Interview Response: Smart Diagnostic Assistant for Factory Floor Technicians
Role: Senior Platform Architect / AI Practice Lead Question: "A global automotive manufacturer wants to equip their factory floor technicians with a Smart Diagnostic Assistant on phone or tablet. When a machine breaks down, the technician should be able to use the app to get repair instructions. Design the end-to-end architecture."
How I Would Open (Framing the Problem)
Before touching the architecture, I want to reframe what makes this problem genuinely hard — because it's easy to sketch a "RAG chatbot with a nice UI" and call it done. That's wrong for this use case in at least three ways.
This is a safety-critical system. Wrong repair instructions don't just produce a bad user experience — they can injure a technician, damage a $2M piece of equipment, or trigger a production line shutdown costing tens of thousands of dollars per minute. The safety layer is not a feature. It's a non-negotiable architectural constraint that touches every component.
This is a mobile-first, physically hostile environment. A technician standing next to a broken CNC machine has oil on their hands, is wearing PPE, and may have noise levels above 85dB. Voice input is primary, not secondary. Tap targets need to work with gloved hands. Screen brightness must be readable under industrial lighting. These are UX constraints that change the architecture.
This is potentially an offline-first system. Large factory floors, basement plant rooms, and EMI-shielded machine enclosures frequently have dead zones where WiFi and cellular don't reach. An assistant that stops working when the machine breaks down is worse than useless — it creates a false dependency. Offline capability is load-bearing, not a nice-to-have.
With those three constraints established, this becomes a multimodal, safety-critical, offline-capable agentic system — not a chatbot.
Clarifying Questions I'd Ask First
These questions are not optional. The answers change the architecture significantly.
Machine & Knowledge Base
| Question | Why It Matters |
|---|---|
| How many distinct machine models are in the factory? (10? 500? 2,000?) | Determines knowledge base scale — 10 models means focused RAG; 2,000 means machine-identity routing before retrieval |
| What format are the repair manuals in today? (PDFs, CAD files, scanned paper, structured CMMS database?) | PDFs → Document AI pipeline; scanned paper → OCR + heavy cleanup; structured DB → direct import |
| Is there historical repair log data? (past repairs, parts used, outcomes?) | If yes: gold for RAG — real repair outcomes beat manual instructions for common failure modes |
| Are machines connected (IoT sensors, PLC fault codes, OBD-style diagnostics)? | If yes: real-time machine state (error code, sensor readings) dramatically improves diagnostic accuracy |
| How many distinct failure modes per machine type, on average? | High volume → knowledge graph; low volume → flat RAG may suffice |
Connectivity & Environment
| Question | Why It Matters |
|---|---|
| What is WiFi coverage like on the factory floor? Any known dead zones? | Determines offline architecture depth — partial offline vs. full offline-first |
| What devices will technicians use? (iOS? Android? Company-issued or BYOD?) | Affects app framework choice (Flutter cross-platform vs. native) and MDM deployment strategy |
| Are there areas where camera use is restricted (explosion risk zones, secure areas)? | May require alternative input modes for those zones |
| What is the ambient noise level? (Does the technician need to use the app while nearby machines are still running?) | High noise → voice input needs noise cancellation; may need visual-only mode |
Safety & Compliance
| Question | Why It Matters |
|---|---|
| What safety standards apply? (ISO 45001, IATF 16949, OSHA, country-specific?) | Directly defines what the safety guardrail layer must enforce |
| Are there repair procedures that require LOTO (Lock-Out/Tag-Out) before starting? | LOTO must be surfaced automatically — the system must never allow a technician to skip it |
| What is the escalation policy for major failures? (Always escalate for electrical? Above a certain machine value threshold?) | Defines the agent's escalation logic and hard stops |
| Are technicians tiered by skill level (junior/senior/specialist)? | If yes: instructions must be scoped to the technician's certification level — a junior must not receive instructions for repairs they're not qualified to perform |
Integration & Operations
| Question | Why It Matters |
|---|---|
| Is there a CMMS (e.g., SAP PM, IBM Maximo, ServiceNow)? | Repair logs, work orders, parts inventory — all need to flow back |
| Is parts inventory accessible via API? | Enables real-time check: "this repair requires Part X — you have 3 in stock, Bay 7" |
| What languages do technicians speak? (Single factory or global rollout across multiple countries?) | Multilingual from Day 1 or later? Changes embedding model choice and UI localization scope |
| What is the acceptable response time? (Machine down = production loss — is 3 seconds acceptable?) | Factory lines can cost $10k–$50k per hour of downtime; P99 latency is a business SLA, not just a UX metric |
Assumed answers for this design: - ~300 distinct machine models, mix of PDF manuals and historical CMMS data - Machines have PLC fault codes (structured) but not full IoT sensor streaming - WiFi coverage is 85% of factory floor; 15% dead zones (basement plant rooms) - Company-issued Android tablets + smartphones; MDM-managed - ISO 45001 + IATF 16949 apply; LOTO procedures are mandatory for electrical - Technicians are tiered (L1/L2/L3); instructions scoped to level - SAP PM as CMMS; parts inventory API available - Target response time: < 5 seconds P99 (machine down = line down)
Scale Reality Check
Unlike the email campaign problem, this is not a throughput problem. It's a latency + reliability problem.
- 5,000 technicians — not concurrent; peak load maybe 200–300 simultaneous queries
- Each query: one multimodal LLM call + RAG retrieval + safety check + CMMS lookup
- Target P99: < 5 seconds (machine down = line down = $$$/min)
- Knowledge base: ~300 machine models × ~500 pages/manual + historical logs ≈ ~5M document chunks
- Offline cache per device: top 50 failure modes × 300 machines = ~15,000 pre-generated guides ≈ ~2GB per device (manageable on modern tablets)
The scale numbers are small enough that a simple Cloud Run deployment handles the peak load. The hard problems here are latency, safety, offline, and multimodal input — not throughput.
System Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────────┐
│ MOBILE APP (Android / iOS) │
│ Flutter — MDM deployed │
│ │
│ Input Capture Offline Mode │
│ ┌────────────────────────────┐ ┌──────────────────────────────────────┐ │
│ │ Voice (primary) │ │ Firebase local cache │ │
│ │ → on-device STT (Whisper) │ │ Pre-generated guides: top 50 failure │ │
│ │ + cloud STT fallback │ │ modes × 300 machine models │ │
│ │ │ │ Sync on WiFi restore │ │
│ │ Photo capture │ │ Firestore offline persistence │ │
│ │ → machine damage photo │ └──────────────────────────────────────┘ │
│ │ │ │
│ │ QR / barcode scan │ Session State │
│ │ → machine_id resolution │ ┌──────────────────────────────────────┐ │
│ │ │ │ Current machine_id │ │
│ │ Text (fallback) │ │ Active repair session │ │
│ │ → large touch targets, │ │ Step progress tracker │ │
│ │ glove-friendly keyboard │ │ Escalation state │ │
│ └────────────────────────────┘ └──────────────────────────────────────┘ │
└───────────────────────────────────────────┬─────────────────────────────────────┘
│ HTTPS / REST + Firebase Realtime
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ API GATEWAY + BACKEND (Cloud Run) │
│ │
│ ┌───────────────────────────────────────────────────────────────────────────┐ │
│ │ Auth: Firebase Auth + IAP — technician identity + skill tier │ │
│ │ Rate limiting: 300 concurrent sessions max │ │
│ │ Session context: Firestore (machine_id, technician_id, repair_session) │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ INPUT PROCESSING PIPELINE (Cloud Run) │
│ │
│ ┌──────────────────┐ ┌──────────────────────┐ ┌───────────────────────────┐ │
│ │ Voice → Text │ │ Image Analysis │ │ Machine ID Resolution │ │
│ │ │ │ │ │ │ │
│ │ Vertex AI │ │ Gemini 1.5 Pro Vision│ │ QR scan → machine_id │ │
│ │ Speech-to-Text │ │ │ │ → Firestore machine │ │
│ │ (noise-suppress │ │ Identify: │ │ registry lookup │ │
│ │ model) │ │ - damaged component │ │ │ │
│ │ │ │ - failure type │ │ Outputs: │ │
│ │ Output: text │ │ - severity estimate │ │ - machine_model │ │
│ │ transcript │ │ - part number match │ │ - machine_location │ │
│ └──────────────────┘ └──────────────────────┘ │ - PLC fault codes │ │
│ │ - last maintenance date │ │
│ └───────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────┐ │
│ │ Unified Problem Description │ │
│ │ {machine_id, model, location, │ │
│ │ fault_codes[], symptom_text, │ │
│ │ damage_image_analysis, │ │
│ │ technician_id, skill_tier} │ │
│ └────────────────────────────────────────────┘ │
└───────────────────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ DIAGNOSTIC AGENT (Cloud Run — ADK / LangGraph) │
│ │
│ ┌───────────────────────────────────────────────────────────────────────────┐ │
│ │ Step 1: Failure Mode Classification │ │
│ │ Input: fault_codes + symptom_text + image analysis │ │
│ │ → Gemini 1.5 Pro: classify into failure_mode + affected_component │ │
│ │ → Confidence score: high (>0.85) → proceed; low → ask clarifying Q │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────────┐ │
│ │ Step 2: Hybrid Knowledge Retrieval │ │
│ │ │ │
│ │ Tool A: graph_query(machine_model, component, failure_mode) │ │
│ │ → Spanner Graph: Machine → Component → FailureMode → RepairProcedure │ │
│ │ → Returns: procedure_id, required_parts[], safety_class, skill_level │ │
│ │ │ │
│ │ Tool B: vector_search(symptom_text + failure_mode) │ │
│ │ → Vertex AI Vector Search: repair manuals + historical repair logs │ │
│ │ → Returns: top-5 relevant manual sections + past repair outcomes │ │
│ │ │ │
│ │ Tool C: get_machine_context(machine_id) │ │
│ │ → Firestore: last maintenance, known issues, open work orders │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────────┐ │
│ │ Step 3: Safety Gate (HARD STOP — non-negotiable) │ │
│ │ │ │
│ │ Check A: Skill level gate │ │
│ │ procedure.required_skill_tier > technician.skill_tier → ESCALATE │ │
│ │ Never generate instructions beyond technician's certification level │ │
│ │ │ │
│ │ Check B: LOTO requirement │ │
│ │ procedure.safety_class IN ["electrical", "hydraulic", "pneumatic"] │ │
│ │ → ALWAYS prepend LOTO procedure steps before any repair instructions │ │
│ │ → Technician must explicitly confirm LOTO complete before continuing │ │
│ │ │ │
│ │ Check C: Escalation triggers │ │
│ │ - Structural damage detected in image (confidence > 0.7) │ │
│ │ - Machine value > $500k AND failure_class = "major" │ │
│ │ - Repair requires >4 hours estimated time (L1/L2 cannot authorize) │ │
│ │ → ESCALATE: create work order, page senior technician / maintenance eng │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────────┐ │
│ │ Step 4: Instruction Generation │ │
│ │ │ │
│ │ Gemini 1.5 Pro: synthesize repair procedure from: │ │
│ │ - Graph-retrieved procedure template │ │
│ │ - Manual sections (RAG) │ │
│ │ - Machine context (last maintenance, known issues) │ │
│ │ - Historical outcomes for this failure mode on this machine model │ │
│ │ │ │
│ │ Output: numbered steps, tool requirements, parts list, │ │
│ │ estimated time, safety warnings, images/diagrams (if available) │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────────┐ │
│ │ Step 5: Real-time Integrations (parallel, non-blocking) │ │
│ │ │ │
│ │ Tool D: check_parts_inventory(parts_list) │ │
│ │ → SAP PM API: is Part X in stock? Location in warehouse? │ │
│ │ │ │
│ │ Tool E: create_work_order(machine_id, failure_mode, technician_id) │ │
│ │ → SAP PM: log the repair, open work order, timestamp │ │
│ │ │ │
│ │ Tool F: check_maintenance_history(machine_id, failure_mode) │ │
│ │ → Firestore: has this exact failure occurred before? Outcome? │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ KNOWLEDGE BASE LAYER │
│ │
│ ┌────────────────────────┐ ┌────────────────────────┐ ┌──────────────────┐ │
│ │ Spanner Graph │ │ Vertex AI Vector Search│ │ Firestore │ │
│ │ │ │ │ │ │ │
│ │ Machine │ │ Repair manuals │ │ Machine registry │ │
│ │ └─ Model │ │ (chunked, embedded) │ │ Technician │ │
│ │ └─ Component │ │ │ │ profiles │ │
│ │ └─ Failure │ │ Historical repair logs │ │ Active sessions │ │
│ │ Mode │ │ (outcome-annotated) │ │ Open work orders │ │
│ │ └─ Proc │ │ │ │ Offline sync │ │
│ │ edure│ │ Safety bulletins │ │ queue │ │
│ │ │ │ Parts catalog │ │ │ │
│ └────────────────────────┘ └────────────────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ FEEDBACK + LEARNING LOOP │
│ │
│ Post-repair: technician marks outcome → "Resolved" / "Partial" / "Escalated" │
│ Outcome + steps_used → Pub/Sub → Cloud Run → BigQuery │
│ Weekly: failed repairs reviewed by maintenance engineers → knowledge update │
│ Monthly: RAGAS-style eval on sample of repair sessions │
└─────────────────────────────────────────────────────────────────────────────────┘
The Three Hard Problems in Detail
Hard Problem 1: Offline Mode
A machine that breaks in a dead zone is the worst-case scenario — the assistant must work there.
Architecture decision: offline-first with pre-generated guides, not on-device model.
Running a full LLM on a factory-floor Android tablet is impractical today (model size, battery, inference latency). The better approach is predictive pre-caching:
- Pre-generation at ingestion time: For every known failure mode × machine model combination, generate the repair guide offline during ingestion. Store as structured JSON in Cloud Storage.
- Device cache via Firebase: Nightly, the app syncs the guides relevant to machines in the technician's assigned zone. Each guide is ~10–20KB of JSON + image thumbnails.
- What's cached: top 50 failure modes per machine type × technician's 20 assigned machines = ~1,000 guides ≈ 50–100MB per device (well within limits).
- Offline interaction: No LLM call. The app uses the cached guide with a simple keyword + QR code lookup. Static steps, no dynamic generation.
- What offline can't do: Image-based diagnosis, real-time parts inventory check, work order creation. These are queued and synced when connectivity returns.
- Sync on reconnect: All technician actions during offline mode (steps completed, notes added) are stored in Firestore local cache and synced automatically.
Identifying what's in cache: When the technician opens the app in offline mode, they see: "Offline — showing cached guides for your machines." They select their machine model and reported symptom from a structured dropdown, which maps to the pre-generated guide.
Hard Problem 2: Safety Layer
This is not a content moderation problem. It's a procedural safety enforcement problem.
Three classes of safety enforcement:
Class 1: Skill tier enforcement (hard gate)
Every repair procedure in the knowledge graph has a required_tier field (L1/L2/L3/Specialist). The agent queries this before generating instructions. If procedure.required_tier > technician.tier, the app:
- Does not generate repair instructions
- Shows: "This repair requires L3 certification. Escalating to maintenance engineer."
- Creates a work order automatically
- This gate cannot be bypassed by the technician
Class 2: LOTO enforcement (mandatory prepend)
For any procedure in safety class {electrical, hydraulic, pneumatic, stored-energy}:
- LOTO steps are always prepended, rendered as a distinct checklist with mandatory confirmation checkboxes
- The app will not show Step 1 of the repair until all LOTO checkboxes are ticked
- LOTO completion is logged with timestamp + technician ID for compliance records
- This is implemented as a UI-level gate, not an LLM prompt instruction — it cannot be hallucinated away
Class 3: Confidence-based escalation (soft gate) If the diagnostic agent's confidence in the failure mode classification is < 0.75: - Do not generate repair instructions - Ask 1–2 clarifying questions to narrow the diagnosis - After 2 clarifications with still low confidence: escalate
The reason for the confidence threshold: generating confident-sounding wrong instructions is more dangerous than admitting uncertainty. An "I'm not sure — let me escalate" response is always safer than a hallucinated repair procedure.
Hard Problem 3: Multimodal Input in a Hostile Environment
Factory floors are harsh for UX. Design choices that matter:
Voice as primary input:
- Technician holds up the tablet, says: "Machine 47-B, hydraulic press. It's making a grinding noise from the left side and stopped mid-cycle."
- Vertex AI Speech-to-Text with enhanced_phone_call model (handles background noise)
- On-device Whisper model as offline fallback (no network needed)
- Voice input must work at 80–90 dB ambient noise — use noise-canceling audio processing before STT
Photo as diagnostic input: - Technician photographs the damaged component - Gemini Vision identifies: component name, failure type (crack, wear, leak, burn mark), severity - Output becomes part of the problem description — not just visual context - Photo is stored with the repair session for compliance documentation
QR code as machine identifier: - Every machine has a QR code on the control panel (generated at CMMS registration) - App scans QR → resolves machine_id → loads machine context (model, last maintenance, known issues, fault codes) - This bypasses the need to type or verbally spell machine names, which are often technical alphanumeric codes (e.g., "KUKA KR 210 R2700, cell 7B")
Glove-friendly UI design: - All interactive elements: minimum 56px touch targets (Apple HIG for glove use) - No small text input — voice first, large dropdown selects for structured fields - High-contrast display mode for bright industrial lighting - Step-by-step view: one step per screen, large typography, clear "Next" / "Back" navigation
Knowledge Base Ingestion Pipeline
Repair manuals (PDFs)
│
▼
Cloud Run: Document AI (OCR + layout detection)
│ structured: text blocks, tables, figures, diagrams
▼
Cloud Run: Chunker
├── Semantic chunks: 512 tokens, procedure-boundary aware
│ (don't split a numbered step across chunk boundaries)
├── Metadata: machine_model, section_type (safety/procedure/parts),
│ procedure_id, step_number
└── Image extraction: diagrams → Cloud Storage → linked to chunk
│
▼
Vertex AI text-embedding-005 → Vertex AI Vector Search (RAG index)
Structured CMMS data (SAP PM export)
│
▼
Dataflow: transform → Spanner Graph upsert
Nodes: Machine, Model, Component, FailureMode, RepairProcedure, Part
Edges: HAS_COMPONENT, EXHIBITS_FAILURE, RESOLVED_BY, REQUIRES_PART
Historical repair logs (CMMS export)
│
▼
Cloud Run: Outcome annotator
├── Parse: machine_id, failure_description, steps_taken, outcome, duration
├── Link: to procedure_id in knowledge graph
└── Embed: failure_description → Vector Search (outcome-annotated index)
Why historical repair logs matter: When the knowledge graph says "Procedure P-47 resolves this failure" but historical logs show that P-47 failed in 60% of cases on this machine model and P-52 succeeds in 90%, the RAG retrieval should surface that signal. Outcome-annotated history is more valuable than the manual alone.
GCP Services Map
| Component | GCP Service | Why |
|---|---|---|
| Mobile app backend | Firebase (Auth, Firestore, Realtime DB) | Offline sync, push notifications, fast auth |
| App deployment | Firebase App Distribution + MDM | Enterprise device management |
| Offline content sync | Firebase + Cloud Storage | Nightly guide sync to devices |
| API backend | Cloud Run | Stateless, scales to 300 concurrent sessions |
| Auth + rate limiting | Firebase Auth + Identity-Aware Proxy | Technician identity + skill tier enforcement |
| Session state | Firestore | Machine_id, active repair session, step progress |
| Voice → text | Vertex AI Speech-to-Text (enhanced) | Industrial noise model, low-latency streaming |
| On-device STT (offline) | Whisper (on-device, bundled in app) | Offline voice input fallback |
| Image analysis | Vertex AI Gemini 1.5 Pro (Vision) | Multimodal damage identification |
| Failure classification | Vertex AI Gemini 1.5 Pro | Reasoning over fault codes + symptoms + image |
| Knowledge graph | Spanner Graph | Machine → Component → Failure → Procedure graph |
| RAG index (manuals + logs) | Vertex AI Vector Search | Semantic search over 5M manual chunks |
| Instruction generation | Vertex AI Gemini 1.5 Pro | Synthesize contextual repair steps |
| Machine registry | Firestore | QR code → machine_id resolution, <10ms |
| Document ingestion (manuals) | Document AI (OCR + Form Parser) | Layout-aware PDF parsing |
| Ingestion pipeline | Dataflow | Parallel manual processing + graph upsert |
| Parts inventory integration | Cloud Run → SAP PM API proxy | Real-time stock check |
| CMMS work order creation | Cloud Run → SAP PM API proxy | Automatic repair log + work order |
| Repair session logs | BigQuery | Full audit trail per session + outcome |
| Feedback processing | Pub/Sub + Cloud Run | Async outcome annotation + knowledge update |
| Content cache (pre-gen guides) | Cloud Storage + Cloud CDN | Low-latency offline guide delivery |
| Monitoring | Cloud Monitoring + Cloud Trace | P99 latency per step, safety gate trigger rate |
| Secrets (SAP API keys) | Secret Manager | Credentials for CMMS + inventory APIs |
Key Trade-offs I Would Call Out
1. On-device LLM vs. pre-generated offline guides
The "obvious" offline solution is to run a small LLM on-device (Gemma 2B, Phi-3 Mini). The reality: even a 3B model struggles on mid-range Android tablets, generates instructions in 15–30 seconds, and hallucinates at a higher rate than a well-tuned cloud model. For safety-critical repair instructions, hallucination rate is not acceptable. Pre-generated guides for known failure modes are deterministic, fast (<1ms), and auditable. The trade-off: pre-generated guides only cover known failure modes. Novel failures that don't match any cached guide will show "Connectivity required for this diagnosis" in offline mode — which is honest and safe rather than generating potentially wrong instructions.
2. Graph + RAG vs. RAG alone
Pure vector search over repair manuals can retrieve relevant sections, but it can't answer: "What skill level is required for this procedure?" or "Does this machine model have a known issue with this component?" Those are structured relational facts, not semantic retrieval questions. The knowledge graph answers those questions deterministically in <100ms. RAG fills the unstructured knowledge gap (manual prose, historical notes). The combination is more accurate and faster than RAG alone for this domain.
3. Confidence threshold calibration
Setting the failure classification confidence threshold too high (>0.90) means too many escalations for repairable failures — frustrating for technicians and expensive for maintenance engineers. Too low (<0.60) means generating instructions for misdiagnosed failures — dangerous. The right threshold is calibrated on historical data: run the classifier against labeled past repairs and tune the threshold to maximize precision on dangerous failure classes (electrical, hydraulic, structural). I'd start at 0.75 and adjust based on the first 30 days of production data.
4. Real-time PLC fault code integration vs. manual symptom entry
If machines have OPC-UA or MQTT connectivity, the app can automatically pull fault codes the moment a technician scans the machine QR code — no symptom entry needed, dramatically improving diagnostic accuracy. This is the right long-term architecture. But it requires connectivity between the app backend and the factory's OT (Operational Technology) network — which typically has strict IT/OT segmentation policies, requires OT team sign-off, and can take months to approve. Design for it, but plan for manual symptom entry as the Day 1 fallback.
5. Safety gates as UI vs. LLM instructions
A common mistake: implementing safety gates as LLM prompt instructions ("Always include LOTO steps before electrical repairs"). LLM instructions can be reasoned around, misapplied, or hallucinated. LOTO enforcement must be a hard UI gate — a checklist that blocks progression, rendered from the knowledge graph's safety_class field, independent of what the LLM generates. The LLM writes the repair instructions; the safety layer is a separately-engineered, deterministic overlay. Never trust a language model with life-safety enforcement.
6. Single-turn vs. multi-turn repair session
Simple repairs: single-turn (describe → get instructions → execute). Complex repairs with 20+ steps: multi-turn (execute step 1–5, hit unexpected issue, ask follow-up, get adapted instructions). The session model in Firestore preserves step progress and context, enabling the agent to answer mid-repair follow-ups like "Step 7 says to remove the coupling — but mine doesn't have a coupling. What do I do?" with full context of the machine and prior steps.
Failure Scenarios and Handling
| Failure | Impact | Mitigation |
|---|---|---|
| Network drop mid-repair (step 6 of 12) | Technician stranded mid-procedure | Firestore local cache: all received steps cached; technician can continue with cached content; new steps queued for delivery on reconnect |
| Gemini misclassifies failure mode (confidence > threshold but wrong) | Wrong repair instructions → potential damage | Technician confirmation step: "I identified this as a bearing failure in the left drive shaft. Does this match what you're seeing?" before generating steps |
| Safety gate false positive (escalates fixable issue) | Technician frustration, unnecessary escalation | Log all escalations; weekly review by maintenance manager; adjust threshold or procedure tier if pattern identified |
| SAP PM API down | Can't create work order or check parts | Non-blocking: work order creation queued in Pub/Sub, retried when SAP recovers; show technician parts location from cached inventory snapshot |
| Outdated repair manual in knowledge base | Instructions reference superseded procedure | Every document chunk has valid_until metadata; stale chunks suppressed in retrieval; CMMS change notifications trigger re-ingestion |
| Technician photos wrong machine | Wrong machine context → wrong instructions | QR code scan is the authoritative machine identifier; if photo is used as supplemental input, the response includes "Confirming this is machine {id} — the {model} in {location}?" before generating steps |
Pre-Deployment Checklist
| Task | Lead Time | Owner |
|---|---|---|
| Complete QR code tagging on all machines | 4–6 weeks | Maintenance team |
| Ingest all repair manuals into knowledge base | 2–4 weeks | AI Platform team + Document AI pipeline |
| Import CMMS historical repair logs | 1 week | CMMS admin |
| Build and validate knowledge graph from CMMS data | 2 weeks | AI Platform team |
| Safety gate validation: test all LOTO procedures | 2 weeks | EHS + Maintenance engineering |
| Skill tier mapping: assign tier to all repair procedures | 1 week | Maintenance manager |
| Offline guide pre-generation: all models × top failure modes | 1 day (automated) | AI Platform team |
| Device rollout via MDM: 5,000 devices | 2 weeks | IT / MDM admin |
| Pilot with 50 technicians (1 production line) | 2 weeks | Change management team |
| Go/No-go review: safety audit, accuracy audit on pilot data | 1 week | EHS, Maintenance Eng, IT Security |
How I Would Measure Success
| Metric | Target | Why |
|---|---|---|
| Mean time to first repair step (from QR scan to step 1) | < 5 seconds | Machine down = line down = $$$/min |
| Diagnostic accuracy (correct failure mode identified) | ≥ 92% | Validated against maintenance engineer review of sample |
| First-time fix rate (repair resolved without escalation) | ≥ 75% | Baseline today without assistant: typically 50–60% |
| Unnecessary escalation rate (assistant escalated a fixable repair) | < 10% | Balances safety with technician autonomy |
| LOTO compliance rate (LOTO confirmed before electrical steps) | 100% | Non-negotiable; any miss is a compliance incident |
| Offline mode availability (app functional in dead zones) | 100% for known failure modes | Must work where machines are |
| Mean time to repair (MTTR) reduction vs. baseline | ≥ 20% reduction | Primary business value metric |
| Knowledge base freshness (manuals up-to-date within N days of revision) | < 7 days | Stale manuals are a safety risk |
How the Interview Conversation Actually Flows
Minutes 0–5: Framing + Clarifying Questions
Interviewer: "Design a Smart Diagnostic Assistant for 5,000 factory technicians."
Strong candidate: "Before I start, I want to call out that this problem has some non-obvious constraints that will significantly change the architecture. Can I ask a few questions?"
Key questions to raise verbally: connectivity on factory floor, machine count, manual format, LOTO/safety standards, technician skill tiers, CMMS integration.
What the interviewer is watching for: Do you identify safety-criticality unprompted? Do you ask about offline requirements? Do you know what LOTO is? Candidates who immediately talk about "a RAG chatbot with a nice UI" haven't thought about the environment.
Minutes 5–10: Reframe as a Three-Constraint Problem
Strong candidate: "Based on those answers, I want to flag three constraints that I think are load-bearing for the architecture. Safety — wrong instructions can injure someone. Offline — factory dead zones are real. And mobile hostility — voice is the primary input, not text."
What the interviewer is watching for: Do you derive non-obvious constraints from the clarifications, or just restate the question?
Minutes 10–25: Architecture walkthrough Draw the five stages: input capture → processing → diagnostic agent → knowledge base → integrations. Call out the safety gate explicitly as a separate, non-LLM-generated layer.
Minutes 25–35: Deep dives (interviewer picks) Common probes: - "How does offline mode actually work — what happens if the technician is in a dead zone for 6 hours?" - "Walk me through exactly what happens when a technician scans a QR code and takes a photo." - "How do you prevent the LLM from generating instructions that could injure a technician?" - "The CMMS has 10 years of repair logs. How do you use that data?" - "What happens if a technician reaches Step 7 and encounters something the manual doesn't cover?"
Step 7 unexpected situation answer: The session context in Firestore contains the full repair history: machine_id, failure_mode, steps completed so far, and the current manual sections used. When the technician says "Step 7 doesn't match what I'm seeing — my machine has an extra coupling here," the agent does a targeted RAG query: vector_search("unexpected coupling + machine_model + procedure_section") + graph_query(machine_model, variant=?) to find whether there's a known variant. If no match and confidence is low → escalate to senior technician with full session context pre-loaded.
Minutes 35–45: Trade-offs Strong candidate proactively raises: - "I should flag that safety gates must be implemented as UI constraints, not LLM prompt instructions — I can explain why." - "The on-device LLM temptation for offline mode is a trap for this use case — pre-generated guides are safer." - "IT/OT network segmentation is the biggest integration risk for real-time PLC fault code access."
Minutes 45–55: Follow-up scenarios - "What if a machine is brand new and has no repair history or manual yet?" - "How would you handle a safety recall that affects 200 machines — how do you update all their guides?" - "What if a technician follows the instructions and the repair fails — how do you improve the system?"
Safety recall answer: Recall bulletin → ingestion pipeline → knowledge graph edge update: {machine_model} REQUIRES_SAFETY_ACTION {recall_id}. Next time any technician opens a session for that machine model, the safety gate layer checks for active recalls before generating any other content, and surfaces the recall procedure as the mandatory first action. Push notification via Firebase to all technicians assigned to affected machines. Pre-generated offline guides for affected machines are invalidated and regenerated automatically.
Summary Framing for Closing the Answer
The key insight I want to leave you with is that the hardest problems in this design are not the AI problems — they're the environmental and safety constraints that most candidates forget. The LLM is capable of generating excellent repair instructions given the right context. What makes this system production-ready is everything around the LLM: the offline cache that works in dead zones, the LOTO enforcement that cannot be LLM-generated, the skill-tier gate that can't be bypassed, and the confidence threshold that chooses escalation over hallucination.
A RAG chatbot that works 95% of the time in a factory is dangerous because of the 5% where it confidently generates wrong instructions. The architecture I've described is designed so that uncertainty leads to escalation, not fabrication — and safety checks are structural, not instructional. That's the difference between a demo and a system you'd deploy to 5,000 people working with machinery that can kill them.