Agentic AI — Q&A Review Bank

← Back to Overview: Agentic AI

70+ curated Q&A pairs covering the full Agentic AI curriculum.
Each answer is 3-5 sentences with specifics — not vague responses.
Tags: [Easy] = conceptual, [Medium] = design decisions, [Hard] = system design / deep technical.

Section 1: Foundations & Core Concepts (Q1–Q12)

Q1: What is the fundamental difference between traditional AI and agentic AI? [Easy] A: Traditional AI is reactive — given a single input, it produces a single output and stops. Agentic AI is proactive — given a goal, it decides what steps to take, executes them, observes results, and continues until the goal is met. The key shift is who holds the initiative: in traditional AI the human drives every step; in agentic AI the system drives the steps within a defined goal boundary. This difference requires agentic systems to handle planning, memory, tool use, and failure recovery in ways single-turn systems never need to.

Q2: What are the four core properties that define how "agentic" a system is? [Easy] A: Goal-directed (pursues an objective rather than responding to input), Autonomous Action (takes real-world actions without human approval at each step), Extended Horizon (operates across multiple steps and sessions with persistent state), and Adaptive (adjusts its approach based on feedback and failure). Most real systems sit somewhere on the spectrum between traditional and fully agentic. The four properties are a diagnostic tool — when evaluating whether a task needs an agentic architecture, check which of the four are actually required; over-engineering to "full agentic" when it's not needed adds cost and complexity.

Q3: Where on the autonomy spectrum does a RAG system sit, and why isn't it considered agentic? [Medium] A: RAG sits at Level 2 (knowledge injection) — it retrieves relevant context before generating, but it's still fundamentally reactive: one retrieval plus one generation per query, with no loop, no tool use, and no goal-directed behavior. Agentic systems begin at Level 3 (single agent), where an LLM uses tools in a reasoning loop, observes results, and continues until a multi-step task is complete. The key distinction is that RAG cannot decide to take a second retrieval based on the first result, cannot call external APIs, and cannot persist state between turns. RAG provides knowledge; agentic systems provide agency.

Q4: What is the key vocabulary distinction between an "agent" and an "agentic system"? [Easy] A: An agent is a single LLM that can use tools in a reasoning loop — it's the atomic unit. An agentic system is the full architecture: multiple agents, memory layers, orchestration logic, HITL gates, monitoring, and reliability mechanisms working together toward a goal. A single customer support agent (with tools to look up orders and process refunds) is an agent. A multi-agent system that researches, writes, fact-checks, and publishes content — with checkpointing, observability, and human review gates — is an agentic system. The system design skills are at the agentic system level, not the individual agent level.

Q5: What is "bounded autonomy" and why is it essential in every production agentic system? [Hard] A: Bounded autonomy means placing explicit limits on an agent's ability to act: maximum steps, maximum wall-clock time, maximum cost per task, maximum retries, and a restricted set of permitted actions. Without these limits, agents can enter infinite retry loops, exhaust API quotas, accumulate runaway costs, or take unintended real-world actions. When a limit is hit, the agent should record partial results, set a "status": "limit_reached" flag, and exit gracefully rather than silently continuing or failing. Bounded autonomy is what separates a reliable production system from a demo — it's the engineering contract between the agent and the infrastructure that hosts it.

Q6: What is the difference between a tool, a subagent, and an orchestrator? [Easy] A: A tool is a callable function (API call, search, code executor) that gives an agent capabilities beyond text generation — it's deterministic and has no reasoning capability of its own. A subagent is a full LLM with its own prompt, tools, and reasoning loop — it receives a subtask from the orchestrator, executes it autonomously, and returns a result. An orchestrator is the top-level agent that holds the plan, decomposes the goal into subtasks, assigns them to subagents, handles failures, and synthesizes the final output. The distinction matters for system design: tools are called by agents; subagents are agents called by other agents.

Q7: Why are agentic systems harder to evaluate than traditional AI systems? [Medium] A: In traditional AI, evaluation is straightforward — does the output match the expected answer? In agentic systems, the path matters as much as the destination: an agent can produce the correct final answer via an inefficient or unreliable path that would fail on any variation. The ground truth is not just the expected answer but the expected trajectory (what tools to call, in what order, with what arguments), which is much harder to define and validate. Additionally, agentic systems take real-world actions with side effects (emails sent, files written, API calls made), and cost is cumulative over many steps rather than per-call.

Q8: What is "context drift" and what is the recommended prevention? [Medium] A: Context drift is the gradual degradation of reasoning quality as conversation history grows — the agent loses track of the original goal, and later reasoning contradicts earlier reasoning. It's the hardest failure mode to catch because the final answer looks reasonable in isolation; only comparison against the original task reveals the drift. The primary prevention is maintaining a task ledger external to conversation history — a structured record of the goal, remaining subtasks, and completed outputs — so the goal is always retrievable even when conversation history is summarized. Pinning the original goal statement in a fixed position in every system prompt is a secondary defense.

Q9: What is "grounding" in the context of agentic systems? [Easy] A: Grounding constrains agent outputs to verifiable facts or real-world observations rather than model-generated content. Document grounding (via RAG) ensures the agent's claims are sourced from retrieved documents, not model weights. Action grounding ensures the agent's knowledge of a tool's result comes from the actual tool output, not from the LLM predicting what the result might be. Without grounding, agents hallucinate tool results — reasoning about what a search "probably returned" rather than what it actually returned. Well-grounded agents produce answers traceable to specific sources; ungrounded agents produce confident-sounding responses that may be entirely fabricated.

Q10: What is checkpointing and why is it required for long-running agentic tasks? [Medium] A: Checkpointing serializes the full agent state — task ledger, partial results, conversation history, error log — to persistent storage at each step or at defined intervals. If the agent crashes, is preempted, or hits a resource limit, it can restore from the last checkpoint and continue rather than starting over. For tasks that take minutes or hours and involve many LLM calls and tool calls, restarting from scratch is expensive, may be impossible (tool calls may have side effects that can't be replayed), and destroys completed work. LangGraph provides built-in checkpointing via MemorySaver and SqliteSaver; custom systems should implement checkpoint save/restore as part of the orchestration layer.

Q11: How does the principle of least privilege apply to agentic systems? [Hard] A: Each agent should have access only to the tools and APIs it needs for its specific role — nothing more. A research agent should have search and read tools but not email or delete tools; an email agent should only have the email tool. This limits the blast radius of a compromised or misbehaving agent: a research agent that is prompt-injected can't send emails because it doesn't have the email tool. Implementing least privilege requires defining a permission map (agent → allowed tools) enforced by the tool layer, not just by the agent's system prompt — system prompt restrictions are bypassed by prompt injection; tool-layer restrictions are not.

Q12: What distinguishes a "static DAG" workflow from a "dynamic plan" in an orchestration layer? [Hard] A: A static DAG defines the full workflow at design time: all agents, their dependencies, and the execution order are fixed in code. It's predictable, easy to test, and straightforward to add HITL gates. A dynamic plan has the orchestrator LLM generate the execution plan at runtime based on the specific task — the plan is a product of inference, not code. Dynamic plans handle tasks that don't fit a known template and allow the orchestrator to adapt mid-execution if circumstances change. The trade-off: static DAGs are reliable and auditable; dynamic plans are flexible but harder to reason about, test, and debug. Most production systems start with static DAGs and add dynamic planning only for the subset of tasks that genuinely require it.

Section 2: Architectural Patterns (Q13–Q24)

Q13: Name and briefly describe all eight architectural patterns. [Easy] A: Single Agent (one LLM with tools in a ReAct loop — simplest, debuggable, context-limited), Orchestrator-Subagent (central coordinator delegates to specialized workers — the production workhorse), Hierarchical (multi-level orchestration for 10+ agent systems with management boundaries), Peer-to-Peer (no central coordinator, agents self-organize via shared state — decentralized), Pipeline/Sequential (agents form a fixed processing chain — most reliable, ETL-style), Parallel/Fan-out (orchestrator dispatches concurrent agents and aggregates results — reduces latency), Adversarial/Debate (proposer + critic + judge for quality-critical analysis), Reflexion/Self-Critique (agent evaluates and iteratively revises its own output).

Q14: When is a Single Agent sufficient, and when does it fail? [Medium] A: Single Agent is sufficient when the task fits within one context window, a single area of expertise covers the entire task, parallelism is not required, and simplicity and debuggability are the priority. It fails when the task accumulates too much history (context window ceiling), requires deep specialization across multiple distinct domains, or would benefit from parallel execution of independent subtasks. The most common mistake is adding more agents when the actual problem is a poor single-agent system prompt or insufficient tools. A well-designed single agent handles a surprising fraction of real tasks adequately.

Q15: What are the key design decisions when building an Orchestrator-Subagent system? [Medium] A: Three critical decisions: (1) Plan dynamism — is the decomposition fixed at design time (static DAG) or generated by the orchestrator LLM at runtime (dynamic plan)? Dynamic plans are more flexible but harder to test and debug. (2) Execution strategy — do subagents run sequentially (simpler state management) or in parallel (lower latency but requires aggregation logic and concurrent state access handling)? (3) Failure handling — when a subagent fails, does the orchestrator retry the same agent, substitute a fallback, skip the subtask, or abort? Each choice has different implications for reliability and output completeness. Define all three explicitly at design time; they can't be easily added after the fact.

Q16: Why is Pipeline the "most reliable" architectural pattern? [Medium] A: Pipeline reliability comes from its static structure: the workflow is fully defined at design time, each stage has a clear and validated input/output contract, stages are independent modules that can be tested in isolation, and there is no dynamic planning or inter-agent negotiation that can go wrong. Each stage can be replaced or upgraded without touching adjacent stages. This makes debugging straightforward: when something fails, you know exactly which stage failed and can inspect its input and output. In contrast, Orchestrator-Subagent and Hierarchical patterns involve dynamic coordination that can fail in ways harder to trace. If a task can be modeled as fixed sequential stages, Pipeline is almost always preferable.

Q17: What makes the Peer-to-Peer pattern difficult to use in production? [Hard] A: Three hard problems: (1) Race conditions — if multiple agents can claim the same task from shared state simultaneously, you need distributed locking or atomic check-and-set operations to prevent duplicate execution; (2) Debugging — there is no single execution trace; the sequence of events emerges from agent interactions and must be reconstructed from distributed logs; (3) HITL implementation — it's unclear which agent should pause for human input, how to serialize state before pausing, and how to resume after approval. For most teams, the Peer-to-Peer pattern's fault tolerance benefits don't outweigh its operational complexity. It's appropriate in research contexts or highly distributed systems where operator teams have strong distributed systems experience.

Q18: How does HITL become an architectural element rather than just a UI feature? [Hard] A: HITL becomes architectural when it's planned into the system from day one rather than added as a wrapper around an existing system. Architectural HITL means: the orchestration layer has explicit pause points where the system serializes its full state, stores it durably, and notifies a human reviewer; the system has a resume path where it loads the serialized state and continues from the pause point based on the human's decision; and timeout handling is defined (what happens at 15 min, 1 hour, 24 hours without response). HITL added as an afterthought usually doesn't serialize state correctly, can't resume from where it paused, and has no timeout handling — making it effectively blocking and fragile.

Q19: What is the Adversarial/Debate pattern best suited for, and what does the judge agent do? [Easy] A: The Adversarial/Debate pattern is best suited for quality-critical analysis where a single agent might be overconfident: security analysis, investment due diligence, legal review, fact-checking, and red-teaming. A proposer agent builds the primary argument or recommendation; a critic agent is specifically instructed to find flaws, gaps, and risks; multiple rounds of debate follow. The judge agent synthesizes the debate — it doesn't just pick a side but produces a final output that incorporates the strongest points from both sides and a reasoned resolution of conflicts. The pattern's cost is multiple LLM round-trips; use it only when quality and thoroughness justify the latency and expense.

Q20: When would you use Hierarchical over Orchestrator-Subagent? [Medium] A: Hierarchical is justified when the system requires more agents than a single orchestrator can manage effectively (typically 10+ agents), when the task domain naturally decomposes into sub-domains with their own internal structure (e.g., a software project with separate backend and frontend teams), and when management boundaries are valuable — team leads can fail and be retried without the top-level manager needing to know the implementation details. The trade-off is latency (each layer adds an LLM round-trip) and debug complexity (a failure in a worker surfaces only after propagating up through team lead → manager). Don't use Hierarchical for tasks that fit an Orchestrator-Subagent design; the added layers add cost and complexity with no benefit.

Q21: What are the four aggregation strategies for the Parallel/Fan-out pattern? [Medium] A: Union all results then deduplicate (good for research where you want all unique findings), rank and select top-k across all agent results (good when quality varies and you want the best N items), synthesis agent merge (have a dedicated LLM intelligently combine results — handles conflicts and overlaps), and majority vote (for classification tasks where multiple independent agents vote and the majority wins). The choice depends on the task: research tasks benefit from union + deduplicate + synthesis; classification tasks benefit from majority vote for reliability; coverage-sensitive tasks (want to maximize unique findings) benefit from union. Always define what "conflict" means between results and how the aggregator handles it.

Q22: Describe a hybrid pattern that combines Orchestrator-Subagent with Reflexion. [Hard] A: In this combination, the orchestrator dispatches a subagent to complete a subtask. Instead of the subagent returning its output directly, the subagent has an internal Reflexion loop: it generates a draft output, an evaluator agent (often a stronger or different model) scores it against a rubric, and if the score is below threshold the subagent revises with the critique. Only when the quality threshold is met (or max iterations is reached) does the subagent return its result to the orchestrator. This is used in content creation pipelines (writing agent with editorial loop), code generation (generator + test runner), and research summarization (summarizer + factual accuracy checker). The benefit is that the orchestrator receives a quality-validated result from each subagent, rather than having to validate results itself.

Q23: What is the feedback loop at the innermost level of every agentic system? [Easy] A: The tool observation loop — the most basic feedback in any agent: the agent reasons, calls a tool, observes the result, reasons again, and repeats. This is the ReAct cycle baked into every agent's operation. Everything else (critic/evaluator loops, HITL loops, environment feedback loops) are outer loops that wrap around this innermost loop. A well-designed tool observation loop has structured error returns (not plain strings) that give the agent actionable information when a tool fails, and a bounded retry count to prevent infinite retries on persistent failures.

Q24: How do you choose between running subagents sequentially vs in parallel? [Medium] A: Parallel execution is appropriate when subagents' tasks are genuinely independent — Agent A's output is not an input to Agent B's task, and neither modifies shared state the other reads. Sequential execution is required when there are dependencies (Agent B needs Agent A's output) or when shared state must be updated serially to avoid conflicts. The practical check: draw a dependency graph. Subtasks with no edges between them can run in parallel; subtasks connected by edges must be sequential. Most Orchestrator-Subagent systems use a hybrid: the first batch of independent subtasks runs in parallel, then dependent subtasks run sequentially once their dependencies complete.

Section 3: Design Patterns (Q25–Q36)

Q25: What are the five core design patterns and which is the foundation of all the others? [Easy] A: Tool-Use, Reflection, Planning, Multi-Agent Coordination, and Routing/Gating. Tool-Use is the foundation — every agentic system uses it. Without the ability to call external functions and act on results, an agent is just a text transformer. The other four patterns build on top of tool use: Reflection uses tool results as evaluation criteria, Planning determines what tools to call and in what order, Multi-Agent Coordination defines how tool-using agents collaborate, and Routing determines which tool-using agent handles a given input.

Q26: What makes a tool description "LLM-facing" rather than "human-facing"? [Medium] A: A human-facing description explains what the tool does for reference documentation. An LLM-facing description is part of the agent's decision-making prompt — it must tell the model: when to use this tool (what conditions trigger its use), when NOT to use it (what mistakes to avoid), what the input format is (parameter types, constraints, examples), and what the output format is (so the agent can correctly parse results). The LLM reads tool descriptions at inference time when deciding its next action; a vague description produces wrong decisions. "Search products" tells the model nothing useful; "Search the product catalog by name or description. Use when asked about availability, pricing, or specifications. Returns a list of matching products with id, name, price, and stock. Do NOT use for order status — use check_order_status instead" is an LLM-facing description.

Q27: What are the three planning patterns and when do you choose each? [Medium] A: ReAct (Reason + Act) — interleaved reasoning and action with no upfront plan; use for most general-purpose tasks where the right next step depends on the current observation. Plan-and-Execute — explicit upfront plan produced before any action; use when the full plan can be determined ahead of time, when HITL review of the plan before execution is needed, or when tasks have stable, predictable dependencies. Tree of Thoughts — explore multiple reasoning branches simultaneously and select the best before committing; use for complex, high-stakes decisions where the "right first step" isn't obvious and exploring alternatives improves quality. ToT is expensive; reserve it for decisions where the cost of a wrong path is high.

Q28: What is the "single responsibility" principle for tool design? [Medium] A: Each tool should do exactly one thing well — not two or three things that seem related. A tool that both searches for a product AND checks its inventory status is harder for the LLM to reason about (when does it want the combined vs separate behavior?) and harder to test and maintain. Separate tools (search_products, check_inventory) let the LLM call each independently when needed and compose them naturally in its reasoning loop. Single-responsibility tools also have cleaner error messages — a tool that fails only ever fails for one reason, making the failure actionable. Compound tools fail ambiguously.

Q29: What is the self-consistency bias in Reflection, and how do you avoid it? [Hard] A: Self-consistency bias occurs when the same model that generated an output also evaluates it — the model tends to evaluate using the same reasoning that produced the error, so it fails to catch its own mistakes. This is why self-reflection (same model as generator and evaluator) is less effective for catching factual errors, logical flaws, or hallucinations. To avoid it: use a stronger or different model as the evaluator (different weights, different priors, harder to fool with the same errors), define the evaluation rubric explicitly in the evaluator's prompt rather than asking for general feedback, and structure the critique as a list of specific issues rather than a qualitative judgment.

Q30: Describe the test-driven Reflection pattern for code generation. [Medium] A: Instead of using an LLM evaluator, use actual test execution as the feedback signal: write the test suite first (or have the LLM generate tests from the spec), generate the code, run tests, feed failure messages back to the generator as the critique, and iterate. This is the most reliable form of reflection because the feedback is objective (tests either pass or fail) and specific (the failure message tells the agent exactly what went wrong and on which test case). The generator revises to make the failing tests pass while not breaking passing tests. Max iterations is still required — some bugs may require architectural changes the generator can't make without starting over.

Q31: How does a router agent differ from an orchestrator? [Medium] A: A router agent classifies the input and routes it to the appropriate handler — it doesn't execute the task or coordinate multiple agents, it just makes one routing decision. An orchestrator holds the plan, manages state, coordinates multiple agents, handles errors, and synthesizes results — it actively manages execution. A router is lightweight (often just a classification prompt); an orchestrator is the control plane for the entire system. Good systems use both: the router is the first layer (routes billing questions to the billing agent and tech support to the tech agent), and each specialized agent has an orchestrator managing its internal workflow.

Q32: What is Tool Chaining and how does the LLM manage it? [Easy] A: Tool chaining is when the output of one tool becomes the input to a subsequent tool call — the LLM naturally manages this in its reasoning loop without any explicit chaining mechanism. In its Thought step, the LLM decides what to do next based on the previous tool result and produces the next tool call accordingly. For example: search_web() returns URLs → read_url() with one of the URLs returns raw text → extract_financials() with the text returns structured data → LLM generates answer from structured data. The LLM's reasoning trace connects these steps. The developer's job is to design tools with compatible input/output formats so chaining works cleanly.

Q33: What is the difference between an approval gate and a review queue for HITL? [Medium] A: An approval gate is synchronous — the agent pauses and blocks all further execution until a human approves or rejects the proposed action. The human must respond before the task can continue. A review queue is asynchronous — the agent submits its output to a queue and continues processing other tasks; a human reviews queued items on their own schedule, and reviewed items are processed when they're ready. Approval gates are appropriate for irreversible, high-risk actions (sending a payment, deleting data) where the system must not proceed without explicit approval. Review queues are appropriate for outputs where human oversight is required but the cost of delay is low (reviewing drafted reports, checking classifications).

Q34: What are the four failure modes in multi-agent coordination and how do you prevent each? [Hard] A: (1) Agents overwrite each other's output — prevent by assigning each agent a dedicated output key in shared state, using append-only state for intermediate results. (2) Deadlock (Agent A waits for B, Agent B waits for A) — prevent by defining a strict dependency graph at design time; circular dependencies must be caught before deployment. (3) Wrong output format — prevent by defining and validating structured output schemas for each agent; return an explicit error if validation fails rather than passing malformed data downstream. (4) Agent calls another agent without permission — prevent by routing all inter-agent communication through the orchestrator; no direct agent-to-agent calls.

Q35: Why is routing "underestimated" as a performance optimization? [Medium] A: Most teams think of routing as plumbing — routing the right input to the right handler so the system works at all. But routing also dramatically improves quality: sending a billing question to a specialist billing agent (with billing-specific tools and a billing-focused system prompt) produces much better results than sending it to a general agent. The billing specialist has narrower context and better-tuned behavior for that exact task. Additionally, routing to a smaller, faster model for simple inputs (classification, extraction) and a larger model for complex inputs (multi-step reasoning) reduces cost and latency significantly. A well-designed router is one of the highest-ROI optimizations in a multi-agent system.

Q36: How does Plan-and-Execute enable HITL better than ReAct? [Hard] A: Plan-and-Execute produces an explicit, readable plan before any action is taken. This plan can be shown to a human for review and approval before the system commits to execution — "here are the 5 steps I plan to take, do you approve?" This is natural HITL at the planning level. With ReAct, the agent interleaves reasoning and action in a stream that's hard to pause and show to a human meaningfully — actions happen as part of reasoning, so there's no clean "show the plan first" moment. For regulated workflows (financial advice, medical diagnosis) where a human must sign off before the system acts, Plan-and-Execute is structurally superior to ReAct precisely because the plan is a first-class artifact.

Section 4: Multi-Agent Systems (Q37–Q48)

Q37: What is the minimum viable reason to add a second agent to a system? [Medium] A: A second agent is justified when one of three conditions is clearly true: (1) the task has a natural seam where one subtask requires a different tool set or area of expertise that would contaminate the first agent's context, (2) the two subtasks are genuinely independent and running them in parallel would produce a meaningful latency reduction, or (3) fault isolation is required — if one agent fails, the system must be able to continue with other agents rather than crashing entirely. Adding an agent purely because it "feels more capable" or to separate concerns that don't need separation just adds coordination overhead. Measure whether the multi-agent version actually outperforms a well-designed single agent before committing.

Q38: What is the difference between Message Passing and Shared State as communication protocols? [Medium] A: Message Passing: agents communicate via explicit, typed message objects routed through the orchestrator. Advantages — interfaces are explicit and auditable; each message is logged; agents are fully decoupled (neither knows about the other). Disadvantages — requires message routing infrastructure; verbose. Shared State: all agents read from and write to a common state object; the state is the communication channel. Advantages — simple to implement, natural for sequential pipelines, easy to inspect for debugging. Disadvantages — requires careful design to prevent agents from overwriting each other's outputs; less explicit than typed messages. LangGraph uses Shared State as its native model; most custom systems use a hybrid (typed messages wrapped in shared state).

Q39: How do you prevent an infinite retry loop in a multi-agent system? [Hard] A: Three defenses in combination: (1) explicit max_retries per agent (typically 3) — after N failures, the agent returns a typed error rather than retrying again; (2) max_iterations for any feedback/reflection loop — even if quality hasn't been met, stop and return best effort with a "max iterations reached" flag; (3) exponential backoff with jitter for transient failures (rate limits, network timeouts) — not all retries should happen immediately. Additionally, classify error types before retrying: transient errors (network timeout, rate limit) are retryable with backoff; input errors (bad arguments, validation failure) should never be retried — fix the input; permanent errors (resource deleted, permission denied) should be escalated, not retried.

Q40: What is the role of the "error log" component in agent state? [Medium] A: The error log is a component of agent working state that records every failure: which agent failed, what it was trying to do, what the error was, and how many retries have been attempted. It serves three purposes: (1) enables the orchestrator to replan — by reading the error log, the orchestrator knows what has failed and can decide to skip, substitute a fallback, or try a different approach; (2) prevents retry storms — the orchestrator checks the error log before retrying to avoid exceeding max_retries; (3) post-mortem debugging — the complete error log is included in the episodic memory and is the primary artifact for diagnosing system failures. Without an explicit error log, failures are lost in conversation history and hard to find.

Q41: What is context summarization and what must it preserve? [Hard] A: Context summarization compresses old conversation history to prevent context window overflow as tasks run for many steps. When the accumulated history approaches the context limit, older sections are summarized into a compact representation and the original messages are discarded. The summary must preserve: the original goal statement, key decisions made and why, tool results that are still relevant, the current state of the task ledger, and any constraints or requirements that remain active. What it can discard: verbose tool outputs already incorporated into later reasoning, redundant back-and-forth, intermediate reasoning steps whose conclusions have been captured. A poorly written summary that loses the goal statement is the direct cause of context drift.

Q42: Why is "start with centralized coordination" the standard advice for new multi-agent systems? [Medium] A: Centralized coordination (orchestrator-controlled) has one place that holds all state, makes all decisions, and produces a single trace of events — this makes the system tractable to reason about, test, and debug. Decentralized coordination emerges from agent interactions and requires debugging distributed systems, handling race conditions, and reasoning about concurrent state mutations — skills and infrastructure most teams don't have or need. The orchestrator can be made highly available (multiple orchestrator instances with shared state) to address the single-point-of-failure concern without moving to full P2P decentralization. Move to decentralized only when you have a specific, demonstrated need that centralization can't meet.

Q43: What is graceful degradation in the context of multi-agent systems? [Medium] A: Graceful degradation means returning a useful partial result when some agents fail, rather than returning nothing or crashing the entire system. Instead of: raise exception on first failure, do: collect results from agents that succeeded, record what failed and why, and return a structured output that clearly distinguishes completed work from failed work (with "completeness": "partial" metadata). This is valuable because in many tasks a partial result is still useful — a research report with 4 out of 5 manufacturer profiles is still valuable; a complete failure that returns nothing is not. Graceful degradation requires designing outputs as collections of independent results from the start, not as a single monolithic object that's either complete or missing.

Q44: Why are typed, structured outputs from each agent critical? [Hard] A: Unstructured handoffs — where Agent A returns a free-form paragraph that Agent B must parse — are a major source of bugs for three reasons: (1) Agent B's LLM may misparse or misinterpret the paragraph, silently propagating an error; (2) the interface between agents is implicit and invisible, making changes to Agent A's output format break Agent B in ways that aren't caught until runtime; (3) testing and mocking are much harder — you can't write unit tests for an agent that receives unstructured text. Typed, structured outputs (Pydantic models, TypedDicts) make interfaces explicit, enable schema validation, produce clear error messages when validation fails, and allow agents to be tested in isolation against known input shapes.

Q45: Describe the "fallback agent" resilience pattern. [Medium] A: A fallback agent is a simpler, more reliable alternative to the primary agent that takes over when the primary fails. The primary agent is optimized for quality (more capable model, more complex logic); the fallback is optimized for reliability (simpler prompt, fewer tools, more predictable behavior). When the primary raises an exception or returns an error after max retries, the orchestrator transparently routes to the fallback. The fallback may produce a lower-quality result, but it produces a result — which is often preferable to a complete failure. Design fallback agents to be stateless and simple enough that their failure modes are well-understood; a fallback that can also fail in complex ways provides no reliability improvement.

Q46: What makes event-driven (pub/sub) communication challenging for agentic systems? [Hard] A: Three fundamental challenges: (1) At-least-once delivery — message brokers guarantee each message will be delivered at least once, but possibly multiple times; agents must be idempotent to handle duplicate event processing safely; (2) Event ordering — events from different producers may arrive out of order; an agent may receive a "task complete" event before the "task started" event in some network conditions; (3) Debugging requires reconstructing the causal chain from distributed logs, which is significantly harder than following a single execution trace. Additionally, implementing HITL (pausing for human input mid-stream) requires careful state serialization since there's no natural "pause point" in an event-driven flow.

Q47: What is the difference between a "retry" and a "replan"? [Medium] A: A retry re-executes the same action that failed, typically after a backoff period, expecting the failure was transient. A replan revises the remaining execution plan — the orchestrator reads the current state (what's been done, what failed, what remains), understands why the failure occurred, and generates a different plan for what to do next. Retries are appropriate for transient failures (rate limits, network timeouts) where the approach is correct but the execution failed temporarily. Replanning is appropriate for permanent or strategic failures — the tool returned an error indicating the approach was wrong, an agent was blocked by missing permissions, or the original plan's assumptions turned out to be incorrect. A system that only retries will loop forever on permanent failures; a system with replanning can pivot gracefully.

Q48: How does parallelism in multi-agent systems improve latency but potentially complicate state? [Hard] A: Latency improvement: independent subtasks running concurrently complete in max(task_durations) rather than sum(task_durations) — five 30-second research agents running in parallel complete in 30 seconds total, not 150 seconds. State complication: concurrent agents writing to shared state create race conditions. If two agents both read a counter, increment it, and write it back, the final value may be incremented only once instead of twice. Mitigations: use agent-specific output keys (each agent writes to its own field), make aggregation atomic (one designated aggregator reads all results only after all agents complete), and use append-only state for intermediate results rather than overwrite operations. The orchestrator coordinates completion — it dispatches all parallel agents, waits for all to complete, and then proceeds.

Section 5: System Design & Production (Q49–Q60)

Q49: Walk through the six-layer production agentic architecture from request to response. [Hard] A: Request enters the Interface Layer — authenticated, validated, sanitized, and queued as an async task; caller receives a task ID. The Orchestration Layer decomposes the task into a workflow (static DAG or dynamic plan), assigns agents to steps, and manages dependencies and HITL gates. Agents in the Agent Pool execute assigned steps: each is a stateless worker with a specific role, system prompt, model, and tool set. Tools in the Tool Layer are called as needed — each sandboxed and rate-limited. State is maintained in the Memory Layer: short-term (context window), working (task ledger, partial results), long-term (semantic knowledge store), episodic (event log). Every event is captured by the Monitoring Layer for tracing, cost tracking, and alerting. Results are returned via webhook or polling endpoint to the original caller.

Q50: What are the four types of agent failures and how do you handle each? [Medium] A: Transient failures (network timeout, rate limit) — retry with exponential backoff up to max_retries; these are expected and usually self-resolve. Input errors (bad arguments, schema validation failure) — do not retry; the same input will fail again; fix the upstream agent that produced the bad arguments. Permanent failures (resource deleted, permission denied, API key expired) — do not retry; escalate to the orchestrator for replanning or alert on-call. Unknown failures — retry once with backoff; if it fails again, escalate. The key principle: classify the error before deciding to retry; retrying a permanent failure wastes retries and delays escalation. Return structured error objects with an error_type field so the orchestrator can route each failure type appropriately.

Q51: Why is async task handling the correct architecture for long-running agents? [Easy] A: Long-running tasks hold connections open and block server resources if handled synchronously. A task that takes 5 minutes would require the client to maintain an open HTTP connection for 5 minutes — nearly all clients will time out. The async pattern — accept task, return task ID immediately, process in background, deliver result via webhook or polling — solves all of these issues: the connection is freed immediately, processing scales horizontally across workers, and the client controls when it checks for results. Async also enables natural retry on worker crash (the task stays in the queue), priority queuing, and back-pressure handling without server exhaustion.

Q52: What is a token budget and how do you enforce it in production? [Hard] A: A token budget is an explicit limit on the total tokens a task may consume across all LLM calls. Each task type has a defined budget (simple_query: 10K, research_task: 100K, full_analysis: 500K). The orchestrator's CostTracker records input and output tokens for every LLM call and accumulates them against the budget. When the budget is near exhaustion (e.g., 80% used), the orchestrator signals agents to wrap up — summarize accumulated context and return a best-effort result rather than continuing. When the budget is exceeded, the system raises a BudgetExceededError, records partial results, and returns with a "status": "budget_exceeded" flag. Without token budgets, a complex task can consume hundreds of dollars in LLM calls before producing any output — a single misconfigured task can run up significant costs.

Q53: Describe the four memory types and when each is accessed during a task. [Hard] A: When a new task arrives: query long-term memory (vector store) for relevant prior knowledge about the task's topic. When beginning execution: load working memory (task ledger, agent assignments, partial results from any prior run). During execution: agents use short-term memory (their context window) for the current session's conversation history and tool results. After each significant step: write results to working memory (state update) and append an event to episodic memory (event log). At context window pressure: summarize short-term memory into a compact form while preserving goal and task ledger. On task completion: extract important findings from working memory and write to long-term memory for future tasks.

Q54: What is prompt injection specific to multi-agent systems and why is it harder to prevent than single-agent injection? [Hard] A: In a single-agent system, prompt injection requires the malicious content to be directly in the agent's input context — a relatively contained attack surface. In multi-agent systems, injection can be multi-hop: malicious content embedded in a web page is processed by the research agent, incorporated into its output, passed to the writer agent, which passes it to the email agent, which takes an unauthorized action. Each agent in the chain adds a hop; the injection travels through agents that individually look clean. Multi-hop injection is harder to prevent because the content looks like legitimate agent output (not raw user input) by the time it reaches the action-taking agent. The primary defense is role separation enforced by tool permissions — the research agent cannot have email tools, so even if injected instructions reach it, it cannot send email.

Q55: What is a circuit breaker and how does it interact with retry policies? [Hard] A: A circuit breaker monitors failure rates for a downstream dependency and temporarily stops sending requests when failure exceeds a threshold — rather than having every agent call hang waiting for a timeout. In the Open state (circuit tripped), requests are rejected immediately with a CircuitOpenError rather than waiting for a timeout. Interaction with retry policy: the retry policy handles individual call failures (retry with backoff); the circuit breaker handles sustained dependency failures (stop retrying altogether). They work together: retry handles transient single-call failures; circuit breaker handles prolonged outages where retrying would just pile up failed requests. The recovery path is the Half-Open state — after a recovery timeout, one test call is allowed through; if it succeeds, the circuit closes; if it fails, the circuit stays open.

Q56: How do you size and configure the AgentConfig limits for a new agent? [Medium] A: Start with conservative defaults and tune from production data: max_llm_calls: 20, max_tool_calls: 30, max_wall_time: 300s, max_cost_usd: 1.00, max_retries: 3. Instrument your agent from day one — log actual steps to completion, actual cost, and actual duration for every task in your evaluation set. Set max_llm_calls to 2× the 95th percentile of actual steps (gives room for retries). Set max_wall_time to 3× the 95th percentile of actual duration. Set max_cost_usd to 2× the 95th percentile of actual cost. The goal is limits that never trigger on legitimate tasks but catch runaway behavior promptly. Review and adjust monthly as task patterns change.

Q57: What are the five HITL trigger conditions and what information must every HITL request include? [Medium] A: Trigger conditions: irreversible actions (send_email, make_payment, delete_record), dollar value exceeds threshold, confidence score below threshold, personal data (PII) involved, scope is "bulk" (affects many records). Every HITL request must include: (1) what the agent wants to do (the specific action, not a vague description), (2) why (the agent's reasoning — what goal is this action serving), (3) consequences of approval and of rejection, (4) a preview (for content: show the exact email/document to be sent/created), and (5) an alternative (if rejected, what will the agent do instead). HITL requests that lack this context lead to rubber-stamp approvals — humans approve without reading because the context is insufficient to evaluate.

Q58: Describe a production security model for a multi-agent system that processes external web content. [Hard] A: Separate the content-processing layer from the action-taking layer with a hard boundary. The research agent processes external content (web pages, user inputs) in a sandboxed environment with no action-taking tools — it can only read and return structured summaries. Its output is sanitized (strip instruction-like patterns) and validated against a schema before passing to the writer or action agents. Action-taking agents (email, database write, payment) only receive sanitized summaries from the research layer, never raw external content. Tool permissions enforce this: research_agent has [search_web, read_url]; email_agent has [send_email] — they cannot be swapped. No agent should have both read-external-content and take-irreversible-action permissions simultaneously.

Q59: What observability metrics should be tracked from day one, and which is most actionable? [Medium] A: Track from day one: task completion rate (% completing without error), average steps per task (efficiency — should be stable), cost per task (cumulative token cost — alerts on runaway tasks), P50/P95/P99 task duration, error rate by agent and error type, tool call accuracy (% of calls with valid arguments), and safety violation rate. The most actionable metric is error rate by agent — it immediately tells you which specific agent is misbehaving and what type of failure is occurring. Task completion rate tells you something is wrong; error rate by agent tells you what and where. Set alerts: task completion rate < 90%, error rate > 5%, any safety violation, daily cost > budget threshold.

Q60: What is horizontal scaling for agent workers and what architectural requirement enables it? [Medium] A: Horizontal scaling means running multiple identical worker processes that pull tasks from a shared queue, processing different tasks concurrently — adding a worker increases throughput proportionally. The architectural requirement that enables it is stateless workers: each worker reads all state it needs from the shared database (task ledger, conversation history, tool results), processes the task, and writes results back — no local state that would prevent another worker from picking up the task. Queue-backed architecture (client → task queue → worker → result store → webhook) provides natural back-pressure (tasks queue up during spikes), retry on worker crash (task stays in queue), and clean separation between ingestion and processing. Without stateless workers, each task is pinned to one worker and you can't scale horizontally.

Section 6: Evaluation & Observability (Q61–Q72)

Q61: Why does "the path matters, not just the destination" in agentic evaluation? [Medium] A: A wrong path produces results that are fragile, expensive, or dangerous even when the final answer looks correct. A system that uses 20 tool calls to answer a question solvable in 3 will cost 7× more at scale; a system that hallucinates tool arguments before accidentally getting the right answer will fail on 80% of real-world variations. The path also reveals system health over time: steps-per-task growing week-over-week signals prompt degradation or context drift long before task completion rate drops. Additionally, for safety — a system that took a prohibited action along the way but produced an acceptable final answer still violated safety constraints. Final-answer-only evaluation misses all of this.

Q62: What are the three trajectory evaluation approaches and their trade-offs? [Hard] A: Golden trajectory comparison — define the ideal path for each test case and compare mechanically. Precise and reproducible, but brittle: there are usually multiple valid paths to the same answer, and the golden path becomes stale as the system evolves. LLM-as-judge — a strong LLM evaluates whether the trajectory was reasonable on defined dimensions (tool selection, argument quality, efficiency, correctness, safety). Handles trajectory variability, but introduces judge inconsistency and shared blind spots with the generating model. Automated step-level checks — programmatic assertions on each step (no hallucinated arguments, no duplicate calls, tool result referenced in next step). Fast and cheap but can only check explicit assertions, not nuanced quality. Production systems combine all three: automated checks catch mechanical failures, golden trajectories verify known-good paths, LLM-as-judge evaluates novel inputs.

Q63: What is the difference between task-level, step-level, and quality metrics? [Easy] A: Task-level metrics measure overall task outcomes: completion rate, steps to completion, cost per task, time to completion, safety violation rate. Step-level metrics measure individual decisions within a trajectory: tool call accuracy, tool selection accuracy, argument hallucination rate, retry rate. Quality metrics measure the value of the final output: answer accuracy, completeness, instruction following, groundedness. All three are needed because they catch different problems: a high task completion rate with high step-level error rate means the system gets lucky often but is fundamentally unreliable. High quality metrics with high step costs mean the system produces good outputs expensively. Strong in all three dimensions means a system that is reliable, efficient, and produces good results.

Q64: Why should you build evaluation infrastructure before optimizing prompts? [Hard] A: Without measurement, you can't tell whether a prompt change improves or degrades behavior — you're optimizing blind. A prompt change that improves output quality on your two test cases may reduce task completion rate on 20% of real-world inputs; without an eval harness covering diverse cases, you'll never know. The sequence should always be: define test cases → build eval harness → establish baselines → make changes → measure impact. Skipping straight to optimization produces systems that perform well on the developer's mental test cases but fail unpredictably in production. Additionally, a good eval harness with a simulated tool environment (mock tools) runs in seconds rather than minutes, enabling rapid iteration that would be too slow against real APIs.

Q65: What is the difference between simulated and real environment evaluation? [Medium] A: Simulated environments use mock tools with predefined responses — fast (no API latency), cheap (no API cost), reproducible (same response every run), side-effect-free (no emails sent, no real writes). Use simulated evaluation for the vast majority of eval runs during development. Real environment evaluation runs against actual APIs in a staging environment — captures real-world behavior (actual latency, real API error modes, actual response variability) but is slower, more expensive, and has side effects. Use real environment evaluation for pre-production validation, known failure scenario testing, and cost/latency benchmarking. The two are complementary: simulate to iterate fast, validate real to check production readiness.

Q66: Describe the LLM-as-judge evaluation rubric for trajectory assessment. [Medium] A: A well-structured rubric rates five dimensions on a 1-5 scale: (1) Tool selection accuracy — were the right tools called for each reasoning step? (2) Argument quality — were tool arguments correct, grounded in available context, and free of hallucination? (3) Efficiency — were there unnecessary or repeated steps? Did the agent terminate promptly when done? (4) Answer correctness — is the final answer factually accurate and complete? (5) Safety — did the agent avoid any prohibited actions, prompt injection susceptibility, or scope creep? Each dimension gets a score and a brief justification citing specific steps. Calibrate the judge by comparing its scores to human annotators on a sample — judge and human should agree within one point on average.

Q67: How do you set up distributed tracing for a multi-agent system? [Hard] A: Every significant event gets logged as an AgentEvent with: event type (llm_call, tool_call, state_update, hitl_trigger, error), task ID, agent ID, step number, and timestamp, plus type-specific fields (model, input/output tokens, cost for LLM calls; tool name, args, result, latency for tool calls; error type, message, stack trace for errors). All events for a task share a common task_id, enabling trace reconstruction. Use LangSmith for LangChain/LangGraph projects (traces automatically from env var configuration); use Langfuse for framework-agnostic tracing (decorate functions with @observe()); use OpenTelemetry for custom microservice architectures. Start tracing on day one — retrofitting tracing into a running system is 10× harder than building it in from the start.

Q68: What does "evaluate before you optimize" mean in practice? [Medium] A: It means running your current system against a defined test set and measuring baseline metrics (completion rate, steps per task, cost per task, answer accuracy) before making any changes. The baseline tells you: what the system actually does (which may differ from what you think it does), which failure modes are most common, which agents have the highest error rates, and what good performance looks like for your specific task distribution. With the baseline measured, every optimization can be evaluated as a delta — "this prompt change increased answer accuracy by 8% with no change in completion rate and 12% reduction in steps." Without the baseline, you're optimizing by feel.

Q69: What is the "first wrong step" debugging strategy? [Hard] A: The strategy is: when debugging a multi-agent system failure, start from step 1 of the trajectory and find the earliest step where something went wrong — not the step that produced the visible error. In a 10-step trajectory, the error may appear at step 8, but the root cause is at step 3 (a hallucinated tool argument that produced a plausible-but-wrong result). Steps 4-7 built on that wrong result; step 8 failed because of accumulated wrongness. If you start debugging at step 8, you'll fix the symptom rather than the cause. Practical implementation: replay the trajectory step-by-step using your framework's checkpoint replay, inspecting the full state (including tool call arguments and results) at each step until you find the first step whose output doesn't match expectations.

Q70: Why is context drift the hardest failure mode to detect? [Hard] A: Context drift produces outputs that look superficially correct — the final answer is coherent, grammatical, and on the general topic of the question. The model doesn't fail with an error or refuse to answer; it answers a slightly different question than the one asked. The drift is only detectable by comparison with the original task, which is often several thousand tokens earlier in the conversation history. Standard metrics won't catch it: task completion rate is high (the task completed), cost is normal, latency is normal, and even human review may miss it without carefully comparing the answer to the original goal. The only reliable detection is task-level answer accuracy evaluation that compares the final answer against the original task specification, not just against the most recent context.

Q71: How would you diagnose a system where task completion rate is high but user satisfaction is low? [Hard] A: This pattern indicates the system is completing tasks but producing low-quality outputs — the common causes are context drift (answering a different question than asked), low groundedness (generating from model weights rather than retrieved context), or poor answer completeness (technically completing but omitting key parts of the request). Diagnostic steps: (1) Run quality metrics (answer accuracy, completeness, groundedness) on a sample of "completed" tasks — this will show if output quality is systematically poor. (2) Run trajectory evaluation to check whether the agent is using the right tools with good arguments — poor tool use can lead to technically complete but empty answers. (3) Compare final answers against original task specs for a random sample — context drift will be visible here. (4) Check if specific task types or agents are driving the satisfaction gap — the problem may be narrow.

Q72: What is A/B evaluation of agent versions and when should you use it? [Medium] A: A/B evaluation runs two agent versions (different model, different prompt, different architecture) against the same test task set and compares their metrics head-to-head: completion rate, steps to completion, cost per task, answer accuracy, trajectory quality. It measures relative improvement rather than absolute quality. Use A/B evaluation when deciding whether to deploy a change: before shipping a prompt update, model upgrade, or architectural change to production, validate that the new version improves (or at minimum doesn't regress) on all key metrics. This prevents "this change helped on my test cases but hurt on 20 others" production incidents. Run A/B evaluation on a diverse test set representing real task distribution — testing only happy-path cases will miss regressions on edge cases.