Contents

Agentic Ai

Evaluation & Observability

View as:

Evaluation and Observability

Back to Overview: Agentic AI


Why Agent Evaluation Is Different

Evaluating a single-turn LLM is straightforward: does the output match what you expect? You can measure BLEU scores, factual accuracy, or user ratings on individual responses.

Evaluating an agentic system is fundamentally harder because the path matters, not just the destination.

DimensionSingle-Turn LLMAgentic System
What to evaluateFinal outputEach step + final output
Failure modesHallucination, refusalWrong tool, bad arguments, infinite loop, cascading error
Success criteriaOutput qualityTask completion + efficiency + safety
Ground truthExpected answerExpected trajectory (hard to define)
Side effectsNoneReal-world actions taken (emails sent, files written)
CostPer-call costCumulative cost over potentially many calls

An agent can produce the right final answer via a completely wrong path - and that wrong path might have taken 3× longer and cost 5× more than it should have. Or it might have taken a correct shortcut that happens to fail in edge cases. Final answer alone tells you nothing.


Trajectory Evaluation

Trajectory evaluation assesses the quality of the agent's execution path, not just the final answer.

What a Trajectory Contains

Task: "Find the current CEO of OpenAI and their educational background"

Trajectory:
  Step 1: search_web("OpenAI CEO 2024")           → "Sam Altman, CEO of OpenAI"
  Step 2: search_web("Sam Altman education")       → "Stanford, dropped out"
  Step 3: [LLM reasons: I have enough information]
  Step 4: [LLM generates final answer]

Final answer: "Sam Altman is the CEO of OpenAI. He attended Stanford University 
              before dropping out to start businesses."

A good trajectory:

  • Calls the right tools at the right time
  • Uses appropriate arguments (not hallucinated)
  • Doesn't make unnecessary tool calls
  • Doesn't repeat failed approaches
  • Terminates when the task is complete (doesn't loop)

Trajectory Evaluation Approaches

Golden trajectory comparison Define a "golden" trajectory for each test case (the ideal path), then compare the agent's actual trajectory against it.

golden = {
    "task": "Find OpenAI CEO's educational background",
    "expected_tools": ["search_web", "search_web"],
    "expected_tool_args": [{"query": "OpenAI CEO"}, {"query": "Sam Altman education"}],
    "expected_steps": 2,
    "expected_final_answer_contains": ["Sam Altman", "Stanford"]
}

Limitation: Golden trajectories are brittle - there are usually multiple valid paths to the same answer.

LLM-as-Judge for trajectories Use a strong LLM to evaluate whether the trajectory was reasonable, even if it differed from the golden path.

judge_prompt = """
Evaluate this agent trajectory for a task. Rate each dimension 1-5:

Task: {task}
Trajectory: {trajectory}
Final Answer: {final_answer}

Dimensions to rate:
1. Tool selection accuracy: Were the right tools called?
2. Argument quality: Were tool arguments correct and reasonable?
3. Efficiency: Were there unnecessary or repeated steps?
4. Answer correctness: Is the final answer accurate?
5. Safety: Did the agent avoid any problematic actions?

Provide a score and brief justification for each.
"""

Automated step-level checks For each step in the trajectory, run automated assertions:

def check_trajectory(trajectory: list[Step]) -> EvalResult:
    issues = []

    for i, step in enumerate(trajectory):
        # Check: no hallucinated tool arguments
        if step.tool_call and not is_valid_tool_args(step.tool_call):
            issues.append(f"Step {i}: invalid tool arguments")

        # Check: no repeated identical tool calls
        if i > 0 and step.tool_call == trajectory[i-1].tool_call:
            issues.append(f"Step {i}: duplicate tool call")

        # Check: tool result was used in subsequent reasoning
        if step.tool_result and not is_referenced_in_next_step(step, trajectory[i+1:]):
            issues.append(f"Step {i}: tool result ignored")

    return EvalResult(issues=issues, score=1.0 - len(issues) / len(trajectory))

Key Metrics

Task-Level Metrics

MetricDefinitionHow to Measure
Task completion rate% of tasks that produce a valid final outputPass/fail per task in evaluation set
Steps to completionNumber of LLM calls + tool calls per taskCount from trajectory
Cost per taskTotal token cost for the taskToken counts × model pricing
Time to completionWall-clock time from start to final outputTimestamp delta
Safety violation rate% of tasks where agent attempted a prohibited actionAudit log analysis

Step-Level Metrics

MetricDefinition
Tool call accuracy% of tool calls with correct arguments
Tool selection accuracy% of tool choices that were appropriate for the current reasoning step
Hallucination rate% of tool arguments that were hallucinated (not grounded in context)
Retry rate% of steps that required a retry after failure
Context utilizationDid the agent use the available context, or ignore it?

Quality Metrics

MetricDefinition
Final answer accuracy% of final answers that are factually correct
Answer completenessDoes the answer address all parts of the task?
Instruction followingDid the agent follow the format/style requirements?
GroundednessIs the final answer grounded in tool results vs hallucinated?

Evaluation Approaches

Simulated Environments

Run the agent against mock tools that return predefined responses. This allows deterministic, reproducible evaluation without real API calls or side effects.

# Mock tool that returns predefined responses based on query
class MockSearchTool:
    def __init__(self, fixtures: dict[str, str]):
        self.fixtures = fixtures  # query → response mapping

    def __call__(self, query: str) -> str:
        # Find best matching fixture
        for pattern, response in self.fixtures.items():
            if pattern.lower() in query.lower():
                return response
        return "No results found"

fixtures = {
    "OpenAI CEO": "Sam Altman is the CEO of OpenAI as of 2024",
    "Sam Altman education": "Sam Altman attended Stanford University and dropped out",
}
mock_search = MockSearchTool(fixtures)

Advantages: Fast, cheap, reproducible, no external dependencies Limitations: Can't test the agent's behavior on novel inputs

Real Environment Testing

Run the agent against real tools in a staging environment. Captures real-world behavior but is slower, more expensive, and has side effects.

Use real environment testing for:

  • Pre-production validation
  • Edge case testing for known failure scenarios
  • Cost and latency benchmarking

A/B Evaluation

Compare two agent versions (or two model versions) on the same task set. Measures relative improvement rather than absolute quality.

results_v1 = evaluate_agent(agent_v1, test_tasks)
results_v2 = evaluate_agent(agent_v2, test_tasks)

# Compare on key metrics
compare_metrics(results_v1, results_v2, ["completion_rate", "steps_to_completion", "cost_per_task"])

Human Evaluation

For tasks where automated metrics are insufficient (subjective quality, nuanced correctness), have humans rate agent outputs.

Use a structured rubric:

  • Is the answer factually correct?
  • Is it complete?
  • Is the reasoning sound?
  • Would you trust this output in production?

Observability Stack

Observability means you can understand what the system did at any point in time. For agentic systems, this requires tracing every LLM call, tool call, and state transition.

What to Instrument

Every significant event should be logged with enough context to reconstruct what happened:

@dataclass
class AgentEvent:
    event_type: str        # "llm_call", "tool_call", "state_update", "hitl_trigger", "error"
    task_id: str
    agent_id: str
    step: int
    timestamp: str
    
    # For LLM calls
    model: str | None = None
    input_tokens: int | None = None
    output_tokens: int | None = None
    cost_usd: float | None = None
    
    # For tool calls
    tool_name: str | None = None
    tool_args: dict | None = None
    tool_result: dict | None = None
    tool_latency_ms: int | None = None
    
    # For errors
    error_type: str | None = None
    error_message: str | None = None
    stack_trace: str | None = None

Observability Tools

ToolWhat It DoesBest For
LangSmithFull LLM call + chain tracing, built for LangChain/LangGraphLangChain/LangGraph projects
LangfuseOpen-source LLM observability, works with any frameworkFramework-agnostic, self-hosted option
Weights & Biases (W&B)ML experiment tracking + LLM tracingTeams already using W&B for ML
OpenTelemetry + JaegerGeneral distributed tracing, framework-agnosticCustom instrumentations, microservices

Setting Up LangSmith Tracing

import os
from langsmith import traceable

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
os.environ["LANGCHAIN_PROJECT"] = "agentic-system-prod"

# All LangChain/LangGraph calls are traced automatically once env vars are set
# For custom functions, use the @traceable decorator:
@traceable(name="research_step")
def run_research(query: str) -> dict:
    # ... your research logic
    return results

Setting Up Langfuse

from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"
)

@observe()  # auto-traces this function and all nested calls
def run_agent_task(goal: str) -> str:
    # all LLM calls made within this function are traced
    return agent.run(goal)

Dashboards and Alerts

Set up monitoring dashboards that track:

  • Task completion rate over time
  • Average steps per task (efficiency)
  • Error rate by agent and error type
  • Cost per task (total and by model)
  • P50 / P95 / P99 task duration

Set alerts for:

  • Task completion rate drops below 90%
  • Error rate spikes above 5%
  • Any safety violation
  • Daily cost exceeds budget threshold

Debugging Agentic Systems

Debugging a multi-agent system is harder than debugging a single function because the failure may have happened several steps before the visible error.

Step 1: Find the First Wrong Step

Don't start from the error message - start from the beginning of the trajectory and find the first step where something went wrong.

Task: "Send a summary of the Q3 earnings report to the CFO"

Step 1: search("Q3 earnings report")    → [correct: found the report]
Step 2: read_file("Q3_earnings.pdf")    → [correct: extracted text]
Step 3: summarize(text)                 → [FIRST ERROR: summary missed key figures]
Step 4: get_email("CFO")                → [correct: found email, but summary was already wrong]
Step 5: send_email(summary, to=CFO)     → [symptom: wrong content sent]

The error is at step 3, not step 5. Fix the summarization step.

Step 2: Replay from Checkpoint

Most agentic frameworks support replaying execution from a checkpoint. This lets you:

  1. Stop execution at the problematic step
  2. Modify the agent prompt or tool behavior
  3. Resume from that checkpoint to verify the fix
# LangGraph: replay from checkpoint
config = {"configurable": {"thread_id": "task-123", "checkpoint_id": "step-3"}}
for event in graph.stream(None, config, stream_mode="values"):
    print(event)

Step 3: Inspect State at Each Step

Check the full agent state (conversation history, tool results, intermediate outputs) at each step, not just the final state.

Step 4: Check Tool Call Arguments

The most common failure mode is a hallucinated tool argument. Log and verify every tool call's arguments, not just whether the tool succeeded.

# Log exactly what was sent to the tool
logger.debug(f"Tool call: {tool_name}({json.dumps(tool_args, indent=2)})")
logger.debug(f"Tool result: {json.dumps(tool_result, indent=2)}")

Step 5: Trace Context Usage

Did the agent actually use the information it was given? Check whether the final answer references facts from the tool results, or whether it was generated from model weights (hallucinated).


Failure Mode Taxonomy

Failure ModeSymptomRoot CauseFix
Hallucinated tool argsTool returns unexpected resultAgent invents arguments not grounded in contextImprove tool descriptions; validate args before calling
Wrong tool selectedTask fails or takes too longAgent misunderstands when to use each toolClarify tool descriptions; add routing logic
Infinite retry loopTask runs forever; cost spikesNo max_iterations; evaluator always rejectsSet max_iterations; improve evaluator rubric
Context window exhaustionAgent loses track of goalHistory too longAdd summarization; reduce verbose tool outputs
Cascade failureDownstream agents produce garbageUpstream error not caught and propagatedValidate each agent's output; propagate errors explicitly
Context driftAgent contradicts its earlier workGoal lost in long conversationPin goal in system prompt; use task ledger
HITL timeoutTask stuckHuman didn't respond; no timeout handlerDefine timeout policy; add escalation path
Prompt injectionAgent takes unauthorized actionsMalicious content in external dataInput sanitization; role separation; action validation

Study Notes

  • Evaluate before you optimize. Build your evaluation harness before tuning prompts or adding complexity. Without measurement, you can't tell if changes improve or degrade behavior.
  • Trajectory evaluation is the most important investment in agent eval. Final answer quality is easy to measure and easy to game; trajectory quality is what actually reflects system health.
  • LangSmith or Langfuse from day one. Instrumenting after the fact is 10× harder. Start tracing on the first day of development.
  • Simulated environments enable fast iteration. Running against mock tools lets you test thousands of scenarios quickly and cheaply. Build a good test fixture library.
  • Debugging: always find the first wrong step. Symptoms appear downstream; root causes are upstream. Don't fix the symptom.
  • The hardest failure mode to catch is context drift. The agent's final answer looks reasonable; only comparison against the original task reveals that it answered a slightly different question than the one asked.

Q&A Review Bank

Q1: Why is trajectory evaluation more important than final-answer evaluation for agentic systems? [Medium] A: A final answer alone tells you almost nothing about system health. An agent can produce the right answer via a completely wrong path - one that took 3× longer, cost 5× more, and would fail on any variation of the input. Conversely, an agent can follow a perfect path and still get a wrong answer due to an out-of-date knowledge source. Trajectory evaluation - assessing tool selection, argument quality, step count, and absence of loops - reveals the real system behavior. It's also essential for catching context drift: the final answer looks reasonable, but only the trajectory reveals that the agent answered a slightly different question than the one asked.

Q2: What are the three categories of metrics for evaluating agentic systems? [Easy] A: Task-level metrics (completion rate, steps to completion, cost per task, time to completion, safety violation rate - these measure the overall task outcome), Step-level metrics (tool call accuracy, tool selection accuracy, hallucination rate in tool arguments, retry rate - these measure the quality of individual decisions within a trajectory), and Quality metrics (final answer accuracy, answer completeness, instruction following, groundedness - these measure the value of the output to the user). All three categories are needed: task-level metrics can mask step-level inefficiency, and quality metrics can mask trajectory inefficiency.

Q3: What is LLM-as-judge for trajectory evaluation and what is its key limitation? [Medium] A: LLM-as-judge uses a strong LLM (the judge) to evaluate a trajectory on defined dimensions - tool selection accuracy, argument quality, efficiency, answer correctness, and safety - producing a score and justification for each. It's used instead of golden trajectory comparison because there are usually multiple valid paths to the same answer, and golden trajectories are brittle. The key limitation is that LLM judges introduce their own biases and are inconsistent: the same trajectory may receive different scores on different runs (due to temperature), and the judge may share blind spots with the generating model. Calibrate LLM judges by comparing their scores to human annotators on a subset of trajectories before relying on them.

Q4: Why are simulated environments preferred for most agentic evaluation runs? [Medium] A: Simulated environments use mock tools with predefined responses, enabling fast (no API latency), cheap (no API costs), reproducible (same response every run), and side-effect-free (no emails sent, no records written) evaluation. This allows running thousands of test scenarios in minutes. Real environment testing is reserved for pre-production validation, edge case testing, and cost/latency benchmarking - situations where the real behavior of external services matters. The key to making simulated evaluation valuable is building comprehensive fixture libraries that cover both happy-path and failure-mode scenarios.

Q5: What is context drift, why is it the hardest failure mode to catch, and how do you prevent it? [Hard] A: Context drift occurs when an agent's accumulated conversation history grows so long that later reasoning starts to contradict earlier reasoning, and the agent gradually loses track of the original goal - answering a subtly different question than the one it was asked. It's the hardest to catch because the final answer looks reasonable in isolation; only comparison against the original task reveals the drift. Automated metrics won't flag it; trajectory evaluation is required. Prevention: pin the goal statement in a fixed position in every prompt (system message or beginning of each user turn), maintain a task ledger as the external source of truth for what remains to be done, and use context summarization that preserves the original goal explicitly.

Q6: Describe the five-step debugging process for a multi-agent system. [Hard] A: Step 1 - Find the first wrong step: don't start from the error message, start from the beginning of the trajectory and find the earliest step where something went wrong; symptoms appear downstream of root causes. Step 2 - Replay from checkpoint: most frameworks (LangGraph, etc.) support replaying execution from a saved state, letting you modify the agent and resume from the problem point without rerunning the whole task. Step 3 - Inspect state at each step: examine the full agent state (conversation history, tool results, intermediate outputs) at the problem step, not just the final state. Step 4 - Check tool call arguments: the most common failure is hallucinated arguments - log and verify every call's exact input, not just whether the tool returned a success code. Step 5 - Trace context usage: verify whether the final answer actually references facts from tool results, or whether it was generated from model weights (hallucinated context).