Agent evaluation
Learn why standard LLM evals fail for agents, how to score agent trajectories step-by-step, and how to build an eval harness that catches regressions before they reach production.
TL;DR
- Standard LLM evals check input-to-output correctness. Agent evals must also check the trajectory: every step taken, every tool called, every argument passed.
- An agent can produce a correct final answer via a completely wrong trajectory (lucky path). Standard evals pass it. Trajectory scoring catches it.
- Task completion rate alone is insufficient. Track efficiency (steps to completion), safety (no harmful actions), cost (tokens per success), and reliability (success across 100 runs).
- Golden traces are your ground truth: 10-20 human-verified successful runs stored as (input, expected_trajectory, expected_output) triples.
- LLM-as-judge scoring lets a stronger model evaluate whether each step's tool call and arguments were correct given the preceding context. This scales where human evaluation cannot.
- A regression threshold of 5% success rate drop or 10% cost efficiency drop between eval runs is a concrete baseline for alerting.
The problem it solves
Your agent passes all your LLM evals. You promoted it to production. Within 48 hours, users report it completing tasks correctly but sometimes taking 20 steps to do what should take 4. Your eval measured output quality. It measured nothing about the path taken to get there.
Two failure modes that standard LLM evals miss completely:
The lucky path: the agent produces the correct final answer by calling the wrong tool, getting a useful side effect, and stumbling into the right answer. Repeat the run and it fails. Your eval passes it because it only saw the output.
The silent inefficiency: the agent solves the task but uses 3x more tokens than the golden path because it issues redundant tool calls and makes rounds through tools that cancel each other out. Your eval passes it because correct output is correct output.
The cost of missing these failures in production is real: wasted compute budget, degraded user experience from slow responses, and safety incidents from agents taking harmful intermediate steps that happen to produce a "correct" final state.
What is it?
Agent evaluation is the systematic process of scoring agent behavior across the full trajectory of a run, not just at the final output. It measures correctness, efficiency, safety, reliability, and cost to give a multi-dimensional view of agent quality.
Think of it like a driving test. The examiner does not just check whether you reached the destination. They score every turn signal, every speed check, every mirror check along the way. You can fail a driving test even if you arrive safely. The same principle applies to agents: how you get there matters as much as whether you get there.
How it works
Golden trace construction
The foundation of agent evaluation is the golden trace library. A golden trace is a human-verified successful run stored as a triple: the input that triggered the run, the expected trajectory (ordered list of tool calls with arguments), and the expected output.
To build golden traces: run your agent manually on 10-20 representative tasks. For each run that succeeds, record exactly which tools were called, in what order, with what arguments, producing which observations. A human engineer reviews each run and marks it as valid. These become your ground truth.
from dataclasses import dataclass, field
@dataclass
class TrajectoryStep:
step_number: int
tool_name: str
tool_args: dict
expected_observation_type: str # "success" | "list" | "text" | "error"
is_required: bool # False = the step is optional (non-deterministic path)
@dataclass
class GoldenTrace:
trace_id: str
description: str
input_message: str
expected_trajectory: list[TrajectoryStep]
expected_output_contains: list[str] # substrings expected in final output
expected_step_count_max: int # reject if agent uses more steps
tags: list[str] = field(default_factory=list) # ["billing", "account", "read-only"]
# Example golden trace for a support ticket lookup task
example_trace = GoldenTrace(
trace_id="gt_001",
description="Look up and summarize recent tickets for a customer account",
input_message="Show me the last 3 support tickets for account ACC-4821",
expected_trajectory=[
TrajectoryStep(1, "search_tickets", {"account_id": "ACC-4821", "limit": 3}, "list", True),
TrajectoryStep(2, "get_ticket_detail", {"ticket_id": "TKT-*"}, "text", False),
],
expected_output_contains=["ACC-4821", "ticket"],
expected_step_count_max=5,
tags=["account-lookup", "read-only"],
)
Golden traces do not need to define every possible valid path. Mark optional steps with is_required=False to accommodate non-deterministic paths where the agent may take a slightly different route and still succeed. The scorer handles this by checking required steps first and scoring optional steps as bonuses.
Trajectory scoring
The trajectory score combines per-step scores into a single 0.0-1.0 quality metric. Each step gets a score based on whether the right tool was called with the right arguments.
step_score = 1.0 if tool_name correct AND args semantically correct
step_score = 0.5 if tool_name correct BUT args partially wrong or suboptimal
step_score = 0.0 if tool_name wrong OR args critically wrong OR action unsafe
trajectory_score = sum(step_scores * step_weights) / sum(step_weights)
Step weights allow you to penalize critical steps (like write operations) more heavily than read steps. A wrong write call should drop the trajectory score further than a slightly wrong filter argument on a search call.
LLM-as-judge for step correctness
Human evaluation does not scale past 100-200 eval runs per week. For continuous eval (running on every pull request or nightly), you need automated scoring. LLM-as-judge uses a stronger model (GPT-4o or Claude 3 Opus) to evaluate whether each agent step was correct given the preceding context.
The judge prompt is structured: it receives the input message, the context up to the current step (all prior observations), the golden trace step, and the agent's actual step. It returns a score (0.0, 0.5, or 1.0) and a one-sentence reason.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.