Workflow evals with mocked tools
Test agent behavior end-to-end by replacing real tools with deterministic mocks, making agent evaluation reproducible, fast, and independent of external services.
TL;DR
- Workflow evals with mocked tools replace real tool implementations (APIs, databases, file systems) with deterministic mock functions that return scripted responses, making agent evaluation reproducible, fast, and free of external dependencies.
- Evaluate agent behavior, not just final output: did it call the right tools, in the right order, with the right arguments? Tool call sequences are the observable signal that separates a working agent from a lucky one.
- Production teams using mocked workflow evals catch 60-80% of prompt regressions in CI before they reach users, at zero external API cost per test run.
- Mock fidelity matters: overly simplified mocks produce passing evals that fail in production. Snapshot-record real tool responses during development, then replay them as mocks for evaluation.
- The core risk is mock drift: real APIs evolve but mocks stay static. Schedule periodic mock refresh (weekly or per-release) to keep evals meaningful.
- Limitation: mocked evals test agent logic and tool orchestration, not tool reliability or real-world latency. Complement with a small set of live integration tests for end-to-end coverage.
The Problem It Solves
Your travel-booking agent chains three tools: search flights, check seat availability, and reserve a seat. You update the system prompt to improve how the agent handles sold-out flights. You deploy. Within an hour, users report that the agent is booking flights without checking availability first, skipping step 2 entirely.
You had no eval that caught this. Your unit tests verified that each tool function works correctly in isolation. Your integration tests hit the real API, but they're slow (8 seconds per run), expensive ($0.02 per flight search API call), and flaky (the API returns different results at different times of day). So you ran them once manually, eyeballed the output, and shipped.
This is the testing gap that kills agent reliability. Individual tool tests tell you the tools work. But they don't tell you whether the agent calls the right tools in the right order with the right arguments. The agent's orchestration logic lives in the prompt and the LLM's reasoning, and that logic changes every time you update the prompt, swap the model, or adjust the temperature. Without workflow-level evals, every prompt change is a coin flip.
What Is It?
Workflow evals with mocked tools test an agent's end-to-end behavior by replacing real tool implementations with deterministic mock functions that return pre-scripted responses. The eval checks whether the agent calls the correct tools in the correct order with the correct arguments, and whether it produces the expected final output given those mock responses.
Think of it as a flight simulator for pilots. A real airplane is expensive, dangerous, and unpredictable (weather, traffic, mechanical issues). A flight simulator gives the pilot the exact same cockpit interface but with scripted scenarios: engine failure at 10,000 feet, crosswind landing, instrument failure. The pilot's responses are graded. The simulator doesn't test whether the airplane works. It tests whether the pilot makes the right decisions with the controls available.
How It Works
The mock tool layer
Every tool in your agent gets two implementations: a real one that calls external services, and a mock that returns deterministic responses. The agent code doesn't know the difference. It calls the same interface either way.
class ToolRegistry:
def __init__(self, mode: str = "real"):
self.mode = mode
self.call_log = [] # Track every tool call
def call(self, tool_name: str, **kwargs) -> dict:
self.call_log.append({"tool": tool_name, "args": kwargs})
if self.mode == "mock":
return self.mocks[tool_name](**kwargs)
return self.real_tools[tool_name](**kwargs)
# Mock: deterministic, instant, free
def mock_search_flights(origin, dest, date):
return {"flights": [
{"id": "FL123", "price": 299, "seats": 12},
{"id": "FL456", "price": 189, "seats": 0},
]}
# Real: non-deterministic, slow, costs money
def real_search_flights(origin, dest, date):
return flight_api.search(origin=origin, dest=dest, date=date)
The call_log is the key innovation. It records every tool call the agent makes, with the exact arguments. After the eval runs, you compare this log against the expected tool sequence. I've found this call log to be more valuable than the final output for diagnosing regressions, because it shows you exactly where the agent's reasoning diverged from the expected path.
Defining eval scenarios
An eval scenario is a triple: (input prompt, expected tool call sequence, expected output). The sequence is ordered. For agents where tool order matters (most production agents), the eval asserts that tools were called in the right sequence with the right arguments.
eval_scenarios = [
{
"name": "book_cheapest_available_flight",
"input": "Book the cheapest available flight from SFO to JFK on March 15",
"expected_tools": [
{"tool": "search_flights", "args": {"origin": "SFO", "dest": "JFK", "date": "2026-03-15"}},
{"tool": "check_availability", "args": {"flight_id": "FL456"}},
{"tool": "check_availability", "args": {"flight_id": "FL123"}},
{"tool": "reserve_seat", "args": {"flight_id": "FL123"}},
],
"forbidden_tools": ["cancel_reservation", "refund_payment"],
"expected_output_contains": ["FL123", "confirmed", "299"],
},
{
"name": "handle_no_availability",
"input": "Book any flight from SFO to JFK on March 15",
"mock_overrides": {
"search_flights": lambda **kw: {"flights": [
{"id": "FL789", "price": 399, "seats": 0},
]}
},
"expected_tools": [
{"tool": "search_flights"},
{"tool": "check_availability", "args": {"flight_id": "FL789"}},
],
"forbidden_tools": ["reserve_seat"],
"expected_output_contains": ["no available", "sold out"],
},
]
Notice the second scenario uses mock_overrides to simulate all flights sold out. This is how you test edge cases without waiting for them to occur naturally. You script the tool responses to create the exact conditions you want to test.
Snapshot recording for high-fidelity mocks
The hardest part of mocking is making mocks realistic enough that the agent behaves normally. If your mock returns {"result": "ok"} for a flight search, the agent won't have flight IDs, prices, or seat counts to reason about. It's testing a fiction.
The solution: record real tool responses during development, then replay them as mocks. This is the same snapshot testing approach used in frontend testing (Jest snapshots) and HTTP recording (VCR/Polly.js), applied to agent tools.
I record snapshots whenever I'm developing a new agent workflow or adding a new tool. One recording session produces mock data for dozens of eval scenarios. The snapshots go into version control alongside the eval definitions.
Grading: behavior, not just output
The eval grading engine checks three dimensions. First, tool call accuracy: did the agent call the expected tools? Second, argument accuracy: were the arguments correct? Third, output quality: did the final response match expectations?
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.