Action caching and replay
Cache deterministic tool call results so repeated agent runs skip redundant API calls, saving cost, reducing latency, and enabling offline debugging via replay.
TL;DR
- Action caching stores tool call results keyed by
(tool_name, hash(normalized_args))and returns cached results on identical future calls, skipping redundant API calls, database queries, and web requests. - Production agents executing retry loops or multi-run workflows see 40-70% reduction in API costs and 10-100x faster replays when caching is enabled.
- Replay mode re-runs an entire agent session from recorded logs using cached results instead of live tool calls, enabling deterministic debugging, regression testing, and offline evaluation.
- Cache key design is the hardest part: normalize arguments (sort keys, strip timestamps, canonicalize paths) before hashing. Non-deterministic arguments (random IDs, current timestamps) bust the cache if not handled.
- Determinism test: a tool is cacheable if identical inputs always produce identical outputs. Side-effecting tools (
send_email,write_file,deploy) are never cacheable. - Limitation: stale caches return outdated results. TTL policies must match the volatility of each tool's data source, or you'll debug phantom bugs that only exist in the cache.
The Problem It Solves
Your coding agent is refactoring a 200-file codebase. It calls get_file_contents on the same 15 utility files across 40 different refactoring steps. Each call hits the filesystem API, adds latency, and burns context-building tokens. The agent has already read utils/auth.ts eight times in the past hour, and the file hasn't changed once.
Now multiply this across retries. The agent hits a rate limit on step 37, backs off, and restarts from step 30. Steps 30-36 re-execute identical tool calls: the same file reads, the same linter runs, the same test executions. You pay for every one of them again.
I've watched production agents burn through $200 in a single day purely on redundant tool calls during retry-heavy workflows. The agent wasn't doing more work. It was doing the same work over and over, paying full price each time.
What Is It?
Action caching stores the results of deterministic tool calls in a key-value cache, keyed by the tool name and a hash of normalized arguments. When the agent makes the same call again, the cache returns the stored result instantly instead of executing the tool. Replay mode takes this further: it records an entire session's tool calls and results, then plays them back without any live execution.
Think of it as a court stenographer. During the first trial (agent run), the stenographer records every question asked and every answer given. When the trial is rehearsed for appeal preparation (replay), no one needs to bring the witnesses back. The attorneys read from the transcript. Same questions, same answers, zero witness fees.
How It Works
Cache key design: the make-or-break decision
The cache key determines whether your cache hits or misses. A good key maximizes hits on truly identical calls while never returning stale results for different calls. The standard formula: hash(tool_name + canonical(args)).
Canonicalization is where teams get it wrong. Consider these two calls:
# Call A
get_file(path="./src/utils/auth.ts", encoding="utf-8")
# Call B
get_file(path="src/utils/auth.ts", encoding="utf-8")
These are semantically identical but produce different hashes if you hash raw arguments. Canonicalization resolves paths to absolute form, sorts dictionary keys alphabetically, strips default values, and normalizes whitespace. Without it, your cache hit rate drops from 85% to under 40%.
Non-deterministic arguments need special handling. If a tool call includes timestamp: Date.now() or request_id: uuid(), those values change every call and bust the cache. The solution: define a per-tool argument filter that strips non-deterministic fields before hashing.
# Cache key generation with argument normalization
import hashlib, json, os
def make_cache_key(tool_name: str, args: dict, strip_keys: set = None) -> str:
"""Generate a deterministic cache key from tool call."""
clean_args = dict(args)
# Strip non-deterministic fields
for key in (strip_keys or set()):
clean_args.pop(key, None)
# Normalize paths to absolute
for key in ("path", "file", "directory"):
if key in clean_args and isinstance(clean_args[key], str):
clean_args[key] = os.path.abspath(clean_args[key])
# Sort keys for deterministic serialization
canonical = json.dumps(clean_args, sort_keys=True, default=str)
content = f"{tool_name}:{canonical}"
return hashlib.sha256(content.encode()).hexdigest()[:16]
The cacheability test: deterministic vs. side-effecting tools
Not every tool call can be cached. The rule is simple: cache reads, never cache writes.
A tool is cacheable if the same inputs always produce the same outputs and the call has no side effects. Reading a file is cacheable. Sending an email is not. Running a linter on unchanged code is cacheable. Deploying to production is not.
I've seen teams make the mistake of caching web_search results with no TTL. The agent searched for "current stock price of AAPL" three hours ago, got $185, and now it's returning that cached result while the actual price has moved to $192. The cache turned a reliable tool into a time-delayed one.
TTL policies: matching cache lifetime to data volatility
Different tools need different cache durations. A database schema changes once a week. A web search result is stale after 15 minutes. A compiled binary is valid until the source changes. Using a single TTL for all tools either causes stale results (TTL too long) or cache thrashing (TTL too short).
| Tool Category | TTL | Rationale |
|---|---|---|
| Static config / schema | 24 hours | Changes require deploys |
| File contents (read-only) | Until file modified (inotify or mtime check) | Invalidate on mutation |
| Linter / compiler output | Until source changes | Deterministic for same input |
| Web search | 10-30 minutes | Results drift with time |
| API responses (third-party) | 5-15 minutes | External data changes |
| Directory listings | 5 minutes | Files may be added/removed |
Event-based invalidation is stronger than time-based TTL for file operations. If you watch the filesystem for changes (inotify, FSEvents), you can keep file caches valid indefinitely until the underlying file actually changes. This pushes hit rates above 90% for file-heavy agent workflows.
Replay mode: deterministic debugging without live calls
Replay mode is the second major capability built on action caching. Instead of caching individual tool calls, replay records an entire session (every tool call, every result, every agent reasoning step) and plays it back later using cached results instead of live execution.
This solves three problems. First, debugging: when an agent misbehaves, you replay the session step-by-step to find exactly where reasoning went wrong, without paying for another full run. Second, regression testing: after changing the model or prompt, replay historical sessions against the new configuration and compare outputs. Third, evaluation: replay recorded sessions with different models to benchmark quality without re-running tools.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.