Memory synthesis from execution logs
Distill raw agent execution logs into compact, reusable memory entries that inform future decisions without consuming the full context window.
TL;DR
- Raw execution logs from a 30-step agent session can easily reach 50K tokens, far more than any context window can repeatedly absorb.
- A synthesis LLM pass compresses those 50K tokens into 500 tokens of structured memory entries: what worked, what failed, key decisions, confidence scores.
- Memory entries follow a typed format (factual, procedural, failure) with temporal decay so recent lessons outweigh stale ones.
- Teams using log synthesis report 50-100x compression ratios while retaining the lessons that actually affect future task success.
- Limitation: synthesis is biased toward outcomes. A path that failed in one context might succeed in another, but the synthesized lesson says "avoid this approach."
The Problem It Solves
Your coding agent just spent 30 steps debugging a Python dependency conflict. It tried pip install, hit a version mismatch, switched to a virtual environment, discovered a system-level conflict, and finally resolved it by pinning a transitive dependency. The raw log is 50,000 tokens of tool calls, error messages, and reasoning traces. Tomorrow, the same agent hits the same class of problem. It has no memory of yesterday's session.
The naive fix is obvious: store the entire log and inject it into context next time. But 50K tokens of raw log leaves almost no room for the actual task. Even with 200K-token context windows, injecting four or five past session logs would consume the entire budget before the agent starts working.
I've watched agents repeat the exact same three-step dead end across consecutive sessions because nobody built the "remember what happened" pipeline. The raw data exists. The extraction step is missing.
What Is It?
Memory synthesis from execution logs runs an LLM pass over raw agent session transcripts to extract compact, reusable "lessons learned" entries. Instead of storing the full 50K-token log, the system produces a handful of structured memory entries (typically 300-500 tokens total) capturing the decisions, outcomes, and transferable knowledge from each session.
Think of it as a surgeon's case notes. The surgeon doesn't record every heartbeat and instrument swap during a six-hour procedure. Instead, they write a one-page summary: what the diagnosis was, what technique they used, what complications arose, and what they'd do differently. Future surgeons reading these notes get the critical knowledge without replaying the entire surgery.
How It Works
Phase 1: structured logging during execution
The synthesis pipeline starts before synthesis. During execution, the agent writes structured task diary entries rather than relying on raw tool call logs alone. Each diary entry captures what was attempted, what happened, and why the agent made that choice.
A good diary format looks like this:
## Task: Resolve Python dependency conflict in payment-service
Attempted approaches:
1. pip install stripe==7.0.0 β failed, conflicts with requests>=2.31
2. pip install --force-reinstall β installed but broke test suite
3. Created venv, pinned stripe==6.5.0 + requests==2.31.0 β all tests pass
What worked: Isolated venv with explicit pinning of transitive deps
What failed: force-reinstall bypassed version checks, broke other packages
Pattern: Dependency conflicts in monorepos always need venv isolation first
This structured format costs almost nothing to produce during execution (the agent generates it as a final step). But it gives the synthesis pass dramatically better signal than raw tool call logs.
Phase 2: the synthesis pass
After a session ends, a dedicated synthesis LLM processes the full execution log (including diary entries if available) and extracts memory entries. The synthesis prompt is specific:
SYNTHESIS_PROMPT = """
Given this agent execution log, extract the 3-5 most important
facts that would help a future agent performing a similar task.
For each fact, provide:
- task_type: what kind of task this applies to
- lesson: the transferable insight (one sentence)
- category: factual | procedural | failure
- confidence: high | medium | low
- source_session_id: {session_id}
Rules:
- Skip anything too specific to generalize ("make button pink")
- Keep anything that would save >2 minutes if known in advance
- Prefer lessons that apply to a class of problems, not one instance
- Include the WHY, not just the WHAT
"""
The synthesis LLM acts as a filter with judgment. It decides that "the project uses Python 3.11" is worth remembering (factual), "always check virtualenv activation before installing packages" is a transferable procedure (procedural), and "pip install --force causes cascading breakage" is a failure lesson worth storing.
Phase 3: memory entry format and categorization
Each synthesized memory entry follows a typed schema. The three categories serve different retrieval purposes:
Factual memories store environment truths: API endpoints, configuration values, project-specific conventions. These have high confidence and long shelf life. Example: "Payment service uses Stripe API v2023-10-16 with webhook signing enabled."
Procedural memories capture how-to knowledge: sequences of steps that solved a class of problem. These generalize across similar tasks. Example: "For Python dependency conflicts in monorepos, always (1) create a venv, (2) pin the conflicting package, (3) run the full test suite before committing."
Failure memories record what went wrong and why. These are the most valuable for preventing repeat mistakes. Example: "pip install --force bypasses version compatibility checks. It installs the package but silently breaks transitive dependencies. Always use version pinning instead."
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.