Memory synthesis from execution logs

TL;DR

Raw execution logs from a 30-step agent session can easily reach 50K tokens, far more than any context window can repeatedly absorb.
A synthesis LLM pass compresses those 50K tokens into 500 tokens of structured memory entries: what worked, what failed, key decisions, confidence scores.
Memory entries follow a typed format (factual, procedural, failure) with temporal decay so recent lessons outweigh stale ones.
Teams using log synthesis report 50-100x compression ratios while retaining the lessons that actually affect future task success.
Limitation: synthesis is biased toward outcomes. A path that failed in one context might succeed in another, but the synthesized lesson says "avoid this approach."

Your coding agent just spent 30 steps debugging a Python dependency conflict. It tried pip install, hit a version mismatch, switched to a virtual environment, discovered a system-level conflict, and finally resolved it by pinning a transitive dependency. The raw log is 50,000 tokens of tool calls, error messages, and reasoning traces. Tomorrow, the same agent hits the same class of problem. It has no memory of yesterday's session.

The naive fix is obvious: store the entire log and inject it into context next time. But 50K tokens of raw log leaves almost no room for the actual task. Even with 200K-token context windows, injecting four or five past session logs would consume the entire budget before the agent starts working.

I've watched agents repeat the exact same three-step dead end across consecutive sessions because nobody built the "remember what happened" pipeline. The raw data exists. The extraction step is missing.

What Is It?

Memory synthesis from execution logs runs an LLM pass over raw agent session transcripts to extract compact, reusable "lessons learned" entries. Instead of storing the full 50K-token log, the system produces a handful of structured memory entries (typically 300-500 tokens total) capturing the decisions, outcomes, and transferable knowledge from each session.

Think of it as a surgeon's case notes. The surgeon doesn't record every heartbeat and instrument swap during a six-hour procedure. Instead, they write a one-page summary: what the diagnosis was, what technique they used, what complications arose, and what they'd do differently. Future surgeons reading these notes get the critical knowledge without replaying the entire surgery.

How It Works

Phase 1: structured logging during execution

The synthesis pipeline starts before synthesis. During execution, the agent writes structured task diary entries rather than relying on raw tool call logs alone. Each diary entry captures what was attempted, what happened, and why the agent made that choice.

A good diary format looks like this:

## Task: Resolve Python dependency conflict in payment-service
Attempted approaches:
1. pip install stripe==7.0.0 → failed, conflicts with requests>=2.31
2. pip install --force-reinstall → installed but broke test suite
3. Created venv, pinned stripe==6.5.0 + requests==2.31.0 → all tests pass

What worked: Isolated venv with explicit pinning of transitive deps
What failed: force-reinstall bypassed version checks, broke other packages
Pattern: Dependency conflicts in monorepos always need venv isolation first

This structured format costs almost nothing to produce during execution (the agent generates it as a final step). But it gives the synthesis pass dramatically better signal than raw tool call logs.

Phase 2: the synthesis pass

After a session ends, a dedicated synthesis LLM processes the full execution log (including diary entries if available) and extracts memory entries. The synthesis prompt is specific:

SYNTHESIS_PROMPT = """
Given this agent execution log, extract the 3-5 most important
facts that would help a future agent performing a similar task.

For each fact, provide:
- task_type: what kind of task this applies to
- lesson: the transferable insight (one sentence)
- category: factual | procedural | failure
- confidence: high | medium | low
- source_session_id: {session_id}

Rules:
- Skip anything too specific to generalize ("make button pink")
- Keep anything that would save >2 minutes if known in advance
- Prefer lessons that apply to a class of problems, not one instance
- Include the WHY, not just the WHAT
"""

The synthesis LLM acts as a filter with judgment. It decides that "the project uses Python 3.11" is worth remembering (factual), "always check virtualenv activation before installing packages" is a transferable procedure (procedural), and "pip install --force causes cascading breakage" is a failure lesson worth storing.

Agent Session

>Running task...

Raw Log

>50K tokens captured

Synthesis LLM

>Waiting...

Memory Store

>Waiting...

Future Session

>Waiting...

Memory synthesis pipeline: execute, log, synthesize, store, retrieve

Phase 3: memory entry format and categorization

Each synthesized memory entry follows a typed schema. The three categories serve different retrieval purposes:

Factual memories store environment truths: API endpoints, configuration values, project-specific conventions. These have high confidence and long shelf life. Example: "Payment service uses Stripe API v2023-10-16 with webhook signing enabled."

Procedural memories capture how-to knowledge: sequences of steps that solved a class of problem. These generalize across similar tasks. Example: "For Python dependency conflicts in monorepos, always (1) create a venv, (2) pin the conflicting package, (3) run the full test suite before committing."

Failure memories record what went wrong and why. These are the most valuable for preventing repeat mistakes. Example: "pip install --force bypasses version compatibility checks. It installs the package but silently breaks transitive dependencies. Always use version pinning instead."

TL;DR

Raw execution logs from a 30-step agent session can easily reach 50K tokens, far more than any context window can repeatedly absorb.
A synthesis LLM pass compresses those 50K tokens into 500 tokens of structured memory entries: what worked, what failed, key decisions, confidence scores.
Memory entries follow a typed format (factual, procedural, failure) with temporal decay so recent lessons outweigh stale ones.
Teams using log synthesis report 50-100x compression ratios while retaining the lessons that actually affect future task success.
Limitation: synthesis is biased toward outcomes. A path that failed in one context might succeed in another, but the synthesized lesson says "avoid this approach."

The Problem It Solves

What Is It?

How It Works

Phase 1: structured logging during execution

A good diary format looks like this:

## Task: Resolve Python dependency conflict in payment-service
Attempted approaches:
1. pip install stripe==7.0.0 → failed, conflicts with requests>=2.31
2. pip install --force-reinstall → installed but broke test suite
3. Created venv, pinned stripe==6.5.0 + requests==2.31.0 → all tests pass

What worked: Isolated venv with explicit pinning of transitive deps
What failed: force-reinstall bypassed version checks, broke other packages
Pattern: Dependency conflicts in monorepos always need venv isolation first

This structured format costs almost nothing to produce during execution (the agent generates it as a final step). But it gives the synthesis pass dramatically better signal than raw tool call logs.

Phase 2: the synthesis pass

After a session ends, a dedicated synthesis LLM processes the full execution log (including diary entries if available) and extracts memory entries. The synthesis prompt is specific:

SYNTHESIS_PROMPT = """
Given this agent execution log, extract the 3-5 most important
facts that would help a future agent performing a similar task.

For each fact, provide:
- task_type: what kind of task this applies to
- lesson: the transferable insight (one sentence)
- category: factual | procedural | failure
- confidence: high | medium | low
- source_session_id: {session_id}

Rules:
- Skip anything too specific to generalize ("make button pink")
- Keep anything that would save >2 minutes if known in advance
- Prefer lessons that apply to a class of problems, not one instance
- Include the WHY, not just the WHAT
"""

Agent Session

>Running task...

Raw Log

>50K tokens captured

Synthesis LLM

>Waiting...

Memory Store

>Waiting...

Future Session

>Waiting...

Memory synthesis pipeline: execute, log, synthesize, store, retrieve

Phase 3: memory entry format and categorization

Each synthesized memory entry follows a typed schema. The three categories serve different retrieval purposes:

Memory synthesis from execution logs

TL;DR

The Problem It Solves

What Is It?

How It Works

Phase 1: structured logging during execution

Phase 2: the synthesis pass

Phase 3: memory entry format and categorization

Continue Reading with Premium

Comments

Memory synthesis from execution logs

TL;DR

The Problem It Solves

What Is It?

How It Works

Phase 1: structured logging during execution

Phase 2: the synthesis pass

Phase 3: memory entry format and categorization

Continue Reading with Premium

Comments