Episodic memory

TL;DR

By default, every agent session starts with zero knowledge of past sessions. Episodic memory gives agents access to relevant memories from previous interactions without stuffing the entire history into the context window.
The pattern mirrors RAG but for the agent's own past: store "memories" (summaries of past sessions, decisions, user preferences, learned facts) as vector embeddings; retrieve the most relevant ones at session start.
Memory selection matters as much as storage. Retrieve only memories relevant to the current task, not the complete history. Irrelevant memories add noise and waste context space.
Memory formation is a design decision. You can store raw conversation turns, summaries generated by the agent itself, or explicitly extracted facts. Each has different retrieval behavior and storage cost.
Episodic memory solves personalization, continuity ("as I mentioned last time"), and accumulated context from repeated interactions.

A user works with a coding agent for three months. In month one, they establish preferences: "Always add type hints, always use black formatting, prefer dependency injection over singleton services." In month two, the agent helps them understand and document a complex payment module. In month three, when they ask the agent to refactor a related module, it has no memory of the established preferences, no knowledge of the payment module's design decisions, and no continuity with the previous two months of work.

The engineer re-explains everything from scratch in every session. The agent makes formatting and architectural decisions it would have gotten right with context. Work that should compound regresses.

The problem gets worse over time, not better. Each session generates context that could inform future sessions, but without persistence, all of it evaporates. A three-month working relationship with the agent feels like talking to a stranger every single day.

Episodic memory addresses the amnesia problem by giving the agent a persistent external memory store that survives session boundaries. The agent doesn't need to remember everything. It needs to remember the right things and retrieve them when they're relevant.

What is it?

Episodic memory is a retrieval system for an agent's own past. At the end of each session, the agent (or a background process) stores a record of what happened as embeddings in a vector database. At the start of each new session, the agent queries its memory store with the current task and injects the most relevant memories into its context window.

Think of it like a doctor's patient file. Each visit, the doctor jots down key notes: allergies, current medications, test results, treatment preferences. At the next visit, the doctor pulls the file and scans the most relevant notes before seeing the patient. They don't re-read every note from every visit (that would take too long). They look for notes related to the current complaint. The patient file is the episodic memory, and the relevance-based scan is the vector similarity search.

The architecture is deliberately similar to RAG: the "documents" are past experiences rather than external knowledge, and the "query" is the current task. The retrieval mechanism (embedding similarity search) also identical to RAG.

How it works

Memory formation: what to store

Three common memory formation strategies, in increasing sophistication:

Raw turn storage: store each conversation message as a separate vector. High granularity, high storage cost, noisy retrieval (too many tangentially related messages return on any query).

Session summary storage: at session end, generate a compact summary of the session. Store the summary as one or a few vectors. Lower granularity, cleaner retrieval, matches how humans form episodic memories.

Fact extraction storage: during or after the session, an LLM-based extractor identifies specific learnable facts: user preferences, key decisions, technical constraints, and stores each separately. Highest signal-to-noise retrieval. Most implementation effort.

I'd recommend session summaries with fact extraction for most agent use cases. Raw turn storage works only if sessions are very short (under 10 messages).

The formation strategy is the single most impactful design decision in an episodic memory system. Storage is cheap, and retrieval algorithms are mature. But if you store the wrong things, even perfect retrieval returns noise. Invest most of your design time in deciding what gets stored and at what granularity.

Memory retrieval: how to inject memories

Memory scoring and selection

Not all retrieved memories are equally relevant. Apply a relevance threshold and limit the number of memories injected:

def retrieve_memories(query: str, k: int = 5, threshold: float = 0.75) -> list[Memory]:
    query_embedding = embed(query)
    candidates = vector_db.search(query_embedding, top_k=20)
    return [
        m for m in candidates
        if m.similarity_score >= threshold
    ][:k]

Too many low-relevance memories in context is worse than no memories. They add noise that misleads the model. Keep the threshold high enough that returned memories are genuinely relevant.

Memory structure

Each stored memory should include metadata beyond the vector:

class Memory(BaseModel):
    content: str          # The actual memory text
    session_id: str       # Which session created this
    created_at: datetime
    memory_type: Literal["preference", "decision", "fact", "summary"]
    relevance_tags: list[str]  # Structured tags for filtering
    user_id: str          # For multi-user systems

Include relevance_tags for categorical filtering before semantic search. Filtering by memory_type == "preference" before a vector search returns much cleaner results.

Memory update and decay

Memories can become stale. A preference stored six months ago ("always use MySQL") may no longer be valid. Two approaches:

Explicit updates: the agent detects when a new session contradicts an old memory (user says "actually, we migrated to PostgreSQL") and updates or archives the old memory.

Recency weighting: weight similarity scores by recency. Older memories need higher similarity scores to be selected. Prevents ancient memories from dominating retrieval.

Memory compaction

Over months of use, memory stores accumulate redundant and stale entries. A monthly compaction job keeps retrieval fast and relevant.

async def compact_memories(
    vector_db, user_id: str, max_age_days: int = 180
) -> int:
    """Merge near-duplicates and archive stale memories."""
    all_memories = vector_db.list(filter={"user_id": user_id})

    # Cluster near-duplicate memories (similarity > 0.95)
    clusters = cluster_by_similarity(all_memories, threshold=0.95)
    merged_count = 0
    for cluster in clusters:
        if len(cluster) > 1:
            merged = merge_memory_texts(cluster)
            vector_db.upsert(merged)
            for old in cluster[1:]:
                vector_db.archive(old.id)
            merged_count += len(cluster) - 1

    # Archive memories never retrieved and older than cutoff
    cutoff = now() - timedelta(days=max_age_days)
    stale = [
        m for m in all_memories
        if m.created_at < cutoff and m.retrieve_count == 0
    ]
    for m in stale:
        vector_db.archive(m.id)

    return merged_count + len(stale)

A well-tuned compaction policy reduces index size by 50-70% over a year, keeping retrieval latency under 100ms even as the memory store grows.

Architecture overview

The full episodic memory pipeline has three stages: formation (deciding what to remember), storage (embedding and persisting), and retrieval (finding relevant memories for the current task). Each stage has distinct engineering concerns.

In production, I've found that the formation stage is where most teams underinvest. Storage and retrieval are solved problems (any vector database handles them well), but deciding what to remember and at what granularity determines whether memories are useful or noisy.

TL;DR

By default, every agent session starts with zero knowledge of past sessions. Episodic memory gives agents access to relevant memories from previous interactions without stuffing the entire history into the context window.
The pattern mirrors RAG but for the agent's own past: store "memories" (summaries of past sessions, decisions, user preferences, learned facts) as vector embeddings; retrieve the most relevant ones at session start.
Memory selection matters as much as storage. Retrieve only memories relevant to the current task, not the complete history. Irrelevant memories add noise and waste context space.
Memory formation is a design decision. You can store raw conversation turns, summaries generated by the agent itself, or explicitly extracted facts. Each has different retrieval behavior and storage cost.
Episodic memory solves personalization, continuity ("as I mentioned last time"), and accumulated context from repeated interactions.

The problem it solves

The engineer re-explains everything from scratch in every session. The agent makes formatting and architectural decisions it would have gotten right with context. Work that should compound regresses.

What is it?

How it works

Memory formation: what to store

Three common memory formation strategies, in increasing sophistication:

Raw turn storage: store each conversation message as a separate vector. High granularity, high storage cost, noisy retrieval (too many tangentially related messages return on any query).

I'd recommend session summaries with fact extraction for most agent use cases. Raw turn storage works only if sessions are very short (under 10 messages).

Memory retrieval: how to inject memories

Memory scoring and selection

Not all retrieved memories are equally relevant. Apply a relevance threshold and limit the number of memories injected:

def retrieve_memories(query: str, k: int = 5, threshold: float = 0.75) -> list[Memory]:
    query_embedding = embed(query)
    candidates = vector_db.search(query_embedding, top_k=20)
    return [
        m for m in candidates
        if m.similarity_score >= threshold
    ][:k]

Too many low-relevance memories in context is worse than no memories. They add noise that misleads the model. Keep the threshold high enough that returned memories are genuinely relevant.

Memory structure

Each stored memory should include metadata beyond the vector:

class Memory(BaseModel):
    content: str          # The actual memory text
    session_id: str       # Which session created this
    created_at: datetime
    memory_type: Literal["preference", "decision", "fact", "summary"]
    relevance_tags: list[str]  # Structured tags for filtering
    user_id: str          # For multi-user systems

Include relevance_tags for categorical filtering before semantic search. Filtering by memory_type == "preference" before a vector search returns much cleaner results.

Memory update and decay

Memories can become stale. A preference stored six months ago ("always use MySQL") may no longer be valid. Two approaches:

Explicit updates: the agent detects when a new session contradicts an old memory (user says "actually, we migrated to PostgreSQL") and updates or archives the old memory.

Recency weighting: weight similarity scores by recency. Older memories need higher similarity scores to be selected. Prevents ancient memories from dominating retrieval.

Memory compaction

Over months of use, memory stores accumulate redundant and stale entries. A monthly compaction job keeps retrieval fast and relevant.

async def compact_memories(
    vector_db, user_id: str, max_age_days: int = 180
) -> int:
    """Merge near-duplicates and archive stale memories."""
    all_memories = vector_db.list(filter={"user_id": user_id})

    # Cluster near-duplicate memories (similarity > 0.95)
    clusters = cluster_by_similarity(all_memories, threshold=0.95)
    merged_count = 0
    for cluster in clusters:
        if len(cluster) > 1:
            merged = merge_memory_texts(cluster)
            vector_db.upsert(merged)
            for old in cluster[1:]:
                vector_db.archive(old.id)
            merged_count += len(cluster) - 1

    # Archive memories never retrieved and older than cutoff
    cutoff = now() - timedelta(days=max_age_days)
    stale = [
        m for m in all_memories
        if m.created_at < cutoff and m.retrieve_count == 0
    ]
    for m in stale:
        vector_db.archive(m.id)

    return merged_count + len(stale)

A well-tuned compaction policy reduces index size by 50-70% over a year, keeping retrieval latency under 100ms even as the memory store grows.

Episodic memory

TL;DR

The problem it solves

What is it?

How it works

Memory formation: what to store

Memory retrieval: how to inject memories

Memory scoring and selection

Memory structure

Memory update and decay

Memory compaction

Architecture overview

Continue Reading with Premium

Comments

Episodic memory

TL;DR

The problem it solves

What is it?

How it works

Memory formation: what to store

Memory retrieval: how to inject memories

Memory scoring and selection

Memory structure

Memory update and decay

Memory compaction

Architecture overview

Continue Reading with Premium

Comments