Agent memory architecture
Learn how to design agent memory across four tiers, when to persist what, how to manage context window pressure, and how to build cross-session continuity that makes agents actually useful.
TL;DR
- Agent memory has four tiers: working (context window), episodic (vector DB), semantic (RAG knowledge base), and procedural (model weights). Each has different latency, cost, and persistence characteristics.
- Working memory is fast and zero-latency but clears on every session end. Without episodic memory, every agent session starts completely blank.
- Episodic memory retrieval takes 10-50ms via ANN search. At 200K token context, you can inject roughly 100-200 episode summaries before crowding out the task context.
- Context window pressure is a runtime emergency that must be managed proactively: summarize in the yellow zone (60-80% full) before you get forced into lossy truncation at red.
- Forgetting policies are not optional. Without decay and deduplication, episodic memory grows unbounded and retrieval quality degrades.
- Memory synthesis (compressing raw turn logs into structured episodes) is the highest-ROI engineering investment in a production agent memory system.
The problem it solves
A user tells your agent: "I always want code examples in TypeScript, not JavaScript." The agent confirms, stores nothing, and the session ends. Next week the same user is back, asks a related question, and the agent responds with JavaScript examples. The user is annoyed and repeats the preference. This cycle happens forever.
This is not a prompt problem. It is a persistence problem. The agent has no mechanism to carry facts from one session to the next. The context window clears on session end, and with it everything the agent learned about this user.
The cost is retention. Users who experience repeated memory failures trust the agent less and re-type context that the system should have stored. At production scale, this is not just a UX annoyance: it is a measurable drop in task completion rate and session return rate.
What is it?
Agent memory architecture is the design of how an agent stores, retrieves, and manages information across its four storage tiers to maintain continuity within and across sessions. Each tier has a different access pattern, capacity, update cost, and appropriate use case.
Think of it like a doctor's practice. The doctor's mind (working memory) holds the current patient's situation. The patient chart (episodic memory) records past visits and key history. The medical textbooks on the shelf (semantic memory) provide general domain knowledge. The doctor's trained skills like how to read an X-ray (procedural memory) are always active without lookup. A good doctor uses all four without confusing them.
How it works
Working memory and the context window math
Working memory is everything currently in the context window: the system prompt, tool schemas, conversation turns, and all Observations from tool calls so far. It is fast, always available, and costs nothing to read. It is also the smallest of the four tiers by capacity, and it clears when the session ends.
At 200K tokens (approximately the GPT-4.1 and Claude 3.7 limit), you have significant headroom for a single session. One conversation turn with a moderate tool call is roughly 300-500 tokens. One episode summary retrieved from the vector store is roughly 200 tokens. The system prompt and tool schemas for a real agent commonly consume 2,000-5,000 tokens. That leaves roughly 190K tokens for task context, which is ample for most tasks.
The problem is long-running agents. A session that spans 50 tool calls at 400 tokens each uses 20,000 tokens for observations alone. Add in multi-turn conversation history, retrieved episodes, and a dense system prompt, and many production agents hit 60% context utilization within 30-40 turns. After that, the agent is operating under pressure.
I have seen production agents fail silently when they exceed context limits. The model truncates old content from the beginning of the context, which means the original task description disappears and the agent starts answering a different question. The fix is proactive summarization in the yellow zone, not reactive truncation at the limit.
Episodic memory: retrieval pipeline
Episodic memory is the tier that gives agents cross-session continuity. After each session, the agent extracts key facts and outcomes, compresses them into a structured episode record, embeds the record, and writes it to a vector database keyed by user ID.
At query time, the agent embeds the current user message, runs an ANN search filtered by user ID, retrieves the top-K most relevant past episodes, and injects them into the current context window as "Prior context for this user."
The similarity threshold is the most important tuning knob. Episodes with similarity above 0.75 are retrieved and injected. Episodes between 0.5 and 0.75 are skipped but flagged for lower-priority consideration. Episodes below 0.5 are discarded as irrelevant. Lowering the threshold retrieves more memories but adds noise; raising it makes retrieval more precise but misses weakly-relevant context.
Memory synthesis: converting raw logs to episodes
Raw conversation logs are noisy and expensive to store and retrieve. A 50-turn session at 400 tokens per turn is 20,000 tokens. Embedding and storing raw logs at that size would make retrieval noisy and injection prohibitively expensive.
Memory synthesis converts raw logs into compact, structured episode records. A synthesis step runs at session end (or during a pressure event) and extracts the key facts.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.