Context compression techniques
Learn how agents stay within LLM context limits using the iceberg pattern, auto-compaction, progressive summarization, and selective loading to preserve critical information while reducing token usage.
TL;DR
- LLMs have finite context windows (Claude Opus 4.6: 200K tokens, GPT 5.4: 256K, Gemini 3.1 Pro: 2M). Long-running agents blow through these limits in hours during complex tasks.
- Context compression keeps critical information available while shrinking total token count. The core insight: 80-90% of accumulated context is rarely accessed but still sits in the active window consuming tokens on every call.
- The iceberg pattern keeps only the most needed 10-20% of context above the waterline. Everything else lives in compressed storage and is retrieved on demand through retrieval tools.
- Four techniques cover most cases: progressive summarization, auto-compaction, selective loading, and hierarchical compression. They stack together for maximum efficiency.
- With GPT 5.4 at $15 per million tokens, an agent averaging 100K-token contexts over 20 calls costs $30 per task. Compressing to a 20K average cuts that to $6. A 5x reduction on context spend alone.
- One hard constraint: never compress verbatim content you cannot afford to distort. Original requirements, active tool outputs, error messages, and security constraints must stay exactly as written.
The Problem It Solves
A research agent is working through a 200-page technical report. After two hours and 60 tool calls, it has accumulated conversation history, extracted document chunks, intermediate findings, and reasoning traces. The total hits 195K tokens. On the next call, the API rejects the request: context length exceeded. Two hours of accumulated work is gone.
The naive fix is to truncate old messages. This works until the agent needs a constraint from turn 3 at step 55 and hallucinates an entirely different answer because that constraint no longer exists in context. Truncation is not compression. It is selective amnesia.
Larger context windows look like the answer. Gemini 3.1 Pro offers 2M tokens. But long contexts are slow, expensive, and suffer from the lost-in-the-middle problem: LLMs attend much better to information at the start and end of a context than to information buried deep in the middle. Stuffing everything into context does not guarantee the model actually uses it.
What Is It?
Context compression is the set of techniques that reduce what occupies the active context window while preserving access to everything that was learned. The key shift: information does not have to live in context to be available. It just has to be retrievable when needed.
Think of it like a surgical scrub nurse managing the instrument tray. The sterile field holds only the instruments in active use right now. Hundreds of additional instruments are organized, labeled, and instantly available moments away. The surgeon does not need to see every instrument simultaneously. They need to ask for what they need and have it appear immediately.
The iceberg metaphor describes the architecture: 10-20% of the agent's accumulated knowledge is visible above the waterline in active context. The other 80-90% lives below in compressed storage, surfaced only when the agent explicitly retrieves it via tool calls.
The diagram shows three layers. Active context is small and always current. Compressed memory holds summarized versions of older work. External storage holds full-fidelity archives retrieved on demand. The agent only pays the token cost for what is currently above the water line.
How It Works
Technique 1: Progressive summarization
Progressive summarization fires when context utilization crosses a threshold, typically 70%. The oldest chunk of history (say, turns 1-40) is passed to a summarization LLM. That model produces a compact "what happened so far" document of roughly 500 tokens. The original 40 turns are replaced by the summary in active context and archived verbatim to external storage.
The critical implementation detail: only summarize content that has been fully processed. If the agent is actively using a tool output from turn 38, that output stays verbatim until the agent has finished with it and noted its conclusion. Only then does it become eligible for compression.
I've seen engineers set the compression threshold at 90% instead of 70%. By 90%, you may not have enough headroom to run the summarization call itself. Set the trigger at 70% and compress incrementally. Early and frequent compression is much safer than a single emergency compression at the limit.
Technique 2: Auto-compaction (Anthropic's built-in)
Claude Opus 4.6 and Sonnet 4.6 include built-in auto-compaction. When context approaches the limit, Claude automatically summarizes earlier turns without requiring any application-level compression code. The mechanism is transparent to the conversation flow but has a significant implication: the agent loses access to verbatim history after compaction fires.
Auto-compaction is convenient but dangerous in long tasks. The auto-summary does not know which of your constraints are task-critical and which are noise. It summarizes what looks verbose to a general-purpose model, which may be exactly what your agent needs verbatim. I have watched agents that relied solely on auto-compaction start contradicting their own security constraints after the fourth or fifth compaction pass.
The rule: treat auto-compaction as a safety net, not a strategy. Use it to catch overflow cases your compression logic missed. Use manual summarization for the critical content where you need control over what gets preserved.
Technique 3: Selective loading and on-demand retrieval
Selective loading inverts the usual approach entirely. Instead of injecting all available information at context start, the agent receives a compact task definition and a set of retrieval tools. When it needs a specific file, it calls read_file(path). When it needs a past decision, it calls search_memory(query). Nothing enters context unless explicitly requested.
This is the iceberg pattern in its purest form. Context stays consistently small regardless of how long the task has been running. In a Chroma DB or Qdrant setup, embedding retrieval latency is under 50ms. The overhead cost is the LLM reasoning needed to decide when to retrieve, which is a full agent turn.
For tasks running fewer than 20 turns, selective loading overhead may not be worth it. For tasks running 100+ turns over hours, it pays back many times. The break-even point depends on how often the agent needs previously-processed content, which is task-specific.
Technique 4: Hierarchical compression
Hierarchical compression maintains multiple levels of fidelity for the same content. Level 1 is raw data in external storage (full fidelity, high token cost to inject). Level 2 is per-section summaries of roughly 500 words each. Level 3 is an executive summary of 2-3 paragraphs covering everything.
The agent loads Level 3 by default. When it needs to drill into a specific section, it loads the relevant Level 2 summary. Only when it needs exact data does it fetch Level 1 from external storage. Default context cost is predictable regardless of how much raw data the agent has accumulated.
This structure shines in document analysis and research tasks where the agent repeatedly references earlier findings. With simple progressive summarization, going back to an archived detail requires an external retrieval call. With hierarchical compression, the Level 2 intermediate layer is often sufficient, avoiding the external round-trip.
Technique 5: Rolling window with anchors
A rolling window keeps the most recent N message turns verbatim, typically the last 8-12 turns. Recency bias ensures the agent maintains accurate short-term state. Beyond the window, older turns are summarized or evicted.
Anchors are a separate category that never leaves context regardless of age. These are: the original task specification, security and permission constraints, hard user requirements that affect all behavior, and key architectural decisions that define the task's scope. Anchors survive every compression pass.
The rolling window plus anchors approach is the minimum viable implementation. It requires almost no infrastructure complexity yet prevents the single most dangerous failure mode: agents that forget their original requirements halfway through a long task.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.