Context window management
Learn how sliding windows, summarization chains, and selective eviction keep long-running agents within their token budget without losing critical task context.
TL;DR
- Every LLM has a fixed context window. Long-running agents will hit it. The question is not if but when, and whether the agent degrades gracefully or crashes.
- Four strategies cover most cases: sliding window eviction (drop oldest messages), summarization compression (replace a chunk of history with a summary), selective eviction (score messages by importance and drop low-scoring ones), and auto-compaction (summarize the full context when approaching the limit).
- Summarization compression gives the best quality-per-token tradeoff for most agents. Replace 10,000 tokens of message history with a 500-token summary. The agent keeps the key decisions and facts without the verbose chat back-and-forth.
- Never evict: the system prompt, the current task definition, critical constraints (security rules, user preferences that affect all behavior), and the most recent 3-5 turns (near-term context).
- Track token usage explicitly. Don't discover the context is full when the API returns a 400 error mid-task.
The problem it solves
A coding agent is working on a large refactor. After 45 minutes and 50 tool calls, the context window is at 95% capacity. On the next tool call, the API returns an error: context length exceeded. The agent crashes, losing all state. The engineer has to restart from scratch.
Alternatively: the agent doesn't crash but silently drops old context to fit within the window. It's now forgotten the original requirements it was given in the first user message. It starts generating code that contradicts the spec it agreed to earlier.
Both outcomes are production failures. The first is loud; the second is quiet and potentially worse.
Context window management is the engineering discipline that prevents both. It's not a single pattern. It's a suite of strategies applied at different stages of context growth.
What is it?
Context window management is the set of techniques for actively controlling what occupies an LLM's context window during a long-running agent session. The goal is to keep the most important information available to the model while staying within token limits, and to degrade gracefully when compression is necessary.
The strategies range from simple (drop old messages) to sophisticated (score each message's importance and evict strategically). The right approach depends on how long the agent runs, what information it needs to retain, and how much quality degradation is acceptable.
How it works
Strategy 1: Sliding window eviction
Drop the oldest messages once the context fills. Keep the last N turns. Simple and predictable, but loses early context that may still be relevant.
Best for: short agent sessions where early context rarely matters. Conversational chatbots. Not suitable for long-horizon tasks where initial requirements are still being referenced at step 50.
Strategy 2: Summarization compression
When context hits a high-water mark (e.g., 80% of the context limit), pass the oldest 50% of messages to a secondary LLM call that produces a compressed summary. Replace those messages with the summary. Effective ratio: 10,000 tokens of history β 500-token summary.
Strategy 3: Selective eviction
Score each message by importance before eviction. Messages that contain: task constraints, tool call results with data still referenced, user preferences, or error states score high. Pure reasoning chains ("Let me think about this...") and intermediate planning steps score low.
Evict low-scoring messages first. Keeps the semantically important context longer.
Implementation uses a simple heuristic or a cheap classifier. The cost of running the scoring is justified by the quality improvement over naive sliding window.
Strategy 4: Auto-compaction at critical threshold
When context exceeds 90%, trigger a full re-summarization. This is more aggressive than incremental compression. The agent summarizes its entire history into a structured "checkpoint" document:
## Task checkpoint (auto-generated)
**Original goal:** [...]
**Key decisions made:** [...]
**Files modified:** [...]
**Constraints established:** [...]
**Current state:** [...]
**Next step:** [...]
The agent then starts a fresh context window with just this checkpoint plus the system prompt. This is used by Claude Code and similar tools to handle sessions that run for hours.
What to never evict
Certain context must be preserved regardless of window pressure:
- The system prompt: behavior, permissions, constraints.
- The original task / user goal: the agent's north star.
- Explicit user preferences and constraints communicated in the session.
- The most recent N turns (usually 3-5), which serves as near-term working memory.
- The results of any tool calls that produced data still in use. If the agent queried a database and the result is still being worked on, evicting it causes silent errors.
Monitoring context health
Track token usage proactively:
def check_context_health(messages: list, model: str) -> ContextHealth:
token_count = count_tokens(messages, model)
limit = MODEL_LIMITS[model]
utilization = token_count / limit
if utilization > 0.9:
return ContextHealth.CRITICAL # trigger auto-compaction now
elif utilization > 0.75:
return ContextHealth.HIGH # trigger incremental summarization
elif utilization > 0.5:
return ContextHealth.MEDIUM # monitor closely
return ContextHealth.OK
Choosing a strategy by utilization level
Use this decision tree to pick the right compaction strategy based on current token utilization:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.