Context window management

TL;DR

Every LLM has a fixed context window. Long-running agents will hit it. The question is not if but when, and whether the agent degrades gracefully or crashes.
Four strategies cover most cases: sliding window eviction (drop oldest messages), summarization compression (replace a chunk of history with a summary), selective eviction (score messages by importance and drop low-scoring ones), and auto-compaction (summarize the full context when approaching the limit).
Summarization compression gives the best quality-per-token tradeoff for most agents. Replace 10,000 tokens of message history with a 500-token summary. The agent keeps the key decisions and facts without the verbose chat back-and-forth.
Never evict: the system prompt, the current task definition, critical constraints (security rules, user preferences that affect all behavior), and the most recent 3-5 turns (near-term context).
Track token usage explicitly. Don't discover the context is full when the API returns a 400 error mid-task.

A coding agent is working on a large refactor. After 45 minutes and 50 tool calls, the context window is at 95% capacity. On the next tool call, the API returns an error: context length exceeded. The agent crashes, losing all state. The engineer has to restart from scratch.

Alternatively: the agent doesn't crash but silently drops old context to fit within the window. It's now forgotten the original requirements it was given in the first user message. It starts generating code that contradicts the spec it agreed to earlier.

Both outcomes are production failures. The first is loud; the second is quiet and potentially worse.

Context window management is the engineering discipline that prevents both. It's not a single pattern. It's a suite of strategies applied at different stages of context growth.

What is it?

Context window management is the set of techniques for actively controlling what occupies an LLM's context window during a long-running agent session. The goal is to keep the most important information available to the model while staying within token limits, and to degrade gracefully when compression is necessary.

The strategies range from simple (drop old messages) to sophisticated (score each message's importance and evict strategically). The right approach depends on how long the agent runs, what information it needs to retain, and how much quality degradation is acceptable.

How it works

Strategy 1: Sliding window eviction

Drop the oldest messages once the context fills. Keep the last N turns. Simple and predictable, but loses early context that may still be relevant.

Best for: short agent sessions where early context rarely matters. Conversational chatbots. Not suitable for long-horizon tasks where initial requirements are still being referenced at step 50.

Strategy 2: Summarization compression

When context hits a high-water mark (e.g., 80% of the context limit), pass the oldest 50% of messages to a secondary LLM call that produces a compressed summary. Replace those messages with the summary. Effective ratio: 10,000 tokens of history → 500-token summary.

Strategy 3: Selective eviction

Score each message by importance before eviction. Messages that contain: task constraints, tool call results with data still referenced, user preferences, or error states score high. Pure reasoning chains ("Let me think about this...") and intermediate planning steps score low.

Evict low-scoring messages first. Keeps the semantically important context longer.

Implementation uses a simple heuristic or a cheap classifier. The cost of running the scoring is justified by the quality improvement over naive sliding window.

Strategy 4: Auto-compaction at critical threshold

When context exceeds 90%, trigger a full re-summarization. This is more aggressive than incremental compression. The agent summarizes its entire history into a structured "checkpoint" document:

## Task checkpoint (auto-generated)
**Original goal:** [...]
**Key decisions made:** [...]
**Files modified:** [...]
**Constraints established:** [...]
**Current state:** [...]
**Next step:** [...]

The agent then starts a fresh context window with just this checkpoint plus the system prompt. This is used by Claude Code and similar tools to handle sessions that run for hours.

What to never evict

Certain context must be preserved regardless of window pressure:

The system prompt: behavior, permissions, constraints.
The original task / user goal: the agent's north star.
Explicit user preferences and constraints communicated in the session.
The most recent N turns (usually 3-5), which serves as near-term working memory.
The results of any tool calls that produced data still in use. If the agent queried a database and the result is still being worked on, evicting it causes silent errors.

Monitoring context health

Track token usage proactively:

def check_context_health(messages: list, model: str) -> ContextHealth:
    token_count = count_tokens(messages, model)
    limit = MODEL_LIMITS[model]
    utilization = token_count / limit

    if utilization > 0.9:
        return ContextHealth.CRITICAL  # trigger auto-compaction now
    elif utilization > 0.75:
        return ContextHealth.HIGH      # trigger incremental summarization
    elif utilization > 0.5:
        return ContextHealth.MEDIUM    # monitor closely
    return ContextHealth.OK

Choosing a strategy by utilization level

Use this decision tree to pick the right compaction strategy based on current token utilization:

TL;DR

Every LLM has a fixed context window. Long-running agents will hit it. The question is not if but when, and whether the agent degrades gracefully or crashes.
Four strategies cover most cases: sliding window eviction (drop oldest messages), summarization compression (replace a chunk of history with a summary), selective eviction (score messages by importance and drop low-scoring ones), and auto-compaction (summarize the full context when approaching the limit).
Summarization compression gives the best quality-per-token tradeoff for most agents. Replace 10,000 tokens of message history with a 500-token summary. The agent keeps the key decisions and facts without the verbose chat back-and-forth.
Never evict: the system prompt, the current task definition, critical constraints (security rules, user preferences that affect all behavior), and the most recent 3-5 turns (near-term context).
Track token usage explicitly. Don't discover the context is full when the API returns a 400 error mid-task.

## Task checkpoint (auto-generated)
**Original goal:** [...]
**Key decisions made:** [...]
**Files modified:** [...]
**Constraints established:** [...]
**Current state:** [...]
**Next step:** [...]

The agent then starts a fresh context window with just this checkpoint plus the system prompt. This is used by Claude Code and similar tools to handle sessions that run for hours.

What to never evict

Certain context must be preserved regardless of window pressure:

The system prompt: behavior, permissions, constraints.
The original task / user goal: the agent's north star.
Explicit user preferences and constraints communicated in the session.
The most recent N turns (usually 3-5), which serves as near-term working memory.
The results of any tool calls that produced data still in use. If the agent queried a database and the result is still being worked on, evicting it causes silent errors.

Monitoring context health

Track token usage proactively:

def check_context_health(messages: list, model: str) -> ContextHealth:
    token_count = count_tokens(messages, model)
    limit = MODEL_LIMITS[model]
    utilization = token_count / limit

    if utilization > 0.9:
        return ContextHealth.CRITICAL  # trigger auto-compaction now
    elif utilization > 0.75:
        return ContextHealth.HIGH      # trigger incremental summarization
    elif utilization > 0.5:
        return ContextHealth.MEDIUM    # monitor closely
    return ContextHealth.OK

Choosing a strategy by utilization level

Use this decision tree to pick the right compaction strategy based on current token utilization:

Context window management

TL;DR

The problem it solves

What is it?

How it works

Strategy 1: Sliding window eviction

Strategy 2: Summarization compression

Strategy 3: Selective eviction

Strategy 4: Auto-compaction at critical threshold

What to never evict

Monitoring context health

Choosing a strategy by utilization level

Continue Reading with Premium

Comments

Context window management

TL;DR

The problem it solves

What is it?

How it works

Strategy 1: Sliding window eviction

Strategy 2: Summarization compression

Strategy 3: Selective eviction

Strategy 4: Auto-compaction at critical threshold

What to never evict

Monitoring context health

Choosing a strategy by utilization level

Continue Reading with Premium

Comments