Context engineering

TL;DR

Context engineering is the discipline of controlling everything inside the model's context window: system prompt, conversation history, retrieved documents, tool outputs, and user message.
Andrej Karpathy coined the term in June 2025, arguing that "prompt engineering" undersells the real skill because you manage a full window, not just one instruction block.
A production context window has five zones, and their order changes output quality because of primacy and recency effects.
Shorter, focused context outperforms long, diluted context. The best edit is usually a deletion (the "chisel principle").
AlphaCodium's prompt decomposition pipeline improved GPT-4 HumanEval from 19% to 44% by breaking one giant prompt into a chain of small, focused prompts.
Master this before fine-tuning or RAG. Context engineering is the highest-leverage skill in applied AI.

You write a careful system prompt. The model ignores half of it, hallucinates a section you explicitly prohibited, and formats the output in a way that breaks your parser. You add more instructions. Quality gets worse.

The issue is not your word choice. The issue is context design. The model sees your critical instructions diluted by hundreds of tokens it does not need, with retrieved text contradicting your constraints and old conversation history cluttering the signal.

I have watched teams spend weeks tuning word choices in their prompts when the real fix was cutting 60% of the context and restructuring the rest. The model does its best with whatever you put in front of it. Context engineering is the discipline of making "whatever you put in front of it" excellent.

The good news: context engineering is a learnable, testable engineering skill. Unlike "prompt magic," it responds to systematic experimentation and eval-driven iteration.

What is it?

Context engineering is the practice of deciding what goes into the model's context window, what stays out, how content is ordered, and how it is structured, before the model sees a single token.

Think of the context window as the model's working memory. A human cannot hold 50 pages of notes in their head while answering a question; they select the most relevant page, skim for the key paragraph, and work from that. Context engineering is the same selection process, performed by your code on behalf of the model.

Karpathy framed it this way in June 2025: "Prompt engineering is not the best term. Context engineering is more accurate, the skill of carefully constructing the context window." The shift matters because "prompt" implies one instruction block whereas "context" implies a system with multiple inputs, budget constraints, and engineering tradeoffs.

RAG is one technique for filling a zone of this window. Context engineering is the meta-skill of designing the entire window. Conflating them is a common interview mistake.

How it works

The five zones of a context window

Every production context window contains five distinct zones. The order matters because most models give more attention to content at the beginning (primacy effect) and end (recency effect) of the window.

Zone	Position	Typical size	What belongs here
System prompt	Top (first)	1,000 to 5,000 tokens	Role, constraints, output format, few-shot examples
Conversation history	After system prompt	500 to 10,000 tokens	Prior turns, often summarized
Retrieved context	Middle	1,000 to 8,000 tokens	RAG chunks, search results, knowledge base excerpts
Tool outputs	After retrieval	200 to 2,000 tokens	Function call results, API responses
User message	Bottom (last)	50 to 500 tokens	Current query or instruction

Place your most important instructions at the top of the system prompt. The user message sits at the bottom, closest to generation. Everything in the middle competes for attention, so the fewer tokens in the middle zones, the better each zone performs.

System prompt design

The system prompt is your most controlled input. Structure it explicitly with markup, not prose paragraphs.

A production system prompt has four sections: (1) role framing ("You are a senior software engineer reviewing pull requests"), (2) output format instructions (a JSON schema or markdown template), (3) hard constraints ("Always cite the source document. Never fabricate URLs."), and (4) persona guardrails ("If you do not know, say so. Do not guess.").

Use structured markup to signal hierarchy. Anthropic recommends XML tags with Claude (<instructions>, <examples>, <context>). OpenAI models respond well to markdown headers and bold text. Structure signals hierarchy to the model the same way it signals hierarchy to humans.

Interview tip: the working memory frame

Say "the context window is the model's working memory, and everything it knows at inference time comes from what I put there." That one sentence signals engineering depth. Then walk through the five zones.

I have seen teams write a 4,000-token wall of prose as their system prompt and wonder why the model ignores half of it. The model is not disobedient. It is working through a degraded signal. Switching to structured XML tags cut their instruction-following failures by roughly 40%.

The chisel principle

More context is not better context. This is the most counterintuitive lesson in production LLM work.

The "chisel" principle from applied-llms.org: the best context has superfluous information removed, not more added. Every irrelevant sentence is noise the model must route around. Long prompts with redundant instructions produce worse outputs than short prompts with clean ones.

Treat every token as expensive even when you have budget to spare. Start small. Add only what improves your evaluation metrics. If removing a section does not hurt quality, that section should not be there.

The deletion test

Run your eval suite, then delete 20% of your system prompt. If scores stay the same or improve, that 20% was noise. Repeat until deletions hurt. This is the fastest path to a clean context.

Prompt decomposition (the pipeline pattern)

One god-prompt that attempts everything at once is the worst pattern in production LLM systems.

AlphaCodium found that decomposing code generation into a pipeline of small, focused prompts improved GPT-4 HumanEval accuracy from 19% to 44%. Each step has a focused goal, a short context, and clear success criteria. Composition beats consolidation.

The pipeline pattern works in four stages: (1) extract requirements from the input, (2) generate a plan, (3) execute the plan (write code, draft text, etc.), (4) review and refine the output. Four prompts. Four small contexts. Each step feeds a structured output into the next step's context.

I used this pattern on a contract analysis system where a single prompt achieved 62% accuracy on our eval set. Splitting into extraction, reasoning, and formatting stages pushed accuracy to 89% with no model change and no fine-tuning.

Context budget management

TL;DR

Context engineering is the discipline of controlling everything inside the model's context window: system prompt, conversation history, retrieved documents, tool outputs, and user message.
Andrej Karpathy coined the term in June 2025, arguing that "prompt engineering" undersells the real skill because you manage a full window, not just one instruction block.
A production context window has five zones, and their order changes output quality because of primacy and recency effects.
Shorter, focused context outperforms long, diluted context. The best edit is usually a deletion (the "chisel principle").
AlphaCodium's prompt decomposition pipeline improved GPT-4 HumanEval from 19% to 44% by breaking one giant prompt into a chain of small, focused prompts.
Master this before fine-tuning or RAG. Context engineering is the highest-leverage skill in applied AI.

The problem it solves

The good news: context engineering is a learnable, testable engineering skill. Unlike "prompt magic," it responds to systematic experimentation and eval-driven iteration.

What is it?

Context engineering is the practice of deciding what goes into the model's context window, what stays out, how content is ordered, and how it is structured, before the model sees a single token.

RAG is one technique for filling a zone of this window. Context engineering is the meta-skill of designing the entire window. Conflating them is a common interview mistake.

How it works

The five zones of a context window

Zone	Position	Typical size	What belongs here
System prompt	Top (first)	1,000 to 5,000 tokens	Role, constraints, output format, few-shot examples
Conversation history	After system prompt	500 to 10,000 tokens	Prior turns, often summarized
Retrieved context	Middle	1,000 to 8,000 tokens	RAG chunks, search results, knowledge base excerpts
Tool outputs	After retrieval	200 to 2,000 tokens	Function call results, API responses
User message	Bottom (last)	50 to 500 tokens	Current query or instruction

System prompt design

The system prompt is your most controlled input. Structure it explicitly with markup, not prose paragraphs.

Interview tip: the working memory frame

The chisel principle

More context is not better context. This is the most counterintuitive lesson in production LLM work.

The deletion test

Run your eval suite, then delete 20% of your system prompt. If scores stay the same or improve, that 20% was noise. Repeat until deletions hurt. This is the fastest path to a clean context.

Prompt decomposition (the pipeline pattern)

One god-prompt that attempts everything at once is the worst pattern in production LLM systems.

Context engineering

TL;DR

The problem it solves

What is it?

How it works

The five zones of a context window

System prompt design

The chisel principle

Prompt decomposition (the pipeline pattern)

Context budget management

Continue Reading with Premium

Comments

Context engineering

TL;DR

The problem it solves

What is it?

How it works

The five zones of a context window

System prompt design

The chisel principle

Prompt decomposition (the pipeline pattern)

Context budget management

Continue Reading with Premium

Comments