Context engineering
Learn how to construct the context window to get the best results from LLMs, why 'context engineering' has replaced prompt engineering as the key skill, and what belongs in a production system prompt.
TL;DR
- Context engineering is the discipline of controlling everything inside the model's context window: system prompt, conversation history, retrieved documents, tool outputs, and user message.
- Andrej Karpathy coined the term in June 2025, arguing that "prompt engineering" undersells the real skill because you manage a full window, not just one instruction block.
- A production context window has five zones, and their order changes output quality because of primacy and recency effects.
- Shorter, focused context outperforms long, diluted context. The best edit is usually a deletion (the "chisel principle").
- AlphaCodium's prompt decomposition pipeline improved GPT-4 HumanEval from 19% to 44% by breaking one giant prompt into a chain of small, focused prompts.
- Master this before fine-tuning or RAG. Context engineering is the highest-leverage skill in applied AI.
The problem it solves
You write a careful system prompt. The model ignores half of it, hallucinates a section you explicitly prohibited, and formats the output in a way that breaks your parser. You add more instructions. Quality gets worse.
The issue is not your word choice. The issue is context design. The model sees your critical instructions diluted by hundreds of tokens it does not need, with retrieved text contradicting your constraints and old conversation history cluttering the signal.
I have watched teams spend weeks tuning word choices in their prompts when the real fix was cutting 60% of the context and restructuring the rest. The model does its best with whatever you put in front of it. Context engineering is the discipline of making "whatever you put in front of it" excellent.
The good news: context engineering is a learnable, testable engineering skill. Unlike "prompt magic," it responds to systematic experimentation and eval-driven iteration.
What is it?
Context engineering is the practice of deciding what goes into the model's context window, what stays out, how content is ordered, and how it is structured, before the model sees a single token.
Think of the context window as the model's working memory. A human cannot hold 50 pages of notes in their head while answering a question; they select the most relevant page, skim for the key paragraph, and work from that. Context engineering is the same selection process, performed by your code on behalf of the model.
Karpathy framed it this way in June 2025: "Prompt engineering is not the best term. Context engineering is more accurate, the skill of carefully constructing the context window." The shift matters because "prompt" implies one instruction block whereas "context" implies a system with multiple inputs, budget constraints, and engineering tradeoffs.
RAG is one technique for filling a zone of this window. Context engineering is the meta-skill of designing the entire window. Conflating them is a common interview mistake.
How it works
The five zones of a context window
Every production context window contains five distinct zones. The order matters because most models give more attention to content at the beginning (primacy effect) and end (recency effect) of the window.
| Zone | Position | Typical size | What belongs here |
|---|---|---|---|
| System prompt | Top (first) | 1,000 to 5,000 tokens | Role, constraints, output format, few-shot examples |
| Conversation history | After system prompt | 500 to 10,000 tokens | Prior turns, often summarized |
| Retrieved context | Middle | 1,000 to 8,000 tokens | RAG chunks, search results, knowledge base excerpts |
| Tool outputs | After retrieval | 200 to 2,000 tokens | Function call results, API responses |
| User message | Bottom (last) | 50 to 500 tokens | Current query or instruction |
Place your most important instructions at the top of the system prompt. The user message sits at the bottom, closest to generation. Everything in the middle competes for attention, so the fewer tokens in the middle zones, the better each zone performs.
System prompt design
The system prompt is your most controlled input. Structure it explicitly with markup, not prose paragraphs.
A production system prompt has four sections: (1) role framing ("You are a senior software engineer reviewing pull requests"), (2) output format instructions (a JSON schema or markdown template), (3) hard constraints ("Always cite the source document. Never fabricate URLs."), and (4) persona guardrails ("If you do not know, say so. Do not guess.").
Use structured markup to signal hierarchy. Anthropic recommends XML tags with Claude (<instructions>, <examples>, <context>). OpenAI models respond well to markdown headers and bold text. Structure signals hierarchy to the model the same way it signals hierarchy to humans.
Interview tip: the working memory frame
Say "the context window is the model's working memory, and everything it knows at inference time comes from what I put there." That one sentence signals engineering depth. Then walk through the five zones.
I have seen teams write a 4,000-token wall of prose as their system prompt and wonder why the model ignores half of it. The model is not disobedient. It is working through a degraded signal. Switching to structured XML tags cut their instruction-following failures by roughly 40%.
The chisel principle
More context is not better context. This is the most counterintuitive lesson in production LLM work.
The "chisel" principle from applied-llms.org: the best context has superfluous information removed, not more added. Every irrelevant sentence is noise the model must route around. Long prompts with redundant instructions produce worse outputs than short prompts with clean ones.
Treat every token as expensive even when you have budget to spare. Start small. Add only what improves your evaluation metrics. If removing a section does not hurt quality, that section should not be there.
The deletion test
Run your eval suite, then delete 20% of your system prompt. If scores stay the same or improve, that 20% was noise. Repeat until deletions hurt. This is the fastest path to a clean context.
Prompt decomposition (the pipeline pattern)
One god-prompt that attempts everything at once is the worst pattern in production LLM systems.
AlphaCodium found that decomposing code generation into a pipeline of small, focused prompts improved GPT-4 HumanEval accuracy from 19% to 44%. Each step has a focused goal, a short context, and clear success criteria. Composition beats consolidation.
The pipeline pattern works in four stages: (1) extract requirements from the input, (2) generate a plan, (3) execute the plan (write code, draft text, etc.), (4) review and refine the output. Four prompts. Four small contexts. Each step feeds a structured output into the next step's context.
I used this pattern on a contract analysis system where a single prompt achieved 62% accuracy on our eval set. Splitting into extraction, reasoning, and formatting stages pushed accuracy to 89% with no model change and no fine-tuning.
Context budget management
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.