Contextual retrieval

TL;DR

Contextual retrieval prepends a short LLM-generated context sentence to each chunk before embedding, giving the vector store a richer signal about what the chunk means in its document.
Anthropic's 2024 research showed contextual retrieval reduces retrieval failure rate by 49% compared to basic chunking, and up to 67% when combined with reranking.
The technique directly improves BM25 recall too: the context prefix injects entity names, document titles, and dates that keyword search can match against.
Cost is manageable with prompt caching: Claude's prompt caching drops context-generation cost to roughly 10% of the naive per-chunk approach.
The key limitation is stale context: updating a source document requires regenerating all context prefixes for its chunks, which is easy to forget and introduces silent retrieval regressions.

Imagine you are building a RAG system for a 300-page legal contract. You split it into 200 chunks of roughly 750 tokens each. Chunk 87 contains: "The penalty for late payment is 1.5% per month compounded quarterly." When a user asks "what is the late payment penalty in the services agreement?", the vector search correctly finds this chunk as semantically similar. The LLM reads the chunk and generates a confident answer. The answer is correct at the surface.

Now chunk 112 contains: "The decision was made in Q3." The user asks "when was the SLA amendment decided?" The vector search might retrieve this chunk because the query mentions decisions and timing. But the LLM has no idea what decision this is, what Q3 means without a year, or even which document this came from. It cannot answer correctly without the context that surrounded this chunk in the original document.

This is the decontextualization problem. When you split a document into chunks, each chunk loses its location, its neighbors, and its document-level meaning. The chunk says "the decision," but the LLM needs to know which decision in which document from which year. The problem is invisible until users start asking about chunks that reference things outside themselves.

What Is It?

Contextual retrieval enriches each document chunk with a short LLM-generated context sentence at ingestion time, so that when the chunk is later retrieved and shown to the model, it carries enough surrounding context to be interpreted correctly without access to the rest of the document.

Think of it like putting a sticky note on each card in a reference filing system. Every card already contains the information extracted from a document, but each sticky note says: "This card is from the 2024 services agreement between Acme Corp and Globex. It covers late payment penalties in Section 14." When you pull the card later, you do not need to find the original binder to understand what the card means.

How It Works

Step 1: Generating the Context Prefix

For each chunk in your corpus, during ingestion you make a single LLM call with both the full document and the specific chunk in context. The prompt asks the model to generate a short situating sentence, not a summary of the chunk itself, but a description of where the chunk sits within the document.

CONTEXT_PROMPT = """<document>
{full_document}
</document>

Here is the chunk we want to situate within the document:
<chunk>
{chunk_content}
</chunk>

Please give a short succinct context to situate this chunk within the overall document
for the purposes of improving search retrieval of the chunk.
Answer only with the succinct context and nothing else.
"""

The model is constrained to output only the context prefix, nothing else. This keeps the output token count low (typically 50-100 tokens) and prevents the model from paraphrasing the chunk content rather than situating it.

Step 2: The Full Ingestion Pipeline

One critical observation: context generation requires the full document in the prompt for every chunk. A 300-page document split into 200 chunks means 200 LLM calls each reading the entire 300 pages. Without prompt caching, this is prohibitively expensive.

Step 3: Prompt Caching Makes This Practical

Anthropic designed contextual retrieval with Claude's prompt caching feature explicitly in mind. The full document goes into the system prompt (or a cacheable prefix). The only part that varies per call is the chunk itself. Claude caches the full document in memory across all chunk processing calls, so you pay for one full document read and then only the chunk tokens for each subsequent call.

For a 300-page document (roughly 75,000 tokens):

Without caching: 200 calls × 75,000 tokens = 15,000,000 input tokens. At $3/1M tokens (Claude 3 Haiku): $45 per document.
With caching: 75,000 tokens read once + 200 × ~~750 chunk tokens = 225,000 total tokens billed. At cache hit pricing (~~$0.30/1M): approximately $4.50 per document.

The cost drops by roughly 90%. This is what makes contextual retrieval practical at scale.

OpenAI's API also supports prompt caching for repeated prefix tokens, making the same optimization available with GPT-4o. Cache key alignment is critical: the document must be in a fixed position in the prompt structure across all calls for caching to engage.

Step 4: Why BM25 Also Improves

BM25 is a classic keyword search algorithm. It matches terms in a query against terms in documents. Without context, a chunk about "late payment penalties" might not contain the phrase "services agreement" or "Acme Corp" because those terms were in the document header, not in this specific section.

After contextual enrichment, the context prefix injects those terms directly into the chunk: "This chunk is from the 2024 Acme Corp services agreement, Section 14." Now BM25 can match on "Acme Corp" or "services agreement" and retrieve the right chunk for any query using those exact terms.

This is why Anthropic's results show improvement on hybrid retrieval (vector + BM25) from contextual retrieval: it is not just the embedding that benefits, it is the keyword index too.

Step 5: Concrete Before / After Example

Document: Q4 2024 incident post-mortem for the authentication service.

Chunk 23 (raw):

"The root cause was a certificate expiry that was not caught by the monitoring alert because the alert threshold was set to 72 hours rather than 7 days. This was the second occurrence in 18 months."

Retrieved for query: "Why did the auth service go down in October 2024?"

Without context, the chunk does not mention "auth service," "October," or "2024." A vector search for "auth service incident October 2024" might not rank this chunk highly.

Chunk 23 (with context prefix):

"This chunk is from the Q4 2024 incident post-mortem for the authentication service (outage date: October 14, 2024). It describes the root cause of the service failure."

"The root cause was a certificate expiry that was not caught by the monitoring alert because the alert threshold was set to 72 hours rather than 7 days. This was the second occurrence in 18 months."

Now the embedding is enriched with "auth service," "Q4 2024," "October," and "outage." The vector similarity to the query increases. BM25 also matches on "authentication service" and "2024." Retrieval rank goes from position 8-12 to position 1-3.

Anthropic's research shows combining all three (contextual retrieval + BM25 + reranking) achieves the best results: up to 67% reduction in retrieval failure, which is significantly better than any two-component combination.

Implementation Sketch

TL;DR

Contextual retrieval prepends a short LLM-generated context sentence to each chunk before embedding, giving the vector store a richer signal about what the chunk means in its document.
Anthropic's 2024 research showed contextual retrieval reduces retrieval failure rate by 49% compared to basic chunking, and up to 67% when combined with reranking.
The technique directly improves BM25 recall too: the context prefix injects entity names, document titles, and dates that keyword search can match against.
Cost is manageable with prompt caching: Claude's prompt caching drops context-generation cost to roughly 10% of the naive per-chunk approach.
The key limitation is stale context: updating a source document requires regenerating all context prefixes for its chunks, which is easy to forget and introduces silent retrieval regressions.

The Problem It Solves

What Is It?

How It Works

Step 1: Generating the Context Prefix

CONTEXT_PROMPT = """<document>
{full_document}
</document>

Here is the chunk we want to situate within the document:
<chunk>
{chunk_content}
</chunk>

Please give a short succinct context to situate this chunk within the overall document
for the purposes of improving search retrieval of the chunk.
Answer only with the succinct context and nothing else.
"""

Step 2: The Full Ingestion Pipeline

Step 3: Prompt Caching Makes This Practical

For a 300-page document (roughly 75,000 tokens):

Without caching: 200 calls × 75,000 tokens = 15,000,000 input tokens. At $3/1M tokens (Claude 3 Haiku): $45 per document.
With caching: 75,000 tokens read once + 200 × ~~750 chunk tokens = 225,000 total tokens billed. At cache hit pricing (~~$0.30/1M): approximately $4.50 per document.

The cost drops by roughly 90%. This is what makes contextual retrieval practical at scale.

Step 4: Why BM25 Also Improves

This is why Anthropic's results show improvement on hybrid retrieval (vector + BM25) from contextual retrieval: it is not just the embedding that benefits, it is the keyword index too.

Step 5: Concrete Before / After Example

Document: Q4 2024 incident post-mortem for the authentication service.

Chunk 23 (raw):

"The root cause was a certificate expiry that was not caught by the monitoring alert because the alert threshold was set to 72 hours rather than 7 days. This was the second occurrence in 18 months."

Retrieved for query: "Why did the auth service go down in October 2024?"

Without context, the chunk does not mention "auth service," "October," or "2024." A vector search for "auth service incident October 2024" might not rank this chunk highly.

Chunk 23 (with context prefix):

"This chunk is from the Q4 2024 incident post-mortem for the authentication service (outage date: October 14, 2024). It describes the root cause of the service failure."

"The root cause was a certificate expiry that was not caught by the monitoring alert because the alert threshold was set to 72 hours rather than 7 days. This was the second occurrence in 18 months."

Contextual retrieval

TL;DR

The Problem It Solves

What Is It?

How It Works

Step 1: Generating the Context Prefix

Step 2: The Full Ingestion Pipeline

Step 3: Prompt Caching Makes This Practical

Step 4: Why BM25 Also Improves

Step 5: Concrete Before / After Example

Implementation Sketch

Continue Reading with Premium

Comments

Contextual retrieval

TL;DR

The Problem It Solves

What Is It?

How It Works

Step 1: Generating the Context Prefix

Step 2: The Full Ingestion Pipeline

Step 3: Prompt Caching Makes This Practical

Step 4: Why BM25 Also Improves

Step 5: Concrete Before / After Example

Implementation Sketch

Continue Reading with Premium

Comments