Late chunking
Learn how embedding a full document before chunking its token embeddings improves cross-reference resolution and boosts NDCG@10 by 2-8% over standard chunk-then-embed pipelines.
TL;DR
- Late chunking inverts the standard RAG indexing order: instead of chunking text and then embedding each chunk independently, it embeds the full document first, then derives chunk vectors by mean-pooling the resulting token embeddings.
- Because the token embeddings are generated with full-document attention, each chunk vector carries cross-document context: pronouns resolve correctly, cross-references land in the right semantic space.
- Jina AI's 2024 paper showed 2-8% NDCG@10 improvement over standard chunking on MTEB retrieval benchmarks, with the largest gains on documents containing cross-references and pronoun chains.
- The technique requires a long-context embedding model (8K+ tokens); most standard models have 512-token context windows and cannot process a full document.
- Indexing is roughly 3-5x slower than parallel per-chunk embedding because the full document must be processed as a single forward pass.
- The key limitation: adding or updating a single chunk within a document requires recomputing all chunk embeddings for that document, not just the changed one.
The Problem It Solves
In standard RAG, you split a document into chunks and embed each chunk independently. The embeddings are generated in isolation: the model sees only the 256-512 tokens of that chunk and nothing else. This creates a fundamental information gap. Consider a 10-page legal contract. Section 7 states: "The party of the second part agrees to the payment terms defined herein." Embedded independently, "the party of the second part" has no grounding in the actual entity that phrase refers to from Section 1. The embedding for this sentence drifts toward generic legal language rather than toward the specific party's obligations.
The problem compounds for scientific and technical documents. A medical study's results section refers to "the intervention group" without re-specifying the 40-patient cohort defined in the methods section. A technical specification refers to "the module described above" where "above" means a different chunk entirely. When these chunks are embedded in isolation, the embeddings represent decontextualized fragments: the vector for "the intervention group showed a 23% improvement" does not know that the intervention group received Drug X versus placebo.
This is not a problem that chunk overlap or contextual retrieval descriptions fully solve. Overlap adds neighbouring text, but only from adjacent chunks, not from sections far earlier in the document. Contextual retrieval adds an LLM-generated description of context, which is effective but relies on an LLM accurately describing the context for every chunk at ingestion time. Late chunking solves the problem earlier, at the embedding layer, by giving every token access to the full document's attention context before any chunking happens.
What Is It?
Late chunking is an embedding strategy that generates context-aware chunk vectors by running the full document through an embedding model's transformer layers first, obtaining one contextualised token embedding per token, and then computing each chunk's vector as the mean pool of its token embeddings.
Think of it like reading a whole book before taking notes on each chapter. If you take notes on each chapter without reading the rest of the book, your notes on chapter 7 miss the context established in chapter 1. If you read the entire book first, then write chapter summaries, each summary reflects your understanding of how each chapter fits into the whole. Late chunking is the "read first, summarise after" approach applied to embedding.
How It Works
The Math
Standard chunking produces one vector per chunk by feeding each chunk's text into the embedding model independently:
vec(chunk_k) = EmbedModel(text[k_start : k_end])
Late chunking feeds the entire document into the embedding model once, producing one token embedding per token, then mean-pools within each chunk's token range:
token_embeddings = EmbedModel(full_document) # one vector per token
vec(chunk_k) = mean(token_embeddings[k_start : k_end])
The critical difference is that token_embeddings[i] was computed with attention over all other tokens in the document. Token 4500 "knows about" tokens 1 through 4499. When you mean-pool tokens 4000-4256 to get chunk 16's vector, that vector carries contextual information from across the entire document.
Why Full-Document Attention Matters
Transformer attention is the mechanism that allows each token's representation to be influenced by all other tokens in the sequence. In a standard embedding model, the attention operates only within the input sequence. If the input is a 256-token chunk, token 200 can attend to tokens 1-255, but nothing outside that chunk.
When you pass the full document as a single input, token 4200 can attend to token 50 (where the entity definition lives), token 500 (where the methodology is described), and token 3800 (the previous mention of the same concept). The resulting token embedding for token 4200 is informed by all of those. When you later mean-pool the range containing token 4200, that contextual information is preserved in the chunk vector.
Chunking Boundaries in Token Space
A subtle but important implementation detail: chunk boundaries must be defined in token space, not character space. You tokenize the full document, identify which tokens correspond to structural boundaries (paragraph breaks, section headings) or use fixed token-count windows, then apply those boundaries as index ranges into the token embeddings tensor.
The chunk boundaries applied to token embeddings should match the same boundaries you use for the text chunks stored in the database. This alignment is essential: chunk_3's text must correspond to exactly the tokens whose embeddings are pooled for chunk_3's vector.
Handling Documents Longer Than the Context Window
jina-embeddings-v3 supports 8,192 tokens. A 30-page document is roughly 15,000-20,000 tokens, which exceeds the model's context window. The correct approach is to split the document into non-overlapping sections that each fit within 8K tokens, apply late chunking within each section, and store the resulting vectors. You lose some cross-section context (references in section 2 about content from section 1 are not bridged if the sections are processed separately), but you retain context within each section.
For documents under 8K tokens (roughly 15-20 pages), apply late chunking across the entire document with no pre-splitting required.
Storage and Query Side
At query time, late chunking is invisible. The query is embedded with the same model using a standard single-query embedding call. The resulting query vector is compared against chunk vectors in the index using cosine similarity, exactly as in standard RAG. Late chunking is a purely indexing-time technique and does not change the query path.
Implementation Sketch
This is a simplified implementation showing the core mechanism. Production use requires batching and GPU acceleration for large corpora.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.