Context-aware chunking

TL;DR

Context-aware chunking splits documents at structural boundaries (headings, code blocks, tables) or semantic topic transitions rather than arbitrary character counts.
Fixed-size chunking at 1,000 characters routinely severs function signatures from their bodies, table headers from their rows, and section headings from their content.
Structure-based chunking with Docling improves retrieval recall by 20-40% on structured documents (PDFs, DOCX, HTML) compared to fixed-size splitting.
The sweet spot for most use cases is 256-512 tokens per chunk with 50-100 token overlap between adjacent chunks.
Semantic chunking adds 5-15ms latency per chunk due to embedding calls; structure-based chunking adds 200ms-2s per document for parsing but zero per chunk.
The key tradeoff: content-aware parsers (Docling, Unstructured.io) handle structure better but fail on scanned PDFs without embedded text layers.

Imagine you have a 50-page technical specification in PDF format. You run a naive fixed-size chunker with a 1,000-character window. Chunk 23 ends mid-sentence: ...the authentication flow requires validating the JWT token against the. Chunk 24 starts: public key stored in the secrets manager, following RFC 7519. Both chunks are semantically incomplete. Neither can be retrieved reliably for a query about JWT validation because neither contains a coherent semantic unit.

This gets worse with structured content. A Python function spanning 800 characters gets split at character 500, right in the middle of the function body. The first chunk contains the signature and the first few lines. The second chunk starts with return processed_results and provides no context about what processed_results is or which function this belongs to. When the LLM receives this second chunk, it cannot explain what the function does, because the function definition is gone.

Tables are especially fragile. A 10-row table where row headers are in the first chunk and the actual data is in the second chunk is useless for queries about the data. You get half the semantic information in each chunk and the vector embeddings are correspondingly degraded. The core issue is that fixed-size splitting treats documents as byte streams rather than as structured information artifacts.

What Is It?

Context-aware chunking splits documents at boundaries that preserve semantic completeness: structural boundaries defined by the document's own format (headings, code blocks, tables, paragraphs), or semantic boundaries detected by measuring embedding-space drift between consecutive sentences.

Think of it like a librarian who knows that a chapter summary belongs with its chapter, not with the start of the next chapter. The librarian does not cut chapters at page 50 just because every chapter must be the same length. Each chapter is treated as a natural unit, and only subdivided further when it is genuinely too large for the index.

How It Works

Context-aware chunking has three distinct approaches, each appropriate for different content types and quality requirements.

Approach 1: Structure-Based Chunking

Structure-based chunking reads the document's format to find natural split points. For Markdown and HTML, heading tags (H1, H2, H3) define section boundaries. For DOCX files, paragraph styles indicate headings and body text. For PDFs, Docling reconstructs the document hierarchy from layout analysis.

The key rule: certain content types are always atomic. Code blocks are never split mid-block. Tables are never split between the header row and its data rows. Numbered lists are kept together. These are semantic units where partial content is worse than no content.

Libraries for structure-based chunking:

Docling (IBM, open source): handles PDF, DOCX, HTML, Markdown with layout understanding
Unstructured.io: broad format support (emails, PowerPoint, images with OCR)
LlamaParse: LLM-powered PDF parsing, handles complex multi-column layouts

Approach 2: Semantic Chunking

Semantic chunking does not rely on document structure at all. It detects topic boundaries by measuring how much consecutive sentences diverge in embedding space.

The algorithm: embed groups of 2-3 consecutive sentences together. Compute cosine similarity between adjacent sentence groups. When similarity drops below a threshold (typically 0.6-0.75), insert a chunk boundary. When two sentences are about the same topic, their embeddings are similar. When the topic shifts, the embeddings diverge.

LangChain's SemanticChunker and LlamaIndex's SemanticSplitterNodeParser implement this. The main parameter to tune is the breakpoint threshold: too low (say, 0.5) creates many tiny chunks; too high (say, 0.85) creates huge chunks that contain multiple topics.

Approach 3: Hybrid Chunking

Hybrid chunking combines both approaches. First, apply structure-based splitting to get section-level segments. Within each segment, apply semantic similarity to detect sub-topic transitions. Also enforce a max-token ceiling (typically 512 tokens) as a fallback to prevent any single chunk from becoming too large.

Docling's hybrid chunking mode does exactly this. It parses the document structure first, then uses a tokenizer-based segmenter within each structural section, with an optional semantic pass. This is the best default choice for enterprise document processing with mixed content types.

Chunk Size and Overlap

Optimal chunk sizes from production benchmarks:

Chunk Size	Effect
Under 100 tokens	Too little context for accurate embeddings; precision drops
256-512 tokens	Sweet spot for most text content
Over 512 tokens	Embeddings average too many topics; recall drops
Full function body (up to 2K tokens)	Correct for code; never split mid-function

Add 50-100 token overlap between adjacent chunks. Overlap prevents information loss at boundaries: a key sentence that falls at the edge of a chunk appears in both that chunk and the next. The cost is slightly more storage and occasional duplicate retrieval, which deduplication handles.

Metadata Enrichment

Every chunk should carry metadata that enables filtering and re-ranking:

TL;DR

Context-aware chunking splits documents at structural boundaries (headings, code blocks, tables) or semantic topic transitions rather than arbitrary character counts.
Fixed-size chunking at 1,000 characters routinely severs function signatures from their bodies, table headers from their rows, and section headings from their content.
Structure-based chunking with Docling improves retrieval recall by 20-40% on structured documents (PDFs, DOCX, HTML) compared to fixed-size splitting.
The sweet spot for most use cases is 256-512 tokens per chunk with 50-100 token overlap between adjacent chunks.
Semantic chunking adds 5-15ms latency per chunk due to embedding calls; structure-based chunking adds 200ms-2s per document for parsing but zero per chunk.
The key tradeoff: content-aware parsers (Docling, Unstructured.io) handle structure better but fail on scanned PDFs without embedded text layers.

Docling (IBM, open source): handles PDF, DOCX, HTML, Markdown with layout understanding
Unstructured.io: broad format support (emails, PowerPoint, images with OCR)
LlamaParse: LLM-powered PDF parsing, handles complex multi-column layouts

Chunk Size	Effect
Under 100 tokens	Too little context for accurate embeddings; precision drops
256-512 tokens	Sweet spot for most text content
Over 512 tokens	Embeddings average too many topics; recall drops
Full function body (up to 2K tokens)	Correct for code; never split mid-function

Metadata Enrichment

Every chunk should carry metadata that enables filtering and re-ranking:

Context-aware chunking

TL;DR

The Problem It Solves

What Is It?

How It Works

Approach 1: Structure-Based Chunking

Approach 2: Semantic Chunking

Approach 3: Hybrid Chunking

Chunk Size and Overlap

Metadata Enrichment

Continue Reading with Premium

Comments

Context-aware chunking

TL;DR

The Problem It Solves

What Is It?

How It Works

Approach 1: Structure-Based Chunking

Approach 2: Semantic Chunking

Approach 3: Hybrid Chunking

Chunk Size and Overlap

Metadata Enrichment

Continue Reading with Premium

Comments