Hierarchical RAG
Learn how hierarchical RAG decouples retrieval granularity from context granularity, retrieving small chunks for precision while feeding the LLM larger parent chunks for context.
TL;DR
- Hierarchical RAG stores two sizes of chunk: small child chunks (100-256 tokens) for retrieval and larger parent chunks (512-1536 tokens) for LLM context.
- The system retrieves using child embeddings (precise, tight semantic scope) but injects the parent chunk into the prompt (rich context for generation).
- Sentence-window retrieval and LangChain's ParentDocumentRetriever are the two most common production implementations.
- LlamaIndex benchmarks show a 15-20% improvement in answer relevancy over standard single-level chunking on document QA tasks.
- The key tradeoff: double the storage and index size, plus one extra lookup per retrieval hop, in exchange for better answer quality on context-heavy questions.
The Problem It Solves
You pick a chunk size and you're immediately in a dilemma. Small chunks (100-200 tokens) give you precise retrieval: the embedding captures a tight, focused semantic topic and matches queries accurately. But the LLM receives a fragment. A single paragraph about gross margins lacks the surrounding table, the trend commentary, and the comparison figures it needs to generate a useful answer.
Large chunks (1000+ tokens) give the LLM rich context. But retrieval degrades because the embedding now averages over five topics at once. A query about one specific fact has to match against a dense block covering that fact plus four unrelated discussions. The cosine similarity score gets diluted and the right chunk may not make the top-k cutoff.
This is a genuine tension: the ideal size for retrieval and the ideal size for generation are different numbers, usually by a factor of 5 to 10. Standard RAG forces you to pick one size and live with the tradeoff. Hierarchical RAG eliminates the tradeoff entirely by using both sizes, each for the purpose it is suited for.
What Is It?
Hierarchical RAG decouples retrieval granularity from context granularity by storing chunks at two levels, using the small level for retrieval and the large level for generation.
Think of it like a newspaper archive. A search index (the small chunks) contains individual article paragraphs, so your query finds the exact paragraph that mentions the figure you need. But when you retrieve that paragraph, the system also pulls the full article (the parent chunk) so you can read the context around it. You searched at paragraph precision but you read at article resolution.
How It Works
The Parent-Child Data Structure
Every child chunk carries a parent_id metadata field pointing to its parent. This is the structural foundation that makes the pattern work. The child exists in the vector index. The parent exists in a separate storage layer (a docstore, a second database table, or a separate collection). When retrieval returns a child, the system resolves its parent_id to fetch the parent from docstore.
The key insight is that the parent is never embedded and never searched. Its only job is to serve as context once a child has been found. This keeps the vector index clean and precise because only semantically tight child chunks live there.
A practical size ratio: child at 150-200 tokens, parent at 600-1200 tokens. A 4:1 to 6:1 ratio is typical. The parent should be large enough to include surrounding context but small enough to fit comfortably in the LLM context window alongside other chunks.
Two-Level vs. Three-Level Hierarchies
The most common setup is two levels: paragraph-sized children and section-sized parents. This is the right default for most document corpora.
Three levels are possible: sentence-level (for retrieval precision), paragraph-level (for local context), and section-level (for broad context). Three levels let the system dynamically choose how much context to inject. But three levels add storage and lookup complexity without proportional quality improvement for most use cases. I've seen teams adopt three levels and then tune themselves back to two because the marginal gain was not measurable.
Stick with two levels unless you have strong benchmarked evidence that three levels improve your specific task.
Flavor 1: Sentence-Window Retrieval
Sentence-window retrieval stores individual sentences as children and a fixed window of surrounding sentences as the "parent." The window is not a structural unit (like a section) but a configurable buffer: 2-3 sentences before and after the matched sentence.
When a sentence is retrieved, the system returns that sentence plus its window. A window size of 5-7 sentences total (the matched sentence plus 2-3 on each side) is the typical default. This is implemented in LlamaIndex as SentenceWindowNodeParser with window_size=3 or window_size=5.
Sentence-window retrieval works best when your documents have fine-grained facts scattered across paragraphs and you need to surface the exact fact while providing enough surrounding context for the LLM to interpret it. It works less well when document sections are long and self-contained, because the window may span a sentence boundary in a way that leaves the context incomplete.
Flavor 2: ParentDocumentRetriever (LangChain)
LangChain's ParentDocumentRetriever stores child chunk embeddings in a vector store and full parent documents (or large sections) in a separate InMemoryStore or any other docstore. The association is tracked through a metadata field injected at ingestion time.
The retrieval call first queries the vector store for matching children, then resolves the parent document IDs from those children's metadata, and returns the deduplicated list of parent documents. If three children from the same parent document all match the query, only one copy of the parent is returned.
This deduplication behavior is important. Without it, a query that matches multiple paragraphs in the same long document would inject that document's context multiple times, wasting context window tokens.
Flavor 3: Manual Hierarchy with Metadata
For teams that want full control or are using a vector database that does not have a built-in retriever abstraction, manual hierarchy via metadata is the pragmatic approach. Every chunk stores four metadata fields: source_file, section_title, chunk_index, and parent_chunk_id.
At retrieval time, a child hit's parent_chunk_id is used to issue a second query, either a vector database get_by_id call or a SQL join, to fetch the parent. PostgreSQL with pgvector makes this a single join:
SELECT p.content
FROM chunks c
JOIN chunks p ON c.parent_id = p.id
WHERE c.id = $retrieved_child_id;
This approach works with any storage layer and gives you full visibility into the resolution logic. The cost is that you write the resolution code yourself rather than relying on a framework.
How End-to-End Retrieval Flows
The full round-trip adds roughly 5-15ms to a standard vector search. The extra latency comes from the parent lookup, which is usually a key-value fetch (constant time). It is not a second nearest-neighbor search.
Implementation Sketch
Below is a simplified implementation showing the core mechanism using a PostgreSQL + pgvector backend. This is not production code but the pattern is exact.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.