Multi-query RAG
Learn how generating multiple query variants and merging their results in parallel closes the single-embedding coverage gap and improves recall by 10-25% on complex questions.
TL;DR
- Multi-query RAG generates 3-5 alternative phrasings of the original question and runs all of them in parallel against the vector database, then deduplicates and merges the results.
- A single query embedding covers one neighborhood in embedding space; multi-query covers multiple neighborhoods, recovering 10-25% more relevant documents on complex multi-faceted questions.
- All searches run in parallel, so the latency cost is one LLM call (100-300ms with a fast model) plus one search round-trip, not N sequential search round-trips.
- Deduplicate by document ID before passing to the LLM; without deduplication the same chunk fills multiple context window slots and wastes context budget.
- The key tradeoff is vector database cost: 5 queries means 5x more read operations, which matters at scale on paid managed services.
The Problem It Solves
You are building a RAG system for a development platform. A senior engineer asks: "How should I handle database migrations in a production microservices environment?" The system retrieves 5 chunks and the LLM produces a generic answer about running db migrate in CI/CD. The engineer marks it unhelpful and moves on.
The problem is that the question contains at least three separate retrieval needs. "Database migration strategies and patterns" needs one set of documents. "Zero-downtime schema changes for running services" needs a completely different set. "Database ownership boundaries in microservices architectures" needs a third set. A single embedding of the full question produces a vector that points at the semantic centroid of all three topics simultaneously. That centroid might not be close to any of the three specific topic clusters in your knowledge base.
This is the coverage gap problem. Single-query retrieval is a spotlight: it illuminates one region of your embedding space very well. Complex, multi-faceted questions need a floodlight. The knowledge to answer them exists in your corpus, spread across different document clusters. A single embedding cannot reach all of them at once.
What Is It?
Multi-query RAG expands one complex question into N parallel queries, each targeting a different angle or sub-topic, then merges the results into a single deduplicated context set before generation.
Think of it like a research team. Instead of sending one researcher to find everything about a topic, you split the question into sub-tasks and send a different researcher to each sub-topic simultaneously. The researchers work in parallel. An editor combines their findings, removes duplicate sources, and hands the synthesized pile to the analyst who writes the final report.
How It Works
Step 1: Generating Query Variants
The generator LLM receives the original question and a prompt that instructs it to produce N alternative phrasings. The key constraint in the prompt is diversity: you want each variant to target a different sub-topic or angle, not five ways of saying the same thing.
VARIANT_PROMPT = """You are generating search queries to improve document retrieval.
Original question: {query}
Generate {n} alternative search queries that capture different aspects of this question.
Each query should approach the information need from a different angle or sub-topic.
Maximize the diversity of topic areas covered; do not repeat yourself.
Return a JSON array of {n} query strings, nothing else.
"""
The word "diversity" belongs in the prompt. Without it, cheaper models tend to produce near-synonyms rather than genuinely different angles. "How do I run database migrations?" and "How do you execute database schema changes?" are the same query; they will retrieve the same documents. "Zero-downtime schema changes for live production services" and "Database migration tooling comparison: Flyway vs Liquibase" hit entirely different document clusters.
Three types of variants work best. Restatements approach the same question with different vocabulary ("database migrations" vs. "schema changes"). Sub-questions decompose the question into its component parts (migration strategy, zero-downtime techniques, microservices boundaries, tooling). Perspective shifts approach from different roles (developer perspective, ops perspective, security perspective). For most complex questions, a mix of sub-questions and perspective shifts provides the most coverage.
Step 2: Parallel Search
All queries, including the original, run simultaneously via async calls to the vector database. The total search latency equals the latency of the slowest single search, not the sum of all searches. A single vector search typically runs in 5-50ms against a well-indexed corpus, so 5 parallel searches take roughly 5-50ms total.
Always include the original query in the search set. It acts as the semantic anchor. If the generated variants drift away from the user's actual intent, the original query still pulls in the most directly relevant documents.
Step 3: Deduplication and Merging
Without deduplication, a chunk that is highly relevant to the question will appear in multiple result sets. If it appears in all 5 searches and you blindly concatenate results, you waste 5 context window slots on the same chunk. With a context window of 8,000 tokens and target chunks of ~500 tokens each, you can fit roughly 16 chunks. Duplicates directly reduce the number of unique documents the LLM can reason with.
Deduplication by document ID is the minimum. Every chunk in your index should have a stable ID. Maintain a set of seen IDs and skip any chunk whose ID was already added to the merged list.
Frequency weighting is more powerful. A chunk that appears in 4 out of 5 result sets is a stronger convergence signal than one that appears in only 1. Boost its ranking proportionally. This is the core insight of RAGFusion (2023), which applied Reciprocal Rank Fusion (RRF) across multiple result sets: each chunk's score is the sum of 1/(rank + k) across all result sets where it appeared.
from collections import defaultdict
def reciprocal_rank_fusion(result_sets: list[list], k: int = 60) -> list:
"""
RRF merging: chunks appearing in multiple result sets get promoted.
k=60 is the standard constant from the original RRF paper.
"""
scores = defaultdict(float)
docs = {}
for results in result_sets:
for rank, chunk in enumerate(results):
scores[chunk.id] += 1.0 / (rank + k)
docs[chunk.id] = chunk # Store the chunk object by ID
# Sort by aggregated RRF score descending
sorted_ids = sorted(scores.keys(), key=lambda id: scores[id], reverse=True)
return [docs[id] for id in sorted_ids]
End-to-End Example
A user asks: "How should I handle database migrations in a production microservices environment?"
The generator produces four variants:
- "Database migration rollback strategies and blue-green deployment patterns"
- "Zero-downtime schema changes for services with continuous traffic"
- "Database ownership in microservices: shared vs. per-service databases"
- "Flyway vs Liquibase: migration tooling comparison for production systems"
Five parallel searches run (original + 4 variants). Each returns top-5 chunks from the corpus. Four chunks appear in 3+ result sets (strong signal: the RRF merger promotes them to top positions). Twelve unique chunks total after deduplication. Reranking against the original question reduces to 8 chunks. The LLM receives 8 focused, diverse chunks and produces a comprehensive answer covering migration tooling, rollback strategies, zero-downtime techniques, and database ownership patterns.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.