Query expansion
Learn how expanding short user queries with an LLM before vector search closes the vocabulary gap that causes RAG systems to miss relevant documents entirely.
TL;DR
- Query expansion rewrites or supplements a short user query with an LLM before embedding, closing the vocabulary gap between casual user language and formally-written knowledge base documents.
- HyDE (Hypothetical Document Embedding) generates a hypothetical answer to the query and embeds that instead, typically improving NDCG@10 by 5-20% on domain-specific corpora.
- Generative expansion adds 50-150ms latency using a fast model like gpt-4o-mini; the API cost is approximately $0.000015 per query, negligible at most scales.
- Always keep the original query for final answer generation and reranking; only the vector search step uses the expanded query.
- The key risk is over-expansion: expansions beyond 100 tokens cause embedding drift and can hurt retrieval rather than help it.
The Problem It Solves
Your support team knowledge base contains an article titled "Performance bottleneck in the data ingestion pipeline: diagnosing connection pool exhaustion and N+1 query patterns." A user opens the chat widget and types: "Why is my thing slow?" The vector search finds nothing useful. The embedding of "Why is my thing slow?" sits in a region of semantic space far from any of the formal documentation. The LLM apologizes and says it couldn't find relevant information.
The vocabulary mismatch problem is the silent killer of RAG quality. Users write queries in casual, spoken language: 3-10 words, informal phrasing, abbreviations, no domain terminology. Knowledge bases are written in formal, complete sentences with precise technical vocabulary. The embedding model compresses both into vectors, but vectors trained on corpora that pair questions with long-form answers still produce a measurable gap between "why is my thing slow" and "connection pool exhaustion and N+1 query patterns."
This gap is real and consistent across domains. Healthcare users type "my head hurts" while the corpus says "tension-type cephalgia." Developers type "API dying" while runbooks say "service degradation due to upstream dependency timeouts." The fix is not to rewrite all your documentation in casual language. The fix is to rewrite the query to match the documentation's register before searching.
What Is It?
Query expansion bridges the vocabulary gap by rewriting or augmenting a short, informal user query into a longer, more formal form before it reaches the vector database.
Think of it like a reference librarian. You walk in and say "I'm looking for something about why companies fail." A good librarian translates that into "organizational failure, business collapse, strategic misalignment, founder error, market timing" and searches under all of those headings simultaneously. The patron's words were fine; they just were not the words the catalog uses.
How It Works
Three Subtypes of Query Expansion
Not all expansion strategies work the same way. There are three distinct approaches, and choosing the right one depends on your latency budget, domain specificity, and query characteristics.
Generative expansion uses an LLM to rewrite the query into a more detailed, formal version. You pass the original query to a cheap, fast model with a system prompt describing your knowledge base domain. The model returns an expanded query that adds technical vocabulary, related terms, and specificity. Latency cost: 50-150ms with gpt-4o-mini. This is the most versatile approach because it handles arbitrary query types and adapts to context in the prompt.
HyDE (Hypothetical Document Embedding) generates a hypothetical document that would answer the query, then embeds that hypothetical document and uses it for search. The insight is that embedding models are trained on document corpora, so a document-shaped vector sits closer to real documents than a question-shaped vector does. Instead of embedding "Why is my app slow?", you generate a fake 200-word paragraph explaining app slowness and embed that. Typical recall improvement: 5-20% NDCG@10 on knowledge-dense corpora.
Synonym and term expansion is a rules-based approach that maps known abbreviations and casual terms to formal vocabulary without an LLM call. You maintain a domain-specific dictionary: "API" expands to "API, REST endpoint, HTTP endpoint, web service interface"; "slow" expands to "slow, latency, performance, throughput, response time." There is no per-query LLM cost, and latency is essentially zero, but it requires ongoing maintenance of the synonym dictionary and misses novel terminology.
HyDE in Depth
HyDE is the most counterintuitive of the three, so it deserves a closer look at the mechanism.
When a user asks a question, the embedding of that question sits in "question space" in the embedding model's representation. The question embedding and document embeddings come from the same model, but they represent semantically different structures. A question says "what is X?" while a document says "X is defined as..." The vectors are related but not coincident.
HyDE sidesteps this by generating a plausible answer. The model does not need to be correct; it just needs to produce text that looks structurally like an answer. The fictional answer uses the vocabulary, sentence structure, and entities that real answers would use. When you embed the fictional answer, its vector lands in "answer space," close to real answers in the knowledge base.
The critical detail in HyDE: always rerank using the original query, not the hypothetical answer. Reranking with the fictional answer would score chunks based on how much they resemble your invention rather than how much they answer the user's actual question.
The Expansion Prompt
The expansion prompt is load-bearing. A vague prompt produces vague expansions that add little recall. A domain-specific prompt that tells the model what kind of documents it is searching against produces expansions that use the exact vocabulary those documents contain.
EXPANSION_PROMPT = """You are a search query optimizer for a {domain} knowledge base.
The knowledge base contains: {kb_description}.
Original query: {query}
Rewrite this query to improve retrieval against formal documentation.
Add relevant technical terms, acronyms, and related concepts.
Keep the expansion to 2-3 sentences and under 100 tokens.
Do not add information not implied by the original query.
Return only the expanded query text, nothing else."""
The {kb_description} field is important. "A support knowledge base about cloud infrastructure" produces different expansions than "A knowledge base containing 3,000 legal contracts and regulatory filings." The model's expansion vocabulary shifts to match the domain.
Expansion Length and Embedding Drift
There is a real tradeoff between expansion richness and embedding focus. Longer expansions cover more vocabulary but dilute the query's focal point. An expansion like "slow AND fast AND reliable AND database AND cache AND network AND GC AND CPU AND memory AND disk AND..." has so many terms that the resulting embedding points roughly equally in many directions and no longer has a strong signal for any specific topic.
The practical ceiling for expansion length is 100 tokens (roughly 75 words or 2-3 sentences). Expansions of 200+ tokens typically hurt retrieval compared to the original query. I've tested this empirically: the sweet spot is 50-80 tokens covering 3-5 key technical synonyms.
Combining Strategies
Query expansion and reranking are the strongest combination in standard RAG pipelines. Expansion increases recall at the first retrieval step (the right document gets retrieved). Reranking increases precision by scoring the expanded result set against the original query. Neither can substitute for the other: expansion helps nothing if the right document still ranks 8th after retrieval.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.