RAG architectures
Learn the three RAG architectures (naive, advanced, modular), when HyDE and reranking are worth the complexity, and how to diagnose which layer is causing quality failures.
TL;DR
- Naive RAG (chunk, embed, retrieve top-K, stuff into prompt) breaks on ambiguous queries and poor chunk boundaries. It's a starting point, not a destination.
- Advanced RAG fixes specific failure modes by adding query rewriting before retrieval and reranking after it.
- Modular RAG treats each stage as a swappable component. Any stage can be independently replaced, AB-tested, or skipped.
- HyDE (Hypothetical Document Embedding) solves vocabulary mismatch: generate a fake answer, embed it, retrieve against that embedding instead of the raw query.
- Reranking (retrieve 50, rerank to 5) improves accuracy 15-30% for complex queries. The cost is worth it when precision matters.
The problem it solves
Your knowledge base has the right information. The LLM still gives wrong answers. This is the core RAG frustration, and it usually means retrieval is failing, not generation.
Raw vector similarity search retrieves chunks that are semantically close to the query, but "close" is not the same as "relevant." A user asking "what's our refund policy for SaaS subscriptions?" might get chunks about refund accounting processes instead of customer-facing policy text. The embedding model doesn't understand the query's intent, just its surface structure.
RAG architectures exist to close this gap between syntactic similarity and actual relevance. Different architectures address different failure modes, so knowing which architecture to reach for depends on which failure mode you're seeing.
What is it?
RAG (Retrieval-Augmented Generation) grounds an LLM's responses in external documents by retrieving relevant chunks at query time and including them in the prompt. The LLM generates an answer conditioned on those retrieved chunks rather than relying solely on its training data.
There are three generations of RAG architectures: naive RAG (the original pattern), advanced RAG (targeted improvements at each pipeline stage), and modular RAG (a compositional framework where any stage is replaceable). Each generation solves problems the previous one couldn't handle.
How it works
Naive RAG
The baseline pipeline: split documents into chunks, embed each chunk, store in a vector database, embed the user query at query time, retrieve the top-K most similar chunks, stuff them into the prompt, generate.
This works well for simple factual retrieval over clean, well-structured documents. It breaks when queries are ambiguous, when the relevant information spans multiple chunks, or when chunk boundaries cut through conceptual units.
Advanced RAG
Advanced RAG adds three stages to the naive pipeline.
Pre-retrieval: Query rewriting transforms the user's query before embedding it. HyDE (Hypothetical Document Embedding) is the most effective technique: ask the LLM to generate a hypothetical answer to the query, embed that hypothetical answer, then retrieve against it. This works because the hypothetical answer uses the same vocabulary as the documents in your knowledge base. A short user query doesn't. Retrieval accuracy typically improves 10-20% with HyDE.
Retrieval: Hybrid search combines dense vector search with sparse BM25 keyword matching. Dense retrieval handles semantic similarity; BM25 handles exact keyword matches (product names, IDs, technical terms). MMR (Maximal Marginal Relevance) diversifies the result set so you don't retrieve five nearly identical chunks.
Post-retrieval: Reranking uses a cross-encoder model that sees both the query and each candidate chunk together. A bi-encoder (used for initial retrieval) scores query and document independently; a cross-encoder scores them jointly, which is slower but far more accurate. The typical pattern: retrieve top-50 with fast bi-encoder, rerank to top-5 with Cohere Rerank, ColBERT, or a custom cross-encoder. Accuracy improves 15-30% for complex queries.
Modular RAG
Modular RAG treats the pipeline as a graph of swappable components. Each stage (query transformation, retrieval, reranking, context assembly, generation) is an independent module with defined inputs and outputs. You can swap any module, AB-test two implementations of the same stage, or add new stages without touching the rest of the pipeline.
LlamaIndex and LangChain both support modular RAG. I find the biggest benefit isn't the swappability itself, it's that modular architecture forces you to define clear interfaces between stages, which makes debugging dramatically easier.
Context assembly: the underrated stage
How you assemble retrieved chunks matters. LLMs pay more attention to content at the beginning and end of the context window than in the middle (the "lost in the middle" effect). Put the most relevant chunk first.
Remove duplicate chunks before assembly. Downstream from retrieval, you'll often get near-identical chunks from different document sections. Deduplicate by cosine similarity above 0.97 before building the prompt.
Include metadata (source URL, document date, section title) alongside each chunk. This gives the LLM enough information to assess recency and source authority, which reduces hallucination and allows the model to cite sources accurately.
Implementation sketch
This is a simplified implementation showing the core advanced RAG pipeline. Production systems add error handling, caching, and observability around each stage.
async def advanced_rag_query(user_query: str) -> str:
# Stage 1: Query rewriting with HyDE
hypothetical_answer = await llm.generate(
f"Write a detailed answer to: {user_query}",
model="gpt-4o-mini" # Fast model for HyDE
)
hyde_embedding = embed(hypothetical_answer)
# Stage 2: Hybrid retrieval (dense + sparse)
dense_results = vector_db.search(hyde_embedding, top_k=50)
bm25_results = keyword_index.search(user_query, top_k=20)
candidates = merge_and_deduplicate(dense_results, bm25_results)
# Stage 3: Cross-encoder reranking
scored = reranker.score(query=user_query, documents=candidates)
top_chunks = sorted(scored, key=lambda x: x.score, reverse=True)[:5]
# Stage 4: Context assembly (relevance-ordered, with metadata)
context = "\n\n".join(
f"[Source: {c.metadata['url']} | {c.metadata['date']}]\n{c.text}"
for c in top_chunks
)
# Stage 5: Grounded generation
return await llm.generate(
f"Answer based ONLY on the provided context.\n\n"
f"Context:\n{context}\n\nQuestion: {user_query}",
model="gpt-4o"
)
RAG failure diagnosis
When your RAG system gives bad answers, the failure is in exactly one of three places. This is the diagnostic framework I use.
- Empty or irrelevant retrieval: The retrieved chunks don't contain the answer. Fix: improve chunking strategy, add query rewriting or HyDE, or check that your embedding model handles the domain vocabulary.
- Right chunks, wrong answer: Retrieval is correct but the LLM still hallucinates or misinterprets. Fix: improve the generation prompt, add explicit instructions to stay grounded in the provided context, or add reranking to surface the most relevant chunk first.
- Inconsistent quality: Sometimes correct, sometimes wrong on the same question. Fix: add MMR for diversity in retrieval, increase top-K and add reranking, or check chunk boundary quality.
RAG vs fine-tuning
Use RAG when your knowledge changes frequently (product docs, pricing, policies). Use fine-tuning when you need the model to have a specific communication style, domain vocabulary, or format that doesn't change often. Most production systems need both: fine-tuned base model for style and tone, RAG for current factual knowledge.
When to use
- You need LLM answers grounded in proprietary or frequently updated documents
- You want to cite sources and reduce hallucination in factual domains
- You need to add knowledge without the cost and time of fine-tuning
- Your queries vary in complexity and a naive pipeline gives inconsistent quality
Real-world examples
Perplexity AI runs a full advanced RAG pipeline: query rewriting, hybrid search across live web content and indexed documents, reranking, and context assembly with source citations. The reranking stage is what allows Perplexity to show high-precision answers for technical queries, not just semantically similar paragraphs.
GitHub Copilot's workspace feature uses modular RAG to retrieve relevant code context (open files, imported modules, recent edits) before generating suggestions. The context assembly stage trims to fit the context budget while prioritizing the most recently edited files.
Customer service chatbots at scale (Intercom, Zendesk AI) use naive RAG for straightforward FAQ retrieval and advanced RAG with reranking for complex multi-step support queries. The architecture is tiered by query complexity.
Limitations and tradeoffs
- Latency stack: Each advanced RAG stage adds latency. HyDE adds one LLM call, reranking adds 100-500ms. Profile your pipeline; the gains may not justify the cost for simple use cases.
- Chunk quality determines ceiling: No retrieval improvement compensates for bad chunking. Semantic chunking (split on meaning, not fixed character count) is worth the implementation cost.
- Threshold calibration is ongoing: The similarity threshold for retrieval and the reranking cutoff both require calibration against real query distributions. They drift as your content and user base evolve.
- Agentic RAG is powerful but unpredictable: Letting an agent decide when and what to retrieve is the most flexible pattern but hardest to debug. Reserve it for use cases where the query distribution is highly variable.
How this shows up in interviews
RAG architecture questions appear in almost every AI/ML system design interview. Interviewers test whether you understand the full retrieval pipeline or just the "stuff chunks into prompt" version.
When to bring it up:
- "Design a question-answering system over company documents"
- "How would you build a knowledge-grounded chatbot?"
- "Your AI assistant gives outdated answers. Debug it."
- Any system design where the LLM needs access to private or changing data
Depth expected by level:
- Junior/Mid: Explain naive RAG pipeline end-to-end. Know that chunking and embedding quality matter. Mention at least one failure mode.
- Senior: Distinguish naive vs advanced RAG. Explain HyDE, hybrid search, and reranking with tradeoffs. Propose a failure diagnosis framework.
- Staff+: Design a modular RAG architecture with AB-testable stages. Discuss when agentic RAG (agent controls retrieval decisions) beats static pipelines. Quantify latency-accuracy tradeoffs.
| Interviewer asks | Strong answer |
|---|---|
| "Walk me through a RAG pipeline" | "Chunk, embed, store. At query time: rewrite query, hybrid retrieve top-50, cross-encoder rerank to top-5, assemble context with metadata, generate with grounding instructions." |
| "How would you improve retrieval accuracy?" | "HyDE for vocabulary mismatch, BM25 for exact-term queries, cross-encoder reranking. Measure each stage's marginal improvement independently." |
| "When would you NOT use RAG?" | "When knowledge is stable and fits in training data, when latency budget can't accommodate retrieval, or when parametric knowledge already covers the domain." |
| "How do you debug bad RAG answers?" | "Isolate the failing stage: check if retrieval returned relevant chunks, check if those chunks produced a correct answer, check if quality is consistent." |
| "What's the tradeoff with reranking?" | "15-30% accuracy gain for 100-500ms added latency. Route only complex queries through reranking to preserve speed for simple ones." |
Common interview mistakes
| Mistake | Why It Fails | Better Approach |
|---|---|---|
| "Just use embeddings and cosine similarity" | Ignores vocabulary mismatch, chunk quality, and context assembly. Shows naive-only understanding. | Walk through the full pipeline: query rewriting, hybrid retrieval, reranking, context assembly. Show you know where naive RAG breaks. |
| "RAG replaces fine-tuning" | They solve different problems. RAG handles dynamic knowledge; fine-tuning handles style and behavior. | Explain that production systems often need both: fine-tuned model for tone, RAG for current facts. |
| "Just increase top-K for better results" | More chunks dilute context quality and trigger the lost-in-the-middle problem. Higher K without reranking often hurts accuracy. | Retrieve broadly (top-50) but rerank aggressively (to top-5). Quality beats quantity in the context window. |
| "Use the largest embedding model available" | Larger models have higher latency and cost. Domain-specific smaller models often outperform generalist large ones on domain queries. | Match embedding model to your domain. Benchmark with your actual queries, not academic benchmarks. |
| "Chunk at fixed 512-token boundaries" | Fixed chunking splits conceptual units. A policy paragraph cut in half retrieves poorly. | Use semantic chunking that respects document structure (headings, paragraphs, sections). |
Test your understanding
Quick recap
- Naive RAG (chunk, embed, retrieve, generate) is a starting point. Most production use cases need at least one advanced RAG improvement to reach acceptable quality.
- HyDE solves vocabulary mismatch by embedding a hypothetical answer instead of the raw query. It's the single highest-leverage improvement for queries that use different terminology than your documents.
- Reranking (retrieve 50, rerank to 5 with a cross-encoder) delivers 15-30% accuracy improvement for complex queries at the cost of 100-500ms added latency.
- Diagnose before improving: check whether the failure is in retrieval, generation, or context assembly. Each has a different fix.
- Modular RAG treats each pipeline stage as a swappable component. Define clear interfaces between stages and each becomes independently testable and AB-testable.
- Agentic search (tool-based iterative retrieval) is a viable alternative for smaller corpora with frequent changes, but vector RAG still wins for large-scale semantic retrieval.
- The "lost in the middle" effect means chunk ordering matters. Put the most relevant chunk first in your assembled context.
Related patterns
- Retrieval-augmented generation: The foundational concept that RAG architectures implement. Start here if you're new to grounding LLMs in external data.
- Vector databases for AI: ANN indexes, embedding storage, and similarity search internals are essential for debugging retrieval quality issues.
- LLM evals: Your RAG pipeline needs eval infrastructure to measure quality across stages. Evals tell you which stage is failing.
- Prompt management: The generation prompt determines how well the LLM uses retrieved context. Prompt versioning directly impacts RAG output quality.
- User feedback flywheel: Production RAG quality depends on continuous improvement driven by real user feedback signals.