Retrieval augmented generation
Learn how RAG grounds LLM responses in your data, how the ingestion and retrieval pipelines work, and how to diagnose the most common failure modes in production RAG systems.
TL;DR
- RAG retrieves relevant document chunks at query time and injects them into the context window so the model answers from your data, not from training memory.
- Two pipelines: ingestion (chunk, embed, store) and query (embed, retrieve, rerank, assemble, generate).
- Hybrid retrieval (BM25 + semantic via Reciprocal Rank Fusion) outperforms either approach alone in virtually every production benchmark.
- Reranking with a cross-encoder is high-leverage and commonly skipped. Don't skip it.
- The RAGAS framework measures faithfulness, answer relevancy, context precision, and context recall, giving you real metrics instead of vibes.
- The "lost in the middle" problem means chunk ordering matters as much as chunk selection.
The problem it solves
Your company has 10,000 internal documents. An LLM trained through 2024 knows nothing about them. You could fine-tune the model on those documents, but fine-tuning is expensive, your documents update weekly, and fine-tuned models hallucinate about their training data almost as often as base models do.
You need the model to answer questions grounded in specific, current, private documents. Fine-tuning bakes knowledge into weights. RAG gives knowledge at read time.
The model doesn't memorize your data. It reads the relevant piece of it right before answering, every time.
What is it?
Retrieval Augmented Generation (RAG) is a pattern where, before calling the LLM, you retrieve the top-K most relevant documents (or document chunks) from a corpus and include them in the context window. The model then answers using those retrieved chunks as its primary source.
Think of it like a librarian. You don't memorize every book in the library. When someone asks a question, you walk to the right shelf, pull the relevant pages, read them, and then answer. RAG makes the LLM work the same way: retrieve first, then generate.
RAG was formalized in a 2020 paper from Meta AI (Lewis et al.) and has since become the default architecture for grounding LLMs in private or frequently-updated data. It solves training cutoff problems, reduces hallucination on factual questions, and lets you cite sources with page-level precision.
How it works
There are two distinct pipelines. Ingestion runs offline (or on a schedule). Query runs at request time. Understanding which pipeline to optimize for which problem is the most important RAG debugging skill.
Ingestion pipeline
The ingestion pipeline runs whenever your data changes. For most teams, that means a nightly batch job plus an event-driven trigger for high-priority document updates. The pipeline's job: take raw documents, split them into retrieval-friendly chunks, compute embeddings, and write them to your vector store with metadata for filtering.
I've seen teams spend weeks optimizing their retrieval algorithm when the real problem was their ingestion pipeline silently dropping 30% of their documents. Always verify ingestion completeness before debugging retrieval.
Query pipeline
Total latency budget for a production RAG call: 800ms to 1.5s. Query rewriting adds 100-200ms (optional but valuable). Embedding is fast (10-30ms). Parallel retrieval takes 30-80ms. Reranking is the hidden cost at 50-200ms. LLM generation dominates at 300-800ms.
For your interview: know this latency breakdown. It shows you've operated RAG at production scale, not just read about it.
Chunking strategies
How you split documents is the first place most RAG systems fail. Chunks that are too large cause the "lost in the middle" problem. Chunks that are too small lose context that spans sentence boundaries.
Fixed-size chunking (e.g., 512 tokens with 50-token overlap) is simple and works for uniform documents like API docs or FAQ pages. The overlap prevents hard splits at sentence boundaries. It's the default most teams start with, and that's fine.
Semantic chunking splits when the topic changes, detected by embedding similarity between consecutive sentences. It produces variable-length chunks that respect natural topic boundaries, which improves retrieval precision by 8-15% on heterogeneous document collections. The tradeoff: it requires an embedding call per sentence during ingestion.
Parent-child chunking stores small chunks (256 tokens) for retrieval and large chunks (1,024 tokens, the parent) for context injection. You retrieve the small chunk to get high precision, then inject the large parent into the context window for richer information. This is the highest-performing strategy in most benchmarks and the one I recommend for production systems with mixed document types.
| Strategy | Chunk Size | Best For | Key Tradeoff |
|---|---|---|---|
| Fixed-size | 256-512 tokens | Uniform docs, quick start | Splits mid-topic, loses context |
| Semantic | Variable (100-800 tokens) | Mixed document types | Slower ingestion, needs embedding per sentence |
| Parent-child | 256 retrieval / 1,024 context | Production systems | More storage, index complexity |
| Sentence window | Single sentence + window | Precise Q&A | Very granular, high chunk count |
Retrieval: sparse, dense, and hybrid
Sparse retrieval (BM25): keyword-based, TF-IDF variant. Fast, no embedding needed. Excellent at exact matches: product codes, error messages, specific terminology. Falls apart on paraphrases. "How do I log in?" won't match a document about "authentication procedures."
Dense retrieval: embed both the query and documents, find nearest neighbors in vector space. Catches semantic similarity across paraphrases. Misses exact terminology matches when vocabulary differs from training data. A query for "ERR_CONNECTION_RESET" might not retrieve the doc titled with that exact error code.
Hybrid retrieval (BM25 + dense): retrieve candidates from both, merge with Reciprocal Rank Fusion (RRF). My recommendation for almost every production system. BM25 catches what dense misses and vice versa. The combination consistently outperforms either alone on HotpotQA, BEIR, and real-world production evaluation sets.
RRF is elegant: for each document, compute 1 / (k + rank_sparse) + 1 / (k + rank_dense) where k = 60 (standard constant). No tuning needed. Documents that rank well in both systems bubble to the top.
For your interview, say you'd default to hybrid retrieval and explain the BM25 vs semantic tradeoff. That distinction signals production experience.
Reranking
Most teams retrieve too many candidates (top-50) and then need to select the best 5-10 for the context window. Reranking is how you do that selection.
A cross-encoder model sees the query and a candidate chunk side-by-side and scores their relevance together. This is fundamentally different from bi-encoder retrieval, which encodes the query and chunk separately. Cross-encoders are 10-50x slower but significantly more precise because they see both texts simultaneously.
You run the cheap bi-encoder retrieval first to get 30-50 candidates from billions of chunks, then run the expensive cross-encoder on only those candidates. This two-stage architecture is how every production RAG system I've worked with operates.
Cohere Rerank, ColBERT v2, and BGE-Reranker are the most common options. I've seen teams skip reranking and then wonder why the model is answering from irrelevant chunks. Don't skip it.
Interview tip: mention hybrid + rerank as your default
"I'd default to hybrid retrieval (BM25 + semantic via RRF) and add a cross-encoder reranker on the top-50 candidates." That one sentence shows you know the architecture at the level that matters in production.
Context assembly
Retrieval gives you chunks. Assembly determines what actually goes in the context window and in what order. This step is underrated and often ignored entirely.
The "lost in the middle" finding (Liu et al., 2023) showed that LLMs recall information at the start and end of long contexts better than in the middle. If you have 5 retrieved chunks, the most critical one should be first or last, not buried in position 3. Order your chunks by relevance: highest relevance at the edges, lowest in the middle.
Deduplicate aggressively. Parent-child chunking can return parent and child chunks that overlap significantly. Embedding similarity-based deduplication before context assembly prevents wasting tokens on near-duplicate content.
Token budgeting matters too. If your model has a 128K context window, don't fill it all with retrieved chunks. Leave room for the system prompt, grounding instructions, conversation history, and the model's own generation. I typically budget 60-70% of the window for retrieved context and reserve the rest.
Gotcha: retrieval success does not equal generation success
A chunk being retrieved doesn't mean the model will use it. If your system prompt has vague grounding instructions, the model will still hallucinate even with the right chunk present. Always include explicit instruction: "Answer only using the provided context. If the context doesn't contain the answer, say so."
RAG query pipeline (animated)
Key variants / types
RAG has evolved through distinct generations. Understanding where your system sits in this progression helps you identify what to improve next.
| Variant | Architecture | Key Feature | Best For | Limitation |
|---|---|---|---|---|
| Naive RAG | Retrieve top-K, concatenate, generate | Simple, fast to build | Prototypes, uniform docs | No reranking, poor on ambiguous queries |
| Advanced RAG | + query rewriting, reranking, HyDE | Higher retrieval precision | Production Q&A systems | Added latency (100-300ms) |
| Modular RAG | Pluggable components per stage | Swap retrievers, rankers, generators | Teams with diverse doc types | Architecture complexity |
| Agentic RAG | Agent decides when and whether to retrieve | Conditional retrieval, multi-hop reasoning | Complex research queries | Unpredictable latency, harder to debug |
Naive RAG is what most tutorials teach. Embed the query, retrieve top-5 chunks, stuff them into the prompt, call the LLM. It works for demos but breaks in production because it doesn't handle ambiguous queries, vocabulary mismatches, or noise in retrieved chunks.
Advanced RAG adds the components that make production systems work: query rewriting (expanding the user's query for better recall), HyDE (Hypothetical Document Embeddings, where you generate a hypothetical answer first and use that as the retrieval query), cross-encoder reranking, and structured context assembly. This is where most production systems should be.
Modular RAG treats each pipeline stage as a pluggable component. You might use BM25 for one document type and dense retrieval for another, or swap rerankers based on query type. It's the right architecture when you have heterogeneous data sources with different retrieval characteristics.
Agentic RAG gives an AI agent control over the retrieval process. The agent decides whether to retrieve at all, can issue multiple retrieval queries, synthesize results across queries, and decide when it has enough information to answer. This is the frontier, used in systems like Perplexity and multi-hop research assistants. The tradeoff: latency is unpredictable (the agent might make 1 or 5 retrieval calls), and debugging is significantly harder.
The honest recommendation: start with Advanced RAG. Ship it. Only move to Modular or Agentic when you have evidence that Advanced RAG's limitations are hitting you.
When to use / when to avoid
When to use RAG
- You need to answer questions from private, proprietary, or frequently-updated documents
- Your documents are too large to fit in a single context window
- You need to cite sources for answers (compliance, legal, customer trust)
- You're grounding a customer-facing assistant in a product knowledge base
- Your data changes faster than you can retrain (weekly or more frequent updates)
When to avoid RAG
- The task is a capability problem, not a knowledge problem (the model can't write code, not that it lacks data). Use fine-tuning instead.
- Your data source is a live database or API with real-time freshness requirements. Use function calling to query it directly.
- Your entire corpus fits in the context window (under 100K tokens). Just stuff it in. No retrieval needed.
- The expected question types are narrow and predictable. A simpler lookup table or search might outperform the complexity of RAG.
RAG vs fine-tuning vs function calling
RAG adds knowledge at query time. Fine-tuning changes model behavior permanently. Function calling connects the model to live data sources. These are complements, not competitors. Many production systems use all three.
Real-world examples
Notion AI (2023): Notion's Q&A feature is RAG over users' personal workspace. Documents are chunked and embedded at write time. Query time retrieves the top-K most relevant blocks and sends them with the user's question. Their reranking step reduced irrelevant answers by 40% after launch, according to their engineering blog. This is a textbook advanced RAG implementation.
GitHub Copilot Chat (2024): Copilot's chat mode uses RAG over the codebase. Files are chunked by function and class boundary, embedded, and stored per-repo. When you ask a question, the most relevant code chunks are retrieved and included in the context. The key insight: code chunking by syntactic boundaries (functions, classes) vastly outperforms fixed-size chunking for code search, with 25% higher recall.
Elastic + OpenAI production stack: Elasticsearch is the BM25 layer in many hybrid RAG pipelines. Teams use Elastic for sparse retrieval and a separate vector store (or Elastic's kNN search) for dense retrieval, then merge with RRF. Shopify's internal knowledge assistant runs this exact architecture and serves 50K+ queries per day across their engineering org.
Perplexity (2024): The most visible agentic RAG system. Perplexity's agent decides which search queries to issue, retrieves from web and indexed sources, synthesizes across multiple retrievals, and generates answers with inline citations. Average query makes 2-4 retrieval calls. Their p95 latency is around 3 seconds because of the multi-hop retrieval, but the answer quality justifies it for their research-oriented use case.
Limitations and tradeoffs
- Retrieval quality is the ceiling. If the right chunk isn't retrieved, the model can't answer correctly no matter how capable it is. RAG quality is bounded by retrieval quality.
- Latency chain. Query embedding + ANN search + reranking + LLM call = multiple hops. Reranking alone adds 50-200ms at production scale. Plan for it.
- Index maintenance. A stale index is worse than no index. You need a near-real-time ingestion pipeline for documents that change. Incremental indexing (detecting changed documents and re-embedding only those) is essential past 10K documents.
- Long-tail queries. Highly specific queries about niche topics may not retrieve good chunks even with hybrid search. Fallback strategies (query rewriting, HyDE) help but add complexity and latency.
- Evaluation is hard. Unlike classification tasks, there's no single accuracy number. You need to measure retrieval quality and generation quality separately, hence frameworks like RAGAS.
The fundamental tension: more retrieval stages improve quality but add latency. Every RAG system is navigating the precision-vs-speed tradeoff.
RAG evaluation with RAGAS
RAGAS (Retrieval Augmented Generation Assessment) is the standard framework for evaluating RAG pipelines. It measures four metrics that cleanly separate retrieval quality from generation quality:
| Metric | What It Measures | Diagnoses |
|---|---|---|
| Faithfulness | Does the answer stick to the retrieved context? | Hallucination despite correct retrieval |
| Answer Relevancy | Does the answer address the question asked? | Off-topic generation |
| Context Precision | Are the retrieved chunks actually relevant? | Retrieval returning noise |
| Context Recall | Did retrieval find all the needed information? | Missing relevant chunks |
The power of RAGAS is diagnostic separation. If faithfulness is low but context precision is high, your problem is in generation (bad grounding prompt, too many chunks). If context recall is low, your problem is in retrieval (bad embeddings, missing chunks, vocabulary mismatch). I've used this framework to cut debugging time from days to hours.
RAG failure diagnosis
When your RAG system produces bad answers, use this decision tree to pinpoint the root cause:
This flowchart is the single most useful debugging tool for RAG systems. When you can say "our context precision is 0.85 but faithfulness is 0.6, so the problem is grounding, not retrieval," you've saved everyone a week of blind debugging.
How this shows up in interviews
When to bring it up
RAG comes up in almost every AI system design interview. Any question involving "how would you build a Q&A system," "how do you ground the model in company data," or "how do you reduce hallucination" is a RAG question. Bring it up immediately when the problem involves private data, frequently-updated data, or source citation requirements.
Depth calibration
- Junior/mid-level: Know the two pipelines (ingestion + query). Be able to draw the basic architecture. Know that RAG reduces hallucination by grounding in retrieved context.
- Senior: Explain hybrid retrieval (BM25 + dense + RRF), cross-encoder reranking, and chunking strategies. Know the latency budget. Mention RAGAS for evaluation.
- Staff/principal: Discuss Agentic RAG, multi-hop retrieval, incremental indexing at scale, evaluation pipelines, the precision-latency tradeoff, and when RAG is the wrong pattern entirely.
Common questions and strong answers
| Interviewer asks | Strong answer |
|---|---|
| "How would you ground the LLM in our company docs?" | "RAG: chunk the docs, embed them, store in a vector DB. At query time, retrieve top-K relevant chunks via hybrid search and inject them into the context with a grounding prompt." |
| "How do you handle hallucination?" | "Explicit grounding instructions in the system prompt, cross-encoder reranking to surface the right chunks, and edge-first ordering to avoid the lost-in-the-middle problem." |
| "BM25 or vector search?" | "Both. Hybrid retrieval with RRF. BM25 catches exact matches that dense misses. Dense catches paraphrases that BM25 misses. The combination wins on every benchmark." |
| "How do you evaluate a RAG system?" | "RAGAS framework: faithfulness, answer relevancy, context precision, context recall. The four metrics separate retrieval problems from generation problems." |
| "What's the latency budget?" | "Query rewrite 100ms, embed 20ms, hybrid retrieval 50ms, reranking 100ms, LLM generation 500ms. Total p50 around 800ms. Reranking is the hidden cost most teams don't budget for." |
| "When would you NOT use RAG?" | "When the problem is capability, not knowledge (use fine-tuning). When data is live and real-time (use function calling). When the corpus fits in the context window (just stuff it in)." |
Common interview mistakes
| Mistake | Why it's wrong | Say this instead |
|---|---|---|
| "Just embed everything and do cosine similarity" | Ignores BM25, reranking, chunking strategy, and context assembly. This is naive RAG that fails in production. | "I'd use hybrid retrieval with BM25 and dense search, then rerank with a cross-encoder before context assembly." |
| "RAG eliminates hallucination" | RAG reduces but does not eliminate hallucination. The model can still ignore retrieved context or confabulate from training data. | "RAG reduces hallucination on factual questions by grounding in context, but you still need strong grounding instructions and evaluation." |
| Skipping the ingestion pipeline | Candidates jump straight to query-time architecture without discussing how documents get into the system. | "There are two pipelines. Ingestion handles chunking, embedding, and storage. Query handles retrieval, reranking, and generation." |
| "I'd use the biggest context window possible" | More context often hurts. The lost-in-the-middle problem means the model ignores middle chunks. A 128K window full of noise is worse than 5 precise chunks. | "I'd retrieve fewer, higher-quality chunks and order them edges-first. Precision beats volume." |
| Not mentioning evaluation | Candidates build the system but have no plan to measure whether it works. | "I'd set up RAGAS to measure faithfulness and context precision separately, so I can diagnose whether problems are retrieval or generation." |
Test your understanding
Quick recap
- RAG retrieves relevant document chunks at query time and injects them into the context window so the LLM answers from your data.
- Two pipelines: ingestion (chunk, embed, store with metadata) and query (rewrite, embed, retrieve, rerank, assemble, generate).
- Hybrid retrieval (BM25 + dense + RRF) with cross-encoder reranking is the production default. Don't skip either.
- Parent-child chunking gives you retrieval precision (small chunks) and context richness (large parents). Start here for mixed document types.
- The "lost in the middle" effect means chunk ordering matters. Put highest-relevance chunks at the start and end of the context block.
- RAGAS separates retrieval problems (context precision/recall) from generation problems (faithfulness/answer relevancy). Use it to diagnose, not guess.
- The fundamental tradeoff is precision vs latency. Every additional pipeline stage (rewriting, reranking, multi-hop) improves quality and adds milliseconds.
Related concepts
- Vector databases for AI - The storage layer that makes RAG retrieval fast. Understand ANN search, HNSW, and metadata filtering.
- Embeddings - RAG depends on embedding quality. Garbage embeddings mean garbage retrieval, regardless of everything else.
- Context engineering - Context assembly in RAG is a subset of context engineering. The grounding prompt design directly determines faithfulness.
- LLM evaluations - RAGAS is the RAG-specific evaluation framework, but it sits within the broader LLM evaluation ecosystem.