Self-reflective RAG
Learn how self-reflective RAG adds a quality gate between retrieval and generation, grading retrieved chunks and reformulating queries when relevance scores fall below threshold.
TL;DR
- Self-reflective RAG inserts a relevance grading step between retrieval and generation: retrieved chunks are scored before the LLM sees them, and low-scoring retrievals trigger query reformulation and retry.
- CRAG (Corrective RAG, Shi et al. 2024) showed 25-40% improvement in answer correctness over naive RAG on knowledge-intensive tasks using this approach.
- Use
gpt-4o-miniorGemini 1.5 Flashfor the grading step: at roughly $0.00015 per 1K tokens, the cost per query is negligible (under $0.001 extra). - Cap retries at 2-3; beyond that, the knowledge base likely does not contain the answer and continuing is waste.
- The key tradeoff: 200-600ms of added latency per query (usually acceptable) in exchange for dramatically fewer hallucinations from low-quality context.
The Problem It Solves
Standard RAG has no feedback loop. It retrieves, it injects, it generates. If the retrieved chunks are irrelevant, the LLM generates from bad context. Sometimes it hallucinates. Sometimes it says "based on the provided context, the answer is..." and the answer is wrong. The system has no mechanism to detect that retrieval failed.
The failure cases are common: the user's query was ambiguous and matched the wrong documents; the query used different vocabulary than the knowledge base; the topic exists in the knowledge base but the top-k chunks surfaced the wrong section. In all these cases, the standard pipeline confidently generates a response. There is no upstream signal that anything went wrong.
The fix is a quality gate. Check whether the retrieved chunks actually contain information relevant to answering the query before the LLM sees them. If they do not, do something about it: reformulate the query and try again. Only proceed to generation when the context quality is sufficient.
What Is It?
Self-reflective RAG adds a relevance evaluation step between retrieval and generation. Retrieved chunks are scored for query relevance, and low scores trigger query reformulation and retry before the LLM ever touches the context.
Think of it like an editor reviewing sources before they reach a journalist. The journalist (LLM) can only write from what the editor approves. If the editorial assistant (grader) flags that three of the five sources are off-topic, the editor sends the assistant back to the archive with a more specific brief. The journalist never sees the bad sources and never writes from them.
How It Works
The Grading Step
After retrieval returns top-k chunks, a lightweight LLM evaluates each chunk individually. The grading prompt is deliberately simple:
On a scale of 1 to 5, how relevant is the following document excerpt
to answering the question: "{query}"?
Document excerpt:
{chunk_content}
Respond with only a single integer: 1, 2, 3, 4, or 5.
1 = Completely irrelevant
2 = Tangentially related, mentions similar topics
3 = Partially relevant, contains some useful information
4 = Very relevant, directly addresses the question
5 = Directly answers the question with specific information
The grader runs once per chunk in parallel (not sequentially). For top_k=5, that is 5 concurrent LLM calls using a fast, cheap model. Each call completes in 100-200ms. Total grading time for 5 chunks: 100-200ms (all parallel).
Average the scores. If the average is below the threshold (typically 2.5-3.0), retrieval failed and the retry loop fires.
The Retry Threshold
Calibrate the threshold based on your tolerance for false positives vs. false negatives:
- Threshold 2.0: Only retry when retrieval is clearly bad. Low retry rate, some poor-quality contexts still make it through. Good for latency-sensitive applications.
- Threshold 2.5: Balanced. Triggers retry when context is mixed-quality. Recommended default.
- Threshold 3.0: Aggressive. Retries whenever context is not at least "partially relevant" for every chunk. Higher retry rate, better answer quality, higher cost.
Set the threshold by measuring on a held-out query set. Run 100 representative queries through the pipeline, grade the retrieved chunks, and compare average scores against human-labeled answer quality. The threshold that maximizes answer quality at acceptable latency and cost is your production value.
The Query Reformulation Step
A naive retry with the same query returns the same bad results. Reformulation is where the quality improvement actually comes from.
The reformulation prompt is the most important design decision in the pattern:
The original question was: {query}
A search was performed and returned these document excerpts:
{chunk_summaries_with_scores}
These documents were not sufficiently relevant to answer the question.
Your task: reformulate the original question to improve the search.
Guidelines:
- Be more specific rather than more general
- Use different vocabulary that might match document language
- Focus on the most concrete, answerable part of the question
- Do not change the intent of the original question
Reformulated query:
The reformulation should make the query more specific, not more general. A common mistake is generating a broader reformulation that returns even more generic documents. More specific reformulations find the right document. More general reformulations find more documents, but wrong ones.
The Two Levels of Reflection
Self-reflective RAG can operate at two stages:
Retrieval-level reflection (the primary pattern): Grade chunks after retrieval, before generation. Trigger retry if quality is low. This is the most impactful and the one described throughout this article.
Generation-level reflection (optional post-generation check): After generation, grade the answer for faithfulness. "Does this answer stay within the bounds of the provided context?" and "Does this answer actually address the question?" This is a secondary quality gate, useful for high-stakes applications where hallucination cost is very high. It adds another LLM call after generation, so use it selectively.
The CRAG Pattern (Corrective RAG)
CRAG (Shi et al., 2024) is the canonical published implementation. It uses three categories instead of a numeric score:
- CORRECT: the retrieved chunks contain relevant information. Proceed to generation.
- INCORRECT: the retrieved chunks are irrelevant. Pivot to web search to find external context.
- AMBIGUOUS: the evidence is mixed. Supplement local retrieval with web search.
The distinctive feature of CRAG is the pivot to web search when local knowledge fails. This is a more aggressive corrective action than query reformulation alone. If the knowledge base genuinely does not contain the answer, CRAG does not keep retrying the same index; it goes outside.
In practice, for internal-knowledge-base deployments (corporate documentation, product data), the web search fallback is usually replaced with a "knowledge gap" signal and a structured fallback response.
The SELF-RAG Approach (Fine-Tuning-Based)
SELF-RAG (Asai et al., 2024) takes a fundamentally different approach. Rather than using an external grader, the model is fine-tuned to generate reflection tokens inline:
[Retrieve]: should I retrieve more context for this claim?[IsRel]: is this retrieved passage relevant?[IsSup]: does this passage support my current claim?[IsUse]: is my overall response useful?
These tokens are produced by the model during generation, making reflection intrinsic rather than external. SELF-RAG produces higher-quality outputs than CRAG on benchmark tasks but requires a fine-tuned model and is not directly applicable to off-the-shelf GPT-4o or Claude deployments.
For production systems using hosted LLMs, use the external grader approach (CRAG-style). SELF-RAG is the research direction; CRAG-style grading is the production pattern.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.