Semantic context filtering
After retrieving candidate context chunks, apply a secondary LLM or embedding pass to filter out irrelevant results before injecting into the agent's prompt.
TL;DR
- Semantic context filtering adds a second-pass relevance filter after vector retrieval, cutting injected noise by 50-80% and improving LLM answer quality by 15-25%.
- The two-stage pipeline: coarse retrieval (vector search, top-100) produces high recall, then fine filtering (cross-encoder reranker or LLM judge, top-10) produces high precision.
- Cross-encoder rerankers (Cohere Rerank, BGE Reranker) score each (query, chunk) pair independently, achieving 10-20% better precision than bi-encoder similarity alone.
- Token math: retrieving 20 chunks at 300 tokens each costs 6,000 input tokens. If only 5 are relevant, filtering saves 4,500 tokens per request (75% reduction).
- The "Lost in the Middle" effect (Liu et al., 2023) shows that irrelevant context actively degrades LLM accuracy, not just wastes tokens. Filtering is a quality lever, not just a cost lever.
- Limitation: the filtering pass itself adds 50-200ms latency and has its own cost. For latency-critical paths, you need to tune the tradeoff carefully.
The Problem It Solves
Your AI coding agent uses RAG to answer questions about a 50,000-document codebase. The vector search returns the top-20 chunks for "how does the payment retry logic work?" and the results include: three chunks about payment processing, two about retry mechanisms, one about the payment model schema, and fourteen chunks that mention "payment" or "retry" in passing but are actually about logging configuration, test fixtures, CI pipeline setup, and unrelated API endpoints.
All 20 chunks get injected into the prompt. The LLM now has 6,000 tokens of context, but 70% of it is noise. The model's answer wanders, conflating retry logic with CI retry behavior, and misses the actual retry backoff configuration that was in chunk #17 of the original retrieval results (which got pushed out by the irrelevant chunks that scored higher on embedding similarity).
I've watched this exact failure mode in three different production RAG systems. The retrieval "worked" by recall metrics, but the injected context was so noisy that the LLM would have done better with zero context. Vector search optimizes for recall (don't miss anything relevant), but the LLM needs precision (don't include anything irrelevant). These are different optimization targets, and you need different stages to handle each one.
Irrelevant context is worse than no context
Research from Liu et al. (2023) on "Lost in the Middle" demonstrates that LLM accuracy drops when relevant information is surrounded by irrelevant filler. Adding 15 irrelevant documents around 5 relevant ones reduced accuracy by 20-30% compared to showing only the 5 relevant documents. More context is not better context.
What Is It?
Semantic context filtering adds a second-pass relevance filter between retrieval and injection. After the vector search returns candidate chunks, a more accurate (but slower) model re-scores each chunk against the original query and drops the ones that don't meet a relevance threshold.
Think of it like a hiring pipeline. The resume screening (vector search) casts a wide net, pulling in 100 candidates who look roughly qualified based on keyword overlap. The phone screen (semantic filter) asks each candidate a specific question about the role and cuts the list to 10 who actually understand the problem. You interview 10 focused candidates instead of 100 vaguely-qualified ones.
The term "semantic" matters here. This isn't simple keyword filtering ("does the chunk contain the word 'payment'?"). It's meaning-based filtering ("does the chunk discuss payment processing in a way that's relevant to the user's specific question about retry logic?"). The distinction is the difference between grep and understanding.
The pattern sits between retrieval and prompt assembly in any RAG pipeline. It doesn't change how you retrieve or how you inject. It adds a precision layer between the two.
How It Works
Stage 1: coarse retrieval with bi-encoders
The first stage is the standard vector search that most RAG systems already have. A bi-encoder model (like OpenAI text-embedding-3-small, Cohere embed-v3, or an open-source model like bge-base-en-v1.5) embeds the query and each document chunk independently, then finds the nearest neighbors by cosine similarity.
Bi-encoders are fast because they pre-compute document embeddings at index time. At query time, you only embed the query once and do a nearest-neighbor lookup. This makes them practical for searching millions of chunks in milliseconds using vector databases (Pinecone, Qdrant, pgvector, Chroma).
The tradeoff: bi-encoders encode query and document independently, so they can't capture fine-grained interactions between the two. A query about "payment retry backoff strategy" and a chunk about "exponential backoff in CI pipeline retries" will score high because they share the concepts "retry" and "backoff," even though they're about completely different systems. This is why you need a second stage.
The typical configuration: retrieve the top-50 to top-100 chunks at this stage. You intentionally over-retrieve to ensure high recall. Missing a relevant chunk here means it's gone forever. Getting a few irrelevant ones is fine because the next stage will filter them.
The key question at this stage: how many candidates to retrieve? Too few (top-10) and you might miss relevant chunks that the bi-encoder scored slightly lower. Too many (top-500) and the reranking step becomes expensive. The sweet spot is usually 5-10x your final target. If you want 10 chunks in the final prompt, retrieve 50-100 candidates.
Stage 2: fine filtering with cross-encoders
Cross-encoder rerankers are the precision weapon. Unlike bi-encoders (which encode query and document separately), cross-encoders take the (query, document) pair as a single input and produce a relevance score. This lets the model attend to the fine-grained interaction between query terms and document content.
The architecture difference matters. A bi-encoder maps "payment retry backoff" and "CI pipeline retry configuration" to similar regions of embedding space because they share semantic features. A cross-encoder reads both together and can distinguish: "This chunk discusses retry in the context of CI pipelines, not payment processing. Low relevance."
Popular cross-encoder rerankers include Cohere Rerank (API), BGE Reranker (open-source), and Jina Reranker. They typically accept a query and a list of documents, returning relevance scores for each pair. Processing time scales linearly with the number of candidates, which is why you use the coarse stage to cut 1,000,000 documents down to 100 before reranking.
Here's the architecture comparison that makes the difference concrete:
The bi-encoder produces similarity scores. The cross-encoder produces relevance scores. Similarity is a proxy for relevance, but it's a lossy one. Two chunks can be semantically similar (they discuss related concepts) without being relevant to each other (they answer different questions).
After reranking, apply a relevance threshold (typically 0.6-0.8 depending on the model) and keep only the top-K chunks that pass. The result is a small set of highly relevant chunks.
The alternative: LLM-as-judge filtering
Cross-encoder rerankers are the standard, but there's a second approach: use a cheap, fast LLM as the filter. For each candidate chunk, ask: "Is this chunk relevant to answering the user's question? Answer yes or no."
This approach trades compute cost for flexibility. A cross-encoder gives you a relevance score but can't explain why. An LLM judge can apply nuanced filtering criteria: "relevant AND from the last 6 months AND not a test file." You can encode complex filtering logic in natural language.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.