Reranking

TL;DR

Reranking is a two-step retrieval strategy: first retrieve 50-100 candidates fast with a bi-encoder, then score every (query, candidate) pair with a cross-encoder to find the best 5-10.
Cross-encoder reranking improves precision@5 by 15-30% over bi-encoder retrieval alone.
Vector similarity finds "semantically nearby" text but misses intent; the cross-encoder reads both query and document together and scores actual relevance.
Top open-source option: mxbai-rerank-large-v1 or BGE-Reranker-Large; top managed API: Cohere Rerank at ~$0.001 per 1,000 tokens.
Reranking adds 200-500ms of latency for a 50-document batch; it cannot rescue bad initial retrieval if the right document never made the first-pass top-100.

Your RAG pipeline retrieves the top-5 chunks from a vector database and stuffs them into the LLM prompt. The user asks "What is our refund policy for international orders?" and the LLM confidently answers based on the domestic refund policy, the shipping FAQ, and a support macro about order cancellations. The actual international refund document is at position 8 in the vector search results, blocked out by three chunks that are semantically similar (they all mention "refund") but answer a different question.

Vector similarity is not relevance scoring. Embedding models compress a document into a fixed-size vector representing its general semantic neighborhood. Documents about "authorization", "authentication", and OAuth all cluster near each other in embedding space. When your user asks "What is the authorization process for our API?", all three document types get pulled in, and the model has no mechanism to decide that two of them answer the wrong question.

This is the precision problem. Your recall may be fine (the right document is somewhere in the top-50) but your precision is broken (positions 1-5 contain noise). Since most LLM context windows are filled from top-k documents, noise in positions 1-5 directly degrades answer quality. Reranking fixes precision without sacrificing recall.

What Is It?

Reranking solves the precision problem by running a second, more expensive scoring pass over the candidates retrieved in the first pass.

Think of it like a library's catalog system. A card catalog (bi-encoder) finds every book with "authorization" in the title quickly, returning a stack of 50 books. A librarian (cross-encoder) then reads the actual question on the reference slip, skims each book's summary, and hands you the 5 that genuinely answer your question. The catalog is fast; the librarian is accurate.

How It Works

Bi-Encoders: Speed at the Cost of Interaction

A bi-encoder processes the query and each document independently. The query is embedded into a dense vector. Documents are pre-embedded at index time and stored. At query time, you compute cosine similarity between the query vector and every document vector.

This independence is the source of both the speed and the limitation. Pre-computing document embeddings means you never need to re-run the model on documents at query time. But because query and document are embedded separately, the model never sees the actual interaction between them. The word "authorization" in the query and the word "authorization" in the document both contribute to their respective vectors, but the model has no way to reason about whether the document's "authorization" answers the query's intent.

Pre-computation is what makes vector databases scale to millions of documents. Retrieval is a single approximate nearest-neighbor (ANN) operation, typically 1-10ms regardless of corpus size.

Cross-Encoders: Accuracy Through Interaction

A cross-encoder takes the query and document as a single concatenated token sequence: [CLS] query [SEP] document [SEP]. The model's attention mechanism operates over the full combined sequence, allowing every token in the query to attend to every token in the document. This is what makes cross-encoders accurate: full query-document interaction.

The output is a single relevance score between 0 and 1. You run this inference for every (query, candidate) pair. The scores are not meaningful in absolute terms, only relative to each other, so you sort all 50 candidates by score and take the top-5 or top-10.

You cannot pre-compute cross-encoder scores because the query changes with every request. This is why cross-encoders are too slow for first-pass retrieval (you'd need to score every document in the corpus for every query) but perfect for reranking a small candidate set.

Why the Two-Stage Design Matters

The two-stage approach is not a hack to make cross-encoders practical. It is the optimal design for the precision-recall tradeoff.

First-stage retrieval optimizes for recall: get the right document somewhere in the top-50 or top-100. You want high recall here, not precision, so use a generous top-k (50-100) and a fast model. Missing the right document at this stage cannot be recovered.

Second-stage reranking optimizes for precision: from the candidate set, surface the most relevant documents. You want the 5-10 chunks that best answer the query filling the LLM's context window. Only cross-encoders deliver this reliably.

I've seen teams try to solve this with a better embedding model, which helps marginally, and then independently discover reranking and see 20%+ uplift in answer quality. The lesson: embedding quality matters less than the two-stage architecture once your corpus reaches a few thousand documents.

Advanced Variants

Two-stage reranking is the pattern for large corpora or high-quality requirements. Use a fast, small cross-encoder (e.g., cross-encoder/ms-marco-MiniLM-L-12-v2) to reduce 100 candidates to 20, then a large cross-encoder (BGE-Reranker-Large or Cohere Rerank) to reduce 20 to 5. The first stage cuts latency by limiting calls to the expensive model. Total latency is still around 300-600ms but with higher final accuracy.

LLM-as-reranker uses the main LLM to explicitly score each document. You prompt the LLM with the query and a document and ask for a relevance score 0-10. This is accurate (the LLM has maximum context understanding) but expensive (50 LLM calls per query). Use it as a benchmark to measure your reranker quality, not as a production system.

Score fusion combines multiple retrieval signals: BM25 score + vector similarity score + cross-encoder score. Each score is normalized, then combined with a weighted sum. The weights can be learned on a labeled evaluation set. This is more complex to maintain but useful when different document types respond better to different retrieval methods.

Implementation Sketch

# Simplified reranking pipeline - core mechanism only
# Production systems add batching, caching, and error handling

from sentence_transformers import CrossEncoder
import numpy as np

class RerankingRAG:
    def __init__(self, vector_store, reranker_model="mxbai-rerank-large-v1"):
        self.vector_store = vector_store
        # Cross-encoder: loaded once, reused for all queries
        self.reranker = CrossEncoder(reranker_model)

    def retrieve(self, query: str, top_k_final: int = 5) -> list[dict]:
        # Phase 1: Fast bi-encoder retrieval - high recall, lower precision
        # Use a generous top_k here; we'll prune in phase 2
        candidates = self.vector_store.similarity_search(query, k=50)

        # Phase 2: Cross-encoder reranking - high precision
        # Build (query, document_text) pairs for the cross-encoder
        pairs = [(query, doc.page_content) for doc in candidates]

        # Cross-encoder scores all 50 pairs in a single batched forward pass
        # Returns scores in the same order as pairs; higher = more relevant
        scores = self.reranker.predict(pairs)

        # Zip documents with scores, sort descending, take top_k_final
        ranked = sorted(
            zip(candidates, scores),
            key=lambda x: x[1],
            reverse=True
        )

        # Return the top_k_final most relevant chunks
        return [
            {"document": doc, "score": float(score)}
            for doc, score in ranked[:top_k_final]
        ]

    def answer(self, query: str) -> str:
        top_chunks = self.retrieve(query, top_k_final=5)
        context = "\n\n".join(c["document"].page_content for c in top_chunks)
        # Pass context to LLM (implementation specific)
        return self.llm.generate(query, context)

For managed APIs, swap the CrossEncoder call for a Cohere Rerank API call:

TL;DR

Reranking is a two-step retrieval strategy: first retrieve 50-100 candidates fast with a bi-encoder, then score every (query, candidate) pair with a cross-encoder to find the best 5-10.
Cross-encoder reranking improves precision@5 by 15-30% over bi-encoder retrieval alone.
Vector similarity finds "semantically nearby" text but misses intent; the cross-encoder reads both query and document together and scores actual relevance.
Top open-source option: mxbai-rerank-large-v1 or BGE-Reranker-Large; top managed API: Cohere Rerank at ~$0.001 per 1,000 tokens.
Reranking adds 200-500ms of latency for a 50-document batch; it cannot rescue bad initial retrieval if the right document never made the first-pass top-100.

# Simplified reranking pipeline - core mechanism only
# Production systems add batching, caching, and error handling

from sentence_transformers import CrossEncoder
import numpy as np

class RerankingRAG:
    def __init__(self, vector_store, reranker_model="mxbai-rerank-large-v1"):
        self.vector_store = vector_store
        # Cross-encoder: loaded once, reused for all queries
        self.reranker = CrossEncoder(reranker_model)

    def retrieve(self, query: str, top_k_final: int = 5) -> list[dict]:
        # Phase 1: Fast bi-encoder retrieval - high recall, lower precision
        # Use a generous top_k here; we'll prune in phase 2
        candidates = self.vector_store.similarity_search(query, k=50)

        # Phase 2: Cross-encoder reranking - high precision
        # Build (query, document_text) pairs for the cross-encoder
        pairs = [(query, doc.page_content) for doc in candidates]

        # Cross-encoder scores all 50 pairs in a single batched forward pass
        # Returns scores in the same order as pairs; higher = more relevant
        scores = self.reranker.predict(pairs)

        # Zip documents with scores, sort descending, take top_k_final
        ranked = sorted(
            zip(candidates, scores),
            key=lambda x: x[1],
            reverse=True
        )

        # Return the top_k_final most relevant chunks
        return [
            {"document": doc, "score": float(score)}
            for doc, score in ranked[:top_k_final]
        ]

    def answer(self, query: str) -> str:
        top_chunks = self.retrieve(query, top_k_final=5)
        context = "\n\n".join(c["document"].page_content for c in top_chunks)
        # Pass context to LLM (implementation specific)
        return self.llm.generate(query, context)

For managed APIs, swap the CrossEncoder call for a Cohere Rerank API call:

Reranking

TL;DR

The Problem It Solves

What Is It?

How It Works

Bi-Encoders: Speed at the Cost of Interaction

Cross-Encoders: Accuracy Through Interaction

Why the Two-Stage Design Matters

Advanced Variants

Implementation Sketch

Continue Reading with Premium

Comments

Reranking

TL;DR

The Problem It Solves

What Is It?

How It Works

Bi-Encoders: Speed at the Cost of Interaction

Cross-Encoders: Accuracy Through Interaction

Why the Two-Stage Design Matters

Advanced Variants

Implementation Sketch

Continue Reading with Premium

Comments