Agentic RAG
Learn how agentic RAG replaces the fixed retrieval pipeline with an LLM planner that selects from multiple retrieval tools, handling diverse query types and mixed data sources.
TL;DR
- Agentic RAG replaces a fixed retrieval pipeline with an LLM planner that dynamically selects from multiple retrieval tools (semantic search, keyword search, SQL queries, metadata filters) based on the query type.
- Hybrid search, combining BM25 keyword retrieval with semantic vector retrieval via Reciprocal Rank Fusion, improves NDCG@10 by 10-20% over either method alone and is a natural building block for agentic retrieval.
- Pure semantic search fails on product names, error codes, and proper nouns; pure BM25 fails on conceptual queries and paraphrasing. Hybrid catches both.
- Agentic RAG is non-deterministic: the same query may route to different tools on different runs, making reproducibility and debugging harder.
- The quality of tool descriptions (docstrings) is the primary determinant of routing accuracy; vague or ambiguous tool names cause systematic routing failures.
The Problem It Solves
Your RAG pipeline is built around semantic vector search. It works great for questions like "explain our deployment process" or "what does our refund policy say?" Then a user asks "list all documents uploaded in March" and the system returns hallucinated results because there is no semantic similarity between that query and any document. Another user asks "how many customers signed up last week?" and the pipeline faithfully retrieves text snippets about the customer onboarding process, completely missing that this is a structured data query that requires SQL.
A fixed retrieval pipeline has exactly one tool: semantic vector similarity. That tool is well-suited to exactly one class of queries: those where the answer lives in unstructured text and can be found by conceptual similarity. The moment your knowledge base becomes heterogeneous (policies in docs, metrics in a database, logs in a search index, recent data filtered by date) or your queries become diverse (factual lookups, aggregations, temporal queries, exact keyword searches), the one-tool pipeline fails systematically.
The deeper problem is that retrieval strategy is query-dependent. "What is our refund policy?" needs semantic search. "Find documents from the Q4 product review meeting" needs metadata filtering by date and topic. "How many support tickets were opened about login issues in the past 30 days?" needs a SQL query. No fixed retrieval method handles all three correctly. A system that must handle diverse queries needs to choose its retrieval strategy based on the nature of each individual query.
What Is It?
Agentic RAG decouples the retrieval strategy from the retrieval mechanism, allowing the LLM planner to select the right tool for each query rather than always applying vector similarity.
Think of a reference librarian at a large research library. A basic library catalog (fixed RAG) only lets you search by title and subject. A skilled research librarian (agentic RAG) listens to your research question and decides whether to use the card catalog, a specialized periodical index, the archive room for historical documents, or a direct phone call to a subject expert. The librarian's decision-making is the agentic layer. The underlying collections and tools exist either way; the agent adds intelligent routing.
How It Works
The Tool Registry
The core of agentic RAG is a set of retrieval tools with clear, precise descriptions. The LLM planner uses these descriptions to route each query. The tool descriptions are not cosmetic; they are the primary inputs to the routing decision.
Each tool has a function signature and a description. Below is what a well-defined tool registry looks like:
tools = [
{
"name": "semantic_search",
"description": """Search the knowledge base using semantic similarity.
Use this when the query asks for explanations, processes, policies, or
concepts that involve paraphrasing or conceptual overlap with documents.
Examples: 'what is our deployment process', 'how does user authentication work'.
NOT for: exact product names, error codes, or queries requiring date filtering.""",
"parameters": {"query": "str", "top_k": "int = 10"}
},
{
"name": "keyword_search",
"description": """Search the knowledge base using exact keyword matching (BM25).
Use this when the query contains specific product names, error codes, version
numbers, or proper nouns where exact term matching matters.
Examples: 'ERR_CERT_EXPIRED', 'GPT-4o pricing', 'Jira ticket PROJ-1234'.
NOT for: conceptual queries, questions requiring understanding of meaning.""",
"parameters": {"query": "str", "top_k": "int = 10"}
},
{
"name": "sql_query",
"description": """Query structured data tables with SQL.
Use this for counting, aggregating, filtering by exact values, or any
query about metrics, event counts, user statistics, or time-series data.
Examples: 'how many users signed up last week', 'top 5 error types in logs'.
NOT for: unstructured document content.""",
"parameters": {"query": "str"}
},
{
"name": "filter_by_metadata",
"description": """Filter documents by structured metadata fields.
Use this when the query specifies a date range, document type, author,
category, or any attribute that is indexed as structured metadata.
Examples: 'documents from March 2025', 'all policies tagged as HR'.
Can be combined with semantic_search for date-filtered semantic retrieval.""",
"parameters": {"filters": "dict", "semantic_query": "str = None"}
},
{
"name": "read_document",
"description": """Retrieve the full text of a specific document by ID.
Use this when a chunk was found via search but the question requires
more context than the chunk provides, or when you need the complete
document content rather than a selected passage.
Requires: doc_id from a prior search result.""",
"parameters": {"doc_id": "str"}
}
]
The descriptions above are specific. They tell the planner exactly what the tool is for, give concrete examples, and explicitly state what the tool is NOT for. Vague descriptions like "searches documents" cause the planner to default to semantic search for every query.
The Planner's Decision Process
When the LLM planner receives a query, it evaluates the query against the tool descriptions and generates a plan: an ordered list of tool calls, each with specific parameters. The plan can be a single tool call or a sequence.
A query like "what documents discuss our security policy?" produces a single semantic_search call.
A query like "what did the Q4 2024 security review document say about access control?" produces a sequence: filter_by_metadata(date_range="Q4 2024", category="security") to find the document, followed by read_document(doc_id=...) to retrieve the full text.
A query like "how many security incidents were reported in Q4 2024 and what was the most common type?" produces a parallel plan: sql_query("SELECT COUNT(*) ...") for the count plus semantic_search("security incident types Q4 2024") for the qualitative context.
Hybrid Search: BM25 + Semantic via RRF
Hybrid search is not exclusively an agentic RAG pattern. You can build it into a standard fixed RAG pipeline. But it becomes especially powerful as an agentic tool because the planner can choose to invoke it for queries where term coverage matters and semantic coverage matters simultaneously.
The core insight is that BM25 and vector retrieval fail in complementary ways:
- Semantic search fails on: exact product names ("GPT-4o", "Amazon S3"), error codes ("ERR_SSL_VERSION_OR_CIPHER_MISMATCH"), version numbers, proper nouns, and any term where meaning depends on the specific token sequence.
- BM25 fails on: paraphrasing ("make it faster" doesn't match documents about "performance optimization"), conceptual queries ("what's the difference between authentication and authorization"), and vocabulary mismatch.
Reciprocal Rank Fusion (RRF) combines the ranked lists from both methods without requiring score normalization:
from collections import defaultdict
def reciprocal_rank_fusion(
rankings: list[list[str]],
k: int = 60
) -> list[str]:
"""
Combines multiple ranked document lists into a single ranking.
k=60 is empirically the best default; higher k reduces sensitivity to top ranks.
"""
scores: dict[str, float] = defaultdict(float)
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
# Rank is 0-indexed; higher rank = lower score
scores[doc_id] += 1.0 / (k + rank + 1)
# Sort descending by fused score
return sorted(scores.keys(), key=lambda d: scores[d], reverse=True)
# Usage: fuse BM25 and semantic rankings
bm25_ranking = bm25.search(query, top_k=50) # returns [doc_id_1, doc_id_2, ...]
vector_ranking = vector_db.search(query, k=50) # returns [doc_id_3, doc_id_1, ...]
fused = reciprocal_rank_fusion([bm25_ranking, vector_ranking])
top_20 = fused[:20] # pass to reranker or directly to LLM
The RRF formula rewards documents that appear highly in multiple rankings. A document ranked #1 in BM25 and #5 in semantic gets a higher fused score than a document ranked #1 in semantic but absent from BM25.
Parent-Document Retrieval
Parent-document retrieval is a pattern that pairs naturally with agentic RAG via the read_document tool. The idea: chunk documents into small pieces for precise retrieval, but store references to the parent documents so the agent can expand to full context when needed.
Small chunks (200-300 tokens) are good for retrieval because they have focused semantic content. But small chunks are bad for answering questions that require broader context (e.g., "what is the overall architecture described in this document?"). The agent can do both: find the relevant chunk via semantic search, then call read_document with the parent document ID to retrieve the full text.
Implementation requires storing a parent_doc_id field in every chunk's metadata. Most vector databases (Qdrant, Pinecone, Weaviate) support payload/metadata fields that allow this reference to be retrieved along with the chunk.
The Agentic Loop in Practice
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.