Design a RAG chatbot

TL;DR

Two pipelines drive everything: an async ingestion pipeline (chunk, embed, store) and a real-time query pipeline (retrieve, rerank, assemble, generate).
Hybrid retrieval (BM25 + semantic, fused with Reciprocal Rank Fusion) outperforms either alone, especially for exact product names and error codes.
Add a reranker after retrieval to cut from top-50 to top-5 chunks before the LLM call. The 15-30ms extra latency buys a 20-30% accuracy gain.
Guardrails are not optional: a citation check verifies structure, and a RAGAS faithfulness score gates hallucinations before the response ships.
A semantic cache for common questions hits 60-70% on FAQ traffic, collapsing both latency and LLM spend.

Requirements

Functional requirements

Users ask natural-language questions about the company's product documentation (50K documents).
The system retrieves relevant content and returns a grounded answer with source citations.
Responses arrive in under 3 seconds end-to-end, including LLM generation.
New or updated documents are available for retrieval within 10 minutes of ingestion.
The system refuses out-of-scope questions rather than hallucinating an answer.

Non-functional requirements

100K daily active users with peak concurrency around 10K simultaneous requests.
P95 latency under 3 seconds; P99 under 5 seconds.
Zero-hallucination SLA: no response claims a feature exists unless that claim is supported by a cited source.
Embedding model upgrades require full re-embedding of the corpus without user-visible downtime.
Cost target: under $0.02 per query at scale.

The core entities

Document (source of truth, ingestion time)

id, source_url, title, raw_text, metadata (product area, version), updated_at, content_hash

Chunk (derived at ingestion)

chunk_id, document_id, text (200 tokens), embedding (1536-dim float array), parent_chunk_id

Query

query_id, user_id, question_text, session_id, created_at

Response

response_id, query_id, answer_text, source_chunk_ids[], faithfulness_score, latency_ms

EvalResult (offline quality tracking)

eval_id, response_id, faithfulness, context_relevance, answer_relevancy, human_label

API design

POST /api/chat (main query endpoint)

Request:  { "question": "Does the product support SSO?", "session_id": "abc123" }
Response: { "answer": "Yes, SSO via SAML 2.0 ...", "sources": [{"url": "...", "title": "..."}] }

POST /api/ingest (trigger document ingestion)

Request:  { "document_url": "https://docs.example.com/sso", "priority": "normal" }
Response: { "job_id": "job_456", "status": "queued" }

GET /api/ingest/status/{job_id\d}

Response: { "job_id": "job_456", "status": "complete", "chunks_created": 24, "duration_ms": 3200 }

GET /api/health

Response: { "status": "ok", "vector_db": "healthy", "llm_provider": "healthy", "p95_latency_ms": 2100 }

Two completely separate pipelines share a single vector database. The ingestion pipeline runs asynchronously and can process thousands of documents per hour without touching the query path. The query pipeline is the hot path where every user request flows in real time.

The ingestion pipeline starts with a document loader that fetches raw HTML/Markdown from the docs system, strips navigation chrome, and passes clean text to the chunker. Chunking uses a sentence-aware splitter at 200 tokens with 20-token overlap to avoid cutting ideas mid-sentence. A parent-child strategy stores both a 200-token chunk (retrieved) and its 1,000-token parent document (sent to the LLM), giving precise retrieval without losing context.

The query pipeline is where the latency budget matters most. Embedding the user's question (15ms), hybrid retrieval (25ms), reranking top-50 to top-5 (25ms), and the LLM call (600-1,500ms) are the four dominant costs. You hit under 3 seconds by using a fast LLM (GPT-4o-mini or Claude Haiku) and parallelising steps where the dependency graph allows it.

TL;DR

Two pipelines drive everything: an async ingestion pipeline (chunk, embed, store) and a real-time query pipeline (retrieve, rerank, assemble, generate).
Hybrid retrieval (BM25 + semantic, fused with Reciprocal Rank Fusion) outperforms either alone, especially for exact product names and error codes.
Add a reranker after retrieval to cut from top-50 to top-5 chunks before the LLM call. The 15-30ms extra latency buys a 20-30% accuracy gain.
Guardrails are not optional: a citation check verifies structure, and a RAGAS faithfulness score gates hallucinations before the response ships.
A semantic cache for common questions hits 60-70% on FAQ traffic, collapsing both latency and LLM spend.

Requirements

Functional requirements

Users ask natural-language questions about the company's product documentation (50K documents).
The system retrieves relevant content and returns a grounded answer with source citations.
Responses arrive in under 3 seconds end-to-end, including LLM generation.
New or updated documents are available for retrieval within 10 minutes of ingestion.
The system refuses out-of-scope questions rather than hallucinating an answer.

Non-functional requirements

100K daily active users with peak concurrency around 10K simultaneous requests.
P95 latency under 3 seconds; P99 under 5 seconds.
Zero-hallucination SLA: no response claims a feature exists unless that claim is supported by a cited source.
Embedding model upgrades require full re-embedding of the corpus without user-visible downtime.
Cost target: under $0.02 per query at scale.

The core entities

Document (source of truth, ingestion time)

id, source_url, title, raw_text, metadata (product area, version), updated_at, content_hash

Chunk (derived at ingestion)

chunk_id, document_id, text (200 tokens), embedding (1536-dim float array), parent_chunk_id

Query

query_id, user_id, question_text, session_id, created_at

Response

response_id, query_id, answer_text, source_chunk_ids[], faithfulness_score, latency_ms

EvalResult (offline quality tracking)

eval_id, response_id, faithfulness, context_relevance, answer_relevancy, human_label

API design

POST /api/chat (main query endpoint)

Request:  { "question": "Does the product support SSO?", "session_id": "abc123" }
Response: { "answer": "Yes, SSO via SAML 2.0 ...", "sources": [{"url": "...", "title": "..."}] }

POST /api/ingest (trigger document ingestion)

Request:  { "document_url": "https://docs.example.com/sso", "priority": "normal" }
Response: { "job_id": "job_456", "status": "queued" }

GET /api/ingest/status/{job_id\d}

Response: { "job_id": "job_456", "status": "complete", "chunks_created": 24, "duration_ms": 3200 }

GET /api/health

Response: { "status": "ok", "vector_db": "healthy", "llm_provider": "healthy", "p95_latency_ms": 2100 }

Design a RAG chatbot

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments

Design a RAG chatbot

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments