Design a RAG chatbot
Walk through designing a production RAG chatbot end-to-end, from ingestion pipeline to retrieval, context assembly, guardrails, and handling 100K concurrent users.
TL;DR
- Two pipelines drive everything: an async ingestion pipeline (chunk, embed, store) and a real-time query pipeline (retrieve, rerank, assemble, generate).
- Hybrid retrieval (BM25 + semantic, fused with Reciprocal Rank Fusion) outperforms either alone, especially for exact product names and error codes.
- Add a reranker after retrieval to cut from top-50 to top-5 chunks before the LLM call. The 15-30ms extra latency buys a 20-30% accuracy gain.
- Guardrails are not optional: a citation check verifies structure, and a RAGAS faithfulness score gates hallucinations before the response ships.
- A semantic cache for common questions hits 60-70% on FAQ traffic, collapsing both latency and LLM spend.
Requirements
Functional requirements
- Users ask natural-language questions about the company's product documentation (50K documents).
- The system retrieves relevant content and returns a grounded answer with source citations.
- Responses arrive in under 3 seconds end-to-end, including LLM generation.
- New or updated documents are available for retrieval within 10 minutes of ingestion.
- The system refuses out-of-scope questions rather than hallucinating an answer.
Non-functional requirements
- 100K daily active users with peak concurrency around 10K simultaneous requests.
- P95 latency under 3 seconds; P99 under 5 seconds.
- Zero-hallucination SLA: no response claims a feature exists unless that claim is supported by a cited source.
- Embedding model upgrades require full re-embedding of the corpus without user-visible downtime.
- Cost target: under $0.02 per query at scale.
The core entities
Document (source of truth, ingestion time)
id,source_url,title,raw_text,metadata(product area, version),updated_at,content_hash
Chunk (derived at ingestion)
chunk_id,document_id,text(200 tokens),embedding(1536-dim float array),parent_chunk_id
Query
query_id,user_id,question_text,session_id,created_at
Response
response_id,query_id,answer_text,source_chunk_ids[],faithfulness_score,latency_ms
EvalResult (offline quality tracking)
eval_id,response_id,faithfulness,context_relevance,answer_relevancy,human_label
API design
POST /api/chat (main query endpoint)
Request: { "question": "Does the product support SSO?", "session_id": "abc123" }
Response: { "answer": "Yes, SSO via SAML 2.0 ...", "sources": [{"url": "...", "title": "..."}] }
POST /api/ingest (trigger document ingestion)
Request: { "document_url": "https://docs.example.com/sso", "priority": "normal" }
Response: { "job_id": "job_456", "status": "queued" }
GET /api/ingest/status/{job_id\d}
Response: { "job_id": "job_456", "status": "complete", "chunks_created": 24, "duration_ms": 3200 }
GET /api/health
Response: { "status": "ok", "vector_db": "healthy", "llm_provider": "healthy", "p95_latency_ms": 2100 }
High-level design
Two completely separate pipelines share a single vector database. The ingestion pipeline runs asynchronously and can process thousands of documents per hour without touching the query path. The query pipeline is the hot path where every user request flows in real time.
The ingestion pipeline starts with a document loader that fetches raw HTML/Markdown from the docs system, strips navigation chrome, and passes clean text to the chunker. Chunking uses a sentence-aware splitter at 200 tokens with 20-token overlap to avoid cutting ideas mid-sentence. A parent-child strategy stores both a 200-token chunk (retrieved) and its 1,000-token parent document (sent to the LLM), giving precise retrieval without losing context.
The query pipeline is where the latency budget matters most. Embedding the user's question (15ms), hybrid retrieval (25ms), reranking top-50 to top-5 (25ms), and the LLM call (600-1,500ms) are the four dominant costs. You hit under 3 seconds by using a fast LLM (GPT-4o-mini or Claude Haiku) and parallelising steps where the dependency graph allows it.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.