Design a semantic search engine
Walk through designing a semantic search system that combines dense embeddings with sparse retrieval, handles 100M documents, and returns relevant results in under 200ms.
TL;DR
- The single most important architectural insight: hybrid retrieval (BM25 + dense vectors) merged with Reciprocal Rank Fusion (RRF) outperforms either approach alone by 15-25% on recall@10, because keyword search catches exact matches that embeddings miss and embeddings catch synonyms that keywords miss.
- The query pipeline runs in under 200ms P95: embed the query (5ms), fan out to BM25 and vector search in parallel (~30ms each), fuse ranked lists via RRF (1ms), rerank top-50 to top-10 with a cross-encoder (20-30ms), return results.
- Fine-tuning a small embedding model (e5-small or bge-small) on domain-specific query-document pairs gives 20% better recall than a generic large model at 3x lower inference cost. This is the highest-ROI optimization for any search system.
- Scaling to 100M documents requires IVF-PQ quantization, which compresses 600GB of raw vectors down to ~150GB with less than 5% recall loss, and horizontal sharding across a Qdrant or Weaviate cluster.
- The production lesson that separates juniors from seniors: embedding model upgrades require full re-indexing of every document. Plan for this from day one with a blue-green index strategy, or you will face a week-long reindexing outage every time you improve your model.
Requirements
Functional requirements
- Users can submit natural language queries and receive the top-K most semantically relevant documents, ranked by relevance score.
- The system indexes documents from multiple sources (web pages, PDFs, internal knowledge bases) through an ingestion pipeline that chunks, embeds, and stores content.
- The system supports both semantic similarity search and exact keyword matching (hybrid search) so queries like "error code E-4012" return exact matches while "how to fix a leaky faucet" matches "plumbing repair tutorial for dripping tap."
- Users can apply metadata filters (date range, source, category, language) alongside semantic search to narrow results without sacrificing relevance.
- The system provides a relevance feedback mechanism where users can upvote or downvote results, feeding signal back into the ranking model.
- The system detects and handles multi-lingual queries, returning results in the same language or cross-lingually when requested.
Non-functional requirements
- P95 query latency under 200ms end-to-end (from query received to results returned).
- Support for 100M indexed documents with up to 10M new documents ingested per day.
- Query throughput: 1,000 queries per second sustained.
- Search relevance: NDCG@10 above 0.65 on the domain evaluation set, measured weekly.
- Cost: under $2,000/month for embedding inference and vector storage at 100M document scale.
- Availability: 99.9% uptime with zero-downtime index updates and model upgrades.
The hardest engineering problem here: embedding model upgrades. When you switch from text-embedding-3-small to a fine-tuned model, every single document must be re-embedded and re-indexed. At 100M documents, that is 3-5 days of continuous GPU processing. Without a blue-green index strategy, you either serve stale embeddings or suffer downtime.
The core entities
Document
doc_id,source(web/pdf/knowledge_base),raw_text,url,title,language,created_at,updated_at,metadata(category, author, tags)
Chunk
chunk_id,doc_id,chunk_index,text,token_count,embedding_vector(float32[]),embedding_model_version,created_at
SearchQuery
query_id,raw_query,query_embedding,filters(date_range, source, language),top_k,user_id,timestamp
SearchResult
result_id,query_id,chunk_id,rank,rrf_score,dense_score,sparse_score,rerank_score,clicked(boolean),feedback(upvote/downvote/null)
EmbeddingModel
model_id,model_name,dimensions,index_name,is_active,deployed_at,ndcg_score,total_docs_embedded
API design
POST /api/search - execute a hybrid search query
Request: {
"query": "how to fix a leaky kitchen faucet",
"top_k": 10,
"filters": {
"source": "knowledge_base",
"language": "en",
"date_after": "2025-01-01"
},
"include_scores": true
}
Response: {
"query_id": "qry_sem_7x2k",
"results": [
{
"chunk_id": "chk_48291",
"doc_id": "doc_plumbing_guide",
"title": "Plumbing Repair Tutorial: Dripping Taps",
"snippet": "To repair a dripping tap, first turn off the water supply valve under the sink...",
"scores": { "rrf": 0.89, "dense": 0.84, "sparse": 0.72, "rerank": 0.93 },
"rank": 1
},
{
"chunk_id": "chk_51002",
"doc_id": "doc_home_repair",
"title": "Home Repair: Kitchen Sink Issues",
"snippet": "A leaking faucet wastes up to 3,000 gallons per year. Replace the O-ring...",
"scores": { "rrf": 0.81, "dense": 0.79, "sparse": 0.68, "rerank": 0.87 },
"rank": 2
}
],
"latency_ms": 142,
"model_used": "bge-small-finetuned-v2"
}
The primary search endpoint. Embeds the query, runs parallel dense + sparse retrieval, fuses with RRF, reranks, and returns the final ranked list. Metadata filters are applied as pre-filters on the vector DB side.
POST /api/documents/ingest - ingest a document into the search index
Request: {
"url": "https://example.com/plumbing-guide",
"source": "web",
"metadata": { "category": "home_repair", "language": "en" }
}
Response: {
"doc_id": "doc_plumbing_guide",
"chunks_created": 12,
"total_tokens": 3840,
"embedding_model": "bge-small-finetuned-v2",
"status": "indexed",
"latency_ms": 2400
}
Triggers the ingestion pipeline: fetch content, chunk, embed each chunk, and index in both the vector DB and BM25 index. Returns once indexing is complete.
POST /api/feedback - submit relevance feedback on a search result
Request: {
"query_id": "qry_sem_7x2k",
"chunk_id": "chk_48291",
"feedback": "upvote"
}
Response: {
"status": "recorded",
"feedback_id": "fb_9x2m"
}
Captures user relevance signals for offline analysis. Aggregated feedback data feeds into periodic fine-tuning of the embedding model and reranker.
GET /api/documents/{doc_id} - retrieve a document and its chunks
Response: {
"doc_id": "doc_plumbing_guide",
"title": "Plumbing Repair Tutorial: Dripping Taps",
"url": "https://example.com/plumbing-guide",
"chunks": 12,
"indexed_at": "2026-04-10T14:30:00Z",
"embedding_model": "bge-small-finetuned-v2"
}
Administrative endpoint for inspecting indexed documents and verifying that ingestion completed correctly.
High-level design
A semantic search engine has two distinct pipelines that share an embedding model but serve different workloads. The ingestion pipeline (offline) processes documents in bulk: it fetches raw content, splits it into overlapping chunks of 256-512 tokens, embeds each chunk with the active model, and writes the resulting vectors to a vector database while simultaneously indexing the raw text in Elasticsearch for BM25 search. The query pipeline (online) handles user searches in real time: it embeds the query, fans out to both the vector DB and Elasticsearch in parallel, merges the ranked lists with RRF, and optionally reranks the top candidates with a cross-encoder.
The embedding model sits behind a dedicated inference service that serves both pipelines. During ingestion, it processes batches of chunks (throughput-optimized, 64-128 chunks per batch). During queries, it processes single queries with minimal latency (latency-optimized, 5ms per embed). I have seen teams try to use the same serving configuration for both workloads and end up with either slow ingestion or high query latency. Separate the serving pools.
A metadata store (PostgreSQL) tracks documents, chunks, model versions, and feedback signals. This is the source of truth for what is indexed, which model produced each embedding, and which documents need re-embedding after a model upgrade.
For your interview: draw both pipelines from the start. The interviewer wants to see that you understand offline ingestion and online serving are fundamentally different workloads with different optimization strategies.
Here is the query pipeline animated step by step. The key insight is the parallel fan-out: BM25 and vector search run simultaneously, and the total latency is the max of the two (not the sum).
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.