Typeahead Search
Walk through designing a real-time autocomplete system that returns ranked suggestions in under 100ms for billions of daily queries, from a simple Trie to a distributed, cache-first prefix index.
What is a typeahead search service?
Every time you type into a search box and suggestions appear before you finish, a typeahead service is running. The real challenge is not storing queries; it is retrieving ranked prefix matches in under 100ms across billions of daily requests, while a background write pipeline refreshes popularity scores fast enough that trending queries surface within minutes. I find this question fascinating in interviews because it forces you to reason about read-path optimization and write-path freshness simultaneously, two goals that directly conflict. This question tests memory-efficient data structures, distributed caching, and offline aggregation pipelines all in one answer.
Functional Requirements
Core Requirements
- As a user types a prefix, return the top N relevant suggestions in real time.
- Suggestions are ranked by popularity (query frequency).
- Results appear within 100ms of each keystroke.
Below the Line (out of scope)
- Full-text search across document bodies
- Multi-language or fuzzy-match suggestions
- Personalized suggestions based on individual user history
The hardest part in scope: Storing prefix-to-suggestion mappings in a structure that supports O(prefix_length) lookups at 58,000 queries per second while keeping popularity scores fresh enough that trending queries surface within 5 minutes. Fast reads and near-real-time write propagation push in opposite directions, and the entire architecture is a negotiation between them.
Full-text search across document bodies is out of scope because it requires inverted indexes over tokenized content rather than prefix matching over query strings. To add it, I would run a separate document index using Elasticsearch, then merge and re-rank the two result sets in a blending layer before serving.
Multi-language and fuzzy-match suggestions are out of scope because they require language detection, stemming, and edit-distance calculations that each add latency. To add them, I would run a fast approximate-match pass using BK-trees or n-gram indexes in parallel with the exact prefix lookup, then merge the results at the suggestion service layer.
Personalized suggestions are out of scope because they require a per-user history store and a real-time re-ranking pass, multiplying storage by the number of active users. To add personalization, I would store recent queries in a small per-user Redis hash, blend personal frequency with global popularity using a weighted score, and re-rank the top-20 candidates from the global index at request time.
Non-Functional Requirements
Core Requirements
- Latency: Suggestion response under 100ms p99 end to end. The lookup itself must complete in under 20ms (network, serialization, and Redis each consume their own slice of that budget).
- Scale: 5B search queries per day, 1B unique queries in the prefix index.
- Writes: 10M new unique queries discovered per day; popularity scores updated every 5 minutes for the full corpus, with trending queries surfacing within 60 seconds via an in-memory fast path.
- Availability: 99.9% uptime. Stale suggestions for a brief window are acceptable; missing suggestions for minutes are not.
- Data freshness: Trending queries must influence suggestions within 5 minutes of going viral.
Below the Line
- Perfectly synchronized query counts across all Trie replicas in real time
- Sub-second freshness for newly discovered queries
Read-dominant workload: At 5B queries per day (roughly 58,000 GET requests per second) against 10M new unique query writes per day (roughly 115 per second), the read-to-write ratio is approximately 500:1. Every architecture decision in this article is read-path optimization first. Write freshness matters, but never at the cost of read latency.
I'd call out that the 100ms budget is end to end from keystroke to rendered suggestion. A typical mobile network round trip consumes 20-40ms, Redis lookup adds 1ms, serialization adds 2-5ms. That leaves under 20ms for the actual prefix lookup, which is tight and determines why a Redis cache in front of the Trie is not optional at this scale.
Core Entities
- Query: A search string issued by a user. The atomic unit of what we index and count.
- QueryCount: The aggregated frequency of a given query string over a rolling time window. This is the popularity score that drives suggestion ranking.
- Suggestion: A query string enriched with its popularity score, ready to be returned in the suggest API response.
- PrefixIndex: The in-memory data structure (a compressed Radix tree) that maps a prefix string to its precomputed top-K suggestions.
Full schema, indexes, and column types are deferred to the data model deep dive. The entities above are enough to drive the API design and High-Level Design.
API Design
FR 1 and FR 3: Return ranked suggestions for a typed prefix:
# Return the top N suggestions for the typed prefix
GET /v1/search/suggest?q={prefix}&limit=10
Response: {
suggestions: [
{ query: "netflix", score: 9823410 },
{ query: "new york times", score: 7234109 },
{ query: "near me", score: 5901002 }
],
prefix: "ne"
}
Use GET because this is a pure read. The q parameter is short and URL-safe for typical prefix lengths. limit defaults to 10 and caps at 20 to prevent unbounded result sets. The response echoes the prefix back so clients can discard stale responses if the user has moved on to a longer prefix before the response arrives.
The naive endpoint shape works here. There is no failure mode that demands an evolved shape at the request contract level; the complexity lives inside the service, not at the API boundary.
FR 2 (implicit): Log completed searches for ranking:
# Record a completed search for offline frequency aggregation
POST /v1/search/log
Body: { query: "netflix", session_id: "s_abc", timestamp: "2026-03-29T12:00:00Z" }
Response: { ok: true }
This is a fire-and-forget write. Clients call it after the user commits to a full search, not on every keystroke. Logging only completed searches rather than every prefix keystroke reduces write volume by roughly 5x and filters out noise prefixes that never resolve to an intent. Authentication is out of scope but would add a user_id field here for personalization.
High-Level Design
1. Return top suggestions for a typed prefix
The minimal read path: client sends the prefix, Suggestion Service looks it up in an in-memory prefix index, returns the top-K results.
The simplest design that satisfies FR 1 and FR 3 is a single Suggestion Service with a prefix index loaded into memory. I will treat the index internals as a black box here and cover the data structure in Deep Dive 1. Keeping the index in-process (not behind a network call) is a deliberate choice: at 58K QPS, even a 1ms network hop to an external index service adds up to 58 seconds of cumulative wait per second across the fleet.
Components:
- Client: Web or mobile browser sending
GET /v1/search/suggest?q=...on each keystroke. - Suggestion Service: Stateless app server. Receives the prefix, traverses the in-memory prefix index, and returns ranked results.
- In-Memory Prefix Index: The prefix lookup data structure. Each prefix maps to a precomputed top-K list of suggestions sorted by score. Lookups run in O(prefix_length) time.
Request walkthrough:
- User types "ne". Client fires
GET /v1/search/suggest?q=ne&limit=10. - Suggestion Service receives the request.
- Suggestion Service looks up the prefix "ne" in the in-memory index.
- The index returns the precomputed top-10 suggestions for that prefix.
- Suggestion Service serializes and returns the result to the client.
This covers the read path only. The prefix index is static here. The next section adds the pipeline that keeps it populated and fresh.
2. Rank suggestions by popularity
The write pipeline: every completed search is logged, aggregated asynchronously, and applied to update query scores in the prefix index.
The naive approach is to update the prefix index directly on every search, incrementing query frequency counts in real time. At 58,000 requests per second, this creates contention on shared index nodes and requires distributed locking. The fix is to decouple writes from reads entirely using a Kafka-backed aggregation pipeline.
Components:
- Search Log Writer: Receives
POST /v1/search/logevents. Validates the payload and publishes each event to thesearch-eventsKafka topic. Write path only; never touches the read-path index. - Kafka (search-events topic): Buffers the raw search event stream. Durable log partitioned by query string so all events for the same query land on the same partition.
- Aggregation Job: Reads from Kafka in micro-batches (every 60 seconds). Computes query frequency counts over a rolling 1-hour window. Writes updated counts to the Query Count Store.
- Query Count Store: Redis sorted sets, one key per index prefix shard, storing query strings as members with their popularity scores as values. The prefix index is rebuilt from this store on a schedule.
Request walkthrough (write path):
- User commits a search for "netflix". Client fires
POST /v1/search/log. - Search Log Writer validates the event and publishes it to the
search-eventsKafka topic. - Aggregation Job consumes the last 60 seconds of events and computes updated frequency counts.
- Aggregation Job writes updated scores to the Query Count Store using
ZADD. - A Trie Rebuild Job reads the Query Count Store every 5 minutes and updates the prefix index.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.