Typeahead Search
Walk through designing a real-time autocomplete system that returns ranked suggestions in under 100ms for billions of daily queries, from a simple Trie to a distributed, cache-first prefix index.
What is a typeahead search service?
Every time you type into a search box and suggestions appear before you finish, a typeahead service is running. The real challenge is not storing queries; it is retrieving ranked prefix matches in under 100ms across billions of daily requests, while a background write pipeline refreshes popularity scores fast enough that trending queries surface within minutes. I find this question fascinating in interviews because it forces you to reason about read-path optimization and write-path freshness simultaneously, two goals that directly conflict. This question tests memory-efficient data structures, distributed caching, and offline aggregation pipelines all in one answer.
Functional Requirements
Core Requirements
- As a user types a prefix, return the top N relevant suggestions in real time.
- Suggestions are ranked by popularity (query frequency).
- Results appear within 100ms of each keystroke.
Below the Line (out of scope)
- Full-text search across document bodies
- Multi-language or fuzzy-match suggestions
- Personalized suggestions based on individual user history
The hardest part in scope: Storing prefix-to-suggestion mappings in a structure that supports O(prefix_length) lookups at 58,000 queries per second while keeping popularity scores fresh enough that trending queries surface within 5 minutes. Fast reads and near-real-time write propagation push in opposite directions, and the entire architecture is a negotiation between them.
Full-text search across document bodies is out of scope because it requires inverted indexes over tokenized content rather than prefix matching over query strings. To add it, I would run a separate document index using Elasticsearch, then merge and re-rank the two result sets in a blending layer before serving.
Multi-language and fuzzy-match suggestions are out of scope because they require language detection, stemming, and edit-distance calculations that each add latency. To add them, I would run a fast approximate-match pass using BK-trees or n-gram indexes in parallel with the exact prefix lookup, then merge the results at the suggestion service layer.
Personalized suggestions are out of scope because they require a per-user history store and a real-time re-ranking pass, multiplying storage by the number of active users. To add personalization, I would store recent queries in a small per-user Redis hash, blend personal frequency with global popularity using a weighted score, and re-rank the top-20 candidates from the global index at request time.
Non-Functional Requirements
Core Requirements
- Latency: Suggestion response under 100ms p99 end to end. The lookup itself must complete in under 20ms (network, serialization, and Redis each consume their own slice of that budget).
- Scale: 5B search queries per day, 1B unique queries in the prefix index.
- Writes: 10M new unique queries discovered per day; popularity scores updated every 5 minutes for the full corpus, with trending queries surfacing within 60 seconds via an in-memory fast path.
- Availability: 99.9% uptime. Stale suggestions for a brief window are acceptable; missing suggestions for minutes are not.
- Data freshness: Trending queries must influence suggestions within 5 minutes of going viral.
Below the Line
- Perfectly synchronized query counts across all Trie replicas in real time
- Sub-second freshness for newly discovered queries
Read-dominant workload: At 5B queries per day (roughly 58,000 GET requests per second) against 10M new unique query writes per day (roughly 115 per second), the read-to-write ratio is approximately 500:1. Every architecture decision in this article is read-path optimization first. Write freshness matters, but never at the cost of read latency.
I'd call out that the 100ms budget is end to end from keystroke to rendered suggestion. A typical mobile network round trip consumes 20-40ms, Redis lookup adds 1ms, serialization adds 2-5ms. That leaves under 20ms for the actual prefix lookup, which is tight and determines why a Redis cache in front of the Trie is not optional at this scale.
Core Entities
- Query: A search string issued by a user. The atomic unit of what we index and count.
- QueryCount: The aggregated frequency of a given query string over a rolling time window. This is the popularity score that drives suggestion ranking.
- Suggestion: A query string enriched with its popularity score, ready to be returned in the suggest API response.
- PrefixIndex: The in-memory data structure (a compressed Radix tree) that maps a prefix string to its precomputed top-K suggestions.
Full schema, indexes, and column types are deferred to the data model deep dive. The entities above are enough to drive the API design and High-Level Design.
API Design
FR 1 and FR 3: Return ranked suggestions for a typed prefix:
# Return the top N suggestions for the typed prefix
GET /v1/search/suggest?q={prefix}&limit=10
Response: {
suggestions: [
{ query: "netflix", score: 9823410 },
{ query: "new york times", score: 7234109 },
{ query: "near me", score: 5901002 }
],
prefix: "ne"
}
Use GET because this is a pure read. The q parameter is short and URL-safe for typical prefix lengths. limit defaults to 10 and caps at 20 to prevent unbounded result sets. The response echoes the prefix back so clients can discard stale responses if the user has moved on to a longer prefix before the response arrives.
The naive endpoint shape works here. There is no failure mode that demands an evolved shape at the request contract level; the complexity lives inside the service, not at the API boundary.
FR 2 (implicit): Log completed searches for ranking:
# Record a completed search for offline frequency aggregation
POST /v1/search/log
Body: { query: "netflix", session_id: "s_abc", timestamp: "2026-03-29T12:00:00Z" }
Response: { ok: true }
This is a fire-and-forget write. Clients call it after the user commits to a full search, not on every keystroke. Logging only completed searches rather than every prefix keystroke reduces write volume by roughly 5x and filters out noise prefixes that never resolve to an intent. Authentication is out of scope but would add a user_id field here for personalization.
High-Level Design
1. Return top suggestions for a typed prefix
The minimal read path: client sends the prefix, Suggestion Service looks it up in an in-memory prefix index, returns the top-K results.
The simplest design that satisfies FR 1 and FR 3 is a single Suggestion Service with a prefix index loaded into memory. I will treat the index internals as a black box here and cover the data structure in Deep Dive 1. Keeping the index in-process (not behind a network call) is a deliberate choice: at 58K QPS, even a 1ms network hop to an external index service adds up to 58 seconds of cumulative wait per second across the fleet.
Components:
- Client: Web or mobile browser sending
GET /v1/search/suggest?q=...on each keystroke. - Suggestion Service: Stateless app server. Receives the prefix, traverses the in-memory prefix index, and returns ranked results.
- In-Memory Prefix Index: The prefix lookup data structure. Each prefix maps to a precomputed top-K list of suggestions sorted by score. Lookups run in O(prefix_length) time.
Request walkthrough:
- User types "ne". Client fires
GET /v1/search/suggest?q=ne&limit=10. - Suggestion Service receives the request.
- Suggestion Service looks up the prefix "ne" in the in-memory index.
- The index returns the precomputed top-10 suggestions for that prefix.
- Suggestion Service serializes and returns the result to the client.
This covers the read path only. The prefix index is static here. The next section adds the pipeline that keeps it populated and fresh.
2. Rank suggestions by popularity
The write pipeline: every completed search is logged, aggregated asynchronously, and applied to update query scores in the prefix index.
The naive approach is to update the prefix index directly on every search, incrementing query frequency counts in real time. At 58,000 requests per second, this creates contention on shared index nodes and requires distributed locking. The fix is to decouple writes from reads entirely using a Kafka-backed aggregation pipeline.
Components:
- Search Log Writer: Receives
POST /v1/search/logevents. Validates the payload and publishes each event to thesearch-eventsKafka topic. Write path only; never touches the read-path index. - Kafka (search-events topic): Buffers the raw search event stream. Durable log partitioned by query string so all events for the same query land on the same partition.
- Aggregation Job: Reads from Kafka in micro-batches (every 60 seconds). Computes query frequency counts over a rolling 1-hour window. Writes updated counts to the Query Count Store.
- Query Count Store: Redis sorted sets, one key per index prefix shard, storing query strings as members with their popularity scores as values. The prefix index is rebuilt from this store on a schedule.
Request walkthrough (write path):
- User commits a search for "netflix". Client fires
POST /v1/search/log. - Search Log Writer validates the event and publishes it to the
search-eventsKafka topic. - Aggregation Job consumes the last 60 seconds of events and computes updated frequency counts.
- Aggregation Job writes updated scores to the Query Count Store using
ZADD. - A Trie Rebuild Job reads the Query Count Store every 5 minutes and updates the prefix index.
The prefix index is fully decoupled from the write path. Reads never contend with writes. The tradeoff is freshness: trending queries take up to 5 minutes to surface in suggestions (the index rebuild interval). In my experience, this decoupling is the single most important architectural decision in typeahead systems. Every production issue I have seen at scale traces back to coupling the read and write paths.
Deep Dive 2 covers how to shrink that freshness window to under 60 seconds.
Why Kafka instead of writing directly to Redis?
Publishing to Kafka first gives the aggregation job a replayable audit log. If the aggregation job crashes mid-batch, it replays from the last committed Kafka offset rather than losing those events. Writing directly to Redis from every Suggestion Service instance also produces a thundering-herd of 58K concurrent ZADD calls per second, which saturates a Redis cluster quickly.
3. Scale reads to 5B queries per day
Add a Redis cache in front of the prefix index fleet so that common prefixes are served in under 1ms without touching the index.
A single Suggestion Service instance handling index lookups becomes the bottleneck at 58K QPS. The key insight for scaling reads is that prefix popularity follows a power-law distribution: the top 10,000 prefixes ("the", "ne", "goo", "am") account for the vast majority of all queries. I always draw this power-law curve on the whiteboard during interviews because it justifies the entire caching strategy in one picture. Caching those in Redis means the index fleet only sees cache misses, which are a small fraction of total traffic.
Components:
- Redis Suggestion Cache: Precomputed top-K results per prefix. Key =
suggest:{prefix}; value = serialized suggestion list. TTL = 60 seconds. Expect more than 95% cache hit rate for common prefixes. - Suggestion Service (evolved): Checks Redis before the index. On a cache hit, returns immediately. On a cache miss, looks up the index, then writes the result back to Redis.
- Prefix Index Fleet: Multiple Suggestion Service instances with in-memory indexes, handling only cache misses. Each instance holds the full index in RAM.
Request walkthrough:
- Client fires
GET /v1/search/suggest?q=ne. - Suggestion Service issues
GET suggest:neto Redis. - Cache hit (more than 95% of requests): return serialized suggestions immediately. Latency: 5-10ms total.
- Cache miss: traverse the prefix index, serialize the result, write to Redis with a 60-second TTL, return to the client. Latency: 15-25ms total.
This three-component read path handles 58K QPS comfortably when Redis is sized appropriately. The index fleet only handles the cold tail of the prefix distribution (rare or novel prefixes not yet in the Redis cache).
Cache TTL is a freshness knob, not a free variable
A 60-second TTL means cached suggestions for a prefix can be up to 60 seconds stale. This is acceptable for most queries but can be visibly wrong during a breaking news event. Do not shorten the TTL below 30 seconds without profiling Redis write throughput: at 58K QPS with 30-second TTLs, you generate nearly 2M cache writes per minute across the key space, which is a meaningful Redis CPU load.
Potential Deep Dives
1. What data structure should store prefix-to-suggestion mappings?
This is the core question of the entire system. Three constraints drive the choice: prefix lookups must be O(prefix_length); the index must fit in RAM (RAM is the budget, not disk); and each prefix must return already-ranked top-K results without sorting at query time. The index holds 1B unique query strings averaging 30 characters each. I would lead the deep dive with these three constraints explicitly because they immediately eliminate most data structure candidates and focus the conversation on Tries and their variants.
2. How do you keep suggestion rankings fresh at scale?
Trending queries (a celebrity announcement, a sports score) must surface in suggestions within 5 minutes. The write pipeline ingests 58,000 search log events per second at peak. The challenge is absorbing this volume without blocking the read path and without losing events during aggregation.
3. How do you serve 5B queries per day under 100ms with high availability?
At 58K requests per second with a 100ms p99 budget, Redis plus a Trie fleet is the right topology but needs careful capacity planning and failure mode design. A poorly sized cache or a single Redis node turns a clean architecture into an availability hazard. I have seen production incidents where a single Redis node failure cascaded into a full Trie fleet overload within 30 seconds because nobody modeled the cache-miss amplification factor.
Final Architecture
The single most important architectural insight: the read path and write path are completely decoupled. The Trie is a read-only snapshot that is hot-swapped periodically from the Query Count Store. Reads never contend with writes, and the write pipeline never mutates the live index.
Interview Cheat Sheet
- Store prefix-to-suggestion mappings in a compressed Radix tree (Patricia trie). Each node holds one edge label (compressed path segment) and a precomputed top-K suggestion list. This gives O(prefix_length) lookups and reduces node count 5-10x compared to a character-level Trie.
- Keep the index memory-efficient by storing only top-K suggestions at each node (not pointers to the full subtree). With K=10, every node holds exactly 10 query strings plus scores. Ancestor updates when a score changes require traversing at most 50 nodes per query, adding O(K log K) overhead per ancestor.
- Never update the Trie synchronously with live search traffic. At 58K QPS, direct mutations require distributed locking and create hot-node contention. All writes go to Kafka, are aggregated by a Flink job, and applied to the Trie on a 5-minute rebuild cycle.
- Handle 10M new unique query writes per day by filtering noise at the Query Count Store. Exclude queries with fewer than 100 occurrences from the Trie. This eliminates roughly 80% of unique query strings (mostly single-occurrence typos) without affecting suggestion quality.
- Serve the 500:1 read/write ratio with a two-tier cache. Client-side SessionStorage covers 70% of keystrokes via parent-prefix filtering. Redis Suggestion Cache covers 95%+ of the remaining 30%. The Trie fleet handles only about 1.5% of total request volume.
- Shard the Trie by the first character of the prefix, weighted by observed volume. A strict alphabetic split creates hotspots on "s" and "a" prefixes. Use a frequency histogram from your query log to assign shard boundaries so each shard handles roughly equal QPS.
- Use blue/green Trie hot-swaps during rebuilds. Two instances per shard: one serving live traffic, one building the new snapshot. Switch the active pointer atomically after a readiness probe. This prevents empty or partially-built responses during the 60-90 second build window.
- Blend three time windows for ranking stability. Formula:
score = 0.5 * count_1h + 0.3 * count_24h + 0.2 * count_7d. Short-term weight surfaces trends; long-term weight prevents a single viral query from permanently displacing evergreen results. - Add an in-memory hot-query boost for sub-60-second freshness. Each Suggestion Service instance tracks query frequency locally over the last 60 seconds. If a query exceeds a threshold count, its score is temporarily boosted in the response without waiting for a Trie rebuild. The boost is ephemeral and not persisted across instances.
- The 100ms p99 latency budget allocates roughly 20ms to the actual server-side lookup. Network round-trip consumes 20-40ms. Redis lookup adds 1ms. Serialization adds 2-5ms. A Trie lookup exceeding 20ms breaks the budget. Measure Trie shard lookup time independently from Redis response time using separate histograms.
- At 1B unique queries averaging 30 characters, the raw string data is 30GB. The Radix tree adds roughly 2x overhead for the node structure, reaching about 60GB for the full index. Split across 8 shards, each shard holds roughly 7-8GB, well within a standard memory-optimized instance.
- Trending query freshness is 5 minutes from the Trie rebuild cycle, or 60 seconds via the hot-query boost. The Kafka/Flink pipeline writes to the Query Count Store every 60 seconds. The Trie Rebuild Job reads from there every 5 minutes. The in-memory boost provides the fast path for queries trending right now, before the next rebuild fires.
- Design for at-least-once delivery in the Kafka pipeline. If the Flink job crashes mid-window, it replays from the last committed Kafka offset. The
ZADDoperation in Redis is idempotent for stable score values. Replaying the same events produces the same final scores, so double-processing an event window is safe.