Top-K Heavy Hitters
Design a system that tracks the top K most popular items in real time across multiple time windows, from simple in-memory heaps to Count-Min Sketch and distributed stream aggregation at LinkedIn or Amazon scale.
What is a top-K heavy hitters system?
A top-K heavy hitters system identifies the K items seen most frequently in a stream of events, such as the top 10 trending hashtags on Twitter or the top 100 searched products on Amazon. The core loop (count everything, rank it, return the top K) sounds simple. The hard part is two things: counting accurately at millions of events per second without blowing out memory, and merging local top-K lists across dozens of distributed stream processors into a single globally correct ranking without re-processing history.
I'd open with the merge problem on a whiteboard because it catches most candidates off guard. Everyone knows how to count; the interesting design work is in how you combine counts from 32 machines into one correct ranking without shipping all the data.
Functional Requirements
Core Requirements
- Users can submit item events (for example, a search query string, a product view, a hashtag click) to the system.
- Users can query the top K items for a given time window (for example, top 10 in the last hour, top 100 in the last 24 hours).
- Rankings update within 30 seconds of new events arriving.
- Multiple time windows are supported simultaneously (last 1 minute, 1 hour, 24 hours) without re-processing history.
Below the Line (out of scope)
- User authentication and per-user personalization
- Item metadata enrichment (images, display names)
- Exact deduplication of duplicate events from the same user session
The hardest part in scope: Merging distributed top-K results correctly. Naively unioning the top-K list from each stream processor is wrong. An item ranked 11th on every individual node can be globally first if its traffic is spread evenly across partitions, and you miss it entirely. The solution (merging Count-Min Sketches) is the deepest technical challenge in this system, and merging Count-Min Sketches is the subject of Deep Dive 3.
User authentication is below the line because it does not change the counting or ranking path. To add it, attach a user_id to each event but route the event through the same Kafka topic and count pipeline unchanged.
Item metadata enrichment is below the line because it belongs in a separate lookup service. The ranking layer stores only item IDs and approximate counts. Display names and images are joined at the API response layer using a separate catalog service.
Exact deduplication is below the line because it requires a per-user Bloom filter or a Redis SET of (user_session, item_id) pairs, which adds significant storage complexity. Approximate deduplication is acceptable for trending signals: if a user clicks the same product 20 times in a session, the trend still reflects genuine engagement.
Non-Functional Requirements
Core Requirements
- Throughput: Handle 1 billion events per day (roughly 11,500 events per second at steady state, with 3x peak burst = 35,000 events per second).
- Staleness: Rankings are stale by at most 30 seconds. Sub-second freshness is not required; 30-second staleness is imperceptible for trending content.
- Memory: Memory per time window is bounded regardless of item cardinality. At 1 million distinct items per day, exact counting needs 8 MB per window. Count-Min Sketch reduces this to under 200 KB per window.
- Availability: 99.9% uptime. A brief gap in ranking freshness is acceptable; returning stale results from a pre-computed snapshot is always preferable to a total outage.
- Query latency: Top-K reads return in under 10 ms p99. Rankings are pre-computed, so reads resolve to a single Redis
ZREVRANGEcommand.
Below the Line
- Exact (non-approximate) counts for all items (audit-grade counting requires storing every event individually, which does not bound memory)
- Sub-second ranking latency (requires a hot path with no batching window, approaching the Redis throughput ceiling at 35K events/second)
Read/write ratio: 1 billion events per day vs roughly 100,000 top-K queries per day gives a 10,000:1 write-to-read ratio. This is an extremely write-heavy system. The write path is the entire design challenge: ingestion throughput, stream aggregation, and memory-bounded counting all exist to handle this asymmetry. The read path is trivial by comparison, because pre-computed snapshots reduce every query to a Redis
ZREVRANGE.
The 10,000:1 ratio tells you immediately that the system must decouple writes from reads. You cannot compute rankings synchronously on every write event, and you do not need to: a 30-second staleness window gives you time to batch-aggregate and snapshot asynchronously.
I'd write this ratio on the whiteboard early because it immediately kills any design that tries to compute rankings on every incoming event. Once the interviewer sees 10,000:1, the conversation shifts to "how do we absorb writes cheaply and serve reads from a snapshot?"
Core Entities
- Event: A single occurrence of an item being counted (a search query, a product view, a trending signal). Carries
item_id,event_type, andtimestamp. - Counter: The running approximate frequency for an
item_idwithin a specific time window. Stored in-memory inside stream processors as a Count-Min Sketch structure, not as individual key-value rows. - TopKSnapshot: A pre-computed, point-in-time list of the top K items and their approximate counts for a given time window. Stored in Redis as a sorted set and refreshed every 30 seconds.
- TimeWindow: A named time boundary (for example,
last_1m,last_1h,last_24h) that scopes a TopKSnapshot. The system maintains one active TopKSnapshot per TimeWindow.
Schema details for the snapshot store (sorted set key naming, TTL policy, serialization format) are deferred to the deep dives. The four entities above are sufficient to drive the API design and High-Level Design.
API Design
The system exposes two endpoints: one to ingest events and one to query rankings. Events are ingested asynchronously because the write volume (35K events/sec at peak) makes synchronous counting impractical at the app server layer.
FR 1 and FR 3 - Ingest an event:
POST /events
Body: {
item_id: "product:abc123",
event_type: "view",
timestamp: 1711670400
}
Response: HTTP 202 Accepted
Body: { message: "Event queued for processing" }
202 instead of 200: the event is not counted the moment the POST completes. It enters a Kafka queue and is counted asynchronously by a stream processor. Returning 200 would imply the event was counted immediately, which is false and misleading to callers. 202 accurately signals acceptance for later processing.
FR 2 - Query top K for a time window:
GET /top-k?window=last_1h&k=10
Response: {
window: "last_1h",
k: 10,
snapshot_age_seconds: 14,
items: [
{ rank: 1, item_id: "product:abc123", approximate_count: 94821 },
{ rank: 2, item_id: "hashtag:worldcup", approximate_count: 87443 }
]
}
The response includes snapshot_age_seconds so callers know exactly how stale the pre-computed result is. This transparency lets downstream services decide whether to display the result immediately or wait for a fresher snapshot before refreshing a high-visibility leaderboard.
High-Level Design
1. Event ingestion at scale
Direct database writes saturate the primary at 50K events per second, making a write queue the first non-negotiable addition to the system.
The simplest design routes every event from the app server directly to a relational database. At 35,000 events per second at peak, a single Postgres instance cannot sustain the write rate without growing queue depth unboundedly (typical ceiling: 10,000-20,000 simple writes per second before latency degrades). The fix is a Kafka topic that absorbs burst writes and lets stream processors consume at their own pace.
Components:
- Client: Web or mobile client that generates item events (product views, search queries, clicks).
- App Server: Accepts
POST /events, validates the payload, and publishes the event to Kafka. Does not perform any counting. - Kafka: Durable event queue that absorbs write bursts. Decouples the ingestion rate from the stream processing rate. Events are retained for 24 hours to support replay after processor failure.
- Stream Processor Fleet: A horizontally scalable fleet (Kafka Streams, Flink, or Spark Streaming) that consumes from Kafka and maintains in-memory frequency counts. Each processor owns one or more Kafka partitions.
- Redis Counter Store: Receives aggregated item counts flushed from stream processors every 30 seconds. Stores counts as Redis hashes keyed by time window and item ID.
Request walkthrough:
- Client sends
POST /eventswithitem_idandevent_type. - App Server validates the payload and publishes to the Kafka topic
item-events, usingitem_idas the partition key (so all events for the same item always go to the same processor). - Kafka durably persists the event.
- Each stream processor reads from its assigned partitions and increments an in-memory count for the
item_id. - Every 30 seconds, each processor flushes aggregated counts to Redis Counter Store using
HINCRBYbatches.
This diagram covers the write path only. The query path and ranking pre-computation come in the next section.
I'd pause here in an interview and explicitly say: "The app server does zero counting. It is a dumb pipe into Kafka." That single sentence communicates that you understand the separation of concerns, and it prevents the interviewer from wondering whether the app server is a bottleneck.
2. Querying top K: pre-computed snapshots
Scanning all counters on every query is O(N) across potentially millions of items; pre-computing a TopKSnapshot every 30 seconds and serving reads from a Redis sorted set eliminates all per-query computation.
The naive query approach reads all counters from Redis and sorts them in application memory. At 1 million distinct items, that is a million HGETALL values to sort per query. For bulk query traffic (thousands of users refreshing a leaderboard simultaneously), on-demand sorting is a bottleneck that pre-computation eliminates entirely.
Components (new and changed):
- Snapshot Builder: A background process that runs every 30 seconds. Reads all counters from Redis Counter Store for each time window, computes top K using a min-heap, and writes the result to a Redis sorted set.
- Redis Sorted Set (TopKSnapshot): Stores the pre-computed top-K list per time window. Key pattern:
topk:{window}.ZADDwrites the score (approximate count) and member (item_id).ZREVRANGEreads the top K in O(log N + K) time. - Query API: Serves
GET /top-kby callingZREVRANGEon the appropriate Redis sorted set. No ranking computation at query time.
Request walkthrough (query path):
- Client sends
GET /top-k?window=last_1h&k=10. - Query API resolves the Redis key:
topk:last_1h. - Query API calls
ZREVRANGE topk:last_1h 0 9 WITHSCORES. - Redis returns the top 10
item_idvalues and their approximate counts in O(log N + K) time. - Query API attaches
snapshot_age_seconds(computed from the ZADD timestamp stored alongside the set) and returns the response.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.