Top-K Heavy Hitters

What is a top-K heavy hitters system?

A top-K heavy hitters system identifies the K items seen most frequently in a stream of events, such as the top 10 trending hashtags on Twitter or the top 100 searched products on Amazon. The core loop (count everything, rank it, return the top K) sounds simple. The hard part is two things: counting accurately at millions of events per second without blowing out memory, and merging local top-K lists across dozens of distributed stream processors into a single globally correct ranking without re-processing history.

I'd open with the merge problem on a whiteboard because it catches most candidates off guard. Everyone knows how to count; the interesting design work is in how you combine counts from 32 machines into one correct ranking without shipping all the data.

Functional Requirements

Core Requirements

Users can submit item events (for example, a search query string, a product view, a hashtag click) to the system.
Users can query the top K items for a given time window (for example, top 10 in the last hour, top 100 in the last 24 hours).
Rankings update within 30 seconds of new events arriving.
Multiple time windows are supported simultaneously (last 1 minute, 1 hour, 24 hours) without re-processing history.

Below the Line (out of scope)

User authentication and per-user personalization
Item metadata enrichment (images, display names)
Exact deduplication of duplicate events from the same user session

The hardest part in scope: Merging distributed top-K results correctly. Naively unioning the top-K list from each stream processor is wrong. An item ranked 11th on every individual node can be globally first if its traffic is spread evenly across partitions, and you miss it entirely. The solution (merging Count-Min Sketches) is the deepest technical challenge in this system, and merging Count-Min Sketches is the subject of Deep Dive 3.

User authentication is below the line because it does not change the counting or ranking path. To add it, attach a user_id to each event but route the event through the same Kafka topic and count pipeline unchanged.

Item metadata enrichment is below the line because it belongs in a separate lookup service. The ranking layer stores only item IDs and approximate counts. Display names and images are joined at the API response layer using a separate catalog service.

Exact deduplication is below the line because it requires a per-user Bloom filter or a Redis SET of (user_session, item_id) pairs, which adds significant storage complexity. Approximate deduplication is acceptable for trending signals: if a user clicks the same product 20 times in a session, the trend still reflects genuine engagement.

Non-Functional Requirements

Core Requirements

Throughput: Handle 1 billion events per day (roughly 11,500 events per second at steady state, with 3x peak burst = 35,000 events per second).
Staleness: Rankings are stale by at most 30 seconds. Sub-second freshness is not required; 30-second staleness is imperceptible for trending content.
Memory: Memory per time window is bounded regardless of item cardinality. At 1 million distinct items per day, exact counting needs 8 MB per window. Count-Min Sketch reduces this to under 200 KB per window.
Availability: 99.9% uptime. A brief gap in ranking freshness is acceptable; returning stale results from a pre-computed snapshot is always preferable to a total outage.
Query latency: Top-K reads return in under 10 ms p99. Rankings are pre-computed, so reads resolve to a single Redis ZREVRANGE command.

Below the Line

Exact (non-approximate) counts for all items (audit-grade counting requires storing every event individually, which does not bound memory)
Sub-second ranking latency (requires a hot path with no batching window, approaching the Redis throughput ceiling at 35K events/second)

Read/write ratio: 1 billion events per day vs roughly 100,000 top-K queries per day gives a 10,000:1 write-to-read ratio. This is an extremely write-heavy system. The write path is the entire design challenge: ingestion throughput, stream aggregation, and memory-bounded counting all exist to handle this asymmetry. The read path is trivial by comparison, because pre-computed snapshots reduce every query to a Redis ZREVRANGE.

The 10,000:1 ratio tells you immediately that the system must decouple writes from reads. You cannot compute rankings synchronously on every write event, and you do not need to: a 30-second staleness window gives you time to batch-aggregate and snapshot asynchronously.

I'd write this ratio on the whiteboard early because it immediately kills any design that tries to compute rankings on every incoming event. Once the interviewer sees 10,000:1, the conversation shifts to "how do we absorb writes cheaply and serve reads from a snapshot?"

Core Entities

Event: A single occurrence of an item being counted (a search query, a product view, a trending signal). Carries item_id, event_type, and timestamp.
Counter: The running approximate frequency for an item_id within a specific time window. Stored in-memory inside stream processors as a Count-Min Sketch structure, not as individual key-value rows.
TopKSnapshot: A pre-computed, point-in-time list of the top K items and their approximate counts for a given time window. Stored in Redis as a sorted set and refreshed every 30 seconds.
TimeWindow: A named time boundary (for example, last_1m, last_1h, last_24h) that scopes a TopKSnapshot. The system maintains one active TopKSnapshot per TimeWindow.

Schema details for the snapshot store (sorted set key naming, TTL policy, serialization format) are deferred to the deep dives. The four entities above are sufficient to drive the API design and High-Level Design.

API Design

The system exposes two endpoints: one to ingest events and one to query rankings. Events are ingested asynchronously because the write volume (35K events/sec at peak) makes synchronous counting impractical at the app server layer.

FR 1 and FR 3 - Ingest an event:

POST /events
Body: {
  item_id: "product:abc123",
  event_type: "view",
  timestamp: 1711670400
}
Response: HTTP 202 Accepted
Body: { message: "Event queued for processing" }

202 instead of 200: the event is not counted the moment the POST completes. It enters a Kafka queue and is counted asynchronously by a stream processor. Returning 200 would imply the event was counted immediately, which is false and misleading to callers. 202 accurately signals acceptance for later processing.

FR 2 - Query top K for a time window:

GET /top-k?window=last_1h&k=10
Response: {
  window: "last_1h",
  k: 10,
  snapshot_age_seconds: 14,
  items: [
    { rank: 1, item_id: "product:abc123", approximate_count: 94821 },
    { rank: 2, item_id: "hashtag:worldcup", approximate_count: 87443 }
  ]
}

The response includes snapshot_age_seconds so callers know exactly how stale the pre-computed result is. This transparency lets downstream services decide whether to display the result immediately or wait for a fresher snapshot before refreshing a high-visibility leaderboard.

High-Level Design

1. Event ingestion at scale

Direct database writes saturate the primary at 50K events per second, making a write queue the first non-negotiable addition to the system.

The simplest design routes every event from the app server directly to a relational database. At 35,000 events per second at peak, a single Postgres instance cannot sustain the write rate without growing queue depth unboundedly (typical ceiling: 10,000-20,000 simple writes per second before latency degrades). The fix is a Kafka topic that absorbs burst writes and lets stream processors consume at their own pace.

Components:

Client: Web or mobile client that generates item events (product views, search queries, clicks).
App Server: Accepts POST /events, validates the payload, and publishes the event to Kafka. Does not perform any counting.
Kafka: Durable event queue that absorbs write bursts. Decouples the ingestion rate from the stream processing rate. Events are retained for 24 hours to support replay after processor failure.
Stream Processor Fleet: A horizontally scalable fleet (Kafka Streams, Flink, or Spark Streaming) that consumes from Kafka and maintains in-memory frequency counts. Each processor owns one or more Kafka partitions.
Redis Counter Store: Receives aggregated item counts flushed from stream processors every 30 seconds. Stores counts as Redis hashes keyed by time window and item ID.

Request walkthrough:

Client sends POST /events with item_id and event_type.
App Server validates the payload and publishes to the Kafka topic item-events, using item_id as the partition key (so all events for the same item always go to the same processor).
Kafka durably persists the event.
Each stream processor reads from its assigned partitions and increments an in-memory count for the item_id.
Every 30 seconds, each processor flushes aggregated counts to Redis Counter Store using HINCRBY batches.

This diagram covers the write path only. The query path and ranking pre-computation come in the next section.

I'd pause here in an interview and explicitly say: "The app server does zero counting. It is a dumb pipe into Kafka." That single sentence communicates that you understand the separation of concerns, and it prevents the interviewer from wondering whether the app server is a bottleneck.

2. Querying top K: pre-computed snapshots

Scanning all counters on every query is O(N) across potentially millions of items; pre-computing a TopKSnapshot every 30 seconds and serving reads from a Redis sorted set eliminates all per-query computation.

The naive query approach reads all counters from Redis and sorts them in application memory. At 1 million distinct items, that is a million HGETALL values to sort per query. For bulk query traffic (thousands of users refreshing a leaderboard simultaneously), on-demand sorting is a bottleneck that pre-computation eliminates entirely.

Components (new and changed):

Snapshot Builder: A background process that runs every 30 seconds. Reads all counters from Redis Counter Store for each time window, computes top K using a min-heap, and writes the result to a Redis sorted set.
Redis Sorted Set (TopKSnapshot): Stores the pre-computed top-K list per time window. Key pattern: topk:{window}. ZADD writes the score (approximate count) and member (item_id). ZREVRANGE reads the top K in O(log N + K) time.
Query API: Serves GET /top-k by calling ZREVRANGE on the appropriate Redis sorted set. No ranking computation at query time.

Request walkthrough (query path):

Client sends GET /top-k?window=last_1h&k=10.
Query API resolves the Redis key: topk:last_1h.
Query API calls ZREVRANGE topk:last_1h 0 9 WITHSCORES.
Redis returns the top 10 item_id values and their approximate counts in O(log N + K) time.
Query API attaches snapshot_age_seconds (computed from the ZADD timestamp stored alongside the set) and returns the response.

What is a top-K heavy hitters system?

Functional Requirements

Core Requirements

Users can submit item events (for example, a search query string, a product view, a hashtag click) to the system.
Users can query the top K items for a given time window (for example, top 10 in the last hour, top 100 in the last 24 hours).
Rankings update within 30 seconds of new events arriving.
Multiple time windows are supported simultaneously (last 1 minute, 1 hour, 24 hours) without re-processing history.

Below the Line (out of scope)

User authentication and per-user personalization
Item metadata enrichment (images, display names)
Exact deduplication of duplicate events from the same user session

The hardest part in scope: Merging distributed top-K results correctly. Naively unioning the top-K list from each stream processor is wrong. An item ranked 11th on every individual node can be globally first if its traffic is spread evenly across partitions, and you miss it entirely. The solution (merging Count-Min Sketches) is the deepest technical challenge in this system, and merging Count-Min Sketches is the subject of Deep Dive 3.

Non-Functional Requirements

Core Requirements

Throughput: Handle 1 billion events per day (roughly 11,500 events per second at steady state, with 3x peak burst = 35,000 events per second).
Staleness: Rankings are stale by at most 30 seconds. Sub-second freshness is not required; 30-second staleness is imperceptible for trending content.
Memory: Memory per time window is bounded regardless of item cardinality. At 1 million distinct items per day, exact counting needs 8 MB per window. Count-Min Sketch reduces this to under 200 KB per window.
Availability: 99.9% uptime. A brief gap in ranking freshness is acceptable; returning stale results from a pre-computed snapshot is always preferable to a total outage.
Query latency: Top-K reads return in under 10 ms p99. Rankings are pre-computed, so reads resolve to a single Redis ZREVRANGE command.

Below the Line

Exact (non-approximate) counts for all items (audit-grade counting requires storing every event individually, which does not bound memory)
Sub-second ranking latency (requires a hot path with no batching window, approaching the Redis throughput ceiling at 35K events/second)

Read/write ratio: 1 billion events per day vs roughly 100,000 top-K queries per day gives a 10,000:1 write-to-read ratio. This is an extremely write-heavy system. The write path is the entire design challenge: ingestion throughput, stream aggregation, and memory-bounded counting all exist to handle this asymmetry. The read path is trivial by comparison, because pre-computed snapshots reduce every query to a Redis ZREVRANGE.

Core Entities

Event: A single occurrence of an item being counted (a search query, a product view, a trending signal). Carries item_id, event_type, and timestamp.
Counter: The running approximate frequency for an item_id within a specific time window. Stored in-memory inside stream processors as a Count-Min Sketch structure, not as individual key-value rows.
TopKSnapshot: A pre-computed, point-in-time list of the top K items and their approximate counts for a given time window. Stored in Redis as a sorted set and refreshed every 30 seconds.
TimeWindow: A named time boundary (for example, last_1m, last_1h, last_24h) that scopes a TopKSnapshot. The system maintains one active TopKSnapshot per TimeWindow.

API Design

FR 1 and FR 3 - Ingest an event:

POST /events
Body: {
  item_id: "product:abc123",
  event_type: "view",
  timestamp: 1711670400
}
Response: HTTP 202 Accepted
Body: { message: "Event queued for processing" }

FR 2 - Query top K for a time window:

GET /top-k?window=last_1h&k=10
Response: {
  window: "last_1h",
  k: 10,
  snapshot_age_seconds: 14,
  items: [
    { rank: 1, item_id: "product:abc123", approximate_count: 94821 },
    { rank: 2, item_id: "hashtag:worldcup", approximate_count: 87443 }
  ]
}

High-Level Design

1. Event ingestion at scale

Direct database writes saturate the primary at 50K events per second, making a write queue the first non-negotiable addition to the system.

Components:

Client: Web or mobile client that generates item events (product views, search queries, clicks).
App Server: Accepts POST /events, validates the payload, and publishes the event to Kafka. Does not perform any counting.
Kafka: Durable event queue that absorbs write bursts. Decouples the ingestion rate from the stream processing rate. Events are retained for 24 hours to support replay after processor failure.
Stream Processor Fleet: A horizontally scalable fleet (Kafka Streams, Flink, or Spark Streaming) that consumes from Kafka and maintains in-memory frequency counts. Each processor owns one or more Kafka partitions.
Redis Counter Store: Receives aggregated item counts flushed from stream processors every 30 seconds. Stores counts as Redis hashes keyed by time window and item ID.

Request walkthrough:

Client sends POST /events with item_id and event_type.
App Server validates the payload and publishes to the Kafka topic item-events, using item_id as the partition key (so all events for the same item always go to the same processor).
Kafka durably persists the event.
Each stream processor reads from its assigned partitions and increments an in-memory count for the item_id.
Every 30 seconds, each processor flushes aggregated counts to Redis Counter Store using HINCRBY batches.

This diagram covers the write path only. The query path and ranking pre-computation come in the next section.

2. Querying top K: pre-computed snapshots

Components (new and changed):

Snapshot Builder: A background process that runs every 30 seconds. Reads all counters from Redis Counter Store for each time window, computes top K using a min-heap, and writes the result to a Redis sorted set.
Redis Sorted Set (TopKSnapshot): Stores the pre-computed top-K list per time window. Key pattern: topk:{window}. ZADD writes the score (approximate count) and member (item_id). ZREVRANGE reads the top K in O(log N + K) time.
Query API: Serves GET /top-k by calling ZREVRANGE on the appropriate Redis sorted set. No ranking computation at query time.

Request walkthrough (query path):

Client sends GET /top-k?window=last_1h&k=10.
Query API resolves the Redis key: topk:last_1h.
Query API calls ZREVRANGE topk:last_1h 0 9 WITHSCORES.
Redis returns the top 10 item_id values and their approximate counts in O(log N + K) time.
Query API attaches snapshot_age_seconds (computed from the ZADD timestamp stored alongside the set) and returns the response.

Top-K Heavy Hitters

What is a top-K heavy hitters system?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Event ingestion at scale

2. Querying top K: pre-computed snapshots

Continue Reading with Premium

Comments

Top-K Heavy Hitters

What is a top-K heavy hitters system?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Event ingestion at scale

2. Querying top K: pre-computed snapshots

Continue Reading with Premium

Comments