Count-min sketch: streaming frequency estimation
How count-min sketch estimates element frequencies using multiple hash functions and a 2D counter array. Width vs. depth tradeoffs, the error model, and applications for streaming heavy hitters and rate limiting.
The problem
Your API gateway processes 500,000 requests per second. The security team wants to flag any IP address that exceeds 1,000 requests per minute. The product team wants to know the top 10 most-searched queries in real time. The analytics pipeline wants to detect "heavy hitter" product IDs that account for more than 1% of all views.
The naive solution: maintain a HashMap counting occurrences per key. At 10 million unique IPs, 50 million unique search queries, and 5 million product IDs, the map consumes gigabytes of RAM. Worse, the map grows with cardinality, and you do not know the cardinality in advance.
Now add time windows. You need per-minute counts, so you maintain 60 maps (one per minute) and rotate them. That is 60x the memory. Now do this across 20 gateway nodes. The hash maps are not mergeable without transferring the full key sets.
This is the problem count-min sketch solves: estimating element frequencies in a stream using fixed, small memory that does not grow with the number of unique elements, and that supports merging across distributed nodes.
What it is
A count-min sketch (CMS) is a probabilistic data structure that estimates the frequency of elements in a data stream using a fixed-size 2D array of counters and multiple independent hash functions. Estimates are always equal to or greater than the true count (over-count, never under-count).
Analogy: Imagine you work at a busy polling station. Instead of keeping a precise tally for each candidate (which requires one counter per candidate), you have a small grid of tally marks. Five different volunteers each independently pick a column for each candidate using different random rules, and add a tally mark. When someone asks "how many votes did candidate X get?", each volunteer checks their column for X and reports their tally. You take the smallest reported number, because collisions with other candidates only inflate the tally, never reduce it.
The key guarantee: CMS never under-counts. If an element truly appeared 500 times, the estimate is at least 500. It might be 510 or 520 due to hash collisions with other elements, but never 490. This one-sided error makes CMS especially useful for threshold-based decisions (rate limiting, heavy hitter detection) where over-counting is safe but under-counting could miss violations.
CMS always over-counts, never under-counts
This is the most important property to remember. The error is always positive (or zero). If your CMS estimates an IP has made 1,050 requests and your threshold is 1,000, the true count is somewhere between 0 and 1,050. You might have a false positive (true count was only 950) but never a false negative (true count was 1,100 and CMS reported 950). For security applications, this bias is exactly what you want.
How it works
A count-min sketch is a 2D array of counters with dimensions d × w:
d= number of hash functions (depth, controls confidence)w= width per hash function (controls error magnitude)
Each hash function is independent and maps elements to a position in [0, w). The hash functions must be pairwise independent (each pair of functions behaves independently for any two inputs).
Update: adding an element
For each hash function i, increment the counter at position hash_i(x) % w:
def update(x, count=1):
for i in range(d):
j = hash(i, x) % w
table[i][j] += count
This is O(d) per update. With d=5, that means 5 hash computations and 5 counter increments. At typical hash speeds (25ns per MurmurHash), a single update completes in about 150ns.
Query: estimating frequency
For each hash function i, read the counter at position hash_i(x) % w. Return the minimum across all d rows:
def estimate(x):
return min(table[i][hash(i, x) % w] for i in range(d))
Why the minimum? Each counter at position hash_i(x) accumulates counts from element x plus every other element that hashed to the same position in row i. Different rows have different collision patterns (because the hash functions are independent). The row with the fewest collisions gives the best (least inflated) estimate. Taking the minimum is the best you can do without knowing which collisions occurred.
For your interview: say "CMS uses d hash functions and takes the min across all d rows because collisions can only inflate, never deflate."
Sizing: the error model
The error bounds are probabilistic and controlled by two parameters: ε (error rate) and δ (failure probability).
P(estimate > actual + ε × n) < δ
where n = total events processed
ε = error parameter (controls the magnitude of over-count)
δ = failure probability (controls how often the error bound is exceeded)
The dimensions map directly to these parameters:
w = ceil(e / ε)≈2.72 / ε(width for given error rate)d = ceil(ln(1/δ))(depth for given confidence)
| Target error (ε) | Confidence (1-δ) | Width (w) | Depth (d) | Memory (4-byte counters) |
|---|---|---|---|---|
| 1% | 99% | 272 | 5 | ~5.4 KB |
| 0.1% | 99% | 2,718 | 5 | ~54 KB |
| 0.01% | 99% | 27,183 | 5 | ~544 KB |
| 1% | 99.9% | 272 | 7 | ~7.6 KB |
| 0.1% | 99.99% | 2,718 | 10 | ~109 KB |
The error bound is additive, not multiplicative. For a 1-billion-event stream with ε=0.01, the maximum over-count for any element is 10 million events (1% of n). That sounds large in absolute terms, but it means you can reliably detect any element that appears more than ~10 million times (1% of the stream).
I almost always recommend starting with ε=0.001 (0.1%) and d=5 unless memory is extremely constrained. This uses ~54KB and keeps the error within 0.1% of total stream volume.
Error is proportional to total stream volume, not element frequency
This is the subtlety that trips people up. The error bound is ε × n where n is the total number of events across ALL elements, not the count of the specific element you are querying. On a stream of 1 billion events with ε=0.01, every element's estimate has up to ±10M error, whether the element appeared 5 times or 500 million times. This makes CMS much more useful for detecting frequent elements than for accurately counting rare ones.
Worked example
Stream of 1,000 events. CMS with w=10, d=3.
Events: A×400, B×300, C×200, D×100
After processing all events:
Row 1: [0, 400, 300, 200, 100, 0, 0, 0, 0, 0] (lucky: no collisions)
Row 2: [0, 0, 700, 0, 0, 200, 100, 0, 0, 0] (A and B collide at position 2)
Row 3: [0, 0, 0, 400, 500, 0, 0, 0, 100, 0] (B and C collide at position 4)
Query "A": min(400, 700, 400) = 400 ✓ exact
Query "B": min(300, 700, 500) = 300 ✓ exact
Query "C": min(200, 200, 500) = 200 ✓ exact
Query "D": min(100, 100, 100) = 100 ✓ exact
In this case, the minimum operation eliminates the collision noise in rows 2 and 3. With multiple independent hash functions, it is unlikely that element x collides with the same heavy hitter in every row. At least one row usually gives a clean (or near-clean) count.
Count-Min vs. HyperLogLog
| Count-Min Sketch | HyperLogLog | |
|---|---|---|
| Answers | "How often was X seen?" | "How many distinct X were seen?" |
| Error type | Over-count (never under-count) | Relative error (both directions) |
| Merging | Sum the counter arrays | Max the bucket arrays |
| Query input | A specific element X | Nothing (returns global count) |
| Use case | Frequency / heavy hitters / rate limiting | Cardinality (distinct count) |
| Typical memory | 5KB - 500KB depending on error tolerance | 12KB for ~0.8% standard error |
In practice, you often use both together. CMS tells you "IP 1.2.3.4 has sent 10,000 requests in the last minute" (frequency). HLL tells you "50,000 distinct IPs have sent requests in the last minute" (cardinality). Different questions, complementary answers.
Comparison with other frequency sketches
| Sketch | Error type | Supports deletion? | Merging | Memory | Best for |
|---|---|---|---|---|---|
| Count-Min Sketch | One-sided (over-count) | No | Sum | d × w counters | Heavy hitters, rate limiting |
| Count Sketch | Two-sided (zero-mean) | Yes | Sum | d × w counters | Frequency estimation where over-count bias is unacceptable |
| Count-Min-Log | One-sided, reduced | No | Approximate | d × w (log counters) | Long-running streams where counter overflow is a concern |
| Space-Saving | Exact for top-K | Implicit (eviction) | Complex | K entries | Top-K frequent items when K is small |
Count Sketch uses +1/-1 random signs per hash function to make the error zero-mean (unbiased) instead of always positive. This gives better median estimation but loses the "never under-count" guarantee that makes CMS attractive for threshold-based decisions.
Hash function requirements
CMS needs d pairwise-independent hash functions. In practice, you do not need d separate hash implementations. The standard approach uses two strong hash functions (e.g., MurmurHash3) and derives d functions via:
hash_i(x) = (h1(x) + i * h2(x)) mod w
This "double hashing" technique produces pairwise-independent outputs and is much faster than computing d independent hashes. At w=2718 and d=5, each update requires 2 hash computations plus 5 modular additions and 5 counter increments.
Heavy hitter detection
The primary application of CMS is "heavy hitter" detection: finding elements whose frequency exceeds a threshold φ (a fraction of total stream volume). For example, "find all IPs responsible for more than 1% of traffic."
The pattern combines a CMS with a min-heap of size K:
def process_event(x, cms, heap, k, threshold):
cms.update(x)
freq = cms.estimate(x)
if freq >= threshold:
if x in heap:
heap.update(x, freq)
elif len(heap) < k:
heap.push(x, freq)
elif freq > heap.peek_min():
heap.pop_min()
heap.push(x, freq)
The heap never grows beyond K entries. CMS handles the counting in O(d) per event, and the heap operations are O(log K). For K=100 and d=5, you process each event in about 200ns regardless of how many distinct elements exist.
Applications:
- Rate limiting per IP/user: maintain per-key counts in a CMS to detect IPs exceeding thresholds without per-key state
- Trending topics: find queries or hashtags exceeding a views/minute threshold
- DDoS detection: detect source IPs with abnormally high packet rates in router firmware
- Ad fraud detection: flag advertiser IDs or publisher IDs with suspiciously high click counts
Conservative update optimization
Standard CMS increments all d counters by the same amount. Conservative update is a simple optimization: only increment a counter if it is currently less than the estimated count.
def conservative_update(x, count=1):
est = estimate(x)
for i in range(d):
j = hash(i, x) % w
table[i][j] = max(table[i][j], est + count)
Instead of blindly adding 1 to every row, conservative update ensures no counter exceeds the current minimum plus the increment. This reduces over-counting significantly in practice (often 2-5x improvement in accuracy) at zero extra memory cost. The tradeoff: conservative update breaks the mathematical mergeability guarantee (two independently maintained sketches cannot be trivially merged). Use it when you have a single-node sketch and accuracy matters more than distributed merging.
Merging sketches across nodes
When you run CMS on multiple nodes (e.g., 20 API gateway instances), you need to combine their sketches to get a global frequency estimate. Since CMS is a linear sketch, merging is trivial: add the counter arrays element-wise.
The merged sketch is mathematically identical to a single sketch that processed all events from all nodes. This is the key reason CMS is preferred over exact counters in distributed systems: you ship a fixed-size array (e.g., 54KB) instead of a potentially unbounded hash map.
| Merge property | CMS | HashMap |
|---|---|---|
| Data to transfer per node | d × w × 4 bytes (fixed) | O(unique keys × key size) |
| Merge operation | Element-wise addition | Key-by-key union |
| Result accuracy | Same probabilistic guarantees | Exact |
| Memory for merge | One additional d×w array | Potentially unbounded |
Decay and aging for sliding windows
Production frequency counts usually need time windows. "How many requests from this IP in the last 5 minutes?" is more useful than "how many requests ever?"
The simplest approach: maintain one CMS per time bucket (e.g., per minute) and rotate them.
# Five 1-minute buckets for a 5-minute sliding window
buckets = [CMS(w, d) for _ in range(5)]
current_bucket = 0
def tick_minute():
global current_bucket
current_bucket = (current_bucket + 1) % 5
buckets[current_bucket].clear() # Reset the oldest bucket
def update(x):
buckets[current_bucket].update(x)
def estimate_5min(x):
return sum(b.estimate(x) for b in buckets)
An alternative is exponential decay: multiply all counters by a decay factor λ (e.g., 0.99) at regular intervals. This avoids maintaining multiple sketches but loses the hard window guarantee. In practice, I recommend the bucket rotation approach because it is simpler to reason about and the memory overhead (5x a single sketch) is modest.
Production usage
| System | How CMS is used | Why not exact counters? |
|---|---|---|
| Google Bigtable / Dremel | Count hot row keys to detect tablet hotspots. CMS on each tablet server, merged centrally. | Billions of distinct row keys across thousands of tablets. Exact per-key counters would require unbounded memory on each tablet server. |
| Network routers (Memory) | Per-flow packet counting for DDoS detection and traffic engineering. CMS fits in SRAM (on-chip, sub-1μs access). | Line-rate packet processing (100Gbps+). SRAM is tiny (a few MB). Hash maps require DRAM (~50ns) which is too slow at line rate. |
| Apache Spark | CountMinSketch class in the spark-sketch module for approximate frequency queries in streaming pipelines. | Spark Streaming micro-batches process millions of events per second. Exact groupByKey requires shuffle. CMS produces approximate answers in a single pass. |
| Redis (RedisBloom module) | CMS.INCRBY, CMS.QUERY, CMS.MERGE commands. Used for rate limiting and frequency estimation. | Redis keeps everything in RAM. Exact counters with per-key INCR are O(keys); CMS is O(1) memory regardless of key cardinality. |
| Flink / Kafka Streams | Streaming heavy hitter detection in event-driven architectures. CMS per partition, merged at sink. | Event streams are unbounded. Accumulating exact state per key eventually exhausts memory or requires external storage. |
Interview tip: always cite at least one production system
When you mention CMS in a design, name a real system that uses it. "Google uses CMS in Bigtable to detect tablet hotspots" or "Network routers use CMS in SRAM for line-rate DDoS detection" shows you understand where this actually gets deployed, not just the theory.
Limitations and when NOT to use it
- Cannot delete elements. Standard CMS only supports increment. Decrementing counters would break the "never under-count" guarantee. If you need deletions, use a Count-Min-Log sketch or a Count sketch (which supports negative counts but loses the one-sided error property).
- Additive error scales with total stream volume. For a stream of 1 billion events with ε=0.01, the error bound is ±10M for every element. Rare elements (appearing fewer than 10M times) are essentially indistinguishable from noise. CMS is not useful for "long tail" frequency estimation.
- No enumeration. CMS cannot tell you which elements it has seen. You can query "how often did X appear?" but not "which elements appeared at least K times?" without maintaining a separate data structure (like the min-heap pattern above).
- Over-counting can produce false positives in threshold systems. A rate limiter using CMS may occasionally block legitimate users because CMS over-estimates their request count. For systems where false positives are costly (e.g., payment fraud), use CMS as a first-pass filter and confirm with exact counts.
- Not appropriate when you need exact counts. If your use case cannot tolerate any estimation error (billing, financial reporting), use exact counters. CMS is for "good enough" answers at scale.
- Point queries only. CMS cannot answer range queries ("how many elements have frequency between 100 and 200?") without modifications like the range-tree CMS variant.
Interview cheat sheet
- A count-min sketch is a
d × wcounter array that estimates element frequencies in O(d) time and O(d × w) fixed memory. - The estimate is always ≥ the true count (over-count, never under-count). The one-sided error is the key property.
- Error bound: P(estimate exceeds actual + ε × n) is under δ, where n = total events processed. Width w = ceil(e/ε), depth d = ceil(ln(1/δ)).
- Query returns the minimum across all d hash positions. The minimum works because collisions only inflate counters.
- Merging is element-wise addition. The merged sketch has the same guarantees as if all events were processed by a single sketch.
- Use CMS + min-heap for heavy hitter / top-K detection. CMS handles counting, heap tracks the K most frequent elements.
- Time-windowed counting: maintain one CMS per time bucket and rotate, or apply exponential decay to all counters.
- Conservative update reduces over-counting by 2-5x but breaks mergeability. Use it on single-node sketches.
Test Your Understanding
Quick Recap
- Count-min sketch estimates element frequencies using a d×w counter table and d independent hash functions. Updating increments d counters; querying returns the minimum across all d positions.
- Counts are always over-estimates (collisions inflate, never deflate). The minimum across hash functions gives the least-inflated estimate, bounded by ε × total_events with probability 1-δ.
- Memory is fixed:
(e/ε) × ceil(ln(1/δ)) × counter_size. For 0.1% error at 99% confidence, this is ~54KB regardless of the number of distinct elements. - Merging is element-wise addition across distributed nodes. The merged sketch is mathematically identical to a single sketch that processed all events. Ship 54KB per node instead of unbounded hash maps.
- Heavy hitter detection combines CMS with a min-heap: CMS handles frequency estimation in O(d), the heap tracks the K most frequent elements in O(log K).
- Time-windowed counting uses one CMS per bucket with rotation. Conservative update reduces over-counting by 2-5x but breaks mergeability. Use standard update for distributed systems, conservative update for single-node accuracy.
Related concepts
- HyperLogLog answers the complementary question: "how many distinct elements?" vs CMS's "how often did element X appear?"
- Bloom filters answer "have I seen element X before?" (membership) vs CMS's frequency estimation
- Rate limiting is the most common production application of CMS: per-key counting without per-key state
- Caching uses frequency information to decide eviction: CMS can approximate access frequencies for LFU-style policies
- Hashing and pairwise-independent hash families are the mathematical foundation that makes CMS work