Count-min sketch: streaming frequency estimation

The problem

Your API gateway processes 500,000 requests per second. The security team wants to flag any IP address that exceeds 1,000 requests per minute. The product team wants to know the top 10 most-searched queries in real time. The analytics pipeline wants to detect "heavy hitter" product IDs that account for more than 1% of all views.

The naive solution: maintain a HashMap counting occurrences per key. At 10 million unique IPs, 50 million unique search queries, and 5 million product IDs, the map consumes gigabytes of RAM. Worse, the map grows with cardinality, and you do not know the cardinality in advance.

Now add time windows. You need per-minute counts, so you maintain 60 maps (one per minute) and rotate them. That is 60x the memory. Now do this across 20 gateway nodes. The hash maps are not mergeable without transferring the full key sets.

This is the problem count-min sketch solves: estimating element frequencies in a stream using fixed, small memory that does not grow with the number of unique elements, and that supports merging across distributed nodes.

What it is

A count-min sketch (CMS) is a probabilistic data structure that estimates the frequency of elements in a data stream using a fixed-size 2D array of counters and multiple independent hash functions. Estimates are always equal to or greater than the true count (over-count, never under-count).

Analogy: Imagine you work at a busy polling station. Instead of keeping a precise tally for each candidate (which requires one counter per candidate), you have a small grid of tally marks. Five different volunteers each independently pick a column for each candidate using different random rules, and add a tally mark. When someone asks "how many votes did candidate X get?", each volunteer checks their column for X and reports their tally. You take the smallest reported number, because collisions with other candidates only inflate the tally, never reduce it.

The key guarantee: CMS never under-counts. If an element truly appeared 500 times, the estimate is at least 500. It might be 510 or 520 due to hash collisions with other elements, but never 490. This one-sided error makes CMS especially useful for threshold-based decisions (rate limiting, heavy hitter detection) where over-counting is safe but under-counting could miss violations.

CMS always over-counts, never under-counts

This is the most important property to remember. The error is always positive (or zero). If your CMS estimates an IP has made 1,050 requests and your threshold is 1,000, the true count is somewhere between 0 and 1,050. You might have a false positive (true count was only 950) but never a false negative (true count was 1,100 and CMS reported 950). For security applications, this bias is exactly what you want.

How it works

A count-min sketch is a 2D array of counters with dimensions d × w:

d = number of hash functions (depth, controls confidence)
w = width per hash function (controls error magnitude)

Each hash function is independent and maps elements to a position in [0, w). The hash functions must be pairwise independent (each pair of functions behaves independently for any two inputs).

Update: adding an element

For each hash function i, increment the counter at position hash_i(x) % w:

def update(x, count=1):
    for i in range(d):
        j = hash(i, x) % w
        table[i][j] += count

This is O(d) per update. With d=5, that means 5 hash computations and 5 counter increments. At typical hash speeds (25ns per MurmurHash), a single update completes in about 150ns.

Query: estimating frequency

For each hash function i, read the counter at position hash_i(x) % w. Return the minimum across all d rows:

def estimate(x):
    return min(table[i][hash(i, x) % w] for i in range(d))

Why the minimum? Each counter at position hash_i(x) accumulates counts from element x plus every other element that hashed to the same position in row i. Different rows have different collision patterns (because the hash functions are independent). The row with the fewest collisions gives the best (least inflated) estimate. Taking the minimum is the best you can do without knowing which collisions occurred.

For your interview: say "CMS uses d hash functions and takes the min across all d rows because collisions can only inflate, never deflate."

Sizing: the error model

The error bounds are probabilistic and controlled by two parameters: ε (error rate) and δ (failure probability).

P(estimate > actual + ε × n) < δ

where n = total events processed
      ε = error parameter (controls the magnitude of over-count)
      δ = failure probability (controls how often the error bound is exceeded)

The dimensions map directly to these parameters:

w = ceil(e / ε) ≈ 2.72 / ε (width for given error rate)
d = ceil(ln(1/δ)) (depth for given confidence)

Target error (ε)	Confidence (1-δ)	Width (w)	Depth (d)	Memory (4-byte counters)
1%	99%	272	5	~5.4 KB
0.1%	99%	2,718	5	~54 KB
0.01%	99%	27,183	5	~544 KB
1%	99.9%	272	7	~7.6 KB
0.1%	99.99%	2,718	10	~109 KB

The error bound is additive, not multiplicative. For a 1-billion-event stream with ε=0.01, the maximum over-count for any element is 10 million events (1% of n). That sounds large in absolute terms, but it means you can reliably detect any element that appears more than ~10 million times (1% of the stream).

I almost always recommend starting with ε=0.001 (0.1%) and d=5 unless memory is extremely constrained. This uses ~54KB and keeps the error within 0.1% of total stream volume.

Error is proportional to total stream volume, not element frequency

This is the subtlety that trips people up. The error bound is ε × n where n is the total number of events across ALL elements, not the count of the specific element you are querying. On a stream of 1 billion events with ε=0.01, every element's estimate has up to ±10M error, whether the element appeared 5 times or 500 million times. This makes CMS much more useful for detecting frequent elements than for accurately counting rare ones.

Worked example

Stream of 1,000 events. CMS with w=10, d=3.

The problem

What it is

CMS always over-counts, never under-counts

How it works

A count-min sketch is a 2D array of counters with dimensions d × w:

d = number of hash functions (depth, controls confidence)
w = width per hash function (controls error magnitude)

Each hash function is independent and maps elements to a position in [0, w). The hash functions must be pairwise independent (each pair of functions behaves independently for any two inputs).

Update: adding an element

For each hash function i, increment the counter at position hash_i(x) % w:

def update(x, count=1):
    for i in range(d):
        j = hash(i, x) % w
        table[i][j] += count

This is O(d) per update. With d=5, that means 5 hash computations and 5 counter increments. At typical hash speeds (25ns per MurmurHash), a single update completes in about 150ns.

Query: estimating frequency

For each hash function i, read the counter at position hash_i(x) % w. Return the minimum across all d rows:

def estimate(x):
    return min(table[i][hash(i, x) % w] for i in range(d))

For your interview: say "CMS uses d hash functions and takes the min across all d rows because collisions can only inflate, never deflate."

Sizing: the error model

The error bounds are probabilistic and controlled by two parameters: ε (error rate) and δ (failure probability).

P(estimate > actual + ε × n) < δ

where n = total events processed
      ε = error parameter (controls the magnitude of over-count)
      δ = failure probability (controls how often the error bound is exceeded)

The dimensions map directly to these parameters:

w = ceil(e / ε) ≈ 2.72 / ε (width for given error rate)
d = ceil(ln(1/δ)) (depth for given confidence)

Target error (ε)	Confidence (1-δ)	Width (w)	Depth (d)	Memory (4-byte counters)
1%	99%	272	5	~5.4 KB
0.1%	99%	2,718	5	~54 KB
0.01%	99%	27,183	5	~544 KB
1%	99.9%	272	7	~7.6 KB
0.1%	99.99%	2,718	10	~109 KB

I almost always recommend starting with ε=0.001 (0.1%) and d=5 unless memory is extremely constrained. This uses ~54KB and keeps the error within 0.1% of total stream volume.

Error is proportional to total stream volume, not element frequency

Worked example

Stream of 1,000 events. CMS with w=10, d=3.

Count-min sketch: streaming frequency estimation

The problem

What it is

How it works

Update: adding an element

Query: estimating frequency

Sizing: the error model

Worked example

Continue Reading with Premium

Comments

Count-min sketch: streaming frequency estimation

The problem

What it is

How it works

Update: adding an element

Query: estimating frequency

Sizing: the error model

Worked example

Continue Reading with Premium

Comments