HyperLogLog: counting unique things without counting them

The problem

Your analytics dashboard shows "unique visitors today." The product team wants this number on every page, every dashboard, every A/B experiment summary. The naive implementation: maintain a HashSet of all user IDs you have seen, then call .size().

At 10,000 daily users, this costs a few hundred KB. At 100 million daily users, the set costs over a gigabyte of RAM for a single counter. Now multiply that by every page, every experiment, every time window. You are looking at terabytes of memory just for counting distinct things.

It gets worse in distributed systems. Your traffic is split across 20 shards. To compute "unique visitors across all shards," you need to transfer every shard's full set to a central node and compute the union. That is 20 gigabytes of network transfer for a single number. And you need this number updated every minute for real-time dashboards.

Even with clever deduplication (Bloom filters, sorted merges), the fundamental problem remains: exact distinct counting requires memory proportional to cardinality. When cardinality is in the hundreds of millions, the cost is prohibitive.

You can approximate with sampling, but sampling introduces bias: rare items are underrepresented. You can use bitmaps, but bitmap size is proportional to the domain size, not the cardinality.

This is the problem HyperLogLog solves: counting distinct elements (cardinality estimation) using a fixed, tiny amount of memory, regardless of whether the actual count is 1,000 or 1 trillion.

Here is a concrete comparison of the approaches:

Approach	Memory for 100M distinct users	Merge cost across 20 shards	Time complexity per add
`HashSet`	~4 GB	~80 GB network transfer	O(1) amortized, but GC spikes
Sorted file + dedup	~1.6 GB on disk	Full dump + sort-merge	O(log N)
HyperLogLog	12-16 KB	320 KB total	O(1), constant

The difference is not incremental. It is orders of magnitude. And the accuracy loss (0.81%) is invisible on any dashboard or report where the numbers are already rounded for display.

What it is

HyperLogLog (HLL) is a probabilistic data structure that estimates the number of distinct elements in a dataset using constant memory, typically 12-16KB, regardless of actual cardinality. The standard error is about 0.81% with the typical 16,384-bucket configuration. It was invented by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007.

Analogy: Imagine you are flipping coins and recording the longest streak of heads you have ever seen. If you have only flipped 10 times, a streak of 3 heads is reasonable. If you have flipped 10 million times, seeing a streak of 20+ heads is expected. The longest streak you have observed tells you roughly how many coin flips you must have done. HLL applies this same idea to hash outputs: longer runs of leading zeros imply more distinct inputs.

The core observation: if you hash items uniformly, the more distinct items you process, the longer the longest run of leading zero bits you will encounter. If a hash starts with 5 leading zeros (00000xxx...), the probability of that specific output is 1/32. You probably needed to hash around 32 distinct items to see it. Ten leading zeros means you likely processed about 1,024 items.

Why 'Hyper'?

The name comes from using the harmonic mean (related to "hyperharmonic" numbers) to combine estimates from multiple buckets. The original "LogLog" algorithm used the geometric mean. HyperLogLog's harmonic mean reduces variance from outlier buckets significantly.

The single crude estimator 2^(max_leading_zeros) is wildly inaccurate on its own. The improvement comes from splitting the hash space into thousands of independent buckets, computing the estimator per bucket, and combining them with a harmonic mean. This is the core of the algorithm, and I will walk through each step.

How it works

The algorithm has four steps: hash, split, track, and estimate. Each step is simple on its own. The clever engineering is in how they combine to produce a cardinality estimate from nothing but a small byte array.

Before diving in, here is the high-level intuition: hash each element, use part of the hash to pick a bucket, and use the rest of the hash to make a "how rare is this hash?" observation. Track the rarest observation per bucket. When you have seen many distinct elements, rare observations (long zero runs) become more likely, so the tracked maximums increase.

Step 1: Hash every element to a uniform 64-bit string. The hash function must distribute outputs uniformly across the full 64-bit range. MurmurHash3 and xxHash are common choices. Cryptographic hashes work but are needlessly slow.

Step 2: Split the hash into two parts. The first b bits select which of the 2^b buckets (registers) this element maps to. The remaining 64 - b bits are used for the leading-zero observation. With b=14, you get 16,384 buckets and 50 remaining bits.

Step 3: Track the maximum zero-run per bucket. For the remaining bits, count the number of leading zeros plus one. If this value exceeds the current register value, update it.

def add(element):
    h = hash64(element)
    bucket_index = h >> (64 - b)                 # first b bits
    remaining = h & ((1 << (64 - b)) - 1)        # remaining bits
    run_length = count_leading_zeros(remaining) + 1
    registers[bucket_index] = max(registers[bucket_index], run_length)

Each register fits in a single byte (maximum run length in 64 bits is 64, which fits in 6 bits, though most implementations use a full byte for alignment). Total memory: 16,384 registers x 1 byte = 16KB.

Step 4: Estimate cardinality using the harmonic mean of 2^(-M[j]) across all buckets. This is the aggregation step that turns 16,384 independent observations into one number:

def estimate():
    m = len(registers)  # 16,384
    alpha = 0.7213 / (1 + 1.079 / m)
    raw = alpha * m * m / sum(2 ** -r for r in registers)
    return round(raw)

The harmonic mean is critical. A simple average would be destroyed by outlier buckets that saw an unusually long zero run by chance. The harmonic mean naturally downweights outliers. For your interview: just say "it uses buckets and harmonic mean to reduce variance from outliers."

Worked example: four elements, four buckets

Let's walk through a tiny HLL with b=2 (4 buckets) to make the mechanics concrete. Assume 8-bit hashes for readability.

Element	Hash (binary)	First 2 bits (bucket)	Remaining 6 bits	Leading zeros + 1
"alice"	`01 101000`	`01` → bucket 1	`101000`	0 + 1 = 1
"bob"	`11 001010`	`11` → bucket 3	`001010`	2 + 1 = 3
"carol"	`01 000110`	`01` → bucket 1	`000110`	3 + 1 = 4
"dave"	`10 100001`	`10` → bucket 2	`100001`	0 + 1 = 1

After processing all four elements, the registers are:

Bucket 0: 0  (never used)
Bucket 1: 4  (max of alice's 1 and carol's 4)
Bucket 2: 1  (dave's 1)
Bucket 3: 3  (bob's 3)

Estimate: α × 4² × (2^0 + 2^(-4) + 2^(-1) + 2^(-3))^(-1)

With α ≈ 0.532 for m=4: 0.532 × 16 × (1 + 0.0625 + 0.5 + 0.125)^(-1) = 0.532 × 16 × 0.593 ≈ 5.05

The true count is 4. The estimate of ~5 is off because 4 buckets provide very high variance (26% standard error). With 16,384 buckets, this variance shrinks to 0.81%.

Why the harmonic mean, not the arithmetic mean

Consider what happens with the arithmetic mean. If one bucket sees a hash with 15 leading zeros (a 1-in-32768 event), that single outlier dominates the average. The arithmetic mean of {1, 1, 1, 15} is 4.5, suggesting about 2^4.5 ≈ 23 distinct elements from 4 observations.

The harmonic mean resists outliers by working with reciprocals: 4 / (2^(-1) + 2^(-1) + 2^(-1) + 2^(-15)) ≈ 4 / 1.5 ≈ 2.67, much closer to the truth. This outlier resistance is why the algorithm jumped from "LogLog" (geometric mean, requiring more buckets for the same accuracy) to "HyperLogLog" (harmonic mean).

Notice that "alice" appearing twice has zero effect. The hash is identical both times, so the bucket and run length are the same. The max operation is idempotent. This is exactly what we want: duplicates are automatically ignored without any explicit deduplication.

Stochastic averaging: why buckets matter

A single maximum-zero-run counter is extremely noisy. If you hash 1,000 elements into one counter, a single lucky hash with 15 leading zeros makes the counter estimate 32,768. That is useless.

The fix: split the input into m independent substreams using the first b bits of the hash. Each substream has roughly n/m elements (where n is the total count). Each substream produces its own estimate. Combining m independent estimates reduces the variance by a factor of sqrt(m).

This is called stochastic averaging, and it is the same principle behind ensemble methods in machine learning: many weak estimators combined produce a strong estimator. Each individual bucket is noisy. Together, 16,384 buckets average out to 0.81% error.

The tradeoff is intuitive: more buckets means more memory (one byte each) but less noise. Fewer buckets means less memory but wilder estimates. The formula 1.04/sqrt(m) captures this exactly. Adding buckets has rapidly diminishing returns because of the square root: going from 16 to 16,384 buckets (1000x) only improves accuracy by ~32x (from 26% to 0.81%).

Memory layout in practice

In memory, an HLL is just a byte array. Nothing fancy.

HLL structure (b=14, 16,384 registers):
┌────────────────────────────────────────────┐
│ Header: 4 bytes (magic number + encoding)  │
├────────────────────────────────────────────┤
│ Register 0:     [5-bit value, 0-padded]    │
│ Register 1:     [3-bit value, 0-padded]    │
│ Register 2:     [7-bit value, 0-padded]    │
│ ...                                        │
│ Register 16383: [2-bit value, 0-padded]    │
├────────────────────────────────────────────┤
│ Total: ~12-16 KB depending on encoding     │
└────────────────────────────────────────────┘

Redis uses 6-bit registers (max value 63, sufficient for up to 2^63 cardinality with 64-bit hashes). The dense representation packs registers tightly: 16,384 × 6 bits = 98,304 bits = 12,288 bytes. Some implementations use 8-bit registers for simplicity, costing 16,384 bytes.

The entire structure fits in a few L1 cache lines when accessed sequentially. This makes HLL not just memory-efficient but also fast: updating a register is a single byte write with no branching beyond the max comparison.

For performance-sensitive applications, the per-element cost breaks down to:

One hash computation (~25ns for MurmurHash3 on a 16-byte key)
One bit shift to extract the bucket index
One leading-zero count (a single CPU instruction: LZCNT on x86, CLZ on ARM)
One max comparison and conditional store

The problem

You can approximate with sampling, but sampling introduces bias: rare items are underrepresented. You can use bitmaps, but bitmap size is proportional to the domain size, not the cardinality.

This is the problem HyperLogLog solves: counting distinct elements (cardinality estimation) using a fixed, tiny amount of memory, regardless of whether the actual count is 1,000 or 1 trillion.

Here is a concrete comparison of the approaches:

Approach	Memory for 100M distinct users	Merge cost across 20 shards	Time complexity per add
`HashSet`	~4 GB	~80 GB network transfer	O(1) amortized, but GC spikes
Sorted file + dedup	~1.6 GB on disk	Full dump + sort-merge	O(log N)
HyperLogLog	12-16 KB	320 KB total	O(1), constant

The difference is not incremental. It is orders of magnitude. And the accuracy loss (0.81%) is invisible on any dashboard or report where the numbers are already rounded for display.

What it is

Why 'Hyper'?

How it works

Step 3: Track the maximum zero-run per bucket. For the remaining bits, count the number of leading zeros plus one. If this value exceeds the current register value, update it.

def add(element):
    h = hash64(element)
    bucket_index = h >> (64 - b)                 # first b bits
    remaining = h & ((1 << (64 - b)) - 1)        # remaining bits
    run_length = count_leading_zeros(remaining) + 1
    registers[bucket_index] = max(registers[bucket_index], run_length)

Step 4: Estimate cardinality using the harmonic mean of 2^(-M[j]) across all buckets. This is the aggregation step that turns 16,384 independent observations into one number:

def estimate():
    m = len(registers)  # 16,384
    alpha = 0.7213 / (1 + 1.079 / m)
    raw = alpha * m * m / sum(2 ** -r for r in registers)
    return round(raw)

Worked example: four elements, four buckets

Let's walk through a tiny HLL with b=2 (4 buckets) to make the mechanics concrete. Assume 8-bit hashes for readability.

Element	Hash (binary)	First 2 bits (bucket)	Remaining 6 bits	Leading zeros + 1
"alice"	`01 101000`	`01` → bucket 1	`101000`	0 + 1 = 1
"bob"	`11 001010`	`11` → bucket 3	`001010`	2 + 1 = 3
"carol"	`01 000110`	`01` → bucket 1	`000110`	3 + 1 = 4
"dave"	`10 100001`	`10` → bucket 2	`100001`	0 + 1 = 1

After processing all four elements, the registers are:

Bucket 0: 0  (never used)
Bucket 1: 4  (max of alice's 1 and carol's 4)
Bucket 2: 1  (dave's 1)
Bucket 3: 3  (bob's 3)

Estimate: α × 4² × (2^0 + 2^(-4) + 2^(-1) + 2^(-3))^(-1)

With α ≈ 0.532 for m=4: 0.532 × 16 × (1 + 0.0625 + 0.5 + 0.125)^(-1) = 0.532 × 16 × 0.593 ≈ 5.05

The true count is 4. The estimate of ~5 is off because 4 buckets provide very high variance (26% standard error). With 16,384 buckets, this variance shrinks to 0.81%.

HLL structure (b=14, 16,384 registers):
┌────────────────────────────────────────────┐
│ Header: 4 bytes (magic number + encoding)  │
├────────────────────────────────────────────┤
│ Register 0:     [5-bit value, 0-padded]    │
│ Register 1:     [3-bit value, 0-padded]    │
│ Register 2:     [7-bit value, 0-padded]    │
│ ...                                        │
│ Register 16383: [2-bit value, 0-padded]    │
├────────────────────────────────────────────┤
│ Total: ~12-16 KB depending on encoding     │
└────────────────────────────────────────────┘

For performance-sensitive applications, the per-element cost breaks down to:

One hash computation (~25ns for MurmurHash3 on a 16-byte key)
One bit shift to extract the bucket index
One leading-zero count (a single CPU instruction: LZCNT on x86, CLZ on ARM)
One max comparison and conditional store

HyperLogLog: counting unique things without counting them

The problem

What it is

How it works

Worked example: four elements, four buckets

Why the harmonic mean, not the arithmetic mean

Stochastic averaging: why buckets matter

Memory layout in practice

Continue Reading with Premium

Comments

HyperLogLog: counting unique things without counting them

The problem

What it is

How it works

Worked example: four elements, four buckets

Why the harmonic mean, not the arithmetic mean

Stochastic averaging: why buckets matter

Memory layout in practice

Continue Reading with Premium

Comments