Roaring bitmaps: compressed integer sets for analytics
How roaring bitmaps store and operate on integer sets 10-100x more space-efficiently than arrays or hash sets. How they power fast set operations in Druid, ClickHouse, and Elasticsearch, and when to use them.
The problem
Your analytics service stores user segments. Segment A ("users who clicked ad campaign X") has 40 million user IDs. Segment B ("users who purchased in the last 30 days") has 12 million user IDs. A product manager asks: "How many users clicked the ad AND purchased?"
You store each segment as a sorted array of 32-bit integers. Segment A is 160MB. Segment B is 48MB. Intersecting them means scanning both arrays in a merge-join, touching every byte of both segments. The query takes 2 seconds and allocates 200MB+ of heap.
Now multiply that by 500 concurrent queries per second across 200 segments. Memory explodes. Latency spikes. Your analytics dashboard becomes unusable during peak hours. The garbage collector pauses as hundreds of megabytes of intermediate arrays are allocated and discarded per query.
The core issue: sorted arrays are space-inefficient for dense integer sets, and raw bit arrays waste enormous space for sparse sets. You need a structure that adapts to the density of each range, stays compact, and makes set operations (AND, OR, XOR) fast without full decompression. This is the problem roaring bitmaps solve.
To understand the scale: Druid clusters at major ad-tech companies maintain thousands of user segments, each with tens of millions of IDs. Every dashboard query intersects or unions multiple segments. Without compressed set representations, the memory footprint alone makes real-time analytics impossible. The analytics team cannot afford 400MB per segment Γ 2000 segments = 800GB of RAM just for segment storage.
What it is
A roaring bitmap is a compressed data structure for storing sets of 32-bit integers that automatically chooses the best storage format for each region of the integer space.
Think of a library with 65,536 shelves. Each shelf covers a range of book IDs. For a shelf with only a few books, you keep a short handwritten list. For a shelf packed with books, you use a stamp card where each slot is marked present or absent. For a shelf where books arrive in long consecutive runs (IDs 1000 through 5000), you just write "1000 to 5000" on a sticky note. The library picks the format per shelf based on how many books are on it. That is exactly what a roaring bitmap does per 65,536-integer chunk.
The name "roaring" comes from the paper's title ("Roaring Bitmaps: Implementation of an Optimized Software Library," Chambi et al., 2016). The key innovation over earlier compressed bitmap formats (WAH, EWAH, Concise) is the three-container-type design. Earlier formats used only run-length encoding, which is excellent for sequential data but performs poorly on sparse, unordered integer sets. Roaring bitmaps handle both cases well.
Interview tip: the one-liner
"Roaring bitmaps partition 32-bit integers into 65,536-element chunks and pick the smallest container (sorted array, bitset, or run-length encoding) for each chunk independently."
How it works
A roaring bitmap divides the full 32-bit integer space (0 to 2^32 - 1) into chunks of 2^16 = 65,536 values. The upper 16 bits of any integer become the chunk key. The lower 16 bits become the value within the chunk.
Why 16 bits? This is the sweet spot. Smaller chunks (8-bit key = 256-value chunks) would create too many containers, wasting memory on overhead. Larger chunks (20-bit key = 1M-value chunks) would make bitmap containers too large (128KB each) and reduce the benefit of per-chunk density adaptation.
Each chunk is stored in one of three container types, selected automatically based on cardinality:
Array container (sparse chunks, fewer than 4096 values): a sorted array of 16-bit integers. Maximum size is 4096 x 2 bytes = 8KB. Membership check is binary search, O(log n). Insertion maintains sorted order, which costs O(n) in the worst case but is fast for small arrays that fit in L1 cache.
Bitmap container (dense chunks, 4096 or more values): a 65,536-bit bitset, always exactly 8KB. Membership check is a single bit lookup, O(1). Set operations use bitwise AND/OR/XOR, processing 64 elements per CPU instruction. This is why dense bitmap operations are so fast: they map directly to hardware-level operations.
Run-length encoding (RLE) container (consecutive integer runs): stores [start, length] pairs. User IDs 1 through 100,000 compress to a single pair instead of 100,000 entries.
My recommendation: if your dataset has sequential IDs (auto-increment primary keys, timestamp-based IDs), RLE containers will dominate your memory profile. Check your data distribution before estimating memory requirements.
The crossover at 4096 is not arbitrary. A sorted array of 4096 uint16 values takes 8KB. A bitmap container is always 8KB. Below 4096 elements, the array is smaller. At 4096 and above, the bitmap is equal or smaller.
This exact crossover point is what distinguishes roaring bitmaps from ad-hoc approaches. You never need to tune this threshold for your workload. The math guarantees optimal container selection at every cardinality.
The engine converts between container types automatically as elements are added or removed. This is the key design choice: the data structure adapts to the data, not the other way around.
function add(bitmap, value):
chunk_key = value >> 16 // upper 16 bits
low_bits = value & 0xFFFF // lower 16 bits
container = bitmap.containers[chunk_key]
if container is null:
container = new ArrayContainer()
bitmap.containers[chunk_key] = container
container.add(low_bits)
// Convert if density crosses threshold
if container.type == ARRAY and container.cardinality >= 4096:
bitmap.containers[chunk_key] = container.toBitmapContainer()
Worked example
Suppose you add three integers to an empty roaring bitmap: 131073 (0x00020001), 131074 (0x00020002), and 200000 (0x00030D40).
Step 1: 131073 β chunk key = 0x0002, value = 0x0001. No container exists for chunk 0x0002, so create a new array container and insert value 1. The array container starts with a single element: [1].
Step 2: 131074 β chunk key = 0x0002, value = 0x0002. Container for chunk 0x0002 exists (array with [1]). Insert value 2 in sorted position. Array is now [1, 2]. Still below 4096 elements, so it stays as an array container.
Step 3: 200000 β chunk key = 0x0003, value = 0x0D40 (3392). No container for chunk 0x0003. Create a new array container and insert value 3392. Now we have two separate containers, one per chunk key.
Result: two chunk keys (0x0002 and 0x0003), each with an array container. Total memory: 2 chunks Γ (2-byte key + pointer) + 2 values Γ 2 bytes + 1 value Γ 2 bytes = negligible. Compare to storing three 32-bit integers in a sorted array (12 bytes), and the roaring bitmap has slightly more overhead. But at millions of elements, the per-chunk density optimization pays off dramatically.
Now suppose you add all integers from 131072 to 196607 (the entire chunk 0x0002 range, 65,536 values). The array container for chunk 0x0002 would grow past 4096, triggering conversion to a bitmap container. The bitmap container uses exactly 8KB regardless of whether the chunk has 4096 or 65,536 elements. When the chunk is completely full (65,536 of 65,536 possible values), the bitmap is still 8KB, but you could also represent it as a single RLE pair [0, 65535] at just 4 bytes.
Set operations
Operations work at the container level. The bitmap never needs full decompression. Each chunk key is matched between the two bitmaps, and only matching chunks need to be processed.
The crucial insight: if bitmap A has containers for chunks [0, 3, 7] and bitmap B has containers for chunks [0, 3, 10], an intersection only processes chunks 0 and 3.
Chunk 7 (only in A) and chunk 10 (only in B) are skipped without any work. For a union, all chunks from both sides are included, but non-overlapping chunks are copied without modification. This chunk-level skip is why roaring bitmap operations are often faster than naive array operations even when the total data is similar in size.
Intersection (AND): for each chunk key present in both bitmaps, intersect the two containers. Array-to-array uses a sorted merge (or galloping search if sizes differ by 10x+). Bitmap-to-bitmap uses bitwise AND across 8KB (1024 64-bit words), which takes roughly 250 nanoseconds with AVX2. If a chunk key exists in only one bitmap, it contributes nothing to the intersection and is skipped entirely. This skip is what makes intersection fast when bitmaps have different chunk distributions.
Union (OR): for each chunk key present in either bitmap, union the containers. Bitmap-to-bitmap uses bitwise OR. Array-to-array merges and deduplicates. A chunk key present in only one side is copied as-is, which is effectively free.
Symmetric difference (XOR): returns elements in exactly one of the two bitmaps (but not both). Bitmap-to-bitmap uses bitwise XOR. This is used for change detection: "which user IDs were added or removed between yesterday's segment and today's?"
Cardinality (count): bitmap containers use the CPU's popcount instruction to count set bits across 1024 words in roughly 1 microsecond. Array containers return their length. RLE containers sum the run lengths. No iteration through individual elements is needed.
Difference (AND-NOT): A.andNot(B) returns elements in A that are not in B. For bitmap containers, this is A_word & ~B_word for each of 1024 words. For array containers, it is a merge that skips matching elements. This is the operation used for audience exclusion ("users in segment A who are NOT in segment B").
function intersect(bitmapA, bitmapB):
result = new RoaringBitmap()
for each chunk_key in bitmapA.containers:
if chunk_key not in bitmapB.containers:
continue // no match, skip entirely
containerA = bitmapA.containers[chunk_key]
containerB = bitmapB.containers[chunk_key]
merged = containerA.and(containerB)
if merged.cardinality > 0:
result.containers[chunk_key] = merged
return result
Container type conversions during operations
An intersection of two dense bitmap containers can produce a sparse result. The engine checks the output cardinality and downgrades to an array container if the result has fewer than 4096 elements. This keeps memory optimal after every operation, not just during insertion.
Container type selection and conversion
The choice between container types is the heart of roaring bitmap efficiency. Understanding when and why conversions happen is critical for predicting memory usage and performance.
Here is the decision logic:
| Condition | Container type | Size | Membership check |
|---|---|---|---|
| Cardinality < 4096 | Sorted array of uint16 | 2 Γ cardinality bytes | Binary search O(log n) |
| Cardinality >= 4096 | 65,536-bit bitmap | 8,192 bytes (fixed) | Bit lookup O(1) |
| Long consecutive runs | RLE [start, length] pairs | 4 Γ num_runs bytes | Binary search on runs |
The crossover point (4096) is mathematically exact. An array of 4096 uint16 values is 4096 Γ 2 = 8192 bytes. A bitmap container is always 8192 bytes. Below 4096, the array wins. At 4096 and above, the bitmap is never larger.
RLE containers are chosen when the number of runs Γ 4 bytes is smaller than both the array and bitmap representations. I'll often see RLE containers dominate when the data comes from auto-incrementing primary keys or time-bucketed event IDs.
An RLE container for a chunk with three runs ([10, 100], [500, 600], [1000, 2000]) uses 3 Γ 4 = 12 bytes. The same data as an array would use 2 Γ (91 + 101 + 1001) = 2,386 bytes. As a bitmap, 8,192 bytes. The RLE container wins by 200x over the array and 680x over the bitmap.
However, RLE is worst for data with no sequential patterns: if every other integer is present (alternating), each "run" is length 1, and the RLE container uses 4 bytes per element versus 2 bytes for the array container. The engine detects this and avoids RLE when it would be larger.
Conversion happens automatically:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.