Roaring bitmaps: compressed integer sets for analytics
How roaring bitmaps store and operate on integer sets 10-100x more space-efficiently than arrays or hash sets. How they power fast set operations in Druid, ClickHouse, and Elasticsearch, and when to use them.
The problem
Your analytics service stores user segments. Segment A ("users who clicked ad campaign X") has 40 million user IDs. Segment B ("users who purchased in the last 30 days") has 12 million user IDs. A product manager asks: "How many users clicked the ad AND purchased?"
You store each segment as a sorted array of 32-bit integers. Segment A is 160MB. Segment B is 48MB. Intersecting them means scanning both arrays in a merge-join, touching every byte of both segments. The query takes 2 seconds and allocates 200MB+ of heap.
Now multiply that by 500 concurrent queries per second across 200 segments. Memory explodes. Latency spikes. Your analytics dashboard becomes unusable during peak hours. The garbage collector pauses as hundreds of megabytes of intermediate arrays are allocated and discarded per query.
The core issue: sorted arrays are space-inefficient for dense integer sets, and raw bit arrays waste enormous space for sparse sets. You need a structure that adapts to the density of each range, stays compact, and makes set operations (AND, OR, XOR) fast without full decompression. This is the problem roaring bitmaps solve.
To understand the scale: Druid clusters at major ad-tech companies maintain thousands of user segments, each with tens of millions of IDs. Every dashboard query intersects or unions multiple segments. Without compressed set representations, the memory footprint alone makes real-time analytics impossible. The analytics team cannot afford 400MB per segment Ć 2000 segments = 800GB of RAM just for segment storage.
What it is
A roaring bitmap is a compressed data structure for storing sets of 32-bit integers that automatically chooses the best storage format for each region of the integer space.
Think of a library with 65,536 shelves. Each shelf covers a range of book IDs. For a shelf with only a few books, you keep a short handwritten list. For a shelf packed with books, you use a stamp card where each slot is marked present or absent. For a shelf where books arrive in long consecutive runs (IDs 1000 through 5000), you just write "1000 to 5000" on a sticky note. The library picks the format per shelf based on how many books are on it. That is exactly what a roaring bitmap does per 65,536-integer chunk.
The name "roaring" comes from the paper's title ("Roaring Bitmaps: Implementation of an Optimized Software Library," Chambi et al., 2016). The key innovation over earlier compressed bitmap formats (WAH, EWAH, Concise) is the three-container-type design. Earlier formats used only run-length encoding, which is excellent for sequential data but performs poorly on sparse, unordered integer sets. Roaring bitmaps handle both cases well.
Interview tip: the one-liner
"Roaring bitmaps partition 32-bit integers into 65,536-element chunks and pick the smallest container (sorted array, bitset, or run-length encoding) for each chunk independently."
How it works
A roaring bitmap divides the full 32-bit integer space (0 to 2^32 - 1) into chunks of 2^16 = 65,536 values. The upper 16 bits of any integer become the chunk key. The lower 16 bits become the value within the chunk.
Why 16 bits? This is the sweet spot. Smaller chunks (8-bit key = 256-value chunks) would create too many containers, wasting memory on overhead. Larger chunks (20-bit key = 1M-value chunks) would make bitmap containers too large (128KB each) and reduce the benefit of per-chunk density adaptation.
Each chunk is stored in one of three container types, selected automatically based on cardinality:
Array container (sparse chunks, fewer than 4096 values): a sorted array of 16-bit integers. Maximum size is 4096 x 2 bytes = 8KB. Membership check is binary search, O(log n). Insertion maintains sorted order, which costs O(n) in the worst case but is fast for small arrays that fit in L1 cache.
Bitmap container (dense chunks, 4096 or more values): a 65,536-bit bitset, always exactly 8KB. Membership check is a single bit lookup, O(1). Set operations use bitwise AND/OR/XOR, processing 64 elements per CPU instruction. This is why dense bitmap operations are so fast: they map directly to hardware-level operations.
Run-length encoding (RLE) container (consecutive integer runs): stores [start, length] pairs. User IDs 1 through 100,000 compress to a single pair instead of 100,000 entries.
My recommendation: if your dataset has sequential IDs (auto-increment primary keys, timestamp-based IDs), RLE containers will dominate your memory profile. Check your data distribution before estimating memory requirements.
The crossover at 4096 is not arbitrary. A sorted array of 4096 uint16 values takes 8KB. A bitmap container is always 8KB. Below 4096 elements, the array is smaller. At 4096 and above, the bitmap is equal or smaller.
This exact crossover point is what distinguishes roaring bitmaps from ad-hoc approaches. You never need to tune this threshold for your workload. The math guarantees optimal container selection at every cardinality.
The engine converts between container types automatically as elements are added or removed. This is the key design choice: the data structure adapts to the data, not the other way around.
function add(bitmap, value):
chunk_key = value >> 16 // upper 16 bits
low_bits = value & 0xFFFF // lower 16 bits
container = bitmap.containers[chunk_key]
if container is null:
container = new ArrayContainer()
bitmap.containers[chunk_key] = container
container.add(low_bits)
// Convert if density crosses threshold
if container.type == ARRAY and container.cardinality >= 4096:
bitmap.containers[chunk_key] = container.toBitmapContainer()
Worked example
Suppose you add three integers to an empty roaring bitmap: 131073 (0x00020001), 131074 (0x00020002), and 200000 (0x00030D40).
Step 1: 131073 ā chunk key = 0x0002, value = 0x0001. No container exists for chunk 0x0002, so create a new array container and insert value 1. The array container starts with a single element: [1].
Step 2: 131074 ā chunk key = 0x0002, value = 0x0002. Container for chunk 0x0002 exists (array with [1]). Insert value 2 in sorted position. Array is now [1, 2]. Still below 4096 elements, so it stays as an array container.
Step 3: 200000 ā chunk key = 0x0003, value = 0x0D40 (3392). No container for chunk 0x0003. Create a new array container and insert value 3392. Now we have two separate containers, one per chunk key.
Result: two chunk keys (0x0002 and 0x0003), each with an array container. Total memory: 2 chunks Ć (2-byte key + pointer) + 2 values Ć 2 bytes + 1 value Ć 2 bytes = negligible. Compare to storing three 32-bit integers in a sorted array (12 bytes), and the roaring bitmap has slightly more overhead. But at millions of elements, the per-chunk density optimization pays off dramatically.
Now suppose you add all integers from 131072 to 196607 (the entire chunk 0x0002 range, 65,536 values). The array container for chunk 0x0002 would grow past 4096, triggering conversion to a bitmap container. The bitmap container uses exactly 8KB regardless of whether the chunk has 4096 or 65,536 elements. When the chunk is completely full (65,536 of 65,536 possible values), the bitmap is still 8KB, but you could also represent it as a single RLE pair [0, 65535] at just 4 bytes.
Set operations
Operations work at the container level. The bitmap never needs full decompression. Each chunk key is matched between the two bitmaps, and only matching chunks need to be processed.
The crucial insight: if bitmap A has containers for chunks [0, 3, 7] and bitmap B has containers for chunks [0, 3, 10], an intersection only processes chunks 0 and 3.
Chunk 7 (only in A) and chunk 10 (only in B) are skipped without any work. For a union, all chunks from both sides are included, but non-overlapping chunks are copied without modification. This chunk-level skip is why roaring bitmap operations are often faster than naive array operations even when the total data is similar in size.
Intersection (AND): for each chunk key present in both bitmaps, intersect the two containers. Array-to-array uses a sorted merge (or galloping search if sizes differ by 10x+). Bitmap-to-bitmap uses bitwise AND across 8KB (1024 64-bit words), which takes roughly 250 nanoseconds with AVX2. If a chunk key exists in only one bitmap, it contributes nothing to the intersection and is skipped entirely. This skip is what makes intersection fast when bitmaps have different chunk distributions.
Union (OR): for each chunk key present in either bitmap, union the containers. Bitmap-to-bitmap uses bitwise OR. Array-to-array merges and deduplicates. A chunk key present in only one side is copied as-is, which is effectively free.
Symmetric difference (XOR): returns elements in exactly one of the two bitmaps (but not both). Bitmap-to-bitmap uses bitwise XOR. This is used for change detection: "which user IDs were added or removed between yesterday's segment and today's?"
Cardinality (count): bitmap containers use the CPU's popcount instruction to count set bits across 1024 words in roughly 1 microsecond. Array containers return their length. RLE containers sum the run lengths. No iteration through individual elements is needed.
Difference (AND-NOT): A.andNot(B) returns elements in A that are not in B. For bitmap containers, this is A_word & ~B_word for each of 1024 words. For array containers, it is a merge that skips matching elements. This is the operation used for audience exclusion ("users in segment A who are NOT in segment B").
function intersect(bitmapA, bitmapB):
result = new RoaringBitmap()
for each chunk_key in bitmapA.containers:
if chunk_key not in bitmapB.containers:
continue // no match, skip entirely
containerA = bitmapA.containers[chunk_key]
containerB = bitmapB.containers[chunk_key]
merged = containerA.and(containerB)
if merged.cardinality > 0:
result.containers[chunk_key] = merged
return result
Container type conversions during operations
An intersection of two dense bitmap containers can produce a sparse result. The engine checks the output cardinality and downgrades to an array container if the result has fewer than 4096 elements. This keeps memory optimal after every operation, not just during insertion.
Container type selection and conversion
The choice between container types is the heart of roaring bitmap efficiency. Understanding when and why conversions happen is critical for predicting memory usage and performance.
Here is the decision logic:
| Condition | Container type | Size | Membership check |
|---|---|---|---|
| Cardinality < 4096 | Sorted array of uint16 | 2 Ć cardinality bytes | Binary search O(log n) |
| Cardinality >= 4096 | 65,536-bit bitmap | 8,192 bytes (fixed) | Bit lookup O(1) |
| Long consecutive runs | RLE [start, length] pairs | 4 Ć num_runs bytes | Binary search on runs |
The crossover point (4096) is mathematically exact. An array of 4096 uint16 values is 4096 Ć 2 = 8192 bytes. A bitmap container is always 8192 bytes. Below 4096, the array wins. At 4096 and above, the bitmap is never larger.
RLE containers are chosen when the number of runs Ć 4 bytes is smaller than both the array and bitmap representations. I'll often see RLE containers dominate when the data comes from auto-incrementing primary keys or time-bucketed event IDs.
An RLE container for a chunk with three runs ([10, 100], [500, 600], [1000, 2000]) uses 3 Ć 4 = 12 bytes. The same data as an array would use 2 Ć (91 + 101 + 1001) = 2,386 bytes. As a bitmap, 8,192 bytes. The RLE container wins by 200x over the array and 680x over the bitmap.
However, RLE is worst for data with no sequential patterns: if every other integer is present (alternating), each "run" is length 1, and the RLE container uses 4 bytes per element versus 2 bytes for the array container. The engine detects this and avoids RLE when it would be larger.
Conversion happens automatically:
function afterInsert(container):
if container.type == ARRAY and container.cardinality >= 4096:
return container.toBitmap()
if container.type == BITMAP and container.cardinality < 4096:
return container.toArray()
// Optionally: check if RLE is smaller
if container.runCount * 4 < container.currentSizeBytes():
return container.toRLE()
return container // keep current type
Memory layout of a roaring bitmap
At the top level, a roaring bitmap is a sorted array of (chunk_key, container) pairs. The chunk keys are stored in a separate sorted array for binary search during lookups. This two-level structure (keys + containers) is what gives roaring bitmaps O(log n) random access to any chunk.
When you call bitmap.contains(value), the engine extracts the chunk key (upper 16 bits), binary-searches the key array to find the matching container, then checks the container using its type-specific method. The binary search on chunk keys is O(log c) where c is the number of active chunks, typically a few hundred to a few thousand.
RoaringBitmap
āāā keys: [0, 3, 7, 42, 1001] // sorted uint16 array
āāā containers: [ArrayC, BitmapC, RLEC, ArrayC, BitmapC]
āāā size: 5 // number of active chunks
āāā cardinality: 4,203,817 // cached total (optional)
Memory breakdown for a real-world bitmap (40M user IDs, uniform distribution):
| Component | Count | Size each | Total |
|---|---|---|---|
| Chunk key array | 610 entries | 2 bytes | 1.2 KB |
| Container pointers | 610 | 8 bytes | 4.9 KB |
| Array containers (sparse chunks) | ~150 | avg 1KB | ~150 KB |
| Bitmap containers (dense chunks) | ~460 | 8 KB fixed | ~3.6 MB |
| RLE containers | ~0 (uniform random has no runs) | varies | 0 |
| Total | ~3.8 MB |
Compare this to 40M Ć 4 bytes = 160 MB in a sorted array, or 2^32 / 8 = 512 MB as an uncompressed bit array. The roaring bitmap is 42x smaller than the sorted array for this workload.
For sequential IDs (auto-increment), the picture changes dramatically. Forty million sequential IDs from 1 to 40M span about 610 chunks, each stored as a single RLE pair. Total: 610 Ć 4 bytes = 2.4 KB. That is 66,000x smaller than the sorted array.
Memory breakdown by data pattern:
| Data pattern | 40M integers | Roaring size | vs. sorted array |
|---|---|---|---|
| Uniform random | Full 32-bit space | ~3.8 MB | 42x smaller |
| Sequential (auto-increment) | IDs 1 to 40M | ~2.4 KB | 66,000x smaller |
| Clustered (10 dense regions) | 10 regions of 4M each | ~80 KB | 2,000x smaller |
| Alternating (every other int) | Even numbers 0 to 80M | ~5 MB | 32x smaller |
This table illustrates why roaring bitmaps dominate in real analytics workloads: most production data falls into the "clustered" or "sequential" categories, where compression ratios are extreme.
The takeaway for capacity planning: always profile your actual data distribution before estimating roaring bitmap memory usage. A uniform random distribution (worst case) and a sequential distribution (best case) can differ by 1000x in memory consumption for the same cardinality.
SIMD and hardware acceleration
Modern roaring bitmap libraries exploit CPU-level parallelism to accelerate the most common operations. Two hardware features make a significant difference. Without these, roaring bitmaps would be 3-5x slower on bitmap container operations.
POPCNT (population count): The popcount instruction counts the number of set bits in a 64-bit word in a single CPU cycle. A bitmap container has 1024 words (8KB / 8 bytes per word). Counting all set bits (cardinality) takes 1024 popcount operations, completing in roughly 1 microsecond on modern x86-64 CPUs. Without hardware popcount, you would need a lookup table or bit-manipulation loop (Kernighan's algorithm), which is 5-10x slower.
The popcount instruction became standard with Intel's SSE4.2 (2008) and AMD's ABM (2007). Today, every production server supports it. If you are running on hardware from the last 15 years, you have hardware popcount.
SIMD (Single Instruction, Multiple Data): AVX2 processes 256 bits per instruction. A bitwise AND of two bitmap containers (1024 Ć 64-bit words) can be done in 256 AVX2 operations instead of 1024 scalar operations, a 4x throughput improvement. AVX-512 doubles this again on supported hardware.
| Operation | Scalar throughput | AVX2 throughput | Speedup |
|---|---|---|---|
| Bitmap AND (8KB) | ~1.0 μs | ~0.25 μs | 4x |
| Bitmap OR (8KB) | ~1.0 μs | ~0.25 μs | 4x |
| Bitmap cardinality | ~1.0 μs (popcount) | ~0.3 μs (vpopcount) | 3x |
| Array intersection (galloping) | depends on size | no SIMD benefit | 1x |
Note that array containers (sorted uint16 arrays) do not benefit much from SIMD because their operations involve comparison-based merging rather than bulk bitwise operations. The performance advantage of SIMD is specific to bitmap containers, which is another reason dense data (many bitmap containers) is faster to operate on than sparse data (many array containers).
I'll often see teams benchmark roaring bitmap operations without SIMD and conclude "it's not that fast." Make sure your library is compiled with AVX2 support.
CRoaring (the C library) and Java's RoaringBitmap both support SIMD acceleration out of the box on x86-64. In Go, the roaring library uses assembly-optimized popcount but does not yet have full AVX2 support for bitwise operations.
ARM processors and SIMD
ARM chips (AWS Graviton, Apple Silicon) use NEON instead of AVX2. NEON processes 128 bits per instruction, giving a 2x speedup over scalar instead of 4x. Benchmark on your actual production hardware, not your development laptop, when making performance claims.
Serialization and distributed merging
Roaring bitmaps have a standardized binary format (the Roaring Bitmap specification, maintained by the RoaringBitmap GitHub organization). This matters enormously in distributed systems where partial bitmaps are built on different nodes, possibly in different languages (Java on Druid, C++ on ClickHouse), and merged at a coordinator.
The format is portable across implementations: a roaring bitmap serialized by the Java library can be deserialized by the C library (CRoaring) or the Go library. This cross-language interoperability is rare among data structures and is one reason roaring bitmaps became the industry standard for compressed integer sets.
The specification also supports "frozen" serialized bitmaps that can be memory-mapped directly without deserialization. This is covered in the next section.
Because each node owns a distinct range of chunk keys, merging non-overlapping bitmaps is a zero-copy concatenation of container lists. Overlapping bitmaps require container-level OR operations but still avoid full decompression.
The serialization format defines a header with cookie bytes (identifying the format version), the number of containers, and a run-flag bitset that marks which containers use RLE encoding. After the header, an offset array allows random access to any container without deserializing the entire bitmap. This is critical for lazy loading: a query that only touches chunks 0 and 3 never reads chunks 1, 2, 4, or beyond from disk.
// Serialized roaring bitmap layout (simplified)
Header:
cookie: uint32 // format identifier
container_count: uint32 // number of chunks
run_flag_bitset: bit[] // which containers are RLE
Key-cardinality pairs:
[chunk_key, cardinality - 1] // for each container
Offset array:
[byte_offset_to_container_0, byte_offset_to_container_1, ...]
Container data:
// Each container in its native format (array, bitmap, or RLE)
Frozen bitmaps and memory-mapped I/O
In production analytics systems, bitmaps are typically built once (during ingestion) and queried thousands of times. For this read-heavy pattern, frozen (immutable) bitmaps offer significant advantages over mutable in-memory bitmaps.
A frozen roaring bitmap is serialized to disk and memory-mapped (mmap) directly into the process's address space. No deserialization step is needed. The operating system pages in only the portions of the bitmap that are actually accessed during a query. This means a 50MB bitmap file costs zero heap memory and near-zero startup time.
This is how Druid handles segment bitmaps. Each segment file contains pre-built roaring bitmaps for each dimension value. At query time, Druid mmaps the segment file and reads container data directly from the mapped pages. The kernel's page cache handles hot/cold management automatically.
| Approach | Query startup cost | Memory usage | Suitable for |
|---|---|---|---|
| Deserialize into heap | O(n) time, O(n) memory | Full bitmap in heap | Small bitmaps, frequent mutations |
| Memory-mapped (frozen) | Near-zero (just mmap) | Only accessed pages | Large bitmaps, read-heavy, shared across queries |
The tradeoff: frozen bitmaps cannot be mutated. If you need to add or remove elements, you must create a new bitmap or maintain a mutable overlay that merges with the frozen base at query time.
I'll often see systems use a "base + delta" approach: the frozen bitmap is the bulk of the data, and a small mutable bitmap holds recent changes. Queries merge both at read time. Periodically (every hour or every ingestion cycle), the mutable delta is merged into a new frozen base, and the cycle restarts. This is conceptually similar to how LSM trees separate memtables (mutable) from SSTables (immutable).
Roaring vs. other compressed bitmap formats
Roaring bitmaps were not the first compressed bitmap format. Several earlier formats exist, and understanding why roaring won the industry adoption race is instructive for system design interviews.
| Format | Compression strategy | Random access | Mixed density support | Status |
|---|---|---|---|---|
| WAH (Word-Aligned Hybrid) | Run-length encoding of 31-bit words | No (sequential scan only) | Poor (all RLE) | Superseded |
| EWAH (Enhanced WAH) | 64-bit word RLE with skip pointers | Partial | Poor (all RLE) | Superseded |
| Concise | Compressed N Composable Integer Set | No | Moderate | Superseded |
| Roaring | Three container types per chunk | Yes (binary search on chunk keys) | Excellent | Industry standard |
The critical difference: WAH, EWAH, and Concise encode the entire bitmap as a single run-length stream. To access a specific bit, you must scan from the beginning. This is O(n) random access.
Roaring partitions into chunks with a key array, enabling O(log n) random access to any chunk. This makes random lookups and partial operations dramatically faster.
For sparse, sequential data, EWAH can actually compress slightly better than roaring (pure RLE has no per-chunk overhead). But the moment you need random access, set operations on non-sequential data, or mixed-density sets, roaring dominates. This is why Lucene, Druid, and ClickHouse all migrated from WAH/EWAH to roaring.
Another advantage: roaring bitmaps compose well. You can AND two roaring bitmaps without knowing their internal container types. The engine dispatches to the right implementation for each container pair. WAH/EWAH require you to decompress into a common format first, losing the compression benefit during operations.
For your interview: if someone asks "why roaring over WAH/EWAH?", the answer is "random access and mixed-density support. WAH requires sequential scanning; roaring uses binary search on chunk keys. And roaring supports heterogeneous container types, so it handles both sparse and dense regions in the same bitmap."
Cross-container operations
When two containers of different types meet during an operation, the engine must handle mixed-type combinations. There are nine possible pairings (3 types x 3 types), and each has a specialized implementation. This matrix of optimized operations is what makes roaring bitmaps production-grade.
| Operation | Container A | Container B | Strategy | Result type |
|---|---|---|---|---|
| AND | Array | Array | Sorted merge (galloping) | Array |
| AND | Array | Bitmap | Probe each array element in bitmap | Array |
| AND | Bitmap | Bitmap | Bitwise AND, 1024 words | Bitmap or Array |
| AND | RLE | Bitmap | Iterate runs, extract matching bits | Array or Bitmap |
| OR | Array | Array | Sorted merge | Array or Bitmap |
| OR | Array | Bitmap | Set array elements in bitmap copy | Bitmap |
| OR | Bitmap | Bitmap | Bitwise OR, 1024 words | Bitmap |
| OR | RLE | RLE | Merge run intervals | RLE |
The array-to-bitmap intersection is particularly elegant. Instead of converting the array to a bitmap (8KB allocation), you iterate the array and probe each element's bit in the bitmap. This is O(array.length) rather than O(65536), which matters when the array has only a few hundred elements.
For the array-to-array case, the engine uses galloping (exponential) search when the two arrays differ significantly in size. If array A has 50 elements and array B has 4000 elements, a naive merge scans both linearly (O(4050)). Galloping search for each element of A in B runs in O(50 Ć log(4000)) = O(600), which is 6x faster.
The result type also depends on output cardinality. Two bitmap containers ANDed together might produce a result with only 100 set bits. The engine detects this via popcount and stores the result as an array container rather than a sparse bitmap. This downgrade saves 8KB minus the 200 bytes the array needs.
function intersect_array_bitmap(array_container, bitmap_container):
result = new ArrayContainer()
for each value in array_container.values:
word_index = value >> 6 // which 64-bit word
bit_index = value & 0x3F // which bit in that word
if bitmap_container.words[word_index] & (1 << bit_index):
result.add(value)
return result
Difference (AND-NOT) operations follow the same pattern. A.andNot(B) returns all elements in A that are not in B. For bitmap containers, this is A_word & ~B_word for each of 1024 words. For array containers, it is a merge that skips matching elements. The andNot operation is essential for audience exclusion in ad-tech ("users in segment A who are NOT in segment B").
Interview tip: galloping search
When intersecting two sorted arrays of very different sizes (say 100 and 50,000 elements), a naive merge scans both. Galloping search (exponential search) on the larger array lets you skip ahead, giving O(small Ć log(large)) instead of O(small + large). Most roaring bitmap libraries implement this optimization.
Performance characteristics
The performance advantage of roaring bitmaps comes from two sources: compression reduces memory bandwidth requirements, and container-level operations skip irrelevant data entirely. The result is that operations on roaring bitmaps are often faster than operations on uncompressed data because less data moves through the memory bus.
| Sorted array | Hash set | Uncompressed bit array | Roaring bitmap | |
|---|---|---|---|---|
| Memory (1M ints, dense range) | 4MB | ~32MB | 128KB | ~128KB |
| Memory (1M ints, sparse over 2^32) | 4MB | ~32MB | 512MB | ~2MB |
| Intersection speed | O(n) merge | O(min) probe | O(n/64) bitwise | O(n/64) bitwise |
| Union speed | O(n+m) merge | O(n+m) insert | O(n/64) bitwise | O(n/64) bitwise |
| Supports distributed merge | Yes | Difficult | Yes | Yes |
| Compression ratio (sparse) | 1x baseline | 8x worse | 128x worse | 2-10x better |
For your interview: "Roaring bitmaps match uncompressed bit arrays on dense data and beat sorted arrays by 10-100x on sparse data, because each chunk independently picks the smallest representation."
Why roaring bitmaps beat hash sets for analytics workloads:
Hash sets (HashMap, HashSet) are optimized for single-element lookups: O(1) contains, O(1) insert. But they have three properties that make them poor choices for analytics:
-
Memory overhead. A Java HashSet wrapping Integer objects costs roughly 32 bytes per element (16-byte object header, 4-byte int, padding, hash table entry with pointer). A roaring bitmap stores the same integer in 2 bytes (array container) or 1 bit (bitmap container).
-
No set algebra primitives. Intersecting two hash sets requires iterating the smaller set and probing each element in the larger. This is O(min(n,m)) probes, each potentially missing the CPU cache. Roaring bitmap intersection processes 64 elements per CPU instruction (bitwise AND on 64-bit words).
-
Not serializable across languages. There is no standard hash set serialization format. Roaring bitmaps have a cross-language binary format (Java, C, C++, Go, Rust, Python). This matters in polyglot analytics pipelines where different components are written in different languages.
-
No SIMD acceleration. Hash set probing is inherently sequential (each probe depends on the previous hash). Roaring bitmap operations on bitmap containers process 64-256 elements per CPU instruction via SIMD, giving orders-of-magnitude better throughput on modern hardware.
Production usage
| System | Usage | Notable behavior |
|---|---|---|
| Apache Druid | User segment bitmaps for filter queries | Each dimension value maps to a roaring bitmap of row IDs. Multi-filter queries run bitmap AND/OR across dimensions, pruning billions of rows in milliseconds. |
| Apache Lucene / Elasticsearch | Document posting lists for filter caches | Boolean filter queries ("tag:java AND status:published") intersect roaring bitmaps of matching doc IDs. Lucene switched from sorted arrays to roaring bitmaps in version 5.0 for a 3-10x speedup on filter queries. |
| ClickHouse | groupBitmapAnd, groupBitmapOr aggregate functions | Set operations execute at query time across distributed shards. Each shard returns a serialized roaring bitmap; the coordinator merges them. |
| Apache Spark | Bitmap indexing in Delta Lake | Used for data skipping: each Parquet file's min/max statistics are augmented with roaring bitmaps of distinct values to skip irrelevant files during query planning. |
| Redis (via RedisBloom module) | Real-time cohort membership | Feature flag evaluation checks "is user X in cohort Y?" against a roaring bitmap. Cohort intersection for A/B test analysis runs server-side. |
| Git | Reachability bitmaps for git rev-list | Git packs use roaring bitmaps to track which objects are reachable from which commits. git clone and git fetch use these to avoid walking the entire commit graph. |
For your interview: be ready to name at least three of these systems. Druid, Elasticsearch, and ClickHouse are the easiest to remember because they cover analytics, search, and OLAP respectively.
Limitations and when NOT to use it
- Not for non-integer keys. Roaring bitmaps only store unsigned 32-bit integers. If your keys are strings, UUIDs, or 64-bit integers, you need a mapping layer (hash to 32-bit, or use Roaring64). The mapping adds complexity, memory for the dictionary, and potential collisions if you hash. Consider whether the mapping overhead is worth the bitmap benefits.
- Not useful for very small sets. If your sets have fewer than a few hundred elements, a sorted array or hash set is simpler, faster, and uses less memory. Roaring's per-chunk metadata (key array, container pointers) adds overhead that only pays off at scale. I would not use roaring bitmaps for sets under 1,000 elements.
- No ordering guarantees within chunks. Bitmap and RLE containers do not store insertion order. If you need to iterate elements in insertion order (not value order), roaring bitmaps are the wrong choice. You would need a separate ordering structure.
- 64-bit integer support is an extension. The core Roaring Bitmap specification is 32-bit. 64-bit extensions (Roaring64) exist but are less mature, less portable, and not supported by all systems (e.g., ClickHouse's bitmap functions are 32-bit only as of this writing).
- Write amplification on container conversion. Inserting element 4096 into an array container triggers a full conversion to a bitmap container (8KB allocation and copy). Deleting one element from a 4096-element bitmap triggers conversion back. If your workload repeatedly crosses the threshold, you pay conversion costs on every operation. Use libraries with hysteresis to mitigate.
- Not a replacement for database indexes. Roaring bitmaps accelerate set operations on pre-built bitmaps. They do not support range queries ("find all IDs between 1000 and 2000"), sorting, or secondary index lookups natively. Use them alongside B-tree or LSM indexes, not instead of them. A B-tree handles arbitrary range predicates; a roaring bitmap handles pre-computed set algebra.
Interview cheat sheet
- When asked about computing intersections or unions over large ID sets, mention roaring bitmaps immediately. They are the standard answer for analytics filter queries at scale. Say: "Roaring bitmaps store integer sets in three container types per chunk and execute AND/OR using CPU-native bitwise instructions."
- When comparing roaring bitmaps to bloom filters, clarify that bloom filters answer "is this element in the set?" with false positives, while roaring bitmaps store the exact set and support full set algebra (AND, OR, XOR, difference, cardinality) with zero error. Bloom filters use less memory for membership-only checks.
- When discussing Elasticsearch or Lucene filter performance, state that Lucene uses roaring bitmaps for posting lists since version 5.0. Boolean filter queries intersect bitmaps of doc IDs. This is why compound filters in Elasticsearch are fast even on billion-document indexes.
- When asked about memory efficiency, explain the three container types and the 4096-element crossover. Say: "Below 4096 elements, sorted array. At 4096 and above, bitmap. Consecutive runs get RLE. Each chunk picks independently based on its own density."
- When discussing distributed analytics (Druid, ClickHouse), explain that roaring bitmaps are mergeable across nodes because the format is standardized. Each shard builds a partial bitmap; the coordinator merges them with zero-copy concatenation for non-overlapping ranges or container-level OR for overlapping ranges.
- When asked why not just use a hash set, answer with memory and throughput: a hash set for 1M integers costs ~32MB versus ~2MB for a roaring bitmap. Intersection on hash sets requires per-element probing, while bitmap intersection uses CPU bitwise instructions on contiguous memory with SIMD acceleration.
- When the discussion involves user cohorts, A/B testing, or feature flags, describe roaring bitmaps as the standard for "is user X in segment Y?" checks combined with "how many users are in segment X AND segment Y?" cardinality queries. Mention that mutual exclusivity checks between groups are a single AND operation.
- When asked about compression, explain that roaring bitmaps are not applying generic compression (gzip, zstd) but structural compression: choosing the smallest representation per chunk. This means no decompression step before querying. Data is always queryable in its stored form.
Quick recap
- Roaring bitmaps partition 32-bit integers into 65,536-element chunks and independently select the smallest container type (sorted array, bitset, or RLE) for each chunk based on its density.
- The 4096-element crossover between array and bitmap containers is mathematically exact: both representations consume exactly 8,192 bytes at that cardinality.
- Set operations (AND, OR, XOR, difference) work at the container level without full decompression, using hardware popcount and SIMD instructions for bitmap containers.
- The standardized serialization format enables distributed merging: nodes build partial bitmaps independently, and the coordinator merges them with zero-copy concatenation for non-overlapping ranges.
- Frozen (immutable) bitmaps can be memory-mapped from disk, eliminating deserialization overhead and letting the OS page cache manage hot/cold data.
- Use roaring bitmaps when you need set algebra over large integer sets; use bloom filters for probabilistic membership only; use plain arrays for sets under a thousand elements.
Related concepts
- Bloom filters - Bloom filters answer membership queries probabilistically, while roaring bitmaps store exact sets. Use bloom filters to avoid unnecessary disk reads; use roaring bitmaps for set algebra (intersection, union, difference). In many analytics pipelines, bloom filters serve as the first gate ("does this key possibly exist?") and roaring bitmaps handle the actual set computation.
- HyperLogLog - HyperLogLog estimates cardinality with fixed 12KB memory but cannot enumerate elements or perform set intersection. Roaring bitmaps give exact cardinality and full set operations at higher memory cost. Choose HyperLogLog when you only need "how many unique users?" and can tolerate 0.8% error.
- Databases - Database engines use roaring bitmaps for bitmap indexes, filter caches, and column pruning. Understanding roaring bitmaps explains why Elasticsearch filter queries and Druid segment intersections are fast even at billion-row scale.
- Count-Min Sketch - Like roaring bitmaps, count-min sketches are used in analytics pipelines, but for frequency estimation ("how many times did event X occur?") rather than set membership ("which users saw event X?"). They solve complementary problems and are often used together.