LSM trees
Learn how Log-Structured Merge trees turn random writes into sequential I/O, what compaction costs, and why Cassandra and RocksDB choose LSM over B-trees for write-heavy workloads.
The problem
A write-heavy service needs to handle 100,000 inserts per second. The storage engine uses a B-tree on a spinning disk. Inserting a record requires finding the correct leaf node in the tree, which means a random seek to an arbitrary location on the disk platter.
A spinning disk handles roughly 100 to 200 random I/O operations per second. Each random seek takes approximately 5 to 10 milliseconds. At 100,000 writes per second with an average seek time of 10ms, the disk would need to perform 100,000 random seeks per second. Delivering those seeks at 10ms each requires 1,000 seconds of disk time per second. The disk saturates at a tiny fraction of the required throughput.
NVMe SSDs reduce seek latency to 50 to 200 microseconds, but random writes still saturate faster than sequential writes: NVMe delivers roughly 500,000 random 4KB writes per second but can sustain millions of sequential writes per second. The fundamental problem is the same: random access patterns waste the bandwidth advantage of sequential I/O.
The bottleneck is the access pattern, not the hardware. Random writes force the storage medium to seek. B-trees are designed for fast reads at the cost of random write I/O. For write-heavy workloads, a different data structure is needed: one that converts random writes into sequential I/O.
What an LSM tree is
A Log-Structured Merge tree (LSM tree) converts random writes into sequential writes by never modifying data in place. Instead, it buffers writes in memory, flushes them to immutable sorted files on disk, and periodically merges those files in the background.
Analogy: A busy front desk at a hotel. Guests (writes) arrive continuously. The front desk clerk does not immediately file each guest into the right cabinet drawer (random seek). Instead, the clerk records arrivals in an intake notebook (MemTable, in memory). When the notebook is full, the clerk staples the pages together and files the whole notebook as a labeled bundle (SSTable flush, sequential disk write). Periodically, the back office merges old bundles together to keep the filing room organized (compaction). The tradeoff: checking whether a specific guest is registered (point read) may require scanning multiple bundles.
How it works
The write path always goes through two stages before touching disk:
The read path cascades through levels from newest to oldest:
Pseudocode for write and point lookup:
// Write path
function write(key, value):
wal.append({ key, value, seq: next_seq() }) // crash durability first
memtable.insert(key, value) // sorted in-memory insert
if memtable.size() > MEMTABLE_FLUSH_THRESHOLD:
flush_to_l0(memtable) // sequential disk write
memtable = new SkipList()
// Flush MemTable to L0 SSTable (sequential)
function flush_to_l0(mt):
file = new_sequential_file("l0-" + timestamp())
for [key, value] in mt.ascending_iterator():
file.append(encode(key, value)) // sequential, fast
file.sync()
// Point lookup
function read(key):
// Newest data first
if val = memtable.get(key): return val
for sstable in l0_tables.newest_first(): // ALL L0 tables (overlapping ranges)
if sstable.bloom_filter.might_contain(key):
if val = sstable.binary_search(key): return val
for level in [L1, L2, L3, ...]: // At most one SSTable per level
sstable = level.find_sstable_for_key(key)
if sstable.bloom_filter.might_contain(key):
if val = sstable.binary_search(key): return val
return NOT_FOUND
Every write is an append: first to the WAL (sequential on disk), then to the MemTable (in memory). Disk writes only happen during MemTable flushing and compaction, both of which are sequential. This is why LSM trees achieve write throughput that B-trees cannot match on the same hardware.
The write path: WAL and MemTable
The WAL (Write-Ahead Log) is written before the MemTable is updated. If the server crashes after the WAL write but before the MemTable flush, the MemTable is rebuilt from the WAL on restart. No committed write is lost.
The MemTable is a sorted in-memory data structure, typically a skip list or a red-black tree. Both support O(log n) inserts and ordered iteration. Ordered iteration is required because SSTables store keys in sorted order; the flush operation requires iterating the MemTable in ascending key order.
When the MemTable reaches the configured size threshold (commonly 64 MB or 128 MB), it is marked as immutable and a new empty MemTable is created for incoming writes. The immutable MemTable is flushed to a new L0 SSTable file in the background. Writes continue uninterrupted on the new MemTable while the flush proceeds.
The flush is always a sequential append to a new file. No existing SSTable is modified. Because writes only append to the WAL and insert into a memory structure, the write path never performs a random disk seek. This is the source of the write throughput advantage over B-trees.
MemTable size is the first compaction tuning lever
A larger MemTable means fewer L0 files flushed per second and more time for compaction to keep pace between flushes. RocksDB's write_buffer_size defaults to 64 MB but is commonly tuned to 256-512 MB on write-heavy deployments. Doubling MemTable size halves L0 flush frequency and directly reduces compaction pressure. The tradeoff: more in-RAM data to rebuild from the WAL on crash recovery, slightly increasing restart time after an unclean shutdown.
Read amplification and Bloom filters
A point read in an LSM tree may touch multiple SSTables before finding the target key (or confirming it does not exist). This is called read amplification.
Level 0 is the worst case for reads: L0 SSTables are flushed directly from MemTables with no range partitioning, so their key ranges overlap. A key might exist in any L0 SSTable. Every L0 file must be checked. With 10 L0 SSTables and no Bloom filters, a point read performs 10 binary searches before moving to L1.
Level 1 and above are sorted with non-overlapping key ranges per level. Only one SSTable per level can contain a given key, so at most one binary search per level is needed. Without Bloom filters, a read that misses all levels still requires one binary search per level: 1 per L1, 1 per L2, and so on.
Bloom filters are the critical optimization. Each SSTable carries a compact Bloom filter over its keys. Before performing a binary search (which requires disk I/O), the read path checks the Bloom filter. A negative answer ("this key is definitely not in this SSTable") eliminates the I/O entirely. A positive answer ("this key might be in this SSTable") proceeds to the binary search. With typical Bloom filter parameters (1% false positive rate), roughly 99% of miss I/Os are eliminated.
Without Bloom filters, read amplification for a key that does not exist in the database is equal to the total number of SSTables. With Bloom filters, expected I/Os for a miss reduce to approximately 1 per level (occasional false positives) regardless of how many SSTables exist per level at L1+.
RocksDB uses Bloom filters on every SSTable by default. Cassandra uses Bloom filters and allows tuning the false positive rate via the bloom_filter_fp_chance table option (default 0.01).
The level structure
Data flows through levels as compaction runs. Each level is roughly 10x the size of the previous, which is why write amplification from leveled compaction is approximately L × size_ratio. In leveled compaction, every byte written to L0 may be rewritten once per level transition before reaching the deepest level.
spawnSync d2 ENOENT
The key asymmetry: L0 SSTables have overlapping key ranges and require all files to be checked on every read. L1 and above are non-overlapping, so only one file per level needs to be checked. Getting data out of L0 and into L1 is always the top-priority compaction in both RocksDB and Cassandra.
Compaction strategies
Unflushed L0 SSTables accumulate over time. Without compaction, L0 grows unbounded, reads must scan every L0 file, and deleted keys ("tombstones") consume space forever. Compaction merges SSTables, removes tombstones, and reorganizes data to reduce read amplification.
Two strategies dominate production deployments:
Size-tiered compaction merges SSTables of similar size. When enough small SSTables accumulate (typically 4), they are merged into one larger SSTable. The merged file is added to the next size tier. Simple to implement and produces low write amplification, because each byte is merged infrequently. The downside: during a merge, the old SSTables and the new merged SSTable exist simultaneously, doubling space usage temporarily. Space amplification can reach 2x or more.
Leveled compaction organizes SSTables into levels with a strict size cap per level. Level 1 might be capped at 256 MB, Level 2 at 2.5 GB (10x), Level 3 at 25 GB (10x again). Within each level, key ranges are strictly non-overlapping. When a level exceeds its cap, compaction picks an SSTable from that level, finds all SSTables in the next level whose key range overlaps, and merges them into a new set for the next level. The old SSTables are deleted. Space amplification stays low because level caps enforce a bounded ratio. Write amplification is higher because data may be compacted through many level transitions before reaching the final level.
| Strategy | Write amplification | Read amplification | Space amplification | Best for |
|---|---|---|---|---|
| Size-tiered | Low (4-10x typical) | High without Bloom filters | High (up to 2x per tier during compaction) | Write-heavy workloads, time-series data, append-only patterns |
| Leveled | High (10-30x typical) | Low (1 file per level for point reads) | Low (bounded by level size ratios) | Read-heavy workloads, balanced read/write, small working set |
| FIFO | Near 1x (oldest files deleted) | Grows with dataset size | Low | Time-series where oldest data is simply discarded |
RocksDB defaults to leveled compaction and is tuned for read performance. Cassandra defaults to size-tiered compaction and is tuned for write throughput. Both expose their compaction strategy as a configuration option.
Write amplification
Write amplification is the ratio of bytes written to disk by the storage engine to bytes written by the application. An application writes 1 GB; the storage engine writes 10 GB to disk. Write amplification is 10x.
In an LSM tree, write amplification has two sources:
- MemTable flush. Every byte written by the application is also written once to the WAL and once when the MemTable is flushed to L0. Minimum write amplification is approximately 2x before compaction.
- Compaction. Data written into L0 gets merged into L1, then L1 into L2, and so on. In leveled compaction with a size multiplier of 10 across L levels, each byte may be rewritten once per level transition. The total write amplification from compaction is approximately
L * size_ratio. For 5 levels with a ratio of 10, that is roughly 50x write amplification across the full compaction chain in the worst case. RocksDB's actual measured write amplification in production is typically 10 to 30x.
This is a real I/O cost. An NVMe SSD rated for 20 GB/s sequential write can sustain that throughput only until write amplification consumes the available bandwidth. At 30x write amplification, 1 GB/s of application write load generates 30 GB/s of disk I/O, which exceeds most hardware.
Write amplification also accelerates SSD wear. SSD cells have a finite number of write cycles. At 30x write amplification, the SSD wears out 30x faster than the application write rate suggests. This is why hardware sizing for RocksDB and Cassandra deployments must account for write amplification, not raw application throughput.
Write amplification must be included in your hardware budget
At 30x write amplification, an SSD's endurance rating in petabytes translates to application-write petabytes divided by 30. A 1 PB endurance NVMe SSD effectively stores only ~33 TB of application data before write cells approach exhaustion. Measure actual write amplification with the rocksdb.write-amplification statistic before sizing hardware. This metric is routinely 5-10x higher than engineers expect when first deploying an LSM-based system at load.
Tuning levers: reducing the L0-to-L1 size ratio reduces write amplification at the cost of higher read amplification. Increasing the MemTable size reduces L0 flush frequency and flush write amplification. These are the parameters most commonly adjusted in RocksDB production tuning.
Production usage
| System | LSM usage | Notable behavior |
|---|---|---|
| RocksDB (Meta, LinkedIn, Kafka Streams) | Leveled compaction by default; Bloom filters on every SSTable | Provides db_bench for write amplification measurement; widely tuned for NVMe SSDs; used as storage engine inside MySQL (MyRocks) and TiKV |
| Apache Cassandra | Size-tiered compaction by default; TWCS (Time Window Compaction Strategy) for time-series | Each node runs independent compaction; compaction is single-threaded per table by default; bloom_filter_fp_chance tunes Bloom filter size per table |
| LevelDB (Google Chrome, IndexedDB) | Leveled compaction; original implementation by Sanjay Ghemawat and Jeff Dean | Simpler than RocksDB (no column families, no WAL group commit); used for browser-local storage where write throughput matters more than read latency |
| ScyllaDB | Cassandra-compatible; rewritten in C++ with per-shard compaction | Avoids compaction pauses by running compaction on a dedicated shard per CPU core; Bloom filters loaded into huge pages for faster access |
Limitations
- Read amplification without Bloom filters is severe. A point read that misses the MemTable must check every L0 SSTable and one SSTable per level. Without Bloom filters, missing keys that do not exist in the database are the worst case: every SSTable is checked, every check hits disk.
- Compaction causes write I/O bursts. Background compaction competes with application writes and reads for disk bandwidth. RocksDB implements compaction rate limiting (
rate_limiter) to prevent compaction from starving user queries, but bursts remain a production operations concern. - Space amplification during compaction. Size-tiered compaction temporarily doubles storage during a merge: the old SSTables and the new merged SSTable coexist until the merge completes. Systems without headroom for this expansion will see disk-full failures mid-compaction.
- Write stalls when L0 fills faster than compaction. If the write rate exceeds compaction throughput, L0 SSTable count grows unchecked. RocksDB implements configurable L0 write stalls and stops when L0 file count exceeds thresholds (
level0_slowdown_writes_triggerandlevel0_stop_writes_trigger) to prevent unbounded L0 growth. - Not ideal for workloads where random point reads dominate. B-trees remain superior for read-heavy workloads with sparse key distributions. LSM's read overhead is most visible in applications that do many small random point reads and few writes.
When to use LSM trees vs. B-trees
For most OLTP workloads with moderate write rates and frequent point reads, a B-tree engine is simpler, easier to operate, and offers more predictable latency. Reach for LSM when sustained write throughput is genuinely the bottleneck, or when you need the operational model of Cassandra or RocksDB.
Interview cheat sheet
- When asked how LSM trees work: buffers writes in a MemTable (in-memory sorted structure), WAL for durability, flushes to L0 SSTables (sequential disk writes), background compaction merges levels. All disk writes are sequential; no random seeks on the write path.
- When asked why LSM trees are fast for writes: every application write is an in-memory operation plus a sequential WAL append. Disk seeks only occur during background compaction and MemTable flush, both of which are sequential.
- When asked about the read path: check MemTable first, then all L0 SSTables (overlapping ranges), then one SSTable per L1+ level (non-overlapping). Bloom filters eliminate most disk I/Os for keys that do not match.
- When asked about Bloom filters in LSM context: each SSTable has a Bloom filter. Before any binary search (disk I/O), the Bloom filter is checked. A definite negative skips the file entirely. Without Bloom filters, every read that misses the database requires checking every SSTable. This is the single most important optimization in production LSM deployments.
- When asked to compare size-tiered vs leveled compaction: size-tiered has lower write amplification but higher space amplification (temporary 2x during merges) and higher read amplification. Leveled has lower read and space amplification but higher write amplification. Cassandra defaults to size-tiered; RocksDB defaults to leveled.
- When asked about write amplification: each byte written by the application may be rewritten 10 to 30 times by compaction over its lifetime. At 30x write amplification, 1 GB/s application writes generate 30 GB/s disk I/O. This is the primary reason LSM hardware sizing must budget for write amplification.
- When asked about write stalls: if the application writes faster than compaction processes L0 files, L0 grows unbounded. RocksDB introduces configurable slowdowns and hard stops at L0 file count thresholds. This is a common production incident pattern in write-heavy deployments.
- When choosing between LSM and B-tree for a workload: LSM wins for write-heavy, time-series, and append-dominated patterns. B-tree wins for read-heavy, random-access, and OLTP patterns where point reads dominate and write rate is moderate. Knowing both and naming the tradeoff factors (write amplification, read amplification, space amplification) is the interview signal.
Quick recap
- LSM trees convert random writes into sequential writes by buffering in a MemTable, appending to a WAL for durability, and flushing to immutable sorted SSTables only when the MemTable is full.
- The write path never performs a random disk seek; all disk writes are sequential appends to the WAL or new SSTable files.
- The read path cascades from MemTable to L0 (all files, overlapping ranges) to L1 and beyond (one file per level), making reads more expensive than in a B-tree without Bloom filters.
- Bloom filters eliminate most disk I/Os for misses by confirming definitively that a key is absent from an SSTable before any binary search; they are the critical optimization that makes LSM reads practical.
- Compaction merges SSTables, removes tombstones, and reorganizes levels; size-tiered compaction minimizes write amplification at the cost of space amplification, while leveled compaction minimizes read and space amplification at the cost of higher write amplification.
- LSM trees are the right choice for write-heavy workloads (Cassandra, RocksDB, LevelDB); for read-heavy workloads with moderate write rates, B-trees remain the more predictable choice.
Related concepts
- Write-ahead log: The WAL in an LSM tree is the same mechanism as in a B-tree database; understanding WAL explains why the MemTable can be RAM-only while writes are still durable across crashes.
- Bloom filters: Bloom filters are the companion optimization that makes LSM reads practical; without them, every miss-case read must touch every SSTable in the database.
- B-tree indexes: The B-tree is the primary alternative to LSM; understanding both enables the tradeoff analysis (write amplification vs. read amplification vs. space amplification) that is central to storage engine selection.
- Databases: LSM and B-tree are the two dominant storage engine architectures; the concepts article on databases explains when each surfaces in system design conversations and which production systems use which engine.