LSM trees

The problem

A write-heavy service needs to handle 100,000 inserts per second. The storage engine uses a B-tree on a spinning disk. Inserting a record requires finding the correct leaf node in the tree, which means a random seek to an arbitrary location on the disk platter.

A spinning disk handles roughly 100 to 200 random I/O operations per second. Each random seek takes approximately 5 to 10 milliseconds. At 100,000 writes per second with an average seek time of 10ms, the disk would need to perform 100,000 random seeks per second. Delivering those seeks at 10ms each requires 1,000 seconds of disk time per second. The disk saturates at a tiny fraction of the required throughput.

NVMe SSDs reduce seek latency to 50 to 200 microseconds, but random writes still saturate faster than sequential writes: NVMe delivers roughly 500,000 random 4KB writes per second but can sustain millions of sequential writes per second. The fundamental problem is the same: random access patterns waste the bandwidth advantage of sequential I/O.

The bottleneck is the access pattern, not the hardware. Random writes force the storage medium to seek. B-trees are designed for fast reads at the cost of random write I/O. For write-heavy workloads, a different data structure is needed: one that converts random writes into sequential I/O.

What an LSM tree is

A Log-Structured Merge tree (LSM tree) converts random writes into sequential writes by never modifying data in place. Instead, it buffers writes in memory, flushes them to immutable sorted files on disk, and periodically merges those files in the background.

Analogy: A busy front desk at a hotel. Guests (writes) arrive continuously. The front desk clerk does not immediately file each guest into the right cabinet drawer (random seek). Instead, the clerk records arrivals in an intake notebook (MemTable, in memory). When the notebook is full, the clerk staples the pages together and files the whole notebook as a labeled bundle (SSTable flush, sequential disk write). Periodically, the back office merges old bundles together to keep the filing room organized (compaction). The tradeoff: checking whether a specific guest is registered (point read) may require scanning multiple bundles.

How it works

The write path always goes through two stages before touching disk:

The read path cascades through levels from newest to oldest:

Pseudocode for write and point lookup:

// Write path
function write(key, value):
  wal.append({ key, value, seq: next_seq() })   // crash durability first
  memtable.insert(key, value)                   // sorted in-memory insert
  if memtable.size() > MEMTABLE_FLUSH_THRESHOLD:
    flush_to_l0(memtable)                       // sequential disk write
    memtable = new SkipList()

// Flush MemTable to L0 SSTable (sequential)
function flush_to_l0(mt):
  file = new_sequential_file("l0-" + timestamp())
  for [key, value] in mt.ascending_iterator():
    file.append(encode(key, value))             // sequential, fast
  file.sync()

// Point lookup
function read(key):
  // Newest data first
  if val = memtable.get(key): return val

  for sstable in l0_tables.newest_first():      // ALL L0 tables (overlapping ranges)
    if sstable.bloom_filter.might_contain(key):
      if val = sstable.binary_search(key): return val

  for level in [L1, L2, L3, ...]:              // At most one SSTable per level
    sstable = level.find_sstable_for_key(key)
    if sstable.bloom_filter.might_contain(key):
      if val = sstable.binary_search(key): return val

  return NOT_FOUND

Every write is an append: first to the WAL (sequential on disk), then to the MemTable (in memory). Disk writes only happen during MemTable flushing and compaction, both of which are sequential. This is why LSM trees achieve write throughput that B-trees cannot match on the same hardware.

The write path: WAL and MemTable

The WAL (Write-Ahead Log) is written before the MemTable is updated. If the server crashes after the WAL write but before the MemTable flush, the MemTable is rebuilt from the WAL on restart. No committed write is lost.

The MemTable is a sorted in-memory data structure, typically a skip list or a red-black tree. Both support O(log n) inserts and ordered iteration. Ordered iteration is required because SSTables store keys in sorted order; the flush operation requires iterating the MemTable in ascending key order.

When the MemTable reaches the configured size threshold (commonly 64 MB or 128 MB), it is marked as immutable and a new empty MemTable is created for incoming writes. The immutable MemTable is flushed to a new L0 SSTable file in the background. Writes continue uninterrupted on the new MemTable while the flush proceeds.

The flush is always a sequential append to a new file. No existing SSTable is modified. Because writes only append to the WAL and insert into a memory structure, the write path never performs a random disk seek. This is the source of the write throughput advantage over B-trees.

MemTable size is the first compaction tuning lever

A larger MemTable means fewer L0 files flushed per second and more time for compaction to keep pace between flushes. RocksDB's write_buffer_size defaults to 64 MB but is commonly tuned to 256-512 MB on write-heavy deployments. Doubling MemTable size halves L0 flush frequency and directly reduces compaction pressure. The tradeoff: more in-RAM data to rebuild from the WAL on crash recovery, slightly increasing restart time after an unclean shutdown.

Read amplification and Bloom filters

A point read in an LSM tree may touch multiple SSTables before finding the target key (or confirming it does not exist). This is called read amplification.

Level 0 is the worst case for reads: L0 SSTables are flushed directly from MemTables with no range partitioning, so their key ranges overlap. A key might exist in any L0 SSTable. Every L0 file must be checked. With 10 L0 SSTables and no Bloom filters, a point read performs 10 binary searches before moving to L1.

Level 1 and above are sorted with non-overlapping key ranges per level. Only one SSTable per level can contain a given key, so at most one binary search per level is needed. Without Bloom filters, a read that misses all levels still requires one binary search per level: 1 per L1, 1 per L2, and so on.

Bloom filters are the critical optimization. Each SSTable carries a compact Bloom filter over its keys. Before performing a binary search (which requires disk I/O), the read path checks the Bloom filter. A negative answer ("this key is definitely not in this SSTable") eliminates the I/O entirely. A positive answer ("this key might be in this SSTable") proceeds to the binary search. With typical Bloom filter parameters (1% false positive rate), roughly 99% of miss I/Os are eliminated.

Without Bloom filters, read amplification for a key that does not exist in the database is equal to the total number of SSTables. With Bloom filters, expected I/Os for a miss reduce to approximately 1 per level (occasional false positives) regardless of how many SSTables exist per level at L1+.

RocksDB uses Bloom filters on every SSTable by default. Cassandra uses Bloom filters and allows tuning the false positive rate via the bloom_filter_fp_chance table option (default 0.01).

The level structure

Data flows through levels as compaction runs. Each level is roughly 10x the size of the previous, which is why write amplification from leveled compaction is approximately L × size_ratio. In leveled compaction, every byte written to L0 may be rewritten once per level transition before reaching the deepest level.

D2 render error.

spawnSync d2 ENOENT

The key asymmetry: L0 SSTables have overlapping key ranges and require all files to be checked on every read. L1 and above are non-overlapping, so only one file per level needs to be checked. Getting data out of L0 and into L1 is always the top-priority compaction in both RocksDB and Cassandra.

Compaction strategies

Unflushed L0 SSTables accumulate over time. Without compaction, L0 grows unbounded, reads must scan every L0 file, and deleted keys ("tombstones") consume space forever. Compaction merges SSTables, removes tombstones, and reorganizes data to reduce read amplification.

Two strategies dominate production deployments:

The problem

What an LSM tree is

How it works

The write path always goes through two stages before touching disk:

The read path cascades through levels from newest to oldest:

Pseudocode for write and point lookup:

// Write path
function write(key, value):
  wal.append({ key, value, seq: next_seq() })   // crash durability first
  memtable.insert(key, value)                   // sorted in-memory insert
  if memtable.size() > MEMTABLE_FLUSH_THRESHOLD:
    flush_to_l0(memtable)                       // sequential disk write
    memtable = new SkipList()

// Flush MemTable to L0 SSTable (sequential)
function flush_to_l0(mt):
  file = new_sequential_file("l0-" + timestamp())
  for [key, value] in mt.ascending_iterator():
    file.append(encode(key, value))             // sequential, fast
  file.sync()

// Point lookup
function read(key):
  // Newest data first
  if val = memtable.get(key): return val

  for sstable in l0_tables.newest_first():      // ALL L0 tables (overlapping ranges)
    if sstable.bloom_filter.might_contain(key):
      if val = sstable.binary_search(key): return val

  for level in [L1, L2, L3, ...]:              // At most one SSTable per level
    sstable = level.find_sstable_for_key(key)
    if sstable.bloom_filter.might_contain(key):
      if val = sstable.binary_search(key): return val

  return NOT_FOUND

spawnSync d2 ENOENT

Compaction strategies

Two strategies dominate production deployments:

LSM trees

The problem

What an LSM tree is

How it works

The write path: WAL and MemTable

Read amplification and Bloom filters

The level structure

Compaction strategies

Continue Reading with Premium

Comments

LSM trees

The problem

What an LSM tree is

How it works

The write path: WAL and MemTable

Read amplification and Bloom filters

The level structure

Compaction strategies

Continue Reading with Premium

Comments