Merkle trees
How Merkle trees enable efficient data verification and anti-entropy in distributed systems, powering Git's content addressing, BitTorrent's peer verification, and Cassandra's anti-entropy repair.
The problem
You operate three replicas of a database holding 500 million rows. One replica silently missed a batch of writes during a network partition last Tuesday. You need to find which rows are out of sync and fix them.
The naive approach: stream every row from replica A, hash it, and compare against replica B. That is O(N) network traffic, hundreds of gigabytes over the wire for a dataset that is 99.99% identical. The repair takes hours, saturates your network, and slows production reads while it runs.
You could timestamp every row, but clocks drift, and "send me everything modified after Tuesday" misses rows where the timestamp itself was the thing that got corrupted. You need a way to detect exactly which rows differ between two copies of a massive dataset, transferring only a tiny amount of metadata to find the divergence. This is the problem Merkle trees solve.
What it is
A Merkle tree is a hash tree where every leaf contains the hash of a data block and every internal node contains the hash of its children's hashes, producing a single root hash that summarizes the entire dataset.
Think of it like a corporate org chart used for headcount auditing. Each team lead reports their headcount to their director, each director sums and reports to the VP, and the VP reports a single number to the CEO. If the CEO's number does not match what HR expects, you do not re-count every employee. You ask each VP, find the wrong division, drill into directors, then teams. You locate the discrepancy in O(log N) steps rather than recounting all N employees.
The key property: if two Merkle trees have the same root hash, their data is identical. If the roots differ, you can binary-search down the tree to find exactly which leaves (data blocks) are different, comparing only O(log N) hashes instead of O(N) data blocks.
How it works
Every leaf node holds the cryptographic hash (SHA-256, MurmurHash, etc.) of one data block. Every internal node holds the hash of its two children's hashes concatenated together. The root hash is a single fixed-size value that uniquely fingerprints the entire dataset.
If any single byte changes in any data block, the leaf hash changes. That change propagates upward: the parent hash changes, its parent changes, all the way to the root. Two trees with different root hashes are guaranteed to have at least one differing data block.
Building a Merkle tree
def build_merkle_tree(data_blocks):
# Hash each data block to create leaf nodes
leaves = [hash(block) for block in data_blocks]
# Pad to even count if needed (duplicate last leaf)
if len(leaves) % 2 != 0:
leaves.append(leaves[-1])
# Build tree bottom-up
tree = [leaves]
current_level = leaves
while len(current_level) > 1:
next_level = []
for i in range(0, len(current_level), 2):
combined = hash(current_level[i] + current_level[i+1])
next_level.append(combined)
if len(next_level) > 1 and len(next_level) % 2 != 0:
next_level.append(next_level[-1])
tree.append(next_level)
current_level = next_level
return tree # tree[-1][0] is the root hash
Construction is O(N): you hash each of the N data blocks once (leaves), then compute roughly N internal hashes. The tree is stored in memory or serialized for exchange. The root hash is a fixed 32 bytes (SHA-256) regardless of dataset size.
O(log N) difference detection
The core power of a Merkle tree is finding exactly which data blocks differ between two copies of a dataset. The algorithm is a top-down binary search through the hash tree.
- Compare root hashes. If equal, datasets are identical. Done.
- If different, compare the two children of the root.
- For each child where hashes match, skip that entire subtree (all data underneath is identical).
- For each child where hashes differ, recurse into its children.
- At the leaf level, the differing leaves pinpoint the exact data blocks that need syncing.
For N data blocks, the tree has depth log2(N). You compare at most 2 hashes per level, so finding all differences costs O(log N) hash comparisons. For a database with 1 billion rows partitioned into 1 million leaf blocks, that is roughly 20 hash comparisons to locate a single divergent block, not 1 million data comparisons.
Interview tip: always state the complexity
When Merkle trees come up in an interview, immediately say: "Difference detection is O(log N) comparisons, where N is the number of data blocks. This matters because the alternative, comparing all data, is O(N) network transfer." The complexity is the entire reason this data structure exists.
Merkle proofs (inclusion verification)
A Merkle proof lets you verify that a specific data block belongs to a dataset without downloading the whole dataset. You only need the block itself, its sibling hashes along the path to the root, and the root hash.
To prove Block C is in the tree: provide Block C, plus the sibling hash H_D, plus the sibling hash H_AB. The verifier hashes Block C to get H_C, combines H_C with H_D to get H_CD, then combines H_CD with H_AB to get the root. If the computed root matches the known root hash, the proof is valid.
def verify_merkle_proof(block, proof_siblings, known_root, index):
current_hash = hash(block)
for sibling_hash, is_left in proof_siblings:
if is_left:
current_hash = hash(sibling_hash + current_hash)
else:
current_hash = hash(current_hash + sibling_hash)
return current_hash == known_root
The proof size is O(log N) hashes, regardless of dataset size. For a tree with 1 million leaves, a Merkle proof is roughly 20 hashes (640 bytes with SHA-256). This is why blockchains and Certificate Transparency logs use Merkle proofs: you can verify one transaction or certificate without downloading the entire chain.
Cassandra anti-entropy repair
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.