Read repair
How read repair fixes stale replicas during normal read operations in leaderless databases like Cassandra and DynamoDB. Covers digest comparison, vector clocks, repair probability tuning, and the relationship between read repair and anti-entropy.
TL;DR -- Read repair piggybacks replica reconciliation onto normal read operations. When a coordinator detects that replicas disagree during a read, it pushes the newest value to stale replicas. It is a lazy, read-driven convergence mechanism that keeps eventually consistent systems consistent without dedicated background repair processes.
The Problem
In leader-based replication, the leader drives replication and detects lag. When a follower falls behind, the leader pushes missing writes. Staleness is the leader's problem, and the leader knows about it.
Leaderless systems (Cassandra, DynamoDB, Riak) do not have a single authority. Any replica can accept writes. A write is acknowledged when W of N replicas confirm it. If one replica is temporarily down during the write, it misses the update.
Replica C is now stale. Nobody will push the correct value to it because there is no leader. It will serve outdated data to any client that happens to read from it. Without an active repair mechanism, the staleness persists indefinitely.
This is the core problem read repair solves: detecting and fixing stale replicas as a side effect of normal read traffic.
One-Line Definition
Read repair detects version mismatches across replicas during a read operation and pushes the newest value to any stale replicas before or after returning the response to the client.
Analogy
Imagine a group chat where three friends each keep a local copy of the pizza order. One friend's phone was offline when the group changed the order from pepperoni to margherita. The next time someone asks "what did we order?", the group compares notes, notices the disagreement, and corrects the out-of-date copy. Nobody ran a dedicated "sync" process; the fix happened naturally as part of answering a question.
Read repair works the same way. The act of reading triggers the repair.
Solution Walkthrough
The Digest Comparison Technique
Comparing full values from every replica on every read is expensive. Cassandra optimizes this with a two-phase approach.
Phase 1: Digest read. The coordinator sends a full data request to one replica and a digest request to the others. A digest is a hash of the value. The coordinator compares the full value's hash against the digests from other replicas.
Phase 2: Full read (if mismatch). If digests disagree, the coordinator requests full data from all replicas, compares versions, and picks the newest value. It then pushes the correct value to any stale replicas.
This optimization matters at scale. Full data comparisons on every read would multiply your network bandwidth by the replication factor. Digests are typically 16-32 bytes regardless of value size.
Synchronous vs Asynchronous Repair
The coordinator has two choices for when to send the repair write.
Blocking read repair waits for the repair write to reach the stale replica before returning to the client. The client pays extra latency, but the next read from any replica is guaranteed to see the correct value. Use this when consistency matters more than read latency.
Background read repair returns the response to the client immediately and sends the repair write asynchronously. The client gets lower latency, but there is a small window where the stale replica can still serve outdated data. Use this when read latency is the priority.
Cassandra exposes this as read_repair_chance (now deprecated in favor of the read_repair table option). In older versions, setting read_repair_chance: 0.1 meant 10% of reads triggered a background repair. Newer Cassandra versions use read_repair = 'BLOCKING' or read_repair = 'NONE' at the table level.
Version Resolution Strategies
When replicas disagree, the coordinator must decide which value wins.
Timestamp-based (last-write-wins): Each write carries a timestamp. The highest timestamp wins. Simple but lossy: concurrent writes resolve arbitrarily based on clock skew. Cassandra uses this approach.
Vector clocks: Each replica maintains a vector of logical clocks, one per node. Vector clocks can detect true conflicts (concurrent writes that cannot be ordered). The system can then present both values to the application for resolution. DynamoDB's ancestor Dynamo used vector clocks, though DynamoDB itself uses last-write-wins.
Version vectors: Similar to vector clocks but track versions per key rather than per operation. Riak uses dotted version vectors, which solve some scalability issues with plain vector clocks by avoiding the vector from growing unboundedly.
The Quorum Connection
Read repair works best with quorum reads. When you read from R replicas where R + W > N, you are guaranteed to contact at least one replica that has the latest write. The coordinator always sees the newest version and can repair any stale replicas it contacted.
Without quorum reads (R=1), you might contact only the stale replica. Read repair cannot trigger because the coordinator sees only one version and has no basis for comparison. This is why low-consistency reads (R=1) combined with low-consistency writes (W=1) can leave data permanently diverged until anti-entropy repairs catch it.
Read repair requires contacting multiple replicas to detect mismatches. If your read consistency level only contacts one replica (like Cassandra's ONE), read repair effectively does nothing. You need QUORUM or ALL for read repair to have teeth.
Implementation Sketch
A simplified read repair coordinator:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.