Read repair
How read repair fixes stale replicas during normal read operations in leaderless databases like Cassandra and DynamoDB. Covers digest comparison, vector clocks, repair probability tuning, and the relationship between read repair and anti-entropy.
TL;DR -- Read repair piggybacks replica reconciliation onto normal read operations. When a coordinator detects that replicas disagree during a read, it pushes the newest value to stale replicas. It is a lazy, read-driven convergence mechanism that keeps eventually consistent systems consistent without dedicated background repair processes.
The Problem
In leader-based replication, the leader drives replication and detects lag. When a follower falls behind, the leader pushes missing writes. Staleness is the leader's problem, and the leader knows about it.
Leaderless systems (Cassandra, DynamoDB, Riak) do not have a single authority. Any replica can accept writes. A write is acknowledged when W of N replicas confirm it. If one replica is temporarily down during the write, it misses the update.
Replica C is now stale. Nobody will push the correct value to it because there is no leader. It will serve outdated data to any client that happens to read from it. Without an active repair mechanism, the staleness persists indefinitely.
This is the core problem read repair solves: detecting and fixing stale replicas as a side effect of normal read traffic.
One-Line Definition
Read repair detects version mismatches across replicas during a read operation and pushes the newest value to any stale replicas before or after returning the response to the client.
Analogy
Imagine a group chat where three friends each keep a local copy of the pizza order. One friend's phone was offline when the group changed the order from pepperoni to margherita. The next time someone asks "what did we order?", the group compares notes, notices the disagreement, and corrects the out-of-date copy. Nobody ran a dedicated "sync" process; the fix happened naturally as part of answering a question.
Read repair works the same way. The act of reading triggers the repair.
Solution Walkthrough
The Digest Comparison Technique
Comparing full values from every replica on every read is expensive. Cassandra optimizes this with a two-phase approach.
Phase 1: Digest read. The coordinator sends a full data request to one replica and a digest request to the others. A digest is a hash of the value. The coordinator compares the full value's hash against the digests from other replicas.
Phase 2: Full read (if mismatch). If digests disagree, the coordinator requests full data from all replicas, compares versions, and picks the newest value. It then pushes the correct value to any stale replicas.
This optimization matters at scale. Full data comparisons on every read would multiply your network bandwidth by the replication factor. Digests are typically 16-32 bytes regardless of value size.
Synchronous vs Asynchronous Repair
The coordinator has two choices for when to send the repair write.
Blocking read repair waits for the repair write to reach the stale replica before returning to the client. The client pays extra latency, but the next read from any replica is guaranteed to see the correct value. Use this when consistency matters more than read latency.
Background read repair returns the response to the client immediately and sends the repair write asynchronously. The client gets lower latency, but there is a small window where the stale replica can still serve outdated data. Use this when read latency is the priority.
Cassandra exposes this as read_repair_chance (now deprecated in favor of the read_repair table option). In older versions, setting read_repair_chance: 0.1 meant 10% of reads triggered a background repair. Newer Cassandra versions use read_repair = 'BLOCKING' or read_repair = 'NONE' at the table level.
Version Resolution Strategies
When replicas disagree, the coordinator must decide which value wins.
Timestamp-based (last-write-wins): Each write carries a timestamp. The highest timestamp wins. Simple but lossy: concurrent writes resolve arbitrarily based on clock skew. Cassandra uses this approach.
Vector clocks: Each replica maintains a vector of logical clocks, one per node. Vector clocks can detect true conflicts (concurrent writes that cannot be ordered). The system can then present both values to the application for resolution. DynamoDB's ancestor Dynamo used vector clocks, though DynamoDB itself uses last-write-wins.
Version vectors: Similar to vector clocks but track versions per key rather than per operation. Riak uses dotted version vectors, which solve some scalability issues with plain vector clocks by avoiding the vector from growing unboundedly.
The Quorum Connection
Read repair works best with quorum reads. When you read from R replicas where R + W > N, you are guaranteed to contact at least one replica that has the latest write. The coordinator always sees the newest version and can repair any stale replicas it contacted.
Without quorum reads (R=1), you might contact only the stale replica. Read repair cannot trigger because the coordinator sees only one version and has no basis for comparison. This is why low-consistency reads (R=1) combined with low-consistency writes (W=1) can leave data permanently diverged until anti-entropy repairs catch it.
Read repair requires contacting multiple replicas to detect mismatches. If your read consistency level only contacts one replica (like Cassandra's ONE), read repair effectively does nothing. You need QUORUM or ALL for read repair to have teeth.
Implementation Sketch
A simplified read repair coordinator:
async function readWithRepair<T>(
key: string,
replicas: Replica[],
readQuorum: number,
opts: { blocking: boolean }
): Promise<T> {
// Phase 1: Send one full read + (R-1) digest reads
const [fullTarget, ...digestTargets] = pickReplicas(replicas, readQuorum);
const fullRead = fullTarget.readFull(key);
const digestReads = digestTargets.map((r) => r.readDigest(key));
const [fullResult, ...digests] = await Promise.all([fullRead, ...digestReads]);
const expectedDigest = hash(fullResult.value);
// Phase 2: Check for mismatches
const staleReplicas = digestTargets.filter(
(r, i) => digests[i].digest !== expectedDigest
);
if (staleReplicas.length === 0) {
return fullResult.value; // All replicas agree
}
// Fetch full data from stale replicas to resolve
const staleValues = await Promise.all(
staleReplicas.map((r) => r.readFull(key))
);
// Pick the newest version across all values
const allValues = [fullResult, ...staleValues];
const newest = allValues.reduce((a, b) =>
a.version > b.version ? a : b
);
// Repair stale replicas
const repairPromise = Promise.all(
staleReplicas
.filter((_, i) => staleValues[i].version < newest.version)
.map((r) => r.write(key, newest.value, newest.version))
);
if (opts.blocking) {
await repairPromise; // Wait for repair before returning
}
// else: fire-and-forget, repair runs in background
return newest.value;
}
The key invariant: the coordinator always returns the newest value it found across all contacted replicas, regardless of whether repair succeeds or fails.
Repair Probability Tuning
Not every read needs to trigger a repair check. The probability controls the trade-off between consistency convergence speed and read overhead.
repair_probability = 0.0 -- No read repair; rely on anti-entropy only
repair_probability = 0.1 -- 10% of reads check digests; low overhead
repair_probability = 1.0 -- Every read checks digests; fastest convergence
Higher probability means faster convergence but more digest comparisons and potentially more repair writes. For hot keys that are read thousands of times per second, even a low probability (0.01) repairs staleness within seconds because the sheer read volume triggers repairs frequently.
For cold keys that are rarely read, read repair may never fire. These keys need a separate anti-entropy mechanism to converge.
When It Shines
Read repair is the right mechanism when:
- Your system uses leaderless replication and replicas can independently fall behind. This is the canonical use case (Cassandra, DynamoDB, Riak).
- Read traffic is high relative to write traffic. More reads means more opportunities for repair. If a key is read 1,000 times per second, even a 1% repair probability triggers 10 repairs per second.
- You want consistency convergence without dedicated background processes. Read repair piggybacks on existing read operations rather than requiring separate repair jobs with their own resource consumption.
- Latency is tolerable for blocking mode, or you can accept a brief staleness window with background mode. If neither is acceptable, you need stronger consistency guarantees (e.g., Paxos-based replication).
Failure Modes & Pitfalls
1. Cold Key Staleness
Read repair only triggers during reads. Keys that are written once and rarely read may stay stale indefinitely. If a replica was down during the write and nobody reads that key, read repair never fires. You must complement read repair with anti-entropy (Merkle tree) repair for cold data.
2. Repair Storms on Hot Keys
If a popular key has stale replicas, every read triggers a repair write. With 10,000 reads per second and 3 stale replicas, that is 30,000 repair writes per second. This can overload the stale replicas, which are potentially already struggling. Production systems cap repair write rate per key.
3. Clock Skew with Last-Write-Wins
If replicas have clock skew and you use timestamp-based conflict resolution, last-write-wins can discard the "correct" value in favor of a value from a replica with a fast clock. NTP drift of even a few milliseconds can cause data loss in high-write workloads. Vector clocks avoid this but add complexity.
4. Read Repair on Tombstones
Deleted data in Cassandra is represented as a tombstone. If read repair pushes a tombstone to a replica that missed the delete, the data reappears briefly before the tombstone propagates. Worse, if tombstones are garbage collected before all replicas see them, deleted data can be resurrected permanently. This is the "zombie data" problem.
Tombstone garbage collection (gc_grace_seconds in Cassandra) must be longer than your maximum repair interval. If you GC tombstones after 10 days but a replica was down for 14 days, that replica still has the old (deleted) data. When it comes back, read repair resurrects the deleted row because the tombstone no longer exists on the other replicas.
5. Amplified Read Latency
Blocking read repair adds a write operation to the read path. If the stale replica is slow (which is common since it was recently down), the client pays the repair write latency on top of the read latency. This can cause p99 read latency spikes when replicas rejoin after an outage.
6. Digest Collisions
Digest comparison uses a hash function. Hash collisions (two different values producing the same digest) cause the coordinator to miss a mismatch. The probability is extremely low with a strong hash (SHA-256), but CRC32 or short hashes have non-negligible collision rates at scale.
Trade-offs
| Dimension | Upside | Downside |
|---|---|---|
| Consistency | Converges replicas toward latest value | Only repairs keys that are actively read |
| Latency | Background mode adds zero read latency | Blocking mode adds a write to the read path |
| Throughput | No dedicated repair bandwidth | Repair writes compete with normal writes |
| Operational cost | Zero configuration; built into the read path | Hard to monitor and debug when repairs lag |
| Coverage | Repairs popular keys very quickly | Cold keys may never get repaired |
| Complexity | Simple concept and implementation | Interacts subtly with tombstones, clocks, and quorum settings |
Read Repair vs Anti-Entropy
Read repair and anti-entropy are complementary, not competing mechanisms.
Read repair fixes hot data fast. It triggers on every read (or a probabilistic subset), so popular keys converge within seconds. It costs nothing for cold keys because it never runs on them.
Anti-entropy (Merkle tree comparison) fixes cold data slowly. A background process periodically compares Merkle trees between replicas and repairs any differences. It covers all data regardless of read frequency but consumes background CPU, disk I/O, and network bandwidth.
Most production systems run both. Read repair handles the 80% of traffic concentrated on 20% of keys. Anti-entropy handles the long tail of rarely-read data that read repair misses.
Real-World Usage
Apache Cassandra
Cassandra is the canonical read repair implementation. The coordinator sends a full data request to the fastest replica and digest requests to others. On mismatch, it fetches full data from all replicas and repairs the stale ones.
Cassandra 4.0 changed the default from probabilistic read repair (read_repair_chance) to table-level read_repair = 'BLOCKING'. This simplifies configuration and avoids the surprising behavior where identical reads sometimes trigger repair and sometimes do not.
Amazon DynamoDB
DynamoDB uses read repair internally but does not expose configuration. Strongly consistent reads always contact the leader replica. Eventually consistent reads may contact any replica, and DynamoDB repairs discrepancies in the background. Users do not tune repair probability; it is managed by the service.
Riak
Riak KV implements read repair with vector clocks for conflict detection instead of last-write-wins. When a read detects conflicting versions (siblings), Riak can return all siblings to the application for manual resolution. This makes conflict handling explicit rather than silently discarding data.
Apache HBase
HBase uses a different model (leader-based replication via WAL), so it does not use read repair in the Dynamo sense. However, HBase's timeline consistency mode allows stale reads from region replicas, and a form of read repair reconciles stale region replicas with the primary.
How This Shows Up in Interviews
Read repair appears in two contexts: when designing leaderless replication for a system like a key-value store, and when discussing consistency guarantees in eventually consistent systems.
What strong candidates cover: the digest optimization to reduce bandwidth, the quorum requirement (R + W > N) for read repair to work, the distinction between synchronous and asynchronous repair, the cold-key gap that requires anti-entropy, and the tombstone resurrection problem.
What weak candidates miss: they describe read repair as "reads just fix things" without explaining how the coordinator detects staleness, why quorum reads are necessary, or what happens to data that is never read. They also miss the interaction between gc_grace_seconds and read repair for deleted data.
Common follow-ups: "What happens if all replicas are stale by different amounts?" (Coordinator picks the newest version and repairs all others.) "Why not just run anti-entropy for everything?" (Too expensive for hot keys; read repair is free because the data is already being read.) "How does read repair interact with hinted handoff?" (Hinted handoff catches up a replica that was down during writes; read repair catches stragglers that hinted handoff missed.)
Test Your Understanding
Quick Recap
- Read repair fixes stale replicas by detecting version mismatches during normal read operations and pushing the newest value to outdated replicas.
- Digest comparison optimizes bandwidth: the coordinator sends full reads to one replica and lightweight hash requests to others.
- Blocking repair guarantees consistency at the cost of higher read latency. Background repair prioritizes latency at the cost of a brief staleness window.
- Read repair requires quorum reads (R >= 2) to detect mismatches. With R=1 it cannot compare versions.
- The R + W > N quorum formula is what guarantees at least one replica in the read set has the latest write.
- Cold keys that are rarely read may stay stale indefinitely. Anti-entropy (Merkle tree) repair catches what read repair misses.
- Tombstone garbage collection must outlast your maximum repair window. Otherwise, deleted data can be resurrected.
- Hinted handoff, read repair, and anti-entropy are complementary mechanisms, each covering different failure windows and access patterns.
Related Patterns
- Consistency models: Read repair is a mechanism for achieving eventual consistency. Understanding where it sits on the consistency spectrum is essential.
- Replication: Read repair only matters in replicated systems. Leaderless replication is where it is most critical.
- CAP theorem: Read repair is an availability-favoring mechanism. It allows stale reads rather than blocking, then fixes staleness lazily.
- Consistent hashing: Determines which replicas own a key and therefore which replicas participate in read repair.
- CRDT: Conflict-free replicated data types can resolve concurrent writes automatically, simplifying the merge step in read repair when vector clocks detect conflicts.