Quorum consensus math
The mathematics behind quorum reads and writes in distributed systems: how W + R > N guarantees overlap with the latest write, the latency-consistency tradeoff, and sloppy quorums in Dynamo-style systems.
The problem
You run a key-value store replicated across three nodes. A client writes a user profile update. You could send it to all three replicas and wait for all three acknowledgments, but one node is in a different availability zone with 80ms extra latency. Your p99 write latency just became 80ms because you wait for the slowest replica.
The product team asks: can we make writes faster? You consider waiting for only one replica to acknowledge. The write returns in 5ms. But what happens next?
So you try the other extreme: acknowledge after just one replica confirms. Writes are fast, but a subsequent read might hit one of the two replicas that haven't received the write yet. The client reads stale data. A user updates their profile, refreshes the page, and sees the old profile. They update again. Now you have two conflicting writes.
There is a sweet spot between "wait for everyone" and "wait for nobody." The math behind that sweet spot is quorum consensus. It tells you exactly how many replicas need to confirm a write, and how many a read must contact, so that at least one replica in the read set has the latest write.
This is one of the most fundamental ideas in distributed systems. Every database from Cassandra to CockroachDB uses some form of quorum math. If you understand W + R > N and its implications, you understand 80% of distributed consistency tradeoffs.
The formula is deceptively simple. The subtleties, as we will see, are in the edge cases: what happens during failures, what happens with concurrent writes, and why the formula is necessary but not sufficient for strong consistency.
What it is
A quorum is the minimum number of nodes that must agree on an operation for the system to consider it committed. Think of jury verdicts: you don't need all twelve jurors to agree in a civil trial, just a majority. If 7 out of 12 vote "guilty" and any future review panel polls 7 jurors, at least 2 jurors from the original majority are in the review panel. They carry the verdict forward.
The same logic applies to distributed replicas. If enough replicas confirm a write, and enough replicas are polled on a read, the write and read groups must share at least one member. That shared member carries the truth. The beauty of this approach is that no single node needs to be "the source of truth." The truth emerges from the overlap.
In distributed systems, the core guarantee is a formula:
W + R > N
Where N is the total number of replicas, W is the number of replicas that must acknowledge a write, and R is the number of replicas that must respond to a read. When W + R exceeds N, the write set and read set must overlap by at least one node. That overlapping node carries the latest value. The minimum number of overlapping nodes is exactly W + R - N.
How it works
The quorum read/write protocol has three participants: the coordinator (the node receiving the client request), the replica set, and the client. The coordinator is the orchestrator. It does not necessarily store the data itself; its job is to route writes and reads to the right replicas and enforce the quorum constraints.
Here is a write followed by a read with N=3, W=2, R=2:
The coordinator sends the write to all N replicas but only waits for W acknowledgments. For the remaining N-W replicas, the write propagates asynchronously in the background. On read, the coordinator contacts all N replicas but returns after R responses, picking the value with the highest version. Because W + R > N, at least one read-responder has the latest write.
Here is the pseudocode for the coordinator logic:
function quorum_write(key, value, version):
acks = 0
for replica in all_replicas(key):
send_async(replica, WRITE, key, value, version)
while acks < W:
response = await_any_response()
if response.status == OK:
acks += 1
return SUCCESS // W replicas confirmed
function quorum_read(key):
responses = []
for replica in all_replicas(key):
send_async(replica, READ, key)
while len(responses) < R:
response = await_any_response()
responses.append(response)
// The overlap guarantee: at least one response has the latest write
return max_by_version(responses)
The key insight: the coordinator does not need to know which specific replica has the latest value. The pigeonhole principle guarantees that at least one of the R responders is in the W write set, because W + R > N.
The pigeonhole proof
If you have N total slots and you fill W of them (the write set), the remaining unfilled slots are N - W. A read set of R nodes can avoid the write set only if R β€ N - W, which means R + W β€ N. The contrapositive: if R + W > N, the read set cannot avoid the write set. At least one node is in both.
The minimum overlap size is W + R - N. For N=3, W=2, R=2, the overlap is 2 + 2 - 3 = 1. For N=5, W=3, R=3, the overlap is 3 + 3 - 5 = 1. For N=5, W=4, R=3, the overlap is 4 + 3 - 5 = 2 (two nodes carry the latest value, providing redundancy even within the read quorum).
Version resolution
When the coordinator receives R responses, it needs to determine which value is "latest." Two common strategies:
Monotonic version numbers: each write increments a global or per-key version counter. The coordinator picks the response with the highest version. This is simple but requires a way to assign monotonically increasing versions, which itself may need a single sequencer or a consensus protocol. In practice, many systems use the coordinator as the sequencer for the keys it owns.
Vector clocks: each response carries a vector clock. The coordinator compares vectors component-wise. If one vector dominates (every component β₯ the other), that value is newer. If neither dominates, the values are concurrent, and the coordinator must return both (or merge them using application-specific logic). This is more complex than version numbers but correctly handles concurrent writes from multiple coordinators.
Cassandra uses timestamps (a form of version number tied to wall-clock time). DynamoDB's original Dynamo paper used vector clocks. The choice affects what happens during concurrent writes: timestamps silently discard one value (Last-Write-Wins), while vector clocks surface the conflict to the application.
I recommend defaulting to vector clocks or version vectors for data where concurrent writes are possible and data loss is unacceptable. Use LWW only when writes to the same key are serialized by design (single-writer pattern) or when the data can be safely overwritten (counters, last-known-location, session TTLs).
Common quorum configurations
| N | W | R | W+R | Property |
|---|---|---|---|---|
| 3 | 2 | 2 | 4 > 3 | Strong consistency (most common) |
| 3 | 3 | 1 | 4 > 3 | Fast reads, slow writes |
| 3 | 1 | 3 | 4 > 3 | Fast writes, slow reads |
| 3 | 1 | 1 | 2 β€ 3 | Eventual consistency (violates W+R>N) |
| 5 | 3 | 3 | 6 > 5 | Strong consistency in 5-node cluster |
| 5 | 1 | 1 | 2 β€ 5 | High availability, eventual consistency |
The W=1, R=1 configuration (used in Cassandra with CONSISTENCY ONE) offers maximum availability but no quorum guarantee. You may read stale data.
I default to N=3, W=2, R=2 for most systems. It tolerates one node failure, gives strong consistency, and the latency cost is modest. Only deviate when you have a concrete reason.
Interview tip: know the common configs by heart
Interviewers expect you to rattle off N=3/W=2/R=2 instantly when quorum consistency comes up. They also expect you to explain why W=1/R=1 is dangerous (stale reads) and when you'd still use it (read-heavy caches where staleness is acceptable).
Fault tolerance math
A quorum-based system can tolerate node failures as long as enough replicas are available to form a quorum:
Max write failures = N - W
Max read failures = N - R
For N=3, W=2: the system tolerates 1 node failure for writes (3-2=1). For reads with R=2: also 1 node down.
For N=5, W=3: tolerate 2 node failures for writes, 2 for reads. This is why five-node clusters are common in consensus systems like etcd and ZooKeeper, tolerating two simultaneous node failures.
There is a subtlety here that catches people off guard. If you configure W=2 and R=2 on a 3-node cluster and two nodes go down, neither reads nor writes work. The system is unavailable even though one healthy node exists. Quorum availability is determined by the majority, not by any single node.
This also means the practical fault tolerance of a system depends on its weakest link. If your cluster has W=2, R=2, the system tolerates 1 node down for both operations. But if even one extra node goes down, the entire system stops. There is no graceful degradation: it is either fully available or fully unavailable for quorum operations.
| N | W | R | Write tolerance | Read tolerance | Total tolerance |
|---|---|---|---|---|---|
| 3 | 2 | 2 | 1 node down | 1 node down | 1 node down |
| 3 | 3 | 1 | 0 nodes down | 2 nodes down | 0 (writes fail first) |
| 5 | 3 | 3 | 2 nodes down | 2 nodes down | 2 nodes down |
| 7 | 4 | 4 | 3 nodes down | 3 nodes down | 3 nodes down |
This is why Cassandra and DynamoDB default to a replication factor of 3 (N=3) with quorum reads and writes (W=2, R=2): one node failure is tolerated while still guaranteeing consistency.
The latency tradeoff
Quorum operations wait for the Wth (or Rth) response, not the fastest:
Write W=2 to 3 replicas:
Replica A responds: 5ms
Replica B responds: 12ms <- W=2, done at 12ms
Replica C responds: 50ms (ignored)
Write W=1 to 3 replicas:
Replica A responds: 5ms <- W=1, done at 5ms
(faster, but no consistency guarantee with R=1)
Your write latency is the latency of the Wth-fastest replica, not the fastest. For W=2 on 3 replicas, you pay the median latency. For W=3, you pay the slowest replica's latency, which is your tail latency.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.