Quorum consensus math

The problem

You run a key-value store replicated across three nodes. A client writes a user profile update. You could send it to all three replicas and wait for all three acknowledgments, but one node is in a different availability zone with 80ms extra latency. Your p99 write latency just became 80ms because you wait for the slowest replica.

The product team asks: can we make writes faster? You consider waiting for only one replica to acknowledge. The write returns in 5ms. But what happens next?

So you try the other extreme: acknowledge after just one replica confirms. Writes are fast, but a subsequent read might hit one of the two replicas that haven't received the write yet. The client reads stale data. A user updates their profile, refreshes the page, and sees the old profile. They update again. Now you have two conflicting writes.

There is a sweet spot between "wait for everyone" and "wait for nobody." The math behind that sweet spot is quorum consensus. It tells you exactly how many replicas need to confirm a write, and how many a read must contact, so that at least one replica in the read set has the latest write.

This is one of the most fundamental ideas in distributed systems. Every database from Cassandra to CockroachDB uses some form of quorum math. If you understand W + R > N and its implications, you understand 80% of distributed consistency tradeoffs.

The formula is deceptively simple. The subtleties, as we will see, are in the edge cases: what happens during failures, what happens with concurrent writes, and why the formula is necessary but not sufficient for strong consistency.

What it is

A quorum is the minimum number of nodes that must agree on an operation for the system to consider it committed. Think of jury verdicts: you don't need all twelve jurors to agree in a civil trial, just a majority. If 7 out of 12 vote "guilty" and any future review panel polls 7 jurors, at least 2 jurors from the original majority are in the review panel. They carry the verdict forward.

The same logic applies to distributed replicas. If enough replicas confirm a write, and enough replicas are polled on a read, the write and read groups must share at least one member. That shared member carries the truth. The beauty of this approach is that no single node needs to be "the source of truth." The truth emerges from the overlap.

In distributed systems, the core guarantee is a formula:

W + R > N

Where N is the total number of replicas, W is the number of replicas that must acknowledge a write, and R is the number of replicas that must respond to a read. When W + R exceeds N, the write set and read set must overlap by at least one node. That overlapping node carries the latest value. The minimum number of overlapping nodes is exactly W + R - N.

How it works

The quorum read/write protocol has three participants: the coordinator (the node receiving the client request), the replica set, and the client. The coordinator is the orchestrator. It does not necessarily store the data itself; its job is to route writes and reads to the right replicas and enforce the quorum constraints.

Here is a write followed by a read with N=3, W=2, R=2:

The coordinator sends the write to all N replicas but only waits for W acknowledgments. For the remaining N-W replicas, the write propagates asynchronously in the background. On read, the coordinator contacts all N replicas but returns after R responses, picking the value with the highest version. Because W + R > N, at least one read-responder has the latest write.

Here is the pseudocode for the coordinator logic:

function quorum_write(key, value, version):
  acks = 0
  for replica in all_replicas(key):
    send_async(replica, WRITE, key, value, version)

  while acks < W:
    response = await_any_response()
    if response.status == OK:
      acks += 1

  return SUCCESS  // W replicas confirmed

function quorum_read(key):
  responses = []
  for replica in all_replicas(key):
    send_async(replica, READ, key)

  while len(responses) < R:
    response = await_any_response()
    responses.append(response)

  // The overlap guarantee: at least one response has the latest write
  return max_by_version(responses)

The key insight: the coordinator does not need to know which specific replica has the latest value. The pigeonhole principle guarantees that at least one of the R responders is in the W write set, because W + R > N.

The pigeonhole proof

If you have N total slots and you fill W of them (the write set), the remaining unfilled slots are N - W. A read set of R nodes can avoid the write set only if R ≤ N - W, which means R + W ≤ N. The contrapositive: if R + W > N, the read set cannot avoid the write set. At least one node is in both.

The minimum overlap size is W + R - N. For N=3, W=2, R=2, the overlap is 2 + 2 - 3 = 1. For N=5, W=3, R=3, the overlap is 3 + 3 - 5 = 1. For N=5, W=4, R=3, the overlap is 4 + 3 - 5 = 2 (two nodes carry the latest value, providing redundancy even within the read quorum).

Version resolution

When the coordinator receives R responses, it needs to determine which value is "latest." Two common strategies:

Monotonic version numbers: each write increments a global or per-key version counter. The coordinator picks the response with the highest version. This is simple but requires a way to assign monotonically increasing versions, which itself may need a single sequencer or a consensus protocol. In practice, many systems use the coordinator as the sequencer for the keys it owns.

Vector clocks: each response carries a vector clock. The coordinator compares vectors component-wise. If one vector dominates (every component ≥ the other), that value is newer. If neither dominates, the values are concurrent, and the coordinator must return both (or merge them using application-specific logic). This is more complex than version numbers but correctly handles concurrent writes from multiple coordinators.

Cassandra uses timestamps (a form of version number tied to wall-clock time). DynamoDB's original Dynamo paper used vector clocks. The choice affects what happens during concurrent writes: timestamps silently discard one value (Last-Write-Wins), while vector clocks surface the conflict to the application.

I recommend defaulting to vector clocks or version vectors for data where concurrent writes are possible and data loss is unacceptable. Use LWW only when writes to the same key are serialized by design (single-writer pattern) or when the data can be safely overwritten (counters, last-known-location, session TTLs).

Common quorum configurations

N	W	R	W+R	Property
3	2	2	4 > 3	Strong consistency (most common)
3	3	1	4 > 3	Fast reads, slow writes
3	1	3	4 > 3	Fast writes, slow reads
3	1	1	2 ≤ 3	Eventual consistency (violates W+R>N)
5	3	3	6 > 5	Strong consistency in 5-node cluster
5	1	1	2 ≤ 5	High availability, eventual consistency

The W=1, R=1 configuration (used in Cassandra with CONSISTENCY ONE) offers maximum availability but no quorum guarantee. You may read stale data.

I default to N=3, W=2, R=2 for most systems. It tolerates one node failure, gives strong consistency, and the latency cost is modest. Only deviate when you have a concrete reason.

Interview tip: know the common configs by heart

Interviewers expect you to rattle off N=3/W=2/R=2 instantly when quorum consistency comes up. They also expect you to explain why W=1/R=1 is dangerous (stale reads) and when you'd still use it (read-heavy caches where staleness is acceptable).

Fault tolerance math

A quorum-based system can tolerate node failures as long as enough replicas are available to form a quorum:

Max write failures = N - W Max read failures = N - R

For N=3, W=2: the system tolerates 1 node failure for writes (3-2=1). For reads with R=2: also 1 node down.

For N=5, W=3: tolerate 2 node failures for writes, 2 for reads. This is why five-node clusters are common in consensus systems like etcd and ZooKeeper, tolerating two simultaneous node failures.

There is a subtlety here that catches people off guard. If you configure W=2 and R=2 on a 3-node cluster and two nodes go down, neither reads nor writes work. The system is unavailable even though one healthy node exists. Quorum availability is determined by the majority, not by any single node.

This also means the practical fault tolerance of a system depends on its weakest link. If your cluster has W=2, R=2, the system tolerates 1 node down for both operations. But if even one extra node goes down, the entire system stops. There is no graceful degradation: it is either fully available or fully unavailable for quorum operations.

N	W	R	Write tolerance	Read tolerance	Total tolerance
3	2	2	1 node down	1 node down	1 node down
3	3	1	0 nodes down	2 nodes down	0 (writes fail first)
5	3	3	2 nodes down	2 nodes down	2 nodes down
7	4	4	3 nodes down	3 nodes down	3 nodes down

This is why Cassandra and DynamoDB default to a replication factor of 3 (N=3) with quorum reads and writes (W=2, R=2): one node failure is tolerated while still guaranteeing consistency.

The latency tradeoff

Quorum operations wait for the Wth (or Rth) response, not the fastest:

Write W=2 to 3 replicas:
  Replica A responds: 5ms
  Replica B responds: 12ms  <- W=2, done at 12ms
  Replica C responds: 50ms  (ignored)

Write W=1 to 3 replicas:
  Replica A responds: 5ms   <- W=1, done at 5ms
  (faster, but no consistency guarantee with R=1)

Your write latency is the latency of the Wth-fastest replica, not the fastest. For W=2 on 3 replicas, you pay the median latency. For W=3, you pay the slowest replica's latency, which is your tail latency.

The problem

The product team asks: can we make writes faster? You consider waiting for only one replica to acknowledge. The write returns in 5ms. But what happens next?

What it is

In distributed systems, the core guarantee is a formula:

W + R > N

How it works

Here is a write followed by a read with N=3, W=2, R=2:

Here is the pseudocode for the coordinator logic:

function quorum_write(key, value, version):
  acks = 0
  for replica in all_replicas(key):
    send_async(replica, WRITE, key, value, version)

  while acks < W:
    response = await_any_response()
    if response.status == OK:
      acks += 1

  return SUCCESS  // W replicas confirmed

function quorum_read(key):
  responses = []
  for replica in all_replicas(key):
    send_async(replica, READ, key)

  while len(responses) < R:
    response = await_any_response()
    responses.append(response)

  // The overlap guarantee: at least one response has the latest write
  return max_by_version(responses)

The pigeonhole proof

Version resolution

When the coordinator receives R responses, it needs to determine which value is "latest." Two common strategies:

Common quorum configurations

N	W	R	W+R	Property
3	2	2	4 > 3	Strong consistency (most common)
3	3	1	4 > 3	Fast reads, slow writes
3	1	3	4 > 3	Fast writes, slow reads
3	1	1	2 ≤ 3	Eventual consistency (violates W+R>N)
5	3	3	6 > 5	Strong consistency in 5-node cluster
5	1	1	2 ≤ 5	High availability, eventual consistency

The W=1, R=1 configuration (used in Cassandra with CONSISTENCY ONE) offers maximum availability but no quorum guarantee. You may read stale data.

I default to N=3, W=2, R=2 for most systems. It tolerates one node failure, gives strong consistency, and the latency cost is modest. Only deviate when you have a concrete reason.

Interview tip: know the common configs by heart

Fault tolerance math

A quorum-based system can tolerate node failures as long as enough replicas are available to form a quorum:

Max write failures = N - W Max read failures = N - R

For N=3, W=2: the system tolerates 1 node failure for writes (3-2=1). For reads with R=2: also 1 node down.

For N=5, W=3: tolerate 2 node failures for writes, 2 for reads. This is why five-node clusters are common in consensus systems like etcd and ZooKeeper, tolerating two simultaneous node failures.

N	W	R	Write tolerance	Read tolerance	Total tolerance
3	2	2	1 node down	1 node down	1 node down
3	3	1	0 nodes down	2 nodes down	0 (writes fail first)
5	3	3	2 nodes down	2 nodes down	2 nodes down
7	4	4	3 nodes down	3 nodes down	3 nodes down

This is why Cassandra and DynamoDB default to a replication factor of 3 (N=3) with quorum reads and writes (W=2, R=2): one node failure is tolerated while still guaranteeing consistency.

The latency tradeoff

Quorum operations wait for the Wth (or Rth) response, not the fastest:

Write W=2 to 3 replicas:
  Replica A responds: 5ms
  Replica B responds: 12ms  <- W=2, done at 12ms
  Replica C responds: 50ms  (ignored)

Write W=1 to 3 replicas:
  Replica A responds: 5ms   <- W=1, done at 5ms
  (faster, but no consistency guarantee with R=1)

Quorum consensus math

The problem

What it is

How it works

The pigeonhole proof

Version resolution

Common quorum configurations

Fault tolerance math

The latency tradeoff

Continue Reading with Premium

Comments

Quorum consensus math

The problem

What it is

How it works

The pigeonhole proof

Version resolution

Common quorum configurations

Fault tolerance math

The latency tradeoff

Continue Reading with Premium

Comments