Consensus algorithms
How distributed systems agree on a single value despite crashes and network partitions. Covers Raft leader election, Paxos, quorum math, and when consensus is worth its latency cost.
TL;DR
- Consensus is the problem of getting N distributed nodes to agree on a single value, even when some nodes crash or messages get lost.
- Once a value is committed by a quorum (a majority, N/2+1 nodes), every future read from any quorum returns that same value.
- Paxos (Lamport, 1989) proved consensus is solvable. Raft (2014) made it implementable by choosing clarity over generality.
- All consensus protocols pay a latency cost: every write requires a network round-trip to a majority of nodes before acknowledgment.
- In interviews, consensus shows up whenever you need leader election, distributed locks, or linearizable configuration stores (etcd, ZooKeeper, Consul).
The Problem It Solves
You have three database nodes behind a load balancer. A user updates their email address. The write lands on Node A, which confirms success. Milliseconds later, the user refreshes the page, and the load balancer routes the read to Node C. Node C hasn't received the write yet. The user sees their old email and files a support ticket: "your update didn't work."
That's the mild version. Now imagine two clients simultaneously try to claim the last seat on a flight. Node A accepts Client 1's reservation. Node B accepts Client 2's reservation. Neither node knows about the other's decision. The flight is now double-booked, and someone gets turned away at the gate.
Without consensus, every node operates on its own view of the world. Writes succeed locally with no coordination, and when the network heals, there's no principled way to decide which conflicting write wins.
This is the split-brain problem. Without a protocol that forces agreement before commitment, any multi-node system is one partition away from inconsistency. I see candidates in interviews hand-wave this by saying "just use replication," but replication without consensus is exactly how you get split-brain.
Consensus algorithms solve this by requiring a majority of nodes to agree on every committed value. If the network splits, only the partition with a majority can continue accepting writes. The minority partition blocks rather than diverge.
What Is It?
Consensus is a protocol that ensures a group of N nodes agrees on a single value for each decision slot, even when up to floor(N/2) nodes crash or become unreachable.
Think of a jury. Twelve people must reach a verdict (or in many systems, a majority). Jurors deliberate, propose, and vote. If some jurors are absent, the remaining ones can still reach a verdict as long as enough are present. Once the verdict is reached, it's final; returning jurors don't get to overrule it.
In distributed systems, consensus provides three guarantees:
- Agreement: all non-faulty nodes decide on the same value.
- Validity: the decided value was proposed by some node (no fabrication).
- Termination: all non-faulty nodes eventually decide (no infinite stalling).
The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proved that no deterministic consensus algorithm can guarantee all three properties in an asynchronous system with even one crash failure. Every practical algorithm sidesteps this by using timeouts to detect failures, making the system partially synchronous rather than fully asynchronous.
For your interview: say "consensus requires a majority of nodes to agree before committing, and the FLP result means all practical implementations use timeouts." That's the one-liner that signals depth.
Consensus does not mean all nodes respond
A common misconception: consensus requires all nodes to acknowledge. It doesn't. It requires a majority (quorum). In a 5-node cluster, 3 nodes agreeing is sufficient. The remaining 2 can be down, partitioned, or slow. They catch up when they reconnect, but the decision is already final.
How It Works
Most modern systems use Raft, so I'll walk through Raft's mechanics. Paxos solves the same problem with different terminology, but Raft was designed to be understandable, and that's what you'll implement in practice.
Raft: Leader Election
Every node in a Raft cluster is in one of three states: Follower, Candidate, or Leader. Time is divided into terms (logical epochs). Each term has at most one leader.
- All nodes start as Followers, listening for heartbeats from the Leader.
- If a Follower receives no heartbeat within a randomized timeout (typically 150-300ms), it transitions to Candidate and increments the term.
- The Candidate votes for itself and sends RequestVote RPCs to all other nodes.
- Each node votes for at most one candidate per term (first-come-first-served).
- If the Candidate receives votes from a majority, it becomes Leader.
- The Leader immediately sends heartbeats to all Followers to establish authority and prevent new elections.
The randomized election timeout is the key design choice. It ensures that in most cases, one node times out before the others, wins the election without contention, and avoids split votes. If two candidates split the vote (neither gets a majority), both retry with new randomized timeouts. I find this is the detail that clicks for people: randomness breaks symmetry cheaply.
Raft: Log Replication
Once a leader is elected, all client writes go through it. The leader appends each write to its log, replicates it to followers, and commits it once a quorum acknowledges.
- Client sends write to Leader:
set x = 5 - Leader appends to its log:
[term=2, index=42, cmd="set x 5"] - Leader sends AppendEntries RPC to all Followers
- Each Follower appends the entry and responds with ack
- Leader counts acks; once N/2+1 nodes have the entry, it's committed
- Leader applies committed entry to state machine and responds to client
- On the next heartbeat, Leader tells Followers the new commit index
- Followers apply committed entries to their state machines
The critical invariant: if an entry is committed, it exists on a majority of nodes. Any future leader must have been elected by a majority, so it must have that entry. Committed entries are never lost.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.