Consensus algorithms
How distributed systems agree on a single value despite crashes and network partitions. Covers Raft leader election, Paxos, quorum math, and when consensus is worth its latency cost.
TL;DR
- Consensus is the problem of getting N distributed nodes to agree on a single value, even when some nodes crash or messages get lost.
- Once a value is committed by a quorum (a majority, N/2+1 nodes), every future read from any quorum returns that same value.
- Paxos (Lamport, 1989) proved consensus is solvable. Raft (2014) made it implementable by choosing clarity over generality.
- All consensus protocols pay a latency cost: every write requires a network round-trip to a majority of nodes before acknowledgment.
- In interviews, consensus shows up whenever you need leader election, distributed locks, or linearizable configuration stores (etcd, ZooKeeper, Consul).
The Problem It Solves
You have three database nodes behind a load balancer. A user updates their email address. The write lands on Node A, which confirms success. Milliseconds later, the user refreshes the page, and the load balancer routes the read to Node C. Node C hasn't received the write yet. The user sees their old email and files a support ticket: "your update didn't work."
That's the mild version. Now imagine two clients simultaneously try to claim the last seat on a flight. Node A accepts Client 1's reservation. Node B accepts Client 2's reservation. Neither node knows about the other's decision. The flight is now double-booked, and someone gets turned away at the gate.
Without consensus, every node operates on its own view of the world. Writes succeed locally with no coordination, and when the network heals, there's no principled way to decide which conflicting write wins.
This is the split-brain problem. Without a protocol that forces agreement before commitment, any multi-node system is one partition away from inconsistency. I see candidates in interviews hand-wave this by saying "just use replication," but replication without consensus is exactly how you get split-brain.
Consensus algorithms solve this by requiring a majority of nodes to agree on every committed value. If the network splits, only the partition with a majority can continue accepting writes. The minority partition blocks rather than diverge.
What Is It?
Consensus is a protocol that ensures a group of N nodes agrees on a single value for each decision slot, even when up to floor(N/2) nodes crash or become unreachable.
Think of a jury. Twelve people must reach a verdict (or in many systems, a majority). Jurors deliberate, propose, and vote. If some jurors are absent, the remaining ones can still reach a verdict as long as enough are present. Once the verdict is reached, it's final; returning jurors don't get to overrule it.
In distributed systems, consensus provides three guarantees:
- Agreement: all non-faulty nodes decide on the same value.
- Validity: the decided value was proposed by some node (no fabrication).
- Termination: all non-faulty nodes eventually decide (no infinite stalling).
The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proved that no deterministic consensus algorithm can guarantee all three properties in an asynchronous system with even one crash failure. Every practical algorithm sidesteps this by using timeouts to detect failures, making the system partially synchronous rather than fully asynchronous.
For your interview: say "consensus requires a majority of nodes to agree before committing, and the FLP result means all practical implementations use timeouts." That's the one-liner that signals depth.
Consensus does not mean all nodes respond
A common misconception: consensus requires all nodes to acknowledge. It doesn't. It requires a majority (quorum). In a 5-node cluster, 3 nodes agreeing is sufficient. The remaining 2 can be down, partitioned, or slow. They catch up when they reconnect, but the decision is already final.
How It Works
Most modern systems use Raft, so I'll walk through Raft's mechanics. Paxos solves the same problem with different terminology, but Raft was designed to be understandable, and that's what you'll implement in practice.
Raft: Leader Election
Every node in a Raft cluster is in one of three states: Follower, Candidate, or Leader. Time is divided into terms (logical epochs). Each term has at most one leader.
- All nodes start as Followers, listening for heartbeats from the Leader.
- If a Follower receives no heartbeat within a randomized timeout (typically 150-300ms), it transitions to Candidate and increments the term.
- The Candidate votes for itself and sends RequestVote RPCs to all other nodes.
- Each node votes for at most one candidate per term (first-come-first-served).
- If the Candidate receives votes from a majority, it becomes Leader.
- The Leader immediately sends heartbeats to all Followers to establish authority and prevent new elections.
The randomized election timeout is the key design choice. It ensures that in most cases, one node times out before the others, wins the election without contention, and avoids split votes. If two candidates split the vote (neither gets a majority), both retry with new randomized timeouts. I find this is the detail that clicks for people: randomness breaks symmetry cheaply.
Raft: Log Replication
Once a leader is elected, all client writes go through it. The leader appends each write to its log, replicates it to followers, and commits it once a quorum acknowledges.
- Client sends write to Leader:
set x = 5 - Leader appends to its log:
[term=2, index=42, cmd="set x 5"] - Leader sends AppendEntries RPC to all Followers
- Each Follower appends the entry and responds with ack
- Leader counts acks; once N/2+1 nodes have the entry, it's committed
- Leader applies committed entry to state machine and responds to client
- On the next heartbeat, Leader tells Followers the new commit index
- Followers apply committed entries to their state machines
The critical invariant: if an entry is committed, it exists on a majority of nodes. Any future leader must have been elected by a majority, so it must have that entry. Committed entries are never lost.
Raft: Log Safety
Raft ensures that if a log entry is committed, it will never be overwritten:
- A new leader can only be elected if it has the most up-to-date log. Voters reject candidates whose log is behind theirs (lower term or same term but fewer entries).
- All committed entries are present on a majority of nodes. A new leader must get votes from a majority. The overlap guarantees the leader has every committed entry.
This is the core safety property. Uncommitted entries (replicated to a minority before the old leader crashed) may be overwritten by the new leader, which is safe because clients never received confirmation for those writes.
Quorum Math
For a cluster of N nodes:
- Quorum = floor(N/2) + 1
- Maximum tolerated failures = N - quorum = floor(N/2)
| Cluster size | Quorum | Tolerated failures | Notes |
|---|---|---|---|
| 3 | 2 | 1 | Minimum production cluster |
| 5 | 3 | 2 | Standard for etcd, ZooKeeper |
| 7 | 4 | 3 | Rare; higher write latency |
| 4 | 3 | 1 | Same fault tolerance as 3, more expensive |
| 6 | 4 | 2 | Same fault tolerance as 5, more expensive |
Even-numbered clusters are a waste. A 4-node cluster tolerates exactly 1 failure (same as 3 nodes) but requires 3 nodes for quorum instead of 2. I see candidates propose even-numbered clusters in interviews more often than you'd think. Always choose odd counts: 3, 5, or 7.
The cluster-size rule of thumb
3 nodes for most production systems. 5 nodes if you need to survive 2 simultaneous failures (multi-AZ deployments across 5 zones). 7 is almost never worth the write-latency penalty. If an interviewer asks "how many nodes?", say 5 and explain the quorum math.
Key Components
| Component | Role |
|---|---|
| Leader | Single node that accepts all writes, replicates to followers, and decides when entries are committed |
| Follower | Passive node that replicates the leader's log and serves as a vote in elections |
| Candidate | Transient state during leader election; a follower that has timed out and is requesting votes |
| Term | Logical clock that increments with each election; acts as a fencing mechanism to reject stale leaders |
| Log | Ordered sequence of commands; each entry has an index and term; forms the replicated state machine input |
| Commit index | The highest log index known to be replicated on a majority; entries up to this point are safe to apply |
| Heartbeat | Periodic message from leader to followers that resets election timeouts and carries commit index updates |
| Quorum | Minimum number of nodes (N/2+1) that must agree for a value to be committed or a leader to be elected |
Types / Variations
| Dimension | Paxos | Raft | ZAB (ZooKeeper) | Viewstamped Replication |
|---|---|---|---|---|
| Published | 1989 (Lamport) | 2014 (Ongaro, Ousterhout) | 2008 (Junqueira et al.) | 1988 (Oki, Liskov) |
| Design priority | Mathematical proof | Understandability | ZooKeeper-specific | Early practical protocol |
| Leader role | Optional (can be leaderless) | Central; all writes through leader | Central; similar to Raft | Central; view change on failure |
| Log structure | Multi-decree, holes allowed | Sequential, no holes | Sequential, similar to Raft | Sequential |
| Recovery | Complex (multiple sub-protocols) | Simple (leader sends missing entries) | Snapshot + log replay | View change protocol |
| Multi-round optimization | Multi-Paxos skips prepare phase | N/A (inherently multi-round) | N/A | N/A |
| Used by | Google Spanner, Chubby | etcd, CockroachDB, TiKV, Consul | Apache ZooKeeper | Rarely in production |
| Learning curve | High (notoriously hard to follow) | Low (single paper, many explanations) | Medium | Medium |
Paxos was the first rigorous consensus algorithm. The original paper ("The Part-Time Parliament") is famously difficult to follow. Multi-Paxos optimizes for the common case where the same leader proposes multiple values in sequence, skipping the prepare phase after the first round.
Raft was explicitly designed as "Paxos made understandable." It sacrifices some generality (requires a leader, no log holes) for drastically simpler implementation. My recommendation: if you're building a consensus system, use Raft. If an interviewer asks about Paxos, explain the phases at a high level and pivot to Raft for the detailed walkthrough.
ZAB (ZooKeeper Atomic Broadcast) is structurally similar to Raft but predates it. It was designed specifically for ZooKeeper's needs: ordered broadcast of state changes. The main difference is recovery: ZAB uses a dedicated synchronization phase where the new leader brings followers up to date before accepting new requests.
Viewstamped Replication (1988) predates both Paxos and Raft. It introduced the concept of "views" (similar to Raft's terms) and view-change protocols for leader replacement. It's historically important but rarely used in modern systems.
Trade-offs
| Advantage | Disadvantage |
|---|---|
| Strong consistency: committed values are never lost | Write latency: every write needs a network round-trip to a majority |
| Fault tolerance: survives floor(N/2) node failures | Write throughput: bounded by the slowest majority node |
| Automatic leader election: no single point of failure | Operational complexity: harder to deploy and monitor than a single node |
| Linearizable reads (if done through the leader) | Cluster size must be odd; scaling writes horizontally is not straightforward |
| No split-brain: only one partition can make progress | Leader bottleneck: all writes flow through one node |
The fundamental tension is consistency vs. latency. Consensus guarantees that every committed write survives failures, but it pays for that guarantee with network round-trips on every write. A 5-node cluster spread across data centers might add 10-50ms per write. For systems where eventual consistency is acceptable, skipping consensus gives you dramatically lower write latency.
When to Use It / When to Avoid It
Use consensus when:
- You need a single source of truth that multiple services trust (configuration, feature flags, service discovery)
- Leader election must be automatic and partition-safe (database primary failover, scheduler master)
- Distributed locks must be correct under partitions (not "probably correct" like Redis locks)
- You need linearizable key-value storage (etcd, Consul KV)
- Financial or booking systems where double-processing is unacceptable
Avoid consensus when:
- Read-heavy workloads where eventual consistency is fine (product catalogs, social feeds)
- High-write-throughput systems where consensus latency is unacceptable (metrics ingestion, logging pipelines)
- You only have one database node (consensus is meaningless without multiple nodes)
- The data is easily re-derivable or idempotent (cache warming, search index rebuilds)
If you're unsure whether your system needs consensus, it probably doesn't. Consensus is the heaviest consistency primitive in distributed systems. Use it deliberately, not by default.
Real-World Examples
etcd (Kubernetes): Every Kubernetes cluster runs a 3 or 5-node etcd cluster as its brain. All cluster state (pod specs, service endpoints, config maps) is stored in etcd via Raft consensus. When you run kubectl apply, the API server writes to etcd, which replicates to a quorum before acknowledging. A 3-node etcd cluster survives 1 node failure without interruption. Write latency is typically sub-10ms within a single data center.
CockroachDB: Uses Raft for range-level consensus. Each range (a contiguous slice of the keyspace, typically 512MB) has its own Raft group. A CockroachDB cluster with 9 nodes and replication factor 3 runs thousands of independent Raft groups. This allows writes to different key ranges to proceed in parallel, sidestepping the single-leader write bottleneck. Write latencies of 5-15ms for single-range writes within a region.
Google Spanner: Uses a variant of Multi-Paxos for consensus across globally distributed nodes. Spanner famously uses GPS clocks and atomic clocks (TrueTime) to bound clock uncertainty, enabling externally consistent reads without contacting a quorum. The TrueTime uncertainty is typically under 7ms. Spanner proved you can have global consensus with single-digit millisecond write latency if you invest enough in clock infrastructure.
Apache ZooKeeper: Uses ZAB for consensus. ZooKeeper clusters (typically 3 or 5 nodes) provide distributed coordination primitives: locks, barriers, leader election, and config storage. ZooKeeper was the coordination backbone of Hadoop and early Kafka, and it's still widely deployed. Write throughput is approximately 10K-20K writes/second for a 3-node cluster.
How This Shows Up in Interviews
When to bring it up
Consensus is rarely the main topic of a system design interview, but it's the correct answer to a specific class of sub-problems:
- "How does the system elect a new primary?" โ Consensus (Raft-based leader election)
- "How do we ensure this critical value isn't duplicated?" โ Consensus-backed distributed lock
- "Where is the cluster configuration stored?" โ etcd or ZooKeeper (mention it uses Raft/ZAB)
- "How does service discovery work?" โ Registration in a consensus store
Don't explain Raft's full algorithm unless asked. Say "the coordination layer uses etcd, which provides Raft-based consensus" and wait for the interviewer to probe deeper.
Depth you need
- Explain quorum math: N/2+1 for commits and elections, odd cluster sizes only
- Describe leader election at a high level: randomized timeouts, majority vote, term fencing
- Know why consensus is expensive: network round-trip per write, bounded by slowest quorum member
- Understand the FLP result in one sentence: "No deterministic algorithm can guarantee consensus in an asynchronous system with even one failure, so all practical implementations use timeouts"
- Articulate when you don't need consensus: eventual consistency is fine for most reads
Interview shortcut: name the system, not the algorithm
Most interviewers don't want you to implement Raft on a whiteboard. They want to hear: "I'd use etcd for leader election because it provides Raft-based consensus with linearizable reads." Name the system, name the algorithm, and explain the quorum guarantee. That's sufficient for 90% of interviews.
Follow-up Q&A
| Interviewer asks | Strong answer |
|---|---|
| "How many nodes for your consensus cluster?" | "5 nodes across 3 AZs. Quorum is 3, so we survive any 2 node failures. 3 nodes is the minimum for production but only tolerates 1 failure." |
| "What happens if the leader crashes?" | "Followers detect the missing heartbeat within 150-300ms. The fastest follower starts an election. A new leader is elected and resumes operations. Total downtime is typically under 1 second." |
| "Why not use consensus for every write in the system?" | "Consensus adds a network round-trip per write. For a globally distributed cluster, that's 50-200ms per write. Most data (feeds, logs, cached results) doesn't need that guarantee. Reserve consensus for coordination data." |
| "What's the difference between Raft and Paxos?" | "Same problem, same quorum math. Raft requires a leader and sequential logs, making it simpler to implement. Paxos allows leaderless operation and log holes, making it more general but harder to reason about. Most new systems choose Raft." |
| "Can consensus help with read scalability?" | "Not directly. All writes still go through the leader. You can serve reads from followers (stale reads) or from the leader (linearizable reads). For read-heavy workloads, caching or eventually-consistent replicas are better tools." |
Test Your Understanding
Quick Recap
- Consensus solves the fundamental problem of getting distributed nodes to agree on a single value, even when some nodes crash or networks partition.
- A quorum of N/2+1 nodes must agree before a value is committed, which is why consensus clusters always use odd numbers (3, 5, 7).
- Raft structures consensus around a single leader: all writes go through the leader, get replicated to followers, and are committed once a quorum acknowledges.
- Leader election uses randomized timeouts so that one candidate starts before others, winning the election without a second round in most cases.
- The FLP impossibility result means all practical consensus algorithms use timeouts, making them partially synchronous rather than fully asynchronous.
- In interviews, name the system (etcd, ZooKeeper) rather than implementing the algorithm on the spot. Explain quorum math and articulate when consensus is overkill.
- The fundamental trade-off is consistency vs. latency: every committed write pays a network round-trip to a majority before acknowledgment.
Related Concepts
- CAP Theorem: Consensus is how CP systems maintain consistency during partitions, at the cost of availability in the minority partition.
- Consistency Models: Consensus provides linearizability, the strongest consistency model. Understanding the spectrum helps you articulate why consensus is expensive and when weaker models suffice.
- Replication: Replication copies data to multiple nodes, but without consensus the replicas can diverge. Consensus-based replication (Raft log replication) prevents divergence.
- Distributed Locking: Consensus-backed locks (etcd, ZooKeeper) provide the strongest correctness guarantees for distributed mutual exclusion.