Paxos consensus algorithm
How Paxos achieves distributed consensus across unreliable nodes, the Proposer/Acceptor/Learner roles, the two-phase protocol, and why Raft replaced it for most implementations.
The problem
Your five-node key-value store needs to replicate a write. The client sends SET leader = nodeA to three nodes. At the same moment, a network partition splits the cluster: two nodes see the write, three do not. A second client, talking to the other partition, sends SET leader = nodeB. Both partitions accept their local write. When the partition heals, the cluster has two conflicting values for the same key and no way to decide which one wins.
This is not a hypothetical. Google's early Bigtable clusters hit exactly this problem: without a formal agreement protocol, split-brain scenarios produced silent data corruption. The question is simple but deceptively hard: how do N servers agree on a single value when any of them can crash at any time and messages can arrive out of order, be duplicated, or be lost entirely?
This is the problem Paxos solves. It was the first provably correct consensus algorithm for asynchronous networks (no bound on message delivery time), published by Leslie Lamport in 1998.
What it is
Paxos is a two-phase voting protocol that allows a group of nodes to agree on exactly one value, even when some nodes crash and messages are delayed or lost. The protocol guarantees safety (no two nodes decide different values) in all cases, and guarantees progress (a value is eventually chosen) as long as a majority of nodes are alive and can communicate.
Analogy: Imagine a committee voting on a new policy. The chairperson proposes a draft and asks each member, "Will you support this draft number?" Each member promises not to support any older draft. The chairperson collects promises from a majority, then sends the final version for a binding vote. Even if some members leave the room, as long as a majority voted, the policy is decided. No minority can later overturn it.
Interview tip: the one-liner
"Paxos is a two-phase majority-vote protocol: Prepare locks out older proposals, Accept gets the value committed. Safety holds always; liveness requires a majority to be reachable."
Three roles
Every node can play multiple roles simultaneously:
- Proposer: initiates a proposal. Wants to get a value accepted. Typically the current leader or the node receiving a client request.
- Acceptor: votes on proposals. The quorum is formed by a majority of acceptors. Stores the highest proposal number it has seen and the last value it accepted.
- Learner: learns the agreed-upon value once consensus is reached. May be a read replica or an application-layer observer.
A cluster of 5 nodes usually has all 5 act as acceptors, 1-3 as proposers, and some as learners. The majority quorum is 3.
I find it helpful to think of proposers as "the mouth" and acceptors as "the memory." Proposers drive the protocol forward, but acceptors hold the durable state that guarantees safety.
How it works
The protocol runs in two phases. Phase 1 establishes authority (locks out stale proposals). Phase 2 gets the value committed. Every round uses a globally unique, monotonically increasing proposal number N.
Phase 1: Prepare and Promise
The proposer picks a proposal number N larger than any it has used before. It sends Prepare(N) to a majority of acceptors. Each acceptor checks whether N exceeds its currently promised number. If so, it replies with a Promise: "I will reject any proposal numbered less than N." The promise also includes the highest-numbered value this acceptor previously accepted (if any).
Proposer -> Acceptors: Prepare(N=7)
Acceptor A response: Promise(N=7), prevAccepted=(N=5, value="commit")
Acceptor B response: Promise(N=7), prevAccepted=null
Acceptor C response: Promise(N=7), prevAccepted=(N=3, value="commit")
If the proposer receives promises from a majority, phase 1 succeeds.
Phase 2: Accept and Accepted
The proposer must now choose the value to propose. Critical rule: if any acceptor's promise carried a previously accepted value, the proposer must use the value from the highest-numbered prior acceptance. The proposer cannot pick a fresh value if any prior value was partially accepted. This is the core safety mechanism.
Proposer: received prevAccepted at N=5 ("commit") and N=3 ("commit")
must use value="commit" (from highest N=5)
Proposer -> Acceptors: Accept(N=7, value="commit")
Acceptor A: 7 >= promised 7, accepts -> Accepted(N=7, "commit")
Acceptor B: 7 >= promised 7, accepts -> Accepted(N=7, "commit")
Acceptor C: 7 >= promised 7, accepts -> Accepted(N=7, "commit")
When the proposer receives Accepted from a majority, the value is chosen. The proposer notifies learners.
Pseudocode
function proposer_run(value):
n = generate_unique_proposal_number()
// Phase 1
promises = []
for acceptor in random_majority(acceptors):
response = send(acceptor, Prepare(n))
if response.type == PROMISE:
promises.append(response)
if len(promises) < majority_size:
return FAIL // could not get majority, retry with higher n
// Adopt highest previously accepted value (if any)
highest = max(promises, key=p -> p.accepted_n) where p.accepted_n != null
if highest != null:
value = highest.accepted_value
// Phase 2
accepts = 0
for acceptor in same_majority:
response = send(acceptor, Accept(n, value))
if response.type == ACCEPTED:
accepts += 1
if accepts >= majority_size:
broadcast_to_learners(value)
return CHOSEN(value)
else:
return FAIL // retry with higher n
function acceptor_handle(message):
if message.type == PREPARE:
if message.n > this.promised_n:
this.promised_n = message.n
return Promise(n=message.n, accepted_n=this.accepted_n,
accepted_value=this.accepted_value)
else:
return REJECT // already promised higher
if message.type == ACCEPT:
if message.n >= this.promised_n:
this.promised_n = message.n
this.accepted_n = message.n
this.accepted_value = message.value
return Accepted(n=message.n)
else:
return REJECT
Safety guarantee
The two phases together ensure only one value can ever be chosen:
- Phase 1 locks out any proposer with a lower proposal number.
- The "must use prior accepted value" rule ensures that if value V was chosen in round N, any future round N' > N that succeeds will also propose V.
Even if a proposer crashes mid-phase, any new proposer will discover the partially chosen value in phase 1 responses and carry the same decision forward. I think of this as the "value propagation" property: once a majority accepts V, no future round can escape V.
The FLP impossibility result
No deterministic consensus algorithm can guarantee both safety and liveness in a fully asynchronous system where even one process can crash. Paxos guarantees safety always but can only guarantee liveness when a majority is reachable and proposers do not indefinitely conflict. In practice, randomized timeouts solve this.
Dueling proposers (livelock)
Two proposers can indefinitely preempt each other. Proposer A starts phase 1 with N=5. Before A completes phase 2, Proposer B starts phase 1 with N=6, which invalidates A's proposal number. A retries with N=7 and invalidates B. This cycle can repeat forever.
Paxos is technically live-lock-free only if proposers back off. The standard fix: elect a single distinguished proposer (leader) and have only it run proposals. If the leader crashes, a new one is elected. This is exactly Multi-Paxos.
My recommendation: when you mention Paxos livelock in an interview, immediately follow it with "this is why production systems use a stable leader."
Multi-Paxos
Basic Paxos agrees on a single value. Real systems need a sequence of values (a replicated log). Running full two-phase Paxos for every log entry is expensive: two round trips per entry, contention between proposers, no pipelining.
Multi-Paxos optimizes this by electing a stable leader that skips phase 1 for subsequent entries. Once a leader's proposal number is the highest, it can go directly to phase 2 for each new log entry. Phase 1 only runs when the leader changes.
| Aspect | Basic Paxos | Multi-Paxos |
|---|---|---|
| Scope | Agrees on one value | Agrees on a sequence of values (log) |
| Round trips per entry | 2 (Prepare + Accept) | 1 (Accept only, after leader established) |
| Leader required | No (any node proposes) | Yes (stable leader skips Prepare) |
| Proposal number | Per-round | Per-leader tenure |
| Failure handling | Any proposer retries | New leader election, then resumes |
// Multi-Paxos leader steady-state (Phase 1 already done)
function leader_replicate(entry):
log_index = next_available_slot()
for acceptor in majority(acceptors):
send(acceptor, Accept(leader_n, log_index, entry))
acks = collect_majority_accepts()
if acks >= majority:
commit(log_index, entry)
notify_learners(log_index, entry)
The underspecified parts are where the real difficulty lives. Lamport's paper says nothing about how to elect the leader, how to compact the log, how to handle cluster membership changes, or how to manage client retries. Every team that implemented Multi-Paxos reinvented these pieces independently, and many got them wrong.
Paxos vs. Raft
Raft is essentially Multi-Paxos with the underspecified gaps filled in and the presentation restructured for clarity. The Raft paper explicitly decomposes consensus into leader election, log replication, and safety as independent subproblems.
| Aspect | Paxos | Raft |
|---|---|---|
| Understandability | Famously difficult | Designed for understandability |
| Leader election | Not specified | Built into the protocol (terms, timeouts) |
| Log management | Not specified | Central to the design (AppendEntries RPC) |
| Reconfiguration | Not specified | Joint consensus or single-server changes |
| Log compaction | Not specified | Snapshots defined in the paper |
| Safety proof | Single-decree proof extends to multi | End-to-end proof in the Raft paper |
| Where used | Chubby, Spanner, ZooKeeper internals | etcd, CockroachDB, TiKV, Consul |
For your interview: if asked "Paxos or Raft?", say Raft for any new system. Paxos matters for understanding correctness proofs and legacy systems, but Raft is what you ship.
Interview tip: why Paxos still matters
Even though Raft replaced Paxos in practice, interviewers ask about Paxos to test your understanding of consensus fundamentals. Know the two phases, the "must adopt prior value" rule, and the livelock problem. That covers 90% of Paxos interview questions.
Production usage
| System | How Paxos is used | Notable behavior |
|---|---|---|
| Google Chubby | Lock service uses Multi-Paxos for replicated state machine. Five replicas, one master. | Master lease prevents reads from hitting Paxos. Chubby was the motivation for Paxos Made Live (2007). |
| Google Spanner | Uses Multi-Paxos per tablet for replication. TrueTime + Paxos enables globally consistent reads. | Paxos groups span datacenters. Commit latency is bounded by cross-datacenter RTT (~5-10ms within region). |
| Apache ZooKeeper | ZAB (ZooKeeper Atomic Broadcast) is a Paxos variant. Leader-based protocol with atomic broadcast ordering. | ZAB guarantees FIFO order within a session, which basic Paxos does not. Leader election uses a fast-leader variant. |
| Microsoft Azure Storage | Paxos-based replication across storage stamps. Three replicas with chain replication layered on top. | Uses a variant called "stream layer Paxos" optimized for append-only log replication. |
Limitations and when NOT to use it
- Raw Paxos is not a complete system. You cannot ship basic Paxos and call it done. You need leader election, log compaction, snapshotting, membership changes, and client retry semantics. Use Raft instead.
- Single-value Paxos has terrible throughput. Two round trips per decision. Multi-Paxos fixes this, but adds the complexity of maintaining a stable leader.
- No Byzantine fault tolerance. Paxos assumes nodes crash but do not lie. If a node sends forged messages (hardware fault, software bug, malicious actor), Paxos provides no protection. Use PBFT or HotStuff for Byzantine environments.
- Latency is bounded by the slowest majority member. A 5-node cluster across continents with 200ms RTT to one datacenter adds 200ms to every commit, because you need 3-of-5 ACKs and the third-fastest still includes the slow one.
- Performance degrades with large clusters. Paxos requires communicating with a majority of nodes. With 7 or 9 nodes, the quorum size grows and so does tail latency. Most production deployments stick to 3 or 5 nodes.
- FLP impossibility applies. Paxos cannot guarantee progress in all cases. Dueling proposers or sustained network partitions can stall consensus indefinitely, even though safety is maintained.
Interview cheat sheet
- When asked "how do distributed databases agree on a value": start with Paxos as the foundational algorithm, then pivot to Raft as the practical choice. "Paxos proved it was possible; Raft made it implementable."
- When asked about the two phases: "Phase 1 (Prepare) establishes authority by getting promises from a majority. Phase 2 (Accept) proposes the value. The key insight is that if any acceptor already accepted a value, the proposer must adopt it."
- When asked "what prevents two different values from being chosen": "Two different majorities must overlap by at least one node. That overlapping node carries the previously accepted value forward via the promise response."
- When asked about livelock: "Two proposers can preempt each other forever. Production systems fix this with a stable leader (Multi-Paxos) and randomized backoff for contested elections."
- When asked "Paxos vs. Raft": "Raft is Multi-Paxos with all the gaps filled in. Same safety guarantees, but Raft specifies leader election, log compaction, and membership changes explicitly. Use Raft for new systems."
- When asked about quorum size: "Majority of N nodes, so floor(N/2)+1. A 5-node cluster has quorum of 3 and tolerates 2 failures. A 3-node cluster has quorum of 2 and tolerates 1 failure."
- When asked about the FLP result: "No deterministic consensus algorithm can guarantee both safety and liveness in an asynchronous system. Paxos guarantees safety always; liveness requires a reachable majority and non-conflicting proposers."
- When asked where Paxos is used in production: "Google Chubby (lock service), Google Spanner (tablet replication), ZooKeeper (ZAB variant). Most new systems use Raft directly."
Quick recap
- Paxos achieves distributed consensus in two phases: Prepare (get promises, discover prior accepted values) and Accept (propose a value, collect majority agreement). Safety holds even under crashes and message reordering.
- Three roles divide responsibility: Proposers drive the protocol, Acceptors hold durable state and form quorum votes, Learners observe the chosen value. A majority of acceptors (floor(N/2)+1) must agree.
- The "must adopt prior accepted value" rule is the core safety mechanism. If any future proposer discovers a value that was partially chosen, it carries that value forward instead of introducing a new one.
- Basic Paxos is impractically inefficient. Multi-Paxos optimizes it with a stable leader that skips phase 1 for sequential log entries, cutting round trips from two to one in the steady state.
- Paxos deliberately leaves leader election, log compaction, snapshotting, and membership changes unspecified. This is why every implementation reinvented these pieces and many got them wrong.
- For new systems, use Raft. It provides the same safety guarantees as Multi-Paxos with all the gaps filled in. Paxos remains important for understanding consensus correctness and for working with legacy systems like ZooKeeper and Spanner.
Related concepts
- Raft consensus internals - Raft is the practical successor to Paxos, filling in the underspecified gaps (leader election, log compaction, reconfiguration) that made Paxos difficult to implement correctly.
- Two-phase commit - 2PC solves distributed atomicity (all-or-nothing transactions), while Paxos solves distributed consensus (agreeing on a single value). Both use two-phase voting but solve different problems.
- Consensus quorum math - The mathematical foundations behind why majority quorums work: any two majorities overlap by at least one member, which is the property Paxos relies on for safety.
- Vector clocks - Where Paxos provides total ordering through consensus, vector clocks provide partial ordering without consensus, useful when you need causality tracking without the overhead of agreement.