Paxos consensus algorithm

The problem

Your five-node key-value store needs to replicate a write. The client sends SET leader = nodeA to three nodes. At the same moment, a network partition splits the cluster: two nodes see the write, three do not. A second client, talking to the other partition, sends SET leader = nodeB. Both partitions accept their local write. When the partition heals, the cluster has two conflicting values for the same key and no way to decide which one wins.

This is not a hypothetical. Google's early Bigtable clusters hit exactly this problem: without a formal agreement protocol, split-brain scenarios produced silent data corruption. The question is simple but deceptively hard: how do N servers agree on a single value when any of them can crash at any time and messages can arrive out of order, be duplicated, or be lost entirely?

This is the problem Paxos solves. It was the first provably correct consensus algorithm for asynchronous networks (no bound on message delivery time), published by Leslie Lamport in 1998.

What it is

Paxos is a two-phase voting protocol that allows a group of nodes to agree on exactly one value, even when some nodes crash and messages are delayed or lost. The protocol guarantees safety (no two nodes decide different values) in all cases, and guarantees progress (a value is eventually chosen) as long as a majority of nodes are alive and can communicate.

Analogy: Imagine a committee voting on a new policy. The chairperson proposes a draft and asks each member, "Will you support this draft number?" Each member promises not to support any older draft. The chairperson collects promises from a majority, then sends the final version for a binding vote. Even if some members leave the room, as long as a majority voted, the policy is decided. No minority can later overturn it.

Interview tip: the one-liner

"Paxos is a two-phase majority-vote protocol: Prepare locks out older proposals, Accept gets the value committed. Safety holds always; liveness requires a majority to be reachable."

Three roles

Every node can play multiple roles simultaneously:

Proposer: initiates a proposal. Wants to get a value accepted. Typically the current leader or the node receiving a client request.
Acceptor: votes on proposals. The quorum is formed by a majority of acceptors. Stores the highest proposal number it has seen and the last value it accepted.
Learner: learns the agreed-upon value once consensus is reached. May be a read replica or an application-layer observer.

A cluster of 5 nodes usually has all 5 act as acceptors, 1-3 as proposers, and some as learners. The majority quorum is 3.

I find it helpful to think of proposers as "the mouth" and acceptors as "the memory." Proposers drive the protocol forward, but acceptors hold the durable state that guarantees safety.

How it works

The protocol runs in two phases. Phase 1 establishes authority (locks out stale proposals). Phase 2 gets the value committed. Every round uses a globally unique, monotonically increasing proposal number N.

Phase 1: Prepare and Promise

The proposer picks a proposal number N larger than any it has used before. It sends Prepare(N) to a majority of acceptors. Each acceptor checks whether N exceeds its currently promised number. If so, it replies with a Promise: "I will reject any proposal numbered less than N." The promise also includes the highest-numbered value this acceptor previously accepted (if any).

Proposer -> Acceptors: Prepare(N=7)

Acceptor A response: Promise(N=7), prevAccepted=(N=5, value="commit")
Acceptor B response: Promise(N=7), prevAccepted=null
Acceptor C response: Promise(N=7), prevAccepted=(N=3, value="commit")

If the proposer receives promises from a majority, phase 1 succeeds.

Phase 2: Accept and Accepted

The proposer must now choose the value to propose. Critical rule: if any acceptor's promise carried a previously accepted value, the proposer must use the value from the highest-numbered prior acceptance. The proposer cannot pick a fresh value if any prior value was partially accepted. This is the core safety mechanism.

Proposer: received prevAccepted at N=5 ("commit") and N=3 ("commit")
          must use value="commit" (from highest N=5)
Proposer -> Acceptors: Accept(N=7, value="commit")

Acceptor A: 7 >= promised 7, accepts -> Accepted(N=7, "commit")
Acceptor B: 7 >= promised 7, accepts -> Accepted(N=7, "commit")
Acceptor C: 7 >= promised 7, accepts -> Accepted(N=7, "commit")

When the proposer receives Accepted from a majority, the value is chosen. The proposer notifies learners.

Pseudocode

function proposer_run(value):
  n = generate_unique_proposal_number()

  // Phase 1
  promises = []
  for acceptor in random_majority(acceptors):
    response = send(acceptor, Prepare(n))
    if response.type == PROMISE:
      promises.append(response)

  if len(promises) < majority_size:
    return FAIL  // could not get majority, retry with higher n

  // Adopt highest previously accepted value (if any)
  highest = max(promises, key=p -> p.accepted_n) where p.accepted_n != null
  if highest != null:
    value = highest.accepted_value

  // Phase 2
  accepts = 0
  for acceptor in same_majority:
    response = send(acceptor, Accept(n, value))
    if response.type == ACCEPTED:
      accepts += 1

  if accepts >= majority_size:
    broadcast_to_learners(value)
    return CHOSEN(value)
  else:
    return FAIL  // retry with higher n

function acceptor_handle(message):
  if message.type == PREPARE:
    if message.n > this.promised_n:
      this.promised_n = message.n
      return Promise(n=message.n, accepted_n=this.accepted_n,
                     accepted_value=this.accepted_value)
    else:
      return REJECT  // already promised higher

  if message.type == ACCEPT:
    if message.n >= this.promised_n:
      this.promised_n = message.n
      this.accepted_n = message.n
      this.accepted_value = message.value
      return Accepted(n=message.n)
    else:
      return REJECT

Safety guarantee

The problem

This is the problem Paxos solves. It was the first provably correct consensus algorithm for asynchronous networks (no bound on message delivery time), published by Leslie Lamport in 1998.

What it is

Interview tip: the one-liner

"Paxos is a two-phase majority-vote protocol: Prepare locks out older proposals, Accept gets the value committed. Safety holds always; liveness requires a majority to be reachable."

Three roles

Every node can play multiple roles simultaneously:

Proposer: initiates a proposal. Wants to get a value accepted. Typically the current leader or the node receiving a client request.
Acceptor: votes on proposals. The quorum is formed by a majority of acceptors. Stores the highest proposal number it has seen and the last value it accepted.
Learner: learns the agreed-upon value once consensus is reached. May be a read replica or an application-layer observer.

A cluster of 5 nodes usually has all 5 act as acceptors, 1-3 as proposers, and some as learners. The majority quorum is 3.

I find it helpful to think of proposers as "the mouth" and acceptors as "the memory." Proposers drive the protocol forward, but acceptors hold the durable state that guarantees safety.

How it works

Phase 1: Prepare and Promise

Proposer -> Acceptors: Prepare(N=7)

Acceptor A response: Promise(N=7), prevAccepted=(N=5, value="commit")
Acceptor B response: Promise(N=7), prevAccepted=null
Acceptor C response: Promise(N=7), prevAccepted=(N=3, value="commit")

If the proposer receives promises from a majority, phase 1 succeeds.

Phase 2: Accept and Accepted

Proposer: received prevAccepted at N=5 ("commit") and N=3 ("commit")
          must use value="commit" (from highest N=5)
Proposer -> Acceptors: Accept(N=7, value="commit")

Acceptor A: 7 >= promised 7, accepts -> Accepted(N=7, "commit")
Acceptor B: 7 >= promised 7, accepts -> Accepted(N=7, "commit")
Acceptor C: 7 >= promised 7, accepts -> Accepted(N=7, "commit")

When the proposer receives Accepted from a majority, the value is chosen. The proposer notifies learners.

Pseudocode

function proposer_run(value):
  n = generate_unique_proposal_number()

  // Phase 1
  promises = []
  for acceptor in random_majority(acceptors):
    response = send(acceptor, Prepare(n))
    if response.type == PROMISE:
      promises.append(response)

  if len(promises) < majority_size:
    return FAIL  // could not get majority, retry with higher n

  // Adopt highest previously accepted value (if any)
  highest = max(promises, key=p -> p.accepted_n) where p.accepted_n != null
  if highest != null:
    value = highest.accepted_value

  // Phase 2
  accepts = 0
  for acceptor in same_majority:
    response = send(acceptor, Accept(n, value))
    if response.type == ACCEPTED:
      accepts += 1

  if accepts >= majority_size:
    broadcast_to_learners(value)
    return CHOSEN(value)
  else:
    return FAIL  // retry with higher n

function acceptor_handle(message):
  if message.type == PREPARE:
    if message.n > this.promised_n:
      this.promised_n = message.n
      return Promise(n=message.n, accepted_n=this.accepted_n,
                     accepted_value=this.accepted_value)
    else:
      return REJECT  // already promised higher

  if message.type == ACCEPT:
    if message.n >= this.promised_n:
      this.promised_n = message.n
      this.accepted_n = message.n
      this.accepted_value = message.value
      return Accepted(n=message.n)
    else:
      return REJECT

Paxos consensus algorithm

The problem

What it is

Three roles

How it works

Phase 1: Prepare and Promise

Phase 2: Accept and Accepted

Pseudocode

Safety guarantee

Continue Reading with Premium

Comments

Paxos consensus algorithm

The problem

What it is

Three roles

How it works

Phase 1: Prepare and Promise

Phase 2: Accept and Accepted

Pseudocode

Safety guarantee

Continue Reading with Premium

Comments