Raft consensus internals
How the Raft consensus algorithm achieves distributed agreement through leader election, log replication, and safety guarantees, and why it replaced Paxos for most production systems.
The problem
You have a 5-node key-value store. The leader node accepts a write: SET balance = 500. It replicates the entry to two followers, then crashes before sending the entry to the other two. Now three nodes have the entry and two do not. The cluster needs a new leader. If a node without the entry gets elected, the committed write is lost. If the election process is ambiguous, two nodes might both believe they are leader, accepting conflicting writes.
This is not an edge case. Every distributed database that replicates writes faces this scenario during leader failover. The question is precise: how do you elect a new leader that preserves all committed writes, replicate entries consistently, and never allow two leaders in the same term?
Paxos answered this question first, but its specification left leader election, log management, and reconfiguration undefined. Teams that implemented Paxos each invented these pieces independently, often incorrectly. Raft was designed in 2014 by Diego Ongaro and John Ousterhout specifically to solve the same consensus problem with an algorithm that is understandable and implementable.
What it is
Raft is a consensus algorithm that maintains a replicated log across a cluster of nodes, ensuring all nodes agree on the same sequence of commands even when some nodes crash. It achieves this by decomposing the problem into three independent subproblems: leader election, log replication, and safety.
Analogy: Think of a classroom with one teacher (leader) and several students (followers). The teacher dictates notes, and every student writes them down in order. If the teacher leaves, the students hold a quick vote: whoever has the most complete notes becomes the new teacher. A student with missing pages cannot win the vote, because the other students check notebooks before casting their ballot. This ensures no committed notes are ever lost.
Interview tip: the decomposition
"Raft decomposes consensus into three independent subproblems: leader election (who coordinates), log replication (how writes propagate), and safety (the election restriction that prevents data loss). Each subproblem can be understood and verified independently."
How it works
Every Raft node is in one of three states: Follower (passive, responds to RPCs), Candidate (seeking votes), or Leader (handles all client requests and replicates log entries). Time is divided into terms, monotonically increasing integers. Each term begins with an election and has at most one leader.
The entire protocol works through two RPCs: RequestVote (used during elections) and AppendEntries (used for log replication and heartbeats). That is it. Two RPCs for the entire consensus protocol.
Core state per node
Every Raft node persists three things to stable storage before responding to any RPC:
| State field | Persisted? | Description |
|---|---|---|
currentTerm | Yes | Latest term the node has seen. Monotonically increases. |
votedFor | Yes | Node ID this node voted for in the current term (null if none). |
log[] | Yes | Log entries: each contains a command and the term when received. |
commitIndex | No (volatile) | Highest log entry known to be committed. |
lastApplied | No (volatile) | Highest log entry applied to the state machine. |
Persisting currentTerm and votedFor is critical. If a node crashes and recovers without remembering its vote, it could vote for a second candidate in the same term, allowing two leaders.
Leader election
Each follower has a randomized election timeout (typically 150-300ms). If a follower receives no heartbeat from the leader before the timeout fires, it assumes the leader is dead and starts an election.
A candidate wins if it receives votes from a majority (floor(N/2) + 1). Two constraints prevent split-brain:
- One vote per term. Each node votes for at most one candidate per term (first-come-first-served). This is persisted to disk before the vote response is sent.
- Log-up-to-date check. A node only grants its vote if the candidate's log is at least as up-to-date as its own. "Up-to-date" means: higher last log term, or same term with equal-or-higher last log index.
The randomized election timeout breaks ties. If two candidates start simultaneously, they typically have different timeouts, so one completes its election before the other. In the rare case of a true split vote, both candidates time out and retry with incremented terms.
I find the election timeout to be the most elegant part of Raft. It replaces Paxos's undefined leader election with a simple, deterministic mechanism that works in practice.
Election timeout tuning matters
If the election timeout is too short relative to network RTT, nodes trigger unnecessary elections. If too long, the cluster stays leaderless for an extended period after a crash. The Raft paper recommends: broadcastTime (0.5-20ms) is much less than electionTimeout (150-300ms), which is much less than MTBF (months). Getting this wrong is one of the most common Raft deployment issues.
Log replication
Once elected, the leader handles all client writes. It appends each command to its own log, then sends AppendEntries RPCs to every follower. When a majority acknowledges the entry, the leader commits it and applies it to the state machine.
The AppendEntries RPC includes prevLogIndex and prevLogTerm: the index and term of the log entry immediately before the new entries. The follower checks whether its log matches at that point. If it does not match, the follower rejects the RPC and the leader decrements prevLogIndex and retries. This ensures logs are consistent prefix-by-prefix.
Consistency check pseudocode
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.