Raft consensus internals
How the Raft consensus algorithm achieves distributed agreement through leader election, log replication, and safety guarantees, and why it replaced Paxos for most production systems.
The problem
You have a 5-node key-value store. The leader node accepts a write: SET balance = 500. It replicates the entry to two followers, then crashes before sending the entry to the other two. Now three nodes have the entry and two do not. The cluster needs a new leader. If a node without the entry gets elected, the committed write is lost. If the election process is ambiguous, two nodes might both believe they are leader, accepting conflicting writes.
This is not an edge case. Every distributed database that replicates writes faces this scenario during leader failover. The question is precise: how do you elect a new leader that preserves all committed writes, replicate entries consistently, and never allow two leaders in the same term?
Paxos answered this question first, but its specification left leader election, log management, and reconfiguration undefined. Teams that implemented Paxos each invented these pieces independently, often incorrectly. Raft was designed in 2014 by Diego Ongaro and John Ousterhout specifically to solve the same consensus problem with an algorithm that is understandable and implementable.
What it is
Raft is a consensus algorithm that maintains a replicated log across a cluster of nodes, ensuring all nodes agree on the same sequence of commands even when some nodes crash. It achieves this by decomposing the problem into three independent subproblems: leader election, log replication, and safety.
Analogy: Think of a classroom with one teacher (leader) and several students (followers). The teacher dictates notes, and every student writes them down in order. If the teacher leaves, the students hold a quick vote: whoever has the most complete notes becomes the new teacher. A student with missing pages cannot win the vote, because the other students check notebooks before casting their ballot. This ensures no committed notes are ever lost.
Interview tip: the decomposition
"Raft decomposes consensus into three independent subproblems: leader election (who coordinates), log replication (how writes propagate), and safety (the election restriction that prevents data loss). Each subproblem can be understood and verified independently."
How it works
Every Raft node is in one of three states: Follower (passive, responds to RPCs), Candidate (seeking votes), or Leader (handles all client requests and replicates log entries). Time is divided into terms, monotonically increasing integers. Each term begins with an election and has at most one leader.
The entire protocol works through two RPCs: RequestVote (used during elections) and AppendEntries (used for log replication and heartbeats). That is it. Two RPCs for the entire consensus protocol.
Core state per node
Every Raft node persists three things to stable storage before responding to any RPC:
| State field | Persisted? | Description |
|---|---|---|
currentTerm | Yes | Latest term the node has seen. Monotonically increases. |
votedFor | Yes | Node ID this node voted for in the current term (null if none). |
log[] | Yes | Log entries: each contains a command and the term when received. |
commitIndex | No (volatile) | Highest log entry known to be committed. |
lastApplied | No (volatile) | Highest log entry applied to the state machine. |
Persisting currentTerm and votedFor is critical. If a node crashes and recovers without remembering its vote, it could vote for a second candidate in the same term, allowing two leaders.
Leader election
Each follower has a randomized election timeout (typically 150-300ms). If a follower receives no heartbeat from the leader before the timeout fires, it assumes the leader is dead and starts an election.
A candidate wins if it receives votes from a majority (floor(N/2) + 1). Two constraints prevent split-brain:
- One vote per term. Each node votes for at most one candidate per term (first-come-first-served). This is persisted to disk before the vote response is sent.
- Log-up-to-date check. A node only grants its vote if the candidate's log is at least as up-to-date as its own. "Up-to-date" means: higher last log term, or same term with equal-or-higher last log index.
The randomized election timeout breaks ties. If two candidates start simultaneously, they typically have different timeouts, so one completes its election before the other. In the rare case of a true split vote, both candidates time out and retry with incremented terms.
I find the election timeout to be the most elegant part of Raft. It replaces Paxos's undefined leader election with a simple, deterministic mechanism that works in practice.
Election timeout tuning matters
If the election timeout is too short relative to network RTT, nodes trigger unnecessary elections. If too long, the cluster stays leaderless for an extended period after a crash. The Raft paper recommends: broadcastTime (0.5-20ms) is much less than electionTimeout (150-300ms), which is much less than MTBF (months). Getting this wrong is one of the most common Raft deployment issues.
Log replication
Once elected, the leader handles all client writes. It appends each command to its own log, then sends AppendEntries RPCs to every follower. When a majority acknowledges the entry, the leader commits it and applies it to the state machine.
The AppendEntries RPC includes prevLogIndex and prevLogTerm: the index and term of the log entry immediately before the new entries. The follower checks whether its log matches at that point. If it does not match, the follower rejects the RPC and the leader decrements prevLogIndex and retries. This ensures logs are consistent prefix-by-prefix.
Consistency check pseudocode
function follower_handle_append_entries(request):
if request.term < this.currentTerm:
return {success: false, term: this.currentTerm}
// Step down if we see a higher or equal term from a leader
this.currentTerm = request.term
this.state = FOLLOWER
reset_election_timeout()
// Consistency check: does our log match at prevLogIndex?
if request.prevLogIndex > 0:
if log[request.prevLogIndex] == null:
return {success: false} // gap in our log
if log[request.prevLogIndex].term != request.prevLogTerm:
// Conflict: delete this entry and everything after it
delete_log_from(request.prevLogIndex)
return {success: false}
// Append new entries (overwrite conflicts, skip duplicates)
for entry in request.entries:
log[entry.index] = entry
// Update commit index
if request.leaderCommit > this.commitIndex:
this.commitIndex = min(request.leaderCommit, last_log_index())
return {success: true}
The leader never overwrites its own log. Only followers delete conflicting entries. The leader's log is always authoritative. For your interview: when asked how Raft handles log conflicts, say "the leader's log always wins, and the consistency check in AppendEntries forces followers to converge."
Safety properties
Raft guarantees five properties. The most important is the Leader Completeness Property: if a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher-numbered terms.
| Property | Guarantee |
|---|---|
| Election Safety | At most one leader per term. |
| Leader Append-Only | A leader never overwrites or deletes entries in its own log. It only appends. |
| Log Matching | If two logs contain an entry with the same index and term, all preceding entries are identical. |
| Leader Completeness | If an entry is committed in term T, it is present in the leader's log for every term > T. |
| State Machine Safety | If a node applies entry at index I, no other node applies a different entry at index I. |
The Election Restriction is what enforces Leader Completeness. A candidate must have a log that is at least as up-to-date as any majority member's log to win an election. Since a committed entry exists on a majority, and the candidate must get votes from a majority, the two majorities overlap. The overlapping node will refuse to vote for a candidate whose log is missing the committed entry.
This is the deepest insight in Raft. I'd argue it's actually simpler than Paxos's safety argument, which relies on the "must adopt prior value" rule propagated across proposal numbers. Raft's version is structural: the election restriction physically prevents any candidate without committed entries from becoming leader.
Interview tip: the commitment rule for prior-term entries
A subtle Raft safety rule: a leader can only commit entries from prior terms by committing an entry in its current term (which indirectly commits all prior entries by log position). The leader never directly counts replicas for prior-term entries. If you skip this rule, you can construct a scenario where a committed entry gets overwritten.
Joint consensus (membership changes)
Changing cluster membership (adding or removing nodes) is dangerous because during the transition, two different majorities could exist: one from the old configuration and one from the new. Raft handles this with joint consensus.
The process works in two steps:
- The leader creates a
C_old,newentry that contains both the old and new configurations. While this entry is active, agreement requires a majority from both the old config and the new config independently. - Once
C_old,newis committed, the leader creates aC_newentry with only the new configuration. Once committed, the transition is complete and the old nodes can be decommissioned.
This two-step approach ensures there is never a moment when old-config-only or new-config-only can make unilateral decisions. The overlap period requires agreement from both groups.
In practice, many Raft implementations (including etcd) use a simpler single-server change approach: add or remove one node at a time. This is safe because adding one node to a cluster of N only changes the quorum by at most one, and the old and new majorities always overlap.
Log compaction (snapshots)
The log grows without bound unless compacted. Raft handles this with snapshots: periodically, each node saves the current state machine state at a known log index and discards all log entries up to that index.
State machine at index 200: {x: 5, y: 8, z: 3, accounts: {...}}
Snapshot: serialize state machine, tag with (index=200, term=4)
Discard log entries 1-200
Keep log from 201 onward
Slow follower or new node:
Leader sends InstallSnapshot RPC
Node loads snapshot, then receives AppendEntries from index 201
The leader tracks each follower's progress via nextIndex[]. If a follower falls so far behind that the required log entries have been discarded, the leader sends an InstallSnapshot RPC instead of AppendEntries. The follower replaces its state machine with the snapshot and resumes normal replication.
Production usage
| System | How Raft is used | Notable behavior |
|---|---|---|
| etcd | Core consensus for Kubernetes cluster state. One Raft group for the entire key-value store. | Raft log is the WAL. Snapshots triggered by configurable entry count (default 10,000). Watch API built on top of the Raft log. |
| CockroachDB | Multi-Raft: one independent Raft group per range (~64MB of data). | Thousands of simultaneous Raft groups per node. Leader leases enable local reads. Range splits/merges trigger membership changes. |
| TiKV | Multi-Raft per region (similar to CockroachDB). Powers the TiDB distributed SQL database. | Uses Raft learner nodes for online data migration. Implements prevote extension to prevent disruptive elections from partitioned nodes. |
| Consul | Raft for service discovery and configuration. One Raft group per datacenter. | Cross-datacenter replication uses a separate protocol (not Raft). Raft is only for intra-datacenter consensus. |
| InfluxDB IOx | Raft for metadata coordination and write ordering. | Uses Raft for catalog metadata, not for the actual time-series data path. Separates consensus overhead from high-throughput writes. |
Limitations and when NOT to use it
- Single-leader bottleneck. All writes go through one node. For write-heavy workloads, a single Raft group tops out at roughly 10,000-50,000 writes/second depending on entry size and disk speed. Multi-Raft (sharding) is the standard workaround, but it adds operational complexity.
- No Byzantine fault tolerance. Raft assumes crash-stop failures: nodes either work correctly or stop. A node that sends corrupted or malicious data can violate safety. Use PBFT, HotStuff, or Tendermint for environments with untrusted nodes.
- Linearizable reads require leader involvement. By default, all reads must go through the leader to ensure consistency. Follower reads require either ReadIndex (an extra round trip to the leader) or leader leases (clock-dependent, which is fragile under clock skew).
- Cross-datacenter latency is unavoidable. Commit latency is bounded by the time to reach a majority. A 5-node cluster across three datacenters (US, EU, Asia) has a commit latency of at least one cross-region RTT (~80-200ms). You cannot avoid this with Raft alone.
- Cluster size is practically limited to 5-7 nodes. Larger clusters increase tail latency (more nodes in the majority path) and election instability (more candidates, more split votes). For large-scale systems, use Multi-Raft with many small groups.
- Log replay on restart can be slow. If snapshots are infrequent and the log is large, a restarting node must replay thousands of entries before it can serve traffic. Tune snapshot frequency to balance disk I/O against recovery time.
Interview cheat sheet
- When asked "how does Raft work": "Raft decomposes consensus into three subproblems: leader election (randomized timeouts, majority vote), log replication (leader appends, followers replicate via AppendEntries), and safety (election restriction ensures the new leader always has all committed entries)."
- When asked about leader election: "Each node has a randomized election timeout. When it fires, the node increments its term, votes for itself, and requests votes from others. A candidate wins with a majority. The log-up-to-date check prevents stale nodes from becoming leader."
- When asked about log consistency: "The AppendEntries RPC includes prevLogIndex and prevLogTerm. If the follower's log doesn't match at that point, it rejects. The leader decrements and retries until logs align. This ensures logs are identical prefix-by-prefix."
- When asked "how does Raft handle network partitions": "The partition with a majority continues operating. The minority partition cannot elect a leader (no majority) and stops accepting writes. When the partition heals, the minority nodes receive the leader's log and converge."
- When asked about the commitment rule: "An entry is committed when replicated to a majority. The leader advances commitIndex and followers apply committed entries to their state machines. Critically, a leader only commits prior-term entries indirectly by committing a current-term entry."
- When asked about read consistency: "Default: reads go through the leader for linearizability. Optimization: ReadIndex (leader confirms it is still leader with a heartbeat round) or lease-based reads (leader holds a time-bounded lease, serves reads without confirmation)."
- When asked about scaling Raft: "Single Raft group is a write bottleneck. Production systems use Multi-Raft: one Raft group per shard (CockroachDB calls them ranges, TiKV calls them regions). Each group has independent leaders, so writes scale horizontally."
- When asked "Raft vs. Paxos": "Same safety guarantees. Raft specifies leader election, log management, and membership changes explicitly. Paxos leaves these to the implementer. For interviews, say Raft for any new system."
Quick recap
- Raft solves distributed consensus by decomposing it into three independent subproblems: leader election, log replication, and safety, making it significantly more understandable than Paxos.
- Leader election uses randomized timeouts and a majority vote. The log-up-to-date check ensures only nodes with complete logs can become leader, which is the core safety mechanism.
- Log replication works through AppendEntries RPCs with a consistency check (prevLogIndex and prevLogTerm). The leader's log is authoritative and followers converge through the backtrack-and-retry mechanism.
- The Election Restriction guarantees Leader Completeness: a committed entry (replicated to a majority) will always be present in any future leader's log, because the candidate must get votes from a majority that overlaps with the majority that committed the entry.
- Membership changes use joint consensus (requiring agreement from both old and new configurations) to prevent split-brain during transitions. Single-server changes are the simpler alternative used by most implementations.
- For write scaling, production systems use Multi-Raft: one Raft group per data shard, with independent leaders per group. This is how CockroachDB, TiKV, and other distributed databases achieve horizontal write throughput.
Related concepts
- Paxos consensus algorithm - Paxos is the theoretical foundation that Raft builds on. Understanding Paxos explains why Raft's design choices (strong leader, election restriction) exist.
- Two-phase commit - Where Raft agrees on a sequence of log entries across replicas, 2PC agrees on a single atomic transaction across different databases. They solve different problems but both use majority-style coordination.
- Write-ahead log - Raft's replicated log is essentially a distributed WAL. Each Raft node uses a local WAL to persist entries before acknowledging, and the Raft protocol ensures all nodes converge to the same WAL contents.
- Gossip protocol - Where Raft provides strong consistency through leader-based replication, gossip provides eventual consistency through peer-to-peer dissemination. Use Raft for state that must be consistent; use gossip for state that can be stale.