Leader election
How distributed systems elect a single leader using ZooKeeper ephemeral nodes, etcd leases, or Redis SETNX. Covers fencing tokens, split-brain prevention, and when you actually need leader election.
TL;DR
- Leader election ensures exactly one node runs a critical task (cron jobs, partition ownership, DB primary) at any given time.
- The hard part is not electing a leader, it is preventing a deposed leader from acting after a new one takes over. Fencing tokens solve this.
- ZooKeeper uses ephemeral sequential nodes (smallest sequence number wins, watch predecessor). etcd uses lease-based campaigns with revision-based fencing. Redis uses SETNX + TTL but offers weaker consistency guarantees.
- Every election mechanism has a leaderless gap between failure detection and new election. Design systems to queue work during this window.
- For interviews, know fencing tokens and at least one concrete election mechanism (ZooKeeper or etcd).
The Problem
Your payment processing system runs a daily reconciliation job at midnight. Three instances of the service are running for high availability. Without coordination, all three fire the job at midnight. Duplicate reconciliation transfers money twice, triggers duplicate refunds, corrupts ledger balances.
"Just use a cron job on one server" works until that server crashes. Now nothing runs the job. You need exactly-one semantics: one node runs the task, and if it fails, another takes over automatically.
The same pattern appears everywhere in distributed systems. Kafka needs exactly one broker to lead each partition. Elasticsearch needs one master node to manage cluster state. Redis Sentinel needs one sentinel to initiate failover. A stream processing application needs one consumer per partition. A distributed scheduler needs one node to fire each scheduled task. The requirement is always the same: coordinate N nodes so exactly one acts as the leader.
The naive approach of "just let all nodes try and deduplicate later" fails for several reasons. Some operations are not idempotent (sending emails, charging credit cards, executing trades). Some operations require sequential ordering (partition processing, log compaction). Some operations accumulate state (building an in-memory index, maintaining a session cache). For all of these, you need one designated node.
What makes leader election genuinely hard? It is not the happy path. Electing a leader when everything works is trivial.
The difficulty is handling partial failures: a leader that is alive but unreachable, a network partition that splits the cluster, a GC pause that makes the leader appear dead. These failure modes create the possibility of split-brain, where two nodes simultaneously believe they are the leader. Solving split-brain is the core technical challenge of leader election.
One-Line Definition
Leader election is the process by which a cluster of nodes agrees on exactly one node to perform a specific role, and re-elects a replacement when that node fails.
Analogy
Think of a relay race team with a backup runner. Only the person holding the baton runs. If the current runner stumbles, the baton is handed to the next runner in line, not thrown to the ground where multiple people might grab it. The baton is the leadership token: whoever holds it acts, everyone else waits.
Critically, a runner who dropped the baton three laps ago cannot just start running again because they still feel like the leader. The referee (the fencing token check at the resource layer) stops them. The baton has a number on it, and the finish line only accepts the highest-numbered baton it has seen. An old baton from a previous handoff is rejected.
Solution Walkthrough
The Core Challenge: Fencing Tokens
The hard part of leader election is not choosing a leader. It is preventing a deposed leader from taking action after it has been replaced. I see candidates skip fencing in interviews, and that is always the gap interviewers probe.
The split-brain scenario: Node A wins election. A network partition isolates A from the rest of the cluster. Nodes B and C elect B as the new leader. A's heartbeat still runs locally, so it thinks it is still leader. Now two nodes both believe they are leader and both write to the database. Data is corrupted.
This is not a theoretical concern. Martin Kleppmann documented real-world cases where GC pauses as short as 10 seconds caused split-brain in systems using lease-based coordination. At scale, everything that can happen eventually will.
The following diagram shows the exact sequence of events during a split-brain incident and how fencing tokens prevent data corruption:
The fix is a fencing token: the election mechanism issues a monotonically increasing token with each new leader. Any resource the leader touches must record the highest token it has seen and reject requests carrying a lower token. An old leader carrying a stale token gets rejected automatically.
Here is how fencing tokens flow through the system:
Time 0: Leader A wins election, receives fencing token = 5
Time 1: Leader A writes to DB with token=5, DB records max_token=5
Time 2: Leader A gets network-partitioned
Time 3: Leader B wins election, receives fencing token = 6
Time 4: Leader B writes to DB with token=6, DB records max_token=6
Time 5: Leader A's partition heals, A tries to write with token=5
Time 6: DB rejects A's write because 5 < max_token (6)
The fencing token must be enforced at the resource layer, not just checked by the leader itself. A deposed leader that checks "am I still leader?" using its own local state will always say "yes" because it does not know it has been replaced. The resource (database, file system, API) must independently validate the token.
For your interview: say "fencing token" within the first 30 seconds of discussing leader election. It shows you understand the real problem, not just the happy path.
ZooKeeper Ephemeral Sequential Nodes
ZooKeeper is the most battle-tested leader election mechanism. The approach uses two properties: ephemeral nodes (deleted when the session disconnects) and sequential nodes (ZooKeeper appends a monotonically increasing counter). Apache Curator, the high-level Java client for ZooKeeper, provides a LeaderLatch recipe that implements this protocol in production-grade code.
Election protocol:
- Each candidate creates an ephemeral sequential node under
/election/:/election/node-0001,/election/node-0002, etc. - Each candidate reads all children of
/election/and checks if its node has the smallest sequence number. - If yes, it is the leader.
- If no, it sets a watch on the node with the next-smaller sequence number (its predecessor).
- When the predecessor is deleted (leader crash or graceful shutdown), the watcher is notified and the node checks again.
The "watch predecessor" pattern avoids the thundering herd problem. If all followers watched the leader's node, every follower would wake up and race to check when the leader dies. With predecessor watches, only one node wakes up per failure, giving O(1) notification cost.
The ZooKeeper session ID and the zxid (transaction ID) serve as natural fencing tokens. The zxid is monotonically increasing across all ZooKeeper operations, so a new leader's session will always have a higher zxid than the old leader.
ZooKeeper vs the simpler non-sequential approach
A simpler ZooKeeper election has all nodes race to create a single ephemeral node (e.g., /election/leader). One wins, the rest watch it. This works for small clusters but suffers from thundering herd: when the leader dies, all N-1 followers wake up and race to create the node. The sequential approach above is the production-grade solution that scales to hundreds of candidates without contention storms.
etcd Lease-Based Election
etcd provides a simpler election API built on leases. A lease is a time-limited handle: it expires if not renewed within a TTL.
Election protocol:
- Each candidate calls
Grantto obtain a lease with a TTL (e.g., 15 seconds). - Each candidate attempts to put a key (e.g.,
/election/leader) using a transaction: "if the key does not exist, set it to my node ID with my lease attached." - Exactly one candidate succeeds (etcd transactions are linearizable). That node is leader.
- The leader calls
KeepAliveon a regular interval to refresh the lease before TTL expiry. - If the leader crashes, the heartbeat stops, the lease expires, and etcd deletes the key automatically.
- Other candidates, which are watching the key, detect the deletion and re-attempt the campaign.
etcd returns a revision number per write. This revision is a global, monotonically increasing counter across the entire etcd cluster. It is the fencing token. Resources must reject operations from clients carrying a revision lower than the highest they have seen.
The etcd concurrency package in Go abstracts all of this into a clean API:
// Go example using etcd concurrency package
session, _ := concurrency.NewSession(client, concurrency.WithTTL(15))
election := concurrency.NewElection(session, "/election/leader")
// Blocks until this node becomes leader
election.Campaign(ctx, nodeID)
// node is now leader, use election.Rev() as fencing token
fencingToken := election.Rev()
// When done, voluntarily resign
election.Resign(ctx)
My recommendation: etcd is the default choice for new projects today. Its gRPC API is cleaner than ZooKeeper's, the election primitives are built-in, and Kubernetes already runs etcd under the hood so you often get it for free.
Choosing between election mechanisms
Here is the honest answer on choosing: if you are on Kubernetes, use the Lease API (it is backed by etcd and requires zero additional infrastructure). If you are not on Kubernetes but need strong consistency, use etcd or ZooKeeper directly. If duplicate work is tolerable and you want simplicity, Redis SETNX is fine. Most startups should start with Redis SETNX and migrate to etcd when they hit the consistency ceiling.
Redis SETNX + TTL
SET leader <node-id> NX EX 30
NX means "only set if not exists." EX 30 sets a 30-second TTL. The leader refreshes the TTL periodically before expiry. If the leader crashes, the key expires after 30 seconds and a new leader can claim it.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.