Leader election
How distributed systems elect a single leader using ZooKeeper ephemeral nodes, etcd leases, or Redis SETNX. Covers fencing tokens, split-brain prevention, and when you actually need leader election.
TL;DR
- Leader election ensures exactly one node runs a critical task (cron jobs, partition ownership, DB primary) at any given time.
- The hard part is not electing a leader, it is preventing a deposed leader from acting after a new one takes over. Fencing tokens solve this.
- ZooKeeper uses ephemeral sequential nodes (smallest sequence number wins, watch predecessor). etcd uses lease-based campaigns with revision-based fencing. Redis uses SETNX + TTL but offers weaker consistency guarantees.
- Every election mechanism has a leaderless gap between failure detection and new election. Design systems to queue work during this window.
- For interviews, know fencing tokens and at least one concrete election mechanism (ZooKeeper or etcd).
The Problem
Your payment processing system runs a daily reconciliation job at midnight. Three instances of the service are running for high availability. Without coordination, all three fire the job at midnight. Duplicate reconciliation transfers money twice, triggers duplicate refunds, corrupts ledger balances.
"Just use a cron job on one server" works until that server crashes. Now nothing runs the job. You need exactly-one semantics: one node runs the task, and if it fails, another takes over automatically.
The same pattern appears everywhere in distributed systems. Kafka needs exactly one broker to lead each partition. Elasticsearch needs one master node to manage cluster state. Redis Sentinel needs one sentinel to initiate failover. A stream processing application needs one consumer per partition. A distributed scheduler needs one node to fire each scheduled task. The requirement is always the same: coordinate N nodes so exactly one acts as the leader.
The naive approach of "just let all nodes try and deduplicate later" fails for several reasons. Some operations are not idempotent (sending emails, charging credit cards, executing trades). Some operations require sequential ordering (partition processing, log compaction). Some operations accumulate state (building an in-memory index, maintaining a session cache). For all of these, you need one designated node.
What makes leader election genuinely hard? It is not the happy path. Electing a leader when everything works is trivial.
The difficulty is handling partial failures: a leader that is alive but unreachable, a network partition that splits the cluster, a GC pause that makes the leader appear dead. These failure modes create the possibility of split-brain, where two nodes simultaneously believe they are the leader. Solving split-brain is the core technical challenge of leader election.
One-Line Definition
Leader election is the process by which a cluster of nodes agrees on exactly one node to perform a specific role, and re-elects a replacement when that node fails.
Analogy
Think of a relay race team with a backup runner. Only the person holding the baton runs. If the current runner stumbles, the baton is handed to the next runner in line, not thrown to the ground where multiple people might grab it. The baton is the leadership token: whoever holds it acts, everyone else waits.
Critically, a runner who dropped the baton three laps ago cannot just start running again because they still feel like the leader. The referee (the fencing token check at the resource layer) stops them. The baton has a number on it, and the finish line only accepts the highest-numbered baton it has seen. An old baton from a previous handoff is rejected.
Solution Walkthrough
The Core Challenge: Fencing Tokens
The hard part of leader election is not choosing a leader. It is preventing a deposed leader from taking action after it has been replaced. I see candidates skip fencing in interviews, and that is always the gap interviewers probe.
The split-brain scenario: Node A wins election. A network partition isolates A from the rest of the cluster. Nodes B and C elect B as the new leader. A's heartbeat still runs locally, so it thinks it is still leader. Now two nodes both believe they are leader and both write to the database. Data is corrupted.
This is not a theoretical concern. Martin Kleppmann documented real-world cases where GC pauses as short as 10 seconds caused split-brain in systems using lease-based coordination. At scale, everything that can happen eventually will.
The following diagram shows the exact sequence of events during a split-brain incident and how fencing tokens prevent data corruption:
The fix is a fencing token: the election mechanism issues a monotonically increasing token with each new leader. Any resource the leader touches must record the highest token it has seen and reject requests carrying a lower token. An old leader carrying a stale token gets rejected automatically.
Here is how fencing tokens flow through the system:
Time 0: Leader A wins election, receives fencing token = 5
Time 1: Leader A writes to DB with token=5, DB records max_token=5
Time 2: Leader A gets network-partitioned
Time 3: Leader B wins election, receives fencing token = 6
Time 4: Leader B writes to DB with token=6, DB records max_token=6
Time 5: Leader A's partition heals, A tries to write with token=5
Time 6: DB rejects A's write because 5 < max_token (6)
The fencing token must be enforced at the resource layer, not just checked by the leader itself. A deposed leader that checks "am I still leader?" using its own local state will always say "yes" because it does not know it has been replaced. The resource (database, file system, API) must independently validate the token.
For your interview: say "fencing token" within the first 30 seconds of discussing leader election. It shows you understand the real problem, not just the happy path.
ZooKeeper Ephemeral Sequential Nodes
ZooKeeper is the most battle-tested leader election mechanism. The approach uses two properties: ephemeral nodes (deleted when the session disconnects) and sequential nodes (ZooKeeper appends a monotonically increasing counter). Apache Curator, the high-level Java client for ZooKeeper, provides a LeaderLatch recipe that implements this protocol in production-grade code.
Election protocol:
- Each candidate creates an ephemeral sequential node under
/election/:/election/node-0001,/election/node-0002, etc. - Each candidate reads all children of
/election/and checks if its node has the smallest sequence number. - If yes, it is the leader.
- If no, it sets a watch on the node with the next-smaller sequence number (its predecessor).
- When the predecessor is deleted (leader crash or graceful shutdown), the watcher is notified and the node checks again.
The "watch predecessor" pattern avoids the thundering herd problem. If all followers watched the leader's node, every follower would wake up and race to check when the leader dies. With predecessor watches, only one node wakes up per failure, giving O(1) notification cost.
The ZooKeeper session ID and the zxid (transaction ID) serve as natural fencing tokens. The zxid is monotonically increasing across all ZooKeeper operations, so a new leader's session will always have a higher zxid than the old leader.
ZooKeeper vs the simpler non-sequential approach
A simpler ZooKeeper election has all nodes race to create a single ephemeral node (e.g., /election/leader). One wins, the rest watch it. This works for small clusters but suffers from thundering herd: when the leader dies, all N-1 followers wake up and race to create the node. The sequential approach above is the production-grade solution that scales to hundreds of candidates without contention storms.
etcd Lease-Based Election
etcd provides a simpler election API built on leases. A lease is a time-limited handle: it expires if not renewed within a TTL.
Election protocol:
- Each candidate calls
Grantto obtain a lease with a TTL (e.g., 15 seconds). - Each candidate attempts to put a key (e.g.,
/election/leader) using a transaction: "if the key does not exist, set it to my node ID with my lease attached." - Exactly one candidate succeeds (etcd transactions are linearizable). That node is leader.
- The leader calls
KeepAliveon a regular interval to refresh the lease before TTL expiry. - If the leader crashes, the heartbeat stops, the lease expires, and etcd deletes the key automatically.
- Other candidates, which are watching the key, detect the deletion and re-attempt the campaign.
etcd returns a revision number per write. This revision is a global, monotonically increasing counter across the entire etcd cluster. It is the fencing token. Resources must reject operations from clients carrying a revision lower than the highest they have seen.
The etcd concurrency package in Go abstracts all of this into a clean API:
// Go example using etcd concurrency package
session, _ := concurrency.NewSession(client, concurrency.WithTTL(15))
election := concurrency.NewElection(session, "/election/leader")
// Blocks until this node becomes leader
election.Campaign(ctx, nodeID)
// node is now leader, use election.Rev() as fencing token
fencingToken := election.Rev()
// When done, voluntarily resign
election.Resign(ctx)
My recommendation: etcd is the default choice for new projects today. Its gRPC API is cleaner than ZooKeeper's, the election primitives are built-in, and Kubernetes already runs etcd under the hood so you often get it for free.
Choosing between election mechanisms
Here is the honest answer on choosing: if you are on Kubernetes, use the Lease API (it is backed by etcd and requires zero additional infrastructure). If you are not on Kubernetes but need strong consistency, use etcd or ZooKeeper directly. If duplicate work is tolerable and you want simplicity, Redis SETNX is fine. Most startups should start with Redis SETNX and migrate to etcd when they hit the consistency ceiling.
Redis SETNX + TTL
SET leader <node-id> NX EX 30
NX means "only set if not exists." EX 30 sets a 30-second TTL. The leader refreshes the TTL periodically before expiry. If the leader crashes, the key expires after 30 seconds and a new leader can claim it.
Redis election is not strongly consistent
Redis leader election with SETNX is not strongly consistent when running Redis standalone or with async replication. If the leader writes the key to the primary and the primary crashes before replication, a replica promoted to primary will not have the key. Two nodes can elect themselves simultaneously. For critical coordination, use ZooKeeper or etcd. Redis SETNX is acceptable for "best effort" leader election where occasional duplicate work is tolerable (cache warming, non-critical background jobs).
Redis does not provide a built-in fencing token. You can simulate one by atomically incrementing a counter when claiming leadership, but there is no mechanism to enforce the token at the storage layer the way ZooKeeper zxids or etcd revisions provide. Despite these limitations, Redis SETNX is the most common leader election mechanism in startups and small teams because of its simplicity.
If you use Redis for leader election in production, add these safeguards:
- Set the TTL to at least 3x your expected maximum network latency to avoid false expirations.
- The leader should check that the key still contains its own node ID before performing critical operations.
- Add Lua script-based atomic operations for "check and renew" instead of separate GET and EXPIRE calls.
- Accept that under Redis failover (primary crashes, replica promoted), a brief window exists where two nodes may claim leadership. Design downstream operations to be idempotent.
Kubernetes Leader Election (coordination.k8s.io API)
If your service already runs on Kubernetes, you get leader election for free. The coordination.k8s.io/v1 Lease resource is a Kubernetes-native leader election primitive backed by etcd.
Each replica creates or attempts to update a Lease object. The Lease has a holderIdentity, leaseDurationSeconds, and renewTime. The holder refreshes renewTime on each heartbeat. Other replicas read the Lease and check if renewTime + leaseDuration has passed. If so, the lease is expired and any replica can acquire it by updating holderIdentity.
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
name: my-controller-leader
namespace: default
spec:
holderIdentity: "pod-abc-123"
leaseDurationSeconds: 15
acquireTime: "2026-04-04T12:00:00Z"
renewTime: "2026-04-04T12:00:10Z"
The Kubernetes client libraries (Go, Java, Python) provide LeaderElector wrappers that handle the acquire/renew loop. Your code just implements OnStartedLeading() and OnStoppedLeading() callbacks. This is how the Kubernetes scheduler, controller-manager, and most custom operators implement HA.
The beauty of the Kubernetes approach: no additional infrastructure. No separate ZooKeeper or etcd cluster to manage. The Kubernetes API server already runs etcd. You create a Lease object, implement two callbacks, and leader election just works. The downside: your application is now coupled to Kubernetes. If you ever need to run outside of K8s, you will need to swap the election mechanism.
Leader Heartbeat and Failure Detection
Every leader election mechanism requires a heartbeat to distinguish a living leader from a dead one. The heartbeat is the leader periodically proving it is still alive by renewing its lease, refreshing a TTL, or maintaining a session.
The heartbeat interval and the failure detection timeout create a fundamental trade-off:
| Setting | Heartbeat interval | Failure timeout | Behavior |
|---|---|---|---|
| Aggressive | 1 second | 3 seconds | Fast failover (3s), but network jitter causes false positives |
| Balanced | 5 seconds | 15 seconds | Standard choice, tolerates brief network hiccups |
| Conservative | 10 seconds | 45 seconds | Very stable, but leader crashes take 45s to detect |
The rule of thumb: set the failure timeout to at least 3x the heartbeat interval. This gives the leader three chances to prove it is alive before being declared dead. The heartbeat itself should run in a dedicated thread or goroutine that is not blocked by application logic, GC pauses, or I/O operations.
A common mistake: embedding the heartbeat renewal inside the main processing loop. If the processing loop blocks on a slow database query for 20 seconds and the lease TTL is 15 seconds, the leader loses its lease mid-operation.
What Happens During Leader Transition
The transition between old leader and new leader is the most dangerous window. Here is the timeline of a typical failover:
During the leaderless gap, three things can happen depending on your design:
-
Work queues up: Producers keep writing to the queue. The new leader drains the backlog after election. This is the safest approach.
-
Work is rejected: Clients receive errors during the gap. Acceptable for non-critical paths, but requires client retry logic.
-
Work is lost: The worst case. This happens when the leader holds in-memory state that was not checkpointed before the crash.
-
Stale leader interferes: The old leader, unaware it was deposed, continues processing and corrupts state. This is the split-brain case that fencing tokens prevent.
The total failover time is:
total_failover = failure_detection_timeout + election_time + new_leader_initialization
For etcd with a 15-second TTL, this is typically 15 + 1 + 2 = 18 seconds. For ZooKeeper with a 10-second session timeout, it is roughly 10 + 1 + 2 = 13 seconds.
The initialization phase is often underestimated: if the leader maintains in-memory state (caches, connection pools, partition maps), the new leader needs to rebuild that state before it can serve.
Graceful Leadership Transfer
Not every leadership change is a crash. Healthy systems also transfer leadership during rolling deployments, node maintenance, or capacity rebalancing. A graceful transfer avoids the leaderless gap entirely.
The protocol: the current leader finishes its current operation, checkpoints state to durable storage, and explicitly releases its lease. A follower detects the release and campaigns for leadership immediately. In etcd, the concurrency package provides a Resign() method for exactly this purpose. In ZooKeeper, the leader deletes its ephemeral node explicitly rather than waiting for session expiry.
The difference between graceful and crash failover: graceful has ~1-2 seconds of downtime (just the election), crash has TTL + election + initialization (10-30 seconds). Design your leader's shutdown hook to always attempt graceful resignation.
For Kubernetes deployments, use a preStop lifecycle hook to trigger graceful resignation before the pod is terminated. Combine this with a terminationGracePeriodSeconds value long enough for the leader to finish its current operation and resign cleanly. The sequence is: Kubernetes sends SIGTERM, the preStop hook calls Resign(), the leader finishes in-flight work, then the pod terminates. Without this, a rolling deployment causes a hard failover every time.
Comparison Table
| Property | ZooKeeper | etcd | Redis SETNX |
|---|---|---|---|
| Consistency | Linearizable (ZAB) | Linearizable (Raft) | Eventual (async repl.) |
| Fencing token | zxid (built-in) | Revision (built-in) | Must implement manually |
| Failure detection | Session timeout + heartbeat | Lease TTL | Key TTL expiry |
| Leader notification | Watch on predecessor node | Watch on key deletion | Polling or pub/sub |
| Typical failover time | 2-10 seconds | 5-15 seconds (lease TTL) | 10-30 seconds (key TTL) |
| Operational complexity | High (ZK ensemble) | Medium (part of K8s) | Low (single Redis) |
| Best for | Critical coordination | Kubernetes-native systems | Best-effort tasks |
For your interview: pick one mechanism and explain it in depth rather than listing all three superficially. etcd is the safest default because Kubernetes already uses it, the API is clean, and the fencing token (revision) is built in.
Implementation Sketch
// etcd lease-based leader election (simplified)
async function runForLeader(etcd: EtcdClient, nodeId: string) {
const lease = etcd.lease(15); // 15-second TTL
lease.on('lost', () => {
console.log('Lost leadership, re-campaigning...');
runForLeader(etcd, nodeId);
});
try {
// Atomically: if /election/leader does not exist, set it
await etcd.if('/election/leader', 'Create', '==', 0)
.then(etcd.put('/election/leader').value(nodeId).lease(lease))
.commit();
console.log('Elected as leader');
const revision = await etcd.get('/election/leader')
.number('mod_revision');
// Use revision as fencing token for all downstream writes
await doLeaderWork(revision);
} catch {
// Another node won. Watch for key deletion.
const watcher = await etcd.watch()
.key('/election/leader').create();
watcher.on('delete', () => runForLeader(etcd, nodeId));
}
}
The fencing token enforcement happens at the storage layer:
// Fencing token enforcement at the storage layer
async function fencedWrite(
db: Database, key: string, value: string, token: number
) {
const result = await db.query(`
UPDATE resources
SET value = $1, fence_token = $2
WHERE key = $3 AND fence_token <= $2
`, [value, token, key]);
if (result.rowCount === 0) {
throw new Error(`Stale leader: token ${token} rejected`);
}
}
Without fencing enforcement at the resource layer, no election algorithm can prevent split-brain writes.
The leader's work loop should also check leadership status before each operation:
// Leader work loop with pre-operation leadership check
async function doLeaderWork(
etcd: EtcdClient, fencingToken: number
) {
while (true) {
// Check leadership before each batch of work
const currentLeader = await etcd.get('/election/leader');
if (!currentLeader || currentLeader.mod_revision > fencingToken) {
console.log('No longer leader, stopping work loop');
break;
}
const tasks = await fetchPendingTasks();
for (const task of tasks) {
await fencedWrite(db, task.key, task.value, fencingToken);
}
await sleep(1000); // Process tasks every second
}
}
When It Shines
- Singleton tasks: cron jobs, scheduled batch processing, periodic cleanup where duplicate execution is harmful.
- Partition ownership: Kafka broker leading a partition, stream processing consumer group coordination.
- Database primary selection: choosing which node accepts writes in a replicated database.
- Cluster state management: Elasticsearch master node managing shard allocation, Redis Sentinel coordinating failover.
- Distributed lock upgrade: when you need a long-lived lock (minutes to hours), leader election with lease renewal is more robust than a one-shot distributed lock.
- Configuration management: one node serves as the authoritative source for cluster configuration, pushes config changes to followers, and ensures consistency.
When to avoid leader election: If the task is idempotent and running it twice is harmless (cache warming, recomputing a materialized view), skip the coordination overhead. If you are distributing work across multiple workers, use competing consumers or consistent hashing. Leader election is specifically for "exactly one at a time" semantics.
Failure Modes and Pitfalls
1. Missing Fencing Tokens (Split-Brain Writes)
The number-one failure mode. The old leader has a GC pause, network partition, or slow disk. It resumes operations after a new leader is elected. Without fencing, both leaders write to shared state simultaneously. I cannot stress this enough: if your leader election design does not mention fencing, interviewers will assume you do not understand the problem.
How to detect it: Monitor for writes carrying outdated fencing tokens. Add a metric that counts rejected stale-token operations. If this metric is ever nonzero, you had a split-brain event.
How to prevent it: Enforce fencing tokens at every resource the leader touches. This includes databases (add a fence_token column to critical tables), message queues (include the token in message headers), and external APIs (pass the token as a request header that downstream services validate). The enforcement must be atomic with the operation itself, not a separate check-then-act.
2. Thundering Herd on Leader Failure
All followers watch the leader node and all race to become the new leader simultaneously. This creates a brief storm of contention that can overwhelm the coordination service. In a cluster of 500 nodes, 499 simultaneous create requests to ZooKeeper can cause significant load.
ZooKeeper avoids this with the "watch predecessor" pattern (only 1 node wakes up per failure). etcd avoids it by using sorted key comparisons. Redis does not have a built-in solution.
3. False Leader Failure Detection
Network hiccup causes the coordinator to believe the leader is dead. A new leader is elected. The old leader's network recovers. Now the old leader must detect it was deposed. If it does not check its lease/session status before acting, it will cause split-brain.
This is especially common in cloud environments where network latency can spike during host migrations or availability zone failures. Always check "am I still the leader?" before performing any sensitive operation, and design the leader's work loop to abort gracefully if leadership is lost mid-operation.
4. Lease/Session Timeout Tuning
Too short: network jitter causes false failovers, thrashing leadership back and forth. Too long: a genuine leader crash takes a long time to detect, leaving the cluster leaderless. A useful formula: TTL = max(3 * heartbeat_interval, 3 * p99_network_latency, expected_GC_pause_duration + margin). Start with 10-15 seconds and tune based on your SLA.
5. Leader Monopolization
One node keeps winning elections due to faster network, lower clock skew, or being first to detect the lease expiry. This concentrates all load on one machine. Raft solves this with randomized election timeouts. ZooKeeper's sequential nodes naturally rotate leadership. For Redis, add a random backoff (50-200ms) before re-attempting SETNX.
6. The Coordination Service Itself
A common concern: "We added ZooKeeper/etcd to avoid a single point of failure, but now ZooKeeper/etcd is the single point of failure." Both run as replicated clusters (3 or 5 nodes) with consensus-based replication, tolerating minority failures. If the coordination service goes down, the current leader continues serving but no new elections can happen. Run the coordination cluster on separate infrastructure from your application cluster.
7. Leader State Loss on Failover
The new leader starts with a blank slate. If the old leader accumulated in-memory state (a partition map, a consumer offset cache, a connection pool), that state is gone. Reconstruction can take seconds (re-reading a config from etcd) or minutes (rebuilding an in-memory index). Minimize leader state by storing it durably. Anything the leader needs should be recoverable from persistent storage in under 5 seconds.
Monitoring Leader Election Health
Key metrics to monitor: leader uptime (frequent changes indicate instability), heartbeat latency (rising trend predicts leadership loss), leaderless duration (directly measures unavailability), and fencing token rejections (the split-brain smoke alarm). Alert when leaderless duration exceeds your SLA or leadership changes more than 3 times per hour (flapping).
Trade-offs
| Pros | Cons |
|---|---|
| Prevents duplicate execution of critical tasks | Adds coordination dependency (ZK/etcd cluster) |
| Automatic failover on leader crash | Leaderless gap during re-election (seconds to tens of seconds) |
| Well-understood pattern with mature libraries | Fencing token enforcement is application-level responsibility |
| Works across processes, machines, and data centers | Misconfigured TTLs cause false failovers or slow recovery |
| Natural partition/shard ownership model | Coordination service itself must be highly available |
The fundamental tension: leader election trades availability for safety. During re-election, the system has no leader and work queues up. The tighter the failure detection, the more false positives you get. The looser the detection, the longer the leaderless window. There is no configuration that eliminates both risks simultaneously.
A related tension: leader election concentrates responsibility on one node. If the leader becomes a throughput bottleneck, consider splitting work into partitions where each partition has its own leader (Kafka's model), or switching to a competing-consumers pattern where multiple workers share the load.
Real-World Usage
Apache ZooKeeper itself: ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast) for its own internal leader election. The ZooKeeper leader handles all write requests, while followers serve reads and forward writes to the leader. This is a meta-example: the coordination service that provides leader election for your applications is itself a leader-elected system under the hood.
Google Chubby: Google's Chubby lock service, the inspiration for ZooKeeper, uses Paxos for leader election. The original Chubby paper describes how Google uses it for GFS master selection, Bigtable master selection, and numerous internal services. Chubby's design directly influenced ZooKeeper and etcd.
Kafka partition leaders: Every Kafka partition has one leader broker that handles all reads and writes for that partition. Followers replicate from the leader. The Kafka controller (itself elected via ZooKeeper, or KRaft in newer versions) assigns partition leadership.
When a broker fails, the controller elects new leaders for all affected partitions, typically completing failover in under 5 seconds for thousands of partitions. Kafka's move to KRaft eliminates the ZooKeeper dependency entirely, using an internal Raft quorum for controller election. This is one of the clearest examples of a system moving from external coordination (ZooKeeper) to built-in consensus.
Elasticsearch master election: Elasticsearch uses a Raft-like protocol (introduced in 7.x+) to elect a master node that manages cluster state: shard allocation, index creation, node membership. Dedicated master-eligible nodes (usually 3) prevent data nodes from competing for master status. Before 7.x, the custom "Zen Discovery" protocol was prone to split-brain from misconfigured minimum_master_nodes. The move to majority quorum fixed this class of bugs.
Kubernetes leader election: Kubernetes controllers (scheduler, controller-manager) run as leader-elected singletons using the coordination.k8s.io/v1 Lease API. Only the leader instance performs work; standby instances attempt to acquire the lease on a loop. When the leader pod dies, another acquires the lease within the renewal period (default 15 seconds).
Redis Sentinel: Redis Sentinel uses a Raft-inspired protocol to elect a sentinel leader when the Redis primary fails. The sentinel leader orchestrates failover: promoting a replica to primary, reconfiguring other replicas, and notifying clients.
How This Shows Up in Interviews
Leader election comes up directly when designing systems like a distributed scheduler, a job queue, a notification deduplication service, or a database replication layer. It also appears implicitly whenever you say "only one instance should handle this."
The interviewer is testing three things. First, do you recognize when exactly-one semantics are needed? Second, can you name a concrete mechanism (ZooKeeper ephemeral nodes, etcd leases) and explain how it works at the protocol level? Third, do you understand the fencing token problem?
My strongest recommendation: draw the split-brain diagram early. Show the interviewer you understand the failure mode before you present the solution. Then describe fencing tokens as the mitigation. This demonstrates depth that most candidates miss entirely.
Common follow-up questions to prepare for: "What happens during the leaderless gap?", "How do you tune the TTL?", "What if the coordination service itself goes down?", and "How is this different from a distributed lock?" Have crisp one-sentence answers ready for each.
Interview shortcut: the 30-second leader election answer
"We need exactly one node to run this task. I will use etcd (or ZooKeeper) for leader election with lease-based failure detection. The leader gets a fencing token, a monotonically increasing revision number. Any downstream resource rejects requests carrying a token lower than the highest it has seen. This prevents a stale leader from corrupting state after being deposed."
Quick Recap
- Leader election ensures exactly one node runs a critical task at any given time, preventing duplicate execution in distributed systems.
- The real challenge is not election but preventing split-brain, where a deposed leader continues acting after a new leader takes over.
- Fencing tokens (monotonically increasing numbers issued to each new leader) solve split-brain by letting downstream resources reject stale requests.
- ZooKeeper uses ephemeral sequential nodes with predecessor watches. etcd uses lease-based campaigns with revision-based fencing. Redis SETNX is simpler but lacks consistency guarantees.
- Every election mechanism has a leaderless gap between leader failure and new election. Design systems to queue work during this window.
- Kubernetes provides built-in leader election via the Lease API, backed by etcd, making it the easiest choice for K8s-native systems.
- Heartbeat intervals and failure timeouts create a fundamental trade-off: faster detection means more false positives, slower detection means longer outages.
- Graceful leadership transfer (via explicit resignation) eliminates the leaderless gap during planned maintenance and rolling deployments.
- In interviews, draw the split-brain diagram first, then present fencing tokens as the fix. Name one concrete mechanism and explain it in depth.
Related Patterns
- Raft consensus: The consensus algorithm that powers etcd's linearizable leader election and log replication.
- Distributed locks: Shorter-lived mutual exclusion, useful when you do not need stable leadership but need to protect a critical section.
- Consistent hashing: Often paired with leader election for partition assignment, where the leader uses consistent hashing to distribute partitions across nodes.
- Competing consumers: An alternative to leader election for work distribution, where multiple consumers share a queue instead of electing a single processor.
- Circuit breaker: Protects the system when the coordination service (ZooKeeper/etcd) is unavailable, preventing cascading failures during election outages.