Distributed locking
How to coordinate exclusive access to a shared resource across distributed processes. Covers Redis SETNX locks, fencing tokens, Redlock, etcd/ZooKeeper locks, and Martin Kleppmann's critique.
TL;DR
- Distributed locks coordinate exclusive access to a shared resource across machines. Local in-memory locks (
synchronized,mutex) are useless because each process has its own memory space. - Redis
SET key value NX PX ttlis the most common distributed lock: fast to acquire, auto-expires on crash, and sufficient for most coordination tasks. - Fencing tokens protect against the GC-pause problem: a slow lock holder whose TTL expires can still write to the resource after a new holder takes over. Without fencing, you get silent data corruption.
- The Redlock algorithm (Antirez, 2015) acquires locks across 5 independent Redis instances. Martin Kleppmann's critique showed it's still unsafe under clock skew, sparking one of the most important debates in distributed systems.
- For strict correctness (financial transactions, shard assignment), use consensus-backed locks (etcd, ZooKeeper). For best-effort coordination (cron dedup, cache warming), Redis locks are fine.
The Problem It Solves
It's 11:59 PM. Your payment service processes end-of-day settlement batches. A Kubernetes pod running the settlement job starts processing batch #4472. Four seconds in, Kubernetes detects a slow health check and spins up a replacement pod. The replacement also picks up batch #4472. Now two pods are processing the same batch simultaneously.
Pod A debits the merchant account. Pod B debits the merchant account. The merchant gets charged twice. The support team spends the next morning issuing refunds, and your engineering team spends the next week explaining how this happened.
The root cause: both pods checked "is batch #4472 already processing?" against their own in-memory state. Each pod's answer was "no" because each pod has its own memory space. The synchronized block that was supposed to prevent concurrent execution only works within a single JVM.
This isn't a theoretical problem. I've seen it in production at companies that should know better. Any system running multiple instances of the same service (which is every production system) needs a way to coordinate exclusive access across those instances.
Local locks solve the single-process case. Distributed locks solve the multi-process case. The difference matters every time you horizontally scale.
What Is It?
A distributed lock is a mechanism that ensures at most one process across a cluster can execute a critical section at any given time. The lock lives in a shared store (Redis, etcd, a database) that all processes can access, rather than in any single process's memory.
Think of a shared bathroom with a lock on the door. Anyone in the building can try the handle. If the door is locked, you wait. If it's unlocked, you enter and lock it behind you. The lock isn't in your pocket (local mutex). It's on the door itself (shared store). Everyone checks the same door.
The distributed lock must provide three properties:
- Mutual exclusion: At most one holder at any time.
- Deadlock freedom: If a holder crashes without releasing, the lock eventually becomes available (via TTL or lease expiration).
- Fault tolerance: The lock service itself must be available even if some of its nodes fail.
For your interview: say "I'd use a distributed lock with TTL for mutual exclusion, and fencing tokens if I need correctness under process pauses." That covers the two most important design decisions.
A distributed lock does not guarantee mutual exclusion
Most distributed locks (Redis SETNX, Redlock) guarantee mutual exclusion only if the lock holder completes its work before the TTL expires. If the holder pauses (GC, slow network, disk I/O stall) past the TTL, the lock expires, a second process acquires it, and you have two holders. Fencing tokens are the only reliable fix. Without fencing, a distributed lock is a "mostly exclusive" lock.
How It Works
I'll walk through the most common approach: a Redis-based distributed lock with TTL.
Acquiring the Lock
The core operation is Redis SET key value NX PX ttl:
NX: Set only if the key does not exist (atomic check-and-set)PX: Set a TTL in milliseconds (auto-release on crash)value: A unique holder ID (UUID) so only the holder can release
// Acquire lock
const lockKey = "lock:batch:4472";
const holderId = crypto.randomUUID();
const ttlMs = 30_000; // 30 seconds
const acquired = await redis.set(lockKey, holderId, "NX", "PX", ttlMs);
if (acquired === "OK") {
// Lock acquired, do the work
try {
await processBatch(4472);
} finally {
// Release lock (only if we still hold it)
await releaseLock(lockKey, holderId);
}
} else {
// Lock held by someone else, back off or retry
console.log("Batch 4472 already being processed");
}
Releasing the Lock
Release must be atomic: check that you still hold the lock, then delete it. Without atomicity, you can delete another holder's lock.
-- Lua script: atomic check-and-delete
-- KEYS[1] = lock key, ARGV[1] = holder ID
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
else
return 0 -- someone else holds the lock now
end
The Lua script is critical. A non-atomic GET followed by DEL has a TOCTOU (time-of-check to time-of-use) race: between GET and DEL, the lock might expire and be reacquired by another process. Your DEL then removes their lock.
The Fencing Token Problem
Here's the correctness gap that most Redis lock tutorials skip. A process can hold the lock, get paused (GC stop-the-world, network partition, slow disk I/O), and resume after the TTL expires. By then, another process has the lock. Both processes now believe they have exclusive access.
Fencing tokens fix this. The lock server issues a monotonically increasing integer with each lock acquisition. The downstream resource (database, API, storage layer) checks the token on every write and rejects writes with a token lower than the last seen value.
The catch: fencing tokens require the downstream resource to understand and enforce them. Most databases don't natively support this. You can implement it with a version column:
-- Resource-side fencing check
UPDATE batches
SET status = 'processing', fence_token = 35
WHERE id = 4472 AND fence_token < 35;
-- affected_rows = 0 means stale token
The Redlock Algorithm
Redis is single-node by default. If that node crashes, all locks are lost. Redlock (proposed by Salvatore Sanfilippo) attempts to make Redis locks fault-tolerant by acquiring across 5 independent Redis instances:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.