Vector clocks
How vector clocks track causal relationships between events in distributed systems, detecting concurrent writes, resolving conflicts, and why Lamport timestamps alone are insufficient for causality.
The problem
Two users edit the same shared document simultaneously. User A changes the title on Server 1. User B changes the body on Server 2. Both writes complete successfully. When the servers sync, you need to know: did A's edit happen before B's, did B's happen before A's, or did they happen concurrently? The answer determines whether you can safely apply both edits or whether you have a conflict that needs resolution.
The obvious approach is to compare timestamps. Server 1 recorded A's edit at 14:00:00.001. Server 2 recorded B's edit at 14:00:00.002. B came later, right? Not necessarily. Server 2's clock might be 500ms ahead of Server 1's. The "later" timestamp means nothing when clocks aren't synchronized.
Server 1: edit_title at 14:00:00.001 (clock accurate)
Server 2: edit_body at 14:00:00.002 (clock 500ms ahead, real time: 13:59:59.502)
Timestamp says: Server 2's edit is later
Reality: Server 2's edit happened 499ms EARLIER
NTP (Network Time Protocol) helps but doesn't solve the problem. NTP synchronizes clocks to within a few milliseconds on a good day, but under network congestion, clock drift can reach hundreds of milliseconds. For a system making thousands of writes per second, a few milliseconds of skew means thousands of incorrectly ordered operations.
The consequences of getting ordering wrong are severe. A distributed database that uses wall-clock timestamps for conflict resolution (Last-Write-Wins) will silently discard the "losing" write. In a shopping cart, this means a user adds an item and it vanishes. In a financial system, a withdrawal might overwrite a deposit. The data looks consistent (no error is raised), but the result is wrong.
You need a mechanism that tracks causality without relying on physical time. That mechanism is vector clocks.
What it is
A vector clock is a logical timestamp that uses one counter per node in the system instead of a single global counter. By tracking which events each node has seen, vector clocks can definitively answer the question that wall clocks cannot: are two events causally related, or are they concurrent?
Think of it like a guest book at a conference with multiple rooms. Each room has its own guest book. When someone moves from Room A to Room B, they carry Room A's guest count with them. If you later see two guest books and Room A's book shows "10 visitors" while Room B's shows "Room A had 8 visitors when I last checked," you know that Room B's state reflects events before Room A reached 10 visitors. They diverged after visitor 8. Any entries made in Room A after visitor 8, and any entries in Room B after that checkpoint, are concurrent.
Formally, a vector clock for a system with N nodes is an array of N non-negative integers. The i-th element records how many events node i has performed (or, more precisely, the latest event from node i that is in the causal past of the current event). The comparison rule is component-wise: VC1 happened before VC2 if and only if every component of VC1 is less than or equal to the corresponding component of VC2, with at least one strictly less.
How it works
Why Lamport timestamps are not enough
Leslie Lamport's logical clock is a simple step beyond wall clocks: a single counter per node, incremented on every event:
Rules:
On send: increment counter, attach to message
On receive: counter = max(local, received) + 1
On local event: increment counter
A: [1] send message to B
B: [2] (max(0, 1) + 1) receive, increment
B: [3] send to C
C: [4] receive, process
Lamport timestamps give a partial causal order: if event A happened before event B, then A's timestamp < B's timestamp. But the converse is NOT true: if A's timestamp < B's, it doesn't mean A caused B. They might be concurrent.
You can say "A happened before B" from Lamport timestamps. You cannot say "A and B are concurrent." This is the fundamental limitation. In a system that needs to detect conflicts (like a replicated database), knowing that two events are concurrent is exactly the information you need.
Lamport timestamps are still useful. They are cheap (one integer per event), and many systems only need total ordering, not conflict detection. Google Spanner uses TrueTime (hardware-backed physical clocks with bounded uncertainty) rather than vector clocks because it only needs a total order for its external consistency guarantee. If you need to detect conflicts rather than impose an arbitrary order, Lamport timestamps are insufficient.
The diagram illustrates the key insight: Lamport timestamps leave ambiguity (X might have caused Y, or they might be concurrent). Vector clocks resolve this ambiguity by tracking per-node counters.
The vector clock algorithm
A vector clock solves this by maintaining one counter per node in the system:
The rules are simple:
- On local event: increment your own slot in the vector
- On send: increment your own slot, attach the full vector to the message
- On receive: take the pairwise maximum of the received vector and your local vector, then increment your own slot
Here is the pseudocode:
function on_local_event(node_id, clock):
clock[node_id] += 1
return clock
function on_send(node_id, clock):
clock[node_id] += 1
send(message, clock)
return clock
function on_receive(node_id, local_clock, received_clock):
for i in 0..N:
local_clock[i] = max(local_clock[i], received_clock[i])
local_clock[node_id] += 1
return local_clock
function compare(vc1, vc2):
if vc1[i] <= vc2[i] for all i: return BEFORE
if vc1[i] >= vc2[i] for all i: return AFTER
return CONCURRENT // neither dominates
The comparison is the key operation. Two vector clocks are compared component-wise:
- If every component of VC1 is β€ the corresponding component of VC2, then VC1 happened before VC2 (VC1 is an ancestor of VC2)
- If every component of VC1 is β₯ the corresponding component of VC2, then VC1 happened after VC2 (VC2 is an ancestor of VC1)
- If neither condition holds (some components are greater, some are smaller), the events are concurrent (neither caused the other)
This three-way comparison is the fundamental operation. Every other aspect of vector clocks (conflict detection, merge, sibling management) builds on this comparison.
The space cost is O(N) per vector clock, where N is the number of nodes. For a 5-node cluster, that's 5 integers per key. For a 500-node cluster, it's 500 integers per key, which is why vector clock size management matters (covered below). The time cost of comparison is also O(N): a single pass through both vectors.
Here is a worked example with all three outcomes:
VC_a = [3, 1, 2]
VC_b = [3, 2, 3]
VC_c = [4, 0, 1]
Compare VC_a vs VC_b:
3 <= 3, 1 <= 2, 2 <= 3 -> all β€ -> VC_a HAPPENED BEFORE VC_b
Compare VC_a vs VC_c:
3 < 4 (a < c), 1 > 0 (a > c) -> mixed -> CONCURRENT
Compare VC_b vs VC_c:
3 < 4, 2 > 0, 3 > 1 -> mixed -> CONCURRENT
Conflict detection and resolution
When a quorum-based database receives a read, and two replicas return different values with concurrent vector clocks, the system has detected a conflict. What happens next depends on the conflict resolution strategy.
There are three common strategies: (1) return all siblings to the client and let the application merge (Dynamo, Riak), (2) use Last-Write-Wins and silently discard the "loser" (Cassandra), or (3) use CRDTs that merge automatically (Riak 2.0+ data types). The first approach preserves all data at the cost of client complexity. The second is simple but loses data. The third constrains your data model but eliminates both problems.
In Amazon Dynamo's original design, the system returned all conflicting versions (called "siblings") to the next reader. The client was responsible for merging them.
System returns to client:
[
{color: "red", version: [A:2, B:0]},
{color: "green", version: [A:1, B:1]}
]
Client merges (application-specific logic):
merged = {color: "red"} // or "green", or some combination
Write back with merged vector: [A:2, B:1]
Amazon's shopping cart used this approach. Two conflicting carts were merged by taking the union of items. The worst case was a deleted item reappearing (if one sibling had it and the other didn't), but no items were ever silently lost. This is a deliberate tradeoff: data preservation over silent data loss.
Merge rules
When merging two concurrent vector clocks, the merged clock takes the component-wise maximum:
VC1 = [A:2, B:0, C:1]
VC2 = [A:1, B:1, C:3]
Merged = [max(2,1), max(0,1), max(1,3)] = [A:2, B:1, C:3]
The merged vector clock dominates both parents. Any future event that sees the merged clock knows it happened after both conflicting events. This is how conflicts are resolved: the merge creates a new "point in time" that supersedes both conflicting versions.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.