Vector clocks
How vector clocks track causal relationships between events in distributed systems, detecting concurrent writes, resolving conflicts, and why Lamport timestamps alone are insufficient for causality.
The problem
Two users edit the same shared document simultaneously. User A changes the title on Server 1. User B changes the body on Server 2. Both writes complete successfully. When the servers sync, you need to know: did A's edit happen before B's, did B's happen before A's, or did they happen concurrently? The answer determines whether you can safely apply both edits or whether you have a conflict that needs resolution.
The obvious approach is to compare timestamps. Server 1 recorded A's edit at 14:00:00.001. Server 2 recorded B's edit at 14:00:00.002. B came later, right? Not necessarily. Server 2's clock might be 500ms ahead of Server 1's. The "later" timestamp means nothing when clocks aren't synchronized.
Server 1: edit_title at 14:00:00.001 (clock accurate)
Server 2: edit_body at 14:00:00.002 (clock 500ms ahead, real time: 13:59:59.502)
Timestamp says: Server 2's edit is later
Reality: Server 2's edit happened 499ms EARLIER
NTP (Network Time Protocol) helps but doesn't solve the problem. NTP synchronizes clocks to within a few milliseconds on a good day, but under network congestion, clock drift can reach hundreds of milliseconds. For a system making thousands of writes per second, a few milliseconds of skew means thousands of incorrectly ordered operations.
The consequences of getting ordering wrong are severe. A distributed database that uses wall-clock timestamps for conflict resolution (Last-Write-Wins) will silently discard the "losing" write. In a shopping cart, this means a user adds an item and it vanishes. In a financial system, a withdrawal might overwrite a deposit. The data looks consistent (no error is raised), but the result is wrong.
You need a mechanism that tracks causality without relying on physical time. That mechanism is vector clocks.
What it is
A vector clock is a logical timestamp that uses one counter per node in the system instead of a single global counter. By tracking which events each node has seen, vector clocks can definitively answer the question that wall clocks cannot: are two events causally related, or are they concurrent?
Think of it like a guest book at a conference with multiple rooms. Each room has its own guest book. When someone moves from Room A to Room B, they carry Room A's guest count with them. If you later see two guest books and Room A's book shows "10 visitors" while Room B's shows "Room A had 8 visitors when I last checked," you know that Room B's state reflects events before Room A reached 10 visitors. They diverged after visitor 8. Any entries made in Room A after visitor 8, and any entries in Room B after that checkpoint, are concurrent.
Formally, a vector clock for a system with N nodes is an array of N non-negative integers. The i-th element records how many events node i has performed (or, more precisely, the latest event from node i that is in the causal past of the current event). The comparison rule is component-wise: VC1 happened before VC2 if and only if every component of VC1 is less than or equal to the corresponding component of VC2, with at least one strictly less.
How it works
Why Lamport timestamps are not enough
Leslie Lamport's logical clock is a simple step beyond wall clocks: a single counter per node, incremented on every event:
Rules:
On send: increment counter, attach to message
On receive: counter = max(local, received) + 1
On local event: increment counter
A: [1] send message to B
B: [2] (max(0, 1) + 1) receive, increment
B: [3] send to C
C: [4] receive, process
Lamport timestamps give a partial causal order: if event A happened before event B, then A's timestamp < B's timestamp. But the converse is NOT true: if A's timestamp < B's, it doesn't mean A caused B. They might be concurrent.
You can say "A happened before B" from Lamport timestamps. You cannot say "A and B are concurrent." This is the fundamental limitation. In a system that needs to detect conflicts (like a replicated database), knowing that two events are concurrent is exactly the information you need.
Lamport timestamps are still useful. They are cheap (one integer per event), and many systems only need total ordering, not conflict detection. Google Spanner uses TrueTime (hardware-backed physical clocks with bounded uncertainty) rather than vector clocks because it only needs a total order for its external consistency guarantee. If you need to detect conflicts rather than impose an arbitrary order, Lamport timestamps are insufficient.
The diagram illustrates the key insight: Lamport timestamps leave ambiguity (X might have caused Y, or they might be concurrent). Vector clocks resolve this ambiguity by tracking per-node counters.
The vector clock algorithm
A vector clock solves this by maintaining one counter per node in the system:
The rules are simple:
- On local event: increment your own slot in the vector
- On send: increment your own slot, attach the full vector to the message
- On receive: take the pairwise maximum of the received vector and your local vector, then increment your own slot
Here is the pseudocode:
function on_local_event(node_id, clock):
clock[node_id] += 1
return clock
function on_send(node_id, clock):
clock[node_id] += 1
send(message, clock)
return clock
function on_receive(node_id, local_clock, received_clock):
for i in 0..N:
local_clock[i] = max(local_clock[i], received_clock[i])
local_clock[node_id] += 1
return local_clock
function compare(vc1, vc2):
if vc1[i] <= vc2[i] for all i: return BEFORE
if vc1[i] >= vc2[i] for all i: return AFTER
return CONCURRENT // neither dominates
The comparison is the key operation. Two vector clocks are compared component-wise:
- If every component of VC1 is ≤ the corresponding component of VC2, then VC1 happened before VC2 (VC1 is an ancestor of VC2)
- If every component of VC1 is ≥ the corresponding component of VC2, then VC1 happened after VC2 (VC2 is an ancestor of VC1)
- If neither condition holds (some components are greater, some are smaller), the events are concurrent (neither caused the other)
This three-way comparison is the fundamental operation. Every other aspect of vector clocks (conflict detection, merge, sibling management) builds on this comparison.
The space cost is O(N) per vector clock, where N is the number of nodes. For a 5-node cluster, that's 5 integers per key. For a 500-node cluster, it's 500 integers per key, which is why vector clock size management matters (covered below). The time cost of comparison is also O(N): a single pass through both vectors.
Here is a worked example with all three outcomes:
VC_a = [3, 1, 2]
VC_b = [3, 2, 3]
VC_c = [4, 0, 1]
Compare VC_a vs VC_b:
3 <= 3, 1 <= 2, 2 <= 3 -> all ≤ -> VC_a HAPPENED BEFORE VC_b
Compare VC_a vs VC_c:
3 < 4 (a < c), 1 > 0 (a > c) -> mixed -> CONCURRENT
Compare VC_b vs VC_c:
3 < 4, 2 > 0, 3 > 1 -> mixed -> CONCURRENT
Conflict detection and resolution
When a quorum-based database receives a read, and two replicas return different values with concurrent vector clocks, the system has detected a conflict. What happens next depends on the conflict resolution strategy.
There are three common strategies: (1) return all siblings to the client and let the application merge (Dynamo, Riak), (2) use Last-Write-Wins and silently discard the "loser" (Cassandra), or (3) use CRDTs that merge automatically (Riak 2.0+ data types). The first approach preserves all data at the cost of client complexity. The second is simple but loses data. The third constrains your data model but eliminates both problems.
In Amazon Dynamo's original design, the system returned all conflicting versions (called "siblings") to the next reader. The client was responsible for merging them.
System returns to client:
[
{color: "red", version: [A:2, B:0]},
{color: "green", version: [A:1, B:1]}
]
Client merges (application-specific logic):
merged = {color: "red"} // or "green", or some combination
Write back with merged vector: [A:2, B:1]
Amazon's shopping cart used this approach. Two conflicting carts were merged by taking the union of items. The worst case was a deleted item reappearing (if one sibling had it and the other didn't), but no items were ever silently lost. This is a deliberate tradeoff: data preservation over silent data loss.
Merge rules
When merging two concurrent vector clocks, the merged clock takes the component-wise maximum:
VC1 = [A:2, B:0, C:1]
VC2 = [A:1, B:1, C:3]
Merged = [max(2,1), max(0,1), max(1,3)] = [A:2, B:1, C:3]
The merged vector clock dominates both parents. Any future event that sees the merged clock knows it happened after both conflicting events. This is how conflicts are resolved: the merge creates a new "point in time" that supersedes both conflicting versions.
The merge operation is commutative (merge(A,B) = merge(B,A)) and associative (merge(merge(A,B),C) = merge(A,merge(B,C))). This means nodes can merge conflicts in any order and arrive at the same result. This property is essential for convergence in distributed systems.
The data merge is application-specific, but the vector clock merge is always component-wise maximum. Do not confuse the two. A shopping cart merge might take the union of items. A counter merge might take the sum. A last-write-wins merge might pick one value and discard the other. But the clock merge is always the same: pairwise max.
When three or more siblings exist (possible under high contention), the merge is applied pairwise in any order, thanks to the associative and commutative properties:
Three siblings after concurrent writes:
VC1 = [3, 0, 1] value = {apple}
VC2 = [1, 2, 0] value = {banana}
VC3 = [0, 1, 3] value = {cherry}
Step 1: merge VC1 and VC2
clock = [max(3,1), max(0,2), max(1,0)] = [3, 2, 1]
data = {apple} ∪ {banana} = {apple, banana}
Step 2: merge result with VC3
clock = [max(3,0), max(2,1), max(1,3)] = [3, 2, 3]
data = {apple, banana} ∪ {cherry} = {apple, banana, cherry}
Final: VC=[3, 2, 3], value = {apple, banana, cherry}
The final vector [3, 2, 3] dominates all three original vectors, establishing a new causal frontier that supersedes all siblings.
Dotted version vectors
Standard vector clocks have a subtle problem: false siblings. Here is how it happens.
A client reads a value from replica A at version [A:1]. It writes back, and the coordinator sends the update to replicas A and B. Replica A updates to [A:2]. For some reason, B was slow and gets [A:2] later. Meanwhile, another client reads from B (which still has [A:1]) and writes back, producing [A:1, B:1]. Now the system has siblings [A:2] and [A:1, B:1], which are concurrent.
But here's the problem: the second client's write was based on [A:1], which is an ancestor of [A:2]. The second write should have been a conflict with [A:2], not with the original [A:1]. The false sibling occurs because the vector clock tracks node-level counters, not individual writes.
Dotted Version Vectors (DVVs), used in Riak, solve this by adding a "dot" to each value: a (node, counter) pair identifying the specific write that created this value. When comparing versions, the system can distinguish between "this value was created by a specific write" and "this node has seen writes up to counter X." This eliminates false siblings that occur when reads and writes interleave on different replicas.
Standard vector clock:
Value "red" at [A:2, B:0] -- who created this? A at counter 2.
Value "green" at [A:1, B:1] -- who created this? B at counter 1.
Dotted version vector:
Value "red" with dot (A, 2), context [A:1, B:0]
Value "green" with dot (B, 1), context [A:1, B:0]
Now the system knows both were created from the same base context [A:1, B:0].
Neither is "false" -- they are genuinely concurrent.
In practice, DVVs reduce the number of siblings Riak tracks, which improves storage efficiency and reduces the burden on clients to resolve conflicts that aren't real conflicts.
Version vectors vs vector clocks
These two terms are often used interchangeably, but they solve different problems. A vector clock tracks causality per event (every send, receive, and local operation increments the clock). A version vector tracks causality per state of a data item (only writes to a specific key increment the vector).
In a key-value store, you don't care about the causal history of every single message between nodes. You care about whether two versions of a key are causally related. Version vectors are sufficient for this and are cheaper. Each key-value pair carries a version vector that only increments when that specific key is updated, not on every network message.
The practical difference: a vector clock in a chatty system might reach counters in the thousands per node quickly (every message is an event). A version vector for the same key would only increment when someone writes to that key, keeping counters small. Dynamo, Riak, and Voldemort all use version vectors, not full vector clocks, even though the papers often say "vector clocks."
Vector clock (tracks every event):
Node A sends 1000 messages, writes key X once:
A's counter for key X's clock = 1000 (every send incremented it)
Version vector (tracks per-key state):
Node A sends 1000 messages, writes key X once:
A's counter for key X's version vector = 1 (only the write incremented it)
For interviews, use the terms interchangeably unless specifically asked to distinguish them. The comparison algorithm (component-wise), the merge algorithm (pairwise max), and the concurrency detection properties are identical.
Vector clock size management
In a large cluster, vector clocks can grow unbounded. Every node that has ever written to a key adds an entry to that key's vector. In a 100-node cluster with high key churn, vectors can grow to 100+ entries per key, consuming significant storage and slowing comparisons.
Three strategies for managing vector size:
1. Scoped version vectors
Instead of tracking all nodes, scope vectors to the coordinator nodes that actually handled writes for a key. In Dynamo-style systems, each key has a small set of coordinator nodes (typically N=3). The version vector only tracks these coordinators, keeping the vector at 3-5 entries regardless of total cluster size.
This is what Amazon's Dynamo and Riak actually use. The vectors track coordinator-level events, not cluster-level events. A key written only by coordinators A, B, C has a vector with 3 entries, even in a 1000-node cluster. I find this optimization is underappreciated in interview discussions, where candidates often worry about vector size being a real problem. In practice, scoped version vectors keep sizes small.
2. Timestamp-based pruning
If a vector entry hasn't been updated in a long time (the node that wrote it is presumably decommissioned or hasn't written to this key recently), the entry can be pruned. Dynamo's original paper suggested a threshold: if an entry is older than a configurable time (e.g., 24 hours), remove it.
The risk: pruning can cause false conflicts. If a pruned entry was actually relevant to causality determination, two values that were causally ordered might appear concurrent after pruning. This is safe in the sense that false conflicts don't lose data (both values are preserved), but it increases sibling count.
3. Per-entry counters with bounded size
Cap the vector at K entries (e.g., 10). When a new node writes and the vector is full, remove the entry with the smallest counter. This bounds memory usage but, like timestamp pruning, can cause false conflicts.
| Strategy | Vector size | False conflicts | Used by |
|---|---|---|---|
| Scoped to coordinators | O(N_coordinators) | None | Dynamo, Riak |
| Timestamp pruning | Bounded by time | Possible | Dynamo (original) |
| Entry cap | Fixed K | Possible | Voldemort |
| Dotted Version Vectors | O(N_coordinators) | Fewer than standard VCs | Riak 2.0+ |
Production usage
Modern systems use vector-clock concepts under different names and with different tradeoffs:
| System | Mechanism | How conflicts are handled |
|---|---|---|
| DynamoDB (original Dynamo) | Version vectors | Returns all siblings to client; client merges (shopping cart: union of items) |
| Riak | Dotted Version Vectors | Client-side or CRDT-based merge; allow_mult=true returns siblings, allow_mult=false uses LWW |
| CRDTs (Riak, Redis) | Causal context (embedded vector) | Automatic merge via mathematically guaranteed convergence; no client intervention |
| Git | DAG of commits | Detects merge conflicts via common ancestor; user resolves manually or auto-merges non-overlapping changes |
| Cassandra | LWW timestamps (no vector clocks) | Silently discards concurrent writes with lower timestamp; clock skew can cause data loss |
| CouchDB | Revision tree | Maintains revision history; detects conflicts via divergent revision branches |
Git's commit DAG is semantically equivalent to vector clocks. Two branches with a common ancestor that diverged represent concurrent writes. The merge commit is conflict resolution, and the merged commit has a vector that dominates both parents.
A note on DynamoDB specifically: Amazon's publicly documented DynamoDB service does NOT expose vector clocks to users. It uses conditional writes (optimistic locking with a version attribute) and strong consistency reads. The vector clock mechanism described in the original 2007 Dynamo paper applies to the internal Dynamo system, not the DynamoDB service. This distinction comes up in interviews and catching it shows depth.
Interview tip: Cassandra vs. Dynamo conflict handling
This is a common interview question: "How does Cassandra handle conflicting writes?" Cassandra deliberately chose NOT to use vector clocks. It uses Last-Write-Wins with wall-clock timestamps, silently discarding the "losing" write. This is simpler to operate but loses data under concurrent writes. Dynamo chose the opposite: surface all conflicts, never lose data, make the client deal with it. Know this distinction cold.
CRDTs: automatic conflict resolution
The biggest drawback of vector clocks is that they detect conflicts but don't resolve them. The application must define a merge function, and getting that merge function right is hard. CRDTs (Conflict-free Replicated Data Types) solve this by building the merge function into the data structure itself.
A CRDT uses vector clock concepts internally (each operation carries a causal context) but defines a mathematically proven merge function that always converges to the same state, regardless of the order in which concurrent operations are applied.
Common CRDT types:
| CRDT type | What it models | Merge behavior |
|---|---|---|
| G-Counter | Grow-only counter | Each node tracks its own count; merge = sum of per-node counts |
| PN-Counter | Increment/decrement counter | Two G-Counters (positive and negative); merge each independently |
| G-Set | Grow-only set | Merge = union of sets |
| OR-Set (Observed-Remove Set) | Set with add and remove | Track add/remove operations with unique tags; add wins over concurrent remove |
| LWW-Register | Single value | Last write wins (by timestamp); same as LWW but with formal CRDT semantics |
The tradeoff: CRDTs constrain what data structures you can use. You can't have an arbitrary key-value with a custom merge function; you must use one of the predefined CRDT types. But within those types, conflict resolution is fully automatic, and convergence is mathematically guaranteed.
Riak 2.0+ supports CRDT data types (counters, sets, maps, flags, registers) natively. Redis supports CRDTs in its Active-Active (CRDB) deployment. These systems still use vector clocks under the hood for causal tracking, but the application never sees siblings or needs to implement merge logic.
The CRDT approach eliminates the client merge burden entirely. But you cannot use CRDTs for arbitrary data. A "user profile" with free-form fields doesn't map cleanly to any CRDT type. For those cases, vector clocks with application-level merge remain the appropriate choice.
Limitations and when NOT to use it
- Vector clock size grows with the number of nodes. Each vector has one entry per node that has ever written to the key. In a system with thousands of nodes, vectors become large. This is why Dynamo originally used version vectors scoped to the coordinator nodes, not all nodes in the cluster. Even with scoping, high-contention keys that are written by many different coordinators can accumulate large vectors.
- Client-side merge is operationally painful. Returning siblings to clients means every client must implement correct merge logic. If a single client has a bug in its merge function, it can corrupt data for all subsequent readers. I have seen this happen in production. Every language client (Java, Python, Go, JavaScript) must implement the same merge logic identically, and keeping them in sync is a maintenance burden.
- Garbage collection of old vector entries is hard. When a node is decommissioned, its entry in every vector clock is dead weight. Pruning these entries safely requires knowing that no in-flight operations reference the old node, which is difficult to guarantee. You cannot simply delete the entries during a rolling restart because writes may be in flight that still reference the old node's counter.
- Vector clocks do not help with total ordering. If your application needs a global total order of events (not just causal ordering), vector clocks are insufficient. You need a consensus protocol like Raft or a centralized sequencer. Many applications (bank ledgers, sequential event logs, audit trails) require total ordering.
- For single-writer patterns, vector clocks are overkill. If each key is only written by one node (single-writer, multiple-reader), a simple monotonic version number is sufficient. Vector clocks add complexity without benefit in this case.
- Debugging is difficult. When conflicts occur, understanding why two vector clocks are concurrent requires tracing the write history across multiple nodes. Unlike LWW (where you can inspect timestamps), vector clock conflicts require understanding causal chains, which is harder for on-call engineers to diagnose at 3 AM.
Interview cheat sheet
- When asked about conflict detection in distributed databases, state that vector clocks track one counter per node and compare component-wise. If neither vector dominates, the events are concurrent and you have a conflict.
- When comparing Lamport timestamps to vector clocks, explain that Lamport timestamps can prove "A happened before B" but cannot detect concurrency. Vector clocks can do both. This is the key difference.
- When asked about Amazon Dynamo's design, explain that Dynamo chose to surface conflicts to clients rather than silently discard data. The shopping cart merge (union of items) is the canonical example.
- When Cassandra comes up, note that it uses LWW timestamps instead of vector clocks, trading conflict detection for operational simplicity. Data loss under concurrent writes is the accepted tradeoff.
- When asked about vector clock size, explain that the vector grows with the number of distinct writers. Riak's Dotted Version Vectors reduce false siblings by tracking individual writes (dots) in addition to per-node counters.
- When CRDTs come up, explain that CRDTs use vector clock concepts internally (causal context) but define automatic merge functions. They remove the need for client-side conflict resolution at the cost of restricting the data types you can represent.
- When asked "why not just use timestamps?", give the clock skew example. Two servers with 100ms clock drift will incorrectly order operations 100ms apart. NTP helps but does not eliminate the problem, and you cannot detect concurrency at all.
- When asked how Git relates to vector clocks, explain that Git's commit DAG and merge operations are semantically equivalent. A merge conflict in Git is exactly a concurrent write detected via the commit graph structure.
Quick recap
- Physical timestamps fail in distributed systems because clock drift makes ordering ambiguous. Lamport timestamps capture "happened before" but cannot detect concurrency.
- Vector clocks use one counter per node, compared component-wise. If neither vector dominates, the events are concurrent. This is the definitive test for causal vs. concurrent relationships.
- Conflict resolution is application-specific. The system detects the conflict (via vector comparison); the application defines the merge (union, max, custom logic). The clock merge is always component-wise maximum.
- Dotted Version Vectors solve the false sibling problem in standard vector clocks by tracking individual writes, not just per-node counters. Riak uses DVVs.
- Vector clock size grows with writers. In large clusters, scope vectors to coordinator nodes (version vectors) rather than all nodes. Decommissioned nodes leave dead entries that need cleanup.
- LWW is the simpler alternative (used by Cassandra). It silently discards concurrent writes. Use it when data loss is acceptable. Use vector clocks when every write must be preserved.
- Version vectors ≠ vector clocks in the strict sense. Version vectors track per-key state (only writes to that key increment the counter). Vector clocks track per-event causality (every message is an event). In practice, distributed databases use version vectors, but the comparison and merge algorithms are identical.
- CRDTs are the modern evolution of vector clocks for many use cases. They embed causal tracking but define automatic, mathematically proven merge functions, eliminating the client merge burden.
Related concepts
- Consistency models -- Vector clocks enable causal consistency, which is stronger than eventual consistency but weaker than linearizability. Understanding the clock comparison is key to understanding what "causal" means in practice.
- Replication -- Multi-leader and leaderless replication strategies use vector clocks (or version vectors) to detect conflicts that arise from concurrent writes to different replicas.
- Databases -- The conflict resolution strategy (LWW vs. vector clocks vs. CRDTs) is one of the most important architectural decisions when choosing a distributed database.
- Consistent hashing -- In Dynamo-style systems, the coordinator node for a key is determined by consistent hashing, and the coordinator's ID appears in the version vector.
- Consensus protocols -- When causal ordering is insufficient and you need total ordering or linearizability, consensus protocols like Raft provide stronger guarantees at the cost of availability and latency.