Vector clocks

The problem

Two users edit the same shared document simultaneously. User A changes the title on Server 1. User B changes the body on Server 2. Both writes complete successfully. When the servers sync, you need to know: did A's edit happen before B's, did B's happen before A's, or did they happen concurrently? The answer determines whether you can safely apply both edits or whether you have a conflict that needs resolution.

The obvious approach is to compare timestamps. Server 1 recorded A's edit at 14:00:00.001. Server 2 recorded B's edit at 14:00:00.002. B came later, right? Not necessarily. Server 2's clock might be 500ms ahead of Server 1's. The "later" timestamp means nothing when clocks aren't synchronized.

Server 1: edit_title at 14:00:00.001  (clock accurate)
Server 2: edit_body  at 14:00:00.002  (clock 500ms ahead, real time: 13:59:59.502)

Timestamp says: Server 2's edit is later
Reality: Server 2's edit happened 499ms EARLIER

NTP (Network Time Protocol) helps but doesn't solve the problem. NTP synchronizes clocks to within a few milliseconds on a good day, but under network congestion, clock drift can reach hundreds of milliseconds. For a system making thousands of writes per second, a few milliseconds of skew means thousands of incorrectly ordered operations.

The consequences of getting ordering wrong are severe. A distributed database that uses wall-clock timestamps for conflict resolution (Last-Write-Wins) will silently discard the "losing" write. In a shopping cart, this means a user adds an item and it vanishes. In a financial system, a withdrawal might overwrite a deposit. The data looks consistent (no error is raised), but the result is wrong.

You need a mechanism that tracks causality without relying on physical time. That mechanism is vector clocks.

What it is

A vector clock is a logical timestamp that uses one counter per node in the system instead of a single global counter. By tracking which events each node has seen, vector clocks can definitively answer the question that wall clocks cannot: are two events causally related, or are they concurrent?

Think of it like a guest book at a conference with multiple rooms. Each room has its own guest book. When someone moves from Room A to Room B, they carry Room A's guest count with them. If you later see two guest books and Room A's book shows "10 visitors" while Room B's shows "Room A had 8 visitors when I last checked," you know that Room B's state reflects events before Room A reached 10 visitors. They diverged after visitor 8. Any entries made in Room A after visitor 8, and any entries in Room B after that checkpoint, are concurrent.

Formally, a vector clock for a system with N nodes is an array of N non-negative integers. The i-th element records how many events node i has performed (or, more precisely, the latest event from node i that is in the causal past of the current event). The comparison rule is component-wise: VC1 happened before VC2 if and only if every component of VC1 is less than or equal to the corresponding component of VC2, with at least one strictly less.

How it works

Why Lamport timestamps are not enough

Leslie Lamport's logical clock is a simple step beyond wall clocks: a single counter per node, incremented on every event:

Rules:
  On send: increment counter, attach to message
  On receive: counter = max(local, received) + 1
  On local event: increment counter

A: [1] send message to B
B: [2] (max(0, 1) + 1) receive, increment
B: [3] send to C
C: [4] receive, process

Lamport timestamps give a partial causal order: if event A happened before event B, then A's timestamp < B's timestamp. But the converse is NOT true: if A's timestamp < B's, it doesn't mean A caused B. They might be concurrent.

You can say "A happened before B" from Lamport timestamps. You cannot say "A and B are concurrent." This is the fundamental limitation. In a system that needs to detect conflicts (like a replicated database), knowing that two events are concurrent is exactly the information you need.

Lamport timestamps are still useful. They are cheap (one integer per event), and many systems only need total ordering, not conflict detection. Google Spanner uses TrueTime (hardware-backed physical clocks with bounded uncertainty) rather than vector clocks because it only needs a total order for its external consistency guarantee. If you need to detect conflicts rather than impose an arbitrary order, Lamport timestamps are insufficient.

The diagram illustrates the key insight: Lamport timestamps leave ambiguity (X might have caused Y, or they might be concurrent). Vector clocks resolve this ambiguity by tracking per-node counters.

The vector clock algorithm

A vector clock solves this by maintaining one counter per node in the system:

The rules are simple:

On local event: increment your own slot in the vector
On send: increment your own slot, attach the full vector to the message
On receive: take the pairwise maximum of the received vector and your local vector, then increment your own slot

Here is the pseudocode:

function on_local_event(node_id, clock):
  clock[node_id] += 1
  return clock

function on_send(node_id, clock):
  clock[node_id] += 1
  send(message, clock)
  return clock

function on_receive(node_id, local_clock, received_clock):
  for i in 0..N:
    local_clock[i] = max(local_clock[i], received_clock[i])
  local_clock[node_id] += 1
  return local_clock

function compare(vc1, vc2):
  if vc1[i] <= vc2[i] for all i:          return BEFORE
  if vc1[i] >= vc2[i] for all i:          return AFTER
  return CONCURRENT                        // neither dominates

The comparison is the key operation. Two vector clocks are compared component-wise:

If every component of VC1 is ≤ the corresponding component of VC2, then VC1 happened before VC2 (VC1 is an ancestor of VC2)
If every component of VC1 is ≥ the corresponding component of VC2, then VC1 happened after VC2 (VC2 is an ancestor of VC1)
If neither condition holds (some components are greater, some are smaller), the events are concurrent (neither caused the other)

This three-way comparison is the fundamental operation. Every other aspect of vector clocks (conflict detection, merge, sibling management) builds on this comparison.

The space cost is O(N) per vector clock, where N is the number of nodes. For a 5-node cluster, that's 5 integers per key. For a 500-node cluster, it's 500 integers per key, which is why vector clock size management matters (covered below). The time cost of comparison is also O(N): a single pass through both vectors.

Here is a worked example with all three outcomes:

VC_a = [3, 1, 2]
VC_b = [3, 2, 3]
VC_c = [4, 0, 1]

Compare VC_a vs VC_b:
  3 <= 3, 1 <= 2, 2 <= 3 -> all ≤ -> VC_a HAPPENED BEFORE VC_b

Compare VC_a vs VC_c:
  3 < 4 (a < c), 1 > 0 (a > c) -> mixed -> CONCURRENT

Compare VC_b vs VC_c:
  3 < 4, 2 > 0, 3 > 1 -> mixed -> CONCURRENT

Conflict detection and resolution

When a quorum-based database receives a read, and two replicas return different values with concurrent vector clocks, the system has detected a conflict. What happens next depends on the conflict resolution strategy.

There are three common strategies: (1) return all siblings to the client and let the application merge (Dynamo, Riak), (2) use Last-Write-Wins and silently discard the "loser" (Cassandra), or (3) use CRDTs that merge automatically (Riak 2.0+ data types). The first approach preserves all data at the cost of client complexity. The second is simple but loses data. The third constrains your data model but eliminates both problems.

In Amazon Dynamo's original design, the system returned all conflicting versions (called "siblings") to the next reader. The client was responsible for merging them.

System returns to client:
  [
    {color: "red",   version: [A:2, B:0]},
    {color: "green", version: [A:1, B:1]}
  ]

Client merges (application-specific logic):
  merged = {color: "red"}  // or "green", or some combination
  Write back with merged vector: [A:2, B:1]

Amazon's shopping cart used this approach. Two conflicting carts were merged by taking the union of items. The worst case was a deleted item reappearing (if one sibling had it and the other didn't), but no items were ever silently lost. This is a deliberate tradeoff: data preservation over silent data loss.

Merge rules

When merging two concurrent vector clocks, the merged clock takes the component-wise maximum:

VC1 = [A:2, B:0, C:1]
VC2 = [A:1, B:1, C:3]

Merged = [max(2,1), max(0,1), max(1,3)] = [A:2, B:1, C:3]

The merged vector clock dominates both parents. Any future event that sees the merged clock knows it happened after both conflicting events. This is how conflicts are resolved: the merge creates a new "point in time" that supersedes both conflicting versions.

The problem

Server 1: edit_title at 14:00:00.001  (clock accurate)
Server 2: edit_body  at 14:00:00.002  (clock 500ms ahead, real time: 13:59:59.502)

Timestamp says: Server 2's edit is later
Reality: Server 2's edit happened 499ms EARLIER

You need a mechanism that tracks causality without relying on physical time. That mechanism is vector clocks.

What it is

How it works

Why Lamport timestamps are not enough

Leslie Lamport's logical clock is a simple step beyond wall clocks: a single counter per node, incremented on every event:

Rules:
  On send: increment counter, attach to message
  On receive: counter = max(local, received) + 1
  On local event: increment counter

A: [1] send message to B
B: [2] (max(0, 1) + 1) receive, increment
B: [3] send to C
C: [4] receive, process

The diagram illustrates the key insight: Lamport timestamps leave ambiguity (X might have caused Y, or they might be concurrent). Vector clocks resolve this ambiguity by tracking per-node counters.

The vector clock algorithm

A vector clock solves this by maintaining one counter per node in the system:

The rules are simple:

On local event: increment your own slot in the vector
On send: increment your own slot, attach the full vector to the message
On receive: take the pairwise maximum of the received vector and your local vector, then increment your own slot

Here is the pseudocode:

function on_local_event(node_id, clock):
  clock[node_id] += 1
  return clock

function on_send(node_id, clock):
  clock[node_id] += 1
  send(message, clock)
  return clock

function on_receive(node_id, local_clock, received_clock):
  for i in 0..N:
    local_clock[i] = max(local_clock[i], received_clock[i])
  local_clock[node_id] += 1
  return local_clock

function compare(vc1, vc2):
  if vc1[i] <= vc2[i] for all i:          return BEFORE
  if vc1[i] >= vc2[i] for all i:          return AFTER
  return CONCURRENT                        // neither dominates

The comparison is the key operation. Two vector clocks are compared component-wise:

If every component of VC1 is ≤ the corresponding component of VC2, then VC1 happened before VC2 (VC1 is an ancestor of VC2)
If every component of VC1 is ≥ the corresponding component of VC2, then VC1 happened after VC2 (VC2 is an ancestor of VC1)
If neither condition holds (some components are greater, some are smaller), the events are concurrent (neither caused the other)

This three-way comparison is the fundamental operation. Every other aspect of vector clocks (conflict detection, merge, sibling management) builds on this comparison.

Here is a worked example with all three outcomes:

VC_a = [3, 1, 2]
VC_b = [3, 2, 3]
VC_c = [4, 0, 1]

Compare VC_a vs VC_b:
  3 <= 3, 1 <= 2, 2 <= 3 -> all ≤ -> VC_a HAPPENED BEFORE VC_b

Compare VC_a vs VC_c:
  3 < 4 (a < c), 1 > 0 (a > c) -> mixed -> CONCURRENT

Compare VC_b vs VC_c:
  3 < 4, 2 > 0, 3 > 1 -> mixed -> CONCURRENT

Conflict detection and resolution

In Amazon Dynamo's original design, the system returned all conflicting versions (called "siblings") to the next reader. The client was responsible for merging them.

System returns to client:
  [
    {color: "red",   version: [A:2, B:0]},
    {color: "green", version: [A:1, B:1]}
  ]

Client merges (application-specific logic):
  merged = {color: "red"}  // or "green", or some combination
  Write back with merged vector: [A:2, B:1]

Merge rules

When merging two concurrent vector clocks, the merged clock takes the component-wise maximum:

VC1 = [A:2, B:0, C:1]
VC2 = [A:1, B:1, C:3]

Merged = [max(2,1), max(0,1), max(1,3)] = [A:2, B:1, C:3]

Vector clocks

The problem

What it is

How it works

Why Lamport timestamps are not enough

The vector clock algorithm

Conflict detection and resolution

Merge rules

Continue Reading with Premium

Comments

Vector clocks

The problem

What it is

How it works

Why Lamport timestamps are not enough

The vector clock algorithm

Conflict detection and resolution

Merge rules

Continue Reading with Premium

Comments