Google Spanner and TrueTime

TL;DR

Google built Spanner because Bigtable and Megastore could not provide globally consistent ACID transactions across data centers.
TrueTime uses GPS receivers (accurate to less than 1μs) and cesium atomic clocks in every data center to return wall-clock time as a bounded interval [earliest, latest], not a single point.
Spanner's commit wait protocol assigns a timestamp s, then waits until TrueTime.now().earliest > s before acknowledging, guaranteeing that s is safely in the past for every clock on Earth.
Practical uncertainty is roughly 7ms (typically around 4ms), making global ACID commits remarkably fast.
The transferable lesson: when you can bound uncertainty precisely, you can wait it out, and that changes what is possible in distributed systems.

By 2011, Google's internal infrastructure ran on two storage systems that could not coexist gracefully. Bigtable handled massive throughput but offered no cross-row transactions and no SQL interface. Megastore added cross-datacenter replication with ACID semantics on top of Bigtable, but write throughput collapsed under load because every write required Paxos consensus across replicas.

Google Ads needed to bill advertisers in real-time across regions. Google Payments required financial-grade consistency for transactions spanning continents. YouTube needed globally consistent counters for views and subscriptions.

Each of these services was bending Megastore's limitations with application-level workarounds, and the complexity was compounding. One Google engineer described the situation in a 2012 conference talk: "We had more lines of code dealing with Megastore's consistency quirks than actual business logic."

The breaking point was not a single outage. It was the realization that every team building a globally distributed application at Google was independently reinventing the same coordination logic on top of broken abstractions. I've seen this pattern at smaller companies too: when three teams all build their own "distributed transaction layer," it is time to fix the platform.

The Spanner project started in 2008 under the leadership of Jeff Dean and Wilson Hsieh. The initial team was small (fewer than 20 engineers), but the scope was enormous: build a database that provides ACID transactions across continents with latency measured in single-digit milliseconds.

The System Before

Google's storage stack in 2011 looked like a layer cake with cracks running through it.

Bigtable was the workhorse: a sorted key-value store sharded across thousands of machines, optimized for high-throughput reads and writes within a single data center. It had no concept of multi-row transactions. If you needed to update a user's balance and their transaction log atomically, you had to put both in the same row or accept eventual consistency.

This limitation was intentional. Bigtable was designed for web-scale workloads (crawl indexing, analytics) where eventual consistency was acceptable. But as Google's business evolved, more services needed transactional guarantees.

Megastore sat on top of Bigtable and added cross-datacenter replication via Paxos. It supported ACID transactions within an "entity group" (a partition of related rows). But cross-group transactions required two-phase commit layered on top of Paxos, which meant write latency of 100-400ms for anything spanning entity groups.

The fundamental issue was clock coordination. Every distributed database must order transactions globally to provide consistency guarantees. Bigtable and Megastore used logical timestamps (monotonically increasing counters) for ordering within a single machine or entity group. But across data centers, there was no shared notion of "now."

And without a shared "now," there is no way to answer the most basic consistency question: "Has this transaction's result been made visible to everyone who started after it committed?"

Logical clocks solved ordering within a single Paxos group. They could not answer the question: "Did transaction A in Virginia commit before transaction B in Belgium started, according to actual wall-clock time?" Without that answer, external consistency was impossible.

The cost of this limitation was concrete. Google Ads engineers wrote thousands of lines of application-level reconciliation code to handle cases where two data centers disagreed about transaction ordering. Google Payments could not guarantee that a refund processed in Europe would be visible to a balance check in the US within any bounded time. These were not theoretical problems. They were support tickets.

Why Not Just [NTP / Logical Clocks / 2PC]?

Three "obvious" fixes existed. Each one fell short in a specific way.

Why not just use NTP?

Network Time Protocol synchronizes clocks across the internet to within 1-50ms of UTC. That sounds precise enough until you realize what "1-50ms" actually means in practice.

NTP accuracy degrades unpredictably. A congested network link adds variable latency to time queries. Temperature changes in a data center shift crystal oscillator frequency by parts per million. A rebooted NTP server can introduce step changes of tens of milliseconds.

The drift between corrections is unbounded for short intervals. Between NTP polls (which happen every few minutes), a server's clock drifts at whatever rate its local oscillator dictates. That rate varies with temperature, load, and hardware age.

For a database, this means two nodes can disagree about "now" by 50ms or more, and neither knows how wrong it is. If transaction A commits at node 1's time T=100 and transaction B starts at node 2's time T=101, you cannot guarantee that A truly happened before B. Node 2's clock might be 50ms ahead.

The critical difference between NTP and TrueTime: NTP gives you a best-effort timestamp with no error bound. TrueTime gives you a timestamp with a worst-case error bound. The second is far more useful for building correct distributed systems, because you can reason about the bound.

Why not just use logical clocks?

Lamport clocks and vector clocks provide causal ordering: if event A caused event B, the clocks guarantee A's timestamp is smaller than B's. But they say nothing about concurrent events that have no causal relationship.

Two users in different cities submit payments at the same wall-clock moment. Neither payment caused the other. Logical clocks assign arbitrary ordering to these events. For an advertising billing system that needs to match real-world time ("this click happened before that impression expired"), arbitrary ordering is not acceptable.

Vector clocks detect concurrency but do not resolve it. They tell you "these two events are concurrent" without providing a total order. For a SQL database that needs to answer SELECT ... ORDER BY timestamp, you need a total order that respects real time.

The bottom line on logical clocks: they are the right tool for causal consistency (Amazon DynamoDB uses them well), but they fundamentally cannot provide external consistency because they have no connection to wall-clock time.

Why not just use two-phase commit?

Two-phase commit (2PC) can coordinate transactions across data centers, but the coordinator must communicate with every participant before committing. For a transaction spanning Virginia and Belgium, that means at least one cross-Atlantic round trip (~70ms one way).

Megastore already used this approach (Paxos plus 2PC for cross-group transactions), and the result was 100-400ms write latency. For Google Ads processing millions of bid updates per second, that latency budget was unacceptable. The problem was not correctness. The problem was performance.

2PC also has a well-known availability weakness: if the coordinator crashes mid-protocol, participants hold locks until the coordinator recovers. At Google's scale, coordinator failures happen daily. The combination of high latency and fragile availability made pure 2PC a non-starter for a global database.

Spanner does use 2PC, but with a twist: each participant in the 2PC protocol is a Paxos group, not a single machine. If the coordinator machine crashes, another machine in its Paxos group takes over. This makes 2PC fault-tolerant, but that insight alone did not solve the clock-ordering problem.

TL;DR

Google built Spanner because Bigtable and Megastore could not provide globally consistent ACID transactions across data centers.
TrueTime uses GPS receivers (accurate to less than 1μs) and cesium atomic clocks in every data center to return wall-clock time as a bounded interval [earliest, latest], not a single point.
Spanner's commit wait protocol assigns a timestamp s, then waits until TrueTime.now().earliest > s before acknowledging, guaranteeing that s is safely in the past for every clock on Earth.
Practical uncertainty is roughly 7ms (typically around 4ms), making global ACID commits remarkably fast.
The transferable lesson: when you can bound uncertainty precisely, you can wait it out, and that changes what is possible in distributed systems.

Google Spanner and TrueTime

TL;DR

The Trigger

The System Before

Why Not Just [NTP / Logical Clocks / 2PC]?

Why not just use NTP?

Why not just use logical clocks?

Why not just use two-phase commit?

Continue Reading with Premium

Comments

Google Spanner and TrueTime

TL;DR

The Trigger

The System Before

Why Not Just [NTP / Logical Clocks / 2PC]?

Why not just use NTP?

Why not just use logical clocks?

Why not just use two-phase commit?

Continue Reading with Premium

Comments