6.5840 2025 Lecture 12: Spanner

Why this paper (Google Spanner, OSDI 2012)?
  Unusually ambitious for its time:
    Wide-area distributed transactions.
    Consistent wide-area replication.
  Neat ideas:
    Two-phase commit over Paxos.
    Read/only transactions w/o locking, via Snapshot Isolation.
    Read/only reads from local replica, but fresh.
    Synchronized time for external consistency.
  Spanner is used a lot inside Google.
    Influential, e.g. CockroachDB.

What was the motivating use case?
  Google F1 advertising database (Section 5.4).
  Previously sharded over many MySQL and BigTable DBs; awkward.
  Needed:
    Better (synchronous) replication.
    Cross-shard transactions.
  Workload is dominated by read-only transactions (Table 6).
  Strong consistency is required.
    External consistency / linearizability / serializability.

The basic organization:
  Datacenter A:
    "clients" are web servers e.g. for gmail
    data is sharded over multiple servers:
      a-m
      n-z
  Datacenter B:
    has its own local clients
    and its own copy of the data shards
      a-m
      n-z
  Datacenter C:
    same setup

Replication managed by Paxos; one Paxos group per shard.
  Paxos similar to Raft -- replicas, leader, quorum, log.
  Replicas are in different data centers.

Why this arrangement?
  Sharding allows huge total throughput via parallelism.
  Replicas at different data centers copes with whole-site failures.
  Clients can read local replica -- fast!
    This is what's driving the design's timestamps and TrueTime.
  Paxos requires only a majority -- tolerate slow/distant replicas.

What are the challenges?
  Entirely local read-only transactions pose some problems:
    Can we have multi-record read-only transactions without locking?
    Data on local replica may not be fresh if not in Paxos majority.
  A transaction may involve multiple shards -> multiple Paxos groups.

Spanner treats read/write and read/only transactions differently.

First, read/write transactions.

Example read/write transaction (bank transfer):
  BEGIN
    x = x + 1
    y = y - 1
  END
We don't want any read or write of x or y sneaking between our two ops.
After commit, all reads should see our updates.

Summary: two-phase commit (2pc) with Paxos-replicated participants.
  (Omitting timestamps for now.)
  (This is for r/w transactions, not r/o.)
  Client picks a unique transaction id (TID).
  Client sends each read to Paxos leader of relevant shard (2.1).
    Each shard first acquires a lock on the relevant record.
      May have to wait.
    Separate lock table per shard, in shard leader.
    Read locks are not replicated via Paxos, so leader failure -> abort.
  Client keeps writes private until commit.
  When client commits (4.2.1):
    Chooses a Paxos group to act as 2pc Transaction Coordinator (TC).
    Sends writes to relevant shard leaders.
    Each written shard leader:
      Acquires lock(s) on the written record(s).
      Log a "prepare" record via Paxos, to replicate lock and new value.
      Tell TC it is prepared.
      Or tell TC "no" if crashed and thus lost lock table.
  Transaction Coordinator:
    Decides commit or abort.
    Logs the decision to its group via Paxos.
    Tell participant leaders and client the result.
  Each participant leader:
    Log the TC's decision via Paxos.
    Perform writes (values already logged during commit).
    Release the transaction's locks.

Some points about the design so far (just read/write transactions).
  Locking (two-phase locking) ensures serializability.
  2pc widely hated b/c it blocks with locks held if TC fails.
    Replicating the TC with Paxos solves this problem!
  r/w transactions take a long time.
    Many inter-data-center messages.
    Tables 4 and 6 suggest 10ms to 100ms.
  But lots of parallelism: many clients, many shards.
    So total throughput can be high if busy.

From now on I'll mostly view each Paxos group as a single entity.
  Replicates shard data.
  Replicates two-phase commit state.

Now for read-only (r/o) transactions.
  These can involve multiple reads, perhaps from multiple shards.
  Table 6 shows 99.9% of transactions are read-only!
  So we want to make them fast, even at expense of r/w transactions.
  But still strictly serializable.

Spanner eliminates three big costs for r/o transactions:
  No two-phase commit, no transaction manager, no locks (!).
    To avoid slow inter-data-center msg to Paxos leader.
    And to avoid forcing r/w transactions to wait.
  Read from local replicas, again to avoid Paxos and cross-datacenter msgs.
    But local replica may not be up to date!
  Tables 3 and 6 show r/o has 10x less latency than r/w!
    This is a big deal.
  How to be correct despite cutting these corners?
  
Correctness constraints on r/o transactions:
  Serializable:
    Same results as if transactions executed one-by-one.
      Even though they may actually execute concurrently.
    I.e. an r/o xaction must essentially fit between r/w xactions.
      See all writes from prior transactions, nothing from subsequent.
    Even though *concurrent* with r/w xactions! And not locking!
  Externally consistent (== linearizable):
    If T1 completes before T2 starts, T2 must see T1's writes.
    "Before" refers to real (wall-clock) time.
    Even though T2 reads from possibly-lagging local replica.

Why not have r/o transactions just read the latest committed values?
  Suppose two bank transfers, and a transaction that reads both balances.
    T1:  Wx  Wy  C
    T2:                 Wx  Wy  C
    T3:             Rx             Ry
  The results won't match any serial order!
    Not T1, T2, T3.
    Not T1, T3, T2.
  We want T3 to see both of T2's writes, or none.
  We want T3's reads to *all* occur at the *same* point relative to T1/T2.

Idea: Snapshot Isolation (SI):
  DB stores multiple versions for each record.
    Each labeled with a time-stamp.
  A write adds a new version (doesn't overwrite).
  Assign every transaction a time-stamp.
    r/w: commit time.
    r/o: start time.
  All of a r/w transactions's writes get the xaction's time-stamp.
  An r/o transaction's reads see versions as of its time-stamp.
    The record version with the highest time-stamp less than the xaction's.
  Called Snapshot Isolation.

Our example with Snapshot Isolation:
                      x@10=9         x@20=8
                      y@10=11        y@20=12
    T1 @ 10:  Wx  Wy  C
    T2 @ 20:                 Wx  Wy  C
    T3 @ 15:             Rx             Ry
  "@ 10" indicates the time-stamp.
  Now T3's reads will both be served from the @10 versions.
    T3 won't see T2's write even though T3's read of y occurs after T2.
  Now the results are serializable: T1 T2 T3
    Results are the same as if executed serially in time-stamp order.

Why OK for T3 to read the *old* value of y even though there's a newer value?
  T2 and T3 are concurrent, so external consistency allows either order.
  Remember: r/o transactions need to read values
    as of their timestamp, and *not* see later writes.

Nice:
  T3 can get transactional (serializable) reads without locking!
    Reduces communication and blocking of r/w transactions.
  The cost (so far) is storing multiple versions for Snapshot Isolation.

Problem: what if T3 reads x from a replica that hasn't seen T1's write?
  Because the replica wasn't in the Paxos majority?

Solution: replica "safe time".
  Paxos leaders send writes in timestamp order.
  Before serving a read at time 20, replica must see Paxos write for time > 20.
    So it knows it has seen all writes < 20.
  So the replica may have to delay its response to a read.
  Must also delay if prepared but uncommitted r/w transactions (Section 4.1.3).
  Thus: r/o transactions can read from local replica -- usually fast.

Problem: what if clocks are not perfectly synchronized?

What goes wrong if clocks aren't synchronized exactly?
  No problem for r/w transactions, which use locks.
  If an r/o transaction's TS is too large:
    Its TS will be higher than replica safe times, and reads will block.
    Correct but slow -- delay increased by amount of clock error.
  If an r/o transaction's TS is too small:
    It will miss writes that committed before the r/o xaction started.
    Since its low TS will cause it to use old versions of records.
    This violates external consistency.
  If r/w transaction's TS is wrong:
    New write might have small TS, thus not visible to r/o transactions.
    Or old write might have large TS, seem like latest data to r/o.

Example of problem if r/o xaction's TS is too small:
  r/w T1 @ 10:         Wx2 C
  r/o T2 @  5:                   Rx?
(C for commit)
This would cause T2 to read the version of x before T1.
But T2 started after T1 committed (in real time),
  so external consistency requires that T2 see x=2.
So we need a way to deal with incorrect clocks!

What goes wrong with computer clocks?
  The absolute time is never quite right -- offset.
  Tick rate varies, never exactly right -- drift.

You can ask a better clock what time it it.
  Correct your offset, measure and compensate for drift.
  But comparisons have error due to variable communication delays.

Google's time reference system (Section 5.3)
  [UTC(USNO), GPS satellites, masters, servers]
  A few time master servers per data center.
  Each time master has either a GPS receiver or an "atomic clock".
  GPS receivers are typically accurate to better than a microsecond.
  Other servers talk to a few nearby time masters.
    Uncertainty due to network delays, drift between checks.

TrueTime
  Time service yields a TTinterval = [ earliest, latest ].
  The correct time is somewhere in the interval, with high probability.
  Interval width computed from measured network delays,
    measured clock drift, time since last sync.
  Figure 6: intervals are usually < 1 millisecond, but sometimes 10+ ms.
  So: server clocks aren't exactly synchronized.
  But some useful guarantees:
    TS > latest is guaranteed to be in the future.
    TS < earliest is guaranteed to be in the past.

How Spanner ensures that if r/w T1 finishes before r/o T2 starts, TS2 > TS1.
  So that T2's snapshot reads see T1's writes.
  And are thus externally consistent.

Two rules for r/w transactions (4.1.2):
  Start rule:
    TS = TT.now().latest, when commit begins
    this is the time-stamp it will write in the DB records
  Commit wait:
    Before releasing locks or replying to client,
      delay until TS < TT.now().earliest
    Guarantees that TS *is* in the past, relative to xaction finish time.

Rule for r/o transctions:
  TS = TT.now().latest
  and read that version (really, version with highest timestamp < TS)
  TS guaranteed *not* to be in the past

Example updated with intervals and commit wait:
  The scenario is T1 finishes, then T2 starts, T2 must see T1's writes.
  I.e. we need TS1 < TS2.

               TS1=10
               [1,10]     [11,20]
  r/w T1:        C..........W
                                   TS2=15
                                   [5,15]
  r/o T2:                            Rx
  "C" for start of commit, "W" for end of commit wait
  Remember, the assumption is that T2 starts after T1 finishes,
    so that T2 needs to see T1's writes.
  
Why commit wait provides external consistency for r/o transactions:
  Given that T1 finishes before T2 starts.
  Commit wait means TS1 is guaranteed to be in the past.
  After commit wait:
    TS1 < TT.now().latest on every computer, including T2.
      (since > latest guaranteed to be in the future, and TS1 is in the past)
    So TS2 = TT.now().latest is guaranteed to be > TS1.
    So T2 will see T1's writes.

Why does an r/w transaction use TS = TT.now().latest?
  (above reasoning allows any TS that commit-wait forces into the past)
  Suppose sequential r/w transactions, T3, and afterwards T4.
  We must have TS3 < TS4 (so that an r/o that sees T4 also sees T3).
  The same reasoning as for r/o will cause TS4 to be greater than TS3.
  So all r/w transactions use TS = TT.now().latest.

More generally:
  Snapshot Isolation gives you serializable r/o transactions.
    Timestamps set an order.
    Snapshot versions implement consistent reads at a timestamp.
    Xaction sees all writes from lower-TS xactions, none from higher.
  Synchronized time + commit-wait yields external consistency.
    Even among transactions at different data centers.
  Safe time allows reading from local replicas that might lag.

Why is all this useful?
  Fast r/o transactions:
    Read from local replicas!
    No locks!
    Thus the 10x latency r/o vs r/w improvement in Table 3 (and Table 6).
  Although:
    r/o transaction reads may pause due to safe time, to catch up.
    r/w transaction commits pause in Commit Wait.
    Accurate (small interval) time minimizes these delays.

Summary:
  Rare to see deployed systems offer distributed transactions
    over geographically distributed data.
  Spanner was a surprising demonstration that it can be practical.
  Timestamping scheme is the most interesting aspect.
  Widely used within Google; a commercial Google service; influential.

Midterm exam on Thursday!
  Papers, lectures, labs.
  Look at old exams on the web site.

---

Spanner: Becoming a SQL System, 2017, https://research.google/pubs/pub46103.pdf
https://www.cockroachlabs.com/blog/living-without-atomic-clocks/
https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/
https://sookocheff.com/post/time/truetime/
https://communities.actian.com/s/article/Using-MVCC-Multi-Version-Concurrency-Control
https://cloud.google.com/spanner/docs/replication