6.5840 2024 Lecture 13: Spanner Why this paper (Google Spanner, OSDI 2012)? Unusually ambitious for its time: Wide-area distributed transactions. Consistent wide-area replication. Replicating *data* through Paxos. Neat ideas: Two-phase commit over Paxos. Synchronized time for fast r/o transactions. Used a lot inside Google. What was the motivating use case? Google F1 advertising database (Section 5.4). Previously sharded over many MySQL and BigTable DBs; awkward. Needed: Better (synchronous) replication. Cross-shard transactions. Workload is dominated by read-only transactions (Table 6). Strong consistency is required. External consistency / linearizability / serializability. The basic organization: Datacenter A: "clients" are web servers e.g. for gmail data is sharded over multiple servers: a-m n-z Datacenter B: has its own local clients and its own copy of the data shards a-m n-z Datacenter C: same setup Replication managed by Paxos; one Paxos group per shard. Replicas are in different data centers. Similar to Raft -- each Paxos group has a leader. As in the labs, Paxos replicates a log of operations. Why this arrangement? Sharding allows huge total throughput via parallelism. Replicas at different data centers copes with whole-site failures. Clients can read local replica -- fast! This is what's driving the design's timestamps and TrueTime. Paxos requires only a majority -- tolerate slow/distant replicas. What are the challenges? Entirely local reads pose some problems: Can we have multi-record read transactions without locking? Data on local replica may not be fresh if not in Paxos majority. A transaction may involve multiple shards -> multiple Paxos groups. Spanner treats read/write and read/only transactions differently. First, read/write transactions. Example read/write transaction (bank transfer): BEGIN x = x + 1 y = y - 1 END We don't want any read or write of x or y sneaking between our two ops. After commit, all reads should see our updates. Summary: two-phase commit (2pc) with Paxos-replicated participants. (Omitting timestamps for now.) (This is for r/w transactions, not r/o.) Client picks a unique transaction id (TID). Client sends each read to Paxos leader of relevant shard (2.1). Each shard first acquires a lock on the relevant record. May have to wait. Separate lock table per shard, in shard leader. Read locks are not replicated via Paxos, so leader failure -> abort. Client keeps writes private until commit. When client commits (4.2.1): Chooses a Paxos group to act as 2pc Transaction Coordinator (TC). Sends writes to relevant shard leaders. Each written shard leader: Acquires lock(s) on the written record(s). Log a "prepare" record via Paxos, to replicate lock and new value. Tell TC it is prepared. Or tell TC "no" if crashed and thus lost lock table. Transaction Coordinator: Decides commit or abort. Logs the decision to its group via Paxos. Tell participant leaders and client the result. Each participant leader: Log the TC's decision via Paxos. Perform its writes. Release the transaction's locks. Some points about the design so far (just read/write transactions). Locking (two-phase locking) ensures serializability. 2pc widely hated b/c it blocks with locks held if TC fails. Replicating the TC with Paxos solves this problem! r/w transactions take a long time. Many inter-data-center messages. Table 6 suggests about 100 ms for cross-USA r/w transaction. Only 14 ms for cross-city (Table 3). But lots of parallelism: many clients, many shards. So total throughput could be high if busy. From now on I'll mostly view each Paxos group as a single entity. Replicates shard data. Replicates two-phase commit state. Now for read-only (r/o) transactions. These can involve multiple reads, perhaps from multiple shards. Table 6 shows 99.9% of transactions are read-only! So we want to make them fast, even at expense of r/w transactions. But still strictly serializable. So all of an r/o transactions reads must appear to happen at the same instant. Spanner eliminates three big costs for r/o transactions: Read from local replicas, to avoid Paxos and cross-datacenter msgs. But note local replica may not be up to date! No locks, so r/o transaction doesn't force r/w to wait. No two-phase commit, no transaction manager. Again to avoid cross-data center msg to Paxos leader. Tables 3 and 6 show r/o has 10x less latency than r/w! This is a big deal. How to be correct despite cutting these corners? Correctness constraints on r/o transactions: Serializable: Same results as if transactions executed one-by-one. Even though they may actually execute concurrently. I.e. an r/o xaction must essentially fit between r/w xactions. See all writes from prior transactions, nothing from subsequent. Even though *concurrent* with r/w xactions! And not locking! Externally consistent: If T1 completes before T2 starts, T2 must see T1's writes. "Before" refers to real (wall-clock) time. Similar to linearizable. Rules out reading stale data. Why not have r/o transactions just read the latest committed values? Suppose we have two bank transfers, and a transaction that reads both. T1: Wx Wy C T2: Wx Wy C T3: Rx Ry The results won't match any serial order! Not T1, T2, T3. Not T1, T3, T2. We want T3 to see both of T2's writes, or none. We want T3's reads to *all* occur at the *same* point relative to T1/T2. Idea: Snapshot Isolation (SI): Synchronize all computers' clocks (to real wall-clock time). Assign every transaction a time-stamp. r/w: commit time. r/o: start time. We want results as if one-at-a-time in time-stamp order. Even if actual reads occur in different order. Each replica stores multiple time-stamped versions of each record. All of a r/w transactions's writes get the same time-stamp. An r/o transaction's reads see version as of xaction's time-stamp. The record version with the highest time-stamp less than the xaction's. Called Snapshot Isolation. Our example with Snapshot Isolation: x@10=9 x@20=8 y@10=11 y@20=12 T1 @ 10: Wx Wy C T2 @ 20: Wx Wy C T3 @ 15: Rx Ry "@ 10" indicates the time-stamp. Now T3's reads will both be served from the @10 versions. T3 won't see T2's write even though T3's read of y occurs after T2. Now the results are serializable: T1 T2 T3 The serial order is the same as time-stamp order. Why OK for T3 to read the *old* value of y even though there's a newer value? T2 and T3 are concurrent, so external consistency allows either order. Remember: r/o transactions need to read values as of their timestamp, and *not* see later writes. Nice: T3 can get transactional (serializable) reads without locking! Reduces communication and blocking of r/w transactions. The cost (so far) is storing multiple versions. Problem: what if T3 reads x from a replica that hasn't seen T1's write? Because the replica wasn't in the Paxos majority? Solution: replica "safe time". Paxos leaders send writes in timestamp order. Before serving a read at time 20, replica must see Paxos write for time > 20. So it knows it has seen all writes < 20. So the replica may have to delay its response to a read. Must also delay if prepared but uncommitted r/w transactions (Section 4.1.3). Thus: r/o transactions can read from local replica -- usually fast. Problem: what if clocks are not perfectly synchronized? What goes wrong if clocks aren't synchronized exactly? No problem for r/w transactions, which use locks. If an r/o transaction's TS is too large: Its TS will be higher than replica safe times, and reads will block. Correct but slow -- delay increased by amount of clock error. If an r/o transaction's TS is too small: It will miss writes that committed before the r/o xaction started. Since its low TS will cause it to use old versions of records. This violates external consistency. If r/w transactions TS's are out of order: r/w transaction might see later committed writes but not earlier ones. Example of problem if r/o xaction's TS is too small: r/w T1 @ 10: Wx2 C r/o T2 @ 5: Rx? (C for commit) This would cause T2 to read the version of x before T1. But T2 started after T1 committed (in real time), so external consistency requires that T2 see x=2. So we need a way to deal with incorrect clocks! What goes wrong with [computer] clocks? The absolute time is never quite right -- offset. They never tick at exactly the right rate -- drift. You can ask a better clock what time it it. Correct your offset, measure and compensate for drift. But comparisons themselves have error due to variable communication delays. Google's time reference system (Section 5.3) [UTC, GPS satellites, masters, servers] A few time master servers per data center. Each time master has either a GPS receiver or an "atomic clock". GPS receivers are typically accurate to better than a microsecond. Other servers talk to a few nearby time masters. Uncertainty due to network delays, drift between checks. TrueTime Time service yields a TTinterval = [ earliest, latest ]. The correct time is somewhere in the interval, with high probability. Interval width computed from measured network delays, measured clock drift, time since last sync. Figure 6: intervals are usually < 1 millisecond, but sometimes 10+ ms. So: server clocks aren't exactly synchronized. But some useful guarantees: TS > latest is guaranteed to be in the future. TS < earliest is guaranteed to be in the past. How Spanner ensures that if r/w T1 finishes before r/o T2 starts, TS2 > TS1. So that T2's snapshot reads see T1's writes. And are thus externally consistent. Two rules for r/w transactions (4.1.2): Start rule: TS = TT.now().latest, when commit begins this is the version it will write in the DB record Commit wait: Before releasing locks or replying to client, delay until TS < TT.now().earliest Guarantees that TS *is* in the past, relative to xaction finish time. Rule for r/o transctions: TS = TT.now().latest and read that version (really, highest version < TS) TS guaranteed *not* to be in the past Example updated with intervals and commit wait: The scenario is T1 finishes, then T2 starts, T2 must see T1's writes. I.e. we need TS1 < TS2. TS1=10 [1,10] r/w T1: C..........W TS2=15 [5,15] r/o T2: Rx "C" for start of commit, "W" for end of commit wait Remember, the assumption is that T2 starts after T1 finishes, so that T2 needs to see T1's writes. Why this provides external consistency for r/o transactions: Given that T1 finishes before T2 starts. Commit wait means TS1 is guaranteed to be in the past. After commit wait: TS1 < TT.now().latest on every computer, including T2. (since > latest guaranteed to be in the future, and TS1 is in the past) So TS2 = TT.now().latest is guaranteed to be > TS1. So T2 will see T1's writes. Why does an r/w transaction use TS = TT.now().latest? (above reasoning allows any TS that commit-wait forces into the past) Suppose sequential r/w transactions, T3, and afterwards T4. We must have TS3 < TS4 (so that an r/o that sees T4 also sees T3). The same reasoning as for r/o will cause TS4 to be greater than TS3. So all r/w transactions use TS = TT.now().latest. More generally: Snapshot Isolation gives you serializable r/o transactions. Timestamps set an order. Snapshot versions (and safe time) implement consistent reads at a timestamp. Xaction sees all writes from lower-TS xactions, none from higher. Any number will do for TS if you don't care about external consistency. Synchronized timestamps yield external consistency. Even among transactions at different data centers. Even though reading from local replicas that might lag. Why is all this useful? Fast r/o transactions: Read from local replicas! No locks! Thus the 10x latency r/o vs r/w improvement in Table 3 (and Table 6). Although: r/o transaction reads may pause due to safe time, to catch up. r/w transaction commits pause in Commit Wait. Accurate (small interval) time minimizes these delays. Summary: Rare to see deployed systems offer distributed transactions over geographically distributed data. Spanner was a surprising demonstration that it can be practical. Timestamping scheme is the most interesting aspect. Widely used within Google; a commercial Google service; influential. --- Spanner: Becoming a SQL System, 2017, https://research.google/pubs/pub46103.pdf https://www.cockroachlabs.com/blog/living-without-atomic-clocks/ https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/ https://sookocheff.com/post/time/truetime/ https://communities.actian.com/s/article/Using-MVCC-Multi-Version-Concurrency-Control https://cloud.google.com/spanner/docs/replication