6.5840 Spanner FAQ Q: This is a complex paper; what should we focus on? A: For us, the key ideas are 1) running two-phase commit on Paxos-replicated entities so that failure doesn't cause blocking with locks held; 2) use of snapshot isolation to allow read-only transactions without locks; and 3) synchronized time to ensure that read-only transactions see the latest writes (external consistency). Q: How does TrueTime select intervals in a way that's guaranteed to contain the correct time? A: Here a simple example of the kind of reasoning it uses. Suppose master time server S1 has the correct time (from GPS or an atomic clock). S2 sends a request to S1 asking for the time, and gets a response. The response says "10:00:00 AM" and it arrives two seconds after S2 sent the request (it's reasonable to assume that S2 can time how long things take even though it doesn't know the absolute time). Since the whole request/response took two seconds, S2 can conclude that the network might have delayed the request by up to two seconds; or delayed the response by up to two seconds; but no more than that. So S2 can conclude that, at the moment it receives the response, that the correct time must be between 10:00:00 and 10:00:02. Q: How does external consistency relate to linearizability and serializability? A: Spanner's external consistency is the same as strict serializability. And strict serializability is linearizability, but applied to multi-read/write transactions rather than single reads and writes. Q: Why is external consistency desirable? A: Suppose Hatshepsut changes the password on an account shared by her workgroup, via a web server in a datacenter in San Jose. She whispers the new password over the cubicle wall to her colleage Cassandra. Cassandra logs into the account via a web server in a different datacenter, in San Mateo. External consistency guarantees that Cassandra will observe the change to the password, and not, for example, see a stale replica. Q: Could Spanner use Raft rather than Paxos? A: Yes. At the level of this paper there is no difference. At the time Spanner was being built, Raft didn't exist, and Google already had a high-performance and robust Paxos implementation. Have a look at the paper Paxos Made Live by Chandra et al. Q: Why does read-only transaction T2 having a greater time-stamp than T1 mean that T2 will see T1's writes? A: Spanner keeps multiple versions of each record, one for each recent write of that record, and remembers the time-stamp of the transaction that wrote each version (this is the timestamp:int64 at the start of Section 2.1). When a read-only transaction reads a record, the version it reads is the one with the highest time-stamp that's less than the transaction's time-stamp. Thus a read-only transaction will see writes by transactions with lower time-stamps, and will not see writes by transactions with higher time-stamps. This technique of read-only transactions seeing a version of the database as of a specific instant in time is often called Snapshot Isolation. Q: What is the purpose of the t_safe "safe time" machinery in Section 4.1.3? A: A read-only transaction with time-stamp T2 is required to see the writes of any read-write transaction T1 if T1's time-stamp is less than T2's time-stamp. But T2 reads only Paxos replicas in its own data-center; if T1 executed at a distant data-center, the replicas in T2's data-center might not yet have received T1's writes. The safe time mechanism causes T2's local Paxos replicas to delay T2's reads until it's guaranteed that the replica has seen all writes with timestamp < T2. Q: What is the purpose of Spanner's commit wait? A: Commit wait ensures that a read/write transaction does not complete (release locks and reply to the client) until the time in its timestamp is guaranteed to have passed. That means that a read/only transaction that starts after the read/write transaction completes is guaranteed to have a higher timestamp, and thus to see the read/write transaction's writes. This helps fulfil the guarantee of external consistency: if T1 completes before T2 starts, T2 will come after T1 in the equivalent serial order (i.e. T2 will see T1's writes). Commit wait is needed because clocks on different computers aren't perfectly synchronized. If they were guaranteed to be perfectly synchronized, then the TrueTime interval could always be zero, and commit wait would never have to wait because TT.after(TT.now().latest) would always be true. Another way to look at this is that there's value in making TrueTime more accurate (i.e. shrinking the interval), because that allows read/write transactions to commit faster. Q: Why does a read/only transaction use TT.now().latest as its timestamp? A: Suppose a read-only transaction T2 is starting. External consistency demands that T2 see writes by any read-write transaction T1 that finished before T2 started. Spanner's Snapshot Isolation means that it's sufficient for T2 to choose a time-stamp that's greater than T1's time-stamp: then T2 will see T1's writes. T1 chose TT.now().latest as its time-stamp; then commit-wait caused T1 to wait until that time-stamp was guaranteed to be in the past; and only then did T1 commit and finish. So after T1 has finished, its time-stamp is guaranteed to be in the past. Since T2 starts after T1 finishes, and T1's time-stamp is guaranteed to have passed, and TT.now().latest is guaranteed not to have passed yet, TT.now().latest is guaranteed to be larger than T1's time-stamp. And thus T2 using TT.now().latest means that T2 will see T1's writes. Q: Why does a read/write transaction use TT.now().latest as its timestamp? A: If there are two read/write transactions T1 and T2, and T1 ends before T2 starts, TS2 needs to be greater than TS1 so that any r/o transactions that see T2's writes will also see T1's writes. Just as with r/o transactions, T2 chooses TS2 = TT.now().latest in order to guarantee that TS2 is larger than TS1. Q: What are the steps in commiting a read-write transaction? A: 0. A read-write transaction causes read locks to be taken as it executes on all the records it uses (including for writes). 1. Once the client has finished all its reads and writes, and is ready to ask Spanner to try to commit, the client chooses a two-phase commit coordinator. 2. The client sends each of its writes to the leader of the relevant shard, along with the identity of the chosen coordinator. 3. Each shard leader upgrades locks to write (perhaps waiting), chooses a "prepare timestamp", logs this information in Paxos, and sends a "prepared" message to the coordinator containing the prepare timestamp. The shard leader chooses the prepare timestamp to be larger than the timestamp of any transaction it knows about. 4. The coordinator waits for all the prepared messages. It chooses a timestamp TS for the transaction that's >= TT.now().latest and greater than any participant's prepared timestamp. The coordinator waits until TS is guaranteed to have passed (the commit-wait). Then the coordinator sends a reply to the client, and sends each participant a message with TS telling the participant to commit. 5. Each participant performs the writes for its shard, and then releases locks. Q: What's the point of the "prepare timestamp"? A: Each shard leader must apply writes in timestamp order, so that the "safe time" mechanism in 4.1.3 can guarantee that if a replica has seen a write with timestamp X, it has also seen every write with timestamp less than X. So a particpant needs to be able to tell the two-phase-commit coordinator "please don't assign this transaction a timestamp less than X", where X is the highest timestamp the participant has already seen. Q: What is wound-wait (Section 4.2.1)? A: Wound-wait is a technique for avoiding deadlock. When transaction T1 needs a lock held by T2, and T1 is older (smaller time-stamp) than T2, wound-wait will abort T2 and allow T1 to immediately have the lock it needs. Otherwise (if T1 is younger than T2) T1 will wait for the lock. So the waits-for graph will be a DAG, and thus have a no cycles and no deadlocks. On the plus side, wound-wait avoids deadlocks, and is relatively simple. If deadlocks were allowed to develop, the options for detecting and resolving them in a distributed database are none of them attractive. On the minus side, wound-wait may sometimes generate aborts (and consequent re-tries) in situations where waiting for the lock would be safe and more efficient. Q: What is a schema change (as in Section 4.2.3)? A: "Schema" refers to the set of tables, columns, types, and indices provided by a database deployment, plus perhaps the way these objects are laid out over physical servers and datacenters. A schema change involves adding or deleting or changing some of these objects. The challenge during a schema change is to allow concurrent reads and writes, and also to ensure that all writes are reflected in the final data state. Every concurrent write must either logically occur before the schema change (which will then copy the write to the new DB state), or after the schema change (in which case the write will naturally affect the post-change DB state). I cannot tell, after reading 4.2.3, whether Spanner's schema change transactions are ordinary read/write transactions, so that their nice features fall out naturally from Spanner's design, or whether Spanner handles schema change transactions specially. The paper's text does not say whether "they must block behind the schema-change transaction" means that the blocking is caused by locks acquired by schema change, or whether the blocking is caused by use of time-stamps. My guess is the latter. Q: In what ways is Spanner better than GFS? A: GFS provides just huge files. GFS has few consistency guarantees. For example, if I write to a file, and then you read it, GFS does not guarantee that you will see my write. GFS has no notion of transactions. Spanner provides most of the nice features and properties of an ACID database. Spanner has tables with rows and columns, and relational queries over those tables (as in SQL). Spanner has atomic distributed transactions, and guarantees very strong consistency (linearizability and strict serializability). Q: What is an atomic clock? A: A very stable oscillator. There are two main technologies that go by the name "atomic clock": rubidium clocks and cesium clocks. Both exploit changes in the state of the outer electron, which involve specific quanta of energy and thus wavelength. One can tune a signal generator to precisely that wavelength by watching how excited the electrons are. An atomic clock is just the oscillator part of a clock: it produces a frequency that can cause a clock to tick at exactly the right rate, but does not by itself know what time it is. To provide time, an atomic clock must initially be synchronized with the time, often via GPS (which itself is fed the time by a bunch of atomic clocks). Q: What kind of atomic clock does Spanner use? A: Sadly the paper doesn't say. Rubidium clocks are typically a few thousand dollars (e.g. https://thinksrs.com/products/fs725.html). Rubidium clocks drift by perhaps a few microseconds per week, so they need to be re-synchronized to UTC (typically by GPS) every once in a while. Cesium clocks cost perhaps $50,000; the HP 5071A is a good example. A cesium clock doesn't drift. Of course, any one clock might fail or suffer a power failure, so even with perfect cesium clocks you still need more than one and the ability to synchronize to UTC. A picture on this page includes a 5071A: https://sookocheff.com/post/time/truetime/ Q: Does anyone use Spanner? A: It's said that hundreds of Google services depend on Spanner. The paper talks about its use by Google's advertising system. Google's Zanzibar Authorization system uses Spanner. It's offered as a service to Google's cloud customers in the form of Cloud Spanner. The CockroachDB open-source database is based on the Spanner design.