Distributed Transactions FAQ Q: How does this material fit into 6.5840? A: When data is distributed over many computers, it's common for a single operation to need to read and/or modify multiple data items, perhaps stored on different computers. How such multi-step operations interact with concurrent operations on the same data, and what happens if a crash occurs in the middle of such an operation, are usually critical questions for the system's robustness and ease of programming. The gold standard for good behavior is transactions, often provided by database systems. Transactions are usually implemented with two-phase locking and logging; distributed transactions usually add two-phase commit. Today's reading from the 6.033 textbook explains those ideas. Most of the storage systems we've looked at so far provide operations like put() and get() that involve only single records. In constrast, transactions usually refer to atomic multi-record operations (e.g., bank transfers). When the records in involved in a transaction are stored in different places (e.g. in a sharded storage system), then we're talking about distributed transactions, for which two-phase commit is helpful. These ideas will show up in some of the upcoming papers we read (e.g. Spanner and FaRM). Q: Why is it so important for transactions to be atomic? A: What "transaction" means is that the steps inside the transaction occur atomically with respect to failures and other transactions. Atomic here means "all or none". Transactions are a feature provided by some storage systems to make programming easier. A situation where transactions are helpful is bank transfers. If the bank wants to transfer $100 from Alice's account to Bob's account, it would be awkward if a crash left Alice debited by $100 but Bob *not* credited by $100. If your storage system supports transactions, you can write something like BEGIN TRANSACTION decrease Alice's balance by 100; increase Bob's balance by 100; END TRANSACTION and the transaction system will make sure the transaction is atomic. Either both happen, or neither, even if there's a failure; and no other transaction will observe the intermediate situation where only one balance has been modified. Q: Could one use Raft instead of two-phase commit? A: Two-phase commit and Raft solve different problems. Two-phase commit causes different computers to do *different* things (e.g. Alice's bank debits Alice, Bob's bank credits Bob), and causes them *all* to do their thing, or none of them. Two-phase commit systems are typically not available (cannot make progress) if some participants can't be reached, since they have to wait for all participating computers to perform their part of the transaction. Raft causes a majority of the peers to all do the *same* thing (so they remain replicas). It's OK for Raft to wait only for a majority, since the peers are replicas, and therefor we can make the system available in the face of failures. Q: What is the difference between two-phase locking and two-phase commit? A: They are largely unrelated; it just happens that they have two-phase in their name. 2PL is a scheme for acquiring locks for records in a transaction; it is useful in both non-distributed and distributed settings. 2PC is a scheme to execute a transaction across multiple machines, where each machine has some of the records used in the transaction. Q: In two-phase commit, why would a worker send an abort message, rather than a PREPARED message? A: The reason we care most about is if the participant crashed and rebooted after it did some of its work for the transaction but before it received the prepare message; during the crash it will have lost the record of tentative updates it made and locks it acquired, so it cannot complete the transaction. Another possibility (depending on how the database works) is if the worker detected a violated constraint on the data (e.g. the transaction tried to write a record with a duplicate key in a table that requires unique keys). Another possibility is that the worker is involved in a deadlock, and must abort to break the deadlock. Q: Can two-phase locking generate deadlock? A: Yes. If two transactions both use records R1 and R2, but in opposite orders, they will each acquire one of the locks, and then deadlock trying to get the other lock. Databases detect these deadlocks and break them. A database can detect deadlock by timing out lock acquisition, or by finding cycles in the waits-for graph among transactions. A deadlock can be broken by aborting one of the participating transactions. Q: Why does it matter whether locks are held until after a transaction commits or aborts? A: If transactions release locks before they commit, it can be hard to avoid certain non-serializable executions due to aborts or crashes. In this example, suppose T1 releases the lock on x after it updates x, but before it commits or aborts: T1: T2: BEGIN x = x + 1 BEGIN y = x END END It can't be legal for y to end up greater than x. Yet if T1 releases its lock on x, then T2 acquires the lock, writes y, and commits, but then T1 aborts or the system crashes and cannot complete T1, we will end up with y greater than x. It's to avoid having to cope with the above that people use the "strong strict" variant of 2PL, which only releases locks after a commit or abort. Q: What is the point of the two-phase locking rule that says a transaction isn't allowed to acquire any locks after the first time that it releases a lock? A: Acquiring after releasing can lead to non-serializable executions. T1: T2: x = x + 1 z = x + y y = y + 1 Suppose x and y start out as zero, and both transactions execute, and successfully commit. The only final values of z that are allowed by serializability are zero and 2 (corresponding to the orders T2;T1 and T1;T2). But if T1 releases its lock on x before acquiring the lock on y and modifying y, T2 could completely execute and commit while T1 is between its two statements, giving z a value of 1, which is not legal. If T1 keeps its lock on x while using y, as two-phase locking demands, this problem is avoided. Q: How does two-phase commit solve the dilemma of the two generals (or the Byzantine Generals' Problem)? A: It doesn't. Two-phase commit doesn't encounter the two general's problem, and thus doesn't need to solve it. One difference is that, in two-phase commit, there's just one entity making the decision (the TC), and it can't disagree with itself. Whereas in the two general's problem, there are two independent deciders who have trouble communicating, so there's room for disagreement. Another difference is that two-phase commit has no real-time requirement (nothing like the requirement that the generals agree by dawn); it's OK for workers to wait for as long as needed to receive the TC's decision. Q: Are the locks exclusive, or can they allow multiple readers to have simultaneous access? A: By default, "lock" in 6.5840 refers to an exclusive lock. But there are databases that can grant locking access to a record to either multiple readers, or a single writer. Some care has to be taken when a transaction reads a record and then writes it, since the lock will initially be a read lock and then must be upgraded to a write lock. There's also increased opportunity for deadlock in some situations; if two transactions simultaneously want to increment the same record, they might deadlock when upgrading a read lock to a write lock on that record, whereas if locks are always exclusive, they won't deadlock. Q: How should one decide between pessimistic and optimistic concurrency control? A: If your transactions conflict a lot (use the same records, and one or more transactions writes), then locking is better. Locking causes conflicting transactions to wait, whereas most OCC systems deal with conflict by aborting; aborts (really the consequent retries) are expensive. If your transactions rarely conflict, then OCC is preferable to locking. OCC doesn't spend CPU time acquiring/releasing locks and, as long as conflicts are rare, OCC rarely aborts. The "validation" phase of OCC systems often uses locks, but they are usually held for shorter periods of time than the locks in pessimistic designs. Q: What should two-phase commit workers do if the transaction coordinator crashes? A: If a worker has told the coordinator that it is ready to commit, then the worker cannot later change its mind. The reason is that the coordinator may (before it crashed) have told other workers to commit. So the worker has to wait (with locks held) for the coordinator to reboot and send (or re-send) its decision. Waiting indefinitely with locks held is a real problem, since the locks can force a growing set of other transactions to block as well. So people tend to avoid two-phase commit, or they try to make coodinators reliable. For example, Google's Spanner replicates coordinators (and all other servers) using Paxos. Q: Why don't people use three-phase commit, which allows workers to commit or abort even if the coordinator crashes? A: Three-phase commit only works if the network is reliable, or if workers can reliably distinguish between the coordinator being dead and the network not delivering packets. For example, three-phase commit won't work correctly if there's a network partition. In most practical networks, it's not possible to distinguish a dead computer from a network failure. Q: Can there be more than one transaction active? How do participants know which transaction a message refers to? A: There can be many concurrent transactions, managed by many TCs. A TC assigns a unique transaction ID (TID) to each transaction. Every message includes the TID of the relevant transaction. TCs and participants tag entries in their tables with the TID, so that (for example) when a COMMIT message arrives at a participant, it knows what tentative records to make permanent, and what locks to release. Q: How does a two-phase commit system undo modifications if a transaction has to abort? A: Each participant performs modifications to temporary copies of the records. If the participant answers "yes" to the TC's prepare message, the participant must first save the temporary record values to its log on disk, so it can find them if it crashes and restarts. If the TC decides to commit, the participant must copy the temporary values to the real database records; if the TC decides to abort, the participant must discard the temporary records. Q: How does serializability relate to linearizability? A: They are similar notions, arising from different communities. Both require the final outcome to be the same as some serial execution. Serializability usually refers to entire transactions, each involving multiple operations. Linearizability often refers to simple reads and writes. It's also the case that linearizability requires that the equivalent serial execution match the real time order of the actual execution, while serializability usually does not. Q: Why do logs appear so often in the designs we look at? A: A log is a good way to capture the serial order that the system has chosen for transactions, so that e.g. all replicas perform the transactions in the same order, or a server considers transactions in the same order after a crash+reboot as it did before the crash. Many distributed systems keep multiple operations in flight (often called a window or pipeline of operations). Often the fate of operations is not known until some time after they are received. A log is a good way to keep track of such pending operations. Raft is an example of this arrangement: a follower receives a stream of commands from the leader, but doesn't hear that they are committed until later, and must have a place to store commands until they are committed. A log is an efficient way to write data to hard disk or SSD, since both media are much faster at sequential writes (i.e. appends to the log) than at random writes. A log is a convenient way for crash-recovery software to see how far the system got before it crashed, and whether the last transactions have a complete record in the log and thus can safely be replayed. That is, a log is a convenient way to implement crash-recoverable atomic transactions, via write-ahead logging. Q: Are there structures other than logs that would work as well? A: There's nothing as general-purpose as logs. You can record order by storing data in some other way (e.g. a b-tree) and storing sequence numbers with the data (Frangipani does this for meta-data, in addition to using logs). You wouldn't have to worry about performance if you used a persistent storage system that was as fast for random updates as for sequential, for example battery-backed RAM. However, such systems are often more expensive and less robust than hard drives or SSDs. For the write-ahead property, you could store a mini-log for each data record. However, it might then be time-consuming for the crash-recovery software to find the full set of incomplete mini-logs. A different way to get crash-recoverable atomic operations is to prepare an entire new data structure in fresh storage, and then use a single commiting write to substitute it for the original data structure. This makes the most sense with tree- shaped data structures. The NetApp WAFL file system uses that idea: https://atg.netapp.com/wp-content/uploads/2000/01/file-system-design.pdf This arrangement may make it hard to support concurrent transactions. Q: What is the Lock Manager? A: The software module that implements acquire() and release(). It may also implement deadlock detection by constructing a waits-for-graph as users of the module acquire locks using acquire(). In some cases the module is a separate service running on a machine (e.g., the lock service in Franginpani). Q: Why does Section 9.5.3 of the reading say that two-phase locking forbids this sequence? T1: READ X T2: WRITE Y T1: WRITE Y A: Perhaps the text means that T2 goes on to do other unrelated things before committing. In that case two-phase locking would force T1 to wait until T2 completely finished. However, in fact it would be correct (serializable) for T1 to commit before T2 finished. That is, there are correct executions that two-phase locking forbids.