Distributed Transactions FAQ

Q: How does this material fit into 6.5840?

A: When data is distributed over many computers, it's common for a
single operation to need to read and/or modify multiple data items,
perhaps stored on different computers. How such multi-step operations
interact with concurrent operations on the same data, and what happens
if a crash occurs in the middle of such an operation, are usually
critical questions for the system's robustness and ease of
programming. The gold standard for good behavior is transactions,
often provided by database systems. Transactions are usually
implemented with two-phase locking and logging; distributed
transactions usually add two-phase commit. Today's reading from the
6.1800 (6.033) textbook explains those ideas.

Most of the storage systems we've looked at so far provide operations
like put() and get() that involve only single records. In constrast,
transactions usually refer to atomic multi-record operations (e.g.,
bank transfers). When the records in involved in a transaction are
stored in different places (e.g. in a sharded storage system), then
we're talking about distributed transactions, for which two-phase
commit is helpful. These ideas will show up in some of the upcoming
papers we read (e.g. Spanner and FaRM).

Q: Why is it so important for transactions to be atomic?

A: What "transaction" means is that the steps inside the transaction
occur atomically with respect to failures and other transactions.
Atomic here means "all or none". Transactions are a feature provided
by some storage systems to make programming easier. A situation where
transactions are helpful is bank transfers. If the bank wants to
transfer $100 from Alice's account to Bob's account, it would be
awkward if a crash left Alice debited by $100 but Bob *not* credited
by $100. If your storage system supports transactions, you can write
something like

BEGIN TRANSACTION
  decrease Alice's balance by 100;
  increase Bob's balance by 100;
END TRANSACTION

and the transaction system will make sure the transaction is atomic.
Either both happen, or neither, even if there's a failure; and no
other transaction will observe the intermediate situation where only
one balance has been modified.

Q: Could one use Raft instead of two-phase commit?

A: Two-phase commit and Raft solve different problems.

Two-phase commit causes different computers to do *different* things
(e.g. Alice's bank debits Alice, Bob's bank credits Bob), and causes
them *all* to do their thing, or none of them. Two-phase commit
systems are typically not available (cannot make progress) if some
participants can't be reached, since they have to wait for all
participating computers to perform their part of the transaction.

Raft causes a majority of the peers to all do the *same* thing (so
they remain replicas). It's OK for Raft to wait only for a majority,
since the peers are replicas, and therefor we can make the system
available in the face of failures.

Q: What is the difference between two-phase locking and two-phase commit?

A: They are largely unrelated; it just happens that they have
two-phase in their name.

2PL is a scheme for acquiring locks for records in a transaction, to
ensure that different transactions that use the same records don't
interfere with each other; it is useful in both non-distributed and
distributed settings.

2PC is a scheme to execute a transaction across multiple machines,
where each machine has some of the records used in the transaction;
2PC ensures that each machine does its part of the transaction.

Q: In two-phase commit, why would a worker send an abort message,
rather than a PREPARED message?

A: The reason we care most about is if the participant crashed and
rebooted after it did some of its work for the transaction but before
it received the prepare message; during the crash it will have lost
the record of tentative updates it made and locks it acquired, so it
cannot complete the transaction. Another possibility (depending on how
the database works) is if the worker detected a violated constraint on
the data (e.g. the transaction tried to write a record with a
duplicate key in a table that requires unique keys). Another
possibility is that the worker is involved in a deadlock, and
must abort to break the deadlock.

Q: Can two-phase locking generate deadlock?

A: Yes. If two simultaneous transactions both use records R1 and R2,
but in opposite orders, they may each acquire one of the locks, and
then deadlock trying to get the other lock. Databases detect these
deadlocks and break them. A database can detect deadlock by timing out
lock acquisition, or by finding cycles in the waits-for graph among
transactions. A deadlock can be broken by aborting one of the
participating transactions.

Q: What is serializability?

A: It's a common criterion for correctness of concurrent transactions
in databases. An execution of a bunch of transactions is serializable
if it yields the same results as some serial (one-at-a-time) execution
of those transactions. Results include both the content of records in
the database and any outputs from the transactions. So if transactions
T1 and T2 execute concurrently, the results are serializable if they
are the same as executing T1, waiting for T1 to complete, and then
executing T2; or executing T2, waiting for it to complete, and then
executing T1.

Q: Why does it matter whether locks are held until after a transaction
commits or aborts?

A: If transactions release locks before they commit, it can be hard to
avoid certain non-serializable (incorrect) executions due to aborts or
crashes. In this example, suppose T1 releases the lock on x after it
updates x, but before it commits or aborts:

  T1:           T2:
  BEGIN
  x = x + 1
                BEGIN
                y = x
                END

  END
  
It can't be legal for y to end up greater than x. Yet if T1 releases
its lock on x, then T2 acquires the lock, writes y, and commits, but
then T1 aborts or the system crashes and cannot complete T1, we will
end up with y greater than x.

It's to avoid having to cope with the above that people use the
"strong strict" variant of 2PL, which only releases locks after a
commit or abort.

Q: What is the point of the two-phase locking rule that says a
transaction isn't allowed to acquire any locks after the first time
that it releases a lock?

A: Acquiring after releasing can lead to non-serializable executions.

  T1:         T2:
  x = x + 1
              z = x + y
  y = y + 1

Suppose x and y start out as zero, and both transactions execute, and
successfully commit. The only final values of z that are allowed by
serializability are zero and 2 (corresponding to the orders T2;T1 and
T1;T2). But if T1 releases its lock on x before acquiring the lock on
y and modifying y, T2 could completely execute and commit while T1 is
between its two statements, giving z a value of 1, which is not legal.
If T1 keeps its lock on x while using y, as two-phase locking demands,
this problem is avoided.

Q: How does two-phase commit solve the dilemma of the two generals (or
the Byzantine Generals Problem)?

A: It doesn't. Two-phase commit doesn't encounter the two generals
problem, and thus doesn't need to solve it.

One difference is that, in two-phase commit, there's just one entity
making the decision (the TC), and it can't disagree with itself.
Whereas in the two generals problem, there are two independent
deciders who have trouble communicating, so there's room for
disagreement. Another difference is that two-phase commit has no
real-time requirement (nothing like the requirement that the generals
agree by dawn); it's OK for workers to wait for as long as needed to
receive the TC's decision.

Q: Are the locks exclusive, or can they allow multiple readers to have
simultaneous access?

A: By default, "lock" in 6.5840 refers to an exclusive lock. But there
are databases that can grant locking access to a record to either
multiple readers, or a single writer. Some care has to be taken when a
transaction reads a record and then writes it, since the lock will
initially be a read lock and then must be upgraded to a write lock.
There's also increased opportunity for deadlock in some situations; if
two transactions simultaneously want to increment the same record,
they might deadlock when upgrading a read lock to a write lock on that
record, whereas if locks are always exclusive, they won't deadlock.

Q: How should one decide between pessimistic and optimistic
concurrency control?

A: If your transactions conflict a lot (use the same records, and one
or more transactions writes), then locking is better. Locking causes
conflicting transactions to wait, whereas most OCC systems deal with
conflict by aborting; aborts (really the consequent retries) are
expensive.

If your transactions rarely conflict, then OCC is preferable to
locking. OCC doesn't spend CPU time acquiring/releasing locks and, as
long as conflicts are rare, OCC rarely aborts. The "validation" phase
of OCC systems often uses locks, but they are usually held for shorter
periods of time than the locks in pessimistic designs.

Q: What should two-phase commit workers do if the transaction
coordinator crashes?

A: If a worker has told the coordinator that it is ready to commit,
then the worker cannot later change its mind. The reason is that the
coordinator may (before it crashed) have told other workers to commit.
So the worker has to wait (with locks held) for the coordinator to
reboot and send (or re-send) its decision.

Waiting indefinitely with locks held is a real problem, since the
locks can force a growing set of other transactions to block as well.
So people tend to avoid two-phase commit, or they try to make
coodinators reliable. For example, Google's Spanner replicates
coordinators (and all other servers) using Paxos.

Q: Why don't people use three-phase commit, which allows workers to
commit or abort even if the coordinator crashes?

A: Three-phase commit only works if workers can reliably distinguish
between the coordinator being dead and the network not delivering
packets. For example, three-phase commit won't work correctly if
there's a network partition. In most practical networks, it's not
possible to distinguish a dead computer from a network failure.

Q: Can there be more than one transaction active? How do participants
know which transaction a message refers to?

A: There can be many concurrent transactions, managed by many TCs. A
TC assigns a unique transaction ID (TID) to each transaction. Every
message includes the TID of the relevant transaction. TCs and
participants tag entries in their tables with the TID, so that (for
example) when a COMMIT message arrives at a participant, it knows what
tentative records to make permanent, and what locks to release.

Q: How does a two-phase commit system undo modifications if a
transaction has to abort?

A: Each participant performs modifications to temporary copies of the
records. If the participant answers "yes" to the TC's prepare message,
the participant must first save the temporary record values to its log
on disk, so it can find them if it crashes and restarts. If the TC
decides to commit, the participant must copy the temporary values to
the real database records; if the TC decides to abort, the participant
must discard the temporary records.

Q: How does serializability relate to linearizability?

A: They are similar notions, arising from different communities. Both
require the final outcome to be the same as some serial execution.
Serializability usually refers to entire transactions, each involving
multiple operations. Linearizability often refers to simple reads and
writes. It's also the case that linearizability requires that the
equivalent serial execution match the real time order of the actual
execution, while serializability usually does not.

Q: Why do logs appear so often in the designs we look at?

A: A log is a good way to capture the serial order that the system has
chosen for operations, so that e.g. all replicas perform the
operations in the same order, or a server considers operations in the
same order after a crash+reboot as it did before the crash.

Many distributed systems keep multiple operations in flight (often
called a window or pipeline of operations). Often the fate of
operations is not known until some time after they are received. A log
is a good way to keep track of such pending operations. Raft is an
example of this arrangement: a follower receives a stream of commands
from the leader, but doesn't hear that they are committed until later,
and must have a place to store commands until they are committed.

A log is an efficient way to write data to hard disk or SSD, since
both media are faster at sequential writes (i.e. appends to the log)
than at random writes.

A log is a convenient way for crash-recovery software to see how far
the system got before it crashed, and whether the last transactions
have a complete record in the log and thus can safely be replayed.
That is, a log is a convenient way to implement crash-recoverable
atomic transactions, via write-ahead logging.

Q: Are there structures other than logs that would work as well?

A: There's nothing as general-purpose as logs.

You can record order by storing data in some other way (e.g. a b-tree)
and storing sequence numbers with the data (Frangipani does this for
meta-data, in addition to using logs).

You wouldn't have to worry about performance if you used a persistent
storage system that was as fast for random updates as for sequential,
for example battery-backed RAM. However, such systems are often more
expensive and less robust than hard drives or SSDs.

For the write-ahead property, you could store a mini-log for each data
record. However, it might then be time-consuming for the
crash-recovery software to find the full set of incomplete mini-logs.

A different way to get crash-recoverable atomic operations is to
prepare an entire new data structure in fresh storage, and then use a
single commiting write to substitute it for the original data
structure. This makes the most sense with tree- shaped data
structures. The NetApp WAFL file system uses that idea:

https://atg.netapp.com/wp-content/uploads/2000/01/file-system-design.pdf

This arrangement may make it hard to support concurrent transactions.

Q: What is the Lock Manager?

A: The software module that implements acquire() and release(). It may
also implement deadlock detection by constructing a waits-for-graph as
users of the module acquire locks using acquire().  In some cases the
module is a separate service running on a machine (e.g., the lock
service in Franginpani).

Q: Why does Section 9.5.3 of the reading say that two-phase locking
forbids this sequence?

T1: READ X
T2: WRITE Y
T1: WRITE Y

A: Perhaps the text means that T2 goes on to do other unrelated things
before committing. In that case two-phase locking would force T1 to
wait until T2 completely finished. However, in fact it would be
correct (serializable) for T1 to commit before T2 finished. That is,
there are correct executions that two-phase locking forbids.