6.5840 2025 Lecture 10: Distributed Transactions

Topics:
  distributed transactions = concurrency control + atomic commit

where are we in the course?
  so far, mostly distribution for fault tolerance
    multiple servers trying to look like one reliable server
  now, many servers for performance
    split the data up (shard) over multiple servers, for parallelism
    fine as long as clients use data items one at a time
    what if an application operation involves records in different shards?
      failures? atomicity?
    important problem, we'll see a number of approaches

examples
  [diagram: clients, servers, data sharded by key]
  client application actions often involve multiple reads and writes
    bank transfer: debit and credit
    install bi-directional links in a social graph
    insert new record, add to index
  we'd like to hide interleaving and failure from application writers

this is an old problem in databases
  the traditional solution: transactions
  programmer marks beginning/end of sequences of code as transactions
  the system automatically provides good behavior

example transactions
  x and y are bank balances -- records in database tables
    both start out as $10
  T1 and T2 are transactions
    T1: transfer $1 from x to y
    T2: audit, to find the total amount of money in the bank
  T1:             T2:
  BEGIN-X         BEGIN-X
    add(x, 1)       tmp1 = get(x)
    add(y, -1)      tmp2 = get(y)
  END-X             print tmp1, tmp2
                  END-X

the "END-X" indicates that the transaction would like to commit
  as we'll see, the commit may succeed, or it may fail

what is correct behavior for a transaction?
  usually called "ACID"
    Atomic -- all writes or none, despite failures
    Consistent -- obeys application-specific invariants
    Isolated -- no interference between xactions -- serializable
    Durable -- committed writes are permanent
  ACID transactions are magic!
    programmer writes straightforward serial code
    system automatically adds correct locking!
    system automatically adds fault tolerance!
  of course we need to implement this magic

a note on trends
  some storage systems provide transactions, some don't
  some applications benefit a lot from transactions, some don't
  SQL databases provide transactions
  but transactions are slow, particularly for sharded data
    so for a while simple key/value stores gained popularity
    just put and get on single records
  but transactions are coming back

today: ACID for distributed transactions
  with data sharded over multiple servers

What does serializable mean?
  you execute some concurrent transactions, which yield results
    "results" means both output and changes in the DB
  the results are serializable if:
    there exists a serial execution order of the transactions
    that yields the same results as the actual execution
  (serial means one at a time; wait for one to finish before starting the next)
  (this definition should remind you of linearizability)

You can test whether an execution's result is serializable by
  looking for a serial order that yields the same results.
  for our example, the possible serial orders are
    T1; T2
    T2; T1
  so the correct (serializable) results are:
    T1; T2 : x=11 y=9 "11,9"
    T2; T1 : x=11 y=9 "10,10"
  the results for the two differ; either is OK
  no other result is OK for a serializable system
  the implementation might have executed T1 and T2 in parallel
    but it must still yield results as if in a serial order

what if T1's operations run entirely between T2's two get()s?
  would the result be serializable?
  T2 would print 10,9
  but 10,9 is not one of the two serializable results!
what if T2 runs entirely between T1's two adds()s?
  T2 would print 11,10
  but 11,10 is not one of the two serializable results!
what if x's server does the increment but y's server can't?
  x=11 y=10 is not one of the serializable results!

a transaction can "abort" if something goes wrong
  an abort un-does any modifications
  the transaction might voluntarily abort,
    e.g. if the account doesn't exist, or y's balance is <= 0
  the system may force an abort, e.g. to break a locking deadlock
  server failure can result in abort
  result of abort should be as if entire xaction never executed!!!
    must un-do, or not apply, all updates
  the application might (or might not) try the transaction again

distributed transaction implementations have two main components:
  concurrency control (to provide isolation/serializability)
  atomic commit (to provide atomicity despite failure)

first, concurrency control
  the goal: isolated/serializable execution of concurrent transactions
  for now, on a single DB server (not distributed)

two classes of concurrency control for transactions:
  pessimistic:
    lock records before use
    conflicts cause delays (waiting for locks)
  optimistic:
    use records without locking
    commit checks if reads/writes were serializable
    conflict causes abort+retry
    called Optimistic Concurrency Control (OCC)
  pessimistic is faster if conflicts are frequent
  optimistic is faster if conflicts are rare

today: pessimistic concurrency control
in a few weeks: optimistic concurrency control (FaRM)

"Two-phase locking" is one way to implement serializability
  each database record has a lock
  2PL rules:
    a transaction must acquire a record's lock before using it
    a transaction must hold its locks until *after* commit or abort 

2PL for our example
  suppose T1 and T2 start at the same time
  the transaction system automatically acquires locks as needed
  so first of T1/T2 to use x will get the lock
  the other waits until the first completely finishes (reaches END-X)
  this prohibits the non-serializable interleavings

details:
  an executing transaction acquires locks as needed, at the first use
    add() and get() implicitly acquires record's lock
    END-X() releases all locks
  all locks are exclusive (for this discussion, no reader/writer locks)
  the full name is "strong strict two-phase locking"
    strong strict = hold locks until after commit
  much more structured than e.g. Go's Mutexes:
    programmer must supply BEGIN-X/END-X
    DB locks automatically, on first use of each record
    DB unlocks automatically, at transaction end
    DB may automatically abort to resolve deadlock

Why hold locks until after commit/abort?
  why not release as soon as done with the record?
  example of a resulting problem:
    suppose T2 releases x's lock after get(x)
    T1 could then execute between T2's get()s
    T2 would print 10,9
    not a serializable execution: neither T1;T2 nor T2;T1

Two-phase locking can produce deadlock, e.g.
  T1      T2
  get(x)  get(y)
  get(y)  get(x)
The system must resolve deadlocks, usually by aborting
  Detect cycles -- but too hard when locks are distributed
  Time out
  Wound-wait (pre-emptively abort newer xaction rather than waiting)

The Question: describe a situation where Two-Phase Locking yields
higher performance than Simple Locking. Simple locking: lock *every*
record before *any* use; release after abort/commit. 

Next topic: distributed transactions versus failures

how can distributed transactions cope with failures?
  suppose, for our example, x and y are on different storage servers
  suppose x's server adds 1, but y's crashes before subtracting?
  or x's server adds 1, but y's realizes the account doesn't exist?
  or x and y both can do their part, but aren't sure if the other will?
  it's a hard problem!

We want "atomic commit":
  A bunch of computers are cooperating on some task
  Each computer has a different role
  We want atomicity: all execute, or none execute
  Challenges: failures, performance

We're going to look at a protocol called "two-phase commit"
  Used by distributed databases for multi-server transactions

The setting
  Data is sharded among multiple servers
  Each transaction runs on a "transaction coordinator" (TCs)
  For each read/write, TC sends RPC to relevant shard server
    Each shard server is a "participant"
    Each participant manages locks for its shard of the data
  There may be many concurrent transactions, many TCs
    TC assigns unique transaction ID (TID) to each transaction
    Every message, every piece of xaction state tagged with TID
    To avoid confusion

Two-phase commit without failures:
  [time diagram: A, TC, B]
  [state (just for B): DATA, TID LOCKS VALS STATE]
  TC sends put(), get(), &c RPCs to A, B
    A and B lock records (and wait if already locked).
    Modifications are tentative, on a copy, only installed if commit.
  TC gets to the end of the transaction.
  TC sends PREPARE messages to A and B.
  If A is able to commit,
    A responds YES.
    then A is in "prepared" state.
  otherwise, A responds NO.
  Same for B.
  If both A and B said YES, TC sends COMMIT messages to A and B.
  If either A or B said NO, TC sends ABORT messages.
  A/B commit if they get a COMMIT message from the TC.
    I.e. they copy tentative records to the real DB.
    And release the transaction's locks on their records.
  A/B acknowledge COMMIT message.

Why is this correct so far?
  Neither A nor B can commit unless they both agreed.

What if B crashes and restarts?
  If B crashed *before* it got the PREPARE, it can forget the transaction.
  If B crashed *after* sending YES, B must remember (despite crash)!
  Because A might have received a COMMIT and committed.
  So B must be able to commit (or not) even after a reboot.

Thus participants must write persistent (on-disk) state:
  B must save xaction state on disk before saying YES,
    including locks and modified data.
  If B reboots, and disk says YES but didn't receive COMMIT from TC,
    B must ask TC, or wait for TC to re-send.
  And meanwhile, B must continue to hold the transaction's locks.

What if TC crashes and restarts?
  If TC might have sent COMMIT before crash, TC must remember!
    Since one participant may already have committed.
  Thus TC must write COMMIT to disk before sending COMMIT msgs.
  And repeat COMMIT if it crashes and reboots,
    or if a participant asks (i.e. if A/B didn't get COMMIT msg).
  Participants must filter out duplicate COMMITs (using TID).

What if TC never gets a YES/NO from B?
  Perhaps B crashed and didn't recover; perhaps network is broken.
  TC can time out, and abort (since has not sent any COMMIT msgs).
  Good: allows servers to release locks.

What if B times out or crashes while waiting for PREPARE from TC?
  B has not yet responded to PREPARE, so TC can't have decided commit
  so B can unilaterally abort, and release locks
  respond NO to future PREPARE

What if B replied YES to PREPARE, but doesn't receive COMMIT or ABORT?
  Can B unilaterally decide to abort?
    No! TC might have gotten YES from both,
    and sent out COMMIT to A, but crashed before sending to B.
    So then A would commit and B would abort: incorrect.
  B can't unilaterally commit, either:
    A might have voted NO.

So: if B voted YES, it must "block": wait for TC decision.

When can TC completely forget about a committed transaction?
  If it sees an acknowledgement from every participant for the COMMIT.
  Then no participant will ever need to ask again.

When can participant completely forget about a committed transaction?
  After it acknowledges the TC's COMMIT message.
  If it gets another COMMIT, and has no record of the transaction,
    it must have already committed and forgotten, and can acknowledge (again).
  
Two-phase commit perspective
  Used in sharded DBs when a transaction uses data on multiple shards
  But it has a bad reputation:
    slow: multiple rounds of messages
    slow: disk writes
    locks are held over the prepare/commit exchanges; blocks other xactions
    TC crash can cause indefinite blocking, WITH LOCKS HELD
  Thus usually used only in a single small domain
    E.g. not between banks, not between airlines, not over wide area
  Faster distributed transactions are an active research area.

Raft and two-phase commit solve different problems!
  Use Raft to get high availability by replicating
    i.e. to be able to operate when some servers are crashed
    the servers all do the *same* thing
  Use 2PC when each participant does something different
    And *all* of them must do their part
  2PC does not help availability
    since all servers must be up to get anything done
  Raft does not ensure that all servers do something
    since only a majority have to be alive

What if you want high availability *and* atomic commit?
  Here's one plan.
  [diagram]
  The TC and servers should each be replicated with Raft
  Run two-phase commit among the replicated services
  Then you can tolerate failures and still make progress
  Spanner uses this arrangement (next week)

---

http://dbmsmusings.blogspot.com/2019/01/its-time-to-move-on-from-two-phase.html