6.824 2013 Lecture 15: Two-Phase Commmit

Topics:
  crash recovery for distributed systems
  distributed transactions -- atomic
  two-phase commit

Crash recovery:
  A bunch of computers are cooperating on some task
    Each has a different role e.g. bank transfer
  Some crash, or network fails
  How to ensure atomicity: all finish, or all undo?

Example:
  calendar system, each user has a calendar
  want to schedule meetings with multiple participants
  one server hold calendars of users A-M, another server holds N-Z
  [diagram: client, two servers]
  sched(u1, u2, t):
    begin_transaction
      if reserve(u1, t) and reserve(u2, t):
        print "yes"
      else
        print "no"
    end_transaction
  the reserve() calls are RPCs to the two calendar servers
  either might return true, false, or fail
  We want both to reserve, or both not to reserve.
  What if 2nd reserve() returns "false"?
  What if 2nd server crashes during reserve()?
  What if client fails after 1st reserve()?
  Need an "Atomic Commit Protocol"

Idea: tentative changes, later commit or undo (abort)
  reserve_handler(u, t):
    lock u[t]
    if u[t] is free:
      u[t] = taken -- IN TEMPORARY STORAGE
      return true -- DON'T RELEASE LOCK
    return false
  commit_handler():
    copy temporary u[t] to real u[t]
    unlock u[t]
  abort_handler():
    discard temporary u[t]
    unlock u[t]

Idea: single entity decides whether to commit
  let's call it the Transaction Coordinator (TC)
  [time diagram: client, TC, A, B]
  client sends RPCs to A, B
  on end_transaction, client sends "go" to TC
  TC/A/B execute atomic commit protocol...
  TC reports "commit" or "abort" to client

We want two properties for atomic commit protocol:
  TC, A, and B each have a notion of committing
  Correctness:
    if any commit, none abort
    if any abort, none commit
  Performance:
    (since doing nothing is correct...)
    if no failures, and A and B can commit, then commit.
    if failures, come to some conclusion ASAP.

We're going to develop a protocol called "two-phase commit"

Correctness first.

Correct atomic commit protocol:
  [time diagram: client, TC, A, B]
  client sends reserve() RPCs to A, B
  client sends "go" to TC
  TC sends "prepare" messages to A and B.
  A and B respond, saying whether they're willing to commit.
  If both say "yes", TC sends "commit" messages.
  If either says "no", TC sends "abort" messages.
  A/B "decide to commit" if they get a commit message.
    I.e. they actually modify the user's calendar.

Why is this correct?
  Neither can commit unless they both agreed.

What about performance?
  Crashes or network problems can still prevent completion.
  Bad b/c A/B hold locks until commit/abort decided.
  
Two problems:
  Timeout. I'm up, but I don't recv a msg I expect.
    Host crash? Network broken?
  Reboot. I crashed, and I'm rebooting, and I need to clean up.
    
Let's deal with timeout first.
  Where do hosts wait for messages?
  1) TC waits for yes/no.
  2) A and B wait for commit/abort.

Termination protocol summary:
  TC t/o for yes/no -> abort
  B t/o from TC, B voted no -> abort
  B t/o from TC, B voted yes -> block
    could ask A if it knows

TC timeout while waiting for yes/no from A/B.
  TC has not sent any "commit" messages.
  So TC can safely abort, and send "abort" messages.

A/B timeout while waiting for commit/abort from TC.
  Let's talk about just B (A is symmetric).
  If B voted "no", it can unilaterally abort.
  So what if B voted "yes"?
  Can B unilaterally decide to abort?
    No! TC might have gotten "yes" from both,
    and sent out "commit" to A, but crashed before sending to B.
    So then A would commit and B would abort: incorrect.
  B can't unilaterally commit, either:
    A might have voted "no".

If B voted "yes", it must wait for TC decision.

What if B crashes and restarts?
  If B sent "yes" before crash, B must remember!
    --- this is today's question
  Can't change to "no" (and thus abort) after restart
  Since TC may have seen previous yes and told A to commit
  Thus:
    B must remember on disk before saying "ok", as well as modified data.
    B reboots, disk says "yes" but no "commit", must ask TC.

What if TC crashes and restarts?
  If TC might have sent "commit" or "abort" before crash, TC must remember!
    And repeat that if anyone asks (i.e. if A/B/client didn't get msg).
    Thus TC must write "commit" to disk before sending commit msgs.
  Can't change mind since A/B/client have already acted.

This protocol is called "two-phase commit".
  What properties does it have?
  * All hosts that decide reach the same decision.
  * No commit unless everyone says "yes".
  * TC failure can make servers block w/ locks held until repair.

2PC perspective
  Useful when different services cooperate (not so useful for replication)
  Useful within a single smallish distributed system (many msgs)
  Bad that TC crash can cause indefinite blocking
  Bad that locks can't be released until very end (even if no crash)
  Rarely used on a large scale
    E.g. not between banks, not between airlines
    Nobody wants to give up control: can't release locks
    Usually separate commit, reconcile later if someone else aborted

Case study: Argus

Argus's big ideas:
  Language support for distributed programs
    Very cool: language abstracts away ugly parts of distrib systems
  Easy persistence ("stable"):
    Ordinary variables, no need for DB
    Automatic crash recovery
  Easy concurrency:
    Transactions
    Implicit locking of language objects
  Easy RPC model:
    RPCs are transactions w/ two-phase commit
    RPC failure largely hidden

Picture
  "guardian" is like an RPC server
    has state (variables) and handlers
  "handler" is an RPC handler
    reads and writes local variables
  "action" is a distributed atomic transaction
  action on A
    RPC to B
      B sends RPC to C
    RPC to D
  A finishes action
    prepare msgs to B, C, D
    commit msgs to B, C, D

The style is to send RPC to where the data is
  Not to read the data to where it's needed
  Argus is not a storage system

Look at bank example
  page 309 (and 306): bank transfer

Points to notice
  stable keyword (programmer never writes to disk &c)
  atomic keyword (programmer almost never locks/unlocks)
  enter topaction (in transfer)
  coenter (in transfer)
  RPCs are hidden (e.g. f.withdraw())
  RPC error handling hidden (just aborts)

what if deposit account doesn't exist?
  but f.withdraw(from) has already been called?
  how to un-do?

what's the guardian state when withdraw() handler returns?
  lock, temporary version, just in memory

crash just before replying to withdraw()?

crash just after replying to withdraw()?

what happens at withdraw guardian when it receives prepare msg?

what happens at withdraw guardian when it receives a commit?
  an abort?

crash before receiving the commit?

crash while guardian is processing the commit?
  i.e. while it is writing new versions to its disk?

what if an audit runs during a transfer?
  how does the audit not see the tentative new balances?

if a guardian crashes, what happens to its locks?

why not let other readers see the old version?
  i.e. why block readers until new value committed?
  e.g. transfer not committed, let audit see old balance?
  arguably it's just as if the other reader ran a little earlier!
  b/c this sequence would yield the wrong audit result:
    A=20, B=30 (so audit should return 50)
    transfer(A, B, 10) starts
    audit starts, reads A=20
    transfer commits, now A=10 B=40
    audit reads B=40
    audit returns 60

why hold read locks until end of transaction?
  e.g. audit keeps all locks until it finally commits
    so no transfer can run -- might take a long time
  why not release read locks when done reading?
  example (much like above):
    A=20 B=30
    audit reads A=20
    transfer(A, B, 10)
    audit reads B=40
    audit total is 60, not 50, oops

above locking policy is called "two-phase locking"
  xaction acquires and holds locks
  only release after commit

two-phase locking yields "serializability"
  results are as if xactions ran one at a time
    audit, then transfer; or
    transfer, then audit
  an easy-to-understand semantics for transactions
    though sometimes slow

is Argus's implicit locking the right thing?
  very convenient!
  don't have to worry about forgetting to lock!
  but easy to get deadlock
    seed = seed + 1 (if done concurrently, lock upgrade deadlock)
    or transfer(A, B) || transfer(B, A)
  but easy to over-lock
    a lookup() fn might scan a list, lock every element, not needed
  but easy to under-lock
    a lookup() fn might find no entry; nothing to lock indicating it was missing

is transactions / RPC / 2PC / persistent a good design point?
  do we want to write our web site in Argus?
  programmability pro:
    very easy to get nice fault tolerance semantics
    language support probably important
  programmability con:
    like a database but no SQL joins &c
  performance con:
    lots of msgs and disk writes: Table II
    2PC and 2PL hold locks for a while, block if failure
    automatic locking probably over-locks

contrast with Spanner/FDS/PNUTS/&c
  Argus a bit lower-level than Spanner &c
    e.g. Argus gives you tools to build replication, but isn't replicated
  storage-centric (Spanner &c)
    good if the main issue is preserving state
    good if many clients need to get at each piece of data
  compute-centric (Argus)
    good if qualitatively different components (so RPC is best you can do)
    good if it makes sense to manipulate state locally