6.824 Lecture 11: two-phase commmit

General topic: coordinating servers in a distributed system.
  Last lecture:  how to recover a single machine from a power failure
  Today:
    two-phase commit
    case study: Argus

general problem: distributed atomic actions
  operation has sub-parts on multiple computers
  how to ensure they all do/don't do it?

Example:
  calendar system
  want to schedule meetings with multiple participants
  one server hold calendars of users A-M, another server holds N-Z
  [diagram: two servers]
  sched(u1, u2, t):
    begin_transaction
      if reserve(u1, t) and reserve(u2, t):
        return true
      else
        return false
  the two reserve calls are really RPCs to the two calendar servers
  either might say yes, no, or fail
  We want both to reserve, or both not to reserve.
  We don't want only one to reserve
    We'd rather have nothing happen (this is important).
  Need an "Atomic Commit Protocol"

Bigger transaction processing picture: we want two kinds of atomicity:
  Serializability, as if in some order, locking
  Recoverability, all or nothing, no partial results
  Today is mostly about recoverability
    I'll just assume something serializes transactions
    Perhaps a lock server forcing one-at-a-time transactions
    Perhaps there's only one source of transactions

Atomic Commit is hard
  "I'll commit if you commit"
  [time diagram]
  What if I don't hear back from you?
  What if you say "ok" -- did I get your message?
  Neither party can finally decide

Straw Man:
  invent a Transaction Coordinator as single authoritative entity
    only one entity decides: the TC
  four entities: client, TC, server A, server B
  client sends "go" to TC
  TC sends "reserve" to A
  TC sends "reserve" to B
  TC reports "ok" to client.
  [time diagram, no acks]

How can this go wrong?
  Maybe one of the users isn't free at that time.
  Maybe one of the users does not exist.
  Maybe the network link to B is broken.
  Maybe A or B has crashed.
  Maybe TC crashes between sending the messages.

We want two properties:
  TC, A, and B each have a notion of committing
  Correctness:
    if one commits, no one aborts.
    if one aborts, no one commits.
  Performance:
    if no failures, and A and B can commit, then commit.
    if failures, come to some conclusion ASAP.

Let's do correctness first.

Correct atomic commit protocol:
  [time diagram: client, TC, A, B]
  client sends request to TC
  TC sends "prepare" messages to A and B.
  A and B respond, saying whether they're willing to commit.
  If both say "yes", TC sends "commit" messages.
  If either says "no", TC sends "abort" messages.
  A/B "decide to commit" if they get a commit message.
    I.e. they actually modify the user's calendar.

Why is this now correct?
  Neither can commit unless they both agreed.

What about performance?
  Crashes or message losses can still prevent completion.
  Bad b/c servers likely holding locks until commit/abort decided.
  
We have two types of problems:
  Timeout. I'm up, but I don't recv a msg I expect.
    Maybe the other host crashed. Maybe the network is broken.
    We cannot usually tell the difference, so must be correct in either case.
  Reboot. I crashed, and I'm rebooting, and I need to clean up.
    
Let's deal with timeout first.
  Where do hosts wait for messages?
  1) TC waits for yes/no.
  2) A and B wait for commit/abort.

Handling TC timeout waiting for yes/no.
  TC has not sent any "commit" messages.
  So TC can safely abort, and send "abort" messages.
  We've preserved correctness, but sacrificed performance.
  Because (maybe) A and B were both prepared to commit,
    but we lost one of the messages,
    we could have committed, but TC didn't know it.
  TC is being conservative.

Handling A/B timeout while waiting for commit/abort.
  Let's talk about just B (A is symmetric).
  If B voted "no", it can unilaterally abort.
  So what if B voted "yes"?
  Can B unilaterally decide to abort?
    No! TC might have gotten "yes" from both,
    and sent out "commit" to A, but crashed before sending to B.
    So then A would commit and B would abort: incorrect.
  B can't unilaterally commit, either:
    A might have voted "no".

B could just wait forever until it gets commit/abort from TC.
  But you can do better than that:

Termination protocol for B if it voted "yes":
  B sends "status" request message to A.
    Asks if A knows whether transaction should commit.
  If B doesn't hear reply from A: no decision, wait for TC.
  If A received "commit" or "abort" from TC: B decides the same way.
    Can't disagree with TC...
  If A hasn't voted yes/no yet: B and A both abort.
    TC can't have decided "commit", so it will eventually hear from A or B.
  If A voted "no": B and A both abort.
    TC can't have decided "commit".
  If A voted "yes": no decision possible!
    TC might have decided "commit" and replied to client.
    Or TC might have timed out and aborted.
    A and B must wait for the TC.

What have we achieved?
  Can resolve some timeout situations w/ guaranteed correctness.
  But sometimes A and B must block.
  Due to TC failure, or failure of TC's network connection.

How to handle crash/reboot?
  Cannot back out of a commit if already decided.
  TC crashes just after deciding "commit".
  A/B crash just after sending "yes".
  Big trouble if they reboot and don't remember saying "yes"!
    They might change their minds after the reboot.
    Or *even after everybody re-starts* they may not be able to decide!
  If all nodes know state before crash, you can use termination protocol.
    But also talk to TC, which may know that it committed.

How do you know what state you were in before the crash?
  Assume non-volatile memory, perhaps a disk.
  But do you write disk, then send "yes" message? Or "commit" if TC?
    Or send message and then update the state on the disk?
  We cannot send the message before writing the disk:
    Since then we might change our mind after the reboot.
    I.e. B might have voted "yes", then reboot, then "no".
  Does it work to write the disk before sending the message?
    For TC w.r.t. "commit", and A/B w.r.t. "yes".

Recovery protocol w/ non-volatile state:
  If you're the TC, and there's no "commit" on disk, abort.
    Because no commit on disk -> you didn't send any "commit" messages.
  If you're A/B, and no "yes" on disk, abort.
    No "yes" on disk -> didn't vote yes -> nobody could have committed.
  If you're A/B, and there is a "yes" on your disk:
    Ordinary termination protocol, might block.
  If everyone has rebooted and is reachable, you can decide.
    Based just on whether TC has "commit" on its disk.

This protocol is called "two-phase commit".
  What properties does it have?
  1. All hosts that decide reach the same decision.
  2. No commit unless everyone says "yes".
  3. No failures and all "yes", then commit.
  4. If failures, then repair, wait long enough, then some decision.

2PC perspective
  Useful within a single system
  Bad that crash can cause indefinite blocking
  Bad that locks can't be released until very end
  Rarely used on a large scale
    E.g. not between banks, not between airlines
    Nobody wants to give up control: can't release locks
    Usually separate commit, reconcile later if someone else aborted

What if you have transactions and two-phase commit, how can you build systems
  Case study: Argus

Argus's big ideas:
  Language support for distributed programs
  RPC
  Implicit locking
  Distributed Actions -- 2PC
  Nested actions
  Persistent data
  Crash recovery

Picture
  "guardian" is like an RPC server
  "handler" is an RPC handler
  "action" is a distributed atomic transaction
  top-level action on A
    RPC to B
      B sends RPC to C
    RPC to D
  top-level commit
    prepare msgs to B, C, D
    committ msgs to B, C, D

Benefits best seen by studying mail example
  page 398: walk through send mail to many users example

what if one user doesn't exist?
  but mail has already been delivered to some other users
  how to un-do?

why do nested transactions make sense?
  are they just a feature? or necessary?
  does xactions + RPC => nested transactions?
  required by modularity? 
    you don't know who is calling, but you want to be atomic
  required in case RPC fails but larger xaction wants to proceed?

what happens when a guardian that did sub-action receives final commit?

what if that guardian crashes before it gets final commit/abort decision?

what happens if a guardian crashes while it is committing?
  has finalized some of the action's updates, not others

what happens if an RPC times out -- no reply?
  maybe target got the RPC
  is there ambiguity about whether it executed?

what if user reads mail during send-to-many transaction?
  how does user not see tentative new mail?
  does reading user block? where?

read_mail also deletes it
  what if new mail arrives just before msg_list$trim()?
  will it be deleted but not returned?
  why not? what lock protects? where is the lock acquired? released?

what if a user is on the list twice?
  locks are held until end of top-level xaction
  deadlock?

how do locks work?
  is there a separate lock server?

is implicit locking the right thing?
  very convenient!
  don't have to worry about forgetting to lock!
  does it ever lock too much? too long?
    easy to get deadlock
    all locks held to final commit
  does it ever not lock something you'd like to lock?
  hard case:
    if user not in table
      add user to table
    lock on whole table is correct but low concurrency
    lock per entry is not really correct (no entry to lock!)

performance?
  could the email example handle a high load?

is it good to build locks/rpc/2PC/persistence/recovery into language?