6.824 Lecture 11: two-phase commmit General topic: coordinating servers in a distributed system. Last lecture: how to recover a single machine from a power failure Today: two-phase commit case study: Argus general problem: distributed atomic actions operation has sub-parts on multiple computers how to ensure they all do/don't do it? Example: calendar system want to schedule meetings with multiple participants one server hold calendars of users A-M, another server holds N-Z [diagram: two servers] sched(u1, u2, t): begin_transaction if reserve(u1, t) and reserve(u2, t): return true else return false the two reserve calls are really RPCs to the two calendar servers either might say yes, no, or fail We want both to reserve, or both not to reserve. We don't want only one to reserve We'd rather have nothing happen (this is important). Need an "Atomic Commit Protocol" Bigger transaction processing picture: we want two kinds of atomicity: Serializability, as if in some order, locking Recoverability, all or nothing, no partial results Today is mostly about recoverability I'll just assume something serializes transactions Perhaps a lock server forcing one-at-a-time transactions Perhaps there's only one source of transactions Atomic Commit is hard "I'll commit if you commit" [time diagram] What if I don't hear back from you? What if you say "ok" -- did I get your message? Neither party can finally decide Straw Man: invent a Transaction Coordinator as single authoritative entity only one entity decides: the TC four entities: client, TC, server A, server B client sends "go" to TC TC sends "reserve" to A TC sends "reserve" to B TC reports "ok" to client. [time diagram, no acks] How can this go wrong? Maybe one of the users isn't free at that time. Maybe one of the users does not exist. Maybe the network link to B is broken. Maybe A or B has crashed. Maybe TC crashes between sending the messages. We want two properties: TC, A, and B each have a notion of committing Correctness: if one commits, no one aborts. if one aborts, no one commits. Performance: if no failures, and A and B can commit, then commit. if failures, come to some conclusion ASAP. Let's do correctness first. Correct atomic commit protocol: [time diagram: client, TC, A, B] client sends request to TC TC sends "prepare" messages to A and B. A and B respond, saying whether they're willing to commit. If both say "yes", TC sends "commit" messages. If either says "no", TC sends "abort" messages. A/B "decide to commit" if they get a commit message. I.e. they actually modify the user's calendar. Why is this now correct? Neither can commit unless they both agreed. What about performance? Crashes or message losses can still prevent completion. Bad b/c servers likely holding locks until commit/abort decided. We have two types of problems: Timeout. I'm up, but I don't recv a msg I expect. Maybe the other host crashed. Maybe the network is broken. We cannot usually tell the difference, so must be correct in either case. Reboot. I crashed, and I'm rebooting, and I need to clean up. Let's deal with timeout first. Where do hosts wait for messages? 1) TC waits for yes/no. 2) A and B wait for commit/abort. Handling TC timeout waiting for yes/no. TC has not sent any "commit" messages. So TC can safely abort, and send "abort" messages. We've preserved correctness, but sacrificed performance. Because (maybe) A and B were both prepared to commit, but we lost one of the messages, we could have committed, but TC didn't know it. TC is being conservative. Handling A/B timeout while waiting for commit/abort. Let's talk about just B (A is symmetric). If B voted "no", it can unilaterally abort. So what if B voted "yes"? Can B unilaterally decide to abort? No! TC might have gotten "yes" from both, and sent out "commit" to A, but crashed before sending to B. So then A would commit and B would abort: incorrect. B can't unilaterally commit, either: A might have voted "no". B could just wait forever until it gets commit/abort from TC. But you can do better than that: Termination protocol for B if it voted "yes": B sends "status" request message to A. Asks if A knows whether transaction should commit. If B doesn't hear reply from A: no decision, wait for TC. If A received "commit" or "abort" from TC: B decides the same way. Can't disagree with TC... If A hasn't voted yes/no yet: B and A both abort. TC can't have decided "commit", so it will eventually hear from A or B. If A voted "no": B and A both abort. TC can't have decided "commit". If A voted "yes": no decision possible! TC might have decided "commit" and replied to client. Or TC might have timed out and aborted. A and B must wait for the TC. What have we achieved? Can resolve some timeout situations w/ guaranteed correctness. But sometimes A and B must block. Due to TC failure, or failure of TC's network connection. How to handle crash/reboot? Cannot back out of a commit if already decided. TC crashes just after deciding "commit". A/B crash just after sending "yes". Big trouble if they reboot and don't remember saying "yes"! They might change their minds after the reboot. Or *even after everybody re-starts* they may not be able to decide! If all nodes know state before crash, you can use termination protocol. But also talk to TC, which may know that it committed. How do you know what state you were in before the crash? Assume non-volatile memory, perhaps a disk. But do you write disk, then send "yes" message? Or "commit" if TC? Or send message and then update the state on the disk? We cannot send the message before writing the disk: Since then we might change our mind after the reboot. I.e. B might have voted "yes", then reboot, then "no". Does it work to write the disk before sending the message? For TC w.r.t. "commit", and A/B w.r.t. "yes". Recovery protocol w/ non-volatile state: If you're the TC, and there's no "commit" on disk, abort. Because no commit on disk -> you didn't send any "commit" messages. If you're A/B, and no "yes" on disk, abort. No "yes" on disk -> didn't vote yes -> nobody could have committed. If you're A/B, and there is a "yes" on your disk: Ordinary termination protocol, might block. If everyone has rebooted and is reachable, you can decide. Based just on whether TC has "commit" on its disk. This protocol is called "two-phase commit". What properties does it have? 1. All hosts that decide reach the same decision. 2. No commit unless everyone says "yes". 3. No failures and all "yes", then commit. 4. If failures, then repair, wait long enough, then some decision. 2PC perspective Useful within a single system Bad that crash can cause indefinite blocking Bad that locks can't be released until very end Rarely used on a large scale E.g. not between banks, not between airlines Nobody wants to give up control: can't release locks Usually separate commit, reconcile later if someone else aborted What if you have transactions and two-phase commit, how can you build systems Case study: Argus Argus's big ideas: Language support for distributed programs RPC Implicit locking Distributed Actions -- 2PC Nested actions Persistent data Crash recovery Picture "guardian" is like an RPC server "handler" is an RPC handler "action" is a distributed atomic transaction top-level action on A RPC to B B sends RPC to C RPC to D top-level commit prepare msgs to B, C, D committ msgs to B, C, D Benefits best seen by studying mail example page 398: walk through send mail to many users example what if one user doesn't exist? but mail has already been delivered to some other users how to un-do? why do nested transactions make sense? are they just a feature? or necessary? does xactions + RPC => nested transactions? required by modularity? you don't know who is calling, but you want to be atomic required in case RPC fails but larger xaction wants to proceed? what happens when a guardian that did sub-action receives final commit? what if that guardian crashes before it gets final commit/abort decision? what happens if a guardian crashes while it is committing? has finalized some of the action's updates, not others what happens if an RPC times out -- no reply? maybe target got the RPC is there ambiguity about whether it executed? what if user reads mail during send-to-many transaction? how does user not see tentative new mail? does reading user block? where? read_mail also deletes it what if new mail arrives just before msg_list$trim()? will it be deleted but not returned? why not? what lock protects? where is the lock acquired? released? what if a user is on the list twice? locks are held until end of top-level xaction deadlock? how do locks work? is there a separate lock server? is implicit locking the right thing? very convenient! don't have to worry about forgetting to lock! does it ever lock too much? too long? easy to get deadlock all locks held to final commit does it ever not lock something you'd like to lock? hard case: if user not in table add user to table lock on whole table is correct but low concurrency lock per entry is not really correct (no entry to lock!) performance? could the email example handle a high load? is it good to build locks/rpc/2PC/persistence/recovery into language?