Recovery Management in QuickSilver Haskin, Malachi, Sawdon, Chan ToCS Feb 1988 context distributed operating system -- lots of workstations and servers so applications might be harder to write: partial failures may be common not like ordinary O/S where everything fails together solution: distributed transactions for *everything* not just DB also file system, window system, &c every RPC part of a transaction thesis: this is the key to distributed O/S success examples of multi-server transactions? editor atomically replaces file, no temporary finder-like UI move dir of files to a new file server atomic move of all files (and UI update) upgrade of all system files on a workstation parallel make, ensure no weird intermediate state visible on failure i.e. partially written .o file one file server may contact another directory server -> storage server? directory server -> child directory server? what might this be complex? what if transaction involves multiple servers? two-phase commit! what if server issues RPCs to serve client's RPC? need to include 2nd server in transaction some servers can't implement 2pc can't keep persistent state across reboots i.e. may vote "yes" but then crash, then cannot complete. Messages: original RPC: adds target server/TM to transaction tree seems to be first point at which server hears of transaction presumably doesn't actually do any work in RPC handler... vote: down the tree to try to commit. vote-commit: up the tree after a vote (and recoverably ready to commit...) vote-abort: up the tree end-commit: down the tree after all children vote-commit end-abort: down the tree abort: sent down the tree. When can failure occur? participant before it sends vote-commit well, everyone must wait, or abort? participant while "prepared" (after sending vote-commit, before end-commit) must ask around for final status must have saved prepared state persistently, so it can do or un-do what if main TM (client) crashes? page 96 suggests that transaction blocks until TM restarts Why is one-phase commit used by volatile servers? Example server? You only have to know the final outcome: whether to update. No need to "prepare" by logging to stable storage. I.e. no need to promise to commit by vote-commit. Of course you *might* want two-phase with volatile since you might want to be able to vote no Mixtures of one-phase and two-phase? Example mixed transaction? Yes: one-phase servers see only end-commit message? So they don't vote, but they do see commit/abort, to clean up. are there services that aren't likely to be transactional? window server? i.e. "create window" inside a transaction? output to user? input from user? send packets on network? read packets from network? impact on server software? what does it have to do differently? impact on client software? what does it have to do differently? do we think this saves work for application writers? i.e. QuickSilver significantly reduces recovery code? or encourages good semantics rather than blowing them off? flexibile enough to be widely useful? how's the performance? much overhead above RPC?