Recovery Management in QuickSilver
Haskin, Malachi, Sawdon, Chan
ToCS Feb 1988

context
  distributed operating system -- lots of workstations and servers
  so applications might be harder to write:
    partial failures may be common
    not like ordinary O/S where everything fails together
  solution: distributed transactions for *everything*
    not just DB
    also file system, window system, &c
    every RPC part of a transaction
  thesis: this is the key to distributed O/S success

examples of multi-server transactions?
  editor atomically replaces file, no temporary
  finder-like UI move dir of files to a new file server
    atomic move of all files (and UI update)
  upgrade of all system files on a workstation
  parallel make, ensure no weird intermediate state visible on failure
    i.e. partially written .o file
  one file server may contact another
    directory server -> storage server?
    directory server -> child directory server?

what might this be complex?
  what if transaction involves multiple servers?
    two-phase commit!
  what if server issues RPCs to serve client's RPC?
    need to include 2nd server in transaction
  some servers can't implement 2pc
    can't keep persistent state across reboots
    i.e. may vote "yes" but then crash, then cannot complete.

Messages:
  original RPC: adds target server/TM to transaction tree
    seems to be first point at which server hears of transaction
    presumably doesn't actually do any work in RPC handler...
  vote: down the tree to try to commit.
  vote-commit: up the tree after a vote (and recoverably ready to commit...)
  vote-abort: up the tree
  end-commit: down the tree after all children vote-commit
  end-abort: down the tree
  abort: sent down the tree.

When can failure occur?
  participant before it sends vote-commit
    well, everyone must wait, or abort?
  participant while "prepared" (after sending vote-commit, before end-commit)
    must ask around for final status
    must have saved prepared state persistently, so it can do or un-do

what if main TM (client) crashes?
  page 96 suggests that transaction blocks until TM restarts

Why is one-phase commit used by volatile servers?
  Example server?
  You only have to know the final outcome: whether to update.
  No need to "prepare" by logging to stable storage.
    I.e. no need to promise to commit by vote-commit.
  Of course you *might* want two-phase with volatile
    since you might want to be able to vote no

Mixtures of one-phase and two-phase?
  Example mixed transaction?
  Yes: one-phase servers see only end-commit message?
  So they don't vote, but they do see commit/abort, to clean up.

are there services that aren't likely to be transactional?
  window server?
  i.e. "create window" inside a transaction?
  output to user?
  input from user?
  send packets on network?
  read packets from network?

impact on server software?
  what does it have to do differently?

impact on client software?
  what does it have to do differently?

do we think this saves work for application writers?
  i.e. QuickSilver significantly reduces recovery code?
  or encourages good semantics rather than blowing them off?
  flexibile enough to be widely useful?

how's the performance?
  much overhead above RPC?