6.824 2007 Lecture 12: state machine replication

General topic area: Fault tolerance
  last week: how to recover a single machine from a power failure
    durability, but when machine is down service is unavailable
  topic of next few lectures: high availability 
    using multiple computers to provide service so that if one fails
    another one can provide it.
  today's topic: state machine replication

state machine replication
  works for any kind of replicated service: storage or lock server or whatever
  every replica must see same operations in same order
    if deterministic, replicas will end up with same state
  how to handle concurrent client requests to different replicas?
    an easy option: order them by a master
        master-slave organization
  when to respond to a request?
    before slaves perform request?  after?
    master runs two-phase commit.
    respond after master commits
  what if primary fails?
    elect new primary and recovery correctly (next lectures)
  how many replicas?
    assume repair: if one fails, it comes back.
    enough to survive a failure(s) during the time that it takes for a new
     one to start

Case study: Hypervisor-based Fault-tolerance
  Bressoud and Schneider
  SOSP 1995

Why are we reading this paper?
  We're about to look at a bunch of systems that replicate for fault-tolerance
    lab 7 and lab 8 you implement a replicated state machine
  The hypervisor is at one extreme of the spectrum
  Two machines run the exact same instructions
  If one fails, the other just keeps going, no time/data/work is lost
  Totally transparent, runs existing O/S and apps
  Only a factor of two slower
  Seems like an amazingly general/easy/practical solution to fault-tolerance

Where they are coming from
  Banks, telephone exchanges, NASA need fault-tolerant computers
  1980s saw lots of *hardware* fault-tolerant machines
  Assumption: CPU most likely part to fail due to complexity
  So two CPUs connected to bus, single memory, disk, &c system
  All in the same cabinet. Not distributed over the Internet.
  Hypervisor is all s/w! Runs on stock h/w.

What is the modern version of this paper?
  Goal: fault tolerant web application that has its durable state on a NAS
  One option: rewrite web application to use replicated-state machine
    But the Web application programmers might not have the expertise
  One option: apply the replicate state machine on OS level (Linux)
    But OS is a big program with many interfaces---maybe hard to pull off
    In addition, you may not have the source
  Other option: put Web site/Linux on a replicate VM (VMware/Xen, ..)
    Advantage: works with any web application (modula some assumptions).

Basic slow plan
  hypervisor interprets every instruction
  One instruction at the time
  At instruction i:
    at P:
      P asks primary have you executed through i-1
      P waits for response
      Yes: execute instruction i
      No response: backup failed
    at B:
      B waits for message from primary for i-1
      If B receives message: execute instruction i
      If B doesn't receiv message: B becomes primary

Challenges:
  Performance  (use hypervisor and use epochs)
  I/O instructions (P and B shouldn't execute them both)
  Non-determinismistic instructions (I/O, interrupts, clock, ..)
  
Advantages: 
  No OS changes
  No app changes
  No special HW

Why shared disk?
  Disk writes are non deterministic
    one disk may have bad sector and write fails
    => B doesn't execute I/O instructions, only primary
  Simplifies failure recovery
    all state through instruction i-1 is on disk!
  If instruction i is I/O, retry I/O instruction?

I/O Interrupts
  Must be delivered at the same point in instruction stream
  P checks for interrupts at before executing instruction i
   if any interrupts forward to B, and execute them in slot i
  B simulates interrupt by collecting results from P
  P needs to send block data to the secondary
    no problem for programmed I/O (hypervisor can trap on LD)
    puzzle: DMA. how does the hypervisor know?

What about fail-over?
  if primary fails, backup must start doing I/O
  how does backup know the primary has failed?
  what about I/O writes in the last epoch: might or might not have happened.
  what if backup is waiting for an I/O read?
    did the primary even issue the read request?
    did the result come back, but not yet sent to backup?
  if instruction i was I/O instruction
    generate interrupt and set status register of device to failed
    hopefully the O/S will re-try the I/O!  (most do)
  assumptions
    device has to support repeated operations correctly
    O/S device drivers have to know to re-try operations
    OK for disk read and disk write
      some SCSI drivers notice time-outs and re-try
    maybe OK for network (TCP will sort it out)
    not so OK for keyboard, line printer, ATM bill vendor

Performance
  slow to check in with B every instruction
  epochs!
  CPU h/w can be told to interrupt every N instructions
    i.e. at exactly the same point on both machines
  primary hypervisor delays all interrupts until end of epoch
  backup has to wait at the end of its epoch
    maybe primary hasn't reached epoch end, hasn't seen all interrupts
  at each epoch end, primary and backup deliver all interrupts

Performance
  we can always interpret SLOWLY
  real question: can I run o/s and apps on the real hardware?
  will the hardware trap to hypervisor on *every* tricky instruction?
  i.e. any instruction that might be non-deterministic
    OR reveal the existence of the hypervisor?
  time-of-day register
  memory-mapped I/O loads and stores
    need to trap and substitute load values
  HP branch-and-link instruction
  HP TLB replacement
  HP "space registers" (page mapping tables?) writable by users (!)

What if ethernet cable breaks?
  primary still running
  backup will promote itself
  that's part of what the fail-stop assumption assumes away...

What if we want to re-start a failed+repaired primary?

Performance
  epoch boundaries are expensive
  can you do better?
    make hypervisor responsible for all devices directly
    hypervisor downloads its device drivers in OS (like VMware does)