6.824 2012 Lecture 12: state machine replication

General topic area: Fault tolerance
  last week: crash recovery for single machines and distributed systems
    all about continuing after fault is repaired
    but no service in the meantime
  next few weeks: high availability
    how to provide service while some machines are down
    how to survive permanent failures
  today and next week: state machine replication
    replicate data so that it is "always" available
    if one replica fails, use another one
  note: still assuming fail-stop; no malice, no bugs

state machine replication
  works for any kind of replicated service: storage or lock server or whatever
  every replica must see same operations in same order
    if deterministic, replicas will end up with same state
    disaster if replicas diverge
  [client, replicas, client sends stream of arithmetic ops to all]
  how to handle multiple clients sending operations to replicas?
    [two clients sending to all replicas]
    what if replicas see different operation orders?
    easy solution:
      [clients, master, backups]
      all operations to master
      master chooses order
      master sends to backups
      "primary/backup"
    there are other approaches
      quorum/voting systems can avoid master but enforce order
      primary/backup is simpler
  when to respond to a request?
    before backups perform request?  after?
    must be after: can't say "yes" to client but then forget after master fails
  do all replicas have to acknowledge?
    depends on recovery scheme
  what if primary fails?
    one backup takes over as primary
    clients change where they send operations
  how many replicas?
    assume repair or replacement
    enough that low P(all fail) during one repair period

what are the big issues that need to be solved?
  what's a good level of abstraction?
    maybe "state" = who holds each lock
    maybe "state" = machine registers and RAM content
  how to decide if primary has failed?
    disaster if network partition causes two primaries
  how to reconstruct primary's state on backup after failure?
    backup may not have seen last few operations
  how to hide change-over from clients?
    clients may need to re-send request
    and thus clients may effectively repeat a request
  how to bring newly joined backup up to speed?

Case study: Hypervisor-based Fault-tolerance
  Bressoud and Schneider
  SOSP 1995

Motivation
  Goal: fault tolerance / availability for any existing s/w
    by running same instructions &c on two computers
  Transparent: any app, any O/S
  Would be magic if it worked well!

Plan 1:
  [simple diagram]
  Two machines
  Identical start state: registers, program, memory contents
  Maybe they will stay identical
  If one fails, the other just keeps going, no time/data/work is lost

What will go wrong with Plan 1?
  external outputs will be duplicated
  must send external inputs to both
  inputs must arrive at the same time at both
  interrupts must arrive at same time at both
  CPUs unlikely to run at exactly the same speeds

How exact must we be about timing of interrupts?
  is it ok if timing differs a little bit?
  what does it mean for timing to be exact?

How are they able to control I/O and interrupt timing?
  they slip a virtual machine / hypervisor under the O/S
  
What is a hypervisor?
  a piece of software
  emulates a real machine precisely
    so an O/S &c can run on a hypervisor
    "guest"
  as if hypervisor simulated each instruction
    and kept simulated registers, memory, &c
  BUT as much as possible runs guest on real h/w!
    can be much faster than interpreting instructions
  Many fascinating details, see VMware papers for x86 story

Hypervisor gives us control
  Suppress output from backup, detect interrupts/input at primary, &c
  Still need a scheme for delivering I/O inputs and interrupts
  From primary to backup, at the right times

But: hard to know when interrupt happened at primary, repeat at backup
  Answer: HP PA-RISC recovery counter
    It forces an interrupt every N instructions
  This allows primary and backup hypervisor to get control 
    at exactly the same point in the code

Epochs
  Primary alternates epoch and end of epoch; backup does too
  Every epoch is same # of instructions, e.g. 4000 instructions
  Primary during epoch E:
    Execute handlers for interrupts delivered during E-1
    I/O output instructions proceed as normal
    I/O input is from data collected during E-1 interrupts
    Hypervisor hides new interrupts, just records, w/ I/O input
  Primary at end of epoch E:
    Send interrupt info, and I/O input, to backup
    Wait for backup to ACK
    Send "done with E" to backup
  Backup during epoch E:
    Execute handlers for interrupts primary told us of during E-1
    I/O output instructions do nothing
    I/O input instructions consume I/O input data sent by primary
    Ignore interrupts from backup devices
  Backup at end of epoch E:
    Process and buffer interrupt/input msgs from primary
    Wait for "done with E+1" from primary
    Start epoch E+1

Note backup lags primary by one epoch
  Backup doesn't start 7 until primary is done with 7
  So backup knows all input/interrupts for 7 when it starts 7
  And backup knows if primary completed 7 (i.e. didn't crash)
    
What if primary fails during epoch E?
  Backup finishes with E-1
  Backup times out waiting for "done with E"

Can backup just switch to being primary for epoch E?
  No: primary may have asked devices for I/O during E-1
  Backup did not, so wil expect interrupts that will never happen

What if primary did the following I/O during "failover" epoch E:
  read input -- will promoted backup see that input?
  wrote output -- will promoted backup repeat that output?
  failed just before output -- will backup write?
  got an interrupt + didn't forward to backup -- will backup miss it?
  asked its dev h/w to interrupt -- will backup h/w interrupt later?

Failover strategy:
  Backup times out waiting for "done with E"
  Backup executes epoch E as backup (no output, I/O suppressed)
  Switches to primary for E+1
  Hypervisor generates "uncertain interrupts" for I/O started <= E
    Since backup didn't actually generate the requests
    O/S drivers know they should repeat I/O if they get an uncertain interrupt
    Re-issue disk read/write, re-send packet, &c

Details of I/O during failover:
  Primary read input:
    Backup will see the input too, since delivered w/ intr at end of E-1
  Primary wrote output:
    Backup will *not* write during E
    Uncertain interrupt -> backup will repeat during E+x
    So disk/net/&c will see output twice!
  Primary didn't get as far as writing output:
    Uncertain interrupt -> backup will write during E+x
  Primary got an interrupt:
    Backup will *not* get an interrupt during E
    BUT hypervisor knows interrupt is pending, b/c saw request instr during E-x
    Hypervisor will issue uncertain interrupt at end of E

(all this is based on footnote 5 on page 4, and IO1 and P8 on page 5)

Will the outside world tolerate repeated output?
  In general, maybe not
    E.g. controlling an instrument in a scientific experiment
  Network protocols:
    No problem, Must handle duplicated packets anyway
  Shared disk:
    No problem, repeated write of same data 

Do O/S drivers on primary talk directly to h/w?
  I suspect O/S talks to device simulated/mediated by hypervisor
  Hypervisor needs to be able to collect input read from
    devices and sent it to backup
  So hypervisor must be intervening (and understand) access to dev h/w
  Also hypervisor must initialize dev h/w when backup takes over

Figure 1 shows a shared disk
  Rather than separate disk per machine with identical content
  Only the primary writes the disk
  Why does shared disk make the design easier?
  Disk writes are non deterministic
    one disk may have bad sector and write fails
  Simplifies recovery of failed machine
    Don't need to copy whole disk from survivor

Won't disk failures ruin the fault-tolerance story?
  Can be fixed separately, w/ RAID

Does the backup read the disk directly?
  No: why not?
  Suppose primary reads and then immediately writes.
  Backup might read the new data, should see old.

Today's question:
  What if each computer had its own disk replica (not dual-ported)
  Would that make the system more fault-tolerant?
  How would we modify the system?
  How would recovery have to be changed?

What if ethernet cable breaks?
  primary still running
  backup will try to promote itself
  that's part of what the fail-stop assumption assumes away...

What if we want to re-start a failed+repaired primary?

What if the computers have multiple cores + threads?
  Would this scheme work w/o modification?

What we can expect for performance?

When will performance be bad?
  Frequent interrupts -- may be delayed by epoch scheme
  Lots of input -- must be sent over ethernet
  Many privileged instructions -- many hypervisor traps

Should epochs be short or long?
  Short means many end-of-epoch pauses + chatter
  Long means less overhead
    But I/O interrupts delayed longer

What performance do they see?
  CPU-bound: Figure 2

Disk-bound (really seek-bound): Figure 3
  Why do writes do better than reads?
  Why much better relative performance for disk-bound than CPU-bound?
  Why does Figure 3 write performance level off?
    Why isn't it heading down towards 1.0, as in Figure 2?

What is the limiting factor in performance?
  442-microsecond epoch overhead
  442 us is 22100 instructions (50 mHz)
    So 22000-instruction epochs -> 2x slowdown for CPU-bound
    Plus time to emulate privileged instructions

Is the 442 microseconds CPU time, or network time?
  Figure 4 ATM experiment suggests CPU time, but not clear

Does anyone use these ideas today?
  Five years ago I said "no" -- too slow
    Instead, specialized replicated storage systems
    Like my original client/server diagram, for put/get &c
  But now: yes!
  VMware has a fault-tolerant VM system
  Same basic idea, but more complete and sophisticated
  no epochs
    primary has no restrictions on where interrupts occur
    backup can cause them to occur at same place
  primary holds each output until backup ACKs
    to ensure backup will produce same output if primary fails
    but primary can continue executing while it waits
  fault-tolerant network disks
  copes with partition by test-and-set on disk
    at most one of primary and backup will win
    no progress if network disk not reachable
  automatic creation of new backup after failure
    on some spare VM host, don't need to repair same hardware
  much faster: only 10% slow-down, not paper's 2X