6.824 2010 Lecture 12: state machine replication General topic area: Fault tolerance last week: how to recover a single machine from a power failure durability, but when machine is down service is unavailable how to make arbitrary distributed computations atomic: 2PC today's and next week's topic: state machine replication replicate data so that it is "always" available if one replica fails, use another one state machine replication works for any kind of replicated service: storage or lock server or whatever every replica must see same operations in same order if deterministic, replicas will end up with same state disaster if replicas diverge so e.g. replicas probably aren't multiprocessors [client, replicas, client sends stream of arithmetic ops to all] how to handle multiple clients sending operations to replicas? [two clients sending to all replicas] what if replicas see different operation orders? easy solution: [clients, master, backups] all operations to master master chooses order master sends to slaves "master/slave" or "primary/backup" there are other approaches quorum/voting systems can avoid master but enforce order primary/backup is simpler when to respond to a request? before slaves perform request? after? must be after: can't say "yes" to client but then forget after master fails do all replicas have to acknowledge? depends on recovery scheme what if primary fails? one backup takes over as primary clients change where they send operations how many replicas? assume repair or replacement enough to survive a failure(s) during the time that it takes for a new one to start what are the big issues that need to be solved? has the primary failed? disaster if network partition causes two primaries recover primary's state on backup after it fails last operation, or more if lag avoid clients seeing any weirdness during fail-over lost requests, repeated responses bring newly joined slave up to speed Case study: Hypervisor-based Fault-tolerance Bressoud and Schneider SOSP 1995 Motivation Note above example is *not* transparent server and client s/w know all about replication approach only really makes sense for client/server Paper wants fault tolerance / replication for any existing s/w Any app, any O/S So totally transparent And runs on inexpensive stock hardware Would be magic if it worked well! Where they are coming from Banks, telephone exchanges, NASA need fault-tolerant computers 1980s saw lots of *hardware* fault-tolerant machines Assumption: CPU most likely part to fail due to complexity So two CPUs connected to bus, single memory, disk, &c system All in the same cabinet. Fault-tolerant *hardware* replication is Very Expensive! So a software-only approach was very attractive. Plan 1: [simple diagram] Two machines Identical start state: registers, program, memory contents Maybe they will stay identical If one fails, the other just keeps going, no time/data/work is lost What will go wrong with Plan 1? external outputs will be duplicated must send external inputs to both inputs must arrive at the same time at both interrupts &c must arrive at same time at both CPUs may not be deterministic internally How exact must we be about timing of inputs, interrupts, &c? is it ok if timing differs a little bit? what does it mean for timing to be exact? How are they able to control I/O and interrupt timing? they slip a virtual machine / hypervisor under the O/S What is a hypervisor? a piece of software emulates a real machine precisely so an O/S &c can run on a hypervisor "guest" as if hypervisor simulated each instruction and kept simulated registers, memory, &c BUT as much as possible runs guest on real h/w! can be much faster than interpreting instructions What are the main challenges in making a hypervisor? What if guest tries to write on hypervisor's memory? That might crash the hypervisor What if guest O/S tries to modify hardware state? E.g. set up page tables Emulated page tables may differ from hardware page tables And don't want O/S to defeat hypervisor protection What if guest tries to read hardware state? Read page tables, or read current time May differ from intended emulated state We want guest to see emulated state What about devices? Hypervisor probably exposes fake devices to guest Hypervisor has its own devices drivers for real h/w Thus Hypervisor in charge of I/O data and interrupts How does a hypervisor work? Run guest O/S without hardware privileges As if it were an ordinary application So guest cannot modify sensitive h/w state E.g. guest cannot change page tables Rely on hardware to trap to hypervisor if guest r/w hardware state Mostly this works out CPUs have to prevent user programs from modifying priv state anyway Problem: h/w that does *not* trap PA-RISC example: branch-and-link exposes current h/w priv level either tolerate modern VMs rewrite guest instructions! Many fascinating details, see VMware papers for x86 story Plan 2: primary and backup primary's hypervisor: watches I/O and interrupts sends to backup backup's hypervisor: suppresses I/O, interrupts injects from primary, at the same instruction primary details: determine exact instruction # of each interrupt determine exact instruction # of each input does input always arrive as an interrupt? backup details: be able to trigger interrupt at exact instruction # must lag behind primary if backup too fast, may execute pas interrupt point before primary realizes an interrupt would occur there Paper does *not* use Plan 2! instead: interrupts and I/O only at epoch boundary an epoch ends every N instructions PA-RISC has special hardware for this at end of epoch: primary sends interrupt and input info to backup bother deliver at end of epoch Why does the paper use epochs? I don't know Maybe as a way to ensure backup lags primary enough Maybe too hard to know what instruction # an interrupt arrived at What if primary generates output, then crashes? Will the backup take over before or after the output instruction? If backup takes over before it reached the output instruction Will it re-generate exactly the same output? Will it generate a second *different* output? How does the paper deal with output vs primary crash? Page 4, P2 Backup lags primary by one epoch Primary waits at epoch end for backup to ack all msgs Primary crashes midway through epoch 10 At that point backup has not started epoch 10 Backup decides primary has crashed, takes over Backup executes epoch 10 as sole server -- generates output Backup will have all msgs from backup as of start of 10 So backup will execute epoch 10 the same way as primary Primary crashed after output -> backup will output 2nd identical output Primary crashed before -> backup will generate the output So: outside world must tolerate some duplicate output General rule: Primary can't produce output until backup has ACKed all prev msgs So backup will generate same output if primary crashes Will the outside world tolerate repeated output? In general, maybe not E.g. controlling an instrument in a scientific experiment Network protocols: No problem, Must handle duplicated packets anyway Shared disk: No problem, repeated write of same data What if primary starts an I/O operation, epoch ends w/o completion Then primary crashes Backup takes over Backup did *not* start an I/O operation So it will never get a completion interrupt! P8 says: hypervisor on backup keeps track of incomplete I/O started on primary delivers "uncertain" interrupts on backup after fail-over Paper makes other interesting assumptions about I/O hardware Input always associated with interrupt So epoch scheme causes it to arrive at same time on both Thus hypervisor must intercept+defer all DMA Backup can start using h/w devices at start of any epoch I.e. device h/w had no state set up by primary Which backup would not have set up, since I/O is suppressed Figure 1 shows a shared disk Rather than separate disk per machine with identical content Only the primary writes the disk Why does shared disk make the design easier? Disk writes are non deterministic one disk may have bad sector and write fails Simplifies recovery of failed machine Don't need to copy whole disk from survivor Won't disk failures ruin the fault-tolerance story? Can be fixed separately, w/ RAID Does the backup read the disk directly? No: why not? Suppose primary reads and then immediately writes. Backup might read the new data, should see old. What about fail-over? how does backup know the primary has failed? What if ethernet cable breaks? primary still running backup will try to promote itself that's part of what the fail-stop assumption assumes away... What if we want to re-start a failed+repaired primary? What we can expect for performance? When will performance be bad? Frequent interrupts -- may be delayed by epoch scheme Lots of input -- must be sent over ethernet Many privileged instructions -- many hypervisor traps Should epochs be short or long? Short means many P2 pauses, primary/backup chatter Long means less overhead But I/O interrupts delayed longer What performance do they see? CPU-bound: Figure 2 Disk-bound (really seek-bound): Figure 3 Why do writes do better than reads? What is the limiting factor in performance? 442-microsecond epoch overhead Certainly can't have more than 2000 epochs per second! Is the 442 microseconds CPU time, or network time? Figure 4 ATM experiment suggests CPU time, but not clear Does anyone use these ideas today? Five years ago I said "no" -- too slow Instead, specialized replicated storage systems Like my original client/server diagram, for put/get &c But now: yes! VMware has a fault-tolerant VM system Same basic idea, but more complete and sophisticated no epochs primary has no restrictions on where interrupts occur backup can cause them to occur at same place primary holds each output until backup ACKs to ensure backup will produce same output if primary fails but primary can continue executing while it waits fault-tolerant network disks copes with partition by test-and-set on disk at most one of primary and backup will win no progress if network disk not reachable automatic creation of new backup after failure on some spare VM host, don't need to repair same hardware much faster: only 10% slow-down, not paper's 2X