6.824 2013 Lecture 3: Primary/Backup Replication today primary / backup replication explain lab 2 ideas discuss paper today's goals: availablity: still useable despite [some class of] failures correctness: act just like a single server to clients and handle network failures and repair, unlike lab 1 tool: replication of service's state (e.g. lock table, saying if each lock is held) how to ensure two replicas (servers) remain identical? 1) copy the whole state, and/or 2) apply same operations in same order to both replicas basic primary/backup idea (you did this in Lab 1) multiple clients, sending requests concurrently two server roles: primary, backup which server is primary may change due to failures clients send operations to primary (e.g. Lock, Unlock) primary chooses the order primary forwards operations (or results) to backup primary replies to client Q: how to ensure backup applies ops in same order as primary? Q: does primary need to wait for backup to respond? shortcomings of Lab 1 depends on single fail-stop assumption cannot replace a failed server what wider classes of failure would we like to handle? temporary or permanent loss of connectivity network partitions == can't know if a server is crashed or just not reachable what would happen if Lab 1 saw these failures? clients, primary, backup might not agree abt who is primary or about whether backup is alive (and thus if it is up to date) result: two primaries e.g. some clients switch to backup, others don't or one client switches back and forth they don't see each other's updates! "split brain" does *not* achieve goal of acting like single server lab 1 is *incorrect* if there are network problems! lab 2 goals: tolerate network problems, including partition either keep going, correctly or suspend operations until network is repaired replacement of failed servers lab 2 overview: agreement: "view server" decides who p and b are clients and servers ask view server they don't make independent decisions only one vs, avoids multiple machines independently deciding who is p repair: view server can co-opt "idle" server as b after old b becomes p primary initializes new backup's state the tricky part: 1. only one primary! 2. primary must have state! we will work out some rules to ensure these view server maintains a sequence of "views" view #, primary, backup 1: S1 S2 2: S2 -- 3: S2 S3 monitors server liveness each server periodically sends a Ping RPC "dead" if missed N Pings in a row "live" after single Ping can be more than two servers Pinging view server if more than two, "idle" servers if primary is dead new view with previous backup as primary if backup is dead, or no backup new view with previously idle server as backup OK to have a view with just a primary, and no backup but -- if an idle server is available, make it the backup how to ensure new primary has up-to-date replica of state? only promote previous backup i.e. don't make an idle server the primary but what if the backup hasn't had time to acquire the state? how to avoid promoting a state-less backup? example: 1: S1 S2 S1 stops pinging viewserver 2: S2 S3 S2 *immediately* stops pinging 3: S3 -- potential mistake: maybe S3 never got state from S2 better to stay with 2/S2/S3, maybe S2 will revive how can viewserver know it's OK to change views? lab 2 answer: primary in each view must acknowledge that view to viewserver viewserver must stay with current view until acknowledged even if the primary seems to have failed no point in proceeding since not acked == backup may not be initialized Example: Viewnum Primary Backup Event -------------------------------- 0 none none server S1 sends Ping(0), which returns view 0 1 S1 none server S2 sends Ping(0), which returns view 1 (no view change yet since S1 hasn't acked) server S1 sends Ping(1), which returns view 1 or 2 2 S1 S2 server S1 sends Ping(2), which returns view 2 server S1 stops sending Pings 3 S2 none server S2 sends Ping(3), which returns view 3 Q: can more than one server think it is primary? 1: S1, S2 net broken, so viewserver thinks S1 dead but it's alive 2: S2, -- now S1 alive and not aware of view #2, so S1 still thinks it is primary AND S2 alive and thinks it is primary how to ensure only one server acts as primary? even though more than one may *think* it is primary "acts as" == executes and responds to client requests the basic idea: 1: S1 S2 2: S2 -- S1 still thinks it is primary S1 must forward ops to S2 S2 thinks S2 is primary so S2 can reject S1's forwarded ops the rules: 1. non-backup must reject forwarded requests 2. primary in view i must have been primary or backup in view i-1 3. non-primary must reject direct client requests 4. primary must wait for backup to accept each request example: 1: S1, S2 viewserver stops hearing Pings from S1 2: S2, -- it may be a while before S2 hears about view #2 before S2 hears about view #2 S1 can process ops from clients, S2 will accept forwarded requests S2 will reject ops from clients who have heard about view #2 after S2 hears if S1 receives client request, it will forward, S2 will reject so S1 can no longer act as primary S1 will send error to client, client will ask vs for new view, client will re-send to S2 the true moment of switch-over occurs when S2 hears about view #2 how can new backup get state? e.g. the table of locks if S2 is backup in view i, but was not in view i-1, S2 should ask primary to transfer the complete state rule for state transfer: every operation (Lock,Unlock) must be either before or after state xfer if before, xferred state must reflect operation if after, primary must forward operation after xfer finishes Q: does primary need to forward Get()s to backup? after all, Get() doesn't change anything, so why does backup need to know? and the extra RPC costs time Q: how could we make primary-only Get()s work? *** Case study: Hypervisor-based Fault-tolerance Bressoud and Schneider SOSP 1995 Why are we reading this paper? Detailed description of a primary/backup system Very general -- not application specific like the labs Motivation Goal: fault tolerance / availability for any existing s/w by running same instructions &c on two computers Transparent: any app, any O/S Would be magic if it worked well! Plan 1: [simple diagram] Two machines Identical start state: registers, program, memory contents Maybe they will stay identical If one fails, the other just keeps going, no time/data/work is lost What will go wrong with Plan 1? external outputs will be duplicated must send external inputs to both inputs must arrive at the same time at both interrupts must arrive at same time at both CPUs unlikely to run at exactly the same speeds The basic plan: Ignore backup's I/O instructions Hide interrupts from backup Primary forwards I/O results and interrupts to backup Inject those interrupts into backup Q: do we have to be exact about timing of interrupts? can backup just fake an interrupt when msg arrives from primary? How are they able to control I/O and interrupt timing? they slip a virtual machine / hypervisor under the O/S What is a hypervisor? a piece of software emulates a real machine precisely so an O/S &c can run on a hypervisor "guest" as if hypervisor simulated each instruction and kept simulated registers, memory, &c BUT as much as possible runs guest on real h/w! can be much faster than interpreting instructions Many fascinating details, see VMware papers for x86 story Hypervisor gives us control Suppress output from backup, detect interrupts/input at primary, &c Still need a scheme for delivering I/O inputs and interrupts From primary to backup, at the right times But: hard to know when interrupt happened at primary, repeat at backup Answer: HP PA-RISC recovery counter It forces an interrupt every N instructions This allows primary and backup hypervisor to get control at exactly the same point in the code Epochs Primary alternates epoch and end of epoch; backup does too Every epoch is same # of instructions, e.g. 4000 instructions Primary during epoch E: Deliver interrupts buffered during E-1 Interrupts deliver input, and acknowlege output Then continue executing I/O instructions start input/output (but not done until interrupt) Hypervisor hides new interrupts, just buffer, w/ I/O input Primary at end of epoch E: Send buffered interrupt info, including I/O input, to backup Wait for backup to ACK Send "done with E" to backup Backup during epoch E: Deliver interrupts, and I/O input, primary told us of during E-1 Then continue executing I/O instructions do nothing Ignore interrupts from backup devices Backup at end of epoch E: Receive and buffer interrupt/input msgs from primary Wait for "done with E+1" from primary Start epoch E+1 Note backup lags primary by one epoch Backup doesn't start 7 until primary is done with 7 What if primary fails during epoch E? Backup finishes with E-1 Backup times out waiting for "done with E" Can backup just switch to being primary for epoch E? No: primary might have crashed in the middle of writing device registers May not be correct for backup to repeat I/O register writes Example: Maybe for a disk read the driver is required to write disk controller registers in a specific order, e.g. track #, sector #, length, DMA address Primary did *some* of those writes before crashing Backup might then write track # when h/w expecting sector # Good news: many h/w devices can be reset to known state And I/O operations restarted Existing drivers already know how to handle "please reset me" interrupts At any rate, true for HP-UX disk driver Failover strategy: Backup times out waiting for "done with E" Backup executes epoch E as backup (no output, I/O suppressed) Switches to primary for E+1 Hypervisor generates "uncertain interrupts" for I/O started <= E Including suppressed I/O during failover epoch E During E+1: Drivers respond to uncert interrupts w/ reset, repeat of waiting I/O ops (all this is based on footnote 5 on page 4, and IO1 and P8 on page 5) Potential problem: repeated output Maybe primary crashed just as it finished E And it completed some output, e.g. disk write or network pkt send Hypervisor will generate uncertain interrupt for those I/O ops Backup will repeat them in E+1 Will the outside world tolerate repeated output? In general, maybe not E.g. controlling an instrument in a scientific experiment Network protocols: No problem, Must handle duplicated packets anyway Shared disk: No problem, repeated write of same data Do O/S drivers on primary talk directly to h/w? I suspect O/S talks to device simulated/mediated by hypervisor Hypervisor needs to be able to collect input read from devices and sent it to backup So hypervisor must be intervening (and understand) access to dev h/w Also hypervisor must initialize dev h/w when backup takes over Figure 1 shows a shared disk Rather than separate disk per machine with identical content Only the primary writes the disk Why does shared disk make the design easier? Disk writes are non deterministic one disk may have bad sector and write fails so (unlike RAM) backup's disk won't naturally be identical to primary's Simplifies recovery of failed machine Don't need to copy whole disk from survivor Won't disk failures ruin the fault-tolerance story? Can be fixed separately, w/ RAID And my memory of 1990 is that disks failed much less often than computers crashed Today's question: What if system had private disks instead of one dual-ported disk? Would that make the system more fault-tolerant? should backup suppress disk I/O instructions? should backup suppress disk interrupts? should backup directly read its own disk? or should primary forward read data? What if ethernet cable breaks? primary still running backup will try to promote itself that's part of what the fail-stop assumption assumes away... fail-stop is reasonable for a tightly coupled system they could probably make a 10-foot ethernet pretty reliable or have two separate ethernets What if we want to re-start a failed+repaired primary? What if the computers have multiple cores + threads? Would this scheme work w/o modification? What we can expect for performance? When will performance be bad? Frequent interrupts -- may be delayed by epoch scheme Lots of input -- must be sent over ethernet Many privileged instructions -- many hypervisor traps Should epochs be short or long? Short means many end-of-epoch pauses + chatter Long means less overhead But I/O interrupts delayed longer What performance do they see? CPU-bound: Figure 2 Disk-bound (really seek-bound): Figure 3 Why do writes do better than reads? Why much better relative performance for disk-bound than CPU-bound? Why does Figure 3 write performance level off? Why isn't it heading down towards 1.0, as in Figure 2? What is the limiting factor in performance? 442-microsecond epoch overhead 442 us is 22100 instructions (50 mHz) So 22000-instruction epochs -> 2x slowdown for CPU-bound Plus time to emulate privileged instructions Is the 442 microseconds CPU time, or network time? Figure 4 ATM experiment suggests CPU time, but not clear Does anyone use these ideas today? Five years ago I said "no" -- too slow Instead, specialized replicated storage systems Like my original client/server diagram, for put/get &c But now: yes! VMware has a fault-tolerant VM system Same basic idea, but more complete and sophisticated no epochs primary has no restrictions on where interrupts occur backup can cause them to occur at same place primary holds each output until backup ACKs to ensure backup will produce same output if primary fails but primary can continue executing while it waits fault-tolerant network disks copes with partition by test-and-set on disk at most one of primary and backup will win no progress if network disk not reachable automatic creation of new backup after failure on some spare VM host, don't need to repair same hardware much faster: only 10% slow-down, not paper's 2X