6.824 2001 Lecture 21: Primary-Backup Replication Big picture over next few days. We want to replicate data for reliability. How do we manage replicas consistently? Can we replicate transparently to applications? Can we replicate and maintain high performance? The big issues: 1. Order of updates -- same on all replicas. 2. Availability: continue despite some failures. 3. Don't allow two partitions to operate. 4. Allow concurrent operations for efficiency. The techniques: Primary. Two-phase commit. Voting. Log. View change. Recovery. Example application: Appendable file, perhaps suitable for e-mail storage. For simplicity, just one file. Must support multiple concurrent clients. Just two operations: Read Append(Data) We want replicated system to mimic client-visible behavior of a single-server implementation. Despite partial failures at any time. Examples of correct behavior: A1(X), A1(Y), R1 -> X,Y A1(X) concurrent w/ A2(Y): all reads see same X/Y order This is sequential consistency. Single server implementation: [time diagram] Process operations one a time. Write to stable storage before replying to client. Client waits for reply before sending next message. Might lose the last update. But client will re-send after reboot. Why do programmers like this model? Hides existence of concurrency. Hides failure recovery details. General tool, same for many applications. Why isn't single-server DB good enough? Single point of failure. Want to replicate data on multiple machines. May help read (or even write) performance as well as availability. Straw man (incorrect!) replicated database: Imagine 2 servers, file replicated on both. Clients send read operations to either server. Clients send update operations to both. Waits for both to reply. Use Miguel's time diagrams. What can go wrong? Not available if either server is down. A1(X), A1(Y), R : correct; wait for reply forces order. A1(X) | A2(Y) : not correct. Losses/delays may cause different order at two servers. We have two problems to fix: Want to survive one server being down. Straw man is *less* available than single server! Want to ensure servers have identical replicas. I.e. process updates in the same order. Can we relax requirement that client wait for replies from all servers? To tolerate one server being down. The naive approach (wait for just one) does not work! What if one of your network messages is lost? What if the network is partitioned? Client can't distinguish server down from partition. Can fix partition problem with voting: Use 2n+1 replicas if you want to survive n failures. Only allow operations in a partition with >= n+1 members. There can be at most one such partition. Reads also need to contact n+1 members. To verify that client is in majority partition. (What if client gets different answers???) Can fix order problem with primary copy: One primary, clients send operations to it. Primary imposes order on concurrent operations. Primary tells slaves about operations, with sequence number. Slaves perform operation, then send ACK to primary. (What if primary crashes???) Second straw man: Is this enough? Can primary just wait for ACKs, then respond to client? No! What if fewer than n slaves respond? I.e. primary may be in the minority partition. Then primary should abort operation and send error to client. But some of the slaves have performed it! So slaves need to defer actual update until primary confirms. 2-phase commit protocol: 1: Primary sends updates to slaves. Slaves append update information to a log, but don't yet perform. (Logs must be persistent???) Slaves ACK first phase to primary. Primary waits for n replies (so n+1 replicas). Primary replies "YES" to client. 2: In background, primary tells slaves that commit happened. Slaves update real DB from log entry. What if the primary fails before sending client the ACK? I.e. while sending phase 1 msgs or collecting slave ACKs. If some slave got the phase 1 message, it can re-start. But it's OK to just abort the operation -- client hasn't seen an ACK. What if the primary fails after ACKing client? But before sending phase 2 commit messages to all slaves. Operation *must* complete because client has seen ACK. New primary can ask remaining replicas. If n+1 saw (and acked) the phase 1 message, new primary can safely commit. What if slave fails? Doesn't matter -- can just keep going as long as > 1/2 survive. What about concurrent operations? Primary numbers them and allows them to proceed. Primary and slaves keep a log of all operations. Slave only ACKs a phase 1 msg if it has seen all prior phase 1 msgs. Primary only sends out a commit if it has committed all previous. Otherwise reboot could lose a later op but commit an earlier one. Logs look like: Old committed operations <-- Commit Point (CP) Uncommitted operations Slave log: If a replica ACKs phase 1: It can't yet write the DB. But it has promised to do so when asked. Since primary may already have ACKed client. Slave should not ACK, reboot, change its mind. So it must have a stable log. Same order as primary log. Recovery procedure: Only interesting if primary crashes, someone else takes over. Worst case: old primary operating with just n slaves. I.e. the other n were down, now they've recovered, w/ incomplete logs. Must reconstruct primary's state. Must recovery order of updates primary committed. I.e. updates that the primary replied to, to client. Can either commit or forget updates that provably weren't committed. Client will re-send. Primary only committed if it got ACKs from n+1 clients. If primary fails, and we expect to continue, majority must survive. So old view had *more* than a majority. Thus if the primary committed an operation, a majority must know about it after the view change. So that's the criterion for accepting updates into the new view. By the way, this is not quite two-phase commit, though similar. Since only collecting ACKs from majority, not all. Real two-phase commit can't always recover w/o primary. Does this preserve ordering of committed operations? Suppose OP1 and OP2 are concurrent, but in that order. Remember, assigned sequence numbers when submitted to primary. Maybe prepare messages for OP1 were lost. So primary gets n+1 ACKs for OP2 first. Then primary crashes. Consequences: Primary cannot reply to client until all previous operations commit. New primary cannot commit unless all previous ops can be shown to have (possibly) committed. Reads: Can't just send to any replica. Must make sure it's in the majority partition. Otherwise may miss a committed write in the other half. Read must reflect all committed updates. So clients have to send reads to the primary. A read could wait for all prior writes to commit. But could also just use latest committed write at primary. Not much of a savings, since you have to gather quorum anyway. When is real on-disk DB written (from log)? Primary sends current CP along with every message. Slaves remember largest CP seen. Real DB entries written only when CP passes log entry. Updates in background. In log order. As-yet unresolved issues: How to agree on the choice of a new primary? What if slaves are not fail-stop? I.e. malicious? What kind of performance are we likely to get? Every operation involves all the servers. So it's likely to be *slower* than a single server. Can we replicate and still get high performance? Yes -- let's talk about Porcupine.