Replication in the Harp File System Liskov, Ghemawat, Gruber, Johnson, Shrira, Williams SOSP 1991 Key points 2b+1, tolerate b failures improvements over viewstamped replication: supports concurrent/overlapped operations, log to maintain order supports huge state, uses log to bring recovered disks up to date handles simultaneous power failures Outline basic operation. Client, primary, backup, witness. Client -> Primary Primary -> Backups Backups -> Primary, primary waits for all backups Primary replies to Client Primary tells clients to commit Why does Harp use a log? VSR used only status of the one last operation Many overlapped / concurrent operations If failure, need to recover a consistent prefix If OP1, then OP2, can't commit OP2 but not OP1. If crashes &c. What is in a typical log record? Why does Harp have so many log pointers? FP most recent client request CP commit point (real in primary, latest heard in slave) AP highest record sent to disk on this node LB disk has completed up to here GLB all nodes have completed disk up to here? Why the FP-CP gap? So primary doesn't need to wait for ACKs from each backup before starting next operation Higher throughput: can overlap wait for prev op with exec of next Probably most useful when there are multiple clients (there's a complex story for how to execute new operations given that previous OPs have not yet updated FS state) Why the CP-AP gap? Why not apply to disk at CP? Perhaps multiple operations will modify the same block So if we wait we can do just one write, not many Why the AP-LB gap? I.e. why keep multiple writes active to the disk? Because the disk does better scheduling w/ concurrent writes Why the LB-GLB gap? I.e. why not delete log record when disk write completes? Can't throw away log records until we know *everyone* has applied them Because we might need to use our log to bring someone up to date How does failure recovery work? Scenarios 5 servers, 1-5, 1 is usually primary, 2-3 backups, 4-5 witnesses S2 crashes new view S4 is promoted (witness -> backup) S4 gets copy of log starting at GLB (i.e. all ops not known to be on disks) S4 starts logging all operations, but doesn't apply them but GLB advances, so primary discards log entries S2 reboots new view S4 sends big log to S2, S2 plays it to get all missing operations S2 suffers a disk failure needs to get complete state from S1 or S3 All nodes suffer power failure just after S1 replies to a client. Then they all re-start. Can they continue? What if they had all crashed -- could they continue? Crash == lost memory contents (despite UPS). How do they tell the difference? S2 and S3 are partitioned (but still alive) Can S1+S4+S5 continue to process operations? S4 moves to S2/S3 partition Can S2+S3+S4 continue? S2 and S3 are partitioned (but still alive) S4 crashes, loses memory contents, reboots in S2/S3 partition Can they continue? Depends on what S4's on-disk view # says. Everybody suffers a power failure. S4 disk and memory are lost, but it does re-start after repair. S1 and S5 never recover. S2 and S3 save everything on disk, re-start just fine. Can S2+S3+S4 continue? In general, how do you know you can form a view? 1. No other view possible. 2. Know about most recent view. 3. Know all ops from most recent view. #1 is true if you have n+1 nodes in new view. #2 is true if you have n+1 nodes that did not lose view # since last view. View # stored on disk, so they just have to know disk is OK. One of them *must* have been in the previous view. So just take the highest view number. Now that we know last view number, #3 is true if we have a backup or promoted witness w/ intact state from the last view. S1 sends an op to S2, then S1 crashes Will the op survive in the new view? What does correctness require? S1 sends an op to S2, then S1/S2 are partitioned The other nodes form a view and continue. Then S1/S2 re-join. What about that operation? What if S1 had served a read just after sending op to S2? Of the data written by the op? Not possible: only reply to client at CP. So if S1 replied to the read, all backups must have acked the write. What do they use the UPS for? Gives them a few extra minutes after a power failure Apply the log to the disk? Save the log to disk, but don't apply? Suppose primary replies to a client, then power fails. Primary's CP shows that the operation committed. But backups do not know if it committed. But since it did, they are not allowed to forget! In case primary doesn't recover. Thus: backups must save FP-CP region to disk while on UPS power. And actually perform operations through the CP. What if crash other than a power failure? Or the UPS doesn't work? After reboot, disk might be in bad shape. Since it's just an FFS file system. So they have to run fsck? But will fsck leave replicas with different contents? Does Harp have performance benefits? Yes, due to UPS, no need for sync disk writes. But in general, not 3x performance. Why graph x=load y=response-time? Why does this graph make sense? Why not just graph total time to perform X operations? One reason is that systems sometimes get more/less efficient w/ high load. And we care a lot how they perform w/ overload. Why does response time go up with load? Why first gradual... Queuing and random bursts? And some ops more expensive than others, cause temp delays. Then almost straight up? Probably has hard limits, like disk I/Os per second.