Replication in the Harp File System Liskov, Ghemawat, Gruber, Johnson, Shrira, Williams Outline basic operation. Client, primary, backup, witness. Voting. Reply message. Log. Why does Harp have so many log pointers? CP commit point (real in primary, latest heard in slave) AP highest record sent to disk on this node LB disk has completed up to here GLB all nodes have completed disk up to here? At the primary, what does it mean to be before/after CP? How about at the slave? Actually you can't tell in either case... Why AP -- why not apply to disk as committed, at CP? Why is LB not AP-1? Why do we need GLB? Allows up to discard log entries. Why not discard at LB? In case another node lost log, but disk is OK. Though doesn't UPS protect against that? No: crashes due to software will lose the log. When can Harp reclaim log space? Why do they have to have a log at all? Need state of partially completed operation, i.e. being committed. But mostly to allow concurrent operations. Linear because operations must be ordered. If OP1, then OP2, can't commit OP2 but not OP1. If crashes &c. What if power failure, operation committed, but not done writing? All nodes lose power. Already replied to client, can't forget about committed operations. UPS... What is the point of the witness if it doesn't store data? Breaks ties, ensures majority partition. If primary fails, what does witness do? Promoted to pseudo-backup. Has no copy of file system. Logs all messages, even before GBL. On disk/tape. What is the point? Witness has log required to bring old primary up to date when it recovers. And new primary might fail. We cannot continue then, since we don't have majority. But we can still restore stable storage when someone recovers. Assuming they recover with disk intact. What if running w/ bare majority, primary fails, 3rd node revives at same time. Does Harp form a new view? Or where/why exactly does it decide not to? Why is serving reads just on the primary complex? What if primary just became a minority partition? Read will miss committed writes... Could Harp operate over a WAN? Or do the machines have to be in same building? What exactly is the UPS for? I think only for simultaneous power failure. They don't depend on it to recover from partial failure. Other nodes' logs are enough for that. Do we believe UPS story? What if the UPS battery runs out? They flush to disk and halt (?) when main power fails. So not as vulnerable as a RAID controller battery. What exactly are the failures they can survive? One node permanently fails, or loses network connection. Network separates all nodes, then they re-join? Witness and backup permanently fail? All nodes reboot w/o losing any non-volatile data? Why not just use one server with a UPS? Perhaps with a RAID array. Does Harp have performance benefits? Yes, due to UPS, no need for sync disk writes. But in general, not 3x performance. But maybe if you had 3 file systems, could get 3x (???) performance. Or at least use the witness for something useful. Why graph x=load y=response-time? Why does this graph make sense? Why not just graph total time to perform X operations? One reason is that systems sometimes get more/less efficient w/ high load. And we care a lot how they perform w/ overload. Why does response time go up with load? Why first gradual... Queuing and random bursts? And some ops more expensive than others, cause temp delays. Then almost straight up? Probably has hard limits, like disk I/Os per second.