6.824 Lecture 9: Eventual Consistency Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System Terry, Theimer, Petersen, Demers, Spreitzer, Hauser [see also companion paper about update propagation] Big picture Last lecture: file sync, optimistic consistency, detect conflicts This lecture: automatic conflict resolution update functions update log logical clocks eventual consistency Let's build a meeting scheduler Only one meeting allowed at a time (one room). Each entry has a time and a description. We want everyone to end up seeing the same set of entries. Traditional approach: one server Server processes requests one at a time Checks for conflicting time, says yes or no Updates DB Proceeds to next request Server implicitly chooses order for concurrent requests Why aren't we satisfied with central server? I want my calendar on my iPhone. I.e. database replicated in every node. Modify on any node, as well as read. Periodic connectivity to net. Periodic bluetooth contact with other calendar users. Straw man 1: merge DBs. Similar to iPhone calendar sync, or file sync. Might require lots of network b/w. What if there's a conflict? IE two meetings at same time. iPhone just schedules them both! But we want automatic conflict resolution. We can't just view DB items as bits. Since then we can't resolve conflicts. Idea: update functions Have update be a function, not a new value. Read current state of DB, decide best change. E.g. "Meeting at 9 if all are free, otherwise 10, otherwise 11." Rather than just "Meeting at 9" Function must be deterministic Otherwise nodes will get different answers Challenge: A: meeting at 10:00 or 11:00 B: metting at 10:00 or 11:00 X syncs w/ A, then B Y syncs w/ B, then A Will X put A's meeting at 10:00, and Y put A's at 11:00? Goal: eventual consistency OK for X and Y to disagree initially But after enough syncing, everyone should agree Idea: ordered update log Ordered list of updates at each node. DB is result of applying updates in order. Syncing == ensure both nodes have same updates in log. How can nodes agree on update order? Update ID: Assigned by node that creates the update. Order rules: a < b if a_time < b_time or (a_time = b_time and a_ID < b_ID) Example: <10,A>: staff meeting at 10:00 or 11:00 <20,B>: hiring meeting at 10:00 or 11:00 What's the correct final outcome? the result of executing update functions in timestamp order staff at 10:00, hiring at 11:00 What's the status before any syncs? I.e. content of each node's DB A: staff at 10:00 B: hiring at 10:00 This is what A/B user will see before syncing. Now A and B sync with each other Both now know the full set of updates Can each just run the new update function against its DB? A: staff at 10, hiring at 11 B: hiring at 10, staff at 11 That's not the right answer! Roll back and replay Bayou re-runs *all* update functions from the start Starting from empty DB Since A and B have same set of updates they will arrive at same final DB We will see how to optimize this in a bit Displayed calendar entries are "tentative" B's user saw hiring at 10, then changed to hiring at 11 You never know if there's some update from the past you haven't seen. That will be ordered into the middle of your log, requiring roll-back. Will update order be consistent with real time? Maybe A saw B create update first And A only created update later Node clocks are probably not synchronized Thus A's later update got earlier time stamp And A's meeting got priority, even though B asked first Will update order be consistent with causality? What if you add a meeting, then I see it on my node after sync, then I delete your meeting. <10,A> add <9,B> delete -- my clock is slow Now my delete will be ordered before your add! That would not make sense. Differs from real-time case b/c system *knew* I had seen the add. Logical clocks Want to timestamp events s.t. node observes E1, then generates E2, TS(E2) > TS(E1) Thus other nodes will order E1 and E2 the same way. Each node keeps a clock T increments T as real time passes, one second per second T = max(T, T'+1) if sees T' from another node Note properties: E1 then E2 on same node => TS(E1) < TS(E2) BUT TS(E1) < TS(E2) does not imply E1 came before E2 Invented by Lamport How can we have a notion of committing a tentative entry? I.e. freeze the order up until some point, disallow any new updates before that point. So that we can know for sure when a meeting takes place. And trip the update log. How does Bayou agree on total order of committed updates? One node designated "primary replica". It marks each update it receives with a permanent CSN. Commit Sequence Number. That update is committed. So a complete time stamp is CSN notifications are exchanged between nodes. The CSNs define a total order for committed updates. All nodes will eventually agree on it. Uncommitted updates come after all committed updates. Will commit order match tentative order? Sometimes yes. During sync, nodes send update records in log order. Including updates learned from other nodes. Called "prefix property" of syncing. So if A's update log says <-,10,X> <-,20,A> A will send both to primary, in that order Primary will assign CSNs in that order Commit order will, in this case, match tentative order Will commit order always match tentative order? No: primary may see newer updates before older ones. A has just: <-,10,A> M1 B has just: <-,20,B> M2 A syncs to C, B syncs to C. C's order: M1 M2 B syncs with primary, gets CSN=5. Later A syncs w/ primary, gets CSN=6. When C syncs w/ primary, order will change to M2 M1 <5,20,B> M1 <6,10,A> M2 So: committing may change order. Committing allows us to tell users which calendar entries are stable. Nodes can discard committed updates. Instead, keep a copy of the DB as of the highest CSN. Roll back to that DB when replaying tentative update log. Never need to roll back farther. Prefix property guarantees seen CSN=x => seen CSN <-,20,Y> <-,30,X> <-,40,X> B has: <-,10,X> <-,20,Y> <-,30,X> At start of sync, B tells A "X 30, Y 20" Sync prefix means B has all X updates before 30, all Y before 20 This is a version vector How did all this work out? Replicas, write any copy, and sync are good ideas Though WiFi and cell-phone have made disconnected operation less compelling A central server works fine I.e. you don't need peer-to-peer sync Protocol much simpler since central server does all resolution Bayou introduced some very influential design ideas Update functions Ordered update log is the real truth, not the DB Allowed general purpose conflict resolution Bayou made good use of existing ideas Eventual consistency Logical clock