6.824 Lecture 9: Eventual Consistency

Managing Update Conflicts in Bayou,
a Weakly Connected Replicated Storage System
Terry, Theimer, Petersen, Demers, Spreitzer, Hauser

[note that ordering the updates is described in a different paper]

Let's build a calendar system to help understand ordering and conflicts.
 Each entry has a time and a set of participants.
 We want everyone to end up seeing the same set of entries.

Traditional approach: one calendar server.
 Ordering: only one copy, server picks the order.
 Conflict resolution: server checks for conflicts before accepting update.
   Returns error to user, who can hand in any way desired.

Why aren't we satisfied with central server?
 I want my calendar on my Palm Pilot.
   I.e. database replicated in every node.
   No master copy.
 Periodic connectivity to net.
 Periodic infrared contact with other calendar users.

Straw man 1: swap complete DBs.
 Similar to Palm Pilot sync.
 Might requires lots of network b/w.
 What if there's a conflict? IE two meetings at same time.
   Palm just schedules them both!
   But we want automatic  conflict resolution.
 We can't just view DB items as bits.
   Since then we can't resolve conflicts.
 We want intelligent DB items that know how to resolve conflicts.
   They are more like updates: read DB, think, make a change.
   But we have to make sure that all nodes resolve conflicts the same way.
   How?

Insight:
 Maintain an ordered list of updates at each node.
 Make sure every node has the same updates.
 Make sure every node applies the updates in the same order.
 Make sure the updates are deterministic functions of DB contents.
 Then a "sync" is really a merge of two ordered lists: easy.
   Not a DB merge at all.

What are the write log entries?
 Why not just "10am meeting with RTM and Frans"?
   Because we can't resolve conflicts.
 Instead, "1-hour meeting w/ RTM and Frans at 9, otherwise 10, otherwise 11."
   Along with a unique ID: <local-time-stamp, originating-node>
   This is really instructions for doing a write, not the written data.
   So the write log is really an instruction in the distributed calendar program.
     We want all nodes to run same instructions in same order. Eventually.

Example:
 <701,A>: Node A asks for meeting M1 to occur at 10am, otherwise 11am.
 <770,B>: Node B asks for meeting M2 to occur at 10am, otherwise 11am.
 Let's agree to sort by write ID (e.g. <701,A>).
 As "writes" spread, nodes may initially apply updates in different order.
 Each newly seen write is merged into the log,
   and the log is replayed,
   which may cause the calendar displayed to the user to change.
   I.e. all entries are really "tentative", nothing is stable.
 But when everybody has seen all the writes, everybody will agree.

Global time sync is not possible.
 Does that make this particular scheme impossible?
 No. We're just using time stamps to allow agreement on order.
   Doesn't matter if node clocks are wrong.
   As long as users don't depend on reasoning about real time.
 But this scheme arbitrarily constrains order.
   You never know if there's some write from the past you haven't seen.
   So all entries must be tentative forever.
   And you have to keep the log around forever.

How can we allow a notion of committing a tentative entry?
 So we can have the meetings! And trim the logs.
 For an entry X to be committed, everyone must agree on:
   The total order of all previous committed entries.
   The fact that X is the next in the total order.
   The fact that all uncommitted entries are "after" X.

How does Bayou agree on total order of committed writes?
 One designated "primary replica".
 It marks each write it receives with a permanent CSN.
   Commit Sequence Number.
   That write is committed.
   So a complete time stamp is <CSN, local-TS, node-id>
 CSN notifications are exchanged between nodes.
 The CSNs define a total order for committed writes.
   All nodes will eventually agree on it.
   Uncommitted writes come after all committed writes.

Do we know know enough to show user "committed" flag with entries?
 Not quite.
 Entire log up to that committed entry must be stable.
   Otherwise there might be earlier committed write we don't know about.
   And we'd have to re-run conflict resolution.
 So a committed write isn't stable unless we've seen all prior committed writes.
 Bayou update propagation protocol maintains this.
   Propagates writes in order.

Now DB entries can be shown to user as committed.
 Which means everyone does (or will) agree on them.
 And a slow or disconnected node cannot prevent commits.
 Nodes may still disagree on the meaning of uncommitted writes, though.
   Even if they have synced with each other!
     *** well, only if one then sees a CSN that the other doesn't see.
   Only arrival of CSNs will resolve.
   EXAMPLE
     Roll-back, or start at stable DB as of highest CSN seen?

Now nodes can discard log entries with CSNs.
 As long as they've seen every CSN up to that point.
   (Turns out to be guaranteed by propagation protocol.)
 Instead, keep a copy of the database as of the highest CSN.
 That *is* the offical committed database.
   Everybody does (or will) agree on it.
   Its entries will never need to have conflict resolution.
 So you don't need to keep years of calendar write log entries.

Is is OK if primary replica can choose *any* order to commit?
 Suppose I schedule an event, then want to delete it?
   Or change attendee list?
 The create must precede the delete in the CSN order!
   And in every node's view of uncommitted part of log too.
 Total order must preserve order of writes originated at each node.
   But not necessarily order among different nodes' writes.
 How can primary replica be sure it commits each node's writes in order?
 1) Nodes actually use Lamport logical time-stamps for local TS.
 2) Everybody sends updates in order.
 So primary sees updates in causal order, and commits them in that order.

How do I propagate if I've discarded part of my log?
 Suppose I've discarded *all* writes with CSNs.
 I keep a copy of the stable DB reflecting just discarded entries.
 First, I know I cannot receive writes that conflict.
   Could only happen if write has a CSN < one discarded.
   But I already saw it, in the right order, so can ignore.
 When I propagate to node X:
   If node X's highest CSN is less than mine,
     I can send him my stable DB reflecting just committed writes.
     Node X can use my DB as starting point.
     And X can discard all CSN log entries.
     Then play his tentative writes into that DB.
   If node X's highest CSN is greater than mine,
     X can ignore my DB.

Evaluation
 Seems much more functional than Palm Pilot calendar.
 *Not* transparent to applications!
   Requires very strange programming practices.
   Every "write" is actually a bunch of code, not the new bits.
   Check conflicts (meeting already scheduled for any attendee?).
   Resolve conflicts (choose a different unused meeting time?).
 Not every application has appropriate semantics.
   Might actually work for a bank account!
   But conflicts can't always be automatically resolved.
     For example, changes to source code from multiple programmers.