6.824 2001 Lecture 22: View Changes

In a primary-backup voting system,
  It's not enough for each operation to be sent to any majority of nodes.
    Because then different nodes see different operation sequences.
    So the replicas will diverge.
    [Though note Gifford's voting system works like this.]
  Must be a coherent notion of the current majority.
  Changes to that majority must be controlled:
    New majority must have same starting state left by previous majority.
    So you have to know what the previous majority was.
    So the old majority must be preserved in new majority.
    E.g. if you have just 2 of 5, perhaps they were in the minority,
      and missed some operations.

Let's call an operating majority of nodes a "view".
  What can go wrong with transitions between views?
  Assume 5-node system.
  1. Operate with V1 = A, B, C, D, E.
  2. D and E crash, so now V2 = A, B, C.
  3. A and B are partitioned, D revives, V3 = A, B, D.
  4. D crashes, E revives, D revives w/o state,
     C, D, and E in same net partition, so V3 = C, D, E.
  But now we've lost our total order of views and hence operations.
  From client's point of view, system claims to work, but has
    lost committed operations (in first V3).

View change occurs when set of live nodes seems to have changed.
  Finds a majority for the new view.
  Must include a majority from previous view.
  Uses that to decide which operations from previous view committed.
  Makes sure new view has identical set of operations.
  Chooses a new primary, perhaps node w/ highest node ID.

Big picture:
  Sequence of views, totally ordered.
  Always a "next" and "previous" view.
  Operations totally ordered within each view.
    Agreement on which committed at end of view.
  Unique new view picks up old view's final state.
  So we have a definition of the total order of operations.
    Even though we have a changing set of nodes.
  And thus we have a well defined current state.

Interesting cases:
  Primary fails.
  Operation interrupted by a view change.
  Node fails or recovers during view change.
    I.e. multiple view changes started at the same time.
  What happens in minority partition:
    View change must fail in that case!

Every server has a current view_number in stable storage.
  view_number = < sequence_number, initiating_server_id >
  The second part helps break ties.

First, initiating server chooses a new view number.
  new view_number = < old_sequence_number + 1, initiating_server_id >
  (Using its own server ID).

Initiating node broadcasts a prepare view change message.
  Contains new view_number.
  Recipients stop processing operations.
  Recipients send log tail to initiator.
  Initiator waits for majority of replies.
    Maybe a little longer, so that view includes all live nodes.
  Initiator decides which operations are before/after view change.
    I.e. which have committed and must be preserved.
    Others are aborted, must be re-started by client.
  Initiator sends phase 2 message to each replica:
    List of nodes (including who is primary).
    List of committed recent operations.
  Now replicas can continue.

What if two concurrent view changes?
  Must retain a notion of views occuring in a particular order!
    Otherwise cannot show continuity of majority.
  I.e. failure/recovery during view change.
  initiating_server_id breaks tie, so we know which is first/second.
  If first got a majority, second did not.
    So complete the first view change, abort and maybe re-do second one.
  If second got a majority, first aborts.
  But perhaps neither got a majority.
    They must both abort, random delay???, and try again.
    Maybe delays proportional to node ID, so deterministic winner of retry.
  Big point:
    Always at most one functioning view -- thus at most one primary.
    Operations totally ordered within each view.
    No operation spans a view.
    Views are also totally ordered.
    Majority from each view to the next, carries with it knowledge
      of order of committed operations.

What if initiator fails during view change?
  We either do or don't have evidence that he completed view change.
  If he committed operations, it's a real view, must preserve it.
    He must have had a majority.
    Make sure we get a majority of nodes from it.
  If he committed no operations, doesn't matter.

What if view change in minority partition?
  Initiator unable to acquire a majority of ACKs.

Data recovery procedure:
  [same as in lecture 21]
  I.e. how to deal with operations interrupted by view change.
  Need to decide which operations to carry over from old view.
  Must commit any operation old primary might have responded to.
  Can abort any operation old primary couldn't have responded to.
    Client will re-try in new view.
  Old primary only committed if it got ACKs from n+1 clients.
    If primary fails, and we expect to continue, majority must survive.
    So old view had *more* than a majority.
    Thus if the primary committed an operation,
      a majority must know about it after the view change.
    So that's the criterion for accepting updates into the new view.

Does this preserve ordering of committed operations?
  Suppose OP1 and OP2 are concurrent, but in that order.
  Remember, assigned sequence numbers when submitted to primary.
  Maybe prepare messages for OP1 were lost.
  So primary gets n+1 ACKs for OP2 first.
  Then primary crashes.
  Consequences:
    Primary cannot reply to client until all previous operations commit.
    New primary cannot commit unless all previous ops can be
      shown to have (possibly) committed.