6.824 2002 Lecture 20: Viewstamped Replication (2)

Solution: Viewstamped Replication
  Harp uses a variant of viewstamped replication.
  I'll present a simplified form.
  More suitable for e.g. lock server than for file server.

Properties:
  Provides sequential consistency / one-copy serializability.
  Assume there are 2b + 1 replicas.
  Can tolerate failure (node or net) of up to b replicas. Any of them.
  Handles partition correctly:
    If one partition has b + 1 or more, that partition can proceed.
    A partition with b or fewer replicas blocks; does nothing.

Overview of Viewstamped Replication
  System goes through a series of views
  A "view" consists of a primary and the replicas it can talk to
  In each view:
    Elect a primary
    Determine which other replicas are available
    Recover state from previous view
    Primary accepts client operations, sends to other replicas
  A view ends when any replica in the view sees a change in the set of
    reachable replicas. E.g. primary crashes, a backup crashes,
    a dead replica comes back to life.
    "View Change" algorithm constructs a new view.
  It's view change that make this better than two-phase commit.

Why the system is going to be correct
  I.e. provide sequential consistency
  A view must have a majority, so one view at a time
  Primary sequences client operations within each view
  Primary sends each operation to every server in the view
  Majority ensures each view has a representative from previous view
    So all operations that completed in previous view known to next view

Data types:
  viewid: <viewcount, replica_id>
  viewstamp: <viewid, opcount>

State maintained by each replica:
  cur_viewid
  data
  last_viewstamp
  max_viewid
  crashed

cur_viewid is on disk, preserved even if there's a crash.
  Others (including data) are in memory, lost in a crash.

Operation within a view:
  View consists of primary and at least b other replicas.
  Clients send operations to the primary.
    They know because non-primary replicas will redirect them.
  Primary picks the next client operation.
  Primary assigns that operation the next viewstamp.
  Primary sends the operation+viewstamp to all the replicas *in the view*.
  Primary waits for all ACKs.
    (otherwise different replicas apply different subsets of operations.)
  Primary sends ACK to client.

When does a View Change occur?
  Primary notices a backup in the view has died.
  Any node in the view notices the primary has died.
  Any node in the view notices a node outside the view has revived.

First step: pick a new primary and a new view #
  One or more servers send invitations (maybe simultaneously)
  Each invite contains a new viewid: higher than max_viewid.
  We want highest viewid to win:
    higher viewcount always wins.
    If viewcounts the same, higher replica_id wins.
  If a node sees an invite with viewid > max_viewid,
    drop everything and participate in election.
  Reachable replicas will agree on a viewid if nothing fails/joins
    for long enough.
  Node that proposed the winning viewid is the new primary.

Primary needs to reconstruct state.
  Which server (if any) has up-to-date state, reflecting all commits?
  We want to maintain sequental consistency.
  So primary needs to establish three properties:
  1. At most one view at a time.
  2. New view knows about previous view.
  3. New view knows last state of previous view (e.g. last committed op).

Property 1 is satisfied if primary assembles a majority.

Property 2 is also satisfied with a majority.
  Any majority must include a server from the previous view.
  That server has the previous view's viewid on disk.
    Even if it crashed.
  We're assuming disks don't fail.
    So we're really assuming a majority with intact disks.

Property 3 is satisfied if we have one non-crashed server
  from the previous view.
  Since all servers in previous view saw all committed operations.
  Old primary would not have committed w/o ACK from every replica in old view.
  So if our one replica doesn't have the viewstamp, it didn't commit.

New primary sends all servers in view the state from the server
  it has chosen.
  All servers in view write cur_viewid to disk.
  New primary can now start processing client requests.

What about operations that hadn't quite committed when the view change started?
  Primary had sent operation to one server, but not all of them.
  Since primary had not gotten ACKs, it didn't send response to client.
  So we can commit it, or not commit it; either is legal.
  Depends on which server primary chooses as "most recent state".

When *can't* we form a new view?
  A, B, C
  A is primary, B is backup, C is separated.
  A crashes.
  B is separated from A and C.
  C's network connection comes back.
  A reboots.
  A and C between them are a majority, but don't have latest state.
  How do they know?
    A kept last cur_viewid on disk.
    So it can tell C isn't up to date.
    And A knows it isn't up to date either, since it crashed.
  Must wait for B to rejoin; it has the latest state.

Is this system perfect?
  Must copy full state around at each view change.
    OK if lock service, a disaster if NFS service.
  Vulnerable to power failures.
    Nodes lose state if power fails.
    May strike all nodes at the same time.
  Primary executes operations one at a time.
    This is probably slow, cannot overlap execute of one
      with send of the next.
    Would be even slower if every op had to be written to disk.