6.824 2004 Lecture 18: Viewstamped Replication

Suppose we want to build a replicated service, for high availability.
  We want to tolerate replica or network failure and keep going.
  Example: lock manager.
  Example: file server.

Why not use two-phase commit?
  Many situations in which 2pc blocks due to just one server down.
  2pc designed for situations where *all* nodes are neccessary.

If servers are replicas, shouldn't need to have to wait for all of them.
  But need to be careful to keep them true replicas despite failures.
  The hard part is really partition.
    What if concurrent operations in two partitions?

What's the overall plan?
  Clients, and a group of replicated servers.
  Servers have "replicated state machines."
  Same operations, same order.
  Operations might be read/write of particular data items.
  Or acquire/release of locks.
  Or NFS operations.
  To avoid external lock manger, we'll assume that each operation
    is atomic: e.g. rename operation, not link+unlink,
    or transfer operation, not debit+credit.
  In any event, the only issue is agreement on what operations
    have occurred, and the order. So we can boil everything
    down to numbering operations.

Idea for order: primary-copy.
  Elect a primary.
  Clients send all operations through primary.
  Primary processes client operations in some order.
  Primary sends each operation to backups, waits for ACKs.
  If primary fails, clients start using a backup as the primary.
  Problem: status of the last operation if primary fails.
  Problem: network partition might lead to two primaries.
  Problem: net or backup failure may leave a replica out of date.
    Unlike Hypervisor we'd like to bring failed servers back to life.

Idea to handle partition: quorum.
  Make sure every operation consults a majority of replicas.
  Quorum ensures that you proceed in at most one partition.
    Which avoids non-serializable histories.
  Quorum ensures that at least one replica saw previous operation.
    So read following write guaranteed to see one copy of latest data.
    But how does reader know which replica is most recent?
  But quorums do not directly provide serialization.
  And quorums allow different replicas to see different operations.
    That's fine if you update replicated data by completely replacing it.
    Not cool if operations are true modifications, e.g. NFS.

Solution: Viewstamped Replication
  Harp uses a variant of viewstamped replication.
  I'll present a simplified form.
  More suitable for e.g. lock server than for file server.

Properties:
  Provides sequential consistency / one-copy serializability.
  Assume there are 2b + 1 replicas.
  Can tolerate failure (node or net) of up to b replicas.
    This includes primary failures.
    Will continue operating if 1b+1 replicas are up, unlike 2pc.
  Handles partition correctly:
    If one partition has b + 1 or more, that partition can proceed.
    A partition with b or fewer replicas blocks; does nothing.

Overview of Viewstamped Replication
  System goes through a series of views, one at a time.
  A "view" consists of a primary and the replicas it can talk to
  In each view:
    Elect a primary
    Determine which other replicas are available
    Recover state from previous view
    Primary accepts client operations, sends to other replicas
  A view ends when any replica in the view sees a change in the set of
    reachable replicas. E.g. primary crashes, a backup crashes,
    a dead replica comes back to life.
    "View Change" algorithm constructs a new view.
  It's view change that make this better than two-phase commit.

Why the system is going to be correct
  I.e. provide sequential consistency
  A view must have a majority, so one view at a time
  Primary sequences client operations within each view
  Primary sends each operation to every server in the view
  Majority ensures each view has a representative from previous view
    So all operations that completed in previous view known to next view
    This part is harder than it seems and it's the core of the technique.

Data types:
  viewid: <viewcount, replica_id>
  viewstamp: <viewid, opcount>

State maintained by each replica:
  cur_viewid
  data
  last_viewstamp
  max_viewid
  crashed

cur_viewid is on disk, preserved even if there's a crash.
  Others (including data) are in memory, lost in a crash.

Operation within a view:
  View consists of primary and at least b other replicas.
  Clients send operations to the primary.
    They know because non-primary replicas will redirect them.
  Primary picks the next client operation.
  Primary assigns that operation the next viewstamp.
  Primary sends the operation+viewstamp to all the replicas *in the view*.
    Thus possible to proceed despite some failed replicas.
  Primary waits for all ACKs.
    (otherwise different replicas apply different subsets of operations.)
  Primary sends ACK to client.

When does a View Change occur?
  Primary notices a backup in the view has died.
  Any node in the view notices the primary has died.
  Any node in the view notices a node outside the view has revived.

First step: pick a new primary and a new view #
  One or more servers send invitations (maybe simultaneously)
  Each invite contains a new viewid: higher than max_viewid.
  We want highest viewid to win:
    higher viewcount always wins.
    If viewcounts the same, higher replica_id wins.
  If a node sees an invite with viewid > max_viewid,
    drop everything and participate in election.
  Reachable replicas will agree on a viewid if nothing fails/joins
    for long enough.
  Node that proposed the winning viewid is the new primary.

Primary needs to reconstruct state.
  Which server (if any) has up-to-date state, reflecting all commits?
  We want to maintain sequental consistency.
  So primary needs to establish three properties:
  1. At most one view at a time, so well-defined "previous view".
  2. New view knows what the previous view and viewid was.
  3. New view knows last state of previous view.

Property 1 is satisfied if primary assembles a majority.

Property 2 is also satisfied with a majority.
  Any majority must include a server from the previous view.
  That server has the previous view's viewid on disk.
    Even if it crashed.
  We're assuming disks don't fail.
    So we're really assuming a majority with intact disks.

Property 3 is satisfied if we have one non-crashed server
  from the previous view.
  Since all servers in previous view saw all committed operations.
  Old primary would not have committed w/o ACK from every replica in old view.
  So if our one replica doesn't have the viewstamp, it didn't commit.

New primary sends all servers in view the state from the server
  it has chosen.
  All servers in view write cur_viewid to disk.
  New primary can now start processing client requests.

What about operations that hadn't quite committed when the view change started?
  Primary had sent operation to one server, but not all of them.
  Since primary had not gotten ACKs, it didn't send response to client.
  So we can commit it, or not commit it; either is legal.
  Depends on which server primary chooses as "most recent state".
  Client may re-send, primary need to know if a duplicate operation.
    So somehow transaction IDs or RPC seq #s must be in the state?
    Or operations must be safely repeatable.

When *can't* we form a new view?
  A, B, C
  A is primary, B is backup, C is separated.
  A crashes.
  B is separated from A and C.
  C's network connection comes back.
  A reboots.
  A and C between them are a majority, but don't have latest state.
  How do they know?
    A kept last cur_viewid on disk.
    So it can tell C isn't up to date.
    And A knows it isn't up to date either, since it crashed.
  Must wait for B to rejoin; it has the latest state.

Is this system perfect?
  Must copy full state around at each view change.
    OK if lock service, a disaster if NFS service.
  Vulnerable to power failures.
    Nodes lose state if power fails.
    May strike all nodes at the same time.
  Primary executes operations one at a time.
    This is probably slow, cannot overlap execute of one
      with send of the next.
    Would be even slower if every op had to be written to disk.

We'll see solutions to these problems in the Harp paper.
  Replicated NFS server that uses viewstamped replication.