6.824 2007 Lecture 15: paxos

From Paxos Made Simple, by Leslie Lamport, 2001

introduction
  2-phase commit is good if different nodes are doing different things
    but in general you have to wait for all sites and TC to be up
    you have to know if each site voted yes or no
    and the TC must be up to decide
    not very fault-tolerant: has to wait for repair
  can we get work done even if some nodes can't be contacted?
  yes: in the special case of replication

state machine replication
  works for any kind of replicated service: storage or lock server or whatever
  every replica must see same operations in same order
    if deterministic, replicas will end up with same state

how to ensure all replicas see operations in the same order?
  primary + backup(s)
  clients send all operations to current primary
  primary chooses order, sends to backups, replies to client

what if the primary fails?
  need to worry about that last operation, possibly not complete
  need to pick a new primary
  can't afford to have two primaries!
  suppose lowest-numbered live server is the primary
  so after failure, everyone pings everyone
  then everyone knows who new primary is?
  well, maybe not:
    pings may be lost => two primaries
    pings may be delayed => two primaries
    partition => two primaries

idea: a majority of nodes must agree on the primary
  at most one network partition can have a majority
  if two potential primaries, their majorities must overlap

technique: "view change" algorithm
  system goes through a sequence of views
  view: view# and set of participants
  ensure agreement on unique successor of each view
  the participant set allows everyone to agree on new primary

view change requires "fault-tolerant agreement"
  at most a single value is chosen
  agree despite lost messages and crashed nodes
  can't really guarantee to agree
    but we can guarantee to *not* "agree" on different values!

Paxos fault-tolerant agreement protocol
  eventually succeeds if a majority of participants are reachable
  best known algorithm

general Paxos approach
  one (or more) nodes decide to be the leader
  leader chooses a proposed value to agree on
    (view# and participant set)
  leader contacts participants, tries to assemble a majority
    participants are all the nodes in the old view (including unreachable)
    or a fixed set of configuration master nodes
  if a majority respond, we announce result

why agreement is hard
  what if two nodes decide to be the leader?
  what if network partition leads to two leaders?
  what if the leader crashes after persuading only some of the nodes?
  what if leader got a majority, then failed, without announcing result?
    or announced result to only a few nodes?
    new leader might choose a different value, even though we agreed

Paxos
  has three phases
  may have to start over if failure/timeouts
  see handout with code (included  below) or lab 8 assignment

we run an instance of this protocol for each view.

the n's in the protocol are proposal numbers, not view numbers.  per
view a node may make many proposals.  each of the proposals is
numbered in increasing order.

done is to false when a node suspects a problem with a node in the
current view, and invokes Paxos to create the next view.

if at any time any node gets bored (times out)
  it declares itself a leader and starts a new Phase 1

if nothing goes wrong, Paxos clearly reaches agreement

how do we ensure good probability that there is only one leader?
  every node has to be prepared to be leader, to cope w/ failure
  so delay a random amount of time after you realize a new view is required
    or delay your ID times some constant

key danger:
  nodes w/ different v_a receive donereq
  goal: if donereq *could* have been sent, future donereqs guaranteed to have same v_a

what if more than one leader?
  due to timeout or partition or lost packets
  the two leaders used different n, say 10 and 11
  if 10 didn't get a majority to acceptres
    it never will, since no-one will acceptres 10 after seeing 11's preparereq
    or perhaps 10 is in a network partition
  if 10 did get a majority to acceptres
    i.e. might have sent donereq
    10's majority saw 10's acceptreq before 11's preparereq
      otherwise they would have ignored 10's acceptreq, so no majority
    so 11 will get a prepareres from at least one node that saw 10's acceptreq
    so 11 will be aware of 10's value
    so 11 will use 10's value, rather than making a new one
    so we agreed on a v after all

what if leader fails before sending acceptreqs?
  some node will time out and become a leader
  old leader didn't send any donereq, so we don't care what he did
  it's good, but not neccessary, that new leader chooses higher n
    if it doesn't, timeout and some other leader will try
    eventually we'll get a leader that knew old n and will use a higher n

what if leader fails after sending a minority of acceptreqs?
  same as two leaders...

what if leader fails after sending a majority of acceptreqs?
  i.e. potentially after reaching agreement!
  same as two leaders...
  shows scenario
    say last view's v = {1,2,3}
    2 fails
    1 becomes leader, receives prepareress from 1 and 3, choses v = {1,3}
    1 sends acceptreq to 1 and 3, and fails.
    3 will start running paxos, but cannot make a majority. keeps trying
    2 reboots, reloads state from disk (nothing related to this view change)
    2 becomes a leader and proposes acceptreq with n higher than 1 used
      3 send acceptres, which will include v = {1, 3}.
      2 and 3 will switch to view {1,3}
    2 will run Paxos again to add itself.
      No majority of {1,3} is alive, however, to add in 2.
      must wait until 1 is back up.  (note that if we change the protocol,
      we could continue, because 3 knows that 1 cannot have a
      majority---because that would be 1 and 2---but 2 is talking to 3 to
      make a view.
    
what if a node fails after receiving acceptreq?
  if it doesn't restart, possible timeout in Phase 3, new leader
  it it does restart, it must remember v_a/n_a! (on disk)
    leader might have failed after sending a few donereqs
    new leader must choose same value
    our node might be the intersecting node of the two majorities

what if a node reboots after sending prepareres?
  does it have to remember n_h on disk?
  it uses n_h to reject preparereq/acceptreq with smaller n
  scenario:
    leader1 sends preparereq(n=10), a bare majority sends prepareres
      so node X's n_h = 10
    leader2 sends preparereq(n=11), a majority intersecting only at 
    node X sends acceptreq
      node X's n_h = 11
      leader2 got no prepareres with a value, so it chooses v=200
    node X crashes and reboots, loses n_h
    leader1 sends acceptreq(n=10, v=100), its bare majority gets it
      including node X (which should have rejected it...)
      so we have agreement w/ v=100
    leader2 sends acceptreq(n=11, v=200)
      its bare majority all accept the message
      including node X, since 11 > n_h
    so we have agreement w/ v=200. oops.
  so: each node must remember n_h on disk

conclusion
  what have we achieved?
  remember the original goal was replicated state machines
    and we want to continue even if some nodes are not available
  after each failure we can perform view change using Paxos agreement
  that is, we can agree on exactly which nodes are in the new view
  so, for example, everyone can agree on a single new primary
  but we haven't talked at all about how to manage the data

------------------ paxos.txt --------------------

state:
  n_a, v_a: highest value and proposal # which node has accepted
  n_h: highest proposal # seen in a prepare
  my_n: the last proposal # the node has used in this round of Paxos
  vid_h: highest view number we have accepted
  views: map of past view numbers to values
  done: leader says agreement was reached, we can start new view

on each view change, initialize state:
  n_a = 0
  n_h = 0
  my_n = 0
  v_a = () // empty list

Paxos Phase 1
  a node (maybe more than one...) decides to be leader (may not be in current view):
    my_n = max(n_h, my_n)+1, append node ID  // unique proposal number
    done = false
    sends prepare(vid_h+1, my_n) to all nodes in {views[vid_h], initial contact node, itself}
  if node receives prepare(vid, n):
    if vid <= vid_h:
      return oldview(vid, views[vid])
    else if n > n_h:
      n_h = n
      done = false
      return prepareres(n_a, v_a)
    else:
      return reject()

Paxos Phase 2
  if leader gets oldview(vid, v):
    views[vid] = v
    vid_h = vid
    view change
    restart paxos
  else if leader gets reject():
    delay and restart paxos
  else if leader gets prepareres from majority of nodes in views[vid_h]:
    if any prepareres(n_i, v_i) exists such that v_i is not empty:
      v = non-empty value v_i corresponding to highest n_i received
    else leader gets to choose a value:
      v = set of pingable nodes (including self)
    send accept(vid_h+1, my_n, v) to all responders
  else:
    delay and restart paxos
  if node gets accept(vid, n, v):
    if vid <= vid_h:
      return oldview(vid, views[vid])
    else if n >= n_h:
      n_a = n
      v_a = v
      return acceptres()
    else
      return reject()

Paxos Phase 3
  if leader gets oldview(vid, v):
    views[vid] = v
    vid_h = vid
    view change
    restart paxos
  else if leader gets acceptres from a majority of nodes in views[vid_h]:
    send decide(vid_h+1, v_a) to all (including self)
  else:
    delay and restart paxos
  if node gets decide(vid, v):
    if vid <= vid_h:
      return oldview(vid, views[vid])
    else:
      done = true
      primary is lowest-numbered node in v
      views[vid] = v
      vid_h = vid
      view change