6.824 2005 Lecture 15: Paxos

From Paxos Made Simple, by Leslie Lamport, 2001

introduction
  2-phase commit is good if different nodes are doing different things
    but in general you have to wait for all sites and TC to be up
    you have to know if each site voted yes or no
    and the TC must be up to decide
    not very fault-tolerant: has to wait for repair
  can we get work done even if some nodes can't be contacted?
  yes: in the special case of replication

state machine replication
  works for any kind of replicated service: storage or lock server or whatever
  every replica must see same operations in same order
    if deterministic, replicas will end up with same state
  primary/backup so that primary can choose the order
  what if the primary fails?
    need to worry about that last operation, possibly not complete
    need to pick a new primary
    can't afford to have two primaries!
  suppose lowest-numbered live server is the primary
  so after failure, everyone pings everyone
  then everyone knows who new primary is?
  well, maybe not:
    pings may be lost => two primaries
    pings may be delayed => two primaries
    partition => two primaries

solution: "view change" algorithm
  system goes through a sequence of views
  view: view# and set of participants
  ensure agreement on unique successor of each view
  the participant set allows everyone to agree on new primary

view change requires "fault-tolerant agreement"
  at most a single value is chosen
  agree despite lost messages and crashed nodes
  can't really guarantee to agree
    but we can guarantee to *not* "agree" on different values!

Paxos fault-tolerant agreement protocol
  eventually succeeds if a majority of participants are reachable
  best known algorithm

general Paxos approach
  one (or more) nodes decide to be the leader
  leader chooses a proposed value to agree on
    (view# and participant set)
  leader contacts participants, tries to assemble a majority
    participants are all the nodes in the old view (including unreachable)
    or a fixed set of configuration master nodes
  if a majority respond, we're done

why agreement is hard
  what if two nodes decide to be the leader?
  what if the leader crashes after persuading only some of the nodes?
  what if leader got a majority, then failed, without announcing result?
    or announced result to only a few nodes?
    new leader might choose a different value, even though we agreed
  what if network partition leads to two leaders?

Paxos
  has three phases
  may have to start over if failure/timeouts

state
  n_a, v_a: highest value and n which node has accepted
  n_p: highest n seen in a PREPARE
message types.
rules.

Paxos Phase 1
  a node (maybe more than one...) decides to be leader
  leader picks a proposal number n
    must be unique, good if it's higher than any known #
    how about last known proposal number, plus one, append node ID
  leader sends PREPARE(n) to every node (including itself)
  if node gets PREPARE(n) and n > n_p:
    return RESPONSE(n_a, v_a)
    n_p = n

Paxos Phase 2
  if leader gets RESPONSE from majority of nodes (including self):
    if any RESPONSE(n,v) had a value, v = value of highest n
    else leader gets to choose a value
      old view# + 1, set of pingable nodes
    send ACCEPT(n, v) to all responders
  if node gets ACCEPT(n, v) and n >= n_p
    n_a = n
    v_a = v

if a majority accepted, we've agreed on a new value
  but we don't know if we've reached agreement
  what if node crashes after RESPONSE? what if there was a second leader?

Paxos Phase 3
  leader asks all nodes if they accepted its value
  if it gets a majority of "yes", it tells everyone
  now everyone agrees that new *primary* is lowest-numbered node in new view

if at any time any node gets bored (times out)
  it declares itself a leader and starts a new Phase 1

if nothing goes wrong, Paxos clearly reaches agreement

how do we ensure good probability that there is only one leader?
  every node has to be prepared to be leader, to cope w/ failure
  so delay a random amount of time after you realize a new view is required
    or delay your ID times some constant

what if more than one leader?
  due to timeout or partition or lost packets
  the two leaders used different n, say 10 and 11
  if 10 didn't get a majority to ACCEPT
    it never will, since no-one will ACCEPT 10 after seeing 11's PREPARE
    or perhaps 10 is in a network partition
  if 10 did get a majority to ACCEPT
    that majority saw 10's ACCEPT before 11's PREPARE
      (otherwise they would have ignored 10's ACCEPT, so no majority)
    so 11 will get a RESPONSE from at least one node that saw 10's ACCEPT
    so 11 will be aware of 10's value
    so 11 will use 10's value, rather than making a new one
    so we reached agreement after all

what if leader fails before sending ACCEPTs?
  some node will time out and become a leader
  old leader didn't reach agreement, so we don't care what he did
  it's good, but not neccessary, that new leader chooses higher n
    if it doesn't, timeout and some other leader will try
    eventually we'll get a leader that knew old n and will use a higher n

what if leader fails after sending a minority of ACCEPTs?
  same as two leaders...

what if leader fails after sending a majority of ACCEPTs?
  i.e. potentially after reaching agreement!
  same as two leaders...

what if a node fails after receiving ACCEPT?
  if it doesn't reboot, possible timeout in Phase 3, new leader
  it it does reboot, it must remember v_a/n_a! (on disk)
    leader might have failed
    new leader must choose same value, since there might have been agreement
    our node might be the intersecting node of the two majorities

what if a node reboots after sending RESPONSE?
  does it have to remember n_p on disk?
  it uses n_p to reject PREPARE/ACCEPT with smaller n
  scenario:
    leader1 sends PREPARE(n=10), a bare majority RESPONDs
      so node X's n_p = 10
    leader2 sends PREPARE(n=11), a majority intersecting only at node X RESPONDs
      node X's n_p = 11
      leader2 got no RESPONSE with a value, so it chooses v=200
    node X crashes and reboots, loses n_p
    leader1 sends ACCEPT(n=10, v=100), its bare majority gets it
      including node X (which should have rejected it...)
      so we have agreement w/ v=100
    leader2 sends ACCEPT(n=11, v=200)
      its bare majority all accept the message
      including node X, since 11 > n_p
    so we have agreement w/ v=200. oops.
  so: each node must remember n_p on disk

conclusion
  what have we achieved?
  remember the original goal was replicated state machines
    and we want to continue even if some nodes are not available
  after each failure we can perform view change using Paxos agreement
  that is, we can agree on exactly which nodes are in the new view
  so, for example, everyone can agree on a single new primary
  but we haven't talked at all about how to manage the data