6.824 2013 Lecture 5: Paxos

From Paxos Made Simple, by Leslie Lamport, 2001

today:
  a cleaner approach to replication: RSM via Paxos
  Lab 3

Lab 2 critique
  primary/backup with viewserver
  pro:
    conceptually simple
    just two msgs per op (request, reply)
    primary can do computation, send result to bkup
    only two k/v servers needed to tolerate one failure
  con:
    viewserver is single point of failure
    order can be messy, e.g. new view, data to backup, ack, &c
    tension if backup is slow / temporarily unavail
      1. primary can wait for backup -- slow
      2. viewserver can declare backup dead -- expensive, hurts fault tolerance

we would like:
  general-purpose ordering scheme
  no single point of failure
  graceful handling of slow / intermittent replicas

the basic idea:
  [diagram: clients, replicas, log in each replica, k/v layer, paxos layer]
  no viewserver
  three replicas
  clients can send RPCs to any replica (not just primary)
  server appends each client op to a replicated *log* of operations
    Put, Get (and more later)
  numbered log entries -- instances -- seq
  Paxos agreement on content of each log entry

often called "replicated state machine" (RSM) approach
  applicable to many kinds of computation, not just k/v storage

example:
  client sends Put(a,b) to S1
  S1 picks a log entry 3
  S1 uses Paxos to get all servers to agree that entry 3 holds Put(a,b)
  
example:
  client sends Get(a) to S2
  S2 picks log entry 4
  S2 uses Paxos to get all servers to agree that entry 4 holds Get(a)
  S2 scans log up to entry 4 to find latest Put(a,...)
  S2 replies with that value
  (S2 can cache content of DB up through last log scan)
  
why a log?
  why not require all replicas to agree on each op in lock-step?
  allows one replica to fall behind, then catch up
    e.g. if it is slow
    other replicas do not have to wait
  allows one replica to crash and catch up
    if it keeps state on disk
    can replay missed operations
  allows pipelining/overlap of agreement
    agreement turns out to require multiple message rounds

what about agreement -- we need all replicas to have same op in each log slot

Paxos is a fault-tolerant agreement protocol
 one or more servers each propose a value
 correctness:
   if agreement reached, all agreeing nodes see same value
 fault-tolerance:
   can tolerate non-reachability of a minority of servers
   (correctness implies they won't agree at all)
 liveness:
   a majority must be alive and able to communicate reliably
   (minorities are not live)

agreement is hard (1):
  may be multiple proposals for the op in a particular log slot
  Sx may initially hear of one, Sy may hear of another
  clearly one must later change its mind
  thus: multiple rounds, tentative initially
  how do we know when agreement is permanent -- no longer tentative?

agreement is hard (2):
  if S1 and S2 agree, and S3 and S4 don't respond, are we done?
  agreement has to be able to complete even w/ failed servers
  we can't distinguish failed server from network partition
  so maybe S3/S4 are partitioned have "agreed" on a different operation!

two main ideas in Paxos:
  1. many rounds may be required but they will converge on one value
  2. a majority is required for agreement -- prevent "split brain"
     a key point: any two majorities overlap
     so any later majority will share at least one server w/ any earlier majority
     so any later majority can find out what earlier majority decided

Lab 3B k/v server creates a separate Paxos instance for each client Put, Get
  rest of lecture focuses on agreement for a specific instance

general Paxos approach
 one or more nodes act as proposers
   each wants to get agreement on its value
 proposer contacts participants, tries to assemble a majority
 if a majority respond, we're done

basic Paxos exchange:
 proposer        acceptors
     prepare(n) ->
  <- prepare_ok(n_a, v_a)
     accept(n, v') ->
  <- accept_ok(n)
     decided(v') ->

why n?
 to distinguish among multiple rounds, e.g. proposer crashes, simul props
 want later rounds to supersede earlier ones
 numbers allow us to compare early/late
 n values must be unique and roughly follow time
 n = <time, server ID>

definition: server S accepts n/v
  it responded accept_ok to accept(n, v)

definition: n/v is chosen
  a majority of servers accepted n/v

the crucial property:
 if a value was chosen, any subsequent choice must be the same value
   i.e. protocol must not change its mind
   maybe a different proposer &c, but same value!
   this allows us to freely start new rounds after crashes &c
 tricky b/c "chosen" is system-wide property
   e.g. majority accepts, then proposer crashes
   no node can tell locally that agreement was reached

so:
 proposer doesn't send out value with prepare
 acceptors send back any value they have already accepted
 if there is one, proposer proposes that value
   to avoid changing an existing choice
 if no value already accepted,
   proposer can propose any value (e.g. a client request)
 proposer must get prepare_ok from majority
   to guarantee intersection with any previous majority,
   to guarantee proposer hears of any previously chosen value

now the protocol -- see the handout

proposer(v):
 choose n, unique and higher than any n seen so far
 send prepare(n) to all servers including self
 if prepare_ok(n_a, v_a) from majority:
   v' = v_a with highest n_a; choose own v otherwise
   send accept(n, v') to all
   if accept_ok(n) from majority:
     send decided(v') to all

acceptor state:
 must persist across reboots
 n_p (highest prepare seen)
 n_a, v_a (highest accept seen)

acceptor's prepare(n) handler:
 if n > n_p
   n_p = n
   reply prepare_ok(n_a, v_a)

acceptor's accept(n, v) handler:
 if n >= n_p
   n_p = n
   n_a = n
   v_a = v
   reply accept_ok(n)

example 1 (normal operation):
  S1, S2, S3
  but S3 is dead or slow
  S1 starts proposal, n=1 v=A
S1: p1    a1vA    dA
S2: p1    a1vA    dA
S3: dead...
"p1" means Sx receives prepare(n=1) 
"a1vA" means Sx receives accept(n=1, v=A)
"dA" means Sx receives decided(v=A)
these diagrams are not specific about who the proposer is
  it doesn't really matter
  the proposers are logically separate from the acceptors
  we only care about what acceptors saw and replied

Note Paxos only requires a majority of the servers
  so we can continue even though S3 was down
  proposer must not wait forever for any acceptor's response

What would happen if network partition?
  I.e. S3 was alive and had a proposed value B
  S3's prepare would not assemble a majority

the homework question:
 How does Paxos ensure that the following sequence of events can't
 happen? What actually happens, and which value is ultimately chosen?
 proposer 1 crashes after sending two accepts
 proposer 2 has a different value in mind
 A: p1 a1foo
 B: p1       p2 a2bar
 C: p1 a1foo p2 a2bar
 C's prepare_ok to B really included "foo"
   thus a2foo, and so no problem
 the point:
   if the system has already reached agreement, majority will know value
   any new majority of prepares will intersect that majority
   so subsequent proposer will learn of already-agreed-on value
   and send it in accept msgs

example 2 (concurrent proposers):
A1 starts proposing n=10
A1 sends out just one accept v=10
A3 starts proposing n=11
  but A1 does not receive its proposal
  A3 only has to wait for a majority of proposal responses
A1: p10 a10v10 
A2: p10        p11
A3: p10        p11  a11v11
A1 and A3 have accepted different values!
what will happen?
  what will A2 do if it gets a10v10 accept msg from A1?
  what will A1 do if it gets a11v11 accept msg from A3?
what if A3 were to crash at this point (and not restart)?

how about this:
A1: p10  a10v10               p12
A2: p10          p11  a11v11  
A3: p10          p11          p12   a12v10
has the system agreed to a value at this point?

what's the commit point?
  i.e. exactly when has agreement been reached?
  i.e. at what point would changing the value be a disaster?
  after a majority has the same v_a? no -- why not?
  after a majority has the same v_a/n_a? yes -- why sufficient?

why does the proposer need to pick v_a with highest n_a?
A1: p10  a10vA               p12
A2: p10          p11  a11vB  
A3: p10          p11  a11vB  p12   a12v??
n=11 already agreed on vA
n=12 sees both vA and vB, but must choose vB
why: two cases:
  1. there was a majority before n=11
     n=11's prepares would have seen value and re-used it
     so it's safe for n=12 to re-use n=11's value
  2. there was not a majority before n=11
     n=11 might have obtained a majority
     so it's required for n=12 to re-use n=11's value

why does prepare handler check that n > n_p?
  it's taking max(concurrent n's), for accept handler

why does accept handler check n >= n_p?
required to ensure agreement
there's a unique highest n active
everyone favors the highest n
w/o n >= n_p check, you could get this bad scenario:
A1: p1 p2 a1vA
A2: p1 p2 a1vA a2vB
A3: p1 p2      a2vB

what if new proposer chooses n < old proposer?
 i.e. if clocks are not synced
 cannot make progress, though no correctness problem

what if an acceptor crashes after receiving accept?
A1: p1  a1v1
A2: p1  a1v1 reboot  p2  a2v?
A3: p1               p2  a2v?
A2 must remember v_a/n_a across reboot! on disk
  might be only intersection with new proposer's majority
  and thus only evidence that already agreed on v1

what if an acceptor reboots after sending prepare_ok?
 does it have to remember n_p on disk?
 if n_p not remembered, this could happen:
 S1: p10            a10v10
 S2: p10 p11 reboot a10v10 a11v11
 S3:     p11               a11v11
 11's proposer did not see value 10, so 11 proposed its own value
 but just before that, 10 had been chosen!
 b/c S2 did not remember to ignore a10v10

can Paxos get stuck?
  yes, if there is not a majority that can communicate
  how about if a majority is available?