6.824 2009 Lecture 13: paxos

From Paxos Made Simple, by Leslie Lamport, 2001

introduction
 how to build a fault-tolerant service?
   e.g. lock server
 want it to look to clients like a single server

replicated state machine
 like hypervisor paper
 [many clients, two servers, state, operations]
 if all servers see same sequence of ops
   they will remain replicas, so one can take over from another
 works for deterministic local services: lock, block storage, files

how to ensure all replicas see same sequence of ops?
 there are many possibilities
 primary + backup
 clients send all operations to current primary
 primary chooses order, sends to backup, replies to client

what if the primary fails?
 how about:
   backup pings primary
   takes over if gets no response from primary
 doesn't work:
   pings lost => two primaries
   pings delayed => two primaries
   partition => two primaries
 basic problem:
   broken network looks exactly like a dead primary
   but must be treated differently!

can't afford to have two primaries!
 won't see same operations, won't have same state
 might give the same lock to two different clients!

we'll use Paxos to agree on what nodes are alive
 basic idea: odd # nodes, nodes vote, at most one majority
 can use set of live nodes to agree on who is the primary
 agreed set of live nodes is a "view"

protocol stack:
 RSM
 config service (view change)
 agreement (Paxos)

config service provides "views" and "view change"
 system goes through a sequence of views
 view: view# and set of participants
 ensure agreement on unique successor of each view
 example:
   0, S1, S2, S3
   1, S2, S3
   2, S2, S3, S4
   3, S1, S2, S3, S4
 two uses:
   current view implies current primary
   previous view tells new primary who to get state from

what does config service need from Paxos?
 one or more nodes will see change in set of pingable nodes
 call Paxos("i+1 S1 S2")
   maybe just one such call, maybe more than one
   all same i+1
   maybe all same srv list, maybe different
 eventually Paxos says "decided i+1 S1 S2"
 config srv creates separate Paxos instance for each view change
   so from now on in lecture, agreeing on just a single new view

Paxos fault-tolerant agreement protocol
 general-purpose algorithm
 one or more servers each propose a value
 Paxos will not ever "agree" on more than one distinct value
 might never agree on anything (if too many failures)
 if net+servers stable long enough, and enuf live
   Paxos will eventually choose one proposed value as agreed value

general Paxos approach
 one (or more) nodes decide to be the proposer
 proposer contacts participants, tries to assemble a majority
 if a majority respond, we're done

why agreement is hard
 two proposers?
 partition (two proposers)?
 proposer crashes?
 proposer crashes just after majority?
   new proposer shouldn't choose different value...

Paxos has two phases
 proposer can't just send "do you promise to commit to this value?"
   can't promise: maybe everyone promised to different value
   have to be able to change mind
 so: prepare, and accept
 definition: "chosen": a majority has accepted the value
   i.e. sent accept_ok
   majority means at most one value can be chosen

exchange:
 proposer        acceptors
     prepare(n) ->
  <- prepare_ok(n_a, v_a)
     accept(n, v') ->
  <- accept_ok(n)
     decided(v') ->

why n?
 may need multiple rounds e.g. if a proposer crashes
 want later rounds to supersede earlier ones
 numbers allow us to compare early/late

the crucial property:
 if a value was chosen, any subsequent choice must be the same value
   i.e. protocol must not change its mind
   maybe a different proposer &c, but same value!
 tricky b/c "chosen" is system-wide property
   e.g. majority accepts, then proposer crashes
   nodes cannot tell locally

so:
 proposer doesn't send out value with prepare
 acceptors send back any value they have already accepted
 if there is one, proposer proposes that
   to avoid changing an existing choice
 if no value already accepted,
   proposer can propose any value (e.g. a client request)
 proposer must get prepare_ok from majority
   to guarantee intersection with majority formed by existing choice

now the protocols

proposer(v):
 choose n, unique and higher than any n seen so far
 send prepare(n) to all servers including self
 if prepare_ok(n_a, v_a) from majority:
   v' = v_a with highest n_a; choose own v otherwise
   send accept(n, v') to all
   if accept_ok(n) from majority:
     send decided(v') to all

acceptor state:
 must persist across reboots
 n_p (highest prepare seen)
 n_a, v_a (highest accept seen)

acceptor prepare(n) handler:
 if n > n_p
   n_p = n
   reply prepare_ok(n_a, v_a)

acceptor accept(n, v) handler:
 if n >= n_p
   n_a = n
   v_a = v
   reply accept_ok(n)

commit point is when f+1'st acceptor records n_a/v_a in stable storage

if acceptor times out (i.e. doesn't see decide())
 ask all servers for v_a, see if one value has majority
 otherwise become a proposer

easy cases:
 one proposer, nothing fails
 network partition (at most one majority)

why only a majority -- why not all?

three harder cases:
1 proposer (this is the hardest case)
 proposer fails (sub-case of > 1 proposer)
 acceptor fails

example 1 (concurrent proposers):
scenario:
 x sends prepare(10), gets replies
 y sends prepare(11), gets replies, none accepted a value from 10
 x sends accept(10, vx)
 y sends accept(11, vy)
A1: p10 p11 a10v10 a11v11
A2:           ""
A3:           ""
the risk:
 could all nodes accept v10 (so it is chosen)
 then accept v11?
 if this could happen, would change our mind about chosen values!
but it's OK:
 n_p and the "if" in accept() cause nodes to ignore 10's accept
   after they have seen 11's prepare

what about other interleavings of concurrent proposers?
 v10 either was or was not accepted by majority
 if yes: 10's accepts before 11's prepare, 11 will see value
 if no: 11 will win, permanently

example 2 (proposer failure):
S1: p10 a10v10         a11v10
S2: p10 a10v10     p11 a11v10
S3: p10            p11 a11v10
the risk:
 10 crashed just after sending accepts
 but a majority arrived
 11 doesn't know it, sends out prepares
 11's majority of prepare_ok must include 10's majority of accepts
   so 11 will learn of 10's value in prepare_ok()
 and will choose same value as 10
 conservative: 11 may use v10 even if only one server accepted v10

general argument about proposer failure?
 old proposer either did or did not get majority of accepts

what if new proposer chooses n < old proposer?
 cannot make progress, though no correctness problem
 maybe n = timestamp in high bits, unique node ID in low bits

what if an acceptor crashes after receiving accept?
 if it doesn't restart, maybe proposer doesn't get enough accept_oks,
   timeout, new proposer
 if it does restart, it must remember v_a/n_a! (on disk)
   proposer might have failed after sending a few accept()s
   our node might be only intersection with new proposer's majority
   i.e. otherwise example 2 breaks, 11 may not see chosen v10

what if an acceptor reboots after sending prepare_ok?
 does it have to remember n_p on disk?
 yes!
 example 1 above requires this, e.g. if some servers restart
   after getting p11
 don't want them to accept a10v10
   since 11 saw majority of prepare_ok w/o value
   so 11 will propose its own value

the homework question:
 How does Paxos ensure that the following sequence of events can't
 happen? What actually happens, and which value is ultimately chosen?
 1. A sends p(1); acks from A, B, C.
 2. A sends a(1, foo); A and C respond; A sends decide(foo)
 3. B sends p(2); acks from B and C
 4. B sends a(2, bar); B and C respond; B sends decide(bar)

summary
 overall goal: replicated state machines
 needs a single sequence of views
   e.g. sequence of primaries
 paxos chooses each new view; one paxos instance per view
 still need a way to carry data from one view to the next