6.824 Lecture 13: paxos

From Paxos Made Simple, by Leslie Lamport, 2001

introduction
  overall topic: fault-tolerant replicated services
  saw replicated state machine last time, primary/backup
  today: robust agreement on primary
  Lab 6!

replicated state machine
 review, from hypervisor paper
 [many clients, two servers, state, operations]
 if all servers see same sequence of ops
   they will remain replicas, so one can take over from another
 works for deterministic local services: lock, block storage, files

how to ensure all replicas see same sequence of ops?
 naive client->all servers doesn't work
 common approach: primary + backups
 clients send all operations to current primary
 primary chooses order, sends to backups, replies to client

what if the primary fails?
 how about:
   backup pings primary
   takes over if gets no response from primary
 doesn't work:
   pings lost => two primaries
   pings delayed => two primaries
   partition => two primaries
 basic problem:
   broken network looks exactly like a dead primary
   but must be treated differently!

can't afford to have two primaries!
 won't see same operations, won't have same state
 might give the same lock to two different clients!

we'll use Paxos to agree on primary
 nodes ping, choose node w/ lowest ID
 try to agree that that node is the primary
 really, agree on set of live nodes
 agreed set of live nodes is a "view"

protocol stack:
 RSM
 config service (view change)
 agreement (Paxos)

config service provides "views" and "view change"
 system goes through a sequence of views
 view: view# and set of participants
 ensure agreement on unique successor of each view
 example:
   0, S1, S2, S3
   1, S2, S3
   2, S2, S3, S4
   3, S1, S2, S3, S4
 two uses:
   current view implies current primary
   previous view tells new primary who to get state from

what does config service need from Paxos?
 one or more nodes will see change in set of pingable nodes
 call Paxos("i+1 S1 S2")
   maybe just one such call, maybe more than one
   all same i+1
   maybe all same srv list, maybe different
 eventually Paxos says "decided i+1 S1 S2"
 config srv creates separate Paxos instance for each view change
   so from now on in lecture, agreeing on just a single new view

Paxos fault-tolerant agreement protocol
 general-purpose algorithm
 one or more servers each propose a value
 correctness:
   if agreement reached, all agreeing nodes see same value
 fault-tolerance:
   only a majority is required
   steady stream of failures may prevent agreement

general Paxos approach
 one (or more) nodes decide to be the proposer
 proposer contacts participants, tries to assemble a majority
 if a majority respond, we're done

why agreement is hard
 two proposers?
 partition (two proposers)?
 proposer crashes?
 proposer crashes just after majority?
   new proposer shouldn't choose different value...

Paxos has two phases
 proposer can't just send "do you promise to commit to this value?"
   can't promise: maybe everyone promised to different value
   have to be able to change mind
 so: prepare, and accept

exchange:
 proposer        acceptors
     prepare(n) ->
  <- prepare_ok(n_a, v_a)
     accept(n, v') ->
  <- accept_ok(n)
     decided(v') ->

definition: server S accepts a value
  it responded accept_ok

definition: value V is chosen
  a majority of servers accepted V

why n?
 may need multiple rounds e.g. if a proposer crashes
 want later rounds to supersede earlier ones
 numbers allow us to compare early/late
 n values must be unique, comparable, and roughly follow time
 n = <time, server ID>

the crucial property:
 if a value was chosen, any subsequent choice must be the same value
   i.e. protocol must not change its mind
   maybe a different proposer &c, but same value!
 tricky b/c "chosen" is system-wide property
   e.g. majority accepts, then proposer crashes
   nodes cannot tell locally

so:
 proposer doesn't send out value with prepare
 acceptors send back any value they have already accepted
 if there is one, proposer proposes that
   to avoid changing an existing choice
 if no value already accepted,
   proposer can propose any value (e.g. a client request)
 proposer must get prepare_ok from majority
   to guarantee intersection with majority formed by existing choice

now the protocols

state:
 must persist across reboots
 n_p (highest prepare seen)
 n_a, v_a (highest accept seen)

proposer(v):
 choose n, unique and higher than any n seen so far
 send prepare(n) to all servers including self
 if prepare_ok(n_a, v_a) from majority:
   v' = v_a with highest n_a; choose own v otherwise
   send accept(n, v') to all
   if accept_ok(n) from majority:
     send decided(v') to all

acceptor's prepare(n) handler:
 if n > n_p
   n_p = n
   reply prepare_ok(n_a, v_a)

acceptor's accept(n, v) handler:
 if n >= n_p
   n_a = n
   v_a = v
   reply accept_ok(n)

if acceptor times out (i.e. doesn't see decide())
 ask all servers for v_a, see if one value has majority
 otherwise become a proposer

example 1 (normal operation):
  server set is S1, S2, S3
  S3 just crashed
  S1's period pings would see S3 is now dead
  S1's config server asks Paxos to agree on vid+1, S1, S2
  S1 is proposer
S1: p1    a1v1    d1
S2: p1    a1v1    d1
S3: (p1)  (a1v1)  (d1)

Note Paxos only requires a majority of the servers
  so we can continue even though S3 was down
  proposer must not wait for S3's responses! fast timeout.

What would happen if network partition?
  I.e. S3 was alive?
  S3 would also initiate Paxos for new view
  S3's prepare would not assemble a majority

Paxos must assemble a majority
  But a majority of what?
  If there are seven servers, and S1 thinks S4 S5 S6 S7 are down,
    can it proceed with a majority among just S1 S2 S3?

example 2 (concurrent proposers):
scenario:
 x sends prepare(10), gets replies
 y sends prepare(11), gets replies, none accepted a value from 10
 x sends accept(10, vx)
 y sends accept(11, vy)
A1: p10 p11 a10v10 a11v11
A2:           ""
A3:           ""
the risk:
 could all nodes accept v10 (so it is chosen)
 then accept v11?
 if this could happen, would change our mind about chosen values!
but it's OK:
 n_p and the "if" in accept() cause nodes to ignore 10's accept
   after they have seen 11's prepare

what about other interleavings of concurrent proposers?
 v10 either was or was not accepted by majority
 if yes:
   majority saw 10's accepts before 11's prepare
     otherwise they would have ignored 10's accept
   so 11 will get at least one prepare_ok w/ 10's value
   majorities must intersect!
 if no:
   11 will win, permanently

example 3 (proposer failure):
S1: p10 a10v10         a11v10
S2: p10 a10v10     p11 a11v10
S3: p10            p11 a11v10
the risk:
 10 crashed just after sending accepts
 but a majority arrived
 11 doesn't know it, sends out prepares
 11's majority of prepare_ok must include 10's majority of accepts
   so 11 will learn of 10's value in prepare_ok()
 and will choose same value as 10
 conservative: 11 may use v10 even if only one server accepted v10

general argument about proposer failure?
 old proposer either did or did not get majority of accepts

what if new proposer chooses n < old proposer?
 i.e. if clocked are not synced
 cannot make progress, though no correctness problem

what if an acceptor crashes after receiving accept?
 if it doesn't restart, maybe proposer doesn't get enough accept_oks,
   timeout, new proposer
 if it does restart, it must remember v_a/n_a! (on disk)
   our node might be only intersection with new proposer's majority
   i.e. otherwise example 3 breaks, 11 may not see chosen v10

what if an acceptor reboots after sending prepare_ok?
 does it have to remember n_p on disk?
 if n_p not remembered, this could happen:
 S1: p10            a10v10
 S2: p10 p11 reboot a10v10 a11v11
 S3:     p11               a11v11
 11's proposer did not see value 10, so 11 proposed its own value
 but just before that, 10 had been chosen!
 b/c S2 did not remember to ignore a10v10

the homework question:
 How does Paxos ensure that the following sequence of events can't
 happen? What actually happens, and which value is ultimately chosen?
 1. A sends p(1); acks from A, B, C.
 2. A sends a(1, foo); A and C respond; A sends decide(foo)
 3. B sends p(2); acks from B and C
 4. B sends a(2, bar); B and C respond; B sends decide(bar)
 A: p1 a1foo
 B: p1       p2 a2bar
 C: p1 a1foo p2 a2bar
 C's prepare_ok to B really included "foo"
   thus a2foo, and so no problem

That was Lab 6; now for Lab 7.

Remember:
  replicated lock server
  primary and a few backups
  all servers execute all operations in the same order
  layers:
    lock_server
    RSM
    config
    Paxos
  paxos ensures agreement on squence of views
    <vid 0, S1 S2 S3>
    <vid 1, S2 S3>
    <vid 2, S1 S2>
    ...

How do new views get triggered?
  all nodes ping each other
  if pingable nodes != view's live nodes
    start new Paxos instance for vid+1
  if node reboots
    start new Paxos instance

RSM assigns sequence number to client requests
  e.g. acquire, release
  called "viewstamp" = vid:seq
  0:0, 0:1, 0:2, 1:0, 1:1, 2:0
  all replicas execute client requests in viewstamp order
    can't execute 0:2 until executed 0:1

how to continue after a view change?
  if primary from prev view still running, use its state
  if a backup from prev view still running, use its state
  if no server from prev view still running
    must wait for one to re-join!
    even if Paxos reached agreement

if primary crashes, will all backups have the same state?
  no: maybe some backups saw last operation, and some did not
  if state is small, pick one backup and send its state to all
  if state is large
    find last operation, send to backups that missed it

what if backup B1 was the only one to see the last operation
  and B1 is down/separated for a while
  and then B1 re-joins
  problem: B1 executed an operation that no-one else saw
  if state is small, replace B1's state
  if state is large
    need a way to abort the last operation
    read the Harp paper

summary
 overall goal: replicated state machines
 needs a single sequence of views
   e.g. sequence of primaries
 paxos chooses each new view; one paxos instance per view
 servers must