6.824 Lecture 13: paxos From Paxos Made Simple, by Leslie Lamport, 2001 introduction overall topic: fault-tolerant replicated services saw replicated state machine last time, primary/backup today: robust agreement on primary Lab 6! replicated state machine review, from hypervisor paper [many clients, two servers, state, operations] if all servers see same sequence of ops they will remain replicas, so one can take over from another works for deterministic local services: lock, block storage, files how to ensure all replicas see same sequence of ops? naive client->all servers doesn't work common approach: primary + backups clients send all operations to current primary primary chooses order, sends to backups, replies to client what if the primary fails? how about: backup pings primary takes over if gets no response from primary doesn't work: pings lost => two primaries pings delayed => two primaries partition => two primaries basic problem: broken network looks exactly like a dead primary but must be treated differently! can't afford to have two primaries! won't see same operations, won't have same state might give the same lock to two different clients! we'll use Paxos to agree on primary nodes ping, choose node w/ lowest ID try to agree that that node is the primary really, agree on set of live nodes agreed set of live nodes is a "view" protocol stack: RSM config service (view change) agreement (Paxos) config service provides "views" and "view change" system goes through a sequence of views view: view# and set of participants ensure agreement on unique successor of each view example: 0, S1, S2, S3 1, S2, S3 2, S2, S3, S4 3, S1, S2, S3, S4 two uses: current view implies current primary previous view tells new primary who to get state from what does config service need from Paxos? one or more nodes will see change in set of pingable nodes call Paxos("i+1 S1 S2") maybe just one such call, maybe more than one all same i+1 maybe all same srv list, maybe different eventually Paxos says "decided i+1 S1 S2" config srv creates separate Paxos instance for each view change so from now on in lecture, agreeing on just a single new view Paxos fault-tolerant agreement protocol general-purpose algorithm one or more servers each propose a value correctness: if agreement reached, all agreeing nodes see same value fault-tolerance: only a majority is required steady stream of failures may prevent agreement general Paxos approach one (or more) nodes decide to be the proposer proposer contacts participants, tries to assemble a majority if a majority respond, we're done why agreement is hard two proposers? partition (two proposers)? proposer crashes? proposer crashes just after majority? new proposer shouldn't choose different value... Paxos has two phases proposer can't just send "do you promise to commit to this value?" can't promise: maybe everyone promised to different value have to be able to change mind so: prepare, and accept exchange: proposer acceptors prepare(n) -> <- prepare_ok(n_a, v_a) accept(n, v') -> <- accept_ok(n) decided(v') -> definition: server S accepts a value it responded accept_ok definition: value V is chosen a majority of servers accepted V why n? may need multiple rounds e.g. if a proposer crashes want later rounds to supersede earlier ones numbers allow us to compare early/late n values must be unique, comparable, and roughly follow time n = the crucial property: if a value was chosen, any subsequent choice must be the same value i.e. protocol must not change its mind maybe a different proposer &c, but same value! tricky b/c "chosen" is system-wide property e.g. majority accepts, then proposer crashes nodes cannot tell locally so: proposer doesn't send out value with prepare acceptors send back any value they have already accepted if there is one, proposer proposes that to avoid changing an existing choice if no value already accepted, proposer can propose any value (e.g. a client request) proposer must get prepare_ok from majority to guarantee intersection with majority formed by existing choice now the protocols state: must persist across reboots n_p (highest prepare seen) n_a, v_a (highest accept seen) proposer(v): choose n, unique and higher than any n seen so far send prepare(n) to all servers including self if prepare_ok(n_a, v_a) from majority: v' = v_a with highest n_a; choose own v otherwise send accept(n, v') to all if accept_ok(n) from majority: send decided(v') to all acceptor's prepare(n) handler: if n > n_p n_p = n reply prepare_ok(n_a, v_a) acceptor's accept(n, v) handler: if n >= n_p n_a = n v_a = v reply accept_ok(n) if acceptor times out (i.e. doesn't see decide()) ask all servers for v_a, see if one value has majority otherwise become a proposer example 1 (normal operation): server set is S1, S2, S3 S3 just crashed S1's period pings would see S3 is now dead S1's config server asks Paxos to agree on vid+1, S1, S2 S1 is proposer S1: p1 a1v1 d1 S2: p1 a1v1 d1 S3: (p1) (a1v1) (d1) Note Paxos only requires a majority of the servers so we can continue even though S3 was down proposer must not wait for S3's responses! fast timeout. What would happen if network partition? I.e. S3 was alive? S3 would also initiate Paxos for new view S3's prepare would not assemble a majority Paxos must assemble a majority But a majority of what? If there are seven servers, and S1 thinks S4 S5 S6 S7 are down, can it proceed with a majority among just S1 S2 S3? example 2 (concurrent proposers): scenario: x sends prepare(10), gets replies y sends prepare(11), gets replies, none accepted a value from 10 x sends accept(10, vx) y sends accept(11, vy) A1: p10 p11 a10v10 a11v11 A2: "" A3: "" the risk: could all nodes accept v10 (so it is chosen) then accept v11? if this could happen, would change our mind about chosen values! but it's OK: n_p and the "if" in accept() cause nodes to ignore 10's accept after they have seen 11's prepare what about other interleavings of concurrent proposers? v10 either was or was not accepted by majority if yes: majority saw 10's accepts before 11's prepare otherwise they would have ignored 10's accept so 11 will get at least one prepare_ok w/ 10's value majorities must intersect! if no: 11 will win, permanently example 3 (proposer failure): S1: p10 a10v10 a11v10 S2: p10 a10v10 p11 a11v10 S3: p10 p11 a11v10 the risk: 10 crashed just after sending accepts but a majority arrived 11 doesn't know it, sends out prepares 11's majority of prepare_ok must include 10's majority of accepts so 11 will learn of 10's value in prepare_ok() and will choose same value as 10 conservative: 11 may use v10 even if only one server accepted v10 general argument about proposer failure? old proposer either did or did not get majority of accepts what if new proposer chooses n < old proposer? i.e. if clocked are not synced cannot make progress, though no correctness problem what if an acceptor crashes after receiving accept? if it doesn't restart, maybe proposer doesn't get enough accept_oks, timeout, new proposer if it does restart, it must remember v_a/n_a! (on disk) our node might be only intersection with new proposer's majority i.e. otherwise example 3 breaks, 11 may not see chosen v10 what if an acceptor reboots after sending prepare_ok? does it have to remember n_p on disk? if n_p not remembered, this could happen: S1: p10 a10v10 S2: p10 p11 reboot a10v10 a11v11 S3: p11 a11v11 11's proposer did not see value 10, so 11 proposed its own value but just before that, 10 had been chosen! b/c S2 did not remember to ignore a10v10 the homework question: How does Paxos ensure that the following sequence of events can't happen? What actually happens, and which value is ultimately chosen? 1. A sends p(1); acks from A, B, C. 2. A sends a(1, foo); A and C respond; A sends decide(foo) 3. B sends p(2); acks from B and C 4. B sends a(2, bar); B and C respond; B sends decide(bar) A: p1 a1foo B: p1 p2 a2bar C: p1 a1foo p2 a2bar C's prepare_ok to B really included "foo" thus a2foo, and so no problem That was Lab 6; now for Lab 7. Remember: replicated lock server primary and a few backups all servers execute all operations in the same order layers: lock_server RSM config Paxos paxos ensures agreement on squence of views ... How do new views get triggered? all nodes ping each other if pingable nodes != view's live nodes start new Paxos instance for vid+1 if node reboots start new Paxos instance RSM assigns sequence number to client requests e.g. acquire, release called "viewstamp" = vid:seq 0:0, 0:1, 0:2, 1:0, 1:1, 2:0 all replicas execute client requests in viewstamp order can't execute 0:2 until executed 0:1 how to continue after a view change? if primary from prev view still running, use its state if a backup from prev view still running, use its state if no server from prev view still running must wait for one to re-join! even if Paxos reached agreement if primary crashes, will all backups have the same state? no: maybe some backups saw last operation, and some did not if state is small, pick one backup and send its state to all if state is large find last operation, send to backups that missed it what if backup B1 was the only one to see the last operation and B1 is down/separated for a while and then B1 re-joins problem: B1 executed an operation that no-one else saw if state is small, replace B1's state if state is large need a way to abort the last operation read the Harp paper summary overall goal: replicated state machines needs a single sequence of views e.g. sequence of primaries paxos chooses each new view; one paxos instance per view servers must