6.824 2009 Lecture 13: paxos From Paxos Made Simple, by Leslie Lamport, 2001 introduction how to build a fault-tolerant service? e.g. lock server want it to look to clients like a single server replicated state machine like hypervisor paper [many clients, two servers, state, operations] if all servers see same sequence of ops they will remain replicas, so one can take over from another works for deterministic local services: lock, block storage, files how to ensure all replicas see same sequence of ops? there are many possibilities primary + backup clients send all operations to current primary primary chooses order, sends to backup, replies to client what if the primary fails? how about: backup pings primary takes over if gets no response from primary doesn't work: pings lost => two primaries pings delayed => two primaries partition => two primaries basic problem: broken network looks exactly like a dead primary but must be treated differently! can't afford to have two primaries! won't see same operations, won't have same state might give the same lock to two different clients! we'll use Paxos to agree on what nodes are alive basic idea: odd # nodes, nodes vote, at most one majority can use set of live nodes to agree on who is the primary agreed set of live nodes is a "view" protocol stack: RSM config service (view change) agreement (Paxos) config service provides "views" and "view change" system goes through a sequence of views view: view# and set of participants ensure agreement on unique successor of each view example: 0, S1, S2, S3 1, S2, S3 2, S2, S3, S4 3, S1, S2, S3, S4 two uses: current view implies current primary previous view tells new primary who to get state from what does config service need from Paxos? one or more nodes will see change in set of pingable nodes call Paxos("i+1 S1 S2") maybe just one such call, maybe more than one all same i+1 maybe all same srv list, maybe different eventually Paxos says "decided i+1 S1 S2" config srv creates separate Paxos instance for each view change so from now on in lecture, agreeing on just a single new view Paxos fault-tolerant agreement protocol general-purpose algorithm one or more servers each propose a value Paxos will not ever "agree" on more than one distinct value might never agree on anything (if too many failures) if net+servers stable long enough, and enuf live Paxos will eventually choose one proposed value as agreed value general Paxos approach one (or more) nodes decide to be the proposer proposer contacts participants, tries to assemble a majority if a majority respond, we're done why agreement is hard two proposers? partition (two proposers)? proposer crashes? proposer crashes just after majority? new proposer shouldn't choose different value... Paxos has two phases proposer can't just send "do you promise to commit to this value?" can't promise: maybe everyone promised to different value have to be able to change mind so: prepare, and accept definition: "chosen": a majority has accepted the value i.e. sent accept_ok majority means at most one value can be chosen exchange: proposer acceptors prepare(n) -> <- prepare_ok(n_a, v_a) accept(n, v') -> <- accept_ok(n) decided(v') -> why n? may need multiple rounds e.g. if a proposer crashes want later rounds to supersede earlier ones numbers allow us to compare early/late the crucial property: if a value was chosen, any subsequent choice must be the same value i.e. protocol must not change its mind maybe a different proposer &c, but same value! tricky b/c "chosen" is system-wide property e.g. majority accepts, then proposer crashes nodes cannot tell locally so: proposer doesn't send out value with prepare acceptors send back any value they have already accepted if there is one, proposer proposes that to avoid changing an existing choice if no value already accepted, proposer can propose any value (e.g. a client request) proposer must get prepare_ok from majority to guarantee intersection with majority formed by existing choice now the protocols proposer(v): choose n, unique and higher than any n seen so far send prepare(n) to all servers including self if prepare_ok(n_a, v_a) from majority: v' = v_a with highest n_a; choose own v otherwise send accept(n, v') to all if accept_ok(n) from majority: send decided(v') to all acceptor state: must persist across reboots n_p (highest prepare seen) n_a, v_a (highest accept seen) acceptor prepare(n) handler: if n > n_p n_p = n reply prepare_ok(n_a, v_a) acceptor accept(n, v) handler: if n >= n_p n_a = n v_a = v reply accept_ok(n) commit point is when f+1'st acceptor records n_a/v_a in stable storage if acceptor times out (i.e. doesn't see decide()) ask all servers for v_a, see if one value has majority otherwise become a proposer easy cases: one proposer, nothing fails network partition (at most one majority) why only a majority -- why not all? three harder cases: 1 proposer (this is the hardest case) proposer fails (sub-case of > 1 proposer) acceptor fails example 1 (concurrent proposers): scenario: x sends prepare(10), gets replies y sends prepare(11), gets replies, none accepted a value from 10 x sends accept(10, vx) y sends accept(11, vy) A1: p10 p11 a10v10 a11v11 A2: "" A3: "" the risk: could all nodes accept v10 (so it is chosen) then accept v11? if this could happen, would change our mind about chosen values! but it's OK: n_p and the "if" in accept() cause nodes to ignore 10's accept after they have seen 11's prepare what about other interleavings of concurrent proposers? v10 either was or was not accepted by majority if yes: 10's accepts before 11's prepare, 11 will see value if no: 11 will win, permanently example 2 (proposer failure): S1: p10 a10v10 a11v10 S2: p10 a10v10 p11 a11v10 S3: p10 p11 a11v10 the risk: 10 crashed just after sending accepts but a majority arrived 11 doesn't know it, sends out prepares 11's majority of prepare_ok must include 10's majority of accepts so 11 will learn of 10's value in prepare_ok() and will choose same value as 10 conservative: 11 may use v10 even if only one server accepted v10 general argument about proposer failure? old proposer either did or did not get majority of accepts what if new proposer chooses n < old proposer? cannot make progress, though no correctness problem maybe n = timestamp in high bits, unique node ID in low bits what if an acceptor crashes after receiving accept? if it doesn't restart, maybe proposer doesn't get enough accept_oks, timeout, new proposer if it does restart, it must remember v_a/n_a! (on disk) proposer might have failed after sending a few accept()s our node might be only intersection with new proposer's majority i.e. otherwise example 2 breaks, 11 may not see chosen v10 what if an acceptor reboots after sending prepare_ok? does it have to remember n_p on disk? yes! example 1 above requires this, e.g. if some servers restart after getting p11 don't want them to accept a10v10 since 11 saw majority of prepare_ok w/o value so 11 will propose its own value the homework question: How does Paxos ensure that the following sequence of events can't happen? What actually happens, and which value is ultimately chosen? 1. A sends p(1); acks from A, B, C. 2. A sends a(1, foo); A and C respond; A sends decide(foo) 3. B sends p(2); acks from B and C 4. B sends a(2, bar); B and C respond; B sends decide(bar) summary overall goal: replicated state machines needs a single sequence of views e.g. sequence of primaries paxos chooses each new view; one paxos instance per view still need a way to carry data from one view to the next