6.824 2006 Lecture 12: Paxos From Paxos Made Simple, by Leslie Lamport, 2001 introduction 2-phase commit is good if different nodes are doing different things but in general you have to wait for all sites and TC to be up you have to know if each site voted yes or no and the TC must be up to decide not very fault-tolerant: has to wait for repair can we get work done even if some nodes can't be contacted? yes: in the special case of replication state machine replication works for any kind of replicated service: storage or lock server or whatever every replica must see same operations in same order if deterministic, replicas will end up with same state how to ensure all replicas see operations in the same order? primary + backup(s) clients send all operations to current primary primary chooses order, sends to backups, replies to client what if the primary fails? need to worry about that last operation, possibly not complete need to pick a new primary can't afford to have two primaries! suppose lowest-numbered live server is the primary so after failure, everyone pings everyone then everyone knows who new primary is? well, maybe not: pings may be lost => two primaries pings may be delayed => two primaries partition => two primaries idea: a majority of nodes must agree on the primary at most one network partition can have a majority if two potential primaries, their majorities must overlap technique: "view change" algorithm system goes through a sequence of views view: view# and set of participants ensure agreement on unique successor of each view the participant set allows everyone to agree on new primary view change requires "fault-tolerant agreement" at most a single value is chosen agree despite lost messages and crashed nodes can't really guarantee to agree but we can guarantee to *not* "agree" on different values! Paxos fault-tolerant agreement protocol eventually succeeds if a majority of participants are reachable best known algorithm general Paxos approach one (or more) nodes decide to be the leader leader chooses a proposed value to agree on (view# and participant set) leader contacts participants, tries to assemble a majority participants are all the nodes in the old view (including unreachable) or a fixed set of configuration master nodes if a majority respond, we're done why agreement is hard what if two nodes decide to be the leader? what if network partition leads to two leaders? what if the leader crashes after persuading only some of the nodes? what if leader got a majority, then failed, without announcing result? or announced result to only a few nodes? new leader might choose a different value, even though we agreed Paxos has three phases may have to start over if failure/timeouts state (per view) n_a, v_a: highest value and n which node has accepted n_h: highest n seen in a Q1 done: leader says agreement was reached, we can start new view Paxos Phase 1 a node (maybe more than one...) decides to be leader picks a proposal number n must be unique, good if it's higher than any known # how about last known proposal number, plus one, append node ID sends Q1(n) to every node (including itself) if node gets Q1(n) and n > n_h: n_h = n return R1(n_a, v_a) Paxos Phase 2 if leader gets R1 from majority of nodes (including self): if any R1(n,v) had a value, v = value of highest n else leader gets to choose a value old view# + 1, set of pingable nodes send Q2(n, v) to all responders if node gets Q2(n, v) and n >= n_h n_a = n v_a = v return R2() Paxos Phase 3 if leader gets a majority of R2(): send Q3() to all if node gets Q3(): done = true primary is lowest-numbered node in v_a if at any time any node gets bored (times out) it declares itself a leader and starts a new Phase 1 if nothing goes wrong, Paxos clearly reaches agreement how do we ensure good probability that there is only one leader? every node has to be prepared to be leader, to cope w/ failure so delay a random amount of time after you realize a new view is required or delay your ID times some constant key danger: nodes w/ different v_a receive Q3 goal: if Q3 *could* have been sent, future Q3s guaranteed to have same v_a what if more than one leader? due to timeout or partition or lost packets the two leaders used different n, say 10 and 11 if 10 didn't get a majority to R2 it never will, since no-one will R2 10 after seeing 11's Q1 or perhaps 10 is in a network partition if 10 did get a majority to R2 i.e. might have sent Q3 10's majority saw 10's Q2 before 11's Q1 otherwise they would have ignored 10's Q2, so no majority so 11 will get a R1 from at least one node that saw 10's Q2 so 11 will be aware of 10's value so 11 will use 10's value, rather than making a new one so we agreed on a v after all what if leader fails before sending Q2s? some node will time out and become a leader old leader didn't send any Q3, so we don't care what he did it's good, but not neccessary, that new leader chooses higher n if it doesn't, timeout and some other leader will try eventually we'll get a leader that knew old n and will use a higher n what if leader fails after sending a minority of Q2s? same as two leaders... what if leader fails after sending a majority of Q2s? i.e. potentially after reaching agreement! same as two leaders... what if a node fails after receiving Q2? if it doesn't restart, possible timeout in Phase 3, new leader it it does restart, it must remember v_a/n_a! (on disk) leader might have failed after sending a few Q3s new leader must choose same value our node might be the intersecting node of the two majorities what if a node reboots after sending R1? does it have to remember n_h on disk? it uses n_h to reject Q1/Q2 with smaller n scenario: leader1 sends Q1(n=10), a bare majority sends R1 so node X's n_h = 10 leader2 sends Q1(n=11), a majority intersecting only at node X sends R1 node X's n_h = 11 leader2 got no R1 with a value, so it chooses v=200 node X crashes and reboots, loses n_h leader1 sends Q2(n=10, v=100), its bare majority gets it including node X (which should have rejected it...) so we have agreement w/ v=100 leader2 sends Q2(n=11, v=200) its bare majority all accept the message including node X, since 11 > n_h so we have agreement w/ v=200. oops. so: each node must remember n_h on disk conclusion what have we achieved? remember the original goal was replicated state machines and we want to continue even if some nodes are not available after each failure we can perform view change using Paxos agreement that is, we can agree on exactly which nodes are in the new view so, for example, everyone can agree on a single new primary but we haven't talked at all about how to manage the data