6.824 2005 Lecture 15: Paxos From Paxos Made Simple, by Leslie Lamport, 2001 introduction 2-phase commit is good if different nodes are doing different things but in general you have to wait for all sites and TC to be up you have to know if each site voted yes or no and the TC must be up to decide not very fault-tolerant: has to wait for repair can we get work done even if some nodes can't be contacted? yes: in the special case of replication state machine replication works for any kind of replicated service: storage or lock server or whatever every replica must see same operations in same order if deterministic, replicas will end up with same state primary/backup so that primary can choose the order what if the primary fails? need to worry about that last operation, possibly not complete need to pick a new primary can't afford to have two primaries! suppose lowest-numbered live server is the primary so after failure, everyone pings everyone then everyone knows who new primary is? well, maybe not: pings may be lost => two primaries pings may be delayed => two primaries partition => two primaries solution: "view change" algorithm system goes through a sequence of views view: view# and set of participants ensure agreement on unique successor of each view the participant set allows everyone to agree on new primary view change requires "fault-tolerant agreement" at most a single value is chosen agree despite lost messages and crashed nodes can't really guarantee to agree but we can guarantee to *not* "agree" on different values! Paxos fault-tolerant agreement protocol eventually succeeds if a majority of participants are reachable best known algorithm general Paxos approach one (or more) nodes decide to be the leader leader chooses a proposed value to agree on (view# and participant set) leader contacts participants, tries to assemble a majority participants are all the nodes in the old view (including unreachable) or a fixed set of configuration master nodes if a majority respond, we're done why agreement is hard what if two nodes decide to be the leader? what if the leader crashes after persuading only some of the nodes? what if leader got a majority, then failed, without announcing result? or announced result to only a few nodes? new leader might choose a different value, even though we agreed what if network partition leads to two leaders? Paxos has three phases may have to start over if failure/timeouts state n_a, v_a: highest value and n which node has accepted n_p: highest n seen in a PREPARE message types. rules. Paxos Phase 1 a node (maybe more than one...) decides to be leader leader picks a proposal number n must be unique, good if it's higher than any known # how about last known proposal number, plus one, append node ID leader sends PREPARE(n) to every node (including itself) if node gets PREPARE(n) and n > n_p: return RESPONSE(n_a, v_a) n_p = n Paxos Phase 2 if leader gets RESPONSE from majority of nodes (including self): if any RESPONSE(n,v) had a value, v = value of highest n else leader gets to choose a value old view# + 1, set of pingable nodes send ACCEPT(n, v) to all responders if node gets ACCEPT(n, v) and n >= n_p n_a = n v_a = v if a majority accepted, we've agreed on a new value but we don't know if we've reached agreement what if node crashes after RESPONSE? what if there was a second leader? Paxos Phase 3 leader asks all nodes if they accepted its value if it gets a majority of "yes", it tells everyone now everyone agrees that new *primary* is lowest-numbered node in new view if at any time any node gets bored (times out) it declares itself a leader and starts a new Phase 1 if nothing goes wrong, Paxos clearly reaches agreement how do we ensure good probability that there is only one leader? every node has to be prepared to be leader, to cope w/ failure so delay a random amount of time after you realize a new view is required or delay your ID times some constant what if more than one leader? due to timeout or partition or lost packets the two leaders used different n, say 10 and 11 if 10 didn't get a majority to ACCEPT it never will, since no-one will ACCEPT 10 after seeing 11's PREPARE or perhaps 10 is in a network partition if 10 did get a majority to ACCEPT that majority saw 10's ACCEPT before 11's PREPARE (otherwise they would have ignored 10's ACCEPT, so no majority) so 11 will get a RESPONSE from at least one node that saw 10's ACCEPT so 11 will be aware of 10's value so 11 will use 10's value, rather than making a new one so we reached agreement after all what if leader fails before sending ACCEPTs? some node will time out and become a leader old leader didn't reach agreement, so we don't care what he did it's good, but not neccessary, that new leader chooses higher n if it doesn't, timeout and some other leader will try eventually we'll get a leader that knew old n and will use a higher n what if leader fails after sending a minority of ACCEPTs? same as two leaders... what if leader fails after sending a majority of ACCEPTs? i.e. potentially after reaching agreement! same as two leaders... what if a node fails after receiving ACCEPT? if it doesn't reboot, possible timeout in Phase 3, new leader it it does reboot, it must remember v_a/n_a! (on disk) leader might have failed new leader must choose same value, since there might have been agreement our node might be the intersecting node of the two majorities what if a node reboots after sending RESPONSE? does it have to remember n_p on disk? it uses n_p to reject PREPARE/ACCEPT with smaller n scenario: leader1 sends PREPARE(n=10), a bare majority RESPONDs so node X's n_p = 10 leader2 sends PREPARE(n=11), a majority intersecting only at node X RESPONDs node X's n_p = 11 leader2 got no RESPONSE with a value, so it chooses v=200 node X crashes and reboots, loses n_p leader1 sends ACCEPT(n=10, v=100), its bare majority gets it including node X (which should have rejected it...) so we have agreement w/ v=100 leader2 sends ACCEPT(n=11, v=200) its bare majority all accept the message including node X, since 11 > n_p so we have agreement w/ v=200. oops. so: each node must remember n_p on disk conclusion what have we achieved? remember the original goal was replicated state machines and we want to continue even if some nodes are not available after each failure we can perform view change using Paxos agreement that is, we can agree on exactly which nodes are in the new view so, for example, everyone can agree on a single new primary but we haven't talked at all about how to manage the data