6.824 2012 Lecture 17: Security: Byzantine Fault Tolerance reminder: start thinking about projects, groups of 2/3 we've considered many fault-tolerance protocols have always assumed "fail-stop" failures -- like power failure i.e. servers follow the protocol hard enough: crash vs network down; network partition can one handle a larger class of failures? buggy servers, that compute incorrectly rather than stopping? servers that *don't* follow the protocol? servers that have been modified by an attacker? often called "Byzantine" faults the paper's approach: replicated state machine assumes 2f+1 of 3f+1 are non-faulty use voting to select the right results not as easy as it might sound let's assume the worst case: a single attacker controls the f faulty replicas and is actively trying to break the system if we can handle this, we can handle bugs in f replicas too what are the attacker's powers? supplies the code that faulty replicas run knows the code the non-faulty replicas are running knows the faulty replicas' crypto keys can read network messages can temporarily force messages to be delayed via DoS what faults *can't* happen? no more than f out of 3f+1 replicas can be faulty no client failure -- clients never do anything bad no guessing of crypto keys or breaking of cryptography example use scenario: RM: echo A > grade echo B > grade tell YM "the grade file is ready" YM: cat grade a faulty system could: totally make up the file contents execute write("A") but ignore write("B") show "B" to RM and "A" to YM execute write("B") only only some of the replicas let's try to design our own byzantine-fault-tolerant RSM start simple (and broken), work towards paper's design design 1: [client, n servers] n servers client sends request to all of them waits for all n to reply only proceeds if all n agree what's wrong with design 1? one faulty replica can stop progress by disagreeing design 2: let's have replicas vote 2f+1 servers, assume no more than f are faulty client waits for f+1 matching replies if only f are faulty, and network works eventually, must get them! what's wrong with design 2's 2f+1? f+1 matching replies might be f bad nodes and just 1 good so maybe only one good node got the operation! *next* operation also waits for f+1 might *not* include that one good node that saw op1 example: S1 S2 S3 (S1 is bad) everyone hears and replies to write("A") S1 and S2 reply to write("B"), but S3 misses it client can't wait for S3 since it may be the one faulty server S1 and S3 reply to read(), but S2 misses it so read() yields "A" result: client tricked into accepting a reply based on out-of-date state e.g. TA reads A instead of B from grades file design 3: 3f+1 servers, of which at most f are faulty client waits for 2f+1 matching replies == f bad nodes plus a majority of the good nodes so all sets of 2f+1 overlap in at least one good node example: S1 S2 S3 S4 (S1 is bad) everyone hears write("A") S1, S2, S3 process write("B"), S4 misses it now the read() client will wait for 2f+1=3 matching replies S1 and S4 will reply "A" S2 and S3 will reply "B" client doesn't know what to believe (neither is 2f+1) but it is guaranteed to see there's a problem so client can *detect* that some good nodes missed an operation we'll see how to repair in a bit what about handling multiple clients? non-faulty replicas must process operations in the same order! let's have a primary to pick order for concurrent client requests but we have to worry about a faulty primary what can a faulty primary do? 1. send wrong result to client 2. different ops to different replicas 3. ignore a client op general approach to handling faulty primary 1. replicas send results direct to client 2. replicas exchange info about ops sent by primary 3. clients notify replicas of each operation, as well as primary each replica watches progress of each operation if no progress, force change of primary can a replica execute an operation when it first receives it from primary? no: maybe primary gave different ops to different replicas if we execute before we're sure, we've wrecked the replica's state need 2nd round of messages to make sure all good replicas got the same op design 4: 3f+1 servers, one is primary, f faulty, primary might be faulty client sends request to primary AND to each replica primary chooses next op and op # primary sends PRE-PREPARE(op, n) to replicas each replica sends PREPARE(op, n) to all replicas if replica gets matching PREPARE(op, n) from 2f+1 replicas (incl itself) and n is the next operation # execute the operation, possibly modifying state send reply to client else: keep waiting client is happy when it gets f+1 matching replies REQ PRE-P PREPARE REPLY C 0 1 2 3 remember our strategy: primary follows protocol => progress no progress => replicas detect and force change of primary if the primary is non-faulty, can faulty replicas prevent correct progress? they can't forge primary msgs they can delay msgs, but not forever they can do nothing: but they aren't needed for 2f+1 matching PREPAREs they can send correct PREPAREs and DoS f good replicas to prevent them from hearing ops but those replicas will eventually hear the ops from the primary worst outcome: delays if the primary is faulty, will replicas detect any problem? or can primary cause undetectable problem? primary can't forge client ops -- signed it can't ignore client ops -- client sends to all replicas it can try to send in different order to different replicas, or try to trick replicas into thinking an op has been processed even though it hasn't will replicas detect such an attack? results of the primary sending diff ops to diff replicas? case 1: all good nodes get 2f+1 matching PREPAREs did they all get the same op? yes: everyone who got 2f+1 matching PREPAREs must have gotten same op since any two sets of 2f+1 share at least one good server result: all good nodes will execute op2, client happy case 2: >= f+1 good nodes get 2f+1 matching PREPARES again, no disagreement possible result: f+1 good nodes will execute op, client happy BUT up to f good nodes don't execute can they be used to effectively roll back the op? i.e. send the write("B") to f+1, send read() to remaining f no: won't be able to find 2f+1 replicas with old state so no enough PREPAREs case 3: < f+1 good nodes get 2f+1 matching PREPAREs result: client never gets a reply result: system will stop, since f+1 stuck waiting for this op how to resume operation after faulty primary? need a view change to choose new primary (this view change only chooses primary; no notion of set of live servers) when does a replica ask for a view change? if it sees a client op but doesn't see 2f+1 matching PREPAREs after some timeout period is it OK to trigger a view change if just one replica asks? no: faulty replicas might cause constant view changes let's defer the question of how many replicas must ask for a view change who is the next primary? need to make sure faulty replicas can't always make themselves next primary view number v primary is v mod n so primary rotates among servers at most f faulty primaries in a row view change design 1 (not correct) replicas send VIEW-CHANGE requests to *new* primary new primary waits for enough view-change requests new primary announces view change w/ NEW-VIEW includes the VIEW-CHANGE requests as proof that enough replicas wanted to change views new primary starts numbering operations at last n it saw + 1 will all non-faulty replicas agree about operation numbering across view change? problem: I saw 2f+1 PREPAREs for operation n, so I executed it new primary did not, so it did not execute it thus new primary may start numbering at n, yielding two different op #n can new primary ask all replicas for set of operations they have executed? doesn't work: new primary can only wait for 2f+1 replies faulty replicas may reply, so new primary may not wait for me solution: don't execute operation until sure a new primary will hear about it add a third phase: PRE-PREPARE, PREPARE, then COMMIT only execute after commit operation protocol: client sends op to primary primary sends PRE-PREPARE(op, n) to all all send PREPARE(op, n) to all after replica receives 2f+1 matching PREPARE(op, n) send COMMIT(op, n) to all after receiving 2f+1 matching COMMIT(op, n) execute op view change: each replica sends new primary 2f+1 PREPAREs for recent ops new primary waits for 2f+1 VIEW-CHANGE requests new primary sends NEW-VIEW msg to all replicas with complete set of VIEW-CHANGE msgs list of every op for which some VIEW-CHANGE contained 2f+1 PREPAREs i.e. list of final ops from last view if a replica executes an op, will new primary will know of that op? replica only executed after receiving 2f+1 COMMITS maybe f of those were lies, from faulty replicas, who won't tell new primary but f+1 COMMITs were from replicas that got 2f+1 matching PREPAREs new primary waits for view-change requests from 2f+1 replicas ignoring the f faulty nodes f+1 sent COMMITs, f+1 sent VIEW-CHANGE must overlap can the new primary omit some of the reported recent operations? no, NEW-VIEW must include signed VIEW-CHANGE messages paper also discusses checkpoints and logs to help good nodes recover various cryptographic optimizations optimizations to reduce # of msgs in common case fast read-only operations what are the consequences of more than f corrupt servers? can the system recover? what if the client is corrupt? suppose an attacker can corrupt one of the servers exploits a bug, or steals a password, or has physical access, &c why can't the attacker corrupt them all?