Practical byzantine fault tolerance Castro and Liskov What is this paper about? Replicated state w/ Byzantine failures. Make it work in async system. - what is an asynchronous system? - no relation to asynchronous programming - definition: no bound on message delay - if you don't have a timeout, you may wait indefinitely - if you have timeouts, you will miss legitimate messages - attacks: compromised replicas may delete message, delay them, change them, send different replies to client etc. - claim of paper: can tolerate f failures w/ 3f + 1 total nodes (2f+1 are good). where "failure" means actively malicious participant. safety: serializability liveness: clients eventual receive a reply - assumptions: 1. the replicas fail independently the model for *why* replicas fail is critical. component failure? s/w bugs? power failure? evil hackers? not easy to ensure independent failure! 2. message are received at some point - keep retransmitting - network, client, and server will heal at some point Why 3f+1 nodes to tolerate f failures? A lot like Harp: want a majority of good nodes. But cannot wait for votes from all nodes! Have to assume that f nodes might never answer. So we wait for 2f+1 votes (including our own). In reality, the f votes we didn't wait for might all have been good. So we only collected f+1 votes. This is a majority of good nodes, so enough. The goal in more detail: All 2f+1 good replicas eventually agree on order of committed operations. So their DBs stay identical. In a little more detail: f+1 (or more) good replicas agree on order. The other f do not commit anything not agreed on. Thus easy to repair the other f. - the algorithm - a primary-backup protocol to achieve total order - primary sets order, backups agree - we need 2f backups to agree so that we have 2f+primary replicas agreeing (2f+1) - each replica keeps a message log (in-core) - 3 phases: propose, agree, commit c ----------------------------------------------- p ---------------------------------------------- b ---------------------------------------------- b ---------------------------------------------- b ------------------------------------------------- - c -> p: request o, t, c - p -> b's: in view v, i propose sequence number n for message m (signed by p) - b's -> b's: in view v, i agree on n for message m (signed by b) m is prepared when b receives 2f agrees plus 1 propose - b's -> b's: in view v, i commit message m with sequence number n (signed by b) if a non-faulty b thinks that m is committed, then eventually m will be committed after committing, b executes operation - b sends reply to c Framework for analyzing protocol: 1. If primary is bad (inconsistent order, or doesn't send some ops): Good nodes will detect/timeout and force view change. Rotate primary, so will soon see a good primary. Bad nodes cannot prevent view change. 2. If primary is good: Good nodes will agree on next operation. Bad nodes will not be able to prevent agreement. Bad nodes cannot force view change if all goes well. What each node knows after each phase: After getting 1 pre-prepare message: The primary send you a message. It might have sent different (or no) msgs to other good nodes. After getting 2f prepare messages: We can only *wait* for 2f, of which f might be traitors. We now know that f+1 good nodes got the original primary message. We don't yet know if primary is loyal; maybe other f got something else. The f+1 is to ensure majority among 2f+1 loyal nodes. Maybe f+1 good nodes got M1, f got M2. Depending on traitors and which packets are lost, Some nodes may see consistent M1, some may see mix. I.e. some nodes may go ahead, some may abort. Perhaps everybody but me saw a mix of M1 and M2 prepares! So we cannot commit yet. After getting 2f commit messages: We know f+1 good nodes saw f+1 good prepare messages. I.e. f+1 good nodes are willing to commit. So we can commit as well. If primary is good: At least f+1 good nodes will collect 2f+1 commits. Client will see at least f+1 agreeing replies. Bad nodes can only omit messages, since primary is good. But we never have to wait for them -- good nodes are enough. If primary is bad: If good nodes could not agree, will force view change. OK. What if only f+1 good nodes agreed? What happens at the other f good nodes? Maybe the primary sent them inconsistent pre-prepare. Maybe the primary sent them nothing. Those f good nodes will time out and request a view change. We have to ignore them, since they might actually be bad. But it's OK: f+1 can proceed. If f+1 good nodes see inconsistent pre-prepare, or time out, We can't proceed. All good nodes will time out. We have 2f+1 nodes, enough to force a view change. Why does client see the right answer? Client waits for f+1 agreeing answers. Only f can be bad, so one must be good. No good node will reply w/o commit. I.e. proof that f+1 good nodes committed. So at least that 1 answer is correct (i.e. this was the correct next op). If f+1 agree, they agree with the correct node. Similarly, if f+1 nodes agreed to commit, the client will get f+1 agreeing answers. Plus maybe f incorrect. When are view changes triggered? Point is to switch to a new primary, since we suspect the old one. I.e. it's handing out inconsistent messages. Or not handing out enough messages. So we cannot assemble our 2f+1 consistent messages. Only agree to do it if 2f+1 nodes want a view change. So f evil nodes cannot by themselves trigger one. Unless primary is disloyal/broken -- but then we want a view change! What happens in a view change? We're only sure f+1 good nodes are up to date. The other f nodes need to get latest correct operations. No conflicts, just missing last few operations. How do they know who to believe? Nodes exchange *proofs* of current state. Amounts to signed messages from 2f+1 detailing state. Allows inconsistencies to be discovered. New primary picks a pre-prepare for each pending message. All nodes re-run prepare &c protocol We don't care if new primary picked "right" pre-prepare for each seq#. We only care that all good nodes agree on order of operations. - optimizations: 1. backups send digests to client, except one which sends the whole reply 2. send replies after prepared predicate becomes true 3. read-only request straight to backups instead through primary - be careful: respond after all tentative requests have been committed 4. avoid public-key crypto - i probably should present the algorithm based on MACs first; it is easier to understand - BFS - byzantine file system - speaks NFS2 - in-core file system (replication to ensure stability) - non-deterministic operations - setting the last-modified-time - Performance evaluation n = 4 microbenchmarks (dominated by messages cost) andrew (BFS does well compared to NFS because of lack of synchronous writes) - Discussion - what problem does BFS solve? - is IS going to run BFS to deal with byzantine failures? - what failures are we talking about? compromised servers - what about compromised clients? authentication and authorization - how can we extend the system to allow for more than (n-1)/3 failures over its lifetime? - detect failed replicas using proactive recovery - recover the system periodically, no matter what - makes bad nodes good again - tricky stuff - an attacker might steal compromised replica's keys - with how many replicas will BFS work reasonably well?