Bill Weihl ---------------------------- Distributed systems today are essential parts of many aspects of modern life. As such, it is more important than ever to make them robust. Distributed systems have the potential for massive fault-tolerance -- having many nearly identical nodes that can substitute for each other makes it possible to tolerate a large number of failures. At the same time, distributed systems themselves can fail in unexpected ways. Many distributed systems, including telephone systems (both VoIP and traditional circuit-switched systems), content delivery systems, and web server farms, run the same or very similar software on all nodes. This leads to the possibility of a complete system failure. For example, suppose a particular (perhaps unlikely or unexpected) input condition can cause a node to fail. If the system reacts to a node failure by directing input to a different node, that node may fail in turn. Such failures could cascade throughout the system, causing the entire system to go down. For another example, consider the regional and national electric power grids. These are also distributed systems (though more analog than digital), and also suffer from the possibility of complete or near- complete failure (as history clearly shows). And the electric grid will probably become even more distributed in years to come with increasing use of co-generation, grid-tied solar, and other "local" power generation techniques. What happens to distributed systems such as these when the software contains the inevitable bugs? What failure modes do the systems as a whole exhibit when some of these bugs are triggered? What happens when the analog parts or the operators fail? And what happens when the systems are attacked? Terrorism makes the latter question even more critical today. Much work has been done in the past decades on techniques for building robust distributed systems, including Byzantine agreement, the Tandem "non-stop" methods, and others. While useful, these techniques typically focus on only a small part of the problem, ignoring much of the "system" as a whole. We need to consider the entire system -- the operational procedures, the analog feedback loops, etc -- not just the digital algorithms encoded in the software. In addition, these techniques do not deal well with persistent problems. The Tandem techniques work well for "heisenbugs", but not for "Bohr bugs". Byzantine agreement assumes that only a small fraction of the nodes have failed -- but if they all have the same bug, they will all respond to the same input in the same way. The challenge facing us is to understand better how to minimize the impact of any persistent and coordinated failures that do occur -- to keep local failures from becoming global, and to isolate the effects of any larger problems. How can we design systems so that they can identify and isolate potentially fatal failure modes, instead of just letting them spread?