"Global-scale systems that know when they are behaving badly" Jeffrey Mogul -- HP Labs, Palo Alto -- jeff.mogul@hp.com The goal of much distributed systems research has been the creation of complex systems that always work, both through fundamental design principles (e.g., two-phase commit and replication) and through better engineering (e.g., model checking and type-safe languages). In the real world, and especially in complex "enterprise" applications, distributed systems almost inevitably misbehave -- even in the absence of malicious attacks. Moreover, system-wide misbehavior is not necessarily the result of any component-level failure, nor is it necessarily manifested as a failure to produce "correct" results; it could be a significant performance or resource-consumption problem. I assert that a key, unappreciated challenge in distributed systems design is how to keep complex systems running well even when they are incorrectly designed. My premise is that, in spite of our best research efforts, most large real-world systems will misbehave, either because an attempt at correct-by-construction has failed, or (more likely) the original specification is wrong, underdefined, or obsolete. Few enterprises (businesses, governments, etc.) can afford the cost and time of deploying systems that always work, or the cost and downtime when they misbehave. How do we resolve this conflict? While we might never fully avoid it, the first step is the ability to detect system-wide misbehavior in complex systems. This is a prerequisite to diagnosis and repair, but relatively little research attention has been paid to the detection problem. One example is Emre Kiciman's PhD work (see "Detecting Application-Level Failures in Component-based Internet Services", Trans. Neural Networks), where he shows how one might detect application failure without any specification of correct behavior. We must learn how to design systems that realize when they are misbehaving. Such systems would include sufficient monitoring that one could construct a global view of system behavior, and at levels of detail so that unanticipated behavior can be captured. (System designers tend to resist adding such "superfluous" monitoring; it should not be a surprise that the Space Shuttle program waited 15 years to install enough cameras to see that foam was breaking off.) Such systems would also include detectors for misbehavior. This is different from trying to specify correct behavior (the normal and daunting problem for formal approaches); it might be possible to define generic types of misbehavior (deadlock, thrashing, oscillation, resource leakage) without any model of correctness (see also Kiciman's work). Steven Gribble's HotOS-8 paper "Robustness in complex systems" points out that systems can gain robustness when designed to expect failures, detect them quickly, and recover immediately. Patrick Reynolds, in unpublished work (joint with Amin Vadhat, Janet Wiener, Mehul Shah, and myself) has developed tools to express and check "expectations" about distributed system behavior. Pairing these kinds of approaches could lead to distributed systems that recognize their own misbehaviors. A research agenda: (1) Add instrumentation sufficient to synthesize a global view of system behavior, yet cheap enough to use at all times. (2) Create a formal yet usable language for describing behavior, and tools to check these against observations. (3) Create a library of common (generic) misbehavior descriptions, as well as system-specific misbehaviors. (4) Experiment with automatic on-line discovery of misbehavior; does this really work? (5) Reason backwards from detected misbehavior to possible root causes. (Root-cause inference might be hard, but not as hard as correct construction, because you don't have to be right every time.) Summary: Anticipate misbehavior in complex distributed systems; Build in self-monitoring; Don't just try to specify correct behavior, try to describe how to detect incorrect behaviors.