Grand Challenge: Self-managing distributed systems Peter Druschel Max Planck Institute for Software Systems A grand challenge is the design of distributed systems whose operation requires no human intervention other than to add or replace hardware components or to change the system's service contract. To be able to operate in a complex, open environment in which both resource availability and workload characteristics change dynamically, a self-managing system must be capable of self-organization, reflection and adaptation. A challenging (and perhaps infeasible) example of such a system is a self-managing, secure Internet that delivers predictable performance despite changing hardware, physical topology, workloads, and unpredictable influences including attacks against the network. Distributed systems capable of unattended operation exist today in embedded environments. However, such systems operate in a closed environment, with well-known workloads and available resources, and in the absence of unforeseen influences such as security attacks. Moreover, their design tends to be conservative, i.e., it trades efficiency and flexibility for dependability. When a system operates in an open, dynamic environment, both the set of available resources and the workload characteristics change in unpredictable ways. A system that meets our grand challenge must attempt to maintain its service contract on a best effort basis, maximize its functionality and performance given the available resources, remain secure, anticipate when the service contract can no longer be met and generate an appropriate alert. This requires a level of introspection and adaptation not found in current systems. A self-managing system must identify failed components, detect impending violations of the service contract (security, performance or functionality) and affect an appropriate response, such as reconfiguration, task reassignment and isolation of failed components or malicious event sources. Only as a last resort should the system send an alert to human operators, and then provide specific information regarding the anomaly or fault that caused the alert. The challenge is important because it enables truly robust, reliable and secure distributed computer systems that can be deployed incrementally and be operated cost-effectively. Meeting this grand challenge would have a significant impact on society, since it enables trustworthy, complex information systems at low cost and reduces the human expertise required to operate everything from a desktop computer to the Internet. Among the technical challenges and milestones are (1) introspective techniques to determine hardware component health, system performance and security state; (2) maintaining a system model that relates low-level information about resource availability, system events and workload to high-level functional/performance specifications and security policies; (3) dynamic adaptation mechanisms that select system organization, data structures, algorithms and task assignments according to the prevailing system state, relying on algorithms from control theory, machine learning and inspiration from biological and economic systems; (3) a software architecture that enables the cost-effective reuse of system components in a wide range of environments and applications; (4) and, formal methods to ascertain the correctness of the system and to provide provable bounds on system performance, dependability and security.