This ITR proposes a novel decentralized infrastructure, based on distributed hash tables (DHTs), that will enable a new generation of large-scale distributed applications. The key technology on which we build, DHTs, are robust in the face of failures, attacks and unexpectedly high loads. They are scalable, achieving large system sizes without incurring undue overhead. They are self-configuring, automatically incorporating new nodes without manual intervention or oversight. They simplify distributed programming by providing a clean and flexible interface. And, finally, they provide a shared infrastructure simultaneously usable by many applications.
The approach advocated here is a radical departure from both the centralized client-server and the application-specific overlay models of distributed applications. This new approach will not only change the way large-scale distributed systems are built, but could potentially have far reaching societal implications as well. The main challenge in building any distributed system lies in dealing with security, robustness, management, and scaling issues; today each new system must solve these problems for itself, requiring significant hardware investment and sophisticated software design. The shared distributed infrastructure will relieve individual applications of these burdens, thereby greatly reducing the barriers to entry for large-scale distributed services.
Our belief that DHTs are the right enabling infrastructure is based on two conjectures: (1) a DHT with application-independent, unconstrained keys and values provides a general purpose interface upon which a wide variety of distributed applications can be built, and (2) distributed applications that make use of the DHT-based infrastructure inherit basic levels of security, robustness, ease of operation, and scaling. Much of the thrust of the proposed research is an exploration of these two conjectures.
We propose to investigate the first conjecture, that the DHT abstraction can support a wide range of applications, by building a variety of DHT-based systems. Our recent work has used DHTs to support such varied applications as distributed file systems, multicast overlay networks, event notification systems, and distributed query processing. DHTs simplify the structure of these systems by providing general-purpose key/value naming rather than imposing structured keys (e.g. hierarchical names in DNS). These systems are early prototypes, but they suggest that DHTs may be as useful to distributed applications as ordinary hash tables are to programs.
The second conjecture relies on techniques for creating robust, secure, and self-organizing infrastructures out of many mutually distrustful nodes. Our initial work on robust DHT designs gives us confidence that such techniques are within reach. The bulk of our proposed research will be devoted to the in-depth study of these techniques, with the express aim of producing a sound and coherent design for the infrastructure. To investigate the real-world behavior of our design, we plan to create a large-scale open testbed for which we will distribute our infrastructure software, some enabling libraries, and a few key compelling applications.
In addition to its impact on the creation of distributed applications, our research program will have benefits in education and outreach. Given their current importance, security, robustness, and the design of distributed systems should become central topics in undergraduate computer science education. To this end, we are planning (as described in Section ) a new interdisciplinary course that will address these issues, and bring them into sharper focus early in the undergraduate course sequence.
Our testbed and proposed research agenda is also a good vehicle for encouraging the participation of organizations not traditionally involved in networking and systems research. Participation in the testbed requires little cost (a PC and an Internet connection) and minimal levels of systems expertise and oversight. Moreover, because the material is closely related to the P2P systems with which many students are familiar, the project might appeal to students who would not normally be attracted to research in this area. Based on this premise, we plan an active outreach program to underrepresented populations at non-research undergraduate institutions.
To support this broad agenda of research, outreach, and educational innovation we have assembled a team of researchers who have made initial contributions to the new area of decentralized, Internet-scale distributed systems [48,59,18,65]. The team consists of researchers from the diverse disciplines of networking, algorithms, computer security, computer systems, and databases, with a successful track record in research and in high-quality software systems.
To give context for our proposed work, we first describe the state-of-the-art in distributed systems, and then discuss some of the ways in which current requirements are ``pulling'' and technology is ``pushing'' on how we might build distributed systems in the future.
Most deployed large-scale distributed applications adopt the centralized client-server model, where a single server or, more typically, a localized cluster of servers responds to all client requests. Examples of such applications are e-mail, electronic shops, distributed file systems, search engines, etc. A few distributed applications adopt the overlay model in which many servers spread throughout the Internet form an application-level and application-specific network on top of the existing Internet. In this case the servers are decentralized but are all controlled by a centralized administrative authority operating one or a few network operation centers (NOCs). For example, Akamai, Overcast, and FFnet all adopt this model for Internet content distribution.
Both the centralized client-server and the overlay models of distributed computing have a long history of success, and have reshaped the world of distributed applications. There is little doubt that both approaches will continue to be important in the future. However, they may not be appropriate for critical operations because they both have single points of failures (the server or the NOC). Consequently, their performance degrades substantially when certain nodes fail or are subjected to distributed denial-of-service attacks (DDoS). Moreover, these designs often do not function well when network partitions render subsets of nodes unreachable. Perhaps most seriously, an attacker can seriously compromise and disrupt such systems by subverting one of the servers.
Our main objective in this proposal is to design distributed systems that degrade gracefully under failures and attacks. No single server (or small set of servers) should be the Achilles heel of the system. Instead, we desire that when a node fails, comes under attack, or is subverted, the system should continue operating, albeit perhaps with degraded performance.
In the early days of the Internet, vulnerability to attacks and failures was acceptable because the Internet did not play a central role in any critical functions. However, as we discuss next, this is no longer the case. More generally, now that the Internet has come of age, the expectations for it are changing. These changing expectations have created some ``demand pulls'' that should shape the next generation of distributed system design. The technologies related to distributed systems design have also changed substantially. These ``technology pushes'' enable us to pursue new approaches to distributed system design. In the next two subsections we discuss these demand pulls and technology pushes.
The world has changed since the first large-scale Internet-based distributed systems were deployed, and the expectations for these systems have evolved significantly. In our opinion, the most important and interesting new requirements facing distributed systems are the following:
Security and Robustness: Internet applications and services are increasingly being used for critical services. Hospitals, police departments, financial institutions, utilities, state and federal governments and many other crucial institutions make extensive and critical use of the Internet. The distributed systems on which these institutions rely must remain operational under adverse conditions.
At the very least, these distributed systems should be robust in the face of failures of the computational nodes and/or the Internet links over which they communicate. Application performance might inevitably degrade as conditions get worse, but the degradation should be graceful and tolerable.
Moreover, these applications should be highly resilient against distributed denial-of-service attacks (DDoS). The frequency and sophistication of attacks upon the information infrastructure, both in the U.S. and elsewhere, has increased dramatically over the past few years .
In addition, critical distributed systems should not be vulnerable to attacks that compromise one or more of their constituent nodes. Large-scale systems typically must have many servers in order to handle their load. The security and robustness of such systems is extremely limited if the subversion of any one of these servers seriously compromises the system as a whole.
Unfortunately, as we discussed above, our current infrastructure is not robust against failures or attacks; the failure of the server (or cluster), whether through subverting attacks, DDoS attacks or benign malfunctions, can bring a system down. In particular, DDoS attacks effectively shut down several important services during the past year . This vulnerability must be addressed.
While safeguarding our current infrastructure from both attacks and failures is a pressing need, it is critically important that future distributed systems be designed, from the start, to be highly resilient to attacks and failures. Only when this level of resilience becomes embedded into the basic design, rather than addressed as an afterthought, can Internet-based distributed systems safely play such a critical role in the information infrastructure. The focus of this proposal is on the design of these future systems, rather than retrofitting current systems to be more secure.
Deployment, Maintenance and Management Costs: Currently, deploying a large-scale distributed system is an expensive and time-consuming effort. Nodes may have to be manually installed and configured in many locales, agreements must be negotiated with many independent ISPs, network and site operators. Once deployed, the systems require substantial human oversight to maintain acceptable levels of performance; the labor costs associated with this oversight has become a sizable fraction of the overall cost of running large distributed systems. This human involvement is also a common source of failures; for example, many routing failures in the Internet are due to BGP configuration errors [20,43]. Thus, finding new ways of designing large-scale distributed systems that have lower deployment, maintenance, and management costs, and are less vulnerable to operator error, is of significant importance.
Content vs. Location: The basic Internet architecture provides point-to-point communication; the routing infrastructure need only know the source and destination addresses of packets and is oblivious to their content. However, at the user level, the situation is often reversed; users typically care only about the content of the data, not the location from whence it came. That is, users know what they want (e.g. today's CNN headlines), but don't care which server it comes from. To mitigate this mismatch, various pieces of Internet infrastructure translate between the name of data and its location: for instance, the Domain Name System translates between domain names and addresses; search engines (e.g. Google) translate between keywords and URLs (which are then translated into addresses by DNS). However, applications could be far more flexible, powerful, and robust if they were able to deal more directly with the names rather than the locations of data. This will require substantially different approaches to designing the supporting infrastructure for distributed systems.
Scaling: The Internet user community has grown so large that distributed applications must be prepared to deal with millions of users. This requirement of extreme scalability stresses many traditional approaches to distributed system design. In particular, sudden, unexpected spurts of activity (e.g. ``flash crowds'' on the Web) frequently occur but are difficult to plan for in advance, so they severely stress parts of the system.
Technology has evolved substantially in the past decade, giving rise to new possibilities for distributed system design. We describe three particularly relevant technological developments below.
Bigger, Faster, Better: CPU speeds and disk capacities of individual nodes are increasing rapidly, as are the backbone link bandwidths used to communicate between them. Any reasonably configured PC can function as a server, and a large collection of such nodes provides an abundance of computing and storage resources. The collection could consist of loosely cooperating but administratively unrelated participants, or of nodes controlled by a single enterprise. New approaches to distributed system design should seek to harness the resources latent in such collections.
Distributed Hash Tables: In the past few years, there have been several proposals for what could be termed Distributed Hash Tables (DHTs); they are also known as lookup services or location-independent routing services [48,52,59,66]. Some of these systems were originally developed in the context of peer-to-peer file-sharing systems. They are overlay infrastructures that rely on a symmetric distributed design and can reach Internet scales without incurring undue overhead.
In these systems each data object is associated with a key (produced, for instance, by hashing the file name). There is one basic operation, lookup(key), which returns the identity (e.g. the IP address) of the node storing the object with that key. This operation allows nodes to put and get files based on their key, thereby supporting a hash-table-like interface. As we discuss below, these DHT systems provide the starting point for our proposed research.
Trustworthy Systems out of Untrusted Components: In most traditional distributed systems, if one node is compromised then the system as a whole is compromised. Recent work such as SUNDR , CFS , and BFS , has demonstrated the possibility of building trustworthy systems out of untrusted components. Thus, the subversion of one, or several, of the constituent nodes does not lead to the compromise of the overall system. Such technology is essential for building distributed systems with large numbers of nodes spread over many administrative domains.
These technology pushes motivate several aspects of our proposed approach: using an infrastructure comprised of many nodes, incorporating several techniques for secure and robust operation, and offering a simple DHT-based programming abstraction. We were led to this approach by our prior work in the field, which was inspired by recent developments in peer-to-peer (P2P) computing. The broad success of P2P systems (e.g. Gnutella), which are based on many symmetric peers, demonstrated the utility of the P2P approach; the rather primitive nature of these initial P2P designs leaves significant room for improvement. In many ways, the technical aspects of this proposal can be seen as providing a sound scientific and design basis for the P2P approach, but the intended application areas we discuss here are far broader than the file-sharing use that dominates current P2P systems. Industry leaders in this area see broad applications for P2P systems (see enclosed letters from Microsoft, Intel, and Sun).
We describe our proposed approach in more detail in the following section.
The confluence of these demand ``pulls'' and technology ``pushes'' creates an important opportunity. Large-scale distributed systems should be more scalable, secure, robust, easy to manage, and less location-centric. They can also take advantage of the abundant computing resources available in large collections of PC-class machines, use algorithms that provide hash-table-like functionality at Internet scales, and incorporate techniques that construct trustworthy services out of untrusted components. Merging these pushes and pulls, we propose to create a large-scale general-purpose overlay infrastructure comprised of many untrusted machines that will provide DHT-like functionality (that is, will support the lookup operation or some similar interface). Our proposal is based on two simple conjectures:
We now discuss these two conjectures at a general level, motivating the case for DHTs as an interface and discussing the capabilities envisioned for the proposed Internet-wide infrastructure.
Essentially all distributed applications require a way by which software running on one node can discover and communicate with software running elsewhere on the network. The process of discovery is often complex, and depends on the details of the application. Distributed applications therefore use some form of naming to describe what person, host, location, data, or resource they wish to refer to.
The Internet solves the unicast communication problem, but provides little more than DNS for the discovery problem. DNS is essentially a host-naming mechanism; it lacks the flexibility required for sophisticated applications. The heterogeneity of distributed applications makes a general-purpose naming system appear to be a very challenging goal.
Yet we desire a common naming infrastructure in the face of this heterogeneity. We assert that DHTs offer an infrastructure flexible enough to provide useful service to a wide range of applications. The idea is to decouple the syntax and semantic content of names from the infrastructure, preserving only a level of indirection referred to by a key in an m-bit key space. The value corresponding to this key is application-dependent; the DHT naming infrastructure would not impose any restrictions or structure on keys or values. The naming system would perform one well-defined task: given a key and a message, find the node in the system responsible for the key (and, optionally, deliver the message to it). The strong resemblance between this primitive and standard hash tables motivates our choice of the term ``DHT''.
The interface is minimal in that it requires little of the application; it doesn't dictate the registration mechanism, or the language, or the programming discipline, as opposed to several other (complementary) attempts to support distributed systems such as CORBA, IDL, and Java RMI. We believe that this minimalism will help its widespread adoption. The utility of the DHT interface has been shown by a number of prototype systems and proposals for systems, many of which form the basis for this proposal; these include naming systems [1,17], data storage [18,53], communication primitives including multicast and anycast [58,49,67], query processing , and event notification . Thus, we contend that the network location-independent key-based abstraction provided by DHTs is a minimal and useful substrate over which to build additional sophisticated functionality.
|lookup(key) -> IPAddress||get(key) -> data||send(IPAddress, data)|
|put(key, message)||recv() -> message|
The DHT abstraction facilitates a variety of useful middleware services being developed and used as modules in future systems, including support for distributed data storage and content-based retrieval, group communication primitives, complex query processing, etc. These will be provided as library modules that distributed applications can link with and use. Figure 1 illustrates specific simple examples of the kinds of interfaces we plan to supply.
Our Conjecture #1 is preliminary and only partially tested. While we have initial evidence that DHTs are useful, much more work remains to be done to explore the full utility, and the limitations, of this approach. We expect to find that modifications to the interface are needed to support a sufficiently broad set of applications. This exploration-by design, evaluation, and redesign-will be an important part of our proposed work. As described in Section , we will develop several new applications that will help guide the design of the interface and test the value of the infrastructure. We emphasize, however, that our focus is not on developing these few individual applications but is instead on developing a general-purpose interface and overlay infrastructure.
Above we argued that the DHT interface will be useful for distributed system design. However, a useful programming abstraction does not, by itself, address the new security, robustness, management, and scaling requirements discussed in Section 1. We now describe how the nature of the infrastructure, and not just its interface, will help address those issues. The properties we ascribe to the DHT-based infrastructure are based on our experiences with the current DHT designs, which we discuss in more detail in Section 3.
Thus, we believe that such an infrastructure could radically change the way large-scale distributed systems are designed and deployed. However, there are limitations to this approach, which we now discuss.
We've stated that the interface and infrastructure we are proposing is general-purpose. While this has the benefit that our approach can support a wide range of applications, it has the disadvantage that it is sub-optimal for almost all of them. That is, the goal of our approach is to provide pretty good support for a wide range of applications, rather than optimized support for one or two applications. For instance, application-level multicast running over special-purpose multicast overlay networks is probably superior to application-level multicast built using the DHT approach. However, we don't expect the difference to be substantial, and the use of separate infrastructures for each application will prevent the synergy and cost savings that arise from a shared infrastructure. In this sense, our intent is similar to the design choices behind the original Internet architecture; robustness and flexibility were important, and optimality for any particular use was not. This differentiates our approach from many of the proposed special-purpose overlay networks.
At the same time, we expect that there may be classes of applications for which this infrastructure is not appropriate. These include applications that require a transactional service model, strict consistency among many writers, or fine-grained control over the physical location of data. We will only know the limitations of our approach after substantially more investigation.
Lastly, we are not claiming that our approach is panacea. For example, one part of a complete solution to security would be techniques to eliminate remotely exploitable bugs in server software. We take pains to avoid such bugs by using careful coding standards and software infrastructure , and we also plan to produce independent implementations of our software in different languages. However, we do not otherwise address local operating system and software security.
Another important area we do not consider is the economic aspects of our proposed infrastructure. In the beginning such an infrastructure could be based on ``donated'' nodes assigned to the infrastructure by various institutions. This may be sufficient to keep such an infrastructure running indefinitely, but if commercial viability is required for its survival then we claim no expertise about how this will come about.
Despite all this, however, we strongly believe that this approach has broad enough applicability and addresses enough of the central concerns that it will be of significant value. Our proposed work is intended to verify this belief in two basic ways.
Addressing Open Research Questions: While the basic outline of the infrastructure design is largely set, there are many remaining open problems. We don't expect that any of them are insurmountable, and in fact have preliminary ideas on all of them, significant new research will be required to fully explore this rather new design space. Section 4 presents five broad areas of this research effort.
Testing the Infrastructure Design: The above research will be instrumental in bringing the infrastructure to life, but only real usage will determine its value. Our second area of proposed work is to test the infrastructure in two ways. First, we will build a testbed on which it can be deployed. In collaboration with Intel, we expect to grow this testbed to over 1000 nodes, large enough to get some idea of the problems that arise at large scales. Second, we will build a set of pilot applications to deploy on this testbed. These pilot applications will both provide insight into the utility of the DHT interface (addressing Conjecture #1) and will also attract significant usage of the testbed, allowing us to test the properties of the infrastructure (addressing Conjecture #2). The testbed and applications are described in Section .
These two efforts, addressing the open research questions and testing the resulting infrastructure design, form the two pillars of our proposed work. Before addressing our future work, however, we first describe some prior and related work in the next section.
Because DHTs are central to our proposal but are not yet widely known, we review some of their properties here. DHTs enable data-centric networking by allowing requests for data to be sent without requiring any knowledge of where the corresponding items may be stored. One way of using the DHT abstraction would be to associate a name with each object of interest, and hash that name to a ``key" in an m-bit virtual address space. The virtual address space is partitioned into cells, which form contiguous regions of this address space. Depending on the design of the specific DHT algorithm, either a single host or a set of hosts is assigned to each cell of the virtual address space. Each host is assigned to one or more cells, and maintains copies of those key-value bindings whose key values lie within its assigned cells. The partitioning of the address space and the assignment of hosts to cells is dynamic and, in particular, changes whenever a node enters, departs or fails, or whenever load balancing is required to match the capabilities of nodes with the bandwidth, storage and processing requirements assigned to them.
Several virtual address spaces have been proposed, including the unit circle (Chord), a high-dimensional torus (CAN), and the set of binary strings of a given length (Pastry and Tapestry). In all cases the space supports a measure of distance between its points, and a definition of proximity between cells. Each cell has a small number of neighbors, and a host assigned to a given cell must know the IP address of at least one host in each neighbor of that cell. A query originating at one host is transmitted by IP unicasts through a chain of neighboring cells until it reaches a host in the cell containing its hash address; this host is capable of handling the query for the key. Each transmission reduces the distance of the query from its target point in the virtual address space. If a host along the originally chosen routing path is unresponsive then the routing algorithm must find an alternate path in the overlay.
The key aspect of these DHT designs is their routing efficiency. With an overlay with n nodes, the original class of algorithms require O(logn) overlay hops to reach arbitrary destinations and each node must maintain information about O(logn) neighbors. (The CAN proposal has slightly different results: O(d n[1/(d)]) hops and O(d ) neighbors, where d is the dimension of the torus.) Equally important is minimizing the delay experienced in each overlay hop by considering proximity in the Internet. The original DHT algorithms take different approaches to this problem; the best achieve average paths delays that exceed the delay between source and destination by only a small factor. In addition, these algorithms are very robust to failures; for instance, in rather conservative cases (very few neighbors, only simple recovery algorithms, etc.), when 20% of the nodes simultaneously fail only an insignificant number (less than 0.1%) of the routes fail. With more neighbors and with more extensive recovery algorithms, the number of routing failures becomes extremely small.
Recent advances have been made in the efficiency of these DHT routing algorithms, achieving for instance O(logn) hops with only O(1) neighbors, but it isn't yet clear whether these improvements will interfere with the necessary robustness [37,32].
The proposed research builds on a large body of prior work. Because of the depth and breadth of that body of work, we restrict ourselves here to drawing the relationships at a high-level. In subsequent sections we will relate our specific ideas to prior work in more detail. The prior work on robust distributed computing falls roughly in six categories:
Robust client/server computing: This work has focused on building robust distributed systems by building servers clusters, perhaps replicated in a few geographically separated places. Most high-value Web sites are built in this manner. The primary techniques involved in these systems is primary-backup, and some load balancer devices that spreads queries across the loosely synchronized servers. Examples of academic research in this area include Thor , Ninja , TACC , and Porcupine . Our approach extends this work by reducing reliance on central components, thereby improving overall reliability.
Overlay networks: Recently Internet systems are being built in the form of overlay networks to provide higher robustness. Examples include content distribution networks (such as Akamai , FFnet , and Overcast ) and overlay routing systems (such as Xbone  and RON ). Overlay networks are a positive indication that it is possible to construct large-scale distributed systems that enhance robustness. Our work can be viewed as extending overlay network by including ways to deal with untrusted nodes; this allows our infrastructure overlay to be constructed from nodes that are under different administrative domains and to be more robust to subverted nodes.
P2P systems: P2P systems are new overlay systems that are constructed from nodes on different administrative domains . The recent P2P systems (such as Napster , Gnutella , FastTrack/Morpheus ), however, scale badly and are insecure. A few ambitious secure systems (Freenet , FreeHaven , MojoNation , and Intermemory ) have been designed and implemented, partially inspired by the Eternity Service , but have failed in practice for various reasons. A burst of recent activity in the academic community (some involving the PIs) has focused on the scaling issues and performance issues, but little work has been done on building robust decentralized systems.
DDoS: One of the biggest challenges in building robust distributed systems are denial-of-service attacks. Recent work in this area include tracing back attackers, practical algorithms for dealing with byzantine failures , and file systems build out of untrusted servers [24,41]. At a fundamental level, any system that has central components is vulnerable to denial-of-service attacks. A major point of this proposal is to explore the conjecture that decentralized systems built out of symmetric nodes can resist these attacks.
Programming abstractions for robust systems: Most of the prior work focuses on building robust distributed systems using remote procedure call  (e.g., Java RMI and Network objects ), distributed transactions (e.g., Argus ), or group communication (e.g., Isis ). DHTs compliment these previous approaches by focusing on simplifying the design and implementation of large-scale decentralized applications.
The programming model of hash tables has been explored previously in distributed systems (e.g. Gribble's work on distributed data structures for scalable server clusters ) and parallel systems (e.g. the Linda system ). Recent work by the PIs (e.g. CAN , Chord , Kademlia , Pastry , Tapestry ) and others [32,37,56] have extended this work for Internet-wide P2P systems, applying it to applications such as event notification , multicast , file systems [33,18,53], and naming systems . This proposal is a direct outcome of this recent work, which involved collaborative projects among subsets of the PIs. Now our goal is to attack the problems of robust DHTs as a single team, minimizing overlap and building on each other's strengths.
While the DHT-based approach appears to be quite promising, there are many challenges that must be met before such systems can be safely deployed. Our research will be devoted to addressing these challenges. The most important issues to be addressed can be roughly categorized into five broad areas:
In this section we discuss our proposed work in these five areas.
The DHT infrastructure provides robust routing: even in the face of massive failures, most messages reach their intended destination. However, when nodes fail, they take their data with them. Steps must be taken to ensure data availability and durability despite node and network failures. Two primary tools help provide highly-available and durable data: replication and active repair.
Fault Tolerance through Replication: Replication leads to fault-tolerance-the ability to satisfy requests even when data has been lost. Building systems out of large numbers of widely distributed nodes provides independence of failure and large amounts of storage space for replicas. Replication also enables spreading of load across multiple sources, and helps cross-check results obtained from untrustworthy nodes.
Replication can be achieved by hashing each object into several different DHT cells. The replicas should be geographically distributed so that, for every host, the round-trip time to the copy nearest that host is as small as possible. The most efficient form of replication, from the standpoint of durability per byte, is achieved through erasure-coding techniques such as Reed-Solomon, in which partial information about an object is stored in n different hosts, any k ( < n) of which have sufficient information to reconstruct the object. Erasure coding provides a higher level of availability than replication for a given amount of storage.
An important issue in replication is ensuring that an object's replicas do not exhibit correlated failure. Geographic distribution of servers helps to remove a level of correlation related to regional outages and natural disasters. However, other types of correlation can remain, such as shared network routes, operating systems versions, and administrative domains. Thus, an open problem is how to choose the server sets to minimize correlated failure.
Active Repair: Replication alone does not guarantee durability, since a given set of replicas will eventually all fail. This problem leads to a need for repair. In order to repair information, we postulate a continuous, on-line monitoring process that recognizes failures in the system and triggers repair of information.
One interesting property of DHTs is that they distribute information about the location of objects throughout the infrastructure. Treating location information as periodically refreshed soft state lends itself to efficient techniques for triggering the repair process in DHTs that combine proximity and convergence (Pastry and Tapestry have this property). This process is potentially efficient but imprecise, since untrusted nodes are participating. Periodic sweep is a trusted but more costly technique which slowly scans through all information to restore the level of redundancy.
Verification and Byzantine Failure: Read-only replicas are easily compared with one another to combat faulty or corrupted nodes. In fact, if the DHT keys are derived from secure hashes (such as SHA-1) over information, then we can treat corrupted information as a failure. This means that read-only information is easily amenable to repair as well. Unfortunately, when information is changing, a single corrupted node may compromise the system in a way that is undetectable. This leads to a need for techniques such as Byzantine agreement .
Our overlay infrastructure will be comprised of many nodes so that the system can handle large loads and remain robust in the face of failures. However, such a large node population, while solving some problems, raises several others. We now discuss three such important issues.
Reasoning about systems in flux: Most formal reasoning about distributed systems starts with an ideal state and then explores how long it takes to return to the ideal state after a failure or other change. However, a large system may never achieve a quiescent ideal state. New nodes are constantly arriving, old nodes are departing, and nodes are in various states of failure and recovery. In short, these are systems in flux.
We have, as yet, no theory for systems in flux. The result is that systems builders rarely use algorithms that have provable or characterizable properties-and systems often function poorly or fail in unforeseen ways. One of the important challenges before us is to develop theories for systems in flux.
To make progress in this area, we advocate the following general methodology:
The PIs have made some initial progress along these lines, using the concept of the half-life of distributed systems . However, much more remains to be done.
Load balancing systems in flux: While systems built from large numbers of nodes may provide a total storage and serving capacity greater than the load, the load may not always be evenly spread over the nodes. This can lead to hotspots and degraded performance. In particular, flash crowds, when sudden large spurts of offered load are targeted for a particular item or service of interest, can easily overwhelm individual nodes.
To deal with such hot-spots, DHT infrastructures can use striping (breaking each data object into smaller components) and replication (keeping several independent copies of each component) to distribute the load. This allows the demand for particular objects to be spread across many different overlay nodes, thereby decreasing the chance of hotspots. In particular, it allows the system to make use of statistical multiplexing; the system need not require that each node handle some (unknown) peak load but instead need only have enough aggregate capacity to handle the overall average load. This greatly reduces the requirements on individual nodes,and allows the infrastructure to be comprised of much cheaper machines.
This approach to handling high loads presents two challenges. First, one must adaptively rebalance the load when a temporary imbalance arises. Such adaptive algorithms must respond quickly to flash crowds and other temproary overloads, but must also not incur much overhead during more routine operations. Second, one must cope with node heterogeneity. We expect there to be a vast diversity of hosts participating in the overlay infrastructure, ranging from powerful servers to PCs, from nodes with high-bandwidth Internet connections to those with limited bandwidth connections, from nodes with mean-time-between-failures measured in years to those measured in days (or less). The basic replication and routing algorithms must take this heterogeneity into account so that the load on individual nodes does not outstrip their capacity.
Monitoring sytems in flux: Because of the frequent arrivals and departures of hosts and rapidly changing distributions of access requests, it will be very hard to predict the future state of the system with any great accuracy. Thus, to ensure that the system is functioning well, and to give adaptive algorithms (such as the load balancing algorithms) some global context, we will require some level of overall decentralized monitoring of the system. This builds on previous proposals for introspective systems, but the challenge here is to apply those ideas at tremendous scale and in a self-organizing manner. While it seems clear how to monitor some of the basic statistics of the overlay (e.g. the number of nodes), monitoring more subtle performance measures will be more challenging .
Another side effect of intentionally building systems out of many wide-spread nodes is that not all of the participating nodes will be trustworthy. This can take a number of forms. Some nodes may be actively trying to subvert the system; others may have been mis-configured; others may be running buggy or out-of-date software; others might be mis-behaving because they are overloaded. Similarly, requirements will vary from application to application. Our initial approach in this area will be to explore specific applications, threats, and defenses; as we learn more we hope to identify comprehensive defense strategies.
The main authentication problem facing DHT-based systems is authentication of data, rather than the more traditional concern of authenticating principals over secure channels. For data-centric systems, the question is ``is this the data I wanted,'' not ``is this the principal I wanted.'' The risk is that participants responsible for storing or computing data will maliciously or mistakenly supply incorrect data. A significant constraint is that the set of participants may be so large and diverse that a globally acceptable certification or public-key infrastructure is impractical.
In general, DHTs will be used to organize complex structures consisting of related objects. Thus a key concern is the ability to provide verifiable inter-object references, perhaps analogous to secure links between web pages. A simple example involves naming an object using a cryptographic hash of its content , an idea that fits well with DHTs. More difficult challenges include mutable objects; objects that can be changed by more than one user; verifying that the freshest version of an object has been obtained; and verifying that a particular set of objects consists of consistent versions. Initial work by the PIs in these areas include self-certifying pathnames for mutable data [40,24] and techniques to ensure consistent and correct mutable file systems in the face of malicious file servers .
While data authentication can help validate the results of a lookup in a DHT, authentication cannot easily verify whether a ``data not found'' error is correct. This means that attacks that cause DHT lookups to go astray are potentially troublesome. Since DHT nodes build their lookup routing tables based on information other nodes give them, DHTs are particularly vulnerable to participants that provide incorrect routing information. Characterization of the scope of this problem is already underway .
One promising approach to this problem is to define the mapping of DHT keys to nodes in a way that can be independently verified. One approach to this is for a central authority to hand out signed certificates that describe responsibility for regions of the key space; Druschel explores this idea with Pastry . A weaker but more scalable approach is assign responsibility for key space based on IP address, and verify possession of IP addresses by exchanging nonces . A third approach is to place responsibility for maintenance of a consistent view of the participant set in a small core of nodes, and use Byzantine-fault-tolerance techniques to avoid corruption of this view .
An alternate view of this problem is that the DHT approach works best if the fraction of malicious (or unreliable) nodes can be bounded, and if nodes can be viewed as having independent probabilities of being malicious. If these two assumptions are true, then (for example) placing replicas at randomly chosen nodes will lead to robustness, and the freshness and validity of data can be checked by retrieving multiple replicas. Similar arguments apply to the construction and validation of lookup routing tables. A key part of this story is the ability to determine that two apparently different participants are actually the same, to prevent a malicious node from participating enough times to break the bound on the fraction of malicious nodes . The PIs have worked out this general approach in more detail in .
Some applications may require that data can be retrieved anonymously, or that particular pieces of data cannot easily be deleted from the system. Though these are partially political issues, the technical fundamentals are worth exploring. A particularly useful question is the extent to which anonymity and censor-resistance must be integrated into the design of a system, as was done in Freenet . We hope that layering will allow these concerns to be treated separately, for example by using general-purpose anonymity techniques such as Onion Routing  or Tarzan . On the other hand, integration seems to be a good approach to censor-resistance, as explored by Mazières in Tangler .
While the DHT abstraction hides all facets of the underlying network, the infrastructure cannot remain so oblivious. In particular, to achieve reasonable performance it must take into account network topology and congestion levels.
Topology-awareness: To achieve low query latency, high query throughput, and efficient use of network resources, DHT overlays must take proximity in the underlying Internet topology into account while routing messages. Proximity routing deals with the selection of overlay routes that attempt to minimize the delay or hop count of a DHT message. To do this, a model is required of the distance among nodes in the Internet, according to a chosen proximity metric like the RTT. The main concern here is to achieve sufficient accuracy with reasonable overhead. One way to construct such a model is to measure the RTT between each host and a small set of widely distributed ``beacon nodes'' with well-known positions in the network [46,50]. However, special beacon nodes are somewhat at odds with the goals of decentralization and self-organization. A second approach is for each node to autonomously measure its distance to other nodes. Doing this in a scalable manner is a significant open problem .
Given a proximity model, two different approaches for proximity routing present themselves. In one approach, the overlay is constructed without regard for proximity. However, there are typically several possible alternate choices for next node while forwarding a given message, and one may pick the ``best'' choice according to some metric (like RTT). The measurable cost is the network delay to the next hop, while the tangible benefit is the amount of progress in identifier space towards the desired node. Developing practical routing schemes based on this cost-benefit idea, and evaluating them on real workloads is an important research topic.
In the second approach, the overlay is constructed in a topology-aware manner. One way to achieve this is to assign node identifiers such that nearby nodes in the network are also close in the logical identifier space. This works well for DHT protocols like CAN that use a multi-dimensional identifier space. For DHTs with a one-dimensional space, finding an adequate mapping of the physical proximity space to the logical space may prove difficult. Also, topology-based identifier assignment may entail a loss of robustness, because it increases the likelihood of correlated failures of adjacent nodes in the identifier space. We propose to build on the ideas embedded in Tapestry to attack this problem in general .
Our current research on different DHT schemes suggests that the appropriate method for proximity routing depends on the algorithmic properties of the underlying scheme. We therefore plan to investigate what the general proximity routing principles are and how they relate to algorithmic properties.
Traffic-awareness: Many of the techniques we are proposing have the property that a given node in the system will simultaneously communicate with several other nodes. This changes the traffic model imposed on the network; rather than a receiving data from one sender, a node will now receive data from multiple concurrent senders. Furthermore, each stream of data from a given sender may only be a few packets long, making traditional TCP-style congestion control inappropriate. The risk is that extensive use of DHTs may lead to high levels of congestion and unfair competition with other types of traffic.
An important research area is the design of congestion control algorithms for traffic that involves many short interactions with many different nodes. A first step would be to use a generalization of TCP's congestion window to control the aggregate set of requests in progress from any given node; this would make DHTs compatible with existing congestion control. It is also likely that integration could provide opportunities for increased efficiency; for example, each retransmission of a request for a particular piece of data could be sent to a different replica, thus avoiding repeated use of overloaded network paths. At a higher level, the load spreading inherent in DHTs could have a beneficial effect on the overall distribution of network traffic levels, by eliminating hotspots at the network level. A more general problem is the online determination of which concurrent transfers to a receiver from multiple senders share common bottlenecks, so that appropriate congestion sharing strategies (similar to the simpler cases handled by the Congestion Manager today ) may be developed, perhaps starting from [55,27]. Finally, analysis and protocol design is needed in order to realize this vision in a way that does not conflict with traffic management at lower layers.
One of the purposes of this work is to identify the limits to the usefulness of DHTs. Part of this involves exploring the interface needs of a variety of distributed applications. We have preliminary experience using DHTs as building blocks in larger applications, such as multicast communication , notification systems , keyword search , DNS emulation , complex database queries , file system storage , and distributed public key infrastructures . At a high level, all of these applications make good use of the DHT's ability to find data based on keys. At a more detailed level, however, the different applications needed noticeably different interfaces, and more work is required to find general-purpose APIs.
To some extent, application heterogeneity can be addressed by layering more sophisticated abstractions as libraries on top of an underlying DHT. An example where this works well is the layering of the DHash block storage layer on top of a DHT in the CFS read-only file system . DHash is able to take good advantage of the DHT to help with replica and cache placement. Further, DHash presents a general purpose store/retrieve key/value interface, potentially useful to multiple kinds of applications. However, since such libraries live on the infrastructure nodes they will change slowly over time, and thus we seek to define a limited number of such API's that will form the core programming abstractions supported by the infrastructure.
Complex queries: While the data-centric naming that DHTs provide is directly useful for some applications, others will require more sophisticated query support. We intend to develop techniques for relational queries layered on top of DHTs. Rich query languages should do more than ``find'' things: they should also allow for combinations and correlations among the things found. Examples of such queries are: Give the list of stores in New York that have Canon S40 in stock and order them by the camera's advertised price, and Find the median home price in the Bay Area and the median NASDAQ index for each month over the last 10 years.
We will provide support for the traditional relational database operators: selection, projection, join, grouping and aggregation, and sorting. One issue that makes design of a query-related API problematic is that DHTs assume a flat identifier space which is not appropriate to manage multiple data structures, as will be required for query processing. In particular, we need to be able to name tables and temporary tables, tuples within a table, and fields within a tuple. One approach is to implement a hierarchical name space on top of the flat identifier space provided by DHTs, by partitioning the identifiers in multiple fields and then having each field identify objects of the same granularity.
A hierarchical name space may also requires more complex routing primitives such as multicast. Suppose we wish to store a small temporary table on a subset of nodes in the network. Then we will need to route queries to just that subset of nodes. One possibility would be to modify the routing protocol such that a node forwards a query to all neighbors that make progress in the identifier space towards any of the identifiers covered by the query.
Resource Location: An important challenge facing emerging pervasive computing systems is the development of scalable resource discovery techniques that will allow client applications to locate services and devices, in increasingly large-scale environments, using semi-structured intentional names . An intentional name is an attribute-based description of a resource or query, which allows applications describe what they are looking for, rather than where to find it.
The particularly hard problem is scalability because traditional techniques that use hierarchy do not work in a semi-structured world with arbitrary queries. However, the DHT-based approach may led to practically deployable intentional naming systems because the intentional names can be scalably stored using the DHT abstraction. The PIs have started investigating this area .
Mutable Data: Much of the work to date on DHT-based storage systems has involved read-only data, or more properly single-writer data. This is because it is not clear how to authenticate data generated by multiple users, and not clear how to maintain consistency of such data. Worse, the users cooperating in producing the data may not fully trust each other (and thus cannot share keys), or may not trust the reliability of each others' computers (and thus cannot easily participate in a locking protocol).
One possible approach to allowing multiple writers is to maintain data in per-writer logs, instead of in a single unified image. This allows mutually suspicious participants to ignore each others' modifications (i.e. logs) if desired. This use of per-client logs is similar to that of Zebra , though Zebra's centralized meta-data handling is not appropriate in this situation. It also eases conflict resolution, since the individual log entries can be re-played whenever resolution is required; in this way this approach resembles that of Bayou . One challenge in this approach is the fact that the semantics it can provide, while similar to those of an ordinary file system, differ in some ways that are hard to make transparent.
Communication: The Internet is very successful at supporting a basic point-to-point communication abstraction. However, attempts to generalize this abstraction to, for instance, multicast and anycast, have faced difficult technical challenges and deployment barriers. One approach to generalizing communication abstractions is to use the DHT infrastructure to provide a level of indirection. In this approach senders don't name receivers by IP addresses but by an abstract handle (a ``key''). Abstractions that can be provided include multicast and anycast, and seamless communication to nodes that change IP addresses (e.g., due to mobility) can also be supported readily.
Peter, Druschel, Rice
Joe Hellerstein, Berkeley
M. Frans Kaashoek, MIT
David Karger, MIT
Dick Karp, ICSI/Berkeley
John Kubiatowicz, Berkeley
Barbara Liskov, MIT
David Mazières, NYU
Robert Morris, MIT
Scott Shenker, ICSI
Ion Stoica, Berkeley