Global Services for Internet-Scale e-Science Matt Welsh, Harvard University Many scientific fields are increasingly computational and dependent upon diverse sources of digital data. Astronomy, physics, biology, geology, environmental sciences, and public health are all drawing upon networked information sources of both archived and real-time digital data to undertake studies of unprecedented scope. To date, however, these fields have had to resort to building their own infrastructure from the ground up to support their research. As an example, the NSF-funded EarthScope initiative is archiving data from thousands of GPS receivers deployed across the North American continent to track the movement of plates at resolutions of millimeters per year. However, the infrastructure used for this purpose amounts to a central FTP server hosted in Colorado from which researchers manually download data sets for local processing. We argue that the distributed systems community should be partnering directly with researchers in the natural sciences to address the pressing need of a large-scale networked infrastructure for data-intensive "e-Science" applications. Such an approach could enable significant scientific investigations that are not possible today due to the necessity of building significant networking and systems infrastructure for individual experiments. This vision also differs substantially from the many Grid efforts, which are primarily focused on providing computational resources to cycle-intensive simulations. The problem of locating, indexing, archiving, and querying vast numbers of scientific data sources is still largely unaddressed. The Internet provides low-level, physical connectivity between computers, the Web provides a user interface, and Web Service protocols enable a programmatic structure for building applications. Still, domain scientists demand much richer interfaces that can enable a meaningful interchange across different data formats; the ability to name data sources according to physical, logical, and spatiotemporal attributes; annotation and tracking of data provenance; and, finally, the ability to scale to an ever-increasing number of real-time and archived repositories of digital data that will eventually be tied into such an infrastructure. Much in the way that search engines have unified access to static content found on the Web, we envision a common, scalable framework for data-intensive applications that can unify researcher's access to science data. As an example, consider a study of climactic processes and the effect of industrial pollutants on the environment. Such a study can integrate data from ground-, ocean-, and air-based sensors of pollutants and environmental conditions; satellite imagery; and historical databases of weather observations. Currently, locating and accessing these diverse data sources is hampered by the lack of a common data management infrastructure. Simply placing archives of data on the Web is not enough to satisfy the complex process of locating and processing this data. Supporting a large number of simultaneous applications also raises many challenges for resource management and scalability. An environmental scientist should not have to become an expert in network protocols and distributed systems programming to do her research! The database community is starting to look at many of these challenges, although we believe these efforts are hampered by system designs that call for central processing of streaming data, as well as an SQL-centric vision for the query interface. We believe that this effort will require input from experts in databases, languages, theory, AI, systems, and networking, as well as forming effective partnerships with the natural science communities. In particular, the computer science research agenda must be informed by the needs of the "hard science" problems for which this infrastructure will support. We believe that the space of systems problems here is very rich, and that focusing our efforts on supporting large-scale scientific research can reenliven the distributed systems community with a new purpose that is intimately linked to the many scientific discoveries that this research will support.