Organize all the world's medical information -------------------------------------------- Fay Chang, Muthian Sivathanu, Rob von Behren, and Deborah A. Wallach The field of medicine is fundamentally driven by careful observation and analysis of example occurrences. Thus, creating more opportunities for such observation and analysis will enable significant advancements in medicine. We believe that large distributed computer systems have the potential to help achieve this goal in two key ways. First, they can provide a comprehensive, reliable, and secure store of all the world's medical data that is globally accessible and usable. Second, and more importantly, massive distributed systems can employ techniques such as statistical machine translation to automate the "observation and analysis" part, again potentially resulting in new insights and advancements. Although a unified repository of all medical information is quite valuable in itself, analyzing such large amounts of data manually is a daunting task. The most interesting application of distributed systems technology in this scenario would be to mine this information automatically to derive new truths and insights about correlations between symptoms, geographic conditions, life patterns, and diseases or health conditions, but even allowing mining through manual searches would be a key enabler for medical research. The structure of such a repository presents interesting challenges. For example, medical institutions may need to retain ownership of the data they generate in order to protect the privacy of their patients. This necessitates a fundamentally distributed solution, since all data cannot belong to a single domain of administrative control. Accessing and analyzing information in this environment requires basic advances in distributed information security and anonymity techniques. Achieving knowledge extraction from a large corpus of medical data is challenging for a variety of reasons. However, we believe that large amounts of data coupled with enough computing power can circumvent fundamentally hard problems by converting them into brute force searches on data. Investigating this line of research will require collaboration between the systems community and the machine learning/data mining community. We will also need good abstractions and platform support that facilitate building such massive distributed applications for statistical machine learning and brute force search. This challenge can be broken down into several concrete sub-problems, and then progressively attacked on several fronts. Possible steps include: - A system that provides data correlation capabilities across a number of types of medical information records all taken from the same medical center. Although this sounds simple, records entered by different doctors may include different abbreviations for the same symptoms, or some records may be input as text while others are transcribed either through voice recognition or manually from voice input, etc. - A system that provides security guarantees at the proper granularity, such that no individual patients can be identified, yet retains enough detail to be able to spot geographical or familial trends. - A system that would be robust across records from multiple sources, yet all still in a single language. - A system that could correlate records across multiple languages. - A system that, given a set of symptoms, will suggest further symptoms to look for, diagnoses and treatments. - A system that performs automated knowledge extraction across the corpus, with automated processes running in the background attempting to learn new information. - Although acquiring the medical records themselves is as much of a political challenge as a systems one, any steps that can be taken in this direction would be useful.