"Ensuring data longevity in large-scale archival systems" Mehul Shah -- HP Labs, Palo Alto -- mehul.shah@hp.com Online services like email, photo-sharing, and archival sites offer to store large amounts of personal and business data, making the data quickly accessible while preserving it indefinitely. As disk storage becomes denser and cheaper, we will be able to store cost-effectively an increasing fraction of our digital assets online. For example, the MyLifeBits project (J. Gemmell et al.) assumes that we will soon be able to store all our personal data, including everything we see and hear. As we transform our physical personal assets like financial records, photos, and video into digital form, the longevity of this data will become a primary concern. For businesses, data longevity really matters. For some companies, e.g. DreamWorks or Disney, strategic assets are already digital. Moreover, recent legislation such as Sarbanes-Oxley mandate preservation of records for decades. While current storage system designs provide availability for voluminous data, they ignore the obstacles in maintaining data integrity for decades and centuries to come. The next main challenge is to build large-scale systems that permit efficient, cost-effective retrieval of high-integrity, archived information arbitrarily into the future. In addition to typical storage issues, digital stores face a number of long-term threats usually overlooked in reliability analyses of traditional storage systems like RAID. While negligible in the short term, the chances of these threats accumulate to significance over the long term. These threats stem from the larger ecosystem in which the store operates. They include component faults, operator error, large-scale disaster, media and format obsolescence, limited budget, and organizational failure. A component fault results from an error in the layers or components above the storage system, e.g. volume manager, OS driver, or networking layer. Some of these errors, e.g. a corrupted write, are silent and can remain unnoticed for long periods, surfacing after it's too late to fix them. The IRON filesystem (V. Prabhakaran et al.) attempts to fortify local filesystem metadata against these higher-level errors, but does not address protecting the central assets: application data. An example of organizational failure is a provider going out of business and leaving the customer with no reasonable exit strategy for her data. These threats manifest as failures that can eliminate large portions or the entire store all at once. Faced with these threats, we must tackle the following sub-goals to achieve our main objective. (1) Quantify the impact of long-term failure modes on existing distributed storage system designs. This task requires instrumenting such systems to track faults, in particular the silent insidious ones, and track environmental conditions to pinpoint root causes. (2) Design and model defense mechanisms against long-term threats, and build threat models to evaluate the efficacy of the methods. For example, we should explore methods for maintaining consistent, geographically separated replicas to diminish the impact of natural disasters. (3) Understand the costs involved in both the development and in ongoing maintenance of the system. Otherwise, unexpected budgetary constraints remain a threat. (4) Build a to-scale prototype to validate our models and to investigate methods that cannot be modeled, e.g. data ingestion and data exit strategies. As a realistic starting point, we will need storage infrastructure at the petabyte scale. (5) Develop effective rapid-aging testing strategies, to evaluate systems intended to last forever. As as start, we consider accelerated fault-injection to simulate threats and conditions over long periods. Long-term preservation is a never-ending battle. Our success will be based on two criteria. First, we must build a large-scale archival system demonstrating no failures on rapid-aging tests simulating a century. Second, we must develop a process to continually evolve defense strategies and demonstrate success as arbitrary new threats arise.