Werner The Black Art of Real Distributed Systems Development. Werner Vogels CTO - Amazon.com The construction of very large, complex distributed systems for production environments is extremely challenging. This challenge doesn't even deal with the fact that there is hardly any commercial or research distributed systems technology available that can operate in really large dynamic production environments. Regardless of this lack of available technologies, the process of building real systems at large scale in a continuous evolving environment is at times very, very hard. Building large complex distributed systems that need to be scalable, reliable and efficient with guaranteed performance, for production environments is a Black Art. Key in Amazon.com business strategy is growth. Growth through more shopping customers, more sellers, more merchants, more categories, a larger selection, more services, increasing the different access methods and expanding delivery mechanisms. This growth has a continuous impact on many areas: larger datasets, faster update rates, more requests, more services, tighter SLA's, more failures, more latency, more service interdependencies, more developers, more documentation, more programs, more servers, more networks, more datacenters. Under these facts of life, architects, engineers and application developers are continuously working to develop new innovative services to address customer needs. Problems do not get solved by just throwing fancy new distributed systems technology at it. The process of developing complex services in a massively complex distributed world is challenging in itself even before you get to use the fancy stuff. Here are some of the challenges that one faces: - Engineers are not machines. Innovative architects and engineers are artists. And to make them successful you need to impose as a few restrictions in how they can build their systems as possible. You need to find a software containment method that allows them to develop their software anyway they see fit, using those tools they feel are best suited for the job, in those languages they are most productive in. - Reduce dependencies. In an environment where the innovation is not stifled by top-down control a lot of concurrent development that is happening with little or no coordination. The development process in itself becomes a distributed system that is loosely coupled, and which exhibits many of the properties as digital distributed systems. - Feedback is required. Builders need direct contact with their customers to understand the impact their services are having. For this you cannot just throw the service over the wall to an operational team; the software team needs to be responsible for the operational aspects of their services such that they understand and control the full process from design to operation, and can take the real-time feedback to continuously improve the service. - Evolution. The system didn't just happen overnight. There are probably a lot of pieces that have been developed a few years ago, that no longer can easily survive the onslaught of higher request rates or larger datasets. However you can not just redesign and develop each service from scratch again. How to evolve parts of the system into a new state gracefully yet urgently is a challenge. How to make sure that those parts of the system that have open or hidden dependencies evolve at the same time is even more challenging. - Testing, testing, and testing. In a large complex distributed system that is continuously evolving, how do you test? How do you make sure that thousands of developers can test their services in an environment that mimics the real deployment as best as possible, using the correct versions of other services they dependent on, using global application and database state that correctly reflects the cases they need to test? How to do stress and performance testing in an environment that is shared with many other testing developers? How to stress test and do fault-injection in the production system without taking the whole system down if you do hit a bug. - Tools versus Frameworks. There is no single technology that can provide a framework in which all of the development for such a large distributed system can happen. At Amazon for example every possible form of data storage is in use: from file system to replicated relational database. Every possible form of caching is deployed: from a write-through collaborative cache to database query result caching. And those are just the infrastructure style services; the application and business logic services that make use of these technologies run into the thousands for large size operation. Any piece of technology that forces a developer into a strict framework is not likely to be successful in this environment; currently a developer in general needs to incorporate 2-3 different technologies into any application each forcing their own style of development. Software pieces need to become tools that can be used together instead of frameworks competing for the pole position in the developers mind. - Deployment. All these thousands of different pieces rely on thousands of other pieces. Making sure each service has exactly the right software available is very challenging. Being able to augment a deployment if you missed something is essential. Rolling back a deployment from thousands of nodes to a previous version very quickly when you have screwed up is essential. Integration into the build and test process is mandatory, but supporting it for C++, C#, Java, Perl, Ruby, ML and Erlang at the same time is not that simple. These are only the tip of the iceberg of problems that real distributed systems developers have to deal with on a daily basis. If we really want to impact the way distributed systems are using in real systems, we need to first make sure that they can actually be developed, build, tested and deployed in a scalable and reliable manner in a way that doesn't require black magic.