6.824 Lecture 4: Distributed Programming: MapReduce and Dryad Todo: Handout with MapReduce word count app Outline: Why, designs, challenges MapReduce (Google) Dryad (Microsoft) More lab 2 Since writing a distributed application has a number of additional challenges over sequential programming, it would be nice if there ways to simplify it. Today we see two designs for making writing parallel applications easier on distributed computer systems: MapReduce and Dryad. This lecture is also a case study of: Use of distributed computer systems Distributed computing challenges: programming, fault tolerance, consistency, concurrency, etc. One usage of distributed computer systems is running large computations. Typically the application is partitioned in computations running in parallel that somes communicate. Applications Scientific applications Large-data processing apps (indexing, search, ...) Etc. Designs: Cluster computing using PCs connected by a high-speed network Grid computing using a high-speed network of supercomputers Volunteer/Personal computers aggregates Pcs on the Internet Challenges: Parallelize application --- How to handle share state? Network is a bottleneck typilally Embarrasing parallel (run same app for different inputs, users, ..) Coarse-grained (computation versus communication ratio is low) Fine-grained (typically require parallel computer) How to write the application? Explicit messages (RPC, MPI) Shared memory Balance computations of an application across computers Statically (e.g., doable when designer knows how much work there is) Dynamically Handle failures of nodes during computation With a 1,000 machines, is a failure likely in a 1 hour period? Often easier than with say banking applications (or YFS lock server) Many computations have no "harmful" side-effects and clear commit points Scheduling several applications who want to share infrastructure Time-sharing Strict partitioning MapReduce Design Partition large data set into M split a split is equal to a 64 Mbyte part of the input typically Run map on each partition, which produces R local partitions using a partition function R Run reduce on each intermediate partition, which produces R output files Programmer interface: map(string key, string value) reduce(string key, iterator values) Example: word count split file in big splits a map computation takes one split as input produces a list of words as output the output is partitions into R partitions a reduce computation takes a partition as input outputs the number of occurences of each word Implementation: caller invokes mapreduce library library creates worker processes run map or reduce computations library creates one master process master assigns a map and reduce tasks to workers master is comm channel between map and reduce workers handles failures of workers map workers communicate locations of R partitions to master reducer works asks master for locations sorts input keys run reduce operation when all workers are finished, master returns result to caller Fault tolerance when worker fails master resets all map and reduce tasks to idle maps need to be reset because map's output is local and unavailable when map is reset, inform all reduce tasks to read input from new worker when master fails nothing! Semantics: if user map and reduce functions are deterministic, then output is the same as non-faulty sequential run of the program when reduce completes, worker renames tmp output file atomically reduce commit point! Load balance: M + R tasks ideally M + R is much larger than number of workers challenge: stragglers Locality manager runs mappers close to where one of the 3 replicas of input is Dryad Similar goals as MapReduce, but different design Computations expressed as a graph Vertices are computations Edges are communication channels Each vertex can have several input and output channels Nice C++ use to make it easy to construct graphs See figure 3 Example: how does the graph look like for the word count example Answer: figure 6 (before) How is the reduce run in parallel? Figure 6 (after), dynamically Or change graph to have R reduce nodes Implementation Job manager execution records for each vertex when all inputs are available, vertex becomes runnable vertices may express preferences dynamic graph refinements Daemon creates processes to run vertices Stage manager locality replicated stages to avoid straggler problem channels files, TCP pipes, or shared memory Load balancing Greedy scheduling Fault tolerance Job manager fails, computation fails Vertex computation fails restart vertex with different version # previous instance of vertex may run in parallel with new instances Semantics Assumption: Vertex are non-deterministic Each vertex runs one or more times Stop when all vertices have completed their execution at least once Locality stage manager MapReduce versus Dryad Many similarities Dryad computation graphs, while MapReduce a series of maps and reduces Each vertex can take n inputs, while map takes on input Each vertex can produce n outputs, while map generate n output How would you express SQL query (see sec 2.1) using MapReduce? Lab 2: fuse.cc, fuse_client.cc, fuse_client.h, and yfs_protocol.h trace gettattr(), which the open or stat system call will send