6.824 Lecture 4: Distributed Programming: MapReduce and Dryad

Todo:
  Handout with MapReduce word count app

Outline: 
  Why, designs, challenges
  MapReduce (Google)
  Dryad (Microsoft)
  More lab 2

Since writing a distributed application has a number of additional
challenges over sequential programming, it would be nice if there ways
to simplify it.  Today we see two designs for making writing parallel
applications easier on distributed computer systems: MapReduce and
Dryad.

This lecture is also a case study of:
  Use of distributed computer systems
  Distributed computing challenges: programming, fault tolerance,
    consistency, concurrency, etc. 

One usage of distributed computer systems is running large
computations.  Typically the application is partitioned in
computations running in parallel that somes communicate.

Applications
  Scientific applications
  Large-data processing apps (indexing, search, ...)
  Etc.

Designs:
  Cluster computing
    using PCs connected by a high-speed network
  Grid computing
    using a high-speed network of supercomputers
  Volunteer/Personal computers
    aggregates Pcs on the Internet

Challenges:
  Parallelize application --- How to handle share state?
    Network is a bottleneck typilally
    Embarrasing parallel (run same app for different inputs, users, ..)
    Coarse-grained (computation versus communication ratio is low)
    Fine-grained (typically require parallel computer)
  How to write the application?
    Explicit messages (RPC, MPI)
    Shared memory
  Balance computations of an application across computers
    Statically  (e.g., doable when designer knows how much work there is)
    Dynamically
  Handle failures of nodes during computation
    With a 1,000 machines, is a failure likely in a 1 hour period?
    Often easier than with say banking applications (or YFS lock server)
    Many computations have no "harmful" side-effects and clear commit points
  Scheduling several applications who want to share infrastructure
    Time-sharing
    Strict partitioning

MapReduce

  Design	
    Partition large data set into M split
      a split is equal to a 64 Mbyte part of the input typically
    Run map on each partition, which produces R local partitions
      using a partition function R
    Run reduce on each intermediate partition, which produces R output files

  Programmer interface:
    map(string key, string value)
    reduce(string key, iterator values)

  Example: word count
    split file in big splits
    a map computation 
      takes one split as input
      produces a list of words as output
      	 the output is partitions into R partitions
    a reduce computation
      takes a partition as input
      outputs the number of occurences of each word

  Implementation:
    caller invokes mapreduce library
    library creates worker processes
      run map or reduce computations
    library creates one master process
      master assigns a map and reduce tasks to workers
      master is comm channel between map and reduce workers
      handles failures of workers
    map workers communicate locations of R partitions to master
    reducer works asks master for locations
      sorts input keys
      run reduce operation
    when all workers are finished, master returns result to caller

  Fault tolerance
    when worker fails
      master resets all map and reduce tasks to idle
        maps need to be reset because map's output is local and unavailable
      when map is reset, inform all reduce tasks to read input from new worker
    when master fails
      nothing!

  Semantics:
    if user map and reduce functions are deterministic, then
      output is the same as non-faulty sequential run of the program
    when reduce completes, worker renames tmp output file atomically
      reduce commit point!

  Load balance: M + R tasks
    ideally M + R is much larger than number of workers
    challenge: stragglers

  Locality
    manager runs mappers close to where one of the 3 replicas of input is

Dryad

  Similar goals as MapReduce, but different design

  Computations expressed as a graph
    Vertices are computations
    Edges are communication channels
    Each vertex can have several input and output channels
    Nice C++ use to make it easy to construct graphs
      See figure 3

  Example: how does the graph look like for the word count example
    Answer: figure 6 (before)
    How is the reduce run in parallel?
      Figure 6 (after), dynamically
      Or change graph to have R reduce nodes

  Implementation
    Job manager
      execution records for each vertex
      when all inputs are available, vertex becomes runnable
      vertices may express preferences
      dynamic graph refinements
    Daemon
      creates processes to run vertices
    Stage manager
      locality
      replicated stages to avoid straggler problem
    channels
      files, TCP pipes, or shared memory

  Load balancing
    Greedy scheduling

  Fault tolerance
    Job manager fails, computation fails
    Vertex computation fails
      restart vertex with different version #
      previous instance of vertex may run in parallel with new instances

  Semantics
    Assumption: Vertex are non-deterministic
    Each vertex runs one or more times
    Stop when all vertices have completed their execution at least
    once
    
  Locality
    stage manager    

  MapReduce versus Dryad
   Many similarities
   Dryad computation graphs, while MapReduce a series of maps and reduces
   Each vertex can take n inputs, while map takes on input
   Each vertex can produce n outputs, while map generate n output
   How would you express SQL query (see sec 2.1) using MapReduce?

Lab 2:
  fuse.cc, fuse_client.cc, fuse_client.h, and yfs_protocol.h
  trace gettattr(), which the open or stat system call will send