6.824 Lecture 1: Introduction and lab overview

Introduction
  6.824 is about building distributed systems
    big ideas, useful techniques, implementation, case studies
  lectures on ideas, papers for case studies
  you'll build a real system (yfs, inspired by Frangipani)
  why take the course?
    synthesize many different areas in order to build working systems
    Internet has made area much more attractive and timely
    hard/unsolved: not many deployed sophisticated distrib systems
  background: 6.033 ideas and readings
    much more depth: build a real distributed system

What is a distributed system?
  multiple connected computers
  cooperate to provide some service
  Examples: Internet E-Mail, Athena file server, Google MapReduce

What is the point of distributing a system?
  communication, among distant components
    Web
  reliability, from unreliable components
    Argus, BFT, Frangipani
  performance, sum of many components
    aggregate cycles+memory: (ThreadMarks, Dryad)
    aggregate bw (Coral, Shark)
    aggregate disks (Frangipani)
  isolation, to increase security and failure tolerance
    authentication server
    backup server

Challenges
  system design
    how to split functions among computers? (e.g. clients, servers)
    who talks to whom? what info do they exchange?
    today's example: file servers
  performance
    how to divide work? to maximize parallelism, minmize interaction
    load balance
    avoid bottlenecks like network
  failures
    replication usually used to cope with failures
    how to tell which replicas are live? or is network down?
    which replica has the freshest data?
  consistency
    how to keep replicas identical?
    how to sort out many concurrent clients using shared data in server?
  security
    adversary may compromise machines or manipulate messages
  network properties
    clusters -- high b/w, low latency, high reliability
    wide-area -- low b/w, high latency, low reliability
  
advice:
  solutions are complex and hard to get right
  easy to make distributed system slower, less reliable,
    than a centralized system.
  Lamport's definition: a distributed system is one where a computer
    you don't know about renders your own useless.
  use a central system if you can
  build a distributed system only if you're forced to

6.824 lecture topics:
  infrastructure for building servers and clients: threads+RPC
  distributed programming
  consistency
  fault tolerance
  peer-to-peer/decentralized
  security
  case studies
  [continuous thread: system design]

Example:
  file system (like NFS or AFS)
  lots of client computers
  they all want to see the same file system, so they can cooperate

Simple System Design
  One server w/ disk to store directories and files
  Clients send high-level ops, like create(filename), mkdir, &c
  [picture: server, clients, read file, write file, create, etc.]

Topic: implementation
  Server needs to execute multiple operations concurrently
    From multiple clients
    To get good disk and multicore performance
    Solution: threads
  Server threads may modify shared state (e.g., file cache) concurrently
    Avoid race conditions!
  Threads often need to wait for each other
    Avoid deadlocks!

Topic: consistency
  An operation may update multiple servers, e.g. mv from dir1 to dir2
    Will clients see intermediate state?
    What if something fails partway through?
  Multiple copies of each piece of data
    Server replicas, caches in clients
    When does a modification appear in all these copies?
    Same behavior as single-machine file system?

Topic: scalability
  What if the load is too high for one server?
  Idea:
    Multiple servers
    Split load among servers
  We hope that 2 servers will provide 2x performance
    I.e. that performance will scale w/ # of servers
  How to balance the load? By file ID? By user?
  What if one user suddenly generates a lot of load?
  What if some files are much more popular than others?

Topic: fault tolerance
  Can I get to my files when some servers are down or network fails?
  Yes: replicate the data on multiple servers
  Problem: replica consistency (delete file, may still be visible in other replica)
  Problem: failure agreement, especially if network partitioned
  Opportunity: operate from two "replicas" independently if partitioned
  Opportunity: can 2 servers yield 2x availability AND 2x performance?

Topic: security
  Threats:
    corrupt employees
    targeted external attack
    spammers collecting botnets
  How does the server know that a request is from me?
  How much do I have to trust the system administors?
  What if server code has bugs? (one per 1000 lines...)

Topic: system design
  What if we want to provide an Internet-wide file system?
    aggregate servers at many sites into a unified file system
  More failures, longer delays, multiple administrative domains, security

We want to understand the individual techniques, and how to assemble them

COURSE STRUCTURE

All announcements and info at http://pdos.csail.mit.edu/6.824
 look at the first lab, due end of next week

Meetings: 1/2 lectures, 1/2 paper discussions (or lab help)

Research papers on working systems, starting today
  must read papers before class otherwise boring, and you can't pick
    it up by listening
  in future, we will post paper questions 24 hours in advance
  hand in answer on paper in class, one or two paragraphs

Mid-term quiz in class, and final exam

Labs: build a real cluster file system a la Frangipani
  Labs are due on Fridays

Project: extend lab in any way you like.
  alone or teams of two
  one-page report
  demo in last class meeting

Yandong Mao is TA, office hours on Web.

LAB: YET ANOTHER FILE SYSTEM (YFS)

Lab is inspired by Frangipani, a scalable distributed file system (see
paper). Designed/built at a research lab, not a product. You will
build a simplified version of Frangipani.
  
Frangipani goals
  scalable storage -- add disk servers
  scalable file service -- add file servers
  consistency, much like single file server
  adaptive load balance across file servers
  tolerates and recover from server, network, and disk failures

Frangipani design
  diagram:
    client workstations
    Frangipani servers
    Pegal servers
    lock servers
  Petal looks like a single huge disk
    interface: put and get (pretty low-level)
    replicates blocks/extents
    add petal servers to increase storage capacity and throughput
  Frangipani file server
    knows about file names, directories, inodes, &c
    uses Petal to store blocks
    all Frangipani servers serve same file system
      communicate only via Petal
    add Frangipani servers to handle more client workstations
  Lock servers
    Frangipani servers uses lock server to provide consistency
      e.g., when creating a file, lock directory
    locks servers are replicated
  No security beyond traditional file system security
    intended for a cluster, not wide-area

This is a common architecture
  many clients
  many front ends to provide CPU to process client requests
  shared back-end that *only* provides storage
  front-ends independent, communicate indirectly via back-end storage
  only back-end needs to be fault-tolerant
    since no real state in front-end servers, no problem if they crash
  easier to provide fault-tolerance for storage than for general computation

When might Frangipani *not* scale well?
  i.e. when might we not be able to get more performance by adding servers?
  if shared LAN is a bottleneck
  if all clients are modifying the same set of files
  if there's a single demanding user

Frangipani scales well in common case of independent users
  each user's Frangipani caches that user's files
  Frangipani servers don't need to wait for each other for locks
  so each added server doesn't slow down other servers => scalable

YFS
  Same basic structure as Frangipani, but single extent server
    i.e. you don't have to implement Petal.
  Diagram: many clients, one extent_server, multiple lock_servers
  Client diagram:
    app, kernel fuse, fuse.cc, yfs_client
  Each program written in C++ and pthreads, our own RPC library
    Next two lectures cover infrastructure in detail
  Labs: build YFS incrementally
    L1: simple lock server
      threads, locks, condition variables
      then at-most-once RPC
    L2: extent_server, yfs_client
      in-memory store for extent_server
      basic yfs_cilent: create, lookup, readdir, write, read
    L3: yfs_client + extent_server + lock_server
      mkdir, unlink, and sharing/locks
    L4: caching lock_server
      add revocation to lock server
    L5: caching extent_server
      consistency using lock_server
    L6: paxos library
      agreement protocol (who is the master lock server?)
    L7: fault tolerant lock server
      replicated state machine using paxos
    L8: your choice
      e.g. distributed extent server for performance or fault tolerance

Lab 1: simple lock server
  what does a lock server do?
    handles acquire and release RPCs from clients
    supports multiple locks, each lock has a number
    acquire(num):
      if any client holds lock num, wait for release(num)
      make lock num as held
      then server sends "OK" reply to client
    release(num):
      mark lock num as not held
      wake up any waiting acquires
  we supply you with an RPC library and some demo client/server code
    you have to add locking RPCs -- acquire and release
  RPC library greatly simplifies client/server communication over net
  overall structure:
    client app
    client stubs
    rpcc (one for each server client talks to)
    RPC library
    ... network
    RPC library
    rpcs (just one)
    server handlers
  lock_demo.cc
    new lock_client -- create client stubs
    lc->stat(1) -- looks like an ordinary C++ method call
  lock_client.cc
    creates rpcc -- tells it what server to connect (bind) to
    each stub method:
      cl->call(...)
      part of RPC library
      sends procedure # to server, and arguments
      waits for reply
    you need to fill in acquire(), release()
      call cl->call()
  lock_smain.cc
    create rpcs
    registers lock_server's handler functions
    you will have to register two new handlers
  lock_server.cc
    handlers
    you will have to add two new handlers
    a table of locks (use C++ STL map class)
    for each one: held or not held