6.824 Lecture 16: Peer-to-peer systems: Content distribution

Last lecture was about finding a computer for an object. Now focus on
how to serve the objects (i.e., the data).

There are a few different instances of the content distribution
problem:

- A home user/company has a popular web page.  We want to avoid to be
  limited by the home user's link.  (Coral, Akamai, etc.)

- A home user wants a faster Web for all URLs. Cooperative Web caching
  (CoDeen, etc.).  Coral primariy helps for URLs coralized, not for
  all URLs This case assumes that the user configures the browser to
  contact a Web proxy instead, or that all clicks are observed by a
  proxy close to the browser.

- A home user wants to retrieve a big file.  Coral doesn't help, but,
  for example, Bittorrent can download the file from several servers
  in paralllel.  (More next week.)

What's the high-level problem that Coral solves?
  You have cooperating caching proxies scattered over the Internet.
  Direct browser to nearest cached copy.
  If not cached nearby, fetch from real server into a nearby cache.
  Doesn't work for dynamic content    

Why is this helpful?
  Might reduce server load.
  Might reduce delay visible to user (for servers participating in Coral)

Doesn't Akamai already solve this problem?

What are the constraints that make it hard?
  No support from browser.
  No support from final server.
  Popularity of Web change can change quickly

What tools are available?
  We only get to see DNS and HTTP requests.
  We can ask server to change URLs.

Idea 1: "Coralize" names like www.cnn.com.nyucd.edu

What can we achieve with just a bunch of DNS servers for nyucd.edu?
  Browser probably chooses a nearby server from a list of DNS server.
  That DNS server can send the browser an A record for one of the proxies?
  But which one?

Idea 2: if DNS server is close (low ping time) to browser,
  then DNS server can return any proxy close to the DNS server.

Idea 3: build a database mapping IP net numbers to nearby proxies,
  each proxy registers its net number,
  then DNS server looks up browser's IP net number to find proxy.
  What about browsers not on the same net as a proxy?
    Might still be nearby proxy.

So we'd want to somehow cause browser to use nearby Coral DNS server.
How does Coral cause browser to use a nearby Coral DNS server?

Idea 4:  L2.L1.L0 trick
  have one chance per hierarchy level
  nodes(level,count,target) to find good "next" DNS server
    traceroute and hints in DHT to implement nodes()

How does Coral find a nearby cached copy of a URL?

What does Coral store in the DHT?
  router IP addresses (found w/ traceroute) -> nearby proxy/dns
  24-bit IP prefixes -> nearby proxy/dns
  URL -> proxy

Does Coral handle flash crowds (very popular URLs) well?
  What might go wrong?
    Every proxy fetches the URL direct from server.
    DHT hot-spots.
  What does Coral do about it?

What DHT techniques did they use?
  Hierarchy for locality.
    Why don't they just cache along the path?
  How do they choose clusters?
  How do they avoid overloading any node in the DHT?

Why are short TTLs nice?  Why are long TTLs nice?

---- No bittorrent this year -----

6.824 Lecture 16b: Peer-to-peer systems: Bittorrent

This lecture's case study bittorrent

Usages: static bulk content
  Songs and videos
  Linux distributions
  ...

Usage model: cooperative
  user downloads file from someone using simple user interface
  while downloding, bittorrent serves file also to others
  bittorrent keeps running for a little while after download completes

Goal: get file out to many users quickly
  Encourage everyone to upload file

Challenges:
  Tracking which peer has what
  Handling high churn rates
  Download rate proportional to upload rate

Approach:
  Publisher a .torrent file on a Web server (e.g., suprnova.org)
    URL of tracker
    file name, length, secure hash
  Seed posts the URL for .torrent with tracker
    Seed must have complete copy of file
    Every peer that is online and has copy of a file becomes a seed
  Peer asks tracker for list of peers to download from
    Tracker returns list with random selection of peers
  Peers contact peers to learn what parts of the file they have etc.
    Download from other peers

Transport
  Peers pipeline on top of TCP
    divide a file part further in 16Kbyte subpieces
    keep typically 5 requests in flight

Which pieces to download
  strict?
  rarest first?
    ensures that every piece is widely available
    also helps with the seed and bootstrapping rapidly
  random?
    avoid overloading seed when starting download
  parallel download of same pieces?
    avoid waiting on slowest
  final algorithm
    random for first piece, then rarest-first, parallel for last piece

Fairness
  Goal: avoid peers that don't upload
  Approach:
    peers maximize download rate, but uploader can choke
    peers reciprocrate uploading to peers that upload to them  
  Which peers to unchoke?
    the ones who provide the highest download rate
      pick best 4
      rate is calculated over 20s average
      best are recalculated every 10s
    every 30s unchoke unused peer
      to see if it gives a good download rate
  What if you are choked by your peers?
    choke those peers if yon't receive a piece in 1 min
  What if donwload finished?
    use upload rate as the metric
    
How well does BitTorrent in practice?
  Anecdotally it works great---many happy users
  Difficult to answer precisely, but let's check the measurement paper

figures 1, 2, 3, 5, and 6

Efficiency
  How long does it take for a high-upload capacity find other
  high-performance peers?

  If there are a few high-upload peers and many low-upload peers, does
  Bittorrent perform at its best?  (Best = the total download time is
  at its lowest.)

Fairness
  Do high-upload peers receive download rates proportional to their
  upload rate?

http://www.cs.washington.edu/homes/piatek/papers/BitTyrant.pdf