6.824 Lecture 7: Peer-to-peer systems: Content distribution

Last lecture was about finding a computer for an object, more in this
lecture and next, but focus on how to serve the objects (i.e., the
data).  Objects can be large, and therefore consume network bandwidth
when serving them.  Network bandwidth costs money, and many computers
have insufficient network bandwidth to serve popular objects, or many
objects.

Goal is to aggregate network bandwidth from many computers to serve
data.  For popular objects, make copies on multiple computers.  For
many, but less popular objects, partition them across the available
computers.  Once we have multiple copies, it would be good to get
objects from a nearby server if there are multiple copies of the data,
because that will reduce latency.  This approach works well for static
content (images in Web pages, software distributions, videos, songs,
etc.); not so good for dynamic Web pages. For example, a Web page that
is generated based on a user query or, worse, involves user modifying
data, which introduces consistency issues if there are several
copies.

A challenge is that we don't know what is popular in advance, and can
changes quickly.  For example, a certain Web page can become very
popular in short period of time, and experience a flash crowd,
overloading the source, and in the worst case, make it impossible for
anybody to see the Web page.

There are a few different instances of the data distribution problem,
depending on which link is the limiting factor, what computers you
have available to aggregate, etc.  The following are instances in that
show up in practice:

- A home user/company has a popular web page.  We want to avoid to be
  limited by the home user's link.  (Coral, Akamai, etc.)

- A home user wants a faster Web for all URLs. Cooperative Web caching
  (CoDeen, etc.).  Coral primariy helps for URLs coralized, not for
  all URLs This case assumes that the user configures the browser to
  contact a Web proxy instead, or that all clicks are observed by a
  proxy close to the browser.

- A home user wants to share a popular song or video, but isn't always
  online and has limited bandwidth.  Depending on which machines are
  being aggregated you get two different cases, again: (1) if the
  machines are home PCs, then bittorrent is useful; (2) if the machine
  are powerful, well-connected, then UsenetNews might be good (or
  Bittorrent).

This lecture we look at case 1+2 by studying Coral, and next at 3 by
studying Bittorrent.

What's the high-level problem that Coral solves?
  You have cooperating caching proxies scattered over the Internet.
  Direct browser to nearest cached copy.
  If not cached nearby, fetch from real server into a nearby cache.

Why is this helpful?
  Might reduce server load.
  Might reduce delay visible to user.

Doesn't Akamai already solve this problem?

What are the constraints that make it hard?
  No support from browser.
  No support from final server.

What tools are available?
  We only get to see DNS and HTTP requests.
  Assuming "Coralized" names like www.cnn.com.nyucd.edu

What can we achieve with just a bunch of DNS servers for nyucd.edu?
  Browser probably chooses a nearby server from a list of DNS server.
  That DNS server can send the browser an A record for one of the proxies?
  But which one?

Idea 1: if DNS server is close (low ping time) to browser,
  then DNS server can return any proxy close to the DNS server.
  So we'd want to somehow cause browser to use nearby Coral DNS server.

Idea 2: build a database mapping IP net numbers to nearby proxies,
  each proxy registers its net number,
  then DNS server looks up browser's IP net number to find proxy.
  What about browsers not on the same net as a proxy?
    Might still be nearby proxy.

How does Coral cause browser to use a nearby Coral DNS server?
  L2.L1.L0 trick to have one chance per hierarchy level
  nodes(level,count,target) to find good "next" DNS server
    traceroute and hints in DHT to implement nodes()

How does Coral find a nearby cached copy of a URL?

What does Coral store in the DHT?
  router IP addresses (found w/ traceroute) -> nearby proxy
  24-bit IP prefixes -> nearby proxy
  URL -> proxy

If browser is at MIT, and nearest proxy is at BU, will we find it?
  5 hops to www.cnn.com takes us to BBN planet.

Does Coral handle flash crowds (very popular URLs) well?
  What might go wrong?
    Every proxy fetches the URL direct from server.
    DHT hot-spots.
  What does Coral do about it?

What DHT techniques did they use?
  Hierarchy for locality.
    Why don't they just cache along the path?
  How do they choose clusters?