12/7 --- Web caching

- Lets study a special case, but important case of replication:
  caching of Web pages in the Internet.

- Assumptions:
  - Web pages are read-only  (excludes database transactions, queries, etc.)
  - Web pages are static  (excludes all dynamic pages)
  - Consistency requirements are lax (doesn't work for data with many updates)

- Why do we want to cache web pages?
  1. reduce load on server
  2. reduce network congestion
  3. reduce network bandwidth consumption
  4. improve fetch latency  (avoid long roundtrips)
  5. improve availability

- To understand the advantages and placement of caches, draw a picture
  of the Internet:
   - a collection of automous subsystems (AS) connecting browser and
     server (clusters).
   - what are example ASes?  AT&T, Sprint, UUnet, etc.
      - each AS owns part of the IP address space
   - how are packets routed from client to server?
      - through a series of ASes that peer
      - peering relationships are political/economical/strategic deals
      - BGP is used to set up the routing tables between ASes
   - what are the potential bottlenecks in the Internet?
      1. client to ISP (modem line)
      2. ISP to AS (if different)
      3. internal AS
      4. peering points
      5. AS to server (server is often at a hosting service)

- Where can we place caches?
  1. right before server cluster in customer rack (doesn't help with 2, 3, 4)
  2. right before the server cluster but within AS
     - reduce load on servers
     - reduce load on network link between AS and server
     - might improve availability
  3. cluster of caches on the edge of an AS
     - reduces internal load on network
     - might improve client fetch latency substantially
     - improves availability
     - helps only clients whose packets go over that AS
     - doesn't deal with spikes in load
  4. cooperative cache across multiple ASes
     - doesn't help with client to AS bottleneck
     - potential for dealing with spikes.
  5. customer premise
     - improves everything for that customer and for the customer's AS

- How do we get a request to the cache?
  1. manual proxy settings (5)
  2. transparent proxy (1, 3, 5)
  3. have a single "marked" IP address for a server and catch at edge (3)
  4. advertise "marked" names (DNS or URL) and redirect
     - redirection can be based on load, content, ...
  5. modify browser, advertise multiple caches, and client selects (3  and 4)
     - client might measure

- How to locate a cache item in a cluster of caches?
  0. it always local (caches in cluster don't cooperate)
     - duplicate cached data
  1. primary plus multicast
     - other caches are secondary to primary
     - perhaps through a tree
  2. directory-based schemes (centralized or distributed)
     - perhaps distributed scheme using a tree
     - changing number of servers
  3. hashing
     + direct access
     - changing number of servers
       changing the hash function can result all objects to end up in
       different caches
  4. consistent hashing (topic of paper)
     
- Consistent hashing

  - assumptions:
     - many caches, some down some up, all with equal access cost
     - many clients with different views of the live caches

  - hash functions that takes URLs and output a number of 0 ... M
  - URLs and caches are mapped to points on a circle 0 ... M
  - lookup:  first cache that succeeds hash(U) has document
  - add cache: move objects that "closest" on circle to new cache
  - note: map cache on multiple points on circle for uniform
    distribution of URLs to caches
  - result: each URL is in a small number of caches
 
  - ideal implementation: hash runs at client
  - practical implementation: virtual caches through DNS
     for example, a456.proxycache3.com
     a456 is mapped to an IP address