6.824 Lecture 20: Distributed Hash Tables / Chord

Prior focus has been on traditional distributed systems
  e.g. client/server style 
  Server in Machine room: well maintained, centrally located, 
    perhaps replicated.
  Examples: lock and extent server in YFS

Now: Internet-scale systems
  Harness 1000s or millions of machines
  Aggregate machines from many owners
  Huge capacity
  Maybe high availability via massive replication
  No central components to fail, go out of business, &c
  e.g. Bittorrent, Skype

Can one design general-purpose tools to organize big distributed systems?
  One proposal: Distributed Hash Tables (DHTs)
  Focus is on finding information
  lookup(key) -> IP address of responsible server
  Often under a storage layer:
    put(key, value)
    get(key) -> value
  There are many DHT schemes
    Mostly used in research projects
    Sometimes used in real systems, e.g. BitTorrent trackers

Potential DHT applications:
  publishing: DHT key names specific content
    key *not* tied to specific server (unlike URLs)
  file system, use DHT as a sort of distributed disk drive
    keys are like block numbers
    YFS extent server could be like this
  location tracking
    keys are e.g. cell phone numbers or chat usernames
    a value is a phone's current location

What structure could be used to organize nodes?  
  Central lookup service, maybe replicated
    example: Napster many years ago
    single point of failure, performance bottleneck, who runs it?
  Hierarchy of lookup servers, like DNS
    and hierarchical naming, e.g. www.mit.edu
    who is in charge of the root?
    root may have to carry much load
    names have location baked in
  Could we have flat names, decentralized lookup service?
    flat allows apps full control over names
    decentralized maybe avoids overload, and forcing someone to be in charge

Problems
  Fast lookup w/ flat names (no location hints in names)
  Fast lookup w/ many nodes
  Churn (join/fail rate; servers not well maintained)
  Load balance (some data more popular/larger than others)
  Data availability (servers not well maintained)
  Security (can't trust all participants)

Basic idea
  Two layers: routing (lookup) and data storage
  Routing layer handles naming and arranging nodes and then finding them.
  Storage layer handles actually putting and maintaining data.

What does a complete algorithm have to do?
  1. Define IDs, document ID to node ID assignment
  2. Define per-node routing table contents
  3. Lookup algorithm that uses routing tables
  4. Join procedure to reflect new nodes in tables
  5. Failure recovery
  6. Move data around when nodes join
  7. Make new replicas when nodes fail

Typical approach:
  Give each node a unique ID
  Have a rule for assigning keys to nodes based on node/key ID
    e.g. key X goes on node node with nearest ID to X
    Now how, given X, do we find that node?
  Arrange nodes in an ID-space
    i.e. use ID as coordinates
    Build a global sense of direction
    Examples: 1D line, 2D square, binary tree, hypercube, or ID circle
  Build routing tables to allow ID-space navigation
    Each node knows about ID-space neighbors
      I.e. knows neighbors' IDs and IP addresses
    Perhaps each node knows a few farther-away nodes
      To move long distances quickly

The "Chord" peer-to-peer lookup system
  An example system of this type

ID-space topology
  Ring: All IDs are 160-bit numbers, viewed in a ring.
  Everyone agrees on how the ring is divided between nodes
    Just based on ID bits

Assignment of key IDs to node IDs?
  Key stored on first node whose ID is equal to or greater than key ID.
    Closeness is defined as the "clockwise distance"
  If node and key IDs are uniform, we get reasonable load balance.
  Node IDs can be assigned, chosen randomly, SHA-1 hash of IP address...
  Key IDs can be drived from data, or chosen by user

Example:
  5-bit ID space (0..31)
  Nodes: 2 9 12 20 25

Routing?
  Query is at some node.
  Node needs to forward the query to a node "closer" to key.
  Simplest system: either you are the "closest" or your neighbor is closer.
    Hand-off queries in a clockwise direction until done
    Only state necessary is "successor".

    n.find_successor (k):
      if k in (n,successor]: return successor
      else: return successor.find_successor (k)

Slow but steady; how can we make this faster?
  This looks like a linked list: O(n)
  Can we make it more like a binary search?
    Need to be able to halve the distance at each step.

Finger table routing:
  Keep track of nodes exponentially further away:
    n.finger[i] = successor of n + 2^i
    Many of these entries will be the same in full system: expect O(lg N)
    n.find_successor (k):
      if k in (n,successor]: return successor
      else:
        n' = closest_preceding(k)
	return n'.find_successor(k) 

Node 2's finger table looks like this:
  1:   9
  2:   9
  4:   9
  8:  12
  16: 20

Example: 2.lookup(13)
  2.closest_preceding(13) is 12
    so 2 forwards to 12
  13 is in (12, 20)
    so 12 returns 20

There's a complete tree rooted at every node
  Starts at that node's row 0
  Threaded through other nodes' row 1, &c
  Every node acts as a root, so there's no root hotspot
    This is *better* than simply arranging the nodes in one tree

What if finger tables are wrong?
  Too far along ring?
    e.g. node 2's "8" finger points to 20, not 12
    2.lookup(13):
      forwards to 9
      9's successor is 12
        so 9 forwards to 12
      12's successor is 20
        so 12 returns 20
  Too close?
    e.g. node 2's "8" finger points to 9, not 12
    2.lookup(13)
      forwards to 9
      same as before, via successor pointers

Note that correctness depends on successors, not finger pointers
  fingers are only about speed

How does a new node acquire correct tables?
  General approach:
    Assume system starts out w/ correct routing tables.
    Use routing tables to help the new node find information.
    Add new node in a way that maintains correctness.
  Issues a lookup for its own key to any existing node.
    Finds new node's successor.
    Ask that node for its finger table.
  At this point the new node can forward queries correctly:
    Tweak its own finger table as necessary.

Does routing *to* us now work?
  If new node doesn't do anything,
    query will go to where it would have gone before we joined.
    I.e. to the existing node numerically closest to us.
  So, for correctness, we need to let people know that we are here.
    Only need to ensure successor pointers are correct

Straw man: just tell predecessor to use use as successor
  Why doesn't this work?
  Concurrent joins!
  Old: 40 48
  44 and 46 join at the same time
  both set successor to 48
  44 tells 40 to use 44 as successor
    maybe 44 accepts some put()s
  46 tells 40 to use 46 as successor
    now 44 and data on it can't be found!

Chord join algorithm:
  Figure 7
  Each node keeps track of its current predecessor.
  When you join, tell your successor that its predecessor has changed.
  Periodically ask your successor who its predecessor is:
    If that node is closer to you, set successor to that node

Does this fix concurrent joins?
  E.g., the 40 44 46 48 example
  44 and 46 may sort themselves out
    but still not part of ring
  finally 40 asks 48 what 48's predecessor is
    48 says "46"
  then 40 walks backwards to 44

Everyone must also continue to update their finger tables:
  Periodically lookup your n + 2^i-th key

How to lookup if there are failed nodes?
  Assume nodes fail w/o warning. Strictly harder than graceful departure.
  If finger table entry doesn't respond:
    Re-route via lower finger table entry.
  What if successor doesn't respond?
    Failed node might have been closest to key ID!
    Need to know next-closest.
    Maintain a _list_ of successors: r successors.
      If you expect really bad luck, maintain O(log N) successors.
    We can route around failure.

How to repair finger table due to failure?
  Periodically re-lookup each entry.
  Lookup's failure re-tries will find correct live successor.

How *long* will a lookup take?
  [diagram]

Locality
  Lookup takes log(n) messages.
    But they are to random nodes on the Internet!
    Will often be very far away.
  Can we route through nodes close to us on underlying network?
  This boils down to whether we have choices:
    If multiple correct next hops, we can try to choose closest.
  Chord doesn't allow much choice
    Many fingers, but ID-progress vs proximity tradeoff
  But fingers need not be exact for correctness   
  Idea:
    Don't just use succ(n + 2^i) as finger[i]
    Sample nodes in successor list of true finger, pick closest.
    "Closest" = lowest ping time.

What's the effect?
  Individual hops are lower latency.
  But less and less choice (lower node density) as you get close in ID space.
  So last few hops likely to be very long. 
  Thus you don't *end up* close to the initiating node.
    You just get there quicker.

Can a DHT node forge data?
  key = SHA1(value)
  If client knows key, client knows how to check
  Servers reject put()s that don't check out
  Clients reject get()s

Can a DHT node claim that data doesn't exist?
  Yes, though perhaps you can check other replicas

Can a host join w/ IDs chosen to sit under every replica?
  Or "join" many times, so it is most of the DHT nodes?
  Can we prevent node from choosing any ID?

How to allow updates?
  Same key, different value.
  Can't have key = SHA1(value)
  Can't have arbitrarily updatable values
    How to check that the value is correct?
  Maybe have each value owned by someone
    And require value to be signed
    But DHT nodes may not know who should sign a value
    And thus they cannot reject a bad put()
  Maybe nodes only allow put() when
    key = SHA1(value), or
    key = public key, value is signed

How to build a s/w distribution service on a DHT?
  Need updates.
  Need to spread load for popular releases.
  Need availability.
  Split distro into a big set of blocks.
  put() each block into DHT.
  put() replicas on successors to block's key.
  Each distro has a public key, well known.
  get(pubkey) = {list of keys, signature}
  Now fetches are spread over many DHT servers.
    Large aggregate capacity.

Does anyone use DHTs?
  Many research systems (e.g. next paper)
  Amazon Dynamo
  Some distributed BitTorrent trackers
  A few p2p file sharing / p2p indexing systems
  Maybe Skype

Retrospective
  log(n) is a lot
    many hops -> many chances for timeout
    one-hop (full knowledge) attractive for modest-sized systems
  proximity is hard
    internet-scale often means internet-scale delays
    delays are often human-perceptible
  hard to guarantee data will be there
    hard to replicate for failures
    hard to guarantee you'll find a key if it is there
  lots of security problems in an open system
  put/get too restrictive
    often want servers to do more computation near data
    e.g. keyword search
  conclusions:
    maybe private rather than public?
    maybe not internet-wide?
    maybe one-hop rather than routing?
    maybe application-tailorable library rather than separate service?
    maybe "soft state" or cache rather than store real data?
    next paper (CoralCDN) uses DHTs in many of these ways