6.824 2004 Lecture 19: Distributed Hash Tables (1)

From "Pastry: Scalable, decentralized object location and routing for
large-scale peer-to-peer systems", by Rowstron and Druschel.

Can we build something like Freenet, but more suitable
  as a building-block for large distributed systems?

What was good about Freenet?
  unique naming of data (like read/write memory)
  storage "out there" in the net (don't have to be up all the time)
  spread work over many hosts, large total capacity
  efficient routing of keys to data (no flooding)
  caching along the lookup path

What could Freenet have done better?
  Lookups often took a huge number of hops
  Unpopular data might die out
  New nodes don't quickly learn good routes
  New nodes don't pick up their share of costs

Typical DHT interface:
  put(key, value)
  get(key) -> value
  loose guarantees about keeping data alive
  log(n) hops, narrow distribution
    even for newly arriving nodes
  guarantees about load balance, even for new nodes

Potential DHT applications:
  publishing: DHT keys are like links
  file system, use DHT as a sort of distributed disk drive
    keys are like block numbers
    Petal is a little bit like this
  location tracking
    keys are e.g. cell phone numbers
    a value is a phone's current location

Basic idea
  Two layers: routing (lookup) and data storage
  Give each node a unique ID
  Have a rule for assigning keys to nodes based on node/key ID
    e.g. key X goes on node node with nearest ID to X
    Now how, given X, do we find that node?
  Arrange nodes in an ID-space, based on ID
    i.e. use ID as coordinates
    Build a global sense of direction
    Examples: 1D line, 2D square, Tree based on bits, hypercube, or ID circle
  Build routing tables to allow ID-space navigation
    Each node knows about ID-space neighbors
      I.e. knows neighbors' IDs and IP addresses
    Perhaps each node knows a few farther-away nodes
      To move long distances quickly

What does a complete algorithm have to do?
  1. Define IDs, document ID to node ID assignment
  2. Define per-node routing table contents
  3. Lookup algorithm that uses routing tables
  4. Join procedure to reflect new nodes in tables
  5. Failure recovery
  6. Move data around when nodes join
  7. Make new replicas when nodes fail

The "Pastry" peer-to-peer lookup system
  By Rowstron and Druschel
  http://www.research.microsoft.com/~antr/pastry/
  An example system of this type

Assignment of key IDs to node IDs?
  IDs are 128-bit numbers
  Node IDs chosen randomly? MD5 hash of IP address?
  Key stored on node with numerically closest ID
  If node and key IDs are uniform, we get reasonable load balance.

Routing?
  Query is at some node.
  Node needs to forward the query to a node "closer" to key.
  Note key ID and node ID share some high bits of ID prefix.
  Routing forwards to node that shares more bits of prefix.
    Sharing more bits of prefix is the definition of "closer".
    We're descending a b-ary tree, one digit at a time
    Not the same as "numerically closer..." Edge effects.
  How can we ensure every node always knows some node "closer" to any key?
    I.e. what's the structure of the Pastry routing table?

ID-space topology
  Ring
  Pointers to quadrants, quad-quadrants, &c
  Everyone agrees on how the ring is divided
    Just based on ID bits

Routing table contents
  Divide ID into digits
  One routing table row per digit position
  Each row contains one column per digit value
  Entry refers to a node w/ same prefix, but different value for this digit
    "Refers" means contains ID and IP address
  Example. 2 bits per digit. 3 digits per ID.
    ID of this node: 102
    022 102 223 312
    102 113 122 130
    100 101 102 103
  This node (102) shows up in each row.
  There may be many choices for each entry (e.g. 021 rather than 022)
    We can use *any* node with the right prefix: just need to correct the digit.
  Note real Pastry uses 128 bit IDs, 4-bit digits.

Note every node has its own routing table; maybe 223 looks like this:
  031 114 223 312
  202 210 223 231
  220 221 222 223

To forward with the routing table:
  Row index: position of highest digit in which key disagrees with our ID.
  Column index: value of that digit.
  Forward query to that node.
  It does the same thing...
  If we reach the end of the key ID, we're done: key ID == node ID.
  This takes log_b messages.
  Example: lookup(211) from node 102.

There's a complete tree rooted at every node
  Starts at that node's row 0
  Threaded through other nodes' row 1, &c
  Every node acts as a root, so there's no root hotspot
    This is *better* than simply arranging the nodes in one tree

But what if no node exists for a particular routing table entry?
  There are more possible IDs than there are nodes.
  So we won't be able to fill all routing table entries.
    E.g. maybe node 103 does not exist.
  Indeed, only log_b(N) rows are likely to exist.
  We can't correct digits: what do we do now?
  Forward to a node that's numerically closer.
  We may happen to know one in our routing table.
  What if routing table contains no numerically closer node?
    Perhaps we are the right node!
    But maybe not -- maybe node closest to key doesn't share many bits with us.
      So it isn't in our table.
      Suppose 113, 122, and 130 didn't exist.
      Key 123.
      Prefix routing might find node 103.
      But there may be a node 200, which is numerically closer to 123.
  How can we know if we're closest, or forward to closest?
    Easy if each node knows its numeric neighbor nodes.

Each node maintains a "leaf set"
  Refers to the L nodes with numerically closest IDs.
    L/2 of them on each side of us.
  Now we can tell if node on other side of key is numerically closer.
  So node 103 would know about node 200.
  That's the complete routing story.