6.824 Lecture 20: Distributed Hash Tables / Chord Prior focus has been on traditional distributed systems e.g. client/server style Server in Machine room: well maintained, centrally located, perhaps replicated. Examples: lock and extent server in YFS Now: Internet-scale systems Harness 1000s or millions of machines Aggregate machines from many owners Huge capacity Maybe high availability via massive replication No central components to fail, go out of business, &c e.g. Bittorrent, Skype Can one design general-purpose tools to organize big distributed systems? One proposal: Distributed Hash Tables (DHTs) Focus is on finding information lookup(key) -> IP address of responsible server Often under a storage layer: put(key, value) get(key) -> value There are many DHT schemes Mostly used in research projects Sometimes used in real systems, e.g. BitTorrent trackers Potential DHT applications: publishing: DHT key names specific content key *not* tied to specific server (unlike URLs) file system, use DHT as a sort of distributed disk drive keys are like block numbers YFS extent server could be like this location tracking keys are e.g. cell phone numbers or chat usernames a value is a phone's current location What structure could be used to organize nodes? Central lookup service, maybe replicated example: Napster many years ago single point of failure, performance bottleneck, who runs it? Hierarchy of lookup servers, like DNS and hierarchical naming, e.g. www.mit.edu who is in charge of the root? root may have to carry much load names have location baked in Could we have flat names, decentralized lookup service? flat allows apps full control over names decentralized maybe avoids overload, and forcing someone to be in charge Problems Fast lookup w/ flat names (no location hints in names) Fast lookup w/ many nodes Churn (join/fail rate; servers not well maintained) Load balance (some data more popular/larger than others) Data availability (servers not well maintained) Security (can't trust all participants) Basic idea Two layers: routing (lookup) and data storage Routing layer handles naming and arranging nodes and then finding them. Storage layer handles actually putting and maintaining data. What does a complete algorithm have to do? 1. Define IDs, document ID to node ID assignment 2. Define per-node routing table contents 3. Lookup algorithm that uses routing tables 4. Join procedure to reflect new nodes in tables 5. Failure recovery 6. Move data around when nodes join 7. Make new replicas when nodes fail Typical approach: Give each node a unique ID Have a rule for assigning keys to nodes based on node/key ID e.g. key X goes on node node with nearest ID to X Now how, given X, do we find that node? Arrange nodes in an ID-space i.e. use ID as coordinates Build a global sense of direction Examples: 1D line, 2D square, binary tree, hypercube, or ID circle Build routing tables to allow ID-space navigation Each node knows about ID-space neighbors I.e. knows neighbors' IDs and IP addresses Perhaps each node knows a few farther-away nodes To move long distances quickly The "Chord" peer-to-peer lookup system An example system of this type ID-space topology Ring: All IDs are 160-bit numbers, viewed in a ring. Everyone agrees on how the ring is divided between nodes Just based on ID bits Assignment of key IDs to node IDs? Key stored on first node whose ID is equal to or greater than key ID. Closeness is defined as the "clockwise distance" If node and key IDs are uniform, we get reasonable load balance. Node IDs can be assigned, chosen randomly, SHA-1 hash of IP address... Key IDs can be drived from data, or chosen by user Example: 5-bit ID space (0..31) Nodes: 2 9 12 20 25 Routing? Query is at some node. Node needs to forward the query to a node "closer" to key. Simplest system: either you are the "closest" or your neighbor is closer. Hand-off queries in a clockwise direction until done Only state necessary is "successor". n.find_successor (k): if k in (n,successor]: return successor else: return successor.find_successor (k) Slow but steady; how can we make this faster? This looks like a linked list: O(n) Can we make it more like a binary search? Need to be able to halve the distance at each step. Finger table routing: Keep track of nodes exponentially further away: n.finger[i] = successor of n + 2^i Many of these entries will be the same in full system: expect O(lg N) n.find_successor (k): if k in (n,successor]: return successor else: n' = closest_preceding(k) return n'.find_successor(k) Node 2's finger table looks like this: 1: 9 2: 9 4: 9 8: 12 16: 20 Example: 2.lookup(13) 2.closest_preceding(13) is 12 so 2 forwards to 12 13 is in (12, 20) so 12 returns 20 There's a complete tree rooted at every node Starts at that node's row 0 Threaded through other nodes' row 1, &c Every node acts as a root, so there's no root hotspot This is *better* than simply arranging the nodes in one tree What if finger tables are wrong? Too far along ring? e.g. node 2's "8" finger points to 20, not 12 2.lookup(13): forwards to 9 9's successor is 12 so 9 forwards to 12 12's successor is 20 so 12 returns 20 Too close? e.g. node 2's "8" finger points to 9, not 12 2.lookup(13) forwards to 9 same as before, via successor pointers Note that correctness depends on successors, not finger pointers fingers are only about speed How does a new node acquire correct tables? General approach: Assume system starts out w/ correct routing tables. Use routing tables to help the new node find information. Add new node in a way that maintains correctness. Issues a lookup for its own key to any existing node. Finds new node's successor. Ask that node for its finger table. At this point the new node can forward queries correctly: Tweak its own finger table as necessary. Does routing *to* us now work? If new node doesn't do anything, query will go to where it would have gone before we joined. I.e. to the existing node numerically closest to us. So, for correctness, we need to let people know that we are here. Only need to ensure successor pointers are correct Straw man: just tell predecessor to use use as successor Why doesn't this work? Concurrent joins! Old: 40 48 44 and 46 join at the same time both set successor to 48 44 tells 40 to use 44 as successor maybe 44 accepts some put()s 46 tells 40 to use 46 as successor now 44 and data on it can't be found! Chord join algorithm: Figure 7 Each node keeps track of its current predecessor. When you join, tell your successor that its predecessor has changed. Periodically ask your successor who its predecessor is: If that node is closer to you, set successor to that node Does this fix concurrent joins? E.g., the 40 44 46 48 example 44 and 46 may sort themselves out but still not part of ring finally 40 asks 48 what 48's predecessor is 48 says "46" then 40 walks backwards to 44 Everyone must also continue to update their finger tables: Periodically lookup your n + 2^i-th key How to lookup if there are failed nodes? Assume nodes fail w/o warning. Strictly harder than graceful departure. If finger table entry doesn't respond: Re-route via lower finger table entry. What if successor doesn't respond? Failed node might have been closest to key ID! Need to know next-closest. Maintain a _list_ of successors: r successors. If you expect really bad luck, maintain O(log N) successors. We can route around failure. How to repair finger table due to failure? Periodically re-lookup each entry. Lookup's failure re-tries will find correct live successor. How *long* will a lookup take? [diagram] Locality Lookup takes log(n) messages. But they are to random nodes on the Internet! Will often be very far away. Can we route through nodes close to us on underlying network? This boils down to whether we have choices: If multiple correct next hops, we can try to choose closest. Chord doesn't allow much choice Many fingers, but ID-progress vs proximity tradeoff But fingers need not be exact for correctness Idea: Don't just use succ(n + 2^i) as finger[i] Sample nodes in successor list of true finger, pick closest. "Closest" = lowest ping time. What's the effect? Individual hops are lower latency. But less and less choice (lower node density) as you get close in ID space. So last few hops likely to be very long. Thus you don't *end up* close to the initiating node. You just get there quicker. Can a DHT node forge data? key = SHA1(value) If client knows key, client knows how to check Servers reject put()s that don't check out Clients reject get()s Can a DHT node claim that data doesn't exist? Yes, though perhaps you can check other replicas Can a host join w/ IDs chosen to sit under every replica? Or "join" many times, so it is most of the DHT nodes? Can we prevent node from choosing any ID? How to allow updates? Same key, different value. Can't have key = SHA1(value) Can't have arbitrarily updatable values How to check that the value is correct? Maybe have each value owned by someone And require value to be signed But DHT nodes may not know who should sign a value And thus they cannot reject a bad put() Maybe nodes only allow put() when key = SHA1(value), or key = public key, value is signed How to build a s/w distribution service on a DHT? Need updates. Need to spread load for popular releases. Need availability. Split distro into a big set of blocks. put() each block into DHT. put() replicas on successors to block's key. Each distro has a public key, well known. get(pubkey) = {list of keys, signature} Now fetches are spread over many DHT servers. Large aggregate capacity. Does anyone use DHTs? Many research systems (e.g. next paper) Amazon Dynamo Some distributed BitTorrent trackers A few p2p file sharing / p2p indexing systems Maybe Skype Retrospective log(n) is a lot many hops -> many chances for timeout one-hop (full knowledge) attractive for modest-sized systems proximity is hard internet-scale often means internet-scale delays delays are often human-perceptible hard to guarantee data will be there hard to replicate for failures hard to guarantee you'll find a key if it is there lots of security problems in an open system put/get too restrictive often want servers to do more computation near data e.g. keyword search conclusions: maybe private rather than public? maybe not internet-wide? maybe one-hop rather than routing? maybe application-tailorable library rather than separate service? maybe "soft state" or cache rather than store real data? next paper (CoralCDN) uses DHTs in many of these ways