6.824 2006: Scalable Lookup Prior focus has been on traditional distributed systems e.g. NFS, DSM/Hypervisor, Harp Machine room: well maintained, centrally located. Relatively stable population: can be known in entirety. Focus on performance, semantics, recovery. Biggest system might be Porcupine. Now: Internet scale systems Machines owned by you and me: no central authority. Huge number of distributed machines: can't know everyone e.g. From e-mail to Napster. Problems How do you name nodes and objects? How do you find other nodes in the system (efficiently)? How should data be split up between nodes? How to prevent data from being lost? How to keep it available? How to provide consistency? How to provide security? anonymity? What structure could be used to organize nodes? Central contact point: Napster Hierarchy: DNS for E-mail, WWW Flat? Let's look at a system with a flat interface: DHT Scalable lookup: Provide an abstract interface to store and find data Typical DHT interface: put(key, value) get(key) -> value loose guarantees about keeping data alive log(n) hops, even for new nodes guarantees about load balance, even for new nodes Potential DHT applications: publishing: DHT keys are like links file system, use DHT as a sort of distributed disk drive keys are like block numbers Petal is a little bit like this location tracking keys are e.g. cell phone numbers a value is a phone's current location Basic idea Two layers: routing (lookup) and data storage Routing layer handles naming and arranging nodes and then finding them. Storage layer handles actually putting and maintaining data. What does a complete algorithm have to do? 1. Define IDs, document ID to node ID assignment 2. Define per-node routing table contents 3. Lookup algorithm that uses routing tables 4. Join procedure to reflect new nodes in tables 5. Failure recovery 6. Move data around when nodes join 7. Make new replicas when nodes fail Typical approach: Give each node a unique ID Have a rule for assigning keys to nodes based on node/key ID e.g. key X goes on node node with nearest ID to X Now how, given X, do we find that node? Arrange nodes in an ID-space, based on ID i.e. use ID as coordinates Build a global sense of direction Examples: 1D line, 2D square, Tree based on bits, hypercube, or ID circle Build routing tables to allow ID-space navigation Each node knows about ID-space neighbors I.e. knows neighbors' IDs and IP addresses Perhaps each node knows a few farther-away nodes To move long distances quickly The "Chord" peer-to-peer lookup system By Stoica, Morris, Karger, Kaashoek and Balakrishnan http://pdos.csail.mit.edu/chord/ An example system of this type ID-space topology Ring: All IDs are 160-bit numbers, viewed in a ring. Everyone agrees on how the ring is divided between nodes Just based on ID bits Assignment of key IDs to node IDs? Key stored on first node whose ID is equal to or greater than key ID. Closeness is defined as the "clockwise distance" If node and key IDs are uniform, we get reasonable load balance. Node IDs can be assigned, chosen randomly, SHA-1 hash of IP address... Key IDs can be drived from data, or chosen by user Routing? Query is at some node. Node needs to forward the query to a node "closer" to key. Simplest system: either you are the "closest" or your neighbor is closer. Hand-off queries in a clockwise direction until done Only state necessary is "successor". n.find_successor (k): if k in (n,successor]: return successor else: return successor.find_successor (k) Slow but steady; how can we make this faster? This looks like a linked list: O(n) Can we make it more like a binary search? Need to be able to halve the distance at each step. Finger table routing: Keep track of nodes exponentially further away: New state: succ(n + 2^i) Many of these entries will be the same in full system: expect O(lg N) n.find_successor (k): if k in (n,successor]: return successor else: n' = closest_preceding node (k) return n'.find_successor (k) Maybe node 8's looks like this: 1: 14 2: 14 4: 14 8: 21 16: 32 32: 42 There's a complete tree rooted at every node Starts at that node's row 0 Threaded through other nodes' row 1, &c Every node acts as a root, so there's no root hotspot This is *better* than simply arranging the nodes in one tree How does a new node acquire correct tables? General approach: Assume system starts out w/ correct routing tables. Use routing tables to help the new node find information. Add new node in a way that maintains correctness. Issues a lookup for its own key to any existing node. Finds new node's successor. Ask that node for its finger table. At this point the new node can forward queries correctly: Tweak its own finger table as necessary. Does routing *to* us now work? If new node doesn't do anything, query will go to where it would have gone before we joined. I.e. to the existing node numerically closest to us. So, for correctness, we need to let people know that we are here. Each node keeps track of its current predecessor. When you join, tell your successor that its predecessor has changed. Periodically ask your successor who its predecessor is: If that node is closer to you, switch to that guy. Is that enough? Everyone must also continue to update their finger tables: Periodically lookup your n + 2^i-th key What about concurrent joins? E.g. two new nodes with very close ids, might have same successor. e.g. 44 and 46. Both may find node 48... spiky tree! Good news: periodic stabilization takes care of this. What about node failures? Assume nodes fail w/o warning. Strictly harder than graceful departure. Two issues: Other nodes' routing tables refer to dead node. Dead nodes predecessor has no successor. If you try to route via dead node, detect timeout, treat as empty table entry. I.e. route to numerically closer entry instead. Repair: ask any node on same row for a copy of its corresponding entry. Or any node on rows below. All these share the right prefix. For missing successor Failed node might have been closest to key ID! Need to know next-closest. Maintain a _list_ of successors: r successors. If you expect really bad luck, maintain O(log N) successors. We can route around failure. The system is effectively self-correcting. Locality Lookup takes log(n) messages. But they are to random nodes on the Internet! Will often be very far away. Can we route through nodes close to us on underlying network? This boils down to whether we have choices: If multiple correct next hops, we can try to choose closest. Chord doesn't allow much choice. Observe: Strict successor for finger not necessary. Sample nodes in successor list of true finger, pick closest. What's the effect? Individual hops are lower latency. But less and less choice (lower node density) as you get close in ID space. So last few hops likely to be very long. Thus you don't *end up* close to the initiating node. You just get there quicker. How fast could proximity be? 1 + 1/4 + 1/16 + 1/64 Not as good as real shortest-path routing! Any down side to locality routing? Harder to prove independent failure. Maybe no big deal, since no locality for successor lists sets. Easier to trick me into using malicious nodes in my tables. What about security? Self-authenticating data, e.g. key = SHA1(value) So DHT node can't forge data Of course it's annoying to have immutable data... Can a DHT node claim that data doesn't exist? Yes, though perhaps you can check other replicas Can a host join w/ IDs chosen to sit under every replica? Or "join" many times, so it is most of the DHT nodes? How are IDs chosen? Why not just keep complete routing tables? So you can always route in one hop? Danger in large systems: timeouts or cost of keeping tables up to date. Accordion (NSDI '05) trade off between keeping complete state and performance subject to a budget Are there any reasonable applications for DHTs? For example, could you build a DHT-based Gnutella? Next time: wide area storage.