6.824 2012 Lecture 20: Dynamo Dynamo: Amazon's Highly Available Key-value Store DeCandia et al, SOSP 2007 Why are we reading this paper? It's a real system: used for e.g. shopping cart at Amazon DB tradition: You can't have strong consistency AND high availability Most database-like systems opt for consistency Dynamo's big break with tradition: eventual consistency One of first production storage systems based on eventual consistency Followed by Cassandra, Voldemort, Riak, CouchDB (git is also eventually consistent, though quite different) Environment [data centers, app servers ("clients"), Dynamo nodes] *not* full replica per site as in PNUTS, MegaStore sounds like e.g. 10 data centers, each item at 3 of them seems like this would yield frequent WAN traffic maybe they get locality at a higher level? i.e. direct UK customers to amazon.co.uk, with multiple datacenters in UK with low latency between them Things I think about when looking at a storage design: WAN? Finding data? Agreement on placement? Fresh reads? Ordered writes? for replication and read freshness Read-modify-write? w/o lost updates Durable writes? Partitions? Temporary server failure? Permanent server failure? Their Obsessions SLA, 99.9th percentile of delay constant failures "data centers being destroyed by tornadoes" "always writeable" Where does that take us? high availability always writeable write in any partition writes in multiple partitions no paxos, no primary, no "view" of live servers conflicting versions The Big Idea: eventual consistency accept writes at any replica allow divergent replicas allow reads to see stale or conflicting data resolve conflicts when failures go away reader must merge and then write Why eventual consistency makes sense Allows writes at any time In minority partition, if master unreachable, &c Unlike PNUTS which must block for master Unlike Paxos-based systems which don't work outside majority partition Simplifies some parts of the system No need for agreement on exactly which node is master for each record No need for careful view change Easier puts()s vs crash recovery, join, leave What are they giving up Fresh reads Simple reads: must merge and re-write Nothing like PNUTS test-and-set-write IVY/yfs/Harp ... PNUTS ... Dynamo/Bayou/filesync Dynamo is like a standard DB (or Harp) when all goes well Like Bayou when there are failures Data model simple k/v hash, not ordered, no range scans Query model get(k) -> set of values and "context" context is version info put(k, v, context) context indicates which versions this put merges Data placement load balance, including as servers join/leave replicating finding keys, including if failures encourage puts and gets to see each other avoid conflicting versions spread over many servers Consistent hashing [ring, and physical view of servers] node ID = random key ID = hash(key) coordinator: successor of key clients send puts/gets to coordinator join/leave only affects neighbors replicas at successors "preference list" coordinator forwards puts (and gets...) to nodes on preference list virtual nodes Failure handling: quorum goal: gets should have high prob of seeing most recent puts goal: but don't block N nodes in preference list we try to ensure they all have all data coordinator waits for W replies during put() coordinator waits for R replies during get() if failures aren't too crazy, client sees all recent versions it may still have to merge and re-write What if coordinator can't reach W out of N during put()? Wait? Fail? Write to fewer than W? Just reaches farther around the ring That node remembers "hint" that data belongs elsewhere Will forward once real home is reachable What if coordinator can't get R replies during get()? Paper seems to say the read fails I wonder if it would be better to return the data you do get? Though if app wanted that, maybe it should set R=1 Do these quorums guarantee intersection? I.e. that my get() will see all previous writes? They are really "sloppy" quorums Real quorums would block writers if too few nodes available Quorums just increase probability of good behavior That reads will see recent writes How should clients resolve conflicts? Depends on the application Shopping basket: take union? thus un-deletes Weaker than Bayou, but simpler Some apps probably can use latest wall-clock time E.g. if I'm updating my password This is pretty simple for apps How to detect whether two versions conflict? If they are not bit-wise identical, must client always merge+write? We have seen this problem before... Version vectors Example: v1: [a:1] v2: [a:1,b:2] VVs indicate v1 supersedes v2 Dynamo nodes automatically take care of this Example: v1: [a:1] v2: [a:1,b:2] v3: [a:2] Client must merge What happens if two clients concurrently write? To e.g. increment a counter Each does read-modify-write So they both see the same initial version Won't the VVs get big? Dynamo deletes least-recently-updated entry if VV has > 10 elements Impact of deleting a VV entry? won't realize one version subsumes another, will merge when not needed: put@b: [b:4] put@a: [a:3, b:4] forget b: [a:3] now, if you sync w/ [b:4], looks like a merge is required can you mistakenly think one DOES subsume another? and thus mistakenly fail to merge? I can't think of one * one reason is each node won't forget its own previous version * another is that oldest element is dropped forgetting the oldest is clever since that's the element most likely to be present in other branches so if it's missing, forces a merge To which coordinator does a client send request? (4.5) maybe via *random* load balancer, then wrong node forwards maybe client library knows key->node mapping maybe client sends to any of top N (end of Sec 5) and last coord tells client which of N responded fastest What if client can't contact coordinator? Do put()s wait for a disk write? 4.1 implies yes clearly a problem: end of 6.1 talks about fixing only one of N writes to disk -- and presumably isn't in W waited for Temporary server failures? Where do puts go? Do RPCs keep timing out? How to heal server when it restarts? What if node with hinted keys fails? Nodes nearby on ring periodically compare/sync DBs Permanent server failures / additions? Manual maintenance of list System doesn't have to guess permanent vs temporarily failure Since handling is very different! Is the design inherently low delay? No: client may be forced to contact distant coordinator No: coordinator has to wait for W or R responses What parts of design are likely to help limit 99.9th pctile delay? This is a question about variance, not mean Bad news: consulting multiple nodes for get/put is a lose time = max(servers), if you have to talk to lots, at least one will be slow Good news: Dynamo only waits for W or R out of N cuts off tail of delay distribution e.g. if nodes have 1% chance of being busy with something else or if a few nodes are broken, network overloaded, &c No real Eval section, only Experience How does Amazon use Dynamo? shopping cart (merge) session info (maybe Recently Visited &c?) (most recent TS) product list (mostly r/o, R=1 W=N) They claim main advantage of Dynamo is flexible N, R, W What do you get by varying them? N-R-W 3-2-2 : default, reasonable fast R/W, reasonable durability 3-3-1 : fast W, slow R, not very durable 3-1-3 : fast R, slow W, durable 3-3-3 : ??? reduce chance of R missing W? 3-1-1 : doesn't seem useful They had to fiddle with the load balance (6.2) Old scheme: Random choice of node ID meant new node had to split old nodes' ranges Which required expensive scans of on-disk DBs New scheme: Pre-determined set of Q evenly divided ranges Each node is coordinator for a few of them New node takes over a few entire ranges Store each range in a file, can xfer whole file How useful is ability to have multiple versions? (6.3) I.e. how useful is eventual consistency This is a Big Question for them 6.3 claims 0.06% of reads see multiple versions So perhaps 0.06% of writes benefitted from always-writeable? I.e. would have blocked in primary/backup scheme But maybe not! They seem to say divergent versions caused by concurrent writes Not by e.g. disconnected data centers Concurrent writes maybe better solved w/ test-and-set-write Performance / throughput (Figure 4, 6.1) Figure 4 says average 10ms read, 20 ms writes the 20 ms must include a disk write 10 ms probably includes waiting for R/W of N so nodes are all in same datacenter? all in same city? Figure 4 says 99.9th pctil is about 100 or 200 ms Why? "request load, object sizes, locality patterns" does this mean sometimes they had to wait for coast-coast msg? Wrap-up The big issue is eventual consistency Maybe only way to get high availability + no blocking on WAN But PNUTS and MegaStore claim WAN is super reliable Awkward model for some applications (stale reads, merges) This is hard for us to tell from paper No agreement on whether it's good for storage systems