6.824 2017 Lecture 16: PNUTS Cooper et al, PNUTS: Yahoo!'s Hosted Data Serving Platform. VLDB 2008. why this paper? introduction to "geographic replication" a real-world case study of relaxed consistency the setting web sites with data centers ("regions") all over the world users see better performance by contacting nearest region each region stores a replica of all data user profile, messages, shopping cart, friend list, &c potential for web servers to read/write local copy quickly better fault tolerance for e.g. whole-datacenter power failure the big truth long-distance communication is slow coast-to-coast round-trip Internet is > 50ms 30ms of that is speed-of-light hard to make both reads and writes fast if you want consistency we can make either reads or writes fast maintain many replicas either read or write the nearest replica but we can't make both reads and writes fast if we read only the nearest replica, writes must write all replicas if we write only the nearest, reads must read all either read or write must be slow or we must sacrifice consistency PNUTS (and many others) struggle with this truth PNUTS design overview [diagram: 3 regions, browsers, web apps, routers, SUs, MBs] each region has a complete copy of all data each table partitioned by key over storage units routers know the partition plan why replicas of *all* data in all regions? so any user can read anything quickly good for data used by many users / many regions cross-region reads would be slow (~100ms) so failed SU can be repaired from a different region's copy drawbacks of a copy at each region? updates need to be sent to every region local reads may be stale concurrent updates of same data need to be sorted out keep replicas identical don't lose updates (e.g. read-modify-write for counter) uses more disk space how do updates work? app server gets web request, needs to write data in PNUTS need to update every region! why not just have app logic send update to every region? what if app crashes after updating only some regions? what if concurrent updates to same record? PNUTS has a "record master" for each record the master is a region all web apps send updates for a record through record's master MB master MB chooses the order to send out writes to all regions why per-record master? (why not e.g. same master for all data?) a record is often mostly read/written by a single user PNUTS tries to make record's master a region close to the user thus fast writes each record has a hidden column indicating region of record master the complete update story (many guesses, paper doesn't explain) app wants to update a record, knows key 1. app sends update to local router 2. router forwards to local SU for key 3. SU looks up record master for key: region R2 4. SU sends update request to R2's MB 5. R2's MB stores on disk + backup MB 6. MB sends update to MB at every region 7. every region (incl R2) updates local copy (MB -> router -> SU) 8. SU in R2 replies to the app with the new version # step 5 is the commit point and it's the master MB that chooses the order of writes to a record why does the SU in R2 reply to the app? why not have write() return as soon as committed to master MB? * ensures that read-latest sees any completed write * SU returns new version number, needed for subsequent read-critical * SU returns test-and-set-write success/failure MB is a neat idea atomic: updates all replicas, or none rather than app server updating replicas (crash...) reliable: logs to disk, keeps sending msgs, to cope with various failures ordered: record's replicas eventually identical even w/ multiple writers async: writes needn't wait for all regions, just master MB write order semantics suppose each record in a table looks like: Name Where What Alice home sleeping PNUTS preserves order of writes to each record individually e.g. Alice does write(Alice.What, awake) write(Alice.Where, work) write RPC only returns after committed to record master's MB readers may see home/sleeping or home/awake or work/awake no-one will see work / sleeping all replicas (regions) apply the writes in the same order "timeline consistency" BUT writes to different records do not maintain order e.g. Write(Bob.What, on duty) Write(Alice.What, off duty) readers may these in either order i.e. no-one on duty; or two people on duty if atomic hand-off important, need a single record with name of person writes are complex, take a while to finish, why OK? fundamental benefit: reads are then very fast, since local and reads usually greatly outnumber writes PNUTS mitigates write delay: application waits for MB commit but not propagation ("asynchronous") master likely to be local (they claim 80% of the time) so MB commit will often be quick still, eval says write takes 300ms if master is remote! down side: readers at non-master regions may see stale data how stale might a non-master record be? depends on how quickly MB sends updates to regions guess: less than a second usually longer if network slow/flaky, or MB busy how do apps cope with potentially stale local replica? sometimes stale is ok: looking at other people's profiles sometimes stale is not ok: shopping cart after add/delete item application gets to choose how consistent (section 2.2) read-any(k) read from local SU fast but maybe stale read-critical(k, required_version) fast local read if local SU has vers >= required_version otherwise slow read from master SU why: app knows it just wrote, wants value to reflect write read-latest(k) guaranteed to see any completed write thus per-record strong consistency reads from master SU and depends on write not returning until master SU is updated slow if master is remote! what if app needs to increment a counter stored in a record? is read-latest(k), increment, then write(k) enough? not if there might be concurrent updates! test-and-set-write(version#, new value) gives you atomic update to one record SUs reject the write if current version # != version# so if concurrent updates, one will lose and retry while(1): (x, ver) = read-latest(k) if(t-a-s-w(k, ver, x+1)) break t-a-s-w is fully general for single-record atomic read-modify-write why not atomic multi-record writes? e.g. for bank transfers: Alice -= $10 ; Bob += $10 would that be easy to add to PNUTS? no: seems to require single master, not master per record The Question what's the potential problem in Example 1? Alice's profile record contains both ACL and picture list 1. Alice removes her mom from ACL 2. Alice adds a picture to list mere eventual consistency might perform second write first Mom could then (briefly) see the picture how does PNUTS solve this problem? timeline consistency PNUTS guarantees all replicas perform the two writes in issue order PNUTS solution requires that both items be in the same record if ACL and list were in different records, timeline consistency doesn't help PNUTS solution requires mom read Alice's record just once not one read of ACL, a later read of picture list how to change record's master (if no failures)? e.g. I move from Boston to LA perhaps just update the record, via old master? since ID of master region is stored in the record old master announces change over MB a few clients might try to send writes to the old master SU it will reject them, app retries and finds new master what about tolerating failures? app server crashes midway through a set of updates not a transaction, so the completed writes will happen but master SU/MB either did or didn't get each write so each write happens at all regions, or none apps may need to be careful about order of writes SU down briefly, or network temporarily broken/lossy MB keeps trying until SU acks SU shouldn't ACK until safely on disk So after SU reboots, MB will cause it to catch up SU must ignore duplicate writes from MB SU never restarts, perhaps due to disk failure spread its tablets over other SUs need to fetch tablet content from SUs at other regions 1. subscribe to MB feed, and save writes 2. copy tablet from SU at another region 3. replay saved MB writes puzzle: how to ensure we didn't miss any MB updates for this SU? e.g. subscribe to MB at time=100, but source SU only saw through 90? checkpoint messages in 3.2.2 must fix this; not clear how MB crashes after accepting update logs to disks on two MB servers before ACKing MB recovery looks at log, (re)sends logged but un-acked msgs record's master region loses network connection can we change record's master to another region? no: original master MB may have logged and ACKed updates, not yet sent out so apps have to wait for region to revive in order to write this is one price of ordered updates now performance evaluation focuses on latency, not throughput why latency? why not throughput? maybe throughput can be indefinitely increased by adding more SUs whereas latency is harder to reduce big performance question: why isn't the MB a terrible bottleneck? all writes funnel through a few MBs which log every write to disk -- 10ms per disk write paper doesn't say how PNUTS avoids MB bottleneck perhaps by batching many writes into each disk log write 5.2: time for an insert while busy probably measuring time until client sees reply via master MB and SU depends on how far away master region is master local: 75.6 ms master nearby: 131.5 ms master other coast: 315.5 ms Why 75 ms for write to locally mastered record? speed-of-light delay? no: local queuing, waiting for other client's operations? no: they imply 100 clients was max that didn't cause delay to rise End of 5.2 suggests 40 ms of 75 ms in in SU how could it take 40 ms? each key/value is one file? creating a file takes 3 disk writes (directory, inode, content)? what's the other 35 ms? MB disk write? 5.3 / Figure 3: effect of increasing request rate what do we expect for graph w/ x-axis req rate, y-axis latency? for lower rates, constant latency but system has some max capacity, due to disk seeks or CPU for higher rates, queue grows rapidly, avg latency blows up blow-up should be near max capacity of h/w we don't see a blow-up in Figure 3 end of 5.3 says clients too slow text says max rate was about 3000/second 10% writes, so 300 writes/second 5 SU per region, so 60 writes/SU/second about right if each write does a random disk I/O but you'll need lots of SUs for millions of active users stepping back, what were PNUTS' key design decisions? 1. replication of all data at multiple regions fast reads, slow but async writes 2. relaxed consistency -- stale reads needed if you want fast reads and async writes 3. sequence all writes thru record master good for consistency; bad for latency; maybe bad for fault tolerance 4. ordered writes only for individual records simplifies system; maybe awkward for applications Points where the paper isn't clear What's the sequence of messages during a write? How/where is each record's version # maintained? The two-part version # implies it's remembered after delete -- where? How are a delete and subsequent insert of same key ordered? Paper says write() returns new version # -- how does it find out? How does client learn fate of test-and-set-write? What chooses write order -- master SU, or master MB? How do writes and failure interact with changing a record's master? Next: measurements of eventual consistency at Facebook then Amazon's Dynamo, a real-world DB with even looser consistency