6.824 Lecture 16: PNUTS Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni. PNUTS: Yahoo!'s Hosted Data Serving Platform. Proceedings of VLDB, 2008. what is PNUTS' overall goal? [world, browsers, data centers] web servers at data centers ("sites") all over the world multiple web applications (e.g. mail, or shopping) an app might run at one data center, or many apps must read and write global state per-user: profile, shopping cart, friend list per-item: book popularity, user comments might need any piece of data at any data center must cope well with partial failures if one of 1000s of servers reboots, or permanently dies if network to one data center is flakey or dies need a lot of parallel performance for oltp e.g. lots of users must be able to add items to shopping cart at same time this is *not* a traditional architecture traditional would be central site, send all user requests there or hard partition by application, or perhaps user or independent operation, periodic batch reconciliation people have not been building this kind of system for very long but this is the way a lot of big web apps are going similar recent designs: Amazon Dynamo, Facebook's Cassandra warning: not much consensus or traditional wisdom here Why not store each user's data just at one site? use primary/backup (Harp, Frangipani/Petal) at each site send queries to user's site Why not use Bayou? every site has a copy of all the data reads and writes complete locally eventual consistency overview of PNUTS architecture [diagram: 3 sites, browsers, web apps, routers, storage units, MBs] each site has all data each table partitioned by key over storage units routers know the partition plan why does it make sense for every site to have all data? reads will be fast read success only depends on local cluster, not WAN reliability what are some down-sides of a copy at each site? lots of disk space (this probably isn't a serious problem for them) updates will be slow, need to contact every site need to ensure every copy is identical easy for them to drift out of sync if they miss some updates due to network or server failures what is the data and query model? tables records attributes (columns) looks like relational DB so far... queries only by primary key, which must be unique insert, update, delete must specify exact primary key lookup can specify key range, for "ordered" tables mostly a "put/get" or "key/value" interface you can scan a range of keys reads/writes can probably be by column so a write might replace just one column, not whole record query model differs from mainstream web practice *not* SQL, no direct support for fancy queries find users who last updated their passwords over a month ago find the most popular item for sale get a list of my friends' ages good enough for: maintain user profiles, store item info list of friends (range scan) shopping cart how does a read-only query execute? web server at some site gets a web request app logic runs in that web server, PHP or Python or whatever issues PNUTS queries 1. send key to router, ask which local storage unit has that key 2. send read to storage unit 3. storage unit replies how do updates work? need to update every site! why not just have app logic send update to every site? what if app crashes after updating only some sites? what if concurrent updates to same record? PNUTS has a "record master" for each record all updates must go through that site responsible storage unit executes updates one at a time per record each record has a hidden column indicating site of record master so the complete update story (some guesswork): app wants to update some columns of a record, knows key 1. sends key and update to local router, router decides key is on SU1 2. router forwards to SU1, which knows record master for key: SI2 3. SU1 sends update request to router at SI2 4. router at SI2 forwards update to local SU2 for key 6. SU2 sends update to local Message Broker (MB) 7. MB stores on disk + backup MB, sends vers # to original app how does MB know the vers #? maybe SU2 told it or perhaps SU2 (not MB) replies to original app 8. MB sends update to router at every site 9. every site updates local copy what about atomic read-modify-write? e.g. need to increment a counter stored in a record app reads old value, increments locally, writes new value what if the local read produced stale data? what if read was OK, but concurrent updates? test-and-set-write(version#, new value) gives you atomic update to one record master rejects the write if current version # != version# so if concurrent updates, one will lost and retry what if we wanted to do back transfers? from one account (record) to another no support for that! nothing like 2pc for updating multiple records atomically multi-record updates are not atomic other readers can see intermediate state other writers are not locked out multi-record reads are not atomic might read one account before xfer, other account after xfer could PNUTS reasonably have provided general transactions? 2pc and locking are not very attractive: blocking, failures. maybe package a set of updates; each site applies (since has all data) how to ensure order? MB only orders for locally-mastered records i don't think anyone knows a good way to do it is lack of general transactions a problem for web applications? maybe not, if programmers know to expect it what about tolerating failures? want to keep going even if some parts are broken the main players are storage units record master's MB a record's master SU crashes and quickly reboots maybe have been in the middle of applying an update don't want to toss SU's disk content and copy from another site! so SUs must have some kind of logging and recovery SU writes disk, then sends ACK to MB if MB got no ACK, MB will send to SU again so SU won't miss updates while it was down SU loses disk contents, or is down for a while now that site is missing some data! can it serve reads from local web servers by forwarding to other sites? unclear need to restore disk content from SUs at other sites 1. subscribe to MB feed, and save them for now 2. copy content from SUs at other sites 3. replay saved MB updates MB crashes after accepting update logs to disks on two MB server before ACKing recovery looks at log, (re)sends logged msgs record master may re-send an update if MB crash before ACK record version #s will allow SUs to ignore duplicate MB is a neat idea atomic: updates all replicas, or none so failure of app srvrs isn't a problem reliable: keeps trying, to cope with temporarily SU/site failure async: apps don't have to wait for write to complete, good for WAN ordered: keeps replicas identical even w/ multiple writers record's master site loses network connection can other sites designate a replacement RM? no: original RM may still be processing updates don't want *two* RMs! do other sites have to wait indefinitely? this is what the end of section 2.2 is about -- Dynamo envy how to change record's master if no failures? e.g. I move from Boston to LA perhaps just update the record, via old master? since ID of master site is stored in the record a few subsequent updates might go to the old master it will reject them, app retries and finds new master Do they need Paxos? paper does not mention agreement for tablet controllers to agree on which SU stores each tablet? for MBs to agree on which is primary? Evaluation focuses on latency and scaling 5.2: time for an insert while busy depends on how far away Record Master is RM local: 75.6 ms RM nearby: 131.5 ms RM other coast: 315.5 ms what is 5.2 measuring? from what to what? maybe web server starts insert, to RM replies w/ new version? probably not time for MB to propagate to all sites since then local RM wouldn't be < remote Why 75 ms? Is it 75 ms of network speed-of-light delay? Or is some cpu/disk/net a throughput bottleneck? i.e. we have to wait our turn behind N (99?) other requests what's the throughput? 99 requests finish every 76 ms, so 1300 requests/second total Could MB limit to 1300 req/sec? new inserts, so all going through same MB cannot be doing 1300 disk ops/sec! maybe log with group commit perhaps up to 99 requests / 10 ms so well-designed MB unlikely to be the problem Could net limit to 1300 req/sec? how big are the updates? 500 bytes 500 * 1300 = 5.2 megabits/second so unlikely, even over WAN Could SU limit to 1300 req/sec? the 1300 are split over 5 SUs at each site thus 260 req/second per SU does the SU have to write to disk? yes -- probably twice (log and real location) so we'd expect on the order of 100 / second 260 / second is not so far off, maybe group commit for log section 5.2 says the SU is the bottleneck 5.3: effect of increasing request rate what do we expect for graph w/ x-axis req rate, y-axis latency? system has some inherent capacity, e.g. total disk seeks/second for rates less than that, constant latency for rates higher than that, infinite average latency blow-up should be at max capacity of h/w e.g. # disk arms / seek time we don't see that in Figure 3 maybe they only explored rates less than capacity text says max possible rate ws about 3000/second that's higher than 1300 from section 5.2 -- why? probably 5.3 has lots of reads as well as writes stepping back, what were PNUTS key design decisions? 1. primary-backup replication (master site's MB) pro: keeps replicas identical, enforces serial order on updates, easy to reason about con: hard to make progress if primary is up but partitioned alternate design: eventual consistency (Dynamo): always allow updates, tree of versions if network partitions, readers must reconcile versions 2. put/get query model (no SQL) 3. not transactional