6.824 2012 Lecture 19: PNUTS Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni. PNUTS: Yahoo!'s Hosted Data Serving Platform. Proceedings of VLDB, 2008. what is PNUTS' overall goal? [world, browsers, data centers] overall story similar to that of MegaStore 10s of data centers ("sites") all over the world web applications, e.g. mail, shopping, social net each app probably runs at all sites PNUTS keeps state for apps per-user: profile, shopping cart, friend list per-item: book popularity, user comments app might need any piece of data at any data center need to handle lots of concurrent updates to different data e.g. lots of users must be able to add items to shopping cart at same time thus 1000s of PNUTS servers must cope well with partial failures 1000s of servers => crashes must be frequent non-traditional architecture traditional would be central site, send all user requests there people have not been building this kind of system for very long but this is the way a lot of big web apps are going similar recent designs: Amazon Dynamo, Facebook's Cassandra, MegaStore warning: not much consensus or traditional wisdom here overview [diagram: 3 sites, browsers, web apps, tablet ctlrs, routers, storage units, MBs] each site has all data each table partitioned by key over storage units tablet servers + routers know the partition plan why does it make sense for every site to have all data? reads will be fast read success only depends on local cluster, not WAN reliability what are some down-sides of a copy at each site? (disk space prob not an issue for their uses) updates will be slow, need to contact every site need to keep replicas identical all must see same updates, in same order what is the data and query model? table record primary key + attributes (columns) queries only by primary key, which must be unique also supports range scan on primary key reads/writes can probably be by column so a write might replace just one column, not whole record how do updates work? app server gets web request, needs to write data in PNUTS need to update every site! why not just have app logic send update to every site? what if app crashes after updating only some sites? what if concurrent updates to same record? PNUTS has a "record master" for each record all updates must go through that site each record has a hidden column indicating site of record master responsible storage unit executes updates one at a time per record tells MB to broadcast update to all sites so the complete update story (some guesswork): app wants to update some columns of a record, knows key 1. app sends key and update to local SU1 2. SU1 looks up record master for key: SI2 3. SU1 sends update request to router at SI2 4. router at SI2 forwards update to local SU2 for key 6. SU2 sends update to local Message Broker (MB) 7. MB stores on disk + backup MB, sends vers # to original app how does MB know the vers #? maybe SU2 told it or perhaps SU2 (not MB) replies to original app 8. MB sends update to router at every site 9. every site updates local copy puzzles: paper says MB is commit point does SU2 perform update before or after talking to MB? what if SU2 crashes at a bad time? maybe they are serious that they never recover a crashed SU? who assigns version #, and when? who replies to the app? writes seem like they'd be slow -- why does it make sense? master likely to be local MB distribution is async (app does not wait) down side: readers at non-master sites may see stale data how does a read-only query execute? multiple kinds of reads (section 2.2) -- why? how does each work? why is each needed? read-any(k) read from local SU might return stale data (even if you just wrote!) why: fast! read-critical(k, required_version) maybe read from local SU if it has vers >= required_version otherwise read from master SU? why: reflects app's writes; maybe fast read-latest(k) always read from master SU (? "if local copy too stale") slow if master is remote! why: app needs fresh data what if you need to increment a counter stored in a record? app reads old value, increments locally, writes new value what if the local read produced stale data? what if read was OK, but concurrent updates? test-and-set-write(version#, new value) gives you atomic update to one record master rejects the write if current version # != version# so if concurrent updates, one will lost and retry while(1): (x, ver) = read-latest(k) if(t-a-s-w(k, ver, x+1)) break what if we wanted to do bank transfers? from one account (record) to another can t-a-s-w be used for this? not in any direct way (but maybe to update a log) nothing like 2pc for updating multiple records atomically multi-record updates are not atomic other readers can see intermediate state other writers are not locked out multi-record reads are not atomic might read one account before xfer, other account after xfer could PNUTS reasonably have provided general transactions? Megastore shows one way to do it multi-site coordination, quite slow but even Megastore has serious restrictions you get transactions just within a row or related set of rows e.g. within my e-mail msgs but doesn't allow e.g. atomic msg delivery to multiple users is lack of general transactions a problem for web applications? maybe not, if programmers know to expect it The Question how does PNUTS cope with Example 1 (page 2) Initially Alice's mother is in Alice's ACL, so mother can see photos 1. Alice removes her mother from ACL 2. Alice posts spring-break photos could her mother see update #2 but not update #1? really, could mother's app server see updates in wrong order esp if mother uses different site than Alice ACL and photo list must be in the same record since PNUTS guarantees order only for updates to same record Alice sends updates to her record's master site in order master site broadcasts via MB in order MB tells other sites to apply updates in order What if Alice's mother: reads the old ACL, that includes mother reads the new photo list answer: just one read of Alice's record, has both ACL and photo list if record doesn't have new ACL, order says it can't have new photos either what about tolerating failures? want to keep going even if some parts are broken the main players are app servers storage units record master's MB a record's master app server crashes midway through a set of updates not a transaction, so only some of writes will happen but master SU/MB either did or didn't get each write so each write happens at all sites, or none SU crashes and quickly reboots (I'm guessing here, could be wrong) may have been in the middle of applying an update don't want to toss SU's disk content and copy from another site! so SUs must have some kind of logging and recovery slave SU writes disk, then sends ACK to MB if MB got no ACK, MB will send to SU again so SU won't miss updates while it was down but: if SU is master for anything, it might disagree with MB about recent commit suggests maybe PNUTS doesn't do this kind of recovery? or master SU asks MB to commit before updating local disk? and master SU asks MB for its own recent ops after reboot? SU loses disk contents, or is down for a while now that site is missing some data! can it serve reads from local web servers by forwarding to other sites? paper doesn't say need to restore disk content from SUs at other sites 1. subscribe to MB feed, and save them for now 2. copy content from SUs at other sites 3. replay saved MB updates MB crashes after accepting update logs to disks on two MB server before ACKing recovery looks at log, (re)sends logged msgs record master may re-send an update if MB crash before ACK record version #s will allow SUs to ignore duplicate MB is a neat idea atomic: updates all replicas, or none so failure of app srvrs isn't a problem reliable: keeps trying, to cope with temporarily SU/site failure async: apps don't have to wait for write to complete, good for WAN ordered: keeps replicas identical even w/ multiple writers record's master site loses network connection can other sites designate a replacement RM? no: original RM may still be processing updates don't want *two* RMs! do other sites have to wait indefinitely? this is what the end of section 2.2 is about -- Dynamo envy how to change record's master if no failures? e.g. I move from Boston to LA perhaps just update the record, via old master? since ID of master site is stored in the record a few subsequent updates might go to the old master it will reject them, app retries and finds new master Do they need Paxos? paper does not mention agreement for tablet controllers to agree on which SU stores each tablet? for MBs to agree on which is primary? Evaluation focuses on latency and scaling 5.2: time for an insert while busy depends on how far away Record Master is RM local: 75.6 ms RM nearby: 131.5 ms RM other coast: 315.5 ms what is 5.2 measuring? from what to what? maybe web server starts insert, to RM replies w/ new version? probably not time for MB to propagate to all sites since then local RM wouldn't be < remote Why 75 ms? Is it 75 ms of network speed-of-light delay? no: local Is the 75 ms mostly queuing, waiting for other client's operations? no: they imply 100 clients was max that didn't cause delay to rise End of 5.2 suggests 40 ms of 75 ms in in SU how could it take 40 ms? each key/value is one file? creating a file takes 3 disk writes (directory, inode, content)? what's the other 35 ms? But only 33 ms (not 75) for "ordered table" (MySQL/Innodb) closer to the one disk write we'd expect 5.3 / Figure 3: effect of increasing request rate what do we expect for graph w/ x-axis req rate, y-axis latency? system has some inherent capacity, e.g. total disk seeks/second for rates less than that, constant latency for rates higher than that, growing queues, divergent average latency blow-up should be at max capacity of h/w e.g. # disk arms / seek time we don't see that in Figure 3 end of 5.3 says clients too slow at >= 75 ms/op, 300 clients -> about 4000/sec text says max possible rate was about 3000/second that's higher than 1300 from section 5.2 -- why? probably 5.3 has lots of reads as well as writes stepping back, what were PNUTS key design decisions? 1. async replication fast reads, but stale fast writes if near master, but not visible for a while 2. relaxed consistency / no transactions 3. primary-backup replication: sequence all writes thru master site pro: keeps replicas identical, enforces serial order on updates, easy to reason about con: no progress if master site disconnected Next: Dynamo, a very different design async replication, but no master eventual consistency always allow updates tree of versions if network partitions readers must reconcile versions