6.824 2012 Lecture 19: PNUTS

Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel
Weaver and Ramana Yerneni. PNUTS: Yahoo!'s Hosted Data Serving
Platform. Proceedings of VLDB, 2008.

what is PNUTS' overall goal?
  [world, browsers, data centers]
  overall story similar to that of MegaStore
  10s of data centers ("sites") all over the world
  web applications, e.g. mail, shopping, social net
    each app probably runs at all sites
  PNUTS keeps state for apps
    per-user: profile, shopping cart, friend list
    per-item: book popularity, user comments
  app might need any piece of data at any data center
  need to handle lots of concurrent updates to different data
    e.g. lots of users must be able to add items to shopping cart at same time
    thus 1000s of PNUTS servers
  must cope well with partial failures
    1000s of servers => crashes must be frequent

non-traditional architecture
  traditional would be central site, send all user requests there
  people have not been building this kind of system for very long
  but this is the way a lot of big web apps are going
    similar recent designs: Amazon Dynamo, Facebook's Cassandra, MegaStore
  warning: not much consensus or traditional wisdom here

overview
  [diagram: 3 sites, browsers, web apps, tablet ctlrs, routers, storage units, MBs]
  each site has all data
  each table partitioned by key over storage units
    tablet servers + routers know the partition plan

why does it make sense for every site to have all data?
  reads will be fast
  read success only depends on local cluster, not WAN reliability

what are some down-sides of a copy at each site?
  (disk space prob not an issue for their uses)
  updates will be slow, need to contact every site
  need to keep replicas identical
    all must see same updates, in same order

what is the data and query model?
  table
    record
      primary key + attributes (columns)
  queries only by primary key, which must be unique
  also supports range scan on primary key
  reads/writes can probably be by column
    so a write might replace just one column, not whole record

how do updates work?
  app server gets web request, needs to write data in PNUTS
  need to update every site!
  why not just have app logic send update to every site?
    what if app crashes after updating only some sites?
    what if concurrent updates to same record?

PNUTS has a "record master" for each record
  all updates must go through that site
    each record has a hidden column indicating site of record master
  responsible storage unit executes updates one at a time per record
  tells MB to broadcast update to all sites

so the complete update story (some guesswork):
  app wants to update some columns of a record, knows key
  1. app sends key and update to local SU1
  2. SU1 looks up record master for key: SI2
  3. SU1 sends update request to router at SI2
  4. router at SI2 forwards update to local SU2 for key
  6. SU2 sends update to local Message Broker (MB)
  7. MB stores on disk + backup MB, sends vers # to original app
     how does MB know the vers #? maybe SU2 told it
     or perhaps SU2 (not MB) replies to original app
  8. MB sends update to router at every site
  9. every site updates local copy

puzzles:
  paper says MB is commit point
  does SU2 perform update before or after talking to MB?
    what if SU2 crashes at a bad time?
    maybe they are serious that they never recover a crashed SU?
  who assigns version #, and when?
  who replies to the app?

writes seem like they'd be slow -- why does it make sense?
  master likely to be local
  MB distribution is async (app does not wait)
  down side: readers at non-master sites may see stale data

how does a read-only query execute?
  multiple kinds of reads (section 2.2) -- why?
  how does each work?
  why is each needed?
  read-any(k)
    read from local SU
    might return stale data (even if you just wrote!)
    why: fast!
  read-critical(k, required_version)
    maybe read from local SU if it has vers >= required_version
    otherwise read from master SU?
    why: reflects app's writes; maybe fast
  read-latest(k)
    always read from master SU (? "if local copy too stale")
    slow if master is remote!
    why: app needs fresh data

what if you need to increment a counter stored in a record?
  app reads old value, increments locally, writes new value
  what if the local read produced stale data?
  what if read was OK, but concurrent updates?

test-and-set-write(version#, new value) gives you atomic update to one record
  master rejects the write if current version # != version#
  so if concurrent updates, one will lost and retry 
  while(1):
    (x, ver) = read-latest(k)
    if(t-a-s-w(k, ver, x+1))
      break

what if we wanted to do bank transfers?
  from one account (record) to another
  can t-a-s-w be used for this?
    not in any direct way (but maybe to update a log)
  nothing like 2pc for updating multiple records atomically
  multi-record updates are not atomic
    other readers can see intermediate state
    other writers are not locked out
  multi-record reads are not atomic
    might read one account before xfer, other account after xfer

could PNUTS reasonably have provided general transactions?
  Megastore shows one way to do it
  multi-site coordination, quite slow
  but even Megastore has serious restrictions
    you get transactions just within a row or related set of rows
    e.g. within my e-mail msgs
    but doesn't allow e.g. atomic msg delivery to multiple users

is lack of general transactions a problem for web applications?
  maybe not, if programmers know to expect it

The Question
  how does PNUTS cope with Example 1 (page 2)
  Initially Alice's mother is in Alice's ACL, so mother can see photos
  1. Alice removes her mother from ACL
  2. Alice posts spring-break photos
  could her mother see update #2 but not update #1?
    really, could mother's app server see updates in wrong order
    esp if mother uses different site than Alice
  ACL and photo list must be in the same record
    since PNUTS guarantees order only for updates to same record
  Alice sends updates to her record's master site in order
    master site broadcasts via MB in order
    MB tells other sites to apply updates in order
  What if Alice's mother:
    reads the old ACL, that includes mother
    reads the new photo list
    answer: just one read of Alice's record, has both ACL and photo list
      if record doesn't have new ACL, order says it can't have new photos either

what about tolerating failures?
  want to keep going even if some parts are broken

the main players are
  app servers
  storage units
  record master's MB
  a record's master

app server crashes midway through a set of updates
  not a transaction, so only some of writes will happen
  but master SU/MB either did or didn't get each write
    so each write happens at all sites, or none

SU crashes and quickly reboots
  (I'm guessing here, could be wrong)
  may have been in the middle of applying an update
  don't want to toss SU's disk content and copy from another site!
  so SUs must have some kind of logging and recovery
  slave SU writes disk, then sends ACK to MB
  if MB got no ACK, MB will send to SU again
    so SU won't miss updates while it was down
  but:
    if SU is master for anything, it might disagree with MB about recent commit
    suggests maybe PNUTS doesn't do this kind of recovery?
    or master SU asks MB to commit before updating local disk?
      and master SU asks MB for its own recent ops after reboot?

SU loses disk contents, or is down for a while
  now that site is missing some data!
  can it serve reads from local web servers by forwarding to other sites?
    paper doesn't say
  need to restore disk content from SUs at other sites
    1. subscribe to MB feed, and save them for now
    2. copy content from SUs at other sites
    3. replay saved MB updates

MB crashes after accepting update
  logs to disks on two MB server before ACKing
  recovery looks at log, (re)sends logged msgs
  record master may re-send an update if MB crash before ACK
    record version #s will allow SUs to ignore duplicate

MB is a neat idea
  atomic: updates all replicas, or none
    so failure of app srvrs isn't a problem
  reliable: keeps trying, to cope with temporarily SU/site failure
  async: apps don't have to wait for write to complete, good for WAN
  ordered: keeps replicas identical even w/ multiple writers

record's master site loses network connection
  can other sites designate a replacement RM?
    no: original RM may still be processing updates
    don't want *two* RMs!
  do other sites have to wait indefinitely?
  this is what the end of section 2.2 is about -- Dynamo envy

how to change record's master if no failures?
  e.g. I move from Boston to LA
  perhaps just update the record, via old master?
    since ID of master site is stored in the record
  a few subsequent updates might go to the old master
    it will reject them, app retries and finds new master

Do they need Paxos?
  paper does not mention agreement
  for tablet controllers to agree on which SU stores each tablet?
  for MBs to agree on which is primary?

Evaluation focuses on latency and scaling

5.2: time for an insert while busy
  depends on how far away Record Master is
  RM local: 75.6 ms
  RM nearby: 131.5 ms
  RM other coast: 315.5 ms

what is 5.2 measuring? from what to what?
  maybe web server starts insert, to RM replies w/ new version?
  probably not time for MB to propagate to all sites
    since then local RM wouldn't be < remote

Why 75 ms?

Is it 75 ms of network speed-of-light delay?
  no: local

Is the 75 ms mostly queuing, waiting for other client's operations?
  no: they imply 100 clients was max that didn't cause delay to rise

End of 5.2 suggests 40 ms of 75 ms in in SU
  how could it take 40 ms?
    each key/value is one file?
    creating a file takes 3 disk writes (directory, inode, content)?
  what's the other 35 ms?

But only 33 ms (not 75) for "ordered table" (MySQL/Innodb)
  closer to the one disk write we'd expect

5.3 / Figure 3: effect of increasing request rate
  what do we expect for graph w/ x-axis req rate, y-axis latency?
    system has some inherent capacity, e.g. total disk seeks/second
    for rates less than that, constant latency
    for rates higher than that, growing queues, divergent average latency
  blow-up should be at max capacity of h/w
    e.g. # disk arms / seek time
  we don't see that in Figure 3
    end of 5.3 says clients too slow
    at >= 75 ms/op, 300 clients -> about 4000/sec
  text says max possible rate was about 3000/second
    that's higher than 1300 from section 5.2 -- why?
    probably 5.3 has lots of reads as well as writes

stepping back, what were PNUTS key design decisions?
  1. async replication
     fast reads, but stale
     fast writes if near master, but not visible for a while
  2. relaxed consistency / no transactions
  3. primary-backup replication: sequence all writes thru master site
     pro: keeps replicas identical,
          enforces serial order on updates,
          easy to reason about
     con: no progress if master site disconnected

Next: Dynamo, a very different design
  async replication, but no master
  eventual consistency
  always allow updates
  tree of versions if network partitions
  readers must reconcile versions