6.824 Lecture 16: PNUTS

Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel
Weaver and Ramana Yerneni. PNUTS: Yahoo!'s Hosted Data Serving
Platform. Proceedings of VLDB, 2008.

what is PNUTS' overall goal?
  [world, browsers, data centers]
  web servers at data centers ("sites") all over the world
  multiple web applications (e.g. mail, or shopping)
    an app might run at one data center, or many
  apps must read and write global state
    per-user: profile, shopping cart, friend list
    per-item: book popularity, user comments
  might need any piece of data at any data center
  must cope well with partial failures
    if one of 1000s of servers reboots, or permanently dies
    if network to one data center is flakey or dies
  need a lot of parallel performance for oltp
    e.g. lots of users must be able to add items to shopping cart at same time

this is *not* a traditional architecture
  traditional would be central site, send all user requests there
    or hard partition by application, or perhaps user
    or independent operation, periodic batch reconciliation
  people have not been building this kind of system for very long
  but this is the way a lot of big web apps are going
    similar recent designs: Amazon Dynamo, Facebook's Cassandra
  warning: not much consensus or traditional wisdom here

Why not store each user's data just at one site?
  use primary/backup (Harp, Frangipani/Petal) at each site
  send queries to user's site

Why not use Bayou?
  every site has a copy of all the data
  reads and writes complete locally
  eventual consistency

overview of PNUTS architecture
  [diagram: 3 sites, browsers, web apps, routers, storage units, MBs]
  each site has all data
  each table partitioned by key over storage units
    routers know the partition plan

why does it make sense for every site to have all data?
  reads will be fast
  read success only depends on local cluster, not WAN reliability

what are some down-sides of a copy at each site?
  lots of disk space (this probably isn't a serious problem for them)
  updates will be slow, need to contact every site
  need to ensure every copy is identical
    easy for them to drift out of sync if they miss some updates
    due to network or server failures

what is the data and query model?
  tables
    records
      attributes (columns)
  looks like  relational DB so far...
  queries only by primary key, which must be unique
    insert, update, delete must specify exact primary key
    lookup can specify key range, for "ordered" tables
    mostly a "put/get" or "key/value" interface
  you can scan a range of keys
  reads/writes can probably be by column
    so a write might replace just one column, not whole record

query model differs from mainstream web practice
  *not* SQL, no direct support for fancy queries
    find users who last updated their passwords over a month ago
    find the most popular item for sale
    get a list of my friends' ages
  good enough for:
    maintain user profiles, store item info
    list of friends (range scan)
    shopping cart

how does a read-only query execute?
  web server at some site gets a web request
  app logic runs in that web server, PHP or Python or whatever
    issues PNUTS queries
  1. send key to router, ask which local storage unit has that key
  2. send read to storage unit
  3. storage unit replies

how do updates work?
  need to update every site!
  why not just have app logic send update to every site?
    what if app crashes after updating only some sites?
    what if concurrent updates to same record?

PNUTS has a "record master" for each record
  all updates must go through that site
  responsible storage unit executes updates one at a time per record
  each record has a hidden column indicating site of record master

so the complete update story (some guesswork):
  app wants to update some columns of a record, knows key
  1. sends key and update to local router, router decides key is on SU1
  2. router forwards to SU1, which knows record master for key: SI2
  3. SU1 sends update request to router at SI2
  4. router at SI2 forwards update to local SU2 for key
  6. SU2 sends update to local Message Broker (MB)
  7. MB stores on disk + backup MB, sends vers # to original app
     how does MB know the vers #? maybe SU2 told it
     or perhaps SU2 (not MB) replies to original app
  8. MB sends update to router at every site
  9. every site updates local copy

what about atomic read-modify-write?
  e.g. need to increment a counter stored in a record
  app reads old value, increments locally, writes new value
  what if the local read produced stale data?
  what if read was OK, but concurrent updates?

test-and-set-write(version#, new value) gives you atomic update to one record
  master rejects the write if current version # != version#
  so if concurrent updates, one will lost and retry 

what if we wanted to do back transfers?
  from one account (record) to another
  no support for that!
  nothing like 2pc for updating multiple records atomically
  multi-record updates are not atomic
    other readers can see intermediate state
    other writers are not locked out
  multi-record reads are not atomic
    might read one account before xfer, other account after xfer

could PNUTS reasonably have provided general transactions?
  2pc and locking are not very attractive: blocking, failures.
  maybe package a set of updates; each site applies (since has all data)
  how to ensure order?
    MB only orders for locally-mastered records
  i don't think anyone knows a good way to do it

is lack of general transactions a problem for web applications?
  maybe not, if programmers know to expect it

what about tolerating failures?
  want to keep going even if some parts are broken

the main players are
  storage units
  record master's MB
  a record's master

SU crashes and quickly reboots
  maybe have been in the middle of applying an update
  don't want to toss SU's disk content and copy from another site!
  so SUs must have some kind of logging and recovery
  SU writes disk, then sends ACK to MB
  if MB got no ACK, MB will send to SU again
    so SU won't miss updates while it was down

SU loses disk contents, or is down for a while
  now that site is missing some data!
  can it serve reads from local web servers by forwarding to other sites?
    unclear
  need to restore disk content from SUs at other sites
    1. subscribe to MB feed, and save them for now
    2. copy content from SUs at other sites
    3. replay saved MB updates

MB crashes after accepting update
  logs to disks on two MB server before ACKing
  recovery looks at log, (re)sends logged msgs
  record master may re-send an update if MB crash before ACK
    record version #s will allow SUs to ignore duplicate

MB is a neat idea
  atomic: updates all replicas, or none
    so failure of app srvrs isn't a problem
  reliable: keeps trying, to cope with temporarily SU/site failure
  async: apps don't have to wait for write to complete, good for WAN
  ordered: keeps replicas identical even w/ multiple writers

record's master site loses network connection
  can other sites designate a replacement RM?
    no: original RM may still be processing updates
    don't want *two* RMs!
  do other sites have to wait indefinitely?
  this is what the end of section 2.2 is about -- Dynamo envy

how to change record's master if no failures?
  e.g. I move from Boston to LA
  perhaps just update the record, via old master?
    since ID of master site is stored in the record
  a few subsequent updates might go to the old master
    it will reject them, app retries and finds new master

Do they need Paxos?
  paper does not mention agreement
  for tablet controllers to agree on which SU stores each tablet?
  for MBs to agree on which is primary?

Evaluation focuses on latency and scaling

5.2: time for an insert while busy
  depends on how far away Record Master is
  RM local: 75.6 ms
  RM nearby: 131.5 ms
  RM other coast: 315.5 ms

what is 5.2 measuring? from what to what?
  maybe web server starts insert, to RM replies w/ new version?
  probably not time for MB to propagate to all sites
    since then local RM wouldn't be < remote

Why 75 ms?

Is it 75 ms of network speed-of-light delay?

Or is some cpu/disk/net a throughput bottleneck?
  i.e. we have to wait our turn behind N (99?) other requests
  what's the throughput?
    99 requests finish every 76 ms, so 1300 requests/second total

Could MB limit to 1300 req/sec?
  new inserts, so all going through same MB
  cannot be doing 1300 disk ops/sec!
  maybe log with group commit
    perhaps up to 99 requests / 10 ms
  so well-designed MB unlikely to be the problem

Could net limit to 1300 req/sec?
  how big are the updates? 500 bytes
  500 * 1300 = 5.2 megabits/second
  so unlikely, even over WAN

Could SU limit to 1300 req/sec?
  the 1300 are split over 5 SUs at each site
  thus 260 req/second per SU
  does the SU have to write to disk?
    yes -- probably twice (log and real location)
  so we'd expect on the order of 100 / second
  260 / second is not so far off, maybe group commit for log
  section 5.2 says the SU is the bottleneck

5.3: effect of increasing request rate
  what do we expect for graph w/ x-axis req rate, y-axis latency?
    system has some inherent capacity, e.g. total disk seeks/second
    for rates less than that, constant latency
    for rates higher than that, infinite average latency
  blow-up should be at max capacity of h/w
    e.g. # disk arms / seek time
  we don't see that in Figure 3
    maybe they only explored rates less than capacity
  text says max possible rate ws about 3000/second
    that's higher than 1300 from section 5.2 -- why?
    probably 5.3 has lots of reads as well as writes

stepping back, what were PNUTS key design decisions?
  1. primary-backup replication (master site's MB)
     pro: keeps replicas identical,
          enforces serial order on updates,
          easy to reason about
     con: hard to make progress if primary is up but partitioned
     alternate design: eventual consistency (Dynamo):
       always allow updates,
       tree of versions if network partitions,
       readers must reconcile versions
  2. put/get query model (no SQL)
  3. not transactional