6.824 2013 Lecture 19: Scaling Memcache at Facebook

Scaling Memcache at Facebook, by Nishtala et al, NSDI 2013

why are we reading this paper?
  it's not an idea/technique paper
  it's an experience paper
  lots of lessons and facts about real world of big web sites
  we can argue with their design, but not their success

the big picture
  facebook: lots of users, friend lists, status, posts, likes, photos
  multiple data centers (at least west and east coast)
  each data center -- "region":
    "real" data sharded over MySQL DBs
    memcached layer (mc)
    web servers (clients of memcached)
  each data center's DBs contain full replica
  west coast is master, others are slaves via MySQL async log replication

the paper focuses on memcached (mc)
  a cache, between web servers and DBs, to reduce load on DBs
  used by many web sites

mc
  each server stores a hash table -- keys and values
  put(k, v)
  get(k) -> v
  delete(k)
  data in RAM (not disk)
  eviction?
    LRU list -- on put(), if mem full, delete LRU k/v pair
    also put() can specify an expiration time
    but FB found these insufficient -- explicit delete()s
  very simple!

how to use multiple mc servers?
  clients hash key to pick the server
  thus they partition the k/v pairs over mc servers

Q: why partition -- why not replicate?
  i.e. why not have each web server use just one mc server?
  partition uses less memory
  partition better if no one key is extremely popular
  but partition makes writes more expensive
  but partition forces each web server to contact many mc servers

how do FB apps use mc?
  read:
    v = get(k) (computes hash(k) to choose mc server)
    if v is nil {
      v = fetch from DB
      put(k, v)
    }
  write:
    v = new value
    send k,v to DB
    delete(k)
  this is a "look-aside" cache
  looks simple!

what does FB store in mc?
  paper does not say
  maybe userID -> name; userID -> friend list; postID -> text; URL -> likes
  basically copies of data from DB

lessons I found surprising:
  look-aside is tricky -- consistency -- even though seems simple
    paper is trying to integrate mutually-oblivious storage layers
  cache is critical: not about improving UX, but about surviving huge load
    cache misses and failures can wreck DB
  replication vs partition comes up again and again (shared pools, regions, &c)
  stale data a big headache
    and each performance fix brings a new source of staleness
  packet rate more important than query rate => batching &c
  caches more of a bottleneck than DB
    multiple clusters share one set of DBs
  huge "fan-out" => parallel fetch, in-cast congestion, batching
  memory seems to be an issue -- the reason they sometimes don't replicate
    even with 100s of servers, terabytes of memory

let's talk about performance first
  majority of paper is about avoiding stale data
  but staleness only arose from performance design

regions (Section 5)

Q: why regions as complete replicas, rather than partitioning users?
   social net -> not much locality
   very different from e.g. e-mail

Q: why OK performance despite all writes forced to go to the master region?

Q: why not PNUTS-style per-record master?
   regions not about latency to user

Q: why write to MySQL? why not all data in memcached, eliminate DB?
   persistence, transactions, one master copy

Q: why isn't MySQL write load a serious bottleneck?
   after all, region replicas and mc only reduce read load
   and make write load *worse* -- DB has to tell all mc servers about writes
   they have 100s of mc servers
   so even if writes are only 1%, still significant
   
within a region (Section 4)

multiple mc clusters *within* each region too!
  cluster == complete set of mc cache servers
    i.e. a replica, at least of cached data

why multiple clusters per region?
  why not add more and more mc servers to a single cluster?
  1. adding mc servers to cluster doesn't help single popular keys
     replicating (multiple clusters) does help
  2. more mcs in cluster -> each client req talks to more servers
     and more in-cast congestion
     client requests fetch 20 to 500 keys! over many mc servers
     MUST request them in parallel (otherwise total latency too large)
     so all replies come back at the same time
     network switches, NIC run out of buffers
  3. hard to build network for single big cluster
     uniform client/server access
     so cross-section b/w must be large -- major driver of cost
     two clusters -> 1/2 the cross-section b/w

but -- replicating all objects apparently not a good idea
  "regional pool" shared by all clusters
  unpopular objects (no need for many copies)
  decided by *type* of object
  suggests they are running into RAM limits
  again, a partition-vs-replicate tension
    clearly a good storage system needs to provide flexible controls

bringing up new mc cluster was a serious performance problem
  new cluster has 0% hit rate
  if clients use it, will generate big DB load
  serious enough that they didn't just accept the temporary hit
    if ordinarily 99% hit rate, and (let's say) 2 clusters,
    bringing up a new cluster generates a 30x spike in DB load!
    (Table 2 has hit-rate numbers)
  thus the clients of new cluster first get() from existing cluster (4.3)
    and put() into new cluster
    basically lazy copy of existing cluster to new cluster
  better 2x load on existing cluster than 30x load on DB

sending deletes from DB to mc servers apparently a problem
  one reason: DB does not know what data is cached, is conservative
  one per packet -- too many packets/second
  DB -> mcrouter -- many deletes in each packet
  mcrouter partitions by key, buffers, sends batches to mc servers
    mcrouter may also gather deletes from many DBs, for better batching
    
mc server failure?
  can't shift load to one other mc server -- too much
  can't re-partition (like FDS) -- popular keys
  Gutter -- pool of idle servers, clients only use after mc server fails

thundering herd
  problem is caused by invalidation, look-aside
  lots of clients get() but miss
    they all fetch from DB
    they all put()
  not good: needless DB load
  mc gives just the first missing client a "lease"
    lease = permission to refresh from DB
    mc tells others "try again in a few milliseconds"
  effect: only one client reads the DB and does put()

let talk about consistency now

what do they mean by consistency?
  *not* read sees latest write
    since not guaranteed across clusters
  more like "not more than a few seconds stale"
  *and* writers see their own writes
    see-your-own-writes is a big driving force

first, how are DB replicas kept consistent?
  one region is master
  master DBs distributed ordered log of updates to DBs in slave regions
  slave DBs apply
  slave DBs are complete replicas (not caches)
  DB replication delay can be considerable (maybe up to 20 seconds)

how do we feel about the consistency of the DB replication scheme?
  good: it maintains replication, b/c single ordered write stream
  good: we can do SQL transactions, e.g. x = x + 1
  bad: longish replication delay
  
could they use the same approach for the mc servers?
  (instead of clients invalidating and updating mc)
  i.e. DB in each region sends writes to mc servers
    and mc servers update
  would be nice: preserves single ordered write stream

Q: why doesn't DB send writes to mc servers?
  1. only solves consistency if mc holds full replica
     since caching subset, need a diff way to insert after miss
  2. requires DB to understand how to transform data
     probably represented differently in mc than in DB
     e.g. friend list rows -> list
     e.g. profile columns -> indiv mc keys
  3. wanted read-your-writes; but DB update stream lags

how do they maintain mc consistency?
  i.e. how do client writes invalidate the *many* mc copies?
  1. DBs send invalidates (delete()s) to all mc servers that might cache
  2. writing client also invalidates mc in local cluster
     for read-your-writes

what went wrong?
  lack of atomicity in two places
  1. write: DB update vs delete()
  2. read miss: DB read vs put()
  led to multiple races, multiple fixes

what were the races?

Race 1:
  suppose C1 wants to write
  C1 delete()
    C2 get(), miss
    C2 read DB
  C1 updates DB
  DB sends delete()
    C2 put()
  oops, now mc has stale data, delete() has already happened
  solved by client updating DB (and waiting) before delete()
  Figure 1...

Race 2:
  key not in cache
  C1 get(), misses
  C1 reads DB
    C2 updates DB
    C2 delete() -- does nothing
  C1 put()
  again, now mc has stale data, delete() has already happened
  solved with leases -- C1 gets a lease, but C2's inval deletes lease,
    so mc ignores C1's put

Race 3:
  during cold cluster warm-up
  remember clients try get() in warm cluster, copy to cold cluster
  C1 updates DB
  C1 delete() -- in cold cluster
  C2 get(), miss -- in cold cluster
  C2 get() from warm cluster, hits
  C2 put() into cold cluster
  now mc has stale data, delete() has already happened
  solved with two-second hold-off, just used on cold clusters
    after C1 delete(), cold ignores put()s for two seconds
    by then, delete() will propagate via DB to warm cluster

Race 4:
  write from non-master region
  C1 updates master DB
  C1 delete() -- local region
  C1 get(), miss
  C1 read local DB  -- does not see own write!
  later, new data arrives from master DB
  solved by "remote mark"
    C1 delete() marks key "remote"
    get()/miss yields "remote"
      tells C1 to read from *master* region
    "remote" cleared when new data arrives from master region

could they have used PNUTS?
  pro: PNUTS versions solve read-your-writes
  pro: PNUTS gives reader control over whether stale is OK
  pro: PNUTS updates replicas from single write stream; perhaps fewer races
  con: PNUTS replicates everything; not a cache, uses more RAM
  con: PNUTS query language less expressive than MySQL
  con: PNUTS gives less control over replica vs partition (maybe)

lessons for storage system designers?
  need better ideas for integrating storage layers with consistency
  need flexible tools for controlling partition vs replication