6.5840 2025 Lecture 16: Scaling Memcache at Facebook

Scaling Memcache at Facebook, by Nishtala et al, NSDI 2013

why are we reading this paper?
  it's an experience paper
  how did the authors scale up their system?
  what problems did they run into?
  how did they solve those problems?
  performance vs consistency vs practicality
  authors had to learn as they went; let's learn with them

the big facebook infrastructure picture
  lots of users, friend lists, status, posts, likes, photos
  fresh/consistent data not critical -- humans are tolerant
  read-heavy (helpful)
  little locality (not helpful)
  high load: billions of storage operations per second
    much higher than a single storage server can handle
      ~100,000 simple queries/s for mysql
      ~1,000,000 get/puts/s  for memcached
  multiple data centers (at least west and east coast)
  [diagram]
  each data center -- "region":
    "real" data sharded over MySQL DBs -- big disks, but slow
    memcached layer (mc) -- fast, but limited RAM
    web servers (clients of memcached and DB) -- "stateless"
  each data center's DBs contain full replica
  west coast is primary, others are replicas via MySQL async log replication

let's talk about performance first
  majority of paper is about avoiding stale cached data
  but staleness arose from efforts to increase performance

what is memcached?
  a simple key/value server: put(k,v), get(k), delete(k).
  in RAM, not durable, no replication.
  stores what clients tell it to.
  usually lots of memcached servers are deployed.
    clients decide what to store where.

how do FB apps use mc? Figure 1.
  FB uses mc as a "look-aside" cache
    real data is in the DB
    application talks separately to mc and (if miss or write) to DB
    mc doesn't know about the DB
  read(k):
    h = hash(k) % n  -- hash chooses which memcache server to talk to
    v = mc[h].get(k)
    if v is nil:
      v = fetch from DB
      put(k, v)
  write(k,v):
    send k,v to DB
    h = hash(k) % n
    mc[h].delete(k)

what is the benefit of using mc?
  it's only helpful for reads -- but that's by far the majority of operations
  high hit rate -> reduces load on DB servers
    Table 2 says about 99% hit rate, i.e. 100x reduction in DB read load
  this caching is not about reducing user-visible delay,
    it's about protecting the DB servers from massive overload.

lots of mc servers are needed to handle the total load
  CPU/network parallelism
  total RAM
  how to divide the load among the memcache servers?

the client hash function determines how keys are assigned to mc servers
  can shard (partition), or replicate, or some combination
  all web servers use the same hash(k) function
    so if C1 caches key k, C2 will see it!
  central configuration manager tells clients how to hash

will sharding or replication yield most mc throughput?
  [two little diagrams]
  sharding: divide keys over mc servers
  replicate: divide clients over mc servers
  sharding:
    + memory-efficient (only one copy of each k/v pair)
    - not effective if a few keys are extremely popular
    - each web server must talk to many mc servers (high packet overhead)
  replication:
    + useful if a few keys are very popular for reads
    + can pack many requests/responses per packet (low overhead)
    - uses more memory, so fewer distinct items can be cached
    - writes are more expensive

performance and multiple regions (Section 5)

[diagram: west, db primary shards, mc servers, clients |
 east, db secondary shards, ...,
 feed from db primaries to secondaries ]

Q: what is the point of regions -- multiple complete replicas?
   lower RTT to users (east coast, west coast)
   quick local reads, from local mc and DB
   (though writes are expensive: must be sent to primary region)
   hot replica in case primary site fails

Q: why not divide users over regions?
   i.e. why not east-coast users' data in east-coast region, &c
   then no need to replicate: might cut hardware costs in half!
   but: social net -> not much locality
   might work well for e.g. e-mail

Q: why OK performance despite writes sent to the primary region?
   writes are much rarer than reads
   users do not wait for writes to finish
   
performance within a region (Section 4)

[diagram: db shards, multiple clusters, each w/ mc's and clients ]

multiple mc clusters *within* each region
  cluster = complete set of mc cache servers + web servers
  each web server hashes keys over just the mc servers in its cluster

why multiple clusters per region?
  why not a single big cluster in each region?
    divide the load among many parallel mc servers?
  1. more mc servers don't help very popular keys
     replicating (one copy per cluster) does help
  2. more mcs in cluster -> sharded more finely ->
       each web view sends more packets (to more mc servers)
     and more in-cast congestion from replies
     client requests fetch 20 to 500 keys! over many mc servers
     MUST request them in parallel (otherwise total latency too large)
     so all replies come back at the same time
     network switches, NIC run out of buffers
  3. hard to build network for single big cluster
     uniform client/server access
     so cross-section b/w must be large -- expensive
     two clusters -> 1/2 the cross-section b/w

but -- replicating is a waste of RAM for less-popular items
  "regional pool" shared by all clusters
  unpopular objects (no need for many copies)
  the application s/w decides what keys to put in regional pool
  frees mc servers to replicate more popular objects

bringing up new mc cluster is a performance problem
  new cluster has 0% hit rate
  so its clients could generate big spike in DB load
  thus the clients of new cluster first get() from existing cluster (4.3)
    and put() into new cluster
    basically lazy copy of existing cluster to new cluster

another overload problem: thundering herd
  one client updates DB and delete()s a key
  lots of clients get() but miss
    they all fetch from DB
  not good: needless DB load
  mc gives just the first missing client a "lease"
    lease = permission to refresh from DB
    mc tells others "try get() again in a few milliseconds"
  effect: only one client reads the DB and does put()
    others re-try get() later and hopefully hit
    
what if an mc server fails?
  can't have DB servers handle the misses -- too much load
  can't shift load to another mc server -- too much load
  Gutter -- pool of idle mc servers, clients only use after mc server fails
  after a while, failed mc server will be replaced
  as long as only a few mc servers are down at any one time,
    a small Gutter pool can act as backups for a large set of mc servers

The Question:
  why aren't invalidates (deletes) sent to Gutter servers?
    from web servers and MySQL/McSqueal
  my guess:
    Gutter can hold *any* key
    so all invalidates would have to be sent to Gutter 
    this at least doubles delete traffic
    and may place a heavy load on small # of Gutter servers

let's talk about consistency now

first, suppose they had wanted linearizable caching?
  they would need a cache coherence protocol, as in multi-core CPUs
  many coherence schemes exist, all costly, here's a sketch of one.
  [DB x=1, some caches x=1, some empty]
  read miss:
    cache asks DB for current value
    (cached values come from DB, not from client put()s)
  write:
    client sends write to the DB
    DB marks item as "locked", so caches cannot read
    DB asks all caches to invalidate
    DB waits for responses (the paper doesn't do this...)
    DB updates its copy
    DB unlocks the item; now caches can read
    DB replies to client
  slow!
    a write must wait for all replicas to acknowledge invalidation
    item cannot be read during that time!

why must DB hide new value while waiting for invalidation?
  linearizability says that a write must appear at a point in time
  x=1 at start
       |-----Wx2----|
  C1: |--Rx2--|
  C2:           |--Rx?--|
  C1's read implies the write's "point" was before C2's read
    so C2 must see x=2, not 1
  once any client sees x=2,
    no cache can be allowed to hold x=1
  thus a linearizable scheme must be sure *all* old copies are gone
    *before* revealing the new value
  the paper's scheme does not enforce this
    doesn't wait for invalidations before revealing a write
    so reading clients can see the value switch back and forth
    so the system is not linearizable

what is the paper's consistency goal?
  writes go direct to primary DB, with transactions, so DB stays consistent
    e.g. incrementing a "like" count will be correct
  what about reads?
    reads not guaranteed to see the latest write
    different clients not guaranteed to see the same values
    but not too stale! only a few seconds
    i.e. eventual consistency
  *and* "read-your-own-writes"

this is a common pattern:
  updates are ACID -- and slow
  reads are not very consistent -- but fast

why is it OK that reads can yield stale data?
  the data is news feed items, postings, likes, &c
  users may see web pages with content that lags the DB a little
  few people will notice or care as long as it's only a little
  next time they look, mc will likely have caught up to the DB

what does the paper mean by "consistency"?
  they mean how out-of-date a read might be
  "more consistent" means reads don't lag recent writes by too much
  it's a given that reads can be stale; just trying to limit how stale
  this is a user-experience view of consistency
    it is not about correctness / guaranteed properties

how are DB replicas kept in sync across regions?
  one region is primary
  all clients send updates only to primary region's DB servers
  primary DBs distribute log of updates to DBs in secondary regions
  secondary DBs apply
  secondary DBs are complete replicas (not caches)
  DB replication delay can be considerable (many seconds)

Q: why do clients send updates only to primary region's DB servers?
   why not to local region DB server?

what do they do about now-stale cached data when DB is written?
  there can be many cached copies of an item in a given region:
    one per cluster
  1. DBs send invalidates (delete()s) to relevant mc servers in region
     this is McSqueal in Figure 6
  2. writing client also invalidates mc in local cluster
     for read-your-own-writes
  secondary DBs hear updates, send out invalidates

they ran into a number of DB-vs-mc consistency problems
  due to concurrent updates affecting different cached copies in different orders
  dangerous if can lead to permanently stale cached data

example race:
  suppose client write(k,v) looked like:
    send k,v to DB
    put(k,v) in mc -- rather than delete(k)
  what if two clients write the same key at the same time?
    updates might arrive at DB in one order
    but at mc server in the other order!
    leading to perhaps-permanent cached incorrect data
  thus they actually delete(k)

example race (Section 3.2.1):
  k not in cache
  C1 get(k), misses
  C1 v1 = read k from DB
    C2 writes k = v2 in DB
    C2 delete(k)
  C1 put(k, v1)
  now mc has stale data, delete(k) has already happened
  will stay stale indefinitely, until k is next written
  solved with leases:
    mc gives C1 a lease on k with the "miss" -- permission to write k.
    C2's delete(k) invalidates C1's lease.
    so mc ignores C1's put(k).
    key still missing, so next reader will refresh it from DB

Q: aren't the consistency problems caused by clients copying DB data to mc?
   why not instead have DB send new values to mc, so clients only read mc?
     then there would be no racing client updates &c, just ordered writes

A: that's correct in principle, but:
  1. DB doesn't generally know how to compute values for mc
     generally client app code computes them from DB results,
       i.e. mc content is often not simply a literal DB record
  2. DB doesn't know what's cached, would end up sending lots
     of values for keys that aren't cached

FB/mc lessons for storage system designers?
  cache is vital for surviving high load, not just to reduce latency
  need flexible tools for controlling partition vs replication
  linearizability is too much; eventual often not enough

thursday:
  guest speaker from Amazon about one of their distributed databases

--- references

http://cs.cmu.edu/~beckmann/publications/papers/2020.osdi.cachelib.pdf
https://engineering.fb.com/2008/08/20/core-data/scaling-out/
https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf