6.824 Scaling Memcached at Facebook FAQ

Q: Does the paper's design eliminate the possibility of stale data?

A: No, the design allows clients to read stale data from memcached in
some fairly common situations. For example, if a client writes some
data in the database, there will be a delay before mcsqueal sends out
invalidates (delete()s) to all the memcached servers that may be
caching data derived from that write, in all the clusters in the
region. A client that reads during that delay may read old cached
data, not the newly written data.

Q: Why is it OK for memcache to yield stale data?

A: The cached data is typically displayed to users on web pages, for
example news feed items, friend status, and messages. If the data is
out of date by a fraction of a second, users will usually not notice.

The big danger they are avoiding is long-term caching of stale
data. It's OK to serve data that's out of date by a few seconds. It's
not OK to serve data that's out of date by hours. Without the paper's
machinery, unbounded memcached staleness could arise due to lost
deletes or out of order updates.

Q: What if a client reads stale data from memcached, computes
something based on it, and writes the result to the database?

A: Facebook's application programmers don't write code like that.
Instead, the client sends a transaction to the database; the
transaction includes both the reads and the writes. So updates get
strong consistency and don't involve stale data.

Q: Why do they use memcached at all? Why not just read directly from
the MySQL database servers in the "storage cluster"?

A: The MySQL servers are not nearly fast enough to serve the volume of
reads generated by Facebook's web servers. memcached is orders of
magnitude faster than MySQL.

Q: If one had unlimited time and engineering resources, what would a
no-compromises design look like?

A: Ideally a design would handle billions of requests per second, be
easy for application programmers to use, work well for users spread
all over the world, provide strong consistency, and not cost too much.
That's a hard set of goals, and I don't know of a satisfying answer.

One source of problems is that MySQL, while powerful and easy to use,
has relatively low performance. So one could imagine using a faster
database, such as FaRM, which might eliminate the need for a cache,
and thus eliminate problems with cache consistency.

Another source of problems is the lack of integration between
memcached and MySQL. Perhaps one could have the cache and database
cooperate more closely; it might help if the cache (rather than the
application) controlled the handling of cache misses, and if the
database was in sole charge of updating or invalidating cached data.

For geographic distribution, have a look at Yahoo's PNUTS, which was
designed from the start to have useful consistency properties (though
not linearizability) while supporting multiple regions.

Q: What's the difference between the paper's "memcached" and "memcache"?

A: "memcached" refers to the software, which you can find here:

  https://github.com/memcached/memcached

memcached is a simple and fast key/value server. It stores data in
RAM, with no fault tolerance, so people only use it for caching (not
for persistent storage).

The paper uses "memcache" to refer to Facebook's set of servers
running memcached.

Q: What is the "stale set" problem in 3.2.1, and how do leases solve it?

A: Here's an example of the "stale set" problem that could occur if
there were no leases:

1. Client C1 asks memcache for k; memcache says k doesn't exist.
2. C1 asks MySQL for k, MySQL replies with value 1.
   C1 is slow at this point for some reason...
3. Someone updates k's value in MySQL to 2.
4. MySQL/mcsqueal/mcrouter send an invalidate for k to memcache,
   though memcache is not caching k, so there's nothing to invalidate.
5. C2 asks memcache for k; memcache says k doesn't exist.
6. C2 asks MySQL for k, mySQL replies with value 2.
7. C2 installs k=2 in memcache.
8. C1 installs k=1 in memcache.

Now memcache has a stale version of k, and it may never be updated.

The paper's leases fix the example:

1. Client C1 asks memcache for k; memcache says k doesn't exist,
   returns lease L1 to C1, and remembers the lease.
2. C1 asks MySQL for k, MySQL replies with value 1.
   C1 is slow at this point for some reason...
3. Someone updates k's value in MySQL to 2.
4. MySQL/mcsqueal/mcrouter send an invalidate for k to memcache,
   though memcache is not caching k, so there's nothing to invalidate.
   But memcache does invalidate C1's lease L1 (deletes L1 from its set
   of valid leases).
5. C2 asks memcache for k; memcache says k doesn't exist,
   and returns lease L2 to C2 (since there was no current lease for k).
6. C2 asks MySQL for k, mySQL replies with value 2.
7. C2 installs k=2 in memcache, supplying valid lease L2.
8. C1 installs k=1 in memcache, supplying invalid lease L1,
   so memcache ignores C1.

Now memcache is left caching the correct k=2.

Q: What is the "thundering herd" problem in 3.2.1, and how do leases
solve it?

A: The thundering herd problem:

* key k is popular -- lots of clients read it.
* ordinarily clients read k from memcache, which is fast.
* but suppose someone writes k, causing it to be invalidated in memcache.
* for a while, every client that tries to read k will miss in memcache.
* they will all ask MySQL for k.
* MySQL may be overloaded with too many simultaneous requests.

The paper's leases solve this problem by allowing only the first
client that misses to ask MySQL for the latest data. The other clients
wait for a bit to give the first client a chance to fetch the data
from MySQL and install it in memcache, then the other clients re-try
memcache.

Q: Why do writing clients delete() from memcache, rather than updating
the values in memcache?

A: Suppose two clients, C1 and C2, want to update the same item at the
same time; C1 wants to set the item to value "x", and C2 to "y". They
both send their updates to the MySQL database, which executes the
writes in one order or the other. Let's suppose the database executes
C1's write first, then C2's write, so that the final value in the
database is "y". Then C1 sends put(k, "x") to memcached, and C2 sends
put(k, "y") to memcached, at about the same time. Memcached may
execute the requests in either order, so it may execute C2's put("y")
first, and C1's put("x") second, so that memcached ends up caching
"x". Now memcached is caching a value that differs from the one in the
database, which is a bad situation.

This problem doesn't arise if C1 and C2 delete() instead of put().

Q: What is McRouter?

A: The point of mcrouter is to aggregate memcached RPCs from many
clients and send them in big batches to memcached servers. It's more
efficient to have a smallish number of mcrouter servers talk to
memcached than a large number of individual clients. One reason is
that there's overhead to each network (TCP) connection; better that
each memcached have a TCP connection per mcrouter than per client.
Another reason is that there's overhead (packet header space and
interrupt) for each packet, so it's helpful that a mcrouter can pack
many client requests into each TCP packet.

Q: Isn't it wasteful that the gutter servers are idle when they aren't
taking over for a failed server? Why not use the gutter servers for
ordinary memcached service as well as gutter?

A: I think non-gutter memcached servers are often close to fully
loaded, and have little spare capacity. If one fails, the replacement
server needs to have been more or less idle, in order to handle the
failed server's load.

Q: How do Section 4.2's regional pools reduce the number of replicas?

A: Each region has multiple clusters. Each cluster has a complete cache.
Thus a given data item may be cached in each of the clusters. If there
are N clusters in a region, there may be N distinct cached copies of a
data item, one per cluster.

Items that are cached in the regional pool are only cached once per
region, not N times.

The tradeoff is that the potential serving capacity is N times higher if
there are N copies.

Q: What storage system work has gone on at Facebook since this paper?

A: Here's a sample:

https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf
https://www.cs.princeton.edu/~wlloyd/papers/existential-sosp15.pdf
http://www.cs.cmu.edu/~beckmann/publications/papers/2020.osdi.cachelib.pdf
https://www.usenix.org/system/files/fast21-pan.pdf

Q: Why not just put a cache into MySQL, where it can be better
integrated to provide good consistency?

A: It would be fantastic if someone could add a transparent cache to
MySQL that made it as fast as a cluster of memcached servers. But
no-one knows how to do that. MySQL in fact does quite a bit of
caching, and it's still much slower than memcached. Presumably a lot
of the reason is that MySQL presents a much more powerful and complex
interface than memcached (MySQL supports SQL queries, an interface
which is about 1000x as complex as memcached's put()/get()/delete).

Q: Figure 11 shows that Memcache can serve data that is even a day
old. Although this happens with low probability, couldn't it
still cause significant, perhaps catastrophic, problems in
applications using Memcache?

A: Yes, indeed.  It is something FB has struggled with because it makes
writing applications more challenging. It is the topic of two
follow-on papers (see the references above).

The bottom-line of these papers is still roughly the same: the
probability of inconsistency is so low that they are willing to accept
it: even though a few users in principle might be able to notice the
inconsistency, they probably won't realize it or care.  (Their
target applications are not banking applications.)

Q: How does the MySQL replication system work?

A: See https://dev.mysql.com/doc/refman/8.0/en/replication.html. FB uses
the log-based replication scheme as a component of the
publish/subscribe system, as described in "Wormhole: Reliable Pub-Sub
to support Geo-replicated Internet Services", Sharma et al, 2015. The
core of the replication scheme is to read updates from MySQL's
transaction log and send those to the backup, which applies them to
its data.

Q: What does "look-aside" caching refer to?

A: The cache sits on the side as opposed in between the application
and the storage layer. If the application misses in the cache, the
application retrieves the database records and updates the cache,
instead of the cache doing it. This arrangement is relatively simple,
since the cache and database don't have to know about each other, and
the application is free to use different key schemes for the cache
versus the database.

Q: What is incast congestion? 

A: A situation in which a computer receives many packets from
different sources at the same time, too many to process immediately,
and (in the worst case) more than it can buffer, leading to discarded
packets.

In the paper, this comes up when a web server needs many distinct
items of information from memcache (perhaps 100s or 1000s), which may
be stored on 100s of different memcached servers. For speed, the
client asks for many items in parallel. But that means it will receive
many replies in parallel, perhaps leading to incast congestion and
loss of replies. Section 3.1's window mechanism is intended to limit
the number of parallel requests, and thus the number of simultaneous
reply packets.

Q: What is the cold cluster warmup inconsistency which Section 4.3
describes?

A: Here's a scenario:

  0. key k starts out with value v1.
  1. client C1 updates k to v2 in the DB
  2. C1 and the DB send delete(k) to the memcache cold cluster
     but the DB is slow at sending delete(k) to the warm cluster
  3. client C2 sends get(k) to the cold cluster, which sends back a "miss"
  4. C2 sends get(k) to the warm cluster, receives v1
  5. C2 set(k, v1) into cold cluster
  6. the DB's delete(k) finally reaches the warm cluster

Now the cold memcache cluster holds the stale v1 value, but the delete()
has already happened. So the value will stay stale indefinitely, until
the key is next written.

The two-second hold-off scheme solves this. After C1 calls delete(k),
the cold cluster memcached ignores any set(k) for two seconds. By then,
the DB's delete(k) should have reached the warm cluster.

Q: How is the privacy of user data stored in memcached protected?

A: The paper does not touch on this question, so we can only guess.

The first line of defense is likely that only Facebook's own computers
can talk to their memcached and MySQL servers. Probably there are
firewalls between Facebook's datacenters and the Internet so that no
outsider can directly contact any of Facebook's internal servers.

At a higher level, Facebook's web servers have code that decides what
data to reveal to who. If you entrust sensitive information to
Facebook, you have to trust that their code has similar notions to
your own about who should see your information.

One also has to think about the possibility of bugs in Facebook's
permissions code; or bugs that allow outside hackers to break into
Facebook's computers; or corrupt or malicious or careless Facebook
employees. Dealing with such threats requires internal controls that
limit how data can be used even within Facebook.