6.824 Scaling Memcached at Facebook FAQ Q: Does the paper's design eliminate the possibility of stale data? A: No, the design allows clients to read stale data from memcached in some fairly common situations. For example, if a client writes some data in the database, there will be a delay before mcsqueal sends out invalidates (delete()s) to all the memcached servers that may be caching data derived from that write, in all the clusters in the region. A client that reads during that delay may read old cached data, not the newly written data. Q: Why is it OK for memcache to yield stale data? A: The cached data is typically displayed to users on web pages, for example news feed items, friend status, and messages. If the data is out of date by a fraction of a second, users will usually not notice. The big danger they are avoiding is long-term caching of stale data. It's OK to serve data that's out of date by a few seconds. It's not OK to serve data that's out of date by hours. Without the paper's machinery, unbounded memcached staleness could arise due to lost deletes or out of order updates. Q: What if a client reads stale data from memcached, computes something based on it, and writes the result to the database? A: Facebook's application programmers don't write code like that. Instead, the client sends a transaction to the database; the transaction includes both the reads and the writes. So updates get strong consistency and don't involve stale data. Q: Why do they use memcached at all? Why not just read directly from the MySQL database servers in the "storage cluster"? A: The MySQL servers are not nearly fast enough to serve the volume of reads generated by Facebook's web servers. memcached is orders of magnitude faster than MySQL. Q: If one had unlimited time and engineering resources, what would a no-compromises design look like? A: Ideally a design would handle billions of requests per second, be easy for application programmers to use, work well for users spread all over the world, provide strong consistency, and not cost too much. That's a hard set of goals, and I don't know of a satisfying answer. One source of problems is that MySQL, while powerful and easy to use, has relatively low performance. So one could imagine using a faster database, such as FaRM, which might eliminate the need for a cache, and thus eliminate problems with cache consistency. Another source of problems is the lack of integration between memcached and MySQL. Perhaps one could have the cache and database cooperate more closely; it might help if the cache (rather than the application) controlled the handling of cache misses, and if the database was in sole charge of updating or invalidating cached data. For geographic distribution, have a look at Yahoo's PNUTS, which was designed from the start to have useful consistency properties (though not linearizability) while supporting multiple regions. Q: What's the difference between the paper's "memcached" and "memcache"? A: "memcached" refers to the software, which you can find here: https://github.com/memcached/memcached memcached is a simple and fast key/value server. It stores data in RAM, with no fault tolerance, so people only use it for caching (not for persistent storage). The paper uses "memcache" to refer to Facebook's set of servers running memcached. Q: What is the "stale set" problem in 3.2.1, and how do leases solve it? A: Here's an example of the "stale set" problem that could occur if there were no leases: 1. Client C1 asks memcache for k; memcache says k doesn't exist. 2. C1 asks MySQL for k, MySQL replies with value 1. C1 is slow at this point for some reason... 3. Someone updates k's value in MySQL to 2. 4. MySQL/mcsqueal/mcrouter send an invalidate for k to memcache, though memcache is not caching k, so there's nothing to invalidate. 5. C2 asks memcache for k; memcache says k doesn't exist. 6. C2 asks MySQL for k, mySQL replies with value 2. 7. C2 installs k=2 in memcache. 8. C1 installs k=1 in memcache. Now memcache has a stale version of k, and it may never be updated. The paper's leases fix the example: 1. Client C1 asks memcache for k; memcache says k doesn't exist, returns lease L1 to C1, and remembers the lease. 2. C1 asks MySQL for k, MySQL replies with value 1. C1 is slow at this point for some reason... 3. Someone updates k's value in MySQL to 2. 4. MySQL/mcsqueal/mcrouter send an invalidate for k to memcache, though memcache is not caching k, so there's nothing to invalidate. But memcache does invalidate C1's lease L1 (deletes L1 from its set of valid leases). 5. C2 asks memcache for k; memcache says k doesn't exist, and returns lease L2 to C2 (since there was no current lease for k). 6. C2 asks MySQL for k, mySQL replies with value 2. 7. C2 installs k=2 in memcache, supplying valid lease L2. 8. C1 installs k=1 in memcache, supplying invalid lease L1, so memcache ignores C1. Now memcache is left caching the correct k=2. Q: What is the "thundering herd" problem in 3.2.1, and how do leases solve it? A: The thundering herd problem: * key k is popular -- lots of clients read it. * ordinarily clients read k from memcache, which is fast. * but suppose someone writes k, causing it to be invalidated in memcache. * for a while, every client that tries to read k will miss in memcache. * they will all ask MySQL for k. * MySQL may be overloaded with too many simultaneous requests. The paper's leases solve this problem by allowing only the first client that misses to ask MySQL for the latest data. The other clients wait for a bit to give the first client a chance to fetch the data from MySQL and install it in memcache, then the other clients re-try memcache. Q: Why do writing clients delete() from memcache, rather than updating the values in memcache? A: Suppose two clients, C1 and C2, are updating the same data at the same time. C1 writes value "x", and C2 writes value "y". They both send their updates to the database, which executes the writes in one order or the other. Let's suppose the database executes C1's write("x") first, then C2's write("y"), so that the final value in the database is "y". Then C1 sends put(k, "x") to memcached, and C2 sends put(k, "y") to memcached, at about the same time. Memcached may execute the requests in either order, so it may execute C2's put("y") first, and C1's put("x") second, so that memcached ends up caching "x". Now memcached is caching a value that differs from the one in the database, which is a bad situation. This problem doesn't arise if C1 and C2 delete() instead of put(). Q: What is McRouter? A: The point of mcrouter is to aggregate memcached RPCs from many clients and send them in big batches to memcached servers. It's more efficient to have a smallish number of mcrouter servers talk to memcached than a large number of individual clients. One reason is that there's overhead to each network (TCP) connection; better that each memcached have a TCP connection per mcrouter than per client. Another reason is that there's overhead (packet header space and interrupt) for each packet, so it's helpful that a mcrouter can pack many client requests into each TCP packet. Q: Isn't it wasteful that the gutter servers are idle when they aren't taking over for a failed server? Why not use the gutter servers for ordinary memcached service as well as gutter? A: I think non-gutter memcached servers are often close to fully loaded, and have little spare capacity. If one fails, the replacement server needs to have been more or less idle, in order to handle the failed server's load. Q: How do Section 4.2's regional pools reduce the number of replicas? A: Each region has multiple clusters. Each cluster has a complete cache. Thus a given data item may be cached in each of the clusters. If there are N clusters in a region, there may be N distinct cached copies of a data item, one per cluster. Items that are cached in the regional pool are only cached once per region, not N times. The tradeoff is that the potential serving capacity is N times higher if there are N copies. Q: What storage system work has gone on at Facebook since this paper? A: Here's a sample: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf https://www.cs.princeton.edu/~wlloyd/papers/existential-sosp15.pdf http://www.cs.cmu.edu/~beckmann/publications/papers/2020.osdi.cachelib.pdf https://www.usenix.org/system/files/fast21-pan.pdf Q: Why not just put a cache into MySQL, where it can be better integrated to provide good consistency? A: It would be fantastic if someone could add a transparent cache to MySQL that made it as fast as a cluster of memcached servers. But no-one knows how to do that. MySQL in fact does quite a bit of caching, and it's still much slower than memcached. Presumably a lot of the reason is that MySQL presents a much more powerful and complex interface than memcached (MySQL supports SQL queries, an interface which is about 1000x as complex as memcached's put()/get()/delete). Q: Figure 11 shows that Memcache can serve data that is even a day old. Although this happens with low probability, couldn't it still cause significant, perhaps catastrophic, problems in applications using Memcache? A: Yes, indeed. It is something FB has struggled with because it makes writing applications more challenging. It is the topic of two follow-on papers (see the references above). The bottom-line of these papers is still roughly the same: the probability of inconsistency is so low that they are willing to accept it: even though a few users in principle might be able to notice the inconsistency, they probably won't realize it or care. (Their target applications are not banking applications.) Q: How does the MySQL replication system work? A: See https://dev.mysql.com/doc/refman/8.0/en/replication.html. FB uses the log-based replication scheme as a component of the publish/subscribe system, as described in "Wormhole: Reliable Pub-Sub to support Geo-replicated Internet Services", Sharma et al, 2015. The core of the replication scheme is to read updates from MySQL's transaction log and send those to the backup, which applies them to its data. Q: What does "look-aside" caching refer to? A: The cache sits on the side as opposed in between the application and the storage layer. If the application misses in the cache, the application retrieves the database records and updates the cache, instead of the cache doing it. This arrangement is relatively simple, since the cache and database don't have to know about each other, and the application is free to use different key schemes for the cache versus the database. Q: How is the privacy of user data stored in memcached protected? A: The paper does not touch on this question, so we can only guess. The first line of defense is likely that only Facebook's own computers can talk to their memcached and MySQL servers. Probably there are firewalls between Facebook's datacenters and the Internet so that no outsider can directly contact any of Facebook's internal servers. At a higher level, Facebook's web servers have code that decides what data to reveal to who. If you entrust sensitive information to Facebook, you have to trust that their code has similar notions to your own about who should see your information. One also has to think about the possibility of bugs in Facebook's permissions code; or bugs that allow outside hackers to break into Facebook's computers; or corrupt or malicious or careless Facebook employees. Dealing with such threats requires internal controls that limit how data can be used even within Facebook.