6.824 2017 Lecture 17: Existential Consistency why are we reading this paper? storage for giant read-heavy web sites (again) consistency concerns technique to detect anomalies review of overall FB architecture multiple data centers -- regions low network delay to users, parallelism, maybe fault tolerance each region has lots of Front Ends (FEs) web servers, with web site logic in e.g. Python or PHP parallelism each region has a full DB replica so all reads can be local -- users often look at each others' data why MySql: persistence, crash recovery, transactional writes data sharded across MySQL servers in each region for parallelism for each shard, one region is the master so there's a notion of the most up-to-date value all regions send writes to master DB master DB updates slave DBs in other regions each region has lots of cache servers caches have much higher throughput than DBs for reads and their workload is read-heavy key/value pairs sharded over caches FE reads look in cache first, then look in local DB nice properties: uses off-the-shelf DB and (originally) caches grew incrementally from much smaller non-replicated non-caching system early FB "look-aside" caching had problems memcached cache servers (mc) driven by FEs; mc and DB not aware of each other many web sites are structured like this! read(k) v = mc-get(k) if v = nil v = fetch from local DB mc-set(k, v) return v write(k, v) send k,v to master DB mc-delete(k) asynchronously (write doesn't wait): master DB sends to slave DBs all DBs send invalidates to caches so the next read from a cache will miss, fetch from DB it was hard for FB to get look-aside caching right: * miss/write race entire write (incl delete) might occur between a read()'s DB fetch and mc-set() leaving memcache with *permanent* stale data users will eventually notice missing update * read-after-write anomaly C1 sends write to DB master in remote region same user reads same key; miss in memcache; read from local DB local DB hasn't seen update users may immediately notice that read doesn't see prior write! * thundering herd one invalidate may cause many FEs to miss on that key simultaneously and to send lots of simultaneous reads to the DB cause of these problems: no entity is aware of -- and ordering -- the concurrent operations on each key these problems motivated a new design (TAO) as well as the desire to systematically monitor correctness. FB's new system: TAO (Figure 1) FEs send reads/writes *through* caches -- not look-aside split a region's FEs over multiple leaf cache "clusters" even more read parallelism: multiple cached copies of each key 2nd layer of "root" caches one root cache per shard per region all writes and leaf misses go through region's root cache root cache orders writes+misses on each key -- one op at a time read: leaf cache for key's shard root cache for key's shard local region's DB for key's shard as reply returns, populates root/leaf caches write: leaf, root, root in master region, db in master region, and back reply contains new value leaf/root caches update on the way back async: master DB replicates log to slave DBs in other regions DBs send invalidates to caches how does TAO fix look-aside problems? * miss/write race: root cache processes one op at a time delays write until read done so stale read reply won't overwrite fresh write value * read-after-write anomaly: caches install new value (from master DB) before returning from write FE won't issue subsequent read until write finishes * thundering herd: each cache forwards only one miss at a time per key, others wait TAO consistency model paper says: per-object sequential consistency within a cache read-after-write consistency within a cache otherwise eventually consistent what is per-object sequential consistency? single sequence of versions (writes) for each object reads may lag -- see old versions but clients never go backwards -- never see new, then old close to PNUTS' timeline consistency also part of eventual consistency ensures agreement abt which write comes last what is read-after-write? FE1 writes then FE2 in same cluster reads FE2 guaranteed to see FE1's write (or a later write) helps ensure human users see their changes take effect as long as each user uses FEs within the same cluster what is "within a cache"? guaranteed only if you look at a single cache [cluster] if you look at different caches/clusters/regions, no guarantee since a different cache might have an older version this is weak consistency reads can yield stale cached data writes to different objects may appear in different orders to different readers why might we want stronger consistency? human users may complain about stale or delayed data messages, stock quotes, auction bids human users might complain about order anomalies if follow-up message is displayed before original strong consistency might reduce the need for defensive programming e.g. photo ID in list, but photo doesn't (yet) exist strong consistency might make careful update easier PNUTS paper's ACL vs photo example via carefully ordered writes -- only works with strong consistency why do the authors want a consistency checker? their system only guarantees weak consistency is that a problem? or basically OK? does it serve up stale data all the time? or only rarely? they use linearizability as a gold standard they also want a systematic way to watch for the kind of bugs that the old look-aside cache had what is linearizability? an execution history is linearizable if one can find a total order of all operations, that matches real-time (for non-overlapping ops), and in which each read sees the value from the write preceding it in the order. example history, x-axis is real time: |--Wx1--| |-Wx2-| |------Rx2-------| |-Rx1-| can we find an order that proves linearizability? yes: Wx1 Rx1 Wx2 Rx2 obeys real-time rule Wx2, Rx2, and Rx1 concurrent, so any order is consistent w/ real time obeys value rule how do they check linearizability? pick a (tiny) random subset of objects (keys) all FEs record all reads/writes on those keys offline processing of trace at end of day what's in a trace entry? key, read vs write, value hash, start time, end time, &c the paper constructs a graph for each key's operations graph edges reflect ordering constraints look for cycles in the graph cycle == total order is not possible the paper talks about starting with a graph with read and write nodes, and merging reads into writes it's simpler to just think about a graph of writes building the graph of writes a graph node for each write two kinds of directed graph edges: * Time: W' -> W'' if W' finishes before W'' starts * Value: W' -> W'' if a later read saw the value written by W'' after W' completed both mean W'' must come after W' in any total order Example: |--Wx1--| |--Wx3--| |--Wx2--| |-Rx1-| Graph has three nodes: Wx1, Wx2, Wx3 Edges: Wx1 -> Wx3 (time) Wx2 -> Wx3 (time) (no time edge for Wx1/Wx2) Wx2 -> Wx1 (value seen by Rx1) There is no cycle; a total order is: Wx2 Wx1 Rx1 Wx3 So: the operation history is linearizable. Figure 4: Wx1 Wx2 Rx1 Wx1 -> Wx2 (time) Wx2 -> Wx1 (value) cycle -> no total order -> not linearizable "stale read anomaly" -- Rx1 probably from stale cache legal in their system, since allowed by eventual consistency not legal in a linearizable system Figure 5 Wx1 Rx1 Wx2 Rx2 Wx1 -> Wx2 (value) Wx2 -> Wx1 (value) "total order anomaly" when concurrent writes, either outcome is OK, but subsequent reads have to agree! will this technique ever say "legal" to a non-linearizable execution? yes, if incorrect FE clock says a stale read happened earlier than it did Wx1 Wx2 Rx1 if Rx1's time incorrectly says it happened before Wx2, checker will say "OK" will this technique ever say "illegal" to a linearizable execution? yes, again if an FE clock is wrong Wx1 Rx1 Wx2 but clock says Wx2 happened before Rx1 why can't one use the paper's traces to check causal consistency? (this is The Question) this is legal under causal consistency (x is initially 0): C0: Wx1 C1: Rx1 C2: Rx0 this is illegal: C0: Wx1 C1: Rx1 Wy2 C2: Ry2 Rx0 that is, if C2 sees write to y, it should also see all writes that could have caused that write to y. can we distinguish these using just an x trace, or just a y trace? the y trace alone looks OK in the 2nd example: Wy2 Ry2 so the y trace alone isn't enough to spot illegal executions the x trace is the SAME for both the legal and illegal traces: Wx1 Rx1 Rx0 so the x trace alone isn't enough to spot illegal executions since they only trace a tiny random sample of keys, they are unlikely to see both x and y, so they cannot check causal consistency. it's surprising that the paper's technique can check linearizability after all, linearizability constrains operations on multiple objects just like causal consistency and the multiple object part made causal consistency not checkable why they can check linearizability suppose the trace contains just x, but not y Wx1 Rx1 Rx0 this might be linearizable, or not depending on whether Rx0 started after Wx1 completed, in real time their trace contains the start/end real time of each operation on x so they can tell if Rx0 started after Wx1 finished but this technique does depend on global clock agreement which the paper admits is a problem -- the 35 milliseconds in Sec 3.1 measurement results -- Table 3 I'd expect many linearizability anomalies, from stale caches. since TAO implements a relaxed consistency model the big surprise is that there are so few! only 0.0004% of reads show linearizability anomalies why so few anomalies? few writes locality (consistency better within writer's cache) fast invalidation messages what do the results imply? is 0.0004% good or bad? for comments on pictures? for latest bid in an eBay auction? for a password change? what about: for ensuring follow-up is shown after original post? for PNUTS' delete Mom from ACL, post picture? for advertising impression counts? perhaps Facebook's main goal is freshness for users not semantics for programs users can probably tolerate a lot, so 0.0004% is pretty good the paper doesn't have much to say about effect on programmability how generalizable are their results? most relevant to user-oriented systems with no hard requirements 0.0004% depends on read-heavy, locality, and fast invalidation conclusion low-cost checking technique weak consistency design can provide remarkably high consistency! i.e. they rarely display stale data to the user sheds doubt on need for costly consistent designs consistency not always a correctness question for them, about freshness of information shown to user *not* about correct/incorrect or inter-module contracts results are Facebook-specific but perhaps they are representative of many web sites