6.824 2022 Lecture 17: Causal Consistency, COPS

Lloyd et al, Don't Settle for Eventual: Scalable Causal Consistency
for Wide-Area Storage with COPS, SOSP 2011.

the setting: geo-replication for big web sites
  [ 3 datacenters, users, web servers, sharded storage servers ]
  multiple datacenters, to be close to users
  each datacenter has a complete copy of all data
  reads are fast -- from local replica
  can writes be fast also?
 
we've seen two solutions for geo-replication
  Spanner
    reads are local and fast
    write commit waits for replies from other datacenters
    transactions for read and write
  Facebook / Memcache
    reads are local and fast
    all writes go through primary datacenter
    transactions only for updates
  problems:
    writes must wait for wide-area communication
    slow inter-datacenter network makes writes even slower
    failed datacenter can block writes at other datacenters

can we have writes that never have to wait for other datacenters?
  would help fault tolerance, performance, robustness vs slow WAN
  then we'd have fast local reads *and* writes
  puzzle: what's the best we can then do for a consistency model?

idea: Eventual Consistency
  client reads and writes just contact local shard
  each shard pushes writes to other datacenters, shard-to-shard, asynchronously
    that is, a put() fires off a message to each other datacenter,
    but does not wait, there could be long delays and out-of-order.

Eventual Consistency properties
  1. datacenters/clients may see updates in different orders
  2. if no writes for long enough, all clients see same data
  a pretty loose spec, many ways to implement, easy to get good performance
  used in deployed systems, e.g. Dynamo and Cassandra
  but can be tricky for app programmers

example app code -- a photo manager:
  C1 and C2 in different datacenters
  C1 uploads photo, adds reference to list:
    C1: put(photo) put(list)
  C2 reads:
    C2:                       get(list) get(photo)

what can C2 see?

app code can see non-intuitive behavior -- "anomalies"
  not incorrect, since there was no promise of better
  and it is possible to program such a system
    perhaps wait for expected data (photo) to appear
  in a bit we'll turn to COPS, which provides better behavior

how to ensure that all datacenters end up with the same value?
  if there are writes from multiple datacenters
  everyone has to choose the same final value, for eventual consistency
  
why not attach the current wall-clock time as version number on each put?
  and have all datacenters prefer higher version numbers?
  local shard server assigns v#=time when it receives client put()
  remote datacenter receives put(k, val, v#)
    if v# is larger than version of currently stored value for k
      replace with new value / v#
    otherwise
      ignore new value
  "timestamp" might be a better word here than "version"
  what if two put(k) happen at exactly the same time at different datacenters?
    break tie with a unique ID in the low bits of v#
  ok: now different datacenters will agree on final values
    so, eventually, they will become consistent

what if one datacenter's (or server's) clock is fast by an hour
  will cause that datacenter's values to win, even if an hour old!
  worse: prevents any other update for an hour!

COPS uses "Lamport clocks" to assign v#
  each server implements a "Lamport clock" or "logical clock"
    Tmax = highest v# seen (from self and others)
    T = max(Tmax + 1, wall-clock time)
  v# for a new put() is current T
  so: if some server has a fast clock, everyone who sees a version
    from that server will advance their Lamport clock

if concurrent writes, is it OK to simply discard all but one?
  the paper's "last-writer-wins"
  sometimes that's OK:
    e.g. there's only a single possible writer, so the problem can't arise
    probably I'm the only person who can write my photo list or profile
  sometimes latest-write-wins is not useful:
    what if put()s are trying to increment a counter?
    or update a shopping cart to have a new item?
  the problem is "conflicting writes"
  we'd often like to have a more clever plan to detect and merge
    mini-transactions -- atomic increment operation, not just get()/put()
    custom conflict resolution for shopping cart (set union?)
  resolution of conflicting writes is a problem for eventual/causal consistency
    no single primary to order atomic operations or transactions
  the paper mostly ignores write conflict resolution
    but it's a serious problem for real systems

can we have better consistency than eventual?
  and still retain fast local reads and writes, from any datacenter?

remember the eventual consistency anomaly with the photo list:
  C1: put(photo) put(list)
  C2:                      get(list) get(photo)
C1 and C2 in different datacenters
the problem: the put()s may arrive out of order at other datacenters.
  so get(list) may see the new photo name, but get(photo) fails
we need to enforce order on writes
  even when "photo" and "list" are in different shards

a possibility: single write log per datacenter
  each datacenter has a single "log server"
  put() appends to the local server's log, but doesn't wait
  log server sends log, in order, to other datacenters, in the background
    remote sites apply log in order
    so put(photo), put(list) will appear in that order everywhere
  two problems:
    log server will be a bottleneck
    only orders writes from this datacenter

example with writes from multiple datacenters:
  initially x=0 y=0
  C1: put(x,2)
  C2:          t=get(x) put(y,t)
  C3:                            get(y) get(x)
  C1/C2/C3 in different data centers
  suppose C2 sees x=2 (i.e. it observes C1's write)

a log per datacenter produces an anomaly for this example
  C1's put and C2's put would be in different logs, not ordered
  so C3 might see C2's put but not C1's (prior) put
    i.e. C3 might see y=2 but x=0
  we need a transitive notion of order

so:
  we want to forward puts asynchronously
  we want each shard to forward puts independently (no central log server)
  we want enough ordering to make our examples have intuitive results

here's how COPS achieves these goals

each COPS client maintains a "context" to reflect order of client ops
  client adds an item to context after each get() and put()
  client tells COPS to order each put() after everything in its context
  C2's get(x) yields version v2
    C2's context now: x/v2
  C2's put(y,2)
    C2 sends put(y,2,x/v2) to local shard server
  local shard server for y:
    picks a new v# = 3 for y,
    stores y, 2, v3
    sends y,2,v3,x/v2 to corresponding shard server in each datacenter
      but does not wait for reply
  remote shard server receives y,2,v3,x/v2
    talks to local shard server for x
      *waits* for x to reach version v2 (from C1's datacenter)
    then updates DB to hold y,2,v3
    
COPS calls a relationship like "y/v3 comes after x/v2" a "dependency"
  x/v2 -> y/v3

what does a dependency imply for COPS?
  if C3 sees y/v3, then asks for x, it should see at least x/v2
  this notion of dependency is meant to match programmer intuition
    about what it means to get(x) AND THEN put(y)
    or put(x) AND THEN put(y)

these consistency semantics are called "causal consistency"
  paper defines in Section 3 / Figure 2
  a client establishes dependencies between versions in two ways:
    1. its own sequence of puts and gets ("Execution Thread" in Section 3)
    2. reading data written by another client
  dependencies are transitive (if A -> B, and B -> C, then A -> C)
  the system guarantees that if A -> B, and a client reads B,
    then the client must subsequently see A (or a later version)

causal consistency makes the photo list example work
  put(list) will arrive with photo as a dependency
  remote servers will wait for photo to appear before installing updated list

nice: when updates are causally related, readers see versions
  at least as new as the writer saw.

nice: when updates are NOT causally related, COPS has no order obligations
  example:
    C1: put(x,1)
    C2:          put(y,2)
    C3:                   put(x,3)
  these are not causally related
  so they can appear in different orders at different datacenters
  no need for any mechanisms to enforce agreement on order of appearance
    e.g. no single primary for x
    e.g. no wait for write to complete as in linearizability
  we can expect this freedom to help performance

but: readers may see newer data than required by causal dependencies
  so we're not getting transactions or snapshots 

but: COPS sees only certain causal relationships
  ones that COPS can observe from client get()s and put()s
  if there are other communication channels, COPS is only eventually consistent
  e.g. I put(k), tell you by phone, you do get(k), maybe you won't see my data

COPS optimizations avoid ever-growing client contexts
  * client clears context after put(x), replaces with x/v#
    so next put(), e.g. put(y), depends only on x/v#
    so remote sites will wait for arrival of x/v# before writing y
    x/v# itself was waiting for (cleared) context
    so y effectively also waits for (cleared) context
  * garbage collection sees when all datacenters have a certain version
    that version never needs to be remembered in context
    since it's already visible to everyone

are there situations where ordered puts/gets aren't sufficient?
  yes: when you need transactions or atomic update
  paper's example: a photo list with an ACL (Access Control List)
  get(ACL), then get(list)?
    what if someone deletes you from ACL, then adds a sensitive photo?
      between the two get()s
  get(list), then get(ACL)?
    what if someone deletes sensitive photo, then adds you to ACL?
  need a multi-key get that returns mutually consistent versions

COPS-GT get_trans() provides read-only sort-of transactions
  servers store full set of dependencies for each value
  servers store a few old versions of each value
  get_trans(k1,k2,...)
    client library does independent get()s
    get()s return dependencies as well as value/v#
  client checks dependencies
    for each get() result R,
      for each other get result S mentioned in R's dependencies,
        is S/v#  >= version mentioned in R's dependency?
  if yes for all, can use results
  if no for any, need a second round of get()s for values that were too old
    each fetches the version mentioned in dependencies
    may be old: to avoid cascading dependencies

for ACL / list example:
  C1: get_trans(ACL, list)
  C1: get(ACL) -> v1, no deps
      C2: put(ACL, v2)
      C2: put(list, v2, deps=ACL/v2)
  C1: get(list) -> v2, deps: ACL/v2
  (C1 checks dependencies against value versions)
  C1: get(ACL,v2) -> v2
  (now C1 has a causally consistent pair of get() results)

why are only two phases needed for COPS-GET get transactions?
  a new value won't be installed until all its dependencies are installed
  so if a get() returns a dependency, it must already be locally installed

performance?
  Section 6.2:
    one client, a single server in each datacenter
    52,000 gets per second
    30,000 puts per second (async forwarding to remote datacenter)
  that is OK performance, like a conventional DB
  but no comparisons with other systems
    not for performance
    not for ease of programming
  too bad, since central thesis is that COPS has a better tradeoff between
    ease of programming and performance

limitations / drawbacks, for both COPS and causal consistency
  conflicting writes are a serious difficulty
  awkward for clients to track causality
    e.g. user and browser, multiple page views, multiple servers
  COPS doesn't see external causal dependencies
    s/w and people really do communicate outside of the COPS world
  limited notion of "transaction"
    only for reads (though later work generalized a bit)
    definition is more subtle than serializable transactions
  significant overhead
    track, communicate, enforce causal dependencies
    garbage collection
  update delays can cascade

impact?
  causal consistency is a popular research idea
    with good reason: promises both performance and useful consistency
    much work before COPS -- and after (Eiger, SNOW, Occult)
  causal consistency is rarely used in deployed storage systems
  what is actually used?
    no geographic replication at all, just local
    primary-site (PNUTS, Facebook/Memcache)
    eventual consistency (Dynamo, Cassandra)
    strongly consistent (Spanner)