6.5840 2022 Lecture 11: Chain Replication

Chain replication for supporting high throughput and availability
(OSDI 2004) by Renesse and Schneider

a couple of topics today:
  replicated state machines
  revisit primary/backup replication
  chain replication as a better primary/backup
  compare p/b vs chain vs quorum
  structure of sharded systems

two approaches to replicated-state machine
  1. run all ops through Raft/Paxos (lab 3)
  2. configuration server plus primary-backup
      GFS master + chunk replication
      VM-FT: test-and-set + P/B replication
      zookeeper + primary-backup
    configuration server must be fault-tolerant

why approach 2?
  separation of concerns
    configuration srv stores configuration info
      who is primary, which server has what shard
      who is the head of the chain
      must handle split-brain syndrome
        often uses paxos/raft
        can proceed with majority
    data replication protocol
      primary-backup, chain replication or whatever
      often simpler than raft (figure 2); doesn't need to handle split brain
      can proceed if only 1 server is available (after reconfiguration)

primary/backup (paper's Section 4)
  old, widely used in various forms
  GFS uses a (non-strict) p/b scheme for chunks
  CR paper is an improved p/b

the basic primary/backup arrangement
  [diagram: CFG, clients, primary, two backups, state = key/value table]
  primary numbers ops, to impose order if concurrent client ops arrive
  primary forwards each to backups (in parallel)
  primary waits for response from ALL backups, replies to client
  primary can respond to reads w/o sending to backups
  if primary fails, one of the backups becomes new primary
    can proceed if even a single replica survives
    so N replicas can survive N-1 failures -- better than Raft
  a separate configuration service (CFG) manages failover
    configuration = identity of current primary and backups
    usually built on Paxos or Raft or ZooKeeper
    pings all servers to detect failures
    CFG must include at least one replica from
      previous configuration in any new configuration.
  if a backup fails, CFG must remove it from the configuration,
    b/c the primary has to wait for all backups

sometimes called ROWA (Read One Write All), as opposed to quorum
  the "write all" is what allows us to tolerate N-1 failures
    since every replica has all committed values, unlike quorums

a few key properties we need to check for any strong replication scheme:
  1. never reveal uncommitted data
     i.e. data that might not survive a tolerated failure
  2. after failure recovery, all replicas must agree!
     if one saw last msg from primary, but the other did not
  3. no split brain if network partitions
     only one new primary; old primary must stop
  4. able to recover after a complete failure

#1 for primary/backup?
  example bad situation:
    put(x, 99) arrives at primary
    get(x) immediately follows it
      before primary gets backups' responses to put(x,99)
    we said primary can serve reads w/o talking to backups
    but it's not safe to return 99!
  answer: either respond with old value,
    or block read until all backups respond to put(x,99).

#2 for primary/backup?
  new primary could merge all replicas' last few operations.
  or, elect as primary the backup that received the highest # op.
  and new primary must start by forcing backups to agree with it

#3 for primary/backup?
  how to prevent backup from taking over if partitioned from primary?
    and primary is still alive.
    *don't* have backups make this decision!
    only the CFG can declare a new primary + backups.
      based on CFG's opinion of which servers it can reach.
  what if the old primary is actually alive, but CFG can't talk to it?
    how to prevent old primary from serving requests?
    for updates:
      old primary cannot respond to client requests until all
        replicas in its configuration reply
      at least one replica must be part of the new configuration
      so replicas must be careful not to reply to an old primary!
    for reads:
      primary doesn't have to talk to backups for reads
      so CFG must grant lease to primary

Chain Replication was published at a time when only a few people
  understood the detailed design of strongly consistent replication;
  it's been influential, and a fair number of real-world systems build
  on it.

what p/b problems does Chain Replication aim to fix?
  1. primary has to do a lot of work
  2. primary has to send a lot of network data
  3. re-sync after primary failure is complex, since backups may differ
    
the basic Chain Replication idea:
  [clients, S1=head, S2, S3=tail, CFG (= master)]
  (can be more replicas)
  clients send updates requests to head
    head picks an order (assigns sequence numbers)
    head updates local replica, sends to S1
    S1 updates local replica, sends to S2
    S2 updates local replica, sends to S3
    S3 updates local replica, sends response to client
    updates move along the chain in order:
      at each server, earlier updates delivered before later ones
  clients send read requests to tail
    tail reads local replica and responds to client

benefit: head sends less network data than a primary

benefit: client interaction work is split between head and tail

The Question: Suppose Chain Replication replied to update requests
  from the head, as soon as the next chain server said it received the
  forwarded update, instead of responding from the tail. Explain how
  that could cause Chain Replication to produce results that are not
  linearizable.

what if the head fails?
  the CFG (master) is in charge of recovering from failures
  CFG tells 2nd chain server to be new head
    and tells clients who the new head is
  will all (remaining) servers still be exact replicas?
    they will soon!
    in-order delivery means each server is identical
      to the one before it, just missing last few updates
    so each server needs to compare notes with successor and
      just send those last few updates
  some client update requests may be lost
    if only the failed head knew about them
    clients won't receive responses
    and will eventually re-send to new head

what if the tail fails?
  CFG tells next-to-last server to be new tail
    and tells clients, for read requests
  next-to-last is at least as up to date as the old tail
  for updates that new tail received but old tail didn't,
    system won't send responses to clients.
    clients will time out and re-send
    Section 2 says clients are responsible for checking
      whether timed-out operations have actually already
      been executed (often harder than it sounds).

what if an intermediate server fails?
  CFG tells previous/next servers to talk to each other
  previous server may have to re-send some updates that
    it had already sent to failed server

note that servers need to remember updates even after forwarding
  in case a failure requires them to re-send
  when to free?
  tail sends ACKs back up the chain as it receives updates
    when a server gets an ACK, it can free all through that op

what's the argument that CR won't reveal an uncommitted update?
  i.e. could client read a value,
    but it then disappears due to a tolerated failure?
  or could a client get a "yes" response to an update,
    but then it's not there after a failure?
  reads come from the tail
    the tail only sees an update after every other server sees it
    so after a failure, every server still has that update
  update is responded to after it gets to the tail
    at which point every server has it

how to add a new server? ("extend the chain")
  you need to do this to restore replication level after a failure.
  again, CFG manages this
  new server is added at the tail
  a slow possibility:
    tell the old tail to stop processing updates
    tell the old tail to send a complete copy of its data to the new tail
    tell the old tail to start acting as an intermediate server,
      forwarding to the new tail
    tell the new tail to start acting as the tail
    tell clients about new tail
  slow b/c we're pausing all updates for minutes or hours
  better is to transfer a snapshot of the state in advance
    then freeze the system just long enough to send the
      last few updates to the new tail
    then reconfigure and un-freeze
    perhaps use ZooKeeper's fuzzy snapshot idea

partition situation is much as in p/b
  CFG makes all decisions
    it will pick a single new head &c
    based on its view of server liveness -- i.e. just in CFG's partition
  new head is old 2nd server, it should ignore
    updates from the old head, to cope with "what if old
    head is alive but CFG thinks it has failed"
  CFG needs to grant tail a lease to serve client reads,
    and not designate a new tail until lease has expired

p/b versus chain replication?
  p/b may have lower latency (for small requests)
  chain head has less network load than primary
    important if data items are big (as with GFS)
  chain splits work between head and tail
    primary does it all, maybe more of a bottleneck
  chain has simpler story for which server should take over if head fails,
    and how ensure servers get back in sync

chain (or p/b) versus Raft/Paxos/Zab (quorum)?
  p/b can tolerate N-1 of N failures, quorum only N/2
  p/b simpler, maybe faster than quorum
  p/b requires separate CFG, quorum self-contained
  p/b must wait for reconfig after failure, quorum keeps going
  p/b slow if even one server slow, quorum tolerates temporary slow minority
  p/b CFG's server failure detector hard to tune:
    any failed server stalls p/b, so want to declare failed quickly!
    but over-eager failure detector will waste time copying data to new server.
    quorum system handles short / unclear failures more gracefully

for a long time p/b (and chain) dominated data replication
  Paxos was viewed as too complex and slow for high-performance DBs
  recently quorum systems have been gaining ground
    due to good toleration of temporarily slow/flaky replicas

what if you have too much data to fit on a single replica group?
  e.g. millions of objects
  you need to "shard" across many "replica groups"

sharding diagram:
  [CFG, G1, G2, .., Gn, clients]
  GFS looked like this
  modern system might use ZK for CFG, chain or p/b or Raft for data

how to lay out chains on server in a big sharding setup?
  the paper's 5.2 / 5.3 / 5.4

a not-so-great chain or p/b sharding arrangement:
  each set of three servers serves a single shard / chain
    shard A: S1 S2 S3
    shard B: S4 S5 S6
  problem: some servers will be more loaded than others
    the primary in each group will be slow while the others have idle capacity
    the head and tail will be more loaded than the middle
    the under-loaded servers waste money!
  problem: replacing a failed replica takes a long time!
    the new server must fetch a whole disk of data over the network
      from one of the remaining replicas
    a terabyte at a gigabit/second takes two hours!
    significant risk of remaining replicas failing before completion!

a better plan ("rndpar" in Section 5.4):
  split data into many more shards than servers
    (so each shard is much smaller than in previous arrangement)
  each server is a replica in many shard groups
    shard A: S1 S2 S3
    shard B: S2 S3 S1
    shard C: S3 S1 S2
    (this is a regular arrangement, but in general would be random)
  for p/b, a server is primary in some groups, backup in other
  for chain, a server is head in some, tail in others, middle in others
  now request processing work is likely to be more balanced

how does rndpar do for repair speed?
  suppose one server fails.
  say it participated in M replica groups, for M shards.
  instead of designating a single replacement server, let's
    choose M replacement servers, a different one for each shard.
    these are existing servers, which we're giving a new responsibility.
  now repair of the M shards can go on in parallel!
    instead of taking a few hours, it will take 1/M'th that time.
  
how does rndpar do if three random servers fail?
  as the number of shards on each server increases, it gets more
    likely that *some* shard had its three replicas on
    the three random servers that failed.
  this is not ideal.
  rndpar gives us fast repair, but it's somewhat undermined
    by higher probability that a few failures wipes out all
    replicas for some shard (Figure 7).

conclusion
  Chain Replication is one of the clearest descriptions of a ROWA scheme
  it does a good job of balancing work
  it has a simple approach to re-syncing replicas after a failure
  influential: used in EBS, Ceph, Parameter Server, COPS, FAWN.
  it's one of a number of designs (p/b, quorums) with different properties

-----------------

5.1 take-away:
  chain *throughput* as high as p/b b/c limited by head/primary CPU
  for 0% updates, chain limited by tail alone, and p/b limited by prim alone
  for 100% updates, chain limited by head alone, and p/b limited by prim alone
  only in the middle is there a difference b/c chain splits work
    between head and tail
  BUT network communication assumed to be free; in real life
    p/b has a problem b/c prim must send to all backups
  BUT write latency is a problem in the real world, since
    client often waiting for "committed" reply

5.2 take-away:
  assume 1000s of chains and dozens of servers.
  idea: split your data over many chains, and the chains
    over a modest number of servers, so that every server
    is in many chains, some as head, some as tail, some
    in the middle
  this balances the load of being head and tail vs middle.
  Figure 5 doesn't seem to say much, maybe just that if
    you have only 25 clients, then there's not much point
    in having more than about 25 servers.

5.3 take-away:
  assume 1000s of chains and dozens of servers.
  repair time is a big deal, since a single server can store
    so much data that it takes hours to transfer over a
    single network link.
  when a *single* server fails, need to spread responsibility
    for the data it replicated over *multiple* other servers,
    to get fast parallel repair.

5.4 take-away:
  assume 1000s of chains and dozens of servers.
  best for parallel recovery speed is if no constraints and
    random placement, so that both sources and destinations
    of recovery traffic are evenly-ish spread after a single
    failure.
  BUT if chains are randomly spread over servers, then *any*
    combination of three (if chainlen=3) random server failures
    has a good chance of destroying all replicas of *some* chain.
  the ring topologies are a compromise: you can't spread
    recovery load very widely, but a few random server failures
    are less likely to destroy all of any one chain's replicas.
  for their setup, speed of reconstruction seems to be more
    important than simultaneous failures wiping out a chain.
  however, they assume MTBF of a single server of 24 hours,
    which does mean fast repair is crucial when repair can
    take hours, but 24 hours seems unrealistically short.