6.824 2013 Lecture 7: Spanner

Spanner: Google's Globally-Distributed Datase
Corbett et al, OSDI 2012

why this paper?
  modern, high performance, driven by real-world needs
  sophisticated use of paxos
  tackles consistency + performance (will be a big theme)
  Lab 4 a (hugely) simplified version of Spanner

what are the big ideas?
  shard management w/ paxos replication
  high performance despite synchronous WAN replication
  fast reads by asking only the nearest replica
  consistency despite sharding (this is the real focus)
  clever use of time for consistency
  distributed transactions

this is a dense paper!
  i've tried to boil down some of the ideas to simpler form

idea: sharding
  we've seen this before in FDS
  the real problem is managing configuration changes
  Spanner has a more convincing design for this than FDS

simplified sharding outline (lab 4):
  replica groups, paxos-replicated
    paxos log in each replica group
  master, paxos-replicated
    assigns shards to groups
    numbered configurations
  if master moves a shard, groups eventually see new config
  "start handoff Num=7" op in both groups' paxos logs
    though perhaps not at the same time
  dst can't finish handoff until it has copies of shard data at majority
    and can't wait long for possibly-dead minority
    minority must catch up, so perhaps put shard data in paxos log (!)
  "end handoff Num=7" op in both groups' logs

Q: what if a Put is concurrent w/ handoff?
   client sees new config, sends Put to new group before handoff starts?
   client has stale view and sends it to old group after handoff?
   arrives at either during handoff?

Q: what if a failure during handoff?
   e.g. old group thinks shard is handed off
        but new group fails before it thinks so

Q: can *two* groups think they are serving a shard?

Q: could old group still serve shard if can't hear master?

idea: wide-area synchronous replication
  goal: survive single-site disasters
  goal: replica near customers
  goal: don't lose any updates

considered impractical until a few years ago
  paxos too expensive, so maybe primary/backup?
  if primary waits for ACK from backup
    50ms network will limit throughput and cause palpable delay
    esp if app has to do multiple reads at 50ms each
  if primary does not wait, it will reply to client before durable
  danger of split brain; can't make network reliable

what's changed?
  other site may be only 5 ms away -- San Francisco / Los Angeles
  faster/cheaper WAN
  apps written to tolerate delays
    may make many slow read requests
    but issue them in parallel
    maybe time out quickly and try elsewhere, or redundant gets
  huge # of concurrent clients lets you get hi thruput despite high delay
    run their requests in parallel
  people appreciate paxos more and have streamlined variants
    fewer msgs
      page 9 of paxos paper: 1 round per op w/ leader + bulk preprepare
      paper's scheme a little more involved b/c they must ensure
        there's at most one leader
    read at any replica

actual performance?
  Table 3
    pretend just measuring paxos for writes, read at any replica for reads
    latency
      why doesn't write latency go up w/ more replicas?
      why does std dev of latency go down w/ more replicas?
      r/o a *lot* faster since not a paxos agreement + use closest replica
    throughput
      why does read throughput go up w/ # replicas?
      why doesn't write throughput go up?
      does write thruput seem to be going down?
    what can we conclude from Table 3?
      is the system fast? slow?
    how fast do your paxoses run?
      mine takes 10 ms per agreement
      with purely local communication and no disk
      Spanner paxos might wait for disk write
  Figure 5
    npaxos=5, all leaders in same zone
    why does killing a non-leader in each group have no effect?
    for killing all the leaders ("leader-hard")
      why flat for a few seconds?
      what causes it to start going up?
      why does it take 5 to 10 seconds to recover?
      why is slope *higher* until it rejoins?

spanner reads from any paxos replica
  read does *not* involve a paxos agreement
  just reads the data directly from replica's k/v DB
  maybe 100x faster -- same room rather than cross-country

Q: could we *write* to just one replica?

Q: is reading from any replica correct?

example of problem:
  photo sharing site
  i have photos
  i have an ACL (access control list) saying who can see my photos
  i take my mom out of my ACL, then upload new photo
  really it's web front ends doing these client reads/writes
  1. W1: I write ACL on group G1 (bare majority), then
  2. W2: I add image on G2 (bare majority), then
  3. mom reads image -- may get old data from lagging G2 replica
  4. mom reads ACL -- may get new data from G1

this system is not acting like a single server!
  there was not really any point at which the image was
    present but the ACL hadn't been updated

this problem is caused by a combination of
  * partitioning -- replica groups operate independently
  * cutting corners for performance -- read from any replica

how can we fix this?
  A. make reads see latest data
     e.g. full paxos for reads
     expensive!
  B. make reads see *consistent* data
     data as it existed at *some* previous point in time
     i.e. before #1, between #1 and #2, or after #2
     this turns out to be much cheaper
     spanner does this

here's a super-simplification of spanner's consistency story for r/o clients
  "snapshot" or "lock-free" reads
  assume for now that all the clocks agree
  server (paxos leader) tags each write with the time at which it occurred
  k/v DB stores *multiple* values for each key,
    each with a different time
  reading client picks a time t
    for each read
      ask relevant replica to do the read at time t
  how does a replica read a key at time t?
    return the stored value with highest time <= t
  but wait, the replica may be behind
    that is, there may be a write at time < t, but replica hasn't seen it
    so replica must somehow be sure it has seen all writes <= t
    idea: has it seen *any* operation from time > t?
      if yes, and paxos group always agrees on ops in time order,
        it's enough to check/wait for an op with time > t
      that is what spanner does on reads (4.1.3)
  what time should a reading client pick?
    using current time may force lagging replicas to wait
    so perhaps a little in the past
    client may miss latest updates
    but at least it will see consistent snapshot
    in our example, won't see new image w/o also seeing ACL update

how does that fix our ACL/image example?
  1. W1: I write ACL, G1 assigns it time=10, then
  2. W2: I add image, G2 assigns it time=15 (> 10 since clocks agree)
  3. mom picks a time, for example t=14
  4. mom reads ACL t=14 from lagging G1 replica
     if it hasn't seen paxos agreements up through t=14, it knows to wait
     so it will return G1
  5. mom reads image from G2 at t=14
     image may have been written on that replica
     but it will know to *not* return it since image's time is 15
  other choices of t work too

Q: is it reasonable to assume that different computers' clocks agree?
   why might they not agree?

Q: what may go wrong if servers' clocks don't agree?

a performance problem: reading client may pick time in the
  future, forcing reading replicas to wait to "catch up"

a correctness problem:
  again, the ACL/image example
  G1 and G2 disagree about what time it is
  1. W1: I write ACL on G1 -- stamped with time=15
  2. W2: I add image on G2 -- stamped with time=10
  now a client read at t=14 will see image but not ACL update

Q: why doesn't spanner just ensure that the clocks are all correct?
   after all, it has all those master GPS / atomic clocks

TrueTime (section 3)
  there is an actual "absolute" time t_abs
    but server clocks are typically off by some unknown amount
    TrueTime can bound the error
  so now() yields an interval: [earliest,latest]
  earliest and latest are ordinary scalar times
    perhaps microseconds since Jan 1 1970
  t_abs is highly likely to be between earliest and latest

Q: how does TrueTime choose the interval?

Q: why are GPS time receivers able to avoid this problem?
   do they actually avoid it?
   what about the "atomic clocks"?

spanner assigns each write a scalar time
  might not be the actual absolute time
  but is chosen to ensure consistency

the danger:
  W1 at G1, G1's interval is [20,30]
    is any time in that interval OK?
  then W2 at G2, G2's interval is [11,21]
    is any time in that interval OK?
  if they are not careful, might get s1=25 s2=15

so what we want is:
  if W2 starts after W1 finishes, then s2 > s1
  simplified "external consistency invariant" from 4.1.2
  causes snapshot reads to see data consistent w/ true order of W1, W2

how does spanner assign times to writes?
  (again, this is much simplified, see 4.1.2)
  a write request arrives at paxos leader
  s will be the write's time-stamp
  leader sets s to TrueTime now().latest
    this is "Start" in 4.1.2
  then leader *delays* until s < now().earliest
    i.e. until s is guaranteed to be in the past (compared to absolute time)
    this is "commit wait" in 4.1.2
  then leader runs paxos to cause the write to happen
  then leader replies to client

does this work for our example?
  W1 at G1, TrueTime says [20,30]
    s1 = 30
    commit wait until TrueTime says [31,41]
    reply to client
  W2 at G2, TrueTime *must* now say >= [21,31]
    (otherwise TrueTime is broken)
    s2 = 31
    commit wait until TrueTime says [32,43]
    reply to client
  it does work for this example:
    the client observed that W1 finished before S2 started,
    and indeed s2 > s1
    even though G2's TrueTime clock was slow by the most it could be
    so if my mom sees S2, she is guaranteed to also see W1

why the "Start" rule?
  i.e. why choose the time at the end of the TrueTime interval?
  previous writers waited only until their timestamps were barely < t_abs
  new writer must choose s greater than any completed write
  t_abs might be as high as now().latest
  so s = now().latest

why the "Commit Wait" rule?
  ensures that s < t_abs
  otherwise write might complete with an s in the future
    and would let Start rule give too low an s to a subsequent write

Q: why commit *wait*; why not immediately write value with chosen time?
  indirectly forces subsequent write to have high enough s
    the system has no other way to communicate minimum acceptable next s
    for writes in different replica groups
  waiting forces writes that some external agent is serializing
    to have monotonically increasing timestamps
  w/o wait, our example goes back to s1=30 s2=21
  you could imagine explicit schemes to communicate last write's TS
    to the next write

Q: how long is the commit wait?

this answers today's Question
  a large TrueTime uncertainty requires a long commit wait
  so Spanner authors are interested in accurate low-uncertainty time

let's step back
  why did we get into all this timestamp stuff?
    our replicas were 100s or 1000s of miles apart (for locality/fault tol)
    we wanted fast reads from a local replica (no full paxos)
    our data was partitioned over many replica groups w/ separate clocks
    we wanted consistency for reads:
      if W1 then W2, reads don't see W2 but not W1
  it's complex but it makes sense as a
    high-performance evolution of Lab 3 / Lab 4

why is this timestamp technique interesting?
  we want to enforce order -- things that happened in some
    order in real time are ordered the same way by the
    distributed system -- "external consistency"
  the naive approach requires a central agent, or lots of communication
  Spanner does the synchronization implicitly via time
    time can be a form of communication
    e.g. we agree in advance to meet for dinner at 6:00pm

there's a lot of additional complexity in the paper
  transactions, two phase commit, two phase locking,
    schema change, query language, &c
  some of this we'll see more of later
  in particular, the problem of ordering events in a
    distributed system will come up a lot, soon