6.5840 2025 Lecture 14: Chardonnay

XXX take laptop to project Figure 1

Chardonnay: Fast and General Datacenter Transactions for On-Disk
Databases, by Eldeeb et al., OSDI 2023

Why are we reading this paper?
  people want transactions; are forced to shard; thus two-phase commit
  much effort has gone into faster distributed transactions
  Chardonnay is partially a reaction to Spanner
    Spanner r/w transactions are slow -- 14ms
    Spanner relies on fancy time synchronization
  Chardonnay slower than FaRM, but
    supports big on-disk storage, not limited to RAM
    higher-level programming model (tables/rows/columns)

What's wrong with two-phase commit?
  two rounds of network messages
  two synchronous disk writes (particpants, then coordinator)
  locks must be held until commit is complete
    delaying other transactions that use popular records

How fast are 2pc components?
  (approximately)
  RPC round-trip: 200us (try pinging a nearby host)
  write a random block on a disk and wait:
    10 ms for a spinning disk
    200 us for an ordinary SSD

Figure 1 / Section 4 measures 2pc performance
  it's a complex graph, not fully explained

One lesson: faster RPC and faster log disk make a big difference
  Baseline-Slow versus Baseline-Fast
  eRPC
  NVMe "disk"
  these are new technologies, available post-Spanner

what is eRPC? why is it fast?
  ordinary RPC (as in Go's RPC) is slow
    send: user, system call, copy to kernel buffer, NIC
    recv: NIC DMA, kernel buffer, interrupt, schedule user, system call, copy
    each part is only 1-10 microseconds but they add up
    round trip around 100 us or more
  eRPC
    "kernel bypass"
    NIC DMAs direct to user space -- no copy, no system call
    user thread(s) poll DMA rings -- no interrupt, no scheduling
    round trip of around 2 us!

what is NVMe?
  log writes (even to SSD) traditionally painfully slow
    so one would like storage with very fast writes
    and non-volatile, so log survives crash+reboot
  the paper say "NVMe" but does not say what they have in mind!
  experiments simulate NVMe with RAM
    RAM has write latency less than a microsecond -- very nice
    but you cannot buy flash this fast!
  the fastest real-world flash has write latency around 10 or 20 us
    Intel Optane SSD DC P5800X
    Samsung PM1743
    currently very exotic
    
Baseline-Fast is the *starting* point for Chardonnay
  any modern two-phase commit could act like Baseline-Fast

the more interesting implication of Figure 1:
  look at Baseline-Fast (10%)
  the left-hand bar is for a read transaction that had to read the disk
  most of the time is spent reading the disk!
    b/c eRPC and NVMe made other costs much less

the paper's claim:
  with fast RPC and NVMe, transaction time will often be dominated
    by the time to read data from the disk.
  and (for multi-record transactions) locks will be held while reading.
  and if the locked records are "hot", that will block other transactions.
  so 2pc performance may be limited by contention due to
    "hot" locks held while reading cold records from disk.

Chardonnay aims to reduce 2pc's lock contention.

The big idea: dry-run
  run each r/w transaction twice
  first the dry run: to cause data to be fetch from disk and cached
    in read-only mode; discard writes; don't lock
  then for real: with locks and writes
    since cached, locks are held for less time,
    so other transactions have to wait less for hot locks

what do we need for dry runs?
  programmer must write transaction as function that can be called twice
    (the code snippet in Section 7.1)
  need lock-free reads (since the point was to reduce lock contention)
  AND we need the reads to be properly transactional
    can't just read whatever is in the DB
    since that might break invariants and cause application to crash

can use snapshot reads, as in Spanner!
  DB stores multiple versions of each record, tagged with time (epoch)
  each r/w transaction's writes all appear at the same instant in time
  read-only transaction sees set of values as of an instant in time
    writes of all xactions before that point; no writes from xactions after
  no locks required for read-only transactions!

another twist:
  Chardonnay authors don't want to require special time sync scheme
  so they need snapshot r/o transactions but without TrueTime

the setup -- Figure 2
  [diagram]
  clients: run transaction code, not fault tolerant
  Epoch Service: source of time-like epoch numbers, paxos replicated
  KV service: sharded data, paxos replication, leaders
    multiple versions for each key, tagged with epoch
  TX state store: 2pc coordinator log, paxos replicated

two-phase commit for r/w transactions:
  [arrow diagram]
  client (as TC) sends reads and writes
    leaders keep lock table and tentative writes in RAM
    (lock table and writes lost if leader crashes)
  client sends PREPARE messages to leaders
    leaders log via paxos prepared state, locks, tentative writes
    (or reply "no" if crashed, or deadlock)
  client waits for "yes" from all leaders
  client logs "commit" (or "abort") in TX State Store's paxos, waits
    paxos agreement is the "commit point"
  client sends out COMMIT to leaders
    leaders perform writes, release locks

Epoch Service
  maintains just one number, the epoch
  increments epoch every 10 ms
    many transactions during any one epoch
  fault-tolerant via Paxos
    interesting performance details in paper

how do r/w transactions use epochs?
  [diagram]
  client asks Epoch Service for current epoch in parallel with PREPAREs
  client sends e to leaders in COMMIT message
  Each record in the DB appears in multiple versions.
    DB indexed by <key,e,i>,
    where e is prepare epoch of writing transaction,
    and i is a counter the leader maintains, starts at zero for each epoch.
    the i reflects the order of writes to each key during each epoch.
      so reads can see most recent.

Now for r/o transactions.
  client T2 gets epoch 13 from epoch server when r/o transaction starts
  goal: see snapshot of data as of start of 13.
  i.e. we want to see results of all r/w transactions that
    were in epochs < 13, but none from epochs >= 13.
  [ 12 | 13 | 14 ]

Snapshot read of key k is easy if no r/w transaction is active for k:
  when k's shard leader receives a request to read k at start of 13,
  return k's record <12,i> with highest i.

What if an r/w transaction T1 got 12 (or earlier) from the epoch server,
  but hasn't finished committing -- hasn't written the DB?
  An r/o transaction T2 in epoch 13 needs to see T1's writes!
  Good news: any such T1 must still be holding locks.
    So it's enough for T2 to wait for any conflicting lock to be released.

So the procedure for an r/o transaction:
  ask epoch server for the epoch e
  for each read
    ask shard leader to read version as of end of e-1
    shard leader first waits for lock on that key, if any
    then replies
  Result: r/o sees results of all r/w transactions < e, none >= e
          and acquires no locks, so doesn't slow down other xactions

Are r/o transactions linearizable? Do they see fresh data?
  No, they see a snapshot of data as of start of current epoch.
  If r/w T1 finishes in epoch 13, and *then* r/o T2 starts in epoch 13,
    T2 won't see any of T1's writes.
  So r/o transactions see consistent data, but maybe a bit stale.

Paper notes r/o can get fresh data by delaying until next epoch.
  So r/o T3 that starts in 13 delays until 14, then reads as of start of 14.
  (still need to wait for lock releases)
  So T3 will see writes of all transactions that started before it in 13.
  This is the analogue of Spanner's commit wait, but for r/o rather than r/w.
  (This was The Question)

Spanner and Chardonnay make opposite r/w vs r/o optimization decisions.

How do Chardonnay's r/o transactions fit into the dry run idea?
  Goal: r/w transactions to avoid slow disk reads while holding "hot" locks.
    So they don't force other transactions to wait.
  Dry run pre-fetches and caches read data.
    Executing as an r/o transaction (ignoring the writes).
  Dry run sees consistent snapshot so won't crash.
    OK if stale, since the point is to not violate application's invariants.
  Then real r/w execution is fast (no disk reads) 
  So r/w holds locks for short time.
  So locks don't slow down other transactions as much.

For what kinds of transactions does the dry-run idea make the most sense?
  What kinds of transactions would have caused lock contention w/o dry run?
    Read "hot" data (and lock).
    Then read some "cold" data (and wait for disk reads).
    Write the hot data.
    Commit (and release hot lock).
    (and meanwhile other transactions are waiting for "hot" lock)
  Do we think this comes up a lot?

What about deadlocks?
  Dry run doesn't lock, so it can't deadlock.
  Dry run reveals which records will be needed, and thus which locks.
  Before real execution of transaction, client sorts locks, and acquires.
  If dry and real executions match, no deadlock!
    Since all r/w transactions dry-run and sort in the same way.
  Very nice that Chardonnay gets this for free, given dry run.

Does the epoch length matter?
  Could it be e.g. 10 us rather than 10 ms?
  Shorter epoch -> data read by r/o is less stale.
  Shorter epoch -> linearizable r/o must wait less time until epoch changes.
  BUT shorter -> Epoch Service runs Paxos more often -> maybe overloaded.
    Epoch Service a potential bottleneck since EVERY TRANSACTION talks to it.

What's the relationship between epochs and Spanner's TrueTime?
  They are both effectively time sources.
  To time-stamp writes -- for snapshot reads.
  TrueTime works fine when geographically distributed
    Distant datacenters can all hear correct time from GPS.
    But Epoch Service is a single server (or replica cluster).
      Thus effectively restricted to single datacenter.
  Both have some uncertainty.
    TT interval similar to the 10ms epoch period.
  Both require transactions to delay to cover uncertainty.
    Commit wait, and (for linearizable r/o) wait for one epoch.
  Both could probabably reduce uncertainty with more expensive engineering.

What does Chardonnay's design sacrifice?
  Slows r/o transactions, to make r/w faster.
  Dry run hurts low-contention transactions.
  Only single datacenter (epoch service; must talk to leaders).
  Epoch service potentially a bottlneck.

Evaluation
  What are the paper's claims?
    Dry run makes 2pc faster for some workloads, by reducing lock contention.
  What would we like them to demonstrate?
    Do their techniques help for real-world workloads?
    Is Chardonnay faster than competing systems?

Figure 4(b) (page 11) shows the main result
  each transaction:
    8 cold records, two hot records on different shards
    random order, random keys
    reads and then writes all 10 records
    every transaction is r/w (thus dry run, then real)
  x axis: contention index, 1 means everyone hitting the same hot record
  y axis: transactions per second (higher is better)
  the three bars:
    Chardonnay:
    Prefetch only: dry run to pre-fetch, but not ordered lock acquisition
    Baseline: Chardonnay w/o dry run

For low contention, why are all three about the same?
  Baseline wins b/c doesn't do dry-run
  Chardonnay beats prefetch-only due to pipelined ordered acquisition

For contention=0.1, why is Baseline slower?
  paper says hot locks held while reading cold data from disk
  deadlock must also be a problem: Figure 5 says 30% abort rate!
    they all have to be re-run, wasting effort

For contention=1.0, why does Baseline get almost zero throughput?
  even more contention, and abort rate 50%
  Prefetching-only suggests about half of win is reduced contention,
    which prefetching-only gets you, and half is deadlock avoidance

Note the Section 9.1 workload is designed to favor Chardonnay
  many cold records (-> much time spent reading disk)
  very hot records (-> contention)
  random order (-> risk of deadlock)

Are the absolute numbers impressive?
  7x more throughput than Spanner paper's Table 3 (2200 one-key r/w / second).
  But experiments are not comparable:
    Chardonnay transactions do much more -- 10 keys.
    Chardonnay hardware 10 years more modern.
    Neither paper gives much detail about the setup.

What we *don't* see in the evaluation:
  Does dry-run help for real-world workloads
  Comparison with other databases
  Details of hardware/software/network/disk setup

Next week
  revisiting replication, this time with proofs of correctness