6.5840 2025 Lecture 14: Chardonnay XXX take laptop to project Figure 1 Chardonnay: Fast and General Datacenter Transactions for On-Disk Databases, by Eldeeb et al., OSDI 2023 Why are we reading this paper? people want transactions; are forced to shard; thus two-phase commit much effort has gone into faster distributed transactions Chardonnay is partially a reaction to Spanner Spanner r/w transactions are slow -- 14ms Spanner relies on fancy time synchronization Chardonnay slower than FaRM, but supports big on-disk storage, not limited to RAM higher-level programming model (tables/rows/columns) What's wrong with two-phase commit? two rounds of network messages two synchronous disk writes (particpants, then coordinator) locks must be held until commit is complete delaying other transactions that use popular records How fast are 2pc components? (approximately) RPC round-trip: 200us (try pinging a nearby host) write a random block on a disk and wait: 10 ms for a spinning disk 200 us for an ordinary SSD Figure 1 / Section 4 measures 2pc performance it's a complex graph, not fully explained One lesson: faster RPC and faster log disk make a big difference Baseline-Slow versus Baseline-Fast eRPC NVMe "disk" these are new technologies, available post-Spanner what is eRPC? why is it fast? ordinary RPC (as in Go's RPC) is slow send: user, system call, copy to kernel buffer, NIC recv: NIC DMA, kernel buffer, interrupt, schedule user, system call, copy each part is only 1-10 microseconds but they add up round trip around 100 us or more eRPC "kernel bypass" NIC DMAs direct to user space -- no copy, no system call user thread(s) poll DMA rings -- no interrupt, no scheduling round trip of around 2 us! what is NVMe? log writes (even to SSD) traditionally painfully slow so one would like storage with very fast writes and non-volatile, so log survives crash+reboot the paper say "NVMe" but does not say what they have in mind! experiments simulate NVMe with RAM RAM has write latency less than a microsecond -- very nice but you cannot buy flash this fast! the fastest real-world flash has write latency around 10 or 20 us Intel Optane SSD DC P5800X Samsung PM1743 currently very exotic Baseline-Fast is the *starting* point for Chardonnay any modern two-phase commit could act like Baseline-Fast the more interesting implication of Figure 1: look at Baseline-Fast (10%) the left-hand bar is for a read transaction that had to read the disk most of the time is spent reading the disk! b/c eRPC and NVMe made other costs much less the paper's claim: with fast RPC and NVMe, transaction time will often be dominated by the time to read data from the disk. and (for multi-record transactions) locks will be held while reading. and if the locked records are "hot", that will block other transactions. so 2pc performance may be limited by contention due to "hot" locks held while reading cold records from disk. Chardonnay aims to reduce 2pc's lock contention. The big idea: dry-run run each r/w transaction twice first the dry run: to cause data to be fetch from disk and cached in read-only mode; discard writes; don't lock then for real: with locks and writes since cached, locks are held for less time, so other transactions have to wait less for hot locks what do we need for dry runs? programmer must write transaction as function that can be called twice (the code snippet in Section 7.1) need lock-free reads (since the point was to reduce lock contention) AND we need the reads to be properly transactional can't just read whatever is in the DB since that might break invariants and cause application to crash can use snapshot reads, as in Spanner! DB stores multiple versions of each record, tagged with time (epoch) each r/w transaction's writes all appear at the same instant in time read-only transaction sees set of values as of an instant in time writes of all xactions before that point; no writes from xactions after no locks required for read-only transactions! another twist: Chardonnay authors don't want to require special time sync scheme so they need snapshot r/o transactions but without TrueTime the setup -- Figure 2 [diagram] clients: run transaction code, not fault tolerant Epoch Service: source of time-like epoch numbers, paxos replicated KV service: sharded data, paxos replication, leaders multiple versions for each key, tagged with epoch TX state store: 2pc coordinator log, paxos replicated two-phase commit for r/w transactions: [arrow diagram] client (as TC) sends reads and writes leaders keep lock table and tentative writes in RAM (lock table and writes lost if leader crashes) client sends PREPARE messages to leaders leaders log via paxos prepared state, locks, tentative writes (or reply "no" if crashed, or deadlock) client waits for "yes" from all leaders client logs "commit" (or "abort") in TX State Store's paxos, waits paxos agreement is the "commit point" client sends out COMMIT to leaders leaders perform writes, release locks Epoch Service maintains just one number, the epoch increments epoch every 10 ms many transactions during any one epoch fault-tolerant via Paxos interesting performance details in paper how do r/w transactions use epochs? [diagram] client asks Epoch Service for current epoch in parallel with PREPAREs client sends e to leaders in COMMIT message Each record in the DB appears in multiple versions. DB indexed by , where e is prepare epoch of writing transaction, and i is a counter the leader maintains, starts at zero for each epoch. the i reflects the order of writes to each key during each epoch. so reads can see most recent. Now for r/o transactions. client T2 gets epoch 13 from epoch server when r/o transaction starts goal: see snapshot of data as of start of 13. i.e. we want to see results of all r/w transactions that were in epochs < 13, but none from epochs >= 13. [ 12 | 13 | 14 ] Snapshot read of key k is easy if no r/w transaction is active for k: when k's shard leader receives a request to read k at start of 13, return k's record <12,i> with highest i. What if an r/w transaction T1 got 12 (or earlier) from the epoch server, but hasn't finished committing -- hasn't written the DB? An r/o transaction T2 in epoch 13 needs to see T1's writes! Good news: any such T1 must still be holding locks. So it's enough for T2 to wait for any conflicting lock to be released. So the procedure for an r/o transaction: ask epoch server for the epoch e for each read ask shard leader to read version as of end of e-1 shard leader first waits for lock on that key, if any then replies Result: r/o sees results of all r/w transactions < e, none >= e and acquires no locks, so doesn't slow down other xactions Are r/o transactions linearizable? Do they see fresh data? No, they see a snapshot of data as of start of current epoch. If r/w T1 finishes in epoch 13, and *then* r/o T2 starts in epoch 13, T2 won't see any of T1's writes. So r/o transactions see consistent data, but maybe a bit stale. Paper notes r/o can get fresh data by delaying until next epoch. So r/o T3 that starts in 13 delays until 14, then reads as of start of 14. (still need to wait for lock releases) So T3 will see writes of all transactions that started before it in 13. This is the analogue of Spanner's commit wait, but for r/o rather than r/w. (This was The Question) Spanner and Chardonnay make opposite r/w vs r/o optimization decisions. How do Chardonnay's r/o transactions fit into the dry run idea? Goal: r/w transactions to avoid slow disk reads while holding "hot" locks. So they don't force other transactions to wait. Dry run pre-fetches and caches read data. Executing as an r/o transaction (ignoring the writes). Dry run sees consistent snapshot so won't crash. OK if stale, since the point is to not violate application's invariants. Then real r/w execution is fast (no disk reads) So r/w holds locks for short time. So locks don't slow down other transactions as much. For what kinds of transactions does the dry-run idea make the most sense? What kinds of transactions would have caused lock contention w/o dry run? Read "hot" data (and lock). Then read some "cold" data (and wait for disk reads). Write the hot data. Commit (and release hot lock). (and meanwhile other transactions are waiting for "hot" lock) Do we think this comes up a lot? What about deadlocks? Dry run doesn't lock, so it can't deadlock. Dry run reveals which records will be needed, and thus which locks. Before real execution of transaction, client sorts locks, and acquires. If dry and real executions match, no deadlock! Since all r/w transactions dry-run and sort in the same way. Very nice that Chardonnay gets this for free, given dry run. Does the epoch length matter? Could it be e.g. 10 us rather than 10 ms? Shorter epoch -> data read by r/o is less stale. Shorter epoch -> linearizable r/o must wait less time until epoch changes. BUT shorter -> Epoch Service runs Paxos more often -> maybe overloaded. Epoch Service a potential bottleneck since EVERY TRANSACTION talks to it. What's the relationship between epochs and Spanner's TrueTime? They are both effectively time sources. To time-stamp writes -- for snapshot reads. TrueTime works fine when geographically distributed Distant datacenters can all hear correct time from GPS. But Epoch Service is a single server (or replica cluster). Thus effectively restricted to single datacenter. Both have some uncertainty. TT interval similar to the 10ms epoch period. Both require transactions to delay to cover uncertainty. Commit wait, and (for linearizable r/o) wait for one epoch. Both could probabably reduce uncertainty with more expensive engineering. What does Chardonnay's design sacrifice? Slows r/o transactions, to make r/w faster. Dry run hurts low-contention transactions. Only single datacenter (epoch service; must talk to leaders). Epoch service potentially a bottlneck. Evaluation What are the paper's claims? Dry run makes 2pc faster for some workloads, by reducing lock contention. What would we like them to demonstrate? Do their techniques help for real-world workloads? Is Chardonnay faster than competing systems? Figure 4(b) (page 11) shows the main result each transaction: 8 cold records, two hot records on different shards random order, random keys reads and then writes all 10 records every transaction is r/w (thus dry run, then real) x axis: contention index, 1 means everyone hitting the same hot record y axis: transactions per second (higher is better) the three bars: Chardonnay: Prefetch only: dry run to pre-fetch, but not ordered lock acquisition Baseline: Chardonnay w/o dry run For low contention, why are all three about the same? Baseline wins b/c doesn't do dry-run Chardonnay beats prefetch-only due to pipelined ordered acquisition For contention=0.1, why is Baseline slower? paper says hot locks held while reading cold data from disk deadlock must also be a problem: Figure 5 says 30% abort rate! they all have to be re-run, wasting effort For contention=1.0, why does Baseline get almost zero throughput? even more contention, and abort rate 50% Prefetching-only suggests about half of win is reduced contention, which prefetching-only gets you, and half is deadlock avoidance Note the Section 9.1 workload is designed to favor Chardonnay many cold records (-> much time spent reading disk) very hot records (-> contention) random order (-> risk of deadlock) Are the absolute numbers impressive? 7x more throughput than Spanner paper's Table 3 (2200 one-key r/w / second). But experiments are not comparable: Chardonnay transactions do much more -- 10 keys. Chardonnay hardware 10 years more modern. Neither paper gives much detail about the setup. What we *don't* see in the evaluation: Does dry-run help for real-world workloads Comparison with other databases Details of hardware/software/network/disk setup Next week revisiting replication, this time with proofs of correctness