FAQ for Chardonnay

Q: What is Wound-Wait?

A: Wound-Wait is a technique for avoiding deadlock. If transaction T
holds a lock, and transaction U needs that lock, then one of two
things happens. If T started before U, then U waits until T releases
the lock. If U started before T, then the system aborts T and gives U
the lock. Put another way, the system orders transactions by start
time, and ensures that older transactions never have to wait for
younger transactions.

Wound-Wait can abort a transaction unnecessarily, because it doesn't
actually detect deadlocks. In particular, if the lock holder is
younger and was about to release the lock, Wound-Wait may kill that
younger transaction even though if it had just waited a little, the
potential for deadlock would have gone away.

Q: What is eRPC?

A:  Chardonnay uses  eRPC, a  fast experimental  RPC system.  eRPC can
perform  a complete  RPC in  few microseconds,  in contrast  to a  few
hundred microseconds with conventional RPC software. Much of its speed
comes from completely bypassing the kernel: eRPC uses a user-level NIC
driver, and  tells the NIC to  DMA directly to/from user  memory. eRPC
polls DMA queues rather than taking interrupts. eRPC uses commercially
available high-end network hardware,  but requires an unusual software
setup.

You can read more about eRPC here:

  https://www.usenix.org/system/files/nsdi19-kalia.pdf

Q: What are SSD and NVMe?

A: SSD means Solid State Drive: a replacement for mechanical spinning
hard drives, using flash memory. Often SSDs are attached to the host
computer using the same interface and cabling system (SATA) as
mechanical hard drives, for compatibility. SSDs are much faster than
hard drives: an SSD read or write takes 100 or 200 microseconds rather
than 5 or 10 milliseconds.

NVMe refers to a technique for attaching storage to a computer via the
PCI Express bus, which is faster than SATA. If you pay a lot of money
for a flash device attached via NVMe you might be able to get read
times as low as 10 microseconds. The paper's experiments emulate NVMe
with ordinary DRAM, so perhaps the authors have in mind an NVMe device
that's about as fast as DRAM, but it's not clear what that could be.

Q: What is an Azure L8s v2 VM?

A: https://learn.microsoft.com/en-us/azure/virtual-machines/lsv2-series

Q: Why do transactions need to check leaders' lease ranges?

A: Here's my guess. A read-only transaction in epoch e needs to wait
until all locks from epoch e-1 have been released. So it needs to talk
to leaders who have valid lock tables for epoch e-1. If leadership has
changed since epoch e-1, the new leader won't know about locks from
not-yet-preparing transactions from e-1 (since the lock table is in
RAM, and not replicated with Paxos).

Q: What is YCSB-A (Section 4)?

A: YCSB is a set of benchmarks for key/value databases (Yahoo Cloud
Serving Benchmark). The "A" variant is 50% reads, 50% writes.
https://courses.cs.duke.edu/fall13/cps296.4/838-CloudPapers/ycsb.pdf

Q: Why does a read/write transaction have to hold read locks until
after commit (Section 5.4)?

A: One of the paper's authors kindly explained with an example of what
can go wrong if transactions released read locks earlier:

(1) Transaction T1 reads key K1
(2a) T1 calls prepare, releases lock on K1
(2b) In parallel, T1 calls read-epoch, which for some reason stalls
(3) Transaction T2 writes key K1
(4) Transaction T2 runs 2PC and read-epoch, getting assigned epoch e, and commits successfully
-- Since T1 didn't observe T2's write, it must be ordered before it in any valid ordering
(5) the epoch is incremented to e+1
(6) 2b completes, but with T1 getting assigned epoch e+1.
-- This breaks the epoch ordering property

Q: What is the RPC-Chains-like technique for acquiring locks that
Section 8 describes?

A: The Chardonnay client needs to acquire the transaction's locks in
advance, in ascending key order, to avoid deadlock. The naive design
would acquire the locks one at a time, sending an RPC to the relevant
shard server for each lock. If the transaction needs N locks, that's
2*N messages. The paper's scheme uses only N messages, by forwarding a
network message listing all locks from one shard server to the next.
Each server acquires its share of the locks, then forwards the message
to the next server.

Q: Section 8 describes how Chardonnay avoids deadlock by acquiring
locks in order. So why does it later say it may need to fall back on
Wound-Wait to avoid deadlock?

A: Some transactions read a few records, and then choose what records
to read next based on the contents of the first few records. If those
first records change between the dry run and the read-write execution
of the transaction, the read-write execution may need different locks
than those calculated during the dry-run. In that case, the in-order
lock acquisition won't have acquired the right locks; the real set of
locks will only be discovered as the read-write transaction executes;
so more locks will have to be acquired on-the-fly; and that risks
deadlock.

Q: In section 5.3, why does the Transaction State Store not need to
record whether a transaction is in the Prepared state?

A: The job of the Transaction State Store (TSS) is to help the shard
leaders (participants) recover if the client (the transaction
coordinator) fails before all the leaders have received a Commit or
Abort from the client. In particular, if a leader times out waiting for
a Commit or Abort, it will ask the TSS if it has recorded Commit or
Abort. If Commit or Abort, the leader will follow that decision; if
neither, the leader will try to set the state in the TSS to Abort.

So one way to look at it is that a leader only needs to know Commit
vs Abort vs not-yet-known; it never needs to know Prepared.

Another way to look at it is that a leader already knows whether it is
in Prepared state -- i.e. whether it has received a Prepare message.

Q: Spanner and Chardonnay both use a notion of time to implement
read-only transactions; how do they differ?

A: Spanner's read-only transactions use time and snapshot reads to make
read-only transaction fast (no locks, can read local replica). But at
the expense of slowing down read-write transactions, which have to
suffer commit wait. Spanner's TrueTime works well even among computers
at distant data-centers, so Spanner has reasonable performance even when
the replicas and clients are geographically spread out.

Chardonnay's technique, in contrast, makes read-only transactions
slower, by forcing them to wait for all the transactions in the chosen
epoch to commit. But in return, certain read-write transactions slow
down other transactions less (those that lock a hot record, then read
cold records from the disk). Chardonnay only makes sense in a single
data-center because contacting its single epoch server would take a
long time if the computers involved were geographically distributed.

Q: Section 5.1 says that the epoch interval can't be too short
since then there might be cache misses. What does the paper mean?

A: I imagine that the epoch servers are multi-core machines, and that
the core that executes Paxos and writes the current epoch number is
different from the core(s) that handle "read epoch" RPC requests from
clients. The microprocessor's cache coherence protocol will invalidate
the stale cached copy of the epoch number in all cores whenever the
Paxos core updates it. And then the other cores will suffer a cache
miss when next a client asks to fetch the epoch.

A CPU data cache miss might take a microsecond or more, particularly if
many cores want the same cache line at the same time. Maybe that would
be significant for situations like Figure 6, where the epoch server must
be receiving hundreds of thousands of requests per second.