FAQ for Chardonnay Q: What is Wound-Wait? A: Wound-Wait is a technique for avoiding deadlock. If transaction T holds a lock, and transaction U needs that lock, then one of two things happens. If T started before U, then U waits until T releases the lock. If U started before T, then the system aborts T and gives U the lock. Put another way, the system orders transactions by start time, and ensures that older transactions never have to wait for younger transactions. Wound-Wait can abort a transaction unnecessarily, because it doesn't actually detect deadlocks. In particular, if the lock holder is younger and was about to release the lock, Wound-Wait may kill that younger transaction even though if it had just waited a little, the potential for deadlock would have gone away. Q: What is eRPC? A: Chardonnay uses eRPC, a fast experimental RPC system. eRPC can perform a complete RPC in few microseconds, in contrast to a few hundred microseconds with conventional RPC software. Much of its speed comes from completely bypassing the kernel: eRPC uses a user-level NIC driver, and tells the NIC to DMA directly to/from user memory. eRPC polls DMA queues rather than taking interrupts. eRPC uses commercially available high-end network hardware, but requires an unusual software setup. You can read more about eRPC here: https://www.usenix.org/system/files/nsdi19-kalia.pdf Q: What are SSD and NVMe? A: SSD means Solid State Drive: a replacement for mechanical spinning hard drives, using flash memory. Often SSDs are attached to the host computer using the same interface and cabling system (SATA) as mechanical hard drives, for compatibility. SSDs are much faster than hard drives: an SSD read or write takes 100 or 200 microseconds rather than 5 or 10 milliseconds. NVMe refers to a technique for attaching storage to a computer via the PCI Express bus, which is faster than SATA. If you pay a lot of money for a flash device attached via NVMe you might be able to get read times as low as 10 microseconds. The paper's experiments emulate NVMe with ordinary DRAM, so perhaps the authors have in mind an NVMe device that's about as fast as DRAM, but it's not clear what that could be. Q: What is an Azure L8s v2 VM? A: https://learn.microsoft.com/en-us/azure/virtual-machines/lsv2-series Q: Why do transactions need to check leaders' lease ranges? A: Here's my guess. A read-only transaction in epoch e needs to wait until all locks from epoch e-1 have been released. So it needs to talk to leaders who have valid lock tables for epoch e-1. If leadership has changed since epoch e-1, the new leader won't know about locks from not-yet-preparing transactions from e-1 (since the lock table is in RAM, and not replicated with Paxos). Q: What is YCSB-A (Section 4)? A: YCSB is a set of benchmarks for key/value databases (Yahoo Cloud Serving Benchmark). The "A" variant is 50% reads, 50% writes. https://courses.cs.duke.edu/fall13/cps296.4/838-CloudPapers/ycsb.pdf Q: Why does a read/write transaction have to hold read locks until after commit (Section 5.4)? A: One of the paper's authors kindly explained with an example of what can go wrong if transactions released read locks earlier: (1) Transaction T1 reads key K1 (2a) T1 calls prepare, releases lock on K1 (2b) In parallel, T1 calls read-epoch, which for some reason stalls (3) Transaction T2 writes key K1 (4) Transaction T2 runs 2PC and read-epoch, getting assigned epoch e, and commits successfully -- Since T1 didn't observe T2's write, it must be ordered before it in any valid ordering (5) the epoch is incremented to e+1 (6) 2b completes, but with T1 getting assigned epoch e+1. -- This breaks the epoch ordering property Q: What is the RPC-Chains-like technique for acquiring locks that Section 8 describes? A: The Chardonnay client needs to acquire the transaction's locks in advance, in ascending key order, to avoid deadlock. The naive design would acquire the locks one at a time, sending an RPC to the relevant shard server for each lock. If the transaction needs N locks, that's 2*N messages. The paper's scheme uses only N messages, by forwarding a network message listing all locks from one shard server to the next. Each server acquires its share of the locks, then forwards the message to the next server. Q: Section 8 describes how Chardonnay avoids deadlock by acquiring locks in order. So why does it later say it may need to fall back on Wound-Wait to avoid deadlock? A: Some transactions read a few records, and then choose what records to read next based on the contents of the first few records. If those first records change between the dry run and the read-write execution of the transaction, the read-write execution may need different locks than those calculated during the dry-run. In that case, the in-order lock acquisition won't have acquired the right locks; the real set of locks will only be discovered as the read-write transaction executes; so more locks will have to be acquired on-the-fly; and that risks deadlock. Q: In section 5.3, why does the Transaction State Store not need to record whether a transaction is in the Prepared state? A: The job of the Transaction State Store (TSS) is to help the shard leaders (participants) recover if the client (the transaction coordinator) fails before all the leaders have received a Commit or Abort from the client. In particular, if a leader times out waiting for a Commit or Abort, it will ask the TSS if it has recorded Commit or Abort. If Commit or Abort, the leader will follow that decision; if neither, the leader will try to set the state in the TSS to Abort. So one way to look at it is that a leader only needs to know Commit vs Abort vs not-yet-known; it never needs to know Prepared. Another way to look at it is that a leader already knows whether it is in Prepared state -- i.e. whether it has received a Prepare message. Q: Spanner and Chardonnay both use a notion of time to implement read-only transactions; how do they differ? A: Spanner's read-only transactions use time and snapshot reads to make read-only transaction fast (no locks, can read local replica). But at the expense of slowing down read-write transactions, which have to suffer commit wait. Spanner's TrueTime works well even among computers at distant data-centers, so Spanner has reasonable performance even when the replicas and clients are geographically spread out. Chardonnay's technique, in contrast, makes read-only transactions slower, by forcing them to wait for all the transactions in the chosen epoch to commit. But in return, certain read-write transactions slow down other transactions less (those that lock a hot record, then read cold records from the disk). Chardonnay only makes sense in a single data-center because contacting its single epoch server would take a long time if the computers involved were geographically distributed. Q: Section 5.1 says that the epoch interval can't be too short since then there might be cache misses. What does the paper mean? A: I imagine that the epoch servers are multi-core machines, and that the core that executes Paxos and writes the current epoch number is different from the core(s) that handle "read epoch" RPC requests from clients. The microprocessor's cache coherence protocol will invalidate the stale cached copy of the epoch number in all cores whenever the Paxos core updates it. And then the other cores will suffer a cache miss when next a client asks to fetch the epoch. A CPU data cache miss might take a microsecond or more, particularly if many cores want the same cache line at the same time. Maybe that would be significant for situations like Figure 6, where the epoch server must be receiving hundreds of thousands of requests per second.