6.5840 2024 Lecture 15: FaRM, Optimistic Concurrency Control why are we reading about FaRM? another take on transactions+replication+sharding this is still an open research area! optimistic concurrency control exploiting huge performance potential of RDMA NICs the overall setup all in one data center configuration manager, using ZooKeeper, chooses primaries/backups sharded w/ primary/backup replication P1 B1 P2 B2 ... can recover as long as at least one replica of each shard i.e. f+1 replicas tolerate f failures transaction clients (which they run in the servers) transaction code acts as two-phase-commit Transaction Coordinator (TC) the goal: millions of distributed transactions per second so the time budget is tens of microseconds very challenging! how do they get high performance? sharding over many servers (90 in the evaluation) data must fit in total RAM (so no disk reads) non-volatile RAM (so no disk writes) one-sided RDMA (fast cross-network access to RAM) fast user-level access to NIC transaction+replication protocol that exploits one-sided RDMA NVRAM (non-volatile RAM) FaRM writes go to RAM, not disk -- eliminates a huge bottleneck RAM write takes 200 ns, hard drive write takes 10 ms, SSD write 100 us ns = nanosecond, ms = millisecond, us = microsecond but RAM loses content in power failure! not persistent by itself. why not just write to RAM of f+1 machines, to tolerate f failures? might be enough if failures were always independent but power failure is not independent -- may strike 100% of machines! so: batteries in every rack, can run machines for a few minutes power h/w notifies s/w when main power fails s/w halts all transaction processing s/w writes FaRM's RAM to SSD; may take a few minutes then machine shuts down on re-start, FaRM reads saved memory image from SSD "non-volatile RAM" what if crash prevents s/w from writing SSD? e.g bug in FaRM or kernel, or cpu/memory/hardware error FaRM copes with single-machine crashes with replication crashes (other than power failure) must be independent! summary: NVRAM eliminates persistence write bottleneck leaving network and CPU as remaining bottlenecks why is the network often a performance bottleneck? FaRM assumes single data-center, so low speed-of-light delay but CPU cost of network data handling is often large! the usual setup for RPC over TCP over LAN: app app --- --- socket buffers buffers TCP TCP NIC driver driver NIC -------------------- NIC lots of expensive CPU operations: system calls copy messages interrupts context switches slow: hard to build RPC than can deliver more than a few 100,000 / second wire b/w (e.g. 10 gigabits/second) is rarely the limit for short RPC per-packet CPU costs traditionally limit performance for small messages FaRM uses two networking ideas: Kernel bypass RDMA Kernel bypass [diagram: FaRM user program, CPU cores, DMA queues, NIC] application directly interacts with NIC -- no system calls, no kernel NIC DMAs into/out of user RAM FaRM s/w polls DMA areas to check for incoming messages NIC polls DMA areas to check for outgoing messages RDMA (remote direct memory access) [src host, NIC, switch, NIC, target memory, target CPU] remote NIC directly reads/writes memory Sender provides memory address Remote CPU is not involved! This is "one-sided RDMA" Reads an entire cache line, atomically RDMA NICs use reliable protocol, with ACKs one server's throughput: 10+ million/second (Figure 2) latency: 5 microseconds (from their NSDI 2014 paper) Performance would be amazing if clients could directly access DB records on servers via one-sided RDMA! How to combine one-sided RDMA with replication and transactions? The protocols we've seen so far require active server participation. e.g. to check and set locks, to check lease, to indicate when safely persisted. Not immediately compatible with one-sided RDMA. two classes of concurrency control for transactions: pessimistic (two-phase locking): wait for lock on first use of object; hold until commit/abort conflicts cause delays optimistic: read objects without locking don't install writes until commit commit "validates" to see if other xactions conflicted valid: commit the writes invalid: abort called Optimistic Concurrency Control (OCC) FaRM uses OCC the reason: OCC lets FaRM read using one-sided RDMA reads so server needn't actively participate in reads how does FaRM's OCC validate? we'll look at Figure 4 in a minute. FaRM transaction API (simplified): txCreate() o = txRead(oid) -- RDMA o.f += 1 txWrite(oid, o) -- purely local ok = txCommit() -- Figure 4 what's an oid? region # indexes a mapping to [ primary, backup1, ... ] target RDMA NIC uses address directly to read RAM server memory layout regions, each an array of objects object layout header with version #, and lock flag in high bit of version # for each other server incoming log incoming message queue (senders write via RDMA, local FaRM reads via polling) all this in non-volatile RAM (i.e. written to SSD on power failure) Figure 4: transaction execution / commit protocol let's consider steps in Figure 4 one by one focus on concurrency control (not fault tolerance) Execute phase TC (the client) reads the objects it needs from servers including records that it will write using one-sided RDMA reads without locking this is the optimism in Optimistic Concurrency Control TC remembers the version numbers TC buffers writes locally now TC commits; two big goals: atomic distributed commit -- all writes or none serializability -- as if entirely before or after every other transaction LOCK (first message in commit protocol) TC sends to primary of each written object TC uses RDMA to append to its log at each primary LOCK record contains oid, version # xaction read, new value LOCK is now logged in primary's NVRAM will survive a power failure LOCK message is both a write-ahead log entry, and an RPC request to the primary what does primary do on receipt of LOCK? FaRM s/w polls incoming logs in RAM, sees our LOCK if object locked, or version != what xaction read, send "no" reply to TC otherwise set the object's lock flag and reply "yes" but don't yet modify the data! lock check, version check, and lock set are atomic using atomic compare-and-swap instruction "locked" flag is high-order bit in object's version number in case other CPU also processing a LOCK, or a client is reading w/ RDMA TC waits for all LOCK reply messages if any "no", abort append ABORT to primaries' logs so they can release locks returns "no" from txCommit() let's ignore VALIDATE and COMMIT BACKUP for now at this point primaries need to know TC's decision TC appends COMMIT-PRIMARY to primaries' logs TC only waits for RDMA hardware acknowledgement (ack) does not wait for primary to process log entry hardware ack means safe in primary's NVRAM TC returns "yes" from txCommit() when primary processes COMMIT-PRIMARY in its log: copy new value to object's memory increment object's version # clear object's lock flag the commit point is when the first COMMIT-PRIMARY is written since at that point the transactions results can be revealed example: T1 and T2 both want to increment x x = x + 1 what results does serializability allow? i.e. what outcomes are possible if run one at a time? x = 2, both clients told "success" x = 1, one client told "success", other "aborted" x = 0, both clients told "aborted" what if T1 and T2 are exactly in step? T1: Rx0 Lx Cx T2: Rx0 Lx Cx what will happen? or T1: Rx0 Lx Cx T2: Rx0 Lx Cx or T1: Rx0 Lx Cx T2: Rx0 Lx Cx intuition for why FaRM's OCC provides serializability: i.e. checks "was execution same as one at a time?" if there was no conflicting transaction: the versions won't have changed if there was a conflicting transaction: one or the other will see a lock or changed version # what about VALIDATE in Figure 4? it is an optimization for objects that are just read by a transaction VALIDATE = one-sided RDMA read to re-fetch object's version # and lock flag if lock set, or version # changed since read, TC aborts does not set the lock, thus faster than LOCK+COMMIT VALIDATE example: x and y initially zero T1: if x == 0: y = 1 T2: if y == 0: x = 1 (this is a classic test example for transactions) T1,T2 yields y=1,x=0 T2,T1 yields x=1,y=0 aborts could leave x=0,y=0 but serializability forbids x=1,y=1 suppose simultaneous: T1: Rx Ly Vx Cy T2: Ry Lx Vy Cx what will happen? the LOCKs will both succeed! the VALIDATEs will both fail, since lock bits are both set so both will abort -- which is OK how about: T1: Rx Ly Vx Cy T2: Ry Lx Vy Cx T1 commits T2 aborts since T2's Vy sees T1's lock or higher version but we can't have *both* V's before the other L's so VALIDATE seems correct in this example and fast: one-sided VALIDATE read rather than LOCK+COMMIT writes a purely read-only FaRM transaction uses only one-sided RDMA reads no writes, no log records, no locking very fast! what about fault tolerance? suppose some computers crash and don't reboot most interesting if TC and some primaries crash but we assume one backup from each shard survives the critical issue: if a transaction was interrupted by a failure, and a client could have been told a transaction committed, or a committed value could have been read by another xaction, then the transaction must be preserved and completed during recovery. look at Figure 4. a committed write might be revealed as soon the first COMMIT-PRIMARY is sent (since primary writes and unlocks). so by then, all of the transaction's writes must be on all f+1 replicas of all relevant shards. the good news: LOCK and COMMIT-BACKUP achieve this. LOCK tells all primaries the new value(s). COMMIT-BACKUP tells all backups the new value(s). TC doesn't send COMMIT-PRIMARY until all LOCKs and COMMIT-BACKUPS complete. backups may not have processed COMMIT-BACKUPs, but in NVRAM logs. similarly, TC doesn't return to client until at least one COMMIT-PRIMARY is safe in primary log. without the COMMIT-PRIMARY, the risky case is: TC replies "yes" to app after COMMIT-BACKUPS (before COMMIT-PRIMARY). TC and all backups then fail. now the only evidence left is the LOCK records. but even a complete set of LOCK records doesn't tell us if TC committed maybe TC aborted due to failed VALIDATE! writing the COMMIT-PRIMARY handles the risk, because the TC's decision will survive f failures of any shard. since there's one shard with a full set of COMMIT-BACKUP and COMMIT-PRIMARY. any of which is evidence that the primary decided to commit. FaRM is very impressive; does it fall short of perfection? * works best if few conflicts, due to OCC. * data must fit in total RAM. * replication only within a datacenter (no geographic distribution). * the data model is low-level; would need e.g. SQL library. * requires somewhat unusual RDMA and NVRAM hardware. how does FaRM differ from Spanner? both shard, replicate, and use two-phase commit (2pc) for transactions Spanner: focuses on coping with network delay due to geographic replication Paxos tolerates delay TrueTime lets them read from local replicas performance: r/w xaction takes 10 to 100 ms (Tables 3 and 6) FaRM focuses on reducing CPU costs RDMA, direct NIC access, NVRAM to avoid disk writes RDMA leads them to Optimistic Concurrency Control (OCC) performance: 58 microseconds for simple transactions (6.3, Figure 7) i.e. 100 times faster than Spanner summary super high speed distributed transactions hardware is exotic (NVRAM and RDMA) but may be common soon use of OCC for speed and to allow fast one-sided RDMA reads