FAQ FaRM Q: What are some systems that currently uses FaRM? A: FaRM seems to be a research system, and not in production use. I suspect it will influence future designs, and perhaps itself be developed into a production system. Q: Why do companies (Microsoft, Google, Facebook, Yahoo, etc) publish papers about their software, rather than keeping their designs secret? A: These companies only publish papers about a tiny fraction of the software they write. One reason they publish is that these systems are partially developed by people with an academic background (i.e. who have PhDs), who feel that part of their mission in life is to help the world understand the new ideas they invent. They are proud of their work and want people to appreciate it. Another reason is that such papers may help the companies attract top talent, because the papers show that intellectually interesting work is going on there. Q: Does FaRM really signal the end of necessary compromises in consistency/availability in distributed systems? A: This part of the paper seems more like advertising than science. History suggests that no level of performance is so high that no-one will want more, and those people will likely be willing to compromise in other areas to get the performance they need. Q: What are some limitations of FaRM? A: The data has to fit in RAM. OCC will produce lots of aborts if transactions conflict a lot. The transaction API (described in their NSDI 2014 paper) looks awkward to use because replies return in callbacks. Application code has to tightly interleave executing application transactions and polling RDMA NIC queues and logs for messages from other computers. Application code can see inconsistencies while executing transactions that will eventually abort. Applications may not be able to make free use of threads for their own purposes because FaRM pins threads to cores, and uses all cores. FaRM requires special network hardware that's not widely deployed. The design only makes sense if all the computers are close to each other; it's not a recipe for geographical distribution (and thus can have only limited fault tolerance). Of course, FaRM is a research prototype intended to explore new ideas. It is not a finished product intended for general use. If people continue this line of work, we might eventually see descendants of FaRM with fewer rough edges. Q: What's a NIC? A: A Network Interface Card -- the hardware that connects a computer to the network. Q: What is RDMA? A: RDMA is a special feature implemented in some modern NICs. The NIC looks for special command packets that arrive over the network, and executes the commands itself (and does not give the packets to the CPU). The commands specify memory operations such as write a value to an address or read from an address and send the value back over the network. In addition, RDMA NICs allow application code to directly talk to the NIC hardware to send the special RDMA command packets, and to be notified when the "hardware ACK" packet arrives indicating that the receiving NIC has executed the command. Q: What is one-sided RDMA? A: "One-sided" refers to a situation where application code in one computer uses these RDMA NICs to directly read or write memory in another computer without involving the other computer's CPU. FaRM's "Validate" phase in Section 4 / Figure 4 uses only a one-sided read. FaRM sometimes uses RDMA as a fast way to implement an RPC-like scheme to talk to software running on the receiving computer. The sender uses RDMA to write the request message to an area of memory that the receiver's FaRM software is polling (checking periodically); the receiver sends its reply in the same way. The FaRM "Lock" phase uses RDMA in this way. The benefit of RDMA is speed. A one-sided RDMA read or write takes as little as 1/18 of a microsecond (Figure 2), while a traditional RPC might take 10 microseconds. Even FaRM's use of RDMA for messaging is a lot faster than traditional RPC: user-space code in the receiver frequently polls the incoming NIC queues in order to see new messages quickly, rather than involving interrupts and user/kernel transitions. Q: Why is FaRM's RDMA-based RPC faster than traditional RPC? A: Traditional RPC requires the application to make a system call to the local kernel, which asks the local NIC to send a packet. At the receiving computer, the NIC writes the packet to a queue in memory and interrupts the receving computer's kernel. The kernel copies the packet to user space and context-switches to the receiving application. The receving application does the reverse to send the reply (system call to kernel, kernel talks to NIC, NIC on the other side interrupts its kernel, &c). This point is that a huge amount of code is executed for each RPC, and it's not very fast. In contrast, FaRM arranges that the application code can directly read and write memory to communicate with the NIC, and dedicates CPU cores (which the paper calls hardware threads) to polling for incoming messages. This eliminates costs from interrupts, system calls, copying data between user and kernel, and context switches. Q: Much of FaRM's performance comes from the hardware. In what ways does the software design contribute to performance? A: It's true that one reason FaRM is fast is that the hardware is fast. But the hardware has been around for many years now, yet no-one has figured out how to put all the parts together in a way that really exploits the hardware's potential. One reason FaRM does so well is that they simultaneously put a lot of effort into optimizing the network, the persistent storage, and the use of CPU; many previous systems have optimized one but not all. A specific design point is the way FaRM uses fast one-sided RDMA (rather than slower full RPC) for many of the interactions. Q: Do other systems use UPS (uninterruptable power supplies, with batteries) to implement fast but persistent storage? A: The idea is old; for example the Harp replicated file service used it in the early 1990s. Many storage systems use batteries in related ways (e.g. in RAID controllers) to write persistently without waiting for the disk. However, the kind of battery setup that FaRM uses isn't particularly common, so software that has to be general purpose can't rely on it. If you configure your own hardware to have batteries, then it would make sense to modify your Raft (or k/v server) to exploit your batteries. Q: Would the FaRM design still make sense without the battery-backed RAM? A: I'm not sure FaRM would make sense without non-volatile RAM, because then the one-sided log writes (e.g. COMMIT-BACKUP in Figure 4) would not persist across power failures. You could modify FaRM so that all log updates were written to SSD before returning, but then it would have much lower performance. An SSD write takes about 100 microseconds, while FaRM's one-sided RDMA writes to non-volatile RAM take only a few microseconds. Q: Isn't DRAM inherently volatile? A: The authors make RAM "non-volatile" by using a UPS to allow FaRM to write the content of RAM to an SSD on a power failure. But, this is indeed not completely non-volatile, because if the computer crashes for any other reason than a power failure, the content of the memory of the failed machine is lost. This is the reason why they replicate each region across several machines and have a fast recovery protocol. Q: A FaRM server copies RAM to SSD if the power is about to fail. Could they use mechanical hard drives instead of SSDs? A: They use SSDs because they are fast. They could have used hard drives without changing the design. However, it would then take about 10x longer to write the data to disk during a power outage, and 10x longer to read it back in after power is restored. That would require bigger batteries and more patience. Q: What is the distinction between primaries, backups, and configuration managers in FaRM? Why are there three roles? A: The data is sharded among many primary/backup sets. The point of the backups is to store a copy of the shard's data and logs in case the primary fails. The primary performs all reads and writes to data in the shard, while the backups perform only the writes (in order to keep their copies of the data identical to the primary's copy). There's just one configuration manager. It keeps track of which primaries and backups are alive, and keeps track of how the data is sharded among them. At a high level this arrangement is similar to GFS, which also sharded data among many primary/backup sets, and also had a master that kept track of where data is stored. Q: Would FaRM make sense at small scale? A: I think FaRM is only interesting if you need to support a huge number of transactions per second. If you only need a few thousand transactions per second, you can use off-the-shelf mature technology like MySQL. You could probably set up a considerably smaller FaRM system than the authors' 90-machine system. But FaRM doesn't make sense unless you are sharding and replicating data, which means you need at least four data servers (two shards, two servers per shard) plus a few machines for ZooKeeper (though probably you could run ZooKeeper on the four machines). Then maybe you have a system that costs on the order of $10,000 dollars and can execute a few million simple transactions per second, which is pretty good. Q: Section 3 seems to say that a single transaction's reads may see inconsistent data. That doesn't seem like it would be serializable! A: Farm only guarantees serializability for transactions that commit. If a transaction sees the kind of inconsistency Section 3 is talking about, FaRM will abort the transaction. Applications must handle inconsistency in the sense that they should not crash, so that they can get as far as asking to commit, so that FaRM can abort them. Q: How does FaRM ensure that a transaction's reads are consistent? What happens if a transaction reads an object that is being modified by a different transaction? A: There are two dangers here. First, for a big object, the reader may read the first half of the object before a concurrent transaction has written it, and the second half after the concurrent transaction has written it, and this might cause the reading program to crash. Second, the reading transaction can't be allowed to commit if it might not be serializable with a concurrent writing transaction. Based on my reading of the authors' previous NSDI 2014 paper, the solution to the first problem is that every cache line of every object has a version number, and single-cache-line RDMA reads and writes are atomic. The reading transaction's FaRM library fetches all of the object's cache lines, and then checks whether they all have the same version number. If yes, the library gives the copy of the object to the application; if no, the library reads it again over RDMA. The second problem is solved by FaRM's validation scheme described in Section 4. In the VALIDATE step, if another transaction has written an object read by our transaction since our transaction started, our transaction will be aborted. Q: How does log truncation work? When can a log entry be removed? If one entry is removed by a truncate call, are all previous entries also removed? A: The TC tells the primaries and backups to delete the log entries for a transaction after the TC sees that all of them have a COMMIT-PRIMARY or COMMIT-BACKUP in their log. In order that recovery will know that a transaction is done despite truncation, page 62 mentions that primaries remember completed transaction IDs even after truncation. Truncation implies that all log entries before the truncation point are deleted; this works because each primary/backup has a separate log per TC. Q: Is it possible for an abort to occur during COMMIT-BACKUP, perhaps due to hardware failure? A: I believe so. If one of the backups doesn't respond, and the TC crashes, then there's a possibility that the transaction might be aborted during recovery. Q: Does FaRM performance suffer when many transactions need to modify the same object? A: When multiple transactions modify the same object at the same time, some of them will see during Figure 4's LOCK phase that the lock is already held. Readers may see a changed version, or a lock flag, during the VALIDATE phase. Each such transaction will abort and restart from the beginning. If that happens a lot, performance will indeed suffer. The "optimistic" in "optimistic concurrency control" refers to the hope that that such conflicts will be rare, and that the ability to do lock-free reads will yield high performance. And indeed, for the applications the authors measure, FaRM gets fantastic performance. Very likely one reason is that their applications have relatively few conflicting transactions, and thus not many aborts. Q: Figure 7 shows significant increase in latency when the number of operations exceeds 120 per microsecond. Why is that? A: I suspect the limit is that the servers can only process about 140 million operations per second in total. If clients send operations faster than that, some of them will have to wait; this waiting causes increased latency. Q: What is vertical Paxos? A: It is a style of Paxos protocols where an external master performs reconfiguration while the Paxos group can continue performing operations while reconfiguration is in progress (see https://lamport.azurewebsites.net/pubs/vertical-paxos.pdf for the details). In the FaRM paper, the authors use the term "vertical Paxos" loosely to mean that the configuration management is done by an external service (Zookeeper and CM) and processing writes of a transaction is done with standard primary/backup protocol.