Spinnaker FAQ Q: What is timeline consistency? A: It's a form of relaxed consistency that can provide faster reads at the expense of visibly anomalous behavior. All replicas apply writes in the same order (clients send writes to the leader, and the leader picks an order and forwards the writes to the replicas). Clients are allowed to send reads to any replica, and that replica replies with whatever data it current has. The replica may not yet have received recent writes from the leader, so the client may see stale data. If a client sends a read to one replica, and then another, the client may get older data from the second read. Unlike strong consistency, timeline consistency makes clients aware of the fact that there are replicas, and that the replicas may differ. Q: When there is only 1 node up in the cohort, the paper says it’s still timeline consistency; how is that possible? A: Timeline consistency allows the system to return stale values: a get can return a value that was written in the past, but is not the most recent one. Thus, if a server is partitioned off from the majority, and perhaps has missed some Puts, it can still execute a timeline-consistent Get. Q: What are the trade-offs of the different consistency levels? A: Strongly-consistent systems are easier to program because they behave like a non-replicated systems. Building applications on top of eventually-consistent systems is generally more difficult because the programmer must consider scenarios that couldn't come up in a non-replicated system. On the other hand, one can often obtain higher performance from weaker consistency models. Q: How does Spinnaker implement timeline reads, as opposed to consistent reads? A: Section 5 says that clients are allowed to send timeline read requests to any replica, not just the leader. Replicas serve timeline read requests from their latest committed version of the data, which may be out of date (because the replica hasn't seen recent writes). Q: Why are Spinnaker's timeline reads faster than its consistent reads, in Figure 8? A: It's not clear why timeline reads improve read performance; perhaps the read throughput is higher by being split over multiple replicas, perhaps the leader can serve reads without talking to the followers, or perhaps all replicas can execute reads without waiting for processing from prior writes to complete. On the other hand, the paper says the read workload is uniform random (ruling out the first explanation, since the Figure 2 arrangement evenly spreads the work regardless); the paper implies that the leader doesn't send read requests to the followers (ruling out the second explanation), and it's not obvious why either kind of read would ever have to wait for concurrent writes to complete. Q: Why do Spinnaker's consistent reads have much higher performance than Cassandra's quorum reads, in Figure 8? A: The paper says that Spinnaker's consistent reads are processed only by the leader, while Cassandra's reads involve messages to two servers. It's not clear why it's legal for consistent reads to consult only the leader -- suppose leadership has changed, but the old leader isn't yet aware; won't that cause the old leader to serve up stale data to a consistent read? We'd expect that either the leader wait for the read to be committed to the log, or that the leader have a lease, but the paper doesn't mention anything about leases. Q: What is the CAP theorem about? A: The key point is that if there is a network partition, you have two choices: both partitions can continue to operate independently, sacrificing consistency (since they won't see each others' writes); or at most one partition can continue to operate, preserving consistency but sacrificing forward progress in the other partitions. Q: Where does Spinnaker sit in the CAP scheme? A: Spinnaker's consistent reads allow operation in at most one partition: reads in the majority partition see the latest write, but minority partitions are not allowed to operate. Raft and Lab 3 are similar. Spinnaker's timeline reads allow replicas in minority partitions to process reads, but they are not consistent, since they may not reflect recent writes. Q: Could Spinnaker use Raft as the replication protocol rather than Paxos? A: Although the details are slightly different, my thinking is that they are interchangeable to first order. The main reason that we reading this paper is as a case study of how to build a complete system using a replicated log. It is similar to lab 3 and lab 4, except you will be using your Raft library. Q: The paper mentions Paxos hadn't previously been used for database replication. What was it used for? A: Paxos was not often used at all until recently. By 2011 most uses were for configuration services (e.g., Google's Chubby or Zookeeper), but not to directly replicate data. Q: What is the reason for logical truncation? A: The logical truncation exists because Spinnaker merges logs of 3 cohorts into a single physical log for performance on magnetic disks. This complicates log truncation, because when one cohort wants to truncate the log there maybe log entries from other cohorts, which it cannot truncate. So, instead of truncating the log, it remembers its entries that are truncated in a skip list. When the cohort recovers, and starts replaying log entries from the beginning of the log, it skips the entries in the skip list, because they are already present in the last checkpoint of memtables. Q: What exactly is the key range (i.e. 'r') defined in the paper for leader election? A: One of the key ranges of a shard (e.g., [0,199] in figure 2). Q: Is there a more detailed description of Spinnaker's replication protocol somewhere? A: http://www.vldb.org/2011/files/slides/research15/rSession15-1.ppt Q: How does Spinnaker's leader election ensure there is at most one leader? A: The new leader is the candidate with the max n.lst in the Zookeeper under /r/candidates, using Zookeeper sequence numbers to break ties. Q: Does Spinnaker have something corresponding to Raft's terms? Yes, it has epoch numbers (see appendix B). Q: Section 9.1 says that a Spinnaker leader replies to a consistent read without consulting the followers. How does the leader ensure that it is still the leader, so that it doesn't reply to a consistent read with stale data? A: Unclear. Maybe the leader learns from Zookeeper that it isn't the leader anymore, because some lease/watch times out. Q: Would it be possible to replace the use of Zookeeper with a Raft-like leader election protocol? A: Yes, that would be a very reasonable thing to do. My guess is that they needed Zookeeper for shard assignment and then decided to also use it for leader election. Q: What is the difference between a forced and a non-forced log write? A: After a forced write returns, then it is guaranteed that the data is on persistent storage. After a non-forced write returns, the write to persistent storage has been issued, but may not yet be on persistent storage. Q: Step 6 of Figure 7 seems to say that the candidate with the longest long gets to be the next leader. But in Raft we saw that this rule doesn't work, and that Raft has to use the more elaborate Election Restriction. Why can Spinnaker safely use longest log? A: Spinnaker actually seems to use a rule similar to Raft's. Figure 7 compares LSNs (Log Sequence Numbers), and Appendix B says that an LSN has the "epoch number" in the high bits. The epoch number is equivalent to Raft's term. So step 6's "max n.lst" actuall boils down to "highest epoch wins; if epochs are equal, longest log wins."