6.5840 Lecture 24: Formal verification of distributed systems (Grove) Why this lecture? A number of students have asked about formal verification for distributed systems, so we wanted to spend some lecture time talking about it. The paper for this lecture, Grove, is very recent research in the area. Here's the plan for today's lecture: * First, talk about GroveKV system, the "case study" distributed system verified with Grove. * Then, some motivation and general background about verification. * Main thing: how Grove reasons about code like GroveKV. * Finally, we can reflect on what Grove actually accomplished and its limits (and those of verification in general). -------------------------------------------------------------------------------- GroveKV is a fault-tolerant, linearizable KV service. KV service → clerk with Put and Get operations. Operations happen exactly once. Fault tolerance → Crash-safe and reconfigurable. If all servers crash, should be able to restart them without losing any state or any violations of linearizability. If a few servers are down, then should be able to reconfigure to use the remaining online servers in a consistent way. A put that was acknowledged before reconfiguration should be there after reconfiguration. GroveKV's goals similar to lab 3. Lab 3 is also a fault-tolerant KV service. Implemented differently though, GroveKV uses primary/backup replication instead of Raft. A "bonus" feature: fast reads based on leases. Similar to how leases are used in Chain Replication. [ Diagram of different servers in GroveKV ] To do reconfiguration (adding/removing servers consistently) GroveKV uses a configuration service. Without a config service, two backups might be network partitioned from each other and both want to become a primary, which would be wrong. Same issue as VMWare-FT. In GroveKV specifically, the config service is just a single server. It should be fault-tolerant with Raft/Paxos/etc.; think of ZooKeeper. The Grove authors plan to make the config service Paxos replicated in the near future. Reconfiguration is a program that "facilitates" configuration change. You can imagine an admin logging into a machine and running `reconfigure`, or a failure detector on some machine doing it automatically. Importantly, it is *always* be safe to attempt reconfiguration. Code fragments: 1) replication: Primary uses goroutines to replicate to other servers. System is built with RPCs. LocalApply updates KV map, and writes to log. With $f+1$ servers (+ the config service), tolerates $f$ faults. Concurrency: have to hold on to lock while doing this stuff. nextIndex determines order. Have to wait for all to reply because PB means EVERYONE has to ack an op before replying to the client. 2) reconfiguration Makes use of configuration service. Idea: "seal" the current group of servers, get a copy of state from them, install the state to new servers. There might be concurrent reconfiguration attempts so use "epoch numbers" to make sure only one reconfiguration succeeds. The epoch numbers are like Raft term numbers. Because it's PB, only have to reach 1 of the old servers. 3) reads based on leases Idea: do reads from primary. What's wrong with the naive example? The primary gets all the ops from the client, so this seems plausible. But, need to deal with reconfiguration happening without primary knowing it. Problematic execution: backup from epoch 1 becomes primary in epoch 2, without primary of epoch 1 knowing about it. Similar issues in Raft are why the leader has to replicate Get operations (otherwise would read stale state). With leases: server replies to Get without coordinating with anyone else. Assume loosely synchronized clocks. GetTimeRange: returns {earliest, latest} s.t. earliest <= current time <= latest. NTP (network time protocol) with error bounds, or e.g. AWS time sync https://github.com/aws/clock-bound Puts can happen concurrently with Gets; get waits for previous puts to finish. Configuration service can't enter new epoch until every lease is expired. --- Commonly, developers write tests to make sure code is correct. Challenge: lots of scenarios ("what if" questions), easy to miss something. Idea of verification is to prove: "∀ scenarios, code behaves correctly". Needs a mathematical model for how code executes. Needs a definition of "correctly". This is called the _specification_. In principle, could do this with pen and paper. To make it so that the developer doesn't have to worry about messing up the proof (might be easy to mess stuff up with lots of lines of code), use a mechanized proof checker. [ Formal verification diagram ] Human writes code, proof, and spec. What Grove is about: - Proving "∀ theorems" by going "line-by-line" in code. Cool part: With the right setup, can verify system by going "line-by-line" and cover all the what-if scenarios without having to list them all out. Examples of this further down. --- Specs: pre + postconditions. E.g. { is_sorted a } bsearch(a, x) { RET i, a[i] = x ... } How to specify GroveKV? Pre+postcondition has to make sense across machines and with possible crashes and restarts happening in the middle. Grove builds on top of Concurrent Separation Logic (CSL). CSL is a formal verification approach that was originally designed for concurrent (non-distributed) programs. Grove extends CSL to distributed systems. First, let's talk a bit about CSL. Key idea from CSL: reason about code in terms of _ownership of resources_. "In separation logic, assertions not only describe what is true about a system’s state, but also what parts of the state are 4 logically owned by the thread executing the code at that point." Classic example of CSL is "heap points-to": x ↦ v. As long as one thread (logically) owns a points-to, it knows that the value of that address is unchanged. Also, can safely write to it without needing to worry about data races. This is the crux of what allows profos in Grove to work without having to consider concurrently executed code. x ↦ 10 ∗ y ↦ 3 read as "x points to 10 and separately y points to 3". GroveKV spec: { k ↦ w } clerk.Put(k, v) { k ↦ v } { k ↦ v } clerk.Get(k, v) { ret v, k ↦ v } This spec is technically a weaker version of the real GroveKV spec, since it insists that you fully own the key in order to call Put. E.g. this disallows Put's to the same key. The full spec actually allows for Puts to the same key, and specifies that they should happen in a "linearizable" way; it uses a fancier specification technique called "logical atomicity". Will discuss this technically weaker spec, which gets to the heart of ownership-based reasoning. [Example walkthrough of using Put/Get specs] --- How to specify ApplyAsBackupRPC? Often want to think about the system more "abstractly" than in terms of the contents of each nodes heap. E.g. the Raft paper talks about the abstract log of operations that are committed. Want to be able to do this formally, and use it to specify ApplyAsBackupRPC. Idea from CSL: Ghost resources. Help talk about system at a more abstract level than, e.g., heap points-tos. Represents "ghost state": state that is not physically present in the system, but is useful for specification and proof. Ghost resources are controlled by the proof. Want to model at a high-level the replica servers with a "log of operations". Do this by using an "append-only list" resource. a ↦ ℓ: points-to is like the points-tos before: exact ownership. Because it's append-only, it makes sense to talk about a lower bound. a ⊒ ℓ: Lower-bounds shows up on the primary. The primary can't own the backups' lists. Only needs to know that same ops are definitely in the backups' logs. OK if there are more. Does not denote ownership of anything, only knowledge. a ↦□ ℓ: Means that the append-only list a has value ℓ, and will never be modified again. During reconfiguration, one of the old replica servers promises not to add any more operations. Capture this with ghost resources via knowledge of a "frozen list". Model PB by having each server own an append-only list points-to. Also committed append-only list. Invariant: something that has to always be true. In the case of ownership, resources that have to always be available. The resources are owned not by any particular thread, but by the invariant itself. ApplyAsBackupRPC spec: promises to accept operation. [ Proof walkthrough ] Time-bounded invariants: tinv(A, L) (written with dashed-box in paper and slides) means A is available for as long as L is valid. L is a name for this particular tinv. L >= t means L is valid until at least t. When doing GetTimeRange(), if latest < t, the tinv library provides a rule that allows temporary access to the underlying resource. In GroveKV, when verifying Get(), --- What if... * Backup loses operations? Would be buggy. In ApplyAsBackupRPC, backup promised a ⊒ ops + [op], which meant it had a ↦ ops + [op] at some point. There's just no way to do an update that goes backwards. If the backup code ever decides to undo an operation, the proof would get stuck because the physical state and the accepted ghost list would go out of sync. * What if old primary does a Put during reconfig? Would be buggy. Bug would appear if a primary tried to do an operation after having responded to a Seal request, because the primary would be left with a ↦□ ops. * What if lease expires in the middle of Get? Still ok. Everything was done in the proof at the moment that GetTimeRange runs. --- (When) is verification worth it? This kind of verification is currently a research endeavor. Proving theorems can eliminate some classes of bugs. Not necessarily all bugs that matter in practice. E.g. specs proven with Grove don't guarantee liveness, i.e. it's OK for code to loop forever at any moment. Writing proofs is a lot like writing code. Have little modules, once in a while have to refactor to clean things up, break proof up into smaller lemmas (like breaking code up into different functions). Writing good tests takes time, as does tracking down and fixing bugs revealed by tests. Tests might also miss bugs, so one has to weigh the cost of bugs in production v.s. the overhead of writing proofs. Writing a proof requires a lot of effort. Not familiar with any quantitative studies that compare building a verified system v.s. the traditional practice of testing the code. The proof for GroveKV probably took on the order of 6 person-months of work. Grove is a research project and making it economical for developers to prove real systems would at minimum require a lot of engineering work to improve the proof framework. Becoming proficient with writing these kind of proofs also takes time, which would be a big up-front cost for distributed systems developers. There are more lightweight tools for formally checking some properties for distributed system. For instance, AWS has made use of some more lightweight formal methods for many years (https://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf), mainly focusing on checking some properties about high-level protocols, not fully verifying a system implementation. How much work to verify the whole system? Figure 5. Overall proof effort 19,347 lines of proof code. * "Easy" program steps * pure math reasoning (e.g. l1 ⪯ l2 ∧ l1[l2.len() - 1] = x → l1 = l2.) * Some of it is "ghost code" (e.g. mutating append-only list) * Some of it is defining invariants (e.g. replication invariant). Q: What about verifying Raft? A: In principle possible. In practice, we haven't done it so there might be unexpectedly hard to overcome challenges.