6.5840 2024 Lecture 21: Secure Untrusted Data Repository (SUNDR) (2004) Why are we reading this paper? We routinely trust storage services: github, gmail, AFS, Dropbox. Google &c may mean well, but: Maybe the server s/w or h/w is buggy, even exploitable. Maybe an attacker guessed server admin password, modified s/w. Maybe an employee of the cloud provider is corrupt or sloppy. Can we obtain trustworthy storage from non-trustworthy servers? This is a hard problem! SUNDR contains some good ideas. Similar ideas show up in git and blockchains. Keybase (acquired by zoom) is directly influenced by SUNDR Some of you have seen SUNDR in 6.5660 / 6.858 It's interesting both as a distributed system (consistency, logging) and for security. The problem is real! I.e. attackers break into source code repositories, perhaps modify source. Paper mentions Debian server compromised in 2003. SourceForge compromised in 2011. [ http://sourceforge.net/blog/sourceforge-attack-full-report ] Canonical (Ubuntu) compromised in 2019. [ https://hub.packtpub.com/canonical-the-company-behind-the-ubuntu-linux-distribution-was-hacked-ubuntu-source-code-unaffected/ ] We want "integrity" Readers see correct data. Server cannot trick them into accepting incorrect data. Server cannot omit legitimate updates. e.g. cannot omit a critical security patch. (this turns out to be the hard part!) Secrecy is not a SUNDR goal. SUNDR uses cryptography only to verify data is correct. What tools do we have for integrity? cryptographic hashes digital signatures Cryptographic hash, e.g. SHA-1 h = hash(data) h is small, data can be large for SHA-1, h is 160 bits secure = not possible to find two different inputs that have the same hash used to name data and/or to verify integrity Secure storage with cryptographic hashes use an ordinary (untrusted) key/value server client write(v): k = hash(v) put(k, v) client read(k): (the reading client needs to know the key k) v = get(k) check that hash(v) == k this is secure storage with an untrusted server if server corrupts data, client check will fail "content-addressed storage" or "content-hash storage" from where does a reading client get the key? we can't choose keys -- they must be hash(v) we could emebed them in URLs, publish on secure pages, send in e-mail we can build secure list- and tree-shaped structures from content-hash storage [key, directory block, file blocks] knowledge of the root key allows secure access to the whole structure even if blocks are stored in untrusted server content-addressed storage is not enough for a read/write file system: a key's value cannot change, since k = hash(v) "immutable" Digital signatures allow mutable storage public/private key pairs; only owner knows private sig = sign(data, k_priv) ok = verify(data, sig, k_pub) can use any key/value key scheme we like, as long as readers know writer's k_pub client write(k,v) OR UPDATE(k,v): client calls put(k, v+sign(v, k_priv)) client read(k): client calls v+sig = get(k) verify(v, sig, k_pub) now the owner can can modify anyone who knows the public key can fetch and check the returned value can be sure owner of k_priv signed that value at some point (though that is not quite the same as "v is the correct value for k") Are digital signatures alone enough? the server cannot forge a signature, so it cannot create arbitrary fake values the server *can* return an old correctly-signed value so it can hide recent updates or show different signed versions to different readers or show old versions for some keys, new for other keys these are serious consistency problems also we'd like a coherent file system, not just individual data items file names, directories, multiple users, permissions, &c One of SUNDR's ideas: Every update to any file includes signature over entire current FS state. Thus B's update to net.c includes a signature reflecting A's updated login.c So if C sees updated net.c, signature will only check if C also sees new login.c Strawman design: section 3.1 from the paper. Server stores a complete log of file-system operations create, write, rename, delete, mkdir, &c Clients ask server to append new operations. Clients reconstruct FS content by fetching and playing log. Strawman details Log entries: fetch or modify, user, sig. Signature covers the entire log up to that point. Client step: Download entire log (other clients now wait). Check the log: Correct signature in each entry, covering prior log. Each operation is allowed by permissions. This client's last log entry is present (client must remember). Construct FS state based on logged operations. Append client's operation and sign new log. Upload log (other clients can now proceed). Inefficient but simple to reason about. Example straw-man log: mod(login.c), X, sig -- X creates login program mod(login.c), B, sig -- B applies a crucial security patch Could the server sign a fake log entry? No: it would need to know the private key of an authorized user. How do clients know the public keys of authorized users? 1. All clients must be told public key of root directory owner. 2. File system itself stores the public key of each user. Could the server hide an entry from C? Yes. This is the only kind of consistency attack possible in SUNDR. What if the server hides B's mod(login.c)? C will read, see just X's mod(login.c), ask to append a fetch() with a signature over what it saw, fetch(), C, sig and remember its last operation (that fetch and signature). [diagram: start of a forked log] Suppose later the server wants to include B's mod(login.c) next time C asks for the log. In case C later hears that B patched login.c It won't work for the server to show C this: mod(login.c), X, sig mod(login.c), B, sig because C requires that its last operation (it's fetch) be in the log. It won't work for the server to show C this: mod(login.c), X, sig mod(login.c), B, sig fetch(), C, sig Because C's fetch signature didn't include B's mod(login.c). This won't work either: mod(login.c), X, sig fetch(), C, sig mod(login.c), B, sig Because B's mod()'s signature didn't include C's fetch. So the server cannot ever show C B's mod(login). Or any future B operation, since B will generate signatures that include its mod(login.c) in each of them. The server can't show B any future C operation either. Since it can't show B C's fetch(), since it's not compatible with B's mod(login.c). But the server *can* continue to give C a separate view of the file system. Containing only C's future operations. But not B's operations. That is, the server can "fork" B and C. This is "fork consistency": A server can fork users, concealing each others' updates. But the server cannot heal a fork -- forked forever. Since then signatures would reveal missing operations. The Question: what's the point of including fetch() in the log? Is fork consistency a good outcome? It's not ideal that it allows fork attacks. But users can detect if they can communicate outside of SUNDR. e.g. e-mail asking "what do you think of my last commit?" And, given assumptions, it seems the best one can do. A trusted "timestamp box" can automate fork detection. Special "user" updates some file every 5 seconds. If client sees these updates, it's in the same "fork" as the timestamp box. If there is a fork, timestamp box's updates can show up in only one fork. Clients in the other fork won't see timestamp box's updates. And they can infer that the server has forked them. Strawman is not practical: Log keeps growing. Interpreting log gets slow. Idea: tree of blocks reflecting current state, instead of log. Figure 2 -- similar to an ordinary disk file system. One tree per user; the full file system is the union of all users' trees. user's i-handle points to table of i-number->inode mappings. inode contains type (file vs directory) and list of block references. Directory contains list of name->i-number mappings. All references are SHA-1(block) So clients can verify blocks that the server returns. But SHA-1 references mean that the tree under a given i-handle is immutable. How to update? When client C writes a file, it constructs a new tree reflecting its modification. New tree can share almost all blocks with old tree, only needs new blocks on path from modified content up to i-handle. A new i-handle for each successive modification. How to maintain fork consistency with i-handles? A malicious server could give out old i-handles, concealing recent updates. We can't prevent the server from forking users, but (as with the straw man) we want to prevent it from merging forks and thus concealing its misdeeds. We want each new i-handle to somehow encode what came before it, including updates (and fetches) from all users. Idea: a signed "version structure" (VS) for each user. Figure 3. Server stores each user's latest VS. Each VS contains: User's i-handle after user's last operation. Version vector (VV): For each user, how many operations that user has performed. Public-key signature by user. The point: the VV operation counts allow clients to detect appearance of previously-hidden operations, and thus detect attempts by server to merge forks. How client U1 executes an operation (both reads and writes): Get all users' VSs from server. Validate (we'll see how in a minute). Get needed i-tables &c from block server. Execute operation using i-tables. Store new i-table, i-handle with mods (if any). New VS: New i-handle. Copy current vv[j] from each j's VS But increment U1's version #. Put VS back to server How do version vectors evolve in correct operation? U1: 1,10 2,12 U2: 1,11 1,12 2,13 A correct server executes operations serially, in some order. So each operation should see version # of previous operation. How does U2 validate a set of fetched VSs? Check that U2's VS is up to date (so U2 must remember its own last VS). Check that version vectors of diff users can be ordered OK: 2,13 then 2,14 not OK: 2,13 then 3,12 General VV rule: vv1 and vv2 can be ordered if either vv1[i] >= vv2[i] for all i, or vv2[i] >= vv1[i] for all i that is, vv1 came after vv2 and saw everything vv2 saw, or vv2 came after vv1 and saw everything vv1 saw A malicious server cannot create a fake VS, because users sign them, but it can return an old one -- i.e. hide a user's latest VS. What would version vectors look like if server hid an update from U2? U1: 1,10 [2,11] U2: 1,11 1,12 Do the version vectors give us fork consistency? can the server show future U2 VSs to U1? e.g. 1,13 no: 1,13 and 2,11 cannot be ordered U1 knows its 2,11 must have executed before U2's 1,13 (since 11 < 13) but 2 > 1 means server concealed 2,11 from 1,13 can the server show future U1 VSs to U2? e.g. 3,11 no: 3,11 and 1,12 cannot be ordered! Nice: version structures allow us to use an efficient tree data structure, eliminate the need for an ever-growing log, but still enforce fork consistency. Summary. Hard problem: integrity despite compromised servers. Client signatures prevent outright forgery. Hiding and forking are the main attacks still possible. Fork consistency prevents server from merging a fork once created. Forking still possible, but will eventually become obvious to clients.