6.824 2007 Lecture 10: Logging and Recovery Done with consistency segment of the class. Consistency helps multiple clients update the same shared data Haven't yet dealt with the durability of the data itself across failures, even on just a single node Today's topics: logging FSD case study What's the overall topic? Atomic updates of complex data w.r.t. failures. Today just a single system, we'll be seeing distributed versions later. Similar techniques will be useful in distributed systems like Petal, 2-phase commit, Paxos, etc. assume disk sector writes are atomic, and "work correctly" see FSD paper for better handling of disk failures What does "atomic" mean? Suppose an operation has many steps It is atomic if observers see either none of the steps, or all We're worried about atomicity in the face of crashes Let's talk about atomic update of file system data structures. E.g. UNIX file system. Or YFS updating the extent server. Can you update the on-disk FS in place? For example, a file create might involve these disk writes: 1. Allocate an i-number from a free-bitmap. 2. Append name/i-number to directory. 3. Initialize i-node (size, owner, block list, &c). Why is the above order potentially bad? Crash between 1 and 2. Crash between 2 and 3. Would a different order be better? 1. Allocate i-number. 2. Initialize i-node. 3. Append to directory. Idea: can view write #3 as a "committing write" Crash before: no change visible Crash after: complete change is visible Synchronous Meta-Data Update Write disk in a careful order Wait for each write to finish before doing the next Scan the disk and fix things after crash e.g. allocated but unreferenced i-nodes Why aren't synchronous meta-data updates enough? They're slow Must write to disk, not cache 10 ms per disk I/O Can't re-order disk writes to decrease seek time Recovery may require checking/scanning the whole FS (e.g. fsck) Some operations don't have an obvious single committing write Example: UNIX rename Programs use rename to reveal only a complete file echo stuff > /tmp/t0 mv /tmp/t0 file need to update two directories, stored in two blocks on disk. remove then add? add then remove? probably want add then remove what if a crash? what does post-crash checker do? it knows something is wrong, since link count is 1, but two links. can't roll back -- which one to delete? has to leave two links, increase the link count this is *not* a legal result of rename! but at least we haven't lost the file sync meta-data updates are slow *and* have wrong semantics You can push tree update one step farther. Prepare a new copy of the entire affected sub-tree. Replace old subtree in one final write. Very expensive if done in the obvious way. But you can share structure between old and new tree. Only need new storage between change points and sub-tree root. (NetApp WAFL does this and more.) This approach only works for tree data structures. and doesn't support concurrent operations very well What are the reasons to use logging? communicates order -- this is the fundamental property of a log Bayou, so all nodes agree on order of operations pre-crash FS operations -> FS recovery atomic commit of compound operations. w.r.t. crashes. fast recovery (unlike fsck). well-defined post-recovery state: serial prefix of operations. as if synchronous and crash had occured a bit earlier in echo/mv example, if you see file, you see new content too can be applied to almost any existing data structure e.g. database tables, free lists representation is compact on a disk, so very fast to append useful to coordinate updates to distributed data structures let's all do this operation oops, someone didn't say "yes" how to back out or complete? How does one build a logging file system? Apps FS Log Disk cache Disk Transactions Most FS operations involve multiple disk writes We want all the writes of an operation to be atomic So logging system needs to know what the operations are A "transaction" is a multi-write operation that should be atomic. create() begin_transaction update free list update i-node update directory entry end_transaction there may be multiple concurrent transactions e.g. two processes are making system calls Terminology in-memory data cache dirty vs clean on-disk data in-memory log on-disk log naive re-do log log format: B TID [begin] W TID B# new-data [write] E TID [end == commit] Example: B T1 W T1 B1 25 E T1 B T2 W T2 B1 30 B T3 W T3 B2 99 W T3 B3 50 E T3 for now, log lives on its own infinite disk note log may include records from uncommitted xactions records from concurrent xactions may be inter-mingled we can write dirty in-memory data blocks to disk any time we want what does a write do? modify block in the data cache append a record to the log what does recovery do? 1. discard all on-disk data 2. scan whole log and remember all Committed TIDs 3. scan whole log, ignore non-committed TIDs, replay the writes why does recovery omit writes of non-committed xactions? *code* for that xaction can't be resumed thus can't complete thus atomicity requires that it look like it was never started when can we write dirty data from cache to disk? any time since recovery reconstructs everything why can't we use any of on-disk data's contents during recovery? don't know if a block is from an uncommitted xaction i.e. was written to disk before commit the *real* data is in the log! the on-disk data structure is just a cache for speed since it's hard to *find* things in a log so what have we achieved? atomic update of complex data structures: gets rename() right recoverable operations are fast (exploit disk locality) problems: we have to store the whole log forever recovery has to replay from the beginning of time A better log scheme: write-ahead log + checkpoint + re-do most FS logs work like this, e.g. FSD allows much faster recovery: can use on-disk data how can we avoid re-playing the whole log on recovery? recovery needs to know a point in log at which it can start a "checkpoint", pointer into log, stored on disk how to ensure recovery can ignore everything before the checkpoint? checkpoint rule: all data writes before checkpoint must be stable on disk checkpoint may not advance beyond first uncommitted Begin in background, flush a bunch of early writes, update checkpoint ptr recovery: scan log from checkpoint to see what transactions are committed re-play commited updates from checkpoint onward what about transactions that wrote the disk but did not commit? will they leave partial operations visible after recovery? we can't allow this! write-ahead rule: do *not* write dirty blocks from in-memory data cache to disk until corresponding commit record is on disk thus disk contains only committed data three log regions: data guaranteed on disk (checkpoint) data might be on disk (log write point) data cannot be on disk (end of in-memory log) so a transaction does this: write disk cache in memory append log records append commit record wait for all its log records to reach disk write disk cache to disk update checkpoint (last three in background) what if crash after writing blocks to disk, before updating checkpoint? recovery will write the same block contents to disk bonus: we can free log space before checkpoint careful disk writing log usually stored in a dedicated known area of the disk so it's easy to find after a reboot where's the start? block #0 holds checkpoint pointer where's the end? hard if crash interrupted log append append records in order include unique ascending sequence # in each record also a checksum for multi-sector records (maybe in End?) recovery must search forward for highest sequential # does this always work? why is logging fast? can delay writes to disk blocks only have to write the log right away log appends are fast no seek if log was recently written all writes to same place, not scattered around the disk single seek/rotate per transaction, rather than per write write-behind of data allows batched / scheduled. one data block may reflect many transactions. i.e. create many files in a directory. don't have to be so careful since the log is the real information Problems with WAL + checkpoint + re-do uncommitted transactions use space in in-memory data cache a problem for long-running transactions (not a problem for file systems) log records are huge entire block, even if only modified a few bytes writing log to disk is slow can't abort a transaction under s/w control you're midway through a rename unexpectedly run out of blocks, or i-nodes, or cache space would be nice to un-do the partial operation FS rarely has support for this, but common in DBs More sophisticated log system: log just the bytes that changes, not the whole block or a description of the operation un-do as well as re-do each log record has old data and new data un-do makes abort a lot easier un-do allows one to write uncommitted data to disk recovery can un-do *** un-do/re-do was briefly summarized, skip to FSD paper ... problem: uncommitted transactions use space in in-memory data cache a problem for long-running transactions (not a problem for file systems) un-do/re-do with checkpoint suppose we want to write uncommitted data to disk? need to be able to un-do them in recovery so include old value in each log record W TID B# old-data new-data now we can write data from in-memory data cache to disk after log entry is on disk no need to wait for the End to be on disk so we can free in-memory data cache blocks of uncommitted transactions recovery: for each block mentioned in the log find the last xaction that wrote that block if committed: re-do if not committed: un-do two pointers stored on disk: checkpoint and tail checkpoint: all in-memory data cache entries flushed up to this point no need to re-do before this point but may need to un-do before this point tail: start of first uncommitted transaction no need to un-do before this point so can free before this point it's ok if we crash just before updating the tail pointer itself we would have advanced it over committed transaction(s) so we will re-do them, no problem what if there's an un-do record for block never written to disk? it's ok: un-do will re-write same value that's already there what if B T1 W T1 B1 old=10 new=20 B T2 W T2 B1 old=20 new=30 crash The right answer is B1 = 10, since neither committed But it looks like we'll un-do to 20 What went wrong? How to fix it? ------- Example: Cedar Reimplementing the Cedar File System Using Logging and Group Commit Robert Hagmann SOSP 1987 What are FSD's main on-disk data structures? name table b-tree indexed by name entry: block list VAM -- bitmap indicating which blocks are free file content blocks leader block per file (ignore) What were they worried about? Crash during complex b-tree operation (e.g. splitting during insert) Recovery speed -- no whole-disk scans Maybe speed of ordinary operation Disk sector corruption What happens when you create a file in FSD? 1. get two free pages for leader and data from VAM. 2. update the name table b-tree in memory 3. write leader+data to disk synchronously 4. append modified name table pages to the log in memory When is the in-memory log forced to the disk? Group commit... Every few seconds. Why not log immediately, after every operation? Is there a down-side to group commit? What does FSD's on-disk log look like? Fixed area of disk, not a file, no free list for log blocks. Proper entry: hdr, blank, hdr copy, data pages..., end, copies of data..., end copy. What if FSD runs out of log space during normal operation? How can we be sure it's safe to start re-using the log space on disk? What if crash while writing blocks during switch to a new 1/3? Suppose a block is modified in one 1/3, modified again in a later 1/3. Does it have to be written to disk when first 1/3 is re-used? Is it even *allowed* to be written to disk when first 1/3 is re-used? What if 2nd transaction to write the block hasn't committed? What happens during recovery after crash+reboot? How does recovery s/w find the start of the log? On-disk ptr to start of oldest third. Written when prev third was re-used. How does recovery find the last log record? hdr/end blocks must have time-stamp or something. What does recovery actually do with the log records? Can it just start re-play at the beginning? I.e. what if it crashed during previous replay, is 2nd time OK? Does recovery work if crash during log write? Can recovery reliably decide if there's an "end" block? If "end" block is present, are all data blocks guaranteed to be good? What about recovering the free list (VAM)? What if the on-disk VAM was incorrect at time of crash? Does FSD log VAM modifications? Why is it OK not to log them? Why don't they log data writes? When do they write the data? Does that break the prefix rule? May have writes from "the future", after the recovery point. I think FSD never updates a file in place. Overwriting a file always creates a new version, preserves old one. What about delete, create, write, crash? W/ recovery point before delete? Performance Evaluation The big story is Crash recovery: 25 sec instead of an hour Why is FSD faster than CFS? Table 3, why does FSD do 1/6th as many I/Os for small creates? Why 149 I/Os for 100 files? Did they evaluate whether they achieved their robustness goals? ****** unused old notes ******* Let's talk about the old system, CFS, first. In some ways it's the more interesting of the two. What were some strengths of the old design? Labels provided robustness. What were the weaknesses of the old design? Non-standard hardware. Slow recovery. Slow meta-data update. CFS file name table was a b-tree. *not* consistent after a crash. thus really required to duplicate names &c in headers. How did the labels work? What was in a label? When written? when checked? What errors can the labels detect? What invariants could a label help maintain? Disk hardware failures: For each, detect bad data? Recover? If disk scribbles onto the file name table? If disk scribbles onto a header? If disk scribbles onto file content? Software errors: If used page on free list, will we re-use it incorrectly? If someone writes junk into header, will we detect? Can we recover? If someone writes junk into file content, will we detect? Why was the CFS VAM just hints? VAM is disk page free list. How could there be no invariants? What if an allocated page is on the free list? Does the FSD VAM have invariants? Why is guarding against single sector errors reasonable? What's the logic behind this design focus? What are the alternatives? What if most disk errors occured in disk electronics? Do they have hierarchical directories? Only one "directory", the file name table...