6.824 2012 Lecture 10: Logging and Recovery Start of new big topic: fault tolerance Start with single computers Then distributed systems Today's lecture: crash recovery for storage logging FSD case study Overall topic: crash recovery Updating persistent (on-disk) storage in the face of crashes Challenge: multi-write operation, crash partway through Overall game: Write the disk carefully Recover after restart: complete or roll back partial operation Assuming fail-stop crashes; no malice, no bugs Why storage? Why not crash recovery for computation? Crash recovery for computation has to involve persistent storage E.g. periodically save state of computation to disk You need today's techniques to save state reliably Big idea: atomic operations Suppose an operation has many steps It is atomic w.r.t. crashes if: After restart+recovery, you see effect of all steps, or none We are *not* worried about atomicity in the fact of concurrency today Let's talk about atomic update of file system data structures. E.g. UNIX file system. Or YFS updating the extent server. Can you update the on-disk FS in place? For example, a file create might involve these disk writes: 1. Allocate an i-number from a free-bitmap. 2. Append name/i-number to directory. 3. Initialize i-node (size, owner, block list, &c). Why is the above order potentially bad? Crash between 1 and 2. Crash between 2 and 3. Would a different order be better? 1. Allocate i-number. 2. Initialize i-node. 3. Append to directory. Idea: can view write #3 as a "committing write" Crash before: no change visible Crash after: complete change is visible Synchronous Meta-Data Update Organize all system calls to have a commiting write Write disk in a careful order Wait for each write to finish before doing the next Scan the disk and fix things after crash e.g. allocated but unreferenced i-nodes Why aren't synchronous meta-data updates enough? They're slow Must write to disk, not cache 10 ms per disk I/O Can't re-order disk writes to decrease seek time Recovery may require checking/scanning the whole FS (e.g. fsck) Some operations don't have an obvious single committing write Example: UNIX rename Programs use rename to reveal only a complete file echo stuff > /tmp/t0 mv /tmp/t0 file need to update two directories, stored in two blocks on disk. remove then add? add then remove? probably want add then remove what if a crash? what does post-crash checker do? it knows something is wrong, since link count is 1, but two links. can't roll back -- which one to delete? has to leave two links, increase the link count this is *not* a legal result of rename! but at least we haven't lost the file sync meta-data updates are slow *and* have wrong semantics You can push tree update one step farther. Prepare a new copy of the entire affected sub-tree. Replace old subtree in one final write. Very expensive if done in the obvious way. But you can share structure between old and new tree. Only need new storage between change points and sub-tree root. (NetApp WAFL does this and more.) This approach only works for tree data structures. and doesn't support concurrent operations very well What's the basic idea of logging? The log lives on disk, separate from the file system Each operation pre-declares all its writes in log Create would say in the log that it *will* write dwrite(bn, data) -- free list dwrite(bn, data) -- dir dwrite(bn, data) -- inode END Start writing FS after intentions are all in log on disk If crash: Recover entire op from log What if crash while writing log? No END -> recovery ignores the operation's log entries What are the reasons to use logging? atomic commit of compound operations. w.r.t. crashes. fast recovery (unlike fsck). well-defined post-recovery state: serial prefix of operations. in echo/mv example, if you see file, you see new content too can be applied to almost any existing data structure e.g. database tables, free lists helps performance by turning random writes into sequential writes useful to coordinate updates to distributed data structures let's all do this operation oops, some nodes crashed look in logs to help pick up the pieces Software structure of a logging file system Apps (system calls) FS (block writes) Log + Disk cache Disk Terminology in-memory data cache dirty vs clean on-disk FS [in-memory log] on-disk log Transactions Most FS operations involve multiple disk writes Atomic == all writes or none So logging system needs to know operation boundaries A "transaction" is a multi-write operation that should be atomic create() begin_transaction update free list update i-node update directory entry end_transaction Let's firm up the details of our super-simple log ("Design 1"): Where is the log? On a disk, at a known block address Holds just one transaction at a time Log format? Start block: # of data blocks xaction will dirty array of block addresses COMMIT flag A sequence of data blocks What does BEGIN_TRANSACTION do? Lock the log (wait until previous operation is totally done) What does an operation write(bn,data) do? Update in-memory data cache (but do *not* write the on-disk FS) Write data to the end of the log, on disk Remember bn What does END_TRANSACTION do? Write block addresses to start block Set COMMIT flag in start block Then write the cached blocks to the on-disk FS Then clear the COMMIT flag What does recovery do? If the start block has COMMIT flag set Copy each block to its place in the on-disk FS Clear the COMMIT flag else: do nothing This is a "write-ahead" "re-do" log Write-ahead = all of an xaction's writes in log before any written to disk Re-do = recovery consists of replaying written data from log To finish any potentially interrupted END_TRANSACTIONs What if crash during operation, before END_TRANSACTION? What if crash during END_TRANSACTION? What if crash during recovery? What are the defects of Design 1? Slow: every block written twice Slow: each log write takes one disk revolution Idea: delay write from cache to on-disk FS The point: hope for "write absorbtion" Multiple operations may modify the same block If we delay, maybe only write to disk once, not many times Example: create lots of files in a single directory Example: create temporary file, then remove But now we must store multiple transactions in on-disk log Otherwise we cannot delay writes to on-disk FS Can't erase a transaction from log until we write on-disk FS And we need a new story for when to re-use on-disk log space Design 2: logging with delayed writes to on-disk FS Log format: Again, log sits in a known range of blocks on disk Header block: block # of first valid log record Start block: TID, # blocks, block addresses, COMMIT flag Data blocks... (then maybe another start block and data blocks, &c) BEGIN_TRANSACTION and block write same as in Design 1 What does END_TRANSACTION do? Write block addresses to this transaction's start block Set COMMIT flag in start block (but don't write data from cache on on-disk FS) When to write data from cache to on-disk FS? After transaction's COMMIT flag is set Before re-using transaction's space in the log When to re-use on-disk log area? When log is full -- typically the log wraps around to the start So free log space in order, oldest transaction first Before freeing a transaction's space in the log: Must first force transaction's cached data to on-disk FS But only if cached block has *not* been modified by subsequent xaction Why is that? Recovery needs to know where log starts -- earliest valid xaction So update pointer in header block to point past freed xaction Clear freed xaction's COMMIT flag Now we can re-use freed xaction's log space What does recovery do? Header block says where to start looking at the log Look at each xaction in turn, wrapping around Stop if COMMIT flag not set Replay logged blocks Must recovery examine log records in commit order? E.g. why not replay in reverse order? Suppose a transaction in the middle of log is damaged (disk block goes bad) E.g., TID 10 OK, TID 11 damaged, TID 12 OK Is it OK to replay 10 and 12 but skip 11? Problem: what if a data block looks just like a log Start block? What are the defects of Design 2? Slow: each log write (or transaction) takes one disk revolution 7200 RPM = 120 disk writes per second So < 30 3-write transactions/second, not very fast! If we could somehow do huge sequential writes to the log: 50 MB/sec = 5,000 10-kbyte transactions/second Idea: delay log writes END_TRANSACTION appends to an in-memory log Force in-memory log to disk every second or so One big sequential write, includes many transactions Called "group commit" Note: now completed system calls can disappear after crash! Design 3: group commit Same log format as for Design 2 Except transaction start block must hold checksum of all data In case big disk write is interrupted BEGIN_TRANSACTION same as in Design 1/2 When a transaction writes a block: Modify in cache, append a copy to the in-memory log What does END_TRANSACTION do? Compute checksum of transaction's blocks, put in-memory start block Set the COMMIT flag in start block in in-memory log When to force data from in-memory log to on-disk log? At any time -- but only whole transactions When to write data from cache to on-disk FS? After transaction commits After transaction has been forced from in-memory log to on-disk log Note there are three log regions: 1. data guaranteed written to on-disk FS 2. data might be written to on-disk FS 3. data can *not* have been written to on-disk FS Part 1 can be freed Part 2 is on disk but cannot be freed Part 3 is either not COMMITed or only in in-memory log Potential problem: We're getting performance by doing huge sequential log writes to disk Huge write may fail in the middle The disk may not write in the order we ask it to So, during recovery, any subset of the tail end of the log may be missing Blocks will be present on disk but contain old data/start blocks Checksum in start blocks ensures each transaction is complete Recovery must stop at first damaged/missing/uncommitted xaction Are there defects in Design 3? Each FS modification turns into a whole block appended to the log Create many files in a directory -> directory block logged over and over Could group many operations into one transaction Just log the block once, after full set of operations Could log short description of the operation, rather than whole block Review: why does write-ahead logging provide atomic transactions? Don't write on-disk FS until log contains full set of writes Then a crash midway through writing on-disk FS can be completed by replay from log during recovery Review: why is logging fast? can write just to in-memory cache, delay writes to on-disk FS since can recover data from the log may benefit from write absorbtion log appends are fast no seek if log was recently written big sequential writes, not random writes single seek/rotate per log force Paper question: (last para in Section 4) Why doesn't FSD put file content writes in the log? The sync content write means the content goes to disk before the log Why is that OK? What if crash after content write, but before forcing log to disk? Would it be OK for content to go to disk after log is forced? I.e. to write content just into the cache? What if delete frees a block, create allocates it, then write, then crash but delete hadn't been forced to log on disk? Then deleted file will exist -- but we wrote one of its blocks! See end of 5.5... ------- Example: Cedar Reimplementing the Cedar File System Using Logging and Group Commit Robert Hagmann SOSP 1987 What are FSD's main on-disk data structures? name table b-tree indexed by name entry: block list VAM -- bitmap indicating which blocks are free file content blocks leader block per file (ignore) What were they worried about? Crash during complex b-tree operation (e.g. splitting during insert) Recovery speed -- no whole-disk scans Maybe speed of ordinary operation Disk sector corruption What happens when you create a file in FSD? 1. get two free pages for leader and data from VAM. 2. update the name table b-tree in memory 3. write leader+data to disk synchronously 4. append modified name table pages to the log in memory When is the in-memory log forced to the disk? Group commit... Every few seconds. Why not log immediately, after every operation? Is there a down-side to group commit? What does FSD's on-disk log look like? Fixed area of disk, not a file, no free list for log blocks. Proper entry: hdr, blank, hdr copy, data pages..., end, copies of data..., end copy. What if FSD runs out of log space during normal operation? How can we be sure it's safe to start re-using the log space on disk? What if crash while writing blocks during switch to a new 1/3? Suppose a block is modified in one 1/3, modified again in a later 1/3. Does it have to be written to disk when first 1/3 is re-used? Is it even *allowed* to be written to disk when first 1/3 is re-used? What if 2nd transaction to write the block hasn't committed? What happens during recovery after crash+reboot? How does recovery s/w find the start of the log? On-disk ptr to start of oldest third. Written when prev third was re-used. How does recovery find the last log record? hdr/end blocks must have time-stamp or something. What does recovery actually do with the log records? Can it just start re-play at the beginning? I.e. what if it crashed during previous replay, is 2nd time OK? Does recovery work if crash during log write? Can recovery reliably decide if there's an "end" block? If "end" block is present, are all data blocks guaranteed to be good? What about recovering the free list (VAM)? What if the on-disk VAM was incorrect at time of crash? Does FSD log VAM modifications? Why is it OK not to log them? Why don't they log data writes? When do they write the data? Does that break the prefix rule? May have writes from "the future", after the recovery point. I think FSD never updates a file in place. Overwriting a file always creates a new version, preserves old one. What about delete, create, write, crash? W/ recovery point before delete? Performance Evaluation The big story is Crash recovery: 25 sec instead of an hour Why is FSD faster than CFS? Table 3, why does FSD do 1/6th as many I/Os for small creates? Why 149 I/Os for 100 files? Did they evaluate whether they achieved their robustness goals?