6.824 2002 Lecture 11: Logging What's the overall topic? Atomic updates of complex data w.r.t. failures. Today just a single system, we'll be seeing distributed versions later. Terminology data buffer data storage log buffer log storage dirty vs clean sync write vs async FFS sync create. 1. update i-free list 2. initialize i-node 3. update directory block sync writes correct w.r.t. failure but very slow. and requires slow recovery: restore i-free list. but fsck recovery works correctly! FFS rename editor could use re-name from temp file for careful update echo a > d1/f1 echo b > d2/f2 mv d2/f2 d1/f1 need to update two directories, stored in two blocks on disk. remove then add? add then remove? probably want add then remove what if a crash? what does fsck do? it knows something is wrong, since link count is 1, but two links. can't roll back -- which one to delete? has to just increase the link count. this is *not* a legal result of rename! but at least we haven't lost the file. so FFS is slow *and* it doesn't get semantics right. XFS needs even more simultaneous updates create: dir block, new inode, i-free B+-tree, i-free counts can't just let fsck clean up mess (i.e. free lists) XFS wants no fsck at all! so can't even play FFS/fsck rename trick What are the reasons to use logging? atomic commit of compound operations. w.r.t. crashes. fast recovery (unlike fsck). correct recovery: serial prefix of operations. can be applied to almost any existing data structure not just file systems. mostly used in databases, e.g. for banks. very useful to coordinate updates to distributed data structures need to be very systematic about order of updates, which parts of which updates have completed. in case of partial failure. very efficient representation Transactions re-organize code to mark start/end of group of atomic operations create() begin_transaction update free list update i-node update directory entry end_transaction writes go to buffer cache via logging system simple log (*not* write-ahead! no rule!) keep a "log" of updates for now, log lives on its own infinite disk note we include record from uncommitted xactions in the log recovery discard DB replay from start of log don't recover transactions that didn't commit why can't we use any of DB's contents during recovery? don't know if a block is from an uncommitted xaction so what have we achieved? atomic update of complex data structures slow recovery fast operation real persistent store is the log on-disk DB just a sort of cache careful disk writing tape? dedicated disk? special area on disk? how do we know where the end of the log is? "log anchor" two copies, ping-pong what if a multi-sector disk write to log interrupted by crash? can't guarantee sectors written in order so each update probably protected by checksum update anchor after finished a full append probably don't want partial-sector updates why is logging fast? group commit -- batched log writes. could delay flushing log -- may lose committed transactions but at least you have a prefix. single write (or less) to implement a transaction. no seek if dedicated log write-behind of data allows batched / scheduled. one buffer may reflect many transactions. i.e. create many files in a directory. don't have to be so careful since the log is the real information re-do with checkpoint this is probably how SGI XFS (and Echo) log allows much faster recovery: can use on-disk DB write-ahead rule only write committed updates (i.e. commit record on disk) so keep updates of uncommitted xactions in buffers (not disk) so no un-committed data on disk. but disk may be missing some committed data recovery may need to replay committed data from the log how does recovery know what to re-do out of the log? remember a checkpoint pointer as well as anchor. non-volatile. first log record for which write isn't flushed to disk in background, flush earliest write, advance checkpoint pointer of course, cannot advance if an open xaction on recovery, re-play commited updates from checkpoint onward can free log space before checkpoint! problem: uncommitted transactions sit on main-memory buffers. un-do/re-do with checkpoint what if you don't want to wait for commits? write uncommitted data to disk need to be able to un-do it in recovery so include old value in each log record now we can write a buffer whenever we want to as long as log entry already on disk checkpoint: all buffers flushed up to this point no need to re-do before this point but may need to un-do before this point tail: start of first uncommitted transaction no need to un-do before this point so can free before this point no main-memory buffers for data between tail and checkpoint. so uncommitted xactions don't consume memory