6.828 2016 Lecture 12: File System lecture plan: file systems API -> disk layout caching why are file systems useful? durability across restarts naming and organization sharing among programs and users why interesting? crash recovery performance API design for sharing security for sharing abstraction is useful: pipes, devices, /proc, /afs, Plan 9 so FS-oriented apps work with many kinds of objects you will implement one for JOS API example -- UNIX/Posix/Linux/xv6/&c: fd = open("x/y", -); write(fd, "abc", 3); link("x/y", "x/z"); unlink("x/y"); high-level choices visible in this API objects: files (vs virtual disk, DB) content: byte array (vs 80-byte records, BTree) naming: human-readable (vs object IDs) organization: name hierarchy synchronization: none (vs locking, versions) a few implications of the API: fd refers to something that is preserved even if file name changes or if file is deleted while open! a file can have multiple links i.e. occur in multiple directories no one of those occurences is special so file must have info stored somewhere other than directory thus: FS records file info in an "inode" on disk FS refers to inode with i-number (internal version of FD) inode must have link count (tells us when to free) inode must have count of open FDs inode deallocation deferred until last link and FD are gone let's talk about xv6 FS software layers system calls name ops | FD ops inodes inode cache log buffer cache ide driver IDE is a standard to talk to devices https://en.wikipedia.org/wiki/Parallel_ATA connectors, interface, protocol, etc. device has an IDE controler. several popular IDE devices mechanical drives, hard disk drives (HDD) solid state drives (SSD) hard disk drives concentric tracks each track is a sequence of sectors, usually 512 bytes ECC on each sector can only read/write whole sectors thus: sub-sector writes are expensive (read-modify-write) solid state drives NAND flash non-volatile memory can write only to erased blocks cannot directly overwrite a block good to avoid erasing blocks there is a limit to the number of times a block can be erased wear-leveling are techniques to distribute wear across the medium performance characteristics SSDs are much faster than HHDs HHDs: 10s of milliseconds for small random read/write SSDs: 10s of microseconds for small read 100s of microseconds for small writes difference in throughput is smaller (3x?) big sequential writes have much higher throughput than random sector I/O both on SSDs and HHDs SSDs are still much slower than memory 10s of microseconds is still many CPU cycles -> disk layout and batching are important disk blocks most o/s use blocks of multiple sectors, e.g. 4 KB blocks = 8 sectors to reduce book-keeping and seek overheads xv6 uses single-sector blocks for simplicity on-disk layout xv6 file system on 2nd IDE drive; first has just kernel xv6 treats IDE drive as an array of sectors, hides tracks 0: unused 1: super block (size, ninodes) 2: log for transactions 32: array of inodes, packed into blocks 58: block in-used bitmap (0=free, 1=inuse) 59: file/dir content blocks end of disk "meta-data" everything on disk other than file content super block, i-nodes, bitmap, directory content on-disk inode type (free, file, directory, device) nlink size addrs[12+1] direct and indirect blocks example: how to find file's byte 8000? logical block 15 = 8000 / 512 3rd entry in the indirect block each i-node has an i-number easy to turn i-number into inode inode is 64 bytes long byte address on disk: 2*512 + 64*inum directory contents directory much like a file but user can't directly write content is array of dirents dirent: inum 14-byte file name dirent is free if inum is zero you should view FS as an on-disk data structure [tree: dirs, inodes, blocks] with two allocation pools: inodes and blocks let's look at xv6 in action focus on disk writes illustrate on-disk data structures via how updated Q: how does xv6 create a file? rm fs.img $ echo > a write 34 ialloc (from create sysfile.c; mark it non-free) write 34 iupdate (from create; initialize nlink &c) write 59 writei (from dirlink fs.c, from create) Q: what's in block 34? look at create() in sysfile.c Q: why *two* writes to block 34? Q: what is in block 59? Q: what if there are concurrent calls to ialloc? will they get the same inode? note bread / write / brelse in ialloc bread locks the block, perhaps waiting, and reads from disk brelse unlocks the block Q: how does xv6 write data to a file? $ echo x > a write 58 balloc (from bmap, from writei) write 565 bzero write 565 writei (from filewrite file.c) write 34 iupdate (from writei) write 565 writei write 34 iupdate Q: what's in block 58? look at writei call to bmap look at bmap call to balloc Q: what's in block 565? Q: why the iupdate? file length and addrs[] Q: why *two* writei+iupdate? who many calls to write()? Q: how does xv6 delete a file? $ rm a write 59 writei (from sys_unlink; directory content) write 34 iupdate (from sys_unlink; link count of file) write 58 bfree (from itrunc, from iput) write 34 iupdate (from itrunc) write 34 iupdate (from iput) Q: what's in block 59? sys_unlink in sysfile.c Q: what's in block 34? Q: what's in block 58? look at iput Q: why three iupdates? Let's look at the block cache in bio.c block cache holds just a few recently-used blocks FS calls bread, which calls bget bget looks to see if block already cached if present and not B_BUSY, return the block if present and B_BUSY, wait if not present, re-use an existing buffer Q: why goto loop after sleep()? Q: what is the block cache replacement policy? prev ... head ... next bget re-uses bcache.head.prev -- the "tail" brelse moves block to bcache.head.next Q: is that the best replacement policy? Q: what if lots of processes need to read the disk? who goes first? iderw appends to idequeue list ideintr calls idestart on head of idequeue list so FIFO Q: is FIFO a good disk scheduling policy? priority to interactive programs? elevator sort? Q: how fast can an xv6 application read big files? contiguous blocks? Q: why does it make sense to have a double copy of I/O? disk to buffer cache buffer cache to user space can we fix it to get better performance? Q: how much RAM should we dedicate to disk buffers?