6.1810 2025 Lecture 17: File System why are file systems useful? durability across restarts naming and organization sharing among programs and users device independence why interesting? crash recovery performance security API generalizes: pipes, devices, /proc, /afs, Plan 9 so FS-oriented apps work with many kinds of objects topic of two labs API affects design / implementation. Example -- UNIX/Posix/Linux/xv6/&c: fd = open("x/y", -); write(fd, "abc", 3); link("x/y", "x/z"); -- now file has two names! unlink("x/y"); write(fd, "def", 3); close(fd); // file y/z contains abcdef high-level choices visible in the UNIX FS API objects: files (vs virtual disk, DB) content: byte array (vs rows/columns, BTree) size: can append at any time; disk space not pre-allocated naming: human-readable (vs object IDs) organization: directory hierarchy synchronization: none (no locking, no versions) does not prevent applications from modifying concurrently! each of these decisions has been made differently in some other FSs! a few subtle implications of the API: fd refers to something that is preserved even if file name changes or if file is deleted while open! a file can have multiple "hard" links i.e. occur in multiple directories no one of those occurences is special so file must have info stored somewhere other than directory thus: FS records file info in an "inode" on disk FS refers to inode with i-number inode must have link count (tells us when to free) inode must have count of open FDs inode deallocation deferred until last link and FD are gone let's talk about xv6 FS software layers system calls names | FDs inodes inode cache log buffer cache virtio_disk driver data on disk means that data is persistent persistent = data preserved even when powered off many kinds of "disks": hard drives (big but slow, inexpensive) solid state drives (smaller, but fast, and more expensive) common interface: array of sectors, often 512 bytes hard disk drives (HDD) disk spins at a constant rate (e.g. 7200 RPM) concentric tracks each track is a sequence of sectors to read/write a block: seek the head to the right track wait for sector to rotate under the head random access is slow (5 or 10ms, due to seek+rotate) sequential access is much faster (100 MB/second) b/c bits very dense on disk ECC on each sector (Reed-Solomon, a few hundred bits) can only read/write whole sectors thus: sub-sector writes are slow (read-modify-write) solid state drives (SSD) non-volatile "flash" memory random access: 100 microseconds sequential: 500 MB/second internally complex -- hidden except sometimes performance flash must be erased, in big blocks, before it's re-written limit to the number of times a flash block can be written SSD maintains an internal level of indirection -- remapped blocks for both HDD and SSD: sequential access has much higher throughput than random big reads/writes are faster than small ones both of these influence FS design and performance block size most o/s filesystems use big blocks, e.g. 4 KB blocks = 8 sectors to reduce book-keeping and seek overheads xv6 uses 1024-byte blocks (2 sectors) on-disk layout xv6 treats disk as an array of blocks 0: unused 1: super block (size, ninodes) 2: log for transactions (31 blocks) 33: array of inodes, 64 bytes each, packed 16 per block 46: block in-use bitmap (0=free, 1=used) 47: file/dir content blocks... exact numbers depend on # inodes, # blocks "meta-data" everything on disk other than file content super block, i-nodes, bitmap, directory content xv6's mkfs program generates this layout for an empty file system the layout is static for the file system's lifetime mkfs puts sizes &c that the kernel needs in the superblock superblock [kernel/fs.h] sizes and locations of the log, inodes, bitmap, content areas the kernel reads this when it boots on-disk inode [kernel/fs.h] type (free, file, directory, device) nlink size addrs[12+1] direct and indirect blocks indirect block holds 256 block #s so max file size is 12+256 blocks, or 268 kbytes example: how to find disk block # of file's byte 8000? logical block 7 = 8000 / BSIZE (=1024) inode.addrs[7] holds disk block # each i-node has an i-number easy to find i-node on disk given i-number inode is 64 bytes long byte address on disk: 33*BSIZE + 64*inum directory contents directory implemented as an i-node with type T_DIR shares lots of implementation with files content is array of dirents dirent: [kernel/fs.h] inum 14-byte file name dirent is free if inum is zero applications can read directory to find file names applications cannot directly write directories root directory (/) is i-number 1, created by mkfs so xv6 knows where to find it you should view FS as an on-disk data structure [tree: root inode, dirents, inodes, blocks] with two allocation pools: inodes and blocks the xv6 disk layout is low-performance on a hard drive suppose you grep (search) for something in a directory full of files lots of long seeks between i-nodes, file content there are better layouts e.g. allocate a file's blocks near each other e.g. spread the i-nodes out over the disk, near content let's look at xv6 in action focus on disk writes illustrate on-disk data structures via how updated Q: how does xv6 create a file? rm fs.img & make qemu $ echo > x // create bwrite 34 ialloc // allocate inode in 2nd inode block bwrite 34 iupdate // update inode (e.g., set nlink) bwrite 47 writei // write directory entry, adding "x" by dirlink() bwrite 33 iupdate // update directory inode, because inode may have changed bwrite 34 iupdate // itrunc new inode (even though nothing changed) First, what is in those blocks? 34? 47? 33? Second, what does the code look like? call graph: sys_open sysfile.c create sysfile.c ialloc fs.c iupdate fs.c dirlink fs.c writei fs.c iupdate fs.c itrunc sysfile.c iupdate fs.c Q: how does xv6 write data to an existing file? $ echo a > x bwrite 34 iupdate // open / itrunc // write() bwrite 46 balloc // allocate a block in bitmap block bwrite 914 bzero // zero the allocated block bwrite 914 writei // write to it (hi) bwrite 34 iupdate // update inode (length and new block) // write() bwrite 914 writei // write to it (\n) bwrite 34 iupdate // update inode First, what is in those blocks? 34? 46? 914? Second, what does the code look like? first iupdate() is open()'s truncate for "> x" then sys_write... call graph: sys_write sysfile.c filewrite file.c writei fs.c bmap balloc bzero iupdate Q: why the bzero? Q: how does xv6 delete a file? $ rm x bwrite 47 writei // from sys_unlink; directory content bwrite 33 iupdate // from writei of directory content bwrite 34 iupdate // from sys_unlink; link count of file bwrite 46 bfree // from itrunc, from iput bwrite 34 iupdate // from itrunc; zeroed length bwrite 34 iupdate // from iput; marked free First, what is in those blocks? 47? 33? 34? 46? Second, what does the code look like? call graph: sys_unlink writei iupdate iunlockput iput itrunc bfree iupdate iupdate What about caching, since disk is so slow? xv6 FS caches at two levels: Cache of disk blocks in bio.c Cache of active i-nodes in fs.c Let's look at the block cache in bio.c block cache holds just a few recently-used blocks bcache at start of bio.c FS calls bread, which calls bget bget looks to see if block already cached if present, lock (may wait) and return the block may wait in sleeplock until current using processes releases sleep lock puts caller to sleep, if already locked if not present, re-use an existing buffer b->refcnt++ prevents buf from being recycled while we're waiting invariant: one copy of a disk block in memory Two levels of locking here bcache.lock protects the description of what's in the cache b->lock protects just the one buffer Q: what is the block cache replacement policy? prev ... head ... next bget re-uses bcache.head.prev -- the "tail" brelse moves block to bcache.head.next Q: is that the best replacement policy? Q: why does it make sense to have a double copy of I/O? disk to buffer cache buffer cache to user space can we fix it to get better performance? Q: how much RAM should we dedicate to disk buffers? Why bget()'s acquiresleep(&b->lock)? i.e. why does bget() return a locked block? What if there are concurrent calls to ialloc? will they get the same inode? no: ialloc gets a lock on the bitmap block note bread / write / brelse in ialloc bread locks the block, perhaps waiting, and reads from disk brelse unlocks the block (similar story for allocating blocks from bitmap) Pathname lookup Traverse a pathname element at the time Potentially many blocks involved: inode of top directory data of top directory inode of next-level down .. and so on .. Each one of them might result in a cache miss disk access are expensive => Allow parallel pathname lookup If one process blocks on disk, another process may proceed with lookup Challenging: unlink may happen concurrent with lookup Let's look at namex() (kernel/fs.c) ilock(): locks current directory find next directory inode then unlock current directory another process may unlink the next inode but inode won't be deleted, because inode's refcnt > 0 risk: next points to same inode as current (lookup of ".") unlock current before getting lock on next key idea: getting a reference separately from locking Next week: what if the power fails mid-way through one of the operations?