10/18 --- File systems: disk management - Review of file representation - inodes - block numbers - directores: files with a particular format - Simple file system: - assume simple disk layout: inodes at beginning and the rest data - no cache - Reliability: ordering of writes - How is create("f") in directory "d" be implemented? 1. allocate and write data block for f (perhaps write modified free map) (perhaps initialize newly allocated block) 2. allocate and write inode for f) (perhaps write modified free map) 3. read inode for "d" (if no cache) 4 read data for "d" (if no cache) 5. update data block for "d" 6. update inode of "d" - Order of these 4 write matters - Ex. if file system crashes after performing steps 2, 3, 4, 5 and 6 before 1, then we will have an incorrect file system after recovery. - Performance - Consider the following fragment: char sector[512]; fd = open ("x"); for (i = 0; i < 1000; i++) { write (fd, section, 512); } close (fd); - The following operations happen per iteration: 1. allocate block (perhaps write modified freemap) 2. write block 3. a. read inode (if we don't have a cache) b. read indirect block (if don't have a cache) 4. update inode - Do we have to be careful about order? Yes. - What is the performance of this loop: - worst case: one seek + 1/2 rotation + 512/BW (per sector) - best case: one seek for loop + 1/2 rotation per sector + 512/BW per sector seek: ~ 10 msec rotation: ~2 msec BW: 50 Mbyte/sec - What is the performance of creating 1,000 files? Terrible - Goal: achieve high-performance *and* reliability Design approach: - Reads can be made fast by exploiting big main-memory cache Result: read ops can happen at the speed of main-memory - Writes can be made fast by performing them in the main-memory cache and asynchronously updating the disk Result: - write ops can happen at the speed of main-memory (until main-memory is filled by dirty blocks) - a queue of dirty blocks to be written also allows us to achieve high disk bandwidth (by handing them to the disk together so the disk can schedule them efficiently---e.g., using a elevator algorithm). - LOST RELIABILITY - applications expect that the completion of some disk operation (e.g., close) means that data is on the disk recoverable. - writes out of order could make data unrecoverable. - Approaches to making disk writes asynchronously: 1. Forget reliability (Linux) 2. Synchronously write metada (inodes etc.) and asynchronously data blocks (VMS, DOS, FFS, etc.) 3. Use NVRAM (disadv: cannot remove disk from broken machine) 4. Atomic updates (group set of dependent writes as an atomic operations) with write-ahead logging (journaling FS, LFS, XFS, etc.) - each change to metadata first writes asynch entry in the log - commit ops wait until writes to log are stable 5. Scheduler-enforced ordering (pass dependencies to disk driver) 6. The cache-write-back code enforces interbuffer dependencies Problem: many circular dependencies (e.g., between an inode block and a directory block after creating a file A and removing B from the same directory and whose inodes are in the same inode block). inode block I dir block D inode 4 - inode 5 B if we add 4 to D, then D is dependent on I if we delete 5 from D, then I is dependent on D 7. Soft updates - Fine-grained in memory log to break dependencies from 6. - After main-memory updates for I and D to add 4 and delete 5: I D 4 F 5 - - Writing I and D: undo adding 4 (in main memory) write D redo 4 (in main memory) write 4 write D - Adv: No on-disk log and transaction machinery