10/18 --- File systems: disk management

- Review of file representation
  - inodes
  - block numbers
  - directores: files with a particular format

- Simple file system:
  - assume simple disk layout: inodes at beginning and the rest data
  - no cache

- Reliability: ordering of writes

  - How is create("f") in directory "d" be implemented?
    1. allocate and write data block for f
       (perhaps write modified free map)
       (perhaps initialize newly allocated block)
    2. allocate and write inode for f)
       (perhaps write modified free map)
    3. read inode for "d" (if no cache)
    4  read data for "d"  (if no cache)
    5. update data block for "d"
    6. update inode of "d"
  
  - Order of these 4 write matters
    - Ex. if file system crashes after performing steps 2, 3, 4, 5 and
    6 before 1, then we will have an incorrect file system after recovery.

- Performance
  
  - Consider the following fragment:
    char sector[512];

    fd = open ("x");
    for (i = 0; i < 1000; i++) {
       write (fd, section, 512);
    }
    close (fd);

  - The following operations happen per iteration:
   1. allocate block
      (perhaps write modified freemap)
   2. write block
   3. a. read inode (if we don't have a cache)
      b. read indirect block (if don't have a cache)
   4. update inode

  - Do we have to be careful about order?  Yes.

  - What is the performance of this loop:
      - worst case: one seek + 1/2 rotation + 512/BW (per sector)
      - best case: one seek for loop + 1/2 rotation per sector + 512/BW per
        sector

    seek: ~ 10 msec
    rotation: ~2 msec
    BW: 50 Mbyte/sec

  - What is the performance of creating 1,000 files?
    Terrible    

- Goal: achieve high-performance *and* reliability
  
  Design approach:
  - Reads can be made fast by exploiting big main-memory cache
    Result: read ops can happen at the speed of main-memory
  - Writes can be made fast by performing them in the main-memory
    cache and asynchronously updating the disk
    Result: 
      - write ops can happen at the speed of main-memory (until
        main-memory is filled by dirty blocks)
      - a queue of dirty blocks to be written also allows us to
        achieve high disk bandwidth (by handing them to the disk
	together so the disk can schedule them efficiently---e.g.,
	using a elevator algorithm).
      - LOST RELIABILITY
        - applications expect that the completion of some disk
	operation (e.g., close) means that data is on the disk recoverable.
        - writes out of order could make data unrecoverable.
  - Approaches to making disk writes asynchronously:
    1. Forget reliability (Linux)
    2. Synchronously write metada (inodes etc.) and asynchronously
       data blocks (VMS, DOS, FFS, etc.)
    3. Use NVRAM (disadv: cannot remove disk from broken machine)
    4. Atomic updates (group set of dependent writes as an atomic
       operations) with write-ahead logging (journaling FS, LFS, XFS,
       etc.)
       - each change to metadata first writes asynch entry in the log
       - commit ops wait until writes to log are stable
    5. Scheduler-enforced ordering  (pass dependencies to disk driver)
    6. The cache-write-back code enforces interbuffer dependencies
       Problem: many circular dependencies  (e.g., between an inode
       block and a directory block after creating a file A and
       removing B from the same directory and whose inodes are in the
       same inode block). 
		 inode block I	dir block D	
		     inode 4	 -
		     inode 5     B
	  if we add 4 to D, then D is dependent on I
          if we delete 5 from D, then I is dependent on D
    7. Soft updates
	    - Fine-grained in memory log to break dependencies from 6.
	    - After main-memory updates for I and D to add 4 and delete
	      5:
		     I		D
		     4		F
		     5		-

	     - Writing I and D:
	             undo adding 4 (in main memory)
		     write D
		     redo 4  (in main memory)
		     write 4
		     write D

	     - Adv: No on-disk log and transaction machinery