6.828 2016 Lecture 12: File System

lecture plan:
  file systems
  API -> disk layout
  caching

why are file systems useful?
  durability across restarts
  naming and organization
  sharing among programs and users

why interesting?
  crash recovery
  performance
  API design for sharing
  security for sharing
  abstraction is useful: pipes, devices, /proc, /afs, Plan 9
    so FS-oriented apps work with many kinds of objects
  you will implement one for JOS

API example -- UNIX/Posix/Linux/xv6/&c:
  fd = open("x/y", -);
  write(fd, "abc", 3);
  link("x/y", "x/z");
  unlink("x/y");

high-level choices visible in this API
  objects: files (vs virtual disk, DB)
  content: byte array (vs 80-byte records, BTree)
  naming: human-readable (vs object IDs)
  organization: name hierarchy
  synchronization: none (vs locking, versions)

a few implications of the API:
  fd refers to something
    that is preserved even if file name changes
    or if file is deleted while open!
  a file can have multiple links
    i.e. occur in multiple directories
    no one of those occurences is special
    so file must have info stored somewhere other than directory
  thus:
    FS records file info in an "inode" on disk
    FS refers to inode with i-number (internal version of FD)
    inode must have link count (tells us when to free)
    inode must have count of open FDs
    inode deallocation deferred until last link and FD are gone

let's talk about xv6

FS software layers
  system calls
  name ops | FD ops
  inodes
  inode cache
  log
  buffer cache
  ide driver

IDE is a standard to talk to devices
  https://en.wikipedia.org/wiki/Parallel_ATA
  connectors, interface, protocol, etc.
  device has an IDE controler.
  several popular IDE devices
    mechanical drives, hard disk drives (HDD)
    solid state drives (SSD)
  
hard disk drives
  concentric tracks
  each track is a sequence of sectors, usually 512 bytes
  ECC on each sector
  can only read/write whole sectors
  thus: sub-sector writes are expensive (read-modify-write)

solid state drives
  NAND flash non-volatile memory
  can write only to erased blocks
    cannot directly overwrite a block
  good to avoid erasing blocks
    there is a limit to the number of times a block can be erased
  wear-leveling are techniques to distribute wear across the medium

performance characteristics
  SSDs are much faster than HHDs
    HHDs: 10s of milliseconds for small random read/write
    SSDs: 10s of microseconds for small read
          100s of microseconds for small writes
    difference in throughput is smaller (3x?)
  big sequential writes have much higher throughput than random sector I/O
    both on SSDs and HHDs
  SSDs are still much slower than memory
    10s of microseconds is still many CPU cycles
  -> disk layout and batching are important
  
disk blocks
  most o/s use blocks of multiple sectors, e.g. 4 KB blocks = 8 sectors
  to reduce book-keeping and seek overheads
  xv6 uses single-sector blocks for simplicity

on-disk layout
  xv6 file system on 2nd IDE drive; first has just kernel
  xv6 treats IDE drive as an array of sectors, hides tracks
  0: unused
  1: super block (size, ninodes)
  2: log for transactions
  32: array of inodes, packed into blocks
  58: block in-used bitmap (0=free, 1=inuse)
  59: file/dir content blocks
  end of disk

"meta-data"
  everything on disk other than file content
  super block, i-nodes, bitmap, directory content

on-disk inode
  type (free, file, directory, device)
  nlink
  size
  addrs[12+1]

direct and indirect blocks

example:
  how to find file's byte 8000?
  logical block 15 = 8000 / 512
  3rd entry in the indirect block

each i-node has an i-number
  easy to turn i-number into inode
  inode is 64 bytes long
  byte address on disk: 2*512 + 64*inum

directory contents
  directory much like a file
    but user can't directly write
  content is array of dirents
  dirent:
    inum
    14-byte file name
  dirent is free if inum is zero

you should view FS as an on-disk data structure
  [tree: dirs, inodes, blocks]
  with two allocation pools: inodes and blocks

let's look at xv6 in action
  focus on disk writes
  illustrate on-disk data structures via how updated

Q: how does xv6 create a file?

rm fs.img

$ echo > a
write 34 ialloc (from create sysfile.c; mark it non-free)
write 34 iupdate (from create; initialize nlink &c)
write 59 writei (from dirlink fs.c, from create)

Q: what's in block 34?
   look at create() in sysfile.c

Q: why *two* writes to block 34?

Q: what is in block 59?

Q: what if there are concurrent calls to ialloc?
   will they get the same inode?
   note bread / write / brelse in ialloc
   bread locks the block, perhaps waiting, and reads from disk
   brelse unlocks the block

Q: how does xv6 write data to a file?

$ echo x > a
write 58 balloc  (from bmap, from writei)
write 565 bzero
write 565 writei (from filewrite file.c)
write 34 iupdate  (from writei)
write 565 writei
write 34 iupdate

Q: what's in block 58?
   look at writei call to bmap
   look at bmap call to balloc

Q: what's in block 565?

Q: why the iupdate?
   file length and addrs[]

Q: why *two* writei+iupdate?
  who many calls to write()?

Q: how does xv6 delete a file?

$ rm a
write 59 writei (from sys_unlink; directory content)
write 34 iupdate (from sys_unlink; link count of file)
write 58 bfree  (from itrunc, from iput)
write 34 iupdate (from itrunc)
write 34 iupdate (from iput)

Q: what's in block 59?
   sys_unlink in sysfile.c

Q: what's in block 34?

Q: what's in block 58?
   look at iput

Q: why three iupdates?

Let's look at the block cache in bio.c
   block cache holds just a few recently-used blocks

FS calls bread, which calls bget
   bget looks to see if block already cached
   if present and not B_BUSY, return the block
   if present and B_BUSY, wait
   if not present, re-use an existing buffer
   Q: why goto loop after sleep()?

Q: what is the block cache replacement policy?
   prev ... head ... next
   bget re-uses bcache.head.prev -- the "tail"
   brelse moves block to bcache.head.next

Q: is that the best replacement policy?

Q: what if lots of processes need to read the disk? who goes first?
   iderw appends to idequeue list
   ideintr calls idestart on head of idequeue list
   so FIFO

Q: is FIFO a good disk scheduling policy?
   priority to interactive programs?
   elevator sort?

Q: how fast can an xv6 application read big files?
   contiguous blocks?

Q: why does it make sense to have a double copy of I/O?
   disk to buffer cache
   buffer cache to user space
   can we fix it to get better performance?

Q: how much RAM should we dedicate to disk buffers?