Lecture 8: VM Page Tables

Intro
  Previous lecture made VM seem easy.
    Simple goals: isolation, address space model.
    Simple implementation: page table provides indirection.
    Hardware does most of the work and defines data structures.
    VM may be nearly invisible to O/S.
      And this was how UNIX worked 15 years ago.
  But:
    Sophisticated VM tricks can produce huge performance gains.
      Both within O/S and in VM-aware applications.
      Requires complex O/S management of VM.
    Typical VM hardware is hard to manage.
    Hardware facilities not a close match to O/S requirements.
    Goals and hardware often conflict.
  Why do we care when machines keep getting faster?
    Rooms full of servers...
    Forking CGI scripts -- either fast fork, or contorted software.

O/S goals
  (Keep in mind whether simple PTE array model could support these.)
  Processes larger than physical memory.
    Not really mapping virtual to phys mem -- some mem is on disk.
  Mapped access to files.
    Program text, for example.
  Lazyness for better response time and maybe to save work.
    Demand fill for instant program start-up.
    Zero fill pages.
    Just don't map the pages in the VM hardware; wait for the fault.
  Efficient copying.
    Copy-on-write access to initialized program data.
      Mark page read-only, do the work on the write fault.
    Implement UNIX pipe with re-mapping.
  Sharing to conserve memory
    I.e. same phys pages mapped into multiple processes.
    Program text (r/o) and initialized data (r/w with copy-on-write).
    Fork with copy-on-write, rather than copy mem (from swap...).
  Avoid using lots of phys mem for page tables
    4GB of 4k pages requires 4 megabytes of 32-bit PTEs.
    Sparse mappings -- e.g. stack is at the top.
    Lazy map construction -- demand PTE creation for file, zero fill.
  These are just internal O/S goals!
    Haven't even considered exposing mapping to user processes.
  
O/S and hardware design depend on each other.
  Not a simple layered abstraction:
    O/S is "under" the hardware during a page fault.
  So the O/S-vs-hardware split is fluid.
  A portable O/S must be sophisticated about VM hardware management.

Avoiding TLB flushes.
  Context IDs, as in the SPARC.
  O/S must map processes to limited # of context IDs and tell hardware.
  O/S can be lazy when *increasing* a process' space or permissions.

Virtually indexed data caches.
  Physically indexed caches are slow: must wait for TLB.
    But can be big, and hardware guarantees consistency.
  Index with offset (physical) part of vaddr, tag with physical address.
    Fast: can overlap with TLB lookup.
    Hardware guarantees consistency -- no flushes.
    But limited in size to one page.
  What if we index with some of the virtual address bits?
    Then two mappings for the same page may use different parts of the cache.
      Read through 1st vaddr; write through 2nd; invisible to 1st.
    Hardware cannot easily fix this by itself.
    O/S could install only one mapping at a time.
      Flush when other mapping is needed.
      Or allow either one writer, or many readers.
    O/S could align all mappings in the cache.
      O/S must have freedom to pick virtual addresses.
      Low performance if application insists.

Multiple-level page tables.
  One linear page table works badly.
  What if address use is sparse -- stack at top?
  What if only a small fraction of valid addresses are actually used?
  Example of hardware-supported multi-level page tables.
    Highest level is often a context table.
    Now O/S can get faults for page-table-pages.
    Must page them as well as user processes.

Machines with only a TLB -- MIPS.
  We're going to find out that O/S must maintain its own copy of page tables.
  So why duplicate in the hardware?

How does the O/S manage VM?
  Note that O/S goals are not just about address mapping.
  All about mapping memory-like (not non-memory) objects:
    Files.
    Swap/paging space.
    Zero-fill pages.
    Copy on write not-really-there pages.
  Some objects are shared, mapped by many processes at different places.
  An address space is also non-homogeneous:
    Lots of distinct ranges with different properties.
  All this is hardware independent.
  Also need an interface to low-level hardware.
    Can't rely on it to do much -- what if TLB only?
    And hardware page tables unlikely to be able to express all we want.

Common solution
  Separate the ideas:
    1. How process address ranges are mapped to objects.
       The virtual part.
    2. Where the underlying object data comes from.
       The physical part -- often not memory.
    3. The state of the VM hardware.
    4. Global management of limited # of physical pages.
       [The subject of the next lecture/paper.]
  This split used by Mach / SunOS / 4.4BSD.

Picture:
  process
    contains list of vm_map_entries
  vm_map_entry
    vstart, vend, protection, offset, object pointer
    protection: read, write, copy-on-write
    these are not shared
    note that they are more space efficient than PTE arrays
    but not directly indexable
  object
    examples: file, anonymous zero-fill memory
    only knows about object-relative "physical" addresses
    may have a cache of vm_pages
    may be shared by many vm_map_entries
    implements read/write sharing of object
  shadow object
    implements copy on write
    points to an underlying object
    list of private copies of pages that have been modified

Effect of PROT1 and PROTN?

Example
  r/o text -> file object
  r/w data -> shadow object -> file object
  r/w stack -> anonymous object
  After fork:
    new vm_map_entries for child
    Share text directly, since r/o.
    Share data through two more shadows.
      Sharing pre-fork modifications, but not post-fork.
    Share stack via two shadows.
  Points:
    More expressive, but slower, than PTE arrays.
    Must follow potentially long chains of shadows to look up.
    Often opportunities to discard or collapse shadows.

What happens on a page fault
  See pseudo-code.
  Find the vm_map_entry, if any.
  Might be an unmapped address, or a protection violation.
    Kill process -- or notify it with a signal!
  Follow the object chain.
  Might be a resident page, just not in hardware map.
    Or a write access to a r/o (maybe copy on write) page.
    Might be found in a shadow object, or underlying object.
  Might be non-resident, read from file.
  If copy on write, and writing, and not in first shadow:
    Make a copy of the page.
    Install it in the first shadow.
  We end up with a vaddr and a physical page; now what?

Machine-dependent layer.
  Mach calls this layer "pmap".
  Let's call machine-independent layer "VM layer".
  Calculating full set of PTEs from VM layer info would be hard!
  So pmap is very lazy. VM layer only calls it:
    During page fault, to create a mapping for the needed page.
    When taking away permissions or a mapping.
    To collect dirty or referenced information.
  pmap can always throw away mappings!
    Does not have to notify VM layer.
    Process will fault, VM layer will re-install mapping.
    VM layers must assume pmap knows nothing.
      VM layer tells pmap about every mapping or protection change.
      pmap can ignore many calls.
  pmap must handle cache alignment restrictions.
    For phys pages with multiple virtual mappings.
    So must keep track of the vaddr[s] of every physical page.

Summary
  Don't view VM as a thin layer just above the memory system.
  It's actually an important program / O/S interface.
    Allows O/S to control what memory references refer to.
    Most of the implementation is in the O/S, not hardware.
  O/S uses flexible control to improve performance.
  Applications can do the same.