6.828 2011 Lecture 20: Memory Management

Read: Carl A. Waldspurger, Memory Resource Management in VMware ESX
Server, OSDI 2002.

Based on Eddie Kohler's lecture notes.

Here are three ideas to remember from the previous lecture:

big idea: virtualization
  VMM maintains virtual h/w state (IF, CPL, &c)
  guest sees/modifies only virtual state
  VMM may set up real state in totally different way
  need to interpose on all guest reads/writes of state

big idea: binary translation
  rewrite the guest code
  replace instructions that read/write machine state
  with instructions that modify just the virtual state
    and perhaps apply derived modification to shadow state
  hard:
    need to find instruction boundaries
    don't let guest know that you're moving code, so e.g. fn addrs different
  much faster than having privileged instructions trap
  BT useful far beyond virtual machines
    last week's race detectors: record lock held for every load/store
    run PPC binary on x86
  in fact, BT not so useful for VMs due to modern VM hardware

medium idea: shadow page tables
  two layers of mapping
    guest kernel page tables
      virtual -> "physical"
    VMM maintains per-guest mapping
      "physical" -> machine (or invalid)
    real MMU page table (i.e. "shadow" page table)
      virtual -> machine, as computed by VMM
      can play tricks, e.g. invalid mappings
  VMM monitors changes to guest page tale
    reflects into shadow page table
  modern hardware directly provides two-level table

introduction
  today's paper is about physical memory management
    how to allocate phys mem among different tasks
  very important for VMM *and* for ordinary O/S kernels
  many clever techniques and tricks are known, useful for both

paper's situation
  company may run many distinct services
    want each on a separate O/S installation
    very painful to entangle e.g. web server and mail server
  many services are idle most of the time
    so dedicating a CPU + memory + disk very wasteful
  lots of VMs in one box: good performance, must less money
    "statistical multiplexing"

how to divide up physical memory among guests?
  suppose 4 GB of RAM, 4 guests
  guests think they are running on real h/w w/ fixed amt of RAM
  how much RAM should VMM tell each guest at boot?
    hard to predict actual needs!
    1 GB per guest:
      what if one needs 2 GB and others only 0.5?
    4 GB per guest:
      probably will use a lot less, so likely OK
      but what if sum of active use is > 4 GB?

how to make over-commit work?
  start with all phys pages on VMM's free list
  shadow page table has 4 GB of *invalid* PTEs
    allocate phys mem only during page fault
    "lazy" or "on-demand" allocation
  the millionth page fault will see an empty free list!
    "page out" some page to "swap space" on disk
    invalidate corresponding PTE
    re-use phys page for our page fault

most O/S kernels over-commit mem
  sum of process sizes > phys mem
  usually ok since most process pages are idle
    thus nothing lost by paging them out to disk

paper implies memory mgmt is the key VM sharing issue -- why?
  why isn't CPU sharing/scheduling the key problem?
    usually not a performance issue, just fairness or priority
  impact of doing a bad job w/ mem allocation?
    *huge* performance penalty if you have to page-in from disk
    s/w runs at disk speed, not RAM speed
    maybe you sometimes see this on your PC -- disk light on all the time
  doing a good job is helpful
    efficient use of phys mem -> more disk data cached
  paper often trades CPU time to save memory
    suggests they think memory mgmt a worse bottleneck than CPU

let's look at paper's techniques

Section 4: content-based sharing

problem:
  same data may be in memory many times -- wasteful
  examples:
    instructions of programs/kernels running more than once
    same file content read by many programs/guests
    zero-filled pages

how does a conventional O/S deal w/ duplicate data?
  shared libraries
  single file buffer cache used by all programs
  copy-on-write fork()
  basic idea: sharing is explicit, easy for kernel to observe
    open/read/mmap of same file
    fork()

how does ESX deal with duplicate data?
  cannot see explicitly (doesn't see open, fork, read, &c)
  idea: directly check whether two physical pages have the same content
  very general, e.g. different files with same content
  potentially slow -- lots of comparisons

details:
  table of page content hashes
  pick random page, hash, see if table entry has same hash
  if duplicate:
    fix PTE to refer to existing phys page
    mark pages copy-on-write
    add duplicate page to free list
  downside: CPU time to compute hashes

Section 3: cooperative page reclamation

problem:
  overcommitting memory
  guest 1 is using too much physical memory
  we want to take some back and give it to guest 2
  BUT guest 1 has data in that phys mem, can't just take the pages
  how best to cause the data to be paged out to disk?

conventional O/S disk paging scheme
  sum of process's mem >> phys mem -- overcommitted
  [diagram: address space table -> disk, pg tbl -> phys mem]
  some PTEs refer to phys pages
  other PTEs are invalid
    kernel tables indicate a disk block #
    block of a data or executable file
    block of a "swap" file
  think of phys mem as cache for data on disk
  what to do on a page fault for a paged-out page?
    need a phys page into which to read it
    steal a random page while in the pg flt handler?
    what if that page is dirty, or is actively in use?
  kernel keeps lists of phys page #s
    free
    inactive+clean
    inactive+dirty
    active
  kernel remembers what data is in memory
    file+off -> phys addr
  page fault:
    desired page in memory?
      perhaps on inactive or free list
      fix PTE to point to it
      move page to active list
    else
      take a physical page from free list
      read from disk -- maybe read-ahead too
      change PTE
      move page to active list
  background reclaim kernel thread:
    if inactive+dirty list too short
      move some not-recently-used pages from active to inactive+dirty or +clean
      modify PTEs to be invalid
    if inactive+clean too short
      write inactive+dirty pages to disk, move to +clean
      once you pick a page, write neighbors too, for batching
    if free too short
      move pages from inactive+clean to free
  eviction/replacement policies (chooses pages to inactivate or free)
    huge # of clever ideas 
    LRU? random? learn app patterns? app tells you?
    avoid shared pages?
    avoid dirty pages?
  important to move a bunch of pages at a time
    if we fall one page below free target, don't just move one page!
    "hysteresis" -- e.g. start moving when free < 5%, stop when > 7%
    batch disk writes

could a VMM do disk paging like a conventional O/S?
  yes -- VMM can page to its own swap area on disk
    and ESX does this if it has to
    but not such a great idea
  double paging
    VMM writes a page to swap, steals phys page, inval PTE
    guest O/S decides to page out that page
    doesn't know VMM has already paged it out
    guest O/S reads page -- so VMM pages it in
    but guest O/S just writes it back out to disk
    wasteful and likely to happen, since both will choose inactive pages
  O/S can make better decisions than VMM
    already has sophisticated policies
    knows which pages are clean
    understands app patterns

how does ESX solve this problem?
  essentially asks the guest to page out some pages
    can't do this directly -- guest O/S doesn't know about VMM
  "balloon driver"
    device driver loaded into guest kernel
    VMM can tell it to allocate from kern phys page allocator
    VMM then takes away those pages!
  thus:
    ESX decides guest 1 should use less phys mem
    tells balloon driver to grab N phys pages
    guest O/S takes from free list
    guest page-out daemon then reclaims pages from processes
      using whatever clever algorithms
    balloon driver gives the phys pages to VMM
    VMM gives phys pages to guest 2
      -> guest 2's balloon driver *frees* corresponding # of pages
  very clever
    effect is that O/S and VMM cooperate, use good O/S page-out policy
    but O/S basically doesn't know about VMM

Section 5: shared-based memory allocation, idle memory

problem:
  some guests are much more important than others
  for performance, need to give important guests more memory when needed
  but want to take mem from important guest if mem or guest are idle
  how to do both?

conventional O/S solution
  you often see sophisticated *CPU* schedulers
    priority &c, low-pri can use time left idle by hi-pri
  i don't know of a sophisticated *memory* scheduler
  more typical to see global LRU replacement
    idle pages taken away, given to active processes
    achieves high overall throughput
  thus:
    if a hi-pri process is idle
    all its memory may be paged out
    and thus will be very slow when it resumes

how to combine priority with idle page reclamation?

ESX plan
  give each guest a number of shares
    if no idle mem, guest gets shares / totalshares   
    i.e. phys mem in proportion to # shares
  "adjusted shares / page ratio" indicates which guests use too much
    combines shares and idleness
  when a guest needs memory
    take it from the guest with lowest s/p
  p. 8 rho formula decreases guest's s/p with frac of pages that are idle
  thus:
    high-share guest has left much of its memory idle
      lower-share guests will take some of it -- using ballooning
      but never all of it!
    high starts using more memory
      will take it back
  neatly combines:
    demand-driven sharing for high efficiency
    priorities

how does ESX know what frac of guest's phy pages are idle?
  clear PTE_P bit in N random PTEs
  wait 30 seconds
  track what fraction of the N faulted
  that's an estimate of fraction of pages that are not idle

ideas and tricks to remember
  physical memory as cache for disk/file data
  over-commit / statistical multiplexing
    for bursty consumers
    only need sum of averages, not sum of maximums
  elastic consumers -- rather than rigid resource partition
    can benefit from more memory
    can survive with less
  on-demand / lazy allocation
    defer work, maybe you won't have to do it
  pageout in background daemon
    decouples allocation from freeing, reduces delays, allows batching
  soft page faults
    to see what pages are active (stats, lists)
  avoid policy -- ballooning
  deduplication via hashing