6.1810 2025 L9: Paging to Disk, Superpages

Practical, transparent operating system support for superpages,
Navarro, Iyer, Druschel, and Cox, OSDI 2002.

outline:
  Classic O/S paging
  TLB considerations
  Today's paper

important point from previous lecture:
  page allocation is usually lazy (much more than on xv6)
  when program starts, its page table may be entirely empty
  when it allocates memory, its page table generally does not change!
  when a program first *accesses* a page:
    page fault
    kernel allocates a physical page, initializes (zero, or read from file)
    maps into the page table
  why?
    transparent fast start
    a program can ask for lots of memory w/o worrying about RAM size

original purpose of page tables:
  support applications that are bigger than RAM
  just one, or many with large total
  dates back to a time when memory was relatively very expensive
  still useful and worth knowing (and part of today's paper's context)
  "paging" or "paging to disk"

how does paging to disk work?
  i.e. how to let programs think they have more memory than there is RAM?

toy example: machine has just two pages of RAM,
  but a program wants to use 4 pages of data

diagram:
  page table: 4 PTEs
  RAM: two pages of phys mem
    initially free
  disk: 4 pages of "paging" storage, one-page executable file

read page fault on va 0
  kernel handles the page fault transparently
  allocate phys page #1
  read from executable file
  PTE0 -> pa1

write page fault on va 2 (really va 0x2000, 3rd page)
  allocate phys page #0
  zero it
  PTE2 -> pa0

write page fault on va 1
  no free RAM: now what?
  choose a NON-FREE physical page, let's say phys page #0
  write phys page #0 to disk page 2
    we know we must save it b/c PTE2 will have dirty bit set
    PTE2 -> invalid
    remember where va 2 is on the disk
    "eviction" or "page-out" or "replacement"
  initialize phys page #0 for va 1
  PTE1 -> pa0

page fault on va 2 (read or write)
  let's evict physical page #1 (va 0)
    write page #1 to disk block #1
    PTE0 invalid
  let's "page in" virtual page 2 to physical page #1
  PTE2 -> pa1

we're using RAM as a cache, disk for "backing store"

memories are huge now, why does e.g. Linux support paging to disk?
  when paging invented, memory tiny (< 1 MB) and expensive
  now computers are individually cheap, have GBs of memory
  but memory is still a huge expense e.g. in datacenters
    total memory cost typically more than CPU
  and programs and data keep growing larger
  thus still important for O/S to optimize RAM use
  page replacement still important:
    manage file cache
    very big programs
    memory-map huge files
    avoid hard edges

eviction strategy matters a lot:
  if we evict a page that's itself about to be used,
    that will cause another page fault and a disk read
  a disk read takes around a millisecond
  that's a million instructions!
  faulting process has to wait...
  a bad eviction policy can cause a lot of time-consuming page faults

idea: evict the page that will next be used the longest time from now
  sadly, in general a kernel cannot predict which page this will be
  so we need eviction heuristics that are possible to implement
  people have invented lots of eviction strategies

a common eviction policy: least recently used (LRU)
  kernel keeps an LRU list of all physical page numbers
  kernel periodically scans all physical pages
    finds PTE(s) for each, looks at "Accessed" bit
    if Accessed bit is set, move page to the head of the LRU list
    and clear the Accessed bit
  so if a page is used a lot, it keeps being moved to the head of the LRU list
  when a page-fault needs a physical page of memory,
    take it from the tail of the LRU list

when evicting a page, need to invalidate all PTEs that refer to it
  may be more than one: copy-on-write fork, memory-mapped files, shared memory
  typically kernel has a "reverse map" mapping PA to list of page tables and VAs

better to evict a clean page than a dirty one
  clean == there's already an identical copy on the disk
    either from a previous page-out, or in a read-only executable file
  we can re-use a clean page without first having to write it to disk
    just need to zero it out, and adjust old and new PTEs
  page faults will be faster if we can re-use (evict) clean pages

a more elaborate LRU scheme (the FreeBSD scheme in the paper's Section 5.1)
  three LRU lists:
    Active
    Inactive
    Cache
  periodic scan; if Accessed (on any list), move to head of Active
  if memory is tight:
    move tail of Active to Inactive
    write dirty Inactive pages to disk
    move clean tail of Inactive list to Cache list
  so Cache list contains only clean pages that haven't been used for a while
    it's the Cache pages that are allocated for page faults

important to keep Cache list non-empty for fast page faults
  if "memory pressure", scan faster, write dirty pages to disk faster
  memory pressure = Cache LRU list too short

when does paging to disk work well?
  if applications use only a fraction of their memory at any given time
    so that active memory can be kept in RAM, the rest paged out to disk
  "locality of reference"
  "working set"
  working set may change over time, but hopefully slowly
  this is true of many programs, but not all

what if a program's working set is too big?
  or there are too many programs running?
  kernel will evict pages that will soon be needed
  lots of memory accesses will incur page faults and disk reads and writes
  disk is so much slower than RAM that system will be dramatically slower
  "thrashing"

is LRU a good plan?
  LRU is effectively a prediction: recently used => will be used again soon
  usually works well
  but sometimes quite badly!
  example: repeated sequential scan through huge memory 
  the program will probably fault on *every* page
    the least-recently-used page will be the one the program is about to access!
  so, for this program, LRU is a terrible replacement strategy
  thus some systems use random replacement, or offer it as an option

how big should pages be?
  assuming just one size
  CPU designers choose the page size, kernel has to live with it
  long ago: small, e.g. 512 bytes
  now often 4096 (but paper uses 8k, some are larger)
  why prefer big?
    less memory consumed by page tables
      16 GB -> 32 MB of PTEs if 4k pages
    less CPU and memory for kernel book-keeping
      free lists, reverse map, VMAs
    fewer page faults (assuming good locality of reference)
  why prefer small?
    less wasted RAM for pages that are only partially used
      so less RAM may be needed for a given workload
    finer granularity for PTE Accessed and Dirty bits
      so kernel can more precisely keep just needed data in RAM
    less disk I/O per page-in and page-out, so lower page-fault delay
  both big and small have compelling properties
  today's paper is about O/S support for *both*

a final piece of background for today's paper: TLB
  Translation Lookaside Buffer
  CPU cache of virtual to physical address translations
  it's a va,pa table
  [diagram: core, L1, L2, RAM, then add TLB]

every memory reference requires a va to pa lookup
  ld, st, instruction fetch
  so CPU needs to be able to do roughly one per cycle -- nanosecond

TLB is typically small
  since it must be extremely fast
  paper: 128 entries
  modern CPUs: 128 L1 entries, 1024 L2 entries

what is the "reach" of a TLB?
  128 entries, 4k pages: half a megabyte!
  1024 entries: four megabytes
  that's tiny on the scale of machines and programs with >= 16 GB of RAM
  a program that regularly uses more memory will miss in the TLB

what's the cost of a TLB miss?
  RISC-V CPU walks the multi-level page table to fetch leaf PTE
    (the RISC-V hardware handles TLB miss, not the kernel)
  that's a couple of memory fetches, at up to 100 ns each
    hundreds of cycles (though CPU data caches may help)
  CPU may stall until TLB fills
  so a TLB miss may increase instruction time by around 100x

larger pages are a way to increase TLB reach
  128 TLB entries * 2MB pages = 1/4 GB
  and indeed modern CPUs have superpages at least that big

but huge pages can waste lots of RAM!
  since some programs only use a small fraction of each page
  being able to mix small and super pages is attractive
  you are all familiar with the basic idea!
  
today's paper
  goal: kernel automatically chooses between base and super pages
    to increase TLB reach, decrease time wasted in TLB misses
  adapt to program characteristics, including changes in behavior
  cope with big and small programs, tight and loose memory pressure
  do it transparently -- a classic kernel strategy

giving a process a 2MB superpage is effectively a prediction:
  process will use that 2MB fairly intensively in the near future
  rather than e.g. only using a few kilobytes of the 2MB
  if wrong -> wasted RAM
  tension: fewer TLB misses vs wasted RAM

Three puzzling design questions
  tackled by today's paper, follow-on research, real kernels e.g. Linux

* when should kernel give application a superpage (PTE and 2MB of RAM)?
  when app calls sbrk()?
    hard to guess situation from sbrk() size argument.
    sbrk(2MB) may not mean process will use the memory.
    sbrk(4096) may be first page in full 2MB allocation.
  when app takes the first page fault in a 2MB area?
    (this is one of Linux's approaches)
    at least then we know it will use some of the RAM
    but perhaps not all, leading to wasted RAM
    application will get immediate benefit from superpages
    this is a bet that superpages are usually appropriate
      may be a good plan, esp if lots of RAM, a few big apps
  when app has taken page faults on all pages in a 2MB area?
    convert area to a superpage
    (this is the paper's approach)
    good: we know it used 100% of the 2 MB superpage
    bad: we let it execute for quite a while without superpage benefits
    this is a conservative approach, prioritizes not wasting memory

* how can kernel find suitable 2MB area of RAM?
  must be contiguous, 2MB-aligned, and free
  2MB-long free runs rarely occur by accident!
    phys mem typically "fragmented" -- used and free 4k pages interspersed
    no problem for 4k pages, big problem if using both 4k and 2MB
  allocate some 2MB areas of RAM at boot?
    (this is what the lab suggests, for simplicity)
    but how many?
    what if more, or fewer, are needed?
  move (copy) physical pages around to clear out 2MB when needed?
    "compaction" or "de-fragmentation"
    and patch up PTEs of moved pages
    (this is one of Linux's approaches)
    causes allocation or first use to be slow, uses CPU to compact

the paper suggests a clever allocation plan (Section 4.1)
  on first fault in a new 2MB area,
    create one ordinary PTE, allocate and initialize just the one page
  but also reserve 2MB of pages, from the LRU "Cache" list
    these are clean pages that have not been used for a while
  the paper doesn't actually allocate or initialize or map those pages
    so they still hold their cached content
    in case their original mappings are used
  as more faults are taken in the superpage, allocate more of
    those reserved pages, and initialize them
  when all ordinary pages in 2MB area have faulted, promote
    to a superpage
  The Question: why is this arrangement good?

* kernel sometimes needs to demote superpages
  if memory is tight; smaller pages likely to require less RAM
    which superpages should be demoted?
  problem:
    2MB-granularity Access bits make it hard to tell what's really in use
    so break superpages up to get finer-grained Access bits
  also:
    maybe if a process writes to a superpage
    don't want a single write to (later) cause page-out of entire 2MB
    so read-protect superpages, demote on first write, then
      can benefit from finer-granularity Dirty bits

Performance?
  Table 1
  a wide array of different benchmarks, mostly real programs
  typical improvement from superpages is 10%
    sometimes higher
  however: Table 1 is "plenty of memory is available"
    sometimes true in real life, sometimes not

Current status?
  Easiest to track for Linux; many posts on the web.
  Very valuable for some applications.
  But "transparent huge pages" waste lots of memory in other situations,
    due to only a fraction of 2MB actually used.
  So the design details have changed a lot over time.
    And careful applications often explicitly allocate their superpages.
  Still actively evolving.
  Perhaps the underlying issue is that the gains are often pretty
    modest, but the costs (in memory waste) can be very large.

next week: neat application-level use of page tables

--------------

Some more recent information about superpages:
https://lwn.net/Articles/423584/
https://lwn.net/Articles/1025629/
https://lwn.net/Articles/1032199/
https://www.usenix.org/system/files/conference/osdi16/osdi16-kwon.pdf
https://www.usenix.org/system/files/atc20-zhu-weixi_0.pdf