6.1810 2025 Lecture 10: User-level virtual memory

Reading: VM primitives for user programs by Appel & Li (1991)

Plan:
  OS kernel uses virtual memory in creative ways
  Paper argues user-level application can also benefit from VM
    Concurrent garbage collector
    Generation garbage collector
    Concurrent check-pointing
    Distributed shared memory
    Data-compression paging
    Persistent stores
  Many of the use cases are not widely-used but the bigger idea had impact
    Most OSes have mmap() and user-level pagefaults
    Many application use virtual memory in creative ways

What VM user-level VM primitives?
  Trap: handle page-fault traps in user mode
  Prot1: decrease the accessibility of a page
  ProtN: decrease the accessibility of N pages
  Unprot: increase the accessibility of a page
  Dirty: return a list of dirtied pages since previous calls
  Map2: map the same physical page at two different VAs with
    different prot levels.

Xv6 supports none of them
  One way to see the paper is that a good OS should support them
  What about Unix today?

Unix today: mmap()
  Maps memory into the address space (many flags and options)
  Example: mapping file
    mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, offset)
  Examples: anonymous memory
    mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
    (preferred over sbrk)
  [see sqrts.c for example, and `man 2 mmap` for a detailed description]
    
Unix today: mprotect()
  Changes the permission of a mapping
  Examples:
    mprotect(addr, len, PROT_READ)
    mprotect(addr, len, PROT_NONE)
      results in a trap on access

Unix today: munmap()
  Removes a mapping
  Example:
    munmap(addr, len)

Unix today: sigaction()
  Configures a signal handler
    (Think sigalarm() lab)
  act.sa_sigaction = handle_sigsegv;
  act.sa_flags = SA_SIGINFO;
  segemptyset(&act.sa_mask)
  sigaction(SIGSEGV, &act, NULL);

Additional Linux VM calls
  Madvise(), Mincore(), Mremap(), Msync(), Mlock(), Mbind(), Shmat(),..

Can we support user-level VM primitives?
  Trap: sigaction and SIGSEGV
  Prot1: mprotect()
  ProtN: mprotect()
  Unprot: mprotect()
  Dirty: No, but workaround exists
  Map2: Not directly, but shm_open/mmap

Demo: large sqrt table backed by a single page
  Application code thinks there is a pre-computed big table
    n -> sqrt(n)
  Application can lookup sqrt:  sqrts[n]
    If present, very fast!
  Table is bigger than physical memory
  User-level VM primitives allow it to occupy only a *single* page

VM Implementation
  Address spaces consists of VMAs and page tables
  VMA (virtual memory area)
    contiguous range of virtual addresses
    same permissions
    backed by the same object (file, anonymous memory)
  VMA helps the kernel to decide how to handle page faults

Trap/sigaction implementation
  PTE (or TLB entry) marked "protected"
  CPU saves user state, jumps into kernel.
  Kernel asks VM system what to do?
    I.e. page in from disk? Core dump?
    VM system looks at VMA    
  Generate signal -- upcall into user process.
    Lower on user stack, or on separate stack...
  Run user handler, can do anything.
    Probably must call UNPROT for referenced page.
    That is, must avoid repeated fault.
  User handler returns to kernel.
  Kernel returns to user program.
  Continue or re-start instruction that trapped.

Use case: Concurrent GC
  Application allocates memory and computes with it
    Application is called the "mutator"
    Application doesn't have to call free
    (which is error prone, in particular in concurrent programs)
  Collector concurrently finds free memory
    Memory that isn't in use by application anymore
    Not reachable from root pointer or registers of any thread
  Traditional implementation requires language/compiler support
    Instrument loads/stores
  User-level virtual memory can avoid instrumenting load/stores
    Faster

Example: Baker's real-time GC algorithm  (see toy version in baker.c)
  https://dl.acm.org/citation.cfm?id=359460.359470
  Divide heap into 2 regions: from-space and to-space
    to-space is further divided in: scanned, unscanned, and new
  At start of collection: all objects are in from-space
    Copy roots to to-space (register and stack)
    Put forwarding pointers in from-space to the to-space copy
  At the end of scan, from-space is free memory
  Don't have to stop the world for GC
    Do a little of scanning on each allocation
    Or when dereferencing pointer in from-space
  Example with two circular list with 2 nodes: head->n1->n2->n1
    Assume head is a root
    At GC, copy head to space unscanned
    When app dereferences head it points into from space
      copy n1 into to space and record address of n1' in n1 (in from space)
    When app dereferences n1->next, it points into from space
      copy n2 into to space, and record address of n2' in n2 (in from space)
    When app dereferences n2->next, it points into from space
      n1, however, has been copied and contains n1'
      update n2->next to n1'

Observations
  Why attractive?
    Alloc is cheap. Compacts, so no free list.
    Incremental: every allocation scan/copy a little
    This is the "real-time" aspect
  What are the cost?
    Does pointer reside in the from space? (If so, it needs to be copied)
      Requires test and branch for every dereference
    Difficult to run collector and program at same time
      Race condition between collector tracing heap and program threads
      Risk: two copies of the same object

Solution: let application use VM
  https://dl.acm.org/doi/10.1145/53990.53992
  Avoid explicit checks for references in application threads
    After copying roots, unmap the unscanned part of to-space
      Initially a page with roots etc.
    Page fault when thread accesses unscanned region
      handlers scans just the page and inspects all objects, then UNPROT
      at most one fault per page
      no compiler changes
  Easy to make concurrent
    A collector thread can run concurrently with application thread
      A collector thread can UNPROT a page after scanning
    Only synchronization needed is for which thread is scanning which page
  Today: few systems uses baker's algorithm because it copies all live objects
    But the VM idea is cool

Are existing VM primitives good enough for concurrent GC?
  MAP2 is the only functionality issue
    Can work around with shm_open and mmap the object at different addresses
  Are traps &c fast enough?
    They say no: 500 us to scan a page, 1200 us to take the trap.
    Why not scan 3 pages?
    How much slower to run Baker's actual algorithm, w/ checks?
      VM version might be faster! Even w/ slow traps.
    What about time saved by 2nd CPU scanning? They don't count this.
  Is it an issue how often faults occur for concurrent GC?
    Not really -- more faults means more scanning.
    I.e. we'll get <= one fault per page, at most.

Measurements
  Sun3/60: 2080/0.12, 17333 adds
  i7-8550U CPU @ 1.80GHz (kaby lake), Linux 6.0.7
     One fault in sqrts.c is 5.4us on my laptop (wo. sqrt)
     One add @ 2.4Ghz is ~ 0.417ns; thus, ~ 12980 adds
       with power policy performance
     2024:  10233us for nfault 1000, Linux 6.11.6

A real use case of user-level VM: virtual-machine/container migration
  Move a process from machine M1 to machine M2
    E.g., a container or virtual machine
    To free up M1 so that it can be, for example, upgraded
  Plan: Checkpoint P, copy P to M2, restart P on M2
    Slow: checkpoint all memory before copying and restarting may take a long time
  Solution: use user-level VM
    Stop P on M1
    Checkpoint P except its memory image
    Start checkpoint on M2, but unmap its memory image
      in background start fetching memory image from M1
      when restarted P references not-fetched memory, it will get a page fault
      on fault, fetch page from M1
  Faster because can overlap sending memory with running P
    Needs trap, protN
  Another variation:
    Keep P running on M1 while copying over checkpoint (incl. memory) to M2
    Keep track of pages that P modifies before switching over to M2
      unmap P's memory
      on page fault record that this page must be sent

What has changed between 1991 and 2022?
  Tons of changes to VM API and implementations
    VM system evolves continuously
    Continued research too (e.g., see OSDI 2020)
  Other primitives
    mlock, munlock, madvise, mremap, mincore,...
    give applications more control
  Switching address space is free (tagged TLBs)
  Extended addressibility doesn't matter
    2^52 bytes of virtual address space