6.1810 2025 Lecture 10: User-level virtual memory Reading: VM primitives for user programs by Appel & Li (1991) Plan: OS kernel uses virtual memory in creative ways Paper argues user-level application can also benefit from VM Concurrent garbage collector Generation garbage collector Concurrent check-pointing Distributed shared memory Data-compression paging Persistent stores Many of the use cases are not widely-used but the bigger idea had impact Most OSes have mmap() and user-level pagefaults Many application use virtual memory in creative ways What VM user-level VM primitives? Trap: handle page-fault traps in user mode Prot1: decrease the accessibility of a page ProtN: decrease the accessibility of N pages Unprot: increase the accessibility of a page Dirty: return a list of dirtied pages since previous calls Map2: map the same physical page at two different VAs with different prot levels. Xv6 supports none of them One way to see the paper is that a good OS should support them What about Unix today? Unix today: mmap() Maps memory into the address space (many flags and options) Example: mapping file mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, offset) Examples: anonymous memory mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) (preferred over sbrk) [see sqrts.c for example, and `man 2 mmap` for a detailed description] Unix today: mprotect() Changes the permission of a mapping Examples: mprotect(addr, len, PROT_READ) mprotect(addr, len, PROT_NONE) results in a trap on access Unix today: munmap() Removes a mapping Example: munmap(addr, len) Unix today: sigaction() Configures a signal handler (Think sigalarm() lab) act.sa_sigaction = handle_sigsegv; act.sa_flags = SA_SIGINFO; segemptyset(&act.sa_mask) sigaction(SIGSEGV, &act, NULL); Additional Linux VM calls Madvise(), Mincore(), Mremap(), Msync(), Mlock(), Mbind(), Shmat(),.. Can we support user-level VM primitives? Trap: sigaction and SIGSEGV Prot1: mprotect() ProtN: mprotect() Unprot: mprotect() Dirty: No, but workaround exists Map2: Not directly, but shm_open/mmap Demo: large sqrt table backed by a single page Application code thinks there is a pre-computed big table n -> sqrt(n) Application can lookup sqrt: sqrts[n] If present, very fast! Table is bigger than physical memory User-level VM primitives allow it to occupy only a *single* page VM Implementation Address spaces consists of VMAs and page tables VMA (virtual memory area) contiguous range of virtual addresses same permissions backed by the same object (file, anonymous memory) VMA helps the kernel to decide how to handle page faults Trap/sigaction implementation PTE (or TLB entry) marked "protected" CPU saves user state, jumps into kernel. Kernel asks VM system what to do? I.e. page in from disk? Core dump? VM system looks at VMA Generate signal -- upcall into user process. Lower on user stack, or on separate stack... Run user handler, can do anything. Probably must call UNPROT for referenced page. That is, must avoid repeated fault. User handler returns to kernel. Kernel returns to user program. Continue or re-start instruction that trapped. Use case: Concurrent GC Application allocates memory and computes with it Application is called the "mutator" Application doesn't have to call free (which is error prone, in particular in concurrent programs) Collector concurrently finds free memory Memory that isn't in use by application anymore Not reachable from root pointer or registers of any thread Traditional implementation requires language/compiler support Instrument loads/stores User-level virtual memory can avoid instrumenting load/stores Faster Example: Baker's real-time GC algorithm (see toy version in baker.c) https://dl.acm.org/citation.cfm?id=359460.359470 Divide heap into 2 regions: from-space and to-space to-space is further divided in: scanned, unscanned, and new At start of collection: all objects are in from-space Copy roots to to-space (register and stack) Put forwarding pointers in from-space to the to-space copy At the end of scan, from-space is free memory Don't have to stop the world for GC Do a little of scanning on each allocation Or when dereferencing pointer in from-space Example with two circular list with 2 nodes: head->n1->n2->n1 Assume head is a root At GC, copy head to space unscanned When app dereferences head it points into from space copy n1 into to space and record address of n1' in n1 (in from space) When app dereferences n1->next, it points into from space copy n2 into to space, and record address of n2' in n2 (in from space) When app dereferences n2->next, it points into from space n1, however, has been copied and contains n1' update n2->next to n1' Observations Why attractive? Alloc is cheap. Compacts, so no free list. Incremental: every allocation scan/copy a little This is the "real-time" aspect What are the cost? Does pointer reside in the from space? (If so, it needs to be copied) Requires test and branch for every dereference Difficult to run collector and program at same time Race condition between collector tracing heap and program threads Risk: two copies of the same object Solution: let application use VM https://dl.acm.org/doi/10.1145/53990.53992 Avoid explicit checks for references in application threads After copying roots, unmap the unscanned part of to-space Initially a page with roots etc. Page fault when thread accesses unscanned region handlers scans just the page and inspects all objects, then UNPROT at most one fault per page no compiler changes Easy to make concurrent A collector thread can run concurrently with application thread A collector thread can UNPROT a page after scanning Only synchronization needed is for which thread is scanning which page Today: few systems uses baker's algorithm because it copies all live objects But the VM idea is cool Are existing VM primitives good enough for concurrent GC? MAP2 is the only functionality issue Can work around with shm_open and mmap the object at different addresses Are traps &c fast enough? They say no: 500 us to scan a page, 1200 us to take the trap. Why not scan 3 pages? How much slower to run Baker's actual algorithm, w/ checks? VM version might be faster! Even w/ slow traps. What about time saved by 2nd CPU scanning? They don't count this. Is it an issue how often faults occur for concurrent GC? Not really -- more faults means more scanning. I.e. we'll get <= one fault per page, at most. Measurements Sun3/60: 2080/0.12, 17333 adds i7-8550U CPU @ 1.80GHz (kaby lake), Linux 6.0.7 One fault in sqrts.c is 5.4us on my laptop (wo. sqrt) One add @ 2.4Ghz is ~ 0.417ns; thus, ~ 12980 adds with power policy performance 2024: 10233us for nfault 1000, Linux 6.11.6 A real use case of user-level VM: virtual-machine/container migration Move a process from machine M1 to machine M2 E.g., a container or virtual machine To free up M1 so that it can be, for example, upgraded Plan: Checkpoint P, copy P to M2, restart P on M2 Slow: checkpoint all memory before copying and restarting may take a long time Solution: use user-level VM Stop P on M1 Checkpoint P except its memory image Start checkpoint on M2, but unmap its memory image in background start fetching memory image from M1 when restarted P references not-fetched memory, it will get a page fault on fault, fetch page from M1 Faster because can overlap sending memory with running P Needs trap, protN Another variation: Keep P running on M1 while copying over checkpoint (incl. memory) to M2 Keep track of pages that P modifies before switching over to M2 unmap P's memory on page fault record that this page must be sent What has changed between 1991 and 2022? Tons of changes to VM API and implementations VM system evolves continuously Continued research too (e.g., see OSDI 2020) Other primitives mlock, munlock, madvise, mremap, mincore,... give applications more control Switching address space is free (tagged TLBs) Extended addressibility doesn't matter 2^52 bytes of virtual address space