6.824 2001 Lecture 7: VM implementation What's in a PTE? About 20 bits of physical page number. Valid bit. Readable. Writeable. Referenced. Dirty. Why if program uses an invalid page? Protected page? Page for which there's no PTE? PTR is really base and bound. A TLB entry copies all this stuff. On TLB miss, CPU microcode fetches PTE. Or maybe just traps to O/S, let O/S cook up TLB entry. Machines with only a TLB -- MIPS. We're going to find out that O/S must maintain its own copy of page tables. So why duplicate in full hardware PTE arrays? Hardare could support just TLB, software refill. How does the O/S manage VM? Note that O/S goals are not just about address mapping. All about mapping memory-like (not non-memory) objects: Files. Swap/paging space. Zero-fill pages. Copy on write not-really-there pages. Some objects are shared, mapped by many processes at different places. An address space is also non-homogeneous: Lots of distinct ranges with different properties. All this is hardware independent. Also need an interface to low-level hardware. Can't rely on it to do much -- what if TLB only? And hardware page tables unlikely to be able to express all we want. Common solution Separate the ideas: 1. How process address ranges are mapped to objects. The virtual part. Per-process. 2. Where the underlying object data comes from. The physical part -- often not memory. Per-object. 3. The state of the VM hardware. View it as a cache of #1->#2 mappings. 4. Global management of limited # of physical pages. This split used by Mach / SunOS / 4.4BSD. Picture: process contains list of vm_map_entries vm_map_entry vstart, vend, protection, offset, object pointer protection: read, write, copy-on-write these are not shared note that they are more space efficient than PTE arrays but not directly indexable object examples: file, anonymous zero-fill memory only knows about object-relative "physical" addresses may have a cache of vm_pages may be shared by many vm_map_entries implements read/write sharing of object shadow object implements copy on write points to an underlying object list of private copies of pages that have been modified Effect of PROT1 and PROTN? Appel/Li. PROT1: protect one page. PROTN: protect N pages in a batch, for efficiency. Example r/o text -> file object r/w private data -> shadow object -> file object r/w stack -> anonymous object After fork: new vm_map_entries for child Share text directly, since r/o. Share data through two more shadows. Sharing pre-fork modifications, but not post-fork. Share stack via two shadows. Points: More expressive, but slower, than PTE arrays. Must follow potentially long chains of shadows to look up. Often opportunities to discard or collapse shadows. What happens on a page fault CPU saves user program state on stack, jumps to ... vm_fault(). See vm_fault() pseudo-code. Find the vm_map_entry, if any. Might be an unmapped address, or a protection violation. Kill process -- or notify it with a signal! Follow the object chain. Might be a resident page, just not in hardware map. Or a write access to a r/o (maybe copy on write) page. Might be found in a shadow object, or underlying object. Might be non-resident, read from file. If copy on write, and writing, and not in first shadow: Make a copy of the page. Install it in the first shadow. We end up with a vaddr and a physical page; now what? Machine-dependent layer. Mach calls this layer "pmap". Let's call machine-independent layer "VM layer". Calculating full set of PTEs from VM layer info would be hard! So pmap is very lazy. VM layer only calls it: During page fault, to create a mapping for the needed page. When taking away permissions or a mapping. To collect dirty or referenced information. pmap can always throw away mappings! Does not have to notify VM layer. Process will fault, VM layer will re-install mapping. VM layers must assume pmap knows nothing. VM layer tells pmap about every mapping or protection change. pmap can ignore many calls. pmap must handle cache alignment restrictions. For phys pages with multiple virtual mappings. So must keep track of the vaddr[s] of every physical page. Common hardware techniques for better VM. Common thread: O/S support for hardware optimizations. Interaction of VM with physically-indexed data caches. Physically indexed caches are slow: must wait for TLB. But can be big, and hardware guarantees consistency. Index with offset part of vaddr, tag with physical address. Fast: can overlap with TLB lookup. Hardware guarantees consistency -- no flushes. But limited in size to one page. Interaction of VM with virtually-indexed data caches. Index the cache with some of the virtual address bits, before translate. Should be faster. Then two mappings for the same page may use different parts of the cache. Read through 1st vaddr; write through 2nd; invisible to 1st. Hardware cannot easily fix this by itself. O/S could install only one mapping at a time. Flush when other mapping is needed. Or allow either one writer, or many readers. O/S could align all mappings in the cache. O/S must have freedom to pick virtual addresses. Low performance if application insists. Multiple-level page tables. One linear page table works badly. What if address use is sparse -- stack at top? What if only a small fraction of valid addresses are actually used? Example of hardware-supported multi-level page tables. Highest level is often a context table. Now O/S can get faults for page-table-pages. Must page them as well as user processes. Avoiding TLB flushes. Context IDs, as in the SPARC. O/S must map processes to limited # of context IDs and tell hardware. CPU has current context ID register, used (w/ vaddr) to index TLB. Summary Don't view VM as a thin layer just above the memory system. It's actually an important program / O/S interface. Allows O/S to control what memory references refer to. Most of the implementation is in the O/S, not hardware. O/S uses flexible control to improve performance. Applications can do the same.