6.828 2011 Lecture 20: Memory Management Read: Carl A. Waldspurger, Memory Resource Management in VMware ESX Server, OSDI 2002. Based on Eddie Kohler's lecture notes. Here are three ideas to remember from the previous lecture: big idea: virtualization VMM maintains virtual h/w state (IF, CPL, &c) guest sees/modifies only virtual state VMM may set up real state in totally different way need to interpose on all guest reads/writes of state big idea: binary translation rewrite the guest code replace instructions that read/write machine state with instructions that modify just the virtual state and perhaps apply derived modification to shadow state hard: need to find instruction boundaries don't let guest know that you're moving code, so e.g. fn addrs different much faster than having privileged instructions trap BT useful far beyond virtual machines last week's race detectors: record lock held for every load/store run PPC binary on x86 in fact, BT not so useful for VMs due to modern VM hardware medium idea: shadow page tables two layers of mapping guest kernel page tables virtual -> "physical" VMM maintains per-guest mapping "physical" -> machine (or invalid) real MMU page table (i.e. "shadow" page table) virtual -> machine, as computed by VMM can play tricks, e.g. invalid mappings VMM monitors changes to guest page tale reflects into shadow page table modern hardware directly provides two-level table introduction today's paper is about physical memory management how to allocate phys mem among different tasks very important for VMM *and* for ordinary O/S kernels many clever techniques and tricks are known, useful for both paper's situation company may run many distinct services want each on a separate O/S installation very painful to entangle e.g. web server and mail server many services are idle most of the time so dedicating a CPU + memory + disk very wasteful lots of VMs in one box: good performance, must less money "statistical multiplexing" how to divide up physical memory among guests? suppose 4 GB of RAM, 4 guests guests think they are running on real h/w w/ fixed amt of RAM how much RAM should VMM tell each guest at boot? hard to predict actual needs! 1 GB per guest: what if one needs 2 GB and others only 0.5? 4 GB per guest: probably will use a lot less, so likely OK but what if sum of active use is > 4 GB? how to make over-commit work? start with all phys pages on VMM's free list shadow page table has 4 GB of *invalid* PTEs allocate phys mem only during page fault "lazy" or "on-demand" allocation the millionth page fault will see an empty free list! "page out" some page to "swap space" on disk invalidate corresponding PTE re-use phys page for our page fault most O/S kernels over-commit mem sum of process sizes > phys mem usually ok since most process pages are idle thus nothing lost by paging them out to disk paper implies memory mgmt is the key VM sharing issue -- why? why isn't CPU sharing/scheduling the key problem? usually not a performance issue, just fairness or priority impact of doing a bad job w/ mem allocation? *huge* performance penalty if you have to page-in from disk s/w runs at disk speed, not RAM speed maybe you sometimes see this on your PC -- disk light on all the time doing a good job is helpful efficient use of phys mem -> more disk data cached paper often trades CPU time to save memory suggests they think memory mgmt a worse bottleneck than CPU let's look at paper's techniques Section 4: content-based sharing problem: same data may be in memory many times -- wasteful examples: instructions of programs/kernels running more than once same file content read by many programs/guests zero-filled pages how does a conventional O/S deal w/ duplicate data? shared libraries single file buffer cache used by all programs copy-on-write fork() basic idea: sharing is explicit, easy for kernel to observe open/read/mmap of same file fork() how does ESX deal with duplicate data? cannot see explicitly (doesn't see open, fork, read, &c) idea: directly check whether two physical pages have the same content very general, e.g. different files with same content potentially slow -- lots of comparisons details: table of page content hashes pick random page, hash, see if table entry has same hash if duplicate: fix PTE to refer to existing phys page mark pages copy-on-write add duplicate page to free list downside: CPU time to compute hashes Section 3: cooperative page reclamation problem: overcommitting memory guest 1 is using too much physical memory we want to take some back and give it to guest 2 BUT guest 1 has data in that phys mem, can't just take the pages how best to cause the data to be paged out to disk? conventional O/S disk paging scheme sum of process's mem >> phys mem -- overcommitted [diagram: address space table -> disk, pg tbl -> phys mem] some PTEs refer to phys pages other PTEs are invalid kernel tables indicate a disk block # block of a data or executable file block of a "swap" file think of phys mem as cache for data on disk what to do on a page fault for a paged-out page? need a phys page into which to read it steal a random page while in the pg flt handler? what if that page is dirty, or is actively in use? kernel keeps lists of phys page #s free inactive+clean inactive+dirty active kernel remembers what data is in memory file+off -> phys addr page fault: desired page in memory? perhaps on inactive or free list fix PTE to point to it move page to active list else take a physical page from free list read from disk -- maybe read-ahead too change PTE move page to active list background reclaim kernel thread: if inactive+dirty list too short move some not-recently-used pages from active to inactive+dirty or +clean modify PTEs to be invalid if inactive+clean too short write inactive+dirty pages to disk, move to +clean once you pick a page, write neighbors too, for batching if free too short move pages from inactive+clean to free eviction/replacement policies (chooses pages to inactivate or free) huge # of clever ideas LRU? random? learn app patterns? app tells you? avoid shared pages? avoid dirty pages? important to move a bunch of pages at a time if we fall one page below free target, don't just move one page! "hysteresis" -- e.g. start moving when free < 5%, stop when > 7% batch disk writes could a VMM do disk paging like a conventional O/S? yes -- VMM can page to its own swap area on disk and ESX does this if it has to but not such a great idea double paging VMM writes a page to swap, steals phys page, inval PTE guest O/S decides to page out that page doesn't know VMM has already paged it out guest O/S reads page -- so VMM pages it in but guest O/S just writes it back out to disk wasteful and likely to happen, since both will choose inactive pages O/S can make better decisions than VMM already has sophisticated policies knows which pages are clean understands app patterns how does ESX solve this problem? essentially asks the guest to page out some pages can't do this directly -- guest O/S doesn't know about VMM "balloon driver" device driver loaded into guest kernel VMM can tell it to allocate from kern phys page allocator VMM then takes away those pages! thus: ESX decides guest 1 should use less phys mem tells balloon driver to grab N phys pages guest O/S takes from free list guest page-out daemon then reclaims pages from processes using whatever clever algorithms balloon driver gives the phys pages to VMM VMM gives phys pages to guest 2 -> guest 2's balloon driver *frees* corresponding # of pages very clever effect is that O/S and VMM cooperate, use good O/S page-out policy but O/S basically doesn't know about VMM Section 5: shared-based memory allocation, idle memory problem: some guests are much more important than others for performance, need to give important guests more memory when needed but want to take mem from important guest if mem or guest are idle how to do both? conventional O/S solution you often see sophisticated *CPU* schedulers priority &c, low-pri can use time left idle by hi-pri i don't know of a sophisticated *memory* scheduler more typical to see global LRU replacement idle pages taken away, given to active processes achieves high overall throughput thus: if a hi-pri process is idle all its memory may be paged out and thus will be very slow when it resumes how to combine priority with idle page reclamation? ESX plan give each guest a number of shares if no idle mem, guest gets shares / totalshares i.e. phys mem in proportion to # shares "adjusted shares / page ratio" indicates which guests use too much combines shares and idleness when a guest needs memory take it from the guest with lowest s/p p. 8 rho formula decreases guest's s/p with frac of pages that are idle thus: high-share guest has left much of its memory idle lower-share guests will take some of it -- using ballooning but never all of it! high starts using more memory will take it back neatly combines: demand-driven sharing for high efficiency priorities how does ESX know what frac of guest's phy pages are idle? clear PTE_P bit in N random PTEs wait 30 seconds track what fraction of the N faulted that's an estimate of fraction of pages that are not idle ideas and tricks to remember physical memory as cache for disk/file data over-commit / statistical multiplexing for bursty consumers only need sum of averages, not sum of maximums elastic consumers -- rather than rigid resource partition can benefit from more memory can survive with less on-demand / lazy allocation defer work, maybe you won't have to do it pageout in background daemon decouples allocation from freeing, reduces delays, allows batching soft page faults to see what pages are active (stats, lists) avoid policy -- ballooning deduplication via hashing