Memory Allocation, Page Replacement and Working Sets

Memory Allocation, Page Replacement and Working Sets 1. PROJECT: Many people want to combine projects for two classes or independent studies Have to have at least 3 people total on the project, at least 2 from this class If you cannot do that, come up with a project just for this class 2. NOTES FROM REVIEWS 3. Memory Management intro: Looking at three problems: page replacement. e.g. clock, lru, fifo, working set kernel page table management vm page table management. QUESTION: Why such a big deal? i. benefit: avoiding 1 page fault saves 5 million cycles can do a lot of work to avoid paging ii. Opportunity: lots of different programs, no perfect prediction, always an opportunity to improve Major themes: i. page replacement: global vs. local, etc. 1. Issue: avoid thrashing, good performance ii. Kernel page table structures: 1. Issue: portability across machines, use virtual memory for new things iii. virtual machine memory management 1. Issue: managing memory without cooperation of an OS 4. Converting a swap system to paging Problem: use VAX but lack reference bits just valid, dirty, protection. Not know if it has been used. i. Claim can only do simple things (FIFO, RANDOM) without minimal usage information High level solution: mark sampled page as invalid (reclaimable state) i. VMS approach: 1. resident set == set of pages for a process, managed in FIFO 2. global free list shared by all processes, can use to replace entry in resident set cheaply if used ii. VMS problem: resident size is fixed, so cannot use more memory even when available 1. handled replacement, but did not solve problem of how much to allocate iii. General approach of the the VMS system: hybrid replacement policies 1. different policies for different areas- a. FIFO for mapped pages (fast, no HW) b. LRU for free list (explicit references because unmapped)

5. UNIX approach: 2. For the right resident set size, can work close to pure LRU while runtime cost is FIFO (LRU ops rare) a. Like HW victim cache iv. Why VMS model does not apply to Unix: 1. people use more processes 2. processes are heterogeneous vary widely in memory use a. behavior of processes vary widely b. Lisp GC: scan all memory 3. So: cannot have a single resident set size a. VMS handles this because part of the real resident set (working set) is on free list b. Can dynamically vary resident set size using a predictor: the WORKING SET 4. Used in Windows NT/2000 Clock algorithm: more robust to variations in system i. page fault cost ii. amount of memory iii. No real knobs iv. QUESTION: Why is this important? Clock basics: i. put all phys page allocated on a circle (the LOOP) ii. pointer (HAND) advances circularly until finds replacement candidate 1. candidate = page that has not been referenced for two successive passages of hand (1 complete rotation of hand) 2. first clears ref' bit 3. second checks it iii. SCAN == check ref bit and clear it 1. WIthout ref bits: a. CLEAR = make reclaimable (unmap), b. check = is it VALID? Apply to all phys memory, not per process i. Clock does allocation and replacement, variable size partitions for programs ii. Helps cooperating programs that tradeoff execution: can tradeoff memory use too. Batching: maintain free pool of page frames not in the loop i. trigger scan when free list is too small ii. scan a minimum number of pages per second in REAL TIME to repopulate free list 1. e.g. 1/4th of real memory, 100 pages/sec 2. fixed rate hand > fixed age of old pages 3. increase scan rate as free page list is smaller a. faster hand > younger pages

WHAT IS COST OF SCAN i. page fault on reclaimable page ii. Can measure cost, and fix the overhead of scanning by fixing the overhead of soft page faults iii. They fix it at 10% of processor time (seems high) 1. QUESTION: General approach: rate limit a cost 2. QUESTION: What if memory demand is high what happens to scan rate? a. More scans done ondemand rather than in background iv. OTHER SOURCES OF PAGES 1. Swapped out process' pages go on reclaimable list 2. Terminated process' pages go on the reclaimable list a. Both can be used again if restart process (fast rerunning) NOTE: Length of free pool > amount of contention for memory i. If cannot maintain long list, then there is high contention for memory 6. ISSUES: Fork performance: CoW too complex (but now used) i. use vfork() instead: 1. suspend parent process 2. let child process run but promise not to touch memory 3. QUESTION: What is the issue? a. Not "correct" all the time; easy to get wrong b. Cannot always use (bad for backgrounding) so not as modular ISSUES: Load control i. PROBLEM: on a small system, don't want to drown in paging traffic 1. SOLUTION: swap out whole programs rather than trying to keep recent parts in memory, don't run the program ii. QUESTION: What is good policy? 1. Largest first: makes most free memory (but may be important, and causes starvation of large programs) 2. Oldest first: gives priority to new programs as more interesting but wastes time swapping small programs without making much free memory 3. SOLUTION: pick largest of 4 oldest processes a. QUESTION: Why works? i. old processes (haven't run in a while) often change phase, so not need old memory that was still resident FREE PAGE POOL: i. smoothes highfrequency component of memory demand by absorbing spikes of load without having to immediately swap. ii. decouple collection of pages from allocation; allocation has to meet average rate but not instantaneous rate

PERFORMANCE EVALUATION: i. paging performance depends as much on I/O performance as replacement ii. Need large, sequential streams of pages to swap in and out iii. SOLUTION: 1. prefetch data on miss to avoid future misses, transfer cheaply a. cluster pages going out to have longer sequential writes 2. Make prefetched pages unreferenced so can be reclaimed more quickly a. balance tension of having prefetched data with other uses of memory; if mark referenced then is too valuable iv. NOTE: Having a comparison system that is better made it easier to optimize Unix. 7. VAX/VMS: Use hybrid system described above buffer modfied pages selected for eviction for a while to "cluster" them for locality on writeout swapper swaps in the entire resident set when scheduling process (like working set) rather than demand paging; makes program more efficient Not load process unless there is enough space for it to run 8. NOTE: Problem in Unix: When memory fills up, has adhoc policy to swap out a whole program when things get bad Is there a more principled solution to keeping the system in better shape? i. Big idea: 1. It does little value to run a program if its memory is not available 9. Working set. Context: automatic resource and scheduling a pretty new thing, as was multiprogramming. Previously, applications handled it themselves (e.g. explicitly swap memory, reserve fixed amount). Key problem: efficiency, simple programming Efficiency: want full utilization of CPU it was the critical resource Simple programming: not have to deal with manual resource allocation. 10. Problems Prior work: automatically give pages to processes, use simple LRU or FIFO global page replacement. CLOCK problems: need to know how fast clock hands move to make memory available. If too fast for some programs, the memory it needs won't be there. If too slow, not enough free memory when needed.

Problem: thrashing. Every program takes a page fault, swaps waiting for a page, meanwhile somebody takes another page from it, and when it comes back, it faults again. 11. OVERALL GOAL/APPROACH: Estimate memory & CPU needed by a program Make sure you have that much available Have models to back up the idea 12. Other people: perhaps VM not worth it? Ask for advice? e.g. let programmer tell system how much memory it needs, when to bring in pages? Problem: i. Usage may depend on environment, hard for programmer ii. Modular compilation makes it hard for compilers. Observation: not worth scheduling a process that doesn't have enough memory to make progress. Need to make scheduler + memory manager work together. 13. Working Set Approach: build a model of program behavior, use model + measurements to make scheduling + allocation decisions. APPROACH (KEY): instead of figuring out what page to discard, figure out what pages should be there, and make sure they are. 14. Working set MODEL: 2 level memory Traverse time T == time to transfer a page between memory Goal of mem mgmt: i. Decide which pages are in core decide which pages to REMOVE (not load) ii. optimize to minize page traffic == pages/unit time moved between memories. iii. page in on demand only Prior work: LRU, Random, LFU, FIO,... Problem: susceptible to overload under heavy competition. EXPLAIN WHY 15. Working set == minimal set of pages that must be in main memory for a process to operate efficiently without unnecessary page faults Look at pages as the program executes (process time, not real time) o(t ~) 0 ---" 3"

Model drives a notion that over some time scale tau, you can capture the locality of a program: what is referenced? Locality is both size of wset and sensitivity how much does it change as tau is changed. This lets you capture memory, cpu demand i. QUESTION: How does graph change for different programs? 16. Give app what it demands, measure what it needs W(t, tau) of a process at time t is the pages referenced in time (ttau, t) == set of pages referenced in last interval of length tau. Working set size w(t,tau) = W(t,tau) Properties: i. Size: monotonically increasing, converging: w(t,2tau) <= 2w(t,tau) 1. show on plot 2. Prediction: immediate past behavior predicts the future: a. W(t,tau) is agood predictor for W(T+a,tau) for small a. b. For large a, it is not good. c. Probability that W(t+a,a) intersect W(t,tau) = null is small ii. Reentry rate: page fault rate for a page 1. As tau is reduces, w(t,tau) (size) reduces; probability that useful pages are not in W(t,tau) increases. 2. Can compute reentry rate from interreference interval distribution a. Given meantime between references to a page, can compute rate of page b. if mean time < tau; always in memory c. if mean time > tau; will always page in 3. Given page fault rate, can compute return traffic rate (in real time) a. number of pages fault in tau / (execution 4. Can compute the total page fault rate for a process as a function of tau a. reentry rate for all pages * time period / (time period + fault rate * time period * traverse time) i. = # of pages faulted / (execution time + fault time) 5. BIG PICTURE: starting from prediction of interpage reference intervals, can estimate page traffic for a given working set period tau. 17. Tausenstivity sigma(tau) = how sensitive reentry rate is to tau. tausensitivity = (d/dr)lambda(tau) = fx(tau)/average interval, i. fx(tau) = probability density of interval function Meaning: if tau is decreased by deltatau, lambda(tau) increases by sigma(ta)*deltatau i. Is always positive; decreasing tau always increases rate of faults

18. NOTE: Choice of tau: too small means pages may be removed while still useful too large, pages in memory too long, wasting memory (impacted by how many simultaneous working sets you need in memory) Recommmendation: tau comparable to traverse time T Example: interrefeence rate x: i. x < tau : page in memory 100% of time, always in W(t,tau) ii. tau < x < tau + T : page in memory for tau/(tau+2t) fraction of time: 1. in for tau, then paged out, during which it is referenced, 2. so begins return trip (e.g. 2T for out to disk and back), 3. page reappears tau+2t after previous reference iii. x > t+tau. Page in memory for tau/(x+t): 1. in memory for tau, 2. swapped out, 3. then referenced (at x) 4. then brough back in (time T) Residency function: interval time vs. residency fraction: Residency I00% T ~'+2T T -. i. To keep cliff at tau smaller, set tau =2T r 19. How do we detect W (the working set of pages)? W(t,tau) = pages process referenced in last tau interval i. Or: reference bits on page, shift at intervals adding up to tau. ii. Count = # refs since last interval iii. OR: clock, etc. 20. Resource allocation with working set: DEMAND = intrinsic need of program (independent of system) Memory demand of process i m i = min(w i /M,1) = fraction of main memory i. depends on tau for w i Processor demand = amount of quanta used before blocking. r+t FIG. 4. Residency

i. = Q(ti)/ NA == if was just given ti seconds, ii. how much more would it use before blocking / # of processors * standard time interval (NA = total capacity) iii. NOTE: demand percpu; ignores blocking time, because it always blocks for I/O or other processes. iv. Depends on quantum size (as it is a fraction of a standard quantum) Balance = total mem demand < 1, total CPU demand < 1 21. SCHEDULING POLICY TO ACHIEVE BALANCE: block jobs on the ready list: pick the one that brings CPU, memory closest back to balance (e.g. amount used beta,...) avoid thrashing by balancing memory first, then balancing processor (avoid memory overcommit) 22. OVERALL BIG PICTURE: track pages in use by a process. Throw out anything not referenced in last window make sure every process gets its CPU, memory demand swap out jobs to bring machine back to utliization goal QUESTIONS: i. what happens during load, when no pages in memory? A: need to identify load phase, do something different