CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

CSE 120 July 18, 2006 Day 5 Memory Instructor: Neil Rhodes Translation Lookaside Buffer (TLB) Implemented in Hardware Cache to map virtual page numbers to page frame Associative memory: HW looks up in all cache entries simultaneously Usually not big: 64-128 entries TLB entry: page number Valid Modified Protection Page frame If not present, do ordinary lookup, then evict entry from TLB and add new one Evict which entry? Serial/Parallel lookup Serial: First look in TLB. If not found, then look in page table Parallel. Look in TLB and in page table in parallel. If not found in TLB, then page table lookup already in progress. 2 Software TLB Management MMU doesn t handle page tables; software does On a TLB miss, generate a TLB fault and let OS deal with it Search a larger memory cache. Page containing cache must be in TLB for speed If not in cache, search page table Once page frame, etc. found, update TLB Why not use hardware? Logic to search page table takes space on the die Spend die size alternatively: Increase Memory cache Reduce cost/power consumption Cost Example TLB Summary Direct memory access: 100ns Without TLB: 200ns (lookup in Page Table first) With TLB Assume cost of TLB lookup is 10ns Assume TLB hit rate is 90% Serial lookup: Average cost =.9*110ns +.1*200ns = 119ns Parallel lookup: Average cost =.9*110ns +.1*(200ns-10ns) = 118ns Caches are very sensitive to: Hit rate Cost of cache miss Note that TLB must be flushed on context switch Unless TLB entries include process ID 3 4

Inverted Page Tables Traditional page tables: 1 entry/virtual page Inverted page tables: 1 entry/physical frame of memory Hash table Inverted Page Tables Space: proportional to number of allocated memory frames 1 entry in hash table for each allocated page Why? Size 64-bit virtual addresses, 4KB page 256MB of RAM per process. Inverted page table needs 65536 entries Page Table Entry: Process ID Virtual page number Additional PTE info pid: p offset virtual address f offset physical address Slow to search through table with 65536 entries Solution: Hash table. Key is virtual page number. Entry contains virtual page, process ID and page frame Advantage: Page table memory is proportional to physical memory Not logical address space Not number of processes Disadvantage Hard to share memory between processes hash(p) f 5 6 Segmentation vs. Paging Page Fault Handling for Paging Need the programmer be aware the technique is being used? Segmentation Paging MMU generates Page Fault (protection violation or page not present). Page fault handler must: Save registers How many linear address spaces are there? Can the total address space exceed the size of phys. mem? Can procedures and data be distinguished and separately protected? Can tables whose size fluctuates be accommodated easily? Figure out virtual address that caused fault Often in hardware register If protection problem, signal or kill process If writing to page after currently-allocated stack Allocate free page to add to stack Update page-table Restart instruction for faulting process - Must undo any partial effects Else, signal or kill process Is sharing of procedures between users facilitated? 7 8

Virtual Memory Idea: Use fast (small, expensive) memory as a cache for slow (large, expensive) disk 90/10 rule: processes spend 90% of there time in 10% of the code Not all of a process s address space need be in memory at a time Illusion of near-infinite memory More processes in memory (higher degree of multiprogramming) Locality Spatial: The likelihood of accessing a resource is higher if a resource close to it was just referenced. Temporal: The likelihood of accessing a resource is higher if it was recently accessed. Page Fault Handling for Virtual Memory MMU generates Page Fault (protection violation or page not present) Save registers Figure out virtual address that caused fault Often in hardware register If protection problem, signal or kill process If no free frame, evict a page from memory (which one?) If modified, write to backing store (dedicated paging space or normal file) Keep disk location of this page (not in page table, but some other data structure). - MMU doesn t need to know disk location Suspend faulting process (resume when write is complete) Read data from backing store for faulting page From backing store or application code or fill-with-zero Suspend faulting process (resume when read complete) Update page table Restart instruction for faulting process Must undo any partial effects 9 10 Paging and Translation Lookaside Buffer Page Replacement Policy yes CPU checks TLB PTE in TLB? no Access page table Page in main memory yes Update TLB CPU generates physical address no return to failed instruction Free page frame? OS instructs CPU to read the page from disk CPU activates I/O hardware Page transferred from disk to main memory Update page table OS instructs CPU to write the page to disk CPU activates I/O hardware Page transferred from main memory to disk Update page table Resident Set Management How many page frames are allocated to each active process? Fixed Variable What existing pages can be considered for replacement Local: only the process that caused the page fault Global: all processes Cleaning policy Pre-Cleaning: Write dirty pages out prospectively Demand-Cleaning: Write dirty pages out only as needed Fetch policy Demand paging Prepaging. Load extra pages speculatively while you re loading others Copy-on-write - Lazy duplicate of pages. For example, on fork, don t copy data page until write occurs. Replacement Policy Which page, among those eligible, should be replaced All policies want to replace those that won t be needed for a long time Since most processes exhibit locality, recent behavior helps predict future behavior Eligibility may be limited based on locked frames 11 Kernel pages I/O buffers in kernel space 12

Page References Assumption is that the sequence of page references exhibits locality Reference string is list of page numbers used by program For example, <0 1 2 3 0 1 4 01 2 3 4> Consecutive references to the same page are removed That page better still be in memory Reference means read or write Opt: the Optimal Page Replacement Policy Swap out the page that will be used farthest in the future Difficult to implement:) Example reference string: <0 1 2 3 0 1 4 0 1 2 3 4> Three page frames 13 14 FIFO: First-In First-Out Swap out the page that s been in memory the longest Works well for swapping out initialization code Not so good for often-used code FIFO: Belady s anomaly For FIFO, adding extra page frames can cause more page faults Example reference string: <0 1 2 3 0 1 4 0 1 2 3 4> Three page frames 0 1 2 3 0 1 4 0 1 2 3 4 Four page frames 0 1 2 3 0 1 4 0 1 2 3 4 15 16

Least Recently Used (LRU) Remove the page that has been unused the longest Hardware Keep counter in PTE. Increment on use. Find PTE with lowest counter to evict Or, keep a linked list ordered by usage Example reference string: <0 1 2 3 0 1 4 0 1 2 3 4> Clock (or Second-chance) Choose the oldest page that hasn t been referenced Implementation: Pages in circular list R bit maintained by hardware in the PTE - HW: Whenever a PTE is accessed (read or write for that page), R bit is set to 1 - SW: can set R bit to 0 or 1 When page is loaded, set R bit to 1 Hand points to particular page. When a page is needed, it checks R bit of that page - If set, clear and move to next page - If not set, this is the page to free 17 18 Two levels of pages: Clock old pages (those not referenced in last clock cycle) new pages (referenced in last clock cycle Algorithm picks one of the old pages Not the oldest (LRU) Another way to look at it: FIFO with a second chance (if front of list is referenced, clear reference and put in back of list) Nth Chance Clock gives a second chance, so has 2 ages it can distinguish Give n chances instead. Don t evict page unless hand has swept by n times Need counter in PTE Higher we make N, the closer it approximates LRU Can it loop infinitely? 19 20

W(, t) Developed by Deming Working-Set Model Set of pages a process has accessed from time t- to time t t is virtual time (last t memory accesses) is size of the window (a larger window means a possibly larger set of pages) Working set can grow and shrink over time Idea for algorithm Monitor the working set of each process Shrink/Grow the page frames allocated to a process down/up to that of its working set If there is not enough space for the working set, swap this process to disk Difficulties What size of to use? Keeping track of working set very difficult Approximation Monitor page fault frequency of process. Exceed upper threshold: add page frame Below lower threshold: remove page frame 21 Keeping Free Pages Keeping some clean free pages makes page faults faster Don t need to run page replacement algorithm: just go to free list Only need to wait for page to be brought in (instead of first waiting for dirty page to be written out). Retain contents of freed page frames If requested again, reuse page frame without I/O Write modified page frames lazily Save in modified page list Write out in groups (based on disk locality) 22 What it is: Thrashing Spending more time paging than doing real work Why it happens: If the degree of multiprogramming gets too high, each process working set is not resident With local replacement, the number of frames allocated to this process isn t enough. (Fighting within a process) With global replacement, one process causes pages from other processes working sets to be evicted. (Fighting among processes) Solution Reduce the degree of multiprogramming. Swap processes out to disk How to determine good degree of multiprogramming Look at utilization of page device (50% utilization optimal) Look at mean time between faults versus mean time to service a fault (equal maximizes CPU utilization) For clock algorithm, look at rate that hand scans through the clock Too low - Few page faults: not many requests to move the pointer - Not scanning many pages per request: most pages not referenced Too high - High fault rate - Scanning many pages per request: most pages are referenced 23 Memory-Mapped Files A file can be mapped into an address space Pager must read/write from file, similar to the way it pages in from an executable Processes can read and write using memory access rather than file read/file write Written data is cached in page frame Difficult to change EOF Can be shared between processes 24

Page Sizes Advantages of smaller page size Less internal fragmentation On average, the address space of each process wastes P/2 space Advantages of larger page size TLB covers more bytes (TLB size * P), so better TBL hit rate Smaller page tables (need address space/p PTEs) As memory has become cheaper and address space has become larger, page sizes have increased 1970s: Vax: 512 bytes 1990s: PowerPC: 4KB 1990s: Pentium: 4KB or 4MB (defined per secondary page table) 1990s: MIPS: 16KB Summary Some page replacement algorithms are better than others OPT LRU Clock FIFO Locality is what makes VM (or any caching) work Physical memory is a cache for logical memory Keep working set in memory Otherwise, thrashing 25 26