Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-6-) Timothy Roscoe Herbstsemester 26 AS 26 Virtual Memory
8: Virtual Memory Computer Architecture and Systems Programming 252-6-, Herbstsemester 26 Timothy Roscoe AS 26 Virtual Memory 2
Virtual Memory (up until now) Programs refer to virtual memory addresses movl (%rcx),%eax Conceptually very large array of bytes Each byte has its own address Actually implemented with hierarchy of different memory types System provides address space private to particular process Allocation: Compiler and run-time system Where different program objects should be stored All allocation within single virtual address space But why virtual memory? Why not physical memory? FF F AS 26 Virtual Memory 3
8.: Recap: Address Translation AS 26 Virtual Memory 4
Address Translation: a level of Virtual memory indirection Process Physical memory mapping Virtual memory Process n Each process gets its own private memory space AS 26 Virtual Memory 5
Address spaces Linear address space: Ordered set of contiguous non-negative integer addresses: {,, 2, 3 } Virtual address space: Set of N = 2 n virtual addresses {,, 2, 3,, N-} Physical address space: Set of M = 2 m physical addresses {,, 2, 3,, M-} Clean distinction between data (bytes) and their attributes (addresses) Each object can now have multiple addresses Every byte in main memory: one physical address, one (or more) virtual addresses AS 26 Virtual Memory 6
System with physical addressing CPU Physical address (PA) Main memory : : 2: 3: 4: 5: 6: 7: 8:... M-: Data word [Still] used in simple systems like embedded microcontrollers in devices like cars, elevators, and digital picture frames AS 26 Virtual Memory 7
System with virtual addressing CPU Chip CPU Virtual address (VA) MMU Physical address (PA) Main memory : : 2: 3: 4: 5: 6: 7: 8:... M-: Data word Used in all modern desktops, laptops, workstations One of the great ideas in computer science MMU checks the cache AS 26 Virtual Memory 8
Address translation with a page table Page table base register (PTBR) Virtual address Virtual page number (VPN) Virtual page offset (VPO) Page table address for process Page table Valid Physical page number (PPN) Valid bit = : page not in memory (page fault) Physical page number (PPN) Physical page offset (PPO) Physical address AS 26 9
8.2: Uses of virtual memory AS 26 Virtual Memory
Why virtual memory (VM)? Efficient use of limited main memory (RAM) Use RAM as a cache for the parts of a virtual address space some non-cached parts stored on disk some (unallocated) non-cached parts stored nowhere Keep only active areas of virtual address space in memory transfer data back and forth as needed Simplifies memory management for programmers Each process gets the same full, private linear address space Isolates address spaces One process can t interfere with another s memory because they operate in different address spaces User process cannot access privileged information different sections of address spaces have different permissions AS 26 Virtual Memory
: VM as a cache for more memory than you have RAM 64-bit addresses: 6 Exabyte Physical main memory: Few Gigabytes? And there are many processes. AS 26 Virtual Memory 2
: VM as a tool for caching Virtual memory: array of N = 2 n contiguous bytes think of the array (allocated part) as being stored on disk Physical main memory (DRAM) = cache for allocated virtual memory Blocks are called pages; size = 2 p Virtual memory Physical memory Disk or VP VP Unallocated Cached Uncached Unallocated Cached Empty Empty PP PP SSD VP 2 n-p - Uncached Cached Uncached 2 n - 2 m - Empty PP 2 m-p - Virtual pages (VP's) stored on disk Physical pages (PP's) cached in DRAM AS 26 Virtual Memory 3
Memory hierarchy: Skylake CPU Thrpt: Latency: L/L2/L3 cache: 64 B blocks Reg ~8 B/cycle 4 cycles L I-cache 32 KB L D-cache ~29 B/cycle 2 cycles ~256kB L2 cache ~8 B/cycle 44 cycles ~8 MB L3 unified cache ~ B/cycle 8 cycles ~6 GB ~3 GB Main Memory Not drawn to scale B/ cycles millions SSD Miss penalty (latency): 2x Miss penalty (latency):,x AS 26 Virtual Memory 4
DRAM cache organization DRAM cache organization driven by the enormous miss penalty DRAM is about x slower than SRAM Disk is about,x slower than DRAM For first byte, faster for next byte Consequences Large page (block) size: typically 4-8 KB, sometimes 4 MB Fully associative Any VP can be placed in any PP Requires a large mapping function different from CPU caches Highly sophisticated, expensive replacement algorithms Too complicated and open-ended to be implemented in hardware Write-back rather than write-through AS 26 Virtual Memory 5
Why does it work? Locality! Virtual memory works because of locality At any point in time, programs tend to access a set of active virtual pages called the working set Programs with better temporal locality will have smaller working sets If (working set size < main memory size) Good performance for one process after compulsory misses If ( SUM(working set sizes) > main memory size ) Thrashing: Performance meltdown where pages are swapped (copied) in and out continuously AS 26 Virtual Memory 6
Problem 2: Memory management Physical main memory Process Process 2 Process 3 Process n x stack heap.text.data What goes where? AS 26 Virtual Memory 7
2. VM as a tool for memory management Key idea: each process has its own virtual address space It can view memory as a simple linear array Mapping function scatters addresses through physical memory Well chosen mappings simplify memory allocation and management Virtual Address Space for Process : Virtual Address Space for Process 2: N- VP VP 2... VP VP 2... Address translation Physical Address Space (DRAM) (e.g., read-only library code) AS 26 8 N- M- PP 2 PP 6 PP 8...
2. VM as a tool for memory management Memory allocation Each virtual page can be mapped to any physical page A virtual page can be stored in different physical pages at different times Sharing code and data among processes Map virtual pages to the same physical page (here: PP 6) Virtual Address Space for Process : Virtual Address Space for Process 2: N- VP VP 2... VP VP 2... Address translation Physical Address Space (DRAM) (e.g., read-only library code) AS 26 9 N- M- PP 2 PP 6 PP 8...
3. Using VM to simplify linking and loading Linking Each program has similar virtual address space Code, stack, and shared libraries always start at the same address Loading execve() allocates virtual pages for.text and.data sections = creates PTEs marked as invalid The.text and.data sections are copied, page by page, on demand by the virtual memory system Kernel virtual memory User stack (created at runtime) Memory-mapped region for shared libraries Run-time heap (created by malloc) Read/write segment (.data,.bss) Read-only segment (.init,.text,.rodata) Memory invisible to user code %rsp (stack pointer) brk Loaded from the executable file Unused AS 26 2
Problem 3: Protection Process i Physical main memory Process j Problem 4: Sharing Process i Physical main memory Process j AS 26 Virtual Memory 2
4. VM as a tool for memory protection Extend PTEs with permission bits Page fault handler checks these before remapping If violated, send process SIGSEGV (segmentation fault) Process i: VP : VP : VP 2: SUP No No Yes READ Yes Yes Yes WRITE No Yes Yes Address PP 6 PP 4 PP 2 Physical Address Space PP 2 PP 4 PP 6 Process j: VP : VP : SUP No Yes READ Yes Yes WRITE No Yes Address PP 9 PP 6 PP 8 PP 9 VP 2: No Yes Yes PP PP AS 26 Virtual Memory 22
Summary Virtual Memory is an important tool for:. Caching disk contents in RAM 2. Performing memory management 3. Simplifying linking and loading 4. Protecting memory AS 26 Virtual Memory 23
8.3: The address translation process AS 26 Virtual Memory 24
Address translation: page hit CPU Chip CPU VA MMU 2 PTEA PTE 3 PA Cache/ Memory 4 Data 5 ) Processor sends virtual address to MMU 2-3) MMU fetches PTE from page table in memory 4) MMU sends physical address to cache/memory 5) Cache/memory sends data word to processor AS 26 Virtual Memory 25
Address translation: page fault Exception 4 Page fault handler CPU Chip CPU VA 7 MMU 2 PTEA PTE 3 Cache/ Memory Victim page 5 New page Disk 6 ) Processor sends virtual address to MMU 2-3) MMU fetches PTE from page table in memory 4) Valid bit is zero, so MMU triggers page fault exception 5) Handler identifies victim (and, if dirty, pages it out to disk) 6) Handler pages in new page and updates PTE in memory 7) Handler returns to original process, restarting faulting instruction 26
8.4: Translation lookaside buffers AS 26 Virtual Memory 27
Speeding up translation with a TLB Page table entries (PTEs) are cached in L like any other memory word PTEs may be evicted by other data references PTE hit in L still requires an extra (or 2)-cycle delay Solution: Translation Lookaside Buffer (TLB) Small hardware cache in MMU Maps virtual page numbers to physical page numbers Contains complete page table entries for small number of pages AS 26 Virtual Memory 28
TLB hit CPU Chip TLB 2 PTE VPN 3 CPU VA MMU PA 4 Cache/ Memory Data 5 A TLB hit eliminates a memory access AS 26 Virtual Memory 29
TLB miss CPU Chip 2 VPN TLB 4 PTE CPU VA MMU 3 PTEA PA 5 Cache/ Memory Data 6 A TLB miss incurs an additional memory access (the PTE) Fortunately, TLB misses are rare (we hope) AS 26 Virtual Memory 3
8.5: A simple memory system example AS 26 Virtual Memory 3
A simple memory system Addressing 4-bit virtual addresses 2-bit physical address Page size = 64 bytes 3 2 9 8 7 6 5 4 3 2 VPN Virtual Page Number VPO Virtual Page Offset 9 8 7 6 5 4 3 2 PPN PPO Physical Page Number Physical Page Offset AS 26 Virtual Memory 32
Simple memory system: page table Only show first 6 entries (out of 256) VPN PPN Valid VPN PPN Valid 28 8 3 9 7 2 33 A 9 3 2 B 4 C 5 6 D 2D 6 E 7 F D AS 26 Virtual Memory 33
Simple memory system: TLB 6 entries 4-way associative TLBT TLBI 3 2 9 8 7 6 5 4 3 2 VPN VPO Set Tag PPN Valid Tag PPN Valid Tag PPN Valid Tag PPN Valid 3 9 D 7 2 3 2D 2 4 A 2 2 8 6 3 3 7 3 D A 34 2 AS 26 Virtual Memory 34
Simple memory system: cache 6 lines, 4-byte block size Physically addressed Direct mapped CT CI CO 9 8 7 6 5 4 3 2 PPN PPO Idx Tag Valid B B B2 B3 Idx Tag Valid B B B2 B3 9 99 23 8 24 3A 5 89 5 9 2D 2 B 2 4 8 A 2D 93 5 DA 3B 3 36 B B 4 32 43 6D 8F 9 C 2 5 D 36 72 F D D 6 4 96 34 5 6 3 E 3 83 77 B D3 7 6 C2 DF 3 F 4
Address translation example # Virtual Address: x3d4 TLBT 3 2 9 8 TLBI 7 6 5 4 3 2 VPN VPO VPN xf TLBI 3 TLBT x3 TLB Hit? Y Page Fault? N PPN: xd Physical Address CT CI CO 9 8 7 6 5 4 3 2 PPN PPO x5 xd Y x36 CO CI CT Hit? Byte: AS 26 Virtual Memory 36
Address translation example #2 Virtual Address: xb8f TLBT 3 2 9 8 TLBI 7 6 5 4 3 2 VPN VPO VPN x2e TLBI 2 TLBT xb TLB Hit? N Page Fault? Y PPN: TBD Physical Address CO CI CT Hit? Byte: AS 26 Virtual Memory 37
Address translation example #3 Virtual Address: x2 TLBT 3 2 9 8 TLBI 7 6 5 4 3 2 VPN VPO VPN x TLBI TLBT x TLB Hit? N Page Fault? N PPN: x28 Physical Address CT CI CO 9 8 7 6 5 4 3 2 PPN PPO x8 x28 N Mem CO CI CT Hit? Byte: AS 26 Virtual Memory 38
Summary Programmer s view of virtual memory Each process has its own private linear address space Cannot be corrupted by other processes System view of virtual memory Uses memory efficiently by caching virtual memory pages Efficient only because of locality Simplifies memory management and programming Simplifies protection by providing a convenient interpositioning point to check permissions AS 26 Virtual Memory 39
8.6: Multi-level page tables AS 26 Virtual Memory 4
Terminology Virtual Page may refer to: Page-aligned region of virtual address space and Contents thereof (different might appear on disk) Physical Page: Page-aligned region of physical memory (RAM) Physical Frame (=Physical Page) Alternative terminology Page = contents, Frame = container Page size may be frame size (rarely) AS 26 Virtual Memory 4
Linear page table size Given: 4KB (2 2 ) page size 48-bit address space 8-byte PTE How big a page table do you need? Answer: 52 GB! 2 48 / 2 2 * 2 3 = 2 39 bytes AS 26 Virtual Memory 42
2-Level page table hierarchy Level page table PTE PTE PTE 2 (null) PTE 3 (null) PTE 4 (null) PTE 5 (null) PTE 6 (null) PTE 7 (null) PTE 8 (K - 9) null PTEs Level 2 page tables PTE... PTE 23 PTE... PTE 23 23 null PTEs Virtual memory VP... VP 23 VP 24... VP 247 Gap PTE 23 23 unallocated pages VP 925... 2K allocated VM pages for code and data 6K unallocated VM pages 23 unallocated pages allocated VM page for the stack AS 26 Virtual Memory 43
Translating with a k-level page table Virtual Address n- VPN VPN 2... VPN k p- VPO Level page table Level 2 page table...... Level k page table PPN m- PPN p- PPO Physical Address AS 26 Virtual Memory 44
8.7: Case study of the core i7 virtual memory system AS 26 Virtual Memory 45
A typical Core i7 processor Processor package Core x4 Registers Instruction fetch MMU (addr translation) L d-cache 32 KB, 8-way L i-cache 32 KB, 8-way L d-tlb 64 entries, 4-way L i-tlb 28 entries, 4-way L2 unified cache 256 KB, 8-way L2 unified TLB 52 entries, 4-way QuickPath interconnect 4 links @ 25.6 GB/s 2.4 GB/s total To other cores To I/O bridge L3 unified cache 8 MB, 6-way (shared by all cores) DDR3 Memory controller 3 x 64 bit @.66 GB/s 32 GB/s total (shared by all cores) Main memory AS 26 46
Core i7 memory system* CPU 36 2 VPN VPO Virtual address (VA) 32/64 Result L hit L2, L3, and main memory L miss CR3 TLB 32 miss TLBT 9 9 VPN VPN2 PTE 4 TLBI... L TLB (6 sets, 4 entries/set) 9 9 VPN3 VPN4 PTE PTE PTE PPN TLB hit 4 2 PPO L d-cache (64 sets, 8 lines/set) Physical... address (PA) 4 6 6 CT CI CO Page tables *A bit simplified AS 26 Virtual Memory 47
More abbreviations Components of the virtual address (VA) TLBI: TLB index TLBT: TLB tag VPO: virtual page offset VPN: virtual page number Components of the physical address (PA) PPO: physical page offset (same as VPO) PPN: physical page number CO: byte offset within cache line CI: cache index CT: cache tag AS 26 Virtual Memory 48
x86-64 paging 48-bit virtual address 256 terabytes (TB) Not yet ready for full 64 bits Nobody can buy that much DRAM yet Mapping tables would be huge Multi-level array map may not be the right data structure 52-bit physical address = 4 bits for PPN Requires 64-bit table entries Keep older x86 4KB page size (including for page tables) (496 bytes per PT) / (8 bytes per PTE) = 52 entries per page AS 26 Virtual Memory 49
Core i7 TLB translation CPU 36 2 VPN VPO Virtual address (VA) 32/64 Result L hit L2, L3, and main memory L miss CR3 TLB 32 miss TLBT 9 9 VPN VPN2 PTE 4 TLBI... L TLB (6 sets, 4 entries/set) 9 9 VPN3 VPN4 PTE PTE PTE PPN TLB hit 4 2 PPO L d-cache (64 sets, 8 lines/set) Physical... address (PA) 4 6 6 CT CI CO AS 26 Virtual Memory 5 Page tables
Core i7 TLB TLB entry (not all documented, so this is speculative): 4 32 PPN TLBTag V G S W D V: indicates a valid () or invalid () TLB entry TLBTag: disambiguates entries cached in the same set PPN: translation of the address indicated by index & tag G: page is global according to PDE, PTE S: page is supervisor-only according to PDE, PTE W: page is writable according to PDE, PTE D: PTE has already been marked dirty (once is enough) Structure of the data TLB: 6 sets, 4 entries/set entry entry entry entry entry entry entry entry... entry entry entry entry set set set 5 AS 26 Virtual Memory 5
Translating with the i7 TLB CPU. Partition VPN into TLBT and TLBI. 36 2 VPN VPO virtual address 2. Is the PTE for VPN cached in set TLBI? 32 TLBT 4 TLBI 2 3. Yes: Check permissions, build physical address TLB miss partial TLB hit PDE... page table translation PTE TLB hit 4 3 4 2 PPN PPO physical address 4. No: Read PTE (and PDE if not cached) from memory and build physical address AS 26 Virtual Memory 52
Core i7 page translation CPU 36 2 VPN VPO Virtual address (VA) 32/64 Result L hit L2, L3, and main memory L miss CR3 TLB 32 miss TLBT 9 9 VPN VPN2 PTE 4 TLBI... L TLB (6 sets, 4 entries/set) 9 9 VPN3 VPN4 PTE PTE PTE PPN TLB hit 4 2 PPO L d-cache (64 sets, 8 lines/set) Physical... address (PA) 4 6 6 CT CI CO Page tables Virtual Memory 53
x86-64 paging 9 VPN 9 9 9 VPN2 VPN3 VPN4 2 VPO Virtual address Page Map Table Page Directory Pointer Table Page Directory Table Page Table PM4LE PDPE PDE PTE BR 52 GB region per entry GB region per entry 2 MB region per entry 4 KB region per entry 4 PPN 2 PPO Physical address AS 26 Virtual Memory 54
Page-Map Level 4 Page-Directory-Pointer Base Address: 4 most significant bits of physical page table address (forces page tables to be 4KB aligned) Avail: These bits available for system programmers A: accessed (set by MMU on reads and writes, cleared by software) PCD: cache disabled () or enabled () PWT: write-through or write-back cache policy for this page table U/S: user or supervisor mode access R/W: read-only or read-write access P: page table is present in memory () or not () AS 26 Virtual Memory 55
Page Directory Pointer Entry (level 3) Page-Directory Base Address: 4 most significant bits of physical page table address (forces page tables to be 4KB aligned) Avail: These bits available for system programmers A: accessed (set by MMU on reads and writes, cleared by software) PCD: cache disabled () or enabled () PWT: write-through or write-back cache policy for this page table U/S: user or supervisor mode access R/W: read-only or read-write access P: page table is present in memory () or not () AS 26 Virtual Memory 56
Page Directory Entry (level 2) Page table physical base address: 4 most significant bits of physical page table address (forces page tables to be 4KB aligned) Avail: These bits available for system programmers A: accessed (set by MMU on reads and writes, cleared by software) PCD: cache disabled () or enabled () PWT: write-through or write-back cache policy for this page table U/S: user or supervisor mode access R/W: read-only or read-write access P: page table is present in memory () or not () AS 26 Virtual Memory 57
Page Table Entry (level ) Physical Page base address: 4 most significant bits of physical page address (forces pages to be 4 KB aligned) Avail: available for system programmers G: global page (don t evict from TLB on task switch) PAT: Page-Attribute Table D: dirty (set by MMU on writes) A: accessed (set by MMU on reads and writes) PCD: cache disabled or enabled PWT: write-through or write-back cache policy for this page U/S: user/supervisor R/W: read/write P: page is present in physical memory () or not () AS 26 Virtual Memory 58
x86-64 paging Virtual address VPN VPO VPN VPN2 VPN3 VPN4 Physical address PPN PPO PM4L(p=) PDPE(p=) PDE (p=) PTE (p=) data Page tables and page present BR Disk MMU action: build physical address fetch data word OS action None 59
x86-64 paging Virtual address VPN VPO VPN VPN2 VPN3 VPN4 BR Disk PM4L(p=) PDPE(p=) PDE (p=) PTE (p=) data Page table present, page missing MMU action: Page fault excpt. Handler receives args: %rip of fault VA of fault Fault type Fault type: non-present page page-level protection 6
x86-64 paging Virtual address VPN VPO VPN VPN2 VPN3 VPN4 Physical address PPN PPO BR Disk PM4L(p=) PDPE(p=) PDE (p=) PTE (p=) data OS action: Check for legal VA Read PTE through PT. Find free physical page (swap out current page if necessary) Read VP from disk into physical page Adjust PTE to point to PP, set p= Restart faulting instruction 6
x86-64 paging Virtual address VPN VPO BR Disk VPN VPN2 VPN3 VPN4 PM4L(p=) PDPE(p=) PDE (p=) PTE (p=) data Page tables and page missing MMU action: Page fault OS action: Swap in page table Restart faulting instruction by returning from handler Goto previous case Multiple disk reads 62
x86-64 paging Virtual address VPN VPO BR Disk VPN VPN2 VPN3 VPN4 PM4L(p=) PDPE(p=) PDE (p=) PTE (p=) data Page table missing, page present Introduces consistency issue Potentially every page-out requires update of disk page table Linux disallows this If a page table is swapped out, then swap out its data pages too 63
8.8: What about the cache? AS 25 Systems Programming and Computer Architecture 64
Core i7 cache access CPU 36 2 VPN VPO Virtual address (VA) 32/64 Result L hit L2, L3, and main memory L miss CR3 TLB 32 miss TLBT 9 9 VPN VPN2 PTE 4 TLBI... L TLB (6 sets, 4 entries/set) 9 9 VPN3 VPN4 PTE PTE PTE PPN TLB hit 4 2 PPO L d-cache (64 sets, 8 lines/set) Physical... address (PA) 4 6 6 CT CI CO AS 26 Virtual Memory 65 Page tables
L cache access 64 data L hit L (64sets, 8 lines/set)... physical address (PA) L2 and DRAM L miss 4 6 6 CT CI CO Partition physical address: CO, CI, and CT Use CT to determine if line containing word at address PA is cached in set CI No: check L2 (then L3 ) Yes: extract word at byte offset CO and return to processor AS 26 Virtual Memory 66
Much older (P6) memory design CPU 32 result L2 and DRAM 2 2 VPN VPO 6 TLBT 4 TLBI virtual address (VA) L hit L miss TLB miss... TLB hit L (28 sets, 4 lines/set)... VPN VPN2 PDBR PDE TLB (6 sets, 4 entries/set) Page tables PTE 2 2 PPN PPO 2 7 5 CT CI CO physical address (PA) Same size cache (just less associative). Virtual Memory Why? 67
Speeding up L access Tag Check Physical address (PA) 4 6 6 CT CI CO PPN PPO Address Translation No Change CI Observation Virtual address (VA) VPN VPO 36 2 Bits that determine CI identical in virtual and physical address Can index into cache while address translation taking place Generally we hit in TLB, so PPN bits (CT bits) available next Virtually indexed, physically tagged Cache carefully sized to make this possible AS 26 Virtual Memory 68
8.9: Large pages AS 26 Virtual Memory 69
x86-64 paging 9 VPN 9 9 9 VPN2 VPN3 VPN4 2 VPO Virtual address Page Map Table Page Directory Pointer Table Page Directory Table Page Table PM4LE PDPE PDE PTE BR 52 GB region per entry GB region per entry 2 MB region per entry 4 KB region per entry 4 PPN 2 PPO Physical address AS 26 Virtual Memory 7
x86-64 large pages 9 VPN 9 9 9 VPN2 VPN3 VPN4 2 VPO Virtual address Page Map Table Page Directory Pointer Table Page Directory Table PM4LE PDPE PDE BR 52 GB region per entry GB region per entry 2 MB region per entry 2 bits 2MB pages PPN PPO Physical address 27 9 2 AS 26 Virtual Memory 7
Page Directory Entry (level 2) Page table physical base address: 4 most significant bits of physical page table address (forces page tables to be 4KB aligned) Avail: These bits available for system programmers A: accessed (set by MMU on reads and writes, cleared by software) PCD: cache disabled () or enabled () PWT: write-through or write-back cache policy for this page table U/S: user or supervisor mode access R/W: read-only or read-write access P: page table is present in memory () or not () AS 26 Virtual Memory 72
PDE for large pages AS 26 Virtual Memory 73
Huge pages 9 VPN 9 9 9 VPN2 VPN3 VPN4 2 VPO Virtual address Page Map Table Page Directory Pointer Table PM4LE PDPE BR 52 GB region per entry GB region per entry 3 bits GB pages PPN PPO Physical address 8 8 2 AS 26 Virtual Memory 74
PDPE (level 3) for huge pages AS 26 Virtual Memory 75
Large and huge pages Terminate page table walk early 2MB, GB pages Simplify address translation Useful for programs with very large, contiguous working sets Reduces compulsory TLB misses Hard to use (Linux) hugetlbfs support Use libhugetlbs {m,c,re}alloc replacements AS 26 Virtual Memory 76
8.: Optimizing for the TLB AS 26 Virtual Memory 77
Buffering: example MMM Blocked for cache Block size B x B c = a i * b + c Assume blocking for L2 cache say, 52 MB = 2 9 B = 2 6 doubles = C 3B 2 < C means B 5 AS 26 Virtual Memory 78
Buffering: example MMM But: Look at one iteration assume > 4 KB = 52 doubles c = a * b blocksize B = 5 + c Consequence each row used O(B) times but every time O(B 2 ) ops between Each row is on different page More rows than TLB entries: TLB thrashing Solution: buffering = copy block to contiguous memory O(B 2 ) cost for O(B 3 ) operations AS 26 Virtual Memory 79