Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple of locality to preset the user with as much memory as possible at the fastest speed ad cheapest price. Icreasig distace from the processor i access time. Processor L$ L2$ Mai Memory 4-8 bytes (word) 8-32 bytes (block) to 4 blocks Secodary Memory,024+ bytes (disk sector = page) Iclusive what is i L$ is a subset of what is i L2$ is a subset of what is i MM is a subset of what is i SM. (Relative) size of the memory at each level Chapter 5 Large ad Fast: Exploitig Memory Hierarchy
Morga Kaufma Publishers 26 February, 208 How is the Hierarchy Maaged? Registers «cache By compiler or programmer. Cache «mai memory By the cache cotroller hardware. Mai memory «disks By the operatig system (virtual memory). Virtual to physical address mappig assisted by the hardware. Virtual Memory Use mai memory as a cache for secodary memory: Allows efficiet ad safe sharig of memory amog multiple programs. Provides the ability to ru programs larger tha the size of physical memory. Simplifies loadig a program for executio by providig for code relocatio (i.e., the code ca be loaded aywhere i mai memory). Each program is compiled ito its ow address space a virtual address space: Durig ru-time, each virtual address must be traslated to a physical address - a address i mai memory. Virtual Memory block is called a page. Virtual Memory traslatio miss is called a page fault. Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 2
Morga Kaufma Publishers 26 February, 208 Two Programs Sharig Physical Memory A program s address space is divided ito pages - fixed size - or segmets - variable sizes: The startig locatio of each page (either i mai memory or i secodary memory) is cotaied i the program s page table. Address Traslatio A virtual address is traslated to a physical address by a combiatio of hardware ad software. Each memory request first requires a address traslatio from the virtual space to the physical space. Virtual Address (VA) 3 30... 2... 0 Virtual page umber Page offset Traslatio Physical page umber Page offset 29... 2 0 Physical Address (PA) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 3
Morga Kaufma Publishers 26 February, 208 Page Fault Pealty O page fault, the etire page must be fetched from disk: Takes millios of clock cycles. Hadled by the Operatig System. Try to miimize page fault rate: Fully associative placemet of page i mai memory. Smarter replacemet algorithms. Page Tables Stores placemet iformatio: A Page Table is a array of page table etries, idexed by virtual page umber. Page table register poits to page i physical memory. If page is preset i memory: PTE stores the physical page umber. Plus other status bits (refereced, dirty, ). If page is ot preset: Page Fault OS gets ivolved. Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 4
Morga Kaufma Publishers 26 February, 208 Traslatio Usig a Page Table Replacemet ad Writes To reduce page fault rate, prefer least-recetly used (LRU) replacemet: Referece bit i Page-table-etry set to o access to page. Periodically cleared to 0 by OS. A page with referece bit = 0 has ot bee used recetly. Disk writes take millios of cycles: Write-through is impractical so write-back is used. Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 5
Morga Kaufma Publishers 26 February, 208 Address Traslatio Summary Virtual page # Offset Page table register V 0 0 0 Physical page # Physical page base addr Page Table (i mai memory) Offset Mai memory Disk storage Virtual Addressig with a Cache It takes a extra memory access to traslate a Virtual Address to a Physical Address via the Page Table. CPU VA PA miss Traslatio Cache data hit Mai Memory This makes cache accesses very expesive (if every access was really two accesses). The hardware fix is to use a Traslatio Lookaside Buffer (TLB) a small cache that keeps track of recetly used address mappigs to avoid havig to do a page table lookup. Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 6
Morga Kaufma Publishers 26 February, 208 Fast Traslatio Usig a TLB TLB s work well because access to page tables has good locality: Use a fast cache of Page-Table-Etries withi the CPU. Typical: 6 52 PTEs, 0.5 cycle for hit, 0 00 cycles for miss, 0.0% % miss rate. Misses ca be hadled by hardware or software. Just like ay other cache, the TLB ca be orgaized as fully associative, set associative, or direct mapped. Fast Traslatio Usig a TLB Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 7
Morga Kaufma Publishers 26 February, 208 A TLB i the Memory Hierarchy A TLB miss is it a page fault or merely a TLB miss? If the page is loaded ito mai memory, the the TLB miss ca be hadled (i hardware or software) by loadig the traslatio iformatio from the page table ito the TLB: Takes 0 s of cycles to fid ad load the traslatio ifo ito the TLB. If the page is ot i mai memory, the it s a true page fault: Takes,000,000 s of cycles to service a page fault. TLB misses are much more frequet tha true page faults. TLB Evet Combiatios TLB Page Table Cache Hit Hit Hit Hit Hit Miss Miss Hit Hit Miss Hit Miss Miss Miss Miss Hit Miss Miss/ Hit Miss Miss Hit Possible? Uder what circumstaces? Yes this is what we wat! Yes although the page table is ot checked if the TLB hits (Page fault). Yes TLB miss, PA i page table. Yes TLB miss, PA i page table, but data ot i cache (Page fault). Yes page fault (OS allocates ew PT etry). Impossible TLB caot Hit if Page Table misses. Impossible data ot allowed i cache if No Page Table etry. Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 8
Morga Kaufma Publishers 26 February, 208 Memory Protectio Differet tasks ca share parts of their virtual address spaces: But eed to protect agaist errat access. Requires OS assistace. Hardware support for OS protectio: Privileged supervisor mode (aka kerel mode). Privileged istructios. Page tables ad other state iformatio oly accessible i supervisor mode. Some Virtual Memory Desig Parameters Total size Total size (KB) VM Page 6,000 to 250,000 words 250,000 to,000,000,000 TLBs 6 to 52 etries 0.25 to 6 Block size (B) 4000 to 64,000 4 to 8 Hit time 0.5 to clock cycle Miss pealty (clocks) Miss rates 0,000,000 to 00,000,000 0.0000% to 0.000% 0 to 00 0.0% to % Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 9
Morga Kaufma Publishers 26 February, 208 2-Level TLB Orgaizatio Two Machies TLB Parameters Itel Nehalem AMD Barceloa Address sizes 48 bits (vir); 44 bits (phy) 48 bits (vir); 48 bits (phy) Page size 4KB 4KB TLB orgaizatio L TLB for istructios ad L TLB for data per core; both are 4-way set assoc.; LRU L ITLB has 28 etries, L2 DTLB has 64 etries L2 TLB (uified) is 4-way set assoc.; LRU L2 TLB has 52 etries TLB misses hadled i hardware L TLB for istructios ad L TLB for data per core; both are fully assoc.; LRU L ITLB ad DTLB each have 48 etries L2 TLB for istructios ad L2 TLB for data per core; each are 4-way set assoc.; roud robi LRU Both L2 TLBs have 52 etries TLB misses hadled i hardware Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 0
Morga Kaufma Publishers 26 February, 208 Two Machies TLB Parameters TLB orgaizatio Itel P4 TLB for istructios ad TLB for data Both 4-way set associative Both use ~LRU replacemet Both have 28 etries TLB misses hadled i hardware AMD Optero 2 TLBs for istructios ad 2 TLBs for data Both L TLBs fully associative with ~LRU replacemet Both L2 TLBs are 4-way set associative with roud-robi LRU Both L TLBs have 40 etries Both L2 TLBs have 52 etries TLB misses hadled i hardware The Hardware/Software Boudary What parts of the virtual to physical address traslatio are doe by or assisted by the hardware? Traslatio Lookaside Buffer (TLB) that caches the recet traslatios: TLB access time is part of the cache hit time. May allot a extra stage i the pipelie for TLB access. Page table storage, fault detectio, ad updatig: Page faults result i precise iterrupts that are the hadled by the OS. Hardware must support Dirty ad Referece bits i the Page Tables. Chapter 5 Large ad Fast: Exploitig Memory Hierarchy
Morga Kaufma Publishers 26 February, 208 Summary: Questios for the Memory Hierarchy Q: Where ca a etry be placed i the cache? (Etry placemet) Q2: How is a etry foud if it is i the cache? (Etry idetificatio) Q3: Which etry should be replaced o a miss? (Etry replacemet) Q4: What happes o a write? (Write strategy) Q&Q2: Where ca a etry be placed/foud? # of sets Etries per set Direct mapped # of etries Set associative (# of etries)/ associativity Associativity (typically 2 to 6) Fully associative # of etries Locatio method Direct mapped Idex Set associative Idex the set; compare set s tags # of comparisos Degree of associativity Fully associative Compare all etries tags # of etries Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 2
Morga Kaufma Publishers 26 February, 208 Q3: Which etry should be replaced o a miss? Easy for direct mapped oly oe choice. Set associative or fully associative: Radom. LRU (Least Recetly Used). For a 2-way set associative cache, radom replacemet has a miss rate about. times higher tha LRU. LRU is too costly to implemet for high levels of associativity (> 4-way) sice trackig the usage iformatio is costly. Q4: What happes o a write? Write-through The iformatio is writte to the etry i the curret memory level ad to the etry i the ext level of the memory hierarchy: Always combied with a write buffer so write-waits to ext level memory ca be elimiated (if the write buffer does t fill). Write-back The iformatio is writte oly to the etry i the curret memory level. The modified etry is writte to ext level of memory oly whe it is replaced. Need a dirty bit to keep track of whether the etry is clea or dirty. Virtual memory systems always use write-back. Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 3
Morga Kaufma Publishers 26 February, 208 Multilevel O-Chip Caches Itel Nehalem 4-core processor Per core: 32KB L I-cache, 32KB L D-cache, 52KB L2 cache 3-Level Cache Orgaizatio Itel Nehalem AMD Optero X4 L caches (per core) L I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacemet, hit time /a L D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacemet, writeback/allocate, hit time /a L I-cache: 32KB, 64-byte blocks, 2-way, LRU replacemet, hit time 3 cycles L D-cache: 32KB, 64-byte blocks, 2-way, LRU replacemet, writeback/allocate, hit time 9 cycles L2 uified cache (per core) 256KB, 64-byte blocks, 8-way, 52KB, 64-byte blocks, 6-way, approx LRU replacemet, write- approx LRU replacemet, writeback/allocate, hit time /a back/allocate, hit time /a L3 uified cache (shared) 8MB, 64-byte blocks, 6-way, replacemet /a, writeback/allocate, hit time /a 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles /a: data ot available Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 4
Morga Kaufma Publishers 26 February, 208 Summary The Priciple of Locality: Program likely to access a relatively small portio of the address space at ay istat of time: Temporal Locality - Locality i Time. Spatial Locality - Locality i Space. Caches, TLBs, Virtual Memory all uderstood by examiig how they deal with the four questios:. Where ca etry be placed? 2. How is etry foud? 3. What etry is replaced o miss? 4. How are writes hadled? Page Tables map virtual address to physical address: TLBs are importat for fast traslatio. Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 5