Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance Gap (grows 50%/year) 1992 1995 1998 2001 2004 DRAM 7%/year (2X/10yrs) Year

The Memory Wall Processor vs DRAM speed disparity continues to grow 1000 Clocks per instruction 100 10 1 0.1 0.01 Core Memory VAX/1980 PPro/1996 2010+ Good memory hierarchy (cache) design is increasingly important to overall performance Clocks per DRAM access

Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Access time of SRAM Capacity and cost/gb of disk 4

Memory Hierarchy The Pyramid 5

A Typical Memory Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology while at the speed offered by the fastest technology On-Chip Components Control Datapath RegFile ITLB DTLB Instr Data Cache Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Memory (Disk) Speed (cycles): ½ s 1 s 10 s 100 s 10,000 s Size (bytes): 100 s 10K s M s G s T s Cost: highest lowest

Inside the Processor AMD Barcelona: 4 processor cores

Principle of Locality Programs access a small proportion of their address space at any time Temporal locality (locality in time) Items accessed recently are likely to be accessed again soon e.g., instructions in a loop, induction variables Keep most recently accessed items into the cache Spatial locality (locality in space) Items near those accessed recently are likely to be accessed soon E.g., sequential instruction access, array data Move blocks consisting of contiguous words closer to the processor 8

Taking Advantage of Locality Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory Main memory Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory Cache memory attached to CPU Chapter 5 Large and Fast: Exploiting Memory Hierarchy 9

Memory Hierarchy Levels Block (aka cache line) Unit of copying May be multiple words If accessed data is present in upper level Hit: access satisfied by upper level Hit ratio: hits/accesses Hit time: Time to access the block + Time to determine hit/miss If accessed data is absent Miss: data not in upper level Miss ratio: misses/accesses = 1 hit ratio Miss penalty: Time to access the block in the lower level + Time to transmit that block to the level that experienced the miss + Time to insert the block in that level + Time to pass the block to the requestor Chapter 5 Large and Fast: Exploiting Memory Hierarchy 10

Cache Memory Cache memory The level of the memory hierarchy closest to the CPU Given accesses X 1,, X n 1, X n 5.3 The Basics of Caches How do we know if the data is present? Where do we look? Chapter 5 Large and Fast: Exploiting Memory Hierarchy 11

Direct Mapped Cache Location determined by address Direct mapped: only one choice (Block address) modulo (#Blocks in cache) 0 1 2 3 4 5 6 7 #Blocks is a power of 2 Use low-order address bits 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 12

Tags and Valid Bits How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag What if there is no data in a location? Valid bit: 1 = valid, 0 = not valid Initially 0 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 13

Cache Example 8-blocks, 1 word/block, direct mapped Initial state Index V Tag Data 000 2 (0 10 ) N 001 2 (1 10 ) N 010 2 (2 10 ) N 011 2 (3 10 ) N 100 2 (4 10 ) N 101 2 (5 10 ) N 110 2 (6 10 ) N 111 2 (7 10 ) N Access sequence (address in word): 22, 26, 22, 26, 16, 3, 16, 18, 16 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 14

Cache Example Word addr Binary addr Hit/miss Cache block 22 10 110 Miss 110 Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N Chapter 5 Large and Fast: Exploiting Memory Hierarchy 15

Cache Example Word addr Binary addr Hit/miss Cache block 26 11 010 Miss 010 Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N Chapter 5 Large and Fast: Exploiting Memory Hierarchy 16

Cache Example Word addr Binary addr Hit/miss Cache block 22 10 110 Hit 110 26 11 010 Hit 010 Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N Chapter 5 Large and Fast: Exploiting Memory Hierarchy 17

Cache Example Word addr Binary addr Hit/miss Cache block 16 10 000 Miss 000 3 00 011 Miss 011 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 11 Mem[11010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N Chapter 5 Large and Fast: Exploiting Memory Hierarchy 18

Cache Example Word addr Binary addr Hit/miss Cache block 16 10 000 Hit 000 18 10 010 Miss 010 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N Chapter 5 Large and Fast: Exploiting Memory Hierarchy 19

Address Subdivision Chapter 5 Large and Fast: Exploiting Memory Hierarchy 20

Hit Multiword Block Direct Mapped Cache Four words/block, cache size = 1K words Tag 31 12 11... 4 3 2 1 0 Byte offset 20 Index 8 Block offset Data IndexValid 0 1 2... 253 254 255 Tag 20 Data What kind of locality are we taking advantage of? 32

Taking Advantage of Spatial Locality Let cache block hold more than one word Start with an empty cache - all blocks initially marked as not valid 0 1 2 3 4 3 4 15 0 miss 00 Mem(1) Mem(0) 1 hit 2 miss 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) 3 hit 4 miss 3hit 01 5 4 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 4 hit 15 miss 01 Mem(5) Mem(4) 1101 Mem(5) Mem(4) 15 14 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 8 requests, 4 misses

Larger Block Size 64 blocks, 16 bytes/block To what block number does address 1200 map? 1200 10 =0x000004B0 31 10 9 4 3 0 Tag Index Offset 22 bits 6 bits 4 bits Chapter 5 Large and Fast: Exploiting Memory Hierarchy 23

Block Size Considerations Larger blocks should reduce miss rate Due to spatial locality In a fixed-sized cache Larger blocks fewer of them More competition increased miss rate Larger miss penalty Can override benefit of reduced miss rate Early restart and critical-word-first can help Chapter 5 Large and Fast: Exploiting Memory Hierarchy 24

Miss Rate v.s. Block Size Chapter 5 Large and Fast: Exploiting Memory Hierarchy 25

Cache Field Sizes The number of bits in a cache includes the storage for data, the tags, and the flag bits For a direct mapped cache with 2 n blocks, n bits are used for the index For a block size of 2 m words (2 m+2 bytes), m bits are used to address the word within the block and 2 bits are used to address the byte within the word What is the size of the tag field? The total number of bits in a direct-mapped cache is then 2 n x (block size + tag field size + flag bits size) How many total bits are required for a direct mapped cache with 16KB of data and 4-word blocks assuming a 32-bit address?

Handling Cache Hits Read hits (I$ and D$) this is what we want! Write hits (D$ only) require the cache and memory to be consistent always write the data into both the cache block and the next level in the memory hierarchy (write-through) allow cache and memory to be inconsistent write the data only into the cache block (write-back the cache block to the next level in the memory hierarchy when that cache block is evicted )

Write-Through Write through: update the data in both the cache and the main memory But makes writes take longer Writes run at the speed of the next level in the memory hierarchy e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles Effective CPI = 1 + 0.1 100 = 11 Solution: write buffer Holds data waiting to be written to memory CPU continues immediately Only stalls on write if write buffer becomes full Chapter 5 Large and Fast: Exploiting Memory Hierarchy 28

Write-Back Write back: On data-write hit, just update the block in cache Keep track of whether each block is dirty Need a dirty bit When a dirty block is replaced Write it back to memory Can use a write buffer Chapter 5 Large and Fast: Exploiting Memory Hierarchy 29

Handling Cache Misses (single-word cache block) Read misses (I$ and D$) Stall the CPU pipeline Fetch block from the next level of hierarchy Copy it to the cache Send the requested word to the processor Resume the CPU pipeline 30

Handling Cache Misses (single-word cache block) What should happen on a write miss? Write allocate (aka: fetch on write): Data at the missed-write location is loaded to cache For write-through: two alternatives Allocate on miss: fetch the block, i.e., write allocate No write allocate: don t fetch the block Since programs often write a whole data structure before reading it (e.g., initialization) For write-back Usually fetch the block, i.e., write allocate Chapter 5 Large and Fast: Exploiting Memory Hierarchy 31

Multiword Block Considerations Read misses (I$ and D$) Processed the same as for single word blocks a miss returns the entire block from memory Miss penalty grows as block size grows Early restart processor resumes execution as soon as the requested word of the block is returned Requested word first requested word is transferred from the memory to the cache (and processor) first Nonblocking cache allows the processor to continue to access the cache while the cache is handling an earlier miss

Multiword Block Considerations Write misses (D$) If using write allocate, must first fetch the block from memory and then write the word to the block or could end up with a garbled block in the cache, e.g., for 4-word blocks, a new tag, one word of data from the new block, and three words of data from the old block If not using write allocate, forward the write request to the main memory

Main Memory Supporting Caches Use DRAMs for main memory Example: cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer 4-word cache block 3 different main memory configurations 1-word wide memory 4-word wide memory 4-bank interleaved memory Chapter 5 Large and Fast: Exploiting Memory Hierarchy 34

1-Word Wide Bus, 1-Word Wide Memory What if the block size is four words and the bus and the memory are 1-word wide? cycle to send address cycles to read DRAM cycles to return 4 data words total clock cycles miss penalty

1-Word Wide Bus, 1-Word Wide Memory What if the block size is four words and the bus and the memory are 1-word wide? 1 4 x 15 = 60 4 65 cycle to send address cycles to read DRAM cycles to return 4 data words total clock cycles miss penalty 15 cycles 15 cycles 15 cycles 15 cycles Bandwidth = (4 x 4)/65 = 0.246 B/cycle

4-Word Wide Bus, 4-Word Wide Memory What if the block size is four words and the bus and the memory are 4-word wide? cycle to send address cycles to read DRAM cycles to return 4 data words total clock cycles miss penalty

4-Word Wide Bus, 4-Word Wide Memory What if the block size is four words and the bus and the memory are 4-word wide? 1 15 1 17 cycle to send address cycles to read DRAM cycles to return 4 data words total clock cycles miss penalty 15 cycles Bandwidth = (4 x 4)/17 = 0.941 B/cycle

Interleaved Memory, 1-Word Wide Bus For a block size of four words cycle to send address cycles to read DRAM banks cycles to return 4 data words total clock cycles miss penalty

Interleaved Memory, 1-Word Wide Bus For a block size of four words 1 cycle to send address 15 cycles to read DRAM banks 4*1 = 4 cycles to return 4 data words 20 total clock cycles miss penalty 15 cycles 15 cycles 15 cycles 15 cycles Bandwidth = (4 x 4)/20 = 0.8 B/cycle

Measuring Cache Performance Components of CPU time Program execution cycles Includes cache hit time Memory stall cycles CPU time Mainly from cache misses CPU time = IC CPI CC = IC (CPI ideal + Average Memory-stall cycles) CC With simplifying assumptions: Total Memory stall cycles Memory accesses Miss rate Miss penalty Instructions Misses Instruction Miss penalty 5.4 Measuring and Improving Cache Performance Chapter 5 Large and Fast: Exploiting Memory Hierarchy 41

Cache Performance Example Given I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions Miss cycles per instruction I-cache: 0.02 100 = 2 Need to fetch every instruction from I-cache D-cache: 0.36 0.04 100 = 1.44 36% instructions need to access D-cache Actual CPI = 2 + 2 + 1.44 = 5.44 Ideal CPU is 5.44/2 =2.72 times faster Chapter 5 Large and Fast: Exploiting Memory Hierarchy 42

Average Memory Access Time Average memory access time (AMAT) AMAT = Hit time + Miss rate Miss penalty Hit time is also important for performance Example CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, cache miss rate = 5% AMAT = 1 + 0.05 20 = 2 cycles 2 cycles Chapter 5 Large and Fast: Exploiting Memory Hierarchy 43

Performance Summary When CPU performance increased Miss penalty becomes more significant Greater proportion of time spent on memory stalls Decreasing base CPI Increasing clock rate Memory stalls account for more CPU cycles Can t neglect cache behavior when evaluating system performance Chapter 5 Large and Fast: Exploiting Memory Hierarchy 44

Mechanisms to Reduce Cache Miss Rates Allow more flexible block placement Use multiple levels of caches 45

Reducing Cache Miss Rates #1 1. Allow more flexible block placement In a direct mapped cache a memory block maps to exactly one cache block At the other extreme, could allow a memory block to be mapped to any cache block fully associative cache A compromise is to divide the cache into sets, each of which consists of n ways (n-way set associative) A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices)

Associative Caches Fully associative (only one set) Allow a given block to go into any cache entry Requires all entries to be searched at once Comparator per entry (expensive) n-way set associative Each set contains n entries Block address determines which set (Block address) modulo (#Sets in cache) Search all entries in a given set at once n comparators (less expensive) Chapter 5 Large and Fast: Exploiting Memory Hierarchy 47

Associative Cache Example Chapter 5 Large and Fast: Exploiting Memory Hierarchy 48

Spectrum of Associativity For a cache with 8 entries Chapter 5 Large and Fast: Exploiting Memory Hierarchy 49

Associativity Example Compare 4-block caches Each block is 1-word wide Direct mapped, 2-way set associative, fully associative Block address sequence: 0, 8, 0, 6, 8 Direct mapped Block address Cache index Hit/miss Cache content after access 0 1 2 3 0 0 miss Mem[0] 8 0 miss Mem[8] 0 0 miss Mem[0] 6 2 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6] Chapter 5 Large and Fast: Exploiting Memory Hierarchy 50

Associativity Example 2-way set associative Block address Cache index Hit/miss 0 0 miss Mem[0] 8 0 miss Mem[0] Mem[8] 0 0 hit Mem[0] Mem[8] 6 0 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6] Cache content after access Set 0 Set 1 Fully associative Block address Hit/miss Cache content after access 0 miss Mem[0] 8 miss Mem[0] Mem[8] 0 hit Mem[0] Mem[8] 6 miss Mem[0] Mem[8] Mem[6] 8 hit Mem[0] Mem[8] Mem[6] Chapter 5 Large and Fast: Exploiting Memory Hierarchy 51

Another Reference String Mapping Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 0 4 0 4 0 4 0 4 0 miss 4 miss 0 miss 4 miss 01 4 00 00 Mem(0) 00 Mem(0) 01 Mem(4) 0 01 00 Mem(0) 4 0 miss 4 miss 0 miss 4 miss 00 01 4 01 4 01 Mem(4) 0 00 0 00 Mem(0) 01 Mem(4) 00 Mem(0) 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block

Another Reference String Mapping Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid 0 4 0 4

Another Reference String Mapping Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 0 4 0 4 0 4 0 4 0 miss 4 miss 0 hit 4 hit 000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0) 010 Mem(4) 010 Mem(4) 010 Mem(4) 8 requests, 2 misses Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist!

How Much Associativity Increased associativity decreases miss rate But with diminishing returns Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 1-way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1% Chapter 5 Large and Fast: Exploiting Memory Hierarchy 55

Benefits of Set Associative Caches Largest gains are in going from direct mapped cache to 2-way set associative cache (20+% reduction in miss rate) 56

Set Associative Cache Organization Chapter 5 Large and Fast: Exploiting Memory Hierarchy 57

Range of Set Associative Caches For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e., the number of ways) and halves the number of sets decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Used for tag compare Selects the set Selects the word in the block Tag Index Block offset Byte offset Decreasing associativity Direct mapped (only one way) Smaller tags, only a single comparator Increasing associativity Fully associative (only one set) Tag is all the bits except block and byte offset

Example: Tag Size Cache 4K blocks 4-word block size 32-bit byte address Total number of tag bits for the cache (?) Directed mapped 2-way set associative 4-way set associative Fully associative 60

Replacement Policy Direct mapped: no choice Set associative Prefer non-valid entry, if there is one Otherwise, choose among entries in the set Least-recently used (LRU) Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU for high associativity Chapter 5 Large and Fast: Exploiting Memory Hierarchy 61

Reducing Cache Miss Rates #2 2. Use multiple levels of caches Primary (L1) cache attached to CPU Small, but fast Separate L1 I$ and L1 D$ Level-2 cache services misses from L1 cache Larger, slower, but still faster than main memory Unified cache for both instructions and data Main memory services L-2 cache misses Some high-end systems include L-3 cache Chapter 5 Large and Fast: Exploiting Memory Hierarchy 62

Multilevel Cache Example Given CPU base CPI = 1, clock rate = 4GHz Miss rate/instruction = 2% Main memory access time = 100ns With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = 1 + 0.02 400 = 9 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 63

Example (cont.) Now add L-2 cache Access time = 5ns Global miss rate to main memory = 0.5% Primary miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles Primary miss with L-2 miss Extra penalty = 400 cycles CPI = 1 + 0.02 20 + 0.005 400 = 3.4 Performance ratio = 9/3.4 = 2.6 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 64

Multilevel Cache Considerations Primary cache Focus on minimal hit time Smaller total size with smaller block size L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Larger total size with larger block size Higher levels of associativity Chapter 5 Large and Fast: Exploiting Memory Hierarchy 65

Global v.s. Local Miss Rate Global miss rate (GR) The fraction of references that miss in all levels of a multilevel cache Dictate how often the main memory is accessed Global miss rate so far (GRS) The fraction of references that miss up to a certain level in a multilevel cache Local miss rate (LR) The fraction of references to one level of a cache that miss L2$ local miss rate >> the global miss rate 66

Example LR1=5%, LR2=20%, LR3=50% GR=LR1 LR2 LR3=0.05 0.2 0.5=0.005 GRS1=LR1=0.05 GRS2=LR1 LR2=0.05 0.2=0.01 GRS3=LR1 LR2 LR3=0.05 0.2 0.5=0.005 CPI=1+GSR1 Pty1+GSR2 Pty2+GSR3 Pty3 =1+0.05 10+0.01 20+0.005 100=2.2 67

Sources of Cache Misses Compulsory (cold start or process migration, first reference): First access to a block, cold fact of life, not a whole lot you can do about it. If you are going to run millions of instruction, compulsory misses are insignificant Solution: increase block size (increases miss penalty; very large blocks could increase miss rate) Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size (may increase access time) Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity (may increase access time)

Cache Design Trade-offs Design change Increase cache size Increase associativity Increase block size Effect on miss rate Decrease capacity misses Decrease conflict misses Decrease compulsory misses Negative performance effect May increase access time May increase access time Increases miss penalty. For very large block size, may increase miss rate due to pollution. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 69

Virtual Memory A memory management technique developed for multitasking computer architectures Virtualize various forms of data storage Allow a program to be designed as there is only one type of memory, i.e., virtual memory Each program runs on its own virtual address space Use main memory as a cache for secondary (disk) storage Allow efficient and safe sharing of memory among multiple programs Provide the ability to easily run programs larger than the size of physical memory Simplify loading a program for execution by providing for code relocation 5.7 Virtual Memory Chapter 5 Large and Fast: Exploiting Memory Hierarchy 70

Two Programs Sharing Physical Memory A program s address space is divided into pages (fixed size) The starting location of each page (either in main memory or in secondary memory) is contained in the program s page table Program 1 virtual address space main memory Program 2 virtual address space cache of hard drive

Address Translation A virtual address is translated to a physical address by a combination of hardware and software Virtual Address (VA) 31 30... 12 11... 0 Virtual page number Page offset Translation Physical page number Page offset 29... 12 11 0 Physical Address (PA) So each memory request first requires an address translation from the virtual space to the physical space A virtual memory miss (i.e., when the page is not in physical memory) is called a page fault

Page Fault Penalty On page fault, the page must be fetched from disk Takes millions of clock cycles Handled by OS code Try to minimize page fault rate Fully associative placement A physical page can be placed at any place in the main memory Smart replacement algorithms Chapter 5 Large and Fast: Exploiting Memory Hierarchy 73

Page Tables Stores placement information Array of page table entries, indexed by virtual page number Page table register in CPU points to page table in physical memory If page is present in memory Page table entry stores the physical page number Plus other status bits (referenced, dirty, ) If page is not present Page table entry can refer to location in swap space on disk Swap space: the space on the disk reserved for the full virtual memory space of a process Chapter 5 Large and Fast: Exploiting Memory Hierarchy 74

Address Translation Mechanisms Virtual page # Offset Physical page # Page table register V 1 1 1 1 1 1 0 1 0 1 0 Physical page base addr Offset Main memory Page Table (in main memory) Disk storage

Translation Using a Page Table Chapter 5 Large and Fast: Exploiting Memory Hierarchy 76

Replacement and Writes To reduce page fault rate, prefer least-recently used (LRU) replacement Reference bit (aka use bit) in PTE set to 1 on access to page Periodically cleared to 0 by OS A page with reference bit = 0 has not been used recently Disk writes take millions of cycles Replace a page at once, not individual locations Use write-back Write through is impractical Dirty bit in PTE set when page is written Chapter 5 Large and Fast: Exploiting Memory Hierarchy 77

Virtual Addressing with a Cache Thus it takes an extra memory access to translate a VA to a PA CPU VA PA miss Translation Cache data hit Main Memory This makes memory (cache) accesses very expensive (if every access was really two accesses) The hardware fix is to use a Translation Lookaside Buffer (TLB) a small cache that keeps track of recently used address mappings to avoid having to do a page table lookup

Fast Translation Using a TLB Address translation would appear to require extra memory references One to access the PTE Then the actual memory access Check cache first But access to page tables has good locality So use a fast cache of PTEs within the CPU Called a Translation Look-aside Buffer (TLB) Typical: 16 512 PTEs, 0.5 1 cycle for hit, 10 100 cycles for miss, 0.01% 1% miss rate Misses could be handled by hardware or software Chapter 5 Large and Fast: Exploiting Memory Hierarchy 79

Making Address Translation Fast Page table register Virtual page # V 1 1 1 1 1 1 0 1 0 1 0 V 1 1 1 0 1 Physical page base addr Tag Page Table (in physical memory) TLB Physical page base addr Main memory Disk storage

Fast Translation Using a TLB Chapter 5 Large and Fast: Exploiting Memory Hierarchy 81

Translation Lookaside Buffers (TLBs) Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped V Tag Physical Page # Dirty Ref Access TLB access time is typically smaller than cache access time (because TLBs are much smaller than caches) TLBs are typically not more than 512 entries even on high end machines

TLB Misses If page is in memory Load the PTE from page table to TLB and retry Could be handled in hardware Can get complex for more complicated page table structures Or handled in software Raise a special exception, with optimized handler If page is not in memory (page fault) OS handles fetching the page and updating the page table Then restart the faulting instruction Chapter 5 Large and Fast: Exploiting Memory Hierarchy 83

TLB Miss Handler TLB miss indicates two possibilities Page present, but PTE not in TLB Handler copies PTE from memory to TLB Then restarts instruction Page not present in main memory Page fault will occur Chapter 5 Large and Fast: Exploiting Memory Hierarchy 84

Page Fault Handler 1. Use virtual address to find PTE 2. Locate page on disk 3. Choose page to replace if necessary If the chosen page is dirty, write to disk first 4. Read page into memory and update page table 5. Make process runnable again Restart from faulting instruction Chapter 5 Large and Fast: Exploiting Memory Hierarchy 85

TLB and Cache Interaction If cache tag uses physical address Need to translate before cache lookup Alternative: use virtual address tag Complications due to aliasing Different virtual addresses for shared physical address Chapter 5 Large and Fast: Exploiting Memory Hierarchy 86

TLB Event Combinations TLB Page Table Cache Hit Hit Hit Hit Hit Miss Miss Hit Hit Miss Hit Miss Miss Miss Miss Hit Miss Miss/ Hit Miss Miss Hit Possible? Under what circumstances? Yes what we want! Yes although the page table is not checked if the TLB hits Yes TLB miss, PA in page table Yes TLB miss, PA in page table, but data not in cache Yes page fault Impossible TLB translation not possible if page is not present in memory Impossible data not allowed in cache if page is not in memory

Memory Protection Different tasks can share parts of their virtual address spaces But need to protect against errant access Requires OS assistance Hardware support for OS protection Privileged supervisor mode (aka kernel mode) Privileged instructions Page tables and other state information only accessible in supervisor mode System call exception (e.g., syscall in MIPS) Chapter 5 Large and Fast: Exploiting Memory Hierarchy 88

Concluding Remarks Fast memories are small, large memories are slow We really want fast, large memories Caching gives this illusion Principle of locality Programs use a small part of their memory space frequently Memory hierarchy L1 cache L2 cache DRAM memory disk Memory system design is critical for multiprocessors 5.16 Concluding Remarks Chapter 5 Large and Fast: Exploiting Memory Hierarchy 89