EEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu

EEC 483 Computer Organization Chapter 5.3 Measuring and Improving Cache Performance Chansu Yu Cache Performance Performance equation execution time = (execution cycles + stall cycles) x cycle time stall cycles = read-stall cycles + write-stall cycles read-stall cycles = reads/program x read miss rate x read miss penalty write-stall cycles = writes/program x write miss rate x write miss penalty + write buffer stalls Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty Decreasing the miss ratio by more flexible placement of blocks is the main theme of this section 2 c.yu9@csuohio.edu

Example # Memory and cache system Cache hit time = cycle, miss penalty = 5 cycles Miss rate of the 32KB cache =.99% What is the average memory access time? 3 c.yu9@csuohio.edu Example #2 Harvard architecture refers to the one where data and program are separately stored in storage systems, e.g., data cache and instruction cache. Memory and cache system Cache hit time = cycle, miss penalty = 5 cycles 75% of memory access is instruction, 25% is data Miss rate of 6KB data cache = 6.47%, miss rate of 6KB instruction cache =.64% What is the average memory access time? 4 c.yu9@csuohio.edu 2

Direct-Mapped Caches Advantage: Simple Fast & Inexpensive Disadvantage: Vulnerable to thrashing. Two heavily used memory blocks map to same cache block index. Each is repeatedly evicted to make room for the other. 5 c.yu9@csuohio.edu Thrashing Example float dot_prod(float x[size], float y[size]) { float sum =.; int i; for (i = ; i < SIZE; i++) sum += x[i]*y[i]; If x[i] and y[i] map to same blocks thrash. What is hit rate in this case? return sum; } 6 c.yu9@csuohio.edu 3

Cache Performance: Tradeoffs () Increasing block size + decreases miss rate, until block gets large (spatial locality) - increases miss penalty (2) Increasing cache size + decreases miss rate - increases hit time - increases hardware cost To make exact tradeoffs, need to know specific numbers. Calculation & measurement See book for formulae. (3) Increasing associativity (Section 5.3) + increases hit rate - increases hit time - increases hardware cost 7 c.yu9@csuohio.edu Other Block Placement Policies Direct-mapped A memory block is cached in only one position in the cache Set-associative (n-way) A memory block is cached in n positions in the cache Fully-associative A memory block is cached in any position in the cache 8 Do they decrease miss ratio? c.yu9@csuohio.edu 4

Direct-mapped Block Placement 6 bytes Memory Address 2 3 4 5 6 7 8 9 A B C D E F Memory Valid 9 Direct-mapped cache Tag 6 bytes Block Index 2 3 4 5 6 7 8-bit memory address tag 3-bit block # 4-bit offset in the block c.yu9@csuohio.edu (N-Way) Set-associative Block Placement 6 bytes Memory Address 2 3 4 5 6 7 8 9 A B C D E F Memory Valid Set-associative cache (2-way ) Tag 6 bytes Block Index Set Set 2 2 Set 3 2 3 Set 4 3 8-bit memory address tag(???) block # (???) 4-bit offset in the block c.yu9@csuohio.edu 5

Fully-associative Block Placement 6 bytes Memory Address 2 3 4 5 6 7 8 9 A B C D E F Memory Valid Fully-associative cache Tag 6 bytes Block Index 8-bit memory address tag(???) block # (???) 4-bit offset in the block c.yu9@csuohio.edu Block Identification (cont d) N-way set-associative cache There are N cache blocks to compare N tag comparisons are done in parallel Block # to Set # choose low-order bits of blocks as set # tag + set# + offset consecutive blocks to map to different sets fewer conflicts in cache, especially in the presence of spatial locality Same cache size with higher associativity #blocks / set??? index size, tag size??? 2 c.yu9@csuohio.edu 6

CPU address block # tag set # offset 2 ways 2-way set associative index into cache tag data tag data hit OR MUX 3 data c.yu9@csuohio.edu An Example Address 3 3 2 9 8 3 2 22 8 Index V Tag Data V Tag Data V Tag Data V Tag Data 2 253 254 255 22 32 Hit 4-to- multiplexor Data 4 x-way set associate cache? Number of cache blocks? Block size? Tag field size? Cache size? c.yu9@csuohio.edu 7

Cache Performance: Tradeoffs 5% 2% 9% Miss rate 6% 3% % One-way Two-way Four-way Eight-way Associativity KB 6 KB 2 KB 32 KB 4 KB 64 KB What is it? 8 KB 28 KB 5 c.yu9@csuohio.edu Replacement Policies In direct-mapped cache, no replacement policy is necessary In set-associative cache, an important question is Which block to replace among the set (see page 54)? Least recently used (LRU) The most commonly used scheme How to keep track of usage of blocks? Single bit in case of two-way set associative cache See Section 7.5 for higher associativity case 6 c.yu9@csuohio.edu 8

Intel Nehalem 4-Core Processor (Multilevel On-Chip Caches ) Per core: 32KB L I-cache, 32KB L D-cache, 52KB L2 cache 7 c.yu9@csuohio.edu 3-Level Cache Organization L caches (per core) L2 unified cache (per core) L3 unified cache (shared) Intel Nehalem L I-cache: 32KB, 64-byte blocks, 4- way, approx LRU replacement, hit time n/a L D-cache: 32KB, 64-byte blocks, 8- way, approx LRU replacement, writeback/allocate, hit time n/a 256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a 8MB, 64-byte blocks, 6-way, replacement n/a, write-back/allocate, hit time n/a AMD Opteron X4 L I-cache: 32KB, 64-byte blocks, 2- way, LRU replacement, hit time 3 cycles L D-cache: 32KB, 64-byte blocks, 2- way, LRU replacement, writeback/allocate, hit time 9 cycles 52KB, 64-byte blocks, 6-way, approx LRU replacement, writeback/allocate, hit time n/a 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, writeback/allocate, hit time 32 cycles 8 c.yu9@csuohio.edu 9

Miss Penalty Reduction Return requested word first Then back-fill rest of block Non-blocking miss processing Hit under miss: allow hits to proceed Mis under miss: allow multiple outstanding misses Hardware prefetch: instructions and data Opteron X4: bank interleaved L D-cache Two concurrent accesses per cycle 9 c.yu9@csuohio.edu