Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases: Miss pealty becomes more sigificat Greater proportio of time spet o memory stalls Icreasig clock rate: Memory stalls accout for more CPU cycles Ca t eglect cache behavior whe evaluatig system performace Chapter 5 Large ad Fast: Exploitig Memory Hierarchy
Morga Kaufma Publishers 26 February, 28 Review: Reducig Cache Miss Rates # Allow more flexible block placemet I a direct mapped cache a memory block maps to exactly oe cache block At the other extreme, we could allow a memory block to be mapped to ay cache block fully associative cache A compromise is to divide the cache ito sets, each of which cosists of ways (-way set associative) A memory block maps to a uique set - specified by the idex field - ad ca be placed ay where i that set Associative Caches Fully associative cache: Allow a give block to go i ay cache etry O a read, requires all blocks to be searched i parallel Oe comparator per etry (expesive) -way set associative: Each set cotais etries Block umber determies which set the requested item is located i: Search all etries i a give set at oce comparators (less expesive) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 2
Morga Kaufma Publishers 26 February, 28 Spectrum of Associativity For a cache with 8 etries: Four-Way Set Associative Cache 2 8 = 256 sets each with four ways (each with oe block) Tag Idex V Tag 2 253 254 255 Data V Tag 2 253 254 255 3 3 3 2 2 Byte offset 22 Data Idex 8 V Tag 2 253 254 255 Data V Tag 2 253 254 255 Way Way Way 2 Way 3 Data 32 4x select Hit Data Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 3
Morga Kaufma Publishers 26 February, 28 Aother Direct-Mapped Catch Example Cosider the mai-memory word referece strig 4 4 4 4 Start with a empty cache - all blocks iitially marked as ot valid miss 4 miss miss 4 miss 4 Mem() Mem() Mem(4) Mem() 4 miss 4 miss miss 4 miss 4 4 Mem(4) Mem() Mem(4) Mem() 8 requests, 8 misses Pig-pog effect due to coflict misses - two memory locatios that map ito the same cache block Set Associative Cache Example Cache Way Set V Tag Q: Is it there? Data Compare all the cache tags i the set to the high order 3 memory address bits to tell if the memory block is i the cache xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx Mai Memory Each word is 4 bytes Two low-order bits select the byte i the word Two Sets per Way Q2: How do we fid it? Use ext low order memory address bit to determie which cache set (ie, modulo the umber of sets i the cache) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 4
Morga Kaufma Publishers 26 February, 28 2-way Set Associative Memory Cosider the mai memory word referece strig 4 4 4 4 Start with a empty cache - all blocks iitially marked as ot valid miss 4 miss hit 4 hit Mem() Mem() Mem() Mem() Mem(4) Mem(4) Mem(4) 8 requests, 2 misses Solves the pig pog effect i a direct-mapped cache due to coflict misses sice ow two memory locatios that map ito the same cache set ca co-exist Rage of Set Associative Caches For a fixed size cache, each icrease by a factor of two i associativity doubles the umber of blocks per set (ie, the umber or ways) ad halves the umber of sets decreases the size of the idex by bit ad icreases the size of the tag by bit Used for tag compare Selects the set Selects the word i the block Tag Idex Block offset Byte offset Decreasig associativity Direct mapped (oly oe way) Smaller tags, oly a sigle comparator Icreasig associativity Fully associative (oly oe set) Tag is all the address bits except block ad byte offset Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 5
Morga Kaufma Publishers 26 February, 28 How Much Associativity is Right? Icreased associativity decreases miss rate But with dimiishig returs Simulatio of a system with 64KB D-cache, 6-word blocks, SPEC2 -way: 3% 2-way: 86% 4-way: 83% 8-way: 8% Costs of Set Associative Caches N-way set associative cache costs: N comparators - delay ad area MUX delay - set selectio - before data is available Data available after set selectio, ad Hit/Miss decisio I a direct-mapped cache, the cache block is available before the Hit/Miss decisio: So its ot possible to just assume a hit ad cotiue ad recover later if it was a miss Whe a miss occurs, which way s block do we pick for replacemet? Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 6
Morga Kaufma Publishers 26 February, 28 Replacemet Policy Direct mapped - o choice Set associative: Prefer o-valid etry, if there is oe Otherwise, choose amog etries i the set Least-recetly used (LRU) is commo: Choose the oe uused for the logest time: Simple for 2-way, maageable for 4-way, too hard beyod that Radom Oddly, gives about the same performace as LRU for high associativity Beefits of Set Associative Caches The choice of direct-mapped or set-associative depeds o the cost of a miss versus the beefit of a hit Largest gais are i goig from direct mapped to 2-way (2%+ reductio i miss rate) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 7
Morga Kaufma Publishers 26 February, 28 Reducig Cache Miss Rates #2 Use multiple levels of caches With advacig techology, we have more tha eough room o the die for bigger L caches or for a secod level of cache ormally a uified L2 cache - it holds both istructios ad data - ad i some cases eve a uified L3 cache For our example, CPI ideal of 2, cycle miss pealty (to mai memory) ad a 25 cycle miss pealty (to UL2$), 36% load/stores, a 2% (4%) L I$ (D$) miss rate, add a 5% UL2$ miss rate CPI stalls = 2 + 2 25 + 36 4 25 + 5 + 36 5 = 354 (as compared to 544 with o L2$) Multilevel Cache Desig Cosideratios Desig cosideratios for L ad L2 caches are differet: Primary cache should focus o miimizig hit time i support of a shorter clock cycle: Smaller capacity with smaller block sizes Secodary cache(s) should focus o reducig miss rate to reduce the pealty of log mai memory access times: Larger capacity with larger block sizes Higher levels of associativity The miss pealty of the L cache is sigificatly reduced by the presece of a L2 cache so it ca be smaller (ie, faster) but have a higher miss rate For the L2 cache, hit time is less importat tha miss rate: The L2$ hit time determies L$ s miss pealty Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 8
Morga Kaufma Publishers 26 February, 28 Usig the Memory Hierarchy Well FIGURE 58 Comparig Quicksort ad Radix Sort by (a) istructios executed per item sorted, (b) time per item sorted, ad (c) cache misses per item sorted This data is from a paper by LaMarca ad Lader [996] Although the umbers would chage for ewer computers, the idea still holds Due to such results, ew versios of Radix Sort have bee iveted that take memory hierarchy ito accout, to regai its algorithmic advatages (see Sectio 5) The basic idea of cache optimizatios is to use all the data i a block repeatedly before it is replaced o a miss Copyright 29 Elsevier, Ic All rights reserved Two Machies Cache Parameters Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 9
Morga Kaufma Publishers 26 February, 28 Cortex-A8 Data Cache Miss Rates FIGURE 545 Data cache miss rates for ARM Cortex-A8 whe ruig Miespec, a small versio of SPEC2 Applicatios with larger memory footprits ted to have higher miss rates i both L ad L2 Note that the L2 rate is the global miss rate; that is, coutig all refereces, icludig those that hit i L (See Elaboratio i Sectio 54) Mcf is kow as a cache buster Itel Core i7 92 Data Cache Miss Rates FIGURE 547 The L, L2, ad L3 data cache miss rates for the Itel Core i7 92 ruig the full iteger SPECCPU26 bechmarks Chapter 5 Large ad Fast: Exploitig Memory Hierarchy
Morga Kaufma Publishers 26 February, 28 Summary: Improvig Cache Performace Reduce the time to hit i the cache: Smaller cache Direct mapped cache Smaller blocks For writes: No write allocate o hit o cache, just write to write buffer Write allocate to avoid two cycles (first check for hit, the write) pipelie writes via a delayed write buffer to cache 2 Reduce the miss rate: Bigger cache More flexible placemet (icrease associativity) Larger blocks (6 to 64 bytes typical) Victim cache small buffer holdig most recetly replaced blocks Summary: Improvig Cache Performace 3 Reduce the miss pealty: Smaller blocks Use a write buffer to hold dirty blocks beig replaced so you do t have to wait for the write to complete before readig Check write buffer (ad/or victim cache) o read miss may get lucky For large blocks fetch, critical word first Use multiple cache levels L2 cache is ofte ot tied to CPU clock rate Chapter 5 Large ad Fast: Exploitig Memory Hierarchy
Morga Kaufma Publishers 26 February, 28 Summary: The Cache Desig Space Several iteractig dimesios: Cache size Block size Associativity Replacemet policy Write-through vs write-back Write allocatio The optimal choice is a compromise Depeds o access characteristics: Hard to predict Depeds o techology / cost Simplicity ofte wis Bad Cache Size Associativity Block Size Good Factor A Factor B Less More Cocludig Remarks Fast memories are small, large memories are slow: We wat fast, large memories L Cachig gives this illusio J Priciple of locality: Programs use a small part of their memory space frequetly Memory hierarchy L cache «L2 cache ««DRAM memory «disk Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 2