Reducing Conflict Misses with Set Associative Caches

/6/7 Reducing Conflict es with Set Associative Caches Not too conflict y. Not too slow. Just Right! 8 byte, way xx E F xx C D What should the offset be? What should the be? What should the tag be? xx N O xx P Q addr A B C D E F G H J K L M N O P Q 8 byte, way load x load x load x load x addr A B C D E F G H J K L M N O P Q 8 byte, way load x load x load x load x N O addr A B C D E F G H J K L M N O P Q 8 byte, way load x load x load x load x N O addr A B C D E F G H J K L M N O P Q 8 byte, way load x load x load x load x N O E F addr A B C D E F G H J K L M N O P Q

/6/7 es: the Three C s Cold (compulsory) : never seen this address before Conflict : cache associativity is too low Capacity : cache is too small ABCs of Caches t avg = t hit + % miss * t miss + Associativity: conflict misses hit time + Block Size: cold misses conflict misses + Capacity: capacity misses hit time Which caches get what properties? L Caches L Cache L Cache Fast Big t avg = t hit + % miss * t miss Design with speed in mind More Associative Bigger Block Sizes Larger Capacity Design with miss rate in mind Summary so far Things we ve covered: The Need for Speed Locality to the Rescue! Calculating average memory access time $ es: Cold, Conflict, Capacity $ Characteristics: Associativity, Block Size, Capacity Things we skipped (and are about to cover): Cache Overhead Replacement Policies Writes Basic Memory Array Structure Number of entries nbits for lookup n entries Example: 4 entries, bit address Decoder changes n bit address to n bit one hot signal Size of entries Width of accessed Here: 56 bits ( bytes) 4 entry x 56bit SRAM bits 56 bits read from cache Caches: Finding Data via Indexing 4 entry x 56bit SRAM Basic cache: array of cache lines (or blocks) Here: KB cache (4 entries, B blocks) Hash table in hardware To find entry: decode part of address Which part? bits bit address B blocks 5 lowest bits locate byte in block = offset bits 4 entry bits find entry = bits Note: nothing says must be these bits But these work best (think about why) 56 bits bit address [:5] [4:5] [4:] << : Which entry? (K possible) offset: Which byte? ( possible) bit

/6/7 Knowing that You Found It: Tags Each entry can hold one of 7 blocks All blocks with same bit pattern How to know which if any is currently there? To each entry attach tag and valid bit Compare entry tag to address tag bits No need to match bits (why?) Lookup algorithm Read entry indicated by bits Hit if tag matches and valid bit is set Otherwise, a miss. Go to next level tag [:5] [4:5] [4:] << = Calculating Tag Overhead KB cache means cache holds KB of Called capacity Tag storage is considered overhead Tag overhead of KB cache with 4 x B entries B blocks?? bit offset 4 entries?? bit bit address?? bit tag (7 bit tag + bit valid)* 4 entries = 8Kb tags =.KB tags ~6% overhead What about 64 bit addresses? Tag increases to 49 bits, ~% overhead (worst case) address hit? Calculating Tag Overhead KB cache means cache holds KB of Called capacity Tag storage is considered overhead Tag overhead of KB cache with 4 x B entries B blocks 5 bit offset 4 entries bit bit address bits (5 bit offset + bit ) =7 bit tag (7 bit tag + bit valid) X 4 entries = 8Kb tags =.KB tags ~6% overhead What about 64 bit addresses? Tag increases to 49 bits, ~% overhead Handling a Cache What if requested isn t in the cache? How does it get in there? Cache controller: finite state machine Remembers miss address Accesses next level of memory Waits for response Writes /tag into proper locations All of this happens on the fill path Sometimes called backside Cache Performance Equation Access: read or write to cache Hit: desired found in cache t hit Cache % miss t miss : desired not found in cache Must get from another component No notion of miss in register file Fill: action of placing into cache % miss (miss rate): #misses / #accesses t hit : time to read from (write to) cache t miss : time to read into cache Performance metric: average access time t avg = t hit + % miss * t miss CPI Calculation with Cache es Parameters Simple pipeline with base CPI of Instruction mix: % loads/stores I$: % miss = %, t miss = cycles D$: % miss = %, t miss = cycles What is new CPI? CPI I$ = CPI D$ = CPI new =

/6/7 CPI Calculation with Cache es Parameters Simple pipeline with base CPI of Instruction mix: % loads/stores I$: % miss = %, t miss = cycles D$: % miss = %, t miss = cycles What is new CPI? CPI I$ = % missi$ *t miss =.* cycles =. cycle CPI D$ = % load/store *% missd$ *t missd$ =. *.* cycles =. cycle CPI new = CPI + CPI I$ + CPI D$ = +.+. =.5 Measuring Cache Performance Ultimate metric is t avg Cache capacity and circuits roughly determines t hit Lower level memory structures determine t miss Measure % miss Hardware performance counters Simulation Paper simulation (next) Cache Paper Simulation 4 bit addresses total memory size = 6B Simpler cache diagrams than bits 8B cache, B blocks Number of entries (or sets): 4 (capacity / block size) Figure out how address splits into offset//tag bits Offset: least significant log (block size) = log () = Index: next log (number of entries) = log (4) = Tag: rest = 4 = Cache diagram = addresses of in block, values don t matter Cache contents Address Outcome Set Set Set Set Cache Paper Simulation 8B cache, B blocks tag ( bit) ( bits) Set Set Set Set How to reduce % miss? And hopefully t avg? bit Cache Paper Simulation 8B cache, B blocks tag ( bit) ( bits) Set Set Set Set Set Set Set Set Hit Hit Hit Hit How to reduce % miss? And hopefully t avg? bit Capacity and Performance Simplest way to reduce % miss : increase capacity + rate decreases monotonically Working set : insns/ program is actively using Diminishing returns However t hit increases Latency proportional to sqrt(capacity) t avg? % miss working set size Cache Capacity Given capacity, manipulate % miss by changing organization 4

/6/7 Block Size Given capacity, manipulate % miss by changing organization One option: increase block size Exploit spatial locality Notice /offset bits change Tag remain the same Ramifications + Reduce % miss (up to a point) + Reduce tag overhead (why?) Potentially useless transfer Premature replacement of useful Fragmentation [:5] [4:6] 5*5bit SRAM 5 5 9-bit = block size [5:] << hit? address Block Size and Tag Overhead Tag overhead of KB cache with 4 B entries B lines 5-bit offset 4 entries -bit -bit address Tag overhead of KB cache with 5 64B entries 64B lines 5 entries -bit address Block Size and Tag Overhead Tag overhead of KB cache with 4 B entries B lines 5-bit offset 4 entries -bit -bit address (5-bit offset + -bit ) = 7-bit tag (7-bit tag + -bit valid) X 4 entries = 8Kb tags =.KB tags ~6% overhead Tag overhead of KB cache with 5 64B entries 64B lines 6-bit offset 5 entries 9-bit -bit address (6-bit offset + 9-bit ) = 7-bit tag (7-bit tag + -bit valid) X 5 entries = 9Kb tags =.KB tags ~% overhead Block Size Cache Paper Simulation 8B cache, 4B blocks tag ( bit) ( bit) offset ( bits) Set Set Hit (spatial locality) + Spatial prefetching : miss on brought in Conflicts: miss on kicked out Effect of Block Size on Rate Two effects on miss rate + Spatial prefetching (good) For blocks with adjacent addresses Turns miss/miss into miss/hit pairs Interference (bad) For blocks with non-adjacent addresses (but in adjacent entries) Turns hits into misses by disallowing simultaneous residence Consider entire cache as one big block Both effects always present Spatial prefetching dominates initially Depends on size of the cache Good block size is 6 8B Program dependent % miss Block Size Block Size and Penalty Does increasing block size increase t miss? Don t larger blocks take longer to read, transfer, and fill? They do, but t miss of an isolated miss is not affected Critical Word First / Early Restart (CRF/ER) Requested word fetched first, pipeline restarts immediately Remaining words in block transferred/filled in the background t miss es of a cluster of misses will suffer Reads/transfers/fills of two misses can t happen at the same time Latencies can start to pile up This is a bandwidth problem (more later) 5

/6/7 8B cache, B blocks Conflicts tag ( bit) ( bits) Set Set Set Set Hit Hit Pairs like / conflict Regardless of block-size (assuming capacity < 6) Q: can we allow pairs like these to simultaneously reside? A: yes, reorganize cache to do so bit Set-Associativity Block can reside in one of few entries Entry groups called sets Each entry in set called a way This is -way set-associative (SA) -way direct-mapped (DM) -set fully-associative (FA) + Reduces conflicts Increases latency hit: additional tag match & muxing Note: valid bit not shown [:4] [:5] 9 bit associativity address sets 5 5 [4:] ways 5 5 54 = = << hit? Set-Associativity Lookup algorithm Use bits to find set Read /tags in all ways in parallel Any (match and valid bit), Hit Notice tag//offset bits Only 9-bit (versus -bit for direct mapped) 9 bit Notice block numbering associativity [:4] [:5] sets 5 5 [4:] ways 5 5 54 = = << Associativity and Paper Simulation 8B cache, B blocks, -way set-associative Set.Way Set.Way Set.Way Set.Way + Avoid conflicts: and can both be in set Introduce some new conflicts: addresses get re-arranged Conflict avoidance usually dominates address hit? Associativity and Paper Simulation 8B cache, B blocks, -way set-associative Set.Way Set.Way Set.Way Set.Way (new conflict) Hit Hit (avoid conflict) + Avoid conflicts: and can both be in set Introduce some new conflicts: addresses get re-arranged Conflict avoidance usually dominates Replacement Policies Associative caches present a new design choice On cache miss, which block in set to replace (kick out)? Some options Random FIFO (first-in first-out) LRU (least recently used) Fits with temporal locality, LRU = least likely to be used in future NMRU (not most recently used) An easier to implement approximation of LRU Is LRU for -way set-associative caches Belady s: replace block that will be used furthest in future Unachievable optimum Which policy is simulated in previous example? 6

/6/7 Associativity and Performance Higher associative caches + Have better (lower) % miss Diminishing returns However t hit increases The more associative, the slower What about t avg? % miss ~5 Associativity Block-size and number of sets should be powers of two Makes ing easier (just rip bits out of the address) -way set-associativity? No problem 7