CS 61C: Great Ideas in Computer Architecture Lecture 16: Caches, Part 3

CS 6C: Great Ideas in Computer Architecture Lecture 6: s, Part 3 Instructor: Sagar Karandikar sagark@eecsberkeleyedu hep://insteecsberkeleyedu/~cs6c Parallel Requests Assigned to computer eg, Search Katz Parallel Threads Assigned to core eg, Lookup, Ads So/ware Parallel InstrucRons > instrucron @ one Rme eg, 5 pipelined instrucrons Parallel > data item @ one Rme eg, Add of 4 pairs of words Hardware descriprons All gates @ one Rme Programming Languages You Are Here! Harness Parallelism & Achieve High Performance Hardware Warehouse Scale Computer Core Memory Input/Output InstrucRon Unit(s) Main Memory Computer () Core Core FuncRonal Unit(s) A +B A +B A +B A 3 +B 3 Smart Phone Today s Lecture Logic Gates s Review Processor issues addresses, cache breaks them into Tag/Index/Offset Direct- Mapped vs Set- AssociaRve vs Fully AssociaRve AMAT = Time + Miss Rate * Miss Penalty 3 Cs of cache misses: Compulsory, Capacity, Conflict Effect of cache parameters on performance Review: Processor Address Fields used by Controller Block Offset: Byte address within block #bits = log_(block size in bytes) Index: Selects which index (set) we want # bits = log_(number of sets) Tag: Remaining porron of processor address processor address size - offset bits - index bits Processor Address (3- bits total) Tag Set Index Block offset 3 4 Review: Direct Mapped One word blocks, cache size = Ki words (or 4KiB) Valid bit ensures something useful in cache for this index Compare Tag with upper part of Address to see if a Tag Index Index Valid 3 3 3 3 Tag Comparator Block offset 3 Read data from cache instead of memory if a 5 Review: MulRword- Block Direct- Mapped Four words/block, cache size = Ki words Byte 3 3 3 4 3 offset Tag Index Valid Tag 53 54 55 Index 8 Block offset 3 6

Review: Four- Way Set- AssociaRve 8 = 56 sets each with four ways (each with one block) 3 3 3 Byte offset Tag Index Index V Tag V Tag V Tag V Tag Way Way Way Way 3 53 53 53 53 54 54 54 54 55 55 55 55 8 Review: Measurement Terms rate: fracron of accesses that hit in the cache Miss rate: rate Miss penalty: Rme to replace a block from lower level in memory hierarchy to cache Rme: Rme to access cache memory (including tag comparison) 4x select 3 7 These are controlled by the myriad cache parameters we can set 8 Primary Parameters Block size (aka line size) how many bytes of data in each cache entry? AssociaRvity how many ways in each set? Direct- mapped => AssociaRvity = Set- associarve => < AssociaRvity < #Entries Fully associarve => AssociaRvity = #Entries Capacity (bytes) = Total #Entries * Block size #Entries = #Sets * AssociaRvity Other Parameters Write Policy Replacement policy 9 Write Policy Choices hit: write through: writes both cache & memory on every access Generally higher memory traffic but simpler pipeline & cache design write back: writes cache only, memory `wrieen only when dirty entry evicted A dirty bit per line reduces write- back traffic Must handle,, or accesses to memory for each load/store miss: no write allocate: only write to main memory write allocate (aka fetch on write): fetch into cache Common combinarons: write through and no write allocate write back with write allocate Replacement Policy In an associarve cache, which line from a set should be evicted when the set becomes full? Random Least- Recently Used (LRU) LRU cache state must be updated on every access True implementaron only feasible for small sets (- way) Pseudo- LRU binary tree oxen used for 4-8 way First- In, First- Out (FIFO) aka Round- Robin Used in highly associarve caches Not- Most- Recently Used (NMRU) FIFO with excepron for most- recently used line or lines This is a second- order effect Why? Replacement only happens on misses

Sources of Misses (3 C s) Compulsory (cold start, first reference): st access to a block in memory, cold, fact of life, not a lot you can do about it If running billions of instrucrons, compulsory misses are insignificant Capacity: cannot contain all blocks accessed by the program Misses that would not occur with infinite cache Conflict (collision): MulRple memory locarons mapped to same cache set Misses that would not occur with ideal fully associarve cache 3 Rules for Determining Miss Type for a Given Access PaEern in 6C ) Compulsory: A miss is compulsory if and only if it results from accessing the data for the first Rme If you have ever brought the data you are accessing into the cache before, it's not a compulsory miss, otherwise it is ) Conflict: A conflict miss is a miss that's not compulsory and that would have been avoided if the cache was fully associarve Imagine you had a cache with the same parameters but fully- associarve (with LRU) If that would've avoided the miss, it's a conflict miss 3) Capacity: This is a miss that would not have happened with an infinitely large cache If your miss is not a compulsory or conflict miss, it's a capacity miss 4 Impact of Parameters on Performance AMAT = Time + Miss Rate * Miss Penalty Note, we assume always first search cache, so must incorporate hit Rme for both hits and misses! For misses, characterize by 3Cs Administrivia Project 3- Out Last week, we built a CPU together, this week, you start building your own! HW4 Out - s Guerrilla SecRon on Pipelining, s on Thursday, 5-7pm, Woz 5 6 Administrivia Break Midterm is one week from tomorrow In this room, at this Rme Two double- sided 85 x handwrieen cheatsheets We ll provide a MIPS green sheet No electronics Covers up to and including tomorrow s lecture (7/)* Review session is Friday, 7/4 from - 4pm in HP Aud * This is one less lecture than inirally indicated 7 8 3

CPU- InteracRon (5- stage pipeline) Increasing Block Size? x4 Add bubble PC addr inst hit? PCen Primary InstrucRon Stall at fetch stage on inst cache miss (but not the whole CPU) IR D To Memory Control Decode, Register Fetch A B MD ALU Y MD Refill from Lower Levels of Memory Hierarchy E M we addr Primary rdata hit? wdata R Stall enrre CPU on data cache miss 9 Rme as block size increases? Rme unchanged, but might be slight hit- Rme reducron as number of tags is reduced, so faster to access memory holding tags Miss rate as block size increases? Goes down at first due to sparal locality, then increases due to increased conflict misses due to fewer blocks in cache Miss penalty as block size increases? Rises with longer block size, but with fixed constant iniral latency that is amorrzed over whole block Increasing AssociaRvity? Rme as associarvity increases? Increases, with large step from direct- mapped to >= ways, as now need to mux correct way to processor Smaller increases in hit Rme for further increases in associarvity Miss rate as associarvity increases? Goes down due to reduced conflict misses, but most gain is from - >- >4- way with limited benefit from higher associarvires Miss penalty as associarvity increases? Unchanged, replacement policy runs in parallel with fetching missing line from memory Increasing #Entries? (Capacity) Rme as #entries increases? Increases, since reading tags and data from larger memory structures Miss rate as #entries increases? Goes down due to reduced capacity and conflict misses Architects rule of thumb: miss rate drops ~x for every ~4x increase in capacity (only a gross approximaron) Miss penalty as #entries increases? Unchanged How to Reduce Miss Penalty? Could there be locality on misses from a cache? Use mulrple cache levels! With Moore s Law, more room on die for bigger L caches and for second- level (L) cache And in some cases even an L3 cache! IBM mainframes have ~GB L4 cache off- chip 3 Inner Levels in memory hierarchy Outer Review: Memory Hierarchy Processor Level Level Level 3 Level n Increasing distance from processor, decreasing speed Size of memory at each level As we move to outer levels the latency goes up and price per bit goes down 4 4

Local vs Global Miss Rates Local miss rate the fracron of references to one level of a cache that miss Local Miss rate L$ = # $L Misses / # L$ Misses Global miss rate the fracron of references that miss in all levels of a mulrlevel cache L$ local miss rate >> than the global miss rate L$ Global Miss rate = L$ Misses / Total Accesses L : 3KB I$, 3KB D$ L : 56 KB L3 : 4 MB 5 7//5 Fall 3 - - Lecture #3 6 Local vs Global Miss Rates Local miss rate the fracron of references to one level of a cache that miss Local Miss rate L$ = $L Misses / L$ Misses Global miss rate the fracron of references that miss in all levels of a mulrlevel cache L$ Global Miss rate = L$ Misses / Total Accesses = L$ Misses / L$ Misses x L$ Misses / Total Accesses = Local Miss rate L$ x Local Miss rate L$ AMAT = Time for a hit + Miss rate x Miss penalty AMAT = Time for a L$ hit + (local) Miss rate L$ x (Time for a L$ hit + (local) Miss rate L$ x L$ Miss penalty) 7 8 On a ReRna MBP ~» sudo sysctl -a grep cache hwcachelinesize = 64 hwlicachesize = 3768 hwldcachesize = 3768 hwlcachesize = 644 hwl3cachesize = 69456 kernflush_cache_on_write: kernnamecache_disabled: vmvm_page_filecache_min: 975 On a hive machine, run: cat /sys/devices/system/ cpu/cpu/cache/index/* A cool thing about Unix: you can poke around by reading files vfsgenericnfsclientaccess_cache_timeout: 6 vfsgenericnfsserverreqcache_size: 64 netinetiprtmaxcache: 8 netinet6ip6rtmaxcache: 8 hwcacheconfig: 8 8 hwcachesize: 77986984 3768 644 69456 hwcachelinesize: 64 9 hwlicachesize: 3768 CPI/Miss Rates/DRAM Access SpecInt6 Only Only InstrucRons and Fall 3 3 7//5 - - Lecture # 5

Clickers/Peer InstrucRon Overall, what are L and L3 local miss rates? A: L > 5%, L3 > 5% B: L < 5%, L3 < 5% C: L ~ 5%, L3 ~ 5% D: L > 5%, L3 < 5% E: L > 5%, L3 ~5% CS6C In the News Privacy- respecrng Laptops on the frontpage of Hacker News yesterday: The Gluglug Libreboot X Purism Librem 3/5 Why does it maeer? You can run all the free soxware you want, but: ) Who knows what the hardware is really doing? ) 99999% of your machines srll have proprietary blobs of driver/firmware code (some are even encrypted) 3 heps://newsycombinatorcom/item?id=9934 heps://newsycombinatorcom/item?id=99863 3 Break 33 Let s say we re interested in analyzing program performance with respect to the cache: i++) { sum += a[i]; 34 ) Assume we only care about D$ ) Assume array is block aligned ie First byte of a[] is the first byte in a block i++) { sum += a[i]; 3) Assume local vars live in registers So what do we care about? The sequence of accesses to array a 35 i++) { sum += a[i]; Important first quesrons: ) What does our access paeern look like within a block? ) How many cache blocks does our array take up? Does it fit enrrely in the cache at once? 36 6

i++) { sum += a[i]; ) What does our access paeern look like within a block? We re iterarng over 48 ints, each of which are 4 bytes A block in this cache holds 4 ints: (6 B/block)/(4 B/int) = 4 ints/block 37 ) What does our access paeern look like within a block? We re iterarng over 48 ints, each of which are 4 bytes A block in this cache holds 4 ints: (6 B/block)/ (4 B/int) = 4 ints/block Block as bytes: Block as ints: This repeats over and over! 5% miss rate a[] a[] a[] a[3] IteraRon : Compulsory Miss IteraRon : IteraRon : IteraRon 3: 38 # ) What does our access paeern look like within a block? We re iterarng over an array of 48 ints (but only accessing 4 of them), each of which are 4 bytes A block in this cache holds 4 ints: (6 B/block)/(4 B/int) = 4 ints/block 39 ) What does our access paeern look like within a block? (let s srck with the first loop) We re iterarng over 48 ints 4 ints, each of which are 4 bytes A block in this cache holds 4 ints: (6 B/block)/(4 B/int) = 4 ints/block Block as bytes: Block as ints: This repeats over and over for the first loop! 5% miss rate a[] a[] a[] a[3] IteraRon : Compulsory Miss IteraRon : 4 But now, we have to worry about quesron : ) How many cache blocks does our array take up? The array contains 48 ints That s 48*4 = 89 B blocks are 6 B each, so 89/6 = 5 blocks Does it fit enurely in the cache at once? No is only 496 B, contains only 56 blocks 4 # So, to figure out what loop s performance is going to be like, we need to know what s in the cache at the end of loop 4 7

# // what s the state of the cache here? Say for the sake of argument a[] that our array starts at address Then, halfway through the first loop (between i = and i = 4), the first half of the array is in the cache 43 a[3] # // what s the state of the cache here? Now, when we execute the a[] a[4] second half of the accesses in loop one, we replace everything already in the cache: a[3] a[47] 44 # // what s the state of the cache here? So, at the end of the first loop, the second half of the array is in the cache a[4] a[47] 45 # Now, let s take a look at loop NoRce that it s going to start by accessing the first half of the array, none of which is in the cache 46 ) What does our access paeern look like within a block? (now, for the second loop) We re iterarng over 48 ints 4 ints, each of which are 4 bytes A block in this cache holds 4 ints: (6 B/block)/(4 B/int) = 4 ints/block Block as bytes: Block as ints: This repeats over and over for the second loop! 5% miss rate a[] a[] a[] a[3] IteraRon : Capacity Miss a[4] was in this spot in the cache IteraRon : 47 # So, we got a 5% miss rate for loop # and a 5% miss rate for loop # Now, we compute a weighted average: Total miss rate = weighted average of miss rates Half the accesses were in loop, the other half in loop : Total miss rate = 5 * loop miss rate + 5 * loop miss rate 48 Total miss rate = 5 * 5% + 5 * 5% = 5% 8

Clicker QuesRon p KiB Direct- Mapped What s the miss rate for the first loop? A % B 5 % C 5 % D 75 % E % Clicker QuesRon p KiB Direct- Mapped What s the miss rate for the second loop? A % B 5 % C 5 % D 75 % E % 49 5 Clicker QuesRon p3 KiB Direct- Mapped What types of misses do we get in the first/second loops? A Capacity/Capacity B Compulsory/Capacity C Compulsory/Conflict D Conflict/Capacity E Compulsory/Compulsory 5 In Conclusion, Huge Design Space Several interacrng dimensions size Block size AssociaRvity Replacement policy Write- through vs write- back Write- allocauon OpRmal choice is a compromise Depends on access characterisrcs Workload Use (I- cache, D- cache) So we learned how to analyze code performance on a cache Simplicity oxen wins Bad Good Size Factor A Less AssociaUvity Block Size Factor B More 5 9