CS 61C: Great Ideas in Computer Architecture Lecture 16: Caches, Part 3

Similar documents
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Show Me the $... Performance And Caches

Mo Money, No Problems: Caches #2...

CS 61C: Great Ideas in Computer Architecture Lecture 15: Caches, Part 2

CS 61C: Great Ideas in Computer Architecture Lecture 15: Caches, Part 2

CS 61C: Great Ideas in Computer Architecture Caches Part 2

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Mul$- Level Caches

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Lecture 7 - Memory Hierarchy-II

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

CS 110 Computer Architecture

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/fa10. 10/4/10 Fall Lecture #16. Agenda

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS 61C: Great Ideas in Computer Architecture Direct- Mapped Caches. Increasing distance from processor, decreasing speed.

Lecture 11 Cache. Peng Liu.

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

ECSE 425 Lecture 21: More Cache Basics; Cache Performance

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

CS 61C: Great Ideas in Computer Architecture. Multilevel Caches, Cache Questions

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches, Set Associative Caches, Cache Performance

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

CS3350B Computer Architecture

EE 4683/5683: COMPUTER ARCHITECTURE

10/11/17. New-School Machine Structures. Review: Single Cycle Instruction Timing. Review: Single-Cycle RISC-V RV32I Datapath. Components of a Computer

CS61C : Machine Structures

CS 61C: Great Ideas in Computer Architecture. Cache Performance, Set Associative Caches

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Course Administration

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

UCB CS61C : Machine Structures

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Page 1. Multilevel Memories (Improving performance using a little cash )

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Direct-Mapped and Set Associative Caches. Instructor: Steven Ho

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Memory hierarchy review. ECE 154B Dmitri Strukov

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

CS 136: Advanced Architecture. Review of Caches

Caches III. CSE 351 Autumn Instructor: Justin Hsia

Advanced Computer Architecture

CS61C Midterm 2 Review Session. Problems Only

Caches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CS 61C: Great Ideas in Computer Architecture Lecture 17: Performance and Floa.ng Point Arithme.c

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Review : Pipelining. Memory Hierarchy

Memory Hierarchy. Slides contents from:

Caches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Memory Hierarchy. Slides contents from:

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Review: New-School Machine Structures. Review: Direct-Mapped Cache. TIO Dan s great cache mnemonic. Memory Access without Cache

Page 1. Memory Hierarchies (Part 2)

Levels in memory hierarchy

Modern Computer Architecture

CS61C Midterm 2 Review Session. Problems + Solutions

Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.2 3, 5.5

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Caches and Memory. Anne Bracy CS 3410 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.13, 5.15, 5.17

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

CS61C : Machine Structures

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

John Wawrzynek & Nick Weaver

CS61C : Machine Structures

New-School Machine Structures. Overarching Theme for Today. Agenda. Review: Memory Management. The Problem 8/1/2011

ECE331: Hardware Organization and Design

Caches! Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University. See P&H 5.2 (writes), 5.3, 5.5

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Transcription:

CS 6C: Great Ideas in Computer Architecture Lecture 6: s, Part 3 Instructor: Sagar Karandikar sagark@eecsberkeleyedu hep://insteecsberkeleyedu/~cs6c Parallel Requests Assigned to computer eg, Search Katz Parallel Threads Assigned to core eg, Lookup, Ads So/ware Parallel InstrucRons > instrucron @ one Rme eg, 5 pipelined instrucrons Parallel > data item @ one Rme eg, Add of 4 pairs of words Hardware descriprons All gates @ one Rme Programming Languages You Are Here! Harness Parallelism & Achieve High Performance Hardware Warehouse Scale Computer Core Memory Input/Output InstrucRon Unit(s) Main Memory Computer () Core Core FuncRonal Unit(s) A +B A +B A +B A 3 +B 3 Smart Phone Today s Lecture Logic Gates s Review Processor issues addresses, cache breaks them into Tag/Index/Offset Direct- Mapped vs Set- AssociaRve vs Fully AssociaRve AMAT = Time + Miss Rate * Miss Penalty 3 Cs of cache misses: Compulsory, Capacity, Conflict Effect of cache parameters on performance Review: Processor Address Fields used by Controller Block Offset: Byte address within block #bits = log_(block size in bytes) Index: Selects which index (set) we want # bits = log_(number of sets) Tag: Remaining porron of processor address processor address size - offset bits - index bits Processor Address (3- bits total) Tag Set Index Block offset 3 4 Review: Direct Mapped One word blocks, cache size = Ki words (or 4KiB) Valid bit ensures something useful in cache for this index Compare Tag with upper part of Address to see if a Tag Index Index Valid 3 3 3 3 Tag Comparator Block offset 3 Read data from cache instead of memory if a 5 Review: MulRword- Block Direct- Mapped Four words/block, cache size = Ki words Byte 3 3 3 4 3 offset Tag Index Valid Tag 53 54 55 Index 8 Block offset 3 6

Review: Four- Way Set- AssociaRve 8 = 56 sets each with four ways (each with one block) 3 3 3 Byte offset Tag Index Index V Tag V Tag V Tag V Tag Way Way Way Way 3 53 53 53 53 54 54 54 54 55 55 55 55 8 Review: Measurement Terms rate: fracron of accesses that hit in the cache Miss rate: rate Miss penalty: Rme to replace a block from lower level in memory hierarchy to cache Rme: Rme to access cache memory (including tag comparison) 4x select 3 7 These are controlled by the myriad cache parameters we can set 8 Primary Parameters Block size (aka line size) how many bytes of data in each cache entry? AssociaRvity how many ways in each set? Direct- mapped => AssociaRvity = Set- associarve => < AssociaRvity < #Entries Fully associarve => AssociaRvity = #Entries Capacity (bytes) = Total #Entries * Block size #Entries = #Sets * AssociaRvity Other Parameters Write Policy Replacement policy 9 Write Policy Choices hit: write through: writes both cache & memory on every access Generally higher memory traffic but simpler pipeline & cache design write back: writes cache only, memory `wrieen only when dirty entry evicted A dirty bit per line reduces write- back traffic Must handle,, or accesses to memory for each load/store miss: no write allocate: only write to main memory write allocate (aka fetch on write): fetch into cache Common combinarons: write through and no write allocate write back with write allocate Replacement Policy In an associarve cache, which line from a set should be evicted when the set becomes full? Random Least- Recently Used (LRU) LRU cache state must be updated on every access True implementaron only feasible for small sets (- way) Pseudo- LRU binary tree oxen used for 4-8 way First- In, First- Out (FIFO) aka Round- Robin Used in highly associarve caches Not- Most- Recently Used (NMRU) FIFO with excepron for most- recently used line or lines This is a second- order effect Why? Replacement only happens on misses

Sources of Misses (3 C s) Compulsory (cold start, first reference): st access to a block in memory, cold, fact of life, not a lot you can do about it If running billions of instrucrons, compulsory misses are insignificant Capacity: cannot contain all blocks accessed by the program Misses that would not occur with infinite cache Conflict (collision): MulRple memory locarons mapped to same cache set Misses that would not occur with ideal fully associarve cache 3 Rules for Determining Miss Type for a Given Access PaEern in 6C ) Compulsory: A miss is compulsory if and only if it results from accessing the data for the first Rme If you have ever brought the data you are accessing into the cache before, it's not a compulsory miss, otherwise it is ) Conflict: A conflict miss is a miss that's not compulsory and that would have been avoided if the cache was fully associarve Imagine you had a cache with the same parameters but fully- associarve (with LRU) If that would've avoided the miss, it's a conflict miss 3) Capacity: This is a miss that would not have happened with an infinitely large cache If your miss is not a compulsory or conflict miss, it's a capacity miss 4 Impact of Parameters on Performance AMAT = Time + Miss Rate * Miss Penalty Note, we assume always first search cache, so must incorporate hit Rme for both hits and misses! For misses, characterize by 3Cs Administrivia Project 3- Out Last week, we built a CPU together, this week, you start building your own! HW4 Out - s Guerrilla SecRon on Pipelining, s on Thursday, 5-7pm, Woz 5 6 Administrivia Break Midterm is one week from tomorrow In this room, at this Rme Two double- sided 85 x handwrieen cheatsheets We ll provide a MIPS green sheet No electronics Covers up to and including tomorrow s lecture (7/)* Review session is Friday, 7/4 from - 4pm in HP Aud * This is one less lecture than inirally indicated 7 8 3

CPU- InteracRon (5- stage pipeline) Increasing Block Size? x4 Add bubble PC addr inst hit? PCen Primary InstrucRon Stall at fetch stage on inst cache miss (but not the whole CPU) IR D To Memory Control Decode, Register Fetch A B MD ALU Y MD Refill from Lower Levels of Memory Hierarchy E M we addr Primary rdata hit? wdata R Stall enrre CPU on data cache miss 9 Rme as block size increases? Rme unchanged, but might be slight hit- Rme reducron as number of tags is reduced, so faster to access memory holding tags Miss rate as block size increases? Goes down at first due to sparal locality, then increases due to increased conflict misses due to fewer blocks in cache Miss penalty as block size increases? Rises with longer block size, but with fixed constant iniral latency that is amorrzed over whole block Increasing AssociaRvity? Rme as associarvity increases? Increases, with large step from direct- mapped to >= ways, as now need to mux correct way to processor Smaller increases in hit Rme for further increases in associarvity Miss rate as associarvity increases? Goes down due to reduced conflict misses, but most gain is from - >- >4- way with limited benefit from higher associarvires Miss penalty as associarvity increases? Unchanged, replacement policy runs in parallel with fetching missing line from memory Increasing #Entries? (Capacity) Rme as #entries increases? Increases, since reading tags and data from larger memory structures Miss rate as #entries increases? Goes down due to reduced capacity and conflict misses Architects rule of thumb: miss rate drops ~x for every ~4x increase in capacity (only a gross approximaron) Miss penalty as #entries increases? Unchanged How to Reduce Miss Penalty? Could there be locality on misses from a cache? Use mulrple cache levels! With Moore s Law, more room on die for bigger L caches and for second- level (L) cache And in some cases even an L3 cache! IBM mainframes have ~GB L4 cache off- chip 3 Inner Levels in memory hierarchy Outer Review: Memory Hierarchy Processor Level Level Level 3 Level n Increasing distance from processor, decreasing speed Size of memory at each level As we move to outer levels the latency goes up and price per bit goes down 4 4

Local vs Global Miss Rates Local miss rate the fracron of references to one level of a cache that miss Local Miss rate L$ = # $L Misses / # L$ Misses Global miss rate the fracron of references that miss in all levels of a mulrlevel cache L$ local miss rate >> than the global miss rate L$ Global Miss rate = L$ Misses / Total Accesses L : 3KB I$, 3KB D$ L : 56 KB L3 : 4 MB 5 7//5 Fall 3 - - Lecture #3 6 Local vs Global Miss Rates Local miss rate the fracron of references to one level of a cache that miss Local Miss rate L$ = $L Misses / L$ Misses Global miss rate the fracron of references that miss in all levels of a mulrlevel cache L$ Global Miss rate = L$ Misses / Total Accesses = L$ Misses / L$ Misses x L$ Misses / Total Accesses = Local Miss rate L$ x Local Miss rate L$ AMAT = Time for a hit + Miss rate x Miss penalty AMAT = Time for a L$ hit + (local) Miss rate L$ x (Time for a L$ hit + (local) Miss rate L$ x L$ Miss penalty) 7 8 On a ReRna MBP ~» sudo sysctl -a grep cache hwcachelinesize = 64 hwlicachesize = 3768 hwldcachesize = 3768 hwlcachesize = 644 hwl3cachesize = 69456 kernflush_cache_on_write: kernnamecache_disabled: vmvm_page_filecache_min: 975 On a hive machine, run: cat /sys/devices/system/ cpu/cpu/cache/index/* A cool thing about Unix: you can poke around by reading files vfsgenericnfsclientaccess_cache_timeout: 6 vfsgenericnfsserverreqcache_size: 64 netinetiprtmaxcache: 8 netinet6ip6rtmaxcache: 8 hwcacheconfig: 8 8 hwcachesize: 77986984 3768 644 69456 hwcachelinesize: 64 9 hwlicachesize: 3768 CPI/Miss Rates/DRAM Access SpecInt6 Only Only InstrucRons and Fall 3 3 7//5 - - Lecture # 5

Clickers/Peer InstrucRon Overall, what are L and L3 local miss rates? A: L > 5%, L3 > 5% B: L < 5%, L3 < 5% C: L ~ 5%, L3 ~ 5% D: L > 5%, L3 < 5% E: L > 5%, L3 ~5% CS6C In the News Privacy- respecrng Laptops on the frontpage of Hacker News yesterday: The Gluglug Libreboot X Purism Librem 3/5 Why does it maeer? You can run all the free soxware you want, but: ) Who knows what the hardware is really doing? ) 99999% of your machines srll have proprietary blobs of driver/firmware code (some are even encrypted) 3 heps://newsycombinatorcom/item?id=9934 heps://newsycombinatorcom/item?id=99863 3 Break 33 Let s say we re interested in analyzing program performance with respect to the cache: i++) { sum += a[i]; 34 ) Assume we only care about D$ ) Assume array is block aligned ie First byte of a[] is the first byte in a block i++) { sum += a[i]; 3) Assume local vars live in registers So what do we care about? The sequence of accesses to array a 35 i++) { sum += a[i]; Important first quesrons: ) What does our access paeern look like within a block? ) How many cache blocks does our array take up? Does it fit enrrely in the cache at once? 36 6

i++) { sum += a[i]; ) What does our access paeern look like within a block? We re iterarng over 48 ints, each of which are 4 bytes A block in this cache holds 4 ints: (6 B/block)/(4 B/int) = 4 ints/block 37 ) What does our access paeern look like within a block? We re iterarng over 48 ints, each of which are 4 bytes A block in this cache holds 4 ints: (6 B/block)/ (4 B/int) = 4 ints/block Block as bytes: Block as ints: This repeats over and over! 5% miss rate a[] a[] a[] a[3] IteraRon : Compulsory Miss IteraRon : IteraRon : IteraRon 3: 38 # ) What does our access paeern look like within a block? We re iterarng over an array of 48 ints (but only accessing 4 of them), each of which are 4 bytes A block in this cache holds 4 ints: (6 B/block)/(4 B/int) = 4 ints/block 39 ) What does our access paeern look like within a block? (let s srck with the first loop) We re iterarng over 48 ints 4 ints, each of which are 4 bytes A block in this cache holds 4 ints: (6 B/block)/(4 B/int) = 4 ints/block Block as bytes: Block as ints: This repeats over and over for the first loop! 5% miss rate a[] a[] a[] a[3] IteraRon : Compulsory Miss IteraRon : 4 But now, we have to worry about quesron : ) How many cache blocks does our array take up? The array contains 48 ints That s 48*4 = 89 B blocks are 6 B each, so 89/6 = 5 blocks Does it fit enurely in the cache at once? No is only 496 B, contains only 56 blocks 4 # So, to figure out what loop s performance is going to be like, we need to know what s in the cache at the end of loop 4 7

# // what s the state of the cache here? Say for the sake of argument a[] that our array starts at address Then, halfway through the first loop (between i = and i = 4), the first half of the array is in the cache 43 a[3] # // what s the state of the cache here? Now, when we execute the a[] a[4] second half of the accesses in loop one, we replace everything already in the cache: a[3] a[47] 44 # // what s the state of the cache here? So, at the end of the first loop, the second half of the array is in the cache a[4] a[47] 45 # Now, let s take a look at loop NoRce that it s going to start by accessing the first half of the array, none of which is in the cache 46 ) What does our access paeern look like within a block? (now, for the second loop) We re iterarng over 48 ints 4 ints, each of which are 4 bytes A block in this cache holds 4 ints: (6 B/block)/(4 B/int) = 4 ints/block Block as bytes: Block as ints: This repeats over and over for the second loop! 5% miss rate a[] a[] a[] a[3] IteraRon : Capacity Miss a[4] was in this spot in the cache IteraRon : 47 # So, we got a 5% miss rate for loop # and a 5% miss rate for loop # Now, we compute a weighted average: Total miss rate = weighted average of miss rates Half the accesses were in loop, the other half in loop : Total miss rate = 5 * loop miss rate + 5 * loop miss rate 48 Total miss rate = 5 * 5% + 5 * 5% = 5% 8

Clicker QuesRon p KiB Direct- Mapped What s the miss rate for the first loop? A % B 5 % C 5 % D 75 % E % Clicker QuesRon p KiB Direct- Mapped What s the miss rate for the second loop? A % B 5 % C 5 % D 75 % E % 49 5 Clicker QuesRon p3 KiB Direct- Mapped What types of misses do we get in the first/second loops? A Capacity/Capacity B Compulsory/Capacity C Compulsory/Conflict D Conflict/Capacity E Compulsory/Compulsory 5 In Conclusion, Huge Design Space Several interacrng dimensions size Block size AssociaRvity Replacement policy Write- through vs write- back Write- allocauon OpRmal choice is a compromise Depends on access characterisrcs Workload Use (I- cache, D- cache) So we learned how to analyze code performance on a cache Simplicity oxen wins Bad Good Size Factor A Less AssociaUvity Block Size Factor B More 5 9