CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches. Smart Phone. Today s Lecture. Core. FuncTonal Unit(s) Logic Gates

Size: px

Start display at page:

Download "CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches. Smart Phone. Today s Lecture. Core. FuncTonal Unit(s) Logic Gates"

Joy Hutchinson
5 years ago
Views:

CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches Instructors: Randy H Katz David A PaFerson hfp://insteecsberkeleyedu/~cs6c/sp Review Time (seconds/program) is performance

being added to benchmarks Time measurement via clock cycles, machine specific Profiling tools as way to see where spending Tme in your program 2/22/ Spring 2 - - Lecture # 3 New- School Machine

) Parallel Requests Assigned to computer eg, Search Katz Parallel Threads Assigned to core eg, Lookup, Ads So*ware Parallel InstrucTons > instructon @ one Tme eg, pipelined instructons Parallel >

InstrucTon Unit(s) Core Smart Phone Logic Gates 4 Computer (Cache) Core FuncTonal Unit(s) A +B A +B A 2 +B 2 A 3 +B 3 Today s Lecture Agenda Memory Hierarchy Analogy Memory Hierarchy Overview

place on desk in library If need more, check them out and keep on desk But don t return earlier books since might need them You hope this collecton of ~ books on desk enough to write report, despite

1 CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches Instructors: Randy H Katz David A PaFerson hfp://insteecsberkeleyedu/~cs6c/sp Review Time (seconds/program) is performance measure = InstrucTons Clock cycles Seconds Program InstrucTon Clock Cycle Benchmarks stand in for real workloads to as standardized measure of relatve performance Power of increasing concern, and being added to benchmarks Time measurement via clock cycles, machine specific Profiling tools as way to see where spending Tme in your program 2/22/ Spring Lecture # 3 New- School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to computer eg, Search Katz Parallel Threads Assigned to core eg, Lookup, Ads So*ware Parallel InstrucTons > one Tme eg, pipelined instructons Parallel > data one Tme eg, Add of 4 pairs of words Hardware descriptons All one Tme Harness Parallelism & Achieve High Performance Hardware Warehouse Scale Computer Core Memory Input/Output InstrucTon Unit(s) Core Smart Phone Logic Gates 4 Computer (Cache) Core FuncTonal Unit(s) A +B A +B A 2 +B 2 A 3 +B 3 Today s Lecture Agenda Memory Hierarchy Analogy Memory Hierarchy Overview Administrivia Caches Technology Break (Cache Performance: If Tme permits) Library Analogy WriTng a report based on books on reserve Eg, works of JD Salinger Go to library to get reserved book and place on desk in library If need more, check them out and keep on desk But don t return earlier books since might need them You hope this collecton of ~ books on desk enough to write report, despite being only % of books in UC Berkeley libraries Big Idea: Locality Temporal Locality (locality in Tme) Go back to same book on desktop multple Tmes If a memory locaton is referenced then it will tend to be referenced again soon Spa>al Locality (locality in space) When go to book shelf, pick up multple books on JD Salinger since library stores related books together If a memory locaton is referenced, the locatons with nearby addresses will tend to be referenced soon 6 7

2 Principle of Locality Principle of Locality: Programs access small porton of address space at any instant of Tme What program structures lead to temporal and spatal locality in code? In data? How does hardware exploit principle of locality? Offer a hierarchy of memories where closest to processor is fastest (and most expensive per bit so smallest) furthest from processor is largest (and least expensive per bit so slowest) Goal is to create illusion of memory almost as fast as fastest memory and almost as large as biggest memory of the hierarchy 8 9 Higher Levels in memory hierarchy Lower Big Idea: Memory Hierarchy Processor Level Level 2 Level 3 Level n Increasing distance from processor, decreasing speed Size of memory at each level As we move to deeper levels the latency goes up and price per bit goes down Why? Cache Concept Processor and memory speed mismatch leads us to add a new level: a memory cache Implemented with same integrated circuit processing technology as processor, integrated on- chip: faster but more expensive than DRAM memory Cache is a copy of a subset of main memory Modern processors have separate caches for instructons and data, as well as several levels of caches implemented in different sizes As a pun, oten use $ ( cash ) to abbreviate cache, eg D$ = Cache, I$ = InstrucTon Cache Memory Hierarchy Technologies Caches use SRAM (StaTc RAM) for speed and technology compatbility Fast (typical access Tmes of to 2 ns) Low density (6 transistor cells), higher power, expensive ($2 to $4 per GB in 2) StaTc: content will last as long as power is on Main memory uses DRAM (Dynamic RAM) for size (density) Slower (typical access Tmes of to 7 ns) High density ( transistor cells), lower power, cheaper ($2 to $4 per GB in 2) Dynamic: needs to be refreshed regularly (~ every 8 ms) Consumes % to 2% of the actve cycles of the DRAM 2 CharacterisTcs of the Memory Hierarchy Increasing distance from the processor in access Tme Processor L$ 4-8 bytes (word) 8-32 bytes (block) L2$ 6-28 bytes (block) 4,96+ bytes (page) Secondary Memory (RelaTve) size of the memory at each level Inclusive what is in L$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM 3 2

3 How is the Hierarchy Managed? registers memory By compiler (or assembly level programmer) cache main memory By the cache controller hardware main memory disks (secondary storage) By the operatng system (virtual memory) (Talk about later in the semester) Virtual to physical address mapping assisted by the hardware (TLB) By the programmer (files) path Typical Memory Hierarchy On- Chip Components Control RegFile Instr Cache Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Memory (Disk Or Flash) Speed (cycles): ½ s s s s,, s Size (bytes): s K s M s G s T s Cost/bit: highest lowest Principle of locality + memory hierarchy presents programmer with as much memory as is available in the cheapest technology at the speed offered by the fastest technology 4 Review so far Wanted: effect of a large, cheap, fast memory Approach: Memory Hierarchy Successively lower levels contain most used data from next higher level Exploits temporal & spa>al locality Memory hierarchy follows 2 Design Principles: Smaller is faster and Do common the case fast Agenda Memory Hierarchy Overview Administrivia Caches Technology Break (Cache Performance: If Tme permits) 6 7 Administrivia Everyone off wait lists Lab #6 posted Project #2, Part 2 Due :9:9 No Homework this week! Midterm in 2 weeks: Exam: Tu, Mar 8, 6-9 PM, 4/ Dwinelle Split: A- Lew in 4, Li- Z in No discussion during exam week; no lecture that day TA Review: Su, Mar 6, 2- PM, 2 VLSB Small number of special consideraton cases, due to class conflicts, etc contact Dave or Randy 2 new tests 7 passed all tests Max score 28 fail 4 to 6 tests Most code does not properly sign- extend immediate field 4 failed inital 2 tests 3 didn t compile Project 2, Part % >, < 2 48% %

6C in the News Inclusive: all welcome, it works!

Microsoft, Alan Eustace Google, Bill Wulf UVA, Irving Wladawsky- Berger IBM, John Kubiatowicz UC

student opportunities Resume PreparaTon + BOFs (work arnd emote Banquet Dance or import student) San

Bridge, apply d octoral c onsortium If interested, email Sheila Rising Stars: Illya Hicks Rice,

Intel why want to go to Tapia humphrys@eecsberkeleyedu General C hair: D ave P atterson Spring 2 - -

QuesTons Hicks Wladawsky Spring 2 - - Lecture # Tapia Kubiatowicz Eustace Lanius Wulf Howard Vargas

lower level blocks share a given cache block Block is also called a line Address mapping: Since the

address) modulo (# of blocks in the cache) Tag associated with each cache block containing the

do we know which blocks of main memory currently have a copy in cache?

Spring 2 - - Lecture # Taylor Estrin 2 Cache Basics: Direct Mapped How best to organize the memory

Spring 2 - - Lecture # Tapia Awardee Home Internet May Get Even Faster in South Korea, Spring 2 - -

Engineering President, past IBM Academy of Technology Chair, past ACM President, past CompuTng

By the end of 22, South Korea intends to connect every home in the country to the Internet at one

4 6C in the News Inclusive: all welcome, it works! 82%: reaf9irms CS major, will 9inish degree Luminiaries: Deborah Estrin UCLA, Blaise Aguera y Arcas Microsoft, Alan Eustace Google, Bill Wulf UVA, Irving Wladawsky- Berger IBM, John Kubiatowicz UC Berkeley 2 in beaunful San Francisco tapiaconferenceorg/2 CDC is a joint org of ACM,IEEE/CS,CRA If ou csare, come! 8 gyreat peakers on pgoster rad School Workshops Volunteer for Success, Early Career Success, student opportunities Resume PreparaTon + BOFs (work arnd emote Banquet Dance or import student) San Francisco AcTvity: Alcatraz Chinatown, Bike sotudents ver Tour, Encourage grad to Golden Gate Bridge, apply d octoral c onsortium If interested, Sheila Rising Stars: Illya Hicks Rice, Humphrys with name, year, Ayanna Howard Georgia Tech, topic interest + 2 to 3 sentences Patty Lopez Intel why want to go to Tapia humphrys@eecsberkeleyedu General C hair: D ave P atterson Spring Lecture # 22 hfp://tapiaconferenceorg/2/partcipatehtml Aguera y Arcas Lopez Patterson Cache Design QuesTons Hicks Wladawsky Spring Lecture # Tapia Kubiatowicz Eustace Lanius Wulf Howard Vargas 23 Perez- Quinones Direct mapped Each memory block is mapped to exactly one block in the cache Many lower level blocks share a given cache block Block is also called a line Address mapping: Since the cache is a subset of memory, multple memory addresses can map to the same cache locaton (block address) modulo (# of blocks in the cache) Tag associated with each cache block containing the address informaton (the upper porton of the address) required to identfy the block (to answer Q) How do we know which blocks of main memory currently have a copy in cache? How do we ﬁnd these copies quickly? Spring Lecture # Taylor Estrin 2 Cache Basics: Direct Mapped How best to organize the memory blocks of the cache? To which block of the cache does a given main memory address map? Spring Lecture # Tapia Awardee Home Internet May Get Even Faster in South Korea, Spring Lecture # By Mark McDonald, New York Times, February 2, 2 Will be held in charming San Francisco April 3-? Is sponsored by Google, Intel, Microsot, Cisco, NetAp, Includes a past NaTonal Academy of Engineering President, past IBM Academy of Technology Chair, past ACM President, past CompuTng Research AssociaTon Chair, members of NAE, Technology Review s Young Innovators (TR3), and Google s ExecuTve VP of Research & Engineering? Has afendance that averages % female, 4% African American, and 3% Hispanic? Speakers What Conference? By the end of 22, South Korea intends to connect every home in the country to the Internet at one gigabit per second That would be a tenfold increase from the already blazing natonal standard and more than 2 Tmes as fast as the average household setup in the United States Organizers South Korean homes now have greater Internet access than we do, President Obama said in his State of the Union address Last week, Mr Obama unveiled an $87 billion broadband spending program (Use WiMax to beam broadband from where Internet providers' systems currently end to rural areas ) (Later in semester we ll cover alternatves to direct mapped) 24 Spring Lecture # 2 4

5 Mapping a 6- bit Memory Address Mem Block Within $ Block Block Within $ Index Byte Offset Within Block (eg, Word) Tag Note: $ = Cache In example, block size is 4 bytes/ word (it could be mult- word) Memory and cache blocks are the same size, unit of transfer between memory and cache # Memory blocks >> # Cache blocks 6 Memory blocks/6 words/64 bytes/6 bits to address all bytes 4 Cache blocks, 4 bytes ( word) per block 4 Memory blocks map to each cache block Byte within block: low order two bits, ignore! (nothing smaller than a block) Memory block to cache block, aka index: middle two bits Which memory block is in a given cache block, aka tag: top two bits 26 Example One word blocks, cache size = K words (or 4KB) Hit Valid bit ensures something useful in cache for this index Compare Tag with upper part of Address to see if a Hit Tag 2 Index Index Valid Tag 2 Byte offset What kind of locality are we taking advantage of? Read data from cache instead of memory if a Hit Cache Caching: A Simple First Example Index Valid Tag Q: Is the mem block in cache? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx One word blocks Two low order bits define the byte in the word (32b words) Q: Where in the cache is the mem block? Use next 2 low order memory address bits the index to determine which cache block (ie, modulo the number of blocks in the cache) (block address) modulo (# of blocks in the cache) 28 Caching: A Simple First Example Cache Index Valid Tag Q: Is the mem block in cache? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx One word blocks Two low order bits define the byte in the word (32b words) Q: Where in the cache is the mem block? Use next 2 low order memory address bits the index to determine which cache block (ie, modulo the number of blocks in the cache) (block address) modulo (# of blocks in the cache) 29 Address Start with an empty cache - all blocks miss Start with an empty 4- word cache - all blocks Mem() requests, miss 3 3

6 Start with an empty cache - all blocks miss 2 3 Mem() Time Time Mem() 32 Start with an empty cache - all blocks miss miss 2 miss 3 miss Mem() Mem() Mem() Mem() Mem() Mem(2) Mem() Mem() Mem(2) Mem(3) 4 miss 3 hit 4 hit miss 4 Mem() Mem(4) Mem(4) Mem(4) Mem() Mem() Mem() Mem() Mem(2) Mem(2) Mem(2) Mem(2) Mem(3) Mem(3) Mem(3) Mem(3) 8 requests, 6 misses 33 MulTword Block Four words/block, cache size = K words Byte Hit offset Tag Index Valid Tag Index 8 Block offset Taking Advantage of SpaTal Locality Let cache block hold more than one word Start with an empty cache - all blocks What kind of locality are we taking advantage of? Taking Advantage of SpaTal Locality Let cache block hold more than one word Start with an empty cache - all blocks miss 2 Mem() Mem() Taking Advantage of SpaTal Locality Let cache block hold more than one word Start with an empty cache - all blocks miss Mem() Mem() hit 2 miss Mem() Mem() Mem() Mem() Mem(3) Mem(2) hit 4 miss 3 hit Mem() Mem() 4 Mem(3) Mem(2) Mem() Mem() Mem(3) Mem(2) Mem() Mem(4) Mem(3) Mem(2) hit miss Mem() Mem(4) Mem() Mem(4) Mem(3) Mem(2) Mem(3) Mem(2) 4 8 requests, 4 misses 37 6

7 Miss rate (%) Miss Rate vs Block Size vs Cache Size Block size (bytes) 8 KB 6 KB 64 KB 26 KB Miss rate goes up if the block size becomes a significant fracton of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses) 38 Measuring Cache Performance Assuming cache hit costs are included as part of the normal CPU executon cycle, then CPU Tme = IC CPI CC = IC (CPI ideal + Memory- stall cycles) CC CPI stall A simple model for Memory- stall cycles Memory- stall cycles = accesses/program miss rate miss penalty Will talk about writes and write misses next lecture, where its a lifle more complicated 2/22/ Spring Lecture #2 39 Impacts of Cache Performance RelaTve $ penalty increases as processor performance improves (faster clock rate and/or lower CPI) Memory speed unlikely to improve as fast as processor cycle Tme When calculatng CPI stall, cache miss penalty is measured in processor clock cycles needed to handle a miss Lower the CPI ideal, more pronounced impact of stalls Processor with a CPI ideal of 2, a cycle miss penalty, 36% load/store instr s, and 2% I$ and 4% D$ miss rates Memory- stall cycles = 2% + 36% 4% = 344 So CPI stalls = = 44 More than twice the CPIideal! What if the CPI ideal is reduced to? What if the D$ miss rate went up by %? 2/22/ Spring Lecture #2 4 Average Memory Access Time (AMAT) Average Memory Access Time (AMAT) is the average to access memory considering both hits and misses AMAT = Time for a hit + Miss rate x Miss penalty What is the AMAT for a processor with a 2 psec clock, a miss penalty of clock cycles, a miss rate of 2 misses per instructon and a cache access Tme of clock cycle? + 2 x = 2 clock cycles Or 2 x 2 = 4 psecs PotenTal impact of much larger cache on AMAT? ) Lower Miss rate 2) Longer Access Tme (Hit Tme): smaller is faster Increase in hit Tme will likely add another stage to the pipeline At some point, increase in hit Tme for a larger cache may overcome the improvement in hit rate, yielding a decrease in performance 2/22/ Spring Lecture #2 4 Review so far Principle of Locality for Libraries /Computer Memory Hierarchy of Memories (speed/size/cost per bit) to Exploit Locality Cache copy of data lower level in memory hierarchy Direct Mapped to find block in cache using Tag field and Valid bit for Hit Larger caches reduce Miss rate via Temporal and SpaTal Locality, but can increase Hit Tme Larger blocks to reduces Miss rate via SpaTal Locality, but increase Miss penalty AMAT helps balance Hit Tme, Miss rate, Miss penalty 42 7

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds. Performance 980 98 982 983 984 985 986 987 988 989 990 99 992 993 994 995 996 997 998 999 2000 7/4/20 CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches Instructor: Michael Greenbaum