Computer Architecture ELEC3441

Similar documents
Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

EE 660: Computer Architecture Advanced Caches

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 7 Memory III

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CSC 631: High-Performance Computer Architecture

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Memory and I/O Organization

CS252 Spring 2017 Graduate Computer Architecture. Lecture 11: Memory

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Assembler. Building a Modern Computer From First Principles.

EC 513 Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management.

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Lecture 11 Cache. Peng Liu.

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

CS252 Graduate Computer Architecture Spring 2014 Lecture 10: Memory

Page 1. Multilevel Memories (Improving performance using a little cash )

Cache Memories. Lecture 14 Cache Memories. Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

ELEC 377 Operating Systems. Week 6 Class 3

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Memory Hierarchy. Slides contents from:

Show Me the $... Performance And Caches

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Mo Money, No Problems: Caches #2...

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Computer Architecture ELEC3441

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS422 Computer Architecture

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Giving credit where credit is due

ECE331: Hardware Organization and Design

Lecture 9 - Virtual Memory

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

Page 1. Memory Hierarchies (Part 2)

CS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory

Computer Architecture Spring 2016

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

EE 4683/5683: COMPUTER ARCHITECTURE

Computer Architecture ELEC3441

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

CS252 Spring 2017 Graduate Computer Architecture. Lecture 17: Virtual Memory and Caches

Memory hierarchy review. ECE 154B Dmitri Strukov

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Lecture 4 - Pipelining

CS3350B Computer Architecture

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

CS 152 Computer Architecture and Engineering

14:332:331. Week 13 Basics of Cache

14:332:331. Week 13 Basics of Cache

Last =me in Lecture 7 3 C s of cache misses Compulsory, Capacity, Conflict

Adapted from David Patterson s slides on graduate computer architecture

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Lecture 14: Multithreading

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

ECE331: Hardware Organization and Design

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Caching Basics. Memory Hierarchies

Transcription:

Causes of Cache Msses: The 3 C s Computer Archtecture ELEC3441 Lecture 9 Cache (2) Dr. Hayden Kwo-Hay So Department of Electrcal and Electronc Engneerng Compulsory: frst reference to a lne (a..a. cold start msses) msses that would occur even wth nfnte cache Capacty: cache s too small to hold all data needed by the program msses that would occur even under perfect replacement polcy Conflct: msses that occur because of collsons due to lne-placement strategy msses that would not occur wth deal full assocatvty 2 AMAT n Average Memory Access Tme: AMAT = Ht Tme + Mss Rate Mss Penalty Example n Processor runs at 2 GHz wth CPI=1. Mss penalty of memory s 50 cloc cycles. L1 cache returns data n 1 cycle on cache ht. On a partcular program, nstructon mss rate s 1%. Load/store mae up 30% of dynamc nstructon, and have a mss rate of 5%. Assume read/wrte penaltes are the same and gnore other stalls. n What s AMAT for nstructon/data? n What s average CPI gven the above memory access tme? 3 4

Example: AMAT Instructon Cache: AMAT = Ht Tme + Mss Rate Mss Penalty =1+1% 50 =1.5 cycles Data Cache: AMAT = Ht Tme + Mss Rate Mss Penalty =1+ 5% 50 = 3.5 cycles Average CPI (wth Memory) # of nstr. = # of nstructon memory mss cycles = 1% 50 = 0.5 # of data memory mss cycles = 30% 5% 50 = 0.75 Total # of memory stall cycles = 0.5 + 0.75 = 1.25 Average CPI = +1.25 = 2.25 5 6 CPU-Cache Interacton (5-stage ppelne) 0x4 Add bubble PC addr nst ht? PCen Prmary Instructon Cache IR D To Memory Control Decode, Regster Fetch A B MD1 ALU Y MD2 Cache Refll Data from Lower Levels of Memory Herarchy E M we addr Prmary Data rdata Cache ht? wdata Stall entre CPU on data cache mss R Improvng Cache Performance Average memory access tme (AMAT) = Ht tme + Mss rate x Mss penalty To mprove performance: reduce the ht tme reduce the mss rate reduce the mss penalty What s best cache desgn for 5-stage ppelne? Bggest cache that doesn t ncrease ht tme past 1 cycle (approx 8-32KB n modern technology) [ desgn ssues more complex wth deeper ppelnes and/or out-oforder superscalar processors] 7 8

Effect of Cache Parameters on Performance Larger cache sze + reduces capacty and conflct msses - ht tme wll ncrease Hgher assocatvty + reduces conflct msses - may ncrease ht tme Larger lne sze + reduces compulsory and capacty (reload) msses - ncreases conflct msses and mss penalty Mss rate per type 10% Performance vs. Assocatvty 9% 8% 7% 6% 5% 4% 3% 2% 1% 0% 4 Capacty One-way Two-way Four-way 8 16 32 64 128 256 512 Cache sze (KB) 1024 Two-way Four-way Assocatvty n 1-way à 2-way èsgnfcant drop n mss rate n 2-way à 4-way è less sgnfcant n Effect of assocatvty sgnfcant n small cache Mss rate 15% 12% 9% 6% 3% 0 One-way 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB Eght-way 9 10 Wrte Polcy Choces Cache ht: wrte through: wrte both cache & memory Generally hgher traffc but smpler ppelne & cache desgn wrte bac: wrte cache only, memory s wrtten only when the entry s evcted A drty bt per lne further reduces wrte-bac traffc Must handle 0, 1, or 2 accesses to memory for each load/store Cache mss: no wrte allocate: only wrte to man memory wrte allocate (aa fetch on wrte): fetch nto cache Common combnatons: wrte through and no wrte allocate wrte bac wth wrte allocate t HIT Wrte Ht (Cache Wrtng) Tag V Tag = t Index WE Offset b Data Data Word or Byte 2 lnes Wrte bac: done Wrte through: wrte also to memory 11 12

Wrte Through va Wrte Buffer Processor n Processor wrtes to both $ and wrte buffer Memory wrtes completes as soon as data n wrte buffer n Memory controller completes the wrte to DRAM offlne n Wrtng too fast may saturate wrte buffer $ Wrte Buffer DRAM Read Mss wth Wrte Buffer n On Read Mss, need to read memory to fll cache n But data may stll be n wrte buffer pendng wrte to DRAM n 2 Solutons: Flush wrte buffer before read Chec all pendng wrtes n wrte buffer and return latest wrte data f address match Q: Would there be data n wrte buffer that needs to be forwarded on a read ht? 13 14 Wrte Mss n Wrte mss happens when wrte locaton not n cache n Wrte Allocate: At the end of the wrte, cache contans full lne of data Need to read from memory Wrte bac: must have wrte allocate Wrte through: may or may not n No wrte allocate: Data go straght to memory Multlevel Caches Problem: A memory cannot be large and fast Soluton: Increasng szes of cache at each level CPU L1$ L2$ DRAM Local mss rate = msses n cache / accesses to cache Global mss rate = msses n cache / CPU memory accesses Msses per nstructon = msses n cache / number of nstructons 15 16

Presence of L2 nfluences L1 desgn Use smaller L1 f there s also L2 Trade ncreased L1 mss rate for reduced L1 ht tme Bacup L2 reduces L1 mss penalty Reduces average access energy Use smpler wrte-through L1 wth on-chp L2 Wrte-bac L2 cache absorbs wrte traffc, doesn t go off-chp At most one L1 mss request per L1 access (no drty vctm wrte bac) smplfes ppelne control Smplfes coherence ssues Smplfes error recovery n L1 (can use just party bts n L1 and reload from L2 when party error detected on L1 read) Incluson Polcy Inclusve multlevel cache: Inner cache can only hold lnes also present n outer cache External coherence snoop access need only chec outer cache Exclusve multlevel caches: Inner cache may hold lnes not n outer cache Swap lnes between nner/outer caches on mss Used n AMD Athlon wth 64KB prmary and 256KB secondary cache Why choose one type or the other? Mss Rate 25.0% 20.0% 15.0% 10.0% L1 vs L2 Mss Rate 5.0% 0.0% twolf bzp2 gzp parser gap Data cache mss rates for ARM Cortex-A8 when runnng Mnnespec vpr perlbm gcc crafty vortex con mcf L1 Data Mss Rate L2 Data Mss Rate n Mss rate on L2$ usually much lower than L1$ n L2 usually has: Hgher capacty Hgher assocatvty n Only mssed L1 access arrved at L2 17 Itanum-2 On-Chp Caches (Intel/HP, 2002) Level 1: 16KB, 4-way s.a., 64B lne, quad-port (2 load+2 store), sngle cycle latency 18 Level 2: 256KB, 4-way s.a, 128B lne, quad-port (4 load or 4 store), fve cycle latency Level 3: 3MB, 12-way s.a., 128B lne, sngle 32B port, twelve cycle latency 19 20

Power 7 On-Chp Caches [IBM 2009] IBM z196 Manframe Caches 2010 32KB L1 I$/core 32KB L1 D$/core 3-cycle latency 256KB Unfed L2$/core 8-cycle latency 32MB Unfed Shared L3$ Embedded DRAM (edram) 25-cycle latency to local slce 96 cores (4 cores/chp, 24 chps/system) Out-of-order, 3-way superscalar @ 5.2GHz L1: 64KB I-$/core + 128KB D-$/core L2: 1.5MB prvate/core (144MB total) L3: 24MB shared/chp (edram) (576MB total) L4: 768MB shared/system (edram) 21 22 Prefetchng Speculate on future nstructon and data accesses and fetch them nto cache(s) Instructon accesses easer to predct than data accesses Varetes of prefetchng Hardware prefetchng Software prefetchng Mxed schemes What types of msses does prefetchng affect? Issues n Prefetchng Usefulness should produce hts Tmelness not late and not too early Cache and bandwdth polluton CPU L1 Instructon Unfed L2 Cache RF L1 Data Prefetched data 23 24

Hardware Instructon Prefetchng Instructon prefetch n Alpha AXP 21064 Fetch two lnes on a mss; the requested lne () and the next consecutve lne (+1) Requested lne placed n cache, and next lne n nstructon stream buffer If mss n cache but ht n stream buffer, move stream buffer lne nto cache and prefetch next lne (+2) CPU RF Req lne Stream Buffer L1 Instructon Prefetched nstructon lne Req lne Unfed L2 Cache Hardware Data Prefetchng Prefetch-on-mss: Prefetch b + 1 upon mss on b One-Bloc Looahead (OBL) scheme Intate prefetch for bloc b + 1 when bloc b s accessed Why s ths dfferent from doublng bloc sze? Can extend to N-bloc looahead Strded prefetch If observe sequence of accesses to lne b, b+n, b+2n, then prefetch b+3n etc. Example: IBM Power 5 [2003] supports eght ndependent streams of strded prefetch per processor, prefetchng 12 lnes ahead of current access 25 26 Software Prefetchng for(=0; < N; ++) { prefetch( &a[ + 1] ); prefetch( &b[ + 1] ); SUM = SUM + a[] * b[]; Software Prefetchng Issues Tmng s the bggest ssue, not predctablty If you prefetch very close to when the data s requred, you mght be too late Prefetch too early, cause polluton Estmate how long t wll tae for the data to come nto L1, so we can set P approprately Why s ths hard to do? for(=0; < N; ++) { prefetch( &a[ + P] ); prefetch( &b[ + P] ); SUM = SUM + a[] * b[]; Must consder cost of prefetch nstructons 27 28

Compler Optmzatons Restructurng code affects the data access sequence Group data accesses together to mprove spatal localty Re-order data accesses to mprove temporal localty Prevent data from enterng the cache Useful for varables that wll only be accessed once before beng replaced Needs mechansm for software to tell hardware not to cache data ( noallocate nstructon hnts or page table bts) Kll data that wll never be used agan Streamng data explots spatal localty but not temporal localty Replace nto dead cache locatons Loop Interchange for(j=0; j < N; j++) { for(=0; < M; ++) { x[][j] = 2 * x[][j]; for(=0; < M; ++) { for(j=0; j < N; j++) { x[][j] = 2 * x[][j]; What type of localty does ths mprove? 29 30 Loop Fuson Matrx Multply, Naïve Code for(=0; < N; ++) a[] = b[] * c[]; for(=0; < N; ++) d[] = a[] * c[]; for(=0; < N; ++) for(j=0; j < N; j++) { r = 0; for(=0; < N; ++) r = r + y[][] * z[][j]; x[][j] = r; z j for(=0; < N; ++) { a[] = b[] * c[]; d[] = a[] * c[]; y x j What type of localty does ths mprove? Not touched Old access New access 31 32

Matrx Multply wth Cache Tlng for(jj=0; jj < N; jj=jj+b) for(=0; < N; =+B) for(=0; < N; ++) for(j=jj; j < mn(jj+b,n); j++) { r = 0; for(=; < mn(+b,n); ++) r = r + y[][] * z[][j]; x[][j] = x[][j] + r; y What type of localty does ths mprove? z x j j Acnowledgements n These sldes contan materal developed and copyrght by: Arvnd (MIT) Krste Asanovc (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubatowcz (UCB) Davd Patterson (UCB) John Lazzaro (UCB) n MIT materal derved from course 6.823 n UCB materal derved from course CS152, CS252 33 34