CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Similar documents
CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

EE 660: Computer Architecture Advanced Caches

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 7 Memory III

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories

CSC 631: High-Performance Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Types of Cache Misses: The Three C s

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Chapter-5 Memory Hierarchy Design

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

CS252 Spring 2017 Graduate Computer Architecture. Lecture 11: Memory

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

Cache performance Outline

Advanced cache optimizations. ECE 154B Dmitri Strukov

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

CS 252 Graduate Computer Architecture. Lecture 8: Memory Hierarchy

CS252 Graduate Computer Architecture Spring 2014 Lecture 10: Memory

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

EITF20: Computer Architecture Part4.1.1: Cache - 2

Recap: The Big Picture: Where are We Now? The Five Classic Components of a Computer. CS152 Computer Architecture and Engineering Lecture 20.

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

Advanced Caching Techniques

ECE 4750 Computer Architecture, Fall 2018 T04 Fundamental Memory Microarchitecture

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy

EITF20: Computer Architecture Part4.1.1: Cache - 2

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Advanced Caching Techniques

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy and Application Tuning

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Computer Architecture Spring 2016

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

EC 513 Computer Architecture

Memory Hierarchy. Slides contents from:

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Virtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts

Chapter 2: Memory Hierarchy Design Part 2

Memory Hierarchy Design. Chapter 5

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Page 1. Multilevel Memories (Improving performance using a little cash )

Chapter 2: Memory Hierarchy Design Part 2

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture notes for CS Chapter 2, part 1 10/23/18

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

CS422 Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

CS3350B Computer Architecture

Cache Performance (H&P 5.3; 5.5; 5.6)

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Lecture 14: Multithreading

Lec 12 How to improve cache performance (cont.)

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS/ECE 250 Computer Architecture

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Inside out of your computer memories (III) Hung-Wei Tseng

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Memory Hierarchy. Slides contents from:

Lecture 9 - Virtual Memory

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CMSC 611: Advanced Computer Architecture. Cache and Memory

Lecture 11 Reducing Cache Misses. Computer Architectures S

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Cray XE6 Performance Workshop

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

COSC 6385 Computer Architecture - Memory Hierarchies (II)

CS 152, Spring 2011 Section 8

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Transcription:

CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste! http://inst.eecs.berkeley.edu/~cs152! February 11, 2010 CS152, Spring 2010

Last time in Lecture 7 3 C s of cache misses: compulsory, capacity, conflict Average memory access time = hit time + miss rate * miss penalty To improve performance, reduce: hit time miss rate and/or miss penalty Primary cache parameters: Total cache capacity Cache line size Associativity February 11, 2010 CS152, Spring 2010 2

Multilevel Caches Problem: A memory cannot be large and fast Solution: Increasing sizes of cache at each level CPU L1$ L2$ DRAM Local miss rate = misses in cache / accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions February 11, 2010 CS152, Spring 2010 3

Presence of L2 influences L1 design Use smaller L1 if there is also L2 Trade increased L1 miss rate for reduced L1 hit time and reduced L1 miss penalty Reduces average access energy Use simpler write-through L1 with on-chip L2 Write-back L2 cache absorbs write traffic, doesn t go off-chip At most one L1 miss request per L1 access (no dirty victim write back) simplifies pipeline control Simplifies coherence issues Simplifies error recovery in L1 (can use just parity bits in L1 and reload from L2 when parity error detected on L1 read) February 11, 2010 CS152, Spring 2010 4

Inclusion Policy Inclusive multilevel cache: Inner cache holds copies of data in outer cache External coherence snoop access need only check outer cache Exclusive multilevel caches: Inner cache may hold data not in outer cache Swap lines between inner/outer caches on miss Used in AMD Athlon with 64KB primary and 256KB secondary cache Why choose one type or the other? February 11, 2010 CS152, Spring 2010 5

Itanium-2 On-Chip Caches (Intel/HP, 2002) Level 1: 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency Level 2: 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency Level 3: 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency February 11, 2010 CS152, Spring 2010 2/17/2009 6

Power 7 On-Chip Caches [IBM 2009] 32KB L1 I$/core 32KB L1 D$/core 3-cycle latency 256KB Unified L2$/core 8-cycle latency 32MB Unified Shared L3$ Embedded DRAM 25-cycle latency to local slice February 11, 2010 CS152, Spring 2010 7

Increasing Cache Bandwidth with Non-Blocking Caches Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss requires Full/Empty bits on registers or out-of-order execution hit under miss reduces the effective miss penalty by working during miss vs. ignoring CPU requests hit under multiple miss or miss under miss may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses, and can get miss to line with outstanding miss (secondary miss) Requires pipelined or banked memory system (otherwise cannot support multiple misses) Pentium Pro allows 4 outstanding memory misses (Cray X1E vector supercomputer allows 2,048 outstanding memory misses) February 11, 2010 CS152, Spring 2010 8

Value of Hit Under Miss for SPEC Hit Under i Hit under n Misses (old data) 2 1.8 Avg. Mem. Access Time 1.6 1.4 1.2 1 0.8 0.6 0->1 1->2 2->64 Base 0->1 1->2 2->64 Base 0.4 0.2 0 eqntott espresso xlisp compress mdljsp2 ear fpppp tomcatv swm256 Integer Floating Point Floating-point programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 Integer programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92 February 11, 2010 CS152, Spring 2010 9 doduc su2cor wave5 mdljdp2 hydro2d alvinn nasa7 spice2g6 ora

CS152 Administrivia February 11, 2010 CS152, Spring 2010 10

Prefetching Speculate on future instruction and data accesses and fetch them into cache(s) Instruction accesses easier to predict than data accesses Varieties of prefetching Hardware prefetching Software prefetching Mixed schemes What types of misses does prefetching affect? February 11, 2010 CS152, Spring 2010 11

Issues in Prefetching Usefulness should produce hits Timeliness not late and not too early Cache and bandwidth pollution CPU RF L1 Instruction L1 Data Unified L2 Cache Prefetched data February 11, 2010 CS152, Spring 2010 12

Hardware Instruction Prefetching Instruction prefetch in Alpha AXP 21064 Fetch two blocks on a miss; the requested block (i) and the next consecutive block (i+1) Requested block placed in cache, and next block in instruction stream buffer If miss in cache but hit in stream buffer, move stream buffer block into cache and prefetch next block (i+2) Req block Stream Buffer Prefetched instruction block CPU RF L1 Instruction Req block Unified L2 Cache February 11, 2010 CS152, Spring 2010 13

Hardware Data Prefetching Prefetch-on-miss: Prefetch b + 1 upon miss on b One Block Lookahead (OBL) scheme Initiate prefetch for block b + 1 when block b is accessed Why is this different from doubling block size? Can extend to N-block lookahead Strided prefetch If observe sequence of accesses to block b, b+n, b+2n, then prefetch b+3n etc. Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access February 11, 2010 CS152, Spring 2010 14

Software Prefetching for(i=0; i < N; i++) { prefetch( &a[i + 1] ); prefetch( &b[i + 1] ); SUM = SUM + a[i] * b[i]; } What property do we require of the cache for prefetching to work? February 11, 2010 CS152, Spring 2010 15

Software Prefetching Issues Timing is the biggest issue, not predictability If you prefetch very close to when the data is required, you might be too late Prefetch too early, cause pollution Estimate how long it will take for the data to come into L1, so we can set P appropriately Why is this hard to do? for(i=0; i < N; i++) { prefetch( &a[i + P] ); prefetch( &b[i + P] ); SUM = SUM + a[i] * b[i]; } Must consider cost of prefetch instructions February 11, 2010 CS152, Spring 2010 16

Compiler Optimizations Restructuring code affects the data block access sequence Group data accesses together to improve spatial locality Re-order data accesses to improve temporal locality Prevent data from entering the cache Useful for variables that will only be accessed once before being replaced Needs mechanism for software to tell hardware not to cache data ( no-allocate instruction hints or page table bits) Kill data that will never be used again Streaming data exploits spatial locality but not temporal locality Replace into dead cache locations February 11, 2010 CS152, Spring 2010 17

Loop Interchange for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } } for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } } What type of locality does this improve? February 11, 2010 CS152, Spring 2010 18

Loop Fusion for(i=0; i < N; i++) a[i] = b[i] * c[i]; for(i=0; i < N; i++) d[i] = a[i] * c[i]; for(i=0; i < N; i++) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } What type of locality does this improve? February 11, 2010 CS152, Spring 2010 19

Matrix Multiply, Naïve Code for(i=0; i < N; i++) for(j=0; j < N; j++) { r = 0; for(k=0; k < N; k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; } z k j y k x j i i Not touched Old access New access February 11, 2010 CS152, Spring 2010 20

Matrix Multiply with Cache Tiling for(jj=0; jj < N; jj=jj+b) for(kk=0; kk < N; kk=kk+b) for(i=0; i < N; i++) for(j=jj; j < min(jj+b,n); j++) { r = 0; for(k=kk; k < min(kk+b,n); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } y k z k x j j i i What type of locality does this improve? February 11, 2010 CS152, Spring 2010 21

Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from course CS252 February 11, 2010 CS152, Spring 2010 22