DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Similar documents
Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Lecture 11 Reducing Cache Misses. Computer Architectures S

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Types of Cache Misses: The Three C s

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Lec 12 How to improve cache performance (cont.)

CPE 631 Lecture 06: Cache Design

Aleksandar Milenkovich 1

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Cache performance Outline

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Computer Architecture Spring 2016

CMSC 611: Advanced Computer Architecture. Cache and Memory

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

Memory Hierarchy Design. Chapter 5

NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

CS422 Computer Architecture

Aleksandar Milenkovich 1

Copyright 2012, Elsevier Inc. All rights reserved.

Advanced optimizations of cache performance ( 2.2)

EITF20: Computer Architecture Part4.1.1: Cache - 2

CS/ECE 250 Computer Architecture

CSE 502 Graduate Computer Architecture

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Chapter-5 Memory Hierarchy Design

5008: Computer Architecture

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy and Application Tuning

! 11 Advanced Cache Optimizations! Memory Technology and DRAM optimizations! Virtual Machines

Cache Performance (H&P 5.3; 5.5; 5.6)

Levels in a typical memory hierarchy

Topics. Digital Systems Architecture EECE EECE Need More Cache?

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 13: Memory Systems -- Cache Op7miza7ons CSE 564 Computer Architecture Summer 2017

EITF20: Computer Architecture Part4.1.1: Cache - 2

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

CS 2461: Computer Architecture 1

Lecture 11. Virtual Memory Review: Memory Hierarchy

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Performance (1/latency) Capacity Access Time Cost. CPU Registers 100s Bytes <1 ns. Cache K Bytes M Bytes 1-10 ns.01 cents/byte

Caching Basics. Memory Hierarchies

Memory Hierarchies 2009 DAT105

Lec 11 How to improve cache performance

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Outline. EECS 252 Graduate Computer Architecture. Lec 16 Advanced Memory Hierarchy. Why More on Memory Hierarchy? Review: 6 Basic Cache Optimizations

CS 252 Graduate Computer Architecture. Lecture 8: Memory Hierarchy

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Pollard s Attempt to Explain Cache Memory

CS222: Cache Performance Improvement

Memory Hierarchy Basics

Adapted from David Patterson s slides on graduate computer architecture

Recap: The Big Picture: Where are We Now? The Five Classic Components of a Computer. CS152 Computer Architecture and Engineering Lecture 20.

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Lecture 10 Advanced Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

COEN6741. Computer Architecture and Design

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Lecture 7 - Memory Hierarchy-II

A Cache Hierarchy in a Computer System

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Page 1. Multilevel Memories (Improving performance using a little cash )

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Advanced cache optimizations. ECE 154B Dmitri Strukov

Computer Architecture CS372 Exam 3

Advanced Computer Architecture- 06CS81-Memory Hierarchy Design

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Caches Concepts Review

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

MEMORY HIERARCHY DESIGN

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Transcription:

DECstation 5 Miss Rates Cache Performance Measures % 3 5 5 5 KB KB KB 8 KB 6 KB 3 KB KB 8 KB Cache size Direct-mapped cache with 3-byte blocks Percentage of instruction references is 75% Instr. Cache Data Cache Unified Hit rate: fraction found in that level So high that usually talk about Miss rate Miss rate fallacy: as MIPS to CPU performance, Average memory-access time = Hit time + Miss rate x Miss penalty (ns) Miss penalty: time to replace a block from lower level, including time to replace in CPU access time to lower level = f(latency to lower level) transfer time: time to transfer block =f(bandwidth) Cache Performance Improvements Which has the lower rate miss: A 6-KB instruction cache with a 6-KB data cache or A 3-KB unified cache Mix: 75% instruction references, 5% data references Hit time = cycle Miss penalty = 5 cycles Load/store hit = cycles on a unified cache Overall miss rate for split caches =.75*.% +.5*6.7% =.% Miss rate for unified cache =.99% Average memory access times: Split =.75 * ( +. * 5) +.5 * ( +.7 * 5) =.5 Unified =.75 * ( +.99 * 5) +.5 * ( + +.99 * 5) =. Average memory-access time = Hit time + Miss rate x Miss penalty Cache optimizations Reducing the miss rate Reducing the miss penalty Reducing the hit time Cache Performance Equations CPU time = (CPU execution cycles + Mem stall cycles) * Cycle time Mem stall cycles = Mem accesses * Miss rate * Miss penalty CPU time = IC * (CPI execution + Mem accesses per instr * Miss rate * Miss penalty) * Cycle time Misses per instr = Mem accesses per instr * Miss rate CPU time = IC * (CPI execution + Misses per instr * Miss penalty) * Cycle time Types of Cache Misses Compulsory First reference or cold start misses Capacity Working set is too big for the cache Fully associative caches Conflict (collision) Many blocks map to the same block frame (line) Affects Set associative caches Direct mapped caches

3Cs Absolute Miss Rates 3Cs Relative Miss Rate....8.6.. -way Conflict -way -way 8-way Capacity % 8% 6% % % -way -way -way 8-way Conflict Capacity : cache rule Cache Size (KB) 8 % 6 3 8 Compulsory 8 6 Cache Size (KB) 3 8 Compulsory Reducing the 3Cs Cache Misses. Larger block size. Higher associativity 3. Victim Caches. Pseudo-associative caches 5. Hardware prefetching 6. Compiler controlled prefetching 7. Compiler optimizations Execution time is the only final measure. Larger Block Size Effects of larger block sizes Reduction of compulsory misses Spatial locality Increase of miss penalty (transfer time) Reduction of number of blocks Potential increase of conflict and capacity misses Latency and bandwidth of lower-level memory High latency and bandwidth => large block size Small increase in miss penalty. Higher Associativity Miss Rate 5% % 5% % 5% % 6 3 8 56 K K 6K K 56K Eight-way set associative is good enough : Cache Rule: Miss Rate of direct mapped cache size N = Miss Rate -way cache size N/ Higher Associativity can increase Clock cycle time Hit time for -way vs. -way external cache +%, internal + % Block Size (bytes)

Average memory access time Clock cycle time =. for -way,. for -way,. for 8-way, and. for direct mapped Miss penalty = cycles Cache Size Associativity (KB) -way -way -way 8-way.33.5.7..98.86.76.68.7.67.6.53 8.6.8.7.3 6.9.3.3.3 3...5.7....3 8..7.8. Tag 3. Victim Caches Data Fully associative, small cache Reduces conflict misses without impairing clock rate Victim Cache CPU Address Data Data in out Write buffer Lower level memory. Pseudo-Associative Caches Pseudo Associative Cache Fast hit time of direct mapped and lower conflict misses of -way set-associative cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Hit time Tag Data CPU Address Data Data in out Pseudo hit time Miss penalty Drawback: CPU pipeline design is hard if hit takes or cycles Better for caches not tied directly to processor (L) Used in MIPS R L cache, similar in UltraSPARC 3 Write buffer Lower level memory Compare: direct-mapped, -way set associative, and pseudo associative. Use AMAT It takes two extra cycles to find an entry in the alternative location (one cycle to check and one cycle to swap) Miss rate pseudo = Miss rate -way Miss penalty pseudo = Miss penalty -way + Hit time pseudo = Hit time -way + Alternate hit rate pseudo * Alternate hit rate pseudo = Hit rate -way Hit rate -way = Miss rate -way Miss rate -way 5. Hardware Prefetching Instruction Prefetching Alpha fetches blocks on a miss Extra block placed in stream buffer On miss check stream buffer Works with data blocks too: data stream buffer gets 5% misses from KB DM cache; streams get 3% For scientific programs: 8 streams got 5% to 7% of misses from KB, -way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty 3

6. Software Prefetching Compiler inserts data prefetch instructions Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC) Special prefetching instructions cannot cause faults; a form of speculative execution Nonblocking cache: overlap execution with prefetch Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth 7. Compiler Optimizations Avoid hardware changes Instructions Profiling to look at conflicts between groups of instructions Data Merging Arrays: improve spatial locality by single array of compound elements vs. arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows Merging Arrays /* Before: sequential arrays */ int key[size]; int val[size]; /* After: array of stuctures */ struct merge { int key; int val; struct merge merged_array[size]; Reducing conflicts between val & key; improved spatial locality Loop Interchange for (j = ; j < ; j = j+) for (i = ; i < 5; i = i+) x[i][j] = * x[i][j]; for (i = ; i < 5; i = i+) for (j = ; j < ; j = j+) x[i][j] = * x[i][j]; Sequential accesses instead of striding through memory every words; improved spatial locality Same number of executed instructions Loop Fusion for (j = ; j < N; j = j+) a[i][j] = /b[i][j] * c[i][j]; for (j = ; j < N; j = j+) d[i][j] = a[i][j] + c[i][j]; for (j = ; j < N; j = j+) { a[i][j] = /b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} misses per access to a & c vs. one miss per access; improve temporal locality Blocking (/) for (j = ; j < N; j = j+){ r = ; for (k = ; k < N; k = k+) r = r + y[i][k]*z[k][j]; x[i][j] = r; Two Inner Loops: Read all NxN elements of z[] Read N elements of row of y[] repeatedly Write N elements of row of x[] Capacity Misses a function of N & Cache Size: 3 NxNx => no capacity misses Idea: compute on BxB submatrix that fits

Blocking (/) for (jj = ; jj < N; jj = jj+b) for (kk = ; kk < N; kk = kk+b) for (j = jj; j < min(jj+b-,n); j=j+){ r = ; for(k=kk; k<min(kk+b-,n);k =k+) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; B called Blocking Factor Capacity Misses from N 3 + N to N 3 /B +N Conflict Misses Too? Conflict Misses and Blocking Miss Rate..5 Direct Mapped Cache Fully Associative Cache 5 5 Blocking Factor Compiler Optimization Performance Summary vpenta (nasa7) gmty (nasa7) tomcatv btrix (nasa7) mxm (nasa7) spice cholesky (nasa7) compress merged arrays.5.5 3 Performance Improvement loop interchange loop fusion blocking Memory accesses CPUtime = IC CPI Execution + Miss rate Miss penalty Clock cycle time Instruction 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate. Larger Block Size. Higher Associativity 3. Victim Cache. Pseudo-Associativity 5. HW Prefetching Instr, Data 6. SW Prefetching Data 7. Compiler Optimizations Danger of concentrating on just one parameter when evaluating performance 5