Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Similar documents
Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

CMSC 611: Advanced Computer Architecture. Cache and Memory

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Types of Cache Misses: The Three C s

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Computer Architecture Spring 2016

Lecture 11 Reducing Cache Misses. Computer Architectures S

CPE 631 Lecture 06: Cache Design

Aleksandar Milenkovich 1

EITF20: Computer Architecture Part4.1.1: Cache - 2

Cache performance Outline

Advanced optimizations of cache performance ( 2.2)

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Lec 12 How to improve cache performance (cont.)

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches

Aleksandar Milenkovich 1

Copyright 2012, Elsevier Inc. All rights reserved.

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

CS/ECE 250 Computer Architecture

Lec 11 How to improve cache performance

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Chapter-5 Memory Hierarchy Design

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Cache Performance (H&P 5.3; 5.5; 5.6)

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Advanced Caching Techniques

Memory Hierarchy Design. Chapter 5

Advanced Caching Techniques

CSE 502 Graduate Computer Architecture

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory Hierarchy. Advanced Optimizations. Slides contents from:

CS422 Computer Architecture

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Memory Hierarchy Basics

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

5008: Computer Architecture

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy

Lecture 11. Virtual Memory Review: Memory Hierarchy

CS 2461: Computer Architecture 1

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy and Application Tuning

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Memory Hierarchies 2009 DAT105

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

A Cache Hierarchy in a Computer System

Caching Basics. Memory Hierarchies

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Levels in a typical memory hierarchy

Advanced cache optimizations. ECE 154B Dmitri Strukov

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Copyright 2012, Elsevier Inc. All rights reserved.

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Lecture-18 (Cache Optimizations) CS422-Spring

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

! 11 Advanced Cache Optimizations! Memory Technology and DRAM optimizations! Virtual Machines

EE 660: Computer Architecture Advanced Caches

Adapted from David Patterson s slides on graduate computer architecture

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Computer Architecture Spring 2016

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

Chapter 2: Memory Hierarchy Design Part 2

CS3350B Computer Architecture

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

LECTURE 5: MEMORY HIERARCHY DESIGN

CS 252 Graduate Computer Architecture. Lecture 8: Memory Hierarchy

Chapter 2: Memory Hierarchy Design Part 2

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Transcription:

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science

CPUtime = IC CPI Execution + Memory accesses Instruction Miss rate Miss penalty Clock cycle time 1. Reducing Misses via Larger Block Size 2. Reducing Misses via Higher Associativity 3. Reducing Misses via Victim Cache 4. Reducing Misses via Pseudo-Associativity 5. Reducing Misses by H/W Prefetching Instr. and Data 6. Reducing Misses by S/W Prefetching Data 7. Reducing Misses by Compiler Optimizations

Complier-based cache optimization reduces the miss rate without any hardware change McFarling [1989] reduced caches misses by 75% (8KB direct mapped / 4 byte blocks) For Instructions Reorder procedures in memory to reduce conflict Profiling to determine likely conflicts among groups of instructions For Data Merging Arrays: improve spatial locality by single array of compound elements vs. two arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine two independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows

Merging Arrays: /* Before: 2 sequential arrays */ int val[size]; int key[size]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[size]; Reduces misses by improving spatial locality through combined arrays that are accessed simultaneously Loop Interchange: /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality

Some programs have separate sections of code that access the same arrays (performing different computation on common data) Fusing multiple loops into a single loop allows the data in cache to be used repeatedly before being swapped out Loop fusion reduces missed through improved temporal locality (rather than spatial locality in array merging and loop interchange) /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; } Accessing array a and c would have caused twice the number of misses without loop fusion

/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { r = 0; for (k = 0; k < N; k = k+1) r = r + y[i][k] * z[k][j]; x[i][j] = r; }; Two Inner Loops: Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Capacity Misses a function of N & Cache Size: 3 N N 4 bytes => no capacity misses; Idea: compute on B B sub-matrix that fits

/* After */ for (jj = 0; jj < N; jj = jj+b) for (kk = 0; kk < N; kk = kk+b) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+b-1,n); j = j+1) { r = 0; for (k = kk; k < min(kk+b-1,n); k = k+1) { r = r + y[i][k] * z[k][j];}; x[i][j] = x[i][j] + r; }; B called Blocking Factor Memory words accessed 2N 3 + N 2 2N 3 /B +N 2 Conflict misses can go down too Blocking is also useful for register allocation

Traditionally blocking is used to reduce capacity misses relying on high associativity to tackle conflict misses Choosing smaller blocking factor than the cache capacity can also reduce conflict misses (fewer words are active in cache) Lam et al [1991] a blocking factor of 24 had a fifth the misses compared to a factor of 48 despite both fit in cache

CPUtime = IC CPI Execution + Memory accesses Instruction Miss rate Miss penalty Clock cycle time Reducing miss penalty can be as effective as the reducing miss rate With the gap between the processor and DRAM widening, the relative cost of the miss penalties increases over time Seven techniques Read priority over write on miss Sub-block placement Merging write buffer Victim cache Early Restart and Critical Word First on miss Non-blocking Caches (Hit under Miss, Miss under Miss) Second Level Cache Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in between First attempts at L2 caches can make things worse, since increased worst case is worse

Write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Processor Cache DRAM Write Buffer Write Back? Read miss replacing dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stall less since restarts as soon as do read

Originally invented to reduce tag storage while avoiding the increased miss penalty caused by large block sizes Enlarge the block size while dividing each block into smaller units (sub-blocks) and thus does not have to load full block on a miss Include valid bits per sub-block to indicate the status of the sub-block (in cache or not) Valid Bits

Don t wait for full block to be loaded before restarting CPU Early restart As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Critical Word First Request the missed word first from memory Also called wrapped fetch and requested word first block Complicates cache controller design CWF generally useful only in large blocks Given spatial locality programs tend to want next sequential word, limits benefit

CPU address Data Data in out =? Tag Data Victim cache =? Write buffer Lower both miss rate Reduce average miss penalty Lower level memory Slightly extend the worst case miss penalty

Early restart still waits for the requested word to arrive before the CPU can continue execution For machines that allows out-of-order execution using a scoreboard or a Tomasulo-style control the CPU should not stall on cache misses Non-blocking cache or lock-free cache allows data cache to continue to supply cache hits during a miss hit under miss reduces the effective miss penalty by working during miss vs. ignoring CPU requests hit under multiple miss or miss under miss may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Requires multiple memory banks (otherwise cannot support) Pentium Pro allows 4 outstanding memory misses

Ratio of the average memory stall time Benchmark

The previous techniques reduce the impact of the miss penalty on the CPU L2 cache handles the cache-memory interface Measuring cache performance AMAT = HitTime L1 + MissRate L1 MissPenalty L1 = HitTime L1 + MissRate L1 (HitTime L2 + MissRate L2 MissPenalty L2 ) Local miss rate misses in this cache divided by the total number of memory accesses to this cache (MissRate L2 ) Global miss rate (& biggest penalty!) misses in this cache divided by the total number of memory accesses generated by the CPU (MissRate L1 MissRate L2 )

(Global miss rate close to single level cache rate provided L2 >> L1)

Relative execution time 32 bit bus 512KB cache Block size of second-level cache (byte) L1 cache directly affects the processor design and clock cycle: should be simple and small Bulk of optimization techniques can go easily to L2 cache Miss-rate reduction more practical for L2 Considering the L2 cache can improve the L1 cache design, e.g. use write-through if L2 cache applies writeback

Average Access Time = Hit Time x (1 - Miss Rate) + Miss Time x Miss Rate Hit rate is typically very high compared to miss rate any reduction in hit time is magnified Hit time critical: affects processor clock rate Three techniques to reduce hit time: Simple and small caches Avoid address translation during cache indexing Pipelining writes for fast write hits Simple and small caches Design simplicity limits control logic complexity and allows shorter clock cycles On-chip integration decreases signal propagation delay, thus reducing hit time Alpha 21164 has 8KB Instruction and 8KB data cache and 96KB second level cache to reduce clock rate