COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

Similar documents
COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Memory Hierarchy: The motivation

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Memory Hierarchy: Motivation

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Caching Basics. Memory Hierarchies

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

The Memory Hierarchy & Cache

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Page 1. Multilevel Memories (Improving performance using a little cash )

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Lecture 11 Reducing Cache Misses. Computer Architectures S

CMSC 611: Advanced Computer Architecture. Cache and Memory

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

COSC 6385 Computer Architecture - Memory Hierarchies (I)

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.

The memory gap. 1980: no cache in µproc; level cache on Alpha µproc

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

A Cache Hierarchy in a Computer System

Memory Hierarchies 2009 DAT105

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Lecture 11. Virtual Memory Review: Memory Hierarchy

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Types of Cache Misses: The Three C s

Cache performance Outline

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

CS422 Computer Architecture

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Cache Performance (H&P 5.3; 5.5; 5.6)

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

CPE 631 Lecture 04: CPU Caches

EE 4683/5683: COMPUTER ARCHITECTURE

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

CS3350B Computer Architecture

Page 1. Memory Hierarchies (Part 2)

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Computer Architecture Spring 2016

The University of Adelaide, School of Computer Science 13 September 2018

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

Course Administration

Modern Computer Architecture

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Advanced optimizations of cache performance ( 2.2)

EITF20: Computer Architecture Part4.1.1: Cache - 2

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

EITF20: Computer Architecture Part4.1.1: Cache - 2

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Logical Diagram of a Set-associative Cache Accessing a Cache

CS61C Review of Cache/VM/TLB. Lecture 26. April 30, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

Virtual Memory Virtual memory first used to relive programmers from the burden of managing overlays.

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Introduction. Memory Hierarchy

CS61C : Machine Structures

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

5 Memory-Hierarchy Design

Keywords Cache mapping technique, Cache optimization, Cache miss, Cache Hit, Miss Penalty

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

ECE4680 Computer Organization and Architecture. Virtual Memory

Lec 12 How to improve cache performance (cont.)

Assignment 1 due Mon (Feb 4pm

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Advanced Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Introduction to OpenMP. Lecture 10: Caches

Transcription:

COSC4201 Chapter 4 Cache Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT 1 Memory Hierarchy The gap between CPU performance and main memory has been widening with higher performance CPUs creating performance bottlenecks for memory access instructions. The memory hierarchy is organized into several levels of memory with the smaller, more expensive, and faster memory levels closer to the CPU: registers, then primary Cache Level (L 1 ), then additional secondary cache levels (L 2, L 3 ), then main memory, then mass storage (virtual memory). Each level of the hierarchy is a subset of the level below: data found in a level is also found in the level below but at lower speed. Each level maps addresses from a larger physical memory to a smaller level of physical memory. This concept is greatly aided by the principal of locality both temporal and spatial which indicates that programs tend to reuse data and instructions that they have used recently or those stored in their vicinity leading to working set of a program. 2

Processor-Memory Gap 1000 Performance 100 10 1 DRAM 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 CPU Processor-Memory Performance Gap: (grows 50% / year) µproc 60%/yr. DRAM 7%/yr. 3 Cost of Cache Processor % Area %Transistors (-cost) (-power) Alpha 21164 37% 77% StrongArm SA110 61% 94% Pentium Pro 64% 88% 2 dies per package: Proc/I$/D$ + L2$ Caches have no inherent value, only try to close performance gap Smaller is faster 4

Principle of Locality Programs usually spends 90% of the time in 10% of the code (90/10 rule) Two Types of locality: Temporal Locality: If an item is referenced, it will tend to be referenced again soon. Spatial locality: If an item is referenced, items whose addresses are close will tend to be referenced soon. 5 Levels of Memory Hierarchy 6

Definitions Block: the smallest piece of information transferred between 2 levels (line, page,..) Hit, hit rate Miss, miss rate Miss penalty 7 4 Questions Q1: Where can a block be placed in the upper level? (Block placement) Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) Random, LRU Q4: What happens on a write? (Write strategy) Write Back or Write Through (with Write Buffer) 8

Cache Placement Placement strategies or mapping of a main memory data block onto cache block frame addresses divide cache into three organizations: 1Direct mapped cache: A block can be placed in one location only, given by: (Block address) MOD (Number of blocks in cache) 2Fully associative cache: A block can be placed anywhere in cache. 3Set associative cache: A block can be placed in a restricted set of places, or cache block frames. A set is a group of block frames in the cache. A block is first mapped onto the set and then it can be placed anywhere within the set. The set in this case is chosen by: (Block address) MOD (Number of sets in cache) If there are n blocks in a set the cache placement is called n-way set-associative. 9 Direct Mapped Cache Hit Tag Address (showing bit positions) 31 30 13 12 11 2 1 0 Byte offset 20 10 Index Data In de x 0 1 2 Valid Tag Data 1024 Blocks Each block = one word Can cache up to 2 32 bytes of memory 1021 1022 1023 20 32 10

Direct Mapped Cache 4K blocks Tag field Each block = four words Hit Tag A d d res s (s ho w in g b it p o sition s) 31 16 15 4 3 2 1 0 16 12 2 Byte offset Index field Word select Data Index Block offset 16 bits 128 bits V Tag Data 4K entries 16 32 32 32 32 Mux 32 11 Alpha AXP 21064 Data Cache 12

Set Associativity One-way set associative (direct mapped) Block Tag Data 0 1 2 3 4 5 6 Two-way set associative Set 0 1 2 3 Tag Data Tag Data 7 Four-way set associative Set T ag D ata Tag D ata Tag D ata Tag D ata 0 1 Eight-way set associative (fully associative) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data 13 14

Set Associative Cache Each block frame in cache has an address tag. The tags of every cache block that might contain the required data are checked in parallel. A valid bit is added to the tag to indicate whether this entry contains a valid address. The address from the CPU to cache is divided into: A block address, further divided into: An index field to choose a block set in cache. (no index field when fully associative). A tag field to search and match addresses in the selected set. - A block offset to select the data from the block. 15 Set Associative Cache Physical Address Generated by CPU Tag Block Address Index Block Offset Block offset size = log2(block size) Index size = log2(total number of blocks/associativity) Tag size = address size - index size - offset size 16

4-way Set Associative Cache Tag Field Address 31 30 12 11 10 9 8 3 2 1 0 22 8 Index Field Index V Tag Data V Tag Data V Tag Data V Tag Data 0 1 2 253 254 255 22 32 4-to-1 multiplexor Hit Data 17 18

Example Which has a lower miss rate 16KB cache for both instruction or data, or a combined 32KB cache? (0.64%, 6.47%, 1.99%). Assume hit=1cycle and miss =50 cycles. 75% of memory references are instruction fetch. Miss rate of split cache=0.75*0.64%+0.25*6.47%=2.1% Slightly worse than 1.99% for combined cache. But, what about average memory access time? Split cache: 75%(1+0.64%*50)+25%(1+6.47%*50) = 2.05 cycles. Combined cache: Extra cycle for load/store 75%(1+1.99*50)+25%(1+1+1.99%*50) = 2.24 19 Example 50 cycles miss penalty, 2 cycles per instruction for exec, 1.33 mem ref/inst. And miss rate of 2% CPU time = IC *(2.0+ 1.33*2%*50)* T clock = IC * 3.33 * T clock Without cache, would be IC *(2+50*1.33)*T clock = IC * 68.5 * T clock 20

Example Compare between 2 different organizations CPI for a perfect cache (CPI) = 2.0, 1ns cycle, 1.5 mem ref/inst Direct mapped, miss=1.4% 2-way set associative miss=1.0%, we have to extend clock cycle by 25% (MUX is on critical path) Both cases miss penalty = 75nsec. Direct mapped memory access=1+0.014*75=2.05nsec. 2-way memory access = 2*1.25 + 0.01*70=2.00nsec. CPU time = IC(CPI exec + CPI misses ) =IC(CPI exec +Miss/Inst. * MP)*TC =IC [ (CPI exec * Tc )+( Mem/Inst *MR *MP*TC)] direct mapped =IC( 2*1 + 1.5*0.014*70)=3.58IC 2-way = IC(2*1.25 + 1.5*0.01*70) = 3.63 IC 21 Improving Cache performance 1. Reducing the miss rate 2. Reducing the miss penalty 3. Reducing the time to hit 22

Reducing the Cache Miss Classifying Misses: 3 Cs Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) Capacity If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) Conflict If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) 23 Classifying misses 0.14 0.12 0.1 0.08 0.06 0.04 1-way 8-way going from fully to 8 ways set 4-way going from 8-way to 4-way 2-way 4-way 8-way Capacity 0.02 0 1 2 Cache Size (KB) 4 8 16 32 64 128 Compulsory Misses in a 1-way set associative of size X = misses in 2-way set associative of size X/2 24

Reducing cache miss Larger Block Size Larger block size reduces misses up to a point, then start to increase it (larger block size takes advantage of spatial locality, but it decrease the number of different block in the cache and increases the miss penalty since it will take longer to load the block from the memory to the 25% cache) Miss Rate 20% 15% 10% 5% 0% 1K 4K 16K 64K 256K 16 32 64 128 256 Block Size (bytes) 25 Reducing Cache Miss Larger block size From the last graph, if the memory takes 40 cycles overhead and then deliver 16 bytes every 2 cycle. Compare between64,128 block size for 64K cache. 16 byte block access time = 1+(1.06%*42)=1.5088 128 bytes block access time = 1+(1.02*56)=1.5712 26

Reducing Cache Miss Higher Associativity Practically 8-way set associative is as good as a fully associative The 2:1 cache rule of thumb states that a directmapped cache of size N has the same miss rate as a 2-way set associative of size N/2 One problem with associative caches is that we need to increase the clock cycle (at least, we need a MUX to choose which set) Practically 10% increase in clock time for TTL, or ECL, and 2% for custom CMOS (when we go from direct mapped to 2-way set associative). 27 Reducing Cache Miss Victim Caches Add a small fully associative cache between the cache and the memory. This victim cache contain only blocks that were discarded from the original cache. The victim cache is checked on miss, if the data is found there, it will be swapped with the data in the cache. A small victim cache of 1-5 blocks is sufficient (4-block victim cache removed 20% to 95% of conflict misses. 28

Reducing Cache Miss Pseudo-Associative Cache This cache behaves like direct-mapped On a miss, before going to the main memory, check another cache entry 2 hit times, one slow and one fast Example Compare between direct-mapped, 2-way set associative and pseudo associative (2 extra cycles fro pseudo hit) Direct = 1+9.8%*50 = 5.9 2-way = 1.1 +7.6%*50 = 4.9 (10% increase in Tc) Pseudo = 1 + (9.8%-7.6%)*2 + 7.6%*50 = 4.844 29 Reducing Cache Miss Prefetching Can be done by the compiler (inserting prefetch instructions in the code), or the hardware Prefetching could be done directly to the cache or to a buffer that could be accessed a lot faster than the main memory. Access time = hit time + miss rate *prefetch hit rate * 1 + miss rate *(1-prefetch hit rate) * Miss Penalty 30

Reducing Cache Miss Compiler Optimization - By rearrangind data we can reduce cache miss, - Merging arrays - Int A[N]; truct merge{ - Int B[N] int a; - int b; - }; struct merged_arrays[n]; - That works if we reference the same indices at the same time; - Loop Interchange - Reference by row instead of columns if the array is stored in a row major fashion 31 Reducing Cache Miss Compiler Optimization Before for(i=0;i<n;i++) for(j=0;j<n;j++) a[i][j]=1/b[i][j]*c[i][j]; for(i=0;i<n;i++) for(j=0;j<n;j++) d[i][j]=a[i][j]*c[i][j]; After for(i=0;i<n;i++) for(j=0;j<n;j++) { a[i][j]=1/b[i][j]*c[i][j]; d[i][j]=a[i][j]*c[i][j]; } 32

Reducing Cache Miss Compiler Optimization It reduces the miss by exploiting the temporal locality. Suppose that we are multiplying two matrices, the first row by the first column. First, we load the first column, then the second column (first is gone from the cache, then last. After that the first column again (cache miss). By blocking, we can fully utilize the data we brought into the cache before we replace it. 33 Reducing Cache Miss Penalty Giving Priority to Reads over Writes With a write through cache, don t wait for the write to complete, send data to a write buffer and continue reading. Night cause problems (RAW) if we want to read a data that is still in the buffer and not written yet. One way is to wait until the buffer is empty before you read (wastes time) We can check the write buffer in our way to the memory (if we find the data, don t go to the memory) Same thing could happen with a write back 34

Reducing Cache Miss Penalty Sub block placement In order to decrease the area occupied by the tag, we want bigger blocks. Bigger blocks increases the miss penalty One solution is to to add a valid bit fro each subblock, and the transfer between the memory and the cache is done on the sub-block level. If the tag is there, some parts of the block might be valid and other parts might not. 35 Reducing Cache Miss Penalty Early restart and Critical Word First The CPU starts as soon as the required word is in the cache Even better, we can request the required word first. Reduces the miss penalty since we don t have to wait for the entire block to be written to the cache. 36

Reducing Cache Miss Penalty Non-blocking caches The cache continue to supply data during the miss penalty (hit under miss). Can be hit under multiple miss 37 Reducing Cache Miss Penalty Second Level Cache In this case Av mem Access= Hit Time L1 +Miss rate L1 * Miss Penalty L1 Miss Penalty L1 = Hit time L2 + Miss Rate L2 * Miss Penalty L2 38

Reducing Hit Time Small and Simple Cache Avoiding Address Translation Virtual caches (use virtual address) Every time the process is switched the virtual address refers to a different physical address requiring the cache to be flushed. If PID is used, only flush these specific lines Pipeline writes We have to check the tag before write One technique is to pipeline writes, check tag, in the next cycle write and check next tag. 39