COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Similar documents
COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Basics

Pollard s Attempt to Explain Cache Memory

Lec 11 How to improve cache performance

COSC 6385 Computer Architecture - Memory Hierarchies (I)

A Cache Hierarchy in a Computer System

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Name: 1. Caches a) The average memory access time (AMAT) can be modeled using the following formula: AMAT = Hit time + Miss rate * Miss penalty

Computer Architecture Spring 2016

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Advanced Caching Techniques

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

CS422 Computer Architecture

Cache Optimisation. sometime he thought that there must be a better way

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Adapted from David Patterson s slides on graduate computer architecture

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

CS3350B Computer Architecture

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2

CS222: Cache Performance Improvement

Memory Hierarchies 2009 DAT105

Advanced cache optimizations. ECE 154B Dmitri Strukov

CMSC 611: Advanced Computer Architecture. Cache and Memory

Types of Cache Misses: The Three C s

Lecture-18 (Cache Optimizations) CS422-Spring

Chapter-5 Memory Hierarchy Design

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Chapter 2: Memory Hierarchy Design Part 2

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

Advanced Caching Techniques

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Advanced optimizations of cache performance ( 2.2)

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture notes for CS Chapter 2, part 1 10/23/18

Chapter 2: Memory Hierarchy Design Part 2

Cache Performance (H&P 5.3; 5.5; 5.6)

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Lecture 11 Reducing Cache Misses. Computer Architectures S

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Cache performance Outline

LECTURE 5: MEMORY HIERARCHY DESIGN

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CPE 631 Lecture 06: Cache Design

Copyright 2012, Elsevier Inc. All rights reserved.

Aleksandar Milenkovich 1

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Copyright 2012, Elsevier Inc. All rights reserved.

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Lecture 11. Virtual Memory Review: Memory Hierarchy

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Page 1. Memory Hierarchies (Part 2)

Memory Hierarchy Design

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

EE 4683/5683: COMPUTER ARCHITECTURE

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Caches Concepts Review

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Page 1. Multilevel Memories (Improving performance using a little cash )

Advanced cache memory optimizations

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

14:332:331. Week 13 Basics of Cache

Lec 12 How to improve cache performance (cont.)

Memory Hierarchy Chapter 2. Abdullah Muzahid

Caching Basics. Memory Hierarchies

Transcription:

COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available in the cache Miss rate: ratio of no. of memory access leading to a cache miss to the total number of instructions Miss penalty: time/cycles required for making a data item in the cache 1

CPU equation: Processor Performance CPU time = (Clock cycle CPU execution + Clock cycles memory stall ) x clock cycle time Can avg. memory access time really be mapped to CPU time? Not all memory stall cycles are due to cache misses We ignore that on the following slides Depends on the processor architecture In-order vs. out-of-order execution For out-of-order processors need the visible portion of the miss penalty Memory stall cycles/instruction = Misses/instruction x ( Total miss latency overlapped miss latency) Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses over writes Merging write buffer Victim caches 2

Multilevel caches (I) Dilemma: should the cache be fast or should it be large? Compromise: multi-level caches 1 st level small, but at the speed of the CPU 2 nd level larger but slower Avg. memory access time = Hit time L1 + Miss rate L1 x Miss penalty L1 and Miss penalty L1 = Hit time L2 + Miss rate L2 x Miss penalty L2 Multilevel caches (II) Local miss rate: rate of number of misses in a cache to total number of accesses to the cache Global miss rate: ratio of number of misses in a cache number of memory access generated by the CPU 1 st level cache: global miss rate = local miss rate 2 nd level cache: global miss rate = Miss rate L1 x Miss rate L2 Design decision for the 2 nd level cache: 1. Direct mapped or n-way set associative? 2. Size of the 2 nd level cache? 3

Multilevel caches (II) Assumptions in order to decide question 1: Hit time L2 cache: Direct mapped cache:10 clock cycles 2-way set associative cache: 10.1 clock cycles Local miss rate L2: Direct mapped cache: 25% 2-way set associative: 20% Miss penalty L2 cache: 100 clock cycles Miss penalty direct mapped L2 = 10 + 0.25 x 100 = 35 clock cycles Miss penalty 2-way assoc. L2 = 10.1 + 0.2 x 100 = 30.1 clock cycles Multilevel caches (III) Multilevel inclusion: 2 nd level cache includes all data items which are in the 1 st level cache Applied if size of 2 nd level cache >> size of 1 st level cache Multilevel exclusion: Data of L1 cache is never in the L2 cache Applied if size of 2 nd level cache only slightly bigger than size of 1 st level cache Cache miss in L1 often leads to a swap of an L1 block with an L2 block 4

Critical word first and early restart In case of a cache-miss, an entire cache-block has to be loaded from memory Idea: don t wait until the entire cache-block has been load, focus on the required data item Critical word first: ask for the required data item Forward the data item to the processor Fill up the rest of the cache block afterwards Early restart: Fetch words of a cache block in normal order Forward the requested data item to the processor as soon as available Fill up the rest of the cache block afterwards Giving priority to read misses over writes Write-through caches use a write-buffer to speed up write operations Write-buffer might contain a value required by a subsequent load operations Two possibilities for ensuring consistency: A read resulting in a cache miss has to wait until write buffer is empty Check the contents of the write buffer and take the data item from the write buffer if it is available Similar technique used in case of a cache-line replacement for n-way set associative caches 5

Merging write buffers Check in the write buffer whether multiple entries can be merged to a single one 100 108 116 124 100 108 116 124 1 Mem[100] 0 0 0 1 Mem[108] 0 0 0 1 Mem[116] 0 0 0 1 Mem[124] 0 0 0 1 Mem[100] 1 Mem[108] 1 Mem[116] 1 Mem[124] 0 0 0 0 0 0 0 0 0 0 0 0 Victim caches Question: how often is a cache block which has just been replaced by another cache block required soon after that again? Victim cache: fully associative cache between the real cache and the memory keeping blocks that have been discarded from the cache Typically very small 6

Reducing miss rate Three categories of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity Misses: cache cannot contain all blocks required for the execution -> increase cache size Conflict Misses: cache block has to be discarded because of block replacement strategy -> increase cache size and/or associativity. Reducing miss rate (II) Five techniques to reduce the miss rate Larger cache block size Larger caches Higher associativity Way prediction and pseudo-associative caches Compiler optimization 7

Larger block size Larger block size will reduce compulsory misses Assuming that the cache size is constant, a larger block size also reduces the number of blocks Increases conflict misses Larger caches Reduces capacity misses Might increase hit time ( e.g. if implemented as offchip caches) Cost limitations 8

Higher Associativity Way Prediction and Pseudoassociative caches Way prediction Add a bit to n-way associative caches in order to predict which of the cache blocks will be used The predicted block is checked first on a data request If the prediction was wrong, check the other entries Speeds up the initial access of the cache in case the prediction is correct Pseudo-associative caches: Implemented as a direct mapped cache with a backup location Only one address can hold the cache item In case the item is not located in the cache a secondary address is also checked Secondary address obtained by reverting the 1bit of the cache Problem: have to avoid that all data-items are located in a secondary location after a while 9

Compiler optimizations Re-organize source code such that caches are used more efficiently Two examples: Loop interchange /* original code jumps through memory and cache*/ for (j=0; j<n; j++ ) for (i=0; i< M; i++ ) x[i][j]=2*x[i][j]; /*optimized code */ for (i=0; i<m; i++ ) for (j=0; j<n; j++) x[i][j] = 2*x[i][j]; - Blocked matrix-matrix multiply (see introductory lecture 1) Reducing Cache Miss penalty/rate via Parallelism Nonblocking caches Hardware prefetch of Instructions and Data Compiler controlled prefetching 10

Non-blocking caches Allow data cache to continue to supply cache hits during a cache miss Reduces effective miss penalty Further optimization: allow overlap of multiple misses Only useful if the memory system can service multiple misses. Increases complexity of cache controller Hardware prefetching Prefetch items before they are requested by the processor E.g. processor prefetches two blocks on a miss, assuming that the block following the missing data item will also be requested Requested block is placed in the instruction cache Prefetched block is placed in an instruction stream buffer In case of a request to the prefetched block, the original cache request is cancelled by the hardware and the block is read from the stream buffer Downside: wasting memory bandwidth if block is not required 11

Compiler-controlled prefetch Register prefetch: load value into register Cache prefetch: load data into cache Faulty and non-faulting prefetches: a non-faulting prefetch turns into a no-op in case of a virtual address fault/protection violation Compiler-controlled prefetch (II) Example: for (i=0; i<3; i++ ) { for (j=0; j<100; j++ ) { a[i][j] = b[j][0] * b[j+1][0]; } } Assumptions: Each element of a, b are 8 byte ( double precision fp) 8 KB direct mapped cache with 16 byte blocks Each block can hold 2 consecutive values of either a or b 12

Compiler controlled prefetching (III) No. of cache misses for a: Since a block contains a[i][j], a[i][j+1] every even value of j will lead to a cache miss and every odd value of j will be a cache hit 3 x 100/2 =150 cache misses because of a No. of cache misses because of b: b does not benefit from spatial locality Each element of b can be used 3 times (for i=0,1,2) Ignoring cache conflicts, this leads to a Cache miss for b[0][0] for i=0 cache miss for every b[j+1][0] when i=0 1 + 100 =101 cache misses Compiler controlled prefetching (IV) Assuming that a cache miss takes 7 clock cycles Assuming that we are dealing with non-faulting prefetches Does not prevent cache misses for the first 7 iterations for (j=0; j<100; j++) { prefetch b[j+7][0]; prefetch a[0][j+7]; a[0][j] = b[j][0] * b[j+1][0]; } for (i=1; i<3; i++ ) { for (j=0; j<100; j++ ) { prefetch a[i][j+7]; a[i][j] = b[j][0] * b[j+1][0]; } } 13

Compiler controlled prefetching (V) New number of cache misses: 7 misses for b[0][0], b[1][0], in the first loop 7/2 = 4 misses for a[0][0], a[0][2], in the first loop 7/2 = 4 misses for a[1][0], a[1][2], in the second loop 7/2 = 4 misses for a[2][0], a[2][2], in the second loop total number of misses: 19 reducing the number of cache misses by 232! A write hint could tell the cache: I need the elements of a in the cache for speeding up the writes. But since I do not read the element, don t bother to load the data really into the cache it will be overwritten anyway! Reducing Hit time Small and simple caches Avoiding address translation Pipelined cache access Trace caches 14

Small and simple caches Cache hit time depends on Comparison of the tag and the address No. of comparisons having to be executed Clock-cycle of the cache On-chip vs. off-chip caches Smaller and simpler caches are faster by nature Avoiding address translation Translation of virtual address to physical address required to access memory/caches Can a cache use virtual addresses? In practice: no Page protection has to be enforced, even if cache would use virtual addresses Cache has to be flushed for every process switch Two processes might use the same virtual address without meaning the same physical address Adding of a process identifier tag (PID) to the cache address tag would solve parts of this problem Operating systems might use different virtual addresses for the same physical address (aliases) Could lead to duplicate copies of the same data in the cache 15

Pipelined cache access and trace caches Pipelined cache access: Reduce latency of first level cache access by pipelining Trace caches: Find a dynamic sequence of instructions to load into a instruction cache block The cache gets dynamic traces of the executed instructions of the CPU and is not bound the spatial locality in the memory Works well on taken branches Used e.g. by Pentium 4 Memory organization Example: Costs of a cache miss 4 clock cycles to send the address to main memory 56 clock cycles to access a word 4 clock cycles to send a word of data Assuming a cache block consists of 4 words, a loading a cache block takes 4 x ( 4 + 56 + 4) cycles 16

Improving memory bandwidth Wider main memory bus E.g. for the previous example: assuming the memory width is 2 words, the costs of a cache miss go down to 2 x ( 4+54+4) cycles Interleaved main memory Try to take advantage of the fact that reading/writing multiple words use multiple chips (=banks) Banks are typically one word wide Memory interleaving tries to utilize all chips in the system E.g. for the previous example: costs of a cache miss are 4+54+(4x4) cycles Independent memory banks Allow independent access to different memory chips/banks Multiple memory controllers allow banks to operate independently Might required separate data buses, address lines Usually required by non-blocking caches 17