COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Similar documents
COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Basics

Lec 11 How to improve cache performance

Advanced cache optimizations. ECE 154B Dmitri Strukov

Pollard s Attempt to Explain Cache Memory

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Advanced optimizations of cache performance ( 2.2)

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Advanced Caching Techniques

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

A Cache Hierarchy in a Computer System

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

Chapter-5 Memory Hierarchy Design

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Computer Architecture Spring 2016

EITF20: Computer Architecture Part4.1.1: Cache - 2

Advanced cache memory optimizations

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

CS222: Cache Performance Improvement

Advanced Caching Techniques

EITF20: Computer Architecture Part4.1.1: Cache - 2

Adapted from David Patterson s slides on graduate computer architecture

Name: 1. Caches a) The average memory access time (AMAT) can be modeled using the following formula: AMAT = Hit time + Miss rate * Miss penalty

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Lecture-18 (Cache Optimizations) CS422-Spring

Chapter 2: Memory Hierarchy Design Part 2

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Lecture notes for CS Chapter 2, part 1 10/23/18

LECTURE 5: MEMORY HIERARCHY DESIGN

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Types of Cache Misses: The Three C s

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Memory Hierarchy Design

Memory Hierarchies 2009 DAT105

Chapter 2: Memory Hierarchy Design Part 2

CMSC 611: Advanced Computer Architecture. Cache and Memory

Cache Optimisation. sometime he thought that there must be a better way

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

CS3350B Computer Architecture

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Page 1. Multilevel Memories (Improving performance using a little cash )

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Cache Performance (H&P 5.3; 5.5; 5.6)

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS422 Computer Architecture

Main Memory (Fig. 7.13) Main Memory

ECE 30 Introduction to Computer Engineering

UNIT I (Two Marks Questions & Answers)

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchy. Slides contents from:

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

CPE 631 Lecture 06: Cache Design

Lecture 11. Virtual Memory Review: Memory Hierarchy

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Aleksandar Milenkovich 1

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

ארכיטקטורת יחידת עיבוד מרכזי ת

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Lecture 13: Cache Hierarchies. Today: cache access basics and innovations (Sections )

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Cache performance Outline

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Memory Hierarchy Chapter 2. Abdullah Muzahid

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 11 Reducing Cache Misses. Computer Architectures S

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lec 12 How to improve cache performance (cont.)

Transcription:

COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006

Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses over writes Merging write buffer Victim caches

Reducing miss rate Five techniques to reduce the miss rate Larger cache block size Larger caches Higher associativity Way prediction and pseudo-associative caches Compiler optimization

Reducing Cache Miss penalty/rate via Parallelism Nonblocking caches Hardware prefetch of Instructions and Data Compiler controlled prefetching

Non-blocking caches Allow data cache to continue to supply cache hits during a cache miss Reduces effective miss penalty Further optimization: allow overlap of multiple misses Only useful if the memory system can service multiple misses. Increases complexity of cache controller

Hardware prefetching Prefetch items before they are requested by the processor E.g. processor prefetches two blocks on a miss, assuming that the block following the missing data item will also be requested Requested block is placed in the instruction cache Prefetched block is placed in an instruction stream buffer In case of a request to the prefetched block, the original cache request is cancelled by the hardware and the block is read from the stream buffer Downside: wasting memory bandwidth if block is not required

Compiler-controlled prefetch Register prefetch: load value into register Cache prefetch: load data into cache Faulty and non-faulting prefetches: a non-faulting prefetch turns into a no-op in case of a virtual address fault/protection violation

Compiler-controlled prefetch (II) Example: for (i=0; i<3; i++ ) { for (j=0; j<100; j++ ) { a[i][j] = b[j][0] * b[j+1][0]; } } Assumptions: Each element of a, b are 8 byte ( double precision fp) 8 KB direct mapped cache with 16 byte blocks Each block can hold 2 consecutive values of either a or b

Compiler controlled prefetching (III) No. of cache misses for a: Since a block contains a[i][j], a[i][j+1] every even value of j will lead to a cache miss and every odd value of j will be a cache hit 3 x 100/2 =150 cache misses because of a No. of cache misses because of b: b does not benefit from spatial locality Each element of b can be used 3 times (for i=0,1,2) Ignoring cache conflicts, this leads to a Cache miss for b[0][0] for i=0 cache miss for every b[j+1][0] when i=0 1 + 100 =101 cache misses

Compiler controlled prefetching (IV) Assuming that a cache miss takes 7 clock cycles Assuming that we are dealing with non-faulting prefetches Does not prevent cache misses for the first 7 iterations for (j=0; j<100; j++) { prefetch b[j+7][0]; prefetch a[0][j+7]; a[0][j] = b[j][0] * b[j+1][0]; } for (i=1; i<3; i++ ) { for (j=0; j<100; j++ ) { prefetch a[i][j+7]; a[i][j] = b[j][0] * b[j+1][0]; } }

Compiler controlled prefetching (V) New number of cache misses: 7 misses for b[0][0], b[1][0], in the first loop 7/2 = 4 misses for a[0][0], a[0][2], in the first loop 7/2 = 4 misses for a[1][0], a[1][2], in the second loop 7/2 = 4 misses for a[2][0], a[2][2], in the second loop total number of misses: 19 reducing the number of cache misses by 232! A write hint could tell the cache: I need the elements of a in the cache for speeding up the writes. But since I do not read the element, don t bother to load the data really into the cache it will be overwritten anyway!

Reducing Hit time Small and simple caches Avoiding address translation Pipelined cache access Trace caches

Small and simple caches Cache hit time depends on Comparison of the tag and compare it to the address No. of comparisons having to be executed Clock-cycle of the cache On-chip vs. off-chip caches Smaller and simpler caches are faster by nature

Avoiding address translation Translation of virtual address to physical address required to access memory/caches Can a cache use virtual addresses? In practice: no Page protection has to be enforced, even if cache would use virtual addresses Cache has to be flushed for every process switch Two processes might use the same virtual address without meaning the same physical address Adding of a process identifier tag (PID) to the cache address tag would solve parts of this problem Operating systems might use different virtual addresses for the same physical address (aliases) Could lead to duplicate copies of the same data in the cache

Pipelined cache access and trace caches Pipelined cache access: Reduce latency of first level cache access by pipelining Trace caches: Find a dynamic sequence of instructions to load into a instruction cache block The cache gets dynamic traces of the executed instructions of the CPU and is not bound the spatial locality in the memory Works well on taken branches Used e.g. by Pentium 4

Memory organization How can you improve memory bandwidth? Wider main memory bus Interleaved main memory Try to take advantage of the fact that reading/writing multiple words use multiple chips (=banks) Banks are typically one word wide Memory interleaving tries to utilize all chips in the system Independent memory banks Allow independent access to different memory chips/banks Multiple memory controllers allow banks to operate independently Might required separate data buses, address lines Usually required by non-blocking caches