Advanced Caching Techniques

Similar documents
Advanced Caching Techniques

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory Hierarchy Basics

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Chapter 2: Memory Hierarchy Design Part 2

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Lecture-18 (Cache Optimizations) CS422-Spring

Chapter 2: Memory Hierarchy Design Part 2

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

Lecture notes for CS Chapter 2, part 1 10/23/18

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

EITF20: Computer Architecture Part4.1.1: Cache - 2

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

A Cache Hierarchy in a Computer System

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Cache Optimisation. sometime he thought that there must be a better way

Pollard s Attempt to Explain Cache Memory

Memory Hierarchies 2009 DAT105

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Types of Cache Misses: The Three C s

Advanced optimizations of cache performance ( 2.2)

EE 660: Computer Architecture Advanced Caches

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Copyright 2012, Elsevier Inc. All rights reserved.

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Computer Architecture Spring 2016

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 11. Virtual Memory Review: Memory Hierarchy

LECTURE 5: MEMORY HIERARCHY DESIGN

COSC 6385 Computer Architecture - Memory Hierarchies (II)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Cache Performance (H&P 5.3; 5.5; 5.6)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

CMSC 611: Advanced Computer Architecture. Cache and Memory

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Advanced cache optimizations. ECE 154B Dmitri Strukov

Lec 11 How to improve cache performance

Advanced Memory Organizations

Lecture 7 - Memory Hierarchy-II

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Computer Systems Architecture

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Cache performance Outline

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Page 1. Multilevel Memories (Improving performance using a little cash )

Logical Diagram of a Set-associative Cache Accessing a Cache

Adapted from David Patterson s slides on graduate computer architecture

Introduction. Memory Hierarchy

Copyright 2012, Elsevier Inc. All rights reserved.

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Chapter-5 Memory Hierarchy Design

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

CPE 631 Lecture 06: Cache Design

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Aleksandar Milenkovich 1

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CS3350B Computer Architecture

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Control Hazards. Prediction

CS/CoE 1541 Exam 2 (Spring 2019).

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CS422 Computer Architecture

ECE 30 Introduction to Computer Engineering

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Transcription:

Advanced Caching Approaches to improving memory system performance eliminate memory operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide memory latencies increase cache throughput increase memory bandwidth 1

Handling a Cache Miss the Old Way (1) Send the address & read operation to the next level of the hierarchy (2) Wait for the data to arrive (3) Update the cache entry with data*, rewrite the tag, turn the valid bit on, clear the dirty bit (if data cache) (4) Resend the memory address; this time there will be a hit. * There are variations: get data before replace the block send the requested word to the CPU as soon as it arrives at the cache (early restart) requested word is sent from memory first; then the rest of the block follows (requested word first) How do the variations improve memory system performance? 2

Non-blocking Caches Non-blocking cache (lockup-free cache) allows the CPU to continue executing instructions while a miss is handled some processors allow only 1 outstanding miss ( hit under miss ) some processors allow multiple misses outstanding ( miss under miss ) miss status holding registers (MSHR) hardware structure for tracking outstanding misses physical address of the block which word in the block destination register number (if data) mechanism to merge requests to the same block mechanism to insure accesses to the same location execute in program order 3

Non-blocking Caches Non-blocking cache (lockup-free cache) can be used with both in-order and out-of-order processors in-order processors stall when an instruction that uses the load data is the next instruction to be executed (non-blocking loads) out-of-order processors can execute instructions after the load consumer How do non-blocking caches improve memory system performance? 4

Victim Cache Victim cache small fully-associative cache contains the most recently replaced blocks of a direct-mapped cache alternative to 2-way set-associative cache check it on a cache miss swap the direct-mapped block and victim cache block How do victim caches improve memory system performance? Why do victim caches work? 5

Sub-block Placement Divide a block into sub-blocks tag I data V data V data I data tag I data V data V data V data tag V data V data V data V data tag I data I data I data I data sub-block = unit of transfer on a cache miss valid bit/sub-block misses: block-level miss: tags didn t match sub-block-level miss: tags matched, valid bit was clear + the transfer time of a sub-block + fewer tags than if each sub-block were a block - less implicit prefetching How does sub-block placement improve memory system performance? 6

Pseudo-set associative Cache Pseudo-set associative cache access the cache if miss, invert the high-order index bit & access the cache again + miss rate of 2-way set associative cache + access time of direct-mapped cache if hit in the fast-hit block predict which is the fast-hit block - increase in hit time (relative to 2-way associative) if always hit in the slow-hit block How does pseudo-set associativity improve memory system performance? 7

Pipelined Cache Access Pipelined cache access simple 2-stage pipeline access the cache data transfer back to CPU tag check & hit/miss logic with the shorter How do pipelined caches improve memory system performance? 8

Mechanisms for Prefetching Stream buffers where prefetched instructions/data held if requested block in the stream buffer, then cancel the cache access How do improve memory system performance? 9

Trace Cache Trace cache contents contains instructions from the dynamic instruction stream + fetch statically noncontiguous instructions in a single cycle + a more efficient use of I-cache space trace is analogous to a cache block wrt accessing 10

Trace Cache Assessing a trace cache trace cache state includes low bits of next addresses (target & fallthrough code) for the last instruction in a trace, a branch trace cache tag is high branch address bits + predictions for all branches in the trace assess trace cache & branch predictor, BTB, I-cache in parallel compare high PC bits & prediction history of the current branch instruction to the trace cache tag hit: use trace cache & I-cache fetch ignored miss: use the I-cache start constructing a new trace Why does a trace cache work? 11

Trace Cache Effect on performance? 12

Cache-friendly Compiler Optimizations Exploit spatial locality schedule for array misses hoist first load to a cache block Improve spatial locality group & transpose makes portions of vectors that are accessed together lie in memory together loop interchange so inner loop follows memory layout Improve temporal locality loop fusion do multiple computations on the same portion of an array tiling (also called blocking) do all computation on a small block of memory that will fit in the cache 13

Tiling Example /* before */ for (i=0; i<n; i=i+1) for (j=0; j<n; j=j+1){ r = 0; for (k=0; k<n; k=k+1) { r = r + y[i,k] * z[k,j]; } x[i,j] = r; }; /* after */ for (jj=0; jj<n; jj=jj+t) for (kk=0; kk<n; kk=kk+t) for (i=0; i<n; i=i+1) for (j=jj; j<min(jj+t-1,n); j=j+1) { r = 0; for (k=kk; k<min(kk+t-1,n); k=k+1) {r = r + y[i,k] * z[k,j]; } x[i,j] = x[i,j] + r; }; 14

Memory Banks Interleaved memory: multiple memory banks word locations are assigned across banks interleaving factor: number of banks send a single address to all banks at once 15

Memory Banks Interleaved memory: + get more data for one transfer data is probably used (why?) - larger DRAM chip capacity means fewer banks - power issue Effect on performance? 16

Memory Banks Independent memory banks different banks can be accessed at once, with different addresses allows parallel access, possibly parallel data transfer multiple memory controllers & separate address lines, one for each access different controllers cannot access the same bank less area than dual porting Effect on performance? 17

Machine Comparison 18

Today s Memory Subsystems Look for designs in common: 19

Advanced Caching Approaches to improving memory system performance eliminate memory operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide memory latencies increase cache throughput increase memory bandwidth 20

Victim cache (reduce miss penalty) Wrap-up TLB (reduce page fault time (penalty)) Hardware or compiler-based prefetching (reduce misses) Cache-conscious compiler optimizations (reduce misses or hide miss penalty) Coupling a write-through memory update policy with a write buffer (eliminate store ops/hide store latencies) Handling the read miss before replacing a block with a write-back memory update policy (reduce miss penalty) Sub-block placement (reduce miss penalty) Non-blocking caches (hide miss penalty) Merging requests to the same cache block in a non-blocking cache (hide miss penalty) Requested word first or early restart (reduce miss penalty) Cache hierarchies (reduce misses/reduce miss penalty) Virtual caches (reduce miss penalty) Pipelined cache accesses (increase cache throughput) Pseudo-set associative cache (reduce misses) Banked or interleaved memories (increase bandwidth) Independent memory banks (hide latency) Wider bus (increase bandwidth) 21