Advanced Caching Techniques

Similar documents
Advanced Caching Techniques

EITF20: Computer Architecture Part4.1.1: Cache - 2

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Lecture-18 (Cache Optimizations) CS422-Spring

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Basics

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

EE 660: Computer Architecture Advanced Caches

EITF20: Computer Architecture Part4.1.1: Cache - 2

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

A Cache Hierarchy in a Computer System

Memory Hierarchies 2009 DAT105

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Chapter 2: Memory Hierarchy Design Part 2

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture notes for CS Chapter 2, part 1 10/23/18

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Pollard s Attempt to Explain Cache Memory

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Lecture 11. Virtual Memory Review: Memory Hierarchy

Cache Performance (H&P 5.3; 5.5; 5.6)

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 2: Memory Hierarchy Design Part 2

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Copyright 2012, Elsevier Inc. All rights reserved.

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CMSC 611: Advanced Computer Architecture. Cache and Memory

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Advanced optimizations of cache performance ( 2.2)

Computer Architecture Spring 2016

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Cache performance Outline

Types of Cache Misses: The Three C s

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Copyright 2012, Elsevier Inc. All rights reserved.

Advanced Memory Organizations

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

ECE 30 Introduction to Computer Engineering

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Cache Optimisation. sometime he thought that there must be a better way

LECTURE 5: MEMORY HIERARCHY DESIGN

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Logical Diagram of a Set-associative Cache Accessing a Cache

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 7 - Memory Hierarchy-II

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Lec 11 How to improve cache performance

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CSE Introduction to Computer Architecture. Jeff Brown

Advanced cache optimizations. ECE 154B Dmitri Strukov

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

14:332:331. Week 13 Basics of Cache

Caches. Hiding Memory Access Times

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Copyright 2012, Elsevier Inc. All rights reserved.

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

CS/CoE 1541 Exam 2 (Spring 2019).

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

CSE 502 Graduate Computer Architecture

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

CSE 2021: Computer Organization

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

CS3350B Computer Architecture

CSE 2021: Computer Organization

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories

Introduction. Memory Hierarchy

Computer Architecture is...??? CSE Principles of Computer Architecture. Computer Architecture is...??? Computer Architecture is...???

Adapted from David Patterson s slides on graduate computer architecture

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Page 1. Multilevel Memories (Improving performance using a little cash )

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

14:332:331. Week 13 Basics of Cache

Transcription:

Advanced Caching Approaches to improving memory system performance eliminate memory accesses/operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide memory latencies increase cache throughput increase memory bandwidth New techniques address particular components of memory system performance 1 Handling a Cache Miss the Old Way (1) Send the address & read operation to the next level of the hierarchy (2) Wait for the data to arrive (3) Update the cache entry with data*, rewrite the tag, turn the valid bit on, clear the dirty bit (if data cache) (4) Resend the memory address; this time there will be a hit. * There are variations: get data before replace the block send the requested word to the CPU as soon as it arrives at the cache (early restart) requested word is sent from memory first; then the rest of the block follows (requested word first) How do the variations improve memory system performance? 2 1

Non-blocking Caches Non-blocking cache (lockup-free cache) allows the CPU to continue executing instructions while a miss is handled some caches allow only 1 outstanding miss ( hit under miss ) some caches have multiple misses outstanding ( miss under miss ) miss status holding registers (MSHR) hardware structure for tracking outstanding misses physical address of the block which word in the block destination register number (if data) mechanism to merge requests to the same block mechanism to insure accesses to the same location execute in program order 3 Non-blocking Caches Non-blocking cache (lockup-free cache) can be used with both in-order and out-of-order processors in-order processors can execute when the next instruction is not the load consumer out-of-order processors can execute instructions after the load consumer How do non-blocking caches improve memory system performance? 4 2

Non-blocking Caches in-order processors lw $3, 100($4) add $2, $3, $4 sub $5, $6, $7 in execution, cache miss consumer waits until the miss is satisfied independent instruction waits for the add out-of-order processors lw $3, 100($4) sub $5, $6, $7 add $2, $3, $4 in execution, cache miss independent instruction can execute during the cache miss consumer waits until the miss is satisfied Spring 2014 CSE 471 - Advanced Caching 5 Victim Cache Victim cache small fully-associative cache contains the most recently replaced blocks of a direct-mapped L1 cache if L1 cache miss & victim cache hit, swap the direct-mapped block and victim cache block if both miss, L1 block goes to victim cache alternative to 2-way set-associative cache Why do victim caches work? How do victim caches improve memory system performance? 6 3

Sub-block Placement Divide a block into sub-blocks tag I data V data V data I data tag I data V data V data V data tag V data V data V data V data tag I data I data I data I data sub-block = unit of transfer on a cache miss valid bit/sub-block 2 kinds of misses: block-level miss: tags didn t match sub-block-level miss: tags matched, valid bit was clear + the transfer time of a sub-block + fewer tags than if each block was the size of a subblock - can t exploit spatial locality How does sub-block placement improve memory system performance? 7 Pipelined Cache Access Pipelined cache access simple 2-stage pipeline access the cache data transfer back to CPU tag check & hit/miss logic with the shorter of the two stages How do pipelined caches improve memory system performance? 8 4

Trace Cache Contains instructions from the dynamic instruction stream fetch statically noncontiguous instructions in a single cycle, called a trace limit on # basic blocks & # instructions in a trace B1: 6 instr B2: 4 instr B3: 8 instr B4: 5 instr instructions may appear more than once accessed with PC & prediction bit traces can contain decoded instructions particularly useful for CISC architectures 9 Trace Cache Advantages/disadvantages of a trace cache +? - Effect on memory system performance? 10 5

Cache-friendly Compiler Optimizations Improve spatial locality for data loop interchange so inner loop follows memory layout of data group & transpose collects data from different vectors or structures that are accessed together & places them contiguously in memory Improve temporal locality for data loop fusion put computations on the same portion of an array from separate loops into one loop tiling (also called blocking) do all computation on a small block of an array that will fit in the cache 11 Tiling Example /* before */ for (i=0; i<n; i=i+1) for (j=0; j<n; j=j+1){ r = 0; for (k=0; k<n; k=k+1) { r = r + y[i,k] * z[k,j]; } x[i,j] = r; }; /* after */ for (jj=0; jj<n; jj=jj+t) for (kk=0; kk<n; kk=kk+t) for (i=0; i<n; i=i+1) for (j=jj; j<min(jj+t-1,n); j=j+1) { r = 0; for (k=kk; k<min(kk+t-1,n); k=k+1) {r = r + y[i,k] * z[k,j]; } x[i,j] = x[i,j] + r; }; 1 2 4 5 7 8 3 6 9 12 6

Memory Banks Interleaved memory: multiple memory banks word locations are assigned across banks send a single address to all banks at once interleaving factor: number of banks 13 Memory Banks Interleaved memory: + get more data for one transfer data is probably used (why?) - larger DRAM chip capacity means fewer banks - power issue Effect on memory system performance? 14 7

Memory Banks Independent memory banks different banks can be accessed at once, with different addresses allows parallel access, possibly parallel data transfer multiple memory controllers & separate address lines, one for each access different controllers cannot access the same bank less area than dual porting Effect on memory system performance? 15 Advanced Caching Approaches to improving memory system performance eliminate memory accesses decrease the number of misses decrease the miss penalty hide memory latencies increase cache throughput increase memory/cache bandwidth 16 8

Other Hardware or compiler-based prefetching (decreases misses) Coupling a write-through memory update policy with a write buffer (eliminates store ops/hides store latencies) Merging requests to the same cache block in a non-blocking cache (hide miss penalty) TLB (reduce page fault time (penalty)) Cache hierarchies (reduce miss penalty) Virtual caches (reduce L1 cache access time) Wider bus (increase bandwidth) 17 9