Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Similar documents
Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Types of Cache Misses: The Three C s

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Lec 11 How to improve cache performance

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Computer Architecture Spring 2016

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part4.1.1: Cache - 2

CPE 631 Lecture 06: Cache Design

Aleksandar Milenkovich 1

Cache performance Outline

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Lecture 11. Virtual Memory Review: Memory Hierarchy

CMSC 611: Advanced Computer Architecture. Cache and Memory

Advanced Caching Techniques

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

EITF20: Computer Architecture Part4.1.1: Cache - 2

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Cache Performance (H&P 5.3; 5.5; 5.6)

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Lecture-18 (Cache Optimizations) CS422-Spring

Cache Optimisation. sometime he thought that there must be a better way

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

Memory Hierarchy Basics

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Computer Architecture Spring 2016

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Memory Hierarchies 2009 DAT105

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Chapter-5 Memory Hierarchy Design

Pollard s Attempt to Explain Cache Memory

CS3350B Computer Architecture

ארכיטקטורת יחידת עיבוד מרכזי ת

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Advanced cache optimizations. ECE 154B Dmitri Strukov

Lecture 11 Reducing Cache Misses. Computer Architectures S

Advanced optimizations of cache performance ( 2.2)

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Copyright 2012, Elsevier Inc. All rights reserved.

A Cache Hierarchy in a Computer System

Caching Basics. Memory Hierarchies

Logical Diagram of a Set-associative Cache Accessing a Cache

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Introduction. Memory Hierarchy

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

LECTURE 5: MEMORY HIERARCHY DESIGN

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory hierarchy review. ECE 154B Dmitri Strukov

CS222: Cache Performance Improvement

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Page 1. Memory Hierarchies (Part 2)

EE 4683/5683: COMPUTER ARCHITECTURE

Computer Architecture CS372 Exam 3

Introduction to OpenMP. Lecture 10: Caches

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Computer Systems Architecture

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Memory Hierarchy Design

Advanced Caching Techniques

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Adapted from David Patterson s slides on graduate computer architecture

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Keywords Cache mapping technique, Cache optimization, Cache miss, Cache Hit, Miss Penalty

Page 1. Multilevel Memories (Improving performance using a little cash )

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Transcription:

Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW conflicts with main memory reads on misses If simply wait for write buffer to empty might increase read miss penalty by 50% Check write buffer contents before read; if no conflicts, let the memory access continue Write Back? Read miss may require write of dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stalls less since it can restart as soon as read completes Early Restart and Critical Word First Don t wait for full block to be loaded before restarting CPU Early restart As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Critical Word First Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Most useful with large blocks, Spatial locality a problem; often we next want the next sequential word soon, so not always a benefit (early restart). Non-blocking Caches to reduce stalls on misses Non-blocking or lockup-free allowing the data to continue to supply hits during a miss hit under miss reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPU hit under multiple miss or miss under miss can further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the controller as there can be multiple outstanding memory accesses assumes stall on use not stall on miss which works naturally with dynamic scheduling, but can also work with static.

Value of Hit Under Miss for SPEC Miss Penalty Reduction: Second Level Cache L2 Equations AMAT = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 AMAT = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 + Miss Penalty L2 ) lowest-level Definitions: Local miss rate misses in this divided by the total number of memory accesses to this (Miss rate L2 ) Global miss rate misses in this divided by the total number of memory accesses generated by the CPU (Miss Rate L1 x Miss Rate L2 ) next-level memory/ Multi-level Caches, cont. L1 local miss rate 10%, L2 local miss rate 40%. What are the global miss rates? L1 highest priority is fast hit time. L2 typically low miss rate. Design L1 and L2 s in concert. Property of inclusion -- if it is in L1, it is guaranteed to be in the L2 -- simplifies design of consistent s. L2 can have different associativity (good idea?) or block size (good idea?) than L1. Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in between Reducing Miss Penalty Summary Four techniques Read priority over write on miss Early Restart and Critical Word First on miss Non-blocking Caches (Hit Under Miss) Multi-level Caches

Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Fast Hit times via Small and Simple Caches Why Alpha 21164 has 8KB Instruction and 8KB data + 96KB second level I and D s used to be typically Direct Mapped, on chip Physical Cache Virtual address TLB Physical address Virtual Cache Virtual Cache TLB Virtual address Physical address Fast hits by Avoiding Address Translation: Virtual Cache Send virtual address to? Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache Every time process is switched logically must flush the ; otherwise get false hits Cost is time to flush + compulsory misses from empty Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address I/O must interact with Solution to aliases HW that guarantees that every block has unique physical address SW guarantee : lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; called page coloring Solution to flush Add process identifier tag that identifies process as well as address within process: can t get a hit if wrong process

Avoiding Translation: Process ID impact Cache Bandwidth: Trace Caches Fetch Bottleneck Cannot execute instructions faster than you can fetch them into the processor. Cannot typically fetch more than about one taken branch per cycle, at best (why? Why one taken branch?) Trace is an instruction that stores instructions in dynamic execution order rather than program/address order. Trace Cache Cache Optimization Summary A B C beq J: D E F J: G H jsr W W X ret A B C beq D E F G H jsr I J K L M N W X ret Conventional Cache A B C beq GH jsr W X ret I Trace Cache Technique MR MP HT Complexity Larger Block Size Higher Associativity Victim Caches Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses Priority to Read Misses Early Restart & Critical Word 1st Non-Blocking Caches Second Level Caches Small & Simple Caches Avoiding Address Translation Trace Cache?

Recent Cache Research at UCSD Hardware prefetching of complex data structures (e.g., pointer chasing) Fetch Target Buffer Let branch predictor run ahead of fetch engine Runtime identification of conflict misses Speculative Precomputation Spawn threads at runtime to calculate addresses of delinquent (problematic) loads and prefetch creates prefetcher from application code.