Lecture 10: Large Cache Design III

Similar documents
Lecture 15: Large Cache Design III. Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies

Lecture 12: Large Cache Design. Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Lecture-16 (Cache Replacement Policies) CS422-Spring

Marten van Dijk Syed Kamran Haider, Chenglu Jin, Phuong Ha Nguyen. Department of Electrical & Computer Engineering University of Connecticut

ECE7995 Caching and Prefetching Techniques in Computer Systems. Lecture 8: Buffer Cache in Main Memory (I)

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Cache Controller with Enhanced Features using Verilog HDL

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Enhancing LRU Replacement via Phantom Associativity

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Lecture 13: Cache Hierarchies. Today: cache access basics and innovations (Sections )

Introduction to OpenMP. Lecture 10: Caches

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

Computer Systems Architecture

Research on Many-core Cooperative Caching

15-740/ Computer Architecture

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

Improving Inclusive Cache Performance with Two-level Eviction Priority

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

Advanced Caches. ECE/CS 752 Fall 2017

Course Outline. Processes CPU Scheduling Synchronization & Deadlock Memory Management File Systems & I/O Distributed Systems

A Cache Hierarchy in a Computer System

Least Recently Frequently Used Caching Algorithm with Filtering Policies

CS7810 Prefetching. Seth Pugsley

Cray XE6 Performance Workshop

A best-offset prefetcher

Advanced optimizations of cache performance ( 2.2)

A Best-Offset Prefetcher

Page 1. Multilevel Memories (Improving performance using a little cash )

Lecture 14 Page Replacement Policies

MEMORY: SWAPPING. Shivaram Venkataraman CS 537, Spring 2019

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Lecture-18 (Cache Optimizations) CS422-Spring

A Survey of Cache Bypassing Techniques

Chapter 2: Memory Hierarchy Design Part 2

Combining Local and Global History for High Performance Data Prefetching

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Vnodes. Every open file has an associated vnode struct All file system types (FFS, NFS, etc.) have vnodes

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

A Hybrid Adaptive Feedback Based Prefetcher

Lecture notes for CS Chapter 2, part 1 10/23/18

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching

Improving Cache Performance using Victim Tag Stores

Process size is independent of the main memory present in the system.

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

VIRTUAL MEMORY READING: CHAPTER 9

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

15-740/ Computer Architecture

I/O Management and Disk Scheduling. Chapter 11

Computer Architecture and Engineering. CS152 Quiz #2. March 3rd, Professor Krste Asanovic. Name:

Virtual Memory. Today.! Virtual memory! Page replacement algorithms! Modeling page replacement algorithms

ABSTRACT. This dissertation proposes an architecture that efficiently prefetches for loads whose

Cache Replacement Championship. The 3P and 4P cache replacement policies

a process may be swapped in and out of main memory such that it occupies different regions

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Roadmap. Handling large amount of data efficiently. Stable storage. Parallel dataflow. External memory algorithms and data structures

Pseudo-LIFO: The Foundation of a New Family of Replacement Policies for Last-level Caches

Caches and Prefetching

Advanced Computer Architecture

CSE 153 Design of Operating Systems

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Computer Architecture Spring 2016

ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen

ECE 341. Lecture # 18

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Another View of the Memory Hierarchy. Lecture #25 Virtual Memory I Memory Hierarchy Requirements. Memory Hierarchy Requirements

Address spaces and memory management

Memory Hierarchy. Slides contents from:

EECS 470 PROJECT: P6 MICROARCHITECTURE BASED CORE

CS370 Operating Systems

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Chapter 2: Memory Hierarchy Design Part 2

Locality and Data Accesses video is wrong one notes when video is correct

3/3/2014! Anthony D. Joseph!!CS162! UCB Spring 2014!

Memory Hierarchy. Bojian Zheng CSCD70 Spring 2018

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Cache memory. Lecture 4. Principles, structure, mapping

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Advanced Caching Techniques

CS 153 Design of Operating Systems Winter 2016

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Memory Data Flow. ECE/CS 752 Fall Prof. Mikko H. Lipasti University of Wisconsin-Madison

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

The check bits are in bit numbers 8, 4, 2, and 1.

Transcription:

Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a 9% higher miss rate than true LRU 1

Overview 2

Set Partitioning Can also partition sets among cores by assigning page colors to each core Needs little hardware support, but must adapt to dynamic arrival/exit of tasks 3

LIN Qureshi et al., ISCA 06 Memory level parallelism (MLP): number of misses that simultaneously access memory; high MLP miss is less expensive Replacement decision is a linear combination of recency and MLP experienced when fetching that block MLP is estimated by tracking the number of outstanding requests in the MSHR while waiting in the MSHR Can also use set dueling to decide between LRU and LIN 4

Pseudo-LIFO Chaudhuri, MICRO 09 A fill stack is a FIFO that tracks the order in which blocks entered the set Most hits are serviced while a block is near the top of the stack; there is usually a knee-point beyond which blocks stop yielding hits Evict the highest block above the knee that has not yet serviced a hit in its current fill stack position Allows some blocks to probabilistically escape past the knee and get retained for distant reuse 5

Scavenger Basu et al., MICRO 07 Half the cache is used as a victim cache to retain blocks that will likely be used in the distant future Counting bloom filters to track a block s potential for reuse and make replacement decisions in the victim cache Complex indexing and search in the victim cache Another recent paper (NuCache, HPCA 11) places blocks in a large FIFO victim file if they were fetched by delinquent PCs and the block has a short re-use distance 6

V-Way Cache Qureshi et al., ISCA 05 Meant to reduce load imbalance among sets and compute a better global replacement decision Tag store: every set has twice as many ways Data store: no correspondence with tag store; need forward and reverse pointers In most cases, can replace any block; every block has a 2b saturating counter that is incremented on every access; scan blocks (and decrement) until a zero counter is found; continue scan on next replacement 7

ZCache Sanchez and Kozyrakis, MICRO 10 Skewed associative cache: each way has a different indexing function (in essence, W direct-mapped caches) When block A is brought in, it could replace one of four (say) blocks B, C, D, E; but B could be made to reside in one of three other locations (currently occupied by F, G, H); and F could be moved to one of three other locations We thus get a tree of replacement options and we can pick LRU among these options Every replacement requires multiple tag look-ups and data block copies; worthwhile if you re reducing off-chip accesses 8

Temporal Memory Streaming Wenisch et al., ISCA 05 When a thread incurs a series of misses to blocks (PQRS, not necessarily contiguous), other threads are likely to incur a similar series of misses Each thread maintains its miss log in a circular buffer in memory; the directory for P also keeps track of a pointer to multiple log entries of P When a thread has a miss on P, it contacts the directory and the directory provides the log pointers; the thread receives multiple streams and starts prefetching Log access and prefetches are off the critical path 9

Spatial Memory Streaming Somogyi et al., ISCA 06 Threads often enter a new region (page) and touch a few arbitrary blocks in that region A predictor is indexed with the PC of the first access to that region and the offset of the first access; the predictor returns a bit vector indicating the blocks accessed within that region Can even prefetch for regions that have not been touched before! 10

Feedback Directed Prefetching Srinath et al., HPCA 07 A stream prefetcher has two parameters: P: prefetch distance: how far ahead of the start do we prefetch N: prefetch degree: how much do we advance the start when there is a hit in the stream Can vary these two parameters based on pref effectiveness Accuracy: a bit tracks if a prefetched block was touched Timeliness: was the block touched while in the MSHR? Pollution: track recent evictions (Bloom filter) and see if they are re-touched; also guides insertion policy 11

Dead Block Prediction Can keep track of the number of accesses to a line during its previous residence; the block is deemed to be dead after that many accesses Kharbutli, Solihin, IEEE TOC 08 To reduce noise, an access can be considered as a block s move to the MRU position Liu et al., MICRO 2008 Earlier DBPs used a trace of PCs to capture when a block has completed its use DBP is used for energy savings, replacement policies, and cache bypassing 12

Distill Cache Qureshi, HPCA 2007 Half the ways are traditional (LOC); when a block is evicted from the LOC, only the touched words are stored in a word-organized cache that has many narrow ways Incurs a fair bit of complexity (more tags for the WOC, collection of word touches in L1s, blocks with holes, etc.) Does not need a predictor; actions are based on the block s behavior during current residence Useless word identification is orthogonal to cache compression 13

Title Bullet 14