Cache Replacement Championship. The 3P and 4P cache replacement policies

Similar documents
Lecture-16 (Cache Replacement Policies) CS422-Spring

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies

Improving Cache Performance using Victim Tag Stores

ECE7995 Caching and Prefetching Techniques in Computer Systems. Lecture 8: Buffer Cache in Main Memory (I)

Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view

CHOP: Integrating DRAM Caches for CMP Server Platforms. Uliana Navrotska

Computer Sciences Department

Best-Offset Hardware Prefetching

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

Adaptive Insertion Policies for High Performance Caching

Virtual memory. Virtual memory - Swapping. Large virtual memory. Processes

Lecture 10: Large Cache Design III

Page Replacement Algorithms

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

Outline. 1 Paging. 2 Eviction policies. 3 Thrashing 1 / 28

Lecture 14 Page Replacement Policies

Lecture 12: Large Cache Design. Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers

CSCE 212: FINAL EXAM Spring 2009

Last Class. Demand Paged Virtual Memory. Today: Demand Paged Virtual Memory. Demand Paged Virtual Memory

PMPM: Prediction by Combining Multiple Partial Matches

Lecture 15: Large Cache Design III. Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

Techniques for Efficient Processing in Runahead Execution Engines

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Course Outline. Processes CPU Scheduling Synchronization & Deadlock Memory Management File Systems & I/O Distributed Systems

Improving Cache Management Policies Using Dynamic Reuse Distances

Paging algorithms. CS 241 February 10, Copyright : University of Illinois CS 241 Staff 1

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

Architecture Tuning Study: the SimpleScalar Experience

Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy

Swapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement"

Caching and Demand-Paged Virtual Memory

A Comparison of Capacity Management Schemes for Shared CMP Caches

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

CS370 Operating Systems

3/3/2014! Anthony D. Joseph!!CS162! UCB Spring 2014!

EECS 470 Midterm Exam Winter 2015

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

A Best-Offset Prefetcher

A best-offset prefetcher

VIRTUAL MEMORY READING: CHAPTER 9

Operating Systems Virtual Memory. Lecture 11 Michael O Boyle

ECE331: Hardware Organization and Design

CS370 Operating Systems

Staged Memory Scheduling

Write only as much as necessary. Be brief!

1. Truthiness /8. 2. Branch prediction /5. 3. Choices, choices /6. 5. Pipeline diagrams / Multi-cycle datapath performance /11

CSE502: Computer Architecture CSE 502: Computer Architecture

Global Replacement Algorithms (1) Ken Wong Washington University. Global Replacement Algorithms (2) Replacement Example (1)

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Unit 2 Buffer Pool Management

CS 153 Design of Operating Systems Winter 2016

LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Operating Systems. Overview Virtual memory part 2. Page replacement algorithms. Lecture 7 Memory management 3: Virtual memory

Last Class: Demand Paged Virtual Memory

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Australian Journal of Basic and Applied Sciences

Memory hierarchy review. ECE 154B Dmitri Strukov

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems

PMPM: Prediction by Combining Multiple Partial Matches

CSE 153 Design of Operating Systems

Analysis and Optimization. Carl Waldspurger Irfan Ahmad CloudPhysics, Inc.

CRC: Protected LRU Algorithm

Swapping. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Swapping. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Computer Architecture Memory hierarchies and caches

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Decoupled Dynamic Cache Segmentation

Unit 2 Buffer Pool Management

Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view

Memory Management Outline. Operating Systems. Motivation. Paging Implementation. Accessing Invalid Pages. Performance of Demand Paging

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Threshold-Based Markov Prefetchers

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

ECE 463/521: Fall 2002 Project #1: Data-Cache System Design Due: Monday, Sept. 30, 11:00 PM

A 64-Kbytes ITTAGE indirect branch predictor

Rethinking Belady s Algorithm to Accommodate Prefetching

Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality

Rethinking Belady s Algorithm to Accommodate Prefetching

SISTEMI EMBEDDED. Computer Organization Memory Hierarchy, Cache Memory. Federico Baronti Last version:

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

CPSC 3300 Spring 2016 Final Exam Version A No Calculators

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

CIS Operating Systems Memory Management Page Replacement. Professor Qiang Zeng Spring 2018

Data Speculation. Architecture. Carnegie Mellon School of Computer Science

Page 1. Multilevel Memories (Improving performance using a little cash )

Lecture 12: Demand Paging

6.004 Recitation Problems L13 Caches

12 Cache-Organization 1

Assisting Cache Replacement by Helper-Threading for MPSoCs

Performance Trade-Offs in Using NVRAM Write Buffer for Flash Memory-Based Storage Devices

Transcription:

1 Cache Replacement Championship The 3P and 4P cache replacement policies Pierre Michaud INRIA June 20, 2010

2 Optimal replacement? Offline (we know the future) Belady Online (we don t know the future) problem without a solution On random address sequences, all the online replacement policies perform equally on average The best online replacement policy does not exist

3 In practice We search a policy that performs well on as many applications as possible We hope that our benchmarks are representative But there is no guarantee that a replacement policy will always perform well

4 The DIP replacement policy Qureshi et al., ISCA 2007 Key idea #1: bimodal insertion (BIP) LRU behaves badly on cyclic accesses try to correct this On a miss, insert block in MRU position only with probability E=1/32, otherwise leave it in LRU position Key idea #2: set sampling 32 LRU sets, 32 BIP sets, use best policy in the other sets Beauty of DIP: just one counter!

5 Proposed policy Incrementally derived from DIP Start from a carefully tuned DIP Based on CLOCK instead of LRU needs less storage than LRU Combines more than 2 different insertion policies (new?) method for multi-policy selection

6 Carefully tuned DIP Cache levels use unique line size? OK Otherwise a (small) filter would have been needed Don t update replacement info on writes The fact that a block is evicted from a cache level does not mean that the block is likely to be accessed soon If it is the case, it is chance, not a manifestation of temporal locality 28 SPEC 2006, CRC simulator, 16-way 1M L3 Speedup DIP / LRU avg: +2% ; max: +20% ; min: -4%

7 CLOCK DIP CLOCK policy one use bit per block, one clock hand per cache set 16-way cache 16+4 = 20 bits per set On access to a block (hit or insertion), set the use bit On a miss, hand points to potential victim If use bit is set, reset it and increment the hand (mod 16), repeat till victim is found CLOCK BIP On insertion, set the use bit with probability E=1/32 CLOCK DIP / DIP avg: +0.2% ; max: +1.2% ; min: -0.5%

8 Multi-policy selection mechanism DIP uses a single PSEL counter Miss in LRU-dedicated set decrement PSEL Miss in BIP-dedicated set increment PSEL Generalization: N policies, N counters P1,,PN Miss in set dedicated to policy j add N-1 to Pj, subtract 1 to all the other counters Keep P1+P2+ +PN = 0 if a counter saturates, all counters stay unchanged Best policy is the one with the smallest counter value

9 The 3P policy For a few benchmarks, neither LRU nor BIP perform well For example, 473.astar exhibits access patterns that are approximately cyclic, but drifting relatively quickly We found that, on a few benchmarks, BIP with E=1/2 can outperform both LRU and BIP with E=1/32 3 policies All policies use the same hardware For E=1/2, it is possible to improve MLP Instead of setting the use bit every other insertions, set the use bit for 64 consecutive insertions every 128 misses 3P / CLOCK DIP: avg: +0.5% ; max: +5.7% ; min: -2.1%

10 Shared-cache replacement Thread-unaware policies like DIP or 3P may be unfair OK when threads have equal force, i.e., equal miss rates (in misses per cycle) But fragile threads (low miss rate) are penalized when they share the cache with aggressive threads (high miss rate) BIP is good for containing aggressive threads Thread-aware bimodal insertion (TABIP): use normal insertion for fragile threads and bimodal insertion for aggressive threads

11 TABIP: identifying fragile threads Heuristic One TMISS counter per running thread Update TMISS counters the same way as policy-selection counters E.g., 4 running threads Thread k miss add 3 to TMISS [k], subtract 1 to TMISS of the other threads (keep sum of TMISS [i] null) Define fragile threads as threads whose TMISS is negative

12 The 4P policy 4P = 3P + CLOCK TABIP Use 4 policy-selection counters instead of 3 28 SPEC 2006, CRC simulator, 16-way 4MB L3 100 fixed random 4-thread mixes perf for an app = arithmetic mean of CPIs for that app among the 400 CPIs Speedup 4P / LRU: avg: +3% ; max: +18% ; min: -4.5% Speedup 4P / 3P: avg: +1% ; max: +7% ; min: -3% 4P is fairer than 3P

Questions? 13