A Comparison of Capacity Management Schemes for Shared CMP Caches

Similar documents
AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

Adaptive Cache Partitioning on a Composite Core

ECE/CS 757: Homework 1

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Thesis Defense Lavanya Subramanian

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

A Hybrid Adaptive Feedback Based Prefetcher

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

A Bandwidth-aware Memory-subsystem Resource Management using. Non-invasive Resource Profilers for Large CMP Systems

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies

Computer Sciences Department

Staged Memory Scheduling

Futility Scaling: High-Associativity Cache Partitioning

Futility Scaling: High-Associativity Cache Partitioning (Extended Version)

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

Dynamic Cache Partitioning for CMP/SMT Systems

12 Cache-Organization 1

Combining Local and Global History for High Performance Data Prefetching

Improving Cache Performance using Victim Tag Stores

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES

Spatial Locality-Aware Cache Partitioning for Effective Cache Sharing

ADAPTIVE BLOCK PINNING BASED: DYNAMIC CACHE PARTITIONING FOR MULTI-CORE ARCHITECTURES

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors

Write only as much as necessary. Be brief!

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Dynamic Performance Tuning for Speculative Threads

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

Exploi'ng Compressed Block Size as an Indicator of Future Reuse

Improving Virtual Machine Scheduling in NUMA Multicore Systems

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Towards Bandwidth-Efficient Prefetching with Slim AMPM

Lecture 11: Large Cache Design

THE DYNAMIC GRANULARITY MEMORY SYSTEM

A Framework for Providing Quality of Service in Chip Multi-Processors

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems

SCALING HARDWARE AND SOFTWARE

Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing

Spatial Memory Streaming (with rotated patterns)

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Cache Controller with Enhanced Features using Verilog HDL

Can randomized mapping secure instruction caches from side-channel attacks?

Computer Architecture Lecture 24: Memory Scheduling

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers

Understanding The Effects of Wrong-path Memory References on Processor Performance

TDT 4260 lecture 3 spring semester 2015

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

CloudCache: Expanding and Shrinking Private Caches

Understanding Cache Interference

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache Management

vcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments

Lecture-16 (Cache Replacement Policies) CS422-Spring

SCALABLE AND EFFICIENT FINE-GRAINED CACHE PARTITIONING WITH VANTAGE

Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho

A Best-Offset Prefetcher

An Analytical Model for Optimum Off- Chip Memory Bandwidth Partitioning in Multi-core Architectures

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Enhancing LRU Replacement via Phantom Associativity

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt

Emerging NVM Memory Technologies

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads

ibench: Quantifying Interference in Datacenter Applications

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

Virtualized ECC: Flexible Reliability in Memory Systems

Flexible Cache Error Protection using an ECC FIFO

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

QoS Policies and Architecture for Cache/Memory in CMP Platforms

WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Transcription:

A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28

Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip Cache Heterogeneous Workloads Web servers Video streaming Graphic intensive Scientific Data mining Security scan File/Data Base Core counts scaling up Shared cache becomes highly contested LRU replacement is not enough No distinction between process priority and applications memory needs 2/23

per Instruction (CPI) per Instruction (CPI) When there is no capacity management 1.4 1.2.8.7 1.8.6.4.2 7 instances of mcf libquantum.6.5.4.3.2.1 3/23

This paper Offer an extensive and detailed study of shared resource management schemes Way-partitioned management [D. Chiou, MIT PhD Thesis, 99] Decay-based management [Petoumenos et al., IEEE Workload Characterization, 6] Demonstrate potential benefits of each management scheme Cache space utilization Performance Flexibility and scalability 4/23

Outline Motivation Shared Cache Capacity Management Experimental Setup and Evaluation Related Work Conclusion 5/23

Shared Cache Capacity Management Apportioning shared cache resources among multiple processor cores Way-Partitioned Management [D. Chiou, MIT PhD Thesis, 99] Decay-Based Management [Petoumenos et al., IEEE Workload Characterization, 6] 6/23

Way-Partitioned Management Statically allocate number of L2 cache ways to processes... 4-way set-associative cache 7/23

L2 Miss Rate (%) How do applications benefit from cache sizes and set-associativity? 1 9 8 7 6 5 4 3 2 1 Number of Ways Allocated vs. L2 Miss Rate Some applications are more sensitive to the number of cache ways (cache resource) allocated to them 2 4 6 8 1 12 14 16 Number of Ways Allocated out of 16 Way Set-Associative L2 Cache of 4MB MCF SJENG -Miss rates are improved as the number of cache ways allocated to them increases. -Used to achieve performance predictability. 8/23

Prior Work: Cache Decay for Leakage Power Management Cache New Data Access: Cache Miss represent 2 DISTINCT memory addresses mapped to the same cache set M: Miss H:Hit M H H H H H H H M time Multiple Accesses in a Short Time timer per cache line If cache line accessed frequently, maintain power: reset timer w/ every access If not accessed for long time switch off V dd: timer=decay interval switch off V dd Re-Power a decayed line on an access 9/23

Decay for Capacity Management Decay counter reaches Cache line becomes an immediate candidate for replacement, even if NOT LRU Set decay counters on per-process basis Long decay interval high priority process Short decay interval low priority process, so cache lines are evicted more frequently 1/23

Outline Motivation Shared Cache Capacity Management Experimental Setup and Evaluation Related Work Conclusion 11/23

Experimental Setup Simulation Framework GEMS: Full system simulator [Simics+Ruby] 16-core multiprocessor on the Sparc architecture running unmodified Solaris 1 operating system Workload SPEC26 CINT Benchmark Suite [program initialization is included] P P1... P15 L1 L1 L1 Shared L2 Cache Off Chip Private L1: 32KB each; 4-way; 64B cache line Shared L2: 4MB; 16-way; 64B cache line L1 miss latency: 2 cycles L2 miss latency: 4 cycles MESI Directory protocol between L1 and L2 12/23

Evaluation Mechanisms Baseline: No Cache Capacity Management Way-Partitioned Management Decay-Based Management Scenarios High contention General Workload 1: Constraining one memoryintensive application General Workload 2: Protecting a high-priority application (refer to the paper) 13/23

per Instruction (CPI) High Contention Scenario 1.4 1.2 1.8.6.4.2 1.29 1..84.85.87.74.67.55 7 instances of mcf libquantum Alone No Management Way-Partitioned Management [libquantum- 2Way;others-14Way] Decay-Based Management [libquantum-no decay; others-1m] No management: taking turns evicting each other s cache lines out Way-partitioning: Decay-based: 14/23

Cache Occupancy (%) Cache Space Distribution 1 9 8 7 6 5 4 3 2 1 Memory interference No Management 2.2x1 9 Way-Partitioned Management (libquantum-2way; mcf+os-14way) 1 9 8 7 6 5 4 3 2 1 2.2x1 9 Decay-Based Management (libquantum-no decay; mcf+os-1k) 1 9 8 7 6 5 4 3 2 1 2.2x1 9 15/23

per Instruction (CPI) Constraining a Memory-Intensive Application.8.7.6.5.4.3.2.1 Alone No Management Way-Partitioned Management [mcf- 4Way;others-12Way] Decay-Based Management [mcf-1m; other-no decay] --Way-partitioning s coarse granularity control trades off 5% performance for mcf with an average 1% performance improvement for the rest --Decay-based management: only 2% for mcf and others 3% because of its fine-grained control and improved ability to exploit data temporal locality 16/23

Cache Occupancy (%) Cache Space Distribution 1 9 8 7 6 5 4 3 2 1 No Management 2.2x1 9 Way-Partitioned Management (mcf-4way; others-12way) 1 9 8 7 6 5 4 3 2 1 2.2x1 9 1 9 8 7 6 5 4 3 2 1 Decay-Based Management (mcf-1m; others-no decay) 2.2x1 9 17/23

No Management Way-Partitioning Decay-based 2 2 2 2 Cache space 2 of mcf is constrained to 4 ways 2 5 5 5 4 2 4 2 libquantum is having more of the shared L2 cache space More cache space is used by gcc in way-partitioning and decay-based 4 2 5 5 5 2 2 2

Outline Motivation Shared Cache Capacity Management Experimental Setup and Evaluation Related Work Conclusion 19/23

Priority classification and enforcement to achieve differentiable QoS [Iyer, ICS 4] Architectural support for optimizing performance of high priority application with minimal performance degradation based on QoS policies [Iyer et al., SIGMETRICS 7] Performance metric, such as miss rates, bandwidth usage, IPC, and fairness, to assist resource allocation [Hsu Further et al., cache PACT 6] fairness policies can be incorporated into both Resource capacity allocation management fairness mechanisms in virtual private discussed in this cache, where its capacity work. manager implements waypartitioning [Nesbit et al., ISCA 7]

Related Work: Dynamic Cache Capacity Management OS distributes equal amount of cache space to all running processes, keeps statistics on the fly, and dynamically adjust cache space distribution [Suh et al., HPCA 2; Kim et al., PACT 4; Qureshi and Patt, MICRO 6] Adaptive set pinning to eliminate inter-process misses [Srikantaiah et al., ISCA 8] To Statistical the best model of our to knowledge, predict thread there behaviors has not been and capacity management through decay [Petoumenos et any prior work based on decay management al., IEEE Workload Characterization 6 ] taking full system effects into account.

Conclusion Advantages Way-Partitioned Management Simple hardware complexity Straightforward technique Decay-Based Management Fine granularity control more effective space utilization Drawbacks Great performance isolation Preferably, the number of cache ways the number of concurrent processes Data remaining in the cache: high priority and good temporal locality 5% 55% More complex hardware Coarse granularity in space allocation inefficient space utilization 22/23

Hardware Overhead: Way-partitioning index MUX process ID Set cache ways for tag comparison MUX data result of tag comparison

What happens to the replaced lines? P P1... P15 L1 L1 L1 Shared L2 Cache L2 s replacement lines are replaced without evicting L1 s copy. Works because L1 and L2 cache blocks Are the same size!! 64Bytes. 25

Cache Occupancy (%) Cache Space Distribution 1 9 8 7 6 5 4 3 2 1 No Management Way-Partitioned Management (lbm-4way; others-12way) 1 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 1 Decay-Based Management (lbm-no decay; others-1k) 4 ways out of 16 way set-associative cache allocated to lbm for its exclusive access

Related Work: Iyer s QoS Shared Cache Capacity Management 27

Decay-based Management reference stream EA BC D EA BC MISS HIT MISS HIT MISS Bb DA CE A, C, E from the HIGH PRIOIRTY process -> NO DECAY B, D from the LOW PRIORITY process -> DECAY D decays B decays Memory controller 5 out of 9 are hits All 5 hits belong to the high priority process LRU: NO HITS at all A 4 32 C 21 43 D B 2 31 4... E 43 21 4-way set-associative cache LRU: Temporal Behaviors Decay-based: Process Priority and Temporal Behaviors 28/29