Exploi'ng Compressed Block Size as an Indicator of Future Reuse
|
|
- Jody Wood
- 5 years ago
- Views:
Transcription
1 Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch
2 Execu've Summary In a compressed cache, compressed block size is an addidonal dimension Problem: How can we maximize cache performance udlizing both block reuse and compressed size? Observa'ons: Importance of the cache block depends on its compressed size Block size is indicadve of reuse behavior in some applicadons Key Idea: Use compressed block size in making cache replacement and inserdon decisions Results: Higher performance (9%/11% on average for 1/2- core) Lower energy (12% on average) 2
3 Poten'al for Data Compression Latency 8-14 cycles 9 cycles 5 cycles C- Pack X SC 2 X All these works FPC showed significant All use Can LRU we replacement X do beser? policy performance improvements Base- Delta- Immediate StaDcDcal C- Pack Frequent [Chen+, Compression PaXern (BDI) [Pekhimenko+, (SC Trans. 2 Compression ) [Arelakis+, on VLSI 10] PACT 12] ISCA 14] (FPC) [Alameldeen+, ISCA 04] 1-2 cycles BDI X Compression Ra'o 3
4 Size- Aware Cache Management 1. Compressed block size maxers 2. Compressed block size varies 3. Compressed block size can indicate reuse 4
5 #1: Compressed Block Size MaSers Cache: large small A Y B C X A Y B miss hit hit Belady s OPT Replacement Access stream: C miss miss X A Y B C (me X miss A Y hit B C miss hit hit saved cycles (me Size- Aware Replacement Size- aware policies could yield fewer misses 5
6 Cache Coverage, % #2: Compressed Block Size Varies size: 0-7 size: 8-15 size: size: size: size: size: size: size: 64 Block size, bytes 100% 80% 60% 40% 20% 0% BDI compression algorithm [Pekhimenko+, PACT 12] mcf lbm astar tpch2 calculix h264ref wrf povray gcc Compressed block size varies within and between applicadons 6
7 Reuse Distance (# of memory accesses) #3: Block Size Can Indicate Reuse bzip Block size, bytes Different sizes have different dominant reuse distances 7
8 Code Example to Support Intui'on int A[N]; double B[16]; for (int i=0; i<n; i++) { int idx = A[i]; for (int j=0; j<n; j++) { sum += B[(idx+j)%16]; } } // small indices: compressible // FP coef1icients: incompressible long reuse, compressible short reuse, incompressible Compressed size can be an indicator of reuse distance 8
9 Size- Aware Cache Management 1. Compressed block size maxers 2. Compressed block size varies 3. Compressed block size can indicate reuse 9
10 Outline MoDvaDon Key ObservaDons Key Ideas: Minimal- Value EvicDon (MVE) Size- based InserDon Policy (SIP) EvaluaDon Conclusion 10
11 Compression- Aware Management Policies (CAMP) CAMP MVE: Minimal- Value Evic'on SIP: Size- based Inser'on Policy 11
12 Minimal- Value Evic'on (MVE): Observa'ons Set: Block #0 Block #1 Block #N Highest priority: value MRU Define Importance a value of (importance) the block depends of the block on both to the cache likelihood of reference Evict and the compressed block with the block lowest size value Lowest priority: value LRU Block #X 12
13 Minimal- Value Evic'on (MVE): Key Idea Value = Probability of reuse Compressed block size Probability of reuse: Can be determined in many different ways Our implementadon is based on re- reference interval predicdon value (RRIP [Jaleel+, ISCA 10]) 13
14 Compression- Aware Management Policies (CAMP) CAMP MVE: Minimal- Value Evic'on SIP: Size- based Inser'on Policy 14
15 Size- based Inser'on Policy (SIP): Observa'ons SomeDmes there is a rela'on between the compressed block size and reuse distance data structure compressed block size reuse distance This reladon can be detected through the compressed block size Minimal overhead to track this reladon (compressed block informadon is a part of design) 15
16 Size- based Inser'on Policy (SIP): Key Idea Insert blocks of a certain size with higher priority if this improves cache miss rate Use dynamic set sampling [Qureshi, ISCA 06] to detect which size to insert with higher priority Main Tags set A set B set C set D set E set F set G set H set I Auxiliary Tags set A 8B set D 64B set F 8B set I 64B set A set F miss + CTR 8B - decides policy for steady state miss set A 8B set F 8B 16
17 Outline MoDvaDon Key ObservaDons Key Ideas: Minimal- Value EvicDon (MVE) Size- based InserDon Policy (SIP) EvaluaDon Conclusion 17
18 Methodology Simulator x86 event- driven simulator (MemSim [Seshadri+, PACT 12]) Workloads SPEC2006 benchmarks, TPC, Apache web server 1 4 core simuladons for 1 billion representadve instrucdons System Parameters L1/L2/L3 cache latencies from CACTI BDI (1- cycle decompression) [Pekhimenko+, PACT 12] 4GHz, x86 in- order core, cache size (1MB - 16MB) 18
19 Evaluated Cache Management Policies Design Descrip'on LRU RRIP Baseline LRU policy - size- unaware Re- reference Interval PredicDon [Jaleel+, ISCA 10] - size- unaware 19
20 Evaluated Cache Management Policies Design Descrip'on LRU RRIP ECM Baseline LRU policy - size- unaware Re- reference Interval PredicDon [Jaleel+, ISCA 10] - size- unaware EffecDve Capacity Maximizer [Baek+, HPCA 13] + size- aware 20
21 Evaluated Cache Management Policies Design Descrip'on LRU RRIP ECM CAMP Baseline LRU policy - size- unaware Re- reference Interval PredicDon [Jaleel+, ISCA 10] - size- unaware EffecDve Capacity Maximizer [Baek+, HPCA 13] + size- aware Compression- Aware Management Policies (MVE + SIP) 21
22 Size- Aware Replacement EffecDve Capacity Maximizer (ECM) [Baek+, HPCA 13] Inserts big blocks with lower priority Uses heurisdc to define the threshold Shortcomings Coarse- grained No reladon between block size and reuse Not easily applicable to other cache organizadons 22
23 CAMP Single- Core 1.3 SPEC2006, databases, web workloads, 2MB L2 cache RRIP ECM MVE SIP CAMP 31% Normalized IPC astar cactusadm gcc gobmk h264ref soplex zeusmp bzip2 GemsFDTD gromacs leslie3d perlbench sjeng sphinx3 xalancbmk milc omnetpp libquantum lbm mcf tpch2 tpch6 apache CAMP improves performance over LRU, RRIP, ECM 8% 5% GMeanIntense GMeanAll 23
24 Mul'- Core Results ClassificaDon based on the compressed block size distribudon Homogeneous (1 or 2 dominant block size(s)) Heterogeneous (more than 2 dominant sizes) We form three different 2- core workload groups (20 workloads in each): Homo- Homo Homo- Hetero Hetero- Hetero 24
25 Mul'- Core Results (2) Normalized Weighted Speedup RRIP ECM MVE SIP CAMP Homo- Homo Homo- Hetero Hetero- Hetero GeoMean CAMP outperforms LRU, RRIP and ECM policies Higher benefits with higher heterogeneity in size 25
26 Effect on Memory Subsystem Energy Normalized Energy L1/L2 caches, DRAM, NoC, compression/decompression RRIP ECM CAMP astar cactusadm gcc gobmk h264ref soplex zeusmp bzip2 GemsFDTD gromacs leslie3d perlbench sjeng sphinx3 xalancbmk milc omnetpp libquantum lbm mcf tpch2 tpch6 apache CAMP reduces energy consumpdon in the memory subsystem GMeanIntense GMeanAll 26 5%
27 CAMP for Global Replacement CAMP with V- Way [Qureshi+, ISCA 05] cache design G- CAMP G- MVE: Global Minimal- Value Evic'on G- SIP: Global Size- based Inser'on Policy Reuse Replacement instead of RRIP Set Dueling instead of Set Sampling 27
28 G- CAMP Single- Core Normalized IPC SPEC2006, databases, web workloads, 2MB L2 cache RRIP V- WAY G- MVE G- SIP G- CAMP 63%- 97% astar cactusadm gcc gobmk h264ref soplex zeusmp bzip2 GemsFDTD gromacs leslie3d perlbench sjeng sphinx3 xalancbmk milc omnetpp libquantum lbm mcf tpch2 tpch6 apache 14% 9% GMeanIntense GMeanAll G- CAMP improves performance over LRU, RRIP, V- Way 28
29 Storage Bit Count Overhead Designs (with BDI) Over uncompressed cache with LRU LRU CAMP V- Way G- CAMP tag- entry +14b +14b +15b +19b data- entry +0b +0b +16b +32b total +9% +11% +12.5% +17% 2% 4.5% 29
30 Cache Size Sensi'vity Normalized IPC LRU RRIP ECM CAMP V- WAY G- CAMP 1M 2M 4M 8M 16M CAMP/G- CAMP outperforms prior policies for all $ sizes G- CAMP outperforms LRU with 2X cache sizes 30
31 Other Results and Analyses in the Paper Block size distribudon for different applicadons SensiDvity to the compression algorithm Comparison with uncompressed cache Effect on cache capacity SIP vs. PC- based replacement policies More details on muld- core results 31
32 Conclusion In a compressed cache, compressed block size is an addidonal dimension Problem: How can we maximize cache performance udlizing both block reuse and compressed size? Observa'ons: Importance of the cache block depends on its compressed size Block size is indicadve of reuse behavior in some applicadons Key Idea: Use compressed block size in making cache replacement and inserdon decisions Two techniques: Minimal- Value EvicDon and Size- based InserDon Policy Results: Higher performance (9%/11% on average for 1/2- core) Lower energy (12% on average) 32
33 Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch
34 Backup Slides 34
35 Poten'al for Data Compression Compression algorithms for on- chip caches: Frequent PaXern Compression (FPC) [Alameldeen+, ISCA 04] performance: +15%, comp. rado: C- Pack [Chen+, Trans. on VLSI 10] comp. rado: 1.64 Base- Delta- Immediate (BDI) [Pekhimenko+, PACT 12] performance: +11.2%, comp. rado: 1.53 StaDsDcal Compression [Arelakis+, ISCA 14] performance: +11%, comp. rado:
36 Poten'al for Data Compression (2) BeXer compressed cache management Decoupled Compressed Cache (DCC) [MICRO 13] Skewed Compressed Cache [MICRO 14] 36
37 # 3: Block Size Can Indicate Reuse soplex Block size, bytes Different sizes have different dominant reuse distances 37
38 Conven'onal Cache Compression Set 0 Tag Storage: Set 0 Data Storage: 32 bytes Conven'onal 2- way cache with 32- byte cache lines Set 1 Tag 0 Tag 1 Set 1 Data 0 Data 1 Way 0 Way 1 Way 0 Way 1 Compressed 4- way cache with 8- byte segmented data Tag Storage: Set 0 Set 1 Tag 0 Tag 1 Tag 2 Tag 3 Way 0 Way 1 Way 2 Way 3 8 bytes Set 0 Set 1 S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 C ü C - Compr. encoding bits ü Twice as ü Tags many tags map to muldple adjacent segments 38
39 CAMP = MVE + SIP Minimal- Value evicdon (MVE) Value Func'on = Probability of reuse Compressed block size Probability of reuse based on RRIP [ISCA 10] Size- based inserdon policy (SIP) Dynamically prioridzes blocks based on their compressed size (e.g., insert into MRU posidon for LRU policy) 39
40 Overhead Analysis 40
41 CAMP and Global Replacement CAMP with V- Way cache design 64 bytes Set 0 Set 1 Tag 0 Tag 1 Tag 2 Tag 3 Data 0 Data 1 R 0 R 1 R 2 R 3 status tag fptr comp rptr v+s 8 bytes 41
42 G- MVE Two major changes: Probability of reuse based on Reuse Replacement policy On inserdon, a block s counter is set to zero On a hit, the block s counter is incremented by one indicadng its reuse Global replacement using a pointer (PTR) to a reuse counter entry StarDng at the entry PTR points to, the reuse counters of 64 valid data entries are scanned, decremendng each non- zero counter 42
43 G- SIP Use set dueling instead of set sampling (SIP) to decide best inserdon policy: Divide data- store into 8 regions and block sizes into 8 ranges During training, for each region, insert blocks of a different size range with higher priority Count cache misses per region In steady state, insert blocks with sizes of top performing regions with higher priority in all regions 43
44 Set: Size- based Inser'on Policy (SIP): Key Idea Insert blocks of a certain size with higher priority if this improves cache miss rate Highest priority Lowest priority 44
45 Evaluated Cache Management Policies Design Descrip'on LRU RRIP ECM V- Way CAMP G- CAMP Baseline LRU policy Re- reference Interval PredicDon [Jaleel+, ISCA 10] EffecDve Capacity Maximizer [Baek+, HPCA 13] Variable- Way Cache [ISCA 05] Compression- Aware Management (Local) Compression- Aware Management (Global) 45
46 Performance and MPKI Correla'on Performance improvement / MPKI reducdon Mechanism LRU RRIP ECM CAMP 8.1%/- 13.3% 2.7%/- 5.6% 2.1%/- 5.9% CAMP performance improvement correlates with MPKI reducdon 46
47 Performance and MPKI Correla'on Performance improvement / MPKI reducdon Mechanism LRU RRIP ECM V- Way CAMP 8.1%/- 13.3% 2.7%/- 5.6% 2.1%/- 5.9% N/A G- CAMP 14.0%/- 21.9% 8.3%/- 15.1% 7.7%/- 15.3% 4.9%/- 8.7% CAMP performance improvement correlates with MPKI reducdon 47
48 Mul'- core Results (2) Normalized Weighted Speedup RRIP V- WAY G- MVE G- SIP G- CAMP Homo- Homo Homo- Hetero Hetero- Hetero GeoMean G- CAMP outperforms LRU, RRIP and V- Way policies 48
49 Effect on Memory Subsystem Energy Normalized Energy L1/L2 caches, DRAM, NoC, compression/decompression RRIP ECM CAMP V- WAY G- CAMP astar cactusadm gcc gobmk h264ref soplex zeusmp bzip2 GemsFDTD gromacs leslie3d perlbench sjeng sphinx3 xalancbmk milc omnetpp libquantum lbm mcf tpch2 tpch6 apache CAMP/G- CAMP reduces energy consumpdon in the memory subsystem 15.1% GMeanIntense GMeanAll 49
Practical Data Compression for Modern Memory Hierarchies
Practical Data Compression for Modern Memory Hierarchies Thesis Oral Gennady Pekhimenko Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko Advisors: Todd C. Mowry and Onur Mutlu Computer Science Department, Carnegie Mellon
More informationImproving Cache Performance using Victim Tag Stores
Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com
More informationComputer Architecture: (Shared) Cache Management
Computer Architecture: (Shared) Cache Management Prof. Onur Mutlu Carnegie Mellon University (small edits and reorg by Seth Goldstein) Readings Required Qureshi et al., A Case for MLP-Aware Cache Replacement,
More informationResource-Conscious Scheduling for Energy Efficiency on Multicore Processors
Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe
More informationPerceptron Learning for Reuse Prediction
Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level
More informationA Fast Instruction Set Simulator for RISC-V
A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.
More informationScalable Many-Core Memory Systems Optional Topic 4: Cache Management
Scalable Many-Core Memory Systems Optional Topic 4: Cache Management Prof. Onur Mutlu http://www.ece.cmu.edu/~omutlu onur@cmu.edu HiPEAC ACACES Summer School 2013 July 15-19, 2013 What Will You Learn in
More informationAB-Aware: Application Behavior Aware Management of Shared Last Level Caches
AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationLightweight Memory Tracing
Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for
More information18-740/640 Computer Architecture Lecture 14: Memory Resource Management I. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 10/26/2015
18-740/640 Computer Architecture Lecture 14: Memory Resource Management I Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 10/26/2015 Required Readings Required Reading Assignment: Qureshi and Patt,
More informationThesis Defense Lavanya Subramanian
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)
More informationNightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems
NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science
More informationFootprint-based Locality Analysis
Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.
More informationDecoupled Dynamic Cache Segmentation
Appears in Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA-8), February, 202. Decoupled Dynamic Cache Segmentation Samira M. Khan, Zhe Wang and Daniel A.
More informationA Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements
More informationDecoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation
More informationNear-Threshold Computing: How Close Should We Get?
Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on
More informationInsertion and Promotion for Tree-Based PseudoLRU Last-Level Caches
Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Daniel A. Jiménez Department of Computer Science and Engineering Texas A&M University ABSTRACT Last-level caches mitigate the high latency
More informationSandbox Based Optimal Offset Estimation [DPC2]
Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset
More informationMultiperspective Reuse Prediction
ABSTRACT Daniel A. Jiménez Texas A&M University djimenezacm.org The disparity between last-level cache and memory latencies motivates the search for e cient cache management policies. Recent work in predicting
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationAddressing End-to-End Memory Access Latency in NoC-Based Multicores
Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu
More informationEnergy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012
Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationThe Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory
The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*
More informationSampling Dead Block Prediction for Last-Level Caches
Appears in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43), December 2010 Sampling Dead Block Prediction for Last-Level Caches Samira Khan, Yingying Tian,
More informationWADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System
WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System ZHE WANG, Texas A&M University SHUCHANG SHAN, Chinese Institute of Computing Technology TING CAO, Australian National University
More informationUCB CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in
More informationOpenPrefetch. (in-progress)
OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),
More informationPerformance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor
Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University
More informationEnergy Models for DVFS Processors
Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July
More informationBias Scheduling in Heterogeneous Multi-core Architectures
Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that
More informationData Prefetching by Exploiting Global and Local Access Patterns
Journal of Instruction-Level Parallelism 13 (2011) 1-17 Submitted 3/10; published 1/11 Data Prefetching by Exploiting Global and Local Access Patterns Ahmad Sharif Hsien-Hsin S. Lee School of Electrical
More informationDynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories
SAFARI Technical Report No. 2-5 (December 6, 2) : A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon hanbinyoon@cmu.edu Justin Meza meza@cmu.edu
More informationPerformance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Performance Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Defining Performance (1) Which airplane has the best performance? Boeing 777 Boeing
More informationImproving Cache Management Policies Using Dynamic Reuse Distances
Improving Cache Management Policies Using Dynamic Reuse Distances Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero and Alexander V. Veidenbaum University of California, Irvine Universitat
More informationCache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems
1 Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems Dimitris Kaseridis, Member, IEEE, Muhammad Faisal Iqbal, Student Member, IEEE and Lizy Kurian John,
More informationLecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies
Lecture 14: Large Cache Design II Topics: Cache partitioning and replacement policies 1 Page Coloring CACHE VIEW Bank number with Page-to-Bank Tag Set Index Bank number with Set-interleaving Block offset
More informationYet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache
Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache SOMAYEH SARDASHTI, University of Wisconsin-Madison ANDRE SEZNEC, IRISA/INRIA DAVID A WOOD, University of Wisconsin-Madison Cache
More informationImproving Writeback Efficiency with Decoupled Last-Write Prediction
Improving Writeback Efficiency with Decoupled Last-Write Prediction Zhe Wang Samira M. Khan Daniel A. Jiménez The University of Texas at San Antonio {zhew,skhan,dj}@cs.utsa.edu Abstract In modern DDRx
More informationYet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache
Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache Somayeh Sardashti, André Seznec, David A. Wood To cite this version: Somayeh Sardashti, André Seznec, David A. Wood. Yet Another
More informationDEMM: a Dynamic Energy-saving mechanism for Multicore Memories
DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University
More informationUCB CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 38 Performance 2008-04-30 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in
More informationSpatial Memory Streaming (with rotated patterns)
Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;
More informationMicro-sector Cache: Improving Space Utilization in Sectored DRAM Caches
Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches Mainak Chaudhuri Mukesh Agrawal Jayesh Gaur Sreenivas Subramoney Indian Institute of Technology, Kanpur 286, INDIA Intel Architecture
More informationCOP: To Compress and Protect Main Memory
COP: To Compress and Protect Main Memory David J. Palframan Nam Sung Kim Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin Madison palframan@wisc.edu, nskim3@wisc.edu,
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationLACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm
1 LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm Mazen Kharbutli and Rami Sheikh (Submitted to IEEE Transactions on Computers) Mazen Kharbutli is with Jordan University of Science and
More informationComputer Architecture. Introduction
to Computer Architecture 1 Computer Architecture What is Computer Architecture From Wikipedia, the free encyclopedia In computer engineering, computer architecture is a set of rules and methods that describe
More informationCAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads
2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole-
More informationQoS-Aware Memory Systems (Wrap Up) Onur Mutlu July 9, 2013 INRIA
QoS-Aware Memory Systems (Wrap Up) Onur Mutlu onur@cmu.edu July 9, 2013 INRIA Slides for These Lectures Architecting and Exploiting Asymmetry in Multi-Core http://users.ece.cmu.edu/~omutlu/pub/onur-inria-lecture1-
More informationEvaluating STT-RAM as an Energy-Efficient Main Memory Alternative
Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University
More informationAdaptive Placement and Migration Policy for an STT-RAM-Based Hybrid Cache
Adaptive Placement and Migration Policy for an STT-RAM-Based Hybrid Cache Zhe Wang Daniel A. Jiménez Cong Xu Guangyu Sun Yuan Xie Texas A&M University Pennsylvania State University Peking University AMD
More informationReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache Management
This is a pre-print, author's version of the paper to appear in the 32nd IEEE International Conference on Computer Design. ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache
More informationAn Application-Oriented Approach for Designing Heterogeneous Network-on-Chip
An Application-Oriented Approach for Designing Heterogeneous Network-on-Chip Technical Report CSE-11-7 Monday, June 13, 211 Asit K. Mishra Department of Computer Science and Engineering The Pennsylvania
More informationA Hybrid Adaptive Feedback Based Prefetcher
A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,
More informationPrefetch-Aware DRAM Controllers
Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin
More informationDFPC: A Dynamic Frequent Pattern Compression Scheme in NVM-based Main Memory
DFPC: A Dynamic Frequent Pattern Compression Scheme in NVM-based Main Memory Yuncheng Guo, Yu Hua, Pengfei Zuo Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology Huazhong
More informationStorage Efficient Hardware Prefetching using Delta Correlating Prediction Tables
Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence
More informationDynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors
Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors Jie Meng, Tiansheng Zhang, and Ayse K. Coskun Electrical and Computer Engineering Department, Boston University,
More informationMinimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era
Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis Electrical and Computer Engineering The University of Texas at Austin Austin, TX, USA kaseridis@mail.utexas.edu
More informationKill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy
Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy Jinchun Kim Texas A&M University cienlux@tamu.edu Daniel A. Jiménez Texas A&M University djimenez@cse.tamu.edu
More informationMulti-Cache Resizing via Greedy Coordinate Descent
Noname manuscript No. (will be inserted by the editor) Multi-Cache Resizing via Greedy Coordinate Descent I. Stephen Choi Donald Yeung Received: date / Accepted: date Abstract To reduce power consumption
More informationArchitecture of Parallel Computer Systems - Performance Benchmarking -
Architecture of Parallel Computer Systems - Performance Benchmarking - SoSe 18 L.079.05810 www.uni-paderborn.de/pc2 J. Simon - Architecture of Parallel Computer Systems SoSe 2018 < 1 > Definition of Benchmark
More informationBest-Offset Hardware Prefetching
Best-Offset Hardware Prefetching Pierre Michaud March 2016 2 BOP: yet another data prefetcher Contribution: offset prefetcher with new mechanism for setting the prefetch offset dynamically - Improvement
More informationLecture-16 (Cache Replacement Policies) CS422-Spring
Lecture-16 (Cache Replacement Policies) CS422-Spring 2018 Biswa@CSE-IITK 1 2 4 8 16 32 64 128 From SPEC92 Miss rate: Still Applicable Today 0.14 0.12 0.1 0.08 0.06 0.04 1-way 2-way 4-way 8-way Capacity
More informationPredicting Performance Impact of DVFS for Realistic Memory Systems
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt The University of Texas at Austin Nvidia Corporation {rustam,patt}@hps.utexas.edu ebrahimi@hps.utexas.edu
More informationA Best-Offset Prefetcher
A Best-Offset Prefetcher Pierre Michaud Inria pierre.michaud@inria.fr The Best-Offset (BO) prefetcher submitted to the DPC contest prefetches one line into the level-two (L) cache on every cache miss or
More informationRelative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review
Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India
More informationarxiv: v1 [cs.ar] 1 Feb 2016
SAFARI Technical Report No. - Enabling Efficient Dynamic Resizing of Large DRAM Caches via A Hardware Consistent Hashing Mechanism Kevin K. Chang, Gabriel H. Loh, Mithuna Thottethodi, Yasuko Eckert, Mike
More informationHybrid Cache Architecture (HCA) with Disparate Memory Technologies
Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, Yuan Xie Pennsylvania State University IBM Austin Research Laboratory Acknowledgement:
More informationExploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies
Exploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies Doe Hyun Yoon, Tobin Gonzalez, Parthasarathy Ranganathan, and Robert S. Schreiber Intelligent Infrastructure Lab (IIL), Hewlett-Packard
More informationA Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Intel Corporation Hillsboro, OR 97124, USA asit.k.mishra@intel.com Onur Mutlu Carnegie Mellon University Pittsburgh,
More informationStaged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems
Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun Kevin Kai-Wei Chang Lavanya Subramanian Gabriel H. Loh Onur Mutlu Carnegie Mellon University
More informationPIPELINING AND PROCESSOR PERFORMANCE
PIPELINING AND PROCESSOR PERFORMANCE Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 1, John L. Hennessy and David A. Patterson, Morgan Kaufmann,
More informationLoop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization
Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Yulei Sui, Xiaokang Fan, Hao Zhou and Jingling Xue School of Computer Science and Engineering The University of
More informationContinuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads
Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt The University of Texas at Austin ETH Zürich ABSTRACT Runahead execution pre-executes
More informationEKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ EKT 303 WEEK 2 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved. Chapter 2 + Performance Issues + Designing for Performance The cost of computer systems continues to drop dramatically,
More informationCloudCache: Expanding and Shrinking Private Caches
CloudCache: Expanding and Shrinking Private Caches Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers Computer Science Department, University of Pittsburgh {abraham,cho,childers}@cs.pitt.edu Abstract The
More informationSEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week)
+ SEN361 Computer Organization Prof. Dr. Hasan Hüseyin BALIK (2 nd Week) + Outline 1. Overview 1.1 Basic Concepts and Computer Evolution 1.2 Performance Issues + 1.2 Performance Issues + Designing for
More informationCONTENT-AWARE SPIN-TRANSFER TORQUE MAGNETORESISTIVE RANDOM-ACCESS MEMORY (STT-MRAM) CACHE DESIGNS
CONTENT-AWARE SPIN-TRANSFER TORQUE MAGNETORESISTIVE RANDOM-ACCESS MEMORY (STT-MRAM) CACHE DESIGNS By QI ZENG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
More informationA Cache Reconfiguration Approach for Saving Leakage and Refresh Energy in Embedded DRAM Caches
A Cache Reconfiguration Approach for Saving Leakage and Refresh Energy in Embedded DRAM Caches arxiv:139.782v1 [cs.ar] 26 Sep 213 Abstract In recent years, the size and leakage energy consumption of large
More informationHigh Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi Chang Joo Lee Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The
More informationTiming Local Streams: Improving Timeliness in Data Prefetching
Timing Local Streams: Improving Timeliness in Data Prefetching Huaiyu Zhu, Yong Chen and Xian-He Sun Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 {hzhu12,chenyon1,sun}@iit.edu
More informationDynamic Cache Pooling in 3D Multicore Processors
Dynamic Cache Pooling in 3D Multicore Processors TIANSHENG ZHANG, JIE MENG, and AYSE K. COSKUN, BostonUniversity Resource pooling, where multiple architectural components are shared among cores, is a promising
More informationDecoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti Computer Sciences Department University of Wisconsin-Madison somayeh@cs.wisc.edu David
More informationHOTL: a Higher Order Theory of Locality
HOTL: a Higher Order Theory of Locality Xiaoya Xiang Chen Ding Hao Luo Department of Computer Science University of Rochester {xiang, cding, hluo}@cs.rochester.edu Bin Bao Adobe Systems Incorporated bbao@adobe.com
More informationHOTL: A Higher Order Theory of Locality
HOTL: A Higher Order Theory of Locality Xiaoya Xiang Chen Ding Hao Luo Department of Computer Science University of Rochester {xiang, cding, hluo}@cs.rochester.edu Bin Bao Adobe Systems Incorporated bbao@adobe.com
More informationLinux Performance on IBM zenterprise 196
Martin Kammerer martin.kammerer@de.ibm.com 9/27/10 Linux Performance on IBM zenterprise 196 visit us at http://www.ibm.com/developerworks/linux/linux390/perf/index.html Trademarks IBM, the IBM logo, and
More informationGather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand, Onur Mutlu, Phillip B. Gibbons, Michael A.
More informationPrefetch-Aware DRAM Controllers
Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt Department of Electrical and Computer Engineering The University of Texas at Austin {cjlee, narasima, patt}@ece.utexas.edu
More information562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016
562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016 Memory Bandwidth Management for Efficient Performance Isolation in Multi-Core Platforms Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Member,
More informationSkewed Compressed Cache
Skewed Compressed Cache Somayeh Sardashti, André Seznec, David A. Wood To cite this version: Somayeh Sardashti, André Seznec, David A. Wood. Skewed Compressed Cache. MICRO - 47th Annual IEEE/ACM International
More informationEvalua&ng STT- RAM as an Energy- Efficient Main Memory Alterna&ve
Evalua&ng STT- RAM as an Energy- Efficient Main Memory Alterna&ve Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University
More information