Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
|
|
- Daniella Nash
- 6 years ago
- Views:
Transcription
1 Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez
2 Summary Read misses are more cri?cal than write misses Read misses can stall processor, writes are not on the cri?cal path Problem: Cache management does not exploit read- write disparity Goal: Design a cache that favors reads over writes to improve performance Lines that are only wriden to are less cri7cal Priori7ze lines that service read requests Key observa7on: Applica?ons differ in their read reuse behavior in clean and dirty lines Idea: Read- Write Par??oning Dynamically par??on the cache between clean and dirty lines Protect the par??on that has more read hits Improves performance over three recent mechanisms 2
3 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 3
4 Mo?va?on Read and write misses are not equally cri?cal Read misses are more cri?cal than write misses Read misses can stall the processor Writes are not on the cri?cal path Rd A STALL Wr B Rd C STALL Buffer/writeback B?me Cache management does not exploit the disparity between read- write requests 4
5 Key Idea Favor reads over writes in cache Differen?ate between read vs. only wriden to lines Cache should protect lines that serve read requests Lines that are only wriden to are less cri7cal Improve performance by maximizing read hits An Example Rd A Wr B Rd B Wr C Rd D A D Read- Only B Read and WriDen C Write- Only 5
6 An Example Rd A Wr B Rd B Wr C Rd D Rd A M D C B Rd A H WR B M Rd B H Wr C M Rd D M STALL Write B A 2 D stalls C B A per D B itera7on A D C B A Wr B H Rd B H WR C M Rd D M Write B Write C LRU Replacement Policy Write C STALL D B A D B A D B A 1 D stall C B A per Replace itera7on C D B A Read- Biased Replacement Policy STALL D C B cycles saved Evic7ng Dirty lines are that treated are only differently wriden to depending can improve on performance read requests 6
7 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 7
8 Reuse Behavior of Dirty Lines Not all dirty lines are the same Write- only Lines Do not receive read requests, can be evicted Read- Write Lines Receive read requests, should be kept in the cache Evic7ng write- only lines provides more space for read lines and can improve performance 8
9 Percentage of Cachelines in LLC perlbench Reuse Behavior of Dirty Lines 401.bzip2 403.gcc 410.bwaves 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 445.gobmk 447.dealII 450.soplex 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum Dirty (write- only) Dirty (read- write) Applica7ons On average have 37.4% different lines read are write- only, reuse behavior 9.4% lines are in both dirty read lines and wriden 464.h264ref 465.tonto 470.lbm 471.omnetpp 473.astar 481.wrf 482.sphinx xalancbmk
10 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 10
11 Read- Write Par??oning Goal: Exploit different read reuse behavior in dirty lines to maximize number of read hits Observa7on: Some applica?ons have more reads to clean lines Other applica?ons have more reads to dirty lines Read- Write Par77oning: Dynamically par??ons the cache in clean and dirty lines Evict lines from the par??on that has less read reuse Improves performance by protec7ng lines with more read reuse 11
12 Read- Write Par??oning Number of Reads Normalized to Reads in clean lines at 100m Soplex Clean Line Dirty Line Instruc7ons (M) Number of Reads Normalized to Reads in clean lines at 100m Xalanc Clean Line Dirty Line Instruc7ons (M) Applica7ons have significantly different read reuse behavior in clean and dirty lines 12
13 Read- Write Par??oning U?lize disparity in read reuse in clean and dirty lines Par??on the cache into clean and dirty lines Predict the par??on size that maximizes read hits Maintain the par??on through replacement DIP [Qureshi et al. 2007] selects vic?m within the par??on Predicted Best Par77on Size 3 Replace from dirty par77on Dirty Lines Cache Sets Clean Lines 13
14 Predic?ng Par??on Size Predicts par??on size using sampled shadow tags Based on u?lity- based par??oning [Qureshi et al. 2006] Counts the number of read hits in clean and dirty lines Picks the par??on (x, associa?vity x) that maximizes number of read hits Maximum number of read hits C O U N T E R S C O U N T E R S S A M P L E D S H A D O W S H A D O W MRU MRU- 1 LRU+1 LRU MRU MRU- 1 LRU+1 LRU S E T S T A G S T A G S Dirty Clean 14
15 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 15
16 Methodology CMP$im x86 cycle- accurate simulator [Jaleel et al. 2008] 4MB 16- way set- associa?ve LLC 32KB I+D L1, 256KB L cycle DRAM access?me 550m representa?ve instruc?ons Benchmarks: 10 memory- intensive SPEC benchmarks 35 mul?- programmed applica?ons 16
17 Comparison Points DIP, RRIP: Inser?on Policy [Qureshi et al. 2007, Jaleel et al. 2010] Avoid thrashing and cache pollu?on Dynamically insert lines at different stack posi?ons Low overhead Do not differen?ate between read- write accesses SUP+: Single- Use Reference Predictor [Piquet et al. 2007] Avoids cache pollu?on Bypasses lines that do not receive re- references High accuracy Does not differen?ate between read- write accesses Does not bypass write- only lines High storage overhead, needs PC in LLC 17
18 Comparison Points: Read Reference Predictor (RRP) A new predictor inspired by prior works [Tyson et al. 1995, Piquet et al. 2007] Iden?fies read and write- only lines by alloca?ng PC Bypasses write- only lines Writebacks are not associated with any PC Time PC P: Rd A Wb A Wb A PC Q: Wb A Wb A Wb A Alloca7ng No alloca7ng PC from L1 PC Marks Associates P as a the PC alloca7ng that High allocates storage PC in a overhead L1 line and that passes is never PC in read L2, again LLC 18
19 1.20 Single Core Performance 48.4KB 2.6KB Speedup vs. Baseline LRU DIP RRIP SUP+ RRP RWP Differen7a7ng RWP performs read within vs. write- only 3.4% of RRP, lines improves But requires performance 18X less over storage recent overhead mechanisms 19
20 4 Core Performance Speedup vs. Baseline LRU No Memory Intensive DIP RRIP SUP+ RRP RWP +4.5% 1 Memory Intensive 2 Memory Intensive 3 Memory Intensive +8% 4 Memory Intensive Differen7a7ng More benefit when read vs. more write- only applica7ons lines improves performance are memory over intensive recent mechanisms 20
21 Average Memory Traffic Percentage of Memory Traffic % 85% Writeback Miss 17% 66% 0 Base RWP Increases writeback traffic by 2.5%, but reduces overall memory traffic by 16% 21
22 Dirty Par??on Sizes Number of Cachelines Natural Dirty Par77on Predicted Dirty Par77on 400.perlbench 401.bzip2 403.gcc 410.bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 445.gobmk 447.dealII 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum 464.h264ref 465.tonto 470.lbm 471.omnetpp 473.astar 481.wrf 482.sphinx3 483.xalancbmk Par77on size varies significantly for some benchmarks 22
23 Dirty Par??on Sizes Number of Cachelines Natural Dirty Par77on Predicted Dirty Par77on 400.perlbench 401.bzip2 403.gcc 410.bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 445.gobmk 447.dealII 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum 464.h264ref 465.tonto 470.lbm 471.omnetpp 473.astar 481.wrf 482.sphinx3 483.xalancbmk Par77on size varies significantly during the run7me for some benchmarks 23
24 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 24
25 Conclusion Problem: Cache management does not exploit read- write disparity Goal: Design a cache that favors read requests over write requests to improve performance Lines that are only wriden to are less cri7cal Protect lines that serve read requests Key observa7on: Applica?ons differ in their read reuse behavior in clean and dirty lines Idea: Read- Write Par??oning Dynamically par??on the cache in clean and dirty lines Protect the par??on that has more read hits Results: Improves performance over three recent mechanisms 25
26 Thank you 26
27 Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez
28 Extra Slides 28
29 Reuse Behavior of Dirty Lines in LLC Different read reuse behavior in dirty lines Read Intensive/Non- Write Intensive Most accesses are reads, only a few writes Example: 483.xalancbmk Write Intensive Generates huge amount of intermediate data Example: 456.hmmer Read- Write Intensive Itera?vely reads and writes huge amount of data Example: 450.soplex 29
30 Read Intensive/Non- Write Intensive Most accesses are reads, only a few writes 483.xalancbmk: Extensible stylesheet language transforma?ons (XSLT) processor XSLT Processor XML Input HTML/XML/TXT Output XSLT Code 92% accesses are reads 99% write accesses are stack opera?on 30
31 elementat push %ebp mov %esp, %ebp push %esi pop %esi pop %ebp ret Non- Write Intensive Logical View of Stack %ebp %esi Write- Only 1.8% lines in LLC are write- only Read and Wri,en These dirty lines should be evicted WRITEBACK WRITEBACK 31
32 Write Intensive Generates huge amount of intermediate data 456.hmmer: Searches a protein sequence database Hidden Markov Model of Mul?ple Sequence Alignments Best Ranked Matches Database Viterbi algorithm, uses dynamic programming Only 0.4% writes are from stack opera?ons 32
33 Write Intensive 2D Table Read and Wri,en WRITEBACK Write- Only 92% lines in LLC are write- only These lines can be evicted WRITEBACK 33
34 Read- Write Intensive Itera?vely reads and writes huge amount of data 450.soplex: Linear programming solver Miminum/maximum of objec?ve func?on Inequali?es and objec?ve func?on Simplex algorithm, iterates over a matrix Different opera?ons over the en?re matrix at each itera?on 34
35 Read- Write Intensive Read and Wri,en Pivot Column READ WRITEBACK Pivot Row Write- Only Read and Wri,en 19% lines in LLC write- only, 10% lines are read- wriden Read and wriden lines should be protected 35
36 Read Reuse of Clean- Dirty Lines in LLC Percentage of Cachelines in LLC Clean Non- Read Dirty Non- Read Clean Read Dirty Read perl 401.bzip 403.gcc 410.bwa 429.mcf 433.milc 434.zeus 435.gro 436.cact 437.lesli 445.gob 447.dealI 450.sopl 456.hm 458.sjen 459.Gem 462.libqu 464.h tont 470.lbm 471.omn 473.astar 481.wrf 482.sphi 483.xala On average 37.4% blocks are dirty non- read and 42% blocks are clean non- read 36
37 1.07 Single Core Performance 48.4KB 2.6KB Speedup vs. Baseline LRU DIP RRIP SUP+ RRP RWP 5% speedup over the whole SPECCPU2006 benchmarks 37
38 Speedup Speedup vs. Baseline LRU DIP RRIP SUP+ RRP RWP On average 14.6% speedup over baseline 5% speedup over the whole SPECCPU2006 benchmarks
39 Writeback Traffic to Memory Writeback Traffic Normalized to LRU RRP RWP Increases traffic by 17% over the whole SPECCPU2006 benchmarks 39
40 4- Core Performance Normalized Weighted Speedup RRP RWP Workloads 40
41 Change of IPC with Sta?c Par??oning 0.9 Xalancbmk 0.8 Soplex IPC IPC Number of Ways Assigned to Dirty Blocks Number of Ways Assigned to Dirty Blocks
Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationLightweight Memory Tracing
Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for
More informationA Fast Instruction Set Simulator for RISC-V
A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.
More informationResource-Conscious Scheduling for Energy Efficiency on Multicore Processors
Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe
More informationExploi'ng Compressed Block Size as an Indicator of Future Reuse
Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch Execu've Summary In a compressed
More informationNear-Threshold Computing: How Close Should We Get?
Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on
More informationNightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems
NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science
More informationPerceptron Learning for Reuse Prediction
Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level
More informationInsertion and Promotion for Tree-Based PseudoLRU Last-Level Caches
Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Daniel A. Jiménez Department of Computer Science and Engineering Texas A&M University ABSTRACT Last-level caches mitigate the high latency
More informationAddressing End-to-End Memory Access Latency in NoC-Based Multicores
Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu
More informationFootprint-based Locality Analysis
Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.
More informationDecoupled Dynamic Cache Segmentation
Appears in Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA-8), February, 202. Decoupled Dynamic Cache Segmentation Samira M. Khan, Zhe Wang and Daniel A.
More informationSandbox Based Optimal Offset Estimation [DPC2]
Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset
More informationImproving Cache Performance using Victim Tag Stores
Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com
More informationMultiperspective Reuse Prediction
ABSTRACT Daniel A. Jiménez Texas A&M University djimenezacm.org The disparity between last-level cache and memory latencies motivates the search for e cient cache management policies. Recent work in predicting
More informationImproving Writeback Efficiency with Decoupled Last-Write Prediction
Improving Writeback Efficiency with Decoupled Last-Write Prediction Zhe Wang Samira M. Khan Daniel A. Jiménez The University of Texas at San Antonio {zhew,skhan,dj}@cs.utsa.edu Abstract In modern DDRx
More informationData Prefetching by Exploiting Global and Local Access Patterns
Journal of Instruction-Level Parallelism 13 (2011) 1-17 Submitted 3/10; published 1/11 Data Prefetching by Exploiting Global and Local Access Patterns Ahmad Sharif Hsien-Hsin S. Lee School of Electrical
More informationSampling Dead Block Prediction for Last-Level Caches
Appears in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43), December 2010 Sampling Dead Block Prediction for Last-Level Caches Samira Khan, Yingying Tian,
More informationEvaluating STT-RAM as an Energy-Efficient Main Memory Alternative
Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University
More informationWADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System
WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System ZHE WANG, Texas A&M University SHUCHANG SHAN, Chinese Institute of Computing Technology TING CAO, Australian National University
More informationChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce
More informationA Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationThesis Defense Lavanya Subramanian
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)
More informationEnergy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012
Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso
More informationUCB CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in
More informationEnergy Models for DVFS Processors
Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July
More informationThe Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory
The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*
More informationPIPELINING AND PROCESSOR PERFORMANCE
PIPELINING AND PROCESSOR PERFORMANCE Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 1, John L. Hennessy and David A. Patterson, Morgan Kaufmann,
More informationPerformance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor
Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University
More informationDynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories
SAFARI Technical Report No. 2-5 (December 6, 2) : A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon hanbinyoon@cmu.edu Justin Meza meza@cmu.edu
More informationArchitecture of Parallel Computer Systems - Performance Benchmarking -
Architecture of Parallel Computer Systems - Performance Benchmarking - SoSe 18 L.079.05810 www.uni-paderborn.de/pc2 J. Simon - Architecture of Parallel Computer Systems SoSe 2018 < 1 > Definition of Benchmark
More informationMinimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era
Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis Electrical and Computer Engineering The University of Texas at Austin Austin, TX, USA kaseridis@mail.utexas.edu
More informationSpatial Memory Streaming (with rotated patterns)
Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;
More informationComputer Architecture. Introduction
to Computer Architecture 1 Computer Architecture What is Computer Architecture From Wikipedia, the free encyclopedia In computer engineering, computer architecture is a set of rules and methods that describe
More informationUCB CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 38 Performance 2008-04-30 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationBias Scheduling in Heterogeneous Multi-core Architectures
Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that
More informationPerformance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Performance Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Defining Performance (1) Which airplane has the best performance? Boeing 777 Boeing
More informationEvalua&ng STT- RAM as an Energy- Efficient Main Memory Alterna&ve
Evalua&ng STT- RAM as an Energy- Efficient Main Memory Alterna&ve Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University
More informationImproving Cache Management Policies Using Dynamic Reuse Distances
Improving Cache Management Policies Using Dynamic Reuse Distances Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero and Alexander V. Veidenbaum University of California, Irvine Universitat
More informationPiCL: a Software-Transparent, Persistent Cache Log for Nonvolatile Main Memory
PiCL: a Software-Transparent, Persistent Cache Log for Nonvolatile Main Memory Tri M. Nguyen Department of Electrical Engineering Princeton University Princeton, USA trin@princeton.edu David Wentzlaff
More informationEKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ EKT 303 WEEK 2 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved. Chapter 2 + Performance Issues + Designing for Performance The cost of computer systems continues to drop dramatically,
More informationAdaptive Placement and Migration Policy for an STT-RAM-Based Hybrid Cache
Adaptive Placement and Migration Policy for an STT-RAM-Based Hybrid Cache Zhe Wang Daniel A. Jiménez Cong Xu Guangyu Sun Yuan Xie Texas A&M University Pennsylvania State University Peking University AMD
More informationMicro-sector Cache: Improving Space Utilization in Sectored DRAM Caches
Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches Mainak Chaudhuri Mukesh Agrawal Jayesh Gaur Sreenivas Subramoney Indian Institute of Technology, Kanpur 286, INDIA Intel Architecture
More informationCache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems
1 Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems Dimitris Kaseridis, Member, IEEE, Muhammad Faisal Iqbal, Student Member, IEEE and Lizy Kurian John,
More informationarxiv: v1 [cs.ar] 1 Feb 2016
SAFARI Technical Report No. - Enabling Efficient Dynamic Resizing of Large DRAM Caches via A Hardware Consistent Hashing Mechanism Kevin K. Chang, Gabriel H. Loh, Mithuna Thottethodi, Yasuko Eckert, Mike
More informationARI: Adaptive LLC-Memory Traffic Management
0 ARI: Adaptive LLC-Memory Traffic Management VIACHESLAV V. FEDOROV, Texas A&M University SHENG QIU, Texas A&M University A. L. NARASIMHA REDDY, Texas A&M University PAUL V. GRATZ, Texas A&M University
More informationDEMM: a Dynamic Energy-saving mechanism for Multicore Memories
DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University
More informationSEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week)
+ SEN361 Computer Organization Prof. Dr. Hasan Hüseyin BALIK (2 nd Week) + Outline 1. Overview 1.1 Basic Concepts and Computer Evolution 1.2 Performance Issues + 1.2 Performance Issues + Designing for
More informationMulti-Cache Resizing via Greedy Coordinate Descent
Noname manuscript No. (will be inserted by the editor) Multi-Cache Resizing via Greedy Coordinate Descent I. Stephen Choi Donald Yeung Received: date / Accepted: date Abstract To reduce power consumption
More informationA Cache Reconfiguration Approach for Saving Leakage and Refresh Energy in Embedded DRAM Caches
A Cache Reconfiguration Approach for Saving Leakage and Refresh Energy in Embedded DRAM Caches arxiv:139.782v1 [cs.ar] 26 Sep 213 Abstract In recent years, the size and leakage energy consumption of large
More information562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016
562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016 Memory Bandwidth Management for Efficient Performance Isolation in Multi-Core Platforms Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Member,
More informationOpenPrefetch. (in-progress)
OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),
More informationPredicting Performance Impact of DVFS for Realistic Memory Systems
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt The University of Texas at Austin Nvidia Corporation {rustam,patt}@hps.utexas.edu ebrahimi@hps.utexas.edu
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationLinux Performance on IBM zenterprise 196
Martin Kammerer martin.kammerer@de.ibm.com 9/27/10 Linux Performance on IBM zenterprise 196 visit us at http://www.ibm.com/developerworks/linux/linux390/perf/index.html Trademarks IBM, the IBM logo, and
More informationHOTL: a Higher Order Theory of Locality
HOTL: a Higher Order Theory of Locality Xiaoya Xiang Chen Ding Hao Luo Department of Computer Science University of Rochester {xiang, cding, hluo}@cs.rochester.edu Bin Bao Adobe Systems Incorporated bbao@adobe.com
More informationPower Gating with Block Migration in Chip-Multiprocessor Last-Level Caches
Power Gating with Block Migration in Chip-Multiprocessor Last-Level Caches David Kadjo, Hyungjun Kim, Paul Gratz, Jiang Hu and Raid Ayoub Department of Electrical and Computer Engineering Texas A& M University,
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko Advisors: Todd C. Mowry and Onur Mutlu Computer Science Department, Carnegie Mellon
More informationFiltered Runahead Execution with a Runahead Buffer
Filtered Runahead Execution with a Runahead Buffer ABSTRACT Milad Hashemi The University of Texas at Austin miladhashemi@utexas.edu Runahead execution dynamically expands the instruction window of an out
More informationDASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture
DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture Junwhan Ahn *, Sungjoo Yoo, and Kiyoung Choi * junwhan@snu.ac.kr, sungjoo.yoo@postech.ac.kr, kchoi@snu.ac.kr * Department of Electrical
More informationHOTL: A Higher Order Theory of Locality
HOTL: A Higher Order Theory of Locality Xiaoya Xiang Chen Ding Hao Luo Department of Computer Science University of Rochester {xiang, cding, hluo}@cs.rochester.edu Bin Bao Adobe Systems Incorporated bbao@adobe.com
More informationBalancing Context Switch Penalty and Response Time with Elastic Time Slicing
Balancing Context Switch Penalty and Response Time with Elastic Time Slicing Nagakishore Jammula, Moinuddin Qureshi, Ada Gavrilovska, and Jongman Kim Georgia Institute of Technology, Atlanta, Georgia,
More informationLACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm
1 LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm Mazen Kharbutli and Rami Sheikh (Submitted to IEEE Transactions on Computers) Mazen Kharbutli is with Jordan University of Science and
More informationA Front-end Execution Architecture for High Energy Efficiency
A Front-end Execution Architecture for High Energy Efficiency Ryota Shioya, Masahiro Goshima and Hideki Ando Department of Electrical Engineering and Computer Science, Nagoya University, Aichi, Japan Information
More informationPractical Data Compression for Modern Memory Hierarchies
Practical Data Compression for Modern Memory Hierarchies Thesis Oral Gennady Pekhimenko Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison
More informationPotential for hardware-based techniques for reuse distance analysis
Michigan Technological University Digital Commons @ Michigan Tech Dissertations, Master's Theses and Master's Reports - Open Dissertations, Master's Theses and Master's Reports 2011 Potential for hardware-based
More informationGenerating Low-Overhead Dynamic Binary Translators
Generating Low-Overhead Dynamic Binary Translators Mathias Payer ETH Zurich, Switzerland mathias.payer@inf.ethz.ch Thomas R. Gross ETH Zurich, Switzerland trg@inf.ethz.ch Abstract Dynamic (on the fly)
More informationScalable Dynamic Task Scheduling on Adaptive Many-Cores
Introduction: Many- Paradigm [Our Definition] Scalable Dynamic Task Scheduling on Adaptive Many-s Vanchinathan Venkataramani, Anuj Pathania, Muhammad Shafique, Tulika Mitra, Jörg Henkel Bus CES Chair for
More informationDecoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation
More informationLightweight Memory Tracing
Lightweight Memory Tracing Mathias Payer ETH Zurich Enrico Kravina ETH Zurich Thomas R. Gross ETH Zurich Abstract Memory tracing (executing additional code for every memory access of a program) is a powerful
More informationRelative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review
Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India
More informationAB-Aware: Application Behavior Aware Management of Shared Last Level Caches
AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering
More informationCOP: To Compress and Protect Main Memory
COP: To Compress and Protect Main Memory David J. Palframan Nam Sung Kim Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin Madison palframan@wisc.edu, nskim3@wisc.edu,
More informationFlexible Cache Error Protection using an ECC FIFO
Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead
More informationCPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:
CPI CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: Clock cycle where: Clock rate = 1 / clock cycle f =
More informationMaking Data Prefetch Smarter: Adaptive Prefetching on POWER7
Making Data Prefetch Smarter: Adaptive Prefetching on POWER7 Víctor Jiménez Barcelona Supercomputing Center Barcelona, Spain victor.javier@bsc.es Alper Buyuktosunoglu IBM T. J. Watson Research Center Yorktown
More informationCloudCache: Expanding and Shrinking Private Caches
CloudCache: Expanding and Shrinking Private Caches Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers Computer Science Department, University of Pittsburgh {abraham,cho,childers}@cs.pitt.edu Abstract The
More informationOpen Access Research on the Establishment of MSR Model in Cloud Computing based on Standard Performance Evaluation
Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2015, 7, 821-825 821 Open Access Research on the Establishment of MSR Model in Cloud Computing based
More informationStaged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems
Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun Kevin Kai-Wei Chang Lavanya Subramanian Gabriel H. Loh Onur Mutlu Carnegie Mellon University
More informationAn Application-Oriented Approach for Designing Heterogeneous Network-on-Chip
An Application-Oriented Approach for Designing Heterogeneous Network-on-Chip Technical Report CSE-11-7 Monday, June 13, 211 Asit K. Mishra Department of Computer Science and Engineering The Pennsylvania
More informationA Dynamic Program Analysis to find Floating-Point Accuracy Problems
1 A Dynamic Program Analysis to find Floating-Point Accuracy Problems Florian Benz fbenz@stud.uni-saarland.de Andreas Hildebrandt andreas.hildebrandt@uni-mainz.de Sebastian Hack hack@cs.uni-saarland.de
More informationMemory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez
Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection
More informationLast time. Lecture #29 Performance & Parallel Intro
CS61C L29 Performance & Parallel (1) inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #29 Performance & Parallel Intro 2007-8-14 Scott Beamer, Instructor Paper Battery Developed by Researchers
More informationDetecting and Mitigating Data-Dependent DRAM Failures by Exploiting Current Memory Content
Detecting and Mitigating Data-Dependent DRAM Failures by Exploiting Current Memory Content Samira Khan Chris Wilkerson Zhe Wang Alaa R. Alameldeen Donghyuk Lee Onur Mutlu University of Virginia Intel Labs
More informationSiNUCA: A Validated Micro-Architecture Simulator
SiNUCA: A Validated Micro-Architecture ulator Marco A. Z. Alves, Matthias Diener, Francis B. Moreira, Philippe O. A. Navaux Informatics Institute Federal University of Rio Grande do Sul Email: {mazalves,
More informationEfficient Memory Shadowing for 64-bit Architectures
Efficient Memory Shadowing for 64-bit Architectures The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Qin Zhao, Derek Bruening,
More informationPrefetch-Aware DRAM Controllers
Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin
More informationInformation System Architecture Natawut Nupairoj Ph.D. Department of Computer Engineering, Chulalongkorn University
2110684 Information System Architecture Natawut Nupairoj Ph.D. Department of Computer Engineering, Chulalongkorn University Agenda Capacity Planning Determining the production capacity needed by an organization
More informationISA-Aging. (SHRINK: Reducing the ISA Complexity Via Instruction Recycling) Accepted for ISCA 2015
ISA-Aging (SHRINK: Reducing the ISA Complexity Via Instruction Recycling) Accepted for ISCA 2015 Bruno Cardoso Lopes, Rafael Auler, Edson Borin, Luiz Ramos, Rodolfo Azevedo, University of Campinas, Brasil
More informationEfficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories
Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories HANBIN YOON * and JUSTIN MEZA, Carnegie Mellon University NAVEEN MURALIMANOHAR, Hewlett-Packard Labs NORMAN P.
More informationReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache Management
This is a pre-print, author's version of the paper to appear in the 32nd IEEE International Conference on Computer Design. ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache
More informationPOPS: Coherence Protocol Optimization for both Private and Shared Data
POPS: Coherence Protocol Optimization for both Private and Shared Data Hemayet Hossain, Sandhya Dwarkadas, and Michael C. Huang University of Rochester Rochester, NY 14627, USA {hossain@cs, sandhya@cs,
More informationA Hybrid Adaptive Feedback Based Prefetcher
A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,
More informationCache Controller with Enhanced Features using Verilog HDL
Cache Controller with Enhanced Features using Verilog HDL Prof. V. B. Baru 1, Sweety Pinjani 2 Assistant Professor, Dept. of ECE, Sinhgad College of Engineering, Vadgaon (BK), Pune, India 1 PG Student
More informationPage-Based Memory Allocation Policies of Local and Remote Memory in Cluster Computers
2012 IEEE 18th International Conference on Parallel and Distributed Systems Page-Based Memory Allocation Policies of Local and Remote Memory in Cluster Computers Mónica Serrano, Salvador Petit, Julio Sahuquillo,
More informationThe Load Slice Core Microarchitecture
The Load Slice Core Microarchitecture Trevor E. Carlson 1 Wim Heirman 2 Osman Allam 3 Stefanos Kaxiras 1 Lieven Eeckhout 3 1 Uppsala University, Sweden 2 Intel, ExaScience Lab 3 Ghent University, Belgium
More informationInvestigating the Viability of Bufferless NoCs in Modern Chip Multi-processor Systems
Investigating the Viability of Bufferless NoCs in Modern Chip Multi-processor Systems Chris Craik craik@cmu.edu Onur Mutlu onur@cmu.edu Computer Architecture Lab (CALCM) Carnegie Mellon University SAFARI
More informationHideM: Protecting the Contents of Userspace Memory in the Face of Disclosure Vulnerabilities
HideM: Protecting the Contents of Userspace Memory in the Face of Disclosure Vulnerabilities Jason Gionta, William Enck, Peng Ning 1 JIT-ROP 2 Two Attack Categories Injection Attacks Code Integrity Data
More information