AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

Size: px
Start display at page:

Download "AB-Aware: Application Behavior Aware Management of Shared Last Level Caches"

Transcription

1 AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering Indian Institute of Technology Bombay, India The 28 th ACM Great Lakes Symposium on VLSI (GLSVLSI) Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

2 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

3 Introduction Thousand-fold growth in microprocessor performance 1 Core micro-architectural innovations 2 Transistor-speed scaling Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

4 Introduction Thousand-fold growth in microprocessor performance 1 Core micro-architectural innovations 2 Transistor-speed scaling Relative Improvement Memory Latency Processor Clock Frequency Figure : Memory latency and CPU frequency improvement [1] [1] J. Srinivasan, Improving cache utilisation, PhD thesis, Cambridge University, 2011 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

5 Introduction Thousand-fold growth in microprocessor performance 1 Core micro-architectural innovations 2 Transistor-speed scaling Relative Improvement Improvements 10 Memory Latency Processor Clock Frequency Processor clock frequency 200 Memory latency Figure : Memory latency and CPU frequency improvement [1] [1] J. Srinivasan, Improving cache utilisation, PhD thesis, Cambridge University, 2011 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

6 Introduction Disparity between CPU and memory speeds Memory bottleneck Cache hierarchy Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

7 Introduction Disparity between CPU and memory speeds Memory bottleneck Cache hierarchy Miss at Last Level Cache (LLC) Hundreds of cycles to get serviced Processor execution stalls Performance loss Non-blocking caches Servicing multiple misses in parallel reduces effective memory stalls Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

8 Introduction Disparity between CPU and memory speeds Memory bottleneck Cache hierarchy Miss at Last Level Cache (LLC) Hundreds of cycles to get serviced Processor execution stalls Performance loss Non-blocking caches Servicing multiple misses in parallel reduces effective memory stalls Notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP) [2] Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

9 Introduction Disparity between CPU and memory speeds Memory bottleneck Cache hierarchy Miss at Last Level Cache (LLC) Hundreds of cycles to get serviced Processor execution stalls Performance loss Non-blocking caches Servicing multiple misses in parallel reduces effective memory stalls Notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP) [2] MLP is not uniform across all memory accesses MLP-aware cache replacement can minimize isolated misses [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

10 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

11 Example [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

12 Example [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

13 Example Reduced total number of misses do not always result in reduced memory stall cycles!! [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

14 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

15 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Increased number of conflict misses Interference among applications Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

16 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Increased number of conflict misses Interference among applications Regularity in cache access is filtered by higher level caches Diverse access pattern at LLC Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

17 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Increased number of conflict misses Interference among applications Regularity in cache access is filtered by higher level caches Diverse access pattern at LLC Multiple applications have different memory characteristics Cache Friendly (CF) small working data-set, more cache re-use Streaming (STR) large working data-set, low cache re-use Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

18 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Increased number of conflict misses Interference among applications Regularity in cache access is filtered by higher level caches Diverse access pattern at LLC Multiple applications have different memory characteristics Cache Friendly (CF) small working data-set, more cache re-use Streaming (STR) large working data-set, low cache re-use LRU-based cache replacement policies allocate cache resources on demand basis Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

19 LRU cache replacement on shared LLC L3 cache size varied from 512kB to 4MB by changing number of ways and keeping number of sets constant MPKI libquantum sphinx Number of ways (1way = 512 kb) Figure : Memory characteristics of sphinx and libquantum MPKI is Misses Per Kilo Instructions Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

20 LRU cache replacement on shared LLC L3 cache size varied from 512kB to 4MB by changing number of ways and keeping number of sets constant MPKI libquantum sphinx Number of ways (1way = 512 kb) Figure : Memory characteristics of sphinx and libquantum Under LRU policy & 2-core configuration, MPKI (sphinx) = sphinx got 3 ways out of total 8 (libquantum got 5 ways) sphinx suffered due interference from libquantum MPKI is Misses Per Kilo Instructions Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

21 Replacement policies for shared LLC On average 60% of cache lines are not re-referred before eviction [3] No temporal locality Data-set larger than allocated cache [3] M. K. Qureshi et al., Adaptive Insertion Policies for High Performance Caching, ISCA 07 [4] A. Jaleel et al., High Performance Cache Replacement using Re-Reference Interval Prediction (RRIP), ISCA 10 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

22 Replacement policies for shared LLC On average 60% of cache lines are not re-referred before eviction [3] No temporal locality Data-set larger than allocated cache Figure : Various cache replacement policies [3] M. K. Qureshi et al., Adaptive Insertion Policies for High Performance Caching, ISCA 07 [4] A. Jaleel et al., High Performance Cache Replacement using Re-Reference Interval Prediction (RRIP), ISCA 10 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

23 Replacement policies for shared LLC ABRIP [5] maintains two counters 1 Core level (Cr) whose block to replace 2 Block level (Br) which block to replace [5] P. Lathigara et al., Application Behavior Aware Re-reference Interval Prediction for Shared LLC, ICCD 15 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

24 Replacement policies for shared LLC ABRIP [5] maintains two counters 1 Core level (Cr) whose block to replace 2 Block level (Br) which block to replace + Reduces interference by giving more priority to application behavior + Improves throughput of CF-STR workload mixes Doesn t work very well with CF-CF workload mixes [5] P. Lathigara et al., Application Behavior Aware Re-reference Interval Prediction for Shared LLC, ICCD 15 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

25 Replacement policies for shared LLC ABRIP [5] maintains two counters 1 Core level (Cr) whose block to replace 2 Block level (Br) which block to replace + Reduces interference by giving more priority to application behavior + Improves throughput of CF-STR workload mixes Doesn t work very well with CF-CF workload mixes Proposal MLP awareness Tries to minimize number of isolated misses Application Behavior awareness Tries to reduce interference among applications [5] P. Lathigara et al., Application Behavior Aware Re-reference Interval Prediction for Shared LLC, ICCD 15 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

26 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

27 MLP-cost calculation for shared LLC Miss Status Holding Register (MSHR) maintains number of in flight misses [2] Algorithm 1 Calculation of MLP-cost for shared LLC misses 1: init mlp cost(miss): /* when miss enters MSHR */ 2: miss.mlp cost = 0 3: update mlp cost(): /* called every cycle */ 4: N i Number of outstanding misses from i th core 5: for each demand miss in MSHR do 6: mlp cost + = (1/N i ) 7: end for Higher the number of parallel misses, lesser is the mlp cost [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

28 MLP-cost calculation for shared LLC Miss Status Holding Register (MSHR) maintains number of in flight misses [2] Algorithm 2 Calculation of MLP-cost for shared LLC misses 1: init mlp cost(miss): /* when miss enters MSHR */ 2: miss.mlp cost = 0 3: update mlp cost(): /* called every cycle */ 4: N i Number of outstanding misses from i th core 5: for each demand miss in MSHR do 6: mlp cost + = (1/N i ) 7: end for Higher the number of parallel misses, lesser is the mlp cost Example consider 5 misses on a 2-core system, 4 from core0 & 1 from core1 (Isolated miss) Considering N = 5 for core1 would be unfair [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

29 Calculation of MLP-cost for shared LLC misses MLP-based Quantized cost value value 0-22 cycles cycles cycles cycles cycles cycles cycles cycles 0 Table : Quantization of MLP-cost 3-bit quantization of MLP-cost q cost degree of parallelism % mlp distribution of bzip under LRU > Range of mlpcost in cycles Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

30 AB-Aware Cache Replacement AB-Aware avoids eviction of isolated cache blocks having near future re-reference prediction COST COMPUTE LOGIC MLP COST L3 MSHR BUS COST-AWARE REPLACEMENT AB-Aware L2 CACHE L3 CACHE (LLC) L2 CACHE MAIN MEMORY L1 CACHE CORE 0 L1 CACHE CORE 1 Figure : Micro-architecture of AB-Aware Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

31 Policies for AB-Aware Insertion Br = 2 M 2, Cr is unaffected Algorithm 3 Victim selection policy for AB-Aware 1: iterate: 2: if any ABr == max ABr then 3: Victim = max {Br(i) + α Cr(i) + λ q cost(i)} 4: else if any Cr == max Cr then 5: all Br + + goto iterate 6: else 7: all Cr + + 8: all Br + + goto iterate 9: end if Promotion Cr = 0 & Br = 0 ABr = 0 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

32 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

33 Evaluation framework Sniper multi-core x86 simulator Nehalem micro-architecture 2.67 GHz, 4-wide fetch, 128-entry ROB Three level cache hierarchy Main memory access latency 175 cycles 32-entry MSHR L1-D Cache L1-I Cache L2 Cache L3 Cache 32KB, 4-Way, Private, 4-cycles 32KB, 4-Way, Private, 4-cycles 256KB, 8-Way, Private, 8-cycles 2MB per core, 8-Way, Shared, 30-cycles Table : Cache hierarchy of the simulated system Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

34 Memory characteristics Four traces of 250M instructions selected using pinpoints Total 1B instructions Application apki 1 mpki 2 Characteristics bzip Cache friendly (CF) gcc Cache friendly (CF) soplex Cache friendly (CF) sphinx Cache friendly (CF) lbm Streaming (STR) libquantum Streaming (STR) mcf Streaming (STR) Table : Memory characteristics of SPEC CPU2006 applications 1 accesses per kilo instructions 2 misses per kilo instructions Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

35 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

36 AB-Aware 2-core configuration Weighted Speedup sphinx-lbm gcc-libq bzip-lbm gcc-mcf gcc-lbm bzip-libq soplex-libq soplex-mcf sphinx-mcf bzip-mcf soplex-lbm sphinx-libq SRRIP LIN ABRIP AB-Aware GMean_all sphinx-soplex soplex-gcc gcc-bzip sphinx-gcc sphinx-bzip soplex-bzip GMean_(CF-STR) Gains over SRRIP LIN ABRIP AB-Aware GMean CF-STR -8.89% 0.63% 1.76% GMean all -7.12% -0.18% 1.69% Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

37 AB-Aware 2-core configuration mpki normalized to SRRIP app-1 app-2 sphinx-soplex soplex-gcc gcc-bzip sphinx-gcc sphinx-bzip soplex-bzip sphinx-lbm gcc-libq bzip-lbm gcc-mcf gcc-lbm bzip-libq soplex-libq soplex-mcf sphinx-mcf bzip-mcf soplex-lbm sphinx-libq Maximum reduction achieved by (bzip*-lbm) = 69.22% Average MPKI reduction = 9.7% Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

38 AB-Aware 4-core configuration Workload Applications Category Mix 1 soplex-sphinx-gcc-libq Mix 2 bzip-gcc-sphinx-lbm Mix 3 sphinx-bzip-gcc-mcf Mix 4 bzip-soplex-gcc-lbm CF-CF-CF-STR Mix 5 gcc-soplex-sphinx-mcf Mix 6 sphinx-soplex-bzip-lbm Mix 7 bzip-sphinx-libq-mcf Mix 8 gcc-sphinx-mcf-libq Mix 9 sphinx-bzip-lbm-mcf Mix 10 bzip-gcc-libq-lbm CF-CF-STR-STR Mix 11 bzip-gcc-lbm-mcf Mix 12 soplex-bzip-libq-lbm Mix 13 soplex-libq-mcf-lbm Mix 14 sphinx-libq-lbm-mcf CF-STR-STR-STR Mix 15 bzip-gcc-sphinx-soplex CF-CF-CF-CF Table : Workloads under evaluation for 4-core configuration Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

39 AB-Aware 4-core configuration Weighted Speedup SRRIP ABRIP AB-Aware Geomean Mix_15 Mix_14 Mix_13 Mix_12 Mix_11 Mix_10 Mix_9 Mix_8 Mix_7 Mix_6 Mix_5 Mix_4 Mix_3 Mix_2 Mix_1 Average Performance improvement over SRRIP 8.71% Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

40 Storage overhead Overhead 2-core 4-core Core RRPV (2 bits per core per set) 4 KB 16 KB Block RRPV bits (16 bits per set) 16 KB 32 KB MLP Cost storage (24 bits per set) 24 KB 48 KB Net overhead 44 KB 96 KB % overhead over LLC 1.1% 1.2% Table : Storage overhead of LINABRIP over SRRIP Overhead over ABRIP 0.6% for both 2-core and 4-core Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

41 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

42 Conclusion Memory Level Parallelism (MLP) is non-uniform MLP-aware cache replacement reduces isolated misses Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

43 Conclusion Memory Level Parallelism (MLP) is non-uniform MLP-aware cache replacement reduces isolated misses Shared Last Level Cache (LLC) experiences diverse access pattern LRU is inefficient for L3 caches Application Behavior aware replacement reduces interference among applications Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

44 Conclusion Memory Level Parallelism (MLP) is non-uniform MLP-aware cache replacement reduces isolated misses Shared Last Level Cache (LLC) experiences diverse access pattern LRU is inefficient for L3 caches Application Behavior aware replacement reduces interference among applications AB-Aware improves MLP and minimizes application interference Average MPKI reduction 9.6% Average Performance gain 1.69% (2-core) and 8.71% (4-core) Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

45 References I J. R. Srinivasan, Improving Cache Utilisation. PhD thesis, University of Cambridge, M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, A Case for MLP-Aware Cache Replacement, in Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA), pp , M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, Adaptive Insertion Policies for High Performance Caching, in Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA), pp , A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP), in Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA), pp , P. Lathigara, S. Balachandran, and V. Singh, Application Behavior Aware Re-reference Interval Prediction for Shared LLC, in Proceedings of the 33rd IEEE International Conference on Computer Design (ICCD), pp , Oct Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

46 Thank You! Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

47 Backup Slides Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 7

48 LIN single-core configuration % IPC improvement over LRU LIN λ=1 LIN λ=2 LIN λ=3 LIN λ=4 bzip gcc soplex sphinx lbm libq mcf Improvement/degradation is proportional to λ Degradation can be alleviated using Dynamic Set Sampling Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 7

49 LIN MLP-distribution % mlp distribution of bzip under LRU > % mlp distribution of bzip under LIN > Range of mlpcost in cycles Range of mlpcost in cycles 30 mlp distribution of gcc under LRU 30 mlp distribution of gcc under LIN % 15 % > > Suhit Pai (IITRange Bombay) of mlpcost in cycles AB-Aware Range of mlpcost Friday, in cycles May 25 th, / 7

50 LIN MLP-distribution % mlp distribution of sphinx under LRU > % mlp distribution of sphinx under LIN > Range of mlpcost in cycles Range of mlpcost in cycles % mlp distribution of soplex under LRU > % mlp distribution of soplex under LIN > Suhit Pai (IITRange Bombay) of mlpcost in cycles AB-Aware Range of mlpcost Friday, in cycles May 25 th, / 7

51 LIN MLP-distribution % mlp distribution of libq under LRU > % mlp distribution of libq under LIN > Range of mlpcost in cycles Range of mlpcost in cycles 60 mlp distribution of lbm under LRU 60 mlp distribution of lbm under LIN % 30 % > > Suhit Pai (IITRange Bombay) of mlpcost in cycles AB-Aware Range of mlpcost Friday, in cycles May 25 th, / 7

52 LIN MLP-distribution % mlp distribution of mcf under LRU > % mlp distribution of mcf under LIN > Range of mlpcost in cycles Range of mlpcost in cycles Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 7

53 Performance metrics IPC i = IPC of i th core concurrently running with other cores IPC alone i = IPC of i th core running in isolation IPC sum signifies system throughput IPC sum = IPC i (1) Weighted Speedup = (IPC i /IPC alone i ) (2) Weighted Speedup signifies reduction in execution time Ideal value of Weighted Speedup 2 for dual-core and 4 for quad-core Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 7

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each

More information

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-24 TR-HPS-26-3

More information

ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache Management

ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache Management This is a pre-print, author's version of the paper to appear in the 32nd IEEE International Conference on Computer Design. ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache

More information

Lecture-16 (Cache Replacement Policies) CS422-Spring

Lecture-16 (Cache Replacement Policies) CS422-Spring Lecture-16 (Cache Replacement Policies) CS422-Spring 2018 Biswa@CSE-IITK 1 2 4 8 16 32 64 128 From SPEC92 Miss rate: Still Applicable Today 0.14 0.12 0.1 0.08 0.06 0.04 1-way 2-way 4-way 8-way Capacity

More information

Exploi'ng Compressed Block Size as an Indicator of Future Reuse

Exploi'ng Compressed Block Size as an Indicator of Future Reuse Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch Execu've Summary In a compressed

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements

More information

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Spatial Memory Streaming (with rotated patterns)

Spatial Memory Streaming (with rotated patterns) Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;

More information

Decoupled Dynamic Cache Segmentation

Decoupled Dynamic Cache Segmentation Appears in Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA-8), February, 202. Decoupled Dynamic Cache Segmentation Samira M. Khan, Zhe Wang and Daniel A.

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Managing GPU Concurrency in Heterogeneous Architectures

Managing GPU Concurrency in Heterogeneous Architectures Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Towards Bandwidth-Efficient Prefetching with Slim AMPM

Towards Bandwidth-Efficient Prefetching with Slim AMPM Towards Bandwidth-Efficient Prefetching with Slim Vinson Young School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia 30332 0250 Email: vyoung@gatech.edu Ajit Krisshna

More information

Perceptron Learning for Reuse Prediction

Perceptron Learning for Reuse Prediction Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level

More information

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems : A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University

More information

LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm

LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm 1 LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm Mazen Kharbutli and Rami Sheikh (Submitted to IEEE Transactions on Computers) Mazen Kharbutli is with Jordan University of Science and

More information

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System

WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System ZHE WANG, Texas A&M University SHUCHANG SHAN, Chinese Institute of Computing Technology TING CAO, Australian National University

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Assisting Cache Replacement by Helper-Threading for MPSoCs

Assisting Cache Replacement by Helper-Threading for MPSoCs Assisting Cache Replacement by Helper-Threading for MPSoCs Masaaki Kondo Graduate School of Information Science and Technology, The University of Tokyo MPSoC2015 1 Background Increasing number of cores

More information

Enhancing LRU Replacement via Phantom Associativity

Enhancing LRU Replacement via Phantom Associativity Enhancing Replacement via Phantom Associativity Min Feng Chen Tian Rajiv Gupta Dept. of CSE, University of California, Riverside Email: {mfeng, tianc, gupta}@cs.ucr.edu Abstract In this paper, we propose

More information

Virtualized ECC: Flexible Reliability in Memory Systems

Virtualized ECC: Flexible Reliability in Memory Systems Virtualized ECC: Flexible Reliability in Memory Systems Doe Hyun Yoon Advisor: Mattan Erez Electrical and Computer Engineering The University of Texas at Austin Motivation Reliability concerns are growing

More information

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce

More information

MLP Aware Heterogeneous Memory System

MLP Aware Heterogeneous Memory System MLP Aware Heterogeneous Memory System Sujay Phadke and Satish Narayanasamy University of Michigan, Ann Arbor {sphadke,nsatish}@umich.edu Abstract Main memory plays a critical role in a computer system

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

Prefetch-Aware DRAM Controllers

Prefetch-Aware DRAM Controllers Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt Department of Electrical and Computer Engineering The University of Texas at Austin {cjlee, narasima, patt}@ece.utexas.edu

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core

More information

Near-Threshold Computing: How Close Should We Get?

Near-Threshold Computing: How Close Should We Get? Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on

More information

Thesis Defense Lavanya Subramanian

Thesis Defense Lavanya Subramanian Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)

More information

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection

More information

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems 1 Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems Dimitris Kaseridis, Member, IEEE, Muhammad Faisal Iqbal, Student Member, IEEE and Lizy Kurian John,

More information

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies Lecture 14: Large Cache Design II Topics: Cache partitioning and replacement policies 1 Page Coloring CACHE VIEW Bank number with Page-to-Bank Tag Set Index Bank number with Set-interleaving Block offset

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation

More information

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck

More information

Sampling Dead Block Prediction for Last-Level Caches

Sampling Dead Block Prediction for Last-Level Caches Appears in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43), December 2010 Sampling Dead Block Prediction for Last-Level Caches Samira Khan, Yingying Tian,

More information

Prefetch-Aware DRAM Controllers

Prefetch-Aware DRAM Controllers Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin

More information

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,

More information

Adaptive Placement and Migration Policy for an STT-RAM-Based Hybrid Cache

Adaptive Placement and Migration Policy for an STT-RAM-Based Hybrid Cache Adaptive Placement and Migration Policy for an STT-RAM-Based Hybrid Cache Zhe Wang Daniel A. Jiménez Cong Xu Guangyu Sun Yuan Xie Texas A&M University Pennsylvania State University Peking University AMD

More information

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system

More information

Exploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies

Exploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies Exploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies Doe Hyun Yoon, Tobin Gonzalez, Parthasarathy Ranganathan, and Robert S. Schreiber Intelligent Infrastructure Lab (IIL), Hewlett-Packard

More information

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi Chang Joo Lee Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The

More information

Flexible Cache Error Protection using an ECC FIFO

Flexible Cache Error Protection using an ECC FIFO Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead

More information

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Daniel A. Jiménez Department of Computer Science and Engineering Texas A&M University ABSTRACT Last-level caches mitigate the high latency

More information

OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches

OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches Jue Wang, Xiangyu Dong, Yuan Xie Department of Computer Science and Engineering, Pennsylvania State University Qualcomm Technology,

More information

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes with Accesses Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes

More information

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Addressing End-to-End Memory Access Latency in NoC-Based Multicores Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu

More information

Combining Local and Global History for High Performance Data Prefetching

Combining Local and Global History for High Performance Data Prefetching Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT : EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1

More information

ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction

ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction Vinson Young, Chiachen Chou, Aamer Jaleel *, and Moinuddin K. Qureshi Georgia Institute of Technology

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,

More information

Improving Inclusive Cache Performance with Two-level Eviction Priority

Improving Inclusive Cache Performance with Two-level Eviction Priority Improving Inclusive Cache Performance with Two-level Eviction Priority Lingda Li, Dong Tong, Zichao Xie, Junlin Lu, Xu Cheng Microprocessor Research and Development Center, Peking University, Beijing,

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Towards Energy-Proportional Datacenter Memory with Mobile DRAM Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University

More information

Analyzing Instructtion Based Cache Replacement Policies

Analyzing Instructtion Based Cache Replacement Policies University of Central Florida Electronic Theses and Dissertations Masters Thesis (Open Access) Analyzing Instructtion Based Cache Replacement Policies 2010 Ping Xiang University of Central Florida Find

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for

More information

h Coherence Controllers

h Coherence Controllers High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs

More information

arxiv: v1 [cs.ar] 10 Apr 2017

arxiv: v1 [cs.ar] 10 Apr 2017 Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, Onur Mutlu, Srinivas Devadas CSAIL, MIT, Intel Labs, ETH Zurich {yxy, devadas}@mit.edu,

More information

Improving Cache Management Policies Using Dynamic Reuse Distances

Improving Cache Management Policies Using Dynamic Reuse Distances Improving Cache Management Policies Using Dynamic Reuse Distances Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero and Alexander V. Veidenbaum University of California, Irvine Universitat

More information

A Cache Scheme Based on LRU-Like Algorithm

A Cache Scheme Based on LRU-Like Algorithm Proceedings of the 2010 IEEE International Conference on Information and Automation June 20-23, Harbin, China A Cache Scheme Based on LRU-Like Algorithm Dongxing Bao College of Electronic Engineering Heilongjiang

More information

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 12: Advanced Caching Prof. Onur Mutlu Carnegie Mellon University Announcements Chuck Thacker (Microsoft Research) Seminar Tomorrow RARE: Rethinking Architectural

More information

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question: Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Dynamic Cache Partitioning for CMP/SMT Systems

Dynamic Cache Partitioning for CMP/SMT Systems Dynamic Cache Partitioning for CMP/SMT Systems G. E. Suh (suh@mit.edu), L. Rudolph (rudolph@mit.edu) and S. Devadas (devadas@mit.edu) Massachusetts Institute of Technology Abstract. This paper proposes

More information

Improving Writeback Efficiency with Decoupled Last-Write Prediction

Improving Writeback Efficiency with Decoupled Last-Write Prediction Improving Writeback Efficiency with Decoupled Last-Write Prediction Zhe Wang Samira M. Khan Daniel A. Jiménez The University of Texas at San Antonio {zhew,skhan,dj}@cs.utsa.edu Abstract In modern DDRx

More information

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers 1 ASPLOS 2016 2-6 th April Amro Awad (NC State University) Pratyusa Manadhata (Hewlett Packard Labs) Yan Solihin (NC

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Tyler Viswanath Krishnamurthy, and Hridesh Laboratory for Software Design Department of Computer Science Iowa State University

More information

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1 Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 1029 1035 2012 International Conference on Medical Physics and Biomedical Engineering Memory Performance Characterization of SPEC CPU2006

More information

Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems

Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems Reetuparna Das Rachata Ausavarungnirun Onur Mutlu Akhilesh Kumar Mani Azimi University of Michigan Carnegie

More information

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks : Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia

More information

Multiperspective Reuse Prediction

Multiperspective Reuse Prediction ABSTRACT Daniel A. Jiménez Texas A&M University djimenezacm.org The disparity between last-level cache and memory latencies motivates the search for e cient cache management policies. Recent work in predicting

More information

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL PACT 13 - EDINBURGH, SCOTLAND SEP 11, 2013 Summary NUCA is giving us more capacity, but further away 40 Applications

More information

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

http://uu.diva-portal.org This is an author produced version of a paper presented at the 4 th Swedish Workshop on Multi-Core Computing, November 23-25, 2011, Linköping, Sweden. Citation for the published

More information

Cache Replacement Championship. The 3P and 4P cache replacement policies

Cache Replacement Championship. The 3P and 4P cache replacement policies 1 Cache Replacement Championship The 3P and 4P cache replacement policies Pierre Michaud INRIA June 20, 2010 2 Optimal replacement? Offline (we know the future) Belady Online (we don t know the future)

More information

THE DYNAMIC GRANULARITY MEMORY SYSTEM

THE DYNAMIC GRANULARITY MEMORY SYSTEM THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal

More information

A Swap-based Cache Set Index Scheme to Leverage both Superpage and Page Coloring Optimizations

A Swap-based Cache Set Index Scheme to Leverage both Superpage and Page Coloring Optimizations A Swap-based Cache Set Index Scheme to Leverage both and Page Coloring Optimizations Zehan Cui, Licheng Chen, Yungang Bao, Mingyu Chen State Key Laboratory of Computer Architecture, Institute of Computing

More information

Virtualized and Flexible ECC for Main Memory

Virtualized and Flexible ECC for Main Memory Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010 1 Memory Error Protection Applying ECC

More information

A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving System Fairness and Performance

A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving System Fairness and Performance A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang Saugata Ghose Onur Mutlu Nian-Feng Tzeng University of Louisiana at

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore By Dan Stafford Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space Results & Observations General

More information

OpenPrefetch. (in-progress)

OpenPrefetch. (in-progress) OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),

More information

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy Jinchun Kim Texas A&M University cienlux@tamu.edu Daniel A. Jiménez Texas A&M University djimenez@cse.tamu.edu

More information

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University

More information

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October

More information