AB-Aware: Application Behavior Aware Management of Shared Last Level Caches
|
|
- Christina Annabelle Lewis
- 5 years ago
- Views:
Transcription
1 AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering Indian Institute of Technology Bombay, India The 28 th ACM Great Lakes Symposium on VLSI (GLSVLSI) Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
2 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
3 Introduction Thousand-fold growth in microprocessor performance 1 Core micro-architectural innovations 2 Transistor-speed scaling Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
4 Introduction Thousand-fold growth in microprocessor performance 1 Core micro-architectural innovations 2 Transistor-speed scaling Relative Improvement Memory Latency Processor Clock Frequency Figure : Memory latency and CPU frequency improvement [1] [1] J. Srinivasan, Improving cache utilisation, PhD thesis, Cambridge University, 2011 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
5 Introduction Thousand-fold growth in microprocessor performance 1 Core micro-architectural innovations 2 Transistor-speed scaling Relative Improvement Improvements 10 Memory Latency Processor Clock Frequency Processor clock frequency 200 Memory latency Figure : Memory latency and CPU frequency improvement [1] [1] J. Srinivasan, Improving cache utilisation, PhD thesis, Cambridge University, 2011 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
6 Introduction Disparity between CPU and memory speeds Memory bottleneck Cache hierarchy Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
7 Introduction Disparity between CPU and memory speeds Memory bottleneck Cache hierarchy Miss at Last Level Cache (LLC) Hundreds of cycles to get serviced Processor execution stalls Performance loss Non-blocking caches Servicing multiple misses in parallel reduces effective memory stalls Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
8 Introduction Disparity between CPU and memory speeds Memory bottleneck Cache hierarchy Miss at Last Level Cache (LLC) Hundreds of cycles to get serviced Processor execution stalls Performance loss Non-blocking caches Servicing multiple misses in parallel reduces effective memory stalls Notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP) [2] Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
9 Introduction Disparity between CPU and memory speeds Memory bottleneck Cache hierarchy Miss at Last Level Cache (LLC) Hundreds of cycles to get serviced Processor execution stalls Performance loss Non-blocking caches Servicing multiple misses in parallel reduces effective memory stalls Notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP) [2] MLP is not uniform across all memory accesses MLP-aware cache replacement can minimize isolated misses [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
10 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
11 Example [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
12 Example [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
13 Example Reduced total number of misses do not always result in reduced memory stall cycles!! [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
14 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
15 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Increased number of conflict misses Interference among applications Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
16 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Increased number of conflict misses Interference among applications Regularity in cache access is filtered by higher level caches Diverse access pattern at LLC Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
17 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Increased number of conflict misses Interference among applications Regularity in cache access is filtered by higher level caches Diverse access pattern at LLC Multiple applications have different memory characteristics Cache Friendly (CF) small working data-set, more cache re-use Streaming (STR) large working data-set, low cache re-use Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
18 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Increased number of conflict misses Interference among applications Regularity in cache access is filtered by higher level caches Diverse access pattern at LLC Multiple applications have different memory characteristics Cache Friendly (CF) small working data-set, more cache re-use Streaming (STR) large working data-set, low cache re-use LRU-based cache replacement policies allocate cache resources on demand basis Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
19 LRU cache replacement on shared LLC L3 cache size varied from 512kB to 4MB by changing number of ways and keeping number of sets constant MPKI libquantum sphinx Number of ways (1way = 512 kb) Figure : Memory characteristics of sphinx and libquantum MPKI is Misses Per Kilo Instructions Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
20 LRU cache replacement on shared LLC L3 cache size varied from 512kB to 4MB by changing number of ways and keeping number of sets constant MPKI libquantum sphinx Number of ways (1way = 512 kb) Figure : Memory characteristics of sphinx and libquantum Under LRU policy & 2-core configuration, MPKI (sphinx) = sphinx got 3 ways out of total 8 (libquantum got 5 ways) sphinx suffered due interference from libquantum MPKI is Misses Per Kilo Instructions Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
21 Replacement policies for shared LLC On average 60% of cache lines are not re-referred before eviction [3] No temporal locality Data-set larger than allocated cache [3] M. K. Qureshi et al., Adaptive Insertion Policies for High Performance Caching, ISCA 07 [4] A. Jaleel et al., High Performance Cache Replacement using Re-Reference Interval Prediction (RRIP), ISCA 10 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
22 Replacement policies for shared LLC On average 60% of cache lines are not re-referred before eviction [3] No temporal locality Data-set larger than allocated cache Figure : Various cache replacement policies [3] M. K. Qureshi et al., Adaptive Insertion Policies for High Performance Caching, ISCA 07 [4] A. Jaleel et al., High Performance Cache Replacement using Re-Reference Interval Prediction (RRIP), ISCA 10 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
23 Replacement policies for shared LLC ABRIP [5] maintains two counters 1 Core level (Cr) whose block to replace 2 Block level (Br) which block to replace [5] P. Lathigara et al., Application Behavior Aware Re-reference Interval Prediction for Shared LLC, ICCD 15 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
24 Replacement policies for shared LLC ABRIP [5] maintains two counters 1 Core level (Cr) whose block to replace 2 Block level (Br) which block to replace + Reduces interference by giving more priority to application behavior + Improves throughput of CF-STR workload mixes Doesn t work very well with CF-CF workload mixes [5] P. Lathigara et al., Application Behavior Aware Re-reference Interval Prediction for Shared LLC, ICCD 15 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
25 Replacement policies for shared LLC ABRIP [5] maintains two counters 1 Core level (Cr) whose block to replace 2 Block level (Br) which block to replace + Reduces interference by giving more priority to application behavior + Improves throughput of CF-STR workload mixes Doesn t work very well with CF-CF workload mixes Proposal MLP awareness Tries to minimize number of isolated misses Application Behavior awareness Tries to reduce interference among applications [5] P. Lathigara et al., Application Behavior Aware Re-reference Interval Prediction for Shared LLC, ICCD 15 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
26 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
27 MLP-cost calculation for shared LLC Miss Status Holding Register (MSHR) maintains number of in flight misses [2] Algorithm 1 Calculation of MLP-cost for shared LLC misses 1: init mlp cost(miss): /* when miss enters MSHR */ 2: miss.mlp cost = 0 3: update mlp cost(): /* called every cycle */ 4: N i Number of outstanding misses from i th core 5: for each demand miss in MSHR do 6: mlp cost + = (1/N i ) 7: end for Higher the number of parallel misses, lesser is the mlp cost [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
28 MLP-cost calculation for shared LLC Miss Status Holding Register (MSHR) maintains number of in flight misses [2] Algorithm 2 Calculation of MLP-cost for shared LLC misses 1: init mlp cost(miss): /* when miss enters MSHR */ 2: miss.mlp cost = 0 3: update mlp cost(): /* called every cycle */ 4: N i Number of outstanding misses from i th core 5: for each demand miss in MSHR do 6: mlp cost + = (1/N i ) 7: end for Higher the number of parallel misses, lesser is the mlp cost Example consider 5 misses on a 2-core system, 4 from core0 & 1 from core1 (Isolated miss) Considering N = 5 for core1 would be unfair [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
29 Calculation of MLP-cost for shared LLC misses MLP-based Quantized cost value value 0-22 cycles cycles cycles cycles cycles cycles cycles cycles 0 Table : Quantization of MLP-cost 3-bit quantization of MLP-cost q cost degree of parallelism % mlp distribution of bzip under LRU > Range of mlpcost in cycles Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
30 AB-Aware Cache Replacement AB-Aware avoids eviction of isolated cache blocks having near future re-reference prediction COST COMPUTE LOGIC MLP COST L3 MSHR BUS COST-AWARE REPLACEMENT AB-Aware L2 CACHE L3 CACHE (LLC) L2 CACHE MAIN MEMORY L1 CACHE CORE 0 L1 CACHE CORE 1 Figure : Micro-architecture of AB-Aware Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
31 Policies for AB-Aware Insertion Br = 2 M 2, Cr is unaffected Algorithm 3 Victim selection policy for AB-Aware 1: iterate: 2: if any ABr == max ABr then 3: Victim = max {Br(i) + α Cr(i) + λ q cost(i)} 4: else if any Cr == max Cr then 5: all Br + + goto iterate 6: else 7: all Cr + + 8: all Br + + goto iterate 9: end if Promotion Cr = 0 & Br = 0 ABr = 0 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
32 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
33 Evaluation framework Sniper multi-core x86 simulator Nehalem micro-architecture 2.67 GHz, 4-wide fetch, 128-entry ROB Three level cache hierarchy Main memory access latency 175 cycles 32-entry MSHR L1-D Cache L1-I Cache L2 Cache L3 Cache 32KB, 4-Way, Private, 4-cycles 32KB, 4-Way, Private, 4-cycles 256KB, 8-Way, Private, 8-cycles 2MB per core, 8-Way, Shared, 30-cycles Table : Cache hierarchy of the simulated system Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
34 Memory characteristics Four traces of 250M instructions selected using pinpoints Total 1B instructions Application apki 1 mpki 2 Characteristics bzip Cache friendly (CF) gcc Cache friendly (CF) soplex Cache friendly (CF) sphinx Cache friendly (CF) lbm Streaming (STR) libquantum Streaming (STR) mcf Streaming (STR) Table : Memory characteristics of SPEC CPU2006 applications 1 accesses per kilo instructions 2 misses per kilo instructions Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
35 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
36 AB-Aware 2-core configuration Weighted Speedup sphinx-lbm gcc-libq bzip-lbm gcc-mcf gcc-lbm bzip-libq soplex-libq soplex-mcf sphinx-mcf bzip-mcf soplex-lbm sphinx-libq SRRIP LIN ABRIP AB-Aware GMean_all sphinx-soplex soplex-gcc gcc-bzip sphinx-gcc sphinx-bzip soplex-bzip GMean_(CF-STR) Gains over SRRIP LIN ABRIP AB-Aware GMean CF-STR -8.89% 0.63% 1.76% GMean all -7.12% -0.18% 1.69% Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
37 AB-Aware 2-core configuration mpki normalized to SRRIP app-1 app-2 sphinx-soplex soplex-gcc gcc-bzip sphinx-gcc sphinx-bzip soplex-bzip sphinx-lbm gcc-libq bzip-lbm gcc-mcf gcc-lbm bzip-libq soplex-libq soplex-mcf sphinx-mcf bzip-mcf soplex-lbm sphinx-libq Maximum reduction achieved by (bzip*-lbm) = 69.22% Average MPKI reduction = 9.7% Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
38 AB-Aware 4-core configuration Workload Applications Category Mix 1 soplex-sphinx-gcc-libq Mix 2 bzip-gcc-sphinx-lbm Mix 3 sphinx-bzip-gcc-mcf Mix 4 bzip-soplex-gcc-lbm CF-CF-CF-STR Mix 5 gcc-soplex-sphinx-mcf Mix 6 sphinx-soplex-bzip-lbm Mix 7 bzip-sphinx-libq-mcf Mix 8 gcc-sphinx-mcf-libq Mix 9 sphinx-bzip-lbm-mcf Mix 10 bzip-gcc-libq-lbm CF-CF-STR-STR Mix 11 bzip-gcc-lbm-mcf Mix 12 soplex-bzip-libq-lbm Mix 13 soplex-libq-mcf-lbm Mix 14 sphinx-libq-lbm-mcf CF-STR-STR-STR Mix 15 bzip-gcc-sphinx-soplex CF-CF-CF-CF Table : Workloads under evaluation for 4-core configuration Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
39 AB-Aware 4-core configuration Weighted Speedup SRRIP ABRIP AB-Aware Geomean Mix_15 Mix_14 Mix_13 Mix_12 Mix_11 Mix_10 Mix_9 Mix_8 Mix_7 Mix_6 Mix_5 Mix_4 Mix_3 Mix_2 Mix_1 Average Performance improvement over SRRIP 8.71% Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
40 Storage overhead Overhead 2-core 4-core Core RRPV (2 bits per core per set) 4 KB 16 KB Block RRPV bits (16 bits per set) 16 KB 32 KB MLP Cost storage (24 bits per set) 24 KB 48 KB Net overhead 44 KB 96 KB % overhead over LLC 1.1% 1.2% Table : Storage overhead of LINABRIP over SRRIP Overhead over ABRIP 0.6% for both 2-core and 4-core Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
41 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
42 Conclusion Memory Level Parallelism (MLP) is non-uniform MLP-aware cache replacement reduces isolated misses Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
43 Conclusion Memory Level Parallelism (MLP) is non-uniform MLP-aware cache replacement reduces isolated misses Shared Last Level Cache (LLC) experiences diverse access pattern LRU is inefficient for L3 caches Application Behavior aware replacement reduces interference among applications Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
44 Conclusion Memory Level Parallelism (MLP) is non-uniform MLP-aware cache replacement reduces isolated misses Shared Last Level Cache (LLC) experiences diverse access pattern LRU is inefficient for L3 caches Application Behavior aware replacement reduces interference among applications AB-Aware improves MLP and minimizes application interference Average MPKI reduction 9.6% Average Performance gain 1.69% (2-core) and 8.71% (4-core) Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
45 References I J. R. Srinivasan, Improving Cache Utilisation. PhD thesis, University of Cambridge, M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, A Case for MLP-Aware Cache Replacement, in Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA), pp , M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, Adaptive Insertion Policies for High Performance Caching, in Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA), pp , A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP), in Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA), pp , P. Lathigara, S. Balachandran, and V. Singh, Application Behavior Aware Re-reference Interval Prediction for Shared LLC, in Proceedings of the 33rd IEEE International Conference on Computer Design (ICCD), pp , Oct Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
46 Thank You! Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27
47 Backup Slides Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 7
48 LIN single-core configuration % IPC improvement over LRU LIN λ=1 LIN λ=2 LIN λ=3 LIN λ=4 bzip gcc soplex sphinx lbm libq mcf Improvement/degradation is proportional to λ Degradation can be alleviated using Dynamic Set Sampling Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 7
49 LIN MLP-distribution % mlp distribution of bzip under LRU > % mlp distribution of bzip under LIN > Range of mlpcost in cycles Range of mlpcost in cycles 30 mlp distribution of gcc under LRU 30 mlp distribution of gcc under LIN % 15 % > > Suhit Pai (IITRange Bombay) of mlpcost in cycles AB-Aware Range of mlpcost Friday, in cycles May 25 th, / 7
50 LIN MLP-distribution % mlp distribution of sphinx under LRU > % mlp distribution of sphinx under LIN > Range of mlpcost in cycles Range of mlpcost in cycles % mlp distribution of soplex under LRU > % mlp distribution of soplex under LIN > Suhit Pai (IITRange Bombay) of mlpcost in cycles AB-Aware Range of mlpcost Friday, in cycles May 25 th, / 7
51 LIN MLP-distribution % mlp distribution of libq under LRU > % mlp distribution of libq under LIN > Range of mlpcost in cycles Range of mlpcost in cycles 60 mlp distribution of lbm under LRU 60 mlp distribution of lbm under LIN % 30 % > > Suhit Pai (IITRange Bombay) of mlpcost in cycles AB-Aware Range of mlpcost Friday, in cycles May 25 th, / 7
52 LIN MLP-distribution % mlp distribution of mcf under LRU > % mlp distribution of mcf under LIN > Range of mlpcost in cycles Range of mlpcost in cycles Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 7
53 Performance metrics IPC i = IPC of i th core concurrently running with other cores IPC alone i = IPC of i th core running in isolation IPC sum signifies system throughput IPC sum = IPC i (1) Weighted Speedup = (IPC i /IPC alone i ) (2) Weighted Speedup signifies reduction in execution time Ideal value of Weighted Speedup 2 for dual-core and 4 for quad-core Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 7
A Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationRelative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review
Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India
More informationImproving Cache Performance using Victim Tag Stores
Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com
More informationPortland State University ECE 587/687. Caches and Memory-Level Parallelism
Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each
More informationA Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt
Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-24 TR-HPS-26-3
More informationReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache Management
This is a pre-print, author's version of the paper to appear in the 32nd IEEE International Conference on Computer Design. ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache
More informationLecture-16 (Cache Replacement Policies) CS422-Spring
Lecture-16 (Cache Replacement Policies) CS422-Spring 2018 Biswa@CSE-IITK 1 2 4 8 16 32 64 128 From SPEC92 Miss rate: Still Applicable Today 0.14 0.12 0.1 0.08 0.06 0.04 1-way 2-way 4-way 8-way Capacity
More informationExploi'ng Compressed Block Size as an Indicator of Future Reuse
Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch Execu've Summary In a compressed
More informationResource-Conscious Scheduling for Energy Efficiency on Multicore Processors
Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationA Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements
More informationTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems
More informationA Hybrid Adaptive Feedback Based Prefetcher
A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,
More informationSpatial Memory Streaming (with rotated patterns)
Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;
More informationDecoupled Dynamic Cache Segmentation
Appears in Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA-8), February, 202. Decoupled Dynamic Cache Segmentation Samira M. Khan, Zhe Wang and Daniel A.
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationManaging GPU Concurrency in Heterogeneous Architectures
Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures
More informationA task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b
5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationTowards Bandwidth-Efficient Prefetching with Slim AMPM
Towards Bandwidth-Efficient Prefetching with Slim Vinson Young School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia 30332 0250 Email: vyoung@gatech.edu Ajit Krisshna
More informationPerceptron Learning for Reuse Prediction
Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level
More informationWALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems
: A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University
More informationLACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm
1 LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm Mazen Kharbutli and Rami Sheikh (Submitted to IEEE Transactions on Computers) Mazen Kharbutli is with Jordan University of Science and
More informationOptimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service
Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationWADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System
WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System ZHE WANG, Texas A&M University SHUCHANG SHAN, Chinese Institute of Computing Technology TING CAO, Australian National University
More informationUnderstanding The Effects of Wrong-path Memory References on Processor Performance
Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend
More informationAssisting Cache Replacement by Helper-Threading for MPSoCs
Assisting Cache Replacement by Helper-Threading for MPSoCs Masaaki Kondo Graduate School of Information Science and Technology, The University of Tokyo MPSoC2015 1 Background Increasing number of cores
More informationEnhancing LRU Replacement via Phantom Associativity
Enhancing Replacement via Phantom Associativity Min Feng Chen Tian Rajiv Gupta Dept. of CSE, University of California, Riverside Email: {mfeng, tianc, gupta}@cs.ucr.edu Abstract In this paper, we propose
More informationVirtualized ECC: Flexible Reliability in Memory Systems
Virtualized ECC: Flexible Reliability in Memory Systems Doe Hyun Yoon Advisor: Mattan Erez Electrical and Computer Engineering The University of Texas at Austin Motivation Reliability concerns are growing
More informationChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce
More informationMLP Aware Heterogeneous Memory System
MLP Aware Heterogeneous Memory System Sujay Phadke and Satish Narayanasamy University of Michigan, Ann Arbor {sphadke,nsatish}@umich.edu Abstract Main memory plays a critical role in a computer system
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationPrefetch-Aware DRAM Controllers
Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt Department of Electrical and Computer Engineering The University of Texas at Austin {cjlee, narasima, patt}@ece.utexas.edu
More informationThread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core
More informationNear-Threshold Computing: How Close Should We Get?
Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on
More informationThesis Defense Lavanya Subramanian
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)
More informationMemory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez
Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection
More informationCache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems
1 Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems Dimitris Kaseridis, Member, IEEE, Muhammad Faisal Iqbal, Student Member, IEEE and Lizy Kurian John,
More informationLecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies
Lecture 14: Large Cache Design II Topics: Cache partitioning and replacement policies 1 Page Coloring CACHE VIEW Bank number with Page-to-Bank Tag Set Index Bank number with Set-interleaving Block offset
More informationAdaptive Cache Partitioning on a Composite Core
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,
More informationDecoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation
More informationReducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University
Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck
More informationSampling Dead Block Prediction for Last-Level Caches
Appears in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43), December 2010 Sampling Dead Block Prediction for Last-Level Caches Samira Khan, Yingying Tian,
More informationPrefetch-Aware DRAM Controllers
Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationAdaptive Placement and Migration Policy for an STT-RAM-Based Hybrid Cache
Adaptive Placement and Migration Policy for an STT-RAM-Based Hybrid Cache Zhe Wang Daniel A. Jiménez Cong Xu Guangyu Sun Yuan Xie Texas A&M University Pennsylvania State University Peking University AMD
More informationDesigning High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC
Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system
More informationExploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies
Exploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies Doe Hyun Yoon, Tobin Gonzalez, Parthasarathy Ranganathan, and Robert S. Schreiber Intelligent Infrastructure Lab (IIL), Hewlett-Packard
More informationHigh Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi Chang Joo Lee Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The
More informationFlexible Cache Error Protection using an ECC FIFO
Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead
More informationInsertion and Promotion for Tree-Based PseudoLRU Last-Level Caches
Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Daniel A. Jiménez Department of Computer Science and Engineering Texas A&M University ABSTRACT Last-level caches mitigate the high latency
More informationOAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches
OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches Jue Wang, Xiangyu Dong, Yuan Xie Department of Computer Science and Engineering, Pennsylvania State University Qualcomm Technology,
More informationImproving DRAM Performance by Parallelizing Refreshes with Accesses
Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes
More informationAddressing End-to-End Memory Access Latency in NoC-Based Multicores
Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu
More informationCombining Local and Global History for High Performance Data Prefetching
Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu
More informationSecurity-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat
Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance
More informationSWAP: EFFECTIVE FINE-GRAIN MANAGEMENT
: EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1
More informationACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction
ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction Vinson Young, Chiachen Chou, Aamer Jaleel *, and Moinuddin K. Qureshi Georgia Institute of Technology
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,
More informationImproving Inclusive Cache Performance with Two-level Eviction Priority
Improving Inclusive Cache Performance with Two-level Eviction Priority Lingda Li, Dong Tong, Zichao Xie, Junlin Lu, Xu Cheng Microprocessor Research and Development Center, Peking University, Beijing,
More informationRethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization
Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,
More informationTowards Energy-Proportional Datacenter Memory with Mobile DRAM
Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University
More informationAnalyzing Instructtion Based Cache Replacement Policies
University of Central Florida Electronic Theses and Dissertations Masters Thesis (Open Access) Analyzing Instructtion Based Cache Replacement Policies 2010 Ping Xiang University of Central Florida Find
More informationComputer Sciences Department
Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for
More informationh Coherence Controllers
High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs
More informationarxiv: v1 [cs.ar] 10 Apr 2017
Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, Onur Mutlu, Srinivas Devadas CSAIL, MIT, Intel Labs, ETH Zurich {yxy, devadas}@mit.edu,
More informationImproving Cache Management Policies Using Dynamic Reuse Distances
Improving Cache Management Policies Using Dynamic Reuse Distances Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero and Alexander V. Veidenbaum University of California, Irvine Universitat
More informationA Cache Scheme Based on LRU-Like Algorithm
Proceedings of the 2010 IEEE International Conference on Information and Automation June 20-23, Harbin, China A Cache Scheme Based on LRU-Like Algorithm Dongxing Bao College of Electronic Engineering Heilongjiang
More information15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 12: Advanced Caching Prof. Onur Mutlu Carnegie Mellon University Announcements Chuck Thacker (Microsoft Research) Seminar Tomorrow RARE: Rethinking Architectural
More informationDatabase Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:
Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive
More informationAn Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group
More informationDynamic Cache Partitioning for CMP/SMT Systems
Dynamic Cache Partitioning for CMP/SMT Systems G. E. Suh (suh@mit.edu), L. Rudolph (rudolph@mit.edu) and S. Devadas (devadas@mit.edu) Massachusetts Institute of Technology Abstract. This paper proposes
More informationImproving Writeback Efficiency with Decoupled Last-Write Prediction
Improving Writeback Efficiency with Decoupled Last-Write Prediction Zhe Wang Samira M. Khan Daniel A. Jiménez The University of Texas at San Antonio {zhew,skhan,dj}@cs.utsa.edu Abstract In modern DDRx
More informationSilent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers
Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers 1 ASPLOS 2016 2-6 th April Amro Awad (NC State University) Pratyusa Manadhata (Hewlett Packard Labs) Yan Solihin (NC
More informationEfficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors
Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University
More informationEECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun
EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,
More informationPredictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*
Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Tyler Viswanath Krishnamurthy, and Hridesh Laboratory for Software Design Department of Computer Science Iowa State University
More informationMemory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1
Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 1029 1035 2012 International Conference on Medical Physics and Biomedical Engineering Memory Performance Characterization of SPEC CPU2006
More informationApplication-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems
Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems Reetuparna Das Rachata Ausavarungnirun Onur Mutlu Akhilesh Kumar Mani Azimi University of Michigan Carnegie
More informationSecure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks
: Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia
More informationMultiperspective Reuse Prediction
ABSTRACT Daniel A. Jiménez Texas A&M University djimenezacm.org The disparity between last-level cache and memory latencies motivates the search for e cient cache management policies. Recent work in predicting
More informationJIGSAW: SCALABLE SOFTWARE-DEFINED CACHES
JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL PACT 13 - EDINBURGH, SCOTLAND SEP 11, 2013 Summary NUCA is giving us more capacity, but further away 40 Applications
More informationThe Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory
The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*
More informationPerformance metrics for caches
Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:
More informationhttp://uu.diva-portal.org This is an author produced version of a paper presented at the 4 th Swedish Workshop on Multi-Core Computing, November 23-25, 2011, Linköping, Sweden. Citation for the published
More informationCache Replacement Championship. The 3P and 4P cache replacement policies
1 Cache Replacement Championship The 3P and 4P cache replacement policies Pierre Michaud INRIA June 20, 2010 2 Optimal replacement? Offline (we know the future) Belady Online (we don t know the future)
More informationTHE DYNAMIC GRANULARITY MEMORY SYSTEM
THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal
More informationA Swap-based Cache Set Index Scheme to Leverage both Superpage and Page Coloring Optimizations
A Swap-based Cache Set Index Scheme to Leverage both and Page Coloring Optimizations Zehan Cui, Licheng Chen, Yungang Bao, Mingyu Chen State Key Laboratory of Computer Architecture, Institute of Computing
More informationVirtualized and Flexible ECC for Main Memory
Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010 1 Memory Error Protection Applying ECC
More informationA Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving System Fairness and Performance
A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang Saugata Ghose Onur Mutlu Nian-Feng Tzeng University of Louisiana at
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationBackground Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore
By Dan Stafford Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space Results & Observations General
More informationOpenPrefetch. (in-progress)
OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),
More informationKill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy
Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy Jinchun Kim Texas A&M University cienlux@tamu.edu Daniel A. Jiménez Texas A&M University djimenez@cse.tamu.edu
More informationDEMM: a Dynamic Energy-saving mechanism for Multicore Memories
DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University
More informationCtrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs
The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October
More information