AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

Size: px

Start display at page:

Download "AB-Aware: Application Behavior Aware Management of Shared Last Level Caches"

Christina Annabelle Lewis
5 years ago
Views:

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department

1 AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering Indian Institute of Technology Bombay, India The 28 th ACM Great Lakes Symposium on VLSI (GLSVLSI) Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

2 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

3 Introduction Thousand-fold growth in microprocessor performance 1 Core micro-architectural innovations 2 Transistor-speed scaling Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

4 Introduction Thousand-fold growth in microprocessor performance 1 Core micro-architectural innovations 2 Transistor-speed scaling Relative Improvement Memory Latency Processor Clock Frequency Figure : Memory latency and CPU frequency improvement [1] [1] J. Srinivasan, Improving cache utilisation, PhD thesis, Cambridge University, 2011 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

5 Introduction Thousand-fold growth in microprocessor performance 1 Core micro-architectural innovations 2 Transistor-speed scaling Relative Improvement Improvements 10 Memory Latency Processor Clock Frequency Processor clock frequency 200 Memory latency Figure : Memory latency and CPU frequency improvement [1] [1] J. Srinivasan, Improving cache utilisation, PhD thesis, Cambridge University, 2011 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

6 Introduction Disparity between CPU and memory speeds Memory bottleneck Cache hierarchy Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

7 Introduction Disparity between CPU and memory speeds Memory bottleneck Cache hierarchy Miss at Last Level Cache (LLC) Hundreds of cycles to get serviced Processor execution stalls Performance loss Non-blocking caches Servicing multiple misses in parallel reduces effective memory stalls Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

8 Introduction Disparity between CPU and memory speeds Memory bottleneck Cache hierarchy Miss at Last Level Cache (LLC) Hundreds of cycles to get serviced Processor execution stalls Performance loss Non-blocking caches Servicing multiple misses in parallel reduces effective memory stalls Notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP) [2] Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

9 Introduction Disparity between CPU and memory speeds Memory bottleneck Cache hierarchy Miss at Last Level Cache (LLC) Hundreds of cycles to get serviced Processor execution stalls Performance loss Non-blocking caches Servicing multiple misses in parallel reduces effective memory stalls Notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP) [2] MLP is not uniform across all memory accesses MLP-aware cache replacement can minimize isolated misses [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

10 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

11 Example [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

12 Example [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

13 Example Reduced total number of misses do not always result in reduced memory stall cycles!! [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

14 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

15 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Increased number of conflict misses Interference among applications Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

16 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Increased number of conflict misses Interference among applications Regularity in cache access is filtered by higher level caches Diverse access pattern at LLC Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

17 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Increased number of conflict misses Interference among applications Regularity in cache access is filtered by higher level caches Diverse access pattern at LLC Multiple applications have different memory characteristics Cache Friendly (CF) small working data-set, more cache re-use Streaming (STR) large working data-set, low cache re-use Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

18 Multi-core Systems LLC shared among multiple homogeneous or heterogeneous cores + Dynamic sharing of cache space + Better utilization of cache resources Increased number of conflict misses Interference among applications Regularity in cache access is filtered by higher level caches Diverse access pattern at LLC Multiple applications have different memory characteristics Cache Friendly (CF) small working data-set, more cache re-use Streaming (STR) large working data-set, low cache re-use LRU-based cache replacement policies allocate cache resources on demand basis Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

19 LRU cache replacement on shared LLC L3 cache size varied from 512kB to 4MB by changing number of ways and keeping number of sets constant MPKI libquantum sphinx Number of ways (1way = 512 kb) Figure : Memory characteristics of sphinx and libquantum MPKI is Misses Per Kilo Instructions Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

20 LRU cache replacement on shared LLC L3 cache size varied from 512kB to 4MB by changing number of ways and keeping number of sets constant MPKI libquantum sphinx Number of ways (1way = 512 kb) Figure : Memory characteristics of sphinx and libquantum Under LRU policy & 2-core configuration, MPKI (sphinx) = sphinx got 3 ways out of total 8 (libquantum got 5 ways) sphinx suffered due interference from libquantum MPKI is Misses Per Kilo Instructions Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

21 Replacement policies for shared LLC On average 60% of cache lines are not re-referred before eviction [3] No temporal locality Data-set larger than allocated cache [3] M. K. Qureshi et al., Adaptive Insertion Policies for High Performance Caching, ISCA 07 [4] A. Jaleel et al., High Performance Cache Replacement using Re-Reference Interval Prediction (RRIP), ISCA 10 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

Replacement policies for shared LLC On average 60% of cache lines are not re-referred before eviction [3] No temporal locality Data-set larger than allocated cache Figure : Various cache replacement

22 Replacement policies for shared LLC On average 60% of cache lines are not re-referred before eviction [3] No temporal locality Data-set larger than allocated cache Figure : Various cache replacement policies [3] M. K. Qureshi et al., Adaptive Insertion Policies for High Performance Caching, ISCA 07 [4] A. Jaleel et al., High Performance Cache Replacement using Re-Reference Interval Prediction (RRIP), ISCA 10 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

23 Replacement policies for shared LLC ABRIP [5] maintains two counters 1 Core level (Cr) whose block to replace 2 Block level (Br) which block to replace [5] P. Lathigara et al., Application Behavior Aware Re-reference Interval Prediction for Shared LLC, ICCD 15 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

24 Replacement policies for shared LLC ABRIP [5] maintains two counters 1 Core level (Cr) whose block to replace 2 Block level (Br) which block to replace + Reduces interference by giving more priority to application behavior + Improves throughput of CF-STR workload mixes Doesn t work very well with CF-CF workload mixes [5] P. Lathigara et al., Application Behavior Aware Re-reference Interval Prediction for Shared LLC, ICCD 15 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

25 Replacement policies for shared LLC ABRIP [5] maintains two counters 1 Core level (Cr) whose block to replace 2 Block level (Br) which block to replace + Reduces interference by giving more priority to application behavior + Improves throughput of CF-STR workload mixes Doesn t work very well with CF-CF workload mixes Proposal MLP awareness Tries to minimize number of isolated misses Application Behavior awareness Tries to reduce interference among applications [5] P. Lathigara et al., Application Behavior Aware Re-reference Interval Prediction for Shared LLC, ICCD 15 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

26 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

27 MLP-cost calculation for shared LLC Miss Status Holding Register (MSHR) maintains number of in flight misses [2] Algorithm 1 Calculation of MLP-cost for shared LLC misses 1: init mlp cost(miss): /* when miss enters MSHR */ 2: miss.mlp cost = 0 3: update mlp cost(): /* called every cycle */ 4: N i Number of outstanding misses from i th core 5: for each demand miss in MSHR do 6: mlp cost + = (1/N i ) 7: end for Higher the number of parallel misses, lesser is the mlp cost [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

28 MLP-cost calculation for shared LLC Miss Status Holding Register (MSHR) maintains number of in flight misses [2] Algorithm 2 Calculation of MLP-cost for shared LLC misses 1: init mlp cost(miss): /* when miss enters MSHR */ 2: miss.mlp cost = 0 3: update mlp cost(): /* called every cycle */ 4: N i Number of outstanding misses from i th core 5: for each demand miss in MSHR do 6: mlp cost + = (1/N i ) 7: end for Higher the number of parallel misses, lesser is the mlp cost Example consider 5 misses on a 2-core system, 4 from core0 & 1 from core1 (Isolated miss) Considering N = 5 for core1 would be unfair [2] M. K. Qureshi et al., A Case for MLP-Aware Cache Replacement, ISCA 06 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

29 Calculation of MLP-cost for shared LLC misses MLP-based Quantized cost value value 0-22 cycles cycles cycles cycles cycles cycles cycles cycles 0 Table : Quantization of MLP-cost 3-bit quantization of MLP-cost q cost degree of parallelism % mlp distribution of bzip under LRU > Range of mlpcost in cycles Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

30 AB-Aware Cache Replacement AB-Aware avoids eviction of isolated cache blocks having near future re-reference prediction COST COMPUTE LOGIC MLP COST L3 MSHR BUS COST-AWARE REPLACEMENT AB-Aware L2 CACHE L3 CACHE (LLC) L2 CACHE MAIN MEMORY L1 CACHE CORE 0 L1 CACHE CORE 1 Figure : Micro-architecture of AB-Aware Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

31 Policies for AB-Aware Insertion Br = 2 M 2, Cr is unaffected Algorithm 3 Victim selection policy for AB-Aware 1: iterate: 2: if any ABr == max ABr then 3: Victim = max {Br(i) + α Cr(i) + λ q cost(i)} 4: else if any Cr == max Cr then 5: all Br + + goto iterate 6: else 7: all Cr + + 8: all Br + + goto iterate 9: end if Promotion Cr = 0 & Br = 0 ABr = 0 Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

32 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

33 Evaluation framework Sniper multi-core x86 simulator Nehalem micro-architecture 2.67 GHz, 4-wide fetch, 128-entry ROB Three level cache hierarchy Main memory access latency 175 cycles 32-entry MSHR L1-D Cache L1-I Cache L2 Cache L3 Cache 32KB, 4-Way, Private, 4-cycles 32KB, 4-Way, Private, 4-cycles 256KB, 8-Way, Private, 8-cycles 2MB per core, 8-Way, Shared, 30-cycles Table : Cache hierarchy of the simulated system Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

34 Memory characteristics Four traces of 250M instructions selected using pinpoints Total 1B instructions Application apki 1 mpki 2 Characteristics bzip Cache friendly (CF) gcc Cache friendly (CF) soplex Cache friendly (CF) sphinx Cache friendly (CF) lbm Streaming (STR) libquantum Streaming (STR) mcf Streaming (STR) Table : Memory characteristics of SPEC CPU2006 applications 1 accesses per kilo instructions 2 misses per kilo instructions Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

35 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

36 AB-Aware 2-core configuration Weighted Speedup sphinx-lbm gcc-libq bzip-lbm gcc-mcf gcc-lbm bzip-libq soplex-libq soplex-mcf sphinx-mcf bzip-mcf soplex-lbm sphinx-libq SRRIP LIN ABRIP AB-Aware GMean_all sphinx-soplex soplex-gcc gcc-bzip sphinx-gcc sphinx-bzip soplex-bzip GMean_(CF-STR) Gains over SRRIP LIN ABRIP AB-Aware GMean CF-STR -8.89% 0.63% 1.76% GMean all -7.12% -0.18% 1.69% Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

37 AB-Aware 2-core configuration mpki normalized to SRRIP app-1 app-2 sphinx-soplex soplex-gcc gcc-bzip sphinx-gcc sphinx-bzip soplex-bzip sphinx-lbm gcc-libq bzip-lbm gcc-mcf gcc-lbm bzip-libq soplex-libq soplex-mcf sphinx-mcf bzip-mcf soplex-lbm sphinx-libq Maximum reduction achieved by (bzip*-lbm) = 69.22% Average MPKI reduction = 9.7% Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

38 AB-Aware 4-core configuration Workload Applications Category Mix 1 soplex-sphinx-gcc-libq Mix 2 bzip-gcc-sphinx-lbm Mix 3 sphinx-bzip-gcc-mcf Mix 4 bzip-soplex-gcc-lbm CF-CF-CF-STR Mix 5 gcc-soplex-sphinx-mcf Mix 6 sphinx-soplex-bzip-lbm Mix 7 bzip-sphinx-libq-mcf Mix 8 gcc-sphinx-mcf-libq Mix 9 sphinx-bzip-lbm-mcf Mix 10 bzip-gcc-libq-lbm CF-CF-STR-STR Mix 11 bzip-gcc-lbm-mcf Mix 12 soplex-bzip-libq-lbm Mix 13 soplex-libq-mcf-lbm Mix 14 sphinx-libq-lbm-mcf CF-STR-STR-STR Mix 15 bzip-gcc-sphinx-soplex CF-CF-CF-CF Table : Workloads under evaluation for 4-core configuration Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

39 AB-Aware 4-core configuration Weighted Speedup SRRIP ABRIP AB-Aware Geomean Mix_15 Mix_14 Mix_13 Mix_12 Mix_11 Mix_10 Mix_9 Mix_8 Mix_7 Mix_6 Mix_5 Mix_4 Mix_3 Mix_2 Mix_1 Average Performance improvement over SRRIP 8.71% Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

40 Storage overhead Overhead 2-core 4-core Core RRPV (2 bits per core per set) 4 KB 16 KB Block RRPV bits (16 bits per set) 16 KB 32 KB MLP Cost storage (24 bits per set) 24 KB 48 KB Net overhead 44 KB 96 KB % overhead over LLC 1.1% 1.2% Table : Storage overhead of LINABRIP over SRRIP Overhead over ABRIP 0.6% for both 2-core and 4-core Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

41 Outline 1 Introduction 2 Motivation 3 AB-Aware Cache Management 4 Evaluation 5 Results 6 Conclusion Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

42 Conclusion Memory Level Parallelism (MLP) is non-uniform MLP-aware cache replacement reduces isolated misses Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

43 Conclusion Memory Level Parallelism (MLP) is non-uniform MLP-aware cache replacement reduces isolated misses Shared Last Level Cache (LLC) experiences diverse access pattern LRU is inefficient for L3 caches Application Behavior aware replacement reduces interference among applications Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

44 Conclusion Memory Level Parallelism (MLP) is non-uniform MLP-aware cache replacement reduces isolated misses Shared Last Level Cache (LLC) experiences diverse access pattern LRU is inefficient for L3 caches Application Behavior aware replacement reduces interference among applications AB-Aware improves MLP and minimizes application interference Average MPKI reduction 9.6% Average Performance gain 1.69% (2-core) and 8.71% (4-core) Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

45 References I J. R. Srinivasan, Improving Cache Utilisation. PhD thesis, University of Cambridge, M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, A Case for MLP-Aware Cache Replacement, in Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA), pp , M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, Adaptive Insertion Policies for High Performance Caching, in Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA), pp , A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP), in Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA), pp , P. Lathigara, S. Balachandran, and V. Singh, Application Behavior Aware Re-reference Interval Prediction for Shared LLC, in Proceedings of the 33rd IEEE International Conference on Computer Design (ICCD), pp , Oct Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

46 Thank You! Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 27

47 Backup Slides Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 7

48 LIN single-core configuration % IPC improvement over LRU LIN λ=1 LIN λ=2 LIN λ=3 LIN λ=4 bzip gcc soplex sphinx lbm libq mcf Improvement/degradation is proportional to λ Degradation can be alleviated using Dynamic Set Sampling Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 7

49 LIN MLP-distribution % mlp distribution of bzip under LRU > % mlp distribution of bzip under LIN > Range of mlpcost in cycles Range of mlpcost in cycles 30 mlp distribution of gcc under LRU 30 mlp distribution of gcc under LIN % 15 % > > Suhit Pai (IITRange Bombay) of mlpcost in cycles AB-Aware Range of mlpcost Friday, in cycles May 25 th, / 7

50 LIN MLP-distribution % mlp distribution of sphinx under LRU > % mlp distribution of sphinx under LIN > Range of mlpcost in cycles Range of mlpcost in cycles % mlp distribution of soplex under LRU > % mlp distribution of soplex under LIN > Suhit Pai (IITRange Bombay) of mlpcost in cycles AB-Aware Range of mlpcost Friday, in cycles May 25 th, / 7

51 LIN MLP-distribution % mlp distribution of libq under LRU > % mlp distribution of libq under LIN > Range of mlpcost in cycles Range of mlpcost in cycles 60 mlp distribution of lbm under LRU 60 mlp distribution of lbm under LIN % 30 % > > Suhit Pai (IITRange Bombay) of mlpcost in cycles AB-Aware Range of mlpcost Friday, in cycles May 25 th, / 7

52 LIN MLP-distribution % mlp distribution of mcf under LRU > % mlp distribution of mcf under LIN > Range of mlpcost in cycles Range of mlpcost in cycles Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 7

53 Performance metrics IPC i = IPC of i th core concurrently running with other cores IPC alone i = IPC of i th core running in isolation IPC sum signifies system throughput IPC sum = IPC i (1) Weighted Speedup = (IPC i /IPC alone i ) (2) Weighted Speedup signifies reduction in execution time Ideal value of Weighted Speedup 2 for dual-core and 4 for quad-core Suhit Pai (IIT Bombay) AB-Aware Friday, May 25 th, / 7

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip