Exploi'ng Compressed Block Size as an Indicator of Future Reuse

Size: px

Start display at page:

Download "Exploi'ng Compressed Block Size as an Indicator of Future Reuse"

Jody Wood
5 years ago
Views:

1 Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch

2 Execu've Summary In a compressed cache, compressed block size is an addidonal dimension Problem: How can we maximize cache performance udlizing both block reuse and compressed size? Observa'ons: Importance of the cache block depends on its compressed size Block size is indicadve of reuse behavior in some applicadons Key Idea: Use compressed block size in making cache replacement and inserdon decisions Results: Higher performance (9%/11% on average for 1/2- core) Lower energy (12% on average) 2

3 Poten'al for Data Compression Latency 8-14 cycles 9 cycles 5 cycles C- Pack X SC 2 X All these works FPC showed significant All use Can LRU we replacement X do beser? policy performance improvements Base- Delta- Immediate StaDcDcal C- Pack Frequent [Chen+, Compression PaXern (BDI) [Pekhimenko+, (SC Trans. 2 Compression ) [Arelakis+, on VLSI 10] PACT 12] ISCA 14] (FPC) [Alameldeen+, ISCA 04] 1-2 cycles BDI X Compression Ra'o 3

4 Size- Aware Cache Management 1. Compressed block size maxers 2. Compressed block size varies 3. Compressed block size can indicate reuse 4

5 #1: Compressed Block Size MaSers Cache: large small A Y B C X A Y B miss hit hit Belady s OPT Replacement Access stream: C miss miss X A Y B C (me X miss A Y hit B C miss hit hit saved cycles (me Size- Aware Replacement Size- aware policies could yield fewer misses 5

6 Cache Coverage, % #2: Compressed Block Size Varies size: 0-7 size: 8-15 size: size: size: size: size: size: size: 64 Block size, bytes 100% 80% 60% 40% 20% 0% BDI compression algorithm [Pekhimenko+, PACT 12] mcf lbm astar tpch2 calculix h264ref wrf povray gcc Compressed block size varies within and between applicadons 6

7 Reuse Distance (# of memory accesses) #3: Block Size Can Indicate Reuse bzip Block size, bytes Different sizes have different dominant reuse distances 7

8 Code Example to Support Intui'on int A[N]; double B[16]; for (int i=0; i<n; i++) { int idx = A[i]; for (int j=0; j<n; j++) { sum += B[(idx+j)%16]; } } // small indices: compressible // FP coef1icients: incompressible long reuse, compressible short reuse, incompressible Compressed size can be an indicator of reuse distance 8

9 Size- Aware Cache Management 1. Compressed block size maxers 2. Compressed block size varies 3. Compressed block size can indicate reuse 9

10 Outline MoDvaDon Key ObservaDons Key Ideas: Minimal- Value EvicDon (MVE) Size- based InserDon Policy (SIP) EvaluaDon Conclusion 10

11 Compression- Aware Management Policies (CAMP) CAMP MVE: Minimal- Value Evic'on SIP: Size- based Inser'on Policy 11

12 Minimal- Value Evic'on (MVE): Observa'ons Set: Block #0 Block #1 Block #N Highest priority: value MRU Define Importance a value of (importance) the block depends of the block on both to the cache likelihood of reference Evict and the compressed block with the block lowest size value Lowest priority: value LRU Block #X 12

13 Minimal- Value Evic'on (MVE): Key Idea Value = Probability of reuse Compressed block size Probability of reuse: Can be determined in many different ways Our implementadon is based on re- reference interval predicdon value (RRIP [Jaleel+, ISCA 10]) 13

14 Compression- Aware Management Policies (CAMP) CAMP MVE: Minimal- Value Evic'on SIP: Size- based Inser'on Policy 14

15 Size- based Inser'on Policy (SIP): Observa'ons SomeDmes there is a rela'on between the compressed block size and reuse distance data structure compressed block size reuse distance This reladon can be detected through the compressed block size Minimal overhead to track this reladon (compressed block informadon is a part of design) 15

16 Size- based Inser'on Policy (SIP): Key Idea Insert blocks of a certain size with higher priority if this improves cache miss rate Use dynamic set sampling [Qureshi, ISCA 06] to detect which size to insert with higher priority Main Tags set A set B set C set D set E set F set G set H set I Auxiliary Tags set A 8B set D 64B set F 8B set I 64B set A set F miss + CTR 8B - decides policy for steady state miss set A 8B set F 8B 16

17 Outline MoDvaDon Key ObservaDons Key Ideas: Minimal- Value EvicDon (MVE) Size- based InserDon Policy (SIP) EvaluaDon Conclusion 17

18 Methodology Simulator x86 event- driven simulator (MemSim [Seshadri+, PACT 12]) Workloads SPEC2006 benchmarks, TPC, Apache web server 1 4 core simuladons for 1 billion representadve instrucdons System Parameters L1/L2/L3 cache latencies from CACTI BDI (1- cycle decompression) [Pekhimenko+, PACT 12] 4GHz, x86 in- order core, cache size (1MB - 16MB) 18

19 Evaluated Cache Management Policies Design Descrip'on LRU RRIP Baseline LRU policy - size- unaware Re- reference Interval PredicDon [Jaleel+, ISCA 10] - size- unaware 19

20 Evaluated Cache Management Policies Design Descrip'on LRU RRIP ECM Baseline LRU policy - size- unaware Re- reference Interval PredicDon [Jaleel+, ISCA 10] - size- unaware EffecDve Capacity Maximizer [Baek+, HPCA 13] + size- aware 20

21 Evaluated Cache Management Policies Design Descrip'on LRU RRIP ECM CAMP Baseline LRU policy - size- unaware Re- reference Interval PredicDon [Jaleel+, ISCA 10] - size- unaware EffecDve Capacity Maximizer [Baek+, HPCA 13] + size- aware Compression- Aware Management Policies (MVE + SIP) 21

22 Size- Aware Replacement EffecDve Capacity Maximizer (ECM) [Baek+, HPCA 13] Inserts big blocks with lower priority Uses heurisdc to define the threshold Shortcomings Coarse- grained No reladon between block size and reuse Not easily applicable to other cache organizadons 22

23 CAMP Single- Core 1.3 SPEC2006, databases, web workloads, 2MB L2 cache RRIP ECM MVE SIP CAMP 31% Normalized IPC astar cactusadm gcc gobmk h264ref soplex zeusmp bzip2 GemsFDTD gromacs leslie3d perlbench sjeng sphinx3 xalancbmk milc omnetpp libquantum lbm mcf tpch2 tpch6 apache CAMP improves performance over LRU, RRIP, ECM 8% 5% GMeanIntense GMeanAll 23

24 Mul'- Core Results ClassificaDon based on the compressed block size distribudon Homogeneous (1 or 2 dominant block size(s)) Heterogeneous (more than 2 dominant sizes) We form three different 2- core workload groups (20 workloads in each): Homo- Homo Homo- Hetero Hetero- Hetero 24

25 Mul'- Core Results (2) Normalized Weighted Speedup RRIP ECM MVE SIP CAMP Homo- Homo Homo- Hetero Hetero- Hetero GeoMean CAMP outperforms LRU, RRIP and ECM policies Higher benefits with higher heterogeneity in size 25

26 Effect on Memory Subsystem Energy Normalized Energy L1/L2 caches, DRAM, NoC, compression/decompression RRIP ECM CAMP astar cactusadm gcc gobmk h264ref soplex zeusmp bzip2 GemsFDTD gromacs leslie3d perlbench sjeng sphinx3 xalancbmk milc omnetpp libquantum lbm mcf tpch2 tpch6 apache CAMP reduces energy consumpdon in the memory subsystem GMeanIntense GMeanAll 26 5%

27 CAMP for Global Replacement CAMP with V- Way [Qureshi+, ISCA 05] cache design G- CAMP G- MVE: Global Minimal- Value Evic'on G- SIP: Global Size- based Inser'on Policy Reuse Replacement instead of RRIP Set Dueling instead of Set Sampling 27

28 G- CAMP Single- Core Normalized IPC SPEC2006, databases, web workloads, 2MB L2 cache RRIP V- WAY G- MVE G- SIP G- CAMP 63%- 97% astar cactusadm gcc gobmk h264ref soplex zeusmp bzip2 GemsFDTD gromacs leslie3d perlbench sjeng sphinx3 xalancbmk milc omnetpp libquantum lbm mcf tpch2 tpch6 apache 14% 9% GMeanIntense GMeanAll G- CAMP improves performance over LRU, RRIP, V- Way 28

29 Storage Bit Count Overhead Designs (with BDI) Over uncompressed cache with LRU LRU CAMP V- Way G- CAMP tag- entry +14b +14b +15b +19b data- entry +0b +0b +16b +32b total +9% +11% +12.5% +17% 2% 4.5% 29

30 Cache Size Sensi'vity Normalized IPC LRU RRIP ECM CAMP V- WAY G- CAMP 1M 2M 4M 8M 16M CAMP/G- CAMP outperforms prior policies for all $ sizes G- CAMP outperforms LRU with 2X cache sizes 30

31 Other Results and Analyses in the Paper Block size distribudon for different applicadons SensiDvity to the compression algorithm Comparison with uncompressed cache Effect on cache capacity SIP vs. PC- based replacement policies More details on muld- core results 31

32 Conclusion In a compressed cache, compressed block size is an addidonal dimension Problem: How can we maximize cache performance udlizing both block reuse and compressed size? Observa'ons: Importance of the cache block depends on its compressed size Block size is indicadve of reuse behavior in some applicadons Key Idea: Use compressed block size in making cache replacement and inserdon decisions Two techniques: Minimal- Value EvicDon and Size- based InserDon Policy Results: Higher performance (9%/11% on average for 1/2- core) Lower energy (12% on average) 32

33 Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch

34 Backup Slides 34

35 Poten'al for Data Compression Compression algorithms for on- chip caches: Frequent PaXern Compression (FPC) [Alameldeen+, ISCA 04] performance: +15%, comp. rado: C- Pack [Chen+, Trans. on VLSI 10] comp. rado: 1.64 Base- Delta- Immediate (BDI) [Pekhimenko+, PACT 12] performance: +11.2%, comp. rado: 1.53 StaDsDcal Compression [Arelakis+, ISCA 14] performance: +11%, comp. rado:

36 Poten'al for Data Compression (2) BeXer compressed cache management Decoupled Compressed Cache (DCC) [MICRO 13] Skewed Compressed Cache [MICRO 14] 36

37 # 3: Block Size Can Indicate Reuse soplex Block size, bytes Different sizes have different dominant reuse distances 37

38 Conven'onal Cache Compression Set 0 Tag Storage: Set 0 Data Storage: 32 bytes Conven'onal 2- way cache with 32- byte cache lines Set 1 Tag 0 Tag 1 Set 1 Data 0 Data 1 Way 0 Way 1 Way 0 Way 1 Compressed 4- way cache with 8- byte segmented data Tag Storage: Set 0 Set 1 Tag 0 Tag 1 Tag 2 Tag 3 Way 0 Way 1 Way 2 Way 3 8 bytes Set 0 Set 1 S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 C ü C - Compr. encoding bits ü Twice as ü Tags many tags map to muldple adjacent segments 38

39 CAMP = MVE + SIP Minimal- Value evicdon (MVE) Value Func'on = Probability of reuse Compressed block size Probability of reuse based on RRIP [ISCA 10] Size- based inserdon policy (SIP) Dynamically prioridzes blocks based on their compressed size (e.g., insert into MRU posidon for LRU policy) 39

40 Overhead Analysis 40

41 CAMP and Global Replacement CAMP with V- Way cache design 64 bytes Set 0 Set 1 Tag 0 Tag 1 Tag 2 Tag 3 Data 0 Data 1 R 0 R 1 R 2 R 3 status tag fptr comp rptr v+s 8 bytes 41

42 G- MVE Two major changes: Probability of reuse based on Reuse Replacement policy On inserdon, a block s counter is set to zero On a hit, the block s counter is incremented by one indicadng its reuse Global replacement using a pointer (PTR) to a reuse counter entry StarDng at the entry PTR points to, the reuse counters of 64 valid data entries are scanned, decremendng each non- zero counter 42

43 G- SIP Use set dueling instead of set sampling (SIP) to decide best inserdon policy: Divide data- store into 8 regions and block sizes into 8 ranges During training, for each region, insert blocks of a different size range with higher priority Count cache misses per region In steady state, insert blocks with sizes of top performing regions with higher priority in all regions 43

44 Set: Size- based Inser'on Policy (SIP): Key Idea Insert blocks of a certain size with higher priority if this improves cache miss rate Highest priority Lowest priority 44

45 Evaluated Cache Management Policies Design Descrip'on LRU RRIP ECM V- Way CAMP G- CAMP Baseline LRU policy Re- reference Interval PredicDon [Jaleel+, ISCA 10] EffecDve Capacity Maximizer [Baek+, HPCA 13] Variable- Way Cache [ISCA 05] Compression- Aware Management (Local) Compression- Aware Management (Global) 45

46 Performance and MPKI Correla'on Performance improvement / MPKI reducdon Mechanism LRU RRIP ECM CAMP 8.1%/- 13.3% 2.7%/- 5.6% 2.1%/- 5.9% CAMP performance improvement correlates with MPKI reducdon 46

47 Performance and MPKI Correla'on Performance improvement / MPKI reducdon Mechanism LRU RRIP ECM V- Way CAMP 8.1%/- 13.3% 2.7%/- 5.6% 2.1%/- 5.9% N/A G- CAMP 14.0%/- 21.9% 8.3%/- 15.1% 7.7%/- 15.3% 4.9%/- 8.7% CAMP performance improvement correlates with MPKI reducdon 47

48 Mul'- core Results (2) Normalized Weighted Speedup RRIP V- WAY G- MVE G- SIP G- CAMP Homo- Homo Homo- Hetero Hetero- Hetero GeoMean G- CAMP outperforms LRU, RRIP and V- Way policies 48

49 Effect on Memory Subsystem Energy Normalized Energy L1/L2 caches, DRAM, NoC, compression/decompression RRIP ECM CAMP V- WAY G- CAMP astar cactusadm gcc gobmk h264ref soplex zeusmp bzip2 GemsFDTD gromacs leslie3d perlbench sjeng sphinx3 xalancbmk milc omnetpp libquantum lbm mcf tpch2 tpch6 apache CAMP/G- CAMP reduces energy consumpdon in the memory subsystem 15.1% GMeanIntense GMeanAll 49

Practical Data Compression for Modern Memory Hierarchies

Practical Data Compression for Modern Memory Hierarchies Thesis Oral Gennady Pekhimenko Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison