Practical Data Compression for Modern Memory Hierarchies

Size: px

Start display at page:

Download "Practical Data Compression for Modern Memory Hierarchies"

Peregrine Harper
5 years ago
Views:

1 Practical Data Compression for Modern Memory Hierarchies Thesis Oral Gennady Pekhimenko Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison Doug Burger, Microsoft Michael Kozuch, Intel Presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy

2 Performance and Energy Efficiency Energy efficiency Applications today are data-intensive Memory Caching Databases Graphics 2

3 Computation vs. Communication Modern memory systems are bandwidth constrained Data movement is very costly Integer operation: ~1 pj Floating operation: ~20 pj Low-power memory access: ~1200 pj Implications ½ bandwidth of modern mobile phone memory exceeds power budget Transfer less or keep data near processing units 3

4 Data Compression across the System Processor Cache Memory Network Disk 4

5 Software vs. Hardware Compression Software vs. Hardware Layer Disk Cache/Memory Latency milliseconds nanoseconds Algorithms Dictionary-based Arithmetic Existing dictionary-based algorithms are too slow for main memory hierarchies 5

6 Key Challenges for Compression in Memory Hierarchy Fast Access Latency Practical Implementation and Low Cost High Compression Ratio 6

7 Thesis Statement It is possible to develop a new set of designs for data compression within modern memory hierarchies that is: Fast Simple Effective in saving storage space and consumed bandwidth so that the resulting improvements in performance, cost, and energy efficiency will make it attractive to implement in future systems 7

8 Contributions of This Dissertation Base-Delta-Immediate (BDI) Compression algorithm with low latency and high compression ratio Compression-Aware Management Policies (CAMP) that incorporate compressed block size into cache management decisions Linearly Compressed Pages (LCP) framework for efficient main memory compression Toggle-Aware Bandwidth compression mechanisms for energy-efficient bandwidth compression 8

9 Practical Data Compression in Memory Processor 1. Cache Compression Cache 2. Compression and Cache Replacement Memory 3. Memory Compression Disk 4. Bandwidth Compression 9

10 PACT Cache Compression 10

11 Background on Cache Compression CPU Hit (3-4 cycles) L1 Cache Decompression Uncompressed Hit (~15 cycles) L2 Cache Data Compressed Key requirement: Low decompression latency 11

12 Key Data Patterns in Real Applications Zero Values: initialization, sparse matrices, NULL pointers 0x x x x Repeated Values: common initial values, adjacent pixels 0x000000C0 0x000000C0 0x000000C0 0x000000C0 Narrow Values: small values stored in a big data type 0x000000C0 0x000000C8 0x000000D0 0x000000D8 Other Patterns: pointers to the same memory region 0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 12

13 libquantum lbm mcf tpch17 sjeng omnetpp tpch2 sphinx3 xalancbmk bzip2 tpch6 leslie3d apache gromacs astar gobmk soplex gcc hmmer wrf h264ref zeusmp cactusadm GemsFDTD Arith.Mean Cache Coverage (%) How Common Are These Patterns? SPEC2006, databases, web workloads, 2MB L2 cache Other Patterns include Narrow Values 100% 80% 60% 40% 20% 0% Zero Repeated Values Other Patterns 43% 43% of the cache lines belong to key patterns 13

14 Key Data Patterns in Real Applications Zero Values: initialization, sparse matrices, NULL pointers 0x x x x Repeated Values: common initial values, adjacent pixels 0x000000C0 0x000000C0 0x000000C0 0x000000C0 Narrow Values: small values stored in a big data type 0x000000C0 0x000000C8 0x000000D0 0x000000D8 Other Patterns: pointers to the same memory region 0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 14

15 Key Data Patterns in Real Applications Low Dynamic Range: Differences between values are significantly smaller than the values themselves Low Latency Decompressor Low Cost and Complexity Compressor Compressed Cache Organization 15

16 Key Idea: Base+Delta (B+Δ) Encoding 4 bytes 32-byte Uncompressed Cache Line 0xC04039C0 0xC04039C8 0xC04039D0 0xC04039F8 0xC04039C0 Base 0x00 0x08 0x10 0x38 12-byte Compressed Cache Line 1 byte 1 byte 1 byte 20 bytes saved Effective: good compression ratio 16

17 B+Δ Decompressor Design Compressed Cache Line B 0 Δ 0 Δ 1 Δ 2 Δ 3 B 0 B 0 B 0 B V 0 V 1 V 2 V 3 Vector addition V 0 V 1 V 2 V 3 Uncompressed Cache Line Fast Decompression: 1-2 cycles 17

18 Can We Get Higher Compression Ratio? Uncompressible cache line (with a single base): 0x09A x x09A4A838 Key idea - use more bases 0x B More cache lines can be compressed Unclear how to find these bases efficiently Higher overhead (due to additional bases) struct A { int* next; int count;}; 18

19 Compression Ratio B+Δ with Multiple Arbitrary Bases # of bases is fixed bases empirically the best option 19

20 How to Find Two Bases Efficiently? 1. First base - first element in the cache line Base+Delta part 2. Second base - implicit base of 0 Immediate part Base-Delta-Immediate (BΔI) Compression 20

21 Set 0 BΔI Cache Organization Tag Storage: Conventional 2-way cache with 32-byte cache lines Set 0 Data Storage: 32 bytes Set 1 Tag 0 Tag 1 Set 1 Data 0 Data 1 Way 0 Way 1 Way 0 Way 1 BΔI: 4-way cache with 8-byte segmented data Tag Storage: Set 0 Set 1 Tag 0 Tag 1 Tag 2 Tag 3 Set 0 C 8 bytes Set 1 S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 C - Compr. encoding bits Way 0 Way 1 Way 2 Way 3 Twice as Tags many tags map to multiple adjacent segments 21

22 Methodology Simulator x86 event-driven simulator (MemSim [Seshadri+, PACT 12]) Workloads SPEC2006 benchmarks, TPC, Apache web server 1 4 core simulations for 1 billion representative instructions System Parameters L1/L2/L3 cache latencies from CACTI BDI (1-cycle decompression) 4GHz, x86 in-order core, cache size (1MB - 16MB) 22

23 Comparison Summary Prior Work vs. BΔI Comp. Ratio Decompression 5-9 cycles 1-2 cycles Compression cycles 1-9 cycles Average performance of a twice larger cache 23

24 Processor Cache 1. Cache Compression 2. Compression and Cache Replacement Memory 3. Memory Compression Disk 4. Bandwidth Compression 24

25 HPCA Compression and Cache Replacement 25

OPT Replacement Optimal for fixed-size caches fixed-size blocks

26 Cache Management Background Not only about size Cache management policies are important Insertion, promotion and eviction Belady s OPT Replacement Optimal for fixed-size caches fixed-size blocks Assumes perfect future knowledge Cache: Access stream: X A Y B C X Y Z A 26

27 Block Size Can Indicate Reuse Sometimes there is a relation between the compressed block size and reuse distance data structure compressed block size reuse distance This relation can be detected through the compressed block size Minimal overhead to track this relation (compressed block information is a part of design) 27

28 Code Example to Support Intuition int A[N]; double B[16]; for (int i=0; i<n; i++) { int idx = A[i]; for (int j=0; j<n; j++) { sum += B[(idx+j)%16]; } } // small indices: compressible // FP coefficients: incompressible long reuse, compressible short reuse, incompressible Compressed size can be an indicator of reuse distance 28

29 Block Size Can Indicate Reuse Reuse Distance (# of memory accesses) Block size, bytes Different sizes have different dominant reuse distances 29

30 Compression-Aware Management Policies (CAMP) CAMP SIP: Size-based Insertion Policy MVE: Minimal-Value Eviction compressed block size data Probability of reuse The Value benefits = structure are on-par with cache compression - 2X additional increase Compressed in capacity block reuse size distance 30

31 Processor Cache 1. Cache Compression 2. Compression and Cache Replacement Memory 3. Memory Compression Disk 4. Bandwidth Compression 31

32 MICRO Main Memory Compression 32

33 Challenges in Main Memory Compression 1. Address Computation 2. Mapping and Fragmentation 33

34 Address Computation Uncompressed Page Cache Line (64B) L 0 L 1 L 2... L N-1 Address Offset (N-1)*64 Compressed Page Address Offset L 0 L 1 L 2... L N-1 0??? 34

35 Mapping and Fragmentation Virtual Page (4KB) Virtual Address Physical Page (? KB) Fragmentation Physical Address 35

36 Shortcomings of Prior Work Compression Mechanisms Compression Ratio Address Comp. Latency Decompression Latency Complexity and Cost IBM MXT [IBM J.R.D. 01] 64 cycles 36

37 Shortcomings of Prior Work Compression Mechanisms Compression Ratio Address Comp. Latency Decompression Latency Complexity and Cost IBM MXT [IBM J.R.D. 01] Robust Main Memory Compression [ISCA 05] 5 cycles 37

38 Shortcomings of Prior Work Compression Mechanisms Compression Ratio Address Comp. Latency Decompression Latency Complexity and Cost IBM MXT [IBM J.R.D. 01] Robust Main Memory Compression [ISCA 05] Linearly Compressed Pages: Our Proposal 38

39 Linearly Compressed Pages (LCP): Key Idea Uncompressed Page (4KB: 64*64B) 64B 64B 64B 64B... 64B 4:1 Compression... LCP sacrifices some compression ratio in favor of design simplicity Compressed Data (1KB) LCP effectively solves challenge 1: address computation 39

40 LCP: Key Idea (2) Uncompressed Page (4KB: 64*64B) 64B 64B 64B 64B... 64B 4:1 Compression... M E Exception Storage Compressed Data (1KB) Metadata (64B):? (compressible) 40

41 LCP Framework Overview Page Table entry extension compression type and size OS support for multiple page sizes 4 memory pools (512B, 1KB, 2KB, 4KB) Handling uncompressible data Hardware support memory controller logic metadata (MD) cache PTE 41

42 Physical Memory Layout Page Table PA 0 PA 1 offset 4KB 2KB 2KB 1KB 1KB 1KB 1KB 512B 512B B 4KB PA

43 LCP Optimizations Metadata cache Avoids additional requests to metadata Memory bandwidth reduction: 64B 64B 64B 64B 1 transfer instead of 4 Zero pages and zero cache lines Handled separately in TLB (1-bit) and in metadata (1-bit per cache line) 43

Summary of the Results Prior Work vs. LCP Comp. Ratio 1.

44 Summary of the Results Prior Work vs. LCP Comp. Ratio Performance -4% +14% Energy Consumption 6% 5% 44

45 Processor Cache Memory 1. Cache Compression 2. Compression and Cache Replacement 3. Memory Compression Disk 4. Bandwidth Compression 45

46 HPCA 2016 CAL Energy-Efficient Bandwidth Compression 46

47 Energy Efficiency: Bit Toggles How energy is spent in data transfers: Previous data: 0011 New data: Energy = C*V 2 Bit Toggles Energy: 1 1 Energy of data transfers (e.g., NoC, DRAM) is proportional to the bit toggle count 47

48 Excessive Number of Bit Toggles Uncompressed Cache Line 0x00003A00 0x8001D000 0x00003A01 0x8001D008 XOR = Flit 0 Flit 1 # Toggles = 2 Compressed Cache Line 0x5 0x3A00 0x7 8001D000 0x5 0x3A01 5 3A D D XOR D A02 1 = Flit 0 Flit 1 # Toggles = 31 0x7 8001D008 48

49 FPC BDI BDI+FPC Fibonacci C-Pack FPC BDI BDI+FPC Fibonacci C-Pack FPC BDI BDI+FPC Fibonacci C-Pack Normalized Bit Toggle # Effect of Compression on Bit Toggles Discrete Mobile Open-Source Compression significantly increases bit toggle count 49

50 Energy Control Bit toggle count: compressed vs. uncompressed Use a heuristic (Energy X Delay or Energy X Delay 2 metric) to estimate the trade-off Take bandwidth utilization into account Throttle compression when it is not beneficial 50

51 Methodology Simulator: GPGPU-Sim 3.2.x and in-house simulator Workloads: NVIDIA apps (discrete and mobile): 221 apps Open-source (Lonestar, Rodinia, MapReduce): 21 apps System parameters (Fermi): 15 SMs, 32 threads/warp 48 warps/sm, registers, 32KB Shared Memory Core: 1.4GHz, GTO scheduler, 2 schedulers/sm Memory: 177.4GB/s BW, GDDR5 Cache: L1-16KB; L2-768KB 51

52 Normalized Bit Toggle Count FPC BDI BDI+FPC Fibonacci C-Pack FPC BDI BDI+FPC Fibonacci C-Pack FPC BDI BDI+FPC Fibonacci C-Pack Effect of EC on Bit Toggle Count Without EC With EC Discrete Mobile Open-Source EC significantly reduces the bit toggle count Works for different compression algorithms 52

53 FPC BDI BDI+FPC Fibonacci C-Pack FPC BDI BDI+FPC Fibonacci C-Pack FPC BDI BDI+FPC Fibonacci C-Pack Compression Ratio Effect of EC on Compression Ratio Without EC With EC Discrete Mobile Open-Source EC preserves most of the benefits of compression 53

54 Acknowledgments Todd Mowry and Onur Mutlu Phil Gibbons and Mike Kozuch Kayvon Fatahalian, David Wood, and Doug Burger SAFARI and LBA group members Collaborators at CMU, MSR, NVIDIA and GaTech CALCM and PDL Deb Cavlovich Family and friends 54

55 Conclusion Data stored in memory hierarchies has significant redundancy Inefficient usage of existing limited resources Simple and efficient mechanisms for hardwarebased data compression On-chip caches Main memory On-chip/off-chip interconnects Our mechanisms improve performance, cost and energy efficiency 55

56 Practical Data Compression for Modern Memory Hierarchies Thesis Oral Gennady Pekhimenko Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison Doug Burger, Microsoft Michael Kozuch, Intel Presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,