Memory-Link Compression Schemes: A Value Locality Perspective

Memory-Link Compression Schemes: A Value Locality Perspective Martin Thuresson, Lawrence Spracklen and Per Stenström IEEE Presented by Jean Niklas L orange and Caroline Sæhle, for TDT01, Norwegian University of Science and Technology November 11, 2013

Introduction What is the problem exactly? Limits to dealing with processor-memory gap on-chip Diminishing returns to making deeper cache hierarchies Limited bandwidth already a problem for memory-bound applications, getting worse: Pin count goes up less than 10% as transistor count doubles Transition to multicore processing increases bandwidth usage Control speculation, hardware scouting, and value prediction increase bandwidth usage by 15-30% Latency-compensating techniques come at the expense of bandwidth Solution: Reduce the bandwidth needed!

Introduction Memory-link compression Compress data before it is sent to link between last-level cache and memory Decompress data before it is installed in cache or written back to memory Storing compressed data in memory also a possibility, but not investigated in paper Additional advantage: Reduces transfer time, miss penalty Disadvantages: Increases transfer latency, lightweight compression is needed Cache Comp/ Decomp Link Comp/ Decomp Main Memory

Introduction Memory-link compression cont d Lightweight compression schemes Significance-width compression (SWC) exploits small value locality Delta encoding exploits clustered value locality Citron scheme and frequent value encoding (FVE) scheme exploit isolated value locality Which are relevant for different domains integer, multimedia, and commercial applications? Does a combination of schemes work better?

Value Locality So what is value locality? A program attribute that describes the likelihood of the recurrence of previously-seen program values 1 The paper analyses the locality and compressibility of memory-link traffic using the Simics full-system simulator. Results are fairly consistent and not uniform a large number of transferred values are either very small or large. 1 Lepak, K.M.; Lipasti, M.H., On the value locality of store instructions, 2000

Value Locality Small Value Locality Small Value Locality Significance-width compression (SWC) utilises small value locality: Many values are small. Encoded in two parts: An integer x with fixed width, representing the number of remaining bits x bits, representing the actual value Fast and simple approach, extremely parallelizable Good compression rates for small values Significant overhead for large values

Value Locality Small Value Locality SWC compression results Frees up 30% bandwidth on average, with 5 bits representing remaining bits Different binning schemes were tried, but worked well only for integer problems

Value Locality Clustered Value Locality Clustered Value Locality Many values are close to each other, even if they are large. By using delta encoding, we can utilise this property. Have multiple cluster values in a cache, pick the closest found Send over index of cluster value, along with the difference to the actual value Or insert current value in cache (LRU) if the difference exceeds a threshold, and send the raw value over the wire Larger cache, more index bits. Larger threshold, larger difference. What are optimal values? Miss? Link Cache Δ-Cache Δ-Cache Main Memory

Value Locality Clustered Value Locality Delta encoding compression results

Value Locality Clustered Value Locality Delta encoding compression results Very good compression rate. Used on average 12 bits for integer, 14 for media, and 20 for commercial, yielding a 60% compression rate on average. Setting the treshold to 16 and cache size to 32 gave optimal results on average. No sensible results for commercial programs: Presumably because of many data ranges at any given time.

Value Locality Isolated Value Locality Isolated Value Locality Programs tend have many similar values, for example 0 and 1. Can utilise this to avoid sending same value over and over again. There are two schemes to handle this: 1 The Frequent Value Encoding (FVE) Scheme store frequent values in a cache 2 The Citron Scheme an FVE scheme scheme on the 16 most significant bits

Value Locality Isolated Value Locality The FVE and Citron scheme Frequent Value Encoding (FVE) scheme: Keep a cache with the least recently used values If the value is in the cache, send only the index over the wire Otherwise, update the cache (LRU) and send the whole value over Citron scheme: Split the value in two 16 bit values Perform the FVE scheme on the 16 most significant bits Send over the 16 least significant bits Again question about optimal cache size for both schemes.

Value Locality Isolated Value Locality FVE scheme compression results

Value Locality Isolated Value Locality The Citron scheme compression results

Value Locality Isolated Value Locality FVE and Citron compression results Optimal cache size for both seems to be 32. Citron compresses on average down to 20 bits, or around a 40 % reduction. FVE manages half of that: 10 bits, almost 70 % reduction. Miss component is quite big, even with a large cache.

Combining Value Locality Properties Combining Value Locality Properties None of these compression schemes handle more than one type of locality. Why not try to combine them?

Combining Value Locality Properties Small and Clustered Value Locality Small and Clustered Value Locality Some inefficiency in delta encoding delta values to be transferred are usually small Using SWC, we can compress the delta value before sending it over the memory link The combined compression is quite effective, especially for integer and media applications For commercial applications, the combined gain is better than the separate gain, but not as spectacular Cache Delta SWC Link

Combining Value Locality Properties Small and Clustered Value Locality Small and Clustered Value Locality compression results

Combining Value Locality Properties Small and Isolated Value Locality Small and Isolated Value Locality Citron and FVE schemes are quite effective at reducing bandwidth usage, but transferring data on a miss in the value cache is inefficient Using SWC, we can utilise this more efficiently! When updating the value cache on a miss, a new value can be sent SWC encoded Small values, defined as being represented by 16 bits are less, don t need to be stored in the value cache as SWC is very efficient already So, the small values and large values can be compressed in parallel, which may give low latency Cache SWC FVE Link

Combining Value Locality Properties Small and Isolated Value Locality Small and Isolated Value Locality compression results

Combining Value Locality Properties Small and Isolated Value Locality Small and Isolated Value Locality cont d Combining SWC and FVE increases compressibility in integer applications and commercial applications, but is not helpful for media applications For media applications, values are either zero or need all 32 bits to be represented, leading to no benefit from SWC, only added overhead For the Citron scheme, combining it with SWC gives better results than both individually this benefit comes primarily from compressing the 16 least significant bits

Conclusion Conclusion Identified three categories of value locality: small, clustered, and isolated. Measured previously proposed techniques by their compressibility, using a consistent framework. As the previous schemes only targeted one of the three categories, the authors proposed two new schemes by combining the previous techniques. Got a 70-75 % bandwidth reduction, compared to previously 35-60 %.

Conclusion Discussion Discussion While bandwidth reduction is good, there are no performance measures: What are the speedups? Are there any energy savings? Although evident, the architecture will be more complex. Increasing performance of this new compression scheme seems hard. The commercial multicore programs have a worse bandwidth reduction than the single core programs. Compression quality may degrade with more cores.

Questions?