Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Size: px

Start display at page:

Download "Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009"

Dwain Waters
5 years ago
Views:

1 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

2 Agenda Introduction Memory Hierarchy Design CPU Speed vs. Memory Speed Memory Stalls Memory Streaming Principle of Locality Motivation Babak et al.'s Spatial Memory Streaming Identifying Spatial Patterns Spatial Memory Streaming Design Experimental Results Conclusions References

3 Introduction Ideally, one would desire an indefinitely large memory capacity such that any particular...word would be immediately available...we are...forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible. A.W.Burks, H.H.Goldstine, and J. von Neumann Preliminary Discussion of the Logical Design of an Electronic Computing Instrument (1946)

4 Memory Hierarchy Design Note that the time units change by factors of 10 (from picoseconds to milliseconds) and that the size units change by factors of 1000 (from bytes to terabytes) [1].

5 Memory Hierarchy Design Caches are the mechanisms for redusing average memory access time. Average Memory Access Time = Cache Hit Time + Cache Miss Rate x Cache Miss Penalty [1] Caches provide faster access to the data, but; They are small. They are expensive. They require consistency and coherence.

6 CPU Speed vs. Memory Speed Starting with 1980 performance as a baseline, the gap in performance between memory and processors is plotted over time [1].

7 Memory Stall The memory speed is way too much slower than the CPU speed itself. Each time CPU tries to access the memory, it waits and wastes its valueable time for the response of the memory. This is called memory stall.

8 Memory Streaming To improve performance, we further need to reduce the memory access time, in other words remove the memory stalls. Streaming (or prefetching) is a technique used to copy memory blocks to the CPU cache -ahead of CPU needs them. Streaming greatly reduces the number of memory stalls since it reduces the number of cache misses.

9 Memory Streaming The questions with streaming are; Which data to stream? When to stream? The answer is directly dependent to the principle of locality. Two different types of locality have been observed.

10 Principle of Locality Temporal locality states that recently accessed items are likely to be accessed in the near future [8]. F(x) = P(page i is referenced at time t+x page i was referenced at time t) [2] Spatial locality states that items whose addresses are near one another tend to be referenced close together in time [8]. G(d, x) = P(page i+d is referenced at time t+x page i was referenced at time t) [2]

11 Motivation Do memory accesses follow certain patterns? If so, are these patterns predictable through some kind of a design? Should they be predicted, does it allow us to exploit some kind of a performance gain? Memory accesses in commercial workloads often exhibit repetitive layouts that span large memory regions [5].

12 Babak et al.'s SMS Babak et al. have proposed Spatial Memory Streaming; a practical on-chip hardware technique for streaming [3,4,7]. SMS predicts code-correlated spatial access patterns. Using its predicted patterns, SMS aim to improves performance by hiding the long cache miss penalties.

13 Spatial Correlation For instance, a database page is the simplest unit of entity that a DBMS deals with. Every time a database page is visited, page header (row directory, log serial number) and page footer (check-sum information) is read. This relationship between accesses is called spatial correlation.

14 Regions, Patterns, Generations Spatial region is a fixed-size portion of the system's address space, consisting of multiple consecutive cache blocks. Spatial region generation is the time interval over which SMS records accesses within a spatial region. Spatial pattern is a bit vector representing the set of blocks in a region accessed during a generation.

15 Identifying Spatial Patterns Spatial Address Correlation: Several variables or fields of an aggregate are frequently accessed together. Address of the data structure identifies the spatial correlation. Program Counter Correlation: A data structure is accessed in a recurring traversal. Address of the code executing the traversal identifies the spatial pattern.

16 Identifying Spatial Patterns PC + Address Indexing: Generates distinct patterns when multiple code sequences lead to different traversals of the same data structure. For instance; a) for (i = 0; i < n; ++i) { a[i].data; } b) while (a) { a->data; a = a -> next; } Stores a different pattern for each different instance of the same data structure.

17 Identifying Spatial Patterns PC + Offset Indexing: Allows to identify code sequences that repeats the same access pattern over a large number of instances of the same data structure. void foo(int *p, int n) { } for (int i = 0; i < n; ++i) p[i].data; Provides accurate predictions for the data that have never been visited before during the application life.

18 SMS Design SMS comprises of two on-chip hardware parts. Active Generation Table: Records spatial patterns as the processor accesses spatial regions and trains the predictor. Pattern History Table: Stores previously-observed spatial patterns, and is accessed at the start of each spatial region generation to predict the pattern of future accesses.

19 Active Generation Table Entries in the accumulation table are tagged by the spatial region tag. This is the high order bits of the region base address. Each entry stores the PC and spatial region offset of the trigger access, and a spatial pattern bit vector indicating which blocks have been accessed during the generation.

20 Active Generation Table Each L1 access first searches the accumulation table. If a matching entry is found, the spatial pattern bit corresponding to the accessed block is set. Otherwise, the access searches for its tag in the filter table. If no match is found, this access is the trigger access for a new spatial region generation and a new entry is allocated in the filter table. If an access matches in the filter table, then its spatial region offset is compared to the recorded offset. If the offsets differ, then this block is the second distinct cache block accessed within the generation, and the entry in the filter table is transferred to the accumulation table. Additional accesses to the region set corresponding bits in the pattern.

21 Active Generation Table Spatial region generations end with an eviction or invalidation. Upon these events, both the filter table and accumulation table are searched for the corresponding spatial region tag. A matching entry in the filter table is discarded because it represents a generation with only a trigger access. A matching entry in the accumulation table is transferred to the pattern history table.

22 Pattern History Table The PHT is accessed using a prediction index constructed from the PC and spatial region offset of the trigger access for a generation. Each entry in the PHT stores the spatial pattern that was accumulated in the AGT. Upon a trigger access, SMS consults the PHT to predict which blocks will be accessed during the generation.

Pattern History Table If an entry in the PHT is found, the spatial region s base address and the spatial pattern are copied to one of several prediction registers.

23 Pattern History Table If an entry in the PHT is found, the spatial region s base address and the spatial pattern are copied to one of several prediction registers. As SMS streams each block predicted by the pattern into the primary cache, it clears the corresponding bit in the prediction register. The register is freed when its entire pattern has been cleared. If multiple prediction registers are active, SMS requests blocks from each prediction register in a round-robin fashion. SMS stream requests behave like read requests in the cache coherence protocol.

24 Experimantal Results SMS evaluated using a combination of trace-driven and cycleaccurate full-system simulation of a shared-memory multiprocessor using FLEXUS. FLEXUS can execute unmodified operating systems and commercial applications on top of them. FLEXUS extends the Virtutech Simics functional simulator with cycle-accurate models of an out-of-order processor core, cache hierarchy, protocol controllers and interconnect.

25 Experimantal Results TPC-C OLTP and TPC-H DSS workloads are used for database servers, SPECweb99 is used for web servers. Aggregate number of user instructions committed per cycle is used as the performance metric.

26 Miss Rates vs. Block Size Increased cache block size leads to drastic increases in L1 miss rates because of conflict behavior. The commercial workloads use only a subset of the data in large regions and interleave accesses across regions. Thus, as the cache block size increases, conflicts increase, and the effective capacity of the L1 cache is reduced, leading to a sharp increase in miss rate with block sizes beyond 512B. The data sets of the scientific applications are more tightly packed, but nevertheless suffer from similar conflict behavior. The larger capacity of L2 reduces the prevalence of conflict effects as compared to L1. However, commercial workloads instead incur misses from false sharing, which accounts for 26% 42% of L2 misses at the 8KB. block size.

$Comparison - Indexing Methods Coverage represents the fraction of L1 read misses that are eliminated by SMS.$

27 Comparison - Indexing Methods Coverage represents the fraction of L1 read misses that are eliminated by SMS. Over-predictions represent blocks that are fetched but not used prior to eviction or invalidation, and thus waste bandwidth. PC + Offset indexing yields the same or significantly higher coverage than address-based indexing methods, as well as lower storage requirements. PC + Offset attains peak coverage with 16k entries - roughly the same hardware cost as a 64kB L1 cache data array. For PC + Address, in all workloads except OLTP, 16k entries is far too small to capture a meaningful fraction of program footprint and provide significant coverage.

$Memory access density as the fraction of cache misses occurring in spatial region generations that contain a particular number of misses.$

28 Memory Access Density The root cause of the inefficiency of large cache blocks is the variability of memory access density within and across applications. Memory access density as the fraction of cache misses occurring in spatial region generations that contain a particular number of misses. A breakdown of memory access density for each application for a 2kB region size. For example, in OLTP-DB2, 22% of L1 misses come from spatial generations in which between 4-7 blocks are missed upon during the generation. With the exception of ocean and sparse, all applications exhibit wide variations in their memory access density at both L1 and L2. Thus, no single block size can simultaneously exploit the available spatial correlation while using bandwidth and storage efficiently.

29 Spatial Region Size Larger regions are more likely to span unrelated data structures, and therefore some accesses may not be repetitive with respect to the trigger access. In DSS, most patterns are dense, so the benefit to merging adjacent spatial regions (i.e., eliminating the trigger misses of additional regions) is negligible. In scientific applications, at region sizes above 2kB, we observe the negative effect of spanning data structures. Using PC + Address (rather than PC + Offset) indexing can mitigate this effect by learning specific patterns for each boundary between data structures, at the cost of drastically increased PHT storage requirements. All applications except OLTP exhibit peak coverage with 2kB regions. The 2% coverage increase for OLTP when increasing region size to 4kB does not justify the doubled PHT size.

30 Comparison - GHB Global History Buffer, PC/DC (program counter / delta correlation) variant was shown to be the most effective prefetching technique for desktop/engineering applications [6]. SMS simply outperforms GHB in OLTP and web applications.

31 Performance Comparisons are done with respect to a baseline system without SMS. Results are given with 95% confidence interval. Geometric average speedup is 1.37, meaning a mean performance improvement of 37%. Maximum speedup is 4.07, meaning a performance improvement of 307% at best.

32 Conclusions CPU's are faster than memory subsystems. Streaming is a well known technique to decrease average memory access latency. Memory accesses in commercial workloads are spatially correlated over large memory regions, and correlation is both repetitive and predictable. Proposed code-based correlation (PC + Offset indexing) is fundamentally superior to address-based correlation (PC + Address indexing). Babak et al.'s SMS produces very promising results during simulations.

33 References 1. John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufman, B. Ramakrishna Rau, The Stack Working Set: A Characterization of Spatial Locality. Digital Systems Laboratory, Stanford Electronic Laboratories, Stanford University, July S. Somogyi, Thomas F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Spatial Memory Streaming. In Proceedings of the 33rd. International Symposium on Computer Architecture, June S. Somogyi, Thomas F. Wenisch, A. Ailamaki and B. Falsafi. Spatio-Temporal Memory Streaming. In Proceedings of the 36th. International Symposium on Computer Architecture, June C. F. Chen, S.-H. Yang, B. Falsafi, and A. Moshovos. Accurate and complexity-effective spatial pattern prediction. In Proceedings of the 10th. Symposium on High-Performance Computer Architecture, February K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In Proceedings of the Tenth Symposium on High-Performance Computer Architecture, February

34 Questions / Thank you

Spatial Memory Streaming (with rotated patterns)

Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;