Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures. Hassan Fakhri Al-Sukhni

Size: px

Start display at page:

Download "Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures. Hassan Fakhri Al-Sukhni"

Garry O’Connor’
6 years ago
Views:

1 Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures by Hassan Fakhri Al-Sukhni B.S., Jordan University of Science and Technology, 1989 M.S., King Fahd University of Petroleum and Minerals, 1993 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Electrical and Computer Engineering 2006

2 This thesis entitled: Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures written by Hassan Fakhri Al-Sukhni has been approved for the Department of Electrical and Computer Engineering Daniel Alexander Connors Andrew Pleszkun James C. Holt Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

3 iii Al-Sukhni, Hassan Fakhri (Ph.D., Computer Engineering) Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures Thesis directed by Assistant Professor Daniel Alexander Connors Data prefetching is one paradigm to hide the memory latency in modern computer systems. Prefetch completeness requires that a prefetching mechanism achieves high coverage of the program would-be misses with high prefetching accuracy and timeliness. Achieving prefetch completeness has been successful for memory accesses associated with regular data structures like arrays because of the spatial regularity of these memory accesses. Modern applications that use Linked Data Structures (LDS) exhibit a lesser degree of spatial regularity in their meory accesses. Thus, other characteristics of the memory accesses associated with LDS have been exploited by previous prefetching mechanisms. Unfortunately limited success has been reported for prefetching LDS, therefore a significant opportunity remains to improve the performance of modern computer systems by improving prefetch completeness for LDS memory accesses. This dissertation proposes a coordinated approach consisting of three components to improve prefetch completeness for LDS. These components are: 1) a rigorous approach that offers metrics to quantify the exploitable characteristics of the memory accesses, 2) a coordinated software and hardware approach that benefits from the static characteristics facilitated by the global view of the compiler combined with the dynamic characteristics accessible via profiling and runtime monitoring, and 3) simultaneous coordination of several mechanisms that exploit different characteristics of the LDS memory accesses. The proposed coordinated approach is illustrated in this dissertation by

4 iv extending the understanding of three exploitable characteristics of LDS memory accesses: spatial regularity, temporal regularity, and topology. Metrics are offered to enable the identification of these characteristics, and prefetching mechanisms to exploit them are designed in a coordinated fashion to benefit from the offered metrics. Simulation results indicate that the proposed approach can improve prefetch completeness by improving prefetch coverage for would-be misses of the program memory accesses using accurate and timely prefetches.

5 To all of the fluffy kitties. Dedication

6 vi Acknowledgements Here s where you acknowledge folks who helped.

7 vii Contents Chapter 1 Introduction Contributions Organization Exploiting Spatial Regularity Using Extrinsic Stream Metrics Introduction Characterizing Regular Streams Intrinsic Stream Characteristics Extrinsic Stream Characteristics Stream Affinity and Short Streams Exploitation Stream Density and Prefetch Coverage Measuring Stream Metrics Run-time Exploitation of Stream Metrics Stream Detection and Allocation Filtering Stream Prioritization and Thrashing Control Using Stream Density Exploiting Short Streams Using Stream Affinity Using Intrinsic and Extrinsic Metrics Together: PAD Controlling Accuracy Using PAD and Stream Length... 27

8 viii Controlling Timeliness Using PAD and Stream Density Experimental Evaluations Methodology Results Conclusions Stream De-Aliasing: The Design of Cost-Effective Stride-Prefetching for Modern Processors Introduction Approach Stream Identification Feedback Management Experimental Evaluation Methodology Results Conclusions A Stitch In Time: Idenitfying and Exploiting Temporal Regularity Introduction LDS Prefetching Limitations Context-Based Prefetching Limitations Content Directed Prefetching Limitations Characterizing Temporal Regularity in the Memory Accesses Stitch Cache Prefetching Identifying Recurrent Loads Using Temporal Regularity Metrics Stitch Pointers Installation Using Stitch Pointers

9 ix Controlling Prefetch Timelines Exploiting Content Based Prefetching Overcoming the limitations of context-based prefetching Overcoming the limitations of content-based prefetching Experimental Evaluation Methodology Results Conclusions Exploiting the Topology of Linked Data Structures for Prefetching Introduction Motivation and Background Dynamic Data Structure Analysis Memory System Evaluation Compiler-Directed Content-Aware Prefetching Compiler-Directed Content-Aware Prefetching Directives Linked Data Structure Example Experimental Evaluation Methodology Results and Analysis Conclusions Conclusions Summary Future Work Bibliography 122

10 x Tables Table 3.1 Estimated structural requirements to use full program counter for prefetching Simulated Benchmarks Percentage of dynamic execution of linked data structure access types

11 xi Figures Figure 2.1 Prefetch opportunity of spatially regular accesses in SPEC-INT Stream Affinity Illustration Stream Density Illustration Stream Density Histogram Percentage of streams with α > 0.5, w = A Finite State Machine That Controls The PEs IPC Speedup of PAT over Stream Buffers Prefetch Coverage Prefetch Accuracy Prefetch Timeliness L1 Cast-outs Forming Stream ID in Load Attribute Prefetching (LAP) Prefetch moving window illustration Performance of PC bits Accuracy of PC bits Performance gain Prefetching Accuracy Prefetch timeliness

12 3.8 Prefetching into the L1 cache xii 4.1 Content-Directed Prefetching A sequence with temporal regularity Markov model conditional entropy for the reference sequence of Figure Stitch Cache Prefetching Controlling prefetch timeliness Average conditional entropy and variability for the top ten executed load instructions Percentage of stitch pointers updates to load accesses Utilization of Stitch Pointers Prefetch coverage Percentage of used prefetches from each mechanism in SCP Prefetch Overall Accuracy Prefetch accuracy of each prefetch type in SCP Prefetch Timeliness of CDP and SCP Percentage improvement in IPC Basic example of four linked data structure types: traversal, pointer, direct, and indirect Dynamic distribution of linked data structure access types Loads miss rates at L1 and L2 caches Content-Aware Prefetching Memory System Compiler-directed prefetch harbinger instruction Source and low-level code example of linked data structure in Olden s health

13 xiii 5.7 Memory accesses and prefetching commands for CDCAP on Olden s health Prefetching accuracy Timeliness of prefetching Normalized bus blocking with prefetching Normalized loads miss rates for L2 cache Normalized cycles of processor waiting for load values Normalized execution times

14 Chapter 1 Introduction Over the last decade, significant advancements in the semiconductor industry have allowed for an unprecedented increase in microprocessor performance. Continued exploitation of instruction level parallelism, longer pipelines and faster clocks are just a few techniques that have led to increased performance. Unfortunately, innovations in memory system design and technology have been unable to achieve the same rate of improvement in memory speeds. As a result, an increasing percentage of total execution time of modern computer systems is spent waiting on the memory system. It is estimated that nearly 40% of execution time is spent stalled waiting for both instruction and data cache misses [5]. Cache organizations have been used successfully to enhance the perceived latency of the memory subsystem in modern computer systems [46]. Caches exploit the locality of reference concepts known as spatial and temporal locality [23]. Unfortunately, the larger the cache is, the slower it becomes, which limits the benefits of cache-based solutions. This situation is exacerbated by the increased size of both programs and their data sets. Prefetching is another appealing technique for overcoming the memory latency problem in modern processors [41]. Numerous prefetching approaches have been proposed, including hardware-based approaches [38, 26, 11, 27], as well as software-based approaches [30, 31, 7]. Hardware-based prefetching approaches

15 2 employ specialized hardware to monitor memory accesses at run-time and to predict future memory accesses such that they can be prefetched well in advance of when they are needed by software executing on the processor. Software-based approaches use tools such as compilers and code profilers to instrument code so that data for future memory accesses can be prefetched using software mechanisms at run-time. A complete prefetching mechanism needs to meet three strict requirements: coverage, timeliness, and accuracy [25]. Coverage measures the ability to correctly predict and prefetch future memory accesses. It is the ratio of misses hidden by prefetching to the overall number of misses without prefetching. Timeliness requires that prefetches be launched with sufficient lead time to hide the memory latency. Timeliness for a prefetch is the ratio of the memory latency hidden by the prefetch to the overall memory latency. Accuracy is the ratio of prefetches that were used by the program to the total prefetches generated. Timeliness and coverage are correlated since prefetches that are late do not contribute toward coverage. Similarly, coverage and accuracy are correlated because inaccurate prefetches will not cover program misses. Prefetching spatially regular data structures (e.g., data structures that exhibit simple mathematical relationships in their load addresses, like arrays) has been particularly successful using stride-based techniques [4, 20, 38, 27, 24, 11, 26]. These techniques satisfy the prefetching completeness requirements by effectively exploiting spatial regularity to accurately predict future accesses well ahead of the program. While many programs make use of such spatially regular data structures, the use of Linked Data Structures (LDS) is pervasive in modern software (e.g. linked lists, trees, hashes, etc.). Integer applications that use LDS demonstrate fragmented short spatially regular patterns in their memory accesses, and previous stride-prefetching mechanisms are not designed to account for such

16 3 patterns. Prefetching techniques for LDS have been proposed [18, 25, 17, 42, 35], but the results have only been marginally successful [15]. Therefore, significant opportunities remain to hide the memory latency associated with LDS. Prefetching techniques for LDS can be broadly classified into three categories: context-based techniques, content-based techniques, and precomputation techniques. Context-based techniques use correlations amongst the memory accesses to predict future memory references. Content-based techniques use the content of the accessed data to make their predictions. Precomputation techniques run a slice of the program to generate memory addresses that are used for prefetching. Context-based prefetching techniques exploit temporal regularity in LDS accesses [15] by finding correlations amongst repeatedly accessed addresses, and then using these correlations to initiate prefetches (temporal regularity is discussed in Chapter 4). These correlations enable context-based techniques to launch timely prefetches ahead of the program. Unfortunately, context-based prefetching has several limitations that reduce its coverage, including limited capacity, learning time, and excessive overhead. A detailed discussion of these limitations follows in Chapter 4. Content-based prefetching involves using stateless systems that prefetch the connected data objects of LDS by discovering pointers in the data contained within cache lines as they are filled. Content-based prefetching can achieve good coverage by overcoming the limitations of context-based prefetching; however, its timeliness is limited since the pointers contained in the accessed data usually refer to data that will soon be needed by the program. Other limitations of content-based prefetching include the inability to recognize traversal orders of multiple potential paths, and the inability to connect and prefetch isolated LDS. Chapter 4 discusses these limitations further.

17 4 Precomputation prefetching runs a slice of the program instructions to compute future references. The slice of instructions is run as a thread context in multithreaded environments [17], or in a dedicated prefetching engine [2, 55, 47, 2, 29]. These techniques trigger their precomputation with specially marked instructions of the program. Although these techniques can predict accesses that exhibit neither spatial nor temporal regularity, the time between the trigger instructions and the instructions that require the prefetched data is usually too little to hide the memory latency, which limits their timeliness. Coordinated prefetching approaches that employ two or more of the above prefetching mechanisms have been proposed. For example, Roth [43] combined a context-based mechanism called jump pointers with a content-based mechanism called dependence-based prefetching [42]. This combined approach, called cooperative prefetching, used jump pointers to launch timely prefetches for correlated accesses and subsequently triggered the dependence-based prefetcher as a result of jump pointer prefetches. Guided-region prefetching [50] is another coordinated approach that used software analysis to tag load instructions that access LDS fields and used these tags (as a context) to improve the accuracy of content-based prefetching. Recently, multi-chain prefetching [29] was proposed as a coordinated approach that uses compiler-analysis techniques to identify static traversal paths in a program. Precomputation schedules for the identified paths were statically generated and shipped to a hardware prefetching engine that executed them based on the memory content. While these coordinated prefetching solutions proved the viability of a coordinated prefetching approach to improve prefetch completeness, they still suffered from the limitations of the underlying mechanisms they coordinated. The principal impediment to prefetching LDS is that the memory addresses for the connected data objects are dynamically generated based on specific pro-

18 5 gramming traversal patterns, phases, and application behavior. The connectivity among the nodes of the LDS changes during the program lifetime, and the program potentially changes its traversal paths along these nodes. Therefore, in order to achieve completeness, a technique for LDS prefetching must be able to dynamically track these changes over the program lifetime and adjust its prefetch targets for improved coverage and prefetch initiation time for improved timeliness. This dissertation hypothesizes that a coordinated software and hardware prefetching system, which builds on the strengths of both software and hardware prefetching paradigms, and utilizes well established compile and profile technologies to guide a run-time prefetching environment, can improve prefetching completeness for LDS. Such a coordinated system can be enabled by a rigorous approach that extends the understanding of the memory access characteristics associated with LDS and offers metrics to quantify such characteristics. The global view and understanding of the memory access characteristics by the compiler cannot be detected by a hardware-only system. Such information combined with the offered metrics and the dynamic run-time knowledge of the hardware part creates a promising and flexible system for exploiting the memory access characteristics associated with LDS. 1.1 Contributions The methodology of this dissertation consists of: (1) surveying and understanding the exploitable characteristics of previous prefetching mechanisms, (2) extending the understanding of these characteristics and offering metrics to quantify them, (3) proposing prefetching mechanisms that employ the offered metrics using a coordinated approach that employs information from compile, profile, and run-time technologies, and (4) validating that the proposed mechanisms improve prefetch completeness using cycle-accurate simulation.

19 6 Through this methodology, the memory access characetristics associated with LDS are studied, and mechanisms to identify and exploit them are proposed. Thus, this dissertation makes the following contributions: Exploiting spatial regularity: To exploit spatially regular streams of the memory accesses, previously identified regular stream metrics are extended to quantify extrinsic characteristics of regular streams. These new metrics are employed to improve the efficiency of stride prefetching for LDS accesses. The extrinsic metrics introduced are stream affinity and stream density. Stream affinity enables prefetching for short streams that result from LDS manipulations. These streams were previously ignored by stride prefetching mechanisms. Stream density enables a prioritization mechanism that dynamically selects amongst available streams in favor of those that promise more miss coverage, and provides thrashing control amongst several coexisting streams. Using intrinsic and extrinsic stream metrics in combination allows a novel hardware technique for controlling Prefetch Ahead Distance (PAD), which dynamically adjusts the prefetch launch time to better enable timely prefetches while minimizing cache pollution. De-aliasing regular streams: Several prefetching mechanisms utilize the Program Counter (PC) to de-alias co-existing streams of regular memory accesses (spatial and temporal.) Transmitting the full PC across modern deep and wide pipelines for the sole purpose of prefetching is not practical (as illustrated in Chapter 3.) To overcome the issues related to using the entire PC for effective stream de-aliasing and prefetching, this dissertation combines other instruction attributes with a small subset of the PC to help detect regularity in program data accesses. This de-aliasing

20 scheme is illustrated by implementing a cost-effective stride prefetching mechanism called Load-Attributes Prefetching (LAP). 7 Exploiting temporal regularity: Metrics are proposed to quantify temporal regularity in the memory accesses generated by load instructions that traverse LDS. These metrics are applied to profile information to identifying recurrent load instructions (instructions that generate temporally regular accesses). Recurrent load instructions are then targeted at run-time by a coordinated context-based and content-based mechanism to exploit temporal regularity in the memory accesses to prefetch LDS accesses. The approach is illustrated by proposing an LDS prefetching mechanism called Stitch Cache Prefetching (SCP), which makes the following contributions: (1) definition of temporal regularity metrics (based on concepts of information theory) that allow a profiler to identify loads that traverse any type of LDS paths (static or dynamic,) without requiring source code, (2) proposing a context-based prefetching mechanism that avoids problems of capacity using a logical stitch space, and of maintenance overhead by physically implementing the stitch space using a hierarchical organization, and (3) improving prefetch timeliness by dynamically adjusting the prefetch launching time based on the observed memory latency using a continuously tuned timeliness stitch queue. Exploiting LDS topology: A classification of the load instructions used to traverse LDS is proposed. Based on this calssification, the compiler is enabled to extract LDS topology and traversal information. This information is used by a hardware component and combined with the content of accessed cache lines to dynamically construct LDS traversal schedules. The constructed schedules are used by the hardware component to gener-

21 8 ate timely and accurate prefetches in a coordinated prefetching approach. The coordinated prefetching approach is illustrated via the design of a novel prefetching mechanism called Compiler-Directed Content-Aware Prefetching (CDCAP). The mechanism utilizes compiler-inserted prefetch instructions to guide hardware prefetching engines (HPEs) to prefetch traversed paths of LDS. The inserted prefetch instructions contain information about the static attributes that describe the topology of the data structure. As the program runs, compiler-inserted hints invoke the HPE to employ the topology information for generating prefetches based on the current program state and the contents of the accessed cache lines. The technique addresses the shortcomings of software-only techniques by eliminating the need to transform the data structure without the use of excessive prefetch instructions and does not require prior knowledge of the traversed data structure paths. At the same time, the approach eliminates the need for large correlation-based hardware structures and reduces the number of unnecessary prefetches caused by hardware-only implementations. 1.2 Organization This dissertation is composed of six chapters. Chapter 2 studies spatial regularity in the memory accesses, and proposes metrics to quantify extrinsic regular stream characteristics. The proposed metrics are utilized by a run-time prefetching system to improve the prefetch completeness of stride-based prefetching systems. Chapter 3 describes Load-Attribute Prefetching (LAP) as a cost-effective solution to de-alias regular streams in modern deep and wide pipelines. Chapter 4 studies temporal regularity in the memory accesses associated with LDS. Metrics to quantify temporal regularity are offered and used in the design of a coordinated

22 9 prefetching approach called Stitch Cache Prefetching (SCP) to improve prefetch completeness. Chapter 5 identifies patterns for LDS traversal instructions and employs these patterns to design a Compiler-Directed Content-Aware Prefetching (CDCAP) approach that generates dynamic precomputation prefetching schedules based on compiler communicated information to prefetching engines that are located at different cache levels in the memory hierarchy. Finally, in Chapter 6, conclusions and suggestions for future research are given.

23 Chapter 2 Exploiting Spatial Regularity Using Extrinsic Stream Metrics 2.1 Introduction Hardware-based prefetching approaches are particularly interesting due to (1) the potential to dynamically adapt to run-time characteristics, and (2) the ability to improve software performance of unmodified program binaries (e.g., there is no need to recompile programs using special compilers or other software tools, and legacy program binaries can also benefit from these techniques). Stride-based prefetching is an effective hardware technique that exploits regular streams of memory accesses [26]. A regular stream is an ordered sequence of addresses that exhibits a non-zero constant address difference (a stride) between its consecutive addresses. Existing stride-based techniques achieve efficiency primarily through accuracy and timeliness. Figure 2.1 depicts the prefetch opportunity of spatially regular accesses in SPEC-INT benchmarks. This figure indicates that there exists significant opportunities to hide the memory latency in these benchmarks via exploiting spatial regularity. Unfortunately, previous stride-prefetching mechanisms were designed to target spatial regularity in scientific numerical applications. The spatial regularity in such applications demonstrate different characetristics than those of integer applications that use LDS (as will be illustrated in the following sections of this chapter.) Thus, improving the efficiency of stride-based prefetching for ap-

24 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf Average Fraction of spatially regular accesses Figure 2.1: Prefetch opportunity of spatially regular accesses in SPEC-INT. plications that use LDS requires better understanding of their spatial regularity characteristics. This understanding enables the improvement of prefetch completeness through improving coverage, without sacrificing accuracy or timeliness (as demonstarted by the simulation results in Section 2.4.) Mohan defines regularity metrics for measuring the characteristics of streams [34]. This work illustrated that well-defined metrics could be effectively employed to guide software optimizations associated with regular streams. The fidelity enabled by these metrics allowed a compiler to select the most effective code optimizations amongst techniques such as tiling, software prefetching, and loop transformations. While Mohan s work was not focused at hardware prefetching, this dissertation hypothesizes that stride-based hardware prefetching completeness could be improved by similar exploitations of regular stream metrics. Furthermore, while Mohan focused on the important characteristics of individual streams in isolation, this research recognizes that at run-time there are interactions between streams and other entities (for example with other streams, or with the memory sub-system). These interactions can be accounted for and can be used to fur-

25 12 ther improve hardware prefetch efficiency. Thus, the major contributions of this chapter are the extension of Mohan s metrics to enable measurement of certain additional characteristics of regular streams, and application of these extended metrics to optimization of stride-based hardware prefetching. The rest of this chapter is organized as follows. The next section identifies characteristics of regular streams and defines metrics to quantify these characteristics. Section 2.3 illustrates how these metrics can be dynamically used to improve the efficiency of a stride-based hardware prefetching system. Section 2.4 illustrates (using simulations) that the proposed hardware outperforms one that does not account for the metrics offered in this research. Section 2.5 summarizes and draws conclusions. 2.2 Characterizing Regular Streams Metrics of regular streams are classified into two classes, intrinsic stream metrics and extrinsic stream metrics. The intrinsic class describes a stream s inherent characteristics, such as its stride or length (e.g., the number of accesses that belong to the stream). The extrinsic class describes a stream s characteristics relative to the program, to other streams, and to the memory system (which includes the hardware prefetchers). While intrinsic stream metrics have been identified [34] and used to optimize stride prefetching mechanisms [26, 27, 20, 24], the same cannot be said about extrinsic stream metrics. In this section, intrinsic stream metrics will be briefly discussed, and extrinsic stream metrics are introduced. Then these metrics are used to make three main contributions that can be used to improve the efficiency of hardware stride prefetching. These contributions are: Exploiting short streams to improve coverage,

26 Dynamic selection amongst streams to prevent thrashing and improve coverage, and 13 Dynamically changing prefetch-ahead distance to improve timeliness Intrinsic Stream Characteristics Existing hardware stride prefetching mechanisms recognize and exploit the intrinsic characteristics of regular streams. These characteristics can be measured by two metrics: stride and length. The stride measures how far apart the consecutive accesses of the stream are in terms of memory addresses, while the length measures the number of accesses of the stream. Streams having strides within one cache line, called unit-stride streams, have been exploited by stream buffer prefetching mechanisms [26, 20]. Unitstride streams are easy to detect and exploit, however in practice there remain a significant number of streams that exhibit strides larger than a single cache line [34]. Detection of streams with arbitrary strides can provide increased coverage. To exploit streams having strides longer than one cache line, Farkas [27] uses per- PC stream detection, a technique that detects streams with arbitrary strides by comparing consecutive accesses of specific load instructions. However, as with all stream buffer based techniques there is no mechanism to adjust prefetch launch time to improve prefetch timeliness. Stream length has been used to optimize stride prefetching techniques through exploitation of long streams [38, 26, 27, 4]. Longer streams are preferred because they offer better prefetch efficiency over time. However, the efficiency of prefetching for long streams can be negatively impacted by the presence of short streams unless there is a way to distinguish between them. Therefore, several mechanisms have been proposed to filter out streams of short length, such as allocation filters

27 14 [27, 20]. Unfortunately, while these techniques can filter out irregular accesses, they fail to exploit a short stream because most of the stream is used to establish its stride, leaving very little of the stream remaining to be prefetched. This chapter shows how extrinsic stream metrics can enable efficient prefetching of some classes of short regular streams without negatively impacting efficiency of prefetching for long streams Extrinsic Stream Characteristics Extrinsic metrics measure a stream s characteristics relative to other streams, and to the memory system (including misses not associated with any stream, as well as interactions with prefetchers). Stride prefetching mechanisms can utilize these metrics to evaluate tradeoffs whose implications cannot be measured using only intrinsic metrics, thus allowing for improved efficiency. The following sections discuss this in detail. To facilitate the extrinsic metrics discussion, the notion of regular streams is augmented by associating a timestamp with each memory access to provide a temporal ordering of stream accesses. The timestamp is a monotonically increasing integer that starts at 0, and is incremented with each memory access. Using the temporal ordering provided by the stream access timestamp, the following stream attributes can be defined: Stream birth, b, is the timestamp of the first address of the stream, Stream death, d, is the timestamp of the last address of the stream, and Stream age, a, is the difference between a stream s death and its birth (d b + 1).

28 With these attributes the following metrics are next defined and discussed: stream affinity and stream density Stream Affinity and Short Streams Exploitation Allocating hardware resources to detect and prefetch regular streams is usually done based on demand misses [26, 4]. Allocation filters are used to filter out misses that do not belong to regular streams to prevent them from disturbing on going prefetching of regular streams [38]. This filtering consumes part of the stream to establish its intrinsic characteristics before prefetching can commence. Such consumption can prevent prefetching of short streams. Stream affinity is introduced as an extrinsic metric to measure how similar a stream is to the most recent non-interleaved stream of equivalent stride. Stride prefetching can benefit from recognizing affine streams (streams with high affinity) by spending less time identifying intrinsic stream characteristics, and instead using that time to continue prefetching. This is especially important for short streams, where individual streams cannot be exploited due to the time wasted in determining their intrinsic characteristics. for (i=0; i < 50; i++) { t = A[i]; for (j=0; j < 3; j++) { sum += B[t+j*16]; } } (a) Memory Accesses x y x : stream for j=0 y : stream for j=1 d x w b y (b) Figure 2.2: Stream Affinity Illustration. Figure 2.2(a) shows an example of affine streams where two nested loops occur in a program such that the purpose of the outer loop is to choose different

29 16 starting points for the inner loop. The short streams produced by consecutive iterations of the inner loop will be affine streams. A prefetching mechanism that detects stream affinity can use the first of a set of high affinity streams to identify the stride, and then can reuse this characterization to efficiently prefetch subsequent members of the set. The author is not aware of any technique that recognizes or exploits affine streams. Given two streams, x and y, that are generated by the same load instruction, let the stream birth and stream death of these streams be denoted as b x, b y, d x, and d y, respectively. Further, let w be a timestamp window during which stream affinity can be exploited efficiently by prefetching hardware. The affinity, α y, for stream y is defined as: α y = 1 by dx w if(b y d x ) w and stride(x) = stride(y) = 0 Otherwise (2.1) The fractional portion of the metric measures how far apart the two streams are. The metric approaches zero as the distance between streams grows larger. Therefore, the highest affinity of 1 happens for two streams x and y when x is a continuation of y (making the fractional term 0.) Stream Density and Prefetch Coverage Regular streams compete for limited hardware prefetching resources that are usually lesser than the number of regular streams [34]. Sherwood [45] used priority counters to select one of several potential streams based on how well predictable each one is (e.g., selection amongst streams based on prefetch accuracy.) This approach prevents irregular misses from deterring the prefetching of regu-

30 17 lar streams. However, Sherwood s approach does not distinguish between regular streams to select one that potentially has better coverage than the others, nor does it prevent predictable streams from thrashing each other. Stream density is introduced as an extrinsic metric that indicates the expected coverage of a stream (how many program misses are potentially hidden in a given period of time by prefetching a stream.) Given the density metric, a prefetching mechanism can select the stream that has more density over one that has lesser density. This enables maximizing prefetch coverage and therefore efficiency. With the previously introduced stream attributes, stream density, δ, is defined as the ratio of a stream s length to its age: δ = l a = l d b + 1 (2.2) Intuitively, a stream that has low density (a sparse stream) is one whose accesses are separated by many interleaved memory accesses that do not belong to the stream. Conversely, accesses of a high density stream are separated by few memory accesses not belonging to the stream. Dense streams appear for example in memory copy operations, and in search algorithms that use tight loops containing load and compare operations. To illustrate dense and sparse streams, consider the two nested loops of Figure 2.3 (a). The outer loop generates a sparse stream (stream z) of the accesses of array A, whereas the inner loop generates streams x and y that are dense streams. Figure 2.3 (b) illustrates the calculation of the density metric for stream z. In this example stream z has a length of 2 (accesses A[0] and A[100] in the code segment) and an age of 5 (its death is 4 and its birth is 0.) Conversely, the density of streams x and y are equal to 1. Dense streams present more prefetching opportunities than sparse streams.

31 18 A[0] = 0; A[100] = 100; for (j=0; j < 2; j=j++) { t = A[j*100]; for (i=0; i < 3; i++) { sum += B[t+i*16]; } } (a) b z Memory Accesses z x z y Timestamp d z (b) lz δz = a z 2 = 2 = = 0.4 d b z z Figure 2.3: Stream Density Illustration. A prefetching mechanism that does not account for stream density can be diverted from prefetching dense streams in the presence of sparse streams. This problem is known as the stream thrashing problem [38]. Although allocation filtering [38, 27] prevents non-stream misses or short streams from interrupting prefetching for long streams, previous hardware stride prefetching mechanism do not resolves the stream thrashing problem. Section 2.3 introduces a mechanism that uses the density metric to select a stream that promises more coverage over other streams Measuring Stream Metrics This chapter uses the SPEC2K suite of benchmarks for illustration. Several traces statistically represent each benchmark, such that each trace consists of several million instructions. Each trace has a weight representing its contribution to the overall benchmark. This is similar in concept to benchmark representation used in SimPoint [39]. Trace representation was verified against actual hardware (a previous processor) by using IPC as a verification metric. The simulated IPC, obtained by simulating the traces on the matching cycle-accurate simulator and weighting the IPCs of each sub-trace appropriately, was compared against the IPC from execution on hardware. All benchmarks showed good correlation between

32 19 Density Distribution 100% 80% 60% 40% 20% 0% gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf >= 0.2 < 0.2 < 0.1 < 0.01 < < Benchmark (INT) Density Distribution 100% 80% 60% 40% 20% 0% wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi >= 0.2 < 0.2 < 0.1 < 0.01 < < Benchmark (FP) Figure 2.4: Stream Density Histogram. actual and simulated IPC. In collecting stream metrics, they were not weighed by the trace weights. Stream Density Figure 2.4 depicts a histogram of streams densities. Each bar represents the percentages of streams that had densities within ranges identified in the legend. One observation here is that the SPEC2K-INT benchmarks generally exhibit higher density streams than the FP benchmarks. This is due to the fact that interleaved streams appear more frequently in the FP benchmarks as a result of aggressive compiler loop unrolling. In contrast, the INT benchmarks offer little or no loop unrolling opportunities to the compiler.

33 Figure 2.4 illustrates that the majority of the regular streams detected in 20 the benchmarks are sparse. This does not mean that these streams represent the majority of the memory accesses, since sparse streams are often significantly shorter than dense streams. However, the high percentage of sparse streams means that these streams will frequently interrupt the denser streams. Improved stride prefetching mechanisms should account for this observation and should be able to dynamically react as conditions change. Section 2.3 presents one solution to this problem based on stream density. Stream Affinity Figure 2.5 shows the percentage of streams in the benchmarks that have an affinity greater than 0.5 (α > 0.5) for a w of 200. A value of 200 was chosen for w experimentally, based on a study of how well the available streams could be exploited by the intended hardware system. This figure suggests that affine streams constitute a significant portion of the total number of streams in several workloads. Overall, streams with high affinity are less frequent in the SPEC2K-FP benchmarks when compared to the SPEC2K- INT ones. This is due to the nature of the two suites, where the INT benchmarks tend to have more loop nesting constructs and more dynamic data structures. Dynamic data structures are usually allocated in chunks that are consecutive in the memory space. As these data structures get rearranged by the program due to deletion and insertion of nodes, their subsequent traversal results in fragments of short affine streams. Contrary to that, the FP benchmarks generally do not use dynamic data structures and they tend to scan large arrays causing long streams that are well separated by non-stream accesses. Therefore, these benchmarks demonstrate low affinity in general. Investigating the FP benchmarks that exhibit high affinity (for example galgel and art) reveals that these benchmarks have outer loops that set up

34 21 12% 10% 8% 6% 4% 2% 0% gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf Affine streams Benchmark (INT) 14% 12% 10% 8% 6% 4% 2% 0% wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi Affine streams Benchmark (FP) Figure 2.5: Percentage of streams with α > 0.5, w = 200. an address for inner loops, which is consistent with the affine streams presented in the INT benchmarks. The only difference between the two code segments showed that galgel used aggressive loop unrolling of the inner loop, while art did not. 2.3 Run-time Exploitation of Stream Metrics The effective use of both intrinsic and extrinsic regular stream metrics is demonstrated with a design for a hardware stride prefetching system.the design includes a number of prefetching engines (PEs). Each PE is controlled by a finite state machine similar to the one shown in Figure 2.6. These PEs get allocated to load instructions that miss in the L1 cache to prefetch their regular streams.

35 22 Therefore, this design employs per-pc stride detection similar to that presented by Farkas [27]. The system is designed to prefetch the load accesses and not the store accesses. Stores are handled by other micro-architecture solutions such as store buffers [20]. The states of this state machine are divided into two sets, inactive states and active states. The active states dynamically control prefetching accuracy and timeliness issues, while the inactive states are employed to improve coverage by detecting streams, managing priorities amongst streams, and identifying affine streams. The details of how stream metrics govern state transitions are discussed in detail in the following sections. State Description Metric Used OFF ALLOCATE PEM PEM MU TC MU SD MP MU SD TC AFD LC1 HCx Stream Detection Thrashing Control Affinity Detection Low Confidence with 1 PAD High Confidence with x PAD Density Density Affinity/Density PAD/Affinity/Density PAD/Affinity/Density PEM MU MP PU MP AFD LC1 HC1 HC2 HC4 HC6 PU PV PU PV PV PV PU PU PU PV PU LM/PEM LM/PEM LM/PEM LM/PEM Active States MP LEGEND LM : Allocated Load Miss PU : Prefetch Used PV : Prefetch Evicted MU: Load Miss Unpredicted MP: Load Miss Predicted PEM: PE Miss (Load Miss Not Allocated An Engine) Padding states and path Affinity states and path Density states and path Figure 2.6: A Finite State Machine That Controls The PEs.

36 Stream Detection and Allocation Filtering When a load instruction misses in the L1 cache, a free PE gets allocated to the instruction, and the state machine transitions to the Stream Detection (SD) state. During the SD state the PE makes an initial stride guess equal to the size of a cache line. Using the subsequent accesses of the load instruction, a stream is identified by comparing the guessed stride with the measured stride of the load accesses. If a stream is identified, the state machine transitions to state LC1 (low confidence with ability to launch 1 prefetch.) Otherwise, the PE state machine stays in SD and re-computes the stride based on the addresses of the load instruction s consecutive accesses. This process repeats until either a stream is detected or the PE is reallocated to another missing load instruction (via repeated transitions through states TC and finally OFF due to other load instruction misses). This stream detection is analogous to the allocation filters proposed by Palacharla [38], in the sense that misses not comprising a stream do not affect prefetching because there will be no resulting transition to the LC1 state Stream Prioritization and Thrashing Control Using Stream Density When several streams are interleaved, they will generate misses that are also interleaved. The proposed prefetching system is designed with a number of PEs equal to N. If the interleaved streams are less than or equal to N, each one will be allocated a PE. However, when the number of interleaved streams is more than N, a prioritization mechanism is needed. The prioritization is based on measuring the density of each of the interleaved streams and finally favoring the denser of them. This task is carried out by the states SD, Thrashing Control

37 24 (TC), and LC1. In these states the interleaved streams compete for the PE such that allocated streams climb to higher confidence states, while unallocated streams can decrease the confidence of allocated streams. The denser stream will eventually win the PE. In previously proposed mechanisms, newly identified streams can replace existing streams without consideration of their respective densities. This approach results in stream thrashing that reduces the prefetching efficiency. Unfortunately, this situation is common in scientific code that has been subject to aggressive loop unrolling. The above prioritization mechanism implements thrashing control using stream density, such that a denser stream retains the PE. Similar to the above prioritization, this is done by dynamically allowing streams to compete for the PEs. The details of this competition are explained next. Recall that the states of the state machine are divided into active (states LC1 and HC1-HC6) and inactive states (states SD, TC, and AFD). Once a stream has been detected (while the PE is in the inactive states,) the PE s state machine transitions to the active states and the PE launches a number of prefetches based on the measured stride. The number of prefetches launched in each state is shown as a number in the state name (for example, in state LC1 one prefetch is launched.) Prefetch requests are queued to an architected miss buffer which manages requesting and collecting data from lower memory levels. Once a request has been fully serviced, its data will be evicted to the L1 data cache (referred to as a PV event.) If the prefetched data is needed by any program instruction while in the miss buffer, a prefetch use event is declared (PU). In this case, the state machine transitions to the next higher confidence state (state High Confidence with ability to launch 1 prefetch, HC1 in this example.) However, if the prefetched cache line is not needed until after it has been evicted, then the state machine transitions to the lower confidence state (e.g., state Affinity Detect, AFD.)

38 25 Streams that are not allocated any PE will generate PE miss (PEM) events that reduce the confidence of allocated PEs, while allocated streams increase the PEs confidence via prefetch use (PU) events. This competition will resolve in one of two ways: (1) if the unallocated stream is denser than the allocated one then its PEM events will number more than the PU events for the allocated streams and eventually the denser stream will take over the PE, or (2) if the allocated stream is denser then its PU events will number more than the PEM events for the unallocated stream thus leaving the PE at high confidence. Either of these cases will result in the denser stream having control of the PE, thereby mitigating the thrashing problem and improving the prefetch coverage Exploiting Short Streams Using Stream Affinity In all previously reported hardware stride-based mechanisms, if a missing address has not already been predicted by the prefetching mechanism, then it becomes a candidate for starting a new stream. This treatment of misses does not account for stream affinity. In the proposed approach of this chapter, stream affinity is exploited by lowering the confidence of the state machine controlling the PE allocated to the missing load instruction (referred to as LM events.) This lowering of confidence repeats until the state machine reaches the Affinity Detect state (AFD). In AFD, if the conditions of Equation 2.1 indicate an affine stream, then the PE changes its prefetching region to match the most recently missed address. This change of the prefetching region allows the PE to begin prefetching affine streams without the need to go through the detection process. When affinity is detected, the PE state transitions directly to HC1 state, bypassing the LC1 state. Allowing this transition results in faster climbing of the confidence states and enables exploiting short affine streams. Note also that not changing the prefetch region until the state machine confidence drops to one of the inactive states al-

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical