Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures. Hassan Fakhri Al-Sukhni

Size: px
Start display at page:

Download "Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures. Hassan Fakhri Al-Sukhni"

Transcription

1 Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures by Hassan Fakhri Al-Sukhni B.S., Jordan University of Science and Technology, 1989 M.S., King Fahd University of Petroleum and Minerals, 1993 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Electrical and Computer Engineering 2006

2 This thesis entitled: Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures written by Hassan Fakhri Al-Sukhni has been approved for the Department of Electrical and Computer Engineering Daniel Alexander Connors Andrew Pleszkun James C. Holt Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

3 iii Al-Sukhni, Hassan Fakhri (Ph.D., Computer Engineering) Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures Thesis directed by Assistant Professor Daniel Alexander Connors Data prefetching is one paradigm to hide the memory latency in modern computer systems. Prefetch completeness requires that a prefetching mechanism achieves high coverage of the program would-be misses with high prefetching accuracy and timeliness. Achieving prefetch completeness has been successful for memory accesses associated with regular data structures like arrays because of the spatial regularity of these memory accesses. Modern applications that use Linked Data Structures (LDS) exhibit a lesser degree of spatial regularity in their meory accesses. Thus, other characteristics of the memory accesses associated with LDS have been exploited by previous prefetching mechanisms. Unfortunately limited success has been reported for prefetching LDS, therefore a significant opportunity remains to improve the performance of modern computer systems by improving prefetch completeness for LDS memory accesses. This dissertation proposes a coordinated approach consisting of three components to improve prefetch completeness for LDS. These components are: 1) a rigorous approach that offers metrics to quantify the exploitable characteristics of the memory accesses, 2) a coordinated software and hardware approach that benefits from the static characteristics facilitated by the global view of the compiler combined with the dynamic characteristics accessible via profiling and runtime monitoring, and 3) simultaneous coordination of several mechanisms that exploit different characteristics of the LDS memory accesses. The proposed coordinated approach is illustrated in this dissertation by

4 iv extending the understanding of three exploitable characteristics of LDS memory accesses: spatial regularity, temporal regularity, and topology. Metrics are offered to enable the identification of these characteristics, and prefetching mechanisms to exploit them are designed in a coordinated fashion to benefit from the offered metrics. Simulation results indicate that the proposed approach can improve prefetch completeness by improving prefetch coverage for would-be misses of the program memory accesses using accurate and timely prefetches.

5 To all of the fluffy kitties. Dedication

6 vi Acknowledgements Here s where you acknowledge folks who helped.

7 vii Contents Chapter 1 Introduction Contributions Organization Exploiting Spatial Regularity Using Extrinsic Stream Metrics Introduction Characterizing Regular Streams Intrinsic Stream Characteristics Extrinsic Stream Characteristics Stream Affinity and Short Streams Exploitation Stream Density and Prefetch Coverage Measuring Stream Metrics Run-time Exploitation of Stream Metrics Stream Detection and Allocation Filtering Stream Prioritization and Thrashing Control Using Stream Density Exploiting Short Streams Using Stream Affinity Using Intrinsic and Extrinsic Metrics Together: PAD Controlling Accuracy Using PAD and Stream Length... 27

8 viii Controlling Timeliness Using PAD and Stream Density Experimental Evaluations Methodology Results Conclusions Stream De-Aliasing: The Design of Cost-Effective Stride-Prefetching for Modern Processors Introduction Approach Stream Identification Feedback Management Experimental Evaluation Methodology Results Conclusions A Stitch In Time: Idenitfying and Exploiting Temporal Regularity Introduction LDS Prefetching Limitations Context-Based Prefetching Limitations Content Directed Prefetching Limitations Characterizing Temporal Regularity in the Memory Accesses Stitch Cache Prefetching Identifying Recurrent Loads Using Temporal Regularity Metrics Stitch Pointers Installation Using Stitch Pointers

9 ix Controlling Prefetch Timelines Exploiting Content Based Prefetching Overcoming the limitations of context-based prefetching Overcoming the limitations of content-based prefetching Experimental Evaluation Methodology Results Conclusions Exploiting the Topology of Linked Data Structures for Prefetching Introduction Motivation and Background Dynamic Data Structure Analysis Memory System Evaluation Compiler-Directed Content-Aware Prefetching Compiler-Directed Content-Aware Prefetching Directives Linked Data Structure Example Experimental Evaluation Methodology Results and Analysis Conclusions Conclusions Summary Future Work Bibliography 122

10 x Tables Table 3.1 Estimated structural requirements to use full program counter for prefetching Simulated Benchmarks Percentage of dynamic execution of linked data structure access types

11 xi Figures Figure 2.1 Prefetch opportunity of spatially regular accesses in SPEC-INT Stream Affinity Illustration Stream Density Illustration Stream Density Histogram Percentage of streams with α > 0.5, w = A Finite State Machine That Controls The PEs IPC Speedup of PAT over Stream Buffers Prefetch Coverage Prefetch Accuracy Prefetch Timeliness L1 Cast-outs Forming Stream ID in Load Attribute Prefetching (LAP) Prefetch moving window illustration Performance of PC bits Accuracy of PC bits Performance gain Prefetching Accuracy Prefetch timeliness

12 3.8 Prefetching into the L1 cache xii 4.1 Content-Directed Prefetching A sequence with temporal regularity Markov model conditional entropy for the reference sequence of Figure Stitch Cache Prefetching Controlling prefetch timeliness Average conditional entropy and variability for the top ten executed load instructions Percentage of stitch pointers updates to load accesses Utilization of Stitch Pointers Prefetch coverage Percentage of used prefetches from each mechanism in SCP Prefetch Overall Accuracy Prefetch accuracy of each prefetch type in SCP Prefetch Timeliness of CDP and SCP Percentage improvement in IPC Basic example of four linked data structure types: traversal, pointer, direct, and indirect Dynamic distribution of linked data structure access types Loads miss rates at L1 and L2 caches Content-Aware Prefetching Memory System Compiler-directed prefetch harbinger instruction Source and low-level code example of linked data structure in Olden s health

13 xiii 5.7 Memory accesses and prefetching commands for CDCAP on Olden s health Prefetching accuracy Timeliness of prefetching Normalized bus blocking with prefetching Normalized loads miss rates for L2 cache Normalized cycles of processor waiting for load values Normalized execution times

14 Chapter 1 Introduction Over the last decade, significant advancements in the semiconductor industry have allowed for an unprecedented increase in microprocessor performance. Continued exploitation of instruction level parallelism, longer pipelines and faster clocks are just a few techniques that have led to increased performance. Unfortunately, innovations in memory system design and technology have been unable to achieve the same rate of improvement in memory speeds. As a result, an increasing percentage of total execution time of modern computer systems is spent waiting on the memory system. It is estimated that nearly 40% of execution time is spent stalled waiting for both instruction and data cache misses [5]. Cache organizations have been used successfully to enhance the perceived latency of the memory subsystem in modern computer systems [46]. Caches exploit the locality of reference concepts known as spatial and temporal locality [23]. Unfortunately, the larger the cache is, the slower it becomes, which limits the benefits of cache-based solutions. This situation is exacerbated by the increased size of both programs and their data sets. Prefetching is another appealing technique for overcoming the memory latency problem in modern processors [41]. Numerous prefetching approaches have been proposed, including hardware-based approaches [38, 26, 11, 27], as well as software-based approaches [30, 31, 7]. Hardware-based prefetching approaches

15 2 employ specialized hardware to monitor memory accesses at run-time and to predict future memory accesses such that they can be prefetched well in advance of when they are needed by software executing on the processor. Software-based approaches use tools such as compilers and code profilers to instrument code so that data for future memory accesses can be prefetched using software mechanisms at run-time. A complete prefetching mechanism needs to meet three strict requirements: coverage, timeliness, and accuracy [25]. Coverage measures the ability to correctly predict and prefetch future memory accesses. It is the ratio of misses hidden by prefetching to the overall number of misses without prefetching. Timeliness requires that prefetches be launched with sufficient lead time to hide the memory latency. Timeliness for a prefetch is the ratio of the memory latency hidden by the prefetch to the overall memory latency. Accuracy is the ratio of prefetches that were used by the program to the total prefetches generated. Timeliness and coverage are correlated since prefetches that are late do not contribute toward coverage. Similarly, coverage and accuracy are correlated because inaccurate prefetches will not cover program misses. Prefetching spatially regular data structures (e.g., data structures that exhibit simple mathematical relationships in their load addresses, like arrays) has been particularly successful using stride-based techniques [4, 20, 38, 27, 24, 11, 26]. These techniques satisfy the prefetching completeness requirements by effectively exploiting spatial regularity to accurately predict future accesses well ahead of the program. While many programs make use of such spatially regular data structures, the use of Linked Data Structures (LDS) is pervasive in modern software (e.g. linked lists, trees, hashes, etc.). Integer applications that use LDS demonstrate fragmented short spatially regular patterns in their memory accesses, and previous stride-prefetching mechanisms are not designed to account for such

16 3 patterns. Prefetching techniques for LDS have been proposed [18, 25, 17, 42, 35], but the results have only been marginally successful [15]. Therefore, significant opportunities remain to hide the memory latency associated with LDS. Prefetching techniques for LDS can be broadly classified into three categories: context-based techniques, content-based techniques, and precomputation techniques. Context-based techniques use correlations amongst the memory accesses to predict future memory references. Content-based techniques use the content of the accessed data to make their predictions. Precomputation techniques run a slice of the program to generate memory addresses that are used for prefetching. Context-based prefetching techniques exploit temporal regularity in LDS accesses [15] by finding correlations amongst repeatedly accessed addresses, and then using these correlations to initiate prefetches (temporal regularity is discussed in Chapter 4). These correlations enable context-based techniques to launch timely prefetches ahead of the program. Unfortunately, context-based prefetching has several limitations that reduce its coverage, including limited capacity, learning time, and excessive overhead. A detailed discussion of these limitations follows in Chapter 4. Content-based prefetching involves using stateless systems that prefetch the connected data objects of LDS by discovering pointers in the data contained within cache lines as they are filled. Content-based prefetching can achieve good coverage by overcoming the limitations of context-based prefetching; however, its timeliness is limited since the pointers contained in the accessed data usually refer to data that will soon be needed by the program. Other limitations of content-based prefetching include the inability to recognize traversal orders of multiple potential paths, and the inability to connect and prefetch isolated LDS. Chapter 4 discusses these limitations further.

17 4 Precomputation prefetching runs a slice of the program instructions to compute future references. The slice of instructions is run as a thread context in multithreaded environments [17], or in a dedicated prefetching engine [2, 55, 47, 2, 29]. These techniques trigger their precomputation with specially marked instructions of the program. Although these techniques can predict accesses that exhibit neither spatial nor temporal regularity, the time between the trigger instructions and the instructions that require the prefetched data is usually too little to hide the memory latency, which limits their timeliness. Coordinated prefetching approaches that employ two or more of the above prefetching mechanisms have been proposed. For example, Roth [43] combined a context-based mechanism called jump pointers with a content-based mechanism called dependence-based prefetching [42]. This combined approach, called cooperative prefetching, used jump pointers to launch timely prefetches for correlated accesses and subsequently triggered the dependence-based prefetcher as a result of jump pointer prefetches. Guided-region prefetching [50] is another coordinated approach that used software analysis to tag load instructions that access LDS fields and used these tags (as a context) to improve the accuracy of content-based prefetching. Recently, multi-chain prefetching [29] was proposed as a coordinated approach that uses compiler-analysis techniques to identify static traversal paths in a program. Precomputation schedules for the identified paths were statically generated and shipped to a hardware prefetching engine that executed them based on the memory content. While these coordinated prefetching solutions proved the viability of a coordinated prefetching approach to improve prefetch completeness, they still suffered from the limitations of the underlying mechanisms they coordinated. The principal impediment to prefetching LDS is that the memory addresses for the connected data objects are dynamically generated based on specific pro-

18 5 gramming traversal patterns, phases, and application behavior. The connectivity among the nodes of the LDS changes during the program lifetime, and the program potentially changes its traversal paths along these nodes. Therefore, in order to achieve completeness, a technique for LDS prefetching must be able to dynamically track these changes over the program lifetime and adjust its prefetch targets for improved coverage and prefetch initiation time for improved timeliness. This dissertation hypothesizes that a coordinated software and hardware prefetching system, which builds on the strengths of both software and hardware prefetching paradigms, and utilizes well established compile and profile technologies to guide a run-time prefetching environment, can improve prefetching completeness for LDS. Such a coordinated system can be enabled by a rigorous approach that extends the understanding of the memory access characteristics associated with LDS and offers metrics to quantify such characteristics. The global view and understanding of the memory access characteristics by the compiler cannot be detected by a hardware-only system. Such information combined with the offered metrics and the dynamic run-time knowledge of the hardware part creates a promising and flexible system for exploiting the memory access characteristics associated with LDS. 1.1 Contributions The methodology of this dissertation consists of: (1) surveying and understanding the exploitable characteristics of previous prefetching mechanisms, (2) extending the understanding of these characteristics and offering metrics to quantify them, (3) proposing prefetching mechanisms that employ the offered metrics using a coordinated approach that employs information from compile, profile, and run-time technologies, and (4) validating that the proposed mechanisms improve prefetch completeness using cycle-accurate simulation.

19 6 Through this methodology, the memory access characetristics associated with LDS are studied, and mechanisms to identify and exploit them are proposed. Thus, this dissertation makes the following contributions: Exploiting spatial regularity: To exploit spatially regular streams of the memory accesses, previously identified regular stream metrics are extended to quantify extrinsic characteristics of regular streams. These new metrics are employed to improve the efficiency of stride prefetching for LDS accesses. The extrinsic metrics introduced are stream affinity and stream density. Stream affinity enables prefetching for short streams that result from LDS manipulations. These streams were previously ignored by stride prefetching mechanisms. Stream density enables a prioritization mechanism that dynamically selects amongst available streams in favor of those that promise more miss coverage, and provides thrashing control amongst several coexisting streams. Using intrinsic and extrinsic stream metrics in combination allows a novel hardware technique for controlling Prefetch Ahead Distance (PAD), which dynamically adjusts the prefetch launch time to better enable timely prefetches while minimizing cache pollution. De-aliasing regular streams: Several prefetching mechanisms utilize the Program Counter (PC) to de-alias co-existing streams of regular memory accesses (spatial and temporal.) Transmitting the full PC across modern deep and wide pipelines for the sole purpose of prefetching is not practical (as illustrated in Chapter 3.) To overcome the issues related to using the entire PC for effective stream de-aliasing and prefetching, this dissertation combines other instruction attributes with a small subset of the PC to help detect regularity in program data accesses. This de-aliasing

20 scheme is illustrated by implementing a cost-effective stride prefetching mechanism called Load-Attributes Prefetching (LAP). 7 Exploiting temporal regularity: Metrics are proposed to quantify temporal regularity in the memory accesses generated by load instructions that traverse LDS. These metrics are applied to profile information to identifying recurrent load instructions (instructions that generate temporally regular accesses). Recurrent load instructions are then targeted at run-time by a coordinated context-based and content-based mechanism to exploit temporal regularity in the memory accesses to prefetch LDS accesses. The approach is illustrated by proposing an LDS prefetching mechanism called Stitch Cache Prefetching (SCP), which makes the following contributions: (1) definition of temporal regularity metrics (based on concepts of information theory) that allow a profiler to identify loads that traverse any type of LDS paths (static or dynamic,) without requiring source code, (2) proposing a context-based prefetching mechanism that avoids problems of capacity using a logical stitch space, and of maintenance overhead by physically implementing the stitch space using a hierarchical organization, and (3) improving prefetch timeliness by dynamically adjusting the prefetch launching time based on the observed memory latency using a continuously tuned timeliness stitch queue. Exploiting LDS topology: A classification of the load instructions used to traverse LDS is proposed. Based on this calssification, the compiler is enabled to extract LDS topology and traversal information. This information is used by a hardware component and combined with the content of accessed cache lines to dynamically construct LDS traversal schedules. The constructed schedules are used by the hardware component to gener-

21 8 ate timely and accurate prefetches in a coordinated prefetching approach. The coordinated prefetching approach is illustrated via the design of a novel prefetching mechanism called Compiler-Directed Content-Aware Prefetching (CDCAP). The mechanism utilizes compiler-inserted prefetch instructions to guide hardware prefetching engines (HPEs) to prefetch traversed paths of LDS. The inserted prefetch instructions contain information about the static attributes that describe the topology of the data structure. As the program runs, compiler-inserted hints invoke the HPE to employ the topology information for generating prefetches based on the current program state and the contents of the accessed cache lines. The technique addresses the shortcomings of software-only techniques by eliminating the need to transform the data structure without the use of excessive prefetch instructions and does not require prior knowledge of the traversed data structure paths. At the same time, the approach eliminates the need for large correlation-based hardware structures and reduces the number of unnecessary prefetches caused by hardware-only implementations. 1.2 Organization This dissertation is composed of six chapters. Chapter 2 studies spatial regularity in the memory accesses, and proposes metrics to quantify extrinsic regular stream characteristics. The proposed metrics are utilized by a run-time prefetching system to improve the prefetch completeness of stride-based prefetching systems. Chapter 3 describes Load-Attribute Prefetching (LAP) as a cost-effective solution to de-alias regular streams in modern deep and wide pipelines. Chapter 4 studies temporal regularity in the memory accesses associated with LDS. Metrics to quantify temporal regularity are offered and used in the design of a coordinated

22 9 prefetching approach called Stitch Cache Prefetching (SCP) to improve prefetch completeness. Chapter 5 identifies patterns for LDS traversal instructions and employs these patterns to design a Compiler-Directed Content-Aware Prefetching (CDCAP) approach that generates dynamic precomputation prefetching schedules based on compiler communicated information to prefetching engines that are located at different cache levels in the memory hierarchy. Finally, in Chapter 6, conclusions and suggestions for future research are given.

23 Chapter 2 Exploiting Spatial Regularity Using Extrinsic Stream Metrics 2.1 Introduction Hardware-based prefetching approaches are particularly interesting due to (1) the potential to dynamically adapt to run-time characteristics, and (2) the ability to improve software performance of unmodified program binaries (e.g., there is no need to recompile programs using special compilers or other software tools, and legacy program binaries can also benefit from these techniques). Stride-based prefetching is an effective hardware technique that exploits regular streams of memory accesses [26]. A regular stream is an ordered sequence of addresses that exhibits a non-zero constant address difference (a stride) between its consecutive addresses. Existing stride-based techniques achieve efficiency primarily through accuracy and timeliness. Figure 2.1 depicts the prefetch opportunity of spatially regular accesses in SPEC-INT benchmarks. This figure indicates that there exists significant opportunities to hide the memory latency in these benchmarks via exploiting spatial regularity. Unfortunately, previous stride-prefetching mechanisms were designed to target spatial regularity in scientific numerical applications. The spatial regularity in such applications demonstrate different characetristics than those of integer applications that use LDS (as will be illustrated in the following sections of this chapter.) Thus, improving the efficiency of stride-based prefetching for ap-

24 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf Average Fraction of spatially regular accesses Figure 2.1: Prefetch opportunity of spatially regular accesses in SPEC-INT. plications that use LDS requires better understanding of their spatial regularity characteristics. This understanding enables the improvement of prefetch completeness through improving coverage, without sacrificing accuracy or timeliness (as demonstarted by the simulation results in Section 2.4.) Mohan defines regularity metrics for measuring the characteristics of streams [34]. This work illustrated that well-defined metrics could be effectively employed to guide software optimizations associated with regular streams. The fidelity enabled by these metrics allowed a compiler to select the most effective code optimizations amongst techniques such as tiling, software prefetching, and loop transformations. While Mohan s work was not focused at hardware prefetching, this dissertation hypothesizes that stride-based hardware prefetching completeness could be improved by similar exploitations of regular stream metrics. Furthermore, while Mohan focused on the important characteristics of individual streams in isolation, this research recognizes that at run-time there are interactions between streams and other entities (for example with other streams, or with the memory sub-system). These interactions can be accounted for and can be used to fur-

25 12 ther improve hardware prefetch efficiency. Thus, the major contributions of this chapter are the extension of Mohan s metrics to enable measurement of certain additional characteristics of regular streams, and application of these extended metrics to optimization of stride-based hardware prefetching. The rest of this chapter is organized as follows. The next section identifies characteristics of regular streams and defines metrics to quantify these characteristics. Section 2.3 illustrates how these metrics can be dynamically used to improve the efficiency of a stride-based hardware prefetching system. Section 2.4 illustrates (using simulations) that the proposed hardware outperforms one that does not account for the metrics offered in this research. Section 2.5 summarizes and draws conclusions. 2.2 Characterizing Regular Streams Metrics of regular streams are classified into two classes, intrinsic stream metrics and extrinsic stream metrics. The intrinsic class describes a stream s inherent characteristics, such as its stride or length (e.g., the number of accesses that belong to the stream). The extrinsic class describes a stream s characteristics relative to the program, to other streams, and to the memory system (which includes the hardware prefetchers). While intrinsic stream metrics have been identified [34] and used to optimize stride prefetching mechanisms [26, 27, 20, 24], the same cannot be said about extrinsic stream metrics. In this section, intrinsic stream metrics will be briefly discussed, and extrinsic stream metrics are introduced. Then these metrics are used to make three main contributions that can be used to improve the efficiency of hardware stride prefetching. These contributions are: Exploiting short streams to improve coverage,

26 Dynamic selection amongst streams to prevent thrashing and improve coverage, and 13 Dynamically changing prefetch-ahead distance to improve timeliness Intrinsic Stream Characteristics Existing hardware stride prefetching mechanisms recognize and exploit the intrinsic characteristics of regular streams. These characteristics can be measured by two metrics: stride and length. The stride measures how far apart the consecutive accesses of the stream are in terms of memory addresses, while the length measures the number of accesses of the stream. Streams having strides within one cache line, called unit-stride streams, have been exploited by stream buffer prefetching mechanisms [26, 20]. Unitstride streams are easy to detect and exploit, however in practice there remain a significant number of streams that exhibit strides larger than a single cache line [34]. Detection of streams with arbitrary strides can provide increased coverage. To exploit streams having strides longer than one cache line, Farkas [27] uses per- PC stream detection, a technique that detects streams with arbitrary strides by comparing consecutive accesses of specific load instructions. However, as with all stream buffer based techniques there is no mechanism to adjust prefetch launch time to improve prefetch timeliness. Stream length has been used to optimize stride prefetching techniques through exploitation of long streams [38, 26, 27, 4]. Longer streams are preferred because they offer better prefetch efficiency over time. However, the efficiency of prefetching for long streams can be negatively impacted by the presence of short streams unless there is a way to distinguish between them. Therefore, several mechanisms have been proposed to filter out streams of short length, such as allocation filters

27 14 [27, 20]. Unfortunately, while these techniques can filter out irregular accesses, they fail to exploit a short stream because most of the stream is used to establish its stride, leaving very little of the stream remaining to be prefetched. This chapter shows how extrinsic stream metrics can enable efficient prefetching of some classes of short regular streams without negatively impacting efficiency of prefetching for long streams Extrinsic Stream Characteristics Extrinsic metrics measure a stream s characteristics relative to other streams, and to the memory system (including misses not associated with any stream, as well as interactions with prefetchers). Stride prefetching mechanisms can utilize these metrics to evaluate tradeoffs whose implications cannot be measured using only intrinsic metrics, thus allowing for improved efficiency. The following sections discuss this in detail. To facilitate the extrinsic metrics discussion, the notion of regular streams is augmented by associating a timestamp with each memory access to provide a temporal ordering of stream accesses. The timestamp is a monotonically increasing integer that starts at 0, and is incremented with each memory access. Using the temporal ordering provided by the stream access timestamp, the following stream attributes can be defined: Stream birth, b, is the timestamp of the first address of the stream, Stream death, d, is the timestamp of the last address of the stream, and Stream age, a, is the difference between a stream s death and its birth (d b + 1).

28 With these attributes the following metrics are next defined and discussed: stream affinity and stream density Stream Affinity and Short Streams Exploitation Allocating hardware resources to detect and prefetch regular streams is usually done based on demand misses [26, 4]. Allocation filters are used to filter out misses that do not belong to regular streams to prevent them from disturbing on going prefetching of regular streams [38]. This filtering consumes part of the stream to establish its intrinsic characteristics before prefetching can commence. Such consumption can prevent prefetching of short streams. Stream affinity is introduced as an extrinsic metric to measure how similar a stream is to the most recent non-interleaved stream of equivalent stride. Stride prefetching can benefit from recognizing affine streams (streams with high affinity) by spending less time identifying intrinsic stream characteristics, and instead using that time to continue prefetching. This is especially important for short streams, where individual streams cannot be exploited due to the time wasted in determining their intrinsic characteristics. for (i=0; i < 50; i++) { t = A[i]; for (j=0; j < 3; j++) { sum += B[t+j*16]; } } (a) Memory Accesses x y x : stream for j=0 y : stream for j=1 d x w b y (b) Figure 2.2: Stream Affinity Illustration. Figure 2.2(a) shows an example of affine streams where two nested loops occur in a program such that the purpose of the outer loop is to choose different

29 16 starting points for the inner loop. The short streams produced by consecutive iterations of the inner loop will be affine streams. A prefetching mechanism that detects stream affinity can use the first of a set of high affinity streams to identify the stride, and then can reuse this characterization to efficiently prefetch subsequent members of the set. The author is not aware of any technique that recognizes or exploits affine streams. Given two streams, x and y, that are generated by the same load instruction, let the stream birth and stream death of these streams be denoted as b x, b y, d x, and d y, respectively. Further, let w be a timestamp window during which stream affinity can be exploited efficiently by prefetching hardware. The affinity, α y, for stream y is defined as: α y = 1 by dx w if(b y d x ) w and stride(x) = stride(y) = 0 Otherwise (2.1) The fractional portion of the metric measures how far apart the two streams are. The metric approaches zero as the distance between streams grows larger. Therefore, the highest affinity of 1 happens for two streams x and y when x is a continuation of y (making the fractional term 0.) Stream Density and Prefetch Coverage Regular streams compete for limited hardware prefetching resources that are usually lesser than the number of regular streams [34]. Sherwood [45] used priority counters to select one of several potential streams based on how well predictable each one is (e.g., selection amongst streams based on prefetch accuracy.) This approach prevents irregular misses from deterring the prefetching of regu-

30 17 lar streams. However, Sherwood s approach does not distinguish between regular streams to select one that potentially has better coverage than the others, nor does it prevent predictable streams from thrashing each other. Stream density is introduced as an extrinsic metric that indicates the expected coverage of a stream (how many program misses are potentially hidden in a given period of time by prefetching a stream.) Given the density metric, a prefetching mechanism can select the stream that has more density over one that has lesser density. This enables maximizing prefetch coverage and therefore efficiency. With the previously introduced stream attributes, stream density, δ, is defined as the ratio of a stream s length to its age: δ = l a = l d b + 1 (2.2) Intuitively, a stream that has low density (a sparse stream) is one whose accesses are separated by many interleaved memory accesses that do not belong to the stream. Conversely, accesses of a high density stream are separated by few memory accesses not belonging to the stream. Dense streams appear for example in memory copy operations, and in search algorithms that use tight loops containing load and compare operations. To illustrate dense and sparse streams, consider the two nested loops of Figure 2.3 (a). The outer loop generates a sparse stream (stream z) of the accesses of array A, whereas the inner loop generates streams x and y that are dense streams. Figure 2.3 (b) illustrates the calculation of the density metric for stream z. In this example stream z has a length of 2 (accesses A[0] and A[100] in the code segment) and an age of 5 (its death is 4 and its birth is 0.) Conversely, the density of streams x and y are equal to 1. Dense streams present more prefetching opportunities than sparse streams.

31 18 A[0] = 0; A[100] = 100; for (j=0; j < 2; j=j++) { t = A[j*100]; for (i=0; i < 3; i++) { sum += B[t+i*16]; } } (a) b z Memory Accesses z x z y Timestamp d z (b) lz δz = a z 2 = 2 = = 0.4 d b z z Figure 2.3: Stream Density Illustration. A prefetching mechanism that does not account for stream density can be diverted from prefetching dense streams in the presence of sparse streams. This problem is known as the stream thrashing problem [38]. Although allocation filtering [38, 27] prevents non-stream misses or short streams from interrupting prefetching for long streams, previous hardware stride prefetching mechanism do not resolves the stream thrashing problem. Section 2.3 introduces a mechanism that uses the density metric to select a stream that promises more coverage over other streams Measuring Stream Metrics This chapter uses the SPEC2K suite of benchmarks for illustration. Several traces statistically represent each benchmark, such that each trace consists of several million instructions. Each trace has a weight representing its contribution to the overall benchmark. This is similar in concept to benchmark representation used in SimPoint [39]. Trace representation was verified against actual hardware (a previous processor) by using IPC as a verification metric. The simulated IPC, obtained by simulating the traces on the matching cycle-accurate simulator and weighting the IPCs of each sub-trace appropriately, was compared against the IPC from execution on hardware. All benchmarks showed good correlation between

32 19 Density Distribution 100% 80% 60% 40% 20% 0% gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf >= 0.2 < 0.2 < 0.1 < 0.01 < < Benchmark (INT) Density Distribution 100% 80% 60% 40% 20% 0% wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi >= 0.2 < 0.2 < 0.1 < 0.01 < < Benchmark (FP) Figure 2.4: Stream Density Histogram. actual and simulated IPC. In collecting stream metrics, they were not weighed by the trace weights. Stream Density Figure 2.4 depicts a histogram of streams densities. Each bar represents the percentages of streams that had densities within ranges identified in the legend. One observation here is that the SPEC2K-INT benchmarks generally exhibit higher density streams than the FP benchmarks. This is due to the fact that interleaved streams appear more frequently in the FP benchmarks as a result of aggressive compiler loop unrolling. In contrast, the INT benchmarks offer little or no loop unrolling opportunities to the compiler.

33 Figure 2.4 illustrates that the majority of the regular streams detected in 20 the benchmarks are sparse. This does not mean that these streams represent the majority of the memory accesses, since sparse streams are often significantly shorter than dense streams. However, the high percentage of sparse streams means that these streams will frequently interrupt the denser streams. Improved stride prefetching mechanisms should account for this observation and should be able to dynamically react as conditions change. Section 2.3 presents one solution to this problem based on stream density. Stream Affinity Figure 2.5 shows the percentage of streams in the benchmarks that have an affinity greater than 0.5 (α > 0.5) for a w of 200. A value of 200 was chosen for w experimentally, based on a study of how well the available streams could be exploited by the intended hardware system. This figure suggests that affine streams constitute a significant portion of the total number of streams in several workloads. Overall, streams with high affinity are less frequent in the SPEC2K-FP benchmarks when compared to the SPEC2K- INT ones. This is due to the nature of the two suites, where the INT benchmarks tend to have more loop nesting constructs and more dynamic data structures. Dynamic data structures are usually allocated in chunks that are consecutive in the memory space. As these data structures get rearranged by the program due to deletion and insertion of nodes, their subsequent traversal results in fragments of short affine streams. Contrary to that, the FP benchmarks generally do not use dynamic data structures and they tend to scan large arrays causing long streams that are well separated by non-stream accesses. Therefore, these benchmarks demonstrate low affinity in general. Investigating the FP benchmarks that exhibit high affinity (for example galgel and art) reveals that these benchmarks have outer loops that set up

34 21 12% 10% 8% 6% 4% 2% 0% gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf Affine streams Benchmark (INT) 14% 12% 10% 8% 6% 4% 2% 0% wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi Affine streams Benchmark (FP) Figure 2.5: Percentage of streams with α > 0.5, w = 200. an address for inner loops, which is consistent with the affine streams presented in the INT benchmarks. The only difference between the two code segments showed that galgel used aggressive loop unrolling of the inner loop, while art did not. 2.3 Run-time Exploitation of Stream Metrics The effective use of both intrinsic and extrinsic regular stream metrics is demonstrated with a design for a hardware stride prefetching system.the design includes a number of prefetching engines (PEs). Each PE is controlled by a finite state machine similar to the one shown in Figure 2.6. These PEs get allocated to load instructions that miss in the L1 cache to prefetch their regular streams.

35 22 Therefore, this design employs per-pc stride detection similar to that presented by Farkas [27]. The system is designed to prefetch the load accesses and not the store accesses. Stores are handled by other micro-architecture solutions such as store buffers [20]. The states of this state machine are divided into two sets, inactive states and active states. The active states dynamically control prefetching accuracy and timeliness issues, while the inactive states are employed to improve coverage by detecting streams, managing priorities amongst streams, and identifying affine streams. The details of how stream metrics govern state transitions are discussed in detail in the following sections. State Description Metric Used OFF ALLOCATE PEM PEM MU TC MU SD MP MU SD TC AFD LC1 HCx Stream Detection Thrashing Control Affinity Detection Low Confidence with 1 PAD High Confidence with x PAD Density Density Affinity/Density PAD/Affinity/Density PAD/Affinity/Density PEM MU MP PU MP AFD LC1 HC1 HC2 HC4 HC6 PU PV PU PV PV PV PU PU PU PV PU LM/PEM LM/PEM LM/PEM LM/PEM Active States MP LEGEND LM : Allocated Load Miss PU : Prefetch Used PV : Prefetch Evicted MU: Load Miss Unpredicted MP: Load Miss Predicted PEM: PE Miss (Load Miss Not Allocated An Engine) Padding states and path Affinity states and path Density states and path Figure 2.6: A Finite State Machine That Controls The PEs.

36 Stream Detection and Allocation Filtering When a load instruction misses in the L1 cache, a free PE gets allocated to the instruction, and the state machine transitions to the Stream Detection (SD) state. During the SD state the PE makes an initial stride guess equal to the size of a cache line. Using the subsequent accesses of the load instruction, a stream is identified by comparing the guessed stride with the measured stride of the load accesses. If a stream is identified, the state machine transitions to state LC1 (low confidence with ability to launch 1 prefetch.) Otherwise, the PE state machine stays in SD and re-computes the stride based on the addresses of the load instruction s consecutive accesses. This process repeats until either a stream is detected or the PE is reallocated to another missing load instruction (via repeated transitions through states TC and finally OFF due to other load instruction misses). This stream detection is analogous to the allocation filters proposed by Palacharla [38], in the sense that misses not comprising a stream do not affect prefetching because there will be no resulting transition to the LC1 state Stream Prioritization and Thrashing Control Using Stream Density When several streams are interleaved, they will generate misses that are also interleaved. The proposed prefetching system is designed with a number of PEs equal to N. If the interleaved streams are less than or equal to N, each one will be allocated a PE. However, when the number of interleaved streams is more than N, a prioritization mechanism is needed. The prioritization is based on measuring the density of each of the interleaved streams and finally favoring the denser of them. This task is carried out by the states SD, Thrashing Control

37 24 (TC), and LC1. In these states the interleaved streams compete for the PE such that allocated streams climb to higher confidence states, while unallocated streams can decrease the confidence of allocated streams. The denser stream will eventually win the PE. In previously proposed mechanisms, newly identified streams can replace existing streams without consideration of their respective densities. This approach results in stream thrashing that reduces the prefetching efficiency. Unfortunately, this situation is common in scientific code that has been subject to aggressive loop unrolling. The above prioritization mechanism implements thrashing control using stream density, such that a denser stream retains the PE. Similar to the above prioritization, this is done by dynamically allowing streams to compete for the PEs. The details of this competition are explained next. Recall that the states of the state machine are divided into active (states LC1 and HC1-HC6) and inactive states (states SD, TC, and AFD). Once a stream has been detected (while the PE is in the inactive states,) the PE s state machine transitions to the active states and the PE launches a number of prefetches based on the measured stride. The number of prefetches launched in each state is shown as a number in the state name (for example, in state LC1 one prefetch is launched.) Prefetch requests are queued to an architected miss buffer which manages requesting and collecting data from lower memory levels. Once a request has been fully serviced, its data will be evicted to the L1 data cache (referred to as a PV event.) If the prefetched data is needed by any program instruction while in the miss buffer, a prefetch use event is declared (PU). In this case, the state machine transitions to the next higher confidence state (state High Confidence with ability to launch 1 prefetch, HC1 in this example.) However, if the prefetched cache line is not needed until after it has been evicted, then the state machine transitions to the lower confidence state (e.g., state Affinity Detect, AFD.)

38 25 Streams that are not allocated any PE will generate PE miss (PEM) events that reduce the confidence of allocated PEs, while allocated streams increase the PEs confidence via prefetch use (PU) events. This competition will resolve in one of two ways: (1) if the unallocated stream is denser than the allocated one then its PEM events will number more than the PU events for the allocated streams and eventually the denser stream will take over the PE, or (2) if the allocated stream is denser then its PU events will number more than the PEM events for the unallocated stream thus leaving the PE at high confidence. Either of these cases will result in the denser stream having control of the PE, thereby mitigating the thrashing problem and improving the prefetch coverage Exploiting Short Streams Using Stream Affinity In all previously reported hardware stride-based mechanisms, if a missing address has not already been predicted by the prefetching mechanism, then it becomes a candidate for starting a new stream. This treatment of misses does not account for stream affinity. In the proposed approach of this chapter, stream affinity is exploited by lowering the confidence of the state machine controlling the PE allocated to the missing load instruction (referred to as LM events.) This lowering of confidence repeats until the state machine reaches the Affinity Detect state (AFD). In AFD, if the conditions of Equation 2.1 indicate an affine stream, then the PE changes its prefetching region to match the most recently missed address. This change of the prefetching region allows the PE to begin prefetching affine streams without the need to go through the detection process. When affinity is detected, the PE state transitions directly to HC1 state, bypassing the LC1 state. Allowing this transition results in faster climbing of the confidence states and enables exploiting short affine streams. Note also that not changing the prefetch region until the state machine confidence drops to one of the inactive states al-

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Performance Oriented Prefetching Enhancements Using Commit Stalls

Performance Oriented Prefetching Enhancements Using Commit Stalls Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1 I Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL Inserting Prefetches IA-32 Execution Layer - 1 Agenda IA-32EL Brief Overview Prefetching in Loops IA-32EL Prefetching in

More information

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Many Cores, One Thread: Dean Tullsen University of California, San Diego Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain Software-assisted Cache Mechanisms for Embedded Systems by Prabhat Jain Bachelor of Engineering in Computer Engineering Devi Ahilya University, 1986 Master of Technology in Computer and Information Technology

More information

Precise Instruction Scheduling

Precise Instruction Scheduling Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

Instruction Based Memory Distance Analysis and its Application to Optimization

Instruction Based Memory Distance Analysis and its Application to Optimization Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang cfang@mtu.edu Steve Carr carr@mtu.edu Soner Önder soner@mtu.edu Department of Computer Science Michigan Technological

More information

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

Performance Prediction using Program Similarity

Performance Prediction using Program Similarity Performance Prediction using Program Similarity Aashish Phansalkar Lizy K. John {aashish, ljohn}@ece.utexas.edu University of Texas at Austin Abstract - Modern computer applications are developed at a

More information

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs University of Maryland Technical Report UMIACS-TR-2008-13 Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs Wanli Liu and Donald Yeung Department of Electrical and Computer Engineering

More information

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)

More information

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554

More information

Combining Local and Global History for High Performance Data Prefetching

Combining Local and Global History for High Performance Data Prefetching Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah ABSTRACT The growing dominance of wire delays at future technology

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

The V-Way Cache : Demand-Based Associativity via Global Replacement

The V-Way Cache : Demand-Based Associativity via Global Replacement The V-Way Cache : Demand-Based Associativity via Global Replacement Moinuddin K. Qureshi David Thompson Yale N. Patt Department of Electrical and Computer Engineering The University of Texas at Austin

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

Architecture Cloning For PowerPC Processors. Edwin Chan, Raul Silvera, Roch Archambault IBM Toronto Lab Oct 17 th, 2005

Architecture Cloning For PowerPC Processors. Edwin Chan, Raul Silvera, Roch Archambault IBM Toronto Lab Oct 17 th, 2005 Architecture Cloning For PowerPC Processors Edwin Chan, Raul Silvera, Roch Archambault edwinc@ca.ibm.com IBM Toronto Lab Oct 17 th, 2005 Outline Motivation Implementation Details Results Scenario Previously,

More information

Parallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing

Parallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing Parallel Computing 38 (2012) 533 551 Contents lists available at SciVerse ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Algorithm-level Feedback-controlled Adaptive data

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Data Access History Cache and Associated Data Prefetching Mechanisms

Data Access History Cache and Associated Data Prefetching Mechanisms Data Access History Cache and Associated Data Prefetching Mechanisms Yong Chen 1 chenyon1@iit.edu Surendra Byna 1 sbyna@iit.edu Xian-He Sun 1, 2 sun@iit.edu 1 Department of Computer Science, Illinois Institute

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy

Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 2006 Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy Sheela A. Doshi Louisiana State University and Agricultural

More information

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-24 TR-HPS-26-3

More information

EXPERT: Expedited Simulation Exploiting Program Behavior Repetition

EXPERT: Expedited Simulation Exploiting Program Behavior Repetition EXPERT: Expedited Simulation Exploiting Program Behavior Repetition Wei Liu and Michael C. Huang Department of Electrical & Computer Engineering University of Rochester fweliu, michael.huangg@ece.rochester.edu

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Decoupled Zero-Compressed Memory

Decoupled Zero-Compressed Memory Decoupled Zero-Compressed Julien Dusser julien.dusser@inria.fr André Seznec andre.seznec@inria.fr Centre de recherche INRIA Rennes Bretagne Atlantique Campus de Beaulieu, 3542 Rennes Cedex, France Abstract

More information

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Ilya Ganusov and Martin Burtscher Computer Systems Laboratory Cornell University {ilya, burtscher}@csl.cornell.edu Abstract This

More information

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9 Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1 MPSoC - Vantagens MPSoC architecture has several advantages over a conventional

More information

Quantifying Load Stream Behavior

Quantifying Load Stream Behavior In Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA), February. Quantifying Load Stream Behavior Suleyman Sair Timothy Sherwood Brad Calder Department of Computer

More information

Exploiting Streams in Instruction and Data Address Trace Compression

Exploiting Streams in Instruction and Data Address Trace Compression Exploiting Streams in Instruction and Data Address Trace Compression Aleksandar Milenkovi, Milena Milenkovi Electrical and Computer Engineering Dept., The University of Alabama in Huntsville Email: {milenka

More information

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:

More information

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn A Cross-Architectural Interface for Code Cache Manipulation Kim Hazelwood and Robert Cohn Software-Managed Code Caches Software-managed code caches store transformed code at run time to amortize overhead

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

Computer System. Performance

Computer System. Performance Computer System Performance Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/

More information

DATA CACHE PREFETCHING USING

DATA CACHE PREFETCHING USING DATA AHE PREFETHING USING A GLOBAL HISTORY BUFFER BY ORGANIZING DATA AHE PREFETH INFORMATION IN A NEW WAY, A GHB SUPPORTS EXISTING PREFETH ALGORITHMS MORE EFFETIVELY THAN ONVENTIONAL PREFETH TABLES. IT

More information

Optimizing SMT Processors for High Single-Thread Performance

Optimizing SMT Processors for High Single-Thread Performance University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul

More information

SEVERAL studies have proposed methods to exploit more

SEVERAL studies have proposed methods to exploit more IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 4, APRIL 2005 1 The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture Resit Sendag, Member, IEEE, Ying

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Picking Statistically Valid and Early Simulation Points

Picking Statistically Valid and Early Simulation Points In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), September 23. Picking Statistically Valid and Early Simulation Points Erez Perelman Greg Hamerly

More information

A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework

A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework In Proceedings of the International Symposium on Code Generation and Optimization (CGO 2006). A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework Weifeng Zhang Brad Calder Dean

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Hardware Loop Buffering

Hardware Loop Buffering Hardware Loop Buffering Scott DiPasquale, Khaled Elmeleegy, C.J. Ganier, Erik Swanson Abstract Several classes of applications can be characterized by repetition of certain behaviors or the regular distribution

More information

Accelerating and Adapting Precomputation Threads for Efficient Prefetching

Accelerating and Adapting Precomputation Threads for Efficient Prefetching In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA 2007). Accelerating and Adapting Precomputation Threads for Efficient Prefetching Weifeng Zhang Dean M.

More information

Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics

Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics Jian Chen, Nidhi Nayyar and Lizy K. John Department of Electrical and Computer Engineering The

More information

Memory Consistency. Challenges. Program order Memory access order

Memory Consistency. Challenges. Program order Memory access order Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined

More information

Exploiting Core Working Sets to Filter the L1 Cache with Random Sampling

Exploiting Core Working Sets to Filter the L1 Cache with Random Sampling Exploiting Core Working Sets to Filter the L Cache with Random Sampling Yoav Etsion and Dror G. Feitelson Abstract Locality is often characterized by working sets, defined by Denning as the set of distinct

More information

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More

More information

Workloads, Scalability and QoS Considerations in CMP Platforms

Workloads, Scalability and QoS Considerations in CMP Platforms Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation Agenda Trends and research context Evolving Workload

More information

Guided Region Prefetching: A Cooperative Hardware/Software Approach

Guided Region Prefetching: A Cooperative Hardware/Software Approach Guided Region Prefetching: A Cooperative Hardware/Software Approach Zhenlin Wang Ý Doug Burger Ü Kathryn S. McKinley Ü Steven K. Reinhardt Þ Charles C. Weems Ý Ý Dept. of Computer Science Ü Dept. of Computer

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,

More information

Efficient Architecture Support for Thread-Level Speculation

Efficient Architecture Support for Thread-Level Speculation Efficient Architecture Support for Thread-Level Speculation A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Venkatesan Packirisamy IN PARTIAL FULFILLMENT OF THE

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies Administrivia CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) HW #3, on memory hierarchy, due Tuesday Continue reading Chapter 3 of H&P Alan Sussman als@cs.umd.edu

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts

Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts Yoav Etsion Dror G. Feitelson School of Computer Science and Engineering The Hebrew University of Jerusalem 14 Jerusalem, Israel Abstract

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Appears in the Proceedings of the 30 th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University

More information

Improving Achievable ILP through Value Prediction and Program Profiling

Improving Achievable ILP through Value Prediction and Program Profiling Improving Achievable ILP through Value Prediction and Program Profiling Freddy Gabbay Department of Electrical Engineering Technion - Israel Institute of Technology, Haifa 32000, Israel. fredg@psl.technion.ac.il

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students

More information

SPM Management Using Markov Chain Based Data Access Prediction*

SPM Management Using Markov Chain Based Data Access Prediction* SPM Management Using Markov Chain Based Data Access Prediction* Taylan Yemliha Syracuse University, Syracuse, NY Shekhar Srikantaiah, Mahmut Kandemir Pennsylvania State University, University Park, PA

More information

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 16: Prefetching Wrap-up Prof. Onur Mutlu Carnegie Mellon University Announcements Exam solutions online Pick up your exams Feedback forms 2 Feedback Survey Results

More information

Exploring Wakeup-Free Instruction Scheduling

Exploring Wakeup-Free Instruction Scheduling Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance

More information

A Study of the Performance Potential for Dynamic Instruction Hints Selection

A Study of the Performance Potential for Dynamic Instruction Hints Selection A Study of the Performance Potential for Dynamic Instruction Hints Selection Rao Fu 1,JiweiLu 2, Antonia Zhai 1, and Wei-Chung Hsu 1 1 Department of Computer Science and Engineering University of Minnesota

More information

Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration

Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration Cristiano Pereira, Harish Patil, Brad Calder $ Computer Science and Engineering, University of California, San Diego

More information

Understanding Cache Interference

Understanding Cache Interference Understanding Cache Interference by M.W.A. Settle B.A., University of Colorado, 1996 M.S., University of Colorado, 2001 A thesis submitted to the Faculty of the Graduate School of the University of Colorado

More information

Skewed-Associative Caches: CS752 Final Project

Skewed-Associative Caches: CS752 Final Project Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

A Feasibility Study for Methods of Effective Memoization Optimization

A Feasibility Study for Methods of Effective Memoization Optimization A Feasibility Study for Methods of Effective Memoization Optimization Daniel Mock October 2018 Abstract Traditionally, memoization is a compiler optimization that is applied to regions of code with few

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC

More information

Using Aggressor Thread Information to Improve Shared Cache Management for CMPs

Using Aggressor Thread Information to Improve Shared Cache Management for CMPs Appears in Proc. of the 18th Int l Conf. on Parallel Architectures and Compilation Techniques. Raleigh, NC. Sept. 2009. Using Aggressor Thread Information to Improve Shared Cache Management for CMPs Wanli

More information

Analyzing and Quantifying Dynamc Program Behavior in Terms of Regularities and Patterns

Analyzing and Quantifying Dynamc Program Behavior in Terms of Regularities and Patterns University of Rhode Island DigitalCommons@URI Open Access Dissertations 2013 Analyzing and Quantifying Dynamc Program Behavior in Terms of Regularities and Patterns Celal Ozturk University of Rhode Island,

More information

Dynamic Speculative Precomputation

Dynamic Speculative Precomputation In Proceedings of the 34th International Symposium on Microarchitecture, December, 2001 Dynamic Speculative Precomputation Jamison D. Collins y, Dean M. Tullsen y, Hong Wang z, John P. Shen z y Department

More information

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, June 1994, pp. 573-584.. The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor David J.

More information

Continuous Adaptive Object-Code Re-optimization Framework

Continuous Adaptive Object-Code Re-optimization Framework Continuous Adaptive Object-Code Re-optimization Framework Howard Chen, Jiwei Lu, Wei-Chung Hsu, and Pen-Chung Yew University of Minnesota, Department of Computer Science Minneapolis, MN 55414, USA {chenh,

More information