MODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR AMOL SHAMKANT PANDIT, B.E.

Size: px
Start display at page:

Download "MODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR AMOL SHAMKANT PANDIT, B.E."

Transcription

1 MODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR BY AMOL SHAMKANT PANDIT, B.E. A thesis submitted to the Graduate School in partial fulfillment of the requirements for the degree Master of Science in Electrical Engineering New Mexico State University Las Cruces, New Mexico May 2005

2 Modeling Effects of Speculative Instruction Execution in a Functional Cache Simulator, a thesis prepared by Amol Shamkant Pandit in partial fulfillment of the requirements for the degree, Master of Science in Electrical Engineering, has been approved and accepted by the following: Linda Lacey Dean of the Graduate School Jeanine Cook Chair of the Examination Committee Date Committee in charge: Dr. Jeanine Cook, Chair Dr. Wolfgang Mueller Dr. Steve Stochaj ii

3 DEDICATION To my parents, Mr. Shamkant Pandit and Mrs. Smita Pandit, and my brother Yogesh, for their love and support. iii

4 ACKNOWLEDGEMENTS I would like to thank my academic advisor, Dr. Jeanine Cook, for her help and guidance throughout my Master s program. This work would not have been possible without her continuous encouragement and her ideas. I thank Dr. Steve Stochaj, for providing excellent suggestions, both during my research work and the thesis review and Dr. Wolfgang Mueller for agreeing to be the Dean s representative during my thesis defense. I also thank my colleague, Ramkumar Srinivasan, for the numerous technical discussions we had during my work as a Graduate Assistant. Finally, I thank my room-mates and other friends for making my stay in Las Cruces an enjoyable one. iv

5 VITA October 20, 1977 Born at Pune, Maharashtra, India 1999 Bachelor of Engineering (B.E) from University of Pune, Pune, Maharashtra, India Software Engineer, Tata Technologies, Pune, Maharashtra, India IC Design Engineer, Texas Instruments, Bangalore, Karnataka, India Graduate Research Assistant, Klipsh School of Electrical and Computer Engineering, New Mexico State University, Las Cruces, New Mexico. Major Field: Electrical Engineering Computer Engineering FIELD OF STUDY v

6 ABSTRACT MODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR BY AMOL SHAMKANT PANDIT, B.E. Master of Science in Electrical Engineering New Mexico State University Las Cruces, New Mexico, 2005 Dr. Jeanine Cook, Chair Functional simulation is a fast technique to determine the performance of cache memories in terms of the cache hit and miss rates for various benchmarks. Functional cache simulators achieve their fast simulation speed by only tracking the programmer visible state of the processor and not simulating memory references along the mis-predicted branch paths during the program execution. In this thesis, we present quantitative results to demonstrate the deviation in the data cache miss rates measured by functional cache simulators due to not simulating mis-predicted path memory references, from the miss rates measured by detailed architectural simulators which track the complete state of the processor on a cycle-by-cycle basis. We model vi

7 the effects of speculatively executing mis-predicted path instructions in a functional cache simulator, and show that our models are successful in reducing the level-1 data cache miss rate difference between the functional cache simulator and the detailed architectural simulator for the integer subset of the SPEC CPU2000 benchmark suit, with little simulation speed reduction. vii

8 TABLE OF CONTENTS Page LIST OF TABLES... LIST OF FIGURES... xi xiii 1 INTRODUCTION Branch Prediction Speculative Instruction Execution Memory Hierarchy and On-chip Caches Cache Organization Cache Performance Metrics Computer Architecture Simulators Motivation Organization RELATED WORK PROBLEM STATEMENT AND SOLUTION OVERVIEW Problem Statement Solution Overview Effects of Mis-predicted Branch Path Execution on L-1 D-cache Modeling Mis-prediction Effects in Functional Cache Simulator SIMULATION ENVIRONMENT viii

9 4.1 The SimpleScalar Tool Set Benchmarks Hardware Platform MIS-PREDICTED PATH EXECUTION EFFECTS ON DATA CACHE Detailed Out-of-order Processor Simulator Sim-outorder Internals Enhancing Sim-outorder to Quantify Mis-prediction Effects Experimental Results Sensitivity to Changes in Data Cache Parameters Analysis of Results MODELS OF MIS-PREDICTION EFFECTS IN FUNCTIONAL CACHE SIMULATOR Difference in Miss Rates of Functional and Detailed Simulators Developing Models of Mis-predicted Path Execution Effects in Sim-cache Branch Prediction in Sim-cache Modeling Instruction Execution on the Mis-predicted Paths Techniques to Model the Number of Speculatively Executed Instructions Random Number of Instructions Data Dependence of Conditional Branch Data Dependence of Speculative Load Combined Model ix

10 6.4 Experimental Results Reduction in the Deviation of Interval Miss Rates Reduction in Simulation Speedup Analysis of Results Sensitivity of Simulator Models to Cache Parameters CONCLUSIONS FUTURE WORK APPENDICES A. PLOTS OF L-1 D-CACHE INTERVAL MISS RATE DIFFERENCE B. SIM-OUTORDER AND SIM-CACHE OPTIONS USED IN SIMULATIONS C. PLOTS OF PROBABILITY DISTRIBUTION OF SPECULATIVE INSTRUCTION NUMBERS D. PLOTS OF MISS RATE DIFFERENCE WITH MIS-PREDICTION MODELS REFERENCES x

11 LIST OF TABLES Table Page 3.1 Cumulative L-1 D-cache Miss Rates for Sim-cache and Sim-outorder SPEC CPU2000 Integer and Floating Point Benchmarks Mis-predicted Path Execution Effects (A) L-1 D-cache (CINT2000) Mis-predicted Path Execution Effects (B) L-1 D-cache (CINT2000) Mis-predicted Path Execution Effects (C) L-1 D-cache (CINT2000) Mis-predicted Path Execution Effects (A) L-1 D-cache (CFP2000) Mis-predicted Path Execution Effects (B) L-1 D-cache (CFP2000) Mis-predicted Path Execution Effects (C) L-1 D-cache (CFP2000) Branch Prediction Accuracy and Percentage of Branch Instructions for SPEC CPU2000 Benchmarks Sensitivity of Mis-prediction Effects to Changes in Data Cache Parameters Summarizing the Difference in Interval Miss Rates of Sim-cache and Sim-outorder for SPEC CPU2000 Benchmarks Functional Unit Latency Values Sim-cache-bpred-rand Results (CINT2000) Sim-cache-bpred-rand Results (CFP2000) Sim-cache-bpred-brdep Results (CINT2000) Sim-cache-bpred-brdep Results (CFP2000) Sim-cache-bpred-lddep Results (CINT2000) Sim-cache-bpred-lddep Results (CFP2000) xi

12 6.9 Sim-cache-bpred-comb Results (CINT2000) Sim-cache-bpred-comb Results (CFP2000) Simulation Times (in second) CINT2000 Workloads Simulation Times (in second) CFP2000 Workloads Simulation Speedups CINT2000 Workloads Simulation Speedups CINT2000 Workloads Sensitivity of Functional Cache Simulator Models to Cache Parameters (181.mcf) Sensitivity of Functional Cache Simulator Models to Cache Parameters (179.art) B.1 Sim-outorder Processor Configuration Simulated B.2 Sim-cache Cache Hierarchy Configuration Simulated xii

13 LIST OF FIGURES Figures Page 1.1 Typical Memory Hierarchy Organization in Modern Processors Types of Architectural Simulators Difference in L-1 D-cache Interval Miss Rate (176.gcc-expr) Difference in L-1 D-cache Interval Miss Rate (187.facerec-ref) The Pipeline Simulated in Sim-outorder Main Loop of Sim-outorder Additions to the Sim-outorder Code A Code Segment Illustrating a Branch Instruction Probability Distribution of Speculative Instructions (176.gcc-expr) Probability Distribution of Speculative Instructions (252.eon-rushmeir) Probability Distribution of Speculative Instructions (301.apsi-ref) Probability Distribution of Speculative Instructions (179.art-110) Latency Calculation of an Instruction Data Dependence of Speculative Load Model Modified Functional Cache Simulator Main Loop Percent Decrease In Mean (CINT2000) Percent Decrease in Standard Deviation (CINT2000) Percent Increase in CSM (CINT2000) Speedups for Different Models of Functional Simulator (CINT2000) xiii

14 A.1 Difference in L-1 D-cache Interval Miss Rate (164.gzip-log) A.2 Difference in L-1 D-cache Interval Miss Rate (175.vpr-route) A.3 Difference in L-1 D-cache Interval Miss Rate (181.mcf-ref) A.4 Difference in L-1 D-cache Interval Miss Rate (186.crafty-ref) A.5 Difference in L-1 D-cache Interval Miss Rate (197.parser-ref) A.6 Difference in L-1 D-cache Interval Miss Rate (252.eon-rushmeir) A.7 Difference in L-1 D-cache Interval Miss Rate (253.perlbmk-diffmail) A.8 Difference in L-1 D-cache Interval Miss Rate (254.gap-ref) A.9 Difference in L-1 D-cache Interval Miss Rate (255.vortex-lendian1) A.10 Difference in L-1 D-cache Interval Miss Rate (256.bzip2-source) A.11 Difference in L-1 D-cache Interval Miss Rate (300.twolf-ref) A.12 Difference in L-1 D-cache Interval Miss Rate (168.wupwise-ref) A.13 Difference in L-1 D-cache Interval Miss Rate (177.mesa-ref) A.14 Difference in L-1 D-cache Interval Miss Rate (179.art-110) A.15 Difference in L-1 D-cache Interval Miss Rate (301.apsi-ref) A.16 Difference in L-1 D-cache Interval Miss Rate (164.gzip-log) with Interval Size of A.17 Difference in L-1 D-cache Interval Miss Rate (181.mcf-ref) with Interval Size of C.1 Probability Distribution of Speculative Instructions (188.ammp-ref) C.2 Probability Distribution of Speculative Instructions (183.equake-ref) C.3 Probability Distribution of Speculative Instructions (164.gzip-log) C.4 Probability Distribution of Speculative Instructions (256.bzip2-source) xiv

15 D.1 Simulator Models Interval Miss Rate Differences (164.gzip-log) D.2 Simulator Models Interval Miss Rate Differences (175.vpr-route) D.3 Simulator Models Interval Miss Rate Differences (176.gcc-expr) D.4 Simulator Models Interval Miss Rate Differences (181.mcf-ref) D.5 Simulator Models Interval Miss Rate Differences (186.crafty-ref) D.6 Simulator Models Interval Miss Rate Differences (197.parser-ref) D.7 Simulator Models Interval Miss Rate Differences (252.eon-rushmeir) D.8 Simulator Models Interval Miss Rate Differences (253.perlbmk-diffmail) D.9 Simulator Models Interval Miss Rate Differences (254.gap-ref) D.10 Simulator Models Interval Miss Rate Differences (255.vortex-lendian1) D.11 Simulator Models Interval Miss Rate Differences (256.bzip2-source) D.12 Simulator Models Interval Miss Rate Differences (300.twolf-ref) D.13 Simulator Models Interval Miss Rate Differences (168.wupwise-ref) D.14 Simulator Models Interval Miss Rate Differences (177.mesa-ref) D.15 Simulator Models Interval Miss Rate Differences (179.art-110) D.16 Simulator Models Interval Miss Rate Differences (187.facerec-ref) D.17 Simulator Models Interval Miss Rate Differences (301.apsi-ref) xv

16 1 INTRODUCTION Modern microprocessors include two distinct performance improvement techniques. Hierarchical memory organization with on-chip caches allows processors to exploit program locality behavior. Dynamic branch prediction with speculative predicted path execution helps to reduce branch penalty stalls and to achieve instruction level parallelism [9]. Computer architects and researchers have generally adopted simulation as the preferred methodology for performance analysis and feasibility studies of novel design ideas. Cache miss rate is one of the three metrics necessary for performance analysis of caches and is readily available as the output of cache simulators [5], which take as input cache parameters and a memory reference trace from a program in execution. Such simulators are not concerned with the rest of the micro-architectural state of the processor. While this approach eliminates unnecessary computations and achieves a very fast simulation speed, it may not produce accurate results for modern processors. These simulators assume perfect branch prediction accuracy and hence do not simulate memory references along mis-predicted branches. Clearly there is a mismatch in the simulation model and the actual hardware. Detailed microarchitectural simulators exist [5], which give accurate results for cache miss rates. These simulators track the complete micro-architectural state of the processor on a cycle-by-cycle basis for the entire duration of program execution. Obviously, it is too time consuming to use detailed architectural simulators to measure only the cache performance metrics such as cache miss rates. 1

17 In this thesis, we demonstrate that there exists a clear deviation in the miss rate results of cache simulators from the miss rate results of detailed microarchitectural simulators. We investigate the reasons for this deviation by quantifying the effects of mis-predicted branch path instruction execution on cache miss rate. Finally, we propose a modified cache simulator that models these effects and achieves results which indicate an improvement over the unmodified cache simulator. This chapter provides a brief introduction to the micro-architectural concepts of interest to this thesis. In the following sections, we provide an overview of dynamic branch prediction with speculation, cache memory concepts and processor architecture simulator types. 1.1 Branch Prediction Over the past decade and half, computer architects have relied upon taking advantage of instruction level parallelism available in a program to speed up its execution. In today s processors, many instructions are simultaneously executed. The Intel Pentium-4 can have a maximum of 126 instructions in-flight at any time [10], whereas the Alpha processor can execute 80 instructions simultaneously [16]. Availability of pipelined and/or multiple execution units is a prerequisite for a processor to execute many instructions simultaneously. Processors typically can issue multiple instructions for execution on every clock cycle. Such processors are called superscalar processors. Conditional branch instructions in programs present a major obstacle to exploiting the available instruction level parallelism. If the condition is satisfied, the 2

18 program control flow is changed to start at the branch target address, otherwise it continues at the next sequential instruction. It has been observed [9] that conditional branches account for anywhere from 10% to 20% of all the dynamic instructions for the SPEC CINT2000 [28] benchmark suite. Clearly, a processor that stalls its pipeline for every conditional branch instruction until the condition is evaluated to decide which direction to continue will incur large performance degradation due to frequent branch stalls. To reduce the branch stall penalty, processors have incorporated branch prediction schemes whose goal is to allow the processor to determine the outcome of a branch early, thus preventing control flow dependent stalls. When the processor encounters a conditional branch, the predictor makes a guess whether to take the branch or not and the processor begins fetching instructions along the predicted path. If the prediction is correct, the processor is spared any branch stall penalty, which helps to speed up program execution. However, if the prediction turns out to be incorrect after the branch condition is resolved, the processor abandons instruction fetch along the wrong path and starts fetching instructions along the correct path. An incorrect prediction effectively means that the processor does incur the branch stall penalty, since the cycles fetching instructions along the mis-predicted path are wasted. An important point to note here is that branch prediction does not necessarily imply that the instructions fetched from the predicted path are executed immediately. In processors lacking the capability of speculative predicted path instruction 3

19 execution, only after the branch condition is resolved and if the prediction turns out to be correct, the processor begins executing these instructions. Branch prediction techniques are broadly classified into static and dynamic techniques. In static prediction schemes, the prediction for a branch instruction remains the same at all times irrespective of the dynamic behavior of program. Examples are predict not taken and predict taken techniques [9]. Static branch predictors are simple with little hardware overhead, but their prediction accuracy is lower. Dynamic branch predictors involve elaborate hardware mechanisms whose prediction depends on the run time behavior of the branch instructions. The prediction changes if the branch changes its behavior during execution. Dynamic predictors offer better branch prediction accuracy at the cost of significant hardware complexity and generally are preferred over static predictors. 1.2 Speculative Instruction Execution Branch prediction techniques reduce the direct stalls attributable to branches, but for a processor executing multiple instructions per clock, just predicting branches accurately and fetching instructions along the predicted path may not be sufficient to extract the available instruction level parallelism. More parallelism can be exploited by predicting the branch outcome and conditionally executing instructions along the predicted path. This mechanism, called hardware-based speculation [9], represents a subtle, but important, extension over dynamic branch prediction. In particular, with 4

20 speculation, instructions along the predicted path are fetched, issued and executed as if the branch prediction were always correct. Mechanisms are provided to handle the situation when the branch prediction turns out to be incorrect. Instructions are executed only up to the point where their results are computed. The results are not committed to the internal architectural state (such as register file and memory), but held inside buffers until the branch condition is resolved to be true and instructions are no longer speculative. If the branch prediction turns out to be incorrect, it is relatively straightforward to flush these buffers without committing the results to the architectural state. 1.3 Memory Hierarchy and On-chip Caches Computer programmers continuously demand larger amounts of faster memory. An economical solution to this problem, based on the principle of locality and cost-performance of memory technologies, is the memory hierarchy. The principle of locality [9] states that programs spend most of their execution time accessing only a small part of their data and code. This principle, together with the guidelines that smaller memory is faster and faster memory is more expensive, led to memory being organized into several hierarchical levels as shown in Figure 1.1. The memory level nearer to the processor is smaller, faster and more expensive per byte than the levels farther from the processor. The levels nearer to the processors are generally called cache memories. Modern processors have multiple levels of caches with level-1 and level-2 caches typically present on the same die as the processor and 5

21 level-3 cache (if present) being off chip. Level-1 cache is generally split into level-1 data cache (L-1 D-cache) and level-1 instruction cache (L-1 I-cache) to hold instructions and data separately. On-chip Level-1 Level-2 Level-n Off-chip Processor Registers C a c h e C a c h e C a c h e Main Memory Smaller, Faster, More Expensive Larger, Slower, Cheaper Figure 1.1: Typical Memory Hierarchy Organization in Modern Processors Cache Organization A cache is typically split into a number of blocks. Each block consists of a fixed number of bytes. Data transfer between a cache and lower level memory takes place one block at a time. Depending upon how blocks in lower level memory are mapped to blocks in a cache, there are three categories of cache organization. If each block from the lower level can appear only in one block of the cache, it is said to be a direct mapped cache. If a block from the lower level can be placed in any block of the cache, it is a fully associative cache. In the third cache organization called set 6

22 associative, cache blocks are grouped into sets. A block from the lower level is mapped onto only one set, but inside the set it can be placed in any block. The number of blocks in a set is called the associativity of the cache. A one-way set associative cache is direct mapped, a cache with only one set is fully associative and a cache with n blocks per set is termed as an n-way set associative cache Cache Performance Metrics When a processor needs to access memory either for instructions or for data, the memory address is first sent to the cache. If the requested item is present in the cache, it is a cache hit; otherwise it is a cache miss. The ratio of total number of cache misses to total cache accesses is called the cache miss rate. A cache hit saves the time for the processor to access the slower main memory. The time taken to access a cache, when the address hits in the cache is called the hit time. When an address misses in the cache, the lower level cache or the main memory must be accessed. The time required for this access is called the miss penalty. A good measure of cache performance [9] is the average cache access time given by the formula: Average Cache Access Time = Hit time + Miss Rate*Miss Penalty In this thesis, we will be mainly concerned with the L-1 D-cache miss rates. 1.4 Computer Architecture Simulators A cursory glance at research papers in the field of computer architecture will reveal that researchers generally prefer simulation as their performance analysis 7

23 methodology over analytic modeling. Simulations can incorporate more details (in the simulation model) and require fewer assumptions than analytic modeling and, thus, more often are closer to reality [11]. Simulation models thus provide an easy way to predict the performance or compare several alternatives. Computer architecture simulation techniques can be broadly classified into two types Trace-driven simulation and Execution Driven simulation [1]. Figure 1.2 shows the different types of simulation techniques. Figure 1.2: Types of Architectural Simulators In Trace-driven simulation [4, 15], the trace is generated by running a workload with some instrumentation tool and the monitored data is stored in a file. Traces include dynamic instruction stream along with additional information such as memory reference addresses. Trace driven simulators such as Atom are stable, familiar and hence widely used. Trace driven simulation has advantages such as portability and flexibility, requiring only a trace file, which is used to drive the simulator. Its drawbacks are the inability to capture any speculative activity that takes 8

24 place inside a processor due to the fact that traces include only the non-speculative instructions, and the huge size of trace files which causes problems for their storage. Execution driven simulators emulate the hardware and actually execute the program to be simulated. Examples of this type are the simulators from the SimpleScalar tool set [5] and the SimOS simulator. Functional simulators [1] model the programmer visible architecture (instruction set architecture) of a processor. They do not incorporate any of the organizational details of the micro-architecture such as pipelines. In direct execution functional simulation, a program is instrumented to run directly on a host computer and the instrumented code gathers the desired data during its execution. Interpreted functional simulators read instructions from the program one by one and execute them, tracking only the programmer visible state of the processor. We will refer to these simulators as functional simulators henceforth. The advantage of functional simulation is the high speed of simulation. The disadvantages are: no speculative state is tracked and no timing information is provided. Detailed execution driven simulators track the cycle-by-cycle microarchitectural state of the processor while executing a program. These simulators allow fairly accurate performance analysis. However, their simulation speed is rather slow due to the detailed tracking of the micro-architecture to gather the timing information. Stand-alone cache simulators (which simulate only caches) are either tracedriven or functional, since they model only the cache memories from the microarchitecture. Examples of functional cache simulators are sim-cache and sim-cheetah 9

25 from the SimpleScalar tool set [5]. Dinero is an example of trace-driven cache simulator. 1.5 Motivation As seen in section 1.2, hardware-based speculation combines two key ideas: dynamic branch prediction to choose which instructions to execute and speculation to allow the execution of instructions before branch conditions are resolved which includes the ability to buffer results of speculative instructions until they are no longer speculative. If the prediction turns out to be correct, the buffered results are committed to the micro-architectural state such as the register file. If the prediction turns out to be incorrect, they are discarded and instruction execution begins at the correct branch target. The effects of speculative instruction execution on the state of a cache memory can not be reversed. For example, speculatively executed load instructions can change the data cache contents (if there is a miss) before the branch condition is evaluated and branch prediction is found to be incorrect. Such changes to cache can not be undone. Thus, speculation will cause a cache to exhibit different behavior than in the case where there is no speculation. A functional cache simulator, which does not model speculative or mis-predicted branch paths in a program, will be unable to capture this difference in cache behavior. (See Appendix A for plots showing percent difference in L-1 D-cache miss rates of functional and detailed simulators and Chapter 6 for quantifying the difference with our metrics introduced in Chapter 3.) 10

26 A straight-forward solution is to use detailed execution driven simulators, as described in Section 1.4, for cache performance analysis. However, this would mean that researchers would lose the speedup advantage functional cache simulators offer over detailed architectural simulators. (See Chapter 6 for simulation time results, the speedup for the functional simulator over the detailed simulation is about 5 to 8.) To the best of our knowledge, no work has been done to model the effects of speculative instruction execution in functional cache simulators (see Chapter 2 for related work). Our goal in this research is not only to analyze the effects of speculatively executed mis-predicted branch paths on L-1 D-cache miss rate, but also to model them in a functional cache simulator with as small increase in functional cache simulation time as possible. 1.6 Organization The rest of the thesis is organized as follows. Chapter 2 presents an overview of related work. In Chapter 3, we make a formal case for the research by providing quantitative data on cache miss rates for L-1 D-cache using functional and detailed simulators and provide the solution overview. Our experimental setup including the simulation tools, benchmarks and the hardware platform is presented in Chapter 4. Chapter 5 describes the methodology to quantify the effects of mis-predicted branch paths on L-1 D-cache miss rate and the results. Chapter 6 explains the models incorporated in functional cache simulator and the simulation results. Chapter 7 presents the conclusions and Chapter 8 outlines some future work. 11

27 2 RELATED WORK Many researchers have quantified the effects of mis-predicted branch path instruction execution on cache performance [18, 23, 8, 2, 17, 20]. However, very few studies [4, 15, 22, 3] have tried to model these effects in architectural simulators other than detailed execution driven simulators, and to the best of our knowledge, none of the work specifically targets functional cache simulators. Speculatively executing mis-predicted path load instructions and allowing such loads to modify the cache contents on a miss by replacing a cache block can have both beneficial and detrimental effects on the cache performance during the correct path execution. If a cache block replaced during a mis-predicted path execution is accessed immediately during the subsequent correct path execution, there is an extra miss in the correct path and the effect of speculative execution is detrimental for cache performance. On the other hand, if a block fetched into the cache during a mis-predicted path execution is accessed during the correct path execution, there is an extra hit in the correct path and the effect of speculative execution is beneficial. Pierce and Mudge [20] report the earliest work to quantify the effects of mispredicted branch path instruction execution on caches. This work uses a code instrumentation tool, Spex, which generates a memory reference trace from the program binary. This trace is an approximation of the speculative processor memory references and they simulate a subset of the SPEC CINT92 benchmark suite. The number of cycles required to resolve a conditional branch is approximated by a fixed 12

28 number in this technique. Their results indicate that speculation does not significantly increase data memory traffic and the main effect of mis-predicted path execution is to pre-fetch cache blocks for later correct path execution (more than 50% of mispredicted misses pre-allocate useful information) for both instruction and data caches. Moudgill et al. [17] quantify the error introduced in the average number of instructions completed per cycle (IPC) by not simulating the mis-predicted paths and accounting for the additional level-1 (instruction and data) cache hits and misses that occur in the taken path due to instruction execution along the wrong path. Their approach is to simulate the processor twice, once with mis-predicted path execution and once without mis-predicted path execution using an out-of-order detailed processor simulator, Turandot. For simulation they use a subset of the SPEC CINT95 benchmark suite. They report that for 5 out of the 9 workloads simulated, mispredicted path memory references lead to more extra cache hits than extra cache misses in the taken path for both level-1 instruction and data caches, for 2 workloads the extra hits and misses are almost equal in number and for the remaining 2 workloads the extra misses exceed the extra hits. The work reported in [2] by Bahar and Albera contradicts the above results which indicate that mis-predicted path memory references have a beneficial effect on the cache performance during the correct path execution. They assume a priori that the wrong path memory references are more likely to pollute the L-1 D-cache with data which is non-reusable in the correct path and suggest techniques to prevent the pollution of data cache with unusable data blocks. They obtain their results using the 13

29 SimpleScalar detailed execution driven simulator and a subset of the SPEC CINT95 benchmark suite and report a maximum performance improvement of 3.4% in terms of cycles for execution. Combs et al. [8] use their detailed performance simulator, fmw, on a subset of the SPEC CINT95 benchmark suite, to conclude that the pollution/pre-fetching effect is benchmark dependent. Out of the 8 benchmarks they simulate, 3 benchmarks have a predominantly pre-fetching effect, 2 benchmarks indicate more of a pollution effect and for the remaining 3 benchmarks, neither effect is dominant. A reason for these contradictory results is that Combs et al. measure the pollution/pre-fetching effect in terms of change in IPC, whereas Pierce and Moudgill measure it in terms of extra cache hits and misses in the correct paths. Sendag et al. [23] examine the effect of executing mis-predicted path load instructions in the context of a speculative multi-threaded architecture. This work is considered for four integer and two floating point programs from the SPEC CPU2000 benchmark suite. Using a detailed multi-threaded architecture simulator, SIMCA (which is based on the sim-outorder simulator from the SimpleScalar tool set), they were able to quantify the pollution and pre-fetching effects of mis-predicted path memory references. They conclude that pollution is the dominant effect for lower associativity data caches and pre-fetching is more likely for higher associativity data caches. To eliminate the pollution effect for lower associativity caches, they propose to add a small buffer to the cache to hold the blocks replaced during mis-predicted path execution. 14

30 In a recently published work [18], Mutlu et al. quantify the effects of mispredicted path memory references on data caches in the uni-processor context using the SPEC CINT2000 benchmark suite and a detailed execution driven simulator. Their results conclude that pre-fetching of reusable data is more likely than pollution with unusable data for level-1 data and instruction caches (on average, 76% of wrong path misses pre-allocate cache blocks which are useful during the correct path execution), but pollution is more likely for a unified level-2 cache for some of the benchmarks. They also report results for the error in the IPC measure due to not simulating mis-predicated branch paths. Some researchers have noted the effects of mis-predicted branch path execution in passing. In [6], Cain et al. talk about the effects of mis-predicted path execution on IPC and caches briefly and report some results for IPC variation and memory stall time. Li et al. [14] consider modeling mis-predicted branch path execution effects on caches in the context of Worst Case Execution Time (WCET) analysis of software for embedded systems. Jourdan et al. [12] analyze the effects of mis-predicted branch path execution on the branch prediction mechanism itself. Researchers who believe that executing mis-predicted path memory references is beneficial, due to the dominance of the pre-fetching effect over the pollution effect in the correct path, have suggested additional pre-fetch mechanisms to exploit this effect. Such work is presented in [19, 13, 24, 7]. For example, Sendag et al. [24] suggest a technique to continue the execution of load instructions along a mis- 15

31 predicted path even after the branch is resolve. Lee et al. [13] advocate a similar strategy to always fetch instructions along speculative paths for instruction caches. Most of the research cited above was performed using detailed execution driven simulators that faithfully model all the micro-architectural details of a processor including speculative execution which is a rather time consuming activity. Although researchers do not agree on whether the effect of speculative branch path execution on data cache performance is beneficial or detrimental, they do agree on the fact that it is necessary to incorporate this effect in simulators used for performance analysis studies. We describe the work reported so far to model the effects of mispredicted path instruction execution in simulators other than the detailed architectural simulators in the next few sections. Bhargava et al. [4] attempt to model the effects of speculative instruction fetching using the traces generated by the simulator Shade. They create an approximate copy of the program code segment called the resurrected code to represent the approximate sequence of instruction addresses in the source code by taking one pass through the trace. At the end of the first pass, the resurrected code structure is complete. The next pass through the trace involves simulation with a speculative fetch unit. This unit accesses the resurrected code to accurately fetch instructions along the path that would have been predicted by an actual processor fetch unit. Their simulation results on a subset of the SPEC CPU95 integer and floating point benchmarks and a suite of C++ programs show that the simulator can fetch from the proper branch target for 98.5 % of all the branches. However, this 16

32 work is limited to determining the branch target only. It does not indicate how many instructions need to be executed along a speculative branch target during simulation to accurately account for the effects of mis-prediction. Loh [15] proposes a time stamping algorithm to model instruction execution latency in cycles in a functional simulator. His goal is to be able to predict IPC using a functional architecture simulator. But he does not model the effects of mis-predicted path execution in his simulator, sim-timestamp, which is based on the functional simulator from the SimpleScalar tool set. In [3], Bechem et al. describe the fmw simulator. This simulator models the PowerPC micro-architecture and they use it to study the mis-predicted path effects on caches and IPC as reported in [8]. Their simulation system consists of a functional front-end simulator, PSIM, which interprets instructions, and a back-end simulator, MW, that models the micro-architecture. The back-end simulator instructs the front end simulator when to simulate instructions along mis-predicted paths. Therefore, this is not a pure functional simulator. It is very similar in operation to the SimpleScalar [5] detailed architectural simulator sim-outorder, which also separates the functional instruction set simulation from the detailed cycle-by-cycle micro-architecture simulation. They do not report any performance gain in speedup as compared to a detailed execution driven simulator. Reilly and Edmondson [22] provide an overview of the architectural simulation methodology used by the Alpha microprocessor design group at Digital Equipment Corporation. They state that their simulation setup (which is not a detailed 17

33 micro-architectural simulation environment) accounts for mis-predicted branch path execution, but do not explain their technique in detail. Very little work has been reported on modeling mis-predicted branch execution effects in a simulator that is not detailed execution driven. None of the reported work is specific to functional cache simulators. There is an obvious need of a functional cache simulation tool which incorporates the effects of speculative instruction execution to analyze cache performance more accurately and without spending too much time with detailed execution driven simulation. Therefore, we feel that we are justified in pursuing our work to model mis-predicted path execution effects in a functional cache simulator to fill the gap in the simulation tools available for computer architecture researchers. 18

34 3 PROBLEM STATEMENT AND SOLUTION OVERVIEW Chapter 2 provides an overview of the related work concerning the effects of speculative branch path execution on cache behavior and stresses the necessity to model mis-predicted branch effects in a functional cache simulator. In this chapter, we begin with our problem statement and describe the outline of the proposed solution. 3.1 Problem Statement In this section, we provide quantitative results for the difference in functional cache simulation and detailed execution driven simulation, in the context of the L-1 D-cache miss rates. We also identify metrics to measure this difference and state the goal of the research. Table 3.1: Cumulative L-1 D-cache Miss Rates for Sim-cache and Sim-outorder. 8 KB, 4-way, L-1 D-cache with 32 bytes block size, other options have default values. Program Input Benchmark Sim-cache Miss Rate (r1) Sim-outorder Miss Rate (r2) % Difference 100*(r1-r2)/r2 176.gcc expr (reference) CINT facerec reference CFP The dynamic behavior of a program can change dramatically with time during its execution [25]. Cumulative statistics, gathered at the end of the program execution, may hide the dynamic difference in the results of two different simulation techniques. Table 3.1 shows the cumulative L-1 D-cache miss rates for two programs from the SPEC CPU2000 benchmark suite, as measured by the functional cache 19

35 simulator sim-cache and the detailed micro-architectural simulator sim-outorder, from the SimpleScalar tool set. The 187.facerec simulation was terminated after 60 billion completed instructions. The difference in the cumulative results is negligible for the two workloads. Figure 3.1: Difference in L-1 D-cache Interval Miss Rate (176.gcc-expr) When we consider the dynamic difference in the L-1 D-cache miss rates measured over fixed sized program intervals, a different picture emerges. Figure 3.1 shows the difference between the sim-outorder and sim-cache miss rates for the 176.gcc workload (with expr input) and Figure 3.2 shows the difference for the 187.facerec workload (with reference input) with the same cache configuration. The 20

36 miss rates are measured over program intervals of length equal to one million (completed) instructions. Figure 3.2: Difference in L-1 D-cache Interval Miss Rate (187.facerec-ref) We observe that there are regions of program execution where the difference in the miss rates of the functional cache simulator and the detailed architectural simulator can not be ignored. Plots for more workloads appear in Appendix A and the quantitative results summarizing the difference with our metrics (discussed later in this Chapter) are presented in Chapter 6. A similar difference is observed for program intervals of other sizes (Plots are presented at the end of Appendix A). Our aim in this research is to reduce the per-interval difference in miss rates as much as possible with minimum penalty in terms of simulation time increase. We use 21

37 three metrics to measure the difference in the interval miss rates: arithmetic mean and standard deviation of the absolute difference and the Chi-squared-based Similarity Measure (CSM) [29]. We use the following terminology. Let o[n] represent the sequence of interval cache miss rate for detailed microarchitectural simulation (using sim-outorder) and let f[n] represent the sequence of interval cache miss rate for functional cache simulation (using sim-cache), where n = 1, 2, 3,..., N. (N denotes the number of program intervals, each consisting of an equal number of completed instructions.) The error sequence e[n] is defined as: e[n] = abs( f[n] o[n]), (1) Where abs() denotes the absolute value function. The mean (M) and the standard deviation (S) of the error sequence e[n] are two of the three metrics we use for measuring the difference between the functional and the detailed architectural simulators. CSM is a new measure proposed by Srinivasan et al. [29] to detect the difference between the dynamic behaviors of two similar sequences of equal length. It is based on the Chi-square distance measure. The Chi-square distance, χ 2, between two distributions A and B having an equal population size and an equal number of bins is defined [21] as: n χ 2 = (a i b i ) 2 / (a i + b i ) (a i + b i ) 0 (2) i=1 Where, a i and b i are the values of the i th bin in the distributions of A and B, respectively; n is the number of bins in either distribution. The Chi-square distance is 22

38 typically used to compute the Chi-squared probability, P(), which reflects the similarity between the distributions. However, for extremely large populations, P() is very sensitive to the number of bins. The CSM measure is defined as: CSM = 1 - χ 2 / n j=1 a j + b j (3) CSM ranges between [0,1] and is less sensitive to the number of bins for large populations. The higher the value of CSM, the greater the similarity between the two distributions. We use the CSM measure for o[n] and f[n] as our third metric. Thus, we have three metrics (M, S, and CSM) for measuring the difference between the functional cache simulator interval miss rate sequence and the detailed simulator interval miss rate sequence. We intend to enhance the functional cache simulator with models of speculative branch path execution to reduce M, reduce S and increase CSM. 3.2 Solution Overview Our work is broadly classified into two parts. We quantify effects of speculative branch path execution on the L-1 D-cache behavior and model these effects in a functional cache simulator Effects of Mis-predicted Branch Path Execution on L-1 D-cache Functional cache simulators do not model mis-predicted branch path memory references to caches. They simulate memory references only along the correct branch targets. As a result, they do not capture the effects of mis-predicted branch path execution on the cache behavior. In this work, we quantify the effects of mis- 23

39 predicted branch path memory references on the overall behavior of Ll-1 D-cache using a detailed architectural simulator that models the speculative branch path execution behavior in a processor. We measure the number of the L-1 D-cache hits and misses along the mis-predicted branch paths and compare them with the total number of cache hits and misses. We determine the number of extra cache hits during the correct branch path execution that result from the pre-fetching of useful data cache blocks by memory references executed on the mis-predicted branch paths (which are used later during the correct path execution). We also calculate the number of extra cache misses during the correct branch paths due to the removal of useful cache blocks and the pollution of the data cache with non-reusable cache blocks caused by memory references executed on the mis-predicted branch paths. Thus, we determine which of the two effects pre-fetching or pollution, is dominant. (See Chapter 2 for related work with contrasting conclusions.) We use the detailed architectural simulator, sim-outorder, from the SimpleScalar tool set in this work. By default, sim-outorder does not provide cache statistics such as the number of hits and misses for the mis-predicted branch path memory references which we require to quantify the effects of mis-predicted path execution. We modify the sim-outorder program source code available in the SimpleScalar distribution to accomplish this task. We also add the functionality to measure the extra hits due to pre-fetching and the extra misses due to pollution in sim-outorder. We use a subset of the SPEC CPU2000 benchmark suite in this work for simulation. We measure the results for a fixed cache configuration and provide 24

40 sensitivity analysis by varying the cache parameters. This work is presented in Chapter Modeling Mis-prediction Effects in Functional Cache Simulator In this work, we model the effects of mis-predicted branch path execution in the functional cache simulator sim-cache from the SimpleScalar tool set. Using the detailed simulator sim-outorder, we analyze the distribution of the number of speculative instructions executed along the mis-predicted branch paths, before the branch condition is evaluated; the prediction is determined to be incorrect and the speculative instructions are flushed. We use a subset of the SPEC CPU2000 benchmark suite for simulation. We then augment sim-cache with a branch prediction mechanism to enable it to simulate mis-predicted branch path memory references. We implement four different techniques to dynamically determine the number of speculative instructions executed along mis-predicted paths in the sim-cache program source code. We measure the (per interval) L-1 D-cache miss rates for a number of workloads from the SPEC CPU2000 benchmark suite using the four techniques we implement in the sim-cache simulator and compare them with the unmodified simcache simulator results for closeness with respect to the sim-outorder results using the three metrics discussed in Section 3.1. This work is discussed in Chapter 6. 25

41 4 SIMULATION ENVIRONMENT We present our experimental framework in this chapter. We explain the simulation tools, the benchmarks and the hardware platform used to perform the simulation experiments. 4.1 The SimpleScalar Tool Set The SimpleScalar tool set [5, 1, 26] is a collection of simulation tools used for modeling applications for architecture and program performance analysis, detailed micro-architectural modeling, and hardware-software co-verification. SimpleScalar simulators can emulate the Alpha, PISA (a MIPS-like instruction set available for instructional use), ARM, and the x86 instruction sets. The tool set includes a machine definition infrastructure that permits most architectural details to be separated from simulator implementations. The SimpleScalar tool set provides simulators ranging from a fast functional simulator (sim-fast) to a detailed, dynamically scheduled superscalar processor model (sim-outorder) that supports non-blocking caches, speculative execution, and state-ofthe-art branch prediction. It has two functional cache simulators, sim-cache and simcheetah. We use sim-outorder in our work to quantify the effects of mis-predicted branch path memory references on the L-1 D-cache hits and misses during the correct branch path execution. We model these effects in the functional simulator sim-cache and compare their results to the (cache) statistics of sim-outorder. 26

42 4.2 Benchmarks Benchmarks are sets of standardized applications used for evaluating the performance of computer systems. There are different benchmark suites for evaluating different systems such as desktop systems, engineering and scientific workstations, business servers, embedded processors and digital signal processors (DSPs). Benchmarks are designed to be representative of realistic programs and are used as inputs for simulations of new architectural models and compiler designs. In this section, we describe the SPEC CPU2000 benchmark suite which we use in our research. We choose the SPEC CPU2000 suite since it is the most widely used standardized benchmark for performance evaluation of micro-architectures and memory hierarchies in desktop and engineering/scientific environments. It is developed by the Standard Performance Evaluation Corporation [28], an organization composed of computer vendors, systems integrators, universities, research organizations, publishers and consultants with the goal of establishing, maintaining and endorsing a standardized set of relevant benchmarks for computer systems. The SPEC CPU2000 suite contains real programs, modified for portability across architectures, to minimize the role of I/O in overall benchmark performance and to emphasize processor performance. It consists of 12 integer benchmarks (CINT 2000) and 14 floating point benchmarks (CFP2000) as shown in Table

43 Table 4.1: SPEC CPU2000 Integer and Floating Point Benchmarks CINT2000 Benchmarks Benchmark Programming Language Category 164.gzip C Compression 175.vpr C FPGA Circuit Placement and Routing 176.gcc C C Programming Language Compiler 181.mcf C Combinatorial Optimization 186.crafty C Game Playing: Chess 197.parser C Word Processing 252.eon C++ Computer Visualization 253.perlbmk C PERL Programming Language 254.gap C Group Theory, Interpreter 255.vortex C Object-oriented Database 256.bzip2 C Compression 300.twolf C Place and Route Simulator CFP2000 Benchmarks Benchmark Programming Language Category 168.wupwise Fortran 77 Physics / Quantum Chromodynamics 171.swim Fortran 77 Shallow Water Modeling 172.mgrid Fortran 77 Multi-grid Solver: 3D Potential Field 173.applu Fortran 77 Parabolic / Elliptic Partial Differential Equations 177.mesa C 3-D Graphics Library 178.galgel Fortran 90 Computational Fluid Dynamics 179.art C Image Recognition / Neural Networks 183.equake C Seismic Wave Propagation Simulation 187.facerec Fortran 90 Image Processing: Face Recognition 188.ammp C Computational Chemistry 189.lucas Fortran 90 Number Theory / Primality Testing 191.fma3d Fortran 90 Finite-element Crash Simulation 200.sixtrack Fortran 77 High Energy Nuclear Physics Accelerator Design 301.apsi Fortran 77 Meteorology: Pollutant Distribution 28

44 For each workload included in the SPEC CPU2000 benchmark suite, three types of inputs are available test, train and reference. The reference inputs are the largest among the three inputs and are considered to be representative of realistic applications. Reference inputs result in billions of executed instructions and take a long time for simulation. For some of the benchmarks, more than one input is available in the reference set. We use the reference inputs in our simulations. However, some of the simulations are truncated before they are finished to avoid exceedingly long simulation times. Such simulations are identified wherever their results are presented. 4.3 Hardware Platform Our hardware platform for running simulations is an 8-node cluster of workstations. Each node consists of two AMD Athlon processors operating at 2 GHz and 2 GB of RAM. The operating system on each node is Linux (kernel smp). The nodes share a common file system through the NFS protocol. We use the Alpha ISA target emulation in SimpleScalar. The binaries for SPEC CPU2000 workloads are compiled with peak optimization levels for the Alpha AXP architecture with DEC C V on Digital UNIX V4.0. The compiler optimization flags used are: -g3 -fast -O4 -arch ev6 -non_shared. 29

45 5 MIS-PREDICTED PATH EXECUTION EFFECTS ON DATA CACHE In this chapter, we present the quantitative results and the analysis of the effects of execution of mis-predicted path memory references on the L-1 D-cache hits and misses. We use a fixed cache configuration to measure the results and probe their sensitivity to changes in the cache configuration. We begin by describing the SimpleScalar detailed out-of-order processor simulator (sim-outorder) which we use in our simulations and the modifications made to its source code to measure the effects of mis-prediction on data cache. 5.1 Detailed Out-of-order Processor Simulator We use the detailed architectural simulator sim-outorder from SimpleScalar to quantify the effects of mis-predicted path memory references on L-1 D-cache, since a functional cache simulator does not simulate mis-predicted branches. This section describes the internal details of sim-outorder and our additions to its code Sim-outorder Internals The sim-outorder simulator is execution driven and tracks the cycle-by-cycle micro-architectural state of a processor in great detail. It models a superscalar processor with out-of-order instruction execution, in-order instruction commit, twolevel cache hierarchy with main memory and branch prediction with speculative instruction execution. Most of the components of the micro-architecture modeled by sim-outorder are configurable by means of command line options used when starting the simulation. The simulator generates a number of performance statistics for 30

46 different parts of the micro-architecture such as caches, branch prediction buffers, functional units as well as overall processor performance metrics such as the IPC. Sim-outorder models out-of-order instruction issue and execution based on the Register Update Unit (RUU) scheme proposed in [27]. The RUU uses a re-order buffer (ROB) [9] to automatically rename registers and hold the results of instructions waiting to commit. The ROB retires completed instructions in program order to the architectural register file every cycle. The pipelined organization simulated in simoutorder is as shown in Figure 5.1. Figure 5.1: The Pipeline Simulated in Sim-outorder The main loop of sim-outorder is structured as shown in Figure 5.2. This loop is located inside the sim_main() function and is executed once for every simulated 31

47 machine cycle. Every iteration of the loop traces the above pipeline in reverse, so that every machine cycle can be simulated correctly by executing the loop only once [5]. Figure 5.2: Main Loop of Sim-outorder The ruu_fetch() function models the fetch stage of the pipeline. In this stage, instructions are fetched every cycle from the I-cache and stored in the dispatch queue. This stage includes a branch prediction mechanism to predict the direction and target of conditional branches. The dispatch stage of the pipeline, modeled in the ruu_dispatch() function, is responsible for decoding instructions in the dispatch queue and register renaming [9] to eliminate false data dependences among sequential instructions. After decoding, the instructions are entered into the RUU and the load/store queue (LSQ) (if applicable). The ruu_dispatch() function also keeps track of when the execution stream enters a mis-predicted branch path. The functions ruu_issue() and lsq_refresh() model the instruction schedule or issue stage of the pipeline. These functions track the register and memory dependences of operands of the instructions in the RUU and the load/store queue and 32

48 schedule the instructions whose register operands are ready for execution on every cycle. The availability of the required functional unit is checked before instruction issue. The function ruu_issue() also models the execute stage, during which the results of the instructions are calculated and scheduled to be available after the predefined latency of the functional units. For loads and stores (in the load/store queue), the execute stage involves computation of their effective addresses. Load instructions access the data cache once their address is computed. Speculative loads are allowed to access the data cache. However, speculative store values are held in the queue until the branch condition is resolved and the stores are no longer speculative. Non-speculative store values are written to the data cache in the later commit stage. The ruu_writeback() function models the writeback stage of the pipeline. In this stage, results of instructions which have completed execution are made available to any instructions whose operands depend on the completed instructions. Such instructions, which receive their operands, are marked as ready for issue. When a mispredicted branch instruction reaches this stage, the results of instructions executed along the mis-predicted path are discarded and the processor state is rolled back to the point where the mis-predicted branch occurred. The instruction flow then resumes at the correct branch path. The commit stage is modeled in the ruu_commit() function. In this stage, instructions are committed in program order. For store instructions, commit means updating the store values to the data cache; for other instructions, it means updating their results in the architectural register file. The ruu_commit() function retires 33

49 instructions at the head of the RUU that are ready to commit until the head instruction is one that is not ready [5]. Once an instruction is committed, its entry from the RUU/LSQ is removed and made available for newer instructions Enhancing Sim-outorder to Quantify Mis-prediction Effects By default, cache metrics such as number of references, hits and misses generated by sim-outorder are not differentiated between those occurring during the mis-predicted path and the correct path. Nor does sim-outorder quantify the prefetching and pollution effects of mis-predicted path memory references on cache during correct path execution. Fortunately, incorporating these enhancements in simoutorder is relatively straight-forward, as the complete source code for sim-outorder is available with the SimpleScalar distribution. Figure 5.3 provides a simplified view of the additions (shown in Bold letters) made to the sim-outorder source code to enable it to quantify effects of mis-prediction on L-1 D-cache hits and misses. We add four parameters to the L-1 D-cache structure two parameters to keep track of hits and misses along the mis-predicted paths and two parameters to measure the extra hits and extra misses along the correct paths to quantify the pre-fetching and the pollution effects of mis-prediction. 34

50 Figure 5.3: Additions to the Sim-outorder Code 35

51 The interface of the data cache read function for loads is modified to input an integer indicating whether the processor is currently executing instructions along a mis-predicted path or correct path. Inside the data cache read function, two static integer arrays keep track of data blocks replaced from and brought to the data cache during the mis-predicted path. In the subsequent data cache read/write access along the correct path, the tag of the requested data block is checked for a match in these arrays to see if the correct path hit or miss is a result of block changes during the previous mis-predicted path. If the tag matches an entry in the array tracking the blocks added during the mis-predicted path, the resulting hit is counted as an extra hit along the correct path (pre-fetching effect). If the tag matches an entry in the array tracking the blocks removed during the mis-predicted path, the resulting miss is counted as extra miss along the correct path (pollution effect). Care is taken in the code to count a maximum of one extra hit along the subsequent correct path execution per each entry in the array of blocks added during a mis-predicted path execution and a maximum of one extra miss along the subsequent correct path execution per each entry in the array of blocks removed during a mispredicted path execution. The reasoning behind this strategy is that after a block is accessed during the correct path for the first time, it is bound to be present in the cache (until replaced again). The effect of mis-predicted path cache activity on a cache block is valid only for the first access to that block along the next correct path. Measuring the cache hits and misses during the mis-predicted path is much simpler. If the cache read access function is called during mis-speculated mode (indicated by the 36

52 integer parameter spec_mode passed to the function), the hits and misses are counted as mis-predicted hits and misses, respectively. 5.2 Experimental Results The quantitative results for the effects of executing mis-predicted branch path memory references on the L-1 D-cache are presented in this section. The memory hierarchy and the processor organization used for the simulations have their default values in sim-outorder. See Appendix B for a description of the parameters of the processor configuration simulated by sim-outorder. The L-1 D-cache, with which we are concerned in this research, is a 16 KB 4-way set associative cache with 32 bytes block size and the least recently used [9] block replacement policy on misses. The access latency for hits is 1 cycle. The cache uses a write back, write allocates scheme [9]. The load/store queue size is 8 entries and the register update unit size is 16 entries. The results are summarized in Table 5.1 to Sensitivity to Changes in Data Cache Parameters In this section, we present the results of simulations performed to determine the sensitivity of the mis-prediction effects to changes in L-1 D-cache parameters. We vary one parameter at a time and measure the mis-prediction effects for two workloads, one integer and one floating point. The cache parameters selected are cache size, cache associativity and block size. The results are summarized in Table

53 Table 5.1: Mis-predicted Path Execution Effects (A) L-1 D-cache (CINT2000) Workload (ref input) Instructions Executed (A) Instructions Committed (B) Mis-predicted Instructions (A-B) Total accesses (C) Total Hits (D) Total Misses (E) 176.gcc (expr) 164.gzip (log) mcf bzip2 (source) 252.eon (rushmeir) twolf vortex (lendian2) 175.vpr (route) 253.perlbmk (diffmail) Table 5.2: Mis-predicted Path Execution Effects (B) L-1 D-cache (CINT2000) Workload (ref input) Misses Mispredicted Path (F) Hits Mispredicted Path(G) Accesses Mispredicted Path(F+G) Extra Hits Correct Path (H) Extra Misses Correct Path (I) 176.gcc(expr) gzip(log) mcf bzip2(source) eon(rushmeir) twolf vortex(lendian2) vpr(route) perlbmk(diffmail)

54 Table 5.3: Mis-predicted Path Execution Effects (C) L-1 D-cache (CINT2000) Workload (ref input) % Of Instructions In Mis-predicted Path % Of Total Cache Accesses In Mis-predicted Path % Of Total Misses In Mispredicted Paths (F/E*100) % Of Total Hits In Mispredicted Paths (G/D*100) Extra Hits/Extra Misses In Correct Path (H/I) 176.gcc (expr) gzip(log) mcf bzip2(source) eon(rushmeir) twolf vortex(lendian2) vpr(route) perlbmk(diffmail) Table 5.4: Mis-predicted Path Execution Effects (A) L-1 D-cache (CFP2000) Workload (ref input) Instructions Executed (A) Instructions Committed (B) Mis-predicted Instructions (A-B) Total accesses (C) Total Hits (D) Total Misses (E) 187.facerec art(110) fma3d equake lucas sixtrack mesa wupwise

55 Table 5.5: Mis-predicted Path Execution Effects (B) L-1 D-cache (CFP2000) Workload (ref input) Misses Mispredicted Path (F) Hits Mispredicted Path(G) Accesses Mispredicted Path(F+G) Extra Hits Correct Path (H) Extra Misses Correct Path (I) 187.facerec art(110) fma3d equake lucas sixtrack mesa wupwise Table 5.6: Mis-predicted Path Execution effects (C) L-1 D-cache (CFP2000) Workload (ref input) % Of Instructions In Mis-predicted Path % Of Total Cache Accesses In Mispredicted Path % Of Total Misses In Mispredicted Paths (F/E*100) % Of Total Hits In Mispredicted Paths (G/D*100) Extra Hits/Extra Misses In Correct Path (H/I) 187.facerec art(110) fma3d equake lucas sixtrack mesa wupwise

56 Table 5.7: Branch Prediction Accuracy and Percentage of Branch Instructions for SPEC CPU2000 Benchmarks CINT2000 Benchmarks CFP2000 Benchmarks Workload Branch Prediction Accuracy % % Of Branch Instructions Committed Workload Branch Prediction Accuracy % % Of Branch Instructions Committed 176.gcc (expr) facerec gzip(log) art(110) mcf fma3d bzip2(source) equake eon(rushmeir) lucas twolf sixtrack vortex(lendian2) mesa vpr(route) wupwise perlbmk(diffmail) Analysis of Results The results presented above indicate contrasting behavior for integer and floating point workloads. All of the integer workloads simulated (except 255.vortex) have considerably more activity recorded in mis-predicted branch paths than any of the floating point workloads. For integer workloads, the percentage of mis-predicted path instructions varies from 1.43% for 255.vortex to 16.78% for 253.perlbmk. The percentage of L-1 D-cache accesses in the mis-predicted path varies from 1.25% for 255.vortex to 15.59% for 181.mcf. The percentage of mis-predicted path L-1 D-cache misses changes from 2.14% for 255.vortex to 30.72% for 253.perlbmk. The 41

57 percentage of mis-predicted path L-1 D-cache hits ranges from 1.24% for 255.vortex to 17.72% for 181.mcf. Table 5.8: Sensitivity of Mis-prediction Effects to Changes in Data Cache Parameters Workload Variation In Cache Configuration (Cache size, associativity, block size) % Of Total Misses In Mispredicted Paths % Of Total Hits In Mis-predicted Paths Extra Hits/Extra Misses In Correct Path None (16KB,4way,32 byte) Cache Size 176.gcc (expr) Associativ ity Block Size (8KB, 4way,32 byte) (32KB,4way,32 byte) (16KB,2way,32 byte) (16KB,1way,32 byte) (16KB,4way,16 byte) (16KB,4way,64 byte) None (16KB,4way,32 byte) art(110) Cache (8KB, 4way,32 byte) Size (32KB,4way,32 byte) Associativ ity Block Size (16KB,2way,32 byte) (16KB,1way,32 byte) (16KB,4way,16 byte) (16KB,4way,64 byte) For floating point workloads, the mis-predicted path instructions vary from 0.07% for 189.lucas to 5.39% for 179.art. The mis-predicted path L-1 D-cache accesses vary from 0.007% for 189.lucas to 2.50% for 168.wupwise. The mispredicted path L-1 D-cache misses change from 0.002% for 168.wupwise to 6.26% for 200.sixtrack. The mis-predicted path L-1 D-cache hits change from 0.007% for 189.lucas to 2.57% for 168.wupwise. 42

58 For all of the integer workloads (except possibly 255.vortex), the mispredicted path L-1 D-cache hits and misses form a considerable part of the total L-1 D-cache hits and misses. Clearly, the effects of execution of mis-predicted path memory references on L-1 D-cache should not be ignored in performance analysis studies, at least for integer benchmarks, when the total (cumulative) hits and misses are the metrics of concern. For floating point benchmarks, the number of mispredicted path L-1 D-cache hits and misses is negligible compared to the total hits and misses, except for the workload 200.sixtrack (6.26% of total misses in mispredicted path) and the workload 177.mesa (4.58% of total misses in mis-predicted path). As such, modeling the effects of mis-prediction on L-1 D-cache is not really helpful, when the metrics under consideration are the total number of hits and misses. The results presented in the Table 5.7 give a clue to the reason why misprediction effects are more pronounced for the integer workloads and less so for the floating point workloads. This table summarizes the results for branch prediction accuracy and the percentage of committed instructions which are conditional branches. The branch prediction accuracy is only slightly greater for the floating point benchmarks as compared to the integer benchmarks. But the percentage of conditional branch instructions in the dynamic instruction count is considerably more for the integer workloads. These two observations slightly worse branch prediction accuracy and the greater number of dynamic conditional branches in the integer workloads, indicate that branch mis-prediction is more prevalent in the integer workloads than the floating point workloads. 43

59 The results for the relative number of extra hits and extra misses in the correct path due to the pre-fetching of useful cache blocks into the L-1 D-cache and the replacement of required cache blocks in the L-1 D-cache during mis-predicted path execution respectively are also presented. Except for the three floating point workloads 179.art, 183.equake and 200.sixtrack, all the workloads show a greater number of extra hits than extra misses. This indicates that for the selected cache organization, the pre-fetching effect is more likely than the pollution effect. Indeed, the number of extra hits is an order of magnitude more than the number of extra misses for many workloads. For example, 300.twolf has 163 times as many extra hits as extra misses, 255.vortex has 159 times more hits, 189.lucas has 496 times more hits and 187.facerec has 196 times as many extra hits as extra misses. Therefore, we conclude that speculative path execution has an indirect beneficial effect on cache performance in terms of pre-fetching of useful cache blocks during mis-predicted path execution, in addition to the direct benefit of exploiting instruction level parallelism across the basic instruction blocks bounded by conditional branches. The sensitivity analysis experiments for the two workloads, 176.gcc and 179.art, indicate that the percentage of mis-predicted path L-1 D-cache hits and misses does not change considerably with changes in cache parameters such as total cache size, block size and associativity. For the relative number of extra hits and extra misses in the correct path, some drastic variations are observed for both the workloads with some cache configurations. For example, keeping the cache size fixed at 16 KB, the block size fixed at 32 bytes and changing the associativity from 4-way 44

60 set associative to the direct mapped configuration, we observe that the ratio of extra hits to extra misses increases from 6.50 to for 176.gcc. For 179.art, it actually increases above 1, when the block size is changed either to 16 bytes or 64 bytes, keeping the other parameters constant. However, no definite trend (such as increase in extra hits with decrease in associativity or increase in block size) is observable. Therefore, we conclude that, although pre-fetching is more likely to happen as a result of speculative instruction execution, more work is necessary to correlate prefetching and pollution effects with the parameters of cache organization. 45

61 6 MODELS OF MIS-PREDICTION EFFECTS IN FUNCTIONAL CACHE SIMULATOR This chapter describes the efforts to model the effects of mis-predicted path memory references on cache statistics in the functional cache simulator sim-cache. We begin by presenting the quantitative results for the difference in the L-1 D-cache (interval) miss rates of sim-cache and sim-outorder, in terms of the three metrics introduced in Chapter 3. We explain the development of four distinct models of misprediction effects for the functional cache simulator and the experimental results indicating their effectiveness in reducing the deviation of its interval miss rate from the detailed architectural simulator interval miss rate, for a fixed cache configuration. We provide results to determine the sensitivity of the models to changes in cache parameters. We conclude with an analysis of the experimental results. 6.1 Difference in Miss Rates of Functional and Detailed Simulators Appendix A contains the plots showing the percent difference in the L-1 D- cache interval miss rates measured by sim-cache and sim-outorder, for a number of programs from the SPEC CPU2000 benchmark suite. In this section, we summarize the difference for the selected workloads using our metrics the arithmetic mean of the absolute difference, the standard deviation of the absolute difference and the CSM. The L-1 D-cache organization simulated is an 8KB, 4-way set associative cache with block size of 32 bytes and the least recently used block replacement policy. Other parameters of the processor configuration in sim-outorder and the 46

62 memory hierarchy modeled in sim-cache have their default values (as described in Appendix B). Table 6.1: Summarizing the Difference in Interval Miss Rates of Sim-cache and Simoutorder for SPEC CPU2000 Benchmarks CINT2000 Benchmark Reference Input Name Mean of Absolute Difference Standard Deviation of Absolute Difference CSM 164.gzip log vpr route gcc expr mcf ref crafty* ref parser* ref eon rushmeir perlbmk diffmail gap* ref vortex* lendian bzip2 source twolf ref CFP2000 Benchmark Reference Input Name Mean of Absolute Difference Standard Deviation of Absolute Difference CSM 168.wupwise* ref apsi* ref art facerec* ref mesa* ref

63 We use the entire SPEC CINT2000 subset and five workloads from the SPEC CFP2000 subset of the SPEC CPU2000 benchmark suite. We simulate either the single reference input or one of the several reference inputs available for each workload. The reason for presenting the results for only a few of the floating point benchmarks from the SPEC suite will be clear when we analyze our results. Most of the simulations are terminated after 60 billion completed instructions, in order to avoid excessively long simulation times for the sim-outorder simulator. The results are summarized in Table 6.1. Benchmarks with a * indicate that the simulations were truncated. The results for arithmetic mean and standard deviation of the absolute difference may seem to be very small; however, one should note that the miss rates are themselves very small for most of the benchmarks. (See plots in Appendix A for percent difference in the interval miss rates to get a better idea of the relative error.) The CSM is a good measure of similarity between two identical length sequences, with a value nearer to zero indicating less similarity and a value nearer to one indicating good similarity. As can be seen from the table, the CSM values for some of the workloads such as 187.facerec, 253.perlbmk, 181.mcf and 164.gzip are quite low. In the remainder of this chapter, we describe modeling mis-prediction effects in sim-cache and the experimental results to see how successful these models are in reducing the mean and the standard deviation and increasing the CSM for the SPEC CPU2000 benchmarks. 48

64 6.2 Developing Models of Mis-predicted Path Execution Effects in Sim-cache In this section, we describe the changes that need to be made in a functional cache simulator in order to model the effects of mis-predicted branch path execution on L-1 D-cache miss rate. We begin with a description of how to incorporate a branch prediction mechanism in sim-cache Branch Prediction in Sim-cache A functional cache simulator executes instructions along the correct or the taken branch paths. It does not execute instructions along the not taken branch paths, since no branch prediction mechanism is simulated. So the first step towards modeling the effects of mis-prediction in a functional cache simulator is to augment it with a branch prediction mechanism and the ability to speculatively execute instructions along mis-predicted branch paths. We add the default branch prediction mechanism modeled in the detailed simulator, sim-outorder, to sim-cache. This default branch predictor is a 2-level bimodal branch predictor [5]. (Identical branch prediction configuration is used when we compare the functional cache simulator miss rates to the detailed simulator cache miss rates.) With this enhancement, every time the functional simulator fetches a conditional branch instruction, the branch prediction model predicts whether the branch will be taken or not and returns a predicted address for the next instruction to be fetched. The predicted address is compared to the correct branch target address, calculated by actually testing the branch instruction condition. If they are equal, the prediction is correct and instruction execution continues normally along the correct 49

65 path. If the two addresses are unequal, the branch prediction is incorrect. The simulator saves the correct branch target address and begins executing instructions along the incorrect branch path. (How the simulator determines the number of instructions to execute along the mis-predicted path is explained in the next section.) When the simulator is finished executing a dynamically estimated number of instructions along the mis-predicted path, it starts fetching and executing instructions from the correct (saved) branch target address. In a real processor, instructions are speculatively executed along the predicted branch target path and their results are buffered until the branch condition is actually evaluated and the prediction is checked. If the prediction is found to be correct, the speculatively calculated results are stored to the architectural state of the processor. Otherwise, the results are discarded before the execution resumes at the correct branch target address. We must ensure that our functional cache simulator models this behavior of speculative instruction execution along the (mis-)predicted branch target path. Whenever the functional simulator begins instruction execution along the mispredicted path, it copies the architectural state of the processor (the register file) into a duplicated register file. The results of computational instructions executed along the mis-predicted path are updated in the duplicate register file. For memory instructions, different strategy is adopted for load and store instructions. In an actual processor, store values of speculative instructions are buffered and written to the data cache only after the branch prediction is determined to be correct, else discarded. This behavior 50

66 is easily modeled by simply ignoring the store instructions along the mis-predicted path without performing any action. The load instructions along the mis-predicted path are allowed to access the data cache just like non-speculative load instructions. The value read from the memory hierarchy is updated to the register in the duplicate file. The duplicate register file contents are simply discarded when the simulator resumes instruction execution along the correct branch target path. Program exceptions generated along the mis-predicted path must be handled properly. In a real processor, no action is generally taken on exceptions caused by speculative instructions, until the branch condition is resolved. The exceptions are recorded just like the results of speculative instructions. If the branch prediction is correct, the exceptions generated by speculative instructions are acted upon in program order, otherwise they are discarded. We model this behavior by ignoring all the exceptions generated by execution of mis-predicted path instructions. Thus, program exceptions such as illegal op-code, illegal memory address and floating point exceptions, are ignored during mis-predicted path execution Modeling Instruction Execution on the Mis-predicted Paths So far, we have discussed how to enable a functional cache simulator to speculatively execute memory references along mis-predicted paths by incorporating a branch predictor. In a real processor, speculative instruction execution along a branch path continues at least until the branch condition is evaluated, but frequently until the branch instruction reaches the head of the in-order instruction retiring buffer 51

67 (reorder buffer). The number of cycles for this to happen, and hence the number of speculatively executed instructions along a mis-predicted path, is not fixed. The number of speculatively executed instructions along a mis-predicted path can vary from zero to the maximum number of instructions that can be simultaneously executed inside the processor minus one, as determined by the size of the reorder buffer. SUB R1,R1,#1 //Decrement R1 by 1 BNE R1,R2,LOOP //Jump to LOOP if R1!= R2 ADD R5,R6,R7 //R5 = R6 + R7 SUB R4,R8,R9 //R4 = R8 R9 OR R7,R7,R8 //R7 = R7 or R8 LOAD R3,100(R0) //Load R3 with memory Contents MUL R3,R3,R1 //R3 = R3*R1 LOAD R2,200(R0) //Load R2 with memory Contents DIV R2,R2,R4 //R2 = R2/R4 ADD R5,R2,R3 //R5 = R2 + R3 Figure 6.1: A Code Segment Illustrating a Branch Instruction Consider the code shown in Figure 6.1 and assume that at most 8 instructions can be executed speculatively along a mis-predicted path. When the branch instruction shown in the Figure is wrongly predicted as not taken, depending upon the number of speculatively executed instructions, either both the load instructions will be speculatively executed or only the first load will be speculatively executed or none of the two loads will be speculatively executed. Therefore, our functional cache simulator must include a realistic mechanism to accurately estimate how many instructions to execute speculatively every time it enters the mis-predicted branch 52

68 path execution mode. It should not simply execute a fixed number of instructions along every mis-predicted path. The number of speculatively executed instructions along mis-predicted paths can be represented by a random variable, N. If this random variable has a definite probability distribution, say an Exponential distribution or a Gaussian distribution or can be closely approximated by one, it can be easily modeled in the functional cache simulator and used to estimate the number of instructions to execute along every mispredicted path. To verify the above presumption, we simulated several SPEC CPU2000 workloads using a modified version of the detailed architectural simulator simoutorder and experimentally estimated the probability that x number of instructions will be executed along a mis-predicted path. The necessary code to achieve this with sim-outorder is straight-forward. We maintain a counter for every possible number of instructions that can be executed along a mis-predicted path. For each mis-predicted path, we count the actual number of speculatively executed instructions and increment the corresponding counter. At the end of simulation, each counter value divided by the sum of all counters gives the probability for the corresponding number. We use the default processor and memory hierarchy configuration of sim-outorder in the simulations. The plots showing the probability distribution for some of the benchmarks are shown below (Figure ). Refer to Appendix C for plots illustrating probability distributions for more benchmarks. 53

69 Figure 6.2: Probability Distribution of Speculative Instructions (176.gcc-expr) Two observations can be made from these plots. First, the number of speculatively executed instructions along a mis-predicted path varies from zero to fifteen, which agrees well with the fact that the default size of the Register Update Unit in sim-outorder is sixteen. Second, every distribution is completely different from other distributions. There is not much similarity between any two of the distributions. Therefore, we can not put a simple model of speculatively executed mis-predicted path instructions in the functional simulator, based on the probability distribution of one benchmark (approximated by a known distribution such as the Exponential distribution), and expect it to work well for all the benchmarks we might be interested in simulating. 54

70 Figure 6.3: Probability Distribution of Speculative Instructions (252.eon-rushmeir) Figure 6.4: Probability Distribution of Speculative Instructions (301.apsi-ref) 55

71 Figure 6.5: Probability Distribution of Speculative Instructions (179.art-110) In a real processor with speculative instruction execution capability, the number of instructions speculatively executed along a mis-predicted path depends on how many cycles it takes for the branch condition to be evaluated and how quickly the branch instruction reaches the head of the reorder buffer in ready to commit state. This in turn depends on various architectural parameters of the processor, the architectural state of the processor at that time as well as the data dependences present in the program. Our model must include the influence of as many factors as possible, to be successful in reducing the difference in the interval L-1 D-cache miss rates of functional and detailed simulators. The architectural parameters of a processor, such as the number of available functional units, the latency of different functional units, size of instruction holding buffers, the number of pipeline stages and the latency of different stages of the memory hierarchy will have an impact on the number of cycles 56

72 it takes to determine whether to take the branch or not after the branch instruction is fetched. The program dependent factors which influence the number of cycles to resolve a branch are the data dependence of the conditional branch operand(s) on the results of immediately preceding instructions and the occupancy of various functional units by respective instructions at the time the branch instruction is ready to execute. The actual number of instructions speculatively executed along a mis-predicted branch path will also depend on the data dependence among the speculative instructions. For example, if a load instruction along a mis-predicted path is data dependent on a previous long running multiply instruction, its probability of being speculatively executed is reduced. The functional cache simulator models only the programmer visible architecture (the instruction set, programmable registers and memory hierarchy) of the processor. It does not incorporate the architectural details of the processor. Adding some of the organizational details in a functional cache simulator for the purpose of estimating the number of mis-predicted path instructions is rather time consuming and difficult and it changes the functional nature of the simulator. So we concentrate on modeling the program dependent factors such as data dependence of conditional branches and the speculative instructions along mis-predicted paths. The only architectural factor we consider and which is easy to account for, is the latency of different functional units available in the processor. 57

73 6.3 Techniques to Model the Number of Speculatively Executed Instructions In this section, we describe in detail four different models to dynamically determine the number of instructions to speculatively execute along a mis-predicted path in the functional cache simulator sim-cache. These are: random number of instructions model, data dependence of conditional branch model, data dependence of speculative load model and the combined model Random Number of Instructions Our first mechanism for estimating the number of speculatively executed instructions along a mis-predicted branch path is simple. Every time the functional cache simulator enters a mis-predicted branch path, it assigns a pseudo-random integer between zero and fifteen (corresponding to the default size of the RUU in simoutorder, the simulator configuration we compare our results to) as the number of instructions it should speculatively execute along the mis-predicted path. It then executes that random number of speculative instructions, including any memory references, along the mis-predicted branch path. After that, the simulator resumes instruction execution along the correct branch path. We call the functional cache simulator with a branch prediction mechanism and a random number of speculatively executed instructions along the mis-predicted path as sim-cache-bpred-rand. We do not expect this model to significantly improve the accuracy of the interval miss rates of the functional cache simulator, since it does not account for any of the architectural factors or the inherent program characteristics that determine how many instructions are executed speculatively along a mis-predicted path. Our 58

74 motivation in simulating a random speculative instruction model is to quantify the improvement in the functional cache simulator interval miss rate accuracy, if any, achieved by merely enabling the simulator to execute instructions along the mispredicted paths. The overhead incurred in implementing a random speculative instruction execution model in the functional cache simulator is small, since it does not have to consider any information when estimating the number of instructions to execute when it starts the mis-predicted path execution. Therefore, this model should not introduce a large loss of simulation speedup in the functional cache simulator, as compared to the detailed architectural simulator Data Dependence of Conditional Branch In our second technique, we use the data dependence of the operand(s) of the conditional branch instruction on the results of immediately preceding instructions, if any, to estimate the number of speculatively executed instructions along the mispredicted branch path. The changes made to the sim-cache program source code to implement this model, other than the addition of the branch prediction mechanism and the speculative instruction execution capability discussed in Section 6.2.1, are described below. We maintain a circular queue to store the latency of the last sixteen instructions when the simulator is executing instructions along a correct program path. The number sixteen is chosen to match the maximum number of instructions that can be in the process of execution in the processor architecture simulated by the 59

75 detailed simulator, sim-outorder. The latency of a correct path instruction is calculated as follows: Instruction Latency, L = L FUnit + L I (1) Where, L FUnit is the latency of the Functional Unit and L I is the latency of a previous instruction on which the current instruction has data dependence, reduced by number of instructions which appear between the two instructions. Figure 6.6 illustrates an example. Figure 6.6: Latency Calculation of an Instruction If the latency of the previous instruction, reduced by the number of intervening instructions, turns out to be negative, it is taken to be zero. The idea here is that the previous instruction will have completed as many cycles of its execution as the number of intervening instructions, so the number of cycles for its result to become available for the current instruction are reduced by that many cycles. The assumption we make here is that exactly one instruction begins execution every clock cycle, for the purpose of simplicity. Table 6.2 provides the functional unit latency values we use in our model. These values are the same as the functional unit latency values assumed in the detailed simulator, sim-outorder. The actual latencies of some functional units, such as floating point divide unit and the floating point square root unit, are greater than 60

76 fifteen. But our model assumes them to be fifteen for the sake of estimating the number of mis-predicted path instructions, which should not exceed fifteen (i.e., the maximum number of instructions simultaneously in execution phase inside our processor model). Table 6.2: Functional Unit Latency Values Functional Unit Latency Integer ALU 1 Integer MULTIPLY Unit 3 Integer DIVIDE Unit 15 Floating Point ADD/COMPARE/CONVERT Unit 2 Floating Point MULTIPLY Unit 4 Floating Point DIVIDE Unit 12 Floating Point SQUARE ROOT Unit 15 LOAD/STORE Unit 1 When a conditional branch is encountered, its latency is calculated as described above. If the branch is mis-predicted, the latency queue is cleared and the branch latency is assigned to the number of instructions to execute along the mispredicted path. The simulator then executes those instructions along the mis-predicted path (without considering the data dependence among the speculative instructions). After finishing speculative instruction execution, the simulator resumes instruction execution along the correct path, tracking the latency of instructions again. We call the functional cache simulator which incorporates the model of data dependence of 61

77 conditional branches to estimate the number of speculatively executed instructions along mis-predicted paths as sim-cache-bpred-brdep Data Dependence of Speculative Load The data dependence of a conditional branch instruction model essentially estimates the number of cycles it takes to evaluate a branch condition. The simulator then executes as many speculative instructions as the number of cycles estimated. In our third model, we consider the data dependence among speculative instructions along a mis-predicted path, to estimate the number of instructions executed speculatively. The instructions that cause change in the state of data cache, even when executed speculatively, are the load instructions. Therefore, we consider the data dependence of only the load instructions on previous instructions, to estimate the number of speculative instructions to execute. The details of the implementation of this model in sim-cache are described below. When the simulator mis-predicts a conditional branch instruction, a pseudorandom integer ranging from zero to fifteen is selected as the number of instructions to execute speculatively along the mis-predicted path (as in Model 1). A queue of length sixteen is maintained to store the latency of instructions along a mis-predicted path. The latency of these instructions along a mis-predicted path is calculated in the same way as described in the Model 2 implementation. If an instruction along a mispredicted path is not a load instruction, it is speculatively executed immediately. When a speculative load instruction is encountered, its latency is calculated as before. 62

78 But before speculatively executing the load instruction, the number of speculative instructions still remaining to be executed is reduced by the latency of load instruction. If the number of speculative instructions to execute is still positive (after subtracting the latency of the load instruction), the load instruction is speculatively executed (allowed to access the data cache) and speculative instruction execution continues. Otherwise, speculative instruction execution is terminated without allowing the load instruction to access the data cache. Figure 6.7 shows an example. BEQ R1, R2, LOOP //Conditional branch ADD R2, R3, R4 //Some speculative instructions AND R1, R2, R2 SUB R4, R1, R3 LOAD R5, 0(R4) //A speculative load instruction //with a dependence chain SUB R6, R5, R1 //Few more speculative instructions Figure 6.7: Data Dependence of Speculative Load Model Assume that the conditional branch is mis-predicted to be not taken and the simulator begins to speculatively execute say ten instructions (a pseudo random number between zero to fifteen assigned dynamically). If the latency of the speculative load instruction is say three, then after the load instruction, the simulator will speculatively execute = 3 more instructions. (The load instruction is the 4 th speculatively executed instruction.) When speculative execution along a mis-predicted path is finished, the simulator clears the latency queue and resumes normal instruction execution along the correct branch target path. We call the functional cache simulator which incorporates the model of data dependence of load instructions along mis-predicted 63

79 paths to estimate the number of speculatively executed instructions sim-cache-bpredlddep Combined Model Our fourth and final model to estimate the number of speculatively executed instructions along mis-predicted paths in the functional cache simulator combines the data dependence of the conditional branch model and the data dependence of the speculative load model. The simulator estimates the number of instructions to execute speculatively after a mis-predicted branch by tracking the latency of instructions before the conditional branch and the data dependence of the conditional branch on its preceding instructions, as described in Section While in the speculative execution mode along a mis-predicted branch path, the simulator stores the latency of instructions and reduces the number of speculative instructions to execute depending on the dependence of the load instructions on preceding instructions, as described in Section We call the functional cache simulator which uses this combined model to estimate the number of speculatively executed instructions along a mis-predicted path sim-cache-bpred-comb. Figure 6.8 shows the main loop of the functional cache simulator sim-cache modified to include a branch prediction mechanism and the ability to speculatively execute instructions along mis-predicted paths. The estimate() function shown in the Figure 6.8 is an abstraction of the model to calculate the number of mis-predicted path instructions to execute in each of the four techniques described above. 64

80 Figure 6.8: Modified Functional Cache Simulator Main Loop 6.4 Experimental Results In this section, we present the results for the L-1 D-cache interval miss rates of the SPEC CPU2000 benchmarks, as measured by the four versions of the modified sim-cache functional simulator. We measure the deviation in the interval miss rates of the functional simulators from the sim-outorder interval miss rates using mean, 65

81 standard deviation and CSM. We simulate the same set of the SPEC suite, as described in Section 6.1. All the simulator options are identical to those used in generating results presented in Section 6.1. In particular, the L-1 D-cache is an 8KB, 4-way set associative cache with block size of 32 bytes and the least recently used block replacement policy. See Appendix D for plots showing the percent difference in the L-1 D-cache miss rates of the detailed simulator and the four versions of the modified sim-cache functional simulator. We use the same reference inputs for every workload from the SPEC CPU2000 suite as described in Section 6.1 for the unmodified functional cache simulator sim-cache and the detailed simulator sim-outorder. As before, many simulations are terminated after 60 billion completed instructions (indicated by a * ), in order to avoid excessively long simulation times for the sim-outorder simulator Reduction in the Deviation of Interval Miss Rates Tables 6.3 to 6.10 summarize the results for the difference in the interval miss rates measured in terms of mean, standard deviation and CSM, and the improvement in accuracy of interval miss rates as compared to the unmodified sim-cache functional simulator, for the four models, using the SPEC CPU2000 benchmarks. The improvement in accuracy is measured by the percent change in the three metrics from their values for the unmodified functional simulator the % decrease in the mean of absolute difference from the unmodified sim-cache mean, the % decrease in the standard deviation of absolute difference from the unmodified sim-cache standard 66

82 deviation and the % increase in the CSM from the unmodified sim-cache CSM. (The mean, standard deviation and CSM results for the unmodified sim-cache simulator are presented in Section 6.1.) A positive value for a percent change implies that, in terms of that particular metric, the simulation results of the modified functional simulator are closer to the sim-outorder results as compared to the unmodified functional simulator. A negative value indicates that the modified simulator results are worse than the unmodified simulator results. CINT2000 Workload Table 6.3: Sim-cache-bpred-rand Results (CINT2000) Mean Of Absolute Difference % Decrease In Mean From Simcache Standard Deviation Of Absolute Difference % Decrease In Std. Dev. From Simcache CSM % Increase In CSM From Simcache 164.gzip vpr gcc mcf crafty* parser* eon perlbmk gap* vortex* bzip twolf

83 CFP2000 Workload Table 6.4: Sim-cache-bpred-rand Results (CFP2000) Mean Of Absolute Difference % Decrease In Mean From Simcache Standard Deviation Of Absolute Difference % Decrease In Std. Dev. From Simcache CSM % Increase In CSM From Simcache 168.wupwise* apsi* art facerec* mesa* CINT2000 Workload Table 6.5: Sim-cache-bpred-brdep Results (CINT2000) Mean Of Absolute Difference % Decrease In Mean From Simcache Standard Deviation Of Absolute Difference % Decrease In Std. Dev. From Simcache CSM % Increase In CSM From Simcache 164.gzip vpr gcc mcf crafty* parser* eon perlbmk gap* vortex* bzip twolf

84 CFP2000 Workload Table 6.6: Sim-cache-bpred-brdep Results (CFP2000) Mean Of Absolute Difference % Decrease In Mean From Simcache Standard Deviation Of Absolute Difference % Decrease In Std. Dev. From Simcache CSM % Increase In CSM From Simcache 168.wupwise* apsi* art facerec* mesa* CINT2000 Workload Table 6.7: Sim-cache-bpred-lddep Results (CINT2000) Mean Of Absolute Difference % Decrease In Mean From Simcache Standard Deviation Of Absolute Difference % Decrease In Std. Dev. From Simcache CSM % Increase In CSM From Simcache 164.gzip vpr gcc mcf crafty* parser* eon perlbmk gap* vortex* bzip twolf

85 CFP2000 Workload Table 6.8: Sim-cache-bpred-lddep Results (CFP2000) Mean Of Absolute Difference % Decrease In Mean From Simcache Standard Deviation Of Absolute Difference % Decrease In Std. Dev. From Simcache CSM % Increase In CSM From Simcache 168.wupwise* apsi* art facerec* mesa* CINT2000 Workload Table 6.9: Sim-cache-bpred-comb Results (CINT2000) Mean Of Absolute Difference % Decrease In Mean From Simcache Standard Deviation Of Absolute Difference % Decrease In Std. Dev. From Simcache CSM % Increase In CSM From Simcache 164.gzip vpr gcc mcf crafty* parser* eon perlbmk gap* vortex* bzip twolf

86 CFP2000 Workload Table 6.10: Sim-cache-bpred-comb Results (CFP2000) Mean Of Absolute Difference % Decrease In Mean From Simcache Standard Deviation Of Absolute Difference % Decrease In Std. Dev. From Simcache CSM 168.wupwise* apsi* art facerec* mesa* CINT2000 Workload Table 6.11: Simulation Times (in second) CINT2000 Workloads Sim-cache 164.gzip vpr gcc mcf crafty* parser* eon perlbmk gap* vortex* bzip twolf CFP2000 Workload Table 6.12: Simulation Times (in second) CFP2000 Workloads Sim-cache % Increase In CSM From Simcache Simoutorder Sim-cachebpred-rand Sim-cachebpredbrdep Sim-cachebpred-lddep Simcachebpredcomb Simoutorder Sim-cachebpred-rand Sim-cachebpred-brdep Sim-cachebpred-lddep Sim-cachebpred-comb 168.wupwise* apsi* art facerec* mesa*

87 6.4.2 Reduction in Simulation Speedup In Tables 6.11 and 6.12, simulation times of sim-outorder and the functional cache simulator models are presented. Tables 6.13 and 6.14 show the simulation speedup offered by functional cache simulators over sim-outorder. The simulation speedup of a functional simulator is defined as the ratio of simulation time of simoutorder to the simulation time of the functional simulator. CINT2000 Workload Table 6.13: Simulation Speedups CINT2000 Workloads Sim-cache 164.gzip vpr gcc mcf crafty* parser* eon perlbmk gap* vortex* bzip twolf CFP2000 Workload Table 6.14: Simulation Speedups CINT2000 Workloads Sim-cache Sim-cachebpred-rand Sim-cachebpred-brdep Sim-cachebpred-lddep Sim-cachebpred-comb Sim-cachebpred-rand Sim-cachebpred-brdep Sim-cachebpred-lddep Sim-cachebpred-comb 168.wupwise* apsi* art facerec* mesa*

88 6.5 Analysis of Results The simulation results from Table 6.4, 6.6, 6.8 and 6.10 indicate that our models of mis-prediction behavior in functional cache simulator are not successful in improving the accuracy L-1 D-cache interval miss rates for the floating point workloads from the SPEC CPU2000 suite. Except for the benchmark 168.wupwise, when simulating with sim-cache-bpred-brdep and sim-cache-bpred-comb, all the floating point benchmarks have worse results for interval cache miss rates using our models than the unmodified functional simulator sim-cache. We simulated additional floating point benchmarks from the CFP2000 subset, other than the five for which results are presented here, but all of them show similar degradation in interval cache miss rate closeness. The integer workloads have a completely different outcome. All of the integer workloads from the CINT2000 subset, except the 175.vpr benchmark, show an increase in the accuracy of the L-1 D-cache interval miss rates, in terms of all the three metrics, using at least two of our models and in most of the workloads, using all four models. For comparison, the results of all four models for integer workloads, are plotted together using bar plots in Figure 6.9, 6.10 and 6.11, for percent change in mean, standard deviation and CSM, respectively. With the mean of absolute difference as the metric for measuring the difference in interval miss rates, the random instructions model (sim-cache-bpredrand) and the data dependence of speculative loads model (sim-cache-bpred-lddep) have a negative change for the workloads 175.vpr, 252.eon, 254.gap, 255.vortex and 73

89 300.twolf. The data dependence of the branch instruction model (sim-cache-bpredbrdep) and the combined model (sim-cache-bpred-comb) show a negative change only for 175.vpr and 176.gcc. The random instructions model and the speculative loads based model show particularly good improvement (50 % or more decrease) for the workloads 164.gzip, 181.mcf and 253.perlbmk. The branch instruction based model and the combined model have good improvement (40 % or more decrease) for the workloads 197.parser, 252.eon and 300.twolf. The conditional branch model also has a good effect on the 253.perlbmk workload. Figure 6.9: Percent Decrease In Mean (CINT2000) For standard deviation of absolute difference as the metric for measuring the closeness of interval miss rates, we observe that except for the workload 253.perlbmk 74

90 (which has a negligible change with all four models), all other workloads have a positive decrease in standard deviation using at least two of the models. The random model and the speculative loads based model are worse than sim-cache for the workloads 186.crafty, 252.eon, 254.gap and 300.twolf, with respect to standard deviation. The conditional branch based model and the combined model have positive improvement for all the workloads. As with the mean, some benchmarks have much better improvement with one or two models as compared to other models. Figure 6.10: Percent Decrease in Standard Deviation (CINT2000) The percent increase in the CSM is very small for most of the workloads, except for 181.mcf and 253.perlbmk. The reason for the small effect is that the CSM, (with a value of 1 indicating maximum similarity) for sim-cache interval miss rates 75

91 has a value of more than 0.9 for most of the CINT2000 workloads (see Table 6.1). Out of the three workloads with smaller value of original CSM, 164.gzip with original CSM of , 181.mcf with original CSM of and 253.perlbmk with original CSM of , two show a large percentage increase in CSM by at least one of the models. The workload 164.gzip has only a small increase in CSM by all four models. As before, the random model and the speculative loads based model have a detrimental effect on a few workloads; whereas the branch based model and the combined model have a negative effect only for the workload 175.vpr and the workload 176.gcc. Figure 6.11: Percent Increase in CSM (CINT2000) 76

92 Figure 6.12: Speedups for Different Models of Functional Simulator (CINT2000) Figure 6.12 shows a comparison of speedup in simulation time as compared to the detailed simulator sim-outorder for all the versions of the functional simulator when simulating the CINT2000 subset. As expected, the random instruction model has the minimum decrease in speedup from the sim-cache speedup value and the combined model has the maximum decrease in speedup. The decrease in speedup with the conditional branch model is more than the decrease with the speculative loads model. The former model incurs the execution overhead in the correct path execution whereas the later model has extra work to do in the mis-predicted path. Since the correct path execution takes place a lot more frequently than the mis- 77

93 predicted path execution, the simulation time with the conditional branch model is more than that with the speculative loads model. Overall, the use of the conditional branch dependent model has a positive effect on the interval miss rates of all the integer workloads (except the 175.vpr workload and the CSM of the 176.gcc workload). The random number of instructions model gives surprisingly good results for many of the integer workloads, sometimes even better than the conditional branch based model. But it turns out to be worse than the unmodified functional cache simulator for a few integer benchmarks. The speculative loads based model (which starts with a random number of instructions to execute along a mis-predicted path and then modifies this number depending upon the data dependence of load instructions along the mis-predicted path), has almost identical results as the random instructions model. Similarly, the combined model results are almost similar to the conditional branch based model. Therefore, we conclude that the influence of the data dependence of speculative load instructions on the number of instructions executed along a mis-predicted path is very small, and need not be considered in a functional cache simulator. Thus, we consider the conditional branch based functional cache simulator to be a good candidate for use in performance analysis studies involving data cache miss rates of the CINT2000 subset of the SPEC CPU2000 suite. If the loss in speedup of this simulator is unacceptable, the random model based simulator can be used, but only for certain workloads. As previously stated, as of now, none of these models are of any use when simulating the floating point benchmarks. 78

94 6.6 Sensitivity of Simulator Models to Cache Parameters In this section, we present some results to indicate the sensitivity of our simulator models to changes in the L-1 D-cache parameters such as the cache size, block size and associativity. We select one integer benchmark (181.mcf) and one floating point benchmark (179.art). The results are shown in Tables The sensitivity analysis results indicate that, for the two workloads 181.mcf and 179.art, the variations in the cache parameters such as the cache size, associativity and the block size do not affect the percentage change in the three metrics from their values for the unmodified simulator sim-cache by much. For the integer workload 181.mcf, which shows good improvement in the accuracy of the interval cache miss rates over the simulator sim-cache with the original cache configuration, the results of the sensitivity analysis experiments imply that the models will work more or less equally well for other cache organizations. For the floating point benchmark 179.art which has poor results for the original configuration, the sensitivity analysis experiments indicate that the models will not be of any use for other configurations as well. We thus conclude that our techniques are not very sensitive to the changes in the data cache configuration. 79

95 Table 6.15: Sensitivity of Functional Cache Simulator Models to Cache Parameters (181.mcf) % Change Model Cache Configuration (Size, Associativity, Block Size) 8KB,4,32B 16KB,4,32B 8KB,2,32B 8KB,4,16B % Decrease Random in Mean from simcache Branch Load % Decrease in Standard Deviation from simcache Combined Random Branch Load Combined % Increase in CSM from simcache Random Branch Load Combined Table 6.16: Sensitivity of Functional Cache Simulator Models to Cache Parameters (179.art) % Change Model Cache Configuration (Size, Associativity, Block Size) 8KB,4,32B 16KB,4,32B 8KB,2,32B 8KB,4,16B % Decrease Random in Mean from simcache Branch Load % Decrease in Standard Deviation from simcache Combined Random Branch Load Combined % Increase in CSM from simcache Random Branch Load Combined

96 7 CONCLUSIONS Integer and floating point workloads from the SPEC CPU2000 benchmark suite have shown contrasting behavior when quantifying the effects of speculative execution of mis-predicted path instructions on the miss rates of the L-1 D-cache. The difference in the behavior of the integer and floating point workloads is even more pronounced when modeling these effects in a functional cache simulator. The slightly lower accuracy of branch prediction mechanisms for the integer workloads as compared to the floating point workloads, combined with the higher frequency of the conditional branch instructions in the dynamic instruction count for the integer workloads as compared to the floating point workloads (see Table 5.7) suggests that speculative mis-predicted branch path instruction execution occurs much more frequently in integer workloads as compared to the floating point workloads. Consequently, for floating point workloads, the L-1 D-cache hits and misses in the mis-predicted paths form a small fraction of the overall cache hits and misses, whereas in integer workloads, the mis-predicted path hits and misses are a sizable part of the overall hits and misses. Therefore, the effects of executing mispredicted path instructions on L-1 D-cache should not be ignored in performance analysis studies, at least for integer benchmarks. We observed that, for almost all of the SPEC CPU2000 benchmarks we simulated, speculatively executing mis-predicted path memory references has a beneficial impact on the L-1 D-cache during normal path execution, due to the pre- 81

97 fetching of cache blocks during mis-predicted paths which were later used in correct paths. Due to the lack of modeling of mis-predicted branch path instruction execution in the functional cache simulators, we see a considerable difference in the program interval-wise data cache miss rates measured by a functional cache simulator and a detailed architectural simulator for many workloads from the SPEC suite. We have attempted to model the effects of mis-predicted path execution in the functional simulator, sim-cache, in four different ways. In a functional simulator, we can only consider the effects of the inherent program characteristics such as the data dependence among instructions, when modeling the mis-prediction behavior. The effects of various architectural parameters on mis-prediction behavior are difficult to account for in a functional simulator, since a functional simulator does not model the micro-architecture to begin with. For most of the integer workloads from the SPEC suite, our models were quite successful in reducing the difference in the data cache miss rates of the functional cache simulator and the detailed architectural simulator. Especially, the model based on the data dependence of conditional branch instructions provides consistently positive results for integer workloads, although it introduces a non-trivial simulation time overhead. However, none of our models could improve the accuracy of the functional cache miss rates for any of the floating point workloads of the SPEC suite. 82

98 8 FUTURE WORK Our models of mis-prediction effects on the data cache miss rates in a functional cache simulator have mostly dealt with the influence of program dependent characteristics, such as the data dependence among instructions. The success of our models in improving the accuracy of the functional cache miss rates for the integer workloads indicates that the program dependent factors have more influence on the mis-prediction effects than the architectural factors for integer workloads. Since our models have not worked as well for floating point workloads, we suspect that the reverse is true for floating point workloads, but more work is necessary to ascertain this hypothesis. It will be interesting to identify the architectural factors that must be accounted for, when considering the effects of mis-predicted path execution on cache performance metrics. If their effects can be incorporated in a functional simulator, the accuracy of the functional simulator results is likely to improve. In Chapter 5, we demonstrated that pre-fetching of useful data blocks into the data cache is more likely to happen due to speculatively executing memory references, rather than the pollution of the data cache with unusable blocks. We would also like to correlate the pre-fetching and the pollution effects to the changes in the cache organization parameters by performing more experimental work. 83

99 APPENDICES 84

100 APPENDIX A PLOTS OF L-1 D-CACHE INTERVAL MISS RATE DIFFERENCE Figure A.1: Difference in L-1 D-cache Interval Miss Rate (164.gzip-log) 85

101 Figure A.2: Difference in L-1 D-cache Interval Miss Rate (175.vpr-route) 86

102 Figure A.3: Difference in L-1 D-cache Interval Miss Rate (181.mcf-ref) 87

103 Figure A.4: Difference in L-1 D-cache Interval Miss Rate (186.crafty-ref) 88

104 Figure A.5: Difference in L-1 D-cache Interval Miss Rate (197.parser-ref) 89

105 Figure A.6: Difference in L-1 D-cache Interval Miss Rate (252.eon-rushmeir) 90

106 Figure A.7: Difference in L-1 D-cache Interval Miss Rate (253.perlbmk-diffmail) 91

107 Figure A.8: Difference in L-1 D-cache Interval Miss Rate (254.gap-ref) 92

108 Figure A.9: Difference in L-1 D-cache Interval Miss Rate (255.vortex-lendian1) 93

109 Figure A.10: Difference in L-1 D-cache Interval Miss Rate (256.bzip2-source) 94

110 Figure A.11: Difference in L-1 D-cache Interval Miss Rate (300.twolf-ref) 95

111 Figure A.12: Difference in L-1 D-cache Interval Miss Rate (168.wupwise-ref) 96

112 Figure A.13: Difference in L-1 D-cache Interval Miss Rate (177.mesa-ref) 97

113 Figure A.14: Difference in L-1 D-cache Interval Miss Rate (179.art-110) 98

114 Figure A.15: Difference in L-1 D-cache Interval Miss Rate (301.apsi-ref) 99

115 Figure A.16: Difference in L-1 D-cache Interval Miss Rate (164.gzip-log) with Interval Size of

116 Figure A.17: Difference in L-1 D-cache Interval Miss Rate (181.mcf-ref) with Interval Size of

117 APPENDIX B SIM-OUTORDER AND SIM-CACHE OPTIONS USED IN SIMULATIONS Table B.1: Sim-outorder Processor Configuration Simulated Sim-outorder Processor Parameter Value Instruction fetch queue size 4 Instructions Extra branch mis-prediction latency 3 cycles Speed of front-end of machine relative 1 to execution core Branch predictor type/configuration Bimodal, Table Size 2048 Return address stack size 8 BTB configuration 512 sets, 4-way set associative Speculative predictors update Null (defaults to non-speculative mode update) Instruction decode B/W 4 instructions/cycle Instruction issue B/W 4 instructions/cycle Run pipeline with in-order issue False Issue instructions down wrong True execution paths Instruction commit B/W 4 instructions/cycle Register update unit (RUU) size 16 slots Load/store queue (LSQ) size 8 slots L-1 Data Cache Configuration 8KB, 4-way set associative, 32 byte block size, LRU L-1 Instruction Cache Configuration 16KB, direct mapped, 32 byte block size, LRU L-2 Unified Cache Configuration 256KB, 4-way set associative, 64 byte block size, LRU Data TLB Configuration 128 entries, 4-way set associative, LRU policy Instruction TLB Configuration 64 entries, 4-way set associative, LRU policy L-1 Data Cache hit latency 1 cycle L-1 Instruction Cache hit latency 1 cycle L-2 Cache hit latency 6 cycles Flush caches on system calls False Convert 64-bit inst addresses to 32-bit False instruction equivalents Memory access latency 18 cycles first chunk, 2 cycles for later chunks Memory access bus width 8 bytes Instruction/data TLB miss latency 30 cycles Total number of integer ALUs 4 Total number of integer 1 multiplier/dividers Total number of memory system ports 2 available (to CPU) Total number of floating point ALUs 4 Total number of floating point 1 multiplier/dividers 102

118 Table B.2: Sim-cache Cache Hierarchy Configuration Simulated Cache Hierarchy Component Configuration L-1 Data Cache 8 KB, 4-way set associative, 32 byte block size, LRU policy L-1 Instruction Cache 16 KB, direct mapped, 32 byte block size, LRU policy L-2 Unified Cache 256 KB, 4-way set associative, 64 byte block size, LRU policy Data TLB 128 entries, 4-way set associative, LRU policy Instruction TLB 64 entries, 4-way set associative, LRU policy 103

119 APPENDIX C PLOTS OF PROBABILITY DISTRIBUTION OF SPECULATIVE INSTRUCTION NUMBERS Figure C.1: Probability Distribution of Speculative Instructions (188.ammp-ref) 104

120 Figure C.2: Probability Distribution of Speculative Instructions (183.equake-ref) 105

121 Figure C.3: Probability Distribution of Speculative Instructions (164.gzip-log) 106

122 Figure C.4: Probability Distribution of Speculative Instructions (256.bzip2-source) 107

123 APPENDIX D PLOTS OF MISS RATE DIFFERENCE WITH MIS-PREDICTION MODELS Figure D.1: Simulator Models Interval Miss Rate Differences (164.gzip-log) 108

124 Figure D.2: Simulator Models Interval Miss Rate Differences (175.vpr-route) 109

125 Figure D.3: Simulator Models Interval Miss Rate Differences (176.gcc-expr) 110

126 Figure D.4: Simulator Models Interval Miss Rate Differences (181.mcf-ref) 111

127 Figure D.5: Simulator Models Interval Miss Rate Differences (186.crafty-ref) 112

128 Figure D.6: Simulator Models Interval Miss Rate Differences (197.parser-ref) 113

129 Figure D.7: Simulator Models Interval Miss Rate Differences (252.eon-rushmeir) 114

130 Figure D.8: Simulator Models Interval Miss Rate Differences (253.perlbmk-diffmail) 115

131 Figure D.9: Simulator Models Interval Miss Rate Differences (254.gap-ref) 116

132 Figure D.10: Simulator Models Interval Miss Rate Differences (255.vortex-lendian1) 117

133 Figure D.11: Simulator Models Interval Miss Rate Differences (256.bzip2-source) 118

134 Figure D.12: Simulator Models Interval Miss Rate Differences (300.twolf-ref) 119

135 Figure D.13: Simulator Models Interval Miss Rate Differences (168.wupwise-ref) 120

136 Figure D.14: Simulator Models Interval Miss Rate Differences (177.mesa-ref) 121

137 Figure D.15: Simulator Models Interval Miss Rate Differences (179.art-110) 122

138 Figure D.16: Simulator Models Interval Miss Rate Differences (187.facerec-ref) 123

139 Figure D.17: Simulator Models Interval Miss Rate Differences (301.apsi-ref) 124

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain Software-assisted Cache Mechanisms for Embedded Systems by Prabhat Jain Bachelor of Engineering in Computer Engineering Devi Ahilya University, 1986 Master of Technology in Computer and Information Technology

More information

Performance Oriented Prefetching Enhancements Using Commit Stalls

Performance Oriented Prefetching Enhancements Using Commit Stalls Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

TECHNIQUES FOR ACCELERATING MICROPROCESSOR SIMULATION RAMKUMAR SRINIVASAN, B.E. A thesis submitted to the Graduate School

TECHNIQUES FOR ACCELERATING MICROPROCESSOR SIMULATION RAMKUMAR SRINIVASAN, B.E. A thesis submitted to the Graduate School TECHNIQUES FOR ACCELERATING MICROPROCESSOR SIMULATION BY RAMKUMAR SRINIVASAN, B.E. A thesis submitted to the Graduate School in partial fulfillment of the requirements for the degree Master of Science

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554

More information

MODELING OUT-OF-ORDER SUPERSCALAR PROCESSOR PERFORMANCE QUICKLY AND ACCURATELY WITH TRACES

MODELING OUT-OF-ORDER SUPERSCALAR PROCESSOR PERFORMANCE QUICKLY AND ACCURATELY WITH TRACES MODELING OUT-OF-ORDER SUPERSCALAR PROCESSOR PERFORMANCE QUICKLY AND ACCURATELY WITH TRACES by Kiyeon Lee B.S. Tsinghua University, China, 2006 M.S. University of Pittsburgh, USA, 2011 Submitted to the

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

Instruction Based Memory Distance Analysis and its Application to Optimization

Instruction Based Memory Distance Analysis and its Application to Optimization Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang cfang@mtu.edu Steve Carr carr@mtu.edu Soner Önder soner@mtu.edu Department of Computer Science Michigan Technological

More information

FINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY

FINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY FINE-GRAIN STATE PROCESSORS By PENG ZHOU A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Computer Science) MICHIGAN TECHNOLOGICAL UNIVERSITY

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

For Problems 1 through 8, You can learn about the "go" SPEC95 benchmark by looking at the web page

For Problems 1 through 8, You can learn about the go SPEC95 benchmark by looking at the web page Problem 1: Cache simulation and associativity. For Problems 1 through 8, You can learn about the "go" SPEC95 benchmark by looking at the web page http://www.spec.org/osg/cpu95/news/099go.html. This problem

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Skewed-Associative Caches: CS752 Final Project

Skewed-Associative Caches: CS752 Final Project Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance

More information

SEVERAL studies have proposed methods to exploit more

SEVERAL studies have proposed methods to exploit more IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 4, APRIL 2005 1 The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture Resit Sendag, Member, IEEE, Ying

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Exploiting Streams in Instruction and Data Address Trace Compression

Exploiting Streams in Instruction and Data Address Trace Compression Exploiting Streams in Instruction and Data Address Trace Compression Aleksandar Milenkovi, Milena Milenkovi Electrical and Computer Engineering Dept., The University of Alabama in Huntsville Email: {milenka

More information

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York

More information

The Impact of Instruction Compression on I-cache Performance

The Impact of Instruction Compression on I-cache Performance Technical Report CSE-TR--97, University of Michigan The Impact of Instruction Compression on I-cache Performance I-Cheng K. Chen Peter L. Bird Trevor Mudge EECS Department University of Michigan {icheng,pbird,tnm}@eecs.umich.edu

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

ECE 587/687 Final Exam Solution

ECE 587/687 Final Exam Solution ECE 587/687 Final Exam Solution Time allowed: 80 minutes Total Points: 60 Points Scored: Name: Problem No. 1 (15 points) Consider a computer system with a two level cache hierarchy, consisting of split

More information

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-24 TR-HPS-26-3

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

The Impact of Write Back on Cache Performance

The Impact of Write Back on Cache Performance The Impact of Write Back on Cache Performance Daniel Kroening and Silvia M. Mueller Computer Science Department Universitaet des Saarlandes, 66123 Saarbruecken, Germany email: kroening@handshake.de, smueller@cs.uni-sb.de,

More information

Precise Instruction Scheduling

Precise Instruction Scheduling Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University

More information

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)

More information

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University of Texas at Austin Motivation Branch predictors are

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan

More information

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students

More information

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering

More information

c 2004 by Ritu Gupta. All rights reserved.

c 2004 by Ritu Gupta. All rights reserved. c by Ritu Gupta. All rights reserved. JOINT PROCESSOR-MEMORY ADAPTATION FOR ENERGY FOR GENERAL-PURPOSE APPLICATIONS BY RITU GUPTA B.Tech, Indian Institute of Technology, Bombay, THESIS Submitted in partial

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

Optimizing SMT Processors for High Single-Thread Performance

Optimizing SMT Processors for High Single-Thread Performance University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul

More information

Improving Achievable ILP through Value Prediction and Program Profiling

Improving Achievable ILP through Value Prediction and Program Profiling Improving Achievable ILP through Value Prediction and Program Profiling Freddy Gabbay Department of Electrical Engineering Technion - Israel Institute of Technology, Haifa 32000, Israel. fredg@psl.technion.ac.il

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

The Predictability of Computations that Produce Unpredictable Outcomes

The Predictability of Computations that Produce Unpredictable Outcomes This is an update of the paper that appears in the Proceedings of the 5th Workshop on Multithreaded Execution, Architecture, and Compilation, pages 23-34, Austin TX, December, 2001. It includes minor text

More information

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:

More information

Picking Statistically Valid and Early Simulation Points

Picking Statistically Valid and Early Simulation Points In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), September 23. Picking Statistically Valid and Early Simulation Points Erez Perelman Greg Hamerly

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths yesoon Kim José. Joao Onur Mutlu Yale N. Patt igh Performance Systems Group

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh Department of Computer Science and Information Engineering Chang Gung University Tao-Yuan, Taiwan Hsin-Dar Chen

More information

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu

More information

EXPERT: Expedited Simulation Exploiting Program Behavior Repetition

EXPERT: Expedited Simulation Exploiting Program Behavior Repetition EXPERT: Expedited Simulation Exploiting Program Behavior Repetition Wei Liu and Michael C. Huang Department of Electrical & Computer Engineering University of Rochester fweliu, michael.huangg@ece.rochester.edu

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Integrating Superscalar Processor Components to Implement Register Caching

Integrating Superscalar Processor Components to Implement Register Caching Integrating Superscalar Processor Components to Implement Register Caching Matthew Postiff, David Greene, Steven Raasch, and Trevor Mudge Advanced Computer Architecture Laboratory, University of Michigan

More information

Parallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing

Parallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing Parallel Computing 38 (2012) 533 551 Contents lists available at SciVerse ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Algorithm-level Feedback-controlled Adaptive data

More information

The Predictability of Computations that Produce Unpredictable Outcomes

The Predictability of Computations that Produce Unpredictable Outcomes The Predictability of Computations that Produce Unpredictable Outcomes Tor Aamodt Andreas Moshovos Paul Chow Department of Electrical and Computer Engineering University of Toronto {aamodt,moshovos,pc}@eecg.toronto.edu

More information

On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling

On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling by John Michael Chase A thesis presented to the University of Waterloo in fulfillment of the thesis requirement

More information

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core 3 rd HiPEAC Workshop IBM, Haifa 17-4-2007 Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core, P. Michaud, L. He, D. Fetis, C. Ioannou, P. Charalambous and A. Seznec

More information

Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration

Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration Cristiano Pereira, Harish Patil, Brad Calder $ Computer Science and Engineering, University of California, San Diego

More information

An Instruction Stream Compression Technique 1

An Instruction Stream Compression Technique 1 An Instruction Stream Compression Technique 1 Peter L. Bird Trevor N. Mudge EECS Department University of Michigan {pbird,tnm}@eecs.umich.edu Abstract The performance of instruction memory is a critical

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

C152 Laboratory Exercise 3

C152 Laboratory Exercise 3 C152 Laboratory Exercise 3 Professor: Krste Asanovic TA: Christopher Celio Department of Electrical Engineering & Computer Science University of California, Berkeley March 7, 2011 1 Introduction and goals

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Cache Optimization by Fully-Replacement Policy

Cache Optimization by Fully-Replacement Policy American Journal of Embedded Systems and Applications 2016; 4(1): 7-14 http://www.sciencepublishinggroup.com/j/ajesa doi: 10.11648/j.ajesa.20160401.12 ISSN: 2376-6069 (Print); ISSN: 2376-6085 (Online)

More information

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Final Exam Prof. Martin Wednesday, May 2nd, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached (with

More information

NAME: Problem Points Score. 7 (bonus) 15. Total

NAME: Problem Points Score. 7 (bonus) 15. Total Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 NAME: Problem Points Score 1 40

More information

Performance Implications of Single Thread Migration on a Chip Multi-Core

Performance Implications of Single Thread Migration on a Chip Multi-Core Performance Implications of Single Thread Migration on a Chip Multi-Core Theofanis Constantinou, Yiannakis Sazeides, Pierre Michaud +, Damien Fetis +, and Andre Seznec + Department of Computer Science

More information

Micro-Architectural Attacks and Countermeasures

Micro-Architectural Attacks and Countermeasures Micro-Architectural Attacks and Countermeasures Çetin Kaya Koç koc@cs.ucsb.edu Çetin Kaya Koç http://koclab.org Winter 2017 1 / 25 Contents Micro-Architectural Attacks Cache Attacks Branch Prediction Attack

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Hardware Loop Buffering

Hardware Loop Buffering Hardware Loop Buffering Scott DiPasquale, Khaled Elmeleegy, C.J. Ganier, Erik Swanson Abstract Several classes of applications can be characterized by repetition of certain behaviors or the regular distribution

More information

POSH: A TLS Compiler that Exploits Program Structure

POSH: A TLS Compiler that Exploits Program Structure POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

More information

Cache Implications of Aggressively Pipelined High Performance Microprocessors

Cache Implications of Aggressively Pipelined High Performance Microprocessors Cache Implications of Aggressively Pipelined High Performance Microprocessors Timothy J. Dysart, Branden J. Moore, Lambert Schaelicke, Peter M. Kogge Department of Computer Science and Engineering University

More information