Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Similar documents
ECE404 Term Project Sentinel Thread

Design of Experiments - Terminology

Execution-based Prediction Using Speculative Slices

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Speculative Parallelization in Decoupled Look-ahead

Understanding The Effects of Wrong-path Memory References on Processor Performance

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

Data Prefetching by Dependence Graph Precomputation

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Speculative Multithreaded Processors

The Effect of Executing Mispredicted Load Instructions in a Speculative Multithreaded Architecture

Threshold-Based Markov Prefetchers

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor

SF-LRU Cache Replacement Algorithm

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Towards a More Efficient Trace Cache

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Increasing Instruction-Level Parallelism with Instruction Precomputation

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Power and Performance Tradeoffs using Various Cache Configurations

Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Cache Implications of Aggressively Pipelined High Performance Microprocessors

Microarchitecture Overview. Performance

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Power and Performance Tradeoffs using Various Caching Strategies

The Predictability of Computations that Produce Unpredictable Outcomes

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Area-Efficient Error Protection for Caches

Cache Performance Research for Embedded Processors

Selective Fill Data Cache

A Low Energy Set-Associative I-Cache with Extended BTB

An Algorithm for Register-Synchronized Precomputation In Intelligent Memory Systems

Way-Predicting Cache and Pseudo- Associative Cache for High Performance and Low Energy Consumption

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Cache Pipelining with Partial Operand Knowledge

Using a Serial Cache for. Energy Efficient Instruction Fetching

The Predictability of Computations that Produce Unpredictable Outcomes

Microarchitecture Overview. Performance

Speculative Execution for Hiding Memory Latency

SEVERAL studies have proposed methods to exploit more

One-Level Cache Memory Design for Scalable SMT Architectures

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Architecture Tuning Study: the SimpleScalar Experience

Data Access History Cache and Associated Data Prefetching Mechanisms

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum

More on Conjunctive Selection Condition and Branch Prediction

Exploitation of instruction level parallelism

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Dynamic Cache Partitioning for CMP/SMT Systems

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami

CS 654 Computer Architecture Summary. Peter Kemper

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Speculation Control for Simultaneous Multithreading

Dynamic Speculative Precomputation

Precise Instruction Scheduling

Energy Efficient Asymmetrically Ported Register Files

Designing and Optimizing the Fetch Unit for a RISC Core

Chip-Multithreading Systems Need A New Operating Systems Scheduler

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation

Prefetching-Aware Cache Line Turnoff for Saving Leakage Energy

INSTRUCTION LEVEL PARALLELISM

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Targeted Data Prefetching

The MIRV SimpleScalar/PISA Compiler

Reducing Reorder Buffer Complexity Through Selective Operand Caching

The Limits of Speculative Trace Reuse on Deeply Pipelined Processors

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

Speculative Multithreaded Processors

A Quantitative/Qualitative Study for Optimal Parameter Selection of a Superscalar Processor using SimpleScalar

Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors

Low-Complexity Reorder Buffer Architecture*

Improving Cloud Application Performance with Simulation-Guided CPU State Management

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser

Cost-Driven Hybrid Configuration Prefetching for Partial Reconfigurable Coprocessor

Filtering of Unnecessary Branch Predictor Lookups for Low-power Processor Architecture *

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology

Accelerating and Adapting Precomputation Threads for Efficient Prefetching

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

PMPM: Prediction by Combining Multiple Partial Matches

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

INTERACTION COST: FOR WHEN EVENT COUNTS JUST DON T ADD UP

Transcription:

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu University, Fukuoka, Japan Dept. of Elec. Eng. and Computer Science, Fukuoka University, Fukuoka, Japan arch-ccc-lpc@c.csce.kyushu-u.ac.jp Abstract In recent years, the performance gap between microprocessor and main memory has been increasing. This problem suppresses the performance of microprocessor and is well-known in the literature as the Memory Wall Problem (MWP). This paper proposes a new method to reduce the MWP effect by means of re-computation. The basic idea is to replace a frequently cache-missed load (socalled delinquent load) instruction with a piece of code which regenerates the load value. The code is called re-computation (RC) code. This method can reduce the number of main memory accesses and the cache miss penalty to alleviate the MWP. From the experiments, one can obtain up to 45.3% reduction of execution time on SPEC2000 benchmark programs. 1 Introduction The performance of microprocessor have been improving 55% per year, whereas that of DRAM commonly used as the main memory improving only 7% [4]. The performance gap between them has been increasing. This gap is one of the main reasons which suppress the performance of microprocessor. This problem is called as memory wall problem. To solve this problem, the memory access latency needs to be reduced or hidden. One of the traditional latency reduction techniques is to employ an on-chip cache memory, which is widely used in state-of-the-art commercial microprocessors. On the other hand, one of the latency hiding techniques is prefetching. So far, many researchers have been proposed a number of techniques to improve the accuracy of prefetching. In recent years, some research groups proposed a new prefetching technique focused on the load instructions which frequently cause cache misses [3, 7]. Those load instructions have come to be known as delinquent load (DL) instructions, and are responsible for 90 % of total cache misses in an experiment [3]. Their prefetching technique is based on multithreaded processors. Besides the normal program execution in one thread (main thread), the data addresses accessed by DL instructions are speculatively calculated in another thread (helper thread). Therefore, the microprocessor can prefetch the data whose address is calculated in the helper thread. However, prefetching needs to access the main memory to improve cache hit rate, so cannot reduce the number of accesses. The main drawback of the prefetching scheme is that an inaccurate data loading pollutes the on-chip cache, resulting in higher cache miss rates. These points are the limitations of the prefetching. In this paper, we propose a novel approach to solve the memory wall problem based on Re- Computation (RC). Originally, we have a concept called CCC (Computing Centric Computation) as one of the measures to the memory wall problem. Recently, the memory accesse latency is extremely longer than the computation time for one operation in microprocessors. CCC replaces the memory accesses with the computations to reduce total execution time. We apply this concept to the load instructions, and propose a Load Data Re-Computation (LDRC) technique. In order to reduce the growing offchip memory-access latency, we replace the load instructions with the corresponding RC code. Namely, we attempt to change memory access latency to RC code execution time. To make the best use of a bad influence of cache misses, we apply this method only to DL instructions out of all loads. Hence, if the microprocessor can execute RC codes faster than DL instructions, LDRC executes RC codes instead of DL instructions, and shortens total execution time,. In this paper, we preliminarily evaluate

our technique and show the potential for execution time reduction. The remainder of this paper is organized as follows. Section 2 introduces the related work. Section 3 describes the concept of our re-computation technique. Section 4 expresses the experiment and the result of evaluation for our method. Section 5 concludes this paper. 2 Related work Original source code c=a+b; Value C is stored to main memory z=x+c; Value C is not in the cache, and has to be read from main memory Original object code Load a Load b Add c, a, b Store c Load c Load x Add z, x, c Store z RC code Load a Load b Add c, a, b Replaced with the RC code LDRC applied object code Load a Load b Add c, a, b Store c [RC code] Load x Add z, x, c Store z Assuming that value x is on the cache Figure 1. Outline of LDRC There are a lot of techniques to reduce or hide the latency of main memory access. However, but needs to add DGG and PE. as far as we know, no techniques related to recomputation These techniques described above are limited have been proposed. In this sec- to data prefetching. If the predicted data ad- tion, we take up methods dealing with DL instructions. Recently, to hide the access latency, speculative data prefetching techniques for DL instructions dress is wrong, they cannot hide the latency of main memory accesses. In the next section, we propose a data re-computation method for DL instructions. have been proposed. Roth and Sohi propose Speculative data-driven multithreading (DDMT) in [7]. In DDMT, intructions are classified by their criticality to performance 3 Load data re-computation from a trace data. If the processor de- In this section, we propose the load data re- tects the critical instructions upcoming at run computation technique, called LDRC, and discuss the effectiveness of our method. time, then the processor forks a speculative thread to execute the critical instruction. The speculative thread is called data-driven thread (DDT), ation that benchmark programs are running on To make problem simple, we assume a situ- and is executed in parallel with the main thread. a normal 32bit RISC type microprocessor with DDT includes load instructions which cause first single thread. The microprocessor accesses a 2- level cache misses or branch instructions which level on-chip cache memory and an off-chip main are frequently miss predicted. memory. Collins et al. propose Speculative Precomputation (SP) in [3]. SP also uses a simultaneous multithreading processor to improve performance 3.1 Basic idea of single-thread applications. In SP, pre- To explain the concept of LDRC, consider a computation slices (p-slices) which include DL case to execute a load instruction. If a data accessed by the load instruction is not in the on- instructions and their dependent instructions are executed in another thread to calculate data addresses accessed by DL instructions and prefetch chip L2 cache, the data has to be fetched from the off-chip main memory. Since it takes long them. SP is focused on Itanium processors which time to fetch the data from the off-chip memory, have in-order pipeline, whereas DDMT focused the cache miss penalty is very large. on out-of-order processors. However, in this case, if the data was defined Annavaram et al. propose Data Prefetching previously in the program, we can get the data by Dependence Graph Precomputation (DGP) by means of re-computation. In addition, if the in [1]. DGP employs Dependence Graph Generator (DGG) in the IFQ, and generates a depen- re-computation time for the load data is shorter than the memory access time, we can reduce the dence graph for a load or store instruction which execution time by re-computation. This is the is likely to cause a cache miss. Precomputation Engine (PE) executes the dependence graph basic idea of LDRC. Namely, we attempt to replace DL instructions with corresponding re-computation to generate prefetching address and prefetch the codes to reduce execution time. data. DGP does not need multithread processors

We give an example in Figure 1 to explain this method. In Figure 1, we see an original source code, a corresponding object code, a RC code, and a modified object code. In the original source code, the variable c is defined in the first sentence (c = a + b), and is used in the latter one (z = x + c). In the latter sentence, we assume that the value of c is evicted from the on-chip cache. In the corresponding object code, the instruction Load c causes a cache miss. The variable c needs to be loaded from the main memory. In the LDRC applied object code, the RC code is added instead of Load c which was in the original object code. The variable c is regenerated with the RC code. In our method, if variables a and b are in the cache, the variable c is successfully re-computed by RC code without any cache misses. The execution cycles can be reduced, if the re-computation can replaces the accesses to the main memory and its time is shorter than the access time to the main memory. 3.2 RC code generation We explain a procedure to generate RC codes. RC codes generation algorithm is shown as follows: 1. Determine DL instructions with simulations using benchmark programs. DL instructions depend on input vectors to a benchmark program, so the DL instruction deterninations have to be carried out for each inputs. 2. Construct CDFGs for each benchmark program. 3. On the CDFG, search the load data definition points for each DL instruction. 4. Extract the defining codes for each load data from CDFG. We can obtain RC codes from above algorithm. Furthermore, if we know that the load instruction in the leaf of the RC code will cause the cache miss, then the RC code for the load should be generated. This kind of RC codes are called Multi-layered RC codes, whereas the normal ones are called Single-layered RC codes. 3.3 The conditions to obtain performance improvements by LDRC This subsection discusses the conditions that should be satisfied in order to cut down the execution time by LDRC. We ignore implementation issues because the evaluation objective is to show the possibility of LDRC to reduce execution time. To reduce execution time, applications should at least satisfy the following four conditions. 1. DL instructions cause a lot of cache misses. 2. DL instructions can be replaced with RC codes. 3. The execution time of RC code is shorter than memory access latency. 4. The load instructions cannot cause any cache misses. First, conditions 1 and 2 describe the opportunity to apply the proposed method. The more opportunities we have, the more effect we can obtain. Next, conditions 3 and 4 describe the overhead caused by the proposed method. Execution time cannot be reduced, unless re-computation time is short and load instructions in the RC code cause cache hits. 4 Evaluation In this section, we quantitatively evaluate our method using some benchmark programs. Since the evaluation objective is to show the maximum performance improvement due to our method, we ignore several issues related to implementations. 4.1 Experimental environment To evaluate our method, we used SimpleScalar simulation tool set [2]. The processor configuration assumed in this evaluation is shown in Table 1. We selected the following benchmark programs from SPEC CPU 2000 benchmark set [5]. These applications were compiled with MIRV C compiler [6] Floating point benchmark programs (4 programs): 177.mesa, 179.art, 183.equake, 188.ammp Integer benchmark programs (7 programs): 164.gzip, 175.vpr, 176.gcc, 181.mcf, 197.parser, 255.vortex, 256.bzip2 We used Reference inputs in SPEC CPU 2000 to these benchmarks. We selected command line parameters so as to cause maximum number of

Table 1. Simulator parameters (cc: clock cycle(s)) Instruction issue Out of order Branch predictor Type 2 level (gshare, 2K entries) BTB size 512 entries, 4 way RAS 32 Inst. decode width 8 instructions/cc Inst. issue width 8 instructions/cc IFQ size 8 entries RUU size 64 entries LSQ size 32 entries Cache memory L1 data 32KB (64B/entry, 2way, 256 entries) L1 instruction 32KB (64B/entry, 1way, 512 entries) L2 unified 2MB (64B/entry, 4way, 8192 entries) Latency L1 cache 1 cc L2 cache 16 cc Main memory 250 cc Memory bandwidth 8B Memory ports 2 ITLB, DTLB Entry 1M entries (4KB/entry, 256 entries/way, 4 way) Miss penalty 30 cc Integer operation units (Units, Exec. lat., Issue.lat.) ALU 4, 1 cc, 1 cc Mult. 1, 3 cc, 1 cc Div. 1, 20 cc, 19 cc Floating operation units (Units, Exec.lat., Issue.lat.) ALU 4, 2 cc, 1 cc Mult. 1, 4 cc, 1 cc Div. 1, 12 cc, 12 cc SQRT 1, 24 cc, 24 cc L2 data cache misses by DL instructions. We evaluated 200M instructions executed after 2G instructions from the beginning to shorten the simulation time. 4.2 Evaluation assumptions The assumptions in this evaluation are shown as follows: In this evaluation, we define the 16 static DL instructions which cause L2 cache misses most frequently. This definition is based on the fact that the 16 static load instructions are responsible for the greater part of L2 cache misses. Figure 2 represents the percentage of L2 cache miss caused by the DL instructions for each benchmark programs. We ranked DL instructions by their cache miss counts such as DL1, DL2, DL3 and Percentage of L2 cache miss due to DL instructions (%) DL1 DL2 DL3 DL4-16 100 80 40 20 0 Figure 2. Percentage of L2 cache miss due to DL instructions DL4-16. According to Figure 2, DL1,2,3 instructions are responsible for more than 40% of total cache misses in most of the benchmark programs, and all the DL instructions are responsible for more than %. In Subsection 3.2, we explain the way to generate RC codes. However, we could not generate CDFG, so we used instruction trace instead. Therefore, RC codes should have included all the paths needed for re-computation, but the generated RC codes include only one path. To count the number of instructions in RC codes, we have to generate all the RC codes. However, it takes long time to generate all of them. In this evaluation, to reduce the RC code generation time, we group the RC codes together by their similarity and generate only one representative RC code for each group. We assume that all the RC codes in the same group have the same number of instructions as a representative RC code. In order to always satisfy the condition 4 described in Subsection 3.3, it is assumed that all the load instructions in all the RC codes cause cache hits in the L2 cache. If the execution time of RC code is longer than the memory access latency, we suppose that the load instruction can fetch the data from the off-chip memory. In this case, we assume that the RC code is not executed but the access to off-chip memory is carried out. We do not consider the drawbacks with LDRC for the instruction caches and the data caches.

4.3 Evaluation index We adopt the percentage of the reduced execution time by LDRC as an evaluation index. The following Formula (1) represents the execution time (in clock cycles) by this method. C ldrc = C ideal + C rc + C norc (1) C ldrc : The execution time (in clock cycles). C ideal : The execution time (in clock cycles) which ignores the data cache miss penalty of the DL instructions. In this case, we assume that DL instructions always cause cache hits. C rc : The execution time (in clock cycles) of all the RC codes. C norc : The execution time (in clock cycles) of DL instructions which cannot be replaced with RC codes. We estimate C rc with Formula (2). ( dl rcc ) C rc = CP I I rccg (i, k) N rccg (i, k) i=0 k=0 (2) CP I: The mean execution clock cycles per instruction. I rccg (i, k): The number of instructions in the kth representative RC code, for the ith DL instruction. N rccg (i, k): The number of execution counts of the kth representative RC code, for the ith DL instruction. dl: The number of DL instructions. In this experiment, this number is fixed to 16. rcc: The number of RC code groups. Formula (3) represents the execution time of DL instructions which cannot be replaced with RC codes. C norc = (C org C ideal ) N norc N dl2miss (3) C org : Original execution time (in clock cycles). N dl2miss : Total number of L2 miss counts due to DL instructions. N norc : Total number of DL instructions which can not be replaced with RC codes. Execution time reduction (%) 100 80 40 20 Ideal LDRC Figure 3. Percentage of reduced execution time Average re-computation time (Clock Cycles) 180 150 120 90 30 Figure 4. The average execution time of RC codes Formula (4) indicates the percentage of the reduced execution time using Formula (1), (2), and (3). ( R ldrc = 1 C ) ldrc 100[%] (4) C org R ldrc : The percentage of execution time reduction method. To estimate the percentage of the reduced execution time by LDRC, we obtain C org C ideal, I rccg (i, k), N rccg (i, k), N norc, N dl2miss, and CP I from our experiment. 4.4 Experimental results and discussions We show the percentage of the reduced execution time by LDRC in Figure 3, the mean execution time of RC codes in Figure 4, and the percentage of DL instruction which can be replaced with RC codes in Figure 5. In Figure3, there are two bars corresponding to two cases. The first bar corresponds to the ideal case assuming all the DL instructions cause cache hits,

Percentages of replacable STOREs by RC codes (%) 100 80 40 20 Figure 5. The percentage of DL instructions which can be replaced with RC codes and the second bar corresponds to the LDRC applied case. Figure 4 plots the average execution time of representative RC codes for each RC code group mentioned in Subsection 4.2. Figure 5 indicates the percentage of DL instructions which can be replaced with RC codes. According to Figure 3, one benchmark program 181.mcf can reduce execution time up to 45.3%. On the other hand, we cannot see the execution time reduction effect of LDRC on 177.mesa, 179.art, 176.gcc, and 255.vortex. We discuss the reasons of these results by means of average execution time of RC codes and the number of DL instructions which can be replaced with RC codes. First, according to Figure 3, we cannot see the large effect of execution time reduction on benchmark programs 177.mesa, 176.gcc and 255.vortex, because these programs seldom satisfy the condition 1 described in Subsection 3.3. Since the DL instructions have little influence on these programs, they cannot obtain large effect due to LDRC. Secondly, according to Figure 5, 179.art has a very small number of DL instructions which can be replaced with RC codes. Therefore, this program seldom satisfies the performance improving condition 2 described in Subsection 3.3. This kind of program has little opportunity to apply LDRC. Finally, we discuss about one more benchmark program, 188.ammp. According to Figure 3, this program indicates high potential to reduce execution time in the ideal case, but not in the case of LDRC. The reason is that, according to Figure 4, it takes a long time for this program to execute RC code. This program does not satisfy the condition 3 described in Subsection 3.3. From above discussions, if one of the performance improving conditions described in Subsection 3.3 is not satisfied, we cannot get large execution time reduction effect. One of the benchmark programs in this evaluation, 181.mcf, satisfies every condition, so it can obtain large execution time reduction effect. 5 Conclusion In this paper, we propose a load data re-computation technique called LDRC as a measure to memory wall problem. We apply LDRC to DL instructions, and evaluate the potential of execution time reduction effect. From the experimental results, one benchmark program 181.mcf can reduce execution time up to 45.3%. We confirmed the potential of execution time reduction. In this evaluation, we ignore several factors related to overheads, such as a change of cache miss rate in the data and the instruction cache. If we take more evaluations considering the available implementations, we will see the effectiveness of LDRC more precisely. References [1] M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In Proc of the 28th Intl. Symposium on Computer Architecture, 2001. [2] D. Burger and T. M. Austin. The simplescalar tool set, version 2.0. University of Wisconsin- Madison Computer Sciences Department Technical Report, June 1997. [3] J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, G. Hoflener, D. Lavery, and J. P. Shen. Speculative precomputation: Long-range prefetching of delinquent loads. In Proc of the 28th Intl. Symposium on Computer Architecture, 2001. [4] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach 3rd Edition. Morgan Kaufmann Publishers, 2003. [5] J. L. Henning. Spec cpu2000: Measuring cpu performance in the new millennium. 33(7), July 2000. [6] M. Postiff, D. Greene, C. Lefurgy, D. Helder, and T. Mudge. The mirv simplescalar/pisa compiler. University of Michigan EECS Department Technical Report, April 2000. [7] A. Roth and G. Sohi. Speculative data-driven multithreading. In Proc of the 7th Intl. Symposium on High-Performance Computer Architecture, January 2001.