Pre-Computational Thread Paradigm: A Survey

Size: px
Start display at page:

Download "Pre-Computational Thread Paradigm: A Survey"

Transcription

1 Pre-Computational Thread Paradigm: A Survey Alok Garg Abstract The straight forward solution to exploit high instruction level parallelism is to increase the size of instruction window. Large instruction window enable processor to look ahead and find independent instructions for execution. Large instruction window helps in tolerating long latencies (due to long latency instructions like cache misses) and potentially remove or delay frequent processor stalls. However deeper window size is challenging to implement and might not always provide performance benefits because of the control dependencies. Even a highly accurate branch predictor (98%) may only guarantee 50% correct path accuracy when executing speculatively very deep into pipeline. Thus branch predictions and cache misses still limits the processor peak performance and are called performance degrading events (PDE). The small fraction of instructions, whose behavior can not be anticipated by conventional branch predictors and caches, contribute towards large fractions of these PDEs. A promising solution to accurately predict the behavior of these instructions is to pre-compute the outcome to these instructions advance in time. This survey paper examines various pre-computational techniques that attack PDEs ahead in time and trains the hardware with future access information. 1 Introduction Since retirement of the instructions is sequential (follows program semantics), performance of a program is measured by its retirement throughput. If during a particular cycle there is no instruction available for retirement, then the retirement bandwidth of the machine is wasted for that cycle and degrades performance. The retirement may block either due to presence of long latency instructions or frequent pipeline flushes because of control mispredictions. These events are commonly called performance degrading events (PDE). Long Latency Instructions: Presence of long latency instructions (like cache misses) fills the processor window and stalls the processor pipeline. Stalls directly impact the performance, since processor is not able to retire instruction each cycle any more. Processor stall also indirectly hurts the performance, because it exposes the slack in the processor pipeline to the retirement engine. This slack is because of time gap when a new instruction is fetched, and when that instruction is ready for retirement. Slack for a particular instruction not only depends on processor pipeline depth but also depends on instruction execution latency, and hence is variable. Increasing instruction window size is a straight forward solution to hide this latency. However, doing so is challenging because of design complexity. Work by Akkary et al. [1] addresses some of the hardware design complexity issues for scalable large instruction window processors. However, large instruction window still does not address the issue of retirement bandwidth waste. Some of the examples of long latency instructions are floating point operations and cache misses, but it is the cache misses that account for majority of these long latency instructions. Control Dependencies: Despite plethora of research to predict control dependencies, branch mispredictions continue to be a major limitation towards micro- 1

2 processor performance. Whenever a branch mispredicts processor starts misprediction recover. Branch misprediction recovery involves flushing of processor pipeline, and to restart fetch from the correct path. From the time mispredicted branch is fetched and till new instruction from the correct path is fetched, processor suffers branch misprediction penalty and directly degrades program performance. Techniques to minimize performance degradation revolve around minimizing branch mispredictions, minimizing branch resolution time, and faster branch recovery. Another issue tied with branch misprediction is instruction cache pollution due to instructions fetched from the wrong alternate path. Even though modern prediction mechanisms can achieve impressive accuracies, the penalty incurred by mispredictions continue to increase due to wider and deeper processor pipelines. On large instruction window processors, even a branch predictor as accurate as 98% for an individual branch may only guarantee machine execution on the correct path with 50% confidence. Several schemes like [8, 14, 17] have been developed, and are effective in controlling the frequency of these PDEs. Yet for wide issue and large window processors, the accuracy provided by conventional techniques is not enough, and further insight is needed to reduce the frequency of these events. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fraction of such PDEs. A promising solution, also a hot topic of research, is to complement the frequency reduction techniques with a generic latency tolerance technique, like pre-computation. Pre-execution of PDEs attempt towards execution of instructions leading to the PDE some time prior to its actual encounter. Preexecution improves the accuracy of branch predictor by actually pre-computing branch outcomes ahead in time. Outcome of the pre-executed branch is provided to the machine when it encounters that branch. Similarly, pre-execution can also be used to pre-compute the address of load instructions. These load instructions can then be pre-fetched to hide cache miss latencies. Several papers have recently investigated the use of pre-computational threads to mitigate the effect of PDEs. Purser and Sundaramoorthy [12, 15] proposes scheme for parallel execution of reduced version of original program or A-stream on spare context of a single-chip multiprocessor (CMP) [11] or simultaneous multithreaded processor (SMT) [16]. The A-stream speculatively runs ahead of the full program and supplies the full program with control and data flow outcomes. The full program executes efficiently due to the communicated outcomes from the shorter program. Other techniques propose small pre-computational helper threads. These helper threads are hand generated speculative slices leading to the computation of difficult branch or address of a difficult load. These light weight threads are spawned early enough in time so that their outcome is useful in hiding the effect of PDEs. Helper threads may leverage the spare execution capacity of the machine to benefit the primary thread. This paper surveys the classification of techniques of what we call pre-computational paradigm and summarizes the issues related to these schemes. Note that we excludes in this survey the conventional branch prediction or load pre-fetching techniques, but aim towards more sophisticated yet simple pre-computational schemes. 2 Classification of Pre-Computational Techniques Pre-computation techniques fall into two categories depending on whether a full-blown thread (also called Slipstream Processors) is executed on a separate context or small multiple helper threads (also called Speculative Slices) spawned opportunistically. Speculative slice technique can be further classified depending on methodology used for the construction of these light weight thread. Few techniques rely on compile time profiling data to hand-construct speculative slices of computation for instructions responsible for many branch pre- 2

3 dictions or cache misses. Others are entirely hardware runtime techniques. Third category of technique also discussed in this survey uses idle clock cycles during full-window stalls to run ahead and pre-execute some of the instructions. This technique also called Runahead Execution increases the tolerance of a processor to long-latency memory operations and provides instruction and data pre-fetching benefits. 3 Slipstream Processors 3.1 Principles The slipstream processor [12, 15] proposes that only a subset of the original dynamic instruction stream is needed to make full, correct forward progress. But at the same time creating a shorter, equivalent program is speculative. So slipstream philosophy argues the creation of two redundant processes on two separate contexts of SMT or CMP processors. One of the program always run slightly ahead of the other. This lead program with reduced dynamic instructions is called the advanced stream, or A-stream. The trailing program executing the full dynamic instruction stream is called the redundant stream, or R-stream. The much-reduced A-stream is sped up because it executes and retires fewer instructions than it would otherwise. All values and branch outcomes produced in the leading A-stream are communicated to the trailing R-stream. R- stream benefits from the accurate prediction of the future and executes more efficiently. Since R-stream executes the full dynamic instruction stream, it always re-computes and validates predictions bypassed by A-stream. So R-stream always makes correct forward progress, and also keeps speculative A-stream on the correct path. When A-stream deviates, the architectural state of the R-stream is used to selectively recover the corrupt architectural state of the A- stream. Combination of A-stream and R-stream improves the performance or throughput of the single program as a whole by: Providing accurate value and branch outcomes to the R-stream. Hence R-stream rarely divert from the correct path. Pre-fetching the memory reference well in advance reduces cache misses from R-stream path. So R-stream very rarely suffers from PDEs, but A-stream still is able to run ahead of R-stream since it executes less number of instructions. In the next section we will briefly discuss the microarchitecture requirement of the Slipstream processor. Then we will discuss methods and some of the issues related to creation of shorter program for A-stream. 3.2 Slipstream microarchitecture A slipstream processor requires two architectural contexts, one for each of the A-stream and R-stream. Hardware is required for directing dynamic instruction removal in the A-stream. It further requires communicating state between the threads. Four components required to support slipstream processing are: The instruction-removal predictor, or IR-predictor is a modified branch predictor. It generates the program counter (PC) of the next block of instructions to be fetched in the A-stream. The instruction removal detector, or IR-detector, monitors the R-stream and detects instructions that could have been removed from the program. IR detector conveys instruction removal information to the IRpredictor. Delay buffer is used to communicate control and data flow outcomes from A-stream to R-stream. The recovery controller is responsible for keeping A- stream on the correct path. 3.3 Creating the shorter program IR-detector monitors past run-time behavior and detects instructions that could have been removed, and might possibly be removed in the future. This information is conveyed to the IR-predictor, and after sufficient repeated indi- 3

4 cations by the IR-detector, the IR-predictor removes future instances of the instructions. The IR-predictor is built on top of conventional trace predictor. Trace is a large dynamic instruction sequence of 16 to 32 instructions. A trace is uniquely identified by a starting PC and branch outcome indicating the path through the trace (ID). An index into a correlated prediction table is formed from the sequence of past trace ids, using a hash function that favors bits from more recent trace ids over less recent trace ids. Each entry in the correlated prediction table contains a trace id and a 2-bit counter for replacement. The prediction is augmented with a second table that is indexed with only the most recent trace id. Together, the two tables form a hybrid predictor that outputs the predicted trace id of the next trace. IR-detector creates shorter version of program by removing ineffectual computations. Cases of ineffectual computations are: Some instructions write a value to a register or memory location and the value is overwritten before ever being used. Such instructions, and the computation chains leading up to these instructions, have no effect on final program state. Some instructions write the same value into a register or memory location as already exists at that location. Such instructions, and the computation chains leading up to them, have no effect on final program state because their write were not truly modifications. Certain control flow in the program may be so predictable that it appears deterministic. With a high level of confidence, branches and computation chains feeding these branches can be removed. More aggressive techniques for reducing the A-stream is removing branch-predictable computation. Another possibility is removing value-predictable computation. An overall better predictor may be possible by combining a conventional predictor with the A-stream. By removing highly predictable branches or values, the A-stream focuses instead on hard-topredict branches or values. IR-detector consumes retired R-stream instructions, address, and values. The instructions are buffered and, based on data dependencies; circuitry among the buffer is dynamically configured to establish connections from consumer to producer instructions. In other words, a reverse dataflow graph (R-DFG) is constructed. IR-detector watches for any triggering conditions for instruction removal. When a triggering condition is observed, the corresponding instruction is selected for removal. 3.4 IR-misprediction recovery An instruction-removal misprediction, or IRmisprediction, occurs when A-stream instructions were removed that should not have been. IR-mispredictions are detected by the R-stream. The IR-detector can also detect these mispredictions sooner by comparing its computed removal information against the corresponding predicted removal information. When an IR-misprediction is detected, the reorder buffer of the R-stream is flushed. The IR-predictor and A-stream are then resynchronized with the precise state of the R-stream. 3.5 Issues with slipstream processors The performance issue with slipstream processors is the accuracy of A-stream. If A-stream is to be made more accurate and less speculative, instruction removal must be more conservative. Resulting in less number of instructions removed from A-stream, and hence slows down the A-stream. Slower A-stream degrades the performance of entire program, since R-stream would also then be making slow forward progress. The performance of the slipstream processor also depends on the distance between A-stream and R-stream. Because R-stream is partially sped up due to less cache misses encountered by the R-stream. If distance between A-stream and R-stream is not enough, the long latencies of cache misses encountered by the A-stream may not be completely hidden from R-stream. Hence R-stream eventually suffers 4

5 from PDEs. To shield R-stream from any kind of performance degrading events (PDEs), A-stream must always be at a comfortable distance from the A-stream. This requires faster A-stream, because PDEs frequently slow down the speed of A-stream. This requires more aggressive techniques towards instruction removal from the A-stream. More aggressive techniques increases speculation and may result in frequent IR-misprediction recovery. IR-misprediction is another type of performance degrading event associated with slipstream processors. 4 Helper Thread Branch prediction using Helper thread was first proposed by Farcy et al. [7]. They proposed a mechanism to target highly mispredicted branches within local loops. The computations leading up to applicable branches were duplicated at decode time and used to pre-compute the conditions of the branches several loop iterations ahead of the current iteration. Branch prediction using subordinate microthreads was proposed by Chappell et al. [2 4]. Zilles and Sohi [18,19] proposed using speculative slices to pre-compute branch conditions and pre-fetching addresses. Roth and Sohi [13] proposed a processor capable of using data-driven threads (DDT) to perform critical computations, chains of instructions, leading to a mispredicted branch or cache miss. Collins et al. proposed dynamic speculative pre-computation [5, 6]. Common principle behind pre-computational threads or helper threads is discussed next. 4.1 Principles Performance degrading events (PDEs) tend to be concentrated in a subset of static instructions whose behavior is not predictable with existing mechanisms. Because the simple instructions are handled accurately by existing mechanisms, most of the branch mispredictions and cache misses are concentrated in the remaining static instructions whose behavior is more complex. Small set of frequently executed static instructions are responsible for causing a majority of PDEs. In Helper Thread approach, a code fragment is constructed that mimics the computations leading up to and including these problem instructions. Such a piece of code is called speculative slice. A speculative slice includes only the operations that are necessary to compute the outcome of the problem instructions. In contrast to a original program, a speculative slice only needs to approximate the original program, resulting in significant flexibility in slice construction. These speculative slices may execute as helper thread on a multi-threaded machine or Simultaneous Subordinate Microthreading (SSMT) proposed by [4]. Helper thread is spawned far advance in time from the target instruction, so that accurate prediction is available at the time when original instruction is fetched. These helper threads accelerate the program s execution microarchitecturally, by pre-fetching data into the cache and generating the branch predictions. These helper threads do not affect the architected state of the program; since a slice has its own register and performs no store. Forking the slice many cycles before the problem instructions are encountered by the main thread provides the necessary latency tolerance. After copying some register values from the main thread, the slice thread executes autonomously. Additional hardware is required, that binds the branch predictions generated by helper threads to the correct branch instances in the main thread. 4.2 Early execution of speculative slices The benefit of executing a speculative is derived from precomputing a problem instruction significantly in advance of when it is executed in the main program. Following rules for light weight slices construction ensures that instructions are executed early relative to the whole program. Only instructions necessary to execute the problem instruction are included in the slices. The slice avoids misprediction stalls that impede the whole program. Control flow that is required by the 5

6 slice is if-converted, either arithmetically or through prediction. The code from the original program can be transformed to reduce overhead and shorten the critical path to computing a problem instruction. Transformations can be performed on the slices that were not applied by the compilers on the original program due to limited resources or correctness issues. 4.3 Slice construction Construction of speculative slices depends on: Problem instruction Forking point Problem instruction. For the limited number of speculative slices to be most effective in early resolution of problem instructions, characterization of problem instructions is required. If a static instruction accounts for a non-trivial number of PDEs and at least 10% of its executions cause a PDE, it is characterized as a problem instruction. [18] shows that problem instructions are responsible for a disproportionate fraction of performance degrading events. A single speculative slice may target a single or multiple problem instructions. Slice overhead can be minimized by aggregating all problem instructions that share common dataflow predecessors into a single slice. If a speculative slice terminates with the pre-fetch of a memory location, the final instruction can be converted into software prefetch by changing its destination register to the zero register. This pre-fetch conversion allows speculative slice to terminate as soon the last instruction is fetched and issued. Forking point. Each speculative slice executes data flow from the primary thread forking point along a particular control-flow path. Forking of a new slice requires communication of seed register values from the primary thread to the helper thread. Selecting a fork point often requires carefully balancing two conflicting desires. Maximizing the likelihood that the problem instruction s latency is fully tolerated requires that the fork point be as early as possible, but increasing the distance between the fork point and the problem instructions can both increase the size of the slice and reduce the likelihood that the problem instructions will be executed. Once the forking point and target problem instruction(s) are identified, only dependent instructions from the original program are used to construct a speculative slice. 4.4 False spawns If the primary thread deviates from the path assumed by a speculative slice, that speculative slice does not a produce a useful prediction. False spawns are damaging because the speculative slices they launch consume resources without performing useful work. Even if the speculative slice follows the correct path, it must complete in time to be useful. Chappell [3] proposed using a set of difficult-paths to guide helper thread. Predictability of branches is classified by control-flow path. Difficult paths can be identified that frequently lead to miss-predictions. Miss-predictions are attacked with specially-constructed speculative slices only for each of these difficult paths. Hardware mechanisms are required to identify difficult path at fork time. 4.5 Automatic slice construction For helper threaded technique to be practical, an automatic technique of slice construction is necessary. Chappell [3] and Collins [5] have proposed dynamic methods for the construction of speculative slices for branch prediction and memory pre-fetch respectively. For dynamic construction of speculative slices, dynamic instruction stream is analyzed for dependencies after retirement stage. Retirement Instruction Buffer (RIB) is used for this purpose, and stores a trace of committed instructions. RIB is off the processor critical path, and does not hurt the performance of the processor. RIB looks for problem instructions, and once a problem instruction is inserted into RIB, it transition into slice-building mode. 6

7 4.6 Branch prediction correlation To derive any benefit from predictions made by speculative slices, the predictions must be assigned to the corresponding dynamic instance of the problem branch in primary thread. Zilles [18] have proposed branch correlator, which consist of multiple queues of predictions, each tagged with the PC of a branch. When a branch with is fetched and its PC matches with a non-empty queue, processor uses predictions from the head of queue in place of generated by conventional branch predictor. 4.7 Issues with helper threads technique Effective use of helper threads for hiding performance degrading events is challenging. Sophisticated techniques are required to cope with the conflicting desires of helper thread. Conflicting desires are related to the forking point of helper thread and size of the speculative slices. Even if these conflicting desires are met by careful selection of forking point there are few more issues related to helper threads and are discussed next. Aborting useless helper threads. A mechanism to detect and abort useless helper threads is required, when the machine deviates from the path predicted by an active helper thread. In the process false aborts must be minimized. Controlling pre-computations. Issues like primary thread miss-speculation recovery, late predictions, and useless predictions complicate the correlation mechanism discussed earlier. 5 Runahead Execution Mutlu et al. [9, 10] has proposed Runahead Execution to tolerate long latency performance degrading events. 5.1 Principle Runahead execution is a cheap alternative to an unreasonably large instruction window. Large instruction window is used to exploit instruction level parallelism more aggressively when a simple /emphout-of-order execution machine stalls due to long latency operations. Runahead execution unblocks the instruction window blocked by long latency operation to execute speculatively far ahead in the program path. When the instruction window is blocked by the longlatency operation, the state of the architectural register file is checkpointed. The processor then enters Runahead mode. It distributes a bogus result for the blocking operation and removes that instruction out of the instruction window. Instructions following the long latency instruction are fetched, executed, and retired, but do not update architectural state. When the blocking operation resolves, the processor re-enters normal mode. It restores the checkpointed state and re-executes instructions with the blocking operation. The instructions fetched and executed during runahead mode create very accurate pre-fetches for the data and the instruction caches. This technique is different from other pre-computational techniques in utilizing the spare resources of processor only during stalls. It does not need idle thread contexts and spare resources (e.g., fetch and execution bandwidth), which are not available when the processor is well used. Runahead execution mode tries to mitigate the effect of anticipated future performance degrading events by pre-computation only when primary execution is not using processor resources. 5.2 Features The main complexity involved with the execution of runahead instructions are with the memory communication and propagation of invalid results. Propagating invalid results. Since invalid results are filtered out while taking anticipatory measures for long latency instructions and branches. Each physical register has an invalid bit associated with it to indicate whether or not it has a bogus value. Any instruction that sources the invalid register is also an invalid instruction. Memory communication For runahead execution to 7

8 most accurately mimic the real execution, runahead cache is used. Runahead cache is used to forward results from store to dependent loads. Load pre-fetches Only valid load instructions are prefetched. Branch prediction Branches are predicted and resolved in runahead mode exactly the same way they are in normal mode. The branch predictions from runahead mode may be communicated to normal mode to improve branch prediction in the normal mode. 5.3 Issues with runahead execution If a mispredicted invalid branch is encountered during runahead mode, it may direct the runahead execution on the wrong path and is called divergence points. Predictions made on the wrong path are wrong and must be discarded. But even then processor can reach a control-flow independent point, after which it continues on the correct program path again. 6 Performance analysis 6.1 Slipstream processors Slipstream processor dramatically improves single program performance. On average this performance improvement is 12% for 4-way processor with 64- ROB entries in a CMP. Slipstreaming on 8-way SMT processor improves performance from 10%-20%. The Benefit of slipstreaming decreases as more execution bandwidth is made available. 6.2 Helper thread Speculative slices achieves average speedup of 10% on a 4-way machine. Hand constructed speculative slices achieves better performance then dynamically constructed slices. 6.3 Runahead execution Runahead execution on a machine model based on Intel Pentium 4 improves performance by 22% across a wide range of memory intensive applications. 7 Conclusion Three pre-computational techniques are discussed in this survey paper. Techniques like Slipstream and Helper thread uses unused contexts of the multithreaded processor, while Runahead execution uses unused resources of the processor during stalls due to long latency instructions. Runahead execution targets all the accessible performance degrading events, while processor is stalled on the normal path. Explicit multithreading techniques on the other hand target all the performance degrading events all the time. While scope of runahead execution is limited, but it is more accurate compare to multithreading techniques, since it execute instructions without any speculation. Slipstream processor is a completely dynamic precomputation technique, but helper threads need hand constructed speculative slices to be more effective. Slipstream processor, executing a reduced copy of primary thread, remains on the correct path most of the time. Correlating predictions is not a big issue for slipstream processors. For helper threaded techniques, primary thread s future path has to be anticipated before forking a new slice. This anticipation mechanism sometimes results into false spawns. Helper thread correlation mechanism for predictions is also more complicated. But on the other hand slipstream processor suffers from slow lead thread (A-stream) and might not hide long latencies of performance degrading events. For pre-computation techniques to be practically viable, Must be simpler then existing techniques. Need to be more flexible and may be more speculative. Need to be dynamic. Slipstream processor may be more practical for future microprocessor, but need better and more aggressive instruction removal mechanisms to be effective. 8

9 References [1] H. Akkary, R. Rajwar, and S. Srinivasan. Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors. In International Symposium on Microarchitecture, pages , San Diego, California, Dec [2] R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous Subordinate Microthreading (SSMT). In International Symposium on Computer Architecture, pages , Atlanta, Georgia, May [3] R. Chappell, F. Tseng, A. Yoaz, and Y. Patt. Difficult-path branch prediction using subordinate microthreads. In International Symposium on Computer Architecture, pages , Anchorage, Alaska, May [4] R. Chappell, F. Tseng, A. Yoaz, and Y. Patt. Microarchitectural Support for Precomputation Microthreads. In International Symposium on Microarchitecture, pages 74 84, Istanbul, Turkey, Nov [5] J. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic Speculative Precomputation. In International Symposium on Microarchitecture, pages , Austin, Texas, Dec [6] J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In International Symposium on Computer Architecture, pages 14 25, Göteberg, Sweden, June July [7] A. Farcy, O. Temam, R. Espasa, and T. Juan. Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes. In International Symposium on Microarchitecture, pages 59 68, Dallas, Texas, Nov. Dec [8] C.-C. Lee, I.-C. K. Chen, and T. N. Mudge. The bi-mode branch predictor. In International Symposium on Microarchitecture, pages 4 13, Research Triangle Park, North Carolina, Dec [9] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In International Symposium on High-Performance Computer Architecture, Anaheim, California, Feb [10] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead Execution: An Effective Alternative to Large Instruction Windows. IEEE Micro, Special Issue: Micro s Top Picks from Microarchitecture Conferences, 23(6):20 25, Nov./Dec [11] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang. The case for a single-chip multiprocessor. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2 11, Cambridge, Massachusetts, Oct [12] Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A study of slipstream processors. In International Symposium on Microarchitecture, pages , Monterey, California, Dec [13] A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In International Symposium on High-Performance Computer Architecture, pages 37 48, Monterey, Mexico, Jan [14] E. Sprangle, R. S. Chappell, M. Alsup, and Y. N. Patt. The agree predictor: a mechanism for reducing negative branch history interference. In International Symposium on Computer Architecture, pages , Denver Colorado, June [15] K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream Processors: Improving both Performance and Fault Tolerance. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages , Cambridge, Massachusetts, Nov [16] D. Tullsen, S. Eggers, and H. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In International Symposium on Computer Architecture, pages , Santa Margherita Ligure, Italy, June [17] T.-Y. Yeh and Y. N. Patt. Two-level adaptive training branch predicton. In International Symposium on Microarchitecture, pages 51 61, Albuquerque, New Mexico, Puerto Rico, [18] C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In International Symposium on Computer Architecture, pages 2 13, Göteberg, Sweden, June July [19] C. B. Zilles and G. S. Sohi. Understanding the backward slices of performance degrading instructions. In International Symposium on Computer Architecture, pages , Vancouver, Canada, June

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 15:

More information

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I Prof. Onur Mutlu Carnegie Mellon University 10/10/2012 Reminder: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 15-740/18-740 Computer Architecture Lecture 28: Prefetching III and Control Flow Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 Announcements for This Week December 2: Midterm II Comprehensive

More information

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 13:

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser Karthik Sundaramoorthy, Zach Purser Dept. of Electrical and Computer Engineering North Carolina State University http://www.tinker.ncsu.edu/ericro ericro@ece.ncsu.edu Many means to an end Program is merely

More information

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 16: Prefetching Wrap-up Prof. Onur Mutlu Carnegie Mellon University Announcements Exam solutions online Pick up your exams Feedback forms 2 Feedback Survey Results

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in

More information

A Study of Slipstream Processors

A Study of Slipstream Processors A Study of Slipstream Processors Zach Purser Karthik Sundaramoorthy Eric Rotenberg North Carolina State University Department of Electrical and Computer Engineering Engineering Graduate Research Center,

More information

Speculative Execution for Hiding Memory Latency

Speculative Execution for Hiding Memory Latency Speculative Execution for Hiding Memory Latency Alex Pajuelo, Antonio Gonzalez and Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona-Spain {mpajuelo,

More information

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more

More information

The Use of Multithreading for Exception Handling

The Use of Multithreading for Exception Handling The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Wide Instruction Fetch

Wide Instruction Fetch Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

Accelerating and Adapting Precomputation Threads for Efficient Prefetching

Accelerating and Adapting Precomputation Threads for Efficient Prefetching In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA 2007). Accelerating and Adapting Precomputation Threads for Efficient Prefetching Weifeng Zhang Dean M.

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 15-740/18-740 Computer Architecture Lecture 14: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 Reviews Due Today Chrysos and Emer, Memory Dependence Prediction Using

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Many Cores, One Thread: Dean Tullsen University of California, San Diego Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Ilya Ganusov and Martin Burtscher Computer Systems Laboratory Cornell University {ilya, burtscher}@csl.cornell.edu Abstract This

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science, University of Central Florida zhou@cs.ucf.edu Abstract Current integration trends

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

Runahead Threads to Improve SMT Performance

Runahead Threads to Improve SMT Performance Runahead Threads to Improve SMT Performance Tanausú Ramírez 1, Alex Pajuelo 1, Oliverio J. Santana 2, Mateo Valero 1,3 1 Universitat Politècnica de Catalunya, Spain. {tramirez,mpajuelo,mateo}@ac.upc.edu.

More information

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering

More information

CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS

CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS PROCESSORS REQUIRE A COMBINATION OF LARGE INSTRUCTION WINDOWS AND HIGH CLOCK FREQUENCY TO ACHIEVE HIGH PERFORMANCE.

More information

Architectures for Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Abstract KOPPANALIL, JINSON JOSEPH. A Simple Mechanism for Detecting Ineffectual Instructions in Slipstream Processors

Abstract KOPPANALIL, JINSON JOSEPH. A Simple Mechanism for Detecting Ineffectual Instructions in Slipstream Processors Abstract KOPPANALIL, JINSON JOSEPH A Simple Mechanism for Detecting Ineffectual Instructions in Slipstream Processors (Under the direction of Dr. Eric Rotenberg) The slipstream paradigm harnesses multiple

More information

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. Memory Ordering Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen and Haitham Akkary 2012 Handling Memory Operations Review pipeline for out of order, superscalar processors To maximize

More information

Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. Memory Ordering Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2018 Handling Memory Operations Review pipeline for out of order, superscalar processors

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

Dynamic Speculative Precomputation

Dynamic Speculative Precomputation In Proceedings of the 34th International Symposium on Microarchitecture, December, 2001 Dynamic Speculative Precomputation Jamison D. Collins y, Dean M. Tullsen y, Hong Wang z, John P. Shen z y Department

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

ABSTRACT. Integration of multiple processor cores on a single die, relatively constant die

ABSTRACT. Integration of multiple processor cores on a single die, relatively constant die ABSTRACT Title of dissertation: Symbiotic Subordinate Threading (SST) Rania Mameesh, Doctor of Philosophy, 2007 Dissertation directed by: Dr Manoj Franklin Electrical and Computer Engineering Department

More information

Pipelining, Branch Prediction, Trends

Pipelining, Branch Prediction, Trends Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Data Prefetching by Dependence Graph Precomputation

Data Prefetching by Dependence Graph Precomputation Data Prefetching by Dependence Graph Precomputation Murali Annavaram, Jignesh M. Patel, Edward S. Davidson Electrical Engineering and Computer Science Department The University of Michigan, Ann Arbor fannavara,

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation What are transient faults? Hardware Fault Detection

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Reading: Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000) If memory

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

ECE 341. Lecture # 15

ECE 341. Lecture # 15 ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

Exploiting Large Ineffectual Instruction Sequences

Exploiting Large Ineffectual Instruction Sequences Exploiting Large Ineffectual Instruction Sequences Eric Rotenberg Abstract A processor executes the full dynamic instruction stream in order to compute the final output of a program, yet we observe equivalent,

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Evaluation of Branch Prediction Strategies

Evaluation of Branch Prediction Strategies 1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III

More information