Pre-Computational Thread Paradigm: A Survey

Size: px

Start display at page:

Download "Pre-Computational Thread Paradigm: A Survey"

Bruno Wilkins
6 years ago
Views:

1 Pre-Computational Thread Paradigm: A Survey Alok Garg Abstract The straight forward solution to exploit high instruction level parallelism is to increase the size of instruction window. Large instruction window enable processor to look ahead and find independent instructions for execution. Large instruction window helps in tolerating long latencies (due to long latency instructions like cache misses) and potentially remove or delay frequent processor stalls. However deeper window size is challenging to implement and might not always provide performance benefits because of the control dependencies. Even a highly accurate branch predictor (98%) may only guarantee 50% correct path accuracy when executing speculatively very deep into pipeline. Thus branch predictions and cache misses still limits the processor peak performance and are called performance degrading events (PDE). The small fraction of instructions, whose behavior can not be anticipated by conventional branch predictors and caches, contribute towards large fractions of these PDEs. A promising solution to accurately predict the behavior of these instructions is to pre-compute the outcome to these instructions advance in time. This survey paper examines various pre-computational techniques that attack PDEs ahead in time and trains the hardware with future access information. 1 Introduction Since retirement of the instructions is sequential (follows program semantics), performance of a program is measured by its retirement throughput. If during a particular cycle there is no instruction available for retirement, then the retirement bandwidth of the machine is wasted for that cycle and degrades performance. The retirement may block either due to presence of long latency instructions or frequent pipeline flushes because of control mispredictions. These events are commonly called performance degrading events (PDE). Long Latency Instructions: Presence of long latency instructions (like cache misses) fills the processor window and stalls the processor pipeline. Stalls directly impact the performance, since processor is not able to retire instruction each cycle any more. Processor stall also indirectly hurts the performance, because it exposes the slack in the processor pipeline to the retirement engine. This slack is because of time gap when a new instruction is fetched, and when that instruction is ready for retirement. Slack for a particular instruction not only depends on processor pipeline depth but also depends on instruction execution latency, and hence is variable. Increasing instruction window size is a straight forward solution to hide this latency. However, doing so is challenging because of design complexity. Work by Akkary et al. [1] addresses some of the hardware design complexity issues for scalable large instruction window processors. However, large instruction window still does not address the issue of retirement bandwidth waste. Some of the examples of long latency instructions are floating point operations and cache misses, but it is the cache misses that account for majority of these long latency instructions. Control Dependencies: Despite plethora of research to predict control dependencies, branch mispredictions continue to be a major limitation towards micro- 1

2 processor performance. Whenever a branch mispredicts processor starts misprediction recover. Branch misprediction recovery involves flushing of processor pipeline, and to restart fetch from the correct path. From the time mispredicted branch is fetched and till new instruction from the correct path is fetched, processor suffers branch misprediction penalty and directly degrades program performance. Techniques to minimize performance degradation revolve around minimizing branch mispredictions, minimizing branch resolution time, and faster branch recovery. Another issue tied with branch misprediction is instruction cache pollution due to instructions fetched from the wrong alternate path. Even though modern prediction mechanisms can achieve impressive accuracies, the penalty incurred by mispredictions continue to increase due to wider and deeper processor pipelines. On large instruction window processors, even a branch predictor as accurate as 98% for an individual branch may only guarantee machine execution on the correct path with 50% confidence. Several schemes like [8, 14, 17] have been developed, and are effective in controlling the frequency of these PDEs. Yet for wide issue and large window processors, the accuracy provided by conventional techniques is not enough, and further insight is needed to reduce the frequency of these events. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fraction of such PDEs. A promising solution, also a hot topic of research, is to complement the frequency reduction techniques with a generic latency tolerance technique, like pre-computation. Pre-execution of PDEs attempt towards execution of instructions leading to the PDE some time prior to its actual encounter. Preexecution improves the accuracy of branch predictor by actually pre-computing branch outcomes ahead in time. Outcome of the pre-executed branch is provided to the machine when it encounters that branch. Similarly, pre-execution can also be used to pre-compute the address of load instructions. These load instructions can then be pre-fetched to hide cache miss latencies. Several papers have recently investigated the use of pre-computational threads to mitigate the effect of PDEs. Purser and Sundaramoorthy [12, 15] proposes scheme for parallel execution of reduced version of original program or A-stream on spare context of a single-chip multiprocessor (CMP) [11] or simultaneous multithreaded processor (SMT) [16]. The A-stream speculatively runs ahead of the full program and supplies the full program with control and data flow outcomes. The full program executes efficiently due to the communicated outcomes from the shorter program. Other techniques propose small pre-computational helper threads. These helper threads are hand generated speculative slices leading to the computation of difficult branch or address of a difficult load. These light weight threads are spawned early enough in time so that their outcome is useful in hiding the effect of PDEs. Helper threads may leverage the spare execution capacity of the machine to benefit the primary thread. This paper surveys the classification of techniques of what we call pre-computational paradigm and summarizes the issues related to these schemes. Note that we excludes in this survey the conventional branch prediction or load pre-fetching techniques, but aim towards more sophisticated yet simple pre-computational schemes. 2 Classification of Pre-Computational Techniques Pre-computation techniques fall into two categories depending on whether a full-blown thread (also called Slipstream Processors) is executed on a separate context or small multiple helper threads (also called Speculative Slices) spawned opportunistically. Speculative slice technique can be further classified depending on methodology used for the construction of these light weight thread. Few techniques rely on compile time profiling data to hand-construct speculative slices of computation for instructions responsible for many branch pre- 2

3 dictions or cache misses. Others are entirely hardware runtime techniques. Third category of technique also discussed in this survey uses idle clock cycles during full-window stalls to run ahead and pre-execute some of the instructions. This technique also called Runahead Execution increases the tolerance of a processor to long-latency memory operations and provides instruction and data pre-fetching benefits. 3 Slipstream Processors 3.1 Principles The slipstream processor [12, 15] proposes that only a subset of the original dynamic instruction stream is needed to make full, correct forward progress. But at the same time creating a shorter, equivalent program is speculative. So slipstream philosophy argues the creation of two redundant processes on two separate contexts of SMT or CMP processors. One of the program always run slightly ahead of the other. This lead program with reduced dynamic instructions is called the advanced stream, or A-stream. The trailing program executing the full dynamic instruction stream is called the redundant stream, or R-stream. The much-reduced A-stream is sped up because it executes and retires fewer instructions than it would otherwise. All values and branch outcomes produced in the leading A-stream are communicated to the trailing R-stream. R- stream benefits from the accurate prediction of the future and executes more efficiently. Since R-stream executes the full dynamic instruction stream, it always re-computes and validates predictions bypassed by A-stream. So R-stream always makes correct forward progress, and also keeps speculative A-stream on the correct path. When A-stream deviates, the architectural state of the R-stream is used to selectively recover the corrupt architectural state of the A- stream. Combination of A-stream and R-stream improves the performance or throughput of the single program as a whole by: Providing accurate value and branch outcomes to the R-stream. Hence R-stream rarely divert from the correct path. Pre-fetching the memory reference well in advance reduces cache misses from R-stream path. So R-stream very rarely suffers from PDEs, but A-stream still is able to run ahead of R-stream since it executes less number of instructions. In the next section we will briefly discuss the microarchitecture requirement of the Slipstream processor. Then we will discuss methods and some of the issues related to creation of shorter program for A-stream. 3.2 Slipstream microarchitecture A slipstream processor requires two architectural contexts, one for each of the A-stream and R-stream. Hardware is required for directing dynamic instruction removal in the A-stream. It further requires communicating state between the threads. Four components required to support slipstream processing are: The instruction-removal predictor, or IR-predictor is a modified branch predictor. It generates the program counter (PC) of the next block of instructions to be fetched in the A-stream. The instruction removal detector, or IR-detector, monitors the R-stream and detects instructions that could have been removed from the program. IR detector conveys instruction removal information to the IRpredictor. Delay buffer is used to communicate control and data flow outcomes from A-stream to R-stream. The recovery controller is responsible for keeping A- stream on the correct path. 3.3 Creating the shorter program IR-detector monitors past run-time behavior and detects instructions that could have been removed, and might possibly be removed in the future. This information is conveyed to the IR-predictor, and after sufficient repeated indi- 3

4 cations by the IR-detector, the IR-predictor removes future instances of the instructions. The IR-predictor is built on top of conventional trace predictor. Trace is a large dynamic instruction sequence of 16 to 32 instructions. A trace is uniquely identified by a starting PC and branch outcome indicating the path through the trace (ID). An index into a correlated prediction table is formed from the sequence of past trace ids, using a hash function that favors bits from more recent trace ids over less recent trace ids. Each entry in the correlated prediction table contains a trace id and a 2-bit counter for replacement. The prediction is augmented with a second table that is indexed with only the most recent trace id. Together, the two tables form a hybrid predictor that outputs the predicted trace id of the next trace. IR-detector creates shorter version of program by removing ineffectual computations. Cases of ineffectual computations are: Some instructions write a value to a register or memory location and the value is overwritten before ever being used. Such instructions, and the computation chains leading up to these instructions, have no effect on final program state. Some instructions write the same value into a register or memory location as already exists at that location. Such instructions, and the computation chains leading up to them, have no effect on final program state because their write were not truly modifications. Certain control flow in the program may be so predictable that it appears deterministic. With a high level of confidence, branches and computation chains feeding these branches can be removed. More aggressive techniques for reducing the A-stream is removing branch-predictable computation. Another possibility is removing value-predictable computation. An overall better predictor may be possible by combining a conventional predictor with the A-stream. By removing highly predictable branches or values, the A-stream focuses instead on hard-topredict branches or values. IR-detector consumes retired R-stream instructions, address, and values. The instructions are buffered and, based on data dependencies; circuitry among the buffer is dynamically configured to establish connections from consumer to producer instructions. In other words, a reverse dataflow graph (R-DFG) is constructed. IR-detector watches for any triggering conditions for instruction removal. When a triggering condition is observed, the corresponding instruction is selected for removal. 3.4 IR-misprediction recovery An instruction-removal misprediction, or IRmisprediction, occurs when A-stream instructions were removed that should not have been. IR-mispredictions are detected by the R-stream. The IR-detector can also detect these mispredictions sooner by comparing its computed removal information against the corresponding predicted removal information. When an IR-misprediction is detected, the reorder buffer of the R-stream is flushed. The IR-predictor and A-stream are then resynchronized with the precise state of the R-stream. 3.5 Issues with slipstream processors The performance issue with slipstream processors is the accuracy of A-stream. If A-stream is to be made more accurate and less speculative, instruction removal must be more conservative. Resulting in less number of instructions removed from A-stream, and hence slows down the A-stream. Slower A-stream degrades the performance of entire program, since R-stream would also then be making slow forward progress. The performance of the slipstream processor also depends on the distance between A-stream and R-stream. Because R-stream is partially sped up due to less cache misses encountered by the R-stream. If distance between A-stream and R-stream is not enough, the long latencies of cache misses encountered by the A-stream may not be completely hidden from R-stream. Hence R-stream eventually suffers 4

5 from PDEs. To shield R-stream from any kind of performance degrading events (PDEs), A-stream must always be at a comfortable distance from the A-stream. This requires faster A-stream, because PDEs frequently slow down the speed of A-stream. This requires more aggressive techniques towards instruction removal from the A-stream. More aggressive techniques increases speculation and may result in frequent IR-misprediction recovery. IR-misprediction is another type of performance degrading event associated with slipstream processors. 4 Helper Thread Branch prediction using Helper thread was first proposed by Farcy et al. [7]. They proposed a mechanism to target highly mispredicted branches within local loops. The computations leading up to applicable branches were duplicated at decode time and used to pre-compute the conditions of the branches several loop iterations ahead of the current iteration. Branch prediction using subordinate microthreads was proposed by Chappell et al. [2 4]. Zilles and Sohi [18,19] proposed using speculative slices to pre-compute branch conditions and pre-fetching addresses. Roth and Sohi [13] proposed a processor capable of using data-driven threads (DDT) to perform critical computations, chains of instructions, leading to a mispredicted branch or cache miss. Collins et al. proposed dynamic speculative pre-computation [5, 6]. Common principle behind pre-computational threads or helper threads is discussed next. 4.1 Principles Performance degrading events (PDEs) tend to be concentrated in a subset of static instructions whose behavior is not predictable with existing mechanisms. Because the simple instructions are handled accurately by existing mechanisms, most of the branch mispredictions and cache misses are concentrated in the remaining static instructions whose behavior is more complex. Small set of frequently executed static instructions are responsible for causing a majority of PDEs. In Helper Thread approach, a code fragment is constructed that mimics the computations leading up to and including these problem instructions. Such a piece of code is called speculative slice. A speculative slice includes only the operations that are necessary to compute the outcome of the problem instructions. In contrast to a original program, a speculative slice only needs to approximate the original program, resulting in significant flexibility in slice construction. These speculative slices may execute as helper thread on a multi-threaded machine or Simultaneous Subordinate Microthreading (SSMT) proposed by [4]. Helper thread is spawned far advance in time from the target instruction, so that accurate prediction is available at the time when original instruction is fetched. These helper threads accelerate the program s execution microarchitecturally, by pre-fetching data into the cache and generating the branch predictions. These helper threads do not affect the architected state of the program; since a slice has its own register and performs no store. Forking the slice many cycles before the problem instructions are encountered by the main thread provides the necessary latency tolerance. After copying some register values from the main thread, the slice thread executes autonomously. Additional hardware is required, that binds the branch predictions generated by helper threads to the correct branch instances in the main thread. 4.2 Early execution of speculative slices The benefit of executing a speculative is derived from precomputing a problem instruction significantly in advance of when it is executed in the main program. Following rules for light weight slices construction ensures that instructions are executed early relative to the whole program. Only instructions necessary to execute the problem instruction are included in the slices. The slice avoids misprediction stalls that impede the whole program. Control flow that is required by the 5

6 slice is if-converted, either arithmetically or through prediction. The code from the original program can be transformed to reduce overhead and shorten the critical path to computing a problem instruction. Transformations can be performed on the slices that were not applied by the compilers on the original program due to limited resources or correctness issues. 4.3 Slice construction Construction of speculative slices depends on: Problem instruction Forking point Problem instruction. For the limited number of speculative slices to be most effective in early resolution of problem instructions, characterization of problem instructions is required. If a static instruction accounts for a non-trivial number of PDEs and at least 10% of its executions cause a PDE, it is characterized as a problem instruction. [18] shows that problem instructions are responsible for a disproportionate fraction of performance degrading events. A single speculative slice may target a single or multiple problem instructions. Slice overhead can be minimized by aggregating all problem instructions that share common dataflow predecessors into a single slice. If a speculative slice terminates with the pre-fetch of a memory location, the final instruction can be converted into software prefetch by changing its destination register to the zero register. This pre-fetch conversion allows speculative slice to terminate as soon the last instruction is fetched and issued. Forking point. Each speculative slice executes data flow from the primary thread forking point along a particular control-flow path. Forking of a new slice requires communication of seed register values from the primary thread to the helper thread. Selecting a fork point often requires carefully balancing two conflicting desires. Maximizing the likelihood that the problem instruction s latency is fully tolerated requires that the fork point be as early as possible, but increasing the distance between the fork point and the problem instructions can both increase the size of the slice and reduce the likelihood that the problem instructions will be executed. Once the forking point and target problem instruction(s) are identified, only dependent instructions from the original program are used to construct a speculative slice. 4.4 False spawns If the primary thread deviates from the path assumed by a speculative slice, that speculative slice does not a produce a useful prediction. False spawns are damaging because the speculative slices they launch consume resources without performing useful work. Even if the speculative slice follows the correct path, it must complete in time to be useful. Chappell [3] proposed using a set of difficult-paths to guide helper thread. Predictability of branches is classified by control-flow path. Difficult paths can be identified that frequently lead to miss-predictions. Miss-predictions are attacked with specially-constructed speculative slices only for each of these difficult paths. Hardware mechanisms are required to identify difficult path at fork time. 4.5 Automatic slice construction For helper threaded technique to be practical, an automatic technique of slice construction is necessary. Chappell [3] and Collins [5] have proposed dynamic methods for the construction of speculative slices for branch prediction and memory pre-fetch respectively. For dynamic construction of speculative slices, dynamic instruction stream is analyzed for dependencies after retirement stage. Retirement Instruction Buffer (RIB) is used for this purpose, and stores a trace of committed instructions. RIB is off the processor critical path, and does not hurt the performance of the processor. RIB looks for problem instructions, and once a problem instruction is inserted into RIB, it transition into slice-building mode. 6

7 4.6 Branch prediction correlation To derive any benefit from predictions made by speculative slices, the predictions must be assigned to the corresponding dynamic instance of the problem branch in primary thread. Zilles [18] have proposed branch correlator, which consist of multiple queues of predictions, each tagged with the PC of a branch. When a branch with is fetched and its PC matches with a non-empty queue, processor uses predictions from the head of queue in place of generated by conventional branch predictor. 4.7 Issues with helper threads technique Effective use of helper threads for hiding performance degrading events is challenging. Sophisticated techniques are required to cope with the conflicting desires of helper thread. Conflicting desires are related to the forking point of helper thread and size of the speculative slices. Even if these conflicting desires are met by careful selection of forking point there are few more issues related to helper threads and are discussed next. Aborting useless helper threads. A mechanism to detect and abort useless helper threads is required, when the machine deviates from the path predicted by an active helper thread. In the process false aborts must be minimized. Controlling pre-computations. Issues like primary thread miss-speculation recovery, late predictions, and useless predictions complicate the correlation mechanism discussed earlier. 5 Runahead Execution Mutlu et al. [9, 10] has proposed Runahead Execution to tolerate long latency performance degrading events. 5.1 Principle Runahead execution is a cheap alternative to an unreasonably large instruction window. Large instruction window is used to exploit instruction level parallelism more aggressively when a simple /emphout-of-order execution machine stalls due to long latency operations. Runahead execution unblocks the instruction window blocked by long latency operation to execute speculatively far ahead in the program path. When the instruction window is blocked by the longlatency operation, the state of the architectural register file is checkpointed. The processor then enters Runahead mode. It distributes a bogus result for the blocking operation and removes that instruction out of the instruction window. Instructions following the long latency instruction are fetched, executed, and retired, but do not update architectural state. When the blocking operation resolves, the processor re-enters normal mode. It restores the checkpointed state and re-executes instructions with the blocking operation. The instructions fetched and executed during runahead mode create very accurate pre-fetches for the data and the instruction caches. This technique is different from other pre-computational techniques in utilizing the spare resources of processor only during stalls. It does not need idle thread contexts and spare resources (e.g., fetch and execution bandwidth), which are not available when the processor is well used. Runahead execution mode tries to mitigate the effect of anticipated future performance degrading events by pre-computation only when primary execution is not using processor resources. 5.2 Features The main complexity involved with the execution of runahead instructions are with the memory communication and propagation of invalid results. Propagating invalid results. Since invalid results are filtered out while taking anticipatory measures for long latency instructions and branches. Each physical register has an invalid bit associated with it to indicate whether or not it has a bogus value. Any instruction that sources the invalid register is also an invalid instruction. Memory communication For runahead execution to 7

8 most accurately mimic the real execution, runahead cache is used. Runahead cache is used to forward results from store to dependent loads. Load pre-fetches Only valid load instructions are prefetched. Branch prediction Branches are predicted and resolved in runahead mode exactly the same way they are in normal mode. The branch predictions from runahead mode may be communicated to normal mode to improve branch prediction in the normal mode. 5.3 Issues with runahead execution If a mispredicted invalid branch is encountered during runahead mode, it may direct the runahead execution on the wrong path and is called divergence points. Predictions made on the wrong path are wrong and must be discarded. But even then processor can reach a control-flow independent point, after which it continues on the correct program path again. 6 Performance analysis 6.1 Slipstream processors Slipstream processor dramatically improves single program performance. On average this performance improvement is 12% for 4-way processor with 64- ROB entries in a CMP. Slipstreaming on 8-way SMT processor improves performance from 10%-20%. The Benefit of slipstreaming decreases as more execution bandwidth is made available. 6.2 Helper thread Speculative slices achieves average speedup of 10% on a 4-way machine. Hand constructed speculative slices achieves better performance then dynamically constructed slices. 6.3 Runahead execution Runahead execution on a machine model based on Intel Pentium 4 improves performance by 22% across a wide range of memory intensive applications. 7 Conclusion Three pre-computational techniques are discussed in this survey paper. Techniques like Slipstream and Helper thread uses unused contexts of the multithreaded processor, while Runahead execution uses unused resources of the processor during stalls due to long latency instructions. Runahead execution targets all the accessible performance degrading events, while processor is stalled on the normal path. Explicit multithreading techniques on the other hand target all the performance degrading events all the time. While scope of runahead execution is limited, but it is more accurate compare to multithreading techniques, since it execute instructions without any speculation. Slipstream processor is a completely dynamic precomputation technique, but helper threads need hand constructed speculative slices to be more effective. Slipstream processor, executing a reduced copy of primary thread, remains on the correct path most of the time. Correlating predictions is not a big issue for slipstream processors. For helper threaded techniques, primary thread s future path has to be anticipated before forking a new slice. This anticipation mechanism sometimes results into false spawns. Helper thread correlation mechanism for predictions is also more complicated. But on the other hand slipstream processor suffers from slow lead thread (A-stream) and might not hide long latencies of performance degrading events. For pre-computation techniques to be practically viable, Must be simpler then existing techniques. Need to be more flexible and may be more speculative. Need to be dynamic. Slipstream processor may be more practical for future microprocessor, but need better and more aggressive instruction removal mechanisms to be effective. 8

9 References [1] H. Akkary, R. Rajwar, and S. Srinivasan. Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors. In International Symposium on Microarchitecture, pages , San Diego, California, Dec [2] R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous Subordinate Microthreading (SSMT). In International Symposium on Computer Architecture, pages , Atlanta, Georgia, May [3] R. Chappell, F. Tseng, A. Yoaz, and Y. Patt. Difficult-path branch prediction using subordinate microthreads. In International Symposium on Computer Architecture, pages , Anchorage, Alaska, May [4] R. Chappell, F. Tseng, A. Yoaz, and Y. Patt. Microarchitectural Support for Precomputation Microthreads. In International Symposium on Microarchitecture, pages 74 84, Istanbul, Turkey, Nov [5] J. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic Speculative Precomputation. In International Symposium on Microarchitecture, pages , Austin, Texas, Dec [6] J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In International Symposium on Computer Architecture, pages 14 25, Göteberg, Sweden, June July [7] A. Farcy, O. Temam, R. Espasa, and T. Juan. Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes. In International Symposium on Microarchitecture, pages 59 68, Dallas, Texas, Nov. Dec [8] C.-C. Lee, I.-C. K. Chen, and T. N. Mudge. The bi-mode branch predictor. In International Symposium on Microarchitecture, pages 4 13, Research Triangle Park, North Carolina, Dec [9] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In International Symposium on High-Performance Computer Architecture, Anaheim, California, Feb [10] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead Execution: An Effective Alternative to Large Instruction Windows. IEEE Micro, Special Issue: Micro s Top Picks from Microarchitecture Conferences, 23(6):20 25, Nov./Dec [11] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang. The case for a single-chip multiprocessor. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2 11, Cambridge, Massachusetts, Oct [12] Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A study of slipstream processors. In International Symposium on Microarchitecture, pages , Monterey, California, Dec [13] A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In International Symposium on High-Performance Computer Architecture, pages 37 48, Monterey, Mexico, Jan [14] E. Sprangle, R. S. Chappell, M. Alsup, and Y. N. Patt. The agree predictor: a mechanism for reducing negative branch history interference. In International Symposium on Computer Architecture, pages , Denver Colorado, June [15] K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream Processors: Improving both Performance and Fault Tolerance. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages , Cambridge, Massachusetts, Nov [16] D. Tullsen, S. Eggers, and H. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In International Symposium on Computer Architecture, pages , Santa Margherita Ligure, Italy, June [17] T.-Y. Yeh and Y. N. Patt. Two-level adaptive training branch predicton. In International Symposium on Microarchitecture, pages 51 61, Albuquerque, New Mexico, Puerto Rico, [18] C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In International Symposium on Computer Architecture, pages 2 13, Göteberg, Sweden, June July [19] C. B. Zilles and G. S. Sohi. Understanding the backward slices of performance degrading instructions. In International Symposium on Computer Architecture, pages , Vancouver, Canada, June

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache