Speculation Control for Simultaneous Multithreading

Size: px
Start display at page:

Download "Speculation Control for Simultaneous Multithreading"

Transcription

1 Speculation Control for Simultaneous Multithreading Dongsoo Kang Dept. of Electrical Engineering University of Southern California Jean-Luc Gaudiot Dept. of Electrical Engineering and Computer Science University of California, Irvine Abstract Speculative executions help modern processors to expose independent instructions on the fly and accordingly exploit more Instruction-Level Parallelism. However, when incorrect speculations occur, useless work is performed for those incorrectly speculated instructions. This lowers a sustained performance and leads to a significant waste of power. Unlike superscalar processors, Simultaneous Multithreading (SMT) processors can concurrently execute multiple threads. Thus, they have a chance to control speculative executions by deliberately choosing threads from which instructions will be fetched at each cycle, considering the dynamic characteristics of running threads. In this paper, we present an efficient front-end mechanism, called SAFE-T (Speculation-Aware Front-End Throttling), for scheduling threads in SMT processors. It involves thread prioritizing and throttling; priority given to a thread can be overridden when that thread seems to suffer from an excessive amount of incorrect speculations, therefore preventing instructions from being fetched. Simulation results show that our policy provides an average reduction of 41.6% in the number of wrong-path instructions and improves the instruction throughput by up to 14.5%. A cost-effective implementation for the proposed policy is shown as well. 1. Introduction In an effort to overcome the limited Instruction-Level Parallelism (ILP) within application programs, Simultaneous Multithreading (SMT) processors exploit Thread- Level Parallelism (TLP) [4] [8] [15]. By filling the instruction window with instructions fetched from multiple threads, an SMT processor is able to exploit TLP as well as ILP with the inherent capability of decreasing horizon- This work is partly supported by the National Science Foundation under Grants No. CSA and INT Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. tal and vertical waste [4] and providing high instruction throughput as a result. The overall performance of an SMT processor depends on many factors including how the threads are selected and the number of threads from which to fetch instructions. Further, how to allocate the limited fetch slots to the selected threads must be judiciously decided. For example, if instructions fetched from a thread reside in the instruction window for too many cycles before they are issued (due to dependencies and latencies) they occupy valuable entries of the window that could be used by other threads, ultimately limiting the ILP and TLP which can be exploited. Tullsen et al. [19] examined several pipeline variables for prioritizing all running threads and choosing a few ones, and they reported that the thread scheduling policy based on the ICOUNT variable, which indicates the number of instructions in the front-end stages, provided the best performance in terms of overall instruction throughput. However, the ICOUNT variable is not aware of speculative executions. Since instructions must be discarded from the intermediate stages of the pipeline if they are found to have been incorrectly speculated, the ICOUNT variable cannot correctly reflect the activities of threads. The number of instructions discarded due to incorrect speculations (wrong-path instructions) is quite high. In this research, we observed that in an SMT processor using the ICOUNT-based policy, the wrong-path instructions account for 16.2 ~ 28.8% of all instructions fetched. Unnecessary work done for these instructions limits a sustained performance achieved by SMT processors and makes them power-inefficient due largely to unnecessary switching activity of the logic gates. The goal of the study presented in this paper is to develop a fetch mechanism for thread scheduling that enables SMT processors to dynamically control speculative executions of threads. We accomplish this by employing two pipeline variables, ICOUNT and LCOUNT, which indicate the distinct behaviors of threads in a new mechanism, called SAFE-T (Speculation-Aware Front-End Throttling). The LCOUNT variable represents the number of unresolved, low-confidence conditional branches de-

2 termined by confidence estimation [6] [11]. It is used to throttle threads if they appear to be incorrectly speculated, even though highly prioritized by the ICOUNT variable. Our experimental results show that such a hybrid policy is able to reduce a noticeable amount of wrong-path instructions with the attendant enhancement in instruction throughput. Prior to describing our front-end policy, we review related work regarding front-end policies for SMT and confidence estimation techniques (Section 2). Our front-end policy and hardware mechanisms to embody the policy are discussed in more detail in Section 3. We describe the simulation environment we used to evaluate the policy in Section 4, and the experimental results are presented in Section Related work The performance of a superscalar processor certainly depends on how many independent instructions are delivered to both the front-end and the back-end stages. However, modern microprocessors have notoriously suffered from limited instruction parallelism within representative application programs, consequently yielding diminishing returns even the issue width is increased [4] [20]. SMT overcomes this limited ILP within a thread by concurrently fetching and executing instructions from other threads, thereby increasing resource utilization and overall performance Front-end policies for SMT Just like superscalars, the performance of SMT processors is affected by the quality of the instructions injected into its pipeline. For instance, if the instructions which are being processed have dependencies among one another or if they have long latencies, the ILP and TLP which can be exploited will be limited, clogging the instruction window and then stalling the front-end stages (fetch, decode, and dispatch). Therefore, how to fill the front-end stages of an SMT processor with instructions fetched from multiple threads is a critical decision which needs to be made at each cycle. Three parameters, algorithm, num_threads, and num_insts, characterize the front-end policies for SMT. The first determines how to choose threads among all available threads. The second describes the number of threads to fetch from at each cycle, and the third defines the number of instructions which can be fetched at each cycle per thread. Tullsen et al. [19] suggested several priority-based front-end policies which surpass the simple round-robin policy. They investigated four policies which prioritize threads according to four pipeline variables: BRCOUNT, MISSCOUNT, ICOUNT, and IQPOSN. These pipeline variables were mainly devised to avoid clogging the issue buffers which may occur when instructions reside for many cycles in the pipeline until they are retired. Among the above policies, it was found that the policy based on ICOUNT provided the highest performance in terms of instruction throughput. However, these variables do not consider that even after an instruction has been injected into the pipeline, it must be discarded from the front-end or the back-end stages whenever a conditional branch preceding the instruction has been determined to have been incorrectly predicted. Wrong-path instructions fetched along the incorrectly predicted path of a conditional branch consume not only fetch slots of the front-end but also valuable functional units in the back-end, which obviously correspondingly reduce instruction throughput and power efficiency. This is part of the reason why the sustained instruction throughput obtained under the ICOUNT-based policy is still quite lower than the possible peak Controlling speculative executions Speculative execution is an aggressive technique which has been used in an effort to achieve higher performance by reducing the effect of control dependencies among instructions. Predicting the outcome of branches will allow a processor to speculatively execute instructions fetched along the predicted target path of a branch. As modern processors tend to have deeper and wider pipelines, however, it should be noted that the penalty for incorrect speculations becomes increasingly substantial. Due to the fact that confidence estimators [6] [11] are able to assess the quality of conditional branch predictions, they have been adopted as a building block in many applications [1] [3] [7] [13] [14]. Confidence estimators are very similar to branch predictors [21] in the sense that they have to make binary decisions (high-confidence and low-confidence or predicted-taken and predicted-untaken, respectively). Like branch predictors, confidence estimators have a table of miss distance counters (MDCs) to record a history. However, whereas a branch predictor takes into account the actual path of branches (i.e., was the branch taken or not?) when making future predictions, a confidence estimator takes into account the history of the outcomes of branch predictions (i.e., was the branch correctly predicted or not?). If a conditional branch was correctly predicted, the corresponding MDC is incremented by one, or reset to zero if incorrectly predicted. A branch prediction is considered to have high confidence when its corresponding MDC value is greater than a given confidence threshold. In order to reduce the power demand of superscalar processors, Manne et al. [14] developed a particular form of speculation control called pipeline gating in which the

3 fetch unit stalls when there are more low-confidence predictions than a certain pre-determined threshold. Even though this scheme can eliminate a lot of wrong-path instructions and then reduce the unnecessary activity in the fetch and decode stages, it suffers from a small (1%) loss of performance. By adopting the concept of pipeline gating, Luo et al. [13] proposed a speculation-aware policy for SMT processors. Under this policy, running threads are prioritized and gated according to the value of their corresponding LCOUNT (short for LC_BPCOUNT) variables, each of which indicates the number of unresolved conditional branches with low confidence per thread. This front-end policy significantly reduced the total amount of wrongpath instructions from 9 ~ 24% to 8 ~ 12%. However, when compared with the previous policies examined in [19], this policy is certainly the most expensive in terms of the additional hardware required since the confidence estimator on which it relies is implemented with a table of MDCs for each thread. Unlike the ICOUNT variable, in addition, the LCOUNT variable is not able to perceive the dynamic latency of instructions in the pipeline stages, mostly in the back-end stages. We observed that it performed worse than the ICOUNT variable for workloads mixed with integer and floating-point benchmarks. 3. SAFE-T The front-end fetch mechanism we now propose for SMT, called SAFE-T (Speculation-Aware Front-End Throttling), integrates the positive features of the ICOUNT and LCOUNT variables, thereby achieving enhanced instruction throughput and power efficiency. Two variables, ICOUNT and LCOUNT, represent the dynamic characteristic of the running threads. We will explain how these variables are used to schedule threads at the fetch stage of an SMT processor. Next, the overhead of confidence estimation on which the LCOUNT variable relies will be discussed and a cost-effective confidence estimation scheme will be presented. order to throttle selected threads, there is a designated counter (the LCOUNT variable) for each running thread, as shown in Figure 1. The counter is incremented by one whenever a conditional branch prediction is made with low confidence in the fetch stage. Conversely, it is decremented by one whenever a low-confidence conditional branch is resolved in the complete stage or is discarded from the other preceding stages due to the recovery from an incorrect speculation. Figure 1. SAFE-T mechanism If a running thread has an LCOUNT value which is greater than a given throttling threshold, the fetching of instructions from the thread is stalled on the assumption that an incorrect path has been entered. This means that it is assumed that if instructions are fetched along the current path of the thread, they will be discarded once a previous conditional branch has been resolved. In fact, once wrong-path instructions are fed into the pipeline, they must reside in it until previous branches are determined to be incorrectly predicted, contending for valuable resources with other useful instructions Prioritizing and throttling threads For each thread, there is a counter (the ICOUNT variable) which represents the number of in-flight instructions in the front-end stages of the pipeline. By looking up the counter, thread prioritizing gives a high priority to those threads with lower values. This information is used to select threads from which instructions will be fetched during the next cycle. Unlike the policies examined in [19], however, the SAFE-T mechanism enables the fetch unit to reject selected threads, even if they are highly prioritized by the ICOUNT variable, as a way of controlling speculation. In Figure 2. Two types of confidence estimators used for SAFE-T. By throttling the threads after prioritizing them, the fetch unit of an SMT processor can save fetch slots which would be otherwise wasted due to incorrect speculations

4 and then keep the useful instructions in the instruction window by allocating these saved slots to other threads. As shown in Figure 2 (a), we use a confidence estimator which is based on miss distance counters (MDCs) [11] to assess the quality of each conditional branch prediction. 4-bit MDCs are used, generating a binary signal (lowconfidence and high-confidence) according to a given confidence threshold. The branch history register (BHR), which holds the global history of recently resolved conditional branches, is shared by the collaborating branch prediction unit. In previous work [1] [6] [7] [11], the confidence estimators for superscalar processors were designed primarily to foretell the correctness of the current prediction for a conditional branch according to the recent history of the branch outcomes. Thus, a table of registers was required, and each register recorded the correctness of recent predictions for the same conditional branch. As an extension of this trend, a confidence estimator for an SMT processor can have one table of registers for each thread or alternatively a single table shared by all the threads. When shared, the size of the table should be proportional to the number of threads due to an increased branch misprediction rate from thread interference [10]. These two implementations are not inexpensive since they require a lot of space and their complexity grows with the number of threads. We claim that the role of a confidence estimator for the front-end policy of an SMT processor can be restricted to indicating whether or not a thread seems to enter a section of instructions in which conditional branches are likely to be incorrectly predicted. Also, a conditional branch must be discarded regardless of the correctness of its prediction whenever any of its preceding conditional branches are resolved to have been mispredicted. In addition, we observed that the mispredictions of conditional branches take place in bursts for the execution-driven simulation of an SMT processor. Indeed, as shown in Figure 3, on average 75% of misprediction distances are no more than 8. This means that in most cases, a misprediction is no further apart from another misprediction than 8 correct predictions. The results shown in Figure 3 were obtained by running the SPEC CPU2000 benchmarks in multi-thread mode on the baseline SMT architecture specified in Section 4. For a superscalar processor, Heil and Smith reported the similar observation from a trace-driven simulation [7] Minimizing the overhead of confidence estimation 100% [1:8] [9:16] 17+ Misprediction Distance 80% 60% 40% 20% 0% 164.gzip 175.vpr-pl 175.vpr-rt 176.gcc 181.mcf 197.parser Figure 3. Distribution of distances between mispredicted branches. The features observed allow us to build an inexpensive confidence estimator by just recording the recent history in a register (one per thread) shared by all static branches, without the need for a table of MDCs. Thus, this confidence estimator with a single MDC assigns low confidence to the current conditional branch prediction if an incorrect prediction has been detected and the number of correct predictions is no more than a certain threshold, even though those predictions have been made for different conditional branches. Figure 4. A schematic diagram of our SMT architecture.

5 183.equake Cond. BR Uncond. BR 183.equake 197.parser 197.parser 181.mcf 181.mcf 176.gcc 176.gcc 175.vpr (route) 175.vpr-rt 175.vpr (place) 175.vpr-pl 164.gzip 164.gzip Fraction of Branches (%) (a) Branch Misprediction Rate (%) (b) 183.equake 197.parser 181.mcf 176.gcc 175.vpr-rt 175.vpr-pl 164.gzip 183.equake 197.parser 181.mcf 176.gcc 175.vpr-rt 175.vpr-pl 164.gzip Wrong-path Instructions (%) IPC (c) Figure 5. Characteristics of the SPEC CPU2000 benchmarks: (a) breakdown of branches, (b) branch misprediction rates, (c) fraction of wrong-path instructions, and (d) instruction throughput in single-thread mode. Branch misprediction rates were collected by using a gshare branch predictor with bit counters. (d) This confidence estimator is truly inexpensive since it uses a single MDC as shown in Figure 2 (b). We call it gmdc since a global MDC is shared by all static conditional branches encountered in a thread, while a confidence estimator with a table of MDCs would be called tmdc. In gmdc, the global MDC is updated as described in [11] whenever a conditional branch is resolved. 4. Simulation methodology To properly evaluate the effects of the proposed front-end policy and underlying confidence estimators, we designed an execution-driven simulator derived from the SimpleScalar tool set [2]. We modified the sim-outorder simulator to implement an SMT processor model (Figure 4) which supports out-of-order and speculative execution. The architectural model contains seven pipeline stages: fetch, decode, dispatch, issue, execute, complete, and commit. Several resources, such as PC, integer and floating-point register files, and branch predictor, are replicated to allow multiple thread contexts Experimental set-up The major simulation parameters are shown in Table 1, and the configuration parameters for functional units are shown in Table 2. In Table 1, each cache line size is in bytes and the 128KB instruction cache is equivalent to a 64KB cache for 32-bit instructions since the simulator uses the 64-bit PISA instruction set. The simulator for SMT is configured to issue as many instructions as the total number of functional units at each cycle. When multiple instructions are ready to be issued, older instructions have higher priority than newer ones. We used two types of branch predictors [16] [21]. One is a gshare branch predictor with 2048 entries and the other is a hybrid predictor that consists of the same gshare predictor and a 2048-entry bimodal predictor. The branch misprediction penalty is a minimum of eight cycles. Six cycles are incurred for branch delays and two cycles are taken for restoring the correct architectural status from each misprediction. We used the SPEC CPU2000 benchmark suite [9] to build workloads for performance simulation. Our workloads consist of seven integer benchmarks (164.gzip,

6 175.vpr, 176.gcc, 181.mcf, 197.parser,, and ) and one floating-point benchmark (183.equake). As shown in Table 3, seven multiprogrammed workloads were created for the simulation experiments. The characteristics of these benchmarks are shown in Figure 5. Table 1. Simulation parameters. Parameter Value Fetch rate 4 Dispatch rate 4 Retire rate 4 2-level: 2K gshare Branch predictor hybrid: 2K gshare + 2K bimodal (2K meta) Branch target buffer way Branch mispredict 8+ cycles penalty Return address stack 16 L1 instruction cache 128KBytes (512:64:4:LRU) L1 data cache 64KBytes (512:32:4:LRU) L2 cache 512KBytes (2048:64:4:LRU) Main memory 256-bit width I-TLB 512KBytes (32:4096:4:LRU) D-TLB 1MBytes (64:4096:4:LRU) IFQ size 16 IDQ size 16 RUU size 64 LQ/SQ size 16/8 INT units 4 FP units 2 We compiled all the benchmarks with gcc -O2 and ran each with its corresponding lgred input data set from the MinneSPEC [12]. Each simulation of a workload is composed of T 500 million instructions, where T is the number of threads, after fast forwarding through the first 300 million instructions from each thread to skip the initialization part of the benchmarks Simulated front-end policies We simulated and evaluated the following four front-end policies for SMT: Ti: Threads are prioritized according to the values of the ICOUNT variable. Tc: Threads are prioritized and gated according to values of the LCOUNT variable. When two or more threads have the same priority value, the ICOUNT value is used as a tie-breaker. The confidence threshold is set to 8 and the gating threshold is set to 1. There is a table of 2048 MDCs per thread, and each MDC is a 4-bit register. St and Sg: Threads are scheduled according to the SAFE-T mechanism. However, St relies on a tmdc confidence estimator with 2048 entries whereas Sg uses a gmdc. When two or more threads have the same value of the ICOUNT variable, the LCOUNT value is used as a tie-breaker. The confidence threshold is set to 8 and the gating threshold is set to 1 by default. For MDCs, 4-bit registers are used. In all front-end policies, the 2.4 scheme was used for the purpose of distributing available slots because the Ti policy showed the best performance for it. Thus, up to two threads can be selected at each cycle, and up to four instructions can be fetched from each thread. Since the fetch rate of the baseline SMT architecture is set to 4, the second thread, which has a lower priority, is allowed to be fetched only if there are remaining slots after they have been allocated to the first one. Table 2. Configuration of functional units. INT FP Function Repeat Rate Latency add/logical/shift 1 1 mult 1 3 div add/comp 1 2 mult 1 4 div sqrt Table 3. Workloads used for simulations. Workload Benchmarks 164.gzip 197.parser gzip 175.vpr (place) 176.gcc vpr (place) 176.gcc 197.parser vpr (route) 176.gcc 181.mcf gzip 175.vpr (route) 181.mcf gzip 176.gcc 183.equake vpr (place) 176.gcc 181.mcf 6 5. Experimental evaluation In order to evaluate the effectiveness of the proposed SAFE-T mechanism, we simulated St and Sg to measured

7 instruction throughput and the amount of wrong-path instructions. The results obtained are compared with those of the Ti and Tc policies. In addition, we analyze the impact of the confidence and gating thresholds on the St and Sg which implement the SAFE-T mechanism by using a different structure of confidence estimator Instruction throughput Our simulation results in terms of IPC (instructions per cycle), are presented in Figure 6. For each workload, a set of four histograms is shown. Each histogram represents the performance of four front-end policies in our discussion: Ti, Tc, St, and Sg. The right most set of histograms represents harmonic means over all seven workloads. IPC W0 W1 W2 W3 W4 W5 W6 HM (a) gshare The figure shows that the St policy is clearly superior to the others because of its higher IPC for all workloads. As a matter of fact, it improves the IPC by up to 14.5%, compared with the Ti policy that has been known to be the most effective for high performance. Even compared with the St policy, the Sg policy yields almost equivalent instruction throughputs although it relies on the gmdc confidence estimator with one miss distance counter per thread. This result justifies that a confidence estimator which updates the LCOUNT variable by using the global history of conditional branches is effective since an SMT processor is able to exploit TLP as well as ILP. To better understand the characteristics of the frontend policies, the average priority assigned by all four front-end policies was measured for the benchmarks in Workload 3 and Workload 5 as shown in Figure 7. Among the benchmarks in Workload 3, both Ti and Tc policies give higher priority to 176.gcc and than to the other benchmarks. If we look at the priority assigned to 176.gcc and, the I p policy favors 176.gcc over while the Tc policy gives a higher priority to than to 176.gcc. 176.gcc has a larger percentage of wrong-path instructions and a smaller IPC than. This means that the ICOUNT variable can generate an incorrect feedback about the instruction flow in the pipeline. Accordingly, the Ti achieves smaller IPC than Tc for the Workload 3. Even though the St and Sg policies prioritize threads by using ICOUNT like the Ti policy, they can cancel out the priority given to 176.gcc as a result of thread throttling by using LCOUNT. Thus, they can avoid the distorted sign for 176.gcc from ICOUNT. This is the main reason that the St and Sg policies can yield a better IPC than the Ti policy for Workload 3. Workload Avg. Priority IPC W0 W1 W2 W3 W4 W5 W6 HM (b) hybrid Figure 6. Instruction throughputs for (a) gshare and (b) hybrid branch prediction schemes. Avg. Priority vpr-rt 176.gcc 181.mcf (a) Workload gzip 176.gcc 183.equake Figure 7 Average priority assigned to the benchmarks in (a) Workload 3 and (b) Workload 5. A smaller number means a higher priority. (b)

8 The Tc policy appears inferior to the others. Even compared to the Ti policy, it yields decreased performance for all workloads except Workload 3 and Workload 6. In the case of Workload 5 where 183.equake is a floating-point benchmark and the others are integer benchmarks, the Tc policy assigned 1.47 to 183.equake whereas the Ti policy assigned 1.72 and the St policy assigned This means that the Tc policy favored 183.equake more than the other policies did, and the LCOUNT variable is not appropriate for the detection of threads with rapidly retired instructions. Consequently, the Tc policy tends to fetch instructions which are likely to clog the pipeline due to the long latencies. We measured the average slip time as shown in Figure 8. The slip time of an instruction can be defined as the amount of time that has elapsed since it was dispatched into the instruction window until retired. These results show that the Tc policy selects those threads with long-latency instructions, even though they cause a comparatively small number of incorrect speculations. Figure 9 shows the percentage of the number of wrongpath instructions fetched into the pipeline for the Ti, Tc, St, and Sg policies. As they were designed, the three policies based on the LCOUNT variable significantly reduce wrong-path instructions. The Tc policy reduces wrong-path instructions by an average factor of 35.1%. However, this reduction in the number of wrong-path instructions does not lead to a noticeable improvement in IPC because the Tc policy tends to select threads with long-latency instructions, as shown in Figure 8. On the average, the St policy shows a reduction of 36.3% in the number of wrong-path instructions and the Sg policy yields an improvement of 41.6%. This shows that for SMT processors, the inexpensive confidence estimator (gmdc) is adequate both to determine whether a thread has entered an incorrect path or not and to effectively prevent instructions in the incorrect path from entering the pipeline. Wrong-path Instructions (%) W0 W1 W2 W3 W4 W5 W6 Figure 9. Percentage of wrong-path instructions Impact of throttling threshold Avg. Slip Time (cycles) Figure 8. Average slip time of instructions in the backend stages Wrong-path instructions The threshold values used for thread throttling with the St and Sg policies affect the instruction throughput and the number of wrong-path instructions. As the throttling threshold increases, threads are allowed to have more conditional branch predictions with low confidence. Consequently, the pipeline has more chance of being fed with instructions which are likely to be discarded. These instructions compete for resources conflicts with other useful instructions (those which actually contribute to providing instruction throughput). Since these conflicts tend to the extrinsic delay of instructions in the pipeline, this will negatively affect the overall performance. For instance, if the gating threshold is set to infinite, no threads will be gated by the fetch unit. Thus, as the throttling threshold increases, the performance of the St and Sg, with respect to the IPC and to the percentage of wrongpath instructions, will converge to one of the Ti policy. To better understand the impact of throttling thresholds on the instruction throughput and the number of wrongpath instructions, we measured both the IPC and the fraction of wrong-path instructions over all fetched instructions by varying the value of the throttling threshold. The experimental data were obtained when the underlying 4- bit MDCs had a confidence threshold of 8. A gshare branch predictor, which is configured as mentioned in Section 4, was used. The data points presented in Figure 10 and Figure 11 are averages calculated over the seven workloads given in Table 3.

9 IPC St Sg Throttling Threshold Figure 10. Throttling threshold vs. IPC. In truth, the value range of the LCOUNT variable, which is referenced to eliminate prioritized threads from consideration for fetching during some period of time by depriving their assigned priority, is affected by the confidence threshold. According to the confidence threshold, the underlying confidence estimator used for the St and Sg policies determines whether the prediction of each conditional branch is low-confidence or high-confidence. If the confidence threshold is low, the confidence estimator is optimistic and it will estimate most of predictions at high confidence. In order to examine the impact of confidence threshold values, we ran the seven workloads in Table 3 and measured the changes of the IPC and the number of wrongpath instructions, varying the confidence threshold value. For thread throttling, the throttling threshold was set to 1. The results obtained for the IPC and the percentage of wrong-path instructions are shown in Figure 12 and Figure 13 respectively. St Sg St Sg IPC 1.6 Wrong-path instructions (%) Confidence Threshold Throttling Threshold Figure 12. Confidence threshold vs. IPC. St Sg Figure 11. Throttling threshold vs. percentage of wrongpath instructions. As the throttling threshold is increased, there is a slim chance that threads will be throttled after they are prioritized. We can see that both St and Sg show the same behavior and the best IPC and speculation control are achieved when the throttling threshold is set to 1. If the threshold increases to 2, the IPC of the Sg degrades by 3.8% and the number of wrong-path instructions changes from 13.9% to 18.6% on average. When the threshold is further increased to 3, the average IPC decreases into 1.75 which is close to the average of one of the Ti policy, This implies that the Sg policy can provide as much performance as the Ti policy Impact of confidence threshold Wrong-path Instructions (%) Confidence Threshold Figure 13. Confidence threshold vs. percentage of wrong-path instructions. We can see that as the confidence threshold increases, the IPC slightly rises and the amount of wrong-path instructions is further reduced. This means that an underlying confidence estimator for the SAFE-T mechanism needs to be pessimistic about branch predictions after a misprediction has been detected, since even a correctly predicted branch and its subsequent instructions must be discarded if a preceding branch is found to have been incorrectly predicted. 6. Conclusions An SMT processor collects instructions from multiple threads and deploys them to the shared instruction win-

10 dow in order to exploit both ILP and TLP. Thus, how to fill the front-end stages with instructions from multiple threads is critical for SMT processors. We have proposed here a thread scheduling mechanism for an SMT processor, called SAFE-T, which prioritizes threads according to the ICOUNT variable and throttles threads when they seem to be in incorrect paths, based on the LCOUNT variable that represents the number of unresolved conditional branches with lowconfidence prediction in the pipeline. The SAFE-T enables SMT processors to increase instruction throughput by up to 14.5% and to reduce the wrong-path instructions 41.6% on average, when compared with the policy using the ICOUNT variable only. As for the implementation cost of our front-end policy, we have examined a confidence estimator with a global MDC instead of a table of MDCs and have evaluated its effectiveness. It has been shown that an inexpensive implementation, the Sg policy, is comparable to the St that uses a table of MDCs, when it comes to instruction throughput and speculation control. High performance is the primary goal of any modern processors. However, it has been achieved at the cost of wasted work like instructions discarded from the pipeline. As the pipelines of processors become wider and deeper, the amount of wasted work will definitely increase. Thus, the proposed scheme will be essential for high performance with fewer power demands in SMT processors. In the future, we plan to evaluate a dynamic adaptation of the throttling threshold and an extrapolation to more than two pipeline variables in order to reflect the difference between thread characteristics during run-time. References [1] J. Aragón, J. González, J. García, and A. González, Confidence Estimation for Branch Prediction Reversal, Proc. 8 th Int l Conference on High Performance Computing, Dec. 2001, pp [2] D. Burger and T. Austin, The SimpleScalar Tool Set, Version, Univ. of Wisconsin-Madison Computer Science Department Technical Report #1342, June [3] M. Burtscher and B. Zorn, Prediction Outcome Historybased Confidence Estimation for Load Value Prediction, Journal of Instruction-Level Parallelism, May [4] S. Eggers, J. Emer, H. Levy, J. Lo, R. Stamm, and D. Tullsen, Simultaneous Multithreading: A Platform for Next- Generation Processors, IEEE Micro, Sept./Oct. 1997, pp [5] R. Gonçalves, M. Pilla, G. Pizzol, T. Santos, R. Santos, and P. Navaux, Evaluating the Effects of Branch Prediction Accuracy on the Performance of SMT Architectures, Euromicro Workshop on Parallel and Distributed Processing, Feb. 2001, pp [6] D. Grunwald, A. Klauser, S. Manne, and A. Pleszkun, Confidence Estimation for Speculation Control, Proc. 25 th Annual Int l Symposium on Computer Architecture, [7] T. Heil and J. Smith, Selective Dual Path Execution, Univ. of Wisconsin Madison, Technical Report, Nov [8] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 3 rd Ed., Morgan Kaufmann, San Francisco, CA, [9] J. Henning, SPEC CPU2000: Measuring CPU Performance in the New Millennium, IEEE Computer, July 2000, pp [10] S. Hilly and A. Séznec, Branch Prediction and Simultaneous Multithreading, 5 th Proc. Int l Conference on Parallel Architectures and Compilation Techniques, 1996, pp [11] E. Jacobsen, E. Rotenberg, and J. Smith, Assigning Confidence to Conditional Branch Predictions, Proc. 29 th Annual Int l Symposium on Microarchitecture, Dec. 1996, pp [12] A. KleinOsowski and D. Lilja, MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research, Computer Architecture Letters, June [13] K. Luo, M. Franklin, S. Mukherjee, and A. Séznec, Boosting SMT Performance by Speculation Control, Proc. 15 th Int l Parallel and Distributed Processing Symposium, [14] S. Manne, A. Klauser, and D. Grunwald, Pipeline Gating: Speculation Control for Energy Reduction, Proc. 25 th Annual Int l Symposium on Computer Architecture, 1998, pp [15] D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton, Hyper-Threading Technology Architecture and Microarchitecture, Intel Technology Journal, vol. 06, issue 01, Feb [16] S. McFarling, Combining Branch Predictors, WRL Technical Note TN-36, Jun [17] J. Rabaey, Digital Integrated Circuits: A Design Perspective, Prentice Hall, Upper Saddle River, NJ, [18] G. Sohi, Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers, IEEE Transactions on Computers, vol. 39, no. 3, Mar. 1990, pp [19] D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm, Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Proc. 23 rd Annual Int l Symposium on Computer Architecture, May 1996, pp [20] D. Wall, Limits of Instruction-Level Parallelism, Proc. 4th Int l Conf. on Architectural Support for Programming Languages and Operating System, 1991, pp [21] T. Yeh and Y. Patt, Alternative Implementations of Two- Level Adaptive Branch Prediction, Proc. 19 th Annual Int l Symposium on Computer Architecture, May 1992, pp

Boosting SMT Performance by Speculation Control

Boosting SMT Performance by Speculation Control Boosting SMT Performance by Speculation Control Kun Luo Manoj Franklin ECE Department University of Maryland College Park, MD 7, USA fkunluo, manojg@eng.umd.edu Shubhendu S. Mukherjee 33 South St, SHR3-/R

More information

Balancing Thoughput and Fairness in SMT Processors

Balancing Thoughput and Fairness in SMT Processors Balancing Thoughput and Fairness in SMT Processors Kun Luo Jayanth Gummaraju Manoj Franklin ECE Department Dept of Electrical Engineering ECE Department and UMACS University of Maryland Stanford University

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Jong Wook Kwak 1, Seong Tae Jhang 2, and Chu Shik Jhon 1 1 Department of Electrical Engineering and Computer Science, Seoul

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun + houman@houman-homayoun.com ABSTRACT We study lazy instructions. We define lazy instructions as those spending

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Towards a More Efficient Trace Cache

Towards a More Efficient Trace Cache Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Threaded Multiple Path Execution

Threaded Multiple Path Execution Threaded Multiple Path Execution Steven Wallace Brad Calder Dean M. Tullsen Department of Computer Science and Engineering University of California, San Diego fswallace,calder,tullseng@cs.ucsd.edu Abstract

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Simultaneous Multithreading Processor

Simultaneous Multithreading Processor Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

More information

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors In Proceedings of the th International Symposium on High Performance Computer Architecture (HPCA), Madrid, February A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Applications of Thread Prioritization in SMT Processors

Applications of Thread Prioritization in SMT Processors Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor,

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Optimizing SMT Processors for High Single-Thread Performance

Optimizing SMT Processors for High Single-Thread Performance University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul

More information

Exploring Efficient SMT Branch Predictor Design

Exploring Efficient SMT Branch Predictor Design Exploring Efficient SMT Branch Predictor Design Matt Ramsay, Chris Feucht & Mikko H. Lipasti ramsay@ece.wisc.edu, feuchtc@cae.wisc.edu, mikko@engr.wisc.edu Department of Electrical & Computer Engineering

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information

Design Trade-Offs and Deadlock Prevention in Transient Fault-Tolerant SMT Processors

Design Trade-Offs and Deadlock Prevention in Transient Fault-Tolerant SMT Processors Design Trade-Offs and Deadlock Prevention in Transient Fault-Tolerant SMT Processors Xiaobin Li Jean-Luc Gaudiot Abstract Since the very concept of Simultaneous Multi-Threading (SMT) entails inherent redundancy,

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

The Use of Multithreading for Exception Handling

The Use of Multithreading for Exception Handling The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Wide Instruction Fetch

Wide Instruction Fetch Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table

More information

The Limits of Speculative Trace Reuse on Deeply Pipelined Processors

The Limits of Speculative Trace Reuse on Deeply Pipelined Processors The Limits of Speculative Trace Reuse on Deeply Pipelined Processors Maurício L. Pilla, Philippe O. A. Navaux Computer Science Institute UFRGS, Brazil fpilla,navauxg@inf.ufrgs.br Amarildo T. da Costa IME,

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh Department of Computer Science and Information Engineering Chang Gung University Tao-Yuan, Taiwan Hsin-Dar Chen

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

Multiple Branch and Block Prediction

Multiple Branch and Block Prediction Multiple Branch and Block Prediction Steven Wallace and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine Irvine, CA 92697 swallace@ece.uci.edu, nader@ece.uci.edu

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling

Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling en-jen Chang Department of Computer Science ational Chung-Hsing University, Taichung, 402 Taiwan Tel : 886-4-22840497 ext.918 e-mail : ychang@cs.nchu.edu.tw

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

DCache Warn: an I-Fetch Policy to Increase SMT Efficiency

DCache Warn: an I-Fetch Policy to Increase SMT Efficiency DCache Warn: an I-Fetch Policy to Increase SMT Efficiency Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona 1-3,

More information

Cache Implications of Aggressively Pipelined High Performance Microprocessors

Cache Implications of Aggressively Pipelined High Performance Microprocessors Cache Implications of Aggressively Pipelined High Performance Microprocessors Timothy J. Dysart, Branden J. Moore, Lambert Schaelicke, Peter M. Kogge Department of Computer Science and Engineering University

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Saving Energy with Just In Time Instruction Delivery

Saving Energy with Just In Time Instruction Delivery Saving Energy with Just In Time Instruction Delivery Tejas Karkhanis Univ. of Wisconsin - Madison 1415 Engineering Drive Madison, WI 53706 1+608-265-3826 karkhani@ece.wisc.edu James E Smith Univ. of Wisconsin

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering

More information

A Speculative Trace Reuse Architecture with Reduced Hardware Requirements

A Speculative Trace Reuse Architecture with Reduced Hardware Requirements A Speculative Trace Reuse Architecture with Reduced Hardware Requirements Maurício L. Pilla Computer Science School UCPEL Pelotas, Brazil pilla@ucpel.tche.br Amarildo T. da Costa IME Rio de Janeiro, Brazil

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Dean M. Tullsen, Susan J. Eggers, Joel S. Emer y, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Ying Chen, Resit Sendag, and David J Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE Stephan Suijkerbuijk and Ben H.H. Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science

More information

Evaluation of Branch Prediction Strategies

Evaluation of Branch Prediction Strategies 1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions Resit Sendag 1, David J. Lilja 1, and Steven R. Kunkel 2 1 Department of Electrical and Computer Engineering Minnesota

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Reducing Reorder Buffer Complexity Through Selective Operand Caching Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev

More information

Use-Based Register Caching with Decoupled Indexing

Use-Based Register Caching with Decoupled Indexing Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

SEVERAL studies have proposed methods to exploit more

SEVERAL studies have proposed methods to exploit more IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 4, APRIL 2005 1 The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture Resit Sendag, Member, IEEE, Ying

More information

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths yesoon Kim José. Joao Onur Mutlu Yale N. Patt igh Performance Systems Group

More information

An Efficient Indirect Branch Predictor

An Efficient Indirect Branch Predictor An Efficient Indirect ranch Predictor Yul Chu and M. R. Ito 2 Electrical and Computer Engineering Department, Mississippi State University, ox 957, Mississippi State, MS 39762, USA chu@ece.msstate.edu

More information

The Impact of Resource Sharing Control on the Design of Multicore Processors

The Impact of Resource Sharing Control on the Design of Multicore Processors The Impact of Resource Sharing Control on the Design of Multicore Processors Chen Liu 1 and Jean-Luc Gaudiot 2 1 Department of Electrical and Computer Engineering, Florida International University, 10555

More information

The Impact of Resource Sharing Control on the Design of Multicore Processors

The Impact of Resource Sharing Control on the Design of Multicore Processors The Impact of Resource Sharing Control on the Design of Multicore Processors Chen Liu 1 and Jean-Luc Gaudiot 2 1 Department of Electrical and Computer Engineering, Florida International University, 10555

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology CSAIL Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Dynamic Cache Partioning for Simultaneous Multithreading Systems Ed Suh, Larry Rudolph, Srinivas Devadas

More information

Exploiting Type Information in Load-Value Predictors

Exploiting Type Information in Load-Value Predictors Exploiting Type Information in Load-Value Predictors Nana B. Sam and Min Burtscher Computer Systems Laboratory Cornell University Ithaca, NY 14853 {besema, burtscher}@csl.cornell.edu ABSTRACT To alleviate

More information

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread

More information