Speculation Control for Simultaneous Multithreading
|
|
- Priscilla Gibbs
- 6 years ago
- Views:
Transcription
1 Speculation Control for Simultaneous Multithreading Dongsoo Kang Dept. of Electrical Engineering University of Southern California Jean-Luc Gaudiot Dept. of Electrical Engineering and Computer Science University of California, Irvine Abstract Speculative executions help modern processors to expose independent instructions on the fly and accordingly exploit more Instruction-Level Parallelism. However, when incorrect speculations occur, useless work is performed for those incorrectly speculated instructions. This lowers a sustained performance and leads to a significant waste of power. Unlike superscalar processors, Simultaneous Multithreading (SMT) processors can concurrently execute multiple threads. Thus, they have a chance to control speculative executions by deliberately choosing threads from which instructions will be fetched at each cycle, considering the dynamic characteristics of running threads. In this paper, we present an efficient front-end mechanism, called SAFE-T (Speculation-Aware Front-End Throttling), for scheduling threads in SMT processors. It involves thread prioritizing and throttling; priority given to a thread can be overridden when that thread seems to suffer from an excessive amount of incorrect speculations, therefore preventing instructions from being fetched. Simulation results show that our policy provides an average reduction of 41.6% in the number of wrong-path instructions and improves the instruction throughput by up to 14.5%. A cost-effective implementation for the proposed policy is shown as well. 1. Introduction In an effort to overcome the limited Instruction-Level Parallelism (ILP) within application programs, Simultaneous Multithreading (SMT) processors exploit Thread- Level Parallelism (TLP) [4] [8] [15]. By filling the instruction window with instructions fetched from multiple threads, an SMT processor is able to exploit TLP as well as ILP with the inherent capability of decreasing horizon- This work is partly supported by the National Science Foundation under Grants No. CSA and INT Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. tal and vertical waste [4] and providing high instruction throughput as a result. The overall performance of an SMT processor depends on many factors including how the threads are selected and the number of threads from which to fetch instructions. Further, how to allocate the limited fetch slots to the selected threads must be judiciously decided. For example, if instructions fetched from a thread reside in the instruction window for too many cycles before they are issued (due to dependencies and latencies) they occupy valuable entries of the window that could be used by other threads, ultimately limiting the ILP and TLP which can be exploited. Tullsen et al. [19] examined several pipeline variables for prioritizing all running threads and choosing a few ones, and they reported that the thread scheduling policy based on the ICOUNT variable, which indicates the number of instructions in the front-end stages, provided the best performance in terms of overall instruction throughput. However, the ICOUNT variable is not aware of speculative executions. Since instructions must be discarded from the intermediate stages of the pipeline if they are found to have been incorrectly speculated, the ICOUNT variable cannot correctly reflect the activities of threads. The number of instructions discarded due to incorrect speculations (wrong-path instructions) is quite high. In this research, we observed that in an SMT processor using the ICOUNT-based policy, the wrong-path instructions account for 16.2 ~ 28.8% of all instructions fetched. Unnecessary work done for these instructions limits a sustained performance achieved by SMT processors and makes them power-inefficient due largely to unnecessary switching activity of the logic gates. The goal of the study presented in this paper is to develop a fetch mechanism for thread scheduling that enables SMT processors to dynamically control speculative executions of threads. We accomplish this by employing two pipeline variables, ICOUNT and LCOUNT, which indicate the distinct behaviors of threads in a new mechanism, called SAFE-T (Speculation-Aware Front-End Throttling). The LCOUNT variable represents the number of unresolved, low-confidence conditional branches de-
2 termined by confidence estimation [6] [11]. It is used to throttle threads if they appear to be incorrectly speculated, even though highly prioritized by the ICOUNT variable. Our experimental results show that such a hybrid policy is able to reduce a noticeable amount of wrong-path instructions with the attendant enhancement in instruction throughput. Prior to describing our front-end policy, we review related work regarding front-end policies for SMT and confidence estimation techniques (Section 2). Our front-end policy and hardware mechanisms to embody the policy are discussed in more detail in Section 3. We describe the simulation environment we used to evaluate the policy in Section 4, and the experimental results are presented in Section Related work The performance of a superscalar processor certainly depends on how many independent instructions are delivered to both the front-end and the back-end stages. However, modern microprocessors have notoriously suffered from limited instruction parallelism within representative application programs, consequently yielding diminishing returns even the issue width is increased [4] [20]. SMT overcomes this limited ILP within a thread by concurrently fetching and executing instructions from other threads, thereby increasing resource utilization and overall performance Front-end policies for SMT Just like superscalars, the performance of SMT processors is affected by the quality of the instructions injected into its pipeline. For instance, if the instructions which are being processed have dependencies among one another or if they have long latencies, the ILP and TLP which can be exploited will be limited, clogging the instruction window and then stalling the front-end stages (fetch, decode, and dispatch). Therefore, how to fill the front-end stages of an SMT processor with instructions fetched from multiple threads is a critical decision which needs to be made at each cycle. Three parameters, algorithm, num_threads, and num_insts, characterize the front-end policies for SMT. The first determines how to choose threads among all available threads. The second describes the number of threads to fetch from at each cycle, and the third defines the number of instructions which can be fetched at each cycle per thread. Tullsen et al. [19] suggested several priority-based front-end policies which surpass the simple round-robin policy. They investigated four policies which prioritize threads according to four pipeline variables: BRCOUNT, MISSCOUNT, ICOUNT, and IQPOSN. These pipeline variables were mainly devised to avoid clogging the issue buffers which may occur when instructions reside for many cycles in the pipeline until they are retired. Among the above policies, it was found that the policy based on ICOUNT provided the highest performance in terms of instruction throughput. However, these variables do not consider that even after an instruction has been injected into the pipeline, it must be discarded from the front-end or the back-end stages whenever a conditional branch preceding the instruction has been determined to have been incorrectly predicted. Wrong-path instructions fetched along the incorrectly predicted path of a conditional branch consume not only fetch slots of the front-end but also valuable functional units in the back-end, which obviously correspondingly reduce instruction throughput and power efficiency. This is part of the reason why the sustained instruction throughput obtained under the ICOUNT-based policy is still quite lower than the possible peak Controlling speculative executions Speculative execution is an aggressive technique which has been used in an effort to achieve higher performance by reducing the effect of control dependencies among instructions. Predicting the outcome of branches will allow a processor to speculatively execute instructions fetched along the predicted target path of a branch. As modern processors tend to have deeper and wider pipelines, however, it should be noted that the penalty for incorrect speculations becomes increasingly substantial. Due to the fact that confidence estimators [6] [11] are able to assess the quality of conditional branch predictions, they have been adopted as a building block in many applications [1] [3] [7] [13] [14]. Confidence estimators are very similar to branch predictors [21] in the sense that they have to make binary decisions (high-confidence and low-confidence or predicted-taken and predicted-untaken, respectively). Like branch predictors, confidence estimators have a table of miss distance counters (MDCs) to record a history. However, whereas a branch predictor takes into account the actual path of branches (i.e., was the branch taken or not?) when making future predictions, a confidence estimator takes into account the history of the outcomes of branch predictions (i.e., was the branch correctly predicted or not?). If a conditional branch was correctly predicted, the corresponding MDC is incremented by one, or reset to zero if incorrectly predicted. A branch prediction is considered to have high confidence when its corresponding MDC value is greater than a given confidence threshold. In order to reduce the power demand of superscalar processors, Manne et al. [14] developed a particular form of speculation control called pipeline gating in which the
3 fetch unit stalls when there are more low-confidence predictions than a certain pre-determined threshold. Even though this scheme can eliminate a lot of wrong-path instructions and then reduce the unnecessary activity in the fetch and decode stages, it suffers from a small (1%) loss of performance. By adopting the concept of pipeline gating, Luo et al. [13] proposed a speculation-aware policy for SMT processors. Under this policy, running threads are prioritized and gated according to the value of their corresponding LCOUNT (short for LC_BPCOUNT) variables, each of which indicates the number of unresolved conditional branches with low confidence per thread. This front-end policy significantly reduced the total amount of wrongpath instructions from 9 ~ 24% to 8 ~ 12%. However, when compared with the previous policies examined in [19], this policy is certainly the most expensive in terms of the additional hardware required since the confidence estimator on which it relies is implemented with a table of MDCs for each thread. Unlike the ICOUNT variable, in addition, the LCOUNT variable is not able to perceive the dynamic latency of instructions in the pipeline stages, mostly in the back-end stages. We observed that it performed worse than the ICOUNT variable for workloads mixed with integer and floating-point benchmarks. 3. SAFE-T The front-end fetch mechanism we now propose for SMT, called SAFE-T (Speculation-Aware Front-End Throttling), integrates the positive features of the ICOUNT and LCOUNT variables, thereby achieving enhanced instruction throughput and power efficiency. Two variables, ICOUNT and LCOUNT, represent the dynamic characteristic of the running threads. We will explain how these variables are used to schedule threads at the fetch stage of an SMT processor. Next, the overhead of confidence estimation on which the LCOUNT variable relies will be discussed and a cost-effective confidence estimation scheme will be presented. order to throttle selected threads, there is a designated counter (the LCOUNT variable) for each running thread, as shown in Figure 1. The counter is incremented by one whenever a conditional branch prediction is made with low confidence in the fetch stage. Conversely, it is decremented by one whenever a low-confidence conditional branch is resolved in the complete stage or is discarded from the other preceding stages due to the recovery from an incorrect speculation. Figure 1. SAFE-T mechanism If a running thread has an LCOUNT value which is greater than a given throttling threshold, the fetching of instructions from the thread is stalled on the assumption that an incorrect path has been entered. This means that it is assumed that if instructions are fetched along the current path of the thread, they will be discarded once a previous conditional branch has been resolved. In fact, once wrong-path instructions are fed into the pipeline, they must reside in it until previous branches are determined to be incorrectly predicted, contending for valuable resources with other useful instructions Prioritizing and throttling threads For each thread, there is a counter (the ICOUNT variable) which represents the number of in-flight instructions in the front-end stages of the pipeline. By looking up the counter, thread prioritizing gives a high priority to those threads with lower values. This information is used to select threads from which instructions will be fetched during the next cycle. Unlike the policies examined in [19], however, the SAFE-T mechanism enables the fetch unit to reject selected threads, even if they are highly prioritized by the ICOUNT variable, as a way of controlling speculation. In Figure 2. Two types of confidence estimators used for SAFE-T. By throttling the threads after prioritizing them, the fetch unit of an SMT processor can save fetch slots which would be otherwise wasted due to incorrect speculations
4 and then keep the useful instructions in the instruction window by allocating these saved slots to other threads. As shown in Figure 2 (a), we use a confidence estimator which is based on miss distance counters (MDCs) [11] to assess the quality of each conditional branch prediction. 4-bit MDCs are used, generating a binary signal (lowconfidence and high-confidence) according to a given confidence threshold. The branch history register (BHR), which holds the global history of recently resolved conditional branches, is shared by the collaborating branch prediction unit. In previous work [1] [6] [7] [11], the confidence estimators for superscalar processors were designed primarily to foretell the correctness of the current prediction for a conditional branch according to the recent history of the branch outcomes. Thus, a table of registers was required, and each register recorded the correctness of recent predictions for the same conditional branch. As an extension of this trend, a confidence estimator for an SMT processor can have one table of registers for each thread or alternatively a single table shared by all the threads. When shared, the size of the table should be proportional to the number of threads due to an increased branch misprediction rate from thread interference [10]. These two implementations are not inexpensive since they require a lot of space and their complexity grows with the number of threads. We claim that the role of a confidence estimator for the front-end policy of an SMT processor can be restricted to indicating whether or not a thread seems to enter a section of instructions in which conditional branches are likely to be incorrectly predicted. Also, a conditional branch must be discarded regardless of the correctness of its prediction whenever any of its preceding conditional branches are resolved to have been mispredicted. In addition, we observed that the mispredictions of conditional branches take place in bursts for the execution-driven simulation of an SMT processor. Indeed, as shown in Figure 3, on average 75% of misprediction distances are no more than 8. This means that in most cases, a misprediction is no further apart from another misprediction than 8 correct predictions. The results shown in Figure 3 were obtained by running the SPEC CPU2000 benchmarks in multi-thread mode on the baseline SMT architecture specified in Section 4. For a superscalar processor, Heil and Smith reported the similar observation from a trace-driven simulation [7] Minimizing the overhead of confidence estimation 100% [1:8] [9:16] 17+ Misprediction Distance 80% 60% 40% 20% 0% 164.gzip 175.vpr-pl 175.vpr-rt 176.gcc 181.mcf 197.parser Figure 3. Distribution of distances between mispredicted branches. The features observed allow us to build an inexpensive confidence estimator by just recording the recent history in a register (one per thread) shared by all static branches, without the need for a table of MDCs. Thus, this confidence estimator with a single MDC assigns low confidence to the current conditional branch prediction if an incorrect prediction has been detected and the number of correct predictions is no more than a certain threshold, even though those predictions have been made for different conditional branches. Figure 4. A schematic diagram of our SMT architecture.
5 183.equake Cond. BR Uncond. BR 183.equake 197.parser 197.parser 181.mcf 181.mcf 176.gcc 176.gcc 175.vpr (route) 175.vpr-rt 175.vpr (place) 175.vpr-pl 164.gzip 164.gzip Fraction of Branches (%) (a) Branch Misprediction Rate (%) (b) 183.equake 197.parser 181.mcf 176.gcc 175.vpr-rt 175.vpr-pl 164.gzip 183.equake 197.parser 181.mcf 176.gcc 175.vpr-rt 175.vpr-pl 164.gzip Wrong-path Instructions (%) IPC (c) Figure 5. Characteristics of the SPEC CPU2000 benchmarks: (a) breakdown of branches, (b) branch misprediction rates, (c) fraction of wrong-path instructions, and (d) instruction throughput in single-thread mode. Branch misprediction rates were collected by using a gshare branch predictor with bit counters. (d) This confidence estimator is truly inexpensive since it uses a single MDC as shown in Figure 2 (b). We call it gmdc since a global MDC is shared by all static conditional branches encountered in a thread, while a confidence estimator with a table of MDCs would be called tmdc. In gmdc, the global MDC is updated as described in [11] whenever a conditional branch is resolved. 4. Simulation methodology To properly evaluate the effects of the proposed front-end policy and underlying confidence estimators, we designed an execution-driven simulator derived from the SimpleScalar tool set [2]. We modified the sim-outorder simulator to implement an SMT processor model (Figure 4) which supports out-of-order and speculative execution. The architectural model contains seven pipeline stages: fetch, decode, dispatch, issue, execute, complete, and commit. Several resources, such as PC, integer and floating-point register files, and branch predictor, are replicated to allow multiple thread contexts Experimental set-up The major simulation parameters are shown in Table 1, and the configuration parameters for functional units are shown in Table 2. In Table 1, each cache line size is in bytes and the 128KB instruction cache is equivalent to a 64KB cache for 32-bit instructions since the simulator uses the 64-bit PISA instruction set. The simulator for SMT is configured to issue as many instructions as the total number of functional units at each cycle. When multiple instructions are ready to be issued, older instructions have higher priority than newer ones. We used two types of branch predictors [16] [21]. One is a gshare branch predictor with 2048 entries and the other is a hybrid predictor that consists of the same gshare predictor and a 2048-entry bimodal predictor. The branch misprediction penalty is a minimum of eight cycles. Six cycles are incurred for branch delays and two cycles are taken for restoring the correct architectural status from each misprediction. We used the SPEC CPU2000 benchmark suite [9] to build workloads for performance simulation. Our workloads consist of seven integer benchmarks (164.gzip,
6 175.vpr, 176.gcc, 181.mcf, 197.parser,, and ) and one floating-point benchmark (183.equake). As shown in Table 3, seven multiprogrammed workloads were created for the simulation experiments. The characteristics of these benchmarks are shown in Figure 5. Table 1. Simulation parameters. Parameter Value Fetch rate 4 Dispatch rate 4 Retire rate 4 2-level: 2K gshare Branch predictor hybrid: 2K gshare + 2K bimodal (2K meta) Branch target buffer way Branch mispredict 8+ cycles penalty Return address stack 16 L1 instruction cache 128KBytes (512:64:4:LRU) L1 data cache 64KBytes (512:32:4:LRU) L2 cache 512KBytes (2048:64:4:LRU) Main memory 256-bit width I-TLB 512KBytes (32:4096:4:LRU) D-TLB 1MBytes (64:4096:4:LRU) IFQ size 16 IDQ size 16 RUU size 64 LQ/SQ size 16/8 INT units 4 FP units 2 We compiled all the benchmarks with gcc -O2 and ran each with its corresponding lgred input data set from the MinneSPEC [12]. Each simulation of a workload is composed of T 500 million instructions, where T is the number of threads, after fast forwarding through the first 300 million instructions from each thread to skip the initialization part of the benchmarks Simulated front-end policies We simulated and evaluated the following four front-end policies for SMT: Ti: Threads are prioritized according to the values of the ICOUNT variable. Tc: Threads are prioritized and gated according to values of the LCOUNT variable. When two or more threads have the same priority value, the ICOUNT value is used as a tie-breaker. The confidence threshold is set to 8 and the gating threshold is set to 1. There is a table of 2048 MDCs per thread, and each MDC is a 4-bit register. St and Sg: Threads are scheduled according to the SAFE-T mechanism. However, St relies on a tmdc confidence estimator with 2048 entries whereas Sg uses a gmdc. When two or more threads have the same value of the ICOUNT variable, the LCOUNT value is used as a tie-breaker. The confidence threshold is set to 8 and the gating threshold is set to 1 by default. For MDCs, 4-bit registers are used. In all front-end policies, the 2.4 scheme was used for the purpose of distributing available slots because the Ti policy showed the best performance for it. Thus, up to two threads can be selected at each cycle, and up to four instructions can be fetched from each thread. Since the fetch rate of the baseline SMT architecture is set to 4, the second thread, which has a lower priority, is allowed to be fetched only if there are remaining slots after they have been allocated to the first one. Table 2. Configuration of functional units. INT FP Function Repeat Rate Latency add/logical/shift 1 1 mult 1 3 div add/comp 1 2 mult 1 4 div sqrt Table 3. Workloads used for simulations. Workload Benchmarks 164.gzip 197.parser gzip 175.vpr (place) 176.gcc vpr (place) 176.gcc 197.parser vpr (route) 176.gcc 181.mcf gzip 175.vpr (route) 181.mcf gzip 176.gcc 183.equake vpr (place) 176.gcc 181.mcf 6 5. Experimental evaluation In order to evaluate the effectiveness of the proposed SAFE-T mechanism, we simulated St and Sg to measured
7 instruction throughput and the amount of wrong-path instructions. The results obtained are compared with those of the Ti and Tc policies. In addition, we analyze the impact of the confidence and gating thresholds on the St and Sg which implement the SAFE-T mechanism by using a different structure of confidence estimator Instruction throughput Our simulation results in terms of IPC (instructions per cycle), are presented in Figure 6. For each workload, a set of four histograms is shown. Each histogram represents the performance of four front-end policies in our discussion: Ti, Tc, St, and Sg. The right most set of histograms represents harmonic means over all seven workloads. IPC W0 W1 W2 W3 W4 W5 W6 HM (a) gshare The figure shows that the St policy is clearly superior to the others because of its higher IPC for all workloads. As a matter of fact, it improves the IPC by up to 14.5%, compared with the Ti policy that has been known to be the most effective for high performance. Even compared with the St policy, the Sg policy yields almost equivalent instruction throughputs although it relies on the gmdc confidence estimator with one miss distance counter per thread. This result justifies that a confidence estimator which updates the LCOUNT variable by using the global history of conditional branches is effective since an SMT processor is able to exploit TLP as well as ILP. To better understand the characteristics of the frontend policies, the average priority assigned by all four front-end policies was measured for the benchmarks in Workload 3 and Workload 5 as shown in Figure 7. Among the benchmarks in Workload 3, both Ti and Tc policies give higher priority to 176.gcc and than to the other benchmarks. If we look at the priority assigned to 176.gcc and, the I p policy favors 176.gcc over while the Tc policy gives a higher priority to than to 176.gcc. 176.gcc has a larger percentage of wrong-path instructions and a smaller IPC than. This means that the ICOUNT variable can generate an incorrect feedback about the instruction flow in the pipeline. Accordingly, the Ti achieves smaller IPC than Tc for the Workload 3. Even though the St and Sg policies prioritize threads by using ICOUNT like the Ti policy, they can cancel out the priority given to 176.gcc as a result of thread throttling by using LCOUNT. Thus, they can avoid the distorted sign for 176.gcc from ICOUNT. This is the main reason that the St and Sg policies can yield a better IPC than the Ti policy for Workload 3. Workload Avg. Priority IPC W0 W1 W2 W3 W4 W5 W6 HM (b) hybrid Figure 6. Instruction throughputs for (a) gshare and (b) hybrid branch prediction schemes. Avg. Priority vpr-rt 176.gcc 181.mcf (a) Workload gzip 176.gcc 183.equake Figure 7 Average priority assigned to the benchmarks in (a) Workload 3 and (b) Workload 5. A smaller number means a higher priority. (b)
8 The Tc policy appears inferior to the others. Even compared to the Ti policy, it yields decreased performance for all workloads except Workload 3 and Workload 6. In the case of Workload 5 where 183.equake is a floating-point benchmark and the others are integer benchmarks, the Tc policy assigned 1.47 to 183.equake whereas the Ti policy assigned 1.72 and the St policy assigned This means that the Tc policy favored 183.equake more than the other policies did, and the LCOUNT variable is not appropriate for the detection of threads with rapidly retired instructions. Consequently, the Tc policy tends to fetch instructions which are likely to clog the pipeline due to the long latencies. We measured the average slip time as shown in Figure 8. The slip time of an instruction can be defined as the amount of time that has elapsed since it was dispatched into the instruction window until retired. These results show that the Tc policy selects those threads with long-latency instructions, even though they cause a comparatively small number of incorrect speculations. Figure 9 shows the percentage of the number of wrongpath instructions fetched into the pipeline for the Ti, Tc, St, and Sg policies. As they were designed, the three policies based on the LCOUNT variable significantly reduce wrong-path instructions. The Tc policy reduces wrong-path instructions by an average factor of 35.1%. However, this reduction in the number of wrong-path instructions does not lead to a noticeable improvement in IPC because the Tc policy tends to select threads with long-latency instructions, as shown in Figure 8. On the average, the St policy shows a reduction of 36.3% in the number of wrong-path instructions and the Sg policy yields an improvement of 41.6%. This shows that for SMT processors, the inexpensive confidence estimator (gmdc) is adequate both to determine whether a thread has entered an incorrect path or not and to effectively prevent instructions in the incorrect path from entering the pipeline. Wrong-path Instructions (%) W0 W1 W2 W3 W4 W5 W6 Figure 9. Percentage of wrong-path instructions Impact of throttling threshold Avg. Slip Time (cycles) Figure 8. Average slip time of instructions in the backend stages Wrong-path instructions The threshold values used for thread throttling with the St and Sg policies affect the instruction throughput and the number of wrong-path instructions. As the throttling threshold increases, threads are allowed to have more conditional branch predictions with low confidence. Consequently, the pipeline has more chance of being fed with instructions which are likely to be discarded. These instructions compete for resources conflicts with other useful instructions (those which actually contribute to providing instruction throughput). Since these conflicts tend to the extrinsic delay of instructions in the pipeline, this will negatively affect the overall performance. For instance, if the gating threshold is set to infinite, no threads will be gated by the fetch unit. Thus, as the throttling threshold increases, the performance of the St and Sg, with respect to the IPC and to the percentage of wrongpath instructions, will converge to one of the Ti policy. To better understand the impact of throttling thresholds on the instruction throughput and the number of wrongpath instructions, we measured both the IPC and the fraction of wrong-path instructions over all fetched instructions by varying the value of the throttling threshold. The experimental data were obtained when the underlying 4- bit MDCs had a confidence threshold of 8. A gshare branch predictor, which is configured as mentioned in Section 4, was used. The data points presented in Figure 10 and Figure 11 are averages calculated over the seven workloads given in Table 3.
9 IPC St Sg Throttling Threshold Figure 10. Throttling threshold vs. IPC. In truth, the value range of the LCOUNT variable, which is referenced to eliminate prioritized threads from consideration for fetching during some period of time by depriving their assigned priority, is affected by the confidence threshold. According to the confidence threshold, the underlying confidence estimator used for the St and Sg policies determines whether the prediction of each conditional branch is low-confidence or high-confidence. If the confidence threshold is low, the confidence estimator is optimistic and it will estimate most of predictions at high confidence. In order to examine the impact of confidence threshold values, we ran the seven workloads in Table 3 and measured the changes of the IPC and the number of wrongpath instructions, varying the confidence threshold value. For thread throttling, the throttling threshold was set to 1. The results obtained for the IPC and the percentage of wrong-path instructions are shown in Figure 12 and Figure 13 respectively. St Sg St Sg IPC 1.6 Wrong-path instructions (%) Confidence Threshold Throttling Threshold Figure 12. Confidence threshold vs. IPC. St Sg Figure 11. Throttling threshold vs. percentage of wrongpath instructions. As the throttling threshold is increased, there is a slim chance that threads will be throttled after they are prioritized. We can see that both St and Sg show the same behavior and the best IPC and speculation control are achieved when the throttling threshold is set to 1. If the threshold increases to 2, the IPC of the Sg degrades by 3.8% and the number of wrong-path instructions changes from 13.9% to 18.6% on average. When the threshold is further increased to 3, the average IPC decreases into 1.75 which is close to the average of one of the Ti policy, This implies that the Sg policy can provide as much performance as the Ti policy Impact of confidence threshold Wrong-path Instructions (%) Confidence Threshold Figure 13. Confidence threshold vs. percentage of wrong-path instructions. We can see that as the confidence threshold increases, the IPC slightly rises and the amount of wrong-path instructions is further reduced. This means that an underlying confidence estimator for the SAFE-T mechanism needs to be pessimistic about branch predictions after a misprediction has been detected, since even a correctly predicted branch and its subsequent instructions must be discarded if a preceding branch is found to have been incorrectly predicted. 6. Conclusions An SMT processor collects instructions from multiple threads and deploys them to the shared instruction win-
10 dow in order to exploit both ILP and TLP. Thus, how to fill the front-end stages with instructions from multiple threads is critical for SMT processors. We have proposed here a thread scheduling mechanism for an SMT processor, called SAFE-T, which prioritizes threads according to the ICOUNT variable and throttles threads when they seem to be in incorrect paths, based on the LCOUNT variable that represents the number of unresolved conditional branches with lowconfidence prediction in the pipeline. The SAFE-T enables SMT processors to increase instruction throughput by up to 14.5% and to reduce the wrong-path instructions 41.6% on average, when compared with the policy using the ICOUNT variable only. As for the implementation cost of our front-end policy, we have examined a confidence estimator with a global MDC instead of a table of MDCs and have evaluated its effectiveness. It has been shown that an inexpensive implementation, the Sg policy, is comparable to the St that uses a table of MDCs, when it comes to instruction throughput and speculation control. High performance is the primary goal of any modern processors. However, it has been achieved at the cost of wasted work like instructions discarded from the pipeline. As the pipelines of processors become wider and deeper, the amount of wasted work will definitely increase. Thus, the proposed scheme will be essential for high performance with fewer power demands in SMT processors. In the future, we plan to evaluate a dynamic adaptation of the throttling threshold and an extrapolation to more than two pipeline variables in order to reflect the difference between thread characteristics during run-time. References [1] J. Aragón, J. González, J. García, and A. González, Confidence Estimation for Branch Prediction Reversal, Proc. 8 th Int l Conference on High Performance Computing, Dec. 2001, pp [2] D. Burger and T. Austin, The SimpleScalar Tool Set, Version, Univ. of Wisconsin-Madison Computer Science Department Technical Report #1342, June [3] M. Burtscher and B. Zorn, Prediction Outcome Historybased Confidence Estimation for Load Value Prediction, Journal of Instruction-Level Parallelism, May [4] S. Eggers, J. Emer, H. Levy, J. Lo, R. Stamm, and D. Tullsen, Simultaneous Multithreading: A Platform for Next- Generation Processors, IEEE Micro, Sept./Oct. 1997, pp [5] R. Gonçalves, M. Pilla, G. Pizzol, T. Santos, R. Santos, and P. Navaux, Evaluating the Effects of Branch Prediction Accuracy on the Performance of SMT Architectures, Euromicro Workshop on Parallel and Distributed Processing, Feb. 2001, pp [6] D. Grunwald, A. Klauser, S. Manne, and A. Pleszkun, Confidence Estimation for Speculation Control, Proc. 25 th Annual Int l Symposium on Computer Architecture, [7] T. Heil and J. Smith, Selective Dual Path Execution, Univ. of Wisconsin Madison, Technical Report, Nov [8] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 3 rd Ed., Morgan Kaufmann, San Francisco, CA, [9] J. Henning, SPEC CPU2000: Measuring CPU Performance in the New Millennium, IEEE Computer, July 2000, pp [10] S. Hilly and A. Séznec, Branch Prediction and Simultaneous Multithreading, 5 th Proc. Int l Conference on Parallel Architectures and Compilation Techniques, 1996, pp [11] E. Jacobsen, E. Rotenberg, and J. Smith, Assigning Confidence to Conditional Branch Predictions, Proc. 29 th Annual Int l Symposium on Microarchitecture, Dec. 1996, pp [12] A. KleinOsowski and D. Lilja, MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research, Computer Architecture Letters, June [13] K. Luo, M. Franklin, S. Mukherjee, and A. Séznec, Boosting SMT Performance by Speculation Control, Proc. 15 th Int l Parallel and Distributed Processing Symposium, [14] S. Manne, A. Klauser, and D. Grunwald, Pipeline Gating: Speculation Control for Energy Reduction, Proc. 25 th Annual Int l Symposium on Computer Architecture, 1998, pp [15] D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton, Hyper-Threading Technology Architecture and Microarchitecture, Intel Technology Journal, vol. 06, issue 01, Feb [16] S. McFarling, Combining Branch Predictors, WRL Technical Note TN-36, Jun [17] J. Rabaey, Digital Integrated Circuits: A Design Perspective, Prentice Hall, Upper Saddle River, NJ, [18] G. Sohi, Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers, IEEE Transactions on Computers, vol. 39, no. 3, Mar. 1990, pp [19] D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm, Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Proc. 23 rd Annual Int l Symposium on Computer Architecture, May 1996, pp [20] D. Wall, Limits of Instruction-Level Parallelism, Proc. 4th Int l Conf. on Architectural Support for Programming Languages and Operating System, 1991, pp [21] T. Yeh and Y. Patt, Alternative Implementations of Two- Level Adaptive Branch Prediction, Proc. 19 th Annual Int l Symposium on Computer Architecture, May 1992, pp
Boosting SMT Performance by Speculation Control
Boosting SMT Performance by Speculation Control Kun Luo Manoj Franklin ECE Department University of Maryland College Park, MD 7, USA fkunluo, manojg@eng.umd.edu Shubhendu S. Mukherjee 33 South St, SHR3-/R
More informationBalancing Thoughput and Fairness in SMT Processors
Balancing Thoughput and Fairness in SMT Processors Kun Luo Jayanth Gummaraju Manoj Franklin ECE Department Dept of Electrical Engineering ECE Department and UMACS University of Maryland Stanford University
More informationABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004
ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical
More informationA Study for Branch Predictors to Alleviate the Aliasing Problem
A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract
More informationUnderstanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures
Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3
More informationBeyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy
EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationAccuracy Enhancement by Selective Use of Branch History in Embedded Processor
Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Jong Wook Kwak 1, Seong Tae Jhang 2, and Chu Shik Jhon 1 1 Department of Electrical Engineering and Computer Science, Seoul
More informationAn Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks
An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationThreshold-Based Markov Prefetchers
Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this
More informationUsing Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation
Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun + houman@houman-homayoun.com ABSTRACT We study lazy instructions. We define lazy instructions as those spending
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationTowards a More Efficient Trace Cache
Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,
More informationECE404 Term Project Sentinel Thread
ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationAn In-order SMT Architecture with Static Resource Partitioning for Consumer Applications
An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul
More informationThreaded Multiple Path Execution
Threaded Multiple Path Execution Steven Wallace Brad Calder Dean M. Tullsen Department of Computer Science and Engineering University of California, San Diego fswallace,calder,tullseng@cs.ucsd.edu Abstract
More informationAn Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group
More informationMore on Conjunctive Selection Condition and Branch Prediction
More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused
More informationStatic Branch Prediction
Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already
More informationSimultaneous Multithreading Processor
Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf
More informationA Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors
In Proceedings of the th International Symposium on High Performance Computer Architecture (HPCA), Madrid, February A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationOne-Level Cache Memory Design for Scalable SMT Architectures
One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationReducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research
Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of
More informationTechniques for Efficient Processing in Runahead Execution Engines
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu
More informationDynamic Branch Prediction
#1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually
More informationPreliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads
Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu
More informationTradeoff between coverage of a Markov prefetcher and memory bandwidth usage
Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end
More informationMultithreaded Value Prediction
Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single
More informationUnderstanding The Effects of Wrong-path Memory References on Processor Performance
Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend
More informationApplications of Thread Prioritization in SMT Processors
Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor,
More informationSelective Fill Data Cache
Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks
More informationReducing Latencies of Pipelined Cache Accesses Through Set Prediction
Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the
More informationImproving Data Cache Performance via Address Correlation: An Upper Bound Study
Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationWish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution
Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department
More informationOptimizing SMT Processors for High Single-Thread Performance
University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul
More informationExploring Efficient SMT Branch Predictor Design
Exploring Efficient SMT Branch Predictor Design Matt Ramsay, Chris Feucht & Mikko H. Lipasti ramsay@ece.wisc.edu, feuchtc@cae.wisc.edu, mikko@engr.wisc.edu Department of Electrical & Computer Engineering
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationEfficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors
Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University
More informationDesign of Experiments - Terminology
Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware
More informationA Mechanism for Verifying Data Speculation
A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es
More informationQuantitative study of data caches on a multistreamed architecture. Abstract
Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com
More informationDesign Trade-Offs and Deadlock Prevention in Transient Fault-Tolerant SMT Processors
Design Trade-Offs and Deadlock Prevention in Transient Fault-Tolerant SMT Processors Xiaobin Li Jean-Luc Gaudiot Abstract Since the very concept of Simultaneous Multi-Threading (SMT) entails inherent redundancy,
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationSimultaneous Multithreading Architecture
Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.
More informationSpeculative Multithreaded Processors
Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads
More informationThe Use of Multithreading for Exception Handling
The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationWide Instruction Fetch
Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table
More informationThe Limits of Speculative Trace Reuse on Deeply Pipelined Processors
The Limits of Speculative Trace Reuse on Deeply Pipelined Processors Maurício L. Pilla, Philippe O. A. Navaux Computer Science Institute UFRGS, Brazil fpilla,navauxg@inf.ufrgs.br Amarildo T. da Costa IME,
More informationSaving Register-File Leakage Power by Monitoring Instruction Sequence in ROB
Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh Department of Computer Science and Information Engineering Chang Gung University Tao-Yuan, Taiwan Hsin-Dar Chen
More information1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722
Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict
More informationMultiple Branch and Block Prediction
Multiple Branch and Block Prediction Steven Wallace and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine Irvine, CA 92697 swallace@ece.uci.edu, nader@ece.uci.edu
More informationSaving Register-File Leakage Power by Monitoring Instruction Sequence in ROB
Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationMotivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture
Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on
More informationLazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling
Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling en-jen Chang Department of Computer Science ational Chung-Hsing University, Taichung, 402 Taiwan Tel : 886-4-22840497 ext.918 e-mail : ychang@cs.nchu.edu.tw
More informationSimultaneous Multithreading (SMT)
#1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing
More informationDCache Warn: an I-Fetch Policy to Increase SMT Efficiency
DCache Warn: an I-Fetch Policy to Increase SMT Efficiency Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona 1-3,
More informationCache Implications of Aggressively Pipelined High Performance Microprocessors
Cache Implications of Aggressively Pipelined High Performance Microprocessors Timothy J. Dysart, Branden J. Moore, Lambert Schaelicke, Peter M. Kogge Department of Computer Science and Engineering University
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationSaving Energy with Just In Time Instruction Delivery
Saving Energy with Just In Time Instruction Delivery Tejas Karkhanis Univ. of Wisconsin - Madison 1415 Engineering Drive Madison, WI 53706 1+608-265-3826 karkhani@ece.wisc.edu James E Smith Univ. of Wisconsin
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationPerformance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt
Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering
More informationA Speculative Trace Reuse Architecture with Reduced Hardware Requirements
A Speculative Trace Reuse Architecture with Reduced Hardware Requirements Maurício L. Pilla Computer Science School UCPEL Pelotas, Brazil pilla@ucpel.tche.br Amarildo T. da Costa IME Rio de Janeiro, Brazil
More informationReduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction
ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch
More informationExploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor
Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Dean M. Tullsen, Susan J. Eggers, Joel S. Emer y, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm
More informationEfficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel
Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationUsing Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor
Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Ying Chen, Resit Sendag, and David J Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationIMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE
IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE Stephan Suijkerbuijk and Ben H.H. Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science
More informationEvaluation of Branch Prediction Strategies
1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III
More informationComputer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction
More informationArea-Efficient Error Protection for Caches
Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various
More informationExploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions
Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions Resit Sendag 1, David J. Lilja 1, and Steven R. Kunkel 2 1 Department of Electrical and Computer Engineering Minnesota
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationReducing Reorder Buffer Complexity Through Selective Operand Caching
Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev
More informationUse-Based Register Caching with Decoupled Indexing
Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register
More informationHigh-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs
High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:
More informationMultithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as
More informationLecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationEfficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero
Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationSEVERAL studies have proposed methods to exploit more
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 4, APRIL 2005 1 The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture Resit Sendag, Member, IEEE, Ying
More informationHigh Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas
Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths yesoon Kim José. Joao Onur Mutlu Yale N. Patt igh Performance Systems Group
More informationAn Efficient Indirect Branch Predictor
An Efficient Indirect ranch Predictor Yul Chu and M. R. Ito 2 Electrical and Computer Engineering Department, Mississippi State University, ox 957, Mississippi State, MS 39762, USA chu@ece.msstate.edu
More informationThe Impact of Resource Sharing Control on the Design of Multicore Processors
The Impact of Resource Sharing Control on the Design of Multicore Processors Chen Liu 1 and Jean-Luc Gaudiot 2 1 Department of Electrical and Computer Engineering, Florida International University, 10555
More informationThe Impact of Resource Sharing Control on the Design of Multicore Processors
The Impact of Resource Sharing Control on the Design of Multicore Processors Chen Liu 1 and Jean-Luc Gaudiot 2 1 Department of Electrical and Computer Engineering, Florida International University, 10555
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationCSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology
CSAIL Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Dynamic Cache Partioning for Simultaneous Multithreading Systems Ed Suh, Larry Rudolph, Srinivas Devadas
More informationExploiting Type Information in Load-Value Predictors
Exploiting Type Information in Load-Value Predictors Nana B. Sam and Min Burtscher Computer Systems Laboratory Cornell University Ithaca, NY 14853 {besema, burtscher}@csl.cornell.edu ABSTRACT To alleviate
More informationSpeculative Parallelization in Decoupled Look-ahead
Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread
More information