Speculation Control for Simultaneous Multithreading

Similar documents
Boosting SMT Performance by Speculation Control

Balancing Thoughput and Fairness in SMT Processors

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

A Study for Branch Predictors to Alleviate the Aliasing Problem

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Simultaneous Multithreading: a Platform for Next Generation Processors

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

Threshold-Based Markov Prefetchers

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Towards a More Efficient Trace Cache

ECE404 Term Project Sentinel Thread

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

Threaded Multiple Path Execution

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

More on Conjunctive Selection Condition and Branch Prediction

Static Branch Prediction

Simultaneous Multithreading Processor

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

Simultaneous Multithreading (SMT)

Multithreaded Processors. Department of Electrical Engineering Stanford University

One-Level Cache Memory Design for Scalable SMT Architectures

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Techniques for Efficient Processing in Runahead Execution Engines

Dynamic Branch Prediction

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Multithreaded Value Prediction

Understanding The Effects of Wrong-path Memory References on Processor Performance

Applications of Thread Prioritization in SMT Processors

Selective Fill Data Cache

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Optimizing SMT Processors for High Single-Thread Performance

Exploring Efficient SMT Branch Predictor Design

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Hardware-Based Speculation

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Design of Experiments - Terminology

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

A Mechanism for Verifying Data Speculation

Quantitative study of data caches on a multistreamed architecture. Abstract

Design Trade-Offs and Deadlock Prevention in Transient Fault-Tolerant SMT Processors

Dynamic Control Hazard Avoidance

Simultaneous Multithreading Architecture

Speculative Multithreaded Processors

The Use of Multithreading for Exception Handling

Execution-based Prediction Using Speculative Slices

Wide Instruction Fetch

The Limits of Speculative Trace Reuse on Deeply Pipelined Processors

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

Multiple Branch and Block Prediction

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

CS425 Computer Systems Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling

Simultaneous Multithreading (SMT)

DCache Warn: an I-Fetch Policy to Increase SMT Efficiency

Cache Implications of Aggressively Pipelined High Performance Microprocessors

Exploitation of instruction level parallelism

Saving Energy with Just In Time Instruction Delivery

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt

A Speculative Trace Reuse Architecture with Reduced Hardware Requirements

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor

Simultaneous Multithreading (SMT)

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE

Evaluation of Branch Prediction Strategies

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Area-Efficient Error Protection for Caches

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Use-Based Register Caching with Decoupled Indexing

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Lecture 14: Multithreading

SEVERAL studies have proposed methods to exploit more

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

An Efficient Indirect Branch Predictor

The Impact of Resource Sharing Control on the Design of Multicore Processors

The Impact of Resource Sharing Control on the Design of Multicore Processors

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology

Exploiting Type Information in Load-Value Predictors

Speculative Parallelization in Decoupled Look-ahead

Transcription:

Speculation Control for Simultaneous Multithreading Dongsoo Kang Dept. of Electrical Engineering University of Southern California dkang@usc.edu Jean-Luc Gaudiot Dept. of Electrical Engineering and Computer Science University of California, Irvine gaudiot@uci.edu Abstract Speculative executions help modern processors to expose independent instructions on the fly and accordingly exploit more Instruction-Level Parallelism. However, when incorrect speculations occur, useless work is performed for those incorrectly speculated instructions. This lowers a sustained performance and leads to a significant waste of power. Unlike superscalar processors, Simultaneous Multithreading (SMT) processors can concurrently execute multiple threads. Thus, they have a chance to control speculative executions by deliberately choosing threads from which instructions will be fetched at each cycle, considering the dynamic characteristics of running threads. In this paper, we present an efficient front-end mechanism, called SAFE-T (Speculation-Aware Front-End Throttling), for scheduling threads in SMT processors. It involves thread prioritizing and throttling; priority given to a thread can be overridden when that thread seems to suffer from an excessive amount of incorrect speculations, therefore preventing instructions from being fetched. Simulation results show that our policy provides an average reduction of 41.6% in the number of wrong-path instructions and improves the instruction throughput by up to 14.5%. A cost-effective implementation for the proposed policy is shown as well. 1. Introduction In an effort to overcome the limited Instruction-Level Parallelism (ILP) within application programs, Simultaneous Multithreading (SMT) processors exploit Thread- Level Parallelism (TLP) [4] [8] [15]. By filling the instruction window with instructions fetched from multiple threads, an SMT processor is able to exploit TLP as well as ILP with the inherent capability of decreasing horizon- This work is partly supported by the National Science Foundation under Grants No. CSA-0073527 and INT-9815742. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. tal and vertical waste [4] and providing high instruction throughput as a result. The overall performance of an SMT processor depends on many factors including how the threads are selected and the number of threads from which to fetch instructions. Further, how to allocate the limited fetch slots to the selected threads must be judiciously decided. For example, if instructions fetched from a thread reside in the instruction window for too many cycles before they are issued (due to dependencies and latencies) they occupy valuable entries of the window that could be used by other threads, ultimately limiting the ILP and TLP which can be exploited. Tullsen et al. [19] examined several pipeline variables for prioritizing all running threads and choosing a few ones, and they reported that the thread scheduling policy based on the ICOUNT variable, which indicates the number of instructions in the front-end stages, provided the best performance in terms of overall instruction throughput. However, the ICOUNT variable is not aware of speculative executions. Since instructions must be discarded from the intermediate stages of the pipeline if they are found to have been incorrectly speculated, the ICOUNT variable cannot correctly reflect the activities of threads. The number of instructions discarded due to incorrect speculations (wrong-path instructions) is quite high. In this research, we observed that in an SMT processor using the ICOUNT-based policy, the wrong-path instructions account for 16.2 ~ 28.8% of all instructions fetched. Unnecessary work done for these instructions limits a sustained performance achieved by SMT processors and makes them power-inefficient due largely to unnecessary switching activity of the logic gates. The goal of the study presented in this paper is to develop a fetch mechanism for thread scheduling that enables SMT processors to dynamically control speculative executions of threads. We accomplish this by employing two pipeline variables, ICOUNT and LCOUNT, which indicate the distinct behaviors of threads in a new mechanism, called SAFE-T (Speculation-Aware Front-End Throttling). The LCOUNT variable represents the number of unresolved, low-confidence conditional branches de-

termined by confidence estimation [6] [11]. It is used to throttle threads if they appear to be incorrectly speculated, even though highly prioritized by the ICOUNT variable. Our experimental results show that such a hybrid policy is able to reduce a noticeable amount of wrong-path instructions with the attendant enhancement in instruction throughput. Prior to describing our front-end policy, we review related work regarding front-end policies for SMT and confidence estimation techniques (Section 2). Our front-end policy and hardware mechanisms to embody the policy are discussed in more detail in Section 3. We describe the simulation environment we used to evaluate the policy in Section 4, and the experimental results are presented in Section 5. 2. Related work The performance of a superscalar processor certainly depends on how many independent instructions are delivered to both the front-end and the back-end stages. However, modern microprocessors have notoriously suffered from limited instruction parallelism within representative application programs, consequently yielding diminishing returns even the issue width is increased [4] [20]. SMT overcomes this limited ILP within a thread by concurrently fetching and executing instructions from other threads, thereby increasing resource utilization and overall performance. 2.1. Front-end policies for SMT Just like superscalars, the performance of SMT processors is affected by the quality of the instructions injected into its pipeline. For instance, if the instructions which are being processed have dependencies among one another or if they have long latencies, the ILP and TLP which can be exploited will be limited, clogging the instruction window and then stalling the front-end stages (fetch, decode, and dispatch). Therefore, how to fill the front-end stages of an SMT processor with instructions fetched from multiple threads is a critical decision which needs to be made at each cycle. Three parameters, algorithm, num_threads, and num_insts, characterize the front-end policies for SMT. The first determines how to choose threads among all available threads. The second describes the number of threads to fetch from at each cycle, and the third defines the number of instructions which can be fetched at each cycle per thread. Tullsen et al. [19] suggested several priority-based front-end policies which surpass the simple round-robin policy. They investigated four policies which prioritize threads according to four pipeline variables: BRCOUNT, MISSCOUNT, ICOUNT, and IQPOSN. These pipeline variables were mainly devised to avoid clogging the issue buffers which may occur when instructions reside for many cycles in the pipeline until they are retired. Among the above policies, it was found that the policy based on ICOUNT provided the highest performance in terms of instruction throughput. However, these variables do not consider that even after an instruction has been injected into the pipeline, it must be discarded from the front-end or the back-end stages whenever a conditional branch preceding the instruction has been determined to have been incorrectly predicted. Wrong-path instructions fetched along the incorrectly predicted path of a conditional branch consume not only fetch slots of the front-end but also valuable functional units in the back-end, which obviously correspondingly reduce instruction throughput and power efficiency. This is part of the reason why the sustained instruction throughput obtained under the ICOUNT-based policy is still quite lower than the possible peak. 2.2. Controlling speculative executions Speculative execution is an aggressive technique which has been used in an effort to achieve higher performance by reducing the effect of control dependencies among instructions. Predicting the outcome of branches will allow a processor to speculatively execute instructions fetched along the predicted target path of a branch. As modern processors tend to have deeper and wider pipelines, however, it should be noted that the penalty for incorrect speculations becomes increasingly substantial. Due to the fact that confidence estimators [6] [11] are able to assess the quality of conditional branch predictions, they have been adopted as a building block in many applications [1] [3] [7] [13] [14]. Confidence estimators are very similar to branch predictors [21] in the sense that they have to make binary decisions (high-confidence and low-confidence or predicted-taken and predicted-untaken, respectively). Like branch predictors, confidence estimators have a table of miss distance counters (MDCs) to record a history. However, whereas a branch predictor takes into account the actual path of branches (i.e., was the branch taken or not?) when making future predictions, a confidence estimator takes into account the history of the outcomes of branch predictions (i.e., was the branch correctly predicted or not?). If a conditional branch was correctly predicted, the corresponding MDC is incremented by one, or reset to zero if incorrectly predicted. A branch prediction is considered to have high confidence when its corresponding MDC value is greater than a given confidence threshold. In order to reduce the power demand of superscalar processors, Manne et al. [14] developed a particular form of speculation control called pipeline gating in which the

fetch unit stalls when there are more low-confidence predictions than a certain pre-determined threshold. Even though this scheme can eliminate a lot of wrong-path instructions and then reduce the unnecessary activity in the fetch and decode stages, it suffers from a small (1%) loss of performance. By adopting the concept of pipeline gating, Luo et al. [13] proposed a speculation-aware policy for SMT processors. Under this policy, running threads are prioritized and gated according to the value of their corresponding LCOUNT (short for LC_BPCOUNT) variables, each of which indicates the number of unresolved conditional branches with low confidence per thread. This front-end policy significantly reduced the total amount of wrongpath instructions from 9 ~ 24% to 8 ~ 12%. However, when compared with the previous policies examined in [19], this policy is certainly the most expensive in terms of the additional hardware required since the confidence estimator on which it relies is implemented with a table of MDCs for each thread. Unlike the ICOUNT variable, in addition, the LCOUNT variable is not able to perceive the dynamic latency of instructions in the pipeline stages, mostly in the back-end stages. We observed that it performed worse than the ICOUNT variable for workloads mixed with integer and floating-point benchmarks. 3. SAFE-T The front-end fetch mechanism we now propose for SMT, called SAFE-T (Speculation-Aware Front-End Throttling), integrates the positive features of the ICOUNT and LCOUNT variables, thereby achieving enhanced instruction throughput and power efficiency. Two variables, ICOUNT and LCOUNT, represent the dynamic characteristic of the running threads. We will explain how these variables are used to schedule threads at the fetch stage of an SMT processor. Next, the overhead of confidence estimation on which the LCOUNT variable relies will be discussed and a cost-effective confidence estimation scheme will be presented. order to throttle selected threads, there is a designated counter (the LCOUNT variable) for each running thread, as shown in Figure 1. The counter is incremented by one whenever a conditional branch prediction is made with low confidence in the fetch stage. Conversely, it is decremented by one whenever a low-confidence conditional branch is resolved in the complete stage or is discarded from the other preceding stages due to the recovery from an incorrect speculation. Figure 1. SAFE-T mechanism If a running thread has an LCOUNT value which is greater than a given throttling threshold, the fetching of instructions from the thread is stalled on the assumption that an incorrect path has been entered. This means that it is assumed that if instructions are fetched along the current path of the thread, they will be discarded once a previous conditional branch has been resolved. In fact, once wrong-path instructions are fed into the pipeline, they must reside in it until previous branches are determined to be incorrectly predicted, contending for valuable resources with other useful instructions. 3.1. Prioritizing and throttling threads For each thread, there is a counter (the ICOUNT variable) which represents the number of in-flight instructions in the front-end stages of the pipeline. By looking up the counter, thread prioritizing gives a high priority to those threads with lower values. This information is used to select threads from which instructions will be fetched during the next cycle. Unlike the policies examined in [19], however, the SAFE-T mechanism enables the fetch unit to reject selected threads, even if they are highly prioritized by the ICOUNT variable, as a way of controlling speculation. In Figure 2. Two types of confidence estimators used for SAFE-T. By throttling the threads after prioritizing them, the fetch unit of an SMT processor can save fetch slots which would be otherwise wasted due to incorrect speculations

and then keep the useful instructions in the instruction window by allocating these saved slots to other threads. As shown in Figure 2 (a), we use a confidence estimator which is based on miss distance counters (MDCs) [11] to assess the quality of each conditional branch prediction. 4-bit MDCs are used, generating a binary signal (lowconfidence and high-confidence) according to a given confidence threshold. The branch history register (BHR), which holds the global history of recently resolved conditional branches, is shared by the collaborating branch prediction unit. In previous work [1] [6] [7] [11], the confidence estimators for superscalar processors were designed primarily to foretell the correctness of the current prediction for a conditional branch according to the recent history of the branch outcomes. Thus, a table of registers was required, and each register recorded the correctness of recent predictions for the same conditional branch. As an extension of this trend, a confidence estimator for an SMT processor can have one table of registers for each thread or alternatively a single table shared by all the threads. When shared, the size of the table should be proportional to the number of threads due to an increased branch misprediction rate from thread interference [10]. These two implementations are not inexpensive since they require a lot of space and their complexity grows with the number of threads. We claim that the role of a confidence estimator for the front-end policy of an SMT processor can be restricted to indicating whether or not a thread seems to enter a section of instructions in which conditional branches are likely to be incorrectly predicted. Also, a conditional branch must be discarded regardless of the correctness of its prediction whenever any of its preceding conditional branches are resolved to have been mispredicted. In addition, we observed that the mispredictions of conditional branches take place in bursts for the execution-driven simulation of an SMT processor. Indeed, as shown in Figure 3, on average 75% of misprediction distances are no more than 8. This means that in most cases, a misprediction is no further apart from another misprediction than 8 correct predictions. The results shown in Figure 3 were obtained by running the SPEC CPU2000 benchmarks in multi-thread mode on the baseline SMT architecture specified in Section 4. For a superscalar processor, Heil and Smith reported the similar observation from a trace-driven simulation [7]. 3.2. Minimizing the overhead of confidence estimation 100% [1:8] [9:16] 17+ Misprediction Distance 80% 60% 40% 20% 0% 164.gzip 175.vpr-pl 175.vpr-rt 176.gcc 181.mcf 197.parser Figure 3. Distribution of distances between mispredicted branches. The features observed allow us to build an inexpensive confidence estimator by just recording the recent history in a register (one per thread) shared by all static branches, without the need for a table of MDCs. Thus, this confidence estimator with a single MDC assigns low confidence to the current conditional branch prediction if an incorrect prediction has been detected and the number of correct predictions is no more than a certain threshold, even though those predictions have been made for different conditional branches. Figure 4. A schematic diagram of our SMT architecture.

183.equake Cond. BR Uncond. BR 183.equake 197.parser 197.parser 181.mcf 181.mcf 176.gcc 176.gcc 175.vpr (route) 175.vpr-rt 175.vpr (place) 175.vpr-pl 164.gzip 164.gzip 0 5 10 15 20 25 Fraction of Branches (%) (a) 0 5 10 15 20 25 30 Branch Misprediction Rate (%) (b) 183.equake 197.parser 181.mcf 176.gcc 175.vpr-rt 175.vpr-pl 164.gzip 183.equake 197.parser 181.mcf 176.gcc 175.vpr-rt 175.vpr-pl 164.gzip 0 10 20 30 40 50 60 70 Wrong-path Instructions (%) 0.0 0.5 1.0 1.5 IPC (c) Figure 5. Characteristics of the SPEC CPU2000 benchmarks: (a) breakdown of branches, (b) branch misprediction rates, (c) fraction of wrong-path instructions, and (d) instruction throughput in single-thread mode. Branch misprediction rates were collected by using a gshare branch predictor with 2048 2-bit counters. (d) This confidence estimator is truly inexpensive since it uses a single MDC as shown in Figure 2 (b). We call it gmdc since a global MDC is shared by all static conditional branches encountered in a thread, while a confidence estimator with a table of MDCs would be called tmdc. In gmdc, the global MDC is updated as described in [11] whenever a conditional branch is resolved. 4. Simulation methodology To properly evaluate the effects of the proposed front-end policy and underlying confidence estimators, we designed an execution-driven simulator derived from the SimpleScalar tool set [2]. We modified the sim-outorder simulator to implement an SMT processor model (Figure 4) which supports out-of-order and speculative execution. The architectural model contains seven pipeline stages: fetch, decode, dispatch, issue, execute, complete, and commit. Several resources, such as PC, integer and floating-point register files, and branch predictor, are replicated to allow multiple thread contexts. 4.1. Experimental set-up The major simulation parameters are shown in Table 1, and the configuration parameters for functional units are shown in Table 2. In Table 1, each cache line size is in bytes and the 128KB instruction cache is equivalent to a 64KB cache for 32-bit instructions since the simulator uses the 64-bit PISA instruction set. The simulator for SMT is configured to issue as many instructions as the total number of functional units at each cycle. When multiple instructions are ready to be issued, older instructions have higher priority than newer ones. We used two types of branch predictors [16] [21]. One is a gshare branch predictor with 2048 entries and the other is a hybrid predictor that consists of the same gshare predictor and a 2048-entry bimodal predictor. The branch misprediction penalty is a minimum of eight cycles. Six cycles are incurred for branch delays and two cycles are taken for restoring the correct architectural status from each misprediction. We used the SPEC CPU2000 benchmark suite [9] to build workloads for performance simulation. Our workloads consist of seven integer benchmarks (164.gzip,

175.vpr, 176.gcc, 181.mcf, 197.parser,, and ) and one floating-point benchmark (183.equake). As shown in Table 3, seven multiprogrammed workloads were created for the simulation experiments. The characteristics of these benchmarks are shown in Figure 5. Table 1. Simulation parameters. Parameter Value Fetch rate 4 Dispatch rate 4 Retire rate 4 2-level: 2K gshare Branch predictor hybrid: 2K gshare + 2K bimodal (2K meta) Branch target buffer 1024 4-way Branch mispredict 8+ cycles penalty Return address stack 16 L1 instruction cache 128KBytes (512:64:4:LRU) L1 data cache 64KBytes (512:32:4:LRU) L2 cache 512KBytes (2048:64:4:LRU) Main memory 256-bit width I-TLB 512KBytes (32:4096:4:LRU) D-TLB 1MBytes (64:4096:4:LRU) IFQ size 16 IDQ size 16 RUU size 64 LQ/SQ size 16/8 INT units 4 FP units 2 We compiled all the benchmarks with gcc -O2 and ran each with its corresponding lgred input data set from the MinneSPEC [12]. Each simulation of a workload is composed of T 500 million instructions, where T is the number of threads, after fast forwarding through the first 300 million instructions from each thread to skip the initialization part of the benchmarks. 4.2. Simulated front-end policies We simulated and evaluated the following four front-end policies for SMT: Ti: Threads are prioritized according to the values of the ICOUNT variable. Tc: Threads are prioritized and gated according to values of the LCOUNT variable. When two or more threads have the same priority value, the ICOUNT value is used as a tie-breaker. The confidence threshold is set to 8 and the gating threshold is set to 1. There is a table of 2048 MDCs per thread, and each MDC is a 4-bit register. St and Sg: Threads are scheduled according to the SAFE-T mechanism. However, St relies on a tmdc confidence estimator with 2048 entries whereas Sg uses a gmdc. When two or more threads have the same value of the ICOUNT variable, the LCOUNT value is used as a tie-breaker. The confidence threshold is set to 8 and the gating threshold is set to 1 by default. For MDCs, 4-bit registers are used. In all front-end policies, the 2.4 scheme was used for the purpose of distributing available slots because the Ti policy showed the best performance for it. Thus, up to two threads can be selected at each cycle, and up to four instructions can be fetched from each thread. Since the fetch rate of the baseline SMT architecture is set to 4, the second thread, which has a lower priority, is allowed to be fetched only if there are remaining slots after they have been allocated to the first one. Table 2. Configuration of functional units. INT FP Function Repeat Rate Latency add/logical/shift 1 1 mult 1 3 div 19 20 add/comp 1 2 mult 1 4 div 12 12 sqrt 24 24 Table 3. Workloads used for simulations. Workload Benchmarks 164.gzip 197.parser 0 164.gzip 175.vpr (place) 176.gcc 1 175.vpr (place) 176.gcc 197.parser 2 175.vpr (route) 176.gcc 181.mcf 3 164.gzip 175.vpr (route) 181.mcf 4 164.gzip 176.gcc 183.equake 5 175.vpr (place) 176.gcc 181.mcf 6 5. Experimental evaluation In order to evaluate the effectiveness of the proposed SAFE-T mechanism, we simulated St and Sg to measured

instruction throughput and the amount of wrong-path instructions. The results obtained are compared with those of the Ti and Tc policies. In addition, we analyze the impact of the confidence and gating thresholds on the St and Sg which implement the SAFE-T mechanism by using a different structure of confidence estimator. 5.1. Instruction throughput Our simulation results in terms of IPC (instructions per cycle), are presented in Figure 6. For each workload, a set of four histograms is shown. Each histogram represents the performance of four front-end policies in our discussion: Ti, Tc, St, and Sg. The right most set of histograms represents harmonic means over all seven workloads. IPC 2.5 1.5 1.0 0.5 W0 W1 W2 W3 W4 W5 W6 HM (a) gshare The figure shows that the St policy is clearly superior to the others because of its higher IPC for all workloads. As a matter of fact, it improves the IPC by up to 14.5%, compared with the Ti policy that has been known to be the most effective for high performance. Even compared with the St policy, the Sg policy yields almost equivalent instruction throughputs although it relies on the gmdc confidence estimator with one miss distance counter per thread. This result justifies that a confidence estimator which updates the LCOUNT variable by using the global history of conditional branches is effective since an SMT processor is able to exploit TLP as well as ILP. To better understand the characteristics of the frontend policies, the average priority assigned by all four front-end policies was measured for the benchmarks in Workload 3 and Workload 5 as shown in Figure 7. Among the benchmarks in Workload 3, both Ti and Tc policies give higher priority to 176.gcc and than to the other benchmarks. If we look at the priority assigned to 176.gcc and, the I p policy favors 176.gcc over while the Tc policy gives a higher priority to than to 176.gcc. 176.gcc has a larger percentage of wrong-path instructions and a smaller IPC than. This means that the ICOUNT variable can generate an incorrect feedback about the instruction flow in the pipeline. Accordingly, the Ti achieves smaller IPC than Tc for the Workload 3. Even though the St and Sg policies prioritize threads by using ICOUNT like the Ti policy, they can cancel out the priority given to 176.gcc as a result of thread throttling by using LCOUNT. Thus, they can avoid the distorted sign for 176.gcc from ICOUNT. This is the main reason that the St and Sg policies can yield a better IPC than the Ti policy for Workload 3. Workload 3 2.5 Avg. Priority 1.5 1.0 IPC 1.5 1.0 0.5 W0 W1 W2 W3 W4 W5 W6 HM (b) hybrid Figure 6. Instruction throughputs for (a) gshare and (b) hybrid branch prediction schemes. Avg. Priority 0.5 1.5 1.0 0.5 175.vpr-rt 176.gcc 181.mcf (a) Workload 5 164.gzip 176.gcc 183.equake Figure 7 Average priority assigned to the benchmarks in (a) Workload 3 and (b) Workload 5. A smaller number means a higher priority. (b)

The Tc policy appears inferior to the others. Even compared to the Ti policy, it yields decreased performance for all workloads except Workload 3 and Workload 6. In the case of Workload 5 where 183.equake is a floating-point benchmark and the others are integer benchmarks, the Tc policy assigned 1.47 to 183.equake whereas the Ti policy assigned 1.72 and the St policy assigned 1.86. This means that the Tc policy favored 183.equake more than the other policies did, and the LCOUNT variable is not appropriate for the detection of threads with rapidly retired instructions. Consequently, the Tc policy tends to fetch instructions which are likely to clog the pipeline due to the long latencies. We measured the average slip time as shown in Figure 8. The slip time of an instruction can be defined as the amount of time that has elapsed since it was dispatched into the instruction window until retired. These results show that the Tc policy selects those threads with long-latency instructions, even though they cause a comparatively small number of incorrect speculations. Figure 9 shows the percentage of the number of wrongpath instructions fetched into the pipeline for the Ti, Tc, St, and Sg policies. As they were designed, the three policies based on the LCOUNT variable significantly reduce wrong-path instructions. The Tc policy reduces wrong-path instructions by an average factor of 35.1%. However, this reduction in the number of wrong-path instructions does not lead to a noticeable improvement in IPC because the Tc policy tends to select threads with long-latency instructions, as shown in Figure 8. On the average, the St policy shows a reduction of 36.3% in the number of wrong-path instructions and the Sg policy yields an improvement of 41.6%. This shows that for SMT processors, the inexpensive confidence estimator (gmdc) is adequate both to determine whether a thread has entered an incorrect path or not and to effectively prevent instructions in the incorrect path from entering the pipeline. Wrong-path Instructions (%) 35 30 25 20 15 10 5 0 W0 W1 W2 W3 W4 W5 W6 Figure 9. Percentage of wrong-path instructions. 50 5.3. Impact of throttling threshold Avg. Slip Time (cycles) 40 30 20 Figure 8. Average slip time of instructions in the backend stages. 5.2. Wrong-path instructions The threshold values used for thread throttling with the St and Sg policies affect the instruction throughput and the number of wrong-path instructions. As the throttling threshold increases, threads are allowed to have more conditional branch predictions with low confidence. Consequently, the pipeline has more chance of being fed with instructions which are likely to be discarded. These instructions compete for resources conflicts with other useful instructions (those which actually contribute to providing instruction throughput). Since these conflicts tend to the extrinsic delay of instructions in the pipeline, this will negatively affect the overall performance. For instance, if the gating threshold is set to infinite, no threads will be gated by the fetch unit. Thus, as the throttling threshold increases, the performance of the St and Sg, with respect to the IPC and to the percentage of wrongpath instructions, will converge to one of the Ti policy. To better understand the impact of throttling thresholds on the instruction throughput and the number of wrongpath instructions, we measured both the IPC and the fraction of wrong-path instructions over all fetched instructions by varying the value of the throttling threshold. The experimental data were obtained when the underlying 4- bit MDCs had a confidence threshold of 8. A gshare branch predictor, which is configured as mentioned in Section 4, was used. The data points presented in Figure 10 and Figure 11 are averages calculated over the seven workloads given in Table 3.

IPC St Sg 1.8 1.6 1.4 1.2 1 2 3 4 Throttling Threshold Figure 10. Throttling threshold vs. IPC. In truth, the value range of the LCOUNT variable, which is referenced to eliminate prioritized threads from consideration for fetching during some period of time by depriving their assigned priority, is affected by the confidence threshold. According to the confidence threshold, the underlying confidence estimator used for the St and Sg policies determines whether the prediction of each conditional branch is low-confidence or high-confidence. If the confidence threshold is low, the confidence estimator is optimistic and it will estimate most of predictions at high confidence. In order to examine the impact of confidence threshold values, we ran the seven workloads in Table 3 and measured the changes of the IPC and the number of wrongpath instructions, varying the confidence threshold value. For thread throttling, the throttling threshold was set to 1. The results obtained for the IPC and the percentage of wrong-path instructions are shown in Figure 12 and Figure 13 respectively. St Sg St Sg 1.8 25 IPC 1.6 Wrong-path instructions (%) 20 15 10 5 1.4 1.2 2 5 8 11 14 Confidence Threshold 0 1 2 3 4 Throttling Threshold Figure 12. Confidence threshold vs. IPC. St Sg Figure 11. Throttling threshold vs. percentage of wrongpath instructions. As the throttling threshold is increased, there is a slim chance that threads will be throttled after they are prioritized. We can see that both St and Sg show the same behavior and the best IPC and speculation control are achieved when the throttling threshold is set to 1. If the threshold increases to 2, the IPC of the Sg degrades by 3.8% and the number of wrong-path instructions changes from 13.9% to 18.6% on average. When the threshold is further increased to 3, the average IPC decreases into 1.75 which is close to the average of one of the Ti policy, 1.72. This implies that the Sg policy can provide as much performance as the Ti policy. 5.4. Impact of confidence threshold Wrong-path Instructions (%) 20 15 10 5 0 2 5 8 11 14 Confidence Threshold Figure 13. Confidence threshold vs. percentage of wrong-path instructions. We can see that as the confidence threshold increases, the IPC slightly rises and the amount of wrong-path instructions is further reduced. This means that an underlying confidence estimator for the SAFE-T mechanism needs to be pessimistic about branch predictions after a misprediction has been detected, since even a correctly predicted branch and its subsequent instructions must be discarded if a preceding branch is found to have been incorrectly predicted. 6. Conclusions An SMT processor collects instructions from multiple threads and deploys them to the shared instruction win-

dow in order to exploit both ILP and TLP. Thus, how to fill the front-end stages with instructions from multiple threads is critical for SMT processors. We have proposed here a thread scheduling mechanism for an SMT processor, called SAFE-T, which prioritizes threads according to the ICOUNT variable and throttles threads when they seem to be in incorrect paths, based on the LCOUNT variable that represents the number of unresolved conditional branches with lowconfidence prediction in the pipeline. The SAFE-T enables SMT processors to increase instruction throughput by up to 14.5% and to reduce the wrong-path instructions 41.6% on average, when compared with the policy using the ICOUNT variable only. As for the implementation cost of our front-end policy, we have examined a confidence estimator with a global MDC instead of a table of MDCs and have evaluated its effectiveness. It has been shown that an inexpensive implementation, the Sg policy, is comparable to the St that uses a table of MDCs, when it comes to instruction throughput and speculation control. High performance is the primary goal of any modern processors. However, it has been achieved at the cost of wasted work like instructions discarded from the pipeline. As the pipelines of processors become wider and deeper, the amount of wasted work will definitely increase. Thus, the proposed scheme will be essential for high performance with fewer power demands in SMT processors. In the future, we plan to evaluate a dynamic adaptation of the throttling threshold and an extrapolation to more than two pipeline variables in order to reflect the difference between thread characteristics during run-time. References [1] J. Aragón, J. González, J. García, and A. González, Confidence Estimation for Branch Prediction Reversal, Proc. 8 th Int l Conference on High Performance Computing, Dec. 2001, pp. 214-223. [2] D. Burger and T. Austin, The SimpleScalar Tool Set, Version, Univ. of Wisconsin-Madison Computer Science Department Technical Report #1342, June 1997. [3] M. Burtscher and B. Zorn, Prediction Outcome Historybased Confidence Estimation for Load Value Prediction, Journal of Instruction-Level Parallelism, May 1999. [4] S. Eggers, J. Emer, H. Levy, J. Lo, R. Stamm, and D. Tullsen, Simultaneous Multithreading: A Platform for Next- Generation Processors, IEEE Micro, Sept./Oct. 1997, pp. 12-19. [5] R. Gonçalves, M. Pilla, G. Pizzol, T. Santos, R. Santos, and P. Navaux, Evaluating the Effects of Branch Prediction Accuracy on the Performance of SMT Architectures, Euromicro Workshop on Parallel and Distributed Processing, Feb. 2001, pp. 355-362. [6] D. Grunwald, A. Klauser, S. Manne, and A. Pleszkun, Confidence Estimation for Speculation Control, Proc. 25 th Annual Int l Symposium on Computer Architecture, 1998. [7] T. Heil and J. Smith, Selective Dual Path Execution, Univ. of Wisconsin Madison, Technical Report, Nov. 1996. [8] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 3 rd Ed., Morgan Kaufmann, San Francisco, CA, 2002. [9] J. Henning, SPEC CPU2000: Measuring CPU Performance in the New Millennium, IEEE Computer, July 2000, pp. 28-35. [10] S. Hilly and A. Séznec, Branch Prediction and Simultaneous Multithreading, 5 th Proc. Int l Conference on Parallel Architectures and Compilation Techniques, 1996, pp. 169-173. [11] E. Jacobsen, E. Rotenberg, and J. Smith, Assigning Confidence to Conditional Branch Predictions, Proc. 29 th Annual Int l Symposium on Microarchitecture, Dec. 1996, pp. 142-152. [12] A. KleinOsowski and D. Lilja, MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research, Computer Architecture Letters, June 2002. [13] K. Luo, M. Franklin, S. Mukherjee, and A. Séznec, Boosting SMT Performance by Speculation Control, Proc. 15 th Int l Parallel and Distributed Processing Symposium, 2001. [14] S. Manne, A. Klauser, and D. Grunwald, Pipeline Gating: Speculation Control for Energy Reduction, Proc. 25 th Annual Int l Symposium on Computer Architecture, 1998, pp. 132-141. [15] D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton, Hyper-Threading Technology Architecture and Microarchitecture, Intel Technology Journal, vol. 06, issue 01, Feb. 2002. [16] S. McFarling, Combining Branch Predictors, WRL Technical Note TN-36, Jun. 1993. [17] J. Rabaey, Digital Integrated Circuits: A Design Perspective, Prentice Hall, Upper Saddle River, NJ, 1996. [18] G. Sohi, Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers, IEEE Transactions on Computers, vol. 39, no. 3, Mar. 1990, pp. 349-359. [19] D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm, Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Proc. 23 rd Annual Int l Symposium on Computer Architecture, May 1996, pp. 191-202. [20] D. Wall, Limits of Instruction-Level Parallelism, Proc. 4th Int l Conf. on Architectural Support for Programming Languages and Operating System, 1991, pp.176-189. [21] T. Yeh and Y. Patt, Alternative Implementations of Two- Level Adaptive Branch Prediction, Proc. 19 th Annual Int l Symposium on Computer Architecture, May 1992, pp. 124-134.