Balancing Thoughput and Fairness in SMT Processors

Size: px

Start display at page:

Download "Balancing Thoughput and Fairness in SMT Processors"

Gerald Long
5 years ago
Views:

1 Balancing Thoughput and Fairness in SMT Processors Kun Luo Jayanth Gummaraju Manoj Franklin ECE Department Dept of Electrical Engineering ECE Department and UMACS University of Maryland Stanford University University of Maryland College Park, MD Stanford, CA College Park, MD ord. edu Abstract Simultaneous Multithreading (SMT) is an execution model that executes multiple threads in parallel within a single processor pipeline. Usually, an SMT processor uses shared instruction queues to collect instructions from the different threads. Hence, an SMT processor s performance depends on how the instruction fetch unit fills these instruction queues every cycle. n the recent past, many schemes have been proposed for fetching instructions into the SMT pipeline. These schemes focussed on increasing the throughput by using the number of instructions and the number of low confidence branch predictions currently in the pipeline, to decide which threads to fetch from. The goal of this paper is to investigate fetch policies that find a balance between fairness and throughput. We present metrics to quantify fairness. We then discuss techniques to use a set of pipeline system variables for achieving balanced throughput and fairness. Finally, we evaluate several fetch policies. Our evaluation confirms that many of our fetch policies provide a good balance between throughput and fairness. 1 ntroduction Simultaneous multithreading (SMT) is a processing model in which multiple thread contexts are simultaneously active in a single pipeline [l] [2] [9]. The performance contribution of each thread is dependent on the amount of resources available to it. Unfortunately, the relationship between a thread s performance and the amount of resources allocated to it is rarely linear. As a thread receives more resources, its performance increases somewhat uniformly up to a point, beyond which the increase tends to be marginal. mportantly, the instruction fetch unit, which supplies instructions from different threads to the rest of the pipeline, can control such resource allocation by slowing down or speeding up the fetch rate of specific threads. Tullsen, et al. [8] investigated several fetch policies for SMT processors. Among these, a policy called COUNT was found to provide the best performance. This policy gives the highest priority to the thread with the fewest instructions in the pipeline; the underlying assumption is that those threads are likely to make the most efficient use of processor resources. This policy, however, does not consider the probability of each thread to be in a wrong speculative path. Luo, et, al. [4] addressed this problem with a fetch policy (LC-BPCOUNT) that considers the number of outstanding low confidence branch predictions to prioritize threads. Threads with the fewest low confidence branch predictions are given the highest priority, with pipeline instruction count used as a tie-breaker, if required. This fetch policy provides very high throughput, as it allocates the processor resources to threads that are very likely to be in their correct paths. However, it does not attempt to be fair. f a thread has many low-confidence branch predictions, it is likely to be consistently passed over in favor of threads that have fewer low-confidence branch predictions. Both fairness and throughput are important when using an SMT processor. Whereas higher throughput ensures higher utilization of processor resources, fairness ensures that all threads are given equal opportunity and that no threads are forced to starve. Typ ically, different threads have different rates at which their instructions execute and hence different native throughputs. Therefore, considering only throughput will result in giving priority to threads with high instruction execution rates all the time, causing the rest to suffer. n a multi-user environment, users experiencing slow response times will not be very pleased. On the other hand, considering only fairness will result in an inefficient use of pipeline resources, because a thread may often receive resources when it is least capable of utilizing it. On the surface, it appears that naive approaches to increase throughput work against fairness, and vice versa. This paper investigates SMT fetch policies, with spe /00/$ EEE 164

2 cia1 emphasis on enhancing both throughput and fairness. We present and evaluate fetch policies that are geared to provide balanced throughput and fairness. The basic idea behind these policies is to use a combination of prioritizing and selective gating, based on pipeline status. Our experimental analysis shows that it is indeed possible to achieve high throughput and fairness at the same time in SMT processors. The rest of this paper is organized as follows. Section 2 reviews background information on the SMT processor and SMT fetch policies. Section 3 discusses fairness and metrics for ensuring fairness. Section 4 describes ways of using different types of feedback from the pipeline for obtaining throughput as well as fairness. Section 5 presents the experimental results, and Section 6 presents the conclusions. 2 Simultaneous Multithreading 2.1 SMT Processor The SMT processing engine, like most other processing engines, comprises two major parts-the fetch engine and the execute engine. The fetch engine fills the instruction queues (Qs) with instructions, and includes the i-cache, the branch predictor, the fetch unit, the decode unit, and the register rename unit, as shown in Figure 1. The execution engine drains the instruction queues, and includes the instruction issue logic, the functional units, the memory hierarchy, the result forward mechanism, and the reorder buffer. The key features of the processor can be summarized as follows: Fetch &-L@ r mteger instruction queue 4 lnmoa6;store -; units 1 Fetch Engine : Execute Engine Figure 1: Block Diagram of an SMT Processor. The major resources are shared by all active threads. Data Cache Every clock cycle, instructions from all active threads compete for each of the shared resources. nstructions of a thread are fetched and committed strictly in program order. Each of the resources in the processor has very limited buffering capabilities. 5. Once an instruction enters the processor pipeline, it is not pre-empted (i.e., discarded without execution) from the pipeline unless it is found to be from an incorrect path. Therefore, it is very important to select the right instructions every cycle at the fetch part of the*'pipeline where instructions enter the pipeline. n this paper, we focus on utilizing the available fetch bandwidth for bringing in the 'best' instructions into the Qs. 2.2 SMT Fetch Policies Tullsen et al. had studied several fetch policies for SMT processors, so as to improve on the simple round-robin priority policy, with the use of feedback from the processor pipeline [S. Of these, a policy called COUNT was found to provide good throughput. COUNT gives highest priority every cycle to those threads that have the fewest instructions in the processor pipeline. The motivation for this policy is two-fold: (i) give highest priority to threads whose instructions are moving through the pipeline efficiently, and (ii) provide an even mix of instructions from the available threads. This naturally prevents any one thread from monopolizing the processor pipeline. However, it does not take into account whether a particular thread is in the correct path of execution or not. Because of control speculation, many of the instructions in an SMT pipeline can be from wrong paths. Wrong-path instructions not only have zero contribution to the throughput, but also tie up valuable resources, preventing correct-path instructions from being executed. For SMT processors, speculation control of individual threads is beneficial to improving the overall performance, because the resources that would have been spent on wrong-path instructions of one thread could instead be diverted for use by other threads that are in the right path. To increase the overall performance of SMT processors, we have to reduce the amount of incorrectly speculated instructions, so as to save resources for non-speculative or correctly speculated instructions. Luo et al. [4] investigated the use of confidence estimators [3] [5] to reduce the number of wrongpath instructions. ndividual threads are prioritized every cycle based on the number of outstanding lowconfidence branch predictions. n essence, speculation control is used to improve the overall SMT throughput, as the resources that would have been spent on wrongpath instructions of one thread instead got diverted for use by other threads that are in the right path. However, this scheme pays no attention to fairness. n fact, its basic principle works against fairness! 165

3 Fairness Metrics f the SMT fetch unit focuses only on maximizing the instruction throughput, then threads with low branch prediction accuracies and low throughputs may not get a fair share of the

3 3 Fairness Metrics f the SMT fetch unit focuses only on maximizing the instruction throughput, then threads with low branch prediction accuracies and low throughputs may not get a fair share of the SMT resources. The end result will be that these threads will proceed at a slow pace, although the overall throughput of the SMT processor is very high. Therefore, every cycle, it is important that the SMT fetch unit pays special attention to fairness when deciding which threads to fetch from. Notice that, by fairness we do not mean that all threads should have the same throughput; instead, we mean that all threads should get equal opportunity to utilize resources. Although the term fairness is quite familiar, quantifying it is a somewhat difficult endeavor! There are several different ways in which fairness can be defined and measured. One option is to use the metric thread fetch priority, to see if each thread gets more or less equal priority during fetching. With this metric, a straightforward approach to guarantee fairness is to use round robin fetching, in which the threads take strict turns for fetching. With round robin fetching, however, the overall instruction throughput may be severely affected. This is because this fetch policy does not give any consideration for a thread's ability to utilize additional resources allocated to it at that time. t also does not consider the thread's probability to be in the correct execution path. Another possible metric for fairness is individual thread speedup, i.e., the speedup experienced by each thread, with respect to the throughput it obtained when executed in single thread mode. We can consider the variance (or standard deviation) in thread speedups; if all threads experience similar speedups, then we may conclude that the fetch policy is fair (although the threads may not have had the same fetch priority). This metric has the drawback that it does not give any importance to the overall throughput. t is conceivable that a fetch policy slows down every thread equally, and still provide fairness! Based on the discussion so far, instead of having two different metrics-one for throughput and another for fairness-it seems better to use a single metric that encapsulates both throughput and fairness. One such metric is the harmonic mean of the individual thread speedups'. Whatever the metric we use to quantify ~ ~ 'The weighted speedup metric used in [7] is obtained by taking the arithmetic mean of the speedup values, and multiplying it by the number of threads. The arithmetic mean does not capture fairness, whereas the harmonic mean tends to be lower if one or more threads have lower speedup, thereby capturing some effect of the lack of fairness. fairness, however noble be the intentions, it is conceivable to find some amount of criticism against that metric. 4 Prioritizing and Gating n order to achieve high throughput as well as fairness, we take feedback from the processor pipeline, and modify the thread fetch priorities accordingly. Feedback from the pipeline can be collected in terms of system variables such as low-confidence branch prediction count and instruction count. There are at least two ways to use the pipeline system variables: either to prioritize the threads (prioritizing) or to stop fetching from specific threads (gating). 4.1 Fetch Prioritizing The current value of a pipeline system variable can be used to assign fetch priorities to the threads. For instance, in the COUNT fetch policy, the pipeline instruction count values of the threads determine their fetch priorities for the current cycle. n order to obtain good throughput as well as fairness, we may need to consider multiple system variables at the same time. When multiple system variables are used for prioritizing, there must- be some means to combine the fetch priority values given by each variable. One way to achieve this is to assign weights for each system variable and calculate an overall fetch priority for all threads. The drawback of this approach is that it requires the fetch unit to do complex calculations every cycle. A more feasible approach is to perform prioritization in multiple steps or levels, with each level representing a system variable. That is, one of the variables is first used to partition the threads into different priority groups (with members of each group having equal priority), and then another variable is used to further prioritize threads that fall into the same group, and so on. An important aspect to consider while doing a multilevel partitioning is deciding the partitioning granularity at each level. Consider the pipeline variable instruction count. f the value of this variable is used as such for prioritizing at the first level, chances are that a definite fetch priority order will be established after the first level itself. This causes the remaining system variables to be unused, and the policy becomes same as COUNT. Therefore, we need to consider a range of values to be in one group. f this range is too large, then all of the threads may fall into the same group, leaving the current partitioning level unused. 166

4 4.2 Fetch Gating The previous subsection discussed how threads can be prioritized based on pipeline variables. Next, we discuss the importance of using gating in addition to prioritization, for providing fairness. For motivation, consider the data in Figure 2, which shows how the singlethread throughputs vary when the Q size is varied. We can see that the perfornlance of a single thread begins to saturate when the Q size allocated to it is increased to about This means that it is not very efficient to provide more than about Q slots to a thread. Once a thread begins to occupy that many Q slots, it is better to temporarily stop fetching from that thread altogether, and utilize the fetch bandwidth for fetching from other threads, until the situation improves. That is, the fetch unit will temporarily throttle fetch activity from those threads, even if they are very likely to be in the correct paths. Doing such fetch gating will help towards providing fairness, without affecting the throughput in a significant manner. When multiple system variables are used for fetch gating, they can all be applied simultaneously Resource Allocated ( Q slots) Figure 2: Single-Thread Throughputs for Different Q Sizes 5 Experimental Evaluation This section presents an experimental evaluation of different fetch prioritizing and gating techniques that are geared for providing balanced throughput and fairness in SMT processors. 5.1 Evaluation Setup The experiments in this section are conducted using a simulator derived from the public domain SMT simulator developed by Tullsen, et a1 [S. The simulator executes unmodified Alpha object code, and models the fetch engine and the execution engine of an SMT processor. Some of the simulator parameters are fixed as follows. The instructionspipeline has 9 stages, based on the Alpha pipeline, but includes extra cycles for accessing a large register file. Functional unit latencies are also based on the Alpha processor. The processor has 64-slot integer and floatingpoint Qs, 12 integer functional units, and 6 floating point units. The memory hierarchy has 64 KB 2-way set-associative instruction and data caches, a 1024 KB 2-way set-associative on-chip L2 cache, and a 4MB offchip cache. Cache line sizes are all 64 bytes. All the on-chip caches are 8-way banked. Cache miss penalties are 6 cycles to L2 cache, another 12 cycles to the L3 cache, and another 62 cycles to the main memory. The fetch unit fetches a maximum off instructions in a cycle from up to 2 threads, where f is the fetch bandwidth. The second thread gets an opportunity only iff instructions could not be fetched from the first thread, due to events such as cache misses. We simulate two fetch sizes: 8 and 16. and a fetch bandwidth of 8 instructions per cycle. Workload: Our workload consists of the following 5 programs from the SPEC95 integer benchmark suite: compress95, gcc, go, li, and i jpeg. These programs have different individual PC values ranging from 2 to 4, and have different branch misprediction rates. We compiled each program with gcc with the -04 optimization. The measurement strategy is kept the same as that used in [8]: each data point is collected by simulating the SMT processor for a total of 500 million instructions. Metrics: We use the following 3 metrics to get a clear idea about the throughput and fairness associated with each fetch policy: Throughput: We measure throughput in terms of PC (nstructions per cycle). We measure the PC of each thread, as well as the overall PC of the SMT processor. Thread Fetch Priority: This metric measures the average fetch priority given to each thread, with 0 being the highest priority. Thread Speedup: This metric indicates the speedup experienced by each thread in SMT mode compared to executing the thread in single mode. 167

5 Pipeline System Variables: The pipeline system variables that we, use for our fetch policies are: Round Robin-based Priority (R): We keep track of the fetch priorities assigned to the threads in the last several cycles. nstruction Count (): Every cycle, we measure the number of instructions from each thread in the pipeline. Low- Confidence Branch Prediction Count (C): We use a JRS confidence estimator [5] to assess the quality of each branch prediction. This estimator parallels the structure of the gshare branch predictor [6], and uses a table of mass distance counters (MDCs) to keep record of branch prediction correctness. Each MDC is a saturating resetting counter. Correct predictions increment the corresponding MDC, whereas incorrect predictions reset the MDC to zero. A branch is considered to have high confidence only when the MDC has reached a particular threshold value called the MDC-threshold. The default MDC-threshold value is set to 8. The default gating threshold is set to 2; that is, when gating is used, no instructions are fetched from a thread if it has more than 2 unresolved low-confidence branch predictions. 5.2 Fetch Policies Simulated We start with three existing policies-round robin, COUNT, and LC-BPCOUNT-and enhance them with the aim to achieve balanced throughput and fairness. The basic round robin policy is good for fairness, but poor for throughput, partly because it does not consider wrong-path instructions. Therefore, we enhance it by adding gating. The basic COUNT policy does not consider the probability for threads to be in wrong paths, and so we enhance it by using low-confidence branch prediction count for prioritizing or gating. The basic LC-BPCOUNT policy does not limit the number of instructions in the pipeline from a thread, and so we enhance it by adding gating using instruction count. Altogether, we evaluate the following 7 fetch policies. For naming these policies, we use the notation (A,B), + (C,D),, where A, B, C, D are pipeline system variables, which can be one of {R,, C}. Variables A and B are used for prioritization, in that order, and variables C and D are used for gating. 1. R,: basic round robin policy 2. R, + (, C)g: round robin policy augmented with gating based on instruction count and lowconfidence branch prediction count 3. p: basic COUNT policy 4., + (, C)g: COUNT policy augmented with gating based on instruction count and lowconfidence branch prediction count 5. (, C)p: 2-level prioritization using instruction count and low-confidence branch prediction count. The first level, which uses instruction count, uses a granularity of 3 for partitioning 6. (C,)p: basic LC-BPCOUNT policy, which is a 2-level prioritizing policy 7. (C,)p + g: basic LC-BPCOUNT policy, augmented with gating based on instruction count 5.3 Throughput Let us first look at the throughput results for different fetch policies. Figure 3 presents the thread PCs as well as the overall PCs obtained with different fetch policies, for fetch sizes of 8 (left figure) and 16 (right figure). The results are very similar for both fetch sizes. n both figures, the Y-axis indicates the PC, and the X-axis lays out the different fetch policies. Each fetch policy has 6 histogram bars; the first 5 correspond to the PCs of the 5 threads, and the last one corresponds to the overall PC. Let us look at the PC results presented in Figure 3. Among the different fetch policies, the round robin fetch policy (Rp) has the lowest PC. The main reasons for this are that when deciding to fetch from a particular thread, (i) it does not consider if that thread is likely to be in the wrong speculative path or not at that time; (ii) it does not consider the appropriateness of allocating more resources to that thread at that time. Notice that, when the basic round robin scheme is enhanced by gating based on instruction count as well as lowconfidence branch prediction count (R, + (, C), policy), the overall PC increases by 14% for a fetch size of 8 and 22% for a fetch size of 16. What is interesting to note is that this increase in overall PC is obtained without sacrificing the individual thread PCs for any of the 5 threads! n fact, the individual thread PCs are higher for all threads; the maximum increase is seen for jpeg, as its branch predictions have very high confidence. This highlights the utility of enhancing the round robin fetch policy with the use of gating. Next, consider the third fetch policy, namely COUNT (p). The overall PC is much better than that of the round robin policy, but is slightly lower than that of round robin enhanced by gating2. The in- 2Although the overatl PCs obtained for Rp + (, C), and, policies are very close, the individual thread PCs differ by a large extent between the two policies. '- 168

6 Figure 3: SMT Throughput Values for (i) Fetch Size = 8; (ii) Fetch Size = 16 dividual thread PCs are also consistently higher than that of the round robin policy. The COUNT policy, as mentioned earlier, gives the highest priority to threads that have the fewest instructions in the pipeline, and are thus very likely to clear the pipeline sooner. The result is that the PCs of all threads increase, compared to that obtained with the round robin policy. When the basic COUNT policy is enhanced by gating based on instruction count as well as low-confidence branch prediction count (, + (, C), policy), the overall PC increases by 2.7% (for fetch size 8). When the basic COUNT policy is enhanced by an additional level of prioritization using low-confidence branch prediction count ((, C), policy), the PC values are very similar to that obtained with gating. Finally, consider the two fetch policies with lowconfidence branch prediction count as the first prioritizing level-(c, )p policy (LC-BPCOUNT) and (C, ), +, policy. These two policies register the highest overall PC values. The use of the low-confidence branch prediction count at the first prioritizing level has a marked impact on the amount of wrong-path instructions that enter the pipeline. However, as we will see later, these two policies are not very fair. tice that a lower value for the priority metric indicates higher priority. From the figure, we can see that the round robin policy provides more or less the same average fetch priority for all threads, as expected. The maximum discrepancy in average fetch priority is seen for the (C, )p + g policy. As we saw in Figure 3, this policy has the highest throughput. Next, let us look at the thread fetch priorities obtained with the other fetch policies. When the basic round robin scheme is enhanced by gating based on instruction count as well as low-confidence branch prediction count (i.e., R, + (, C), policy), the average fetch priorities become somewhat varied. These observations are consistent for both fetch sizes. 5.5 Thread Speedup We have seen metrics that focus primarily on either throughput or fairness. Next, let us consider metrics that address both throughput and fairness. These are the harmonic mean and standard deviation of the relative throughputs obtained for the individual threads compared to their throughputs in single-thread mode. Considering the relative throughputs helps to factor out the inherent discrepancies between the different 5.4 Thread Fetch Priority threads. Figure 5 presents these results. The format of this figure is similar to that of the previous three So far, we were primarily analyzing throughput res-ults, figures. The first 5 bars for each fetch policy indicate with some consideration given for fairness. Next, let the relative PC values for the 5 threads. The 6th and us consider metrics that are more germane to fairness. 7th bars indicate the harmonic mean and standard de- The first metric we consider in this category is thread viation, respectively, of the 5 relative PC values. fetch priority. Figure 4 presents the average thread Let us analyze the results presented in Figfetch priority values obtained for two fetch sizes. No- ure 5. Looking at the harmonic mean values, 169

7 Figure 4: Average Thread Fetch Priorities Obtained in SMT Mode: (i) Fetch Size = 8; (ii) Fetch Size = 16 0 L 0 HarmonicMean 0 Standard Deviatlon B > S d 0.3 Compress95 D Go 0 Gcc B Li U HarmonicMean 0 Standard Deviatlon n Figure 5: Relative PCs Obtained for Different Threads when executed in SMT Mode. Relative PCs are calculated with respect to Single-Thread PCs. (i) Fetch Size = 8; (ii) Fetch Size = 16 the last 4 policies--p + (, C),, (, C),, (C, )p, and (C, )p +,-have very high values for both fetch sizes. Among these, the first two have low standard deviations, and the last two have high standard deviations. Therefore, the first two policies (, + (, C), and (,C)p), which have high harmonic means as well as low standard deviations, are good at providing balanced throughput and fairness. Their overall PC values (cf. Figure 3) are very similar, and reasonably high. f we want to give somewhat higher preference to throughput, along with some fairness, then the last policy ((C, )p + g) is a good one to use. 6 Summary and Conclusions Simultaneous Multithreading (SMT) permits multiple threads to execute in parallel within a single processor. Usually, an SMT processor uses shared instruction queues to collect instructions from the different threads. Hence, an SMT processor s performance depends on how the instruction fetch unit fills these instruction queues. On each cycle the fetch unit must judiciously decide which threads to fetch instructions from. This paper addressed the issue of balancing throughput and fairness when fetching instructions into the 170

8 SMT pipeline. We highlighted the importance of achieving a balance between throughput and fairness. We also proposed metrics that consider both fairness and throughput. Then, we investigated techniques that utilize pipeline system variables to obtain balanced throughput and fairness. The basic idea is to multiple system variables at the same time for prioritizing the threads, and to temporarily stop fetching from some threads. By limiting the amount of resources given to a particular thread, resources can be better distributed to achieve higher throughput as well as fairness. Our experimental evaluation showed that the basic round robin policy provides very low throughput, although it provides good fairness when considering the average thread fetch priority as the metric. Enhancements to the round robin policy provide throughput improvements to all threads, though not necessarily by the same amount. This means that the use of average thread fetch priority as a metric is not necessarily a good idea. When better metrics such as harmonic mean and standard deviation of individual thread speedups were used, we found that enhancements to the basic COUNT policy provide good throughput as well as fairness. These enhancements include gating with instruction count and low-confidence branch prediction count or two-level prioritization with low-confidence branch prediction count at the second level. We also found that very high throughputs, with moderate fairness, can be achieved by including instruction count-based gating to a 2-level prioritizer based on low-confidence branch prediction count at the first level and instruction count at the second level. Acknowledgements This work was supported by the U.S. National Science Foundation (NSF) through a CAREER grant (MP ) and a regular grant (CCR ). We are thankful to the reviewers for their helpful and insightful comments. References (11 G. E. Daddis, Jr. and H. C. Torng, The Concurrent Execution of Multiple nstruction Streams on Superscalar Processors, Proc. nternational Conference on Parallel Processing (CPP), pp. 1:76-83, [2] S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stamm, and D. M. Tullsen, Simultaneous Multithreading: A Foundation for Next-generation Processors, EEE Micro, pp , September/October [3] D. Grunwald, A. Klauser, S. Manne, and A. Pleszkun, Confidence estimation for speculation control, Proc. 25th nternational Symposium on Computer Architecture, [4] K. Luo, M. Franklin, S. S. Mukherjee, and A. Seznec, Boosting SMT Performance by Speculation Control, Proc. 15th nternational Parallel & Distributed Processing Symposium (PDPS), [5] E. Jacobsen, E. Rotenberg, and J. E. Smith, ASsigning Confidence to Conditional Branch Predictions, Proc. 29th nternational Symposium on Microarchitecture (MCRO-29), pp , December [6] C. McFarling, Combining Branch Predictors, WRL Technical Note TN-36, June [7] A. Snavely and D. M. Tullsen, Symbiotic Job Scheduling for a Simultaneous Multithreading Processor, Proc. 9th nternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), [8] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm Exploiting choice: nstruction fetch and issue on an implementable simultaneous multithreading processor, Proc. 23rd nternational Symposium on Computer Architecture, pp , May [9] W. Yamamoto and M. Nemirovsky, ncreasing Superscalar Performance Through Multistreaming, Proc. FP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques (PACT 95), pp 49-58,

Boosting SMT Performance by Speculation Control

Boosting SMT Performance by Speculation Control Kun Luo Manoj Franklin ECE Department University of Maryland College Park, MD 7, USA fkunluo, manojg@eng.umd.edu Shubhendu S. Mukherjee 33 South St, SHR3-/R