Applications of Thread Prioritization in SMT Processors

Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor, MI 48109 USA sraasch@eecs.umich.edu stever@eecs.umich.edu Abstract Previous work in multithreading, and specifically in simultaneous multithreading (SMT), has focused primarily on increasing total instruction throughput. While this focus is sufficient in some application domains, widespread deployment of multithreaded processors will require robust behavior across a variety of platforms. For instance, interactive systems must be concerned with the execution latency of foreground userinterface threads. Multiuser systems must be concerned with fair allocation of throughput among competing users. A multithreaded processor that seeks solely to maximize throughput will favor efficient threads at the expense of any potential latency or fairness issues. We show that a very simple fetch-stage prioritization scheme can substantially reduce the latency impact of multithreading on a selected foreground thread while continuing to provide a throughput improvement over single-threaded execution. When all threads have equal priority, rotating the high-priority designation among the threads reduces the processor s bias against less efficient threads, resulting in a more even throughput distribution across the threads. We also show that even when latency and fairness are not a concern, rotating thread prioritization has a positive effect on cache and branch predictor utilization. Unfortunately, although our simple prioritized multithreading scheme provides these benefits while improving utilization over a single-threaded processor, total throughput falls well short of existing throughput-oriented fetch policies. Our ongoing work focuses on more sophisticated prioritization algorithms, potentially incorporating branch confidence estimators, that will maintain these benefits while increasing total throughput. 1 Introduction Multithreading is a well-known technique for increasing the utilization of a processor core thus increasing total processor throughput by sharing the core among several independent threads of control. Processor resources that would be unused or underused by any single thread due to cache misses or program dependencies can be applied to the execution of another thread. Two recent trends have heightened interest in multithreading. First, semiconductor fabrication technology is now capable of producing superscalar microprocessors whose peak throughput potential is far beyond the throughput that can be extracted from most single-threaded applications. Second, operating system, compiler, and language support for multithreading is becoming more widespread, as exemplified by Windows NT and Java. Simultaneous multithreading (SMT) is a promising form of multithreading proposed by Tullsen et al [1]. SMT enables very fine-grained resource sharing in a dynamic out-of-order superscalar processor core. By multiplexing resources among threads within a single cycle, as well as across 1

cycles, total throughput can be improved significantly over a single-threaded processor. Studies of SMT processors [1][2], as with earlier multithreading studies and systems [3][4][5][6][7][8], have focused almost exclusively on improving overall processor throughput. The implicit assumption is that all threads and all instructions are equally important, so maximizing instructions per cycle is good regardless of which instructions are executed. The effects of this assumption are particularly pronounced in Tullsen et al s study of SMT fetch policies [2], in which they increase throughput by explicitly favoring threads that use the processor efficiently. This assumption is reasonable for application domains such as scientific computation and database servers those traditionally targeted by multithreading where all the threads are components of some larger parallel application. However, for multithreaded microprocessors to escape niche markets, they must benefit a wide range of platforms, including portable and desktop PCs and shared multiuser systems. These systems have constraints that violate the all instructions are equal assumption of traditional multithreading. For example, in interactive systems one or more threads are directly responsible for user interaction. As user interfaces move into new modes such as speech and 3D graphics, this interaction can be computationally expensive. If a multithreaded processor decides to favor a more resource-efficient thread over a user-interaction thread, the user may see an intolerable and potentially unbounded latency increase. One solution is to disable multithreading when a latency-critical thread is active. However, this approach wastes the additional throughput capability that multithreading was designed to exploit. Instead, we propose thread prioritization, in which software associates priorities with active threads, and the processor incorporates these priorities into its instruction fetch policy. In this paper, we examine only the simplest thread prioritization policy: a single thread is identified as the high-priority thread, and the processor fetches instructions for this thread whenever possible. Lower-priority threads are given the opportunity to fetch instructions only when the high-priority thread is stalled. Thread prioritization can be useful even when all threads are logically of equal priority. As mentioned above, a processor that is maximizing throughput will favor threads that use the processor efficiently. This results in an unfair allocation of resources among threads. While this is not an issue in many environments, a multiuser system would like to guarantee fair allocation of resources across all users. By rotating the high-priority designation among all active threads, we can reduce the bias against less efficient threads and improve the fairness of CPU allocation. We also demonstrate that even when latency and fairness are not a concern, rotating thread prioritization has a positive effect on cache and branch predictor utilization. Unfortunately, although our simple prioritized multithreading scheme provides these benefits while improving utilization over a single-threaded processor, total throughput falls well short of existing throughput-oriented fetch policies. Our ongoing work focuses on more sophisticated prioritization algorithms, potentially incorporating branch confidence estimators, that will maintain these benefits while increasing total throughput. We provide additional background and describe our experimental methodology in Section 2. Section 3 describes the three potential areas which may benefit from thread prioritization: limiting latency effects, increasing fairness, and reducing cache and branch predictor conflicts. Section 4 concludes with a discussion of ongoing and future work. 2 Background and methodology When an SMT processor has more threads than instruction-cache fetch ports, it can fetch from only a subset of the threads each cycle. Several fetch policies are introduced and evaluated in [2]. This paper will draw from two of these policies: Round- Robin (RR) and Instruction-Count (I-Count, or IC). We also studied simple prioritized extensions of both Round-Robin and I-Count. In describing these policies, we define an active thread as any thread that is allowed to execute instructions. A thread is eligible to fetch instructions if it is active and has no outstanding instruction cache miss. A fetch opportunity for a thread occurs when the thread s fetch address is supplied to an instruction-cache port, resulting in 2

either the addition of one or more instructions in the fetch queue or an instruction-cache miss. In the case of a full fetch queue, no thread is given the opportunity to fetch. This situation must be handled explicitly since thread starvation can result if this state is counted as a fetch opportunity. The Round-Robin (RR) policy maintains an arbitrarily ordered list of eligible threads. After the thread or threads at the top of the list are given the opportunity to fetch, these threads are rotated to the bottom of the list. The Round-Robin policy is a fair policy in that, over a number of cycles, each thread has equal opportunity to fetch new instructions. The Instruction-Count (I-Count or IC) policy counts the number of instructions from active threads that are currently in the instruction buffers but have not yet been issued to a function unit. This policy gives fetch opportunities to the eligible threads that have the fewest instructions in the pipeline, under the assumption that these threads are moving instructions through the CPU quickly and hence making the most efficient use of the pipeline. Threads with fewer active instructions are also less likely to exhibit data dependencies or to be stalled as a result of a cache miss. We defined two additional fetch policies that incorporate a software-defined fetch priority for each thread. Fetch opportunities are given first to the highest-priority threads. Lower-priority threads are given fetch opportunities only when higher-priority threads are unable to make use of the full fetch bandwidth. The two policies, Prioritized Round-Robin (PR) and Prioritized I-Count (PI), differ only in the policy used to select among threads at the same priority level (RR and IC, Table 1: Simulated Processor Configuration respectively). In this paper, we use at most two priority levels: a single foreground thread runs at a high priority while the background thread(s) share the same lower priority. We simulated the behavior of these policies using a modified version of the sim-outorder simulator from the Simplescalar tool set [9]. The original simulator was extended by replicating the necessary machine context for multithreading, adding support for multiple address spaces, and increasing the coverage and types of collected statistics. The processor model is based on Sohi s RUU [13]. This model is similar to the instruction queue model used in Hewlett-Packard s PA-8000 processors. The fetch stage feeds instructions to a simple fetch/decode queue. The decode stage picks instructions from this queue, decodes and renames them, then places them into the Register Update Unit (RUU). Loads and stores are broken into an address generation and a memory reference. The address generation portion is placed in the RUU while the memory reference portion of a load/store instruction is places into the Load/Store Queue (LSQ). The RUU (and LSQ) serve as a combination of global reservation station, rename register file, and re-order buffer. The processor maintains precise exceptions by only committing instructions from the RUU in fetch order. We specified an advanced processor with numerous function units and reasonable on-chip instruction and data caches. Details of the model can be found in Table 1. Although we have studied processors with multiported instruction caches, for clarity we focus on single-ported instruction caches in this paper. L1 Instruction Cache L1 Data Cache Unified L2 Cache Branch Predictor Fetch/Decode/Issue/Commit Width Integer Function Units FP Function Units 32K bytes, 4-way associative, single ported 32K bytes, 4-way associative, dual ported 1M bytes, 4-way associative Two-Level: 11-bit Global History Register, 2048-entry PHT, 512-entry BTB, 16-entry RAS (per thread) 8 instructions / cycle 6 ALU, 2 Multiply 4 Add, 2 Multiply 3

Normalized Latency 2.50 2.25 2.00 1.75 1.50 1.25 1.00 0.75 RR IC PR 2.00 2.50 3.00 3.50 4.00 Total Throughput (IPC) Figure 1. Throughput vs. latency for various fetch policies. 3 Applications of prioritization This section examines each of the potential areas which may benefit from thread prioritization: limiting latency effects, improving fairness, and reducing cache and branch predictor contention. 3.1 Limiting latency effects Our initial goal in this study was to find a way to address the issue of increased program runtime when several threads are present in the processor. We first looked at the effects on some foreground thread when background threads were introduced into our simulator. To accomplish this, we configured our simulator to stop when our foreground thread completed 10 million instructions and ran simulations with zero, one, two, and three background threads. We refer to the time required to complete the 10 million instructions as the latency of the foreground thread. By comparing the number of simulated cycles from the runs where there are no background threads to the others, we determined the relative increase in latency due to the background threads. Our simulations use a mix of integer benchmarks from SPEC95, where the foreground thread has been chosen to be perl. Background threads include m88ksim, compress, and ijpeg. Figure 1 plots relative latency versus total throughput for zero, one, two, and three background threads using the Round-Robin (RR), I- Count (IC), and Prioritized Round-Robin (PR) fetch policies. We see that for both the RR and IC fetch policies, adding a second thread increases the total processor throughput from 2.33 to more than 3.41 IPC. Unfortunately, this 46% increase in throughput comes at the expense of a 42% increase in latency for our foreground thread. As a second and third background thread are added, the latency increases further. Though the total throughput increases significantly, each thread s portion of the available throughput steadily decreases. Ideally, we would like the addition of background threads to increase throughput without affecting the runtime of the foreground thread. Our goal is to have the curves move from their starting point (a single thread) to the right (increasing throughput) without moving up (no increase in latency); i.e., we are aiming for the lower-right corner of the graph. Our prioritized fetch policy specifies to the processor that the foreground thread should always be given the opportunity to fetch instructions when it is eligible (not suffering an instruction-cache miss). The background threads are allowed to use whatever fetch bandwidth the foreground thread is unable to use. As the figure shows, adding a second thread in this situation results in an increase in total throughput from 2.33 to 2.91 IPC. Here, we have improved throughput by 25%, but have increased our foreground thread latency by only 6%. Beyond the first background thread, the prioritized schemes degrade quickly, but, even with three background threads, the latency never increases beyond 13%. 3.2 Improving fairness For applications where processing resources are being shared between a number of users, we would like to be able to avoid giving preferential treatment to any one thread. This principle of fair- 4

0.60 0.55 Workload 1 T S (0.010) RR (0.061) IC (0.066) PI (0.012) 0.60 0.55 Workload 2 T S (0.009) RR (0.068) IC (0.082) PI (0.050) Speedup 0.45 0.40 Speedup 0.45 0.40 0.35 0.35 0.30 1 2 3 0.30 1 2 3 Thread Number Thread Number Workload 3 Workload 4 Speedup 0.60 0.55 0.45 0.40 T S (0.006) RR (0.062) IC (0.074) PI (0.043) Speedup 0.60 0.55 0.45 0.40 T S (0.021) RR (0.049) IC (0.062) PI (0.019) 0.35 0.35 0.30 1 2 3 0.30 1 2 3 Thread Number Thread Number Figure 2. Thread Speedups: TS = time slicing, RR = round robin, IC = instruction-count, PI = prioritized IC. ness can directly conflict with the goal of maximizing throughput. By design, the I-Count fetch policy increases total throughput by favoring threads that are more efficient. Similarly, Culler et al. [7] observed that switch-on-miss (non-simultaneous) multithreading favors threads with lower miss rates, improving overall cache performance at the expense of fairness. To evaluate the fairness of different fetch policies, we measured the speedup of individual threads (relative to single-threaded performance) when run in a three-thread workload. As a reference point, we also implemented time-slice (TS) scheduling, where each thread has sole possession of the processor for some period of time before relinquishing it to another thread, which then has sole possession. The speedup values for several threads are plotted for different workloads and fetch policies in Figure 2. What we would like to see for each fetch policy is a nearly horizontal line with large speedup values (indicating that each of the three threads has a similar speedup i.e., they all suffer the same performance penalty). As we would expect, the time-slicing policy is quite fair, the standard deviation of the speedup values being no more than 0.010. The RR and IC policies have the largest speedup values, but with significantly different speedups for the individual threads. Standard deviations for these policies range from 0.061 to 0.074. As might be expected, the RR policy does exhibit less unfairness than IC, but the difference is surprisingly small. Although RR is fair in distributing fetch opportunities, useful throughput is still biased toward threads with fewer instruction-cache misses or less sensitivity to branch predictor or cache interference (see the following section). To counteract this bias, we extended our prioritized RR and IC policies to allow us to rotate the high-priority designation among the active threads. Assuming that all threads are of logically equal pri- 5

Normalized Latency 4.00 3.50 3.00 2.50 2.00 1.50 1.00 RR PR 3.50 3.70 3.90 4.10 4.30 4.50 Total Throughput (IPC) Figure 3. Throughput vs. Latency (ijpeg initialization phase). ority, the operating system rotates (time-slices) the high-priority designation among all the threads, giving each equal time as the foreground thread. (Note that by varying the time each thread spends at high priority, the operating system gains significant control over CPU allocation without disabling multithreading a capability not present in nonprioritized policies. We can view the foreground/ background thread experiments in the preceding section as a special case of this more general model where the priorities are not changed.) Unlike the Round-Robin or I-Count policies, the rotating priority policy gives each thread a macroscopic interval during which it can exploit all of the processor s resources to the best of its ability. Unlike the time-sliced CPU, we have not completely lost the ability to get work done on background threads during these intervals. However, since we are forcing the processor to give a fetch opportunity to a thread where it may not have done so under a purely throughput-oriented policy, we expect that our throughput will suffer compared to RR or IC. We ran the new policy with a priority rotation scheduled to occur every 75 million cycles. The resulting curves are marked PI (Prioritized I- Count) and demonstrate improved fairness over the pure RR and IC policies (for the same workloads), and improved speedup over the time-slicing policy. In one instance (Workload 3), the rotating priority policy flattens the curve by increasing the speedup of one thread by approximately 10% at the expense of lower speedups for the other threads. 3.3 Reducing cache and branch predictor contention Though threads within an SMT processor may not interact directly, they share execution resources, and thus do have an effect on each other. The exact nature of these interactions will vary widely with the processor architecture and the workload being executed. An excellent example of this is illustrated in Figure 3. This graph is a latency-throughput graph similar to Figure 1. In this case, the foreground thread is ijpeg in its initialization phase. As the figure indicates, when running alone, this phase of ijpeg s execution achieves an impressive throughput of 4.4 IPC. It is so efficient in its use of processor resources that, for the RR and IC fetch policies, the introduction of any background thread causes a serious drop in total processor throughput and a 100% increase in foreground thread latency. The use of a prioritized fetch policy allows the background threads to execute at a rate of less than 0.01 IPC. As a result, the data points for the prioritized round-robin policy plot essentally on top of one another. Luckily, this type of behavior seems to be rare. We can quantify the contention effects of SMT by examining changes in cache miss rates and branch prediction accuracy, as was done previously by Hily [10][11]. We looked at two base cases: an SMT processor using the round-robin fetch policy and a single-threaded out-of-order processor using time-slicing to run multiple threads. The simulated branch predictor accuracies, L1 data cache miss rates, and processor throughput for the Perl benchmarks are shown in Figure 4. For the time-slicing 6

D-Cache Miss Rate 2.00% 1.60% 1.20% 0.80% 0.40% 0.00% Time-Slice Round-Robin Prioritized RR 1 2 3 4 Number of Threads Prediction Accuracy 100.00% 98.00% 96.00% 94.00% 92.00% 90.00% 1 2 3 4 Number of Threads Total IPC 4.00 3.75 3.50 3.25 3.00 2.75 2.50 2.25 1 2 3 4 Number of Threads Figure 4. Cache and branch predictor effects. processor, contention from additional threads degrades both branch predictor and cache performance, leading to slightly lower overall throughput. The SMT processor sees an even greater degradation in predictor and cache performance, but provides increased throughput nonetheless. Time-slicing provides more favorable predictor and cache performance because it allows a single thread to run alone for a comparatively long period of time (8.5 to 12.5 million cycles for the figures shown). This unperturbed running time allows the branch predictor and caches to warm up without interference, leading to good performance for the running thread. The SMT processor s simultaneous threads do not allow any one thread to avoid interference in the branch predictor or cache as it would in the single-thread case. Our rotating priority policy from the previous section should reduce interference similarly. In each scheduling interval, the foreground (high-pri- 7

ority) thread will receive a dominant fraction of the execution cycles, allowing it to warm up and exploit the branch predictor and caches with reduced (though non-zero) interference from other threads. We ran experiments which gave each thread two equal-length periods as the foreground thread. Our results for the prioritized round-robin (PR) fetch policy are also found in Figure 4. As expected, we have improved branch prediction and cache performance over the pure Round-Robin policy. This improved predictor and cache performance relative to RR does not translate into improved throughput for this simple prioritization scheme on these workloads. However, we expect that better prioritization schemes and/or more demanding workloads may translate this increased resource efficiency into an overall performance improvement. 4 Conclusions and future work The prioritized fetch policies that we have developed have shown themselves to be effective in reducing the impact of background threads on the foreground thread s latency. These policies also have the ability to trade speedup for fairness in applications where this is important. Finally, we have shown that a prioritized scheme can limit the pathological behavior of some workloads by addressing the problem of resource contention between threads. Though our rotating priority scheme clearly reduces cache and branch predictor contention, this utilization improvement does not produce an overall throughput gain over non-prioritizing policies with our simple prioritization scheme and our workload mix. However, we hypothesize that a throughput gain may be realized using more sophisticated prioritization policies and/or workloads with higher resource contention. One policy currently under investigation uses branch confidence measures to temporarily reduce the priority of threads executing down a low-confidence path. An SMT processor is ideally suited to use branch confidence because of the processor s ability to dynamically reallocate processor resources to other threads. By allocating these resources to threads on higher-confidence paths of execution, performance should be improved and contention for resources between threads should be reduced. To the extent that we can execute background threads without significantly increasing the latency of a foreground thread, prioritization makes available free execution cycles that can be exploited in novel ways. Even when only a single application thread is available, these free cycles may be used in support of that application to manage or optimize cache or branch prediction resources, to collect profiling data, or even to re-optimize application code on the fly. Acknowledgments This work was supported by IBM, Intel, and the National Science Foundation under award CCR-9734026. References [1] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On- Chip Parallelism. In 22nd Annual International Symposium on Computer Architecture, pages 392-403, June 1995. [2] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In 23rd Annual International Symposium on Computer Architecture, pages 191-202, May 1996. [3] B. J. Smith. Architecture and Applications of the HEP Multiprocessor Computer System. In Proceedings of the SPIE, 298:241-248, 1981. [4] G. Alverson, et al. Exploiting Heterogeneous Parallelism on a Multithreaded Multiprocessor. In Proceedings of International Conference on Supercomputing, pages 188-197, July 1992. [5] A. Agarwal, B. H. Lim, D. Kranz, and J. Kubiatowicz. APRIL: A Processor Architecture for Multiprocessing. In 17th Annual International Symposium on Computer Architecture, pages 104-114, May 1990. [6] S. W. Keckler and W. J. Dally. Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism. In 19th Annual International Symposium on Computer Architecture, pages 202-213, May 1992. [7] D. E. Culler, M. Gunter, and J. C. Lee. Analysis of Multithreaded Microprocessors under Multiprogramming. Univ. of California, Berkeley Computer Science Division Tech. Report No. UCB/ CSD 92/687, May 1992. 8

[8] J. Laudon, A. Gupta, and M. Horowitz. Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), pages 308-318, October 1994. [9] D. Burger and T.M. Austin. The SimpleScalar Tool Set, Version 2.0. Technical Report #1342, University of Wisconsin-Madison Computer Sciences Department, June 1997. [10] S. Hily and A. Seznec. Branch Prediction and Simultaneous Multithreading. In 1996 International Conference on Parallel Architectures and Compilation Techniques, pages 169-173, October 1996. [11] S. Hily and A. Seznec. Contention on 2nd Level Cache May Limit the Effectiveness of Simultaneous Multithreading. Internal Publication #1086, Universitaire De Beaulieu IRISA, February 1997 [12] T.-Y. Yeh and Y. Patt. Two-Level Adaptive Branch Prediction. In Proceedings of 24th Annual International Symposium on Microarchitecture, November 1995. [13] G. S. Sohi. Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers. IEEE Transactions on Computers, 39(3):349-359, March 1990. 9