Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000

Size: px

Start display at page:

Download "Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000"

Delphia Perkins
6 years ago
Views:

1 Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Mitesh R. Meswani and Patricia J. Teller Department of Computer Science, University of Texas at El Paso {mmeswani, Abstract Applications executing on Simultaneous Multithreaded (SMT) processors face interference from parallel execution contexts, which can significantly reduce overall system performance. This interference differs from that observed in SMP or other multiprocessor systems due to sharing of resources at a finer, cycle-level granularity. The IBM POWER5 processor provides user-adjustable hardware thread priorities to throttle SMT threads and, thus, improve overall system performance. This paper evaluates the significance of these hardware thread priorities on the execution times of pairs of co-scheduled processes. The experiments were run on a trace-driven simulator of the POWER5 processor using SPEC CPU2000 benchmarks. The results show that the performance differences between the best and worst cases are 7% to 30%, and the default priorities assigned do not perform best most of the time. This result shows that it is worthwhile to gain a better understanding of the effect of priorities on application interference and to use this information to develop heuristics for automatically setting priorities in the future. 1. Introduction and Background In the past few years, system architects have tackled the utilization problem at the processor level by taking advantage of an application s Instruction Level Parallelism (ILP). By exploiting ILP, multiple independent instructions can be executed in parallel, thus, improving processor utilization. Design innovations like superscalar processors combined with out-of-order execution allow independent instructions to execute in parallel within the microarchitecture. Modern superscalar processors with wide instruction fetch and issue windows further improve utilization by allowing the completion of more than one Instruction Per Cycle (IPC). Intelligent branch predictors allow speculative execution of instructions down the predicted path, and improve IPC in the case of a correct prediction. However, even with ILP, processor utilization is low. This low utilization is due to several factors. For example, without correct branch prediction, an application s ILP is limited by the number of instructions in a basic block. And, today s wide issue, out-of-order, superscalar processors may have to resolve a branch every one or two clock cycles. In addition, the use of pointers and indirect memory references reduce the accuracy of branch predictions. Furthermore, the high cost of access to the various levels of memory cause the average stall to be costly. Explicit hardware multithreading has emerged as a design choice to help solve the processor utilization problem. In the literature [7] different forms of explicit hardware multithreading have been discussed ranging from interleaved execution to overlapped execution. The basic concept in all the forms of multithreading discussed is to increase the potential for ILP by considering instruction streams from multiple applications. In this paper we use the terms hardware threads, contexts, and threads interchangeably. Simultaneous multithreading (SMT) [6] allows multiple hardware threads or contexts to execute in parallel. These hardware threads share the resources of the microarchitecture and can execute simultaneously as long as there is no conflict for shared resources. The use of SMT allows a processor to improve the IPC rate of the system at the cost of a drop in individual application performance. Ideally, the execution time of a batch of programs executed on an SMT processor with n simultaneous hardware threads is bounded, on the lower end, by the execution time of the programs executed on a single processor and, on the upper end, by the execution time of the programs executed on an n-way Shared Memory Multiprocessor (SMP). An application executing on an SMT processor faces interference for shared resources from applications executing on the other hardware contexts. The applications can interfere in the microarchitecture/pipeline for functional units, in the memory subsystem due to working set interference, or on the system bus due to competing bus transactions. This interference differs from that observed in SMP or other multiprocessor systems due to the finer, cycle-level granularity. This interference could result in situations where the overall system IPC becomes even lower than achieved in single-thread mode, which is not the ideal/expected case. The interference is highly dependent on the characteristics of the

2 simultaneously executing applications and their patterns of shared resource access. SMT is available on IBM s POWER5 processor [8] and on Intel processors as Hyper Threading [9]. The Sun Solaris 9 and 10, Windows XP, Windows Server 2003, Linux 2.6, and AIX 5.L operating systems support SMT processors. They provide separate run queues for each executing hardware context, and for load balancing purposes they consider each context as a separate processor. The IBM processor has a set of environment tuning knobs [2] to optimize SMT performance. These knobs allow selective disabling of SMT, fixed-time idling of a hardware context, and assignment of individual thread priorities to grant higher priority threads more decode cycles and, thus, a larger share of the processing time. Some of these knobs are used by the operating system for implementing fast spin locks. In this research we study the effect of hardware thread priorities on the execution time of co-scheduled application pairs. The remainder of the paper is organized as follows. Section 2 discusses related research. The experimental platform, including the implementation of hardware thread priorities, and the workloads used in the study are described in Sections 3 and 4, respectively. Section 5 presents the method used for data collection, while Section 6 explains the experimental methodology. The results are presented and discussed in Section 7. Finally, Section 8 presents conclusions and future work. 2. Related Research The SMT processor design was first introduced by Tulsen, et al. [6] as a simulated system driven by Alpha binaries. Snavely, et al. [10] studied an operating system (OS) process scheduler targeted at SMT processors; the SOS (sample, optimize, symbios) "symbiotic" scheduler collects performance data of various job mixes during the sample phase and then optimizes the schedule during the optimize phase. The inherent scheduling algorithm, which considers all possible co-schedules, can take exponential time to execute. Nakajima, et al. [11] studied process scheduling optimization on a hyper-threaded processor with a dual core package for non multi-programmed workloads. This work involved load balancing of the set of processes across processors to meet level-two cache and floating-point requirements. Bulpin, et al. [12] studied the design of a hyperthreading aware process scheduler using SPEC CPU2000 [3] benchmarks on Pentium 4 processors. The scheduler keeps track of individual application performance as the ratio of its Singe Threaded performance to its SMT performance. The scheduler records every pair s coperformance, which is the sum of the individual performance ratios. When a scheduling decision has to be made for a thread, the co-performance data is used to calculate the dynamic priorities and the application that works best with the co-scheduled application is selected. This scheduler improved overall system performance by 3.2% over the native Linux scheduler. In [12] dynamically calculated process priorities were used to guide the co-scheduling of applications and to achieve the least application interference. Our work extends this idea by exploring the dynamic changing of hardware, rather than software, thread priorities in order to enhance performance. Changing priorities at the hardware, rather than operating system, level provides a finer-grain of control of the impact that a process can have on the performance of a co-scheduled process. McGregor, et al. [13] studied the design of an OS process scheduler for Intel s Hyper-Threaded processor using NAS OpenMP benchmarks. This research evaluated process pairing and enabling/disabling Hyper Threading using one of three metrics: L3 cache misses, bus transactions, or number of stall cycles. Using these metrics, they adjusted the number of running processes and the process pairing on a processor. The study was limited to compute-intensive applications. Moreover, it did not keep track of processor affinity; with process migration processes may not be able to take advantage of saved cache state. Our work differs from [13] by allowing co-scheduling of applications using different thread priorities, which may have otherwise been executed in Single-Threaded mode. This allows increased opportunities to improve overall system performance. In our research we do not consider off-chip migration and, hence, the applications can take advantage of any saved on-chip cache state. In the future we will extend the analysis to memory-intensive and I/O-intensive classes of applications. Fedorova, et al. [14] designed an OS process scheduler for Chip Multithreading (CMT) systems. The CMT systems they studied are not superscalar and are targeted more at threaded applications. This work considered adjustment of the compute time assigned to an application based on its fair share of the caches, e.g., in an n-threaded system, the performance obtained with a cache of 1/n th the size. Using a detailed regression model, the expected IPC of a process is dynamically predicted based on it having its fair share of the cache. Then the compute time allocated to the process is adjusted to make up for deviation from this expected, fair IPC. For a small class of workloads, these approaches have reduced interference not considered by the current naïve schedulers. However, policies based on simple heuristics [13] or customized models [14] cannot be extended to other application classes, whereas policies that consider all possible co-schedules [10] cannot scale to increasing numbers of hardware contexts. Our work, which uses SMT tuning knobs, such as those provided by IBM

3 processors [2], along with a characterization of their effects on inter-application interference for various application classes, may yield a solution that could automatically set the knobs to attain best possible performance. In addition, this methodology can scale to increasing numbers of hardware contexts. 3. Experimental Platform The experiments and the data collection, explained in Sections 5 and 6, were carried out on the trace-driven IBM Performance Simulator for Linux on POWER [1]. This simulator provides a suite of performance models for various IBM POWER processor models, including the POWER5 processor. The simulator requires as input, instruction traces, and gives as output various metrics such as Per Instruction (CPI), functional unit usage statistics, and an instruction histogram. The simulator for the POWER5 processor model supports both the Simultaneous Multithreading (SMT) mode as well as the Single- Threaded mode. P L1 Shared L2 Figure1. POWER5 Processor Layout The POWER5 processor simulated and studied, shown in Figure 1, has two identical cores running at 1.65 GHz, each with a 32KB L1 instruction cache and a 64KB L1 data cache. Each pair of cores shares a unified 1.9MB L2 cache. The processor supports two SMT threads per core. This processor has hardware logic to throttle the interference of a thread. The processor s Dynamic Resource Balance (DRB) logic can be used to temporarily reduce the number of decode cycles allocated to a thread based on resource usage thresholds, such as number of outstanding L2 cache misses. In addition to the hardwired throttling, the processor supports software-controlled hardware thread priorities. Depending on the application s privilege level, there are eight software-adjustable priorities. The difference in priorities between two threads controls the proportion of decode cycles allocated to each hardware thread. A higher priority thread is allowed to decode more instructions, allowing its tasks to progress faster. In this paper we study priorities that can be set by user applications, which are two, three, and four. Given X as priority of thread 0, and Y as the priority of thread 1, P L1 equation 1 shows the formula to calculate the decode cycle allocation for the lower priority thread; the decode cycle allocation for the higher priority thread is obtained by subtracting this result from one. The decode cycle allocations associated with the possible thread priorities are shown in Table 1; a complete list is available in the related IBM Red Book [2]. The processor assigns the default priority of 4 to both threads. 1 (1) 2 ( X Y 1) Thread 0 Priority Thread 1 Priority Thread 0 Decode Thread 1 Decode 4 2 7/8 1/ /4 1/ /2 1/ /4 3/ /8 7/8 Table 1. Effect of Thread Priorities on Decode Cycle Allocation 4. Workload We use 14 workloads, each composed of two applications from the industry standard SPEC CPU2000 [3] benchmark suite, to measure the performance of the CPU and memory subsystem. The applications used include the floating-point benchmarks swim, art, mesa, and mgrid, and the integer benchmarks gzip, vpr, twolf, and vortex. The application swim is a weather prediction program, while art uses neural networks to perform image recognition. mesa is a 3-D graphics library and mgrid is a multi-grid solver. The application gzip implements data compression algorithms, while vpr is a FPGA circuit placement and routing program. vortex is a single-user, object-oriented database transaction program and twolf is a placement and global routing application used in the process of creating the lithography artwork needed for the production of microchips. 5. Data Collection To drive the simulations and, thus, collect performance data, we had to, first, capture instruction traces for the applications executing on a POWER5 processor. Next, the simulator was configured in order to capture the information needed to understand the effect of hardware thread priorities on execution time for a pair of coscheduled applications. The mechanism to capture traces is explained in Section 5.1 and the configuration of the simulator is briefly discussed in Section 5.2.

4 5.1. Instruction Tracing The Linux tool ITrace [5] was used to capture the instruction traces of unmodified application binaries. The traces are captured in the main memory buffer, the size of which is limited to 200 MB per CPU. The applications were executed on one processor of a two-processor POWER5 machine running the Linux kernel. The system has 5GB main memory and a 70GB hard disk. The study by Gomez, et al. [4] showed that the four integer benchmarks reach stable CPI after the first 50 million instructions. Accordingly, we allow a warm-up time of one minute for each application before capturing traces. The traces for the floating-point benchmarks are captured after executing each application for half of its total execution time. The trace tool captured between 13 million to 28 million instructions for each benchmark Simulator Configuration The trace-driven simulator can be configured to use either the SMT mode or the Single-Threaded mode. We use the SMT mode for executing the co-scheduled application pairs. The simulator in the SMT mode exits when one of the instruction traces completes. The leftover instructions from the incomplete trace are executed in Single-Threaded mode. This is done because the goal of the research is to study the effect of thread priorities on execution time for completing both traces. Accordingly, the total time for executing both traces is computed by adding the time from the SMT and Single-Threaded modes. 6. Experimental Methodology The hypothesis of our research is that hardware thread priorities have a significant performance impact on the execution times of co-scheduled application pairs. Consequently, the goal of our experiments is to verify this hypothesis by comparing the execution times of different application co-schedules. While co-scheduling may cause individual applications to slow down, the time to execute the co-scheduled pair of applications may be reduced. The process of creating co-scheduled pairs is described in Section 6.1, while Section 6.2 describes the performance metrics used for evaluation Application Co-schedules For a given pair of applications, each can be assigned to one of the two threads of the simulated processor. Since each thread has three possible priorities, this gives a total of 18 combinations. However, the number of combinations can be reduced since hardware threads are identical, i.e., executing application X with priority i on thread 0 and application Y with priority j on thread 1 is the same as executing application X with priority i on thread 1 and application Y with priority j on thread 0. This reduces the possible combinations to nine. Moreover, from equation 1, since only the difference in priority matters, the number of distinct pairs reduces to five. The coschedules for one application pair are shown in Table 2. We study pairs where both applications are either integer or floating-point. Additionally, we study two mixed pairs where one application is integer and one is floating-point. Thus, there are 14 pairs, giving a total of 70 different co-schedules. The different pairings are shown in Table 3. Thread 0 Priority Executing Application X Thread 1 Priority Executing Application Y Table 2. Co-schedules for Application Pairs X, Y Integer Application Pairs Floating-Point Application Pairs Mixed Pairs gzip, twolf swim, mgrid swim, gzip gzip, vortex swim, mesa art, vpr gzip, vpr swim, art twolf, vortex mgrid, mesa twolf, vpr mgrid, art vortex, vpr mesa, art Table 3. Application Pairs X, Y 6.2. Performance Metrics The simulator outputs the total number of cycles required to execute a pair of applications, with defined thread priorities, in SMT mode. Additionally, for each trace, it outputs the number of completed instructions. As explained in Section 5.2, once one trace is completed, the remaining instructions from the other trace are executed in Single-Threaded mode. The total number of cycles required to execute a co-scheduled pair of applications is computed by adding the total cycles in SMT mode and in Single-Threaded mode. The performance metric used to evaluate if thread priorities have a significant performance impact on the execution of a pair of applications is computed by taking the difference between the best case and worst case total number of cycles reported for the five different coschedules investigated (see Table 2 above). We also compute the difference between the best case and the default case, which is assigning each thread a priority of 4.

5 Application Pair Best Total Number of Worst Total Number of Default Total Number of % Difference between Best and Worst s % Difference between Best and Default s Best Thread Priorities Worst Thread Priorities gzip, twolf 41,107,097 47,015,237 43,452, ,3 2,4 gzip, vortex 38,449,210 43,138,538 39,974, ,2 2,4 gzip, vpr 39,674,997 44,251,721 41,818, ,3 2,4 twolf, vortex 35,711,356 38,348,825 35,711, ,4 2,4 twolf, vpr 37,301,850 40,711,131 37,301, ,4 4,2 vortex, vpr 34,172,959 37,054,265 34,529, ,4 4,2 swim, mgrid 34,809,876 38,236,266 34,809, ,4 4,2 swim, mesa 32,221,419 39,797,984 33,063, swim, art 29,793,892 33,311,936 29,793, ,2 mgrid, mesa 30,911,156 37,952,726 32,205, ,4 4,2 mgrid, art 27,109,275 31,520,482 27,109, ,4 4,2 mesa, art 27,870,271 34,869,714 28,711, ,2 2,4 swim, gzip 39,334,954 50,919,210 44,438, ,4 4,2 art, vpr 31,965,969 36,215,448 34,149, ,4 4,2 Table 4. Difference in Performance between Best and Worst s and between Best and Default s, and Thread Priorities for Best and Worst s 7. Results Column five of Table 4 shows the percentage difference between the best case and worst case number of cycles required for the execution of each application pair under the various priority co-schedules shown in Table 1. As can be seen from these results, the performance differences between the best and worst cases are between 7.39% and 29.45%. Columns seven and eight of Table 4 give the thread priorities associated with the best and worst case performance, respectively, for each application pair. The first column shows the co-scheduled pair X, Y; the seventh and the eight columns show the thread priority as i, j, such that i is the priority of the thread executing application X and j is the priority of the thread executing application Y. As can be seen from this table, the thread priorities for the best and worst case performance depend on the applications that are co-scheduled. For nine of the fourteen application pairs, the best case performance is not that associated with the case when both threads have the default priority of 4. Column six of Table 4 shows the difference between the best case number of cycles required for the execution of each application pair under the various priority coschedules shown in Table 1 and the default case, i.e., with each thread having priority 4. The application pairs (twolf, vortex), (twolf, vpr), (swim, mgrid), (swim, art), and (mgrid, art) experience best performance under the default thread priority settings, whereas the remaining nine application pairs experience best performance under different priority settings and, for them, the differences in performance under the settings that result in best performance and the default setting range from 1.03% to 12.97%. Although, with the exception of the pair (swim, mgrid), this difference is small, it will be interesting to discover the performance differences for other applications. Application Pair Single- Threaded Total Number of % Difference between Worst and Single- Threaded s swim, mesa 38,810, swim, art 33,001, swim, gzip 50,669, Table 5. Difference between Single-Threaded and Worst Performance Table 5 shows application pairs for which the worst case performance (the total number of cycles to execute the application pair) is less than that of the Single- Threaded case, i.e., the sum of the number of cycles required for each individual application in Single- Threaded mode without SMT on. The negative numbers in the last column of Table 5 show the percentage of extra cycles that the worst cases required relative to the singlethreaded cases. The performance differences presented in this section show that application interference on SMT processors

6 depends on both individual application characteristics and on SMT priorities. As shown by our study, the default SMT priorities do not always yield the best performance, and, in the best cases, the effect of application interference can be reduced by intelligently setting thread priorities. In addition, as shown in this section, in the worst case, hardware thread priorities can cause inter-application interference to reduce performance to less than that of single-threaded execution. Hence, a detailed understanding of the performance impact of thread priorities can garner performance rewards, while avoiding worst case scenarios. 8. Conclusions and Future Work This work shows that hardware thread priorities on SMT processors can have a significant impact on overall performance and that the default priorities that are assigned to hardware threads on a POWER5 processor do not always yield the best performance. The differences in best case and worst case number of cycles required for the execution of two applications using five different pairs of thread priorities shows that a detailed understanding of the effect of SMT priorities on inter-application interference could result in significant performance gains. We will continue to expand this work by using the simulator s functional unit usage data to characterize the effect of thread priorities on inter-application interference. In the future, we intend to expand the applications studied, adding I/O- and memory-intensive applications, and to implement dynamic modification of hardware thread priorities in real systems. Finally we aim to develop heuristics that will allow us to build models that can, given an application pairs performance behavior, effectively set hardware thread priorities dynamically. Besides thread priorities, there are tuning knobs like SMT Snooze and SMT mode off, which are of interest to us as well. Acknowledgements We would like to thank John Griswell from IBM for providing the simulator binaries. In addition, we want to acknowledge that this work was supported by an IBM Faculty Award, an IBM SUR (Supported University Research) grant, and a University of Texas STAR (Science and Technology Acquisition and Retention) Program award, which were awarded to Dr. Teller. 9. References [1] alphaworks: IBM Performance Simulator for Linux on POWER : Overview, accessed 08/15/2006. [2] Advanced POWER Virtualization on IBM eserver p5 Servers: Architecture and Performance Considerations, SG245768, [3] SPEC - Standard Performance Evaluation Corporation, accessed 08/15/2006. [4] I. Gòmez, L. Piñuel, M. Prieto, and F. Tirado, Analysis of Simulation-adapted SPEC 2000 Benchmarks, Proceedings of ACM SIGARCH Computer Architecture News, 30(4): 4-10, September [5] ITrace for Linux/PPC, [6] D. M. Tullsen, S. J. Eggers, and H. M. Levy, Simultaneous Multithreading: Maximizing On-chip Parallelism, Proceedings of the 22nd International Symposium on Computer Architecture (ISCA '95), IEEE Computer Society, pp , June [7] T. Ungerer, B. Robic, and J. Silc, A Survey of Processors with Explicit Multithreading, ACM Computing Surveys (CSUR), 35(1):29-63, March [8] R. Kalla, B. Sinharoy, and J. M. Tendler, IBM POWER5 Chip: a Dual-Core Multithreaded Processor, IEEE Micro, 24(2):40-47, March [9] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel, The Microarchitecture of the Pentium 4 Processor, Intel Technology Journal, 5(1):1-13, February [10] A. Snavely and D. M. Tullsen, Symbiotic Job Scheduling for a Simultaneous Multithreading Processor, Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '00), ACM Press, pp , November 2000 [11] J. Nakajima and V. Pallipadi, Enhancements for Hyper- Threading Technology in the Operating System: Seeking the Optimal Scheduling, Proceedings of the 2nd Workshop on Industrial Experiences with Systems Software, The USENIX Association, December [12] J. Bulpin and I. Pratt, Hyper-Threading Aware Process Scheduling Heuristics, Proceedings of the 2005 USENIX Annual Technical Conference, pp , April [13] R L. McGregor, C D. Antonopoulos, and D S. Nikolopoulos, Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors, Proceedings of the Nineteenth International Parallel and Distributed Processing Symposium, Denver, CO, April [14] A. Fedorova, M. Seltzer, and M D. Smith, A Non-Work- Conserving Operating System Scheduler for SMT Processors, Proceedings of the Workshop on the Interaction between the

7 Operating Systems and Computer Architecture (WIOSCA), in conjunction with ISCA-33, Boston, MA, June 2006.

Simultaneous Multithreading on Pentium 4

Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on