Methods for Modeling Resource Contention on Simultaneous Multithreading Processors

Size: px

Start display at page:

Download "Methods for Modeling Resource Contention on Simultaneous Multithreading Processors"

Linette Simon
5 years ago
Views:

1 Methods for Modeling Resource Contention on Simultaneous Multithreading Processors Tipp Moseley, Joshua L. Kihm, Daniel A. Connors, and Dirk Grunwald Department of Computer Science Department of Electrical and Computer Engineering University of Colorado University of Colorado Boulder, CO Boulder, CO {kihm, Abstract Simultaneous multithreading (SMT) seeks to improve the computation throughput of a processor core by sharing primary resources such as functional units, issue bandwidth, and caches. SMT designs increase utilization and generally improve overall throughput, but the amount of improvement is highly dependent on competition for shared resources between the scheduled threads. This variability has implications that relate to operating system scheduling, simulation techniques, and fairness. Although these techniques recognize the implications of thread interaction, they do little to profile and predict this interaction. The modeling approach presented in this paper uses data collected from performance counters on two different hardware implementations of Pentium-4 Hyper-Threading processors to demonstrate the effects of thread interaction. Techniques are described for fitting linear regression models and recursive partitioning to use the counters to make online predictions of performance (expressed as instructions per cycle); these predictions can be used by the operating system to guide scheduling decisions. A detailed analysis of the effectiveness of each of these techniques is presented. 1. Introduction By leveraging the advances in semiconductor technologies, system developers are exploring the paradigms of System-on-a-Chip (SoC) processors, Chip Multiprocessors (CMP), and Multithreaded (MT) architectures. This evolution dictates that future high-performance systems will integrate tens of multithreaded processor cores on a single chip die, resulting in hundreds of concurrent program threads sharing system resources. These designs will be the cornerstone of not only high-performance computing and server environments, but will also emerge in general-purpose and embedded domains. For the optimal design and runtime management of such systems, it is necessary to have an understanding of how multiple threads interact when sharing hardware. In order to build systems software (compilers, operating systems, run-time systems) that understands the complete view of multiple cores, it is first necessary to build effective models of multithreaded core execution that will likely be the basis for multi-core designs. By supporting multiple hardware thread contexts, multithreaded architectures address the growing processormemory gap by tolerating memory latencies of individual threads. Several multithreaded processor models have been proposed. Coarse Grained Multi-Threaded (CGMT) [1] processors issue instructions from a single thread each cycle and switch between threads on long latency instructions such as cache misses or on definable time intervals. Alternate hardware thread contexts can perform useful work, increasing throughput, where a single thread would stall a processor. IBM released the PowerPC RS64-IV [4] which is a commercial implementation of a course grain multithreading processor. In general, since such systems switch between threads for distinct intervals, contention is limited to longer lifetime resources such as caches and branch predictors. On the other hand, Simultaneous Multithreading (SMT) [14][24][23] processors share the resources (ALUs, branch target buffers, caches etc.) of one physical processor between multiple virtual processors that simultaneously execute each cycle. The SMT design is intended to have a low design overhead for out-of-order processors, allowing it to be added into existing processor designs without significant cost. It is estimated that adding SMT support to the Compaq Alpha EV8 processor only required an additional 5% to the die area, and researchers at Intel found similar costs for their implementation of SMT called Hyper-Threading [15]. The major trade-off of integrating existing microarchitecture designs to handle simul-

2 taneous threads is that the processor efficiency becomes more directly coupled to the individual thread characteristics and inter-thread resource contention. As such, it is critical to develop accurate models for the systems software of SMT processors. The IBM POWER5 [11] consists of two, two-way SMT cores that share a single L2 cache, resulting in a total of four thread contexts per die and very complex thread interactions. The most commonly available SMT processor is the Intel Pentium-4 processor with Hyper-Threading [9]. Hyper- Threading is technically similar to the SMT designs described in the research literature, although it has unique characteristics. In particular, many resources, such as the cache system, microarchitectural registers, and execution units are shared between virtual processors as they are in SMT. Other resources, such as the re-order and load/store buffers, are partitioned, and some, including the instruction translation look-aside buffer (ITLB), are duplicated for each virtual processor. When running a conventional operating system on a Pentium-4 with Hyper-Threading enabled, each virtual processor appears to the operating system as two distinct processors and the base operating system does not need to have detailed knowledge that certain processors are in fact logical processors. L2 Miss Frequency 176.gcc L2 cache Miss Frequency (with and without 179.art) Pairing Total gcc Paired gcc Alone Instructions of gcc (in Billions) Figure 1. Level 2 cache misses per instruction for 176.gcc when run alone and paired with 179.art. Employing SMT in hardware generally has considerable benefit, such as increasing both utilization and throughput. However, increased utilization comes at a cost; threads may compete for resources that simply are not abundant enough to be shared. For example, Figure 1 shows the number of L2 cache misses per instruction retired (miss frequency) for Spec2000 ([22]) CPU benchmark 176.gcc when run alone and when run coscheduled on a Hyper-Threading processor with 179.art over a period of 15 billion operations of gcc. The figure shows that for a significant portion of execution, the L2 miss frequency is considerably higher when paired compared to when it is run alone (the regions of the graph where alone and paired are about the same are most likely due to high numbers of compulsary misses in certain program phases). Since such interactions can occur on any resource that is shared, it is necessary to be able to accurately predict how competition for a set of shared resources can affect overall performance. As multithreaded multi-core systems emerge, it becomes increasingly important for operating systems to be aware of application behavior to assess job scheduling opportunities and ensure fair access to resources. Rather than a set of ad hoc heuristics, the operating system should use a model of the processor architecture that uses program properties to provide insight into how and what applications should be co-scheduled. Deriving such a model can be complex, because it may require considerable insight into the internal machine organization. However, such deep insight may be a detriment often times, intuition is not validated by experimental data and assumptions about component interactions may not be validated in practice. For example, Snavely [19] felt that mixing integer and floating point applications would be a good scheduling heuristic; while intuitively appealing, this would only be true if the functional units were the performance bottlenecks for the processor. Ideally, a model should be able to be automatically derived using on-line measurement of programs; at the minimum, the model must capture the unique characteristics of the machine that influence program performance. In this paper, we evaluate two mechanisms to automate the process of deriving a machine model for SMT processors. We are interested in characterizing thread interaction on two real processors; since changes in instructions per cycle (IPC) summarizes aggregate performance, we concern ourselves with predicting how online measurements of program properties can be used to predict IPC. The goal of this work is to determine what kinds of models provide good prediction and if the models for different processors are similar or different. Use of the result of these predictions to influence scheduling decisions in an actual OS was previously presented [16]. Although this evaluation is for SMT processors, the techniques are applicable to SMT, CGMT, and multi-core designs. We show that statistical prediction tools can predict IPC with good accuracy; for example, across both proces-

3 sors we evaluate, 6 of the predicted IPC s are within 2 of the actual IPC. We also show that it is important to model specific processors. A model derived for one processor design is a poor predictor for a different processor design. These results indicate that not only are formal statistical models useful for predicting IPC, both by simplifying the model building process and improving prediction accuracy, but that processor-specific models are essential. The rest of this paper is organized as follows. Section 2 discusses related work in analysis of SMT processors. Section 3 gives an overview of constructing an accurate SMT resource model using hardware counter information. Section 4 presents and compares several models derived from the Pentium-4. Section 5 summarizes the results and discusses applications of this method in future work. 2. Related Work Since contention on shared resources can cause variations in throughput on a multithreaded system, operating systems have a direct role in the performance of such machines. An operating system that is aware of the underlying processor configuration can combine this information with runtime characteristics of a thread to adapt the process schedule to increase throughput. The work on thread symbiosis by Snavely et al. [20, 21] presents a method for scheduling threads on an SMT processor in the case that there are more runnable threads than contexts. The Sample-Optimize-Symbios (SOS) scheduler performs as the name suggests. In the sample phase, data is collected. Following this, an optimized schedule is calculated based on performance counter attributes recorded during the sample phase. A number of predictors are suggested to predict performance, however they are all ad hoc heuristics based on intuition or knowledge of the microarchitecture (e.g. using data cache miss rate or IPC to indicate likely pairings). Finally, in the symbios stage, groups of jobs that are predicted to fare well are scheduled concurrently. By comparison, this paper presents a methodical, statistical model that can be used to derive a performance predictor. Chandra et al. [6] introduce a model related to our work for mathematically predicting cache interference based on data re-use distances. Moseley et al. [16] describe various techniques for leveraging hardware performance monitoring in dynamic optimization and scheduling. Scheduling decisions are made by sampling performance counters and using a linear model to decide which factors contribute the most to thread interference. Bulpin [5] uses a similar technique to profile threads, and uses a linear model to several potential thread combinations to plan which threads to coschedule. Scheduling decisions based on proposed monitoring of fine-grained cache behavior in SMT processors is explored in [13]. Grunwald et al. [8] study microarchitectural Denial of Service by measuring how malicious threads degrade the throughput of threads they are paired with on SMT. Van Biesbrouck et al. [3] present methods for using Sim- Point [17] to guide simulation for SMT processors. Cophase effects of multiple simultaneous threads and the observed thread interactions are further explored in [12]. In [7], a model is presented for identifying and quantifying the interaction cost of processor which can be used to identify performance bottlenecks and focus optimization effort. Although these techniques target single-thread, superscalar architectures, they can be applied to multithreaded environments. 3. Methodology In statistics, a regression model is used to predict the change in a response variable for values of a number of factors or explanatory variables. Any number of explanatory variables may be used in a regression model, but a good regression model should have low error (meaning it makes good predictions) and, from a practical perspective, use the least number of explanatory variables needed. In order to perform on-line predictions, we wanted to use different regression models to predict IPC. Since an operating system mainly selects which programs to co-schedule, the explanatory variables that make the most sense to monitor include the individual actions of programs. Though many other factors, such as temperature or voltage levels, may influence processor performance, they were not taken into consideration in this work. In practice, we want to use the past value of explanatory values of two (arbitrarily selected) programs to predict the IPC when those two programs are run concurrently on the same processor in other words, given vectors x t 1 and y t 1 of explanatory values for one time period, we want to predict the aggregate IPC for the next time period (e.g. IPC t = par(x t 1,y t 1 ) ). In order to do this, a set of explanatory values needs to be selected; we use the processor performance counters for this purpose. In addition, a way to automatically select a regression function is needed; we consider two mechanisms: linear regression and recursively partitioned decision trees Experimental Configuration To demonstrate the need for a method to easily derive models for thread interaction, experiments are duplicated on two significantly different models of the Pentium-4 processor. The Northwood is a the second generation Hyper- Threading system, and the Nocona is based on the third generation Prescott architecture. Each physical processor has two logical contexts from which instructions are issued

4 Northwood Nocona Frequency 2.53GHz 3.4GHz Issue Width 3 4 Pipeline Stages L1 Dcache Size 8kB 16kB L1 Dcache Associativity 4-way 8-way L1 Dcache Latency 2cycles 4cycles Trace Cache 12k µops 16k µops L2 Cache Size 512kB 1024kB L2 Cache Associativity 8-way 8-way L2 Cache Latency 7cycles 11 cycles FSB 400MHz 800MHz Memory 768MB 2GB Table 1. Architectural differences between the Northwood and Nocona models of Intel Pentium-4 microprocessors used for experiments. In addition, Nocona features a 64-bit extension to the IA-32 architecture: Intel Extended Memory 64 Technology (EM64T). B TCLM L2M FPUOP I L2H RBMP ITLBH I ITLBM DTLBM CPC I Set 1 Retired branches Trace cache lookup misses Second-level cache misses Retired Floating point µops Instructions Set 2 Second-level cache hits Retired branch mispredictions Instruction TLB hits Instructions Set 3 Instruction TLB misses Data TLB misses Pipeline clears Instructions Table 2. Performance metrics recorded for application characterization. Metrics/Events are normalized to be per-cycle events counts. simultaneously. Table 1 shows the primary differences between the two experimental configurations. Experiments were conducted using a Linux kernel modified to support the collection and logging of hardware performance counters. Pair-wise combinations of benchmarks from the SPEC CPU2000 benchmark suite were evaluated using the reference input set. The performance counters were sampled every 25 million cycles and each time the operating system scheduler was invoked (generally every 100ms in Linux 2.6). A 25 million cycle sample period was chosen because it is small enough to isolate specific phases in program behavior, yet large enough to not cause significant overhead (less than 1%). Table 2 contains a list of performance counters that were collected. The Pentium-4 processor is equipped with 18 performance counters that can be configured to count hundreds of different events. With Hyper-Threading enabled, however, the performance counters are shared between two logical processors. To count the same events from both contexts simultaneously, the number of counters is reduced to 9 per logical processor. To further complicate matters, there are complex rules detailing which counter configuration registers can be associated with specific counters, and two counters must be allocated to count instructions. This makes it impossible to monitor combinations of counters such as L1 misses and L2 misses for two threads simultaneously. Given the inherent limitations of sampling on the Pentium-4, Table 2 is divided into three sets. Counters were chosen in an attempt to represent all resources that could be a source of contention between threads. How- ever, not all counters represent a resource that is directly shared. For example, the ITLB is duplicated per thread on both processor models, but high ITLB miss rates could correspond to greater pressure on the trace cache. Initially, we used multiple runs of applications to sample many different counters; we then use an analysis of variance to determine which counters contributed most to the IPC variation. These counters monitor hardware resources where threads could interfere with each other; more prior knowledge of the microarchitecture could eliminate this step. The first set of counters shown were those that cause the most variation in IPC, and the other two sets caused less IPC variation. Since only one set of counters can be used at a time, the remainder of this paper will only use the first set of counters Multiple Linear Regression Multiple linear regression is a statistical technique that attempts to describe a single response variable (RV) as a linear sum of two or more explanatory variables (EV), e.g. RV = EV 1 +EV EV n. Typically, the coefficients of the RV are chosen such that they minimize the mean square error between the prediction and a set of observed data. Using the statistical package R [18], we apply multiple linear regression to model IPC as the RV using each of the three sets of counters. Additionally, R allows for the interaction between the individual factors using standard Analysis of Variance techniques; interactions are modeled by the regression model RV = EV 1 + EV EV n + EV 1

5 EV 2 + EV 1 EV EV 1 EV 2... EV n.interactions can be quite strong in some cases Recursive Partitioning Recursive Partitioning is a methodology for automatically developing a decision tree to partition data sets based on a training set. Tree-based models are commonly used in the fields of biology and artificial intelligence to automatically encapsulate knowledge and make it more usable. Instead of trying to group data into bins, we assign an IPC prediction to each leaf of the decision tree. By making the decisions based on observed values of counters, the multithreaded IPC can be predicted in this manner. The rpart [2] package for the R language for statistical computing is a powerful, yet easy to use classification tool. Since IPC is a continuous variable, we use the anova mode of classification. In this method, each time the tree is split during construction, the result is a reduction in the residual sum of squares (thereby reducing overall error). 4. Results 4.1. Recursive Partitioning Decision Trees Figures 2 and 3 show the recursive partitioning decision trees for counter Set 1 for both processors evaluated. Each tree illustrates the processor monitored events (explanatory variables) shown as the main nodes, and the predicted IPC (response variable) as the leaf nodes. The events of counter Set 1 that are dominant factors are L2M (L2 cache misses), B (branches), and FPUOP (floating-point unit operations). Each event value is normalized to be a percycle event count. In the recursive partitioning algorithm, splits that occur higher in the tree highlight are the most importantfactorsin the model. Northwood splits first based on branches, then on floating point and L2 misses, but the Nocona model is more focused on L2 misses and branches. There are a number of interesting aspects found in comparing the Northwood and Nocona decision trees. First, Nocona actually has a bigger L2 cache. However, both the L1 and L2 caches on Nocona are slower (2 and 4 cycles, respectively) and have half the L2 cache bandwidth [9][10]. Since the Nocona isbasedontheprescott architecture that was designed for operation at much higher frequencies than were actually achieved, the slower caches make sense from a design perspective, but they hinder performance at lower clock speeds. Likewise, reports [9][10] indicate that one of the additions in the Prescott line, of which Nocona is a based, is an improved branch predictor. This would explain why the number of branches (B) was not as dominant in the decision tree of the Nocona as it was for the Northwood. Overall, this data supports the need for machine- and processor-specific models as two reasonably similar processors yield models that are quite different Within Error Threshold NORTHWOOD NOCONA Linear (No factor interaction) Linear Recursive Partitioning Error Threshold Figure 4. Comparison of percentage of samples falling within error bounds (cumulative distribution) for recursive partitioning and multiple linear regression models based on the first set of performance counters for Northwood and Nocona Processors Comparison of Prediction Accuracy Since there is not a standard method for comparing the two types of models, each model is cross validated against a different set of samples. This proves to be more intuitive for both comparison of models and understanding how well an individual model performs. The cumulative distributions of the samples across error bounds are shown in Figure 4 for counter Set 1 for each of the processors tested. The horizontal axis is the error threshold, defined by how far away the predicted value is from the actual value. The vertical axis is the percentage of samples within an error threshold. For example, for Northwood, the linear models predict about 6 of samples within 2 of their actual IPC, and the deci-

6 B < 0.04 FPUOP < 0.11 L2M >= FPUOP < FPUOP < 0.20 FPUOP < 0.08 FPUOP < B < 0.09 L2M >= Figure 2. A decision derived from counters in Set 1 for Northwood to predict total IPC. L2M >= B < 0.03 L2M >= L2M >= B < 0.05 B < 0.04 B < FPUOP >= B < FPUOP >= FPUOP >= Figure 3. A decision derived from counters in Set 1 for Nocona to predict total IPC. sion tree only predicts about 42% of samples within 2 of their actual IPC. In Figure 4, higher values indicate better prediction, and curves that increase more steeply indicate more samples are predicted with higher accuracy. The models for counter Set 1 are fairly good predictors of IPC. However, the models derived for the other counter sets (Set 2 and Set 3) were quite poor; the R 2 values ranged from 0.03 to 0.51, both of which are unacceptable. Therefore, for the sake of space, only results from counter Set 1 are discussed. The results show that taking interaction between factors into account performs only slightly better than not. For the Northwood processor, the linear models greatly outperform the recursive partitioning algorithm, but for the Nocona the performance is almost identical, with a slight edge to the decision tree algorithm. This is mostly due to the increased accuracy of the decision tree algorithm. Although the decision tree for Northwood is not as good a predictor as a linear model, it is still good enough to be considered worthwhile. Prediction accuracy aside, the interesting result of these models is the difference in importance of factors between machines. The SPEC 2000 benchmark suite includes a wide variety of applications, so it may not accurately represent the typical workload of a single system. This makes it desirable to tailor a specific model to a workload in addition to the features of the processor. In Figure 5, models are derived separately for both SPEC INT and SPEC FP benchmarks using counter Set 1 on Nocona. The accuracy curve for FP alone is slightly better than in Figure 4, and in this case recursive partitioning is more competitive with the linear models. The results for the integer set of benchmarks come as a surprise; the prediction accuracy is significantly higher than the models for FP or combined (over 8 of samples are predicted within 2 of their actual IPC). This could mean that either the integer benchmarks are intrinsically more predictable or

7 Within Error Threshold SPEC INT SPEC FP Within Error Threshold (Incorrect Model) Linear Recursive Partitioning Linear (No factor interaction) Error Threshold Linear (No factor interaction) Linear Recursive Partitioning Error Threshold Figure 5. Evaluation of models created for Nocona using benchmark subsets SPEC INT and SPEC FP to train. that some metric of the floating-point benchmarks is not being effectively captured by the model. Both linear regression and recursive partitioning perform well as models for both processors. Given the similarity in architecture, it may also seem safe to assume that a model for one processor will perform suitably as a model for another. Figure 6 shows the results when applying the Nocona models to samples from the Northwood processor. Each of the incorrect models performs about the same with respect to each other applied to samples from the wrong processor, but are far less accurate (predicts less than 2 of the samples within 2 of the actual value) than the processor-specific models. This result shows that if processors are to be modeled using the methods described above, even small changes in architecture are significant enough to require a unique model. Figure 6. Comparison of percentage of samples falling within error bounds (cumulative distribution) with the Nocona models applied to Northwood. 5. Conclusion This paper approaches the problem of resource contention on SMT processors. We apply two techniques, linear modeling and recursive partitioning, to performance counters collected from two real SMT processors. The results show that information from a small number of available performance counters is highly predictive of IPC when using either method. In addition to comparing across generations of Intel Hyper-Threading processors, our future work includes comparing these modeling techniques to other multithreaded and multicore architectures such as the IBM POWER5 and the next generation Intel Itanium-2 (Montecito). This work is not complicated by the architectural differences between processors, but in the available performance counters and how they are accessed. Additionally, work is underway using interpolation techniques to combine data from multiple experiments from performance counters which cannot be used togeter (eg. L1 data cache and L2 cache misses) in order to construct even more accurate models. References [1] A. Agarwal, J. Kubiatowicz, D. Kranz, B. H. Lim, D. Yeung, G. D Souza, and M. Parkin. Sparcle: An evolutionary

8 processor design for large-scale multiprocessors. IEEE Micro, 13(3):48 61, [2] E. J. Atkinson and T. M. Therneau. An introduction to recursive partitioning. Technical report, Mayo Foundation, Feb [3] M. V. Biesbrouck, T. Sherwood, and B. Calder. A co-phase matrix to guide simultaneous multithreading simulation. In Proceedings of the 2004 International Symposium on Performance Analysis of Systems and Software, pages , may [4] J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. R. Kunkel. A multithreaded powerpc processor for commercial servers. IBM Journal of Research and Development, 44(6): , November [5] J. Bulpin. Operating System Support for Simultaneous Multithreaded Processors. PhD thesis, University of Cambridge, Cambridge, UK, Feb [6] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA), pages , Feb [7] B.A.Fields,R.Bodík, M. D. Hill, and C. J. Newburn. Using interaction costs for microarchitectural bottleneck analysis. In MICRO, pages , [8] D. Grunwald and S. Ghiasi. Microarchitectural denial of service: insuring microarchitectural fairness. In Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, pages IEEE Computer Society Press, [9] Intel Corporation. Special issue on intel hyperthreading in pentium-4 processors. Intel Technology Journal, 1(1), January [10] Intel Corporation. The microarchitecture of the intel pentium-4 processor on 90nm technology. Intel Technology Journal, 1(1), February [11] R. N. Kalla, B. Sinharoy, and J. M. Tendler. Ibm power5 chip: A dual-core multithreaded processor. IEEE Micro, 24(2):40 47, [12] J. L. Kihm, T. Moseley, and D. A. Connors. A mathematical model for accurately balncing co-phase effects in simulated multithreaded systems. In Proceedings of the ISCA Workshop on Modeling, Benchmarking, and Simulation (MoBS) 2005, June [13] J. L. Kihm, A. Settle, A. Janiszewski, and D. A. Connors. Understanding the impact of inter-thread cache interference on ILP in modern SMT processors. Journal of Instruction Level Parallelism, 7(2), [14] V. Krishnan and J. Torrellas. A chip-multiprocessor architecture with speculative multithreading. IEEE Transactions on Computers, 48(9): , [15] D. T. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller, and M. Upton. Hyper-threading technology architecture and microarchitecture. Intel Technology Journal, 6(1):4 15, Feb [16] T. Moseley, A. Shye, V. J. Reddi, M. Iyer, D. Fay, D. Hodgdon, J. L. Kihm, A. Settle, D. Grunwald, and D. A. Connors. Dynamic run-time architecture techniques for enabling continuous optimization. In Proceedings of the 2005 International Conference on Computing Frontiers, May [17] E. Perelman, G. Hamerly, M. V. Biesbrouck, T. Sherwood, and B. Calder. Using simpoint for accurate and efficient simulation. In SIGMETRICS, pages , [18] R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, [19] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreaded processor. In Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, pages ACM Press, [20] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Architectural Support for Programming Languages and Operating Systems, pages , [21] A. Snavely, D. M. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages ACM Press, [22] Standard Performance Evaluation Corporation. The SPEC CPU 2000 benchmark suite, [23] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In 22nd Annual International Symposium on Computer Architecture, June [24] D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting fine-grained synchronization on a simultaneous multithreading processor. In International Symposium on Architectural Support for Programming Languages and Operating Systems, pages 54 58, 2000.

Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling

Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling Joshua L. Kihm and Daniel A. Connors University of Colorado at Boulder Department of Electrical and Computer Engineering UCB