Methods for Modeling Resource Contention on Simultaneous Multithreading Processors

Size: px
Start display at page:

Download "Methods for Modeling Resource Contention on Simultaneous Multithreading Processors"

Transcription

1 Methods for Modeling Resource Contention on Simultaneous Multithreading Processors Tipp Moseley, Joshua L. Kihm, Daniel A. Connors, and Dirk Grunwald Department of Computer Science Department of Electrical and Computer Engineering University of Colorado University of Colorado Boulder, CO Boulder, CO {kihm, Abstract Simultaneous multithreading (SMT) seeks to improve the computation throughput of a processor core by sharing primary resources such as functional units, issue bandwidth, and caches. SMT designs increase utilization and generally improve overall throughput, but the amount of improvement is highly dependent on competition for shared resources between the scheduled threads. This variability has implications that relate to operating system scheduling, simulation techniques, and fairness. Although these techniques recognize the implications of thread interaction, they do little to profile and predict this interaction. The modeling approach presented in this paper uses data collected from performance counters on two different hardware implementations of Pentium-4 Hyper-Threading processors to demonstrate the effects of thread interaction. Techniques are described for fitting linear regression models and recursive partitioning to use the counters to make online predictions of performance (expressed as instructions per cycle); these predictions can be used by the operating system to guide scheduling decisions. A detailed analysis of the effectiveness of each of these techniques is presented. 1. Introduction By leveraging the advances in semiconductor technologies, system developers are exploring the paradigms of System-on-a-Chip (SoC) processors, Chip Multiprocessors (CMP), and Multithreaded (MT) architectures. This evolution dictates that future high-performance systems will integrate tens of multithreaded processor cores on a single chip die, resulting in hundreds of concurrent program threads sharing system resources. These designs will be the cornerstone of not only high-performance computing and server environments, but will also emerge in general-purpose and embedded domains. For the optimal design and runtime management of such systems, it is necessary to have an understanding of how multiple threads interact when sharing hardware. In order to build systems software (compilers, operating systems, run-time systems) that understands the complete view of multiple cores, it is first necessary to build effective models of multithreaded core execution that will likely be the basis for multi-core designs. By supporting multiple hardware thread contexts, multithreaded architectures address the growing processormemory gap by tolerating memory latencies of individual threads. Several multithreaded processor models have been proposed. Coarse Grained Multi-Threaded (CGMT) [1] processors issue instructions from a single thread each cycle and switch between threads on long latency instructions such as cache misses or on definable time intervals. Alternate hardware thread contexts can perform useful work, increasing throughput, where a single thread would stall a processor. IBM released the PowerPC RS64-IV [4] which is a commercial implementation of a course grain multithreading processor. In general, since such systems switch between threads for distinct intervals, contention is limited to longer lifetime resources such as caches and branch predictors. On the other hand, Simultaneous Multithreading (SMT) [14][24][23] processors share the resources (ALUs, branch target buffers, caches etc.) of one physical processor between multiple virtual processors that simultaneously execute each cycle. The SMT design is intended to have a low design overhead for out-of-order processors, allowing it to be added into existing processor designs without significant cost. It is estimated that adding SMT support to the Compaq Alpha EV8 processor only required an additional 5% to the die area, and researchers at Intel found similar costs for their implementation of SMT called Hyper-Threading [15]. The major trade-off of integrating existing microarchitecture designs to handle simul-

2 taneous threads is that the processor efficiency becomes more directly coupled to the individual thread characteristics and inter-thread resource contention. As such, it is critical to develop accurate models for the systems software of SMT processors. The IBM POWER5 [11] consists of two, two-way SMT cores that share a single L2 cache, resulting in a total of four thread contexts per die and very complex thread interactions. The most commonly available SMT processor is the Intel Pentium-4 processor with Hyper-Threading [9]. Hyper- Threading is technically similar to the SMT designs described in the research literature, although it has unique characteristics. In particular, many resources, such as the cache system, microarchitectural registers, and execution units are shared between virtual processors as they are in SMT. Other resources, such as the re-order and load/store buffers, are partitioned, and some, including the instruction translation look-aside buffer (ITLB), are duplicated for each virtual processor. When running a conventional operating system on a Pentium-4 with Hyper-Threading enabled, each virtual processor appears to the operating system as two distinct processors and the base operating system does not need to have detailed knowledge that certain processors are in fact logical processors. L2 Miss Frequency 176.gcc L2 cache Miss Frequency (with and without 179.art) Pairing Total gcc Paired gcc Alone Instructions of gcc (in Billions) Figure 1. Level 2 cache misses per instruction for 176.gcc when run alone and paired with 179.art. Employing SMT in hardware generally has considerable benefit, such as increasing both utilization and throughput. However, increased utilization comes at a cost; threads may compete for resources that simply are not abundant enough to be shared. For example, Figure 1 shows the number of L2 cache misses per instruction retired (miss frequency) for Spec2000 ([22]) CPU benchmark 176.gcc when run alone and when run coscheduled on a Hyper-Threading processor with 179.art over a period of 15 billion operations of gcc. The figure shows that for a significant portion of execution, the L2 miss frequency is considerably higher when paired compared to when it is run alone (the regions of the graph where alone and paired are about the same are most likely due to high numbers of compulsary misses in certain program phases). Since such interactions can occur on any resource that is shared, it is necessary to be able to accurately predict how competition for a set of shared resources can affect overall performance. As multithreaded multi-core systems emerge, it becomes increasingly important for operating systems to be aware of application behavior to assess job scheduling opportunities and ensure fair access to resources. Rather than a set of ad hoc heuristics, the operating system should use a model of the processor architecture that uses program properties to provide insight into how and what applications should be co-scheduled. Deriving such a model can be complex, because it may require considerable insight into the internal machine organization. However, such deep insight may be a detriment often times, intuition is not validated by experimental data and assumptions about component interactions may not be validated in practice. For example, Snavely [19] felt that mixing integer and floating point applications would be a good scheduling heuristic; while intuitively appealing, this would only be true if the functional units were the performance bottlenecks for the processor. Ideally, a model should be able to be automatically derived using on-line measurement of programs; at the minimum, the model must capture the unique characteristics of the machine that influence program performance. In this paper, we evaluate two mechanisms to automate the process of deriving a machine model for SMT processors. We are interested in characterizing thread interaction on two real processors; since changes in instructions per cycle (IPC) summarizes aggregate performance, we concern ourselves with predicting how online measurements of program properties can be used to predict IPC. The goal of this work is to determine what kinds of models provide good prediction and if the models for different processors are similar or different. Use of the result of these predictions to influence scheduling decisions in an actual OS was previously presented [16]. Although this evaluation is for SMT processors, the techniques are applicable to SMT, CGMT, and multi-core designs. We show that statistical prediction tools can predict IPC with good accuracy; for example, across both proces-

3 sors we evaluate, 6 of the predicted IPC s are within 2 of the actual IPC. We also show that it is important to model specific processors. A model derived for one processor design is a poor predictor for a different processor design. These results indicate that not only are formal statistical models useful for predicting IPC, both by simplifying the model building process and improving prediction accuracy, but that processor-specific models are essential. The rest of this paper is organized as follows. Section 2 discusses related work in analysis of SMT processors. Section 3 gives an overview of constructing an accurate SMT resource model using hardware counter information. Section 4 presents and compares several models derived from the Pentium-4. Section 5 summarizes the results and discusses applications of this method in future work. 2. Related Work Since contention on shared resources can cause variations in throughput on a multithreaded system, operating systems have a direct role in the performance of such machines. An operating system that is aware of the underlying processor configuration can combine this information with runtime characteristics of a thread to adapt the process schedule to increase throughput. The work on thread symbiosis by Snavely et al. [20, 21] presents a method for scheduling threads on an SMT processor in the case that there are more runnable threads than contexts. The Sample-Optimize-Symbios (SOS) scheduler performs as the name suggests. In the sample phase, data is collected. Following this, an optimized schedule is calculated based on performance counter attributes recorded during the sample phase. A number of predictors are suggested to predict performance, however they are all ad hoc heuristics based on intuition or knowledge of the microarchitecture (e.g. using data cache miss rate or IPC to indicate likely pairings). Finally, in the symbios stage, groups of jobs that are predicted to fare well are scheduled concurrently. By comparison, this paper presents a methodical, statistical model that can be used to derive a performance predictor. Chandra et al. [6] introduce a model related to our work for mathematically predicting cache interference based on data re-use distances. Moseley et al. [16] describe various techniques for leveraging hardware performance monitoring in dynamic optimization and scheduling. Scheduling decisions are made by sampling performance counters and using a linear model to decide which factors contribute the most to thread interference. Bulpin [5] uses a similar technique to profile threads, and uses a linear model to several potential thread combinations to plan which threads to coschedule. Scheduling decisions based on proposed monitoring of fine-grained cache behavior in SMT processors is explored in [13]. Grunwald et al. [8] study microarchitectural Denial of Service by measuring how malicious threads degrade the throughput of threads they are paired with on SMT. Van Biesbrouck et al. [3] present methods for using Sim- Point [17] to guide simulation for SMT processors. Cophase effects of multiple simultaneous threads and the observed thread interactions are further explored in [12]. In [7], a model is presented for identifying and quantifying the interaction cost of processor which can be used to identify performance bottlenecks and focus optimization effort. Although these techniques target single-thread, superscalar architectures, they can be applied to multithreaded environments. 3. Methodology In statistics, a regression model is used to predict the change in a response variable for values of a number of factors or explanatory variables. Any number of explanatory variables may be used in a regression model, but a good regression model should have low error (meaning it makes good predictions) and, from a practical perspective, use the least number of explanatory variables needed. In order to perform on-line predictions, we wanted to use different regression models to predict IPC. Since an operating system mainly selects which programs to co-schedule, the explanatory variables that make the most sense to monitor include the individual actions of programs. Though many other factors, such as temperature or voltage levels, may influence processor performance, they were not taken into consideration in this work. In practice, we want to use the past value of explanatory values of two (arbitrarily selected) programs to predict the IPC when those two programs are run concurrently on the same processor in other words, given vectors x t 1 and y t 1 of explanatory values for one time period, we want to predict the aggregate IPC for the next time period (e.g. IPC t = par(x t 1,y t 1 ) ). In order to do this, a set of explanatory values needs to be selected; we use the processor performance counters for this purpose. In addition, a way to automatically select a regression function is needed; we consider two mechanisms: linear regression and recursively partitioned decision trees Experimental Configuration To demonstrate the need for a method to easily derive models for thread interaction, experiments are duplicated on two significantly different models of the Pentium-4 processor. The Northwood is a the second generation Hyper- Threading system, and the Nocona is based on the third generation Prescott architecture. Each physical processor has two logical contexts from which instructions are issued

4 Northwood Nocona Frequency 2.53GHz 3.4GHz Issue Width 3 4 Pipeline Stages L1 Dcache Size 8kB 16kB L1 Dcache Associativity 4-way 8-way L1 Dcache Latency 2cycles 4cycles Trace Cache 12k µops 16k µops L2 Cache Size 512kB 1024kB L2 Cache Associativity 8-way 8-way L2 Cache Latency 7cycles 11 cycles FSB 400MHz 800MHz Memory 768MB 2GB Table 1. Architectural differences between the Northwood and Nocona models of Intel Pentium-4 microprocessors used for experiments. In addition, Nocona features a 64-bit extension to the IA-32 architecture: Intel Extended Memory 64 Technology (EM64T). B TCLM L2M FPUOP I L2H RBMP ITLBH I ITLBM DTLBM CPC I Set 1 Retired branches Trace cache lookup misses Second-level cache misses Retired Floating point µops Instructions Set 2 Second-level cache hits Retired branch mispredictions Instruction TLB hits Instructions Set 3 Instruction TLB misses Data TLB misses Pipeline clears Instructions Table 2. Performance metrics recorded for application characterization. Metrics/Events are normalized to be per-cycle events counts. simultaneously. Table 1 shows the primary differences between the two experimental configurations. Experiments were conducted using a Linux kernel modified to support the collection and logging of hardware performance counters. Pair-wise combinations of benchmarks from the SPEC CPU2000 benchmark suite were evaluated using the reference input set. The performance counters were sampled every 25 million cycles and each time the operating system scheduler was invoked (generally every 100ms in Linux 2.6). A 25 million cycle sample period was chosen because it is small enough to isolate specific phases in program behavior, yet large enough to not cause significant overhead (less than 1%). Table 2 contains a list of performance counters that were collected. The Pentium-4 processor is equipped with 18 performance counters that can be configured to count hundreds of different events. With Hyper-Threading enabled, however, the performance counters are shared between two logical processors. To count the same events from both contexts simultaneously, the number of counters is reduced to 9 per logical processor. To further complicate matters, there are complex rules detailing which counter configuration registers can be associated with specific counters, and two counters must be allocated to count instructions. This makes it impossible to monitor combinations of counters such as L1 misses and L2 misses for two threads simultaneously. Given the inherent limitations of sampling on the Pentium-4, Table 2 is divided into three sets. Counters were chosen in an attempt to represent all resources that could be a source of contention between threads. How- ever, not all counters represent a resource that is directly shared. For example, the ITLB is duplicated per thread on both processor models, but high ITLB miss rates could correspond to greater pressure on the trace cache. Initially, we used multiple runs of applications to sample many different counters; we then use an analysis of variance to determine which counters contributed most to the IPC variation. These counters monitor hardware resources where threads could interfere with each other; more prior knowledge of the microarchitecture could eliminate this step. The first set of counters shown were those that cause the most variation in IPC, and the other two sets caused less IPC variation. Since only one set of counters can be used at a time, the remainder of this paper will only use the first set of counters Multiple Linear Regression Multiple linear regression is a statistical technique that attempts to describe a single response variable (RV) as a linear sum of two or more explanatory variables (EV), e.g. RV = EV 1 +EV EV n. Typically, the coefficients of the RV are chosen such that they minimize the mean square error between the prediction and a set of observed data. Using the statistical package R [18], we apply multiple linear regression to model IPC as the RV using each of the three sets of counters. Additionally, R allows for the interaction between the individual factors using standard Analysis of Variance techniques; interactions are modeled by the regression model RV = EV 1 + EV EV n + EV 1

5 EV 2 + EV 1 EV EV 1 EV 2... EV n.interactions can be quite strong in some cases Recursive Partitioning Recursive Partitioning is a methodology for automatically developing a decision tree to partition data sets based on a training set. Tree-based models are commonly used in the fields of biology and artificial intelligence to automatically encapsulate knowledge and make it more usable. Instead of trying to group data into bins, we assign an IPC prediction to each leaf of the decision tree. By making the decisions based on observed values of counters, the multithreaded IPC can be predicted in this manner. The rpart [2] package for the R language for statistical computing is a powerful, yet easy to use classification tool. Since IPC is a continuous variable, we use the anova mode of classification. In this method, each time the tree is split during construction, the result is a reduction in the residual sum of squares (thereby reducing overall error). 4. Results 4.1. Recursive Partitioning Decision Trees Figures 2 and 3 show the recursive partitioning decision trees for counter Set 1 for both processors evaluated. Each tree illustrates the processor monitored events (explanatory variables) shown as the main nodes, and the predicted IPC (response variable) as the leaf nodes. The events of counter Set 1 that are dominant factors are L2M (L2 cache misses), B (branches), and FPUOP (floating-point unit operations). Each event value is normalized to be a percycle event count. In the recursive partitioning algorithm, splits that occur higher in the tree highlight are the most importantfactorsin the model. Northwood splits first based on branches, then on floating point and L2 misses, but the Nocona model is more focused on L2 misses and branches. There are a number of interesting aspects found in comparing the Northwood and Nocona decision trees. First, Nocona actually has a bigger L2 cache. However, both the L1 and L2 caches on Nocona are slower (2 and 4 cycles, respectively) and have half the L2 cache bandwidth [9][10]. Since the Nocona isbasedontheprescott architecture that was designed for operation at much higher frequencies than were actually achieved, the slower caches make sense from a design perspective, but they hinder performance at lower clock speeds. Likewise, reports [9][10] indicate that one of the additions in the Prescott line, of which Nocona is a based, is an improved branch predictor. This would explain why the number of branches (B) was not as dominant in the decision tree of the Nocona as it was for the Northwood. Overall, this data supports the need for machine- and processor-specific models as two reasonably similar processors yield models that are quite different Within Error Threshold NORTHWOOD NOCONA Linear (No factor interaction) Linear Recursive Partitioning Error Threshold Figure 4. Comparison of percentage of samples falling within error bounds (cumulative distribution) for recursive partitioning and multiple linear regression models based on the first set of performance counters for Northwood and Nocona Processors Comparison of Prediction Accuracy Since there is not a standard method for comparing the two types of models, each model is cross validated against a different set of samples. This proves to be more intuitive for both comparison of models and understanding how well an individual model performs. The cumulative distributions of the samples across error bounds are shown in Figure 4 for counter Set 1 for each of the processors tested. The horizontal axis is the error threshold, defined by how far away the predicted value is from the actual value. The vertical axis is the percentage of samples within an error threshold. For example, for Northwood, the linear models predict about 6 of samples within 2 of their actual IPC, and the deci-

6 B < 0.04 FPUOP < 0.11 L2M >= FPUOP < FPUOP < 0.20 FPUOP < 0.08 FPUOP < B < 0.09 L2M >= Figure 2. A decision derived from counters in Set 1 for Northwood to predict total IPC. L2M >= B < 0.03 L2M >= L2M >= B < 0.05 B < 0.04 B < FPUOP >= B < FPUOP >= FPUOP >= Figure 3. A decision derived from counters in Set 1 for Nocona to predict total IPC. sion tree only predicts about 42% of samples within 2 of their actual IPC. In Figure 4, higher values indicate better prediction, and curves that increase more steeply indicate more samples are predicted with higher accuracy. The models for counter Set 1 are fairly good predictors of IPC. However, the models derived for the other counter sets (Set 2 and Set 3) were quite poor; the R 2 values ranged from 0.03 to 0.51, both of which are unacceptable. Therefore, for the sake of space, only results from counter Set 1 are discussed. The results show that taking interaction between factors into account performs only slightly better than not. For the Northwood processor, the linear models greatly outperform the recursive partitioning algorithm, but for the Nocona the performance is almost identical, with a slight edge to the decision tree algorithm. This is mostly due to the increased accuracy of the decision tree algorithm. Although the decision tree for Northwood is not as good a predictor as a linear model, it is still good enough to be considered worthwhile. Prediction accuracy aside, the interesting result of these models is the difference in importance of factors between machines. The SPEC 2000 benchmark suite includes a wide variety of applications, so it may not accurately represent the typical workload of a single system. This makes it desirable to tailor a specific model to a workload in addition to the features of the processor. In Figure 5, models are derived separately for both SPEC INT and SPEC FP benchmarks using counter Set 1 on Nocona. The accuracy curve for FP alone is slightly better than in Figure 4, and in this case recursive partitioning is more competitive with the linear models. The results for the integer set of benchmarks come as a surprise; the prediction accuracy is significantly higher than the models for FP or combined (over 8 of samples are predicted within 2 of their actual IPC). This could mean that either the integer benchmarks are intrinsically more predictable or

7 Within Error Threshold SPEC INT SPEC FP Within Error Threshold (Incorrect Model) Linear Recursive Partitioning Linear (No factor interaction) Error Threshold Linear (No factor interaction) Linear Recursive Partitioning Error Threshold Figure 5. Evaluation of models created for Nocona using benchmark subsets SPEC INT and SPEC FP to train. that some metric of the floating-point benchmarks is not being effectively captured by the model. Both linear regression and recursive partitioning perform well as models for both processors. Given the similarity in architecture, it may also seem safe to assume that a model for one processor will perform suitably as a model for another. Figure 6 shows the results when applying the Nocona models to samples from the Northwood processor. Each of the incorrect models performs about the same with respect to each other applied to samples from the wrong processor, but are far less accurate (predicts less than 2 of the samples within 2 of the actual value) than the processor-specific models. This result shows that if processors are to be modeled using the methods described above, even small changes in architecture are significant enough to require a unique model. Figure 6. Comparison of percentage of samples falling within error bounds (cumulative distribution) with the Nocona models applied to Northwood. 5. Conclusion This paper approaches the problem of resource contention on SMT processors. We apply two techniques, linear modeling and recursive partitioning, to performance counters collected from two real SMT processors. The results show that information from a small number of available performance counters is highly predictive of IPC when using either method. In addition to comparing across generations of Intel Hyper-Threading processors, our future work includes comparing these modeling techniques to other multithreaded and multicore architectures such as the IBM POWER5 and the next generation Intel Itanium-2 (Montecito). This work is not complicated by the architectural differences between processors, but in the available performance counters and how they are accessed. Additionally, work is underway using interpolation techniques to combine data from multiple experiments from performance counters which cannot be used togeter (eg. L1 data cache and L2 cache misses) in order to construct even more accurate models. References [1] A. Agarwal, J. Kubiatowicz, D. Kranz, B. H. Lim, D. Yeung, G. D Souza, and M. Parkin. Sparcle: An evolutionary

8 processor design for large-scale multiprocessors. IEEE Micro, 13(3):48 61, [2] E. J. Atkinson and T. M. Therneau. An introduction to recursive partitioning. Technical report, Mayo Foundation, Feb [3] M. V. Biesbrouck, T. Sherwood, and B. Calder. A co-phase matrix to guide simultaneous multithreading simulation. In Proceedings of the 2004 International Symposium on Performance Analysis of Systems and Software, pages , may [4] J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. R. Kunkel. A multithreaded powerpc processor for commercial servers. IBM Journal of Research and Development, 44(6): , November [5] J. Bulpin. Operating System Support for Simultaneous Multithreaded Processors. PhD thesis, University of Cambridge, Cambridge, UK, Feb [6] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA), pages , Feb [7] B.A.Fields,R.Bodík, M. D. Hill, and C. J. Newburn. Using interaction costs for microarchitectural bottleneck analysis. In MICRO, pages , [8] D. Grunwald and S. Ghiasi. Microarchitectural denial of service: insuring microarchitectural fairness. In Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, pages IEEE Computer Society Press, [9] Intel Corporation. Special issue on intel hyperthreading in pentium-4 processors. Intel Technology Journal, 1(1), January [10] Intel Corporation. The microarchitecture of the intel pentium-4 processor on 90nm technology. Intel Technology Journal, 1(1), February [11] R. N. Kalla, B. Sinharoy, and J. M. Tendler. Ibm power5 chip: A dual-core multithreaded processor. IEEE Micro, 24(2):40 47, [12] J. L. Kihm, T. Moseley, and D. A. Connors. A mathematical model for accurately balncing co-phase effects in simulated multithreaded systems. In Proceedings of the ISCA Workshop on Modeling, Benchmarking, and Simulation (MoBS) 2005, June [13] J. L. Kihm, A. Settle, A. Janiszewski, and D. A. Connors. Understanding the impact of inter-thread cache interference on ILP in modern SMT processors. Journal of Instruction Level Parallelism, 7(2), [14] V. Krishnan and J. Torrellas. A chip-multiprocessor architecture with speculative multithreading. IEEE Transactions on Computers, 48(9): , [15] D. T. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller, and M. Upton. Hyper-threading technology architecture and microarchitecture. Intel Technology Journal, 6(1):4 15, Feb [16] T. Moseley, A. Shye, V. J. Reddi, M. Iyer, D. Fay, D. Hodgdon, J. L. Kihm, A. Settle, D. Grunwald, and D. A. Connors. Dynamic run-time architecture techniques for enabling continuous optimization. In Proceedings of the 2005 International Conference on Computing Frontiers, May [17] E. Perelman, G. Hamerly, M. V. Biesbrouck, T. Sherwood, and B. Calder. Using simpoint for accurate and efficient simulation. In SIGMETRICS, pages , [18] R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, [19] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreaded processor. In Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, pages ACM Press, [20] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Architectural Support for Programming Languages and Operating Systems, pages , [21] A. Snavely, D. M. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages ACM Press, [22] Standard Performance Evaluation Corporation. The SPEC CPU 2000 benchmark suite, [23] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In 22nd Annual International Symposium on Computer Architecture, June [24] D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting fine-grained synchronization on a simultaneous multithreading processor. In International Symposium on Architectural Support for Programming Languages and Operating Systems, pages 54 58, 2000.

Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling

Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling Joshua L. Kihm and Daniel A. Connors University of Colorado at Boulder Department of Electrical and Computer Engineering UCB

More information

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Mitesh R. Meswani and Patricia J. Teller Department of Computer Science, University

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Dynamic Run-time Architecture Techniques for Enabling Continuous Optimization

Dynamic Run-time Architecture Techniques for Enabling Continuous Optimization Dynamic Run-time Architecture Techniques for Enabling Continuous Optimization Tipp Moseley, Alex Shye, Vijay Janapa Reddi, Matthew Iyer, Dan Fay, David Hodgdon, Joshua L. Kihm, Alex Settle, Dirk Grunwald,

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Dan Doucette School of Computing Science Simon Fraser University Email: ddoucett@cs.sfu.ca Alexandra Fedorova

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Kostas Papadopoulos December 11, 2005 Abstract Simultaneous Multi-threading (SMT) has been developed to increase instruction

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

32 Hyper-Threading on SMP Systems

32 Hyper-Threading on SMP Systems 32 Hyper-Threading on SMP Systems If you have not read the book (Performance Assurance for IT Systems) check the introduction to More Tasters on the web site http://www.b.king.dsl.pipex.com/ to understand

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy, Rajeev Garg, David H. Albonesi and Sandhya Dwarkadas

Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy, Rajeev Garg, David H. Albonesi and Sandhya Dwarkadas Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy, Rajeev Garg, David H. Albonesi and Sandhya Dwarkadas Depments of Electrical and Computer Engineering and of Computer

More information

LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES

LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES Shane Carroll and Wei-Ming Lin Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio,

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

The Intel move from ILP into Multi-threading

The Intel move from ILP into Multi-threading The Intel move from ILP into Multi-threading Miguel Pires Departamento de Informática, Universidade do Minho Braga, Portugal migutass@hotmail.com Abstract. Multicore technology came into consumer market

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

ILP Ends TLP Begins. ILP Limits via an Oracle

ILP Ends TLP Begins. ILP Limits via an Oracle ILP Ends TLP Begins Today s topics: Explore a perfect machine unlimited budget to see where ILP goes answer: not far enough Look to TLP & multi-threading for help everything has it s issues we ll look

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

A Framework for Providing Quality of Service in Chip Multi-Processors

A Framework for Providing Quality of Service in Chip Multi-Processors A Framework for Providing Quality of Service in Chip Multi-Processors Fei Guo 1, Yan Solihin 1, Li Zhao 2, Ravishankar Iyer 2 1 North Carolina State University 2 Intel Corporation The 40th Annual IEEE/ACM

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Outline Motivation Background Threading Fundamentals

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Pipelined Hash-Join on Multithreaded Architectures

Pipelined Hash-Join on Multithreaded Architectures Pipelined Hash-Join on Multithreaded Architectures Philip Garcia University of Wisconsin-Madison Madison, WI 53706 USA pcgarcia@wisc.edu Henry F. Korth Lehigh University Bethlehem, PA 805 USA hfk@lehigh.edu

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Simultaneous Multithreading and the Case for Chip Multiprocessing

Simultaneous Multithreading and the Case for Chip Multiprocessing Simultaneous Multithreading and the Case for Chip Multiprocessing John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 2 10 January 2019 Microprocessor Architecture

More information

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1> Chapter 7 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 7 Chapter 7 :: Topics Introduction (done) Performance Analysis (done) Single-Cycle Processor

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

CPU Resource Reservation for Simultaneous Multi-Thread Systems. Hiroshi Inoue, Takao Moriyama, Yasushi Negishi, and Moriyoshi Ohara

CPU Resource Reservation for Simultaneous Multi-Thread Systems. Hiroshi Inoue, Takao Moriyama, Yasushi Negishi, and Moriyoshi Ohara RT676 Computer Science 13 pages Research Report September 12, 6 CPU Resource Reservation for Simultaneous Multi-Thread Systems Hiroshi Inoue, Takao Moriyama, Yasushi Negishi, and Moriyoshi Ohara IBM Research,

More information

A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP

A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP Kai Tian Kai Tian, Yunlian Jiang and Xipeng Shen Computer Science Department, College of William and Mary, Virginia, USA 5/18/2009 Cache

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Multi-core Programming Evolution

Multi-core Programming Evolution Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads

Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads Davy Genbrugge Lieven Eeckhout ELIS Depment, Ghent University, Belgium Email: {dgenbrug,leeckhou}@elis.ugent.be Abstract This

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

LIMITS OF ILP. B649 Parallel Architectures and Programming

LIMITS OF ILP. B649 Parallel Architectures and Programming LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

Measurement-based Analysis of TCP/IP Processing Requirements

Measurement-based Analysis of TCP/IP Processing Requirements Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01. Hyperthreading ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Hyperthreading is a design that makes everybody concerned believe that they are actually using

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

Simultaneous Multithreading Processor

Simultaneous Multithreading Processor Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

More information

Efficient Program Power Behavior Characterization

Efficient Program Power Behavior Characterization Efficient Program Power Behavior Characterization Chunling Hu Daniel A. Jiménez Ulrich Kremer Department of Computer Science {chunling, djimenez, uli}@cs.rutgers.edu Rutgers University, Piscataway, NJ

More information

ABSTRACT. Integration of multiple processor cores on a single die, relatively constant die

ABSTRACT. Integration of multiple processor cores on a single die, relatively constant die ABSTRACT Title of dissertation: Symbiotic Subordinate Threading (SST) Rania Mameesh, Doctor of Philosophy, 2007 Dissertation directed by: Dr Manoj Franklin Electrical and Computer Engineering Department

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Lecture 18: Core Design, Parallel Algos

Lecture 18: Core Design, Parallel Algos Lecture 18: Core Design, Parallel Algos Today: Innovations for ILP, TLP, power and parallel algos Sign up for class presentations 1 SMT Pipeline Structure Front End Front End Front End Front End Private/

More information

A Predictable Simultaneous Multithreading Scheme for Hard Real-Time

A Predictable Simultaneous Multithreading Scheme for Hard Real-Time A Predictable Simultaneous Multithreading Scheme for Hard Real-Time Jonathan Barre, Christine Rochange, and Pascal Sainrat Institut de Recherche en Informatique de Toulouse, Université detoulouse-cnrs,france

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

The Implications of Multi-core

The Implications of Multi-core The Implications of Multi- What I want to do today Given that everyone is heralding Multi- Is it really the Holy Grail? Will it cure cancer? A lot of misinformation has surfaced What multi- is and what

More information