Methods for Modeling Resource Contention on Simultaneous Multithreading Processors
|
|
- Linette Simon
- 5 years ago
- Views:
Transcription
1 Methods for Modeling Resource Contention on Simultaneous Multithreading Processors Tipp Moseley, Joshua L. Kihm, Daniel A. Connors, and Dirk Grunwald Department of Computer Science Department of Electrical and Computer Engineering University of Colorado University of Colorado Boulder, CO Boulder, CO {kihm, Abstract Simultaneous multithreading (SMT) seeks to improve the computation throughput of a processor core by sharing primary resources such as functional units, issue bandwidth, and caches. SMT designs increase utilization and generally improve overall throughput, but the amount of improvement is highly dependent on competition for shared resources between the scheduled threads. This variability has implications that relate to operating system scheduling, simulation techniques, and fairness. Although these techniques recognize the implications of thread interaction, they do little to profile and predict this interaction. The modeling approach presented in this paper uses data collected from performance counters on two different hardware implementations of Pentium-4 Hyper-Threading processors to demonstrate the effects of thread interaction. Techniques are described for fitting linear regression models and recursive partitioning to use the counters to make online predictions of performance (expressed as instructions per cycle); these predictions can be used by the operating system to guide scheduling decisions. A detailed analysis of the effectiveness of each of these techniques is presented. 1. Introduction By leveraging the advances in semiconductor technologies, system developers are exploring the paradigms of System-on-a-Chip (SoC) processors, Chip Multiprocessors (CMP), and Multithreaded (MT) architectures. This evolution dictates that future high-performance systems will integrate tens of multithreaded processor cores on a single chip die, resulting in hundreds of concurrent program threads sharing system resources. These designs will be the cornerstone of not only high-performance computing and server environments, but will also emerge in general-purpose and embedded domains. For the optimal design and runtime management of such systems, it is necessary to have an understanding of how multiple threads interact when sharing hardware. In order to build systems software (compilers, operating systems, run-time systems) that understands the complete view of multiple cores, it is first necessary to build effective models of multithreaded core execution that will likely be the basis for multi-core designs. By supporting multiple hardware thread contexts, multithreaded architectures address the growing processormemory gap by tolerating memory latencies of individual threads. Several multithreaded processor models have been proposed. Coarse Grained Multi-Threaded (CGMT) [1] processors issue instructions from a single thread each cycle and switch between threads on long latency instructions such as cache misses or on definable time intervals. Alternate hardware thread contexts can perform useful work, increasing throughput, where a single thread would stall a processor. IBM released the PowerPC RS64-IV [4] which is a commercial implementation of a course grain multithreading processor. In general, since such systems switch between threads for distinct intervals, contention is limited to longer lifetime resources such as caches and branch predictors. On the other hand, Simultaneous Multithreading (SMT) [14][24][23] processors share the resources (ALUs, branch target buffers, caches etc.) of one physical processor between multiple virtual processors that simultaneously execute each cycle. The SMT design is intended to have a low design overhead for out-of-order processors, allowing it to be added into existing processor designs without significant cost. It is estimated that adding SMT support to the Compaq Alpha EV8 processor only required an additional 5% to the die area, and researchers at Intel found similar costs for their implementation of SMT called Hyper-Threading [15]. The major trade-off of integrating existing microarchitecture designs to handle simul-
2 taneous threads is that the processor efficiency becomes more directly coupled to the individual thread characteristics and inter-thread resource contention. As such, it is critical to develop accurate models for the systems software of SMT processors. The IBM POWER5 [11] consists of two, two-way SMT cores that share a single L2 cache, resulting in a total of four thread contexts per die and very complex thread interactions. The most commonly available SMT processor is the Intel Pentium-4 processor with Hyper-Threading [9]. Hyper- Threading is technically similar to the SMT designs described in the research literature, although it has unique characteristics. In particular, many resources, such as the cache system, microarchitectural registers, and execution units are shared between virtual processors as they are in SMT. Other resources, such as the re-order and load/store buffers, are partitioned, and some, including the instruction translation look-aside buffer (ITLB), are duplicated for each virtual processor. When running a conventional operating system on a Pentium-4 with Hyper-Threading enabled, each virtual processor appears to the operating system as two distinct processors and the base operating system does not need to have detailed knowledge that certain processors are in fact logical processors. L2 Miss Frequency 176.gcc L2 cache Miss Frequency (with and without 179.art) Pairing Total gcc Paired gcc Alone Instructions of gcc (in Billions) Figure 1. Level 2 cache misses per instruction for 176.gcc when run alone and paired with 179.art. Employing SMT in hardware generally has considerable benefit, such as increasing both utilization and throughput. However, increased utilization comes at a cost; threads may compete for resources that simply are not abundant enough to be shared. For example, Figure 1 shows the number of L2 cache misses per instruction retired (miss frequency) for Spec2000 ([22]) CPU benchmark 176.gcc when run alone and when run coscheduled on a Hyper-Threading processor with 179.art over a period of 15 billion operations of gcc. The figure shows that for a significant portion of execution, the L2 miss frequency is considerably higher when paired compared to when it is run alone (the regions of the graph where alone and paired are about the same are most likely due to high numbers of compulsary misses in certain program phases). Since such interactions can occur on any resource that is shared, it is necessary to be able to accurately predict how competition for a set of shared resources can affect overall performance. As multithreaded multi-core systems emerge, it becomes increasingly important for operating systems to be aware of application behavior to assess job scheduling opportunities and ensure fair access to resources. Rather than a set of ad hoc heuristics, the operating system should use a model of the processor architecture that uses program properties to provide insight into how and what applications should be co-scheduled. Deriving such a model can be complex, because it may require considerable insight into the internal machine organization. However, such deep insight may be a detriment often times, intuition is not validated by experimental data and assumptions about component interactions may not be validated in practice. For example, Snavely [19] felt that mixing integer and floating point applications would be a good scheduling heuristic; while intuitively appealing, this would only be true if the functional units were the performance bottlenecks for the processor. Ideally, a model should be able to be automatically derived using on-line measurement of programs; at the minimum, the model must capture the unique characteristics of the machine that influence program performance. In this paper, we evaluate two mechanisms to automate the process of deriving a machine model for SMT processors. We are interested in characterizing thread interaction on two real processors; since changes in instructions per cycle (IPC) summarizes aggregate performance, we concern ourselves with predicting how online measurements of program properties can be used to predict IPC. The goal of this work is to determine what kinds of models provide good prediction and if the models for different processors are similar or different. Use of the result of these predictions to influence scheduling decisions in an actual OS was previously presented [16]. Although this evaluation is for SMT processors, the techniques are applicable to SMT, CGMT, and multi-core designs. We show that statistical prediction tools can predict IPC with good accuracy; for example, across both proces-
3 sors we evaluate, 6 of the predicted IPC s are within 2 of the actual IPC. We also show that it is important to model specific processors. A model derived for one processor design is a poor predictor for a different processor design. These results indicate that not only are formal statistical models useful for predicting IPC, both by simplifying the model building process and improving prediction accuracy, but that processor-specific models are essential. The rest of this paper is organized as follows. Section 2 discusses related work in analysis of SMT processors. Section 3 gives an overview of constructing an accurate SMT resource model using hardware counter information. Section 4 presents and compares several models derived from the Pentium-4. Section 5 summarizes the results and discusses applications of this method in future work. 2. Related Work Since contention on shared resources can cause variations in throughput on a multithreaded system, operating systems have a direct role in the performance of such machines. An operating system that is aware of the underlying processor configuration can combine this information with runtime characteristics of a thread to adapt the process schedule to increase throughput. The work on thread symbiosis by Snavely et al. [20, 21] presents a method for scheduling threads on an SMT processor in the case that there are more runnable threads than contexts. The Sample-Optimize-Symbios (SOS) scheduler performs as the name suggests. In the sample phase, data is collected. Following this, an optimized schedule is calculated based on performance counter attributes recorded during the sample phase. A number of predictors are suggested to predict performance, however they are all ad hoc heuristics based on intuition or knowledge of the microarchitecture (e.g. using data cache miss rate or IPC to indicate likely pairings). Finally, in the symbios stage, groups of jobs that are predicted to fare well are scheduled concurrently. By comparison, this paper presents a methodical, statistical model that can be used to derive a performance predictor. Chandra et al. [6] introduce a model related to our work for mathematically predicting cache interference based on data re-use distances. Moseley et al. [16] describe various techniques for leveraging hardware performance monitoring in dynamic optimization and scheduling. Scheduling decisions are made by sampling performance counters and using a linear model to decide which factors contribute the most to thread interference. Bulpin [5] uses a similar technique to profile threads, and uses a linear model to several potential thread combinations to plan which threads to coschedule. Scheduling decisions based on proposed monitoring of fine-grained cache behavior in SMT processors is explored in [13]. Grunwald et al. [8] study microarchitectural Denial of Service by measuring how malicious threads degrade the throughput of threads they are paired with on SMT. Van Biesbrouck et al. [3] present methods for using Sim- Point [17] to guide simulation for SMT processors. Cophase effects of multiple simultaneous threads and the observed thread interactions are further explored in [12]. In [7], a model is presented for identifying and quantifying the interaction cost of processor which can be used to identify performance bottlenecks and focus optimization effort. Although these techniques target single-thread, superscalar architectures, they can be applied to multithreaded environments. 3. Methodology In statistics, a regression model is used to predict the change in a response variable for values of a number of factors or explanatory variables. Any number of explanatory variables may be used in a regression model, but a good regression model should have low error (meaning it makes good predictions) and, from a practical perspective, use the least number of explanatory variables needed. In order to perform on-line predictions, we wanted to use different regression models to predict IPC. Since an operating system mainly selects which programs to co-schedule, the explanatory variables that make the most sense to monitor include the individual actions of programs. Though many other factors, such as temperature or voltage levels, may influence processor performance, they were not taken into consideration in this work. In practice, we want to use the past value of explanatory values of two (arbitrarily selected) programs to predict the IPC when those two programs are run concurrently on the same processor in other words, given vectors x t 1 and y t 1 of explanatory values for one time period, we want to predict the aggregate IPC for the next time period (e.g. IPC t = par(x t 1,y t 1 ) ). In order to do this, a set of explanatory values needs to be selected; we use the processor performance counters for this purpose. In addition, a way to automatically select a regression function is needed; we consider two mechanisms: linear regression and recursively partitioned decision trees Experimental Configuration To demonstrate the need for a method to easily derive models for thread interaction, experiments are duplicated on two significantly different models of the Pentium-4 processor. The Northwood is a the second generation Hyper- Threading system, and the Nocona is based on the third generation Prescott architecture. Each physical processor has two logical contexts from which instructions are issued
4 Northwood Nocona Frequency 2.53GHz 3.4GHz Issue Width 3 4 Pipeline Stages L1 Dcache Size 8kB 16kB L1 Dcache Associativity 4-way 8-way L1 Dcache Latency 2cycles 4cycles Trace Cache 12k µops 16k µops L2 Cache Size 512kB 1024kB L2 Cache Associativity 8-way 8-way L2 Cache Latency 7cycles 11 cycles FSB 400MHz 800MHz Memory 768MB 2GB Table 1. Architectural differences between the Northwood and Nocona models of Intel Pentium-4 microprocessors used for experiments. In addition, Nocona features a 64-bit extension to the IA-32 architecture: Intel Extended Memory 64 Technology (EM64T). B TCLM L2M FPUOP I L2H RBMP ITLBH I ITLBM DTLBM CPC I Set 1 Retired branches Trace cache lookup misses Second-level cache misses Retired Floating point µops Instructions Set 2 Second-level cache hits Retired branch mispredictions Instruction TLB hits Instructions Set 3 Instruction TLB misses Data TLB misses Pipeline clears Instructions Table 2. Performance metrics recorded for application characterization. Metrics/Events are normalized to be per-cycle events counts. simultaneously. Table 1 shows the primary differences between the two experimental configurations. Experiments were conducted using a Linux kernel modified to support the collection and logging of hardware performance counters. Pair-wise combinations of benchmarks from the SPEC CPU2000 benchmark suite were evaluated using the reference input set. The performance counters were sampled every 25 million cycles and each time the operating system scheduler was invoked (generally every 100ms in Linux 2.6). A 25 million cycle sample period was chosen because it is small enough to isolate specific phases in program behavior, yet large enough to not cause significant overhead (less than 1%). Table 2 contains a list of performance counters that were collected. The Pentium-4 processor is equipped with 18 performance counters that can be configured to count hundreds of different events. With Hyper-Threading enabled, however, the performance counters are shared between two logical processors. To count the same events from both contexts simultaneously, the number of counters is reduced to 9 per logical processor. To further complicate matters, there are complex rules detailing which counter configuration registers can be associated with specific counters, and two counters must be allocated to count instructions. This makes it impossible to monitor combinations of counters such as L1 misses and L2 misses for two threads simultaneously. Given the inherent limitations of sampling on the Pentium-4, Table 2 is divided into three sets. Counters were chosen in an attempt to represent all resources that could be a source of contention between threads. How- ever, not all counters represent a resource that is directly shared. For example, the ITLB is duplicated per thread on both processor models, but high ITLB miss rates could correspond to greater pressure on the trace cache. Initially, we used multiple runs of applications to sample many different counters; we then use an analysis of variance to determine which counters contributed most to the IPC variation. These counters monitor hardware resources where threads could interfere with each other; more prior knowledge of the microarchitecture could eliminate this step. The first set of counters shown were those that cause the most variation in IPC, and the other two sets caused less IPC variation. Since only one set of counters can be used at a time, the remainder of this paper will only use the first set of counters Multiple Linear Regression Multiple linear regression is a statistical technique that attempts to describe a single response variable (RV) as a linear sum of two or more explanatory variables (EV), e.g. RV = EV 1 +EV EV n. Typically, the coefficients of the RV are chosen such that they minimize the mean square error between the prediction and a set of observed data. Using the statistical package R [18], we apply multiple linear regression to model IPC as the RV using each of the three sets of counters. Additionally, R allows for the interaction between the individual factors using standard Analysis of Variance techniques; interactions are modeled by the regression model RV = EV 1 + EV EV n + EV 1
5 EV 2 + EV 1 EV EV 1 EV 2... EV n.interactions can be quite strong in some cases Recursive Partitioning Recursive Partitioning is a methodology for automatically developing a decision tree to partition data sets based on a training set. Tree-based models are commonly used in the fields of biology and artificial intelligence to automatically encapsulate knowledge and make it more usable. Instead of trying to group data into bins, we assign an IPC prediction to each leaf of the decision tree. By making the decisions based on observed values of counters, the multithreaded IPC can be predicted in this manner. The rpart [2] package for the R language for statistical computing is a powerful, yet easy to use classification tool. Since IPC is a continuous variable, we use the anova mode of classification. In this method, each time the tree is split during construction, the result is a reduction in the residual sum of squares (thereby reducing overall error). 4. Results 4.1. Recursive Partitioning Decision Trees Figures 2 and 3 show the recursive partitioning decision trees for counter Set 1 for both processors evaluated. Each tree illustrates the processor monitored events (explanatory variables) shown as the main nodes, and the predicted IPC (response variable) as the leaf nodes. The events of counter Set 1 that are dominant factors are L2M (L2 cache misses), B (branches), and FPUOP (floating-point unit operations). Each event value is normalized to be a percycle event count. In the recursive partitioning algorithm, splits that occur higher in the tree highlight are the most importantfactorsin the model. Northwood splits first based on branches, then on floating point and L2 misses, but the Nocona model is more focused on L2 misses and branches. There are a number of interesting aspects found in comparing the Northwood and Nocona decision trees. First, Nocona actually has a bigger L2 cache. However, both the L1 and L2 caches on Nocona are slower (2 and 4 cycles, respectively) and have half the L2 cache bandwidth [9][10]. Since the Nocona isbasedontheprescott architecture that was designed for operation at much higher frequencies than were actually achieved, the slower caches make sense from a design perspective, but they hinder performance at lower clock speeds. Likewise, reports [9][10] indicate that one of the additions in the Prescott line, of which Nocona is a based, is an improved branch predictor. This would explain why the number of branches (B) was not as dominant in the decision tree of the Nocona as it was for the Northwood. Overall, this data supports the need for machine- and processor-specific models as two reasonably similar processors yield models that are quite different Within Error Threshold NORTHWOOD NOCONA Linear (No factor interaction) Linear Recursive Partitioning Error Threshold Figure 4. Comparison of percentage of samples falling within error bounds (cumulative distribution) for recursive partitioning and multiple linear regression models based on the first set of performance counters for Northwood and Nocona Processors Comparison of Prediction Accuracy Since there is not a standard method for comparing the two types of models, each model is cross validated against a different set of samples. This proves to be more intuitive for both comparison of models and understanding how well an individual model performs. The cumulative distributions of the samples across error bounds are shown in Figure 4 for counter Set 1 for each of the processors tested. The horizontal axis is the error threshold, defined by how far away the predicted value is from the actual value. The vertical axis is the percentage of samples within an error threshold. For example, for Northwood, the linear models predict about 6 of samples within 2 of their actual IPC, and the deci-
6 B < 0.04 FPUOP < 0.11 L2M >= FPUOP < FPUOP < 0.20 FPUOP < 0.08 FPUOP < B < 0.09 L2M >= Figure 2. A decision derived from counters in Set 1 for Northwood to predict total IPC. L2M >= B < 0.03 L2M >= L2M >= B < 0.05 B < 0.04 B < FPUOP >= B < FPUOP >= FPUOP >= Figure 3. A decision derived from counters in Set 1 for Nocona to predict total IPC. sion tree only predicts about 42% of samples within 2 of their actual IPC. In Figure 4, higher values indicate better prediction, and curves that increase more steeply indicate more samples are predicted with higher accuracy. The models for counter Set 1 are fairly good predictors of IPC. However, the models derived for the other counter sets (Set 2 and Set 3) were quite poor; the R 2 values ranged from 0.03 to 0.51, both of which are unacceptable. Therefore, for the sake of space, only results from counter Set 1 are discussed. The results show that taking interaction between factors into account performs only slightly better than not. For the Northwood processor, the linear models greatly outperform the recursive partitioning algorithm, but for the Nocona the performance is almost identical, with a slight edge to the decision tree algorithm. This is mostly due to the increased accuracy of the decision tree algorithm. Although the decision tree for Northwood is not as good a predictor as a linear model, it is still good enough to be considered worthwhile. Prediction accuracy aside, the interesting result of these models is the difference in importance of factors between machines. The SPEC 2000 benchmark suite includes a wide variety of applications, so it may not accurately represent the typical workload of a single system. This makes it desirable to tailor a specific model to a workload in addition to the features of the processor. In Figure 5, models are derived separately for both SPEC INT and SPEC FP benchmarks using counter Set 1 on Nocona. The accuracy curve for FP alone is slightly better than in Figure 4, and in this case recursive partitioning is more competitive with the linear models. The results for the integer set of benchmarks come as a surprise; the prediction accuracy is significantly higher than the models for FP or combined (over 8 of samples are predicted within 2 of their actual IPC). This could mean that either the integer benchmarks are intrinsically more predictable or
7 Within Error Threshold SPEC INT SPEC FP Within Error Threshold (Incorrect Model) Linear Recursive Partitioning Linear (No factor interaction) Error Threshold Linear (No factor interaction) Linear Recursive Partitioning Error Threshold Figure 5. Evaluation of models created for Nocona using benchmark subsets SPEC INT and SPEC FP to train. that some metric of the floating-point benchmarks is not being effectively captured by the model. Both linear regression and recursive partitioning perform well as models for both processors. Given the similarity in architecture, it may also seem safe to assume that a model for one processor will perform suitably as a model for another. Figure 6 shows the results when applying the Nocona models to samples from the Northwood processor. Each of the incorrect models performs about the same with respect to each other applied to samples from the wrong processor, but are far less accurate (predicts less than 2 of the samples within 2 of the actual value) than the processor-specific models. This result shows that if processors are to be modeled using the methods described above, even small changes in architecture are significant enough to require a unique model. Figure 6. Comparison of percentage of samples falling within error bounds (cumulative distribution) with the Nocona models applied to Northwood. 5. Conclusion This paper approaches the problem of resource contention on SMT processors. We apply two techniques, linear modeling and recursive partitioning, to performance counters collected from two real SMT processors. The results show that information from a small number of available performance counters is highly predictive of IPC when using either method. In addition to comparing across generations of Intel Hyper-Threading processors, our future work includes comparing these modeling techniques to other multithreaded and multicore architectures such as the IBM POWER5 and the next generation Intel Itanium-2 (Montecito). This work is not complicated by the architectural differences between processors, but in the available performance counters and how they are accessed. Additionally, work is underway using interpolation techniques to combine data from multiple experiments from performance counters which cannot be used togeter (eg. L1 data cache and L2 cache misses) in order to construct even more accurate models. References [1] A. Agarwal, J. Kubiatowicz, D. Kranz, B. H. Lim, D. Yeung, G. D Souza, and M. Parkin. Sparcle: An evolutionary
8 processor design for large-scale multiprocessors. IEEE Micro, 13(3):48 61, [2] E. J. Atkinson and T. M. Therneau. An introduction to recursive partitioning. Technical report, Mayo Foundation, Feb [3] M. V. Biesbrouck, T. Sherwood, and B. Calder. A co-phase matrix to guide simultaneous multithreading simulation. In Proceedings of the 2004 International Symposium on Performance Analysis of Systems and Software, pages , may [4] J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. R. Kunkel. A multithreaded powerpc processor for commercial servers. IBM Journal of Research and Development, 44(6): , November [5] J. Bulpin. Operating System Support for Simultaneous Multithreaded Processors. PhD thesis, University of Cambridge, Cambridge, UK, Feb [6] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA), pages , Feb [7] B.A.Fields,R.Bodík, M. D. Hill, and C. J. Newburn. Using interaction costs for microarchitectural bottleneck analysis. In MICRO, pages , [8] D. Grunwald and S. Ghiasi. Microarchitectural denial of service: insuring microarchitectural fairness. In Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, pages IEEE Computer Society Press, [9] Intel Corporation. Special issue on intel hyperthreading in pentium-4 processors. Intel Technology Journal, 1(1), January [10] Intel Corporation. The microarchitecture of the intel pentium-4 processor on 90nm technology. Intel Technology Journal, 1(1), February [11] R. N. Kalla, B. Sinharoy, and J. M. Tendler. Ibm power5 chip: A dual-core multithreaded processor. IEEE Micro, 24(2):40 47, [12] J. L. Kihm, T. Moseley, and D. A. Connors. A mathematical model for accurately balncing co-phase effects in simulated multithreaded systems. In Proceedings of the ISCA Workshop on Modeling, Benchmarking, and Simulation (MoBS) 2005, June [13] J. L. Kihm, A. Settle, A. Janiszewski, and D. A. Connors. Understanding the impact of inter-thread cache interference on ILP in modern SMT processors. Journal of Instruction Level Parallelism, 7(2), [14] V. Krishnan and J. Torrellas. A chip-multiprocessor architecture with speculative multithreading. IEEE Transactions on Computers, 48(9): , [15] D. T. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller, and M. Upton. Hyper-threading technology architecture and microarchitecture. Intel Technology Journal, 6(1):4 15, Feb [16] T. Moseley, A. Shye, V. J. Reddi, M. Iyer, D. Fay, D. Hodgdon, J. L. Kihm, A. Settle, D. Grunwald, and D. A. Connors. Dynamic run-time architecture techniques for enabling continuous optimization. In Proceedings of the 2005 International Conference on Computing Frontiers, May [17] E. Perelman, G. Hamerly, M. V. Biesbrouck, T. Sherwood, and B. Calder. Using simpoint for accurate and efficient simulation. In SIGMETRICS, pages , [18] R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, [19] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreaded processor. In Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, pages ACM Press, [20] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Architectural Support for Programming Languages and Operating Systems, pages , [21] A. Snavely, D. M. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages ACM Press, [22] Standard Performance Evaluation Corporation. The SPEC CPU 2000 benchmark suite, [23] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In 22nd Annual International Symposium on Computer Architecture, June [24] D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting fine-grained synchronization on a simultaneous multithreading processor. In International Symposium on Architectural Support for Programming Languages and Operating Systems, pages 54 58, 2000.
Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling
Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling Joshua L. Kihm and Daniel A. Connors University of Colorado at Boulder Department of Electrical and Computer Engineering UCB
More informationEvaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000
Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Mitesh R. Meswani and Patricia J. Teller Department of Computer Science, University
More informationUnderstanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures
Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationDynamic Run-time Architecture Techniques for Enabling Continuous Optimization
Dynamic Run-time Architecture Techniques for Enabling Continuous Optimization Tipp Moseley, Alex Shye, Vijay Janapa Reddi, Matthew Iyer, Dan Fay, David Hodgdon, Joshua L. Kihm, Alex Settle, Dirk Grunwald,
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationBase Vectors: A Potential Technique for Micro-architectural Classification of Applications
Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Dan Doucette School of Computing Science Simon Fraser University Email: ddoucett@cs.sfu.ca Alexandra Fedorova
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationPrefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor
Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Kostas Papadopoulos December 11, 2005 Abstract Simultaneous Multi-threading (SMT) has been developed to increase instruction
More informationBeyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy
EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationOne-Level Cache Memory Design for Scalable SMT Architectures
One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationComputer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading
More informationAn In-order SMT Architecture with Static Resource Partitioning for Consumer Applications
An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More information32 Hyper-Threading on SMP Systems
32 Hyper-Threading on SMP Systems If you have not read the book (Performance Assurance for IT Systems) check the introduction to More Tasters on the web site http://www.b.king.dsl.pipex.com/ to understand
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationCompatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy, Rajeev Garg, David H. Albonesi and Sandhya Dwarkadas
Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy, Rajeev Garg, David H. Albonesi and Sandhya Dwarkadas Depments of Electrical and Computer Engineering and of Computer
More informationLATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES Shane Carroll and Wei-Ming Lin Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio,
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationCS 152 Computer Architecture and Engineering. Lecture 18: Multithreading
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationReducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research
Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationSimultaneous Multithreading (SMT)
#1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing
More informationEECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont
Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,
More informationABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004
ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationThe Intel move from ILP into Multi-threading
The Intel move from ILP into Multi-threading Miguel Pires Departamento de Informática, Universidade do Minho Braga, Portugal migutass@hotmail.com Abstract. Multicore technology came into consumer market
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationILP Ends TLP Begins. ILP Limits via an Oracle
ILP Ends TLP Begins Today s topics: Explore a perfect machine unlimited budget to see where ILP goes answer: not far enough Look to TLP & multi-threading for help everything has it s issues we ll look
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationPerformance of Multithreaded Chip Multiprocessors and Implications for Operating System Design
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip
More informationEfficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors
Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University
More informationRelative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review
Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India
More informationA Framework for Providing Quality of Service in Chip Multi-Processors
A Framework for Providing Quality of Service in Chip Multi-Processors Fei Guo 1, Yan Solihin 1, Li Zhao 2, Ravishankar Iyer 2 1 North Carolina State University 2 Intel Corporation The 40th Annual IEEE/ACM
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationRon Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group
Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Outline Motivation Background Threading Fundamentals
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationPipelined Hash-Join on Multithreaded Architectures
Pipelined Hash-Join on Multithreaded Architectures Philip Garcia University of Wisconsin-Madison Madison, WI 53706 USA pcgarcia@wisc.edu Henry F. Korth Lehigh University Bethlehem, PA 805 USA hfk@lehigh.edu
More informationMultithreaded Value Prediction
Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationThreshold-Based Markov Prefetchers
Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this
More informationSimultaneous Multithreading and the Case for Chip Multiprocessing
Simultaneous Multithreading and the Case for Chip Multiprocessing John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 2 10 January 2019 Microprocessor Architecture
More informationChapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>
Chapter 7 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 7 Chapter 7 :: Topics Introduction (done) Performance Analysis (done) Single-Cycle Processor
More informationTechniques for Efficient Processing in Runahead Execution Engines
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu
More informationSimultaneous Multithreading Architecture
Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationCPU Resource Reservation for Simultaneous Multi-Thread Systems. Hiroshi Inoue, Takao Moriyama, Yasushi Negishi, and Moriyoshi Ohara
RT676 Computer Science 13 pages Research Report September 12, 6 CPU Resource Reservation for Simultaneous Multi-Thread Systems Hiroshi Inoue, Takao Moriyama, Yasushi Negishi, and Moriyoshi Ohara IBM Research,
More informationA Study on Optimally Co-scheduling Jobs of Different Lengths on CMP
A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP Kai Tian Kai Tian, Yunlian Jiang and Xipeng Shen Computer Science Department, College of William and Mary, Virginia, USA 5/18/2009 Cache
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationMulti-core Programming Evolution
Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationImproving Real-Time Performance on Multicore Platforms Using MemGuard
Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study
More informationSuperscalar Processors
Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance
More informationEfficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems
Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationStatistical Simulation of Chip Multiprocessors Running Multi-Program Workloads
Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads Davy Genbrugge Lieven Eeckhout ELIS Depment, Ghent University, Belgium Email: {dgenbrug,leeckhou}@elis.ugent.be Abstract This
More informationEfficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero
Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15
More informationLIMITS OF ILP. B649 Parallel Architectures and Programming
LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationA task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b
5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of
More informationMeasurement-based Analysis of TCP/IP Processing Requirements
Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationLecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationA Study for Branch Predictors to Alleviate the Aliasing Problem
A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationQuantitative study of data caches on a multistreamed architecture. Abstract
Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationHyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.
Hyperthreading ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Hyperthreading is a design that makes everybody concerned believe that they are actually using
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationA New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *
A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University
More informationSimultaneous Multithreading Processor
Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf
More informationEfficient Program Power Behavior Characterization
Efficient Program Power Behavior Characterization Chunling Hu Daniel A. Jiménez Ulrich Kremer Department of Computer Science {chunling, djimenez, uli}@cs.rutgers.edu Rutgers University, Piscataway, NJ
More informationABSTRACT. Integration of multiple processor cores on a single die, relatively constant die
ABSTRACT Title of dissertation: Symbiotic Subordinate Threading (SST) Rania Mameesh, Doctor of Philosophy, 2007 Dissertation directed by: Dr Manoj Franklin Electrical and Computer Engineering Department
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationMore on Conjunctive Selection Condition and Branch Prediction
More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationLecture 18: Core Design, Parallel Algos
Lecture 18: Core Design, Parallel Algos Today: Innovations for ILP, TLP, power and parallel algos Sign up for class presentations 1 SMT Pipeline Structure Front End Front End Front End Front End Private/
More informationA Predictable Simultaneous Multithreading Scheme for Hard Real-Time
A Predictable Simultaneous Multithreading Scheme for Hard Real-Time Jonathan Barre, Christine Rochange, and Pascal Sainrat Institut de Recherche en Informatique de Toulouse, Université detoulouse-cnrs,france
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationThe Implications of Multi-core
The Implications of Multi- What I want to do today Given that everyone is heralding Multi- Is it really the Holy Grail? Will it cure cancer? A lot of misinformation has surfaced What multi- is and what
More information