Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000

Size: px
Start display at page:

Download "Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000"

Transcription

1 Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Mitesh R. Meswani and Patricia J. Teller Department of Computer Science, University of Texas at El Paso {mmeswani, Abstract Applications executing on Simultaneous Multithreaded (SMT) processors face interference from parallel execution contexts, which can significantly reduce overall system performance. This interference differs from that observed in SMP or other multiprocessor systems due to sharing of resources at a finer, cycle-level granularity. The IBM POWER5 processor provides user-adjustable hardware thread priorities to throttle SMT threads and, thus, improve overall system performance. This paper evaluates the significance of these hardware thread priorities on the execution times of pairs of co-scheduled processes. The experiments were run on a trace-driven simulator of the POWER5 processor using SPEC CPU2000 benchmarks. The results show that the performance differences between the best and worst cases are 7% to 30%, and the default priorities assigned do not perform best most of the time. This result shows that it is worthwhile to gain a better understanding of the effect of priorities on application interference and to use this information to develop heuristics for automatically setting priorities in the future. 1. Introduction and Background In the past few years, system architects have tackled the utilization problem at the processor level by taking advantage of an application s Instruction Level Parallelism (ILP). By exploiting ILP, multiple independent instructions can be executed in parallel, thus, improving processor utilization. Design innovations like superscalar processors combined with out-of-order execution allow independent instructions to execute in parallel within the microarchitecture. Modern superscalar processors with wide instruction fetch and issue windows further improve utilization by allowing the completion of more than one Instruction Per Cycle (IPC). Intelligent branch predictors allow speculative execution of instructions down the predicted path, and improve IPC in the case of a correct prediction. However, even with ILP, processor utilization is low. This low utilization is due to several factors. For example, without correct branch prediction, an application s ILP is limited by the number of instructions in a basic block. And, today s wide issue, out-of-order, superscalar processors may have to resolve a branch every one or two clock cycles. In addition, the use of pointers and indirect memory references reduce the accuracy of branch predictions. Furthermore, the high cost of access to the various levels of memory cause the average stall to be costly. Explicit hardware multithreading has emerged as a design choice to help solve the processor utilization problem. In the literature [7] different forms of explicit hardware multithreading have been discussed ranging from interleaved execution to overlapped execution. The basic concept in all the forms of multithreading discussed is to increase the potential for ILP by considering instruction streams from multiple applications. In this paper we use the terms hardware threads, contexts, and threads interchangeably. Simultaneous multithreading (SMT) [6] allows multiple hardware threads or contexts to execute in parallel. These hardware threads share the resources of the microarchitecture and can execute simultaneously as long as there is no conflict for shared resources. The use of SMT allows a processor to improve the IPC rate of the system at the cost of a drop in individual application performance. Ideally, the execution time of a batch of programs executed on an SMT processor with n simultaneous hardware threads is bounded, on the lower end, by the execution time of the programs executed on a single processor and, on the upper end, by the execution time of the programs executed on an n-way Shared Memory Multiprocessor (SMP). An application executing on an SMT processor faces interference for shared resources from applications executing on the other hardware contexts. The applications can interfere in the microarchitecture/pipeline for functional units, in the memory subsystem due to working set interference, or on the system bus due to competing bus transactions. This interference differs from that observed in SMP or other multiprocessor systems due to the finer, cycle-level granularity. This interference could result in situations where the overall system IPC becomes even lower than achieved in single-thread mode, which is not the ideal/expected case. The interference is highly dependent on the characteristics of the

2 simultaneously executing applications and their patterns of shared resource access. SMT is available on IBM s POWER5 processor [8] and on Intel processors as Hyper Threading [9]. The Sun Solaris 9 and 10, Windows XP, Windows Server 2003, Linux 2.6, and AIX 5.L operating systems support SMT processors. They provide separate run queues for each executing hardware context, and for load balancing purposes they consider each context as a separate processor. The IBM processor has a set of environment tuning knobs [2] to optimize SMT performance. These knobs allow selective disabling of SMT, fixed-time idling of a hardware context, and assignment of individual thread priorities to grant higher priority threads more decode cycles and, thus, a larger share of the processing time. Some of these knobs are used by the operating system for implementing fast spin locks. In this research we study the effect of hardware thread priorities on the execution time of co-scheduled application pairs. The remainder of the paper is organized as follows. Section 2 discusses related research. The experimental platform, including the implementation of hardware thread priorities, and the workloads used in the study are described in Sections 3 and 4, respectively. Section 5 presents the method used for data collection, while Section 6 explains the experimental methodology. The results are presented and discussed in Section 7. Finally, Section 8 presents conclusions and future work. 2. Related Research The SMT processor design was first introduced by Tulsen, et al. [6] as a simulated system driven by Alpha binaries. Snavely, et al. [10] studied an operating system (OS) process scheduler targeted at SMT processors; the SOS (sample, optimize, symbios) "symbiotic" scheduler collects performance data of various job mixes during the sample phase and then optimizes the schedule during the optimize phase. The inherent scheduling algorithm, which considers all possible co-schedules, can take exponential time to execute. Nakajima, et al. [11] studied process scheduling optimization on a hyper-threaded processor with a dual core package for non multi-programmed workloads. This work involved load balancing of the set of processes across processors to meet level-two cache and floating-point requirements. Bulpin, et al. [12] studied the design of a hyperthreading aware process scheduler using SPEC CPU2000 [3] benchmarks on Pentium 4 processors. The scheduler keeps track of individual application performance as the ratio of its Singe Threaded performance to its SMT performance. The scheduler records every pair s coperformance, which is the sum of the individual performance ratios. When a scheduling decision has to be made for a thread, the co-performance data is used to calculate the dynamic priorities and the application that works best with the co-scheduled application is selected. This scheduler improved overall system performance by 3.2% over the native Linux scheduler. In [12] dynamically calculated process priorities were used to guide the co-scheduling of applications and to achieve the least application interference. Our work extends this idea by exploring the dynamic changing of hardware, rather than software, thread priorities in order to enhance performance. Changing priorities at the hardware, rather than operating system, level provides a finer-grain of control of the impact that a process can have on the performance of a co-scheduled process. McGregor, et al. [13] studied the design of an OS process scheduler for Intel s Hyper-Threaded processor using NAS OpenMP benchmarks. This research evaluated process pairing and enabling/disabling Hyper Threading using one of three metrics: L3 cache misses, bus transactions, or number of stall cycles. Using these metrics, they adjusted the number of running processes and the process pairing on a processor. The study was limited to compute-intensive applications. Moreover, it did not keep track of processor affinity; with process migration processes may not be able to take advantage of saved cache state. Our work differs from [13] by allowing co-scheduling of applications using different thread priorities, which may have otherwise been executed in Single-Threaded mode. This allows increased opportunities to improve overall system performance. In our research we do not consider off-chip migration and, hence, the applications can take advantage of any saved on-chip cache state. In the future we will extend the analysis to memory-intensive and I/O-intensive classes of applications. Fedorova, et al. [14] designed an OS process scheduler for Chip Multithreading (CMT) systems. The CMT systems they studied are not superscalar and are targeted more at threaded applications. This work considered adjustment of the compute time assigned to an application based on its fair share of the caches, e.g., in an n-threaded system, the performance obtained with a cache of 1/n th the size. Using a detailed regression model, the expected IPC of a process is dynamically predicted based on it having its fair share of the cache. Then the compute time allocated to the process is adjusted to make up for deviation from this expected, fair IPC. For a small class of workloads, these approaches have reduced interference not considered by the current naïve schedulers. However, policies based on simple heuristics [13] or customized models [14] cannot be extended to other application classes, whereas policies that consider all possible co-schedules [10] cannot scale to increasing numbers of hardware contexts. Our work, which uses SMT tuning knobs, such as those provided by IBM

3 processors [2], along with a characterization of their effects on inter-application interference for various application classes, may yield a solution that could automatically set the knobs to attain best possible performance. In addition, this methodology can scale to increasing numbers of hardware contexts. 3. Experimental Platform The experiments and the data collection, explained in Sections 5 and 6, were carried out on the trace-driven IBM Performance Simulator for Linux on POWER [1]. This simulator provides a suite of performance models for various IBM POWER processor models, including the POWER5 processor. The simulator requires as input, instruction traces, and gives as output various metrics such as Per Instruction (CPI), functional unit usage statistics, and an instruction histogram. The simulator for the POWER5 processor model supports both the Simultaneous Multithreading (SMT) mode as well as the Single- Threaded mode. P L1 Shared L2 Figure1. POWER5 Processor Layout The POWER5 processor simulated and studied, shown in Figure 1, has two identical cores running at 1.65 GHz, each with a 32KB L1 instruction cache and a 64KB L1 data cache. Each pair of cores shares a unified 1.9MB L2 cache. The processor supports two SMT threads per core. This processor has hardware logic to throttle the interference of a thread. The processor s Dynamic Resource Balance (DRB) logic can be used to temporarily reduce the number of decode cycles allocated to a thread based on resource usage thresholds, such as number of outstanding L2 cache misses. In addition to the hardwired throttling, the processor supports software-controlled hardware thread priorities. Depending on the application s privilege level, there are eight software-adjustable priorities. The difference in priorities between two threads controls the proportion of decode cycles allocated to each hardware thread. A higher priority thread is allowed to decode more instructions, allowing its tasks to progress faster. In this paper we study priorities that can be set by user applications, which are two, three, and four. Given X as priority of thread 0, and Y as the priority of thread 1, P L1 equation 1 shows the formula to calculate the decode cycle allocation for the lower priority thread; the decode cycle allocation for the higher priority thread is obtained by subtracting this result from one. The decode cycle allocations associated with the possible thread priorities are shown in Table 1; a complete list is available in the related IBM Red Book [2]. The processor assigns the default priority of 4 to both threads. 1 (1) 2 ( X Y 1) Thread 0 Priority Thread 1 Priority Thread 0 Decode Thread 1 Decode 4 2 7/8 1/ /4 1/ /2 1/ /4 3/ /8 7/8 Table 1. Effect of Thread Priorities on Decode Cycle Allocation 4. Workload We use 14 workloads, each composed of two applications from the industry standard SPEC CPU2000 [3] benchmark suite, to measure the performance of the CPU and memory subsystem. The applications used include the floating-point benchmarks swim, art, mesa, and mgrid, and the integer benchmarks gzip, vpr, twolf, and vortex. The application swim is a weather prediction program, while art uses neural networks to perform image recognition. mesa is a 3-D graphics library and mgrid is a multi-grid solver. The application gzip implements data compression algorithms, while vpr is a FPGA circuit placement and routing program. vortex is a single-user, object-oriented database transaction program and twolf is a placement and global routing application used in the process of creating the lithography artwork needed for the production of microchips. 5. Data Collection To drive the simulations and, thus, collect performance data, we had to, first, capture instruction traces for the applications executing on a POWER5 processor. Next, the simulator was configured in order to capture the information needed to understand the effect of hardware thread priorities on execution time for a pair of coscheduled applications. The mechanism to capture traces is explained in Section 5.1 and the configuration of the simulator is briefly discussed in Section 5.2.

4 5.1. Instruction Tracing The Linux tool ITrace [5] was used to capture the instruction traces of unmodified application binaries. The traces are captured in the main memory buffer, the size of which is limited to 200 MB per CPU. The applications were executed on one processor of a two-processor POWER5 machine running the Linux kernel. The system has 5GB main memory and a 70GB hard disk. The study by Gomez, et al. [4] showed that the four integer benchmarks reach stable CPI after the first 50 million instructions. Accordingly, we allow a warm-up time of one minute for each application before capturing traces. The traces for the floating-point benchmarks are captured after executing each application for half of its total execution time. The trace tool captured between 13 million to 28 million instructions for each benchmark Simulator Configuration The trace-driven simulator can be configured to use either the SMT mode or the Single-Threaded mode. We use the SMT mode for executing the co-scheduled application pairs. The simulator in the SMT mode exits when one of the instruction traces completes. The leftover instructions from the incomplete trace are executed in Single-Threaded mode. This is done because the goal of the research is to study the effect of thread priorities on execution time for completing both traces. Accordingly, the total time for executing both traces is computed by adding the time from the SMT and Single-Threaded modes. 6. Experimental Methodology The hypothesis of our research is that hardware thread priorities have a significant performance impact on the execution times of co-scheduled application pairs. Consequently, the goal of our experiments is to verify this hypothesis by comparing the execution times of different application co-schedules. While co-scheduling may cause individual applications to slow down, the time to execute the co-scheduled pair of applications may be reduced. The process of creating co-scheduled pairs is described in Section 6.1, while Section 6.2 describes the performance metrics used for evaluation Application Co-schedules For a given pair of applications, each can be assigned to one of the two threads of the simulated processor. Since each thread has three possible priorities, this gives a total of 18 combinations. However, the number of combinations can be reduced since hardware threads are identical, i.e., executing application X with priority i on thread 0 and application Y with priority j on thread 1 is the same as executing application X with priority i on thread 1 and application Y with priority j on thread 0. This reduces the possible combinations to nine. Moreover, from equation 1, since only the difference in priority matters, the number of distinct pairs reduces to five. The coschedules for one application pair are shown in Table 2. We study pairs where both applications are either integer or floating-point. Additionally, we study two mixed pairs where one application is integer and one is floating-point. Thus, there are 14 pairs, giving a total of 70 different co-schedules. The different pairings are shown in Table 3. Thread 0 Priority Executing Application X Thread 1 Priority Executing Application Y Table 2. Co-schedules for Application Pairs X, Y Integer Application Pairs Floating-Point Application Pairs Mixed Pairs gzip, twolf swim, mgrid swim, gzip gzip, vortex swim, mesa art, vpr gzip, vpr swim, art twolf, vortex mgrid, mesa twolf, vpr mgrid, art vortex, vpr mesa, art Table 3. Application Pairs X, Y 6.2. Performance Metrics The simulator outputs the total number of cycles required to execute a pair of applications, with defined thread priorities, in SMT mode. Additionally, for each trace, it outputs the number of completed instructions. As explained in Section 5.2, once one trace is completed, the remaining instructions from the other trace are executed in Single-Threaded mode. The total number of cycles required to execute a co-scheduled pair of applications is computed by adding the total cycles in SMT mode and in Single-Threaded mode. The performance metric used to evaluate if thread priorities have a significant performance impact on the execution of a pair of applications is computed by taking the difference between the best case and worst case total number of cycles reported for the five different coschedules investigated (see Table 2 above). We also compute the difference between the best case and the default case, which is assigning each thread a priority of 4.

5 Application Pair Best Total Number of Worst Total Number of Default Total Number of % Difference between Best and Worst s % Difference between Best and Default s Best Thread Priorities Worst Thread Priorities gzip, twolf 41,107,097 47,015,237 43,452, ,3 2,4 gzip, vortex 38,449,210 43,138,538 39,974, ,2 2,4 gzip, vpr 39,674,997 44,251,721 41,818, ,3 2,4 twolf, vortex 35,711,356 38,348,825 35,711, ,4 2,4 twolf, vpr 37,301,850 40,711,131 37,301, ,4 4,2 vortex, vpr 34,172,959 37,054,265 34,529, ,4 4,2 swim, mgrid 34,809,876 38,236,266 34,809, ,4 4,2 swim, mesa 32,221,419 39,797,984 33,063, swim, art 29,793,892 33,311,936 29,793, ,2 mgrid, mesa 30,911,156 37,952,726 32,205, ,4 4,2 mgrid, art 27,109,275 31,520,482 27,109, ,4 4,2 mesa, art 27,870,271 34,869,714 28,711, ,2 2,4 swim, gzip 39,334,954 50,919,210 44,438, ,4 4,2 art, vpr 31,965,969 36,215,448 34,149, ,4 4,2 Table 4. Difference in Performance between Best and Worst s and between Best and Default s, and Thread Priorities for Best and Worst s 7. Results Column five of Table 4 shows the percentage difference between the best case and worst case number of cycles required for the execution of each application pair under the various priority co-schedules shown in Table 1. As can be seen from these results, the performance differences between the best and worst cases are between 7.39% and 29.45%. Columns seven and eight of Table 4 give the thread priorities associated with the best and worst case performance, respectively, for each application pair. The first column shows the co-scheduled pair X, Y; the seventh and the eight columns show the thread priority as i, j, such that i is the priority of the thread executing application X and j is the priority of the thread executing application Y. As can be seen from this table, the thread priorities for the best and worst case performance depend on the applications that are co-scheduled. For nine of the fourteen application pairs, the best case performance is not that associated with the case when both threads have the default priority of 4. Column six of Table 4 shows the difference between the best case number of cycles required for the execution of each application pair under the various priority coschedules shown in Table 1 and the default case, i.e., with each thread having priority 4. The application pairs (twolf, vortex), (twolf, vpr), (swim, mgrid), (swim, art), and (mgrid, art) experience best performance under the default thread priority settings, whereas the remaining nine application pairs experience best performance under different priority settings and, for them, the differences in performance under the settings that result in best performance and the default setting range from 1.03% to 12.97%. Although, with the exception of the pair (swim, mgrid), this difference is small, it will be interesting to discover the performance differences for other applications. Application Pair Single- Threaded Total Number of % Difference between Worst and Single- Threaded s swim, mesa 38,810, swim, art 33,001, swim, gzip 50,669, Table 5. Difference between Single-Threaded and Worst Performance Table 5 shows application pairs for which the worst case performance (the total number of cycles to execute the application pair) is less than that of the Single- Threaded case, i.e., the sum of the number of cycles required for each individual application in Single- Threaded mode without SMT on. The negative numbers in the last column of Table 5 show the percentage of extra cycles that the worst cases required relative to the singlethreaded cases. The performance differences presented in this section show that application interference on SMT processors

6 depends on both individual application characteristics and on SMT priorities. As shown by our study, the default SMT priorities do not always yield the best performance, and, in the best cases, the effect of application interference can be reduced by intelligently setting thread priorities. In addition, as shown in this section, in the worst case, hardware thread priorities can cause inter-application interference to reduce performance to less than that of single-threaded execution. Hence, a detailed understanding of the performance impact of thread priorities can garner performance rewards, while avoiding worst case scenarios. 8. Conclusions and Future Work This work shows that hardware thread priorities on SMT processors can have a significant impact on overall performance and that the default priorities that are assigned to hardware threads on a POWER5 processor do not always yield the best performance. The differences in best case and worst case number of cycles required for the execution of two applications using five different pairs of thread priorities shows that a detailed understanding of the effect of SMT priorities on inter-application interference could result in significant performance gains. We will continue to expand this work by using the simulator s functional unit usage data to characterize the effect of thread priorities on inter-application interference. In the future, we intend to expand the applications studied, adding I/O- and memory-intensive applications, and to implement dynamic modification of hardware thread priorities in real systems. Finally we aim to develop heuristics that will allow us to build models that can, given an application pairs performance behavior, effectively set hardware thread priorities dynamically. Besides thread priorities, there are tuning knobs like SMT Snooze and SMT mode off, which are of interest to us as well. Acknowledgements We would like to thank John Griswell from IBM for providing the simulator binaries. In addition, we want to acknowledge that this work was supported by an IBM Faculty Award, an IBM SUR (Supported University Research) grant, and a University of Texas STAR (Science and Technology Acquisition and Retention) Program award, which were awarded to Dr. Teller. 9. References [1] alphaworks: IBM Performance Simulator for Linux on POWER : Overview, accessed 08/15/2006. [2] Advanced POWER Virtualization on IBM eserver p5 Servers: Architecture and Performance Considerations, SG245768, [3] SPEC - Standard Performance Evaluation Corporation, accessed 08/15/2006. [4] I. Gòmez, L. Piñuel, M. Prieto, and F. Tirado, Analysis of Simulation-adapted SPEC 2000 Benchmarks, Proceedings of ACM SIGARCH Computer Architecture News, 30(4): 4-10, September [5] ITrace for Linux/PPC, [6] D. M. Tullsen, S. J. Eggers, and H. M. Levy, Simultaneous Multithreading: Maximizing On-chip Parallelism, Proceedings of the 22nd International Symposium on Computer Architecture (ISCA '95), IEEE Computer Society, pp , June [7] T. Ungerer, B. Robic, and J. Silc, A Survey of Processors with Explicit Multithreading, ACM Computing Surveys (CSUR), 35(1):29-63, March [8] R. Kalla, B. Sinharoy, and J. M. Tendler, IBM POWER5 Chip: a Dual-Core Multithreaded Processor, IEEE Micro, 24(2):40-47, March [9] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel, The Microarchitecture of the Pentium 4 Processor, Intel Technology Journal, 5(1):1-13, February [10] A. Snavely and D. M. Tullsen, Symbiotic Job Scheduling for a Simultaneous Multithreading Processor, Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '00), ACM Press, pp , November 2000 [11] J. Nakajima and V. Pallipadi, Enhancements for Hyper- Threading Technology in the Operating System: Seeking the Optimal Scheduling, Proceedings of the 2nd Workshop on Industrial Experiences with Systems Software, The USENIX Association, December [12] J. Bulpin and I. Pratt, Hyper-Threading Aware Process Scheduling Heuristics, Proceedings of the 2005 USENIX Annual Technical Conference, pp , April [13] R L. McGregor, C D. Antonopoulos, and D S. Nikolopoulos, Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors, Proceedings of the Nineteenth International Parallel and Distributed Processing Symposium, Denver, CO, April [14] A. Fedorova, M. Seltzer, and M D. Smith, A Non-Work- Conserving Operating System Scheduler for SMT Processors, Proceedings of the Workshop on the Interaction between the

7 Operating Systems and Computer Architecture (WIOSCA), in conjunction with ISCA-33, Boston, MA, June 2006.

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling

Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling Joshua L. Kihm and Daniel A. Connors University of Colorado at Boulder Department of Electrical and Computer Engineering UCB

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Proceedings of the 2nd Workshop on Industrial Experiences with Systems Software

Proceedings of the 2nd Workshop on Industrial Experiences with Systems Software USENIX Association Proceedings of the 2nd Workshop on Industrial Experiences with Systems Software Boston, Massachusetts, USA December 8, 2002 THE ADVANCED COMPUTING SYSTEMS ASSOCIATION 2002 by The USENIX

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Dan Doucette School of Computing Science Simon Fraser University Email: ddoucett@cs.sfu.ca Alexandra Fedorova

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

CPU Resource Reservation for Simultaneous Multi-Thread Systems. Hiroshi Inoue, Takao Moriyama, Yasushi Negishi, and Moriyoshi Ohara

CPU Resource Reservation for Simultaneous Multi-Thread Systems. Hiroshi Inoue, Takao Moriyama, Yasushi Negishi, and Moriyoshi Ohara RT676 Computer Science 13 pages Research Report September 12, 6 CPU Resource Reservation for Simultaneous Multi-Thread Systems Hiroshi Inoue, Takao Moriyama, Yasushi Negishi, and Moriyoshi Ohara IBM Research,

More information

What SMT can do for You. John Hague, IBM Consultant Oct 06

What SMT can do for You. John Hague, IBM Consultant Oct 06 What SMT can do for ou John Hague, IBM Consultant Oct 06 100.000 European Centre for Medium Range Weather Forecasting (ECMWF): Growth in HPC performance 10.000 teraflops sustained 1.000 0.100 0.010 VPP700

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning

A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning G. Edward Suh, Srinivas Devadas, and Larry Rudolph Laboratory for Computer Science MIT Cambridge, MA 239 suh,devadas,rudolph

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Enhancements to Linux I/O Scheduling

Enhancements to Linux I/O Scheduling Enhancements to Linux I/O Scheduling Seetharami R. Seelam, UTEP Rodrigo Romero, UTEP Patricia J. Teller, UTEP William Buros, IBM-Austin 21 July 2005 Linux Symposium 2005 1 Introduction Dynamic Adaptability

More information

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Outline Motivation Background Threading Fundamentals

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Chip-Multithreading Systems Need A New Operating Systems Scheduler

Chip-Multithreading Systems Need A New Operating Systems Scheduler Chip-Multithreading Systems Need A New Operating Systems Scheduler Alexandra Fedorova Christopher Small Daniel Nussbaum Margo Seltzer Harvard University, Sun Microsystems Sun Microsystems Sun Microsystems

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE Stephan Suijkerbuijk and Ben H.H. Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Methods for Modeling Resource Contention on Simultaneous Multithreading Processors

Methods for Modeling Resource Contention on Simultaneous Multithreading Processors Methods for Modeling Resource Contention on Simultaneous Multithreading Processors Tipp Moseley, Joshua L. Kihm, Daniel A. Connors, and Dirk Grunwald Department of Computer Science Department of Electrical

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question: Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Is Intel s Hyper-Threading Technology Worth the Extra Money to the Average User?

Is Intel s Hyper-Threading Technology Worth the Extra Money to the Average User? Is Intel s Hyper-Threading Technology Worth the Extra Money to the Average User? Andrew Murray Villanova University 800 Lancaster Avenue, Villanova, PA, 19085 United States of America ABSTRACT In the mid-1990

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

A Simple Model for Estimating Power Consumption of a Multicore Server System

A Simple Model for Estimating Power Consumption of a Multicore Server System , pp.153-160 http://dx.doi.org/10.14257/ijmue.2014.9.2.15 A Simple Model for Estimating Power Consumption of a Multicore Server System Minjoong Kim, Yoondeok Ju, Jinseok Chae and Moonju Park School of

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Exploring the Effects of Hyperthreading on Scientific Applications

Exploring the Effects of Hyperthreading on Scientific Applications Exploring the Effects of Hyperthreading on Scientific Applications by Kent Milfeld milfeld@tacc.utexas.edu edu Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Kostas Papadopoulos December 11, 2005 Abstract Simultaneous Multi-threading (SMT) has been developed to increase instruction

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

CEC 450 Real-Time Systems

CEC 450 Real-Time Systems CEC 450 Real-Time Systems Lecture 6 Accounting for I/O Latency September 28, 2015 Sam Siewert A Service Release and Response C i WCET Input/Output Latency Interference Time Response Time = Time Actuation

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC

More information

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology CSAIL Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Dynamic Cache Partioning for Simultaneous Multithreading Systems Ed Suh, Larry Rudolph, Srinivas Devadas

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading CSE 502 Graduate Computer Architecture Lec 11 Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,

More information

IBM's POWER5 Micro Processor Design and Methodology

IBM's POWER5 Micro Processor Design and Methodology IBM's POWER5 Micro Processor Design and Methodology Ron Kalla IBM Systems Group Outline POWER5 Overview Design Process Power POWER Server Roadmap 2001 POWER4 2002-3 POWER4+ 2004* POWER5 2005* POWER5+ 2006*

More information

Simultaneous Multithreading and the Case for Chip Multiprocessing

Simultaneous Multithreading and the Case for Chip Multiprocessing Simultaneous Multithreading and the Case for Chip Multiprocessing John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 2 10 January 2019 Microprocessor Architecture

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Dynamic Cache Partitioning for CMP/SMT Systems

Dynamic Cache Partitioning for CMP/SMT Systems Dynamic Cache Partitioning for CMP/SMT Systems G. E. Suh (suh@mit.edu), L. Rudolph (rudolph@mit.edu) and S. Devadas (devadas@mit.edu) Massachusetts Institute of Technology Abstract. This paper proposes

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Thread to Strand Binding of Parallel Network Applications in Massive Multi-Threaded Systems

Thread to Strand Binding of Parallel Network Applications in Massive Multi-Threaded Systems Thread to Strand Binding of Parallel Network Applications in Massive Multi-Threaded Systems Petar Radojković Vladimir Čakarević Javier Verdú Alex Pajuelo Francisco J. Cazorla Mario Nemirovsky Mateo Valero

More information

Balancing Thoughput and Fairness in SMT Processors

Balancing Thoughput and Fairness in SMT Processors Balancing Thoughput and Fairness in SMT Processors Kun Luo Jayanth Gummaraju Manoj Franklin ECE Department Dept of Electrical Engineering ECE Department and UMACS University of Maryland Stanford University

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last

More information

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology CSAIL Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Dynamic Cache Partitioning for Simultaneous Multithreading Systems Ed Suh, Larry Rudolph, Srini Devadas

More information