Applications of Thread Prioritization in SMT Processors

Size: px
Start display at page:

Download "Applications of Thread Prioritization in SMT Processors"

Transcription

1 Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor, MI USA Abstract Previous work in multithreading, and specifically in simultaneous multithreading (SMT), has focused primarily on increasing total instruction throughput. While this focus is sufficient in some application domains, widespread deployment of multithreaded processors will require robust behavior across a variety of platforms. For instance, interactive systems must be concerned with the execution latency of foreground userinterface threads. Multiuser systems must be concerned with fair allocation of throughput among competing users. A multithreaded processor that seeks solely to maximize throughput will favor efficient threads at the expense of any potential latency or fairness issues. We show that a very simple fetch-stage prioritization scheme can substantially reduce the latency impact of multithreading on a selected foreground thread while continuing to provide a throughput improvement over single-threaded execution. When all threads have equal priority, rotating the high-priority designation among the threads reduces the processor s bias against less efficient threads, resulting in a more even throughput distribution across the threads. We also show that even when latency and fairness are not a concern, rotating thread prioritization has a positive effect on cache and branch predictor utilization. Unfortunately, although our simple prioritized multithreading scheme provides these benefits while improving utilization over a single-threaded processor, total throughput falls well short of existing throughput-oriented fetch policies. Our ongoing work focuses on more sophisticated prioritization algorithms, potentially incorporating branch confidence estimators, that will maintain these benefits while increasing total throughput. 1 Introduction Multithreading is a well-known technique for increasing the utilization of a processor core thus increasing total processor throughput by sharing the core among several independent threads of control. Processor resources that would be unused or underused by any single thread due to cache misses or program dependencies can be applied to the execution of another thread. Two recent trends have heightened interest in multithreading. First, semiconductor fabrication technology is now capable of producing superscalar microprocessors whose peak throughput potential is far beyond the throughput that can be extracted from most single-threaded applications. Second, operating system, compiler, and language support for multithreading is becoming more widespread, as exemplified by Windows NT and Java. Simultaneous multithreading (SMT) is a promising form of multithreading proposed by Tullsen et al [1]. SMT enables very fine-grained resource sharing in a dynamic out-of-order superscalar processor core. By multiplexing resources among threads within a single cycle, as well as across 1

2 cycles, total throughput can be improved significantly over a single-threaded processor. Studies of SMT processors [1][2], as with earlier multithreading studies and systems [3][4][5][6][7][8], have focused almost exclusively on improving overall processor throughput. The implicit assumption is that all threads and all instructions are equally important, so maximizing instructions per cycle is good regardless of which instructions are executed. The effects of this assumption are particularly pronounced in Tullsen et al s study of SMT fetch policies [2], in which they increase throughput by explicitly favoring threads that use the processor efficiently. This assumption is reasonable for application domains such as scientific computation and database servers those traditionally targeted by multithreading where all the threads are components of some larger parallel application. However, for multithreaded microprocessors to escape niche markets, they must benefit a wide range of platforms, including portable and desktop PCs and shared multiuser systems. These systems have constraints that violate the all instructions are equal assumption of traditional multithreading. For example, in interactive systems one or more threads are directly responsible for user interaction. As user interfaces move into new modes such as speech and 3D graphics, this interaction can be computationally expensive. If a multithreaded processor decides to favor a more resource-efficient thread over a user-interaction thread, the user may see an intolerable and potentially unbounded latency increase. One solution is to disable multithreading when a latency-critical thread is active. However, this approach wastes the additional throughput capability that multithreading was designed to exploit. Instead, we propose thread prioritization, in which software associates priorities with active threads, and the processor incorporates these priorities into its instruction fetch policy. In this paper, we examine only the simplest thread prioritization policy: a single thread is identified as the high-priority thread, and the processor fetches instructions for this thread whenever possible. Lower-priority threads are given the opportunity to fetch instructions only when the high-priority thread is stalled. Thread prioritization can be useful even when all threads are logically of equal priority. As mentioned above, a processor that is maximizing throughput will favor threads that use the processor efficiently. This results in an unfair allocation of resources among threads. While this is not an issue in many environments, a multiuser system would like to guarantee fair allocation of resources across all users. By rotating the high-priority designation among all active threads, we can reduce the bias against less efficient threads and improve the fairness of CPU allocation. We also demonstrate that even when latency and fairness are not a concern, rotating thread prioritization has a positive effect on cache and branch predictor utilization. Unfortunately, although our simple prioritized multithreading scheme provides these benefits while improving utilization over a single-threaded processor, total throughput falls well short of existing throughput-oriented fetch policies. Our ongoing work focuses on more sophisticated prioritization algorithms, potentially incorporating branch confidence estimators, that will maintain these benefits while increasing total throughput. We provide additional background and describe our experimental methodology in Section 2. Section 3 describes the three potential areas which may benefit from thread prioritization: limiting latency effects, increasing fairness, and reducing cache and branch predictor conflicts. Section 4 concludes with a discussion of ongoing and future work. 2 Background and methodology When an SMT processor has more threads than instruction-cache fetch ports, it can fetch from only a subset of the threads each cycle. Several fetch policies are introduced and evaluated in [2]. This paper will draw from two of these policies: Round- Robin (RR) and Instruction-Count (I-Count, or IC). We also studied simple prioritized extensions of both Round-Robin and I-Count. In describing these policies, we define an active thread as any thread that is allowed to execute instructions. A thread is eligible to fetch instructions if it is active and has no outstanding instruction cache miss. A fetch opportunity for a thread occurs when the thread s fetch address is supplied to an instruction-cache port, resulting in 2

3 either the addition of one or more instructions in the fetch queue or an instruction-cache miss. In the case of a full fetch queue, no thread is given the opportunity to fetch. This situation must be handled explicitly since thread starvation can result if this state is counted as a fetch opportunity. The Round-Robin (RR) policy maintains an arbitrarily ordered list of eligible threads. After the thread or threads at the top of the list are given the opportunity to fetch, these threads are rotated to the bottom of the list. The Round-Robin policy is a fair policy in that, over a number of cycles, each thread has equal opportunity to fetch new instructions. The Instruction-Count (I-Count or IC) policy counts the number of instructions from active threads that are currently in the instruction buffers but have not yet been issued to a function unit. This policy gives fetch opportunities to the eligible threads that have the fewest instructions in the pipeline, under the assumption that these threads are moving instructions through the CPU quickly and hence making the most efficient use of the pipeline. Threads with fewer active instructions are also less likely to exhibit data dependencies or to be stalled as a result of a cache miss. We defined two additional fetch policies that incorporate a software-defined fetch priority for each thread. Fetch opportunities are given first to the highest-priority threads. Lower-priority threads are given fetch opportunities only when higher-priority threads are unable to make use of the full fetch bandwidth. The two policies, Prioritized Round-Robin (PR) and Prioritized I-Count (PI), differ only in the policy used to select among threads at the same priority level (RR and IC, Table 1: Simulated Processor Configuration respectively). In this paper, we use at most two priority levels: a single foreground thread runs at a high priority while the background thread(s) share the same lower priority. We simulated the behavior of these policies using a modified version of the sim-outorder simulator from the Simplescalar tool set [9]. The original simulator was extended by replicating the necessary machine context for multithreading, adding support for multiple address spaces, and increasing the coverage and types of collected statistics. The processor model is based on Sohi s RUU [13]. This model is similar to the instruction queue model used in Hewlett-Packard s PA-8000 processors. The fetch stage feeds instructions to a simple fetch/decode queue. The decode stage picks instructions from this queue, decodes and renames them, then places them into the Register Update Unit (RUU). Loads and stores are broken into an address generation and a memory reference. The address generation portion is placed in the RUU while the memory reference portion of a load/store instruction is places into the Load/Store Queue (LSQ). The RUU (and LSQ) serve as a combination of global reservation station, rename register file, and re-order buffer. The processor maintains precise exceptions by only committing instructions from the RUU in fetch order. We specified an advanced processor with numerous function units and reasonable on-chip instruction and data caches. Details of the model can be found in Table 1. Although we have studied processors with multiported instruction caches, for clarity we focus on single-ported instruction caches in this paper. L1 Instruction Cache L1 Data Cache Unified L2 Cache Branch Predictor Fetch/Decode/Issue/Commit Width Integer Function Units FP Function Units 32K bytes, 4-way associative, single ported 32K bytes, 4-way associative, dual ported 1M bytes, 4-way associative Two-Level: 11-bit Global History Register, 2048-entry PHT, 512-entry BTB, 16-entry RAS (per thread) 8 instructions / cycle 6 ALU, 2 Multiply 4 Add, 2 Multiply 3

4 Normalized Latency RR IC PR Total Throughput (IPC) Figure 1. Throughput vs. latency for various fetch policies. 3 Applications of prioritization This section examines each of the potential areas which may benefit from thread prioritization: limiting latency effects, improving fairness, and reducing cache and branch predictor contention. 3.1 Limiting latency effects Our initial goal in this study was to find a way to address the issue of increased program runtime when several threads are present in the processor. We first looked at the effects on some foreground thread when background threads were introduced into our simulator. To accomplish this, we configured our simulator to stop when our foreground thread completed 10 million instructions and ran simulations with zero, one, two, and three background threads. We refer to the time required to complete the 10 million instructions as the latency of the foreground thread. By comparing the number of simulated cycles from the runs where there are no background threads to the others, we determined the relative increase in latency due to the background threads. Our simulations use a mix of integer benchmarks from SPEC95, where the foreground thread has been chosen to be perl. Background threads include m88ksim, compress, and ijpeg. Figure 1 plots relative latency versus total throughput for zero, one, two, and three background threads using the Round-Robin (RR), I- Count (IC), and Prioritized Round-Robin (PR) fetch policies. We see that for both the RR and IC fetch policies, adding a second thread increases the total processor throughput from 2.33 to more than 3.41 IPC. Unfortunately, this 46% increase in throughput comes at the expense of a 42% increase in latency for our foreground thread. As a second and third background thread are added, the latency increases further. Though the total throughput increases significantly, each thread s portion of the available throughput steadily decreases. Ideally, we would like the addition of background threads to increase throughput without affecting the runtime of the foreground thread. Our goal is to have the curves move from their starting point (a single thread) to the right (increasing throughput) without moving up (no increase in latency); i.e., we are aiming for the lower-right corner of the graph. Our prioritized fetch policy specifies to the processor that the foreground thread should always be given the opportunity to fetch instructions when it is eligible (not suffering an instruction-cache miss). The background threads are allowed to use whatever fetch bandwidth the foreground thread is unable to use. As the figure shows, adding a second thread in this situation results in an increase in total throughput from 2.33 to 2.91 IPC. Here, we have improved throughput by 25%, but have increased our foreground thread latency by only 6%. Beyond the first background thread, the prioritized schemes degrade quickly, but, even with three background threads, the latency never increases beyond 13%. 3.2 Improving fairness For applications where processing resources are being shared between a number of users, we would like to be able to avoid giving preferential treatment to any one thread. This principle of fair- 4

5 Workload 1 T S (0.010) RR (0.061) IC (0.066) PI (0.012) Workload 2 T S (0.009) RR (0.068) IC (0.082) PI (0.050) Speedup Speedup Thread Number Thread Number Workload 3 Workload 4 Speedup T S (0.006) RR (0.062) IC (0.074) PI (0.043) Speedup T S (0.021) RR (0.049) IC (0.062) PI (0.019) Thread Number Thread Number Figure 2. Thread Speedups: TS = time slicing, RR = round robin, IC = instruction-count, PI = prioritized IC. ness can directly conflict with the goal of maximizing throughput. By design, the I-Count fetch policy increases total throughput by favoring threads that are more efficient. Similarly, Culler et al. [7] observed that switch-on-miss (non-simultaneous) multithreading favors threads with lower miss rates, improving overall cache performance at the expense of fairness. To evaluate the fairness of different fetch policies, we measured the speedup of individual threads (relative to single-threaded performance) when run in a three-thread workload. As a reference point, we also implemented time-slice (TS) scheduling, where each thread has sole possession of the processor for some period of time before relinquishing it to another thread, which then has sole possession. The speedup values for several threads are plotted for different workloads and fetch policies in Figure 2. What we would like to see for each fetch policy is a nearly horizontal line with large speedup values (indicating that each of the three threads has a similar speedup i.e., they all suffer the same performance penalty). As we would expect, the time-slicing policy is quite fair, the standard deviation of the speedup values being no more than The RR and IC policies have the largest speedup values, but with significantly different speedups for the individual threads. Standard deviations for these policies range from to As might be expected, the RR policy does exhibit less unfairness than IC, but the difference is surprisingly small. Although RR is fair in distributing fetch opportunities, useful throughput is still biased toward threads with fewer instruction-cache misses or less sensitivity to branch predictor or cache interference (see the following section). To counteract this bias, we extended our prioritized RR and IC policies to allow us to rotate the high-priority designation among the active threads. Assuming that all threads are of logically equal pri- 5

6 Normalized Latency RR PR Total Throughput (IPC) Figure 3. Throughput vs. Latency (ijpeg initialization phase). ority, the operating system rotates (time-slices) the high-priority designation among all the threads, giving each equal time as the foreground thread. (Note that by varying the time each thread spends at high priority, the operating system gains significant control over CPU allocation without disabling multithreading a capability not present in nonprioritized policies. We can view the foreground/ background thread experiments in the preceding section as a special case of this more general model where the priorities are not changed.) Unlike the Round-Robin or I-Count policies, the rotating priority policy gives each thread a macroscopic interval during which it can exploit all of the processor s resources to the best of its ability. Unlike the time-sliced CPU, we have not completely lost the ability to get work done on background threads during these intervals. However, since we are forcing the processor to give a fetch opportunity to a thread where it may not have done so under a purely throughput-oriented policy, we expect that our throughput will suffer compared to RR or IC. We ran the new policy with a priority rotation scheduled to occur every 75 million cycles. The resulting curves are marked PI (Prioritized I- Count) and demonstrate improved fairness over the pure RR and IC policies (for the same workloads), and improved speedup over the time-slicing policy. In one instance (Workload 3), the rotating priority policy flattens the curve by increasing the speedup of one thread by approximately 10% at the expense of lower speedups for the other threads. 3.3 Reducing cache and branch predictor contention Though threads within an SMT processor may not interact directly, they share execution resources, and thus do have an effect on each other. The exact nature of these interactions will vary widely with the processor architecture and the workload being executed. An excellent example of this is illustrated in Figure 3. This graph is a latency-throughput graph similar to Figure 1. In this case, the foreground thread is ijpeg in its initialization phase. As the figure indicates, when running alone, this phase of ijpeg s execution achieves an impressive throughput of 4.4 IPC. It is so efficient in its use of processor resources that, for the RR and IC fetch policies, the introduction of any background thread causes a serious drop in total processor throughput and a 100% increase in foreground thread latency. The use of a prioritized fetch policy allows the background threads to execute at a rate of less than 0.01 IPC. As a result, the data points for the prioritized round-robin policy plot essentally on top of one another. Luckily, this type of behavior seems to be rare. We can quantify the contention effects of SMT by examining changes in cache miss rates and branch prediction accuracy, as was done previously by Hily [10][11]. We looked at two base cases: an SMT processor using the round-robin fetch policy and a single-threaded out-of-order processor using time-slicing to run multiple threads. The simulated branch predictor accuracies, L1 data cache miss rates, and processor throughput for the Perl benchmarks are shown in Figure 4. For the time-slicing 6

7 D-Cache Miss Rate 2.00% 1.60% 1.20% 0.80% 0.40% 0.00% Time-Slice Round-Robin Prioritized RR Number of Threads Prediction Accuracy % 98.00% 96.00% 94.00% 92.00% 90.00% Number of Threads Total IPC Number of Threads Figure 4. Cache and branch predictor effects. processor, contention from additional threads degrades both branch predictor and cache performance, leading to slightly lower overall throughput. The SMT processor sees an even greater degradation in predictor and cache performance, but provides increased throughput nonetheless. Time-slicing provides more favorable predictor and cache performance because it allows a single thread to run alone for a comparatively long period of time (8.5 to 12.5 million cycles for the figures shown). This unperturbed running time allows the branch predictor and caches to warm up without interference, leading to good performance for the running thread. The SMT processor s simultaneous threads do not allow any one thread to avoid interference in the branch predictor or cache as it would in the single-thread case. Our rotating priority policy from the previous section should reduce interference similarly. In each scheduling interval, the foreground (high-pri- 7

8 ority) thread will receive a dominant fraction of the execution cycles, allowing it to warm up and exploit the branch predictor and caches with reduced (though non-zero) interference from other threads. We ran experiments which gave each thread two equal-length periods as the foreground thread. Our results for the prioritized round-robin (PR) fetch policy are also found in Figure 4. As expected, we have improved branch prediction and cache performance over the pure Round-Robin policy. This improved predictor and cache performance relative to RR does not translate into improved throughput for this simple prioritization scheme on these workloads. However, we expect that better prioritization schemes and/or more demanding workloads may translate this increased resource efficiency into an overall performance improvement. 4 Conclusions and future work The prioritized fetch policies that we have developed have shown themselves to be effective in reducing the impact of background threads on the foreground thread s latency. These policies also have the ability to trade speedup for fairness in applications where this is important. Finally, we have shown that a prioritized scheme can limit the pathological behavior of some workloads by addressing the problem of resource contention between threads. Though our rotating priority scheme clearly reduces cache and branch predictor contention, this utilization improvement does not produce an overall throughput gain over non-prioritizing policies with our simple prioritization scheme and our workload mix. However, we hypothesize that a throughput gain may be realized using more sophisticated prioritization policies and/or workloads with higher resource contention. One policy currently under investigation uses branch confidence measures to temporarily reduce the priority of threads executing down a low-confidence path. An SMT processor is ideally suited to use branch confidence because of the processor s ability to dynamically reallocate processor resources to other threads. By allocating these resources to threads on higher-confidence paths of execution, performance should be improved and contention for resources between threads should be reduced. To the extent that we can execute background threads without significantly increasing the latency of a foreground thread, prioritization makes available free execution cycles that can be exploited in novel ways. Even when only a single application thread is available, these free cycles may be used in support of that application to manage or optimize cache or branch prediction resources, to collect profiling data, or even to re-optimize application code on the fly. Acknowledgments This work was supported by IBM, Intel, and the National Science Foundation under award CCR References [1] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On- Chip Parallelism. In 22nd Annual International Symposium on Computer Architecture, pages , June [2] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In 23rd Annual International Symposium on Computer Architecture, pages , May [3] B. J. Smith. Architecture and Applications of the HEP Multiprocessor Computer System. In Proceedings of the SPIE, 298: , [4] G. Alverson, et al. Exploiting Heterogeneous Parallelism on a Multithreaded Multiprocessor. In Proceedings of International Conference on Supercomputing, pages , July [5] A. Agarwal, B. H. Lim, D. Kranz, and J. Kubiatowicz. APRIL: A Processor Architecture for Multiprocessing. In 17th Annual International Symposium on Computer Architecture, pages , May [6] S. W. Keckler and W. J. Dally. Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism. In 19th Annual International Symposium on Computer Architecture, pages , May [7] D. E. Culler, M. Gunter, and J. C. Lee. Analysis of Multithreaded Microprocessors under Multiprogramming. Univ. of California, Berkeley Computer Science Division Tech. Report No. UCB/ CSD 92/687, May

9 [8] J. Laudon, A. Gupta, and M. Horowitz. Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), pages , October [9] D. Burger and T.M. Austin. The SimpleScalar Tool Set, Version 2.0. Technical Report #1342, University of Wisconsin-Madison Computer Sciences Department, June [10] S. Hily and A. Seznec. Branch Prediction and Simultaneous Multithreading. In 1996 International Conference on Parallel Architectures and Compilation Techniques, pages , October [11] S. Hily and A. Seznec. Contention on 2nd Level Cache May Limit the Effectiveness of Simultaneous Multithreading. Internal Publication #1086, Universitaire De Beaulieu IRISA, February 1997 [12] T.-Y. Yeh and Y. Patt. Two-Level Adaptive Branch Prediction. In Proceedings of 24th Annual International Symposium on Microarchitecture, November [13] G. S. Sohi. Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers. IEEE Transactions on Computers, 39(3): , March

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

A Fine-Grain Multithreading Superscalar Architecture

A Fine-Grain Multithreading Superscalar Architecture A Fine-Grain Multithreading Superscalar Architecture Mat Loikkanen and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine loik, nader@ece.uci.edu Abstract

More information

The Use of Multithreading for Exception Handling

The Use of Multithreading for Exception Handling The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES

LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES Shane Carroll and Wei-Ming Lin Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio,

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Exploring Efficient SMT Branch Predictor Design

Exploring Efficient SMT Branch Predictor Design Exploring Efficient SMT Branch Predictor Design Matt Ramsay, Chris Feucht & Mikko H. Lipasti ramsay@ece.wisc.edu, feuchtc@cae.wisc.edu, mikko@engr.wisc.edu Department of Electrical & Computer Engineering

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

Optimizing SMT Processors for High Single-Thread Performance

Optimizing SMT Processors for High Single-Thread Performance University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

Boosting SMT Performance by Speculation Control

Boosting SMT Performance by Speculation Control Boosting SMT Performance by Speculation Control Kun Luo Manoj Franklin ECE Department University of Maryland College Park, MD 7, USA fkunluo, manojg@eng.umd.edu Shubhendu S. Mukherjee 33 South St, SHR3-/R

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Dynamic Cache Partitioning for CMP/SMT Systems

Dynamic Cache Partitioning for CMP/SMT Systems Dynamic Cache Partitioning for CMP/SMT Systems G. E. Suh (suh@mit.edu), L. Rudolph (rudolph@mit.edu) and S. Devadas (devadas@mit.edu) Massachusetts Institute of Technology Abstract. This paper proposes

More information

Cache Implications of Aggressively Pipelined High Performance Microprocessors

Cache Implications of Aggressively Pipelined High Performance Microprocessors Cache Implications of Aggressively Pipelined High Performance Microprocessors Timothy J. Dysart, Branden J. Moore, Lambert Schaelicke, Peter M. Kogge Department of Computer Science and Engineering University

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are

More information

A Realistic Study on Multithreaded Superscalar Processor Design

A Realistic Study on Multithreaded Superscalar Processor Design A Realistic Study on Multithreaded Superscalar Processor Design Yuan C. Chou, Daniel P. Siewiorek, and John Paul Shen Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh,

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Fetch Directed Instruction Prefetching

Fetch Directed Instruction Prefetching In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32), November 1999. Fetch Directed Instruction Prefetching Glenn Reinman y Brad Calder y Todd Austin z y Department

More information

Threaded Multiple Path Execution

Threaded Multiple Path Execution Threaded Multiple Path Execution Steven Wallace Brad Calder Dean M. Tullsen Department of Computer Science and Engineering University of California, San Diego fswallace,calder,tullseng@cs.ucsd.edu Abstract

More information

Transient Fault Detection via Simultaneous Multithreading

Transient Fault Detection via Simultaneous Multithreading Transient Fault Detection via Simultaneous Multithreading Steven K. Reinhardt EECS Department University of Michigan, Ann Arbor 1301 Beal Avenue Ann Arbor, MI 48109-2122 stever@eecs.umich.edu Shubhendu

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Simultaneous Multithreading Processor

Simultaneous Multithreading Processor Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline? 1. Imagine we have a non-pipelined processor running at 1MHz and want to run a program with 1000 instructions. a) How much time would it take to execute the program? 1 instruction per cycle. 1MHz clock

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

Improving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract

Improving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract Improving Value Prediction by Exploiting Both Operand and Output Value Locality Youngsoo Choi 1, Joshua J. Yi 2, Jian Huang 3, David J. Lilja 2 1 - Department of Computer Science and Engineering 2 - Department

More information

Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search

Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search S. van Haastregt LIACS, Leiden University svhaast@liacs.nl P.M.W. Knijnenburg Informatics Institute,

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Balancing Thoughput and Fairness in SMT Processors

Balancing Thoughput and Fairness in SMT Processors Balancing Thoughput and Fairness in SMT Processors Kun Luo Jayanth Gummaraju Manoj Franklin ECE Department Dept of Electrical Engineering ECE Department and UMACS University of Maryland Stanford University

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Pipelining to Superscalar

Pipelining to Superscalar Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode

More information

Supporting Speculative Multithreading on Simultaneous Multithreaded Processors

Supporting Speculative Multithreading on Simultaneous Multithreaded Processors Supporting Speculative Multithreading on Simultaneous Multithreaded Processors Venkatesan Packirisamy, Shengyue Wang, Antonia Zhai, Wei-Chung Hsu, and Pen-Chung Yew Department of Computer Science, University

More information

C152 Laboratory Exercise 3

C152 Laboratory Exercise 3 C152 Laboratory Exercise 3 Professor: Krste Asanovic TA: Christopher Celio Department of Electrical Engineering & Computer Science University of California, Berkeley March 7, 2011 1 Introduction and goals

More information

Towards a More Efficient Trace Cache

Towards a More Efficient Trace Cache Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,

More information

32 Hyper-Threading on SMP Systems

32 Hyper-Threading on SMP Systems 32 Hyper-Threading on SMP Systems If you have not read the book (Performance Assurance for IT Systems) check the introduction to More Tasters on the web site http://www.b.king.dsl.pipex.com/ to understand

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

Simultaneous Multithreading and the Case for Chip Multiprocessing

Simultaneous Multithreading and the Case for Chip Multiprocessing Simultaneous Multithreading and the Case for Chip Multiprocessing John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 2 10 January 2019 Microprocessor Architecture

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

LIMITS OF ILP. B649 Parallel Architectures and Programming

LIMITS OF ILP. B649 Parallel Architectures and Programming LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump

More information