Applications of Thread Prioritization in SMT Processors
|
|
- Dale Harrell
- 5 years ago
- Views:
Transcription
1 Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor, MI USA Abstract Previous work in multithreading, and specifically in simultaneous multithreading (SMT), has focused primarily on increasing total instruction throughput. While this focus is sufficient in some application domains, widespread deployment of multithreaded processors will require robust behavior across a variety of platforms. For instance, interactive systems must be concerned with the execution latency of foreground userinterface threads. Multiuser systems must be concerned with fair allocation of throughput among competing users. A multithreaded processor that seeks solely to maximize throughput will favor efficient threads at the expense of any potential latency or fairness issues. We show that a very simple fetch-stage prioritization scheme can substantially reduce the latency impact of multithreading on a selected foreground thread while continuing to provide a throughput improvement over single-threaded execution. When all threads have equal priority, rotating the high-priority designation among the threads reduces the processor s bias against less efficient threads, resulting in a more even throughput distribution across the threads. We also show that even when latency and fairness are not a concern, rotating thread prioritization has a positive effect on cache and branch predictor utilization. Unfortunately, although our simple prioritized multithreading scheme provides these benefits while improving utilization over a single-threaded processor, total throughput falls well short of existing throughput-oriented fetch policies. Our ongoing work focuses on more sophisticated prioritization algorithms, potentially incorporating branch confidence estimators, that will maintain these benefits while increasing total throughput. 1 Introduction Multithreading is a well-known technique for increasing the utilization of a processor core thus increasing total processor throughput by sharing the core among several independent threads of control. Processor resources that would be unused or underused by any single thread due to cache misses or program dependencies can be applied to the execution of another thread. Two recent trends have heightened interest in multithreading. First, semiconductor fabrication technology is now capable of producing superscalar microprocessors whose peak throughput potential is far beyond the throughput that can be extracted from most single-threaded applications. Second, operating system, compiler, and language support for multithreading is becoming more widespread, as exemplified by Windows NT and Java. Simultaneous multithreading (SMT) is a promising form of multithreading proposed by Tullsen et al [1]. SMT enables very fine-grained resource sharing in a dynamic out-of-order superscalar processor core. By multiplexing resources among threads within a single cycle, as well as across 1
2 cycles, total throughput can be improved significantly over a single-threaded processor. Studies of SMT processors [1][2], as with earlier multithreading studies and systems [3][4][5][6][7][8], have focused almost exclusively on improving overall processor throughput. The implicit assumption is that all threads and all instructions are equally important, so maximizing instructions per cycle is good regardless of which instructions are executed. The effects of this assumption are particularly pronounced in Tullsen et al s study of SMT fetch policies [2], in which they increase throughput by explicitly favoring threads that use the processor efficiently. This assumption is reasonable for application domains such as scientific computation and database servers those traditionally targeted by multithreading where all the threads are components of some larger parallel application. However, for multithreaded microprocessors to escape niche markets, they must benefit a wide range of platforms, including portable and desktop PCs and shared multiuser systems. These systems have constraints that violate the all instructions are equal assumption of traditional multithreading. For example, in interactive systems one or more threads are directly responsible for user interaction. As user interfaces move into new modes such as speech and 3D graphics, this interaction can be computationally expensive. If a multithreaded processor decides to favor a more resource-efficient thread over a user-interaction thread, the user may see an intolerable and potentially unbounded latency increase. One solution is to disable multithreading when a latency-critical thread is active. However, this approach wastes the additional throughput capability that multithreading was designed to exploit. Instead, we propose thread prioritization, in which software associates priorities with active threads, and the processor incorporates these priorities into its instruction fetch policy. In this paper, we examine only the simplest thread prioritization policy: a single thread is identified as the high-priority thread, and the processor fetches instructions for this thread whenever possible. Lower-priority threads are given the opportunity to fetch instructions only when the high-priority thread is stalled. Thread prioritization can be useful even when all threads are logically of equal priority. As mentioned above, a processor that is maximizing throughput will favor threads that use the processor efficiently. This results in an unfair allocation of resources among threads. While this is not an issue in many environments, a multiuser system would like to guarantee fair allocation of resources across all users. By rotating the high-priority designation among all active threads, we can reduce the bias against less efficient threads and improve the fairness of CPU allocation. We also demonstrate that even when latency and fairness are not a concern, rotating thread prioritization has a positive effect on cache and branch predictor utilization. Unfortunately, although our simple prioritized multithreading scheme provides these benefits while improving utilization over a single-threaded processor, total throughput falls well short of existing throughput-oriented fetch policies. Our ongoing work focuses on more sophisticated prioritization algorithms, potentially incorporating branch confidence estimators, that will maintain these benefits while increasing total throughput. We provide additional background and describe our experimental methodology in Section 2. Section 3 describes the three potential areas which may benefit from thread prioritization: limiting latency effects, increasing fairness, and reducing cache and branch predictor conflicts. Section 4 concludes with a discussion of ongoing and future work. 2 Background and methodology When an SMT processor has more threads than instruction-cache fetch ports, it can fetch from only a subset of the threads each cycle. Several fetch policies are introduced and evaluated in [2]. This paper will draw from two of these policies: Round- Robin (RR) and Instruction-Count (I-Count, or IC). We also studied simple prioritized extensions of both Round-Robin and I-Count. In describing these policies, we define an active thread as any thread that is allowed to execute instructions. A thread is eligible to fetch instructions if it is active and has no outstanding instruction cache miss. A fetch opportunity for a thread occurs when the thread s fetch address is supplied to an instruction-cache port, resulting in 2
3 either the addition of one or more instructions in the fetch queue or an instruction-cache miss. In the case of a full fetch queue, no thread is given the opportunity to fetch. This situation must be handled explicitly since thread starvation can result if this state is counted as a fetch opportunity. The Round-Robin (RR) policy maintains an arbitrarily ordered list of eligible threads. After the thread or threads at the top of the list are given the opportunity to fetch, these threads are rotated to the bottom of the list. The Round-Robin policy is a fair policy in that, over a number of cycles, each thread has equal opportunity to fetch new instructions. The Instruction-Count (I-Count or IC) policy counts the number of instructions from active threads that are currently in the instruction buffers but have not yet been issued to a function unit. This policy gives fetch opportunities to the eligible threads that have the fewest instructions in the pipeline, under the assumption that these threads are moving instructions through the CPU quickly and hence making the most efficient use of the pipeline. Threads with fewer active instructions are also less likely to exhibit data dependencies or to be stalled as a result of a cache miss. We defined two additional fetch policies that incorporate a software-defined fetch priority for each thread. Fetch opportunities are given first to the highest-priority threads. Lower-priority threads are given fetch opportunities only when higher-priority threads are unable to make use of the full fetch bandwidth. The two policies, Prioritized Round-Robin (PR) and Prioritized I-Count (PI), differ only in the policy used to select among threads at the same priority level (RR and IC, Table 1: Simulated Processor Configuration respectively). In this paper, we use at most two priority levels: a single foreground thread runs at a high priority while the background thread(s) share the same lower priority. We simulated the behavior of these policies using a modified version of the sim-outorder simulator from the Simplescalar tool set [9]. The original simulator was extended by replicating the necessary machine context for multithreading, adding support for multiple address spaces, and increasing the coverage and types of collected statistics. The processor model is based on Sohi s RUU [13]. This model is similar to the instruction queue model used in Hewlett-Packard s PA-8000 processors. The fetch stage feeds instructions to a simple fetch/decode queue. The decode stage picks instructions from this queue, decodes and renames them, then places them into the Register Update Unit (RUU). Loads and stores are broken into an address generation and a memory reference. The address generation portion is placed in the RUU while the memory reference portion of a load/store instruction is places into the Load/Store Queue (LSQ). The RUU (and LSQ) serve as a combination of global reservation station, rename register file, and re-order buffer. The processor maintains precise exceptions by only committing instructions from the RUU in fetch order. We specified an advanced processor with numerous function units and reasonable on-chip instruction and data caches. Details of the model can be found in Table 1. Although we have studied processors with multiported instruction caches, for clarity we focus on single-ported instruction caches in this paper. L1 Instruction Cache L1 Data Cache Unified L2 Cache Branch Predictor Fetch/Decode/Issue/Commit Width Integer Function Units FP Function Units 32K bytes, 4-way associative, single ported 32K bytes, 4-way associative, dual ported 1M bytes, 4-way associative Two-Level: 11-bit Global History Register, 2048-entry PHT, 512-entry BTB, 16-entry RAS (per thread) 8 instructions / cycle 6 ALU, 2 Multiply 4 Add, 2 Multiply 3
4 Normalized Latency RR IC PR Total Throughput (IPC) Figure 1. Throughput vs. latency for various fetch policies. 3 Applications of prioritization This section examines each of the potential areas which may benefit from thread prioritization: limiting latency effects, improving fairness, and reducing cache and branch predictor contention. 3.1 Limiting latency effects Our initial goal in this study was to find a way to address the issue of increased program runtime when several threads are present in the processor. We first looked at the effects on some foreground thread when background threads were introduced into our simulator. To accomplish this, we configured our simulator to stop when our foreground thread completed 10 million instructions and ran simulations with zero, one, two, and three background threads. We refer to the time required to complete the 10 million instructions as the latency of the foreground thread. By comparing the number of simulated cycles from the runs where there are no background threads to the others, we determined the relative increase in latency due to the background threads. Our simulations use a mix of integer benchmarks from SPEC95, where the foreground thread has been chosen to be perl. Background threads include m88ksim, compress, and ijpeg. Figure 1 plots relative latency versus total throughput for zero, one, two, and three background threads using the Round-Robin (RR), I- Count (IC), and Prioritized Round-Robin (PR) fetch policies. We see that for both the RR and IC fetch policies, adding a second thread increases the total processor throughput from 2.33 to more than 3.41 IPC. Unfortunately, this 46% increase in throughput comes at the expense of a 42% increase in latency for our foreground thread. As a second and third background thread are added, the latency increases further. Though the total throughput increases significantly, each thread s portion of the available throughput steadily decreases. Ideally, we would like the addition of background threads to increase throughput without affecting the runtime of the foreground thread. Our goal is to have the curves move from their starting point (a single thread) to the right (increasing throughput) without moving up (no increase in latency); i.e., we are aiming for the lower-right corner of the graph. Our prioritized fetch policy specifies to the processor that the foreground thread should always be given the opportunity to fetch instructions when it is eligible (not suffering an instruction-cache miss). The background threads are allowed to use whatever fetch bandwidth the foreground thread is unable to use. As the figure shows, adding a second thread in this situation results in an increase in total throughput from 2.33 to 2.91 IPC. Here, we have improved throughput by 25%, but have increased our foreground thread latency by only 6%. Beyond the first background thread, the prioritized schemes degrade quickly, but, even with three background threads, the latency never increases beyond 13%. 3.2 Improving fairness For applications where processing resources are being shared between a number of users, we would like to be able to avoid giving preferential treatment to any one thread. This principle of fair- 4
5 Workload 1 T S (0.010) RR (0.061) IC (0.066) PI (0.012) Workload 2 T S (0.009) RR (0.068) IC (0.082) PI (0.050) Speedup Speedup Thread Number Thread Number Workload 3 Workload 4 Speedup T S (0.006) RR (0.062) IC (0.074) PI (0.043) Speedup T S (0.021) RR (0.049) IC (0.062) PI (0.019) Thread Number Thread Number Figure 2. Thread Speedups: TS = time slicing, RR = round robin, IC = instruction-count, PI = prioritized IC. ness can directly conflict with the goal of maximizing throughput. By design, the I-Count fetch policy increases total throughput by favoring threads that are more efficient. Similarly, Culler et al. [7] observed that switch-on-miss (non-simultaneous) multithreading favors threads with lower miss rates, improving overall cache performance at the expense of fairness. To evaluate the fairness of different fetch policies, we measured the speedup of individual threads (relative to single-threaded performance) when run in a three-thread workload. As a reference point, we also implemented time-slice (TS) scheduling, where each thread has sole possession of the processor for some period of time before relinquishing it to another thread, which then has sole possession. The speedup values for several threads are plotted for different workloads and fetch policies in Figure 2. What we would like to see for each fetch policy is a nearly horizontal line with large speedup values (indicating that each of the three threads has a similar speedup i.e., they all suffer the same performance penalty). As we would expect, the time-slicing policy is quite fair, the standard deviation of the speedup values being no more than The RR and IC policies have the largest speedup values, but with significantly different speedups for the individual threads. Standard deviations for these policies range from to As might be expected, the RR policy does exhibit less unfairness than IC, but the difference is surprisingly small. Although RR is fair in distributing fetch opportunities, useful throughput is still biased toward threads with fewer instruction-cache misses or less sensitivity to branch predictor or cache interference (see the following section). To counteract this bias, we extended our prioritized RR and IC policies to allow us to rotate the high-priority designation among the active threads. Assuming that all threads are of logically equal pri- 5
6 Normalized Latency RR PR Total Throughput (IPC) Figure 3. Throughput vs. Latency (ijpeg initialization phase). ority, the operating system rotates (time-slices) the high-priority designation among all the threads, giving each equal time as the foreground thread. (Note that by varying the time each thread spends at high priority, the operating system gains significant control over CPU allocation without disabling multithreading a capability not present in nonprioritized policies. We can view the foreground/ background thread experiments in the preceding section as a special case of this more general model where the priorities are not changed.) Unlike the Round-Robin or I-Count policies, the rotating priority policy gives each thread a macroscopic interval during which it can exploit all of the processor s resources to the best of its ability. Unlike the time-sliced CPU, we have not completely lost the ability to get work done on background threads during these intervals. However, since we are forcing the processor to give a fetch opportunity to a thread where it may not have done so under a purely throughput-oriented policy, we expect that our throughput will suffer compared to RR or IC. We ran the new policy with a priority rotation scheduled to occur every 75 million cycles. The resulting curves are marked PI (Prioritized I- Count) and demonstrate improved fairness over the pure RR and IC policies (for the same workloads), and improved speedup over the time-slicing policy. In one instance (Workload 3), the rotating priority policy flattens the curve by increasing the speedup of one thread by approximately 10% at the expense of lower speedups for the other threads. 3.3 Reducing cache and branch predictor contention Though threads within an SMT processor may not interact directly, they share execution resources, and thus do have an effect on each other. The exact nature of these interactions will vary widely with the processor architecture and the workload being executed. An excellent example of this is illustrated in Figure 3. This graph is a latency-throughput graph similar to Figure 1. In this case, the foreground thread is ijpeg in its initialization phase. As the figure indicates, when running alone, this phase of ijpeg s execution achieves an impressive throughput of 4.4 IPC. It is so efficient in its use of processor resources that, for the RR and IC fetch policies, the introduction of any background thread causes a serious drop in total processor throughput and a 100% increase in foreground thread latency. The use of a prioritized fetch policy allows the background threads to execute at a rate of less than 0.01 IPC. As a result, the data points for the prioritized round-robin policy plot essentally on top of one another. Luckily, this type of behavior seems to be rare. We can quantify the contention effects of SMT by examining changes in cache miss rates and branch prediction accuracy, as was done previously by Hily [10][11]. We looked at two base cases: an SMT processor using the round-robin fetch policy and a single-threaded out-of-order processor using time-slicing to run multiple threads. The simulated branch predictor accuracies, L1 data cache miss rates, and processor throughput for the Perl benchmarks are shown in Figure 4. For the time-slicing 6
7 D-Cache Miss Rate 2.00% 1.60% 1.20% 0.80% 0.40% 0.00% Time-Slice Round-Robin Prioritized RR Number of Threads Prediction Accuracy % 98.00% 96.00% 94.00% 92.00% 90.00% Number of Threads Total IPC Number of Threads Figure 4. Cache and branch predictor effects. processor, contention from additional threads degrades both branch predictor and cache performance, leading to slightly lower overall throughput. The SMT processor sees an even greater degradation in predictor and cache performance, but provides increased throughput nonetheless. Time-slicing provides more favorable predictor and cache performance because it allows a single thread to run alone for a comparatively long period of time (8.5 to 12.5 million cycles for the figures shown). This unperturbed running time allows the branch predictor and caches to warm up without interference, leading to good performance for the running thread. The SMT processor s simultaneous threads do not allow any one thread to avoid interference in the branch predictor or cache as it would in the single-thread case. Our rotating priority policy from the previous section should reduce interference similarly. In each scheduling interval, the foreground (high-pri- 7
8 ority) thread will receive a dominant fraction of the execution cycles, allowing it to warm up and exploit the branch predictor and caches with reduced (though non-zero) interference from other threads. We ran experiments which gave each thread two equal-length periods as the foreground thread. Our results for the prioritized round-robin (PR) fetch policy are also found in Figure 4. As expected, we have improved branch prediction and cache performance over the pure Round-Robin policy. This improved predictor and cache performance relative to RR does not translate into improved throughput for this simple prioritization scheme on these workloads. However, we expect that better prioritization schemes and/or more demanding workloads may translate this increased resource efficiency into an overall performance improvement. 4 Conclusions and future work The prioritized fetch policies that we have developed have shown themselves to be effective in reducing the impact of background threads on the foreground thread s latency. These policies also have the ability to trade speedup for fairness in applications where this is important. Finally, we have shown that a prioritized scheme can limit the pathological behavior of some workloads by addressing the problem of resource contention between threads. Though our rotating priority scheme clearly reduces cache and branch predictor contention, this utilization improvement does not produce an overall throughput gain over non-prioritizing policies with our simple prioritization scheme and our workload mix. However, we hypothesize that a throughput gain may be realized using more sophisticated prioritization policies and/or workloads with higher resource contention. One policy currently under investigation uses branch confidence measures to temporarily reduce the priority of threads executing down a low-confidence path. An SMT processor is ideally suited to use branch confidence because of the processor s ability to dynamically reallocate processor resources to other threads. By allocating these resources to threads on higher-confidence paths of execution, performance should be improved and contention for resources between threads should be reduced. To the extent that we can execute background threads without significantly increasing the latency of a foreground thread, prioritization makes available free execution cycles that can be exploited in novel ways. Even when only a single application thread is available, these free cycles may be used in support of that application to manage or optimize cache or branch prediction resources, to collect profiling data, or even to re-optimize application code on the fly. Acknowledgments This work was supported by IBM, Intel, and the National Science Foundation under award CCR References [1] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On- Chip Parallelism. In 22nd Annual International Symposium on Computer Architecture, pages , June [2] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In 23rd Annual International Symposium on Computer Architecture, pages , May [3] B. J. Smith. Architecture and Applications of the HEP Multiprocessor Computer System. In Proceedings of the SPIE, 298: , [4] G. Alverson, et al. Exploiting Heterogeneous Parallelism on a Multithreaded Multiprocessor. In Proceedings of International Conference on Supercomputing, pages , July [5] A. Agarwal, B. H. Lim, D. Kranz, and J. Kubiatowicz. APRIL: A Processor Architecture for Multiprocessing. In 17th Annual International Symposium on Computer Architecture, pages , May [6] S. W. Keckler and W. J. Dally. Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism. In 19th Annual International Symposium on Computer Architecture, pages , May [7] D. E. Culler, M. Gunter, and J. C. Lee. Analysis of Multithreaded Microprocessors under Multiprogramming. Univ. of California, Berkeley Computer Science Division Tech. Report No. UCB/ CSD 92/687, May
9 [8] J. Laudon, A. Gupta, and M. Horowitz. Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), pages , October [9] D. Burger and T.M. Austin. The SimpleScalar Tool Set, Version 2.0. Technical Report #1342, University of Wisconsin-Madison Computer Sciences Department, June [10] S. Hily and A. Seznec. Branch Prediction and Simultaneous Multithreading. In 1996 International Conference on Parallel Architectures and Compilation Techniques, pages , October [11] S. Hily and A. Seznec. Contention on 2nd Level Cache May Limit the Effectiveness of Simultaneous Multithreading. Internal Publication #1086, Universitaire De Beaulieu IRISA, February 1997 [12] T.-Y. Yeh and Y. Patt. Two-Level Adaptive Branch Prediction. In Proceedings of 24th Annual International Symposium on Microarchitecture, November [13] G. S. Sohi. Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers. IEEE Transactions on Computers, 39(3): , March
CS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationOne-Level Cache Memory Design for Scalable SMT Architectures
One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract
More informationMore on Conjunctive Selection Condition and Branch Prediction
More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationAn In-order SMT Architecture with Static Resource Partitioning for Consumer Applications
An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul
More informationBeyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy
EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery
More informationCS 152 Computer Architecture and Engineering. Lecture 18: Multithreading
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationQuantitative study of data caches on a multistreamed architecture. Abstract
Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com
More informationAn Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks
An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004
ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationA Fine-Grain Multithreading Superscalar Architecture
A Fine-Grain Multithreading Superscalar Architecture Mat Loikkanen and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine loik, nader@ece.uci.edu Abstract
More informationThe Use of Multithreading for Exception Handling
The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationA Study for Branch Predictors to Alleviate the Aliasing Problem
A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationSimultaneous Multithreading Architecture
Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.
More informationUnderstanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures
Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3
More informationSimultaneous Multithreading (SMT)
#1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing
More informationLecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationCS 152 Computer Architecture and Engineering. Lecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationLATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES Shane Carroll and Wei-Ming Lin Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio,
More informationComputer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationMultithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as
More informationECE404 Term Project Sentinel Thread
ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationExploring Efficient SMT Branch Predictor Design
Exploring Efficient SMT Branch Predictor Design Matt Ramsay, Chris Feucht & Mikko H. Lipasti ramsay@ece.wisc.edu, feuchtc@cae.wisc.edu, mikko@engr.wisc.edu Department of Electrical & Computer Engineering
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationEfficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors
Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University
More informationOptimizing SMT Processors for High Single-Thread Performance
University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationBoosting SMT Performance by Speculation Control
Boosting SMT Performance by Speculation Control Kun Luo Manoj Franklin ECE Department University of Maryland College Park, MD 7, USA fkunluo, manojg@eng.umd.edu Shubhendu S. Mukherjee 33 South St, SHR3-/R
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationDynamic Cache Partitioning for CMP/SMT Systems
Dynamic Cache Partitioning for CMP/SMT Systems G. E. Suh (suh@mit.edu), L. Rudolph (rudolph@mit.edu) and S. Devadas (devadas@mit.edu) Massachusetts Institute of Technology Abstract. This paper proposes
More informationCache Implications of Aggressively Pipelined High Performance Microprocessors
Cache Implications of Aggressively Pipelined High Performance Microprocessors Timothy J. Dysart, Branden J. Moore, Lambert Schaelicke, Peter M. Kogge Department of Computer Science and Engineering University
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationImproving Data Cache Performance via Address Correlation: An Upper Bound Study
Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are
More information15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are
More informationA Realistic Study on Multithreaded Superscalar Processor Design
A Realistic Study on Multithreaded Superscalar Processor Design Yuan C. Chou, Daniel P. Siewiorek, and John Paul Shen Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh,
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationTradeoff between coverage of a Markov prefetcher and memory bandwidth usage
Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationFetch Directed Instruction Prefetching
In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32), November 1999. Fetch Directed Instruction Prefetching Glenn Reinman y Brad Calder y Todd Austin z y Department
More informationThreaded Multiple Path Execution
Threaded Multiple Path Execution Steven Wallace Brad Calder Dean M. Tullsen Department of Computer Science and Engineering University of California, San Diego fswallace,calder,tullseng@cs.ucsd.edu Abstract
More informationTransient Fault Detection via Simultaneous Multithreading
Transient Fault Detection via Simultaneous Multithreading Steven K. Reinhardt EECS Department University of Michigan, Ann Arbor 1301 Beal Avenue Ann Arbor, MI 48109-2122 stever@eecs.umich.edu Shubhendu
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate
More informationUnderstanding The Effects of Wrong-path Memory References on Processor Performance
Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware
More informationSimultaneous Multithreading Processor
Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationAssuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?
1. Imagine we have a non-pipelined processor running at 1MHz and want to run a program with 1000 instructions. a) How much time would it take to execute the program? 1 instruction per cycle. 1MHz clock
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationEXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu
Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationEECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont
Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationSecurity-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat
Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationImproving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract
Improving Value Prediction by Exploiting Both Operand and Output Value Locality Youngsoo Choi 1, Joshua J. Yi 2, Jian Huang 3, David J. Lilja 2 1 - Department of Computer Science and Engineering 2 - Department
More informationFeasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search
Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search S. van Haastregt LIACS, Leiden University svhaast@liacs.nl P.M.W. Knijnenburg Informatics Institute,
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationAn Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group
More informationBalancing Thoughput and Fairness in SMT Processors
Balancing Thoughput and Fairness in SMT Processors Kun Luo Jayanth Gummaraju Manoj Franklin ECE Department Dept of Electrical Engineering ECE Department and UMACS University of Maryland Stanford University
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationPipelining to Superscalar
Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationA Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines
A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode
More informationSupporting Speculative Multithreading on Simultaneous Multithreaded Processors
Supporting Speculative Multithreading on Simultaneous Multithreaded Processors Venkatesan Packirisamy, Shengyue Wang, Antonia Zhai, Wei-Chung Hsu, and Pen-Chung Yew Department of Computer Science, University
More informationC152 Laboratory Exercise 3
C152 Laboratory Exercise 3 Professor: Krste Asanovic TA: Christopher Celio Department of Electrical Engineering & Computer Science University of California, Berkeley March 7, 2011 1 Introduction and goals
More informationTowards a More Efficient Trace Cache
Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,
More information32 Hyper-Threading on SMP Systems
32 Hyper-Threading on SMP Systems If you have not read the book (Performance Assurance for IT Systems) check the introduction to More Tasters on the web site http://www.b.king.dsl.pipex.com/ to understand
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationMPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors
MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development
More informationSimultaneous Multithreading and the Case for Chip Multiprocessing
Simultaneous Multithreading and the Case for Chip Multiprocessing John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 2 10 January 2019 Microprocessor Architecture
More informationComputer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction
More information2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]
EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian
More informationAdaptive Cache Partitioning on a Composite Core
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,
More informationLIMITS OF ILP. B649 Parallel Architectures and Programming
LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump
More information