Balancing Thoughput and Fairness in SMT Processors

Size: px
Start display at page:

Download "Balancing Thoughput and Fairness in SMT Processors"

Transcription

1 Balancing Thoughput and Fairness in SMT Processors Kun Luo Jayanth Gummaraju Manoj Franklin ECE Department Dept of Electrical Engineering ECE Department and UMACS University of Maryland Stanford University University of Maryland College Park, MD Stanford, CA College Park, MD ord. edu Abstract Simultaneous Multithreading (SMT) is an execution model that executes multiple threads in parallel within a single processor pipeline. Usually, an SMT processor uses shared instruction queues to collect instructions from the different threads. Hence, an SMT processor s performance depends on how the instruction fetch unit fills these instruction queues every cycle. n the recent past, many schemes have been proposed for fetching instructions into the SMT pipeline. These schemes focussed on increasing the throughput by using the number of instructions and the number of low confidence branch predictions currently in the pipeline, to decide which threads to fetch from. The goal of this paper is to investigate fetch policies that find a balance between fairness and throughput. We present metrics to quantify fairness. We then discuss techniques to use a set of pipeline system variables for achieving balanced throughput and fairness. Finally, we evaluate several fetch policies. Our evaluation confirms that many of our fetch policies provide a good balance between throughput and fairness. 1 ntroduction Simultaneous multithreading (SMT) is a processing model in which multiple thread contexts are simultaneously active in a single pipeline [l] [2] [9]. The performance contribution of each thread is dependent on the amount of resources available to it. Unfortunately, the relationship between a thread s performance and the amount of resources allocated to it is rarely linear. As a thread receives more resources, its performance increases somewhat uniformly up to a point, beyond which the increase tends to be marginal. mportantly, the instruction fetch unit, which supplies instructions from different threads to the rest of the pipeline, can control such resource allocation by slowing down or speeding up the fetch rate of specific threads. Tullsen, et al. [8] investigated several fetch policies for SMT processors. Among these, a policy called COUNT was found to provide the best performance. This policy gives the highest priority to the thread with the fewest instructions in the pipeline; the underlying assumption is that those threads are likely to make the most efficient use of processor resources. This policy, however, does not consider the probability of each thread to be in a wrong speculative path. Luo, et, al. [4] addressed this problem with a fetch policy (LC-BPCOUNT) that considers the number of outstanding low confidence branch predictions to prioritize threads. Threads with the fewest low confidence branch predictions are given the highest priority, with pipeline instruction count used as a tie-breaker, if required. This fetch policy provides very high throughput, as it allocates the processor resources to threads that are very likely to be in their correct paths. However, it does not attempt to be fair. f a thread has many low-confidence branch predictions, it is likely to be consistently passed over in favor of threads that have fewer low-confidence branch predictions. Both fairness and throughput are important when using an SMT processor. Whereas higher throughput ensures higher utilization of processor resources, fairness ensures that all threads are given equal opportunity and that no threads are forced to starve. Typ ically, different threads have different rates at which their instructions execute and hence different native throughputs. Therefore, considering only throughput will result in giving priority to threads with high instruction execution rates all the time, causing the rest to suffer. n a multi-user environment, users experiencing slow response times will not be very pleased. On the other hand, considering only fairness will result in an inefficient use of pipeline resources, because a thread may often receive resources when it is least capable of utilizing it. On the surface, it appears that naive approaches to increase throughput work against fairness, and vice versa. This paper investigates SMT fetch policies, with spe /00/$ EEE 164

2 cia1 emphasis on enhancing both throughput and fairness. We present and evaluate fetch policies that are geared to provide balanced throughput and fairness. The basic idea behind these policies is to use a combination of prioritizing and selective gating, based on pipeline status. Our experimental analysis shows that it is indeed possible to achieve high throughput and fairness at the same time in SMT processors. The rest of this paper is organized as follows. Section 2 reviews background information on the SMT processor and SMT fetch policies. Section 3 discusses fairness and metrics for ensuring fairness. Section 4 describes ways of using different types of feedback from the pipeline for obtaining throughput as well as fairness. Section 5 presents the experimental results, and Section 6 presents the conclusions. 2 Simultaneous Multithreading 2.1 SMT Processor The SMT processing engine, like most other processing engines, comprises two major parts-the fetch engine and the execute engine. The fetch engine fills the instruction queues (Qs) with instructions, and includes the i-cache, the branch predictor, the fetch unit, the decode unit, and the register rename unit, as shown in Figure 1. The execution engine drains the instruction queues, and includes the instruction issue logic, the functional units, the memory hierarchy, the result forward mechanism, and the reorder buffer. The key features of the processor can be summarized as follows: Fetch &-L@ r mteger instruction queue 4 lnmoa6;store -; units 1 Fetch Engine : Execute Engine Figure 1: Block Diagram of an SMT Processor. The major resources are shared by all active threads. Data Cache Every clock cycle, instructions from all active threads compete for each of the shared resources. nstructions of a thread are fetched and committed strictly in program order. Each of the resources in the processor has very limited buffering capabilities. 5. Once an instruction enters the processor pipeline, it is not pre-empted (i.e., discarded without execution) from the pipeline unless it is found to be from an incorrect path. Therefore, it is very important to select the right instructions every cycle at the fetch part of the*'pipeline where instructions enter the pipeline. n this paper, we focus on utilizing the available fetch bandwidth for bringing in the 'best' instructions into the Qs. 2.2 SMT Fetch Policies Tullsen et al. had studied several fetch policies for SMT processors, so as to improve on the simple round-robin priority policy, with the use of feedback from the processor pipeline [S. Of these, a policy called COUNT was found to provide good throughput. COUNT gives highest priority every cycle to those threads that have the fewest instructions in the processor pipeline. The motivation for this policy is two-fold: (i) give highest priority to threads whose instructions are moving through the pipeline efficiently, and (ii) provide an even mix of instructions from the available threads. This naturally prevents any one thread from monopolizing the processor pipeline. However, it does not take into account whether a particular thread is in the correct path of execution or not. Because of control speculation, many of the instructions in an SMT pipeline can be from wrong paths. Wrong-path instructions not only have zero contribution to the throughput, but also tie up valuable resources, preventing correct-path instructions from being executed. For SMT processors, speculation control of individual threads is beneficial to improving the overall performance, because the resources that would have been spent on wrong-path instructions of one thread could instead be diverted for use by other threads that are in the right path. To increase the overall performance of SMT processors, we have to reduce the amount of incorrectly speculated instructions, so as to save resources for non-speculative or correctly speculated instructions. Luo et al. [4] investigated the use of confidence estimators [3] [5] to reduce the number of wrongpath instructions. ndividual threads are prioritized every cycle based on the number of outstanding lowconfidence branch predictions. n essence, speculation control is used to improve the overall SMT throughput, as the resources that would have been spent on wrongpath instructions of one thread instead got diverted for use by other threads that are in the right path. However, this scheme pays no attention to fairness. n fact, its basic principle works against fairness! 165

3 3 Fairness Metrics f the SMT fetch unit focuses only on maximizing the instruction throughput, then threads with low branch prediction accuracies and low throughputs may not get a fair share of the SMT resources. The end result will be that these threads will proceed at a slow pace, although the overall throughput of the SMT processor is very high. Therefore, every cycle, it is important that the SMT fetch unit pays special attention to fairness when deciding which threads to fetch from. Notice that, by fairness we do not mean that all threads should have the same throughput; instead, we mean that all threads should get equal opportunity to utilize resources. Although the term fairness is quite familiar, quantifying it is a somewhat difficult endeavor! There are several different ways in which fairness can be defined and measured. One option is to use the metric thread fetch priority, to see if each thread gets more or less equal priority during fetching. With this metric, a straightforward approach to guarantee fairness is to use round robin fetching, in which the threads take strict turns for fetching. With round robin fetching, however, the overall instruction throughput may be severely affected. This is because this fetch policy does not give any consideration for a thread's ability to utilize additional resources allocated to it at that time. t also does not consider the thread's probability to be in the correct execution path. Another possible metric for fairness is individual thread speedup, i.e., the speedup experienced by each thread, with respect to the throughput it obtained when executed in single thread mode. We can consider the variance (or standard deviation) in thread speedups; if all threads experience similar speedups, then we may conclude that the fetch policy is fair (although the threads may not have had the same fetch priority). This metric has the drawback that it does not give any importance to the overall throughput. t is conceivable that a fetch policy slows down every thread equally, and still provide fairness! Based on the discussion so far, instead of having two different metrics-one for throughput and another for fairness-it seems better to use a single metric that encapsulates both throughput and fairness. One such metric is the harmonic mean of the individual thread speedups'. Whatever the metric we use to quantify ~ ~ 'The weighted speedup metric used in [7] is obtained by taking the arithmetic mean of the speedup values, and multiplying it by the number of threads. The arithmetic mean does not capture fairness, whereas the harmonic mean tends to be lower if one or more threads have lower speedup, thereby capturing some effect of the lack of fairness. fairness, however noble be the intentions, it is conceivable to find some amount of criticism against that metric. 4 Prioritizing and Gating n order to achieve high throughput as well as fairness, we take feedback from the processor pipeline, and modify the thread fetch priorities accordingly. Feedback from the pipeline can be collected in terms of system variables such as low-confidence branch prediction count and instruction count. There are at least two ways to use the pipeline system variables: either to prioritize the threads (prioritizing) or to stop fetching from specific threads (gating). 4.1 Fetch Prioritizing The current value of a pipeline system variable can be used to assign fetch priorities to the threads. For instance, in the COUNT fetch policy, the pipeline instruction count values of the threads determine their fetch priorities for the current cycle. n order to obtain good throughput as well as fairness, we may need to consider multiple system variables at the same time. When multiple system variables are used for prioritizing, there must- be some means to combine the fetch priority values given by each variable. One way to achieve this is to assign weights for each system variable and calculate an overall fetch priority for all threads. The drawback of this approach is that it requires the fetch unit to do complex calculations every cycle. A more feasible approach is to perform prioritization in multiple steps or levels, with each level representing a system variable. That is, one of the variables is first used to partition the threads into different priority groups (with members of each group having equal priority), and then another variable is used to further prioritize threads that fall into the same group, and so on. An important aspect to consider while doing a multilevel partitioning is deciding the partitioning granularity at each level. Consider the pipeline variable instruction count. f the value of this variable is used as such for prioritizing at the first level, chances are that a definite fetch priority order will be established after the first level itself. This causes the remaining system variables to be unused, and the policy becomes same as COUNT. Therefore, we need to consider a range of values to be in one group. f this range is too large, then all of the threads may fall into the same group, leaving the current partitioning level unused. 166

4 4.2 Fetch Gating The previous subsection discussed how threads can be prioritized based on pipeline variables. Next, we discuss the importance of using gating in addition to prioritization, for providing fairness. For motivation, consider the data in Figure 2, which shows how the singlethread throughputs vary when the Q size is varied. We can see that the perfornlance of a single thread begins to saturate when the Q size allocated to it is increased to about This means that it is not very efficient to provide more than about Q slots to a thread. Once a thread begins to occupy that many Q slots, it is better to temporarily stop fetching from that thread altogether, and utilize the fetch bandwidth for fetching from other threads, until the situation improves. That is, the fetch unit will temporarily throttle fetch activity from those threads, even if they are very likely to be in the correct paths. Doing such fetch gating will help towards providing fairness, without affecting the throughput in a significant manner. When multiple system variables are used for fetch gating, they can all be applied simultaneously Resource Allocated ( Q slots) Figure 2: Single-Thread Throughputs for Different Q Sizes 5 Experimental Evaluation This section presents an experimental evaluation of different fetch prioritizing and gating techniques that are geared for providing balanced throughput and fairness in SMT processors. 5.1 Evaluation Setup The experiments in this section are conducted using a simulator derived from the public domain SMT simulator developed by Tullsen, et a1 [S. The simulator executes unmodified Alpha object code, and models the fetch engine and the execution engine of an SMT processor. Some of the simulator parameters are fixed as follows. The instructionspipeline has 9 stages, based on the Alpha pipeline, but includes extra cycles for accessing a large register file. Functional unit latencies are also based on the Alpha processor. The processor has 64-slot integer and floatingpoint Qs, 12 integer functional units, and 6 floating point units. The memory hierarchy has 64 KB 2-way set-associative instruction and data caches, a 1024 KB 2-way set-associative on-chip L2 cache, and a 4MB offchip cache. Cache line sizes are all 64 bytes. All the on-chip caches are 8-way banked. Cache miss penalties are 6 cycles to L2 cache, another 12 cycles to the L3 cache, and another 62 cycles to the main memory. The fetch unit fetches a maximum off instructions in a cycle from up to 2 threads, where f is the fetch bandwidth. The second thread gets an opportunity only iff instructions could not be fetched from the first thread, due to events such as cache misses. We simulate two fetch sizes: 8 and 16. and a fetch bandwidth of 8 instructions per cycle. Workload: Our workload consists of the following 5 programs from the SPEC95 integer benchmark suite: compress95, gcc, go, li, and i jpeg. These programs have different individual PC values ranging from 2 to 4, and have different branch misprediction rates. We compiled each program with gcc with the -04 optimization. The measurement strategy is kept the same as that used in [8]: each data point is collected by simulating the SMT processor for a total of 500 million instructions. Metrics: We use the following 3 metrics to get a clear idea about the throughput and fairness associated with each fetch policy: Throughput: We measure throughput in terms of PC (nstructions per cycle). We measure the PC of each thread, as well as the overall PC of the SMT processor. Thread Fetch Priority: This metric measures the average fetch priority given to each thread, with 0 being the highest priority. Thread Speedup: This metric indicates the speedup experienced by each thread in SMT mode compared to executing the thread in single mode. 167

5 Pipeline System Variables: The pipeline system variables that we, use for our fetch policies are: Round Robin-based Priority (R): We keep track of the fetch priorities assigned to the threads in the last several cycles. nstruction Count (): Every cycle, we measure the number of instructions from each thread in the pipeline. Low- Confidence Branch Prediction Count (C): We use a JRS confidence estimator [5] to assess the quality of each branch prediction. This estimator parallels the structure of the gshare branch predictor [6], and uses a table of mass distance counters (MDCs) to keep record of branch prediction correctness. Each MDC is a saturating resetting counter. Correct predictions increment the corresponding MDC, whereas incorrect predictions reset the MDC to zero. A branch is considered to have high confidence only when the MDC has reached a particular threshold value called the MDC-threshold. The default MDC-threshold value is set to 8. The default gating threshold is set to 2; that is, when gating is used, no instructions are fetched from a thread if it has more than 2 unresolved low-confidence branch predictions. 5.2 Fetch Policies Simulated We start with three existing policies-round robin, COUNT, and LC-BPCOUNT-and enhance them with the aim to achieve balanced throughput and fairness. The basic round robin policy is good for fairness, but poor for throughput, partly because it does not consider wrong-path instructions. Therefore, we enhance it by adding gating. The basic COUNT policy does not consider the probability for threads to be in wrong paths, and so we enhance it by using low-confidence branch prediction count for prioritizing or gating. The basic LC-BPCOUNT policy does not limit the number of instructions in the pipeline from a thread, and so we enhance it by adding gating using instruction count. Altogether, we evaluate the following 7 fetch policies. For naming these policies, we use the notation (A,B), + (C,D),, where A, B, C, D are pipeline system variables, which can be one of {R,, C}. Variables A and B are used for prioritization, in that order, and variables C and D are used for gating. 1. R,: basic round robin policy 2. R, + (, C)g: round robin policy augmented with gating based on instruction count and lowconfidence branch prediction count 3. p: basic COUNT policy 4., + (, C)g: COUNT policy augmented with gating based on instruction count and lowconfidence branch prediction count 5. (, C)p: 2-level prioritization using instruction count and low-confidence branch prediction count. The first level, which uses instruction count, uses a granularity of 3 for partitioning 6. (C,)p: basic LC-BPCOUNT policy, which is a 2-level prioritizing policy 7. (C,)p + g: basic LC-BPCOUNT policy, augmented with gating based on instruction count 5.3 Throughput Let us first look at the throughput results for different fetch policies. Figure 3 presents the thread PCs as well as the overall PCs obtained with different fetch policies, for fetch sizes of 8 (left figure) and 16 (right figure). The results are very similar for both fetch sizes. n both figures, the Y-axis indicates the PC, and the X-axis lays out the different fetch policies. Each fetch policy has 6 histogram bars; the first 5 correspond to the PCs of the 5 threads, and the last one corresponds to the overall PC. Let us look at the PC results presented in Figure 3. Among the different fetch policies, the round robin fetch policy (Rp) has the lowest PC. The main reasons for this are that when deciding to fetch from a particular thread, (i) it does not consider if that thread is likely to be in the wrong speculative path or not at that time; (ii) it does not consider the appropriateness of allocating more resources to that thread at that time. Notice that, when the basic round robin scheme is enhanced by gating based on instruction count as well as lowconfidence branch prediction count (R, + (, C), policy), the overall PC increases by 14% for a fetch size of 8 and 22% for a fetch size of 16. What is interesting to note is that this increase in overall PC is obtained without sacrificing the individual thread PCs for any of the 5 threads! n fact, the individual thread PCs are higher for all threads; the maximum increase is seen for jpeg, as its branch predictions have very high confidence. This highlights the utility of enhancing the round robin fetch policy with the use of gating. Next, consider the third fetch policy, namely COUNT (p). The overall PC is much better than that of the round robin policy, but is slightly lower than that of round robin enhanced by gating2. The in- 2Although the overatl PCs obtained for Rp + (, C), and, policies are very close, the individual thread PCs differ by a large extent between the two policies. '- 168

6 Figure 3: SMT Throughput Values for (i) Fetch Size = 8; (ii) Fetch Size = 16 dividual thread PCs are also consistently higher than that of the round robin policy. The COUNT policy, as mentioned earlier, gives the highest priority to threads that have the fewest instructions in the pipeline, and are thus very likely to clear the pipeline sooner. The result is that the PCs of all threads increase, compared to that obtained with the round robin policy. When the basic COUNT policy is enhanced by gating based on instruction count as well as low-confidence branch prediction count (, + (, C), policy), the overall PC increases by 2.7% (for fetch size 8). When the basic COUNT policy is enhanced by an additional level of prioritization using low-confidence branch prediction count ((, C), policy), the PC values are very similar to that obtained with gating. Finally, consider the two fetch policies with lowconfidence branch prediction count as the first prioritizing level-(c, )p policy (LC-BPCOUNT) and (C, ), +, policy. These two policies register the highest overall PC values. The use of the low-confidence branch prediction count at the first prioritizing level has a marked impact on the amount of wrong-path instructions that enter the pipeline. However, as we will see later, these two policies are not very fair. tice that a lower value for the priority metric indicates higher priority. From the figure, we can see that the round robin policy provides more or less the same average fetch priority for all threads, as expected. The maximum discrepancy in average fetch priority is seen for the (C, )p + g policy. As we saw in Figure 3, this policy has the highest throughput. Next, let us look at the thread fetch priorities obtained with the other fetch policies. When the basic round robin scheme is enhanced by gating based on instruction count as well as low-confidence branch prediction count (i.e., R, + (, C), policy), the average fetch priorities become somewhat varied. These observations are consistent for both fetch sizes. 5.5 Thread Speedup We have seen metrics that focus primarily on either throughput or fairness. Next, let us consider metrics that address both throughput and fairness. These are the harmonic mean and standard deviation of the relative throughputs obtained for the individual threads compared to their throughputs in single-thread mode. Considering the relative throughputs helps to factor out the inherent discrepancies between the different 5.4 Thread Fetch Priority threads. Figure 5 presents these results. The format of this figure is similar to that of the previous three So far, we were primarily analyzing throughput res-ults, figures. The first 5 bars for each fetch policy indicate with some consideration given for fairness. Next, let the relative PC values for the 5 threads. The 6th and us consider metrics that are more germane to fairness. 7th bars indicate the harmonic mean and standard de- The first metric we consider in this category is thread viation, respectively, of the 5 relative PC values. fetch priority. Figure 4 presents the average thread Let us analyze the results presented in Figfetch priority values obtained for two fetch sizes. No- ure 5. Looking at the harmonic mean values, 169

7 Figure 4: Average Thread Fetch Priorities Obtained in SMT Mode: (i) Fetch Size = 8; (ii) Fetch Size = 16 0 L 0 HarmonicMean 0 Standard Deviatlon B > S d 0.3 Compress95 D Go 0 Gcc B Li U HarmonicMean 0 Standard Deviatlon n Figure 5: Relative PCs Obtained for Different Threads when executed in SMT Mode. Relative PCs are calculated with respect to Single-Thread PCs. (i) Fetch Size = 8; (ii) Fetch Size = 16 the last 4 policies--p + (, C),, (, C),, (C, )p, and (C, )p +,-have very high values for both fetch sizes. Among these, the first two have low standard deviations, and the last two have high standard deviations. Therefore, the first two policies (, + (, C), and (,C)p), which have high harmonic means as well as low standard deviations, are good at providing balanced throughput and fairness. Their overall PC values (cf. Figure 3) are very similar, and reasonably high. f we want to give somewhat higher preference to throughput, along with some fairness, then the last policy ((C, )p + g) is a good one to use. 6 Summary and Conclusions Simultaneous Multithreading (SMT) permits multiple threads to execute in parallel within a single processor. Usually, an SMT processor uses shared instruction queues to collect instructions from the different threads. Hence, an SMT processor s performance depends on how the instruction fetch unit fills these instruction queues. On each cycle the fetch unit must judiciously decide which threads to fetch instructions from. This paper addressed the issue of balancing throughput and fairness when fetching instructions into the 170

8 SMT pipeline. We highlighted the importance of achieving a balance between throughput and fairness. We also proposed metrics that consider both fairness and throughput. Then, we investigated techniques that utilize pipeline system variables to obtain balanced throughput and fairness. The basic idea is to multiple system variables at the same time for prioritizing the threads, and to temporarily stop fetching from some threads. By limiting the amount of resources given to a particular thread, resources can be better distributed to achieve higher throughput as well as fairness. Our experimental evaluation showed that the basic round robin policy provides very low throughput, although it provides good fairness when considering the average thread fetch priority as the metric. Enhancements to the round robin policy provide throughput improvements to all threads, though not necessarily by the same amount. This means that the use of average thread fetch priority as a metric is not necessarily a good idea. When better metrics such as harmonic mean and standard deviation of individual thread speedups were used, we found that enhancements to the basic COUNT policy provide good throughput as well as fairness. These enhancements include gating with instruction count and low-confidence branch prediction count or two-level prioritization with low-confidence branch prediction count at the second level. We also found that very high throughputs, with moderate fairness, can be achieved by including instruction count-based gating to a 2-level prioritizer based on low-confidence branch prediction count at the first level and instruction count at the second level. Acknowledgements This work was supported by the U.S. National Science Foundation (NSF) through a CAREER grant (MP ) and a regular grant (CCR ). We are thankful to the reviewers for their helpful and insightful comments. References (11 G. E. Daddis, Jr. and H. C. Torng, The Concurrent Execution of Multiple nstruction Streams on Superscalar Processors, Proc. nternational Conference on Parallel Processing (CPP), pp. 1:76-83, [2] S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stamm, and D. M. Tullsen, Simultaneous Multithreading: A Foundation for Next-generation Processors, EEE Micro, pp , September/October [3] D. Grunwald, A. Klauser, S. Manne, and A. Pleszkun, Confidence estimation for speculation control, Proc. 25th nternational Symposium on Computer Architecture, [4] K. Luo, M. Franklin, S. S. Mukherjee, and A. Seznec, Boosting SMT Performance by Speculation Control, Proc. 15th nternational Parallel & Distributed Processing Symposium (PDPS), [5] E. Jacobsen, E. Rotenberg, and J. E. Smith, ASsigning Confidence to Conditional Branch Predictions, Proc. 29th nternational Symposium on Microarchitecture (MCRO-29), pp , December [6] C. McFarling, Combining Branch Predictors, WRL Technical Note TN-36, June [7] A. Snavely and D. M. Tullsen, Symbiotic Job Scheduling for a Simultaneous Multithreading Processor, Proc. 9th nternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), [8] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm Exploiting choice: nstruction fetch and issue on an implementable simultaneous multithreading processor, Proc. 23rd nternational Symposium on Computer Architecture, pp , May [9] W. Yamamoto and M. Nemirovsky, ncreasing Superscalar Performance Through Multistreaming, Proc. FP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques (PACT 95), pp 49-58,

Boosting SMT Performance by Speculation Control

Boosting SMT Performance by Speculation Control Boosting SMT Performance by Speculation Control Kun Luo Manoj Franklin ECE Department University of Maryland College Park, MD 7, USA fkunluo, manojg@eng.umd.edu Shubhendu S. Mukherjee 33 South St, SHR3-/R

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Speculation Control for Simultaneous Multithreading

Speculation Control for Simultaneous Multithreading Speculation Control for Simultaneous Multithreading Dongsoo Kang Dept. of Electrical Engineering University of Southern California dkang@usc.edu Jean-Luc Gaudiot Dept. of Electrical Engineering and Computer

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES

LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES Shane Carroll and Wei-Ming Lin Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio,

More information

Threaded Multiple Path Execution

Threaded Multiple Path Execution Threaded Multiple Path Execution Steven Wallace Brad Calder Dean M. Tullsen Department of Computer Science and Engineering University of California, San Diego fswallace,calder,tullseng@cs.ucsd.edu Abstract

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

Applications of Thread Prioritization in SMT Processors

Applications of Thread Prioritization in SMT Processors Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor,

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Optimizing SMT Processors for High Single-Thread Performance

Optimizing SMT Processors for High Single-Thread Performance University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors In Proceedings of the th International Symposium on High Performance Computer Architecture (HPCA), Madrid, February A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Superscalar Machines. Characteristics of superscalar processors

Superscalar Machines. Characteristics of superscalar processors Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance

More information

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Dean M. Tullsen, Susan J. Eggers, Joel S. Emer y, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

Simultaneous Multithreading: A Platform for Next-generation Processors

Simultaneous Multithreading: A Platform for Next-generation Processors M A R C H 1 9 9 7 WRL Technical Note TN-52 Simultaneous Multithreading: A Platform for Next-generation Processors Susan J. Eggers Joel Emer Henry M. Levy Jack L. Lo Rebecca Stamm Dean M. Tullsen Digital

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser Karthik Sundaramoorthy, Zach Purser Dept. of Electrical and Computer Engineering North Carolina State University http://www.tinker.ncsu.edu/ericro ericro@ece.ncsu.edu Many means to an end Program is merely

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

High Performance Memory Requests Scheduling Technique for Multicore Processors

High Performance Memory Requests Scheduling Technique for Multicore Processors High Performance Memory Requests Scheduling Technique for Multicore Processors by Walid Ahmed Mohamed El-Reedy A Thesis Submitted to the Faculty of Engineering at Cairo University In Partial Fulfillment

More information

Fetch Directed Instruction Prefetching

Fetch Directed Instruction Prefetching In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32), November 1999. Fetch Directed Instruction Prefetching Glenn Reinman y Brad Calder y Todd Austin z y Department

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Mitesh R. Meswani and Patricia J. Teller Department of Computer Science, University

More information

DCache Warn: an I-Fetch Policy to Increase SMT Efficiency

DCache Warn: an I-Fetch Policy to Increase SMT Efficiency DCache Warn: an I-Fetch Policy to Increase SMT Efficiency Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona 1-3,

More information

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering

More information

Architectures for Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

Transparent Threads: Resource Sharing in SMT Processors for High Single-Thread Performance

Transparent Threads: Resource Sharing in SMT Processors for High Single-Thread Performance Transparent Threads: Resource haring in MT Processors for High ingle-thread Performance Gautham K. Dorai and Donald Yeung Department of Electrical and Computer Engineering Institute for Advanced Computer

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Simultaneous Multithreading Processor

Simultaneous Multithreading Processor Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information

What are the major changes to the z/os V1R13 LSPR?

What are the major changes to the z/os V1R13 LSPR? Prologue - The IBM Large System Performance Reference (LSPR) ratios represent IBM's assessment of relative processor capacity in an unconstrained environment for the specific benchmark workloads and system

More information

A Fine-Grain Multithreading Superscalar Architecture

A Fine-Grain Multithreading Superscalar Architecture A Fine-Grain Multithreading Superscalar Architecture Mat Loikkanen and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine loik, nader@ece.uci.edu Abstract

More information

Supertask Successor. Predictor. Global Sequencer. Super PE 0 Super PE 1. Value. Predictor. (Optional) Interconnect ARB. Data Cache

Supertask Successor. Predictor. Global Sequencer. Super PE 0 Super PE 1. Value. Predictor. (Optional) Interconnect ARB. Data Cache Hierarchical Multi-Threading For Exploiting Parallelism at Multiple Granularities Abstract Mohamed M. Zahran Manoj Franklin ECE Department ECE Department and UMIACS University of Maryland University of

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001 Data-flow prescheduling for large instruction windows in out-of-order processors Pierre Michaud, André Seznec IRISA / INRIA January 2001 2 Introduction Context: dynamic instruction scheduling in out-oforder

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Data Speculation. Architecture. Carnegie Mellon School of Computer Science

Data Speculation. Architecture. Carnegie Mellon School of Computer Science Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996. Sodani and Sohi. Understanding the differences between value prediction and instruction reuse, 1998. 1 A

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in

More information

The Use of Multithreading for Exception Handling

The Use of Multithreading for Exception Handling The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32

More information

A Study of Control Independence in Superscalar Processors

A Study of Control Independence in Superscalar Processors A Study of Control Independence in Superscalar Processors Eric Rotenberg, Quinn Jacobson, Jim Smith University of Wisconsin - Madison ericro@cs.wisc.edu, {qjacobso, jes}@ece.wisc.edu Abstract An instruction

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs International Journal of Computer Systems (ISSN: 2394-1065), Volume 04 Issue 04, April, 2017 Available at http://www.ijcsonline.com/ An Intelligent Fetching algorithm For Efficient Physical Register File

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Assigning Confidence to Conditional Branch Predictions

Assigning Confidence to Conditional Branch Predictions Assigning Confidence to Conditional Branch Predictions Erik Jacobsen, Eric Rotenberg, and J. E. Smith Departments of Electrical and Computer Engineering and Computer Sciences University of Wisconsin-Madison

More information

A Dynamic Multithreading Processor

A Dynamic Multithreading Processor A Dynamic Multithreading Processor Haitham Akkary Microcomputer Research Labs Intel Corporation haitham.akkary@intel.com Michael A. Driscoll Department of Electrical and Computer Engineering Portland State

More information