Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Size: px
Start display at page:

Download "Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors"

Transcription

1 Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University of Texas at San Antonio San Antonio, TX, , USA (Wenun.Wang, Abstract Effective distribution of critical shared resources among concurrently executing threads is key to improving overall system performance in Simultaneous Multi-Threading (SMT) processors. One of the most critical shared resources is physical register file in the rename stage and a disproportional distribution of these rename registers can easily render it a bottleneck along the pipeline stages. In this paper, we propose an architectural-level physical register file allocation algorithm to better utilize the register file. Once the overall physical register file utilization exceeds a certain threshold, the thread with the highest occupancy is temporarily suspended in order to allow other threads more space to proceed for achieving a higher throughput. Our simulations based on M-Sim [1] show that our proposed technique improves IPC (Instruction Per Clock Cycle) by a very significant margin (up to 25%) without sacrificing execution fairness among the threads. 1 Introduction To enhance overall performance in an SMT system by improving resources allocation fairness, several techniques have been presented by modifying the fetch policy. ICOUT [2] gives priority to threads with the fewest instructions in decode, rename and issue stages. DCRA [3] monitors the usage of resources by each thread and gives a higher share of the available resources to memory-intensive threads. Several techniques target on the utilization of critical shared resources, for example physical register file. In [4], the performance for a given register file size is improved in an SMT system by permitting the compiler and operating system to expedite the deallocation process. Another technique focused on early release of physical registers is proposed in [5] in which a virtual register window is employed to achieve early release and reduce the register file pressure. In [6], a technique is proposed to reduce register file cost by delaying the allocation of physical registers until a later pipeline stage. However, these techniques either require the support of an intelligent compiler and operating system [4][6] or extra hardware modifications in other pipeline stages [5], which can be too system-specific and/or complex to be practical. In this paper, we propose an efficient physical register file allocation scheme with the notion of suspending a thread if the system resources are being heavily contended. This is achieved by blocking one thread with the highest utilization of physical registers if the current overall utilization is relatively high. Our proposed technique is at architectural level and requires no modification at operating system or compiler. Furthermore, this technique is a stand-alone process at rename stage, and thus no modification is required in other pipeline stages. Our simulation results show that our proposed technique improves IPC by 24% and 25%, and enhances the Harmonic IPC (a performance indicator of execution fairness) up to 30.48% and 40.47% on a 4-threaded workload and 8-threaded workload respectively. 2 Simulation Environment 2.1 Simulator We use the M-sim [1], a multi-threaded microarchitectural simulation environment to estimate the performance of our proposed method in an SMT system. Table 1 gives the detailed configuration of the simulated processor. 2.2 Workloads Simulations on simultaneous multi-threading use the workload of mixed SPEC CPU2006 benckmark suite [8] with mixtures of various levels of ILP. ILP classification of each mix is obtained by initializing it in accordance with the procedure mentioned in Simpoints tool [9] and simulated individually in a simplescalar environment / copyright ISCA, SEDE 2016 September 26-28, 2016, Denver, Colorado, USA

2 Table 1: Configuration of Simulated Processor Parameter Machine Width L/S Queue Size ROB & IQ size Functional Units & Latency(total/issue) Physical registers L1 I-cache L1 D-cache L2 Cache unified BTB Branch Predictor Pipeline Structure Memory Configuration 8 wide fetch/dispatch/issue/commit 48-entry load/store queue 128-entry ROB, 32-entry IQ 4 Int Add(1/1) 1 Int Mult(3/1)/Div(20/19) 2 Load/Store(1/1), 4 FP Add(2/1) 1 FP Mult(4/1)/Div(12/12) Sqrt(24/24) integer and floating point as specified in the paper 64KB, 2-way set associative 64-byte line 64KB, 4-way set associative 64-byte line write back, 1 cycle access latency 512KB, 16-way set associative 64-byte line write back, 10 cycles access latency 512 entry, 4-way set-associative bimod: 2K entry 5-stage front-end(fetch-dispatch) scheduling (for register file access: 2 stages, execution, write back, commit) 32-bit wide, 300 cycles access latency Three types of ILP classification are identified, low ILP (memory bound), medium ILP, high ILP (execution bound). Table 2 gives the 4-threaded and 8-threaded simulated workload with various mixtures of ILP classification types. Table 2: SPEC CPU threaded Workload and 8- threaded Workload Mix Benchmarks Classification(ILP) Low Med High Mix1 libquantum, dealii, gromacs, namd Mix2 soplex, leslie3d, povray, milc Mix3 hmmer, seng, gobmk, gcc Mix4 lbm, cactusadm, xalancbmk, bzip Mix5 libquantum, dealii, gobmk, gcc Mix6 gromacs, namd, soplex, leslie3d Mix7 dealii, gromacs, lbm, cactusadm Mix8 libquantum, namd, xalancbmk, bzip Mix9 povray, milc, cactusadm, xalancbmk Mix10 hmmer, seng, lbm, bzip Mix1 libquantum, dealii, gromacs, namd, soplex, leslie3d, povray, milc Mix2 libquantum, dealii, gromacs, namd, lbm, cactusadm, xalancbmk, bzip2 Mix3 hmmer, seng, gobmk, gcc, lbm, cactusadm, xalancbmk, bzip2 Mix4 libquantum, dealii, gromacs, soplex, leslie3d, povray, lbm, cactusadm Mix5 dealii, gromacs, namd, xalancbmk, hmmer, cactusadm, milc, bzip2 Mix6 gromacs, namd, seng, gobmk, gcc,lbm, cactusadm, xalancbmk 2.3 Metrics For a multi-threaded workload, total combined IPC is a typical indicator used to measure the overall performance, which is defined as the sum of each thread s IPC: n Overall IPC = IPC i (1) where n denotes the number of threads per mix in the system. However, in order to preclude starvation effect among threads, the so-called Harmonic IPC is also adopted, which reflects the degree of execution fairness among the threads, namely, Harmonic IPC = n/ i n i 1 IPC i (2) In this paper, these two indicators are used to compare the proposed algorithm to the baseline (default) system. The following metric indicates the IPC improvement percentage averaged over the selected mixes (instead of improvement percentage of averaged IPC ), which is applied to both Overall IPC and Harmonic IPC, namely, m Imp % = ( IPC new IPC baseline IPC baseline 100%)/m (3) where m denotes the number of mixes of the workload in our simulation. 3 Motivation This section is devoted to analyzing the distribution of physical registers among multiple threads. ote that in general competition for the floating point register file is not as severe as that for the integer one [7]. Thus, throughout the rest of this paper, the register file focused all refers to the integer one. We first analyze occupancy rate of physical register file to see the demand of it. Figure 1 shows the overall physical register utilization (denoted as R t ) with its size set at 192. Each point represents an average result Figure 1: Average Overall Physical Register Utilization of a 4-threaded Workload with R t = 192 from the 10 mixes of the 4-threaded workload. As the

3 results show, in nearly 80% of simulation clock cycles, the physical register file is fully occupied, which clearly indicates the need for a larger register file. We further analyze the utilization distribution of physical registers among threads on a 4-threaded workload to determine the seriousness of single-thread dominance. Figure 2 shows the percentage of time distribution for the highest individual thread utilization on shared registers per mix with R t = 192. If the more threads? next thread thread "suspended"? more rename register? next clock cycle updateuo anduih end of current window? UO >ThO? more rename bandwidth? allocate register UIH>ThIH? suspend target thread initializeuo more instructions to rename? : newly proposed feature 1 : newly proposed feature 2 Figure 3: Flowchart of the Proposed Algorithm Figure 2: Highest Register Utilization among Threads of a 4-threaded Workload with R t = 192 registers are perfectly (equally) distributed among 4 threads, the highest individual utilization should be at 25%. However, as shown in Figure 2, except for mix6, over 40% of shared registers are occupied by one single thread for most of simulation time. Especially for mix9 and mix10, in nearly 60% of simulation clock cycles, over 80% of shared registers are occupied by one single thread. Such a single-thread dominance may easily lead to a sustained bottleneck for other thread s movement. 4 Proposed Method Our proposed technique is based on the conecture that, when the overall utilization of the register file exceeds a certain threshold indicating an ongoing congestion, it may be worthwhile to suspend the most demanding thread from further getting more registers, in sacrificing a certain degree of thread-level parallelism for a better flow of thread processing. The basic concept of our proposed technique is that the thread with highest physical register utilization (occupancy) will be suspended in the rename stage without being allocated more new physical registers if the overall register utilization is above a set threshold. Figure 3 shows the flowchart of the proposed features built into the original default scheme. The proposed thread-suspending technique is to be imposed in a window basis, with a fixed, preset number of clock cycles per window. That is, suppression of the thread identified at the end of the current window will last through the next window if the set threshold is exceeded in the current window. ote that physical register file is partially shared among threads in an SMT system and a portion of which is dedicated for architectural registers for each thread, with the rest (denoted as R r ) being shared among the threads. All utilization measurements in this paper are with the base of R r. During the current window, register utilization of thread i is measured, denoted as U I (i), so is the overall utilization of all threads, denoted as U O. If the average number of registers used in a window is R O then and U O = R O R r U I (i) = R(i) R r where R(i) is the average number of registers occupied by thread i. At the end of each window, the proposed algorithm checks if the overall utilization is higher than the preset overall threshold (denoted as Th O ) to determine if the suspension scheme needs to be initiated for the next window; that is, when U O > Th O.

4 In this algorithm, for the sake of avoiding thread starvation, no thread will be suspended for two consecutive windows. Our observation shows that such a prevention scheme mostly is not needed since register utilization of a suspended thread in the current window normally decreases significantly and thus does not continue to be the highest register-occupying thread at the end of the current window. Such a status change comes from the fact that, even a thread is suspended for renaming during the current window, its instructions already renamed can still continue to finish and commit, and along the way releasing its occupied shared resources for other threads to use. This newly added feature is indicated in the lightly shaded box in Figure 3 as feature 1. Due to a combination of register file size and threads concurrent behaviors, exceeding the overall utilization threshold U O may not truly reflect the targeted scenario of congestion and imbalanced usage. amely, a universally set U O may not be applicable to the cases with smaller register file sizes (R r ) U O will be mostly high under such cases but it does not necessarily indicate an imbalanced usage among the threads. Due to this concern, another threshold is instilled to ensure that the suspension mechanism is imposed only when the highest per-thread occupancy exceeds this threshold, the so called highest individual utilization threshold, denoted as Th IH ; that is when condition which does not really call for the scheme. Also note that the preset window size, denoted as W, is also one of parameters that may dearly affect the effectiveness of the approach. If the window size is set too small, there may not be enough time for the suspended thread to release enough registers for others to use before this thread comes back in the next window. On the other hand, a window size too large may reduce the effectiveness of the proposed technique in capturing the threads temporal behavioral changes and at the same time potentially degrading execution fairness among the threads. 5 Simulation Results 5.1 Feature I: with Overall Utilization Threshold Th O As shown in Figure 3, the technique is first applied with the first feature, using only the overall utilization threshold (Th O ) to determine when the suspension mechanism is to be imposed. First, the simulation runs are to investigate the effect of window size on the effectiveness of the proposed algorithm. An ample size of register file is chosen for 4-threaded with R t = 256. Figure 4 shows the average IPC improvement with different window sizes on a 4-threaded workload. Four U IH > Th IH where U IH denotes the highest per-thread occupancy defined as U IH = max i{u I (i)} U O and is measured against the overall usage, instead of against R r. That is, under this new feature (indicated as feature 2 in Figure 3), not only the overall threshold has to be exceeded but this individual threshold condition has to be satisfied as well. In this approach, the size of the register file will affect the effectiveness of our proposed technique. A larger register file will diminish the benefits of our technique since the competition among threads is less severe. For a smaller register file, even with the thread with highest utilization being suspended, congestion still can be competitive due to the limited size. However, the main compromise between the benefits and drawbacks lies at the value setting of Th O and Th IH. If these thresholds are set too high, the intention of our proposed technique in preventing shared resources from being dominated by one single thread will not be effective since the threshold conditions are less likely to be satisfied. On the other hand, thresholds set too low may unnecessarily suspend threads in a Figure 4: Average IPC Improvement vs. Window Size on a 4-threaded Workload with R t = 256 different Th O values are used for each to provide a more comprehensive comparison on different window sizes. IPC improvement continues to rise as window size (W ) increases until about 8000 clock cycles (cc), and the optimal Th O value is around 72%. Due to this observation, all our following simulation runs are with W = Simulations are also performed to disclose the effect on IPC performance from the size of register file adopted. Results are shown in Figure 5 when we apply different threshold (Th O ) values, for 4-threaded workload and 8-threaded workload respectively. ote that the mechanism is defaulted back to the original system when Th O is set to 100% in which the threshold

5 5.2 Feature II: with Individual Utilization Threshold Th IH As displayed in Figure 3, the second feature proposed imposes an additional individual utilization threshold Th IH as the second condition required for thread suspension. To clearly see the effect of this extra condition, combined with the overall utilization threshold Th O, on the overall performance, a simulation is performed on a variety of register file sizes with both features turned on. Figures 6 shows the average IPC improvement with these two threshold values on a register file size of 160, 192 on a 4-threaded workload, respectively. Three Figure 5: Average IPC Improvement vs. Overall Physical Register Utilization Threshold (Th O ) on a 4- threaded workload and 8-thread workload with W = 8000 condition is never satisfied leading to no suspension at all. Average IPC improvement from this technique can reach up to about 25% for both workloads. In the 4-threaded case, improvement percentage in general increases as R t increases from 160 to 224, but drops significantly when it further increases to 256, an indication of having a suitable range for applying the proposed mechanism. That is, when the register file size is not large enough, congestion still remains prevalent and thus suspending only one thread may not provide the optimal benefit. On the other hand, when the register file size is sufficiently large, the benefit of the proposed mechanism will not be as obvious, indicating the need for thread suspension being not as critical. Similar scenario occurs in the 8-threaded case where the performance improvement peaks at R t = 576 and then drops significantly when it is set higher. A summary on the optimal Th O value for each register file size is given in Table 3 for the two workloads respectively. The optimal threshold value continues to decrease as the size of register file increases, which perfectly ascertains our conecture on the use of this threshold, in which a smaller register file demands a higher threshold to indicate the need for thread suspension. Table 3: Optimal Th O vs Physical Register File Size R t 4 threads Th O 98% 95% 95% 80% R t threads R t Th O 98% 95% 80% 80% 80% Figure 6: Average Overall IPC Improvement vs. Overall Physical Register Utilization Threshold on a 4-threaded workload with R t = 160, R t = 192 different Th IH values are used: 40%, 60%, and 80%. A fourth plot with Th IH = 0 represents the case of not employing this extra threshold since the condition is always satisfied. For the cases with smaller register files (R t = 160 and R t = 192), a Th IH = 40% is too small to properly reflect single-thread dominance, which leads to no further improvement from not employing this threshold (Th IH = 0). Employing a higher Th IH can further increase the performance improvement by up to 7% depending the Th O selected. This result verifies our aforementioned claim that satisfying the overall utilization threshold (Th O ) condition may not indicate resource dominance by one or a few threads, especially with a smaller R t or a smaller Th O. 5.3 Execution Fairness When a selective thread-suspension mechanism like the proposed one is imposed for the sake of increasing overall performance, there is a potential that execution unfairness or even disastrous thread starvation might

6 occur. Thus, another measurement is carried out in our simulations using the Harmonic IPC (defined in Eq. 2) as a performance indicator. Figure 7 shows the Harmonic IPC improvement with varying register file size on the 4-threaded workload and 8-threaded workload, respectively. Values of Th O and Th IH are set to their respective values for optimal overall IPC improvement so as to reveal the most potential execution unfairness. As the results show, our proposed Figure 7: Harmonic IPC Improvement vs. Physical Register File Size on a 4-threaded Workload and 8- threaded Workload technique not only improves the overall IPC but also enhances the Harmonic IPC by up to 30.48% and 40.47% on the 4-threaded workload and 8-threaded workload respectively. 6 Conclusion In this paper, an efficient physical register file allocation scheme is proposed to reduce register file pressure in an SMT system. By preventing the physical register file from being overwhelmingly occupied by one single thread, the technique delivers a very significant performance improvement without sacrificing execution fairness. Further extension of this technique can be directed at the selection of not only one but potentially more than one thread for temporary suspension, which is especially true when the register file is small and/or the number of threads in the system is large. Such an extension would require a more intelligent threshold setting and more reliable real-time congestion and thread-dominance reading mechanism. 7 Acknowledgement This research is partially supported by funding from ational Science Foundation Award References [1] J. Sharkey, M-Sim: M-Sim: A Flexible, Multithreaded Simulation Environment Tech. Report CS-TR-05-DP1, Department of Computer Science, SU Binghamton, [2] D. M. Tullsen, S. J. Emer, H. M. Levy, J. L. Lo and R. L. Stamm, Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multi- Threading Processor, In the Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp , May 1996 [3] F. J. Cazorla, A. Ramirez, M. Valero and E. Fernandez, Dynamically Controlled Resource Allocation in SMT processors, In the Proceedings of the 37th International Symposium on Microarchitecture, pp , Dec 2004 [4] J. L. Lo, S. S. Parekh, S. J. Eggers, H. M. Levy and D. M. Tullsen, Software-directed Register Deallocation for Simultaneous Multi-Threading Processors, IEEE Transaction on Parallel and Distributed Systems, Vol 10, Issue 9, pp , Sept 1999 [5] T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals, Dynamic Register Renaming through Virtual-Physical Registers, Journal of Instruction Parallelism, Vol. 2, 2000 [6] E. Quinones, J. Parcerisa, A. Gonzalez Leveraging Register Windows to Reduce Physical Registers to the Bare Minimum, IEEE Transaction on Computers, Vol. 59, o. 12, Dec 2010 [7]. Zhang and W.-M. Lin, Efficient Physical Register File Allocation in Simultaneous Multi- Threading CPUs, 33rd IEEE International Performance Computing and Communications Conference (IPCCC 2014), Austin, Texas, December 5-7, [8] Standard Performance Evaluation Corporation (SPEC) website, [9] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, Automatically Characterizing Large Scale Program Behavior In the Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pp , October 2002.

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs International Journal of Computer Systems (ISSN: 2394-1065), Volume 04 Issue 04, April, 2017 Available at http://www.ijcsonline.com/ An Intelligent Fetching algorithm For Efficient Physical Register File

More information

An Intelligent Resource Sharing Protocol on Write Buffers in Simultaneous Multi-Threading Processors

An Intelligent Resource Sharing Protocol on Write Buffers in Simultaneous Multi-Threading Processors IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 6, Ver. IV (Nov.-Dec. 2016), PP 74-84 www.iosrjournals.org An Intelligent Resource Sharing Protocol

More information

LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES

LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES Shane Carroll and Wei-Ming Lin Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio,

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

Sandbox Based Optimal Offset Estimation [DPC2]

Sandbox Based Optimal Offset Estimation [DPC2] Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Addressing End-to-End Memory Access Latency in NoC-Based Multicores Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1 Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 1029 1035 2012 International Conference on Medical Physics and Biomedical Engineering Memory Performance Characterization of SPEC CPU2006

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh Department of Computer Science and Information Engineering Chang Gung University Tao-Yuan, Taiwan Hsin-Dar Chen

More information

A Front-end Execution Architecture for High Energy Efficiency

A Front-end Execution Architecture for High Energy Efficiency A Front-end Execution Architecture for High Energy Efficiency Ryota Shioya, Masahiro Goshima and Hideki Ando Department of Electrical Engineering and Computer Science, Nagoya University, Aichi, Japan Information

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Applications of Thread Prioritization in SMT Processors

Applications of Thread Prioritization in SMT Processors Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor,

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

Boosting SMT Performance by Speculation Control

Boosting SMT Performance by Speculation Control Boosting SMT Performance by Speculation Control Kun Luo Manoj Franklin ECE Department University of Maryland College Park, MD 7, USA fkunluo, manojg@eng.umd.edu Shubhendu S. Mukherjee 33 South St, SHR3-/R

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches

Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches Sonia López 1, Steve Dropsho 2, David H. Albonesi 3, Oscar Garnica 1, and Juan Lanchares 1 1 Departamento de Arquitectura de Computadores y Automatica,

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors In Proceedings of the th International Symposium on High Performance Computer Architecture (HPCA), Madrid, February A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

More information

Runahead Threads to Improve SMT Performance

Runahead Threads to Improve SMT Performance Runahead Threads to Improve SMT Performance Tanausú Ramírez 1, Alex Pajuelo 1, Oliverio J. Santana 2, Mateo Valero 1,3 1 Universitat Politècnica de Catalunya, Spain. {tramirez,mpajuelo,mateo}@ac.upc.edu.

More information

Optimizing SMT Processors for High Single-Thread Performance

Optimizing SMT Processors for High Single-Thread Performance University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

DCache Warn: an I-Fetch Policy to Increase SMT Efficiency

DCache Warn: an I-Fetch Policy to Increase SMT Efficiency DCache Warn: an I-Fetch Policy to Increase SMT Efficiency Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona 1-3,

More information

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University

More information

Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors

Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors R. Ubal, J. Sahuquillo, S. Petit and P. López Department of Computing Engineering (DISCA) Universidad Politécnica de Valencia,

More information

High Performance Memory Requests Scheduling Technique for Multicore Processors

High Performance Memory Requests Scheduling Technique for Multicore Processors High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical

More information

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Filtered Runahead Execution with a Runahead Buffer

Filtered Runahead Execution with a Runahead Buffer Filtered Runahead Execution with a Runahead Buffer ABSTRACT Milad Hashemi The University of Texas at Austin miladhashemi@utexas.edu Runahead execution dynamically expands the instruction window of an out

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

Adaptive Cache Memories for SMT Processors

Adaptive Cache Memories for SMT Processors Adaptive Cache Memories for SMT Processors Sonia Lopez, Oscar Garnica, David H. Albonesi, Steven Dropsho, Juan Lanchares and Jose I. Hidalgo Department of Computer Engineering, Rochester Institute of Technology,

More information

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012 Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems 1 Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems Dimitris Kaseridis, Member, IEEE, Muhammad Faisal Iqbal, Student Member, IEEE and Lizy Kurian John,

More information

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science

More information

L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors

L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors Josué Feliu, Julio Sahuquillo, Salvador Petit, and José Duato Department of Computer Engineering (DISCA) Universitat Politècnica de València

More information

An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures

An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures Wangyuan Zhang, Xin Fu, Tao Li and José Fortes Department of Electrical and Computer Engineering,

More information

High Performance Memory Requests Scheduling Technique for Multicore Processors

High Performance Memory Requests Scheduling Technique for Multicore Processors High Performance Memory Requests Scheduling Technique for Multicore Processors by Walid Ahmed Mohamed El-Reedy A Thesis Submitted to the Faculty of Engineering at Cairo University In Partial Fulfillment

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Energy Models for DVFS Processors

Energy Models for DVFS Processors Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

A Content Aware Integer Register File Organization

A Content Aware Integer Register File Organization A Content Aware Integer Register File Organization Rubén González, Adrián Cristal, Daniel Ortega, Alexander Veidenbaum and Mateo Valero Universitat Politècnica de Catalunya HP Labs Barcelona University

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Near-Threshold Computing: How Close Should We Get?

Near-Threshold Computing: How Close Should We Get? Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Footprint-based Locality Analysis

Footprint-based Locality Analysis Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Bias Scheduling in Heterogeneous Multi-core Architectures

Bias Scheduling in Heterogeneous Multi-core Architectures Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis Electrical and Computer Engineering The University of Texas at Austin Austin, TX, USA kaseridis@mail.utexas.edu

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

Speculative Execution for Hiding Memory Latency

Speculative Execution for Hiding Memory Latency Speculative Execution for Hiding Memory Latency Alex Pajuelo, Antonio Gonzalez and Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona-Spain {mpajuelo,

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics

Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics Jian Chen, Nidhi Nayyar and Lizy K. John Department of Electrical and Computer Engineering The

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

Lightweight Memory Tracing

Lightweight Memory Tracing Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Threaded Multiple Path Execution

Threaded Multiple Path Execution Threaded Multiple Path Execution Steven Wallace Brad Calder Dean M. Tullsen Department of Computer Science and Engineering University of California, San Diego fswallace,calder,tullseng@cs.ucsd.edu Abstract

More information

Perceptron Learning for Reuse Prediction

Perceptron Learning for Reuse Prediction Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Suresh Kumar, Vishal Gupta *, Vivek Kumar Tamta Department of Computer Science, G. B. Pant Engineering College, Pauri, Uttarakhand,

More information

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor Boost Sequential Program Performance Using A Virtual Large Instruction Window on Chip Multicore Processor Liqiang He Inner Mongolia University Huhhot, Inner Mongolia 010021 P.R.China liqiang@imu.edu.cn

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University

More information

The Impact of Resource Sharing Control on the Design of Multicore Processors

The Impact of Resource Sharing Control on the Design of Multicore Processors The Impact of Resource Sharing Control on the Design of Multicore Processors Chen Liu 1 and Jean-Luc Gaudiot 2 1 Department of Electrical and Computer Engineering, Florida International University, 10555

More information