On Performance, Transistor Count and Chip Space Assessment of Multimediaenhanced Simultaneous Multithreaded Processors

Size: px
Start display at page:

Download "On Performance, Transistor Count and Chip Space Assessment of Multimediaenhanced Simultaneous Multithreaded Processors"

Transcription

1 On Performance, Transistor Count and Chip Space Assessment of Multimediaenhanced Simultaneous Multithreaded Processors Ulrich Sigmund, Marc Steinhaus, and Theo Ungerer VIONA Development GmbH, Karlstr. 7, D-733 Karlsruhe, Institute of Computer Design and Fault Tolerance, University of Karlsruhe, D-7 Karlsruhe, Germany, Abstract This paper gives a cost/benefit analysis of simultaneous multithreaded (SMT) processors with multimedia enhancements. We carefully assess performance, transistor count and chip space of each simulated processor model. We focus our investigations on three different sets of processor configurations: One set with an abundance of resources, a second set with a more realistic memory hierarchy, and a third set with contemporary scaled processor models. Comparing the single-threaded -issue models with the -threaded -issue SMT models shows that the maximum processor models require a % increase in transistor count and a 9% increase in chip space, but yield a threefold speedup; the models with realistic memory hierarchy require a 3% increase in transistor count and a 53% increase in chip space, but yield a nearly twofold speedup; and the contemporary scaled models require a 9% increase in transistor count and a 7% increase in chip space, resulting in a.5-fold speedup. Introduction A multithreaded processor is able to pursue multiple threads of control in parallel within the processor pipeline. The functional units are multiplexed between the thread contexts. Most approaches store the thread contexts in different register sets on the processor chip. Latencies are masked by switching to another thread. A simultaneous multithreaded (SMT) processor, in particular, issues instructions from several threads simultaneously. It combines a wide-issue superscalar processor with multithreading. SMT approaches are simulated and evaluated with workloads typically consisting of several different, simultaneously executing Spec95 or database OLTP benchmark programs. Most of the simulations showed that an - threaded SMT is able to reach a two- to threefold throughput increase over single-threaded superscalar processors for multi-programmed or multithreaded workloads (see e.g. [- ]). In consequence, recent announcements by industry concern a -threaded -issue SMT Alpha processor of DEC/ Compaq [5] and the MAJC-500 processor of Sun, which features two -threaded processors on a single die []. It is unfair, however, to compare the performances of multithreaded processor models with single-threaded processor models applying otherwise the same configuration parameters. The resources of the single-threaded model should be adjusted such that the same chip space or the same transistor count is covered as in the multithreaded model. Otherwise only statements about the scaling of the SMT technique with respect to performance are allowed. Burns and Gaudiot [7] performed a first step in the right direction by estimating the layout area of SMT. They identified, which layout blocks are affected by SMT, determined the scaling of chip space requirements using an O- calculus, and compared SMT versus single-threaded processor space requirements by scaling a R0000-based layout to 0. µm technology. Our target is to set performance in comparison with the number of transistors and the chipspace required to implement the simulated features. We supplemented our existing SMT processor simulator by a tool that estimates transistor count and chip space requirements of the simulated features applying a more detailed analysis than previously known approaches. Our processor model is based on a wide-issue superscalar general-purpose processor model with the -stage pipeline of the PowerPC 0 [], but is enhanced by simultaneous multithreading and by combined integer/multimedia units and on-chip RAM memory towards a multimedia-enhanced SMT processor. We choose a hand coded, multithreaded MPEG- video decompression as example for multimedia workloads. We already reported on optimizations for processor models with an abundance of resources in [9] and on realistic processor models in [0]. In the following we study performance, transistor counts and chip space of three different sets of SMT processor models. Section introduces the baseline processor model, the simulator and workload, and the transistor count and chip-space estimator. Section 3 shows our simulation and estimation results; Section concludes the paper. - -

2 Evaluation Methodology. Baseline Processor Model Our multimedia-enhanced SMT processor model (see Fig.) features single or multiple fetch (IF) and decode (ID) units (one per thread), a single rename/issue (RI) unit, multiple, decoupled reservation stations, up to execution units (three to six integer/multimedia units, a complex integer/multimedia unit, a branch unit, a thread unit, a global and one or two local load/store units), a single retirement (RT) and write back (WB) unit, rename registers, a branch target address cache (BTAC), separate I- and D-caches that are shared by all active threads. The D-cache is a non-blocking write-back cache with write-allocation. Loads and stores of the same thread are performed out of order unless an address conflict arises. We further relax access ordering so that loads and stores of other threads can pass loads or stores with unavailable memory addresses, while succeeding loads and stores of the same thread are blocked which is possible without consistency violation for our multithreaded algorithm. This improves IPC performance by about 0. for the -threaded -issue models (see [9, ]). to Memory Memory interface D-cache Global L/S I-cache Local L/S Thread Control Local RAM memory I/O IF ID Branch Rename Registers RI Simple Int/MM RT WB IF ID Compl. Int/MM BTAC Registers All simple integer/multimedia units share a single reservation station which is able to dispatch three to six instructions per cycle, all other reservation stations are separate per execution unit. We employ thread-specific instruction buffers (between IF and ID), issue buffers (between ID and RI), and reorder buffers (in front of RT). Each thread executes in a separate architectural register set. However, there is no fixed allocation between threads and (execution) units. The pipeline performs an in-order instruction fetch, decode, rename/issue to reservation stations, out-of-order dispatch from the reservation stations to the execution units, out-of-order execution, and an in-order retirement and write-back. The rename/issue stage simultaneously selects instructions from all issue buffers up to its maximum issue bandwidth (SMT feature). Fig. : The SMT Multimedia Processor Model The integer units are enhanced by MMX-style multimedia processing capabilities (multimedia unit feature). We employ a thread control unit for thread start, stop, synchronization, and for I/O operations. We also employ a 3 KB local on-chip RAM memory (enough for constants and variables of the simulation workload) accessed by one or two local load/store unit. Simulations without the on-chip RAM showed an IPC decrease of about 0. for all configurations (see []). We use the DLX instruction set enhanced by thread control and by multimedia instructions. All current simulations apply a McFarling s gshare branch predictor [] (with K -bit counters and an bit history) which in general yields slightly better results for all models. The misprediction penalty is 5 cycles. - -

3 We choose to fix the following parameters for all simulations: - 3-bit general-purpose registers (per thread), - 0 rename registers (per thread; but a fixed in the last set of simulations), - 3-entry issue buffers (per thread) with a -entry lookup depth (, respectively, in the last set of simulations), - MB (off-chip) main memory (enough to store the whole simulation workload), - -way set-associative D- and I-caches, - -bit system bus, and - a 0-entry BTAC. We vary the number of threads from to and the issue bandwidth from to in each figure.. Simulator and Workload The simulator is an execution-based simulator that models all internal structures of the microprocessor model. We choose a multithreaded MPEG- video decompression algorithm as workload. The MPEG- video decompression can be partitioned into the following six steps: - header decode, - Huffman decode, - inverse quantization, - IDCT (inverse discrete cosine transform), - motion compensation, - and display. The steps and have to be executed sequentially. The steps 3 to can be executed in parallel for all blocks (and macro blocks) of a single image. The MPEG- decompression is made multithreaded by splitting the task into a single parser thread, eight threads for macro block decoding, and an additional display thread. The parser thread executes the first two steps of the decompression and activates up to eight macro block decoding threads that perform steps three to five. The display thread transfers the decompressed images for display into the frame buffer (step ). The average usage of the instructions assuming a local RAM storage for table look-up is given in table. The maximum performance of the decompression is bound by the sequential parser thread. Our studies show that the sequential part covers approximately 5.5 % of the executed instructions (depending on the bit rate and size of the encoded material). This results in a theoretical speedup of at most.5 compared to the sequential algorithm. Two different videos are used as data stream workload for the MPEG- algorithm. One to two seconds of video are decompressed by the simulator per simulator run. The produced picture frames can be visually and digitally analyzed to evaluate the correct working of the workload routine and the simulator. Instruction type Average use (%) Integer/Multimedia shift and add 5.0 Complex-Integer/Multimedia multiply 3. Local load/store 0. Global load/store 7.7 Branch 9.9 Frame buffer I/O. Thread control.7 Table : Average usage of instructions.3 Transistor Count and Chip-space Estimator Moreover we devised a transistor count and chip-space estimation tool that works as follows: To estimate the chip space and the amount of transistors we use an analytical method for memory-based structures like register files or internal queues and an empirical method for logic blocks like control logic and functional units. For the analytical method we calculate the amount of bit cells, which are needed for the memory-based structures and the number of ports to access them. Based on this information, we calculate the number of transistors, assuming four transistors to implement a basic bit cell, two transistors per write port and one transistor per read port. To calculate the chip-space of a memory based structure we estimate the area of a basic cell in accordance to [3]. The basic area of a bit cell is increased in height and width by the number of the ports. We use the information of a basic bit cell space and the number of the ports to calculate its area, from that we are able to estimate the whole chipspace. To be independent of chip technology we apply the half-feature size λ as measure of length [], e.g., mm in 0.5 µm technology equals million λ. For the non memory-based parts of the processor we measure the floor-plans of existing processors. With this empirical approach we estimate the sizes of the basic logic blocks of the processors. Using this data, we calculate the necessary chip space of the logic blocks. To estimate the transistor amount of hypothetical processors with our tool, we calculate the average transistor density of non memory-based structures of real processors by measuring floor-plans (SPARC and HP PA-000) and by additional information about the transistor amount of the measured logic blocks. We validated the tool by estimating the PowerPC 0 configuration [] and reached the same 3. million transistor count and nearly the same die size as the real processor. For a detailed description of the estimation tool see [5]

4 3 Performance vs. Hardware Cost Estimation 3. Processor Models with an Abundance of Resources In this paper we focus on three sets of processor models. The full experimental results are published in []. Figure shows SMT processor models that yielded the highest throughput reached with our simulator. Fetch and decode bandwidth is scaled from one fetch unit (decode unit), which is able to fetch (decode) a single instruction per cycle in the single-threaded single-issue model, to eight fetch units (decode units), which are each able to fetch (decode) up to eight instructions per cycle, in the -threaded - issue configuration. We notice that multithreading is very effective if the issue bandwidth is at least four. In particular the two and four-threaded models lead to a high performance increase in the multiple-issue models. There is little improvement for the single-threaded (the contemporary superscalar case) and the two-threaded models when issue bandwidth is increased from to. These models need as many resources as a potential processor in years 00 to 009 on the basis of the SIA National Roadmap for Semiconductors prognosis []. Our transistor count of these models is a maximum of nearly 300 M transistors per processor chip for the -threaded - issue model including the large I- and D-caches (see Fig. 3).,39,3 5,57 5,57 5,5 5,3 3,9 3,9 3,9 3,,99 3, 3,07,99,99,9,,5,3 0,93, 7 5 IPC ,,,, bandwidth...,,,, Fetch and decode bandwidth...x - x Reservation stations...5 each buffers...3 each Lookup depth... each Retirement buffers...3 each Rename registers Local load-/store units... Integer/multimedia units... Result buses...0 D-cache fill burst rate... ::: D-cache... MB I-cache... MB Fig. : Maximum processor models Size in M λ² K Transistors Figure 3: Transistor count and chip space estimation of maximum processor models - -

5 3. Processor Models with Realistic Memory Hierarchy Second, we lessen the transistor and space requirements to processor models with a resource capacity of contemporary processors and performed further optimization steps. The processor models in Fig. use a MB D-cache (and a MB I-cache) able to host the full workload. We reduced the I- and D-cache sizes to realistic values of KB each in Fig.. Because of the small workload code size of only KB, a reduction of the I-cache size down to KB yields only an insignificant performance decrease, but a large transistor count and area reduction. The reduction of,,5,,,5 3,,0 3,5 3,7,9,9,5,9 0,9 0,9, 0,9 0,97,,,3,,, 0, IPC the D-cache size strongly affects overall performance for all models. We also reduced cache line burst rate from ::: down to more realistic 3::: processor cycles. On the other hand, memory bus utilization is already very high, and memory bandwidth is critical for the SMT processor performance. We choose to double the cache line burst length to 3::::::: instead of a costly doubling of the memory bus speed or the bus width (see [7] for a discussion of the trade-offs of the memory hierarchy). The diagrams in Fig. 5 show that the required transistor count and chip space of the - and -threaded -issue configurations allow an implementation of these models already with contemporary VLSI technology....,,,, bandwidth...,,,, Fetch and decode bandwidth...x - x Reservation stations...5 each buffers...3 each Lookup depth... each Retirement buffer...3 each Rename registers Local load-/store units... Integer/multimedia units... Result buses...0 D-cache fill burst rate...3::::::: D-cache... KB I-cache... KB Fig. : Optimized maximum processor models with more realistic memory hierarchy K Transitors Space in M λ² Figure 5: Transistor count and chip space estimation of the optimized maximum processor models with more realistic memory hierarchy - 5 -

6 3.3 Contemporary Scaled Processor Models Our third set of processor models impose further resource restrictions: smaller reservation stations, less fetch and decode bandwidth, less result buses, and one instead of two local load/store units. The number of rename registers is a fixed, the fetch and decode bandwidth a fixed two fetch (and two decode) units with instructions fetched (and decoded) each, in contrast to the previous sets of configurations that scale these parameters with the issue bandwidth and the number of threads. The transistor count is.5 M transistors for the - threaded -issue model and. M transistors for the - threaded -issue model (see Fig. 7). Both are less than the 5. M transistors of the recent DEC Alpha processor (the Alpha [] features a different microarchitecture than our PowerPC based models). Our performance evaluations in Fig. reach IPC values for the - to -threaded, - to -issue models that are just about one IPC less than the corresponding processor models of Fig.. However, the performance increase of the latter is paid by roughly.5 times the amount of transistors and six times the chip size. 0,9 0,97 0,99,9,9,9 0,9 3, 3, 3,5 3,5 3,09 3, 3,0 3,,7,9,7,,3,7,3,7,3 0, 0 3 IPC...,,,, bandwidth...,,,, Fetch and decode bandwidth... x Reservation stations... 3 each buffers... each Lookup depth... each Retirement buffers... each Rename registers... Local load-/store units... Integer/multimedia units...3 Result buses... D-cache fill burst rate...3::::::: D-cache... KB I-cache... KB Figure : Realistic processor models K Transistors Space in M λ² Figure 7: Transistor count and chip space estimation of realistic processor models - -

7 Conclusions We simulated various multimedia-enhanced SMT processor models using a hand-coded multithreaded MPEG- video decompression algorithm as workload. Our performance estimation showed a roughly threefold IPC increase of the -threaded -issue over the single threaded -issue processor models in all three sets of configurations. The transistor count and chip space estimation showed that the additional hardware cost of an -threaded SMT processor over a single-threaded processor is a negligible 5% transistor increase, but a 3% chip space increase for a 300 M transistor chip of years 00 to 009; it requires a 7% increase of the transistor amount for the processor models with realistic memory hierarchy (Figs. and 5), and % more for the contemporary scaled processor models (Figs. and 7). Chip space increase of the -threaded - issue over the single-threaded -issue model with realistic memory hierarchy is about 07%, and 3% in case of the contemporary scaled processor models. Even more favorable is the comparison of the singlethreaded -issue models with the -threaded -issue SMT model. The maximum processor models require a % increase in transistor count and a 9% increase in chip space, but yield a threefold speedup; the models with realistic memory hierarchy require a 3% increase in transistor count and a 53% increase in chip space, but yield a nearly twofold speedup; and the contemporary scaled models require a 9% increase in transistor count and a 7% increase in chip space, resulting in a.5-fold speedup. These observations show that SMT is an attractive design feature for future processors and already an alternative for contemporary processors. Our processor models are specifically tailored to the characteristics of our multithreaded MPEG- algorithm. Other workloads might favor different configuration. Our transistor count and chip space estimation tool will soon be adapted to the processor model that is the basis of the more widely used SimpleScalar simulator. 5 References [] Tullsen, D. M., Eggers, S. J., and Levy, H. M.: Simultaneous Multithreading: Maximizing On-Chip Parallelism. nd Ann. Int. Symp. on Computer Architecture. Santa Margherita Ligure, Italy, July 995, [] Tullsen, D. M., Eggers, S. J., Levy, H. M., Jo, J. L., and Stamm, R. L.: Exploiting Choice: Instruction Fetch And on an Implementable Simultaneous Multithreading Processor. 3 rd Ann. Int. Symp. on Computer Architecture. Philadelphia, May 99, 9-0. [3] Sigmund, U., Ungerer, Th.: Evaluating A Multithreaded Superscalar Microprocessor Versus a Multiprocessor Chip. th PASA Workshop Parallel Systems and Algorithms. Forschungszentrum Jülich, Germany, World Scientific Publishing, April 99, [] Gulati, M., Bagherzadeh, N.: Performance Study of a Multithreaded Superscalar Microprocessor. nd Int. Symp. on High-Performance Computer Architectures, February 99, [5] Emer, J.: Simultaneous Multithreading: Multiplying Alpha s Performance. Microprocessor Forum 999, San Jose, Ca., Oct [] Gwennap, L.: MAJC Gives VLIW a New Twist. Microprocessor Report, Sept. 3, 999. [7] Burns J., Gaudiot J.-L.: Quantifying the SMT Layout Overhead - Does SMT Pull Its Weight? HPCA-, Toulouse, France, Jan. -, 000, [] Song, S.P., Denman, M., Chang, J.: The PowerPC 0 RISC Microprocessor. IEEE Micro, Oktober 99, -7. [9] Oehring, H., Sigmund, U., Ungerer, Th.: Simultaneous Multithreading and Multimedia. Workshop on Multithreaded Execution, Architecture and Compilation, MTEAC99, in connection with the HPCA-5 Conf., Orlando, Jan. 9, 999. [0] Oehring, H., Sigmund, U., Ungerer, Th.: MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors. 999 Int. Conf. on Parallel Architectures and Compilation Techniques (PACT '99), Newport Beach, Ca., Oct. 999, -. [] Sigmund, U.: Entwurf und Evaluierung mehrfädig superskalarer Prozessortechniken im Hinblick auf Multimedia. Doctoral Thesis, Universität Karlsruhe, April 000. [] McFarling, S.: Combining Branch Predictors. DEC WRL Technical Note TN-3, DEC Western Research Laboratory 993. [3] Lopez D., Llosa J., Valero M., und Ayguade E.: Resource Widening Versus Replication: Limits and Performance-Cost Trade-off. th International Conference on Supercomputing (ICS-), 99, -. [] Mead, C., Conway, L.: Introduction to VLSI Systems. Addison-Wesley, Reading, MA 90. [5] Steinhaus, M.: Complexity of Processors. Master s Thesis, Universität Karlsruhe, Oct. 000 (in German). [] [7] Sigmund, U., Ungerer, Th.: Memory Hierarchy Studies of Multimedia-enhanced Simultaneous Multithreaded Processors for MPEC- Video Decompression. Workshop on Multi-Threaded Execution, Architecture and Compilation (MTEAC-3), Toulouse, Jan., 000. [] Linley Gwennap: Digital Sets New Standard. Microprocessor Report, Vol. 0, No., Oct.,

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Chip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading

Chip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading Chip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading BORUT ROBIČ JURIJ ŠILC THEO UNGERER Faculty of Computer and Information Sc. Computer Systems Department Dept. of Computer

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

A Realistic Study on Multithreaded Superscalar Processor Design

A Realistic Study on Multithreaded Superscalar Processor Design A Realistic Study on Multithreaded Superscalar Processor Design Yuan C. Chou, Daniel P. Siewiorek, and John Paul Shen Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh,

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller

Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller U. Brinkschulte, C. Krakowski J. Kreuzinger, Th. Ungerer Institute of Process Control,

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator, ACAPS Technical Memo 64, School References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Boosting SMT Performance by Speculation Control

Boosting SMT Performance by Speculation Control Boosting SMT Performance by Speculation Control Kun Luo Manoj Franklin ECE Department University of Maryland College Park, MD 7, USA fkunluo, manojg@eng.umd.edu Shubhendu S. Mukherjee 33 South St, SHR3-/R

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search

Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search S. van Haastregt LIACS, Leiden University svhaast@liacs.nl P.M.W. Knijnenburg Informatics Institute,

More information

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Simultaneous Multithreading Blending Thread-level and Instruction-level Parallelism in Advanced Microprocessors

Simultaneous Multithreading Blending Thread-level and Instruction-level Parallelism in Advanced Microprocessors Simultaneous Multithreading Blending Thread-level and Instruction-level Parallelism in Advanced Microprocessors JURIJ ŠILC BORUT ROBIČ THEO UNGERER Computer Systems Department Faculty of Computer and Information

More information

Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches

Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches Sonia López 1, Steve Dropsho 2, David H. Albonesi 3, Oscar Garnica 1, and Juan Lanchares 1 1 Departamento de Arquitectura de Computadores y Automatica,

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

Parallel Computer Architecture

Parallel Computer Architecture Parallel Computer Architecture What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation:»

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

PowerPC 620 Case Study

PowerPC 620 Case Study Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Current and Future Trends in Processor Architecture. Theo Ungerer Borut Robic Jurij Silc

Current and Future Trends in Processor Architecture. Theo Ungerer Borut Robic Jurij Silc Current and Future Trends in Processor Architecture Theo Ungerer Borut Robic Jurij Silc 1 Tutorial Background Material Jurij Silc, Borut Robic, Theo Ungerer: Processor Architecture From Dataflow to Superscalar

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Architectural Considerations for Network Processor Design. EE 382C Embedded Software Systems. Prof. Evans

Architectural Considerations for Network Processor Design. EE 382C Embedded Software Systems. Prof. Evans Architectural Considerations for Network Processor Design EE 382C Embedded Software Systems Prof. Evans Department of Electrical and Computer Engineering The University of Texas at Austin David N. Armstrong

More information

The Use of Multithreading for Exception Handling

The Use of Multithreading for Exception Handling The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

Integrated circuit processing technology offers

Integrated circuit processing technology offers Theme Feature A Single-Chip Multiprocessor What kind of architecture will best support a billion transistors? A comparison of three architectures indicates that a multiprocessor on a chip will be easiest

More information

Using a Serial Cache for. Energy Efficient Instruction Fetching

Using a Serial Cache for. Energy Efficient Instruction Fetching Using a Serial Cache for Energy Efficient Instruction Fetching Glenn Reinman y Brad Calder z y Computer Science Department, University of California, Los Angeles z Department of Computer Science and Engineering,

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

A Fine-Grain Multithreading Superscalar Architecture

A Fine-Grain Multithreading Superscalar Architecture A Fine-Grain Multithreading Superscalar Architecture Mat Loikkanen and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine loik, nader@ece.uci.edu Abstract

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

A survey of new research directions in microprocessors

A survey of new research directions in microprocessors Microprocessors and Microsystems 24 (2000) 175 190 www.elsevier.nl/locate/micpro A survey of new research directions in microprocessors J. Šilc a, *, T. Ungerer b, B. Robic c a Computer Systems Department,

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

Real-time Scheduling on Multithreaded Processors

Real-time Scheduling on Multithreaded Processors Real-time Scheduling on Multithreaded Processors J. Kreuzinger, A. Schulz, M. Pfeffer, Th. Ungerer Institute for Computer Design, and Fault Tolerance University of Karlsruhe D-76128 Karlsruhe, Germany

More information

Comparing Multiported Cache Schemes

Comparing Multiported Cache Schemes Comparing Multiported Cache Schemes Smaїl Niar University of Valenciennes, France Smail.Niar@univ-valenciennes.fr Lieven Eeckhout Koen De Bosschere Ghent University, Belgium {leeckhou,kdb}@elis.rug.ac.be

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Value Compression for Efficient Computation

Value Compression for Efficient Computation Value Compression for Efficient Computation Ramon Canal 1, Antonio González 12 and James E. Smith 3 1 Dept of Computer Architecture, Universitat Politècnica de Catalunya Cr. Jordi Girona, 1-3, 08034 Barcelona,

More information

Cache Implications of Aggressively Pipelined High Performance Microprocessors

Cache Implications of Aggressively Pipelined High Performance Microprocessors Cache Implications of Aggressively Pipelined High Performance Microprocessors Timothy J. Dysart, Branden J. Moore, Lambert Schaelicke, Peter M. Kogge Department of Computer Science and Engineering University

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Introduction. Summary. Why computer architecture? Technology trends Cost issues

Introduction. Summary. Why computer architecture? Technology trends Cost issues Introduction 1 Summary Why computer architecture? Technology trends Cost issues 2 1 Computer architecture? Computer Architecture refers to the attributes of a system visible to a programmer (that have

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Applications of Thread Prioritization in SMT Processors

Applications of Thread Prioritization in SMT Processors Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor,

More information

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor Kyriakos Stavrou, Paraskevas Evripidou, and Pedro Trancoso Department of Computer Science, University of Cyprus, 75 Kallipoleos Ave., P.O.Box

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Suresh Kumar, Vishal Gupta *, Vivek Kumar Tamta Department of Computer Science, G. B. Pant Engineering College, Pauri, Uttarakhand,

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011 15-740/18-740 Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011 Reviews Due next Monday Mutlu et al., Runahead Execution: An Alternative

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information