On Performance, Transistor Count and Chip Space Assessment of Multimediaenhanced Simultaneous Multithreaded Processors
|
|
- Clare Sullivan
- 6 years ago
- Views:
Transcription
1 On Performance, Transistor Count and Chip Space Assessment of Multimediaenhanced Simultaneous Multithreaded Processors Ulrich Sigmund, Marc Steinhaus, and Theo Ungerer VIONA Development GmbH, Karlstr. 7, D-733 Karlsruhe, Institute of Computer Design and Fault Tolerance, University of Karlsruhe, D-7 Karlsruhe, Germany, Abstract This paper gives a cost/benefit analysis of simultaneous multithreaded (SMT) processors with multimedia enhancements. We carefully assess performance, transistor count and chip space of each simulated processor model. We focus our investigations on three different sets of processor configurations: One set with an abundance of resources, a second set with a more realistic memory hierarchy, and a third set with contemporary scaled processor models. Comparing the single-threaded -issue models with the -threaded -issue SMT models shows that the maximum processor models require a % increase in transistor count and a 9% increase in chip space, but yield a threefold speedup; the models with realistic memory hierarchy require a 3% increase in transistor count and a 53% increase in chip space, but yield a nearly twofold speedup; and the contemporary scaled models require a 9% increase in transistor count and a 7% increase in chip space, resulting in a.5-fold speedup. Introduction A multithreaded processor is able to pursue multiple threads of control in parallel within the processor pipeline. The functional units are multiplexed between the thread contexts. Most approaches store the thread contexts in different register sets on the processor chip. Latencies are masked by switching to another thread. A simultaneous multithreaded (SMT) processor, in particular, issues instructions from several threads simultaneously. It combines a wide-issue superscalar processor with multithreading. SMT approaches are simulated and evaluated with workloads typically consisting of several different, simultaneously executing Spec95 or database OLTP benchmark programs. Most of the simulations showed that an - threaded SMT is able to reach a two- to threefold throughput increase over single-threaded superscalar processors for multi-programmed or multithreaded workloads (see e.g. [- ]). In consequence, recent announcements by industry concern a -threaded -issue SMT Alpha processor of DEC/ Compaq [5] and the MAJC-500 processor of Sun, which features two -threaded processors on a single die []. It is unfair, however, to compare the performances of multithreaded processor models with single-threaded processor models applying otherwise the same configuration parameters. The resources of the single-threaded model should be adjusted such that the same chip space or the same transistor count is covered as in the multithreaded model. Otherwise only statements about the scaling of the SMT technique with respect to performance are allowed. Burns and Gaudiot [7] performed a first step in the right direction by estimating the layout area of SMT. They identified, which layout blocks are affected by SMT, determined the scaling of chip space requirements using an O- calculus, and compared SMT versus single-threaded processor space requirements by scaling a R0000-based layout to 0. µm technology. Our target is to set performance in comparison with the number of transistors and the chipspace required to implement the simulated features. We supplemented our existing SMT processor simulator by a tool that estimates transistor count and chip space requirements of the simulated features applying a more detailed analysis than previously known approaches. Our processor model is based on a wide-issue superscalar general-purpose processor model with the -stage pipeline of the PowerPC 0 [], but is enhanced by simultaneous multithreading and by combined integer/multimedia units and on-chip RAM memory towards a multimedia-enhanced SMT processor. We choose a hand coded, multithreaded MPEG- video decompression as example for multimedia workloads. We already reported on optimizations for processor models with an abundance of resources in [9] and on realistic processor models in [0]. In the following we study performance, transistor counts and chip space of three different sets of SMT processor models. Section introduces the baseline processor model, the simulator and workload, and the transistor count and chip-space estimator. Section 3 shows our simulation and estimation results; Section concludes the paper. - -
2 Evaluation Methodology. Baseline Processor Model Our multimedia-enhanced SMT processor model (see Fig.) features single or multiple fetch (IF) and decode (ID) units (one per thread), a single rename/issue (RI) unit, multiple, decoupled reservation stations, up to execution units (three to six integer/multimedia units, a complex integer/multimedia unit, a branch unit, a thread unit, a global and one or two local load/store units), a single retirement (RT) and write back (WB) unit, rename registers, a branch target address cache (BTAC), separate I- and D-caches that are shared by all active threads. The D-cache is a non-blocking write-back cache with write-allocation. Loads and stores of the same thread are performed out of order unless an address conflict arises. We further relax access ordering so that loads and stores of other threads can pass loads or stores with unavailable memory addresses, while succeeding loads and stores of the same thread are blocked which is possible without consistency violation for our multithreaded algorithm. This improves IPC performance by about 0. for the -threaded -issue models (see [9, ]). to Memory Memory interface D-cache Global L/S I-cache Local L/S Thread Control Local RAM memory I/O IF ID Branch Rename Registers RI Simple Int/MM RT WB IF ID Compl. Int/MM BTAC Registers All simple integer/multimedia units share a single reservation station which is able to dispatch three to six instructions per cycle, all other reservation stations are separate per execution unit. We employ thread-specific instruction buffers (between IF and ID), issue buffers (between ID and RI), and reorder buffers (in front of RT). Each thread executes in a separate architectural register set. However, there is no fixed allocation between threads and (execution) units. The pipeline performs an in-order instruction fetch, decode, rename/issue to reservation stations, out-of-order dispatch from the reservation stations to the execution units, out-of-order execution, and an in-order retirement and write-back. The rename/issue stage simultaneously selects instructions from all issue buffers up to its maximum issue bandwidth (SMT feature). Fig. : The SMT Multimedia Processor Model The integer units are enhanced by MMX-style multimedia processing capabilities (multimedia unit feature). We employ a thread control unit for thread start, stop, synchronization, and for I/O operations. We also employ a 3 KB local on-chip RAM memory (enough for constants and variables of the simulation workload) accessed by one or two local load/store unit. Simulations without the on-chip RAM showed an IPC decrease of about 0. for all configurations (see []). We use the DLX instruction set enhanced by thread control and by multimedia instructions. All current simulations apply a McFarling s gshare branch predictor [] (with K -bit counters and an bit history) which in general yields slightly better results for all models. The misprediction penalty is 5 cycles. - -
3 We choose to fix the following parameters for all simulations: - 3-bit general-purpose registers (per thread), - 0 rename registers (per thread; but a fixed in the last set of simulations), - 3-entry issue buffers (per thread) with a -entry lookup depth (, respectively, in the last set of simulations), - MB (off-chip) main memory (enough to store the whole simulation workload), - -way set-associative D- and I-caches, - -bit system bus, and - a 0-entry BTAC. We vary the number of threads from to and the issue bandwidth from to in each figure.. Simulator and Workload The simulator is an execution-based simulator that models all internal structures of the microprocessor model. We choose a multithreaded MPEG- video decompression algorithm as workload. The MPEG- video decompression can be partitioned into the following six steps: - header decode, - Huffman decode, - inverse quantization, - IDCT (inverse discrete cosine transform), - motion compensation, - and display. The steps and have to be executed sequentially. The steps 3 to can be executed in parallel for all blocks (and macro blocks) of a single image. The MPEG- decompression is made multithreaded by splitting the task into a single parser thread, eight threads for macro block decoding, and an additional display thread. The parser thread executes the first two steps of the decompression and activates up to eight macro block decoding threads that perform steps three to five. The display thread transfers the decompressed images for display into the frame buffer (step ). The average usage of the instructions assuming a local RAM storage for table look-up is given in table. The maximum performance of the decompression is bound by the sequential parser thread. Our studies show that the sequential part covers approximately 5.5 % of the executed instructions (depending on the bit rate and size of the encoded material). This results in a theoretical speedup of at most.5 compared to the sequential algorithm. Two different videos are used as data stream workload for the MPEG- algorithm. One to two seconds of video are decompressed by the simulator per simulator run. The produced picture frames can be visually and digitally analyzed to evaluate the correct working of the workload routine and the simulator. Instruction type Average use (%) Integer/Multimedia shift and add 5.0 Complex-Integer/Multimedia multiply 3. Local load/store 0. Global load/store 7.7 Branch 9.9 Frame buffer I/O. Thread control.7 Table : Average usage of instructions.3 Transistor Count and Chip-space Estimator Moreover we devised a transistor count and chip-space estimation tool that works as follows: To estimate the chip space and the amount of transistors we use an analytical method for memory-based structures like register files or internal queues and an empirical method for logic blocks like control logic and functional units. For the analytical method we calculate the amount of bit cells, which are needed for the memory-based structures and the number of ports to access them. Based on this information, we calculate the number of transistors, assuming four transistors to implement a basic bit cell, two transistors per write port and one transistor per read port. To calculate the chip-space of a memory based structure we estimate the area of a basic cell in accordance to [3]. The basic area of a bit cell is increased in height and width by the number of the ports. We use the information of a basic bit cell space and the number of the ports to calculate its area, from that we are able to estimate the whole chipspace. To be independent of chip technology we apply the half-feature size λ as measure of length [], e.g., mm in 0.5 µm technology equals million λ. For the non memory-based parts of the processor we measure the floor-plans of existing processors. With this empirical approach we estimate the sizes of the basic logic blocks of the processors. Using this data, we calculate the necessary chip space of the logic blocks. To estimate the transistor amount of hypothetical processors with our tool, we calculate the average transistor density of non memory-based structures of real processors by measuring floor-plans (SPARC and HP PA-000) and by additional information about the transistor amount of the measured logic blocks. We validated the tool by estimating the PowerPC 0 configuration [] and reached the same 3. million transistor count and nearly the same die size as the real processor. For a detailed description of the estimation tool see [5]
4 3 Performance vs. Hardware Cost Estimation 3. Processor Models with an Abundance of Resources In this paper we focus on three sets of processor models. The full experimental results are published in []. Figure shows SMT processor models that yielded the highest throughput reached with our simulator. Fetch and decode bandwidth is scaled from one fetch unit (decode unit), which is able to fetch (decode) a single instruction per cycle in the single-threaded single-issue model, to eight fetch units (decode units), which are each able to fetch (decode) up to eight instructions per cycle, in the -threaded - issue configuration. We notice that multithreading is very effective if the issue bandwidth is at least four. In particular the two and four-threaded models lead to a high performance increase in the multiple-issue models. There is little improvement for the single-threaded (the contemporary superscalar case) and the two-threaded models when issue bandwidth is increased from to. These models need as many resources as a potential processor in years 00 to 009 on the basis of the SIA National Roadmap for Semiconductors prognosis []. Our transistor count of these models is a maximum of nearly 300 M transistors per processor chip for the -threaded - issue model including the large I- and D-caches (see Fig. 3).,39,3 5,57 5,57 5,5 5,3 3,9 3,9 3,9 3,,99 3, 3,07,99,99,9,,5,3 0,93, 7 5 IPC ,,,, bandwidth...,,,, Fetch and decode bandwidth...x - x Reservation stations...5 each buffers...3 each Lookup depth... each Retirement buffers...3 each Rename registers Local load-/store units... Integer/multimedia units... Result buses...0 D-cache fill burst rate... ::: D-cache... MB I-cache... MB Fig. : Maximum processor models Size in M λ² K Transistors Figure 3: Transistor count and chip space estimation of maximum processor models - -
5 3. Processor Models with Realistic Memory Hierarchy Second, we lessen the transistor and space requirements to processor models with a resource capacity of contemporary processors and performed further optimization steps. The processor models in Fig. use a MB D-cache (and a MB I-cache) able to host the full workload. We reduced the I- and D-cache sizes to realistic values of KB each in Fig.. Because of the small workload code size of only KB, a reduction of the I-cache size down to KB yields only an insignificant performance decrease, but a large transistor count and area reduction. The reduction of,,5,,,5 3,,0 3,5 3,7,9,9,5,9 0,9 0,9, 0,9 0,97,,,3,,, 0, IPC the D-cache size strongly affects overall performance for all models. We also reduced cache line burst rate from ::: down to more realistic 3::: processor cycles. On the other hand, memory bus utilization is already very high, and memory bandwidth is critical for the SMT processor performance. We choose to double the cache line burst length to 3::::::: instead of a costly doubling of the memory bus speed or the bus width (see [7] for a discussion of the trade-offs of the memory hierarchy). The diagrams in Fig. 5 show that the required transistor count and chip space of the - and -threaded -issue configurations allow an implementation of these models already with contemporary VLSI technology....,,,, bandwidth...,,,, Fetch and decode bandwidth...x - x Reservation stations...5 each buffers...3 each Lookup depth... each Retirement buffer...3 each Rename registers Local load-/store units... Integer/multimedia units... Result buses...0 D-cache fill burst rate...3::::::: D-cache... KB I-cache... KB Fig. : Optimized maximum processor models with more realistic memory hierarchy K Transitors Space in M λ² Figure 5: Transistor count and chip space estimation of the optimized maximum processor models with more realistic memory hierarchy - 5 -
6 3.3 Contemporary Scaled Processor Models Our third set of processor models impose further resource restrictions: smaller reservation stations, less fetch and decode bandwidth, less result buses, and one instead of two local load/store units. The number of rename registers is a fixed, the fetch and decode bandwidth a fixed two fetch (and two decode) units with instructions fetched (and decoded) each, in contrast to the previous sets of configurations that scale these parameters with the issue bandwidth and the number of threads. The transistor count is.5 M transistors for the - threaded -issue model and. M transistors for the - threaded -issue model (see Fig. 7). Both are less than the 5. M transistors of the recent DEC Alpha processor (the Alpha [] features a different microarchitecture than our PowerPC based models). Our performance evaluations in Fig. reach IPC values for the - to -threaded, - to -issue models that are just about one IPC less than the corresponding processor models of Fig.. However, the performance increase of the latter is paid by roughly.5 times the amount of transistors and six times the chip size. 0,9 0,97 0,99,9,9,9 0,9 3, 3, 3,5 3,5 3,09 3, 3,0 3,,7,9,7,,3,7,3,7,3 0, 0 3 IPC...,,,, bandwidth...,,,, Fetch and decode bandwidth... x Reservation stations... 3 each buffers... each Lookup depth... each Retirement buffers... each Rename registers... Local load-/store units... Integer/multimedia units...3 Result buses... D-cache fill burst rate...3::::::: D-cache... KB I-cache... KB Figure : Realistic processor models K Transistors Space in M λ² Figure 7: Transistor count and chip space estimation of realistic processor models - -
7 Conclusions We simulated various multimedia-enhanced SMT processor models using a hand-coded multithreaded MPEG- video decompression algorithm as workload. Our performance estimation showed a roughly threefold IPC increase of the -threaded -issue over the single threaded -issue processor models in all three sets of configurations. The transistor count and chip space estimation showed that the additional hardware cost of an -threaded SMT processor over a single-threaded processor is a negligible 5% transistor increase, but a 3% chip space increase for a 300 M transistor chip of years 00 to 009; it requires a 7% increase of the transistor amount for the processor models with realistic memory hierarchy (Figs. and 5), and % more for the contemporary scaled processor models (Figs. and 7). Chip space increase of the -threaded - issue over the single-threaded -issue model with realistic memory hierarchy is about 07%, and 3% in case of the contemporary scaled processor models. Even more favorable is the comparison of the singlethreaded -issue models with the -threaded -issue SMT model. The maximum processor models require a % increase in transistor count and a 9% increase in chip space, but yield a threefold speedup; the models with realistic memory hierarchy require a 3% increase in transistor count and a 53% increase in chip space, but yield a nearly twofold speedup; and the contemporary scaled models require a 9% increase in transistor count and a 7% increase in chip space, resulting in a.5-fold speedup. These observations show that SMT is an attractive design feature for future processors and already an alternative for contemporary processors. Our processor models are specifically tailored to the characteristics of our multithreaded MPEG- algorithm. Other workloads might favor different configuration. Our transistor count and chip space estimation tool will soon be adapted to the processor model that is the basis of the more widely used SimpleScalar simulator. 5 References [] Tullsen, D. M., Eggers, S. J., and Levy, H. M.: Simultaneous Multithreading: Maximizing On-Chip Parallelism. nd Ann. Int. Symp. on Computer Architecture. Santa Margherita Ligure, Italy, July 995, [] Tullsen, D. M., Eggers, S. J., Levy, H. M., Jo, J. L., and Stamm, R. L.: Exploiting Choice: Instruction Fetch And on an Implementable Simultaneous Multithreading Processor. 3 rd Ann. Int. Symp. on Computer Architecture. Philadelphia, May 99, 9-0. [3] Sigmund, U., Ungerer, Th.: Evaluating A Multithreaded Superscalar Microprocessor Versus a Multiprocessor Chip. th PASA Workshop Parallel Systems and Algorithms. Forschungszentrum Jülich, Germany, World Scientific Publishing, April 99, [] Gulati, M., Bagherzadeh, N.: Performance Study of a Multithreaded Superscalar Microprocessor. nd Int. Symp. on High-Performance Computer Architectures, February 99, [5] Emer, J.: Simultaneous Multithreading: Multiplying Alpha s Performance. Microprocessor Forum 999, San Jose, Ca., Oct [] Gwennap, L.: MAJC Gives VLIW a New Twist. Microprocessor Report, Sept. 3, 999. [7] Burns J., Gaudiot J.-L.: Quantifying the SMT Layout Overhead - Does SMT Pull Its Weight? HPCA-, Toulouse, France, Jan. -, 000, [] Song, S.P., Denman, M., Chang, J.: The PowerPC 0 RISC Microprocessor. IEEE Micro, Oktober 99, -7. [9] Oehring, H., Sigmund, U., Ungerer, Th.: Simultaneous Multithreading and Multimedia. Workshop on Multithreaded Execution, Architecture and Compilation, MTEAC99, in connection with the HPCA-5 Conf., Orlando, Jan. 9, 999. [0] Oehring, H., Sigmund, U., Ungerer, Th.: MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors. 999 Int. Conf. on Parallel Architectures and Compilation Techniques (PACT '99), Newport Beach, Ca., Oct. 999, -. [] Sigmund, U.: Entwurf und Evaluierung mehrfädig superskalarer Prozessortechniken im Hinblick auf Multimedia. Doctoral Thesis, Universität Karlsruhe, April 000. [] McFarling, S.: Combining Branch Predictors. DEC WRL Technical Note TN-3, DEC Western Research Laboratory 993. [3] Lopez D., Llosa J., Valero M., und Ayguade E.: Resource Widening Versus Replication: Limits and Performance-Cost Trade-off. th International Conference on Supercomputing (ICS-), 99, -. [] Mead, C., Conway, L.: Introduction to VLSI Systems. Addison-Wesley, Reading, MA 90. [5] Steinhaus, M.: Complexity of Processors. Master s Thesis, Universität Karlsruhe, Oct. 000 (in German). [] [7] Sigmund, U., Ungerer, Th.: Memory Hierarchy Studies of Multimedia-enhanced Simultaneous Multithreaded Processors for MPEC- Video Decompression. Workshop on Multi-Threaded Execution, Architecture and Compilation (MTEAC-3), Toulouse, Jan., 000. [] Linley Gwennap: Digital Sets New Standard. Microprocessor Report, Vol. 0, No., Oct.,
MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors
MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development
More informationAn In-order SMT Architecture with Static Resource Partitioning for Consumer Applications
An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationChip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading
Chip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading BORUT ROBIČ JURIJ ŠILC THEO UNGERER Faculty of Computer and Information Sc. Computer Systems Department Dept. of Computer
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationA Realistic Study on Multithreaded Superscalar Processor Design
A Realistic Study on Multithreaded Superscalar Processor Design Yuan C. Chou, Daniel P. Siewiorek, and John Paul Shen Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh,
More informationOne-Level Cache Memory Design for Scalable SMT Architectures
One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationInterrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller
Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller U. Brinkschulte, C. Krakowski J. Kreuzinger, Th. Ungerer Institute of Process Control,
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More information[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School
References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationBoosting SMT Performance by Speculation Control
Boosting SMT Performance by Speculation Control Kun Luo Manoj Franklin ECE Department University of Maryland College Park, MD 7, USA fkunluo, manojg@eng.umd.edu Shubhendu S. Mukherjee 33 South St, SHR3-/R
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationFeasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search
Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search S. van Haastregt LIACS, Leiden University svhaast@liacs.nl P.M.W. Knijnenburg Informatics Institute,
More informationinstruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals
Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationSimultaneous Multithreading Blending Thread-level and Instruction-level Parallelism in Advanced Microprocessors
Simultaneous Multithreading Blending Thread-level and Instruction-level Parallelism in Advanced Microprocessors JURIJ ŠILC BORUT ROBIČ THEO UNGERER Computer Systems Department Faculty of Computer and Information
More informationDynamic Capacity-Speed Tradeoffs in SMT Processor Caches
Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches Sonia López 1, Steve Dropsho 2, David H. Albonesi 3, Oscar Garnica 1, and Juan Lanchares 1 1 Departamento de Arquitectura de Computadores y Automatica,
More informationPowerPC 740 and 750
368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order
More informationParallel Computer Architecture
Parallel Computer Architecture What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation:»
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More information1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola
1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device
More informationSimultaneous Multithreading (SMT)
#1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing
More informationPowerPC 620 Case Study
Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance
More informationStatic Branch Prediction
Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationCS 152 Computer Architecture and Engineering. Lecture 18: Multithreading
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004
ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical
More informationCurrent and Future Trends in Processor Architecture. Theo Ungerer Borut Robic Jurij Silc
Current and Future Trends in Processor Architecture Theo Ungerer Borut Robic Jurij Silc 1 Tutorial Background Material Jurij Silc, Borut Robic, Theo Ungerer: Processor Architecture From Dataflow to Superscalar
More informationMotivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture
Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More information15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationArchitectural Considerations for Network Processor Design. EE 382C Embedded Software Systems. Prof. Evans
Architectural Considerations for Network Processor Design EE 382C Embedded Software Systems Prof. Evans Department of Electrical and Computer Engineering The University of Texas at Austin David N. Armstrong
More informationThe Use of Multithreading for Exception Handling
The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationBeyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy
EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery
More informationHP PA-8000 RISC CPU. A High Performance Out-of-Order Processor
The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA
More informationIntegrated circuit processing technology offers
Theme Feature A Single-Chip Multiprocessor What kind of architecture will best support a billion transistors? A comparison of three architectures indicates that a multiprocessor on a chip will be easiest
More informationUsing a Serial Cache for. Energy Efficient Instruction Fetching
Using a Serial Cache for Energy Efficient Instruction Fetching Glenn Reinman y Brad Calder z y Computer Science Department, University of California, Los Angeles z Department of Computer Science and Engineering,
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationModule 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.
Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch
More informationA Fine-Grain Multithreading Superscalar Architecture
A Fine-Grain Multithreading Superscalar Architecture Mat Loikkanen and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine loik, nader@ece.uci.edu Abstract
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationEfficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel
Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print
More informationArchitectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad
nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationCS146 Computer Architecture. Fall Midterm Exam
CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationComputer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue
More informationHigh-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs
High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationAR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationA survey of new research directions in microprocessors
Microprocessors and Microsystems 24 (2000) 175 190 www.elsevier.nl/locate/micpro A survey of new research directions in microprocessors J. Šilc a, *, T. Ungerer b, B. Robic c a Computer Systems Department,
More informationEfficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors
Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University
More informationReal-time Scheduling on Multithreaded Processors
Real-time Scheduling on Multithreaded Processors J. Kreuzinger, A. Schulz, M. Pfeffer, Th. Ungerer Institute for Computer Design, and Fault Tolerance University of Karlsruhe D-76128 Karlsruhe, Germany
More informationComparing Multiported Cache Schemes
Comparing Multiported Cache Schemes Smaїl Niar University of Valenciennes, France Smail.Niar@univ-valenciennes.fr Lieven Eeckhout Koen De Bosschere Ghent University, Belgium {leeckhou,kdb}@elis.rug.ac.be
More informationTradeoff between coverage of a Markov prefetcher and memory bandwidth usage
Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end
More informationUnderstanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures
Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3
More informationSimultaneous Multithreading Architecture
Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.
More informationDynamic Branch Prediction
#1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationValue Compression for Efficient Computation
Value Compression for Efficient Computation Ramon Canal 1, Antonio González 12 and James E. Smith 3 1 Dept of Computer Architecture, Universitat Politècnica de Catalunya Cr. Jordi Girona, 1-3, 08034 Barcelona,
More informationCache Implications of Aggressively Pipelined High Performance Microprocessors
Cache Implications of Aggressively Pipelined High Performance Microprocessors Timothy J. Dysart, Branden J. Moore, Lambert Schaelicke, Peter M. Kogge Department of Computer Science and Engineering University
More informationDynamic Scheduling. CSE471 Susan Eggers 1
Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationReduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction
ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationIntroduction. Summary. Why computer architecture? Technology trends Cost issues
Introduction 1 Summary Why computer architecture? Technology trends Cost issues 2 1 Computer architecture? Computer Architecture refers to the attributes of a system visible to a programmer (that have
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationApplications of Thread Prioritization in SMT Processors
Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor,
More informationDDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor
DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor Kyriakos Stavrou, Paraskevas Evripidou, and Pedro Trancoso Department of Computer Science, University of Cyprus, 75 Kallipoleos Ave., P.O.Box
More informationMainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation
Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer
More informationMultithreaded Value Prediction
Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single
More informationDynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution
Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Suresh Kumar, Vishal Gupta *, Vivek Kumar Tamta Department of Computer Science, G. B. Pant Engineering College, Pauri, Uttarakhand,
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationMainstream Computer System Components
Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved
More informationARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES
ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu
More information15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011
15-740/18-740 Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011 Reviews Due next Monday Mutlu et al., Runahead Execution: An Alternative
More informationLimiting the Number of Dirty Cache Lines
Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology
More informationA Mechanism for Verifying Data Speculation
A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es
More information