A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing

Size: px
Start display at page:

Download "A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing"

Transcription

1 A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing Kentaro Shimada *1, Tatsuya Kawashimo *1, Makoto Hanawa *1, Ryo Yamagata *2, and Eiki Kamada *2 *1 Central Research Laboratory, Hitachi LTD. *2 General Purpose Computer Division, Hitachi LTD. Abstract We have developed a superscalar RISC processor for the super technical server HITACHI SR8000. The processor includes architectural features dedicated to scientific applications where massive amounts of data in the main memory must be processed. These features are 160 floating-point registers (FPRs) and simultaneous execution of up to 16 prefetch instructions. Any of the 160 FPRs can be accessed by instructions with the slide windowed registers scheme. The execution mechanism for prefetch instructions enabled high efficiency of the out-of-order superscalar processor despite the long latency of the main memory. The processor is manufactured using 0.25-µm CMOS technology. Our evaluation demonstrated that the processor achieves over 3 floating-point operations per cycle and the memory throughput of over 12 bytes per cycle. 1. Introduction Many supercomputers dedicated to large scale scientific processing have been developed. Some of these computers have used vector processors that have vector registers - 1 -

2 and execute vector instructions[4]. However, RISC processors, especially superscalar RISCs, have demonstrated high performance recently. With current CMOS technology, RISC processors have also reduced a cost for achieving high performance and enabled multiprocessor machines of over 1000 processors. Thus, we have developed a new superscalar RISC processor for the super technical server HITACHI SR8000, a multiprocessor supercomputer using the superscalar RISC processors as the processing engines. When designing a superscalar RISC processor for supercomputers, two primary requirements of large scale scientific processing become important considerations: a) Large scale scientific processing requires high performance on floating-point operations. b) Large scale scientific processing requires high-speed access to extremely large bodies of floating-point data in the main memory. To ensure high performance on floating-point operations, we used a superpipelined architecture in which the processor is highly pipelined to make the instruction throughput high in addition to the superscalar architecture of the processor. A superpipelined architecture usually results in long instruction latency, but we have hidden this latency by unrolling loops and by using software pipelining. To enable calculations using a large amount of floating-point data, scientific applications usually contain many loops, and these loops are often very suitable for unrolling or software - 2 -

3 pipelining. To exploit unrolling and software pipelining effectively, we have used 160 floating-point registers (FPRs), in the processor s instruction-set architecture. To enable fast access to large floating-point data, we have made the access throughput of the processor s memory bus as large as 16 bytes per cycle. The throughput depends on the implementation limitation of the processor. Large scale scientific processing often uses many multiple-dimension arrays to express data. When the elements of such arrays are in the IEEE 8-byte long double-precision format, the typical size of, for example, a two-dimensional 1024-by-1024 array can be 8 Mbytes. However, the typical size of the on-chip cache of today s RISC processors ranges from 64 Kbytes to 1 Mbyte, and we expect a cache that can hold arrays as large as 8 Mbytes to be too large to be implemented on a chip. Therefore we have designed the processor so that all of the data is initially stored in the main memory and needs to be transferred from the main memory to the on-chip cache of the processor before it can be used for some operations. Figure 1 shows a simple example of matrix addition. DO 10 J=1,1024 DO 10 I=1,1024 C(I,J) = A(I,J)+B(I,J) 10 CONTINUE Figure 1 A simple Example of Matrix Addition In Figure 1, the size of matrices A, B, and C is 8 Mbytes each, and the elements of matrices A and B are read only once while the elements of C are written once and never - 3 -

4 read. Thus, the execution time of the program depends entirely on the throughput of the main memory access rather than the throughput of the cache access or even that of the floating-point operation. With the throughput of the main memory access of 16 bytes per cycle, we can achieve 16/24 or floating-point operations per cycle because 24 bytes of data (two 8-byte data to be read and one 8-byte data to be written) is transferred from or to the main memory for each addition. RISC processors not designed for large scale scientific processing assume, in principal, that all necessary data is in the cache, and usually are not capable of such large throughput to or from the main memory. The typical throughput of such processors is from 4 to 8 bytes per cycle. We also had to consider the relationship between the memory-access performance and the memory-access throughput; i.e., the memory access latency. The memory access latency is usually very long compared with the processor s cycle time, and becomes longer and longer as the processor s operating frequency rises. When designing the basic profile of this processor, we anticipated that the memory latency would become a hundred or more cycles of the processor. If this long memory latency is not hidden, the effective memory-access performance will be degraded despite the high memory-access throughput. Conventional vector-type supercomputers hide the latency by using vector registers, but superscalar RISC processors need another method. We solved this memory latency problem by using a software data prefetch scheme. In the scheme, prefetch instructions are inserted at optimized points before and among - 4 -

5 loops by the compiler, then, at the execution time, many of the prefetch instructions are activated to fetch the required data from the main memory to the on-chip cache well before the load instructions read the on-chip cache. The processor was manufactured using 0.25-µm CMOS technology. The instruction set architecture of the processor is based on 64-bit PowerPC *1 [5]. In the rest of this paper, we describe the architectural features of the processor for large scale scientific processings, an overview of the processor, and the mechanism of the features, then evaluate and discuss their effect on the processor performance. 2. Architectural Features Because a superscalar processor for large scale scientific processing must be capable of high floating-point performance and high throughput of large data access from the main memory, we incorporated the 160 FPRs and a software prefetch scheme with simultaneous execution of 16 prefetch instructions. To ensure high floating-point performance, we designed the throughput of the processor s floating-point instructions as two instructions per cycle. The instruction set of the processor includes fused floating-point multiply-add instructions, thus, the processor has the throughput of up to four floating-point operations per cycle. To keep this throughput high, we used a superpipelined architecture in addition to the superscalar one. To enable many enough floating-point instructions to be active, we *1 PowerPC is a trademark of the International Business Machines Corporation - 5 -

6 used 160 FPRs. Processor instructions access this large number of FPRs in two ways: some instructions can designate FPR numbers from 0 to 127 (others can only designate numbers from 0 to 31), and all instructions that refer to the FPRs use the slide windowed registers scheme[1][3]. The extended instructions that can designate FPR numbers up to 127 include floating-point arithmetic instructions and floating-point load or store instructions. The slide windowed registers scheme enables all instructions, including those which can designate FPR numbers only from 0 to 31, can access any of the 160 FPRs. To increase throughput of large data access from the main memory, we made the throughput of the processor s memory bus 16 bytes per cycle. We also made the throughput of the processor s on-chip level-one cache twice that of the memory bus, namely 32 bytes per cycle. Because data must be written in the on-chip cache before being read from the on-chip cache, the on-chip cache throughput must be no less than twice that of the memory bus. Since we designed the load and store instruction throughput to be two instructions per cycle, we have used load FPR pair and store FPR pair instructions to load 16-bytes data (two IEEE double-precision floating-point values) into and store 16-byte data from a pair of FPRs to enable the full cache throughput for each instruction. We used the software prefetch scheme to solve the memory latency problem. For the scheme, we made that the processor can execute up to 16 prefetch instructions simultaneously

7 On top of this, we have implemented a special superscaler execution method for the prefetch instructions. The latency of prefetch instructions is the same as the memory latency, which is a hundred or more cycles, so using the same execution method as for other instructions with short latency would make the superscalar control facility ineffective. The central idea of our method is that the superscalar facility of the processor issues and forgets the prefetch instructions. The issued prefetch instructions are tracked by a special purpose facility called the block transfer buffer. In this way, the processor can effectively execute prefetch instructions with a latency of a hundred cycles. The detailed execution mechanism for prefetch instructions is described in the section Overview of the Processor s Internal Structure The internal structure of the processor, including the pipeline stages, is shown in Figure 2. The processor has a 64-Kbyte two-way set-associative instruction cache and a 128-Kbyte four-way set-associative write-through data cache with a block size of 128 bytes on the chip. Up to eight instructions are fetched from the instruction cache to the instruction buffer at every cycle, in the IF1 and IF2 stages. Instructions are extracted from the instruction buffer and decoded, and their FPR numbers are translated for the slide windowed registers at the IF3 stage. Then the registers, both FPRs and the general purpose registers (GPRs), are renamed and up to five instructions are dispatched in program order to the reservation stations corresponding to the kind of instructions in the - 7 -

8 D1 and D2 stages. There are four reservation stations: FRS for floating-point instructions; ARS for load and store instructions, including prefetch instructions; GRS for fixed-point instructions; and BRS for branch instructions. IF1 IF2 IF3 D1 D2 E0 E1 E2 E3 E4 E5 8 inst. 5 inst. Reorder Buffer (32 entries) ICache Inst. Rename (64-Kbyte Buffer Mul-ADD FRS 2-way) Mul-ADD DEC. DIV/SQRT Branch FN- Prediction TRANS FRR C1 C2 C3 FPR 6 inst. External Bus (16-byte Width) ARS GRS Load- Store Queue DCache (128-Kbyte 4-way) Block Transfer Buffer (16 entries) (32-byte Width) Mul /Div Store Buffer GRR GPR BRS Branch Exec. Figure 2 Overview of the Processor s Internal Structure All dispatched instructions are registered by the reorder buffer at the time of dispatch. The reorder buffer has 32 entries that keep track of each instruction, and it tracks sequential control flow of the program while instructions are issued from the reservation stations and executed out of order. Specifically, the reorder buffer allocates entries for the instructions dispatched in program order at the D2 stage, and waits for the - 8 -

9 instructions to finish their execution out of order. Then it put the results of their execution in program order again and checks if any exception was reported at the C1 stage. After that, at the C2 and C3 stages, the reorder buffer instructs each execution unit to write back data from the rename registers into the architectural registers, such as FPRs and GPRs and, in the case of a store instruction, to store the data into the store buffer. In parallel, the reorder buffer releases the entries in which the instructions are registered at the C2 stage. This process is called completion. The reorder buffer of the processor completes up to six instructions at every cycle. For the issue and execution stages, the number of stages varies according to the kinds of instructions. For example, fixed-point add instructions are issued at the E0 stage, their source registers are read at the E1 stage, and they are executed at the E2 stage. On the other hand, floating-point multiply-add instructions are issued at the E0 stage, their source registers are read at the E1 stage, and they are executed from the E2 stage through to the E5 stage. The results of these instructions are written into the rename registers for GPRs (GRRs) or the ones for FPRs (FRRs). Load and store instructions are also issued at the E0 stage and their source registers are read for address calculation at the E1 stage so that their data addresses can be calculated at the E2 stage. The calculated addresses are written into the load-store queue. The load-store queue resolves address conflicts between load and store instructions that are executed out of order. Load instructions then refer to the data cache - 9 -

10 in the E3 and E4 stages, and send the required data to GRRs or to FRRs at the E5 stage. Store instructions wait for the reorder buffer to instruct them to put the required data into the store buffer and complete them. A prefetch instruction puts its data address into the block transfer buffer, and waits for the reorder buffer to instruct it to fetch the main memory as we describe later. Floating-Point Unit Reorder Buffer FPR FPR Instruction Control Fixed-Point Unit GPR Branch Unit Load-Store Unit Data Cache Instruction Cache Bus Control Figure 3 Processor Chip Photograph

11 Figure 3 is a photograph of the processor chip where such as the reorder buffer and FPRs are shown. 4. Mechanism of the Architectural Features 4.1 Slide Windowed Registers for 160 FPRs Slide windowed registers have been developed for a two-way in-order superscalar processor[1]. We have modified these for use with processors that have an out of order superscalar and superpipelined architecture. The FPR number associated with the physical FPR (the physical register number PN) is translated from the register number designated in the instruction (the logical register number LN) with the following formula: PN = (LN + SWBS) mod 128 (GN LN) PN = LN (0 LN < GN) where GN is the number of registers in the global part (not the sliding part), and SWBS is the slide window base value which is held in the special purpose register SWSW (slide window status word). Figure 4 shows an example of this where GN is 16 and SWBS is 2. GN is selected from 4, 8, 16, or 32. SWBS is an even number from 0 to 126. : LN which any instruction can designate : LN which only an extended instruction can designate LN SWBS PN Slide Part Global Part Figure 4 An Example of Mapping through the Slide Windowed Registers (SWBS = 2, GN = 16)

12 To slide the slide windowed registers smoothly, we introduced a special instruction SSWSTP (slide sliding window step). The SSWSTP instruction changes the SWBS value in the SWSW register resulting in sliding of the slide windowed registers in the slide part. This instruction is speculatively executed at the IF3 stage to enable the immediately following instructions to use the changed SWBS value for their FPR number translation. Therefore, there are no penalties for the SSWSTP instruction execution. A special mechanism is implemented to recover speculatively changed SWBS values when the speculation fails because of a branch prediction miss. The mechanism includes buffers which register the SWBS values at the time of dispatch of each branch instruction to speed up the recovery. 4.2 Execution Mechanism for the Prefetch Instruction The execution mechanism for the prefetch instruction is designed for a mainmemory latency that is so long that the superscalar facility of the processor, namely the reorder buffer, cannot keep track of the prefetch instructions effectively. If the reorder buffer kept track of the prefetch instructions, one entry of it would be occupied until the prefetched data from the main memory was returned, thus the occupation could be more than one hundred cycles long. Consequently, the number of entries required to sustain the instruction throughput would be a hundred times the number of active instructions. In other words, the reorder buffer is required to have a few hundreds entries, which is not feasible

13 In the execution mechanism of the processor, a prefetch instruction is tracked by the special facility called a block transfer buffer instead of by the reorder buffer (Figure 2). When a prefetch instruction is executed, only its address calculation is performed and the calculated result is placed into the block transfer buffer. Then, after checking for address exceptions, the reorder buffer instructs the prefetch instruction to go to fetch the required data from the main memory, and frees the entry for the prefetch instruction without waiting for the data from the main memory to be returned. While the instruction fetches the data from the main memory, functions such as address-conflict checks are done by the block transfer buffer. The block transfer buffer has 16 entries, so up to 16 prefetch instructions can be active simultaneously. When the required data is returned from the main memory, the data is written into the on-chip data cache according to the data address registered in the block transfer buffer. This execution mechanism for the prefetch instruction enables the processor to effectively handle a main memory latency that is a hundred cycles long without requiring a reorder buffer of a few hundred entries. Because the main memory throughput of the processor is 16 bytes per cycle, a block transfer buffer of 16 entries allows the main memory latency to be up to 128 cycles long (= the on-chip cache block size of 128 bytes 16 active prefetchs / the memory throughput of 16 bytes per cycle)

14 5. Evaluation 5.1 Evaluation Environment To evaluate the processor architecture, we did a logic design simulation using the logic design data of the actual processor. Since the evaluation environment had to enable us to identify the bottleneck of the processor with sufficient evaluation accuracy, we excluded the bottlenecks from outside of the processor, such as the overhead of the main memory control, the refresh control, and so on, and idealized the memory system so that the processor bottleneck would be clearly shown in the performance number. To ensure the accuracy, we used the actual design file of the processor. The simulator used in this evaluation environment was developed mainly to verify the logic design of the processor, and the results obtained were identical to the real cycle by cycle behavior of the processor. 5.2 Effect of the Large Number of FPRs First, we evaluated the effect of the large number of FPRs. We used the calculation of the inner product of two arrays as a sample program. Its source program is shown in Figure 5. To evaluate the effect of the number of FPRs, we built up four model codes using this source program for use 6, 20, 40 or 80 FPRs. DO 10 I=1,N S = S + A(I) * B(I) 10 CONTINUE Figure 5 Source Program of the Inner Product Figure 6 shows the results for the four codes under the conditions that all

15 necessary data was in the cache and the loop length was The performance is expressed in floating operations per cycle (FLOPC) and the instruction throughput is expressed in instructions per cycle (IPC). 4 Performance 4 Performance [Floating Operations per Cycle] Instruction Throughput Instruction Throughput [Instructions per Cycle] Number of FPRs Used Figure 6 Effect of the number of FPRs When the number of FPRs was small, not enough instructions became active to hide the latency of the processor pipelines and the performance was low. When the number of FPRs increased, the performance rose to 3.15 FLOPC, which is 78.7% of the ideal performance (4 FLOPC) for the code using 80 FPRs. As a result, for this simple program of the inner product, 80 FPRs was enough to obtain fairy good performance despite the latency of the highly pipelined instruction

16 execution. For more complex and larger programs, the number of FPRs required to obtain good performance will rise. For example, the matrix multiplication program used in the following section required 104 FPRs. 5.3 Memory Throughput and Effect of Prefetch Next, we evaluated the processing speed and the memory throughput vs. the main memory latency and the effect of the software prefetch scheme. For this evaluation, our sample program was a version of the model code of the inner product with 80 FPRs used in the previous evaluation with prefetch instructions inserted. We also prepared a version of matrix multiplication where all the required data was assumed to be in the cache (the no prefetch instructions inserted version) and a version of matrix multiplication with prefetch instructions inserted. The source program of the matrix multiplication is shown in Figure 7, and its model code was produced by unrolling the outermost loop 16 times, the middle loop 2 times, and the innermost loop 2 times. DO 10 I=1,N DO 10 J=1,N S = 0 DO 20 K=1,N S = S + A(I,K) * B(K,J) 20 CONTINUE C(I,J) = S 10 CONTINUE Figure 7 The Source Program of Matrix Multiplication In the case of the matrix multiplication, the amount of data that had to be transferred from the main memory for every floating-point operation was reduced by the above loop unrollings. If no unrollings were applied, the amount of data for each

17 floating-point operation would be 8 bytes (one floating-point value), which is the same as for the sample program of the inner product Figure 5. With the unrollings, the amount of data is reduced to 2.25 bytes (9/32 floating-point values) because 36 floating-point values (16 2 A(I,K)s and 2 2 B(K,J)s) are read once and used for 128 floating-point operations ( floating-point multiplications and floating-point additions). In terms of the memory throughput, the code of matrix multiplication without unrollings and the four codes of the inner product need memory throughput of 32 bytes per cycle to obtain the ideal performance of 4 FLOPC, while the unrolled code of matrix multiplication needs memory throughput of 9 bytes per cycle to enable 4 FLOPC. On top of this, for the code of matrix multiplication with unrolling of the outermost loop 16 times, the unrolled data access size become 128 bytes (= 8 bytes 16 times). This is equal to the on-chip cache block size, so the prefetch instruction can be applied very effectively. The performance and the memory throughput of the inner product and the matrix multiplication for the all-in-cache case (the no prefetch instructions inserted version), and for the main memory latency of 50 cycles, 80 cycles, 100 cycles, 150 cycles, and 200 cycles are shown in Figure 8. For the inner product, the performance of 3.15 FLOPC and the memory throughput of 25.0 bytes per cycle for the all-in-cache case fell by about half when the data was out of the cache and prefetch instructions were used to fetch the data from the main memory into the cache. This is because the memory throughput of the processor was

18 4 Memory throughput limited by the main memory latency Performance Memory Throughput Peak cache throughput =32 bytes/cycle Performance Memory Throughput Performance [Floating-Operations per Cycle] Memory Throughput [Bytes/Cycle] Performance [Floating-Operations per Cycle] Memory Throughput [Bytes/Cycle] Peak main memory throughput=16 bytes/cycle 0 In Cache In Cache Main Memory Latency [Cycles] Main Memory Latency [Cycles] (a) Inner Product (b) Matrix Multiplication Figure 8 Performance vs. Main Memory Latency 16 bytes per cycle but the memory throughput required to achieve 4 FLOPC was 32 bytes per cycle. However, for the main memory latency from 50 cycles to 100 cycles, the performance degradation was very small because the execution mechanism of the prefetch instruction worked well. In Figure 8, the effective memory throughput for the main memory latency of 50 cycles was 12.2 bytes per cycle, which was 76.0% of the peak 16 bytes per cycle, and for the main memory latency of 100 cycles it was 11.4 bytes per cycle, which was 71.4% of the peak. For the main memory latency of 150 cycles or of 200 cycles, the memory throughput could not be sustained and the performance was degraded

19 In the matrix multiplication, 3.78 FLOPC (94.6% of the peak performance of 4 FLOPC) was achieved in the all-in-cache case. The required memory throughput of 9 bytes per cycle was much less than the memory throughput of the processor and much less than that limited by the main memory latency, so the performance was not heavily degraded for the main memory latency of 50, 80, 100, or even 150 cycles. In all of these cases, the performance was more than 3 FLOPC, and the mechanism of the prefetch instruction seemed to work well. With the main memory latency of 200 cycles, the memory throughput was limited to bytes per cycle (= 128 bytes cache block size 16 active prefetchs / the latency of 200 cycles), which is close to the required throughput, and the performance fell to 2.65 FLOPC. 6. Conclusion We have developed a superscalar RISC processor for large scale scientific processing. Our goal was to enable high performance on floating-point operations and to maximize the throughput of large data access from the main memory. To increase floating-point performance, we equipped the processor with a large number of FPRs, and introduced extended instructions that can designate a FPR number from 0 to 127 and slide windowed registers to hide the latency of the processor pipeline. To maximize the effective throughput of large data access, we used a software prefetch scheme and implemented a special superscalar execution method for the prefetch instructions. Our evaluation demonstrated that the large number of the FPRs improved performance

20 significantly and that the execution mechanism of the prefetch instructions worked well despite long latency of the main memory access. References [1] Shimamura,K., Tanaka,S., Shimomura,T., Hotta,T., Kamada,E., Sawamoto,H., Shimizu,T., and Nakazawa,K.: A Superscalar RISC Processor with Pseudo Vector Processing Feature, Proc. of International Conference on Computer Design 95,pp , IEEE (1995) [2] Nakazawa,K., Nakamura,H., Imori,H., and Kawabe,S.: Pseudo Vector Processor based on Register-Windowed Superscalar Pipeline, Proc. of Supercomputing 92, pp ,ieee (1992) [3] Nakamura,H., Imori,H., Nakazawa,K., Boku,T., Nakata,I., Yamashita,Y., Wada,H., and Inagami,Y.: A Scalar Architecture for Pseudo Vector Processing based on Slide- Windowed Registers, Proc. of International Conference on Supercomputing 93, pp , ACM (1993) [4] Kitai,K., Isobe,T., Tanaka,Y., Tamaki,Y., Fukagawa,M., Tanaka,T., and Inagami,Y.: Parallel Processing Architecture for the Hitachi S-3800 Shared-Memory Vector Multiprocessor, Proc. of International Conference on Supercomputing 93, pp , ACM (1993) [5] May,C., et al.(ed.): The PowerPC Architecture: A Specification for A New Family of RISC Processors Second Edition, Morgan Kaufmann Publishers, INC. (1994)

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

Jim Keller. Digital Equipment Corp. Hudson MA

Jim Keller. Digital Equipment Corp. Hudson MA Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

Case Study IBM PowerPC 620

Case Study IBM PowerPC 620 Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,

More information

Superscalar Machines. Characteristics of superscalar processors

Superscalar Machines. Characteristics of superscalar processors Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

1. PowerPC 970MP Overview

1. PowerPC 970MP Overview 1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor

More information

The Pentium II/III Processor Compiler on a Chip

The Pentium II/III Processor Compiler on a Chip The Pentium II/III Processor Compiler on a Chip Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa Intel Corporation Tel Aviv University January 20, 2004 1 Agenda

More information

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise CSCI 4717/5717 Computer Architecture Topic: Instruction Level Parallelism Reading: Stallings, Chapter 14 What is Superscalar? A machine designed to improve the performance of the execution of scalar instructions.

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found

More information

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency Superscalar Processors Ch 13 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction 1 New dependency for superscalar case? (8) Name dependency (nimiriippuvuus) two use the same

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Lecture 13 - VLIW Machines and Statically Scheduled ILP CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Portland State University ECE 587/687. Superscalar Issue Logic

Portland State University ECE 587/687. Superscalar Issue Logic Portland State University ECE 587/687 Superscalar Issue Logic Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2017 Instruction Issue Logic (Sohi & Vajapeyam, 1987) After instructions are

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

SUPERSCALAR AND VLIW PROCESSORS

SUPERSCALAR AND VLIW PROCESSORS Datorarkitektur I Fö 10-1 Datorarkitektur I Fö 10-2 What is a Superscalar Architecture? SUPERSCALAR AND VLIW PROCESSORS A superscalar architecture is one in which several instructions can be initiated

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Programmazione Avanzata

Programmazione Avanzata Programmazione Avanzata Vittorio Ruggiero (v.ruggiero@cineca.it) Roma, Marzo 2017 Pipeline Outline CPU: internal parallelism? CPU are entirely parallel pipelining superscalar execution units SIMD MMX,

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization CH14 Instruction Level Parallelism and Superscalar Processors Decode and issue more and one instruction at a time Executing more than one instruction at a time More than one Execution Unit What is Superscalar?

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution In-order vs. Out-of-order Execution In-order instruction execution instructions are fetched, executed & committed in compilergenerated order if one instruction stalls, all instructions behind it stall

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 9: Multiple Issue (Superscalar and VLIW) Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced cache optimizations. ECE 154B Dmitri Strukov Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli 06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

Next Generation Technology from Intel Intel Pentium 4 Processor

Next Generation Technology from Intel Intel Pentium 4 Processor Next Generation Technology from Intel Intel Pentium 4 Processor 1 The Intel Pentium 4 Processor Platform Intel s highest performance processor for desktop PCs Targeted at consumer enthusiasts and business

More information

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions.

More information

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Outline Motivation Background Threading Fundamentals

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Portland State University ECE 588/688. IBM Power4 System Microarchitecture Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2018 IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments

More information

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors CPUs Caches. Memory management. CPU performance. Cache : MainMemory :: Window : 1. Door 2. Bigger Door 3. The Great Outdoors 4. Horizontal Blinds 18% 9% 64% 9% Door Bigger Door The Great Outdoors Horizontal

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline? 1. Imagine we have a non-pipelined processor running at 1MHz and want to run a program with 1000 instructions. a) How much time would it take to execute the program? 1 instruction per cycle. 1MHz clock

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle. CS 320 Ch. 16 SuperScalar Machines A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle. A superpipelined machine is one in which a

More information

Architectures for Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as 372 Chapter 16 Code Improvement 16.10 Exercises 16.1 In Section 16.2 we suggested replacing the instruction r1 := r2 / 2 with the instruction r1 := r2 >> 1, and noted that the replacement may not be correct

More information

Superscalar Organization

Superscalar Organization Superscalar Organization Nima Honarmand Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average

More information

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly

More information

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor David Johnson Systems Technology Division Hewlett-Packard Company Presentation Overview PA-8500 Overview uction Fetch Capabilities

More information

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real

More information