SUPERSCALAR AND VLIW PROCESSORS

Similar documents
Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Pipeline Architecture RISC

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Hardware-based Speculation

Processor: Superscalars Dynamic Scheduling

Dynamic Scheduling. CSE471 Susan Eggers 1

REDUCED INSTRUCTION SET COMPUTERS (RISC)

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Superscalar Machines. Characteristics of superscalar processors

Organisasi Sistem Komputer

5008: Computer Architecture

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Case Study IBM PowerPC 620

Static vs. Dynamic Scheduling

E0-243: Computer Architecture

Lecture 19: Instruction Level Parallelism

Advanced Computer Architecture

Chapter. Out of order Execution

Vorlesung / Course IN2075: Mikroprozessoren / Microprocessors

Advanced processor designs

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Pipelining and Vector Processing

Getting CPI under 1: Outline

ECE 341. Lecture # 15

Superscalar Processors

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

COSC 6385 Computer Architecture - Pipelining

Multi-cycle Instructions in the Pipeline (Floating Point)

Handout 2 ILP: Part B

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

Hardware-Based Speculation

Superscalar Processors

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Copyright 2012, Elsevier Inc. All rights reserved.

Multiple Instruction Issue. Superscalars

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Preventing Stalls: 1

Course on Advanced Computer Architectures

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Instruction-Level Parallelism (ILP)

CS152 Computer Architecture and Engineering. Complex Pipelines

Basic Computer Architecture

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

EECC551 Exam Review 4 questions out of 6 questions

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Computer Architecture

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Advanced Computer Architecture

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Instruction Level Parallelism

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

COSC4201 Instruction Level Parallelism Dynamic Scheduling

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Good luck and have fun!

EE 4683/5683: COMPUTER ARCHITECTURE

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Advanced issues in pipelining

ECE 587 Advanced Computer Architecture I

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Chapter 4 The Processor 1. Chapter 4D. The Processor

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

PowerPC 620 Case Study

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Superscalar Processor Design

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Adapted from David Patterson s slides on graduate computer architecture

Processor (IV) - advanced ILP. Hwansoo Han

Main Points of the Computer Organization and System Software Module

Transcription:

Datorarkitektur I Fö 10-1 Datorarkitektur I Fö 10-2 What is a Superscalar Architecture? SUPERSCALAR AND VLIW PROCESSORS A superscalar architecture is one in which several instructions can be initiated simultaneously and executed independently. 1. What is a Superscalar Architecture? 3. Features of Superscalar Architectures 4. Data Dependencies 5. Policies for Parallel Instruction Execution Pipelining allows several instructions to be executed at the same time, but they have to be in different pipeline stages at a given moment. Superscalar architectures include all features of pipelining but, in addition, there can be several instructions executing simultaneously in the same pipeline stage. 6. Register Renaming 7. VLIW Processors Datorarkitektur I Fö 10-3 Datorarkitektur I Fö 10-4 What is a Superscalar Architecture? (cont d) Superscalar Architectures Pipelined execution Clock cycle Instr. i Instr. i+1 Instr. i+2 Instr. i+3 Instr. i+4 Instr. i+5 1 2 3 4 5 6 7 8 FI DI CO FO EI WO FI 9 10 11 DI CO FO EI WO Superscalar architectures allow several instructions to be issued and completed per clock cycle. A superscalar architecture consists of a number of pipelines that are working in parallel. Depending on the number and kind of parallel units available, a certain number of instructions can be executed in parallel. Superscalar execution In the following example a floating point and two integer operations can be issued and executed simultaneously; each unit is pipelined and can execute several operations in different pipeline stages. Clock cycle 1 2 3 4 5 6 7 8 9 10 11 Instr. i Instr. i+1 Instr. i+2 Instr. i+3 Instr. i+4 Instr. i+5 FI FI DI CO FO EI WO DI CO FO EI WO

Datorarkitektur I Fö 10-5 Datorarkitektur I Fö 10-6 Memory Limitations on Parallel Execution Superscalar Architectures (cont d) Instr. buffer Fetch & Addr. calc. & Branch pred. Decode & Rename & Dispatch Floating point unit Integer unit Instruction issuing Instr. window (queues, reservation stations, etc.) Integer unit Register Files Commit The situations which prevent instructions to be executed in parallel by a superscalar architecture are very similar to those which prevent an efficient execution on any pipelined architecture (see pipeline hazards - lecture 6/7). The consequences of these situations on superscalar architectures are more severe than those on simple pipelines, because the potential of parallelism in superscalars is greater and, thus, a greater opportunity is lost. Instruction cache Datorarkitektur I Fö 10-7 Datorarkitektur I Fö 10-8 Limitations on Parallel Execution (cont d) Three categories of limitations have to be considered: 1. Resource conflicts: - They occur if two or more instructions compete for the same resource (register, memory, functional unit) at the same time; they are similar to structural hazards discussed with pipelines. Introducing several parallel pipelined units, superscalar architectures try to reduce a part of possible resource conflicts. 2. Control (procedural) dependency: - The presence of branches creates major problems in assuring an optimal parallelism. How to reduce branch penalties has been discussed in lectures 7&8. - If instructions are of variable length, they cannot be fetched and issued in parallel; an instruction has to be decoded in order to identify the following one and to fetch it superscalar techniques are efficiently applicable to RISCs, with fixed instruction length and format. Data Dependencies Three types of data dependencies can be identified: 1. True data dependency 2. Output dependency 3. Antidependency 3. Data conflicts: - Data conflicts are produced by data dependencies between instructions in the program. Because superscalar architectures provide a great liberty in the order in which instructions can be issued and completed, data dependencies have to be considered with much attention.

Datorarkitektur I Fö 10-9 Datorarkitektur I Fö 10-10 True Data Dependency Output Dependency True data dependency exists when the output of one instruction is required as an input to a subsequent instruction: ADD R2,R4,R5 R2 R4 + R5 An output dependency exists if two instructions are writing into the same location; if the second instruction writes before the first one, an error occurs: ADD R4,R2,R5 R4 R2 + R5 True data dependencies are intrinsic features of the user s program. They cannot be eliminated by compiler or hardware techniques. True data dependencies have to be detected and treated: the addition above cannot be executed before the result of the multiplication is available. - The simplest solution is to stall the adder until the multiplier has finished. - In order to avoid the adder to be stalled, the compiler or hardware can find other instructions which can be executed by the adder until the result of the multiplication is available. Antidependency An antidependency exists if an instruction uses a location as an operand while a following one is writing into that location; if the first one is still using the location when the second one writes into it, an error occurs: ADD R3,R2,R5 R3 R2 + R5 Datorarkitektur I Fö 10-11 Datorarkitektur I Fö 10-12 The Nature of Output Dependency and Antidependency Policies for Parallel Instruction Execution Output dependencies and antidependencies are not intrinsic features of the executed program; they are not real data dependencies but storage conflicts. Output dependencies and antidependencies are only the consequence of the manner in which the programmer or the compiler are using registers (or memory locations). They are produced by the competition of several instructions for the same register. In the previous examples the conflicts are produced only because: - the output dependency: R4 is used by both instructions to store the result; - the antidependency: R3 is used by the second instruction to store the result; The examples could be written without dependencies by using additional registers: The ability of a superscalar processor to execute instructions in parallel is determined by: 1. the number and nature of parallel pipelines (this determines the number and nature of instructions that can be fetched and executed at the same time); 2. the mechanism that the processor uses to find independent instructions (instructions that can be executed in parallel). The policies used for instruction execution are characterized by the following two factors: and ADD R7,R2,R5 R7 R2 + R5 ADD R6,R2,R5 R6 R2 + R5 1. the order in which instructions are issued for execution; 2. the order in which instructions are completed (they write results into registers and memory locations).

Datorarkitektur I Fö 10-13 Datorarkitektur I Fö 10-14 Policies for Parallel Instruction Execution (cont d) Policies for Parallel Instruction Execution (cont d) The simplest policy is to execute and complete instructions in their sequential order. This, however, gives little chances to find instructions which can be executed in parallel. In order to improve parallelism the processor has to look ahead and try to find independent instructions to execute in parallel. Example We consider the superscalar architecture: Two instructions can be fetched and decoded at a time; Three functional units can work in parallel: floating point unit, integer adder, integer multiplier; Two instructions can be written back (completed) at a time; We consider the following instruction sequence: Instructions will be executed in an order different from the strictly sequential one, with the restriction that the result must be correct. Execution policies: 1. In-order issue with in-order completion. 2. In-order issue with out-of-order completion. 3. Out-of-order issue with out-of-order completion. I1: ADDF R12,R13,R14 R12 R13 + R14 (float. pnt.) I2: ADD R1,R8,R9 R1 R8 + R9 I3: MUL R4,R2,R3 R4 R2 * R3 I4: MUL R5,R6,R7 R5 R6 * R7 I5: ADD R10,R5,R7 R10 R5 + R7 I6: ADD R11,R2,R3 R11 R2 + R3 I1 requires two cycles to execute; I3 and I4 are in conflict for the same functional unit; I5 depends on the value produced by I4 (we have a true data dependency between I4 and I5); I2, I5 and I6 are in conflict for the same functional unit; Datorarkitektur I Fö 10-15 Datorarkitektur I Fö 10-16 In-Order Issue with In-Order Completion Instructions are issued in the exact order that would correspond to sequential execution; results are written (completion) in the same order. In-Order Issue with In-Order Completion (cont d) Decode/ Issue - An instruction cannot be issued before the previous one has been issued; - An instruction completes only after the previous one has completed. Execute Writeback/ Complete Cycle I1 I2 1 I3 I4 I1 I2 2 I5 I6 I1 3 I3 I1 I2 4 I4 I3 5 I5 I4 6 I6 I5 7 I6 8 The processor detects and handles (by stalling) true data dependencies and resource conflicts. As instructions are issued and completed in their strict order, the resulting parallelism is very much dependent on the way the program is written/ compiled. If I3 and I6 switch position, the pairs I6-I4 and I5-I3 can be executed in parallel (see following slide). We are interested in techniques which are not compiler based but allow the hardware alone to detect instructions which can be executed in parallel and to issue them. - To guarantee in-order completion, instruction issuing stalls when there is a conflict and when the unit requires more than one cycle to execute;

Datorarkitektur I Fö 10-17 Datorarkitektur I Fö 10-18 In-Order Issue with In-Order Completion (cont d) In-Order Issue with In-Order Completion (cont d) If the compiler generates this sequence: I1: ADDF R12,R13,R14 R12 R13 + R14 (float. pnt.) I2: ADD R1,R8,R9 R1 R8 + R9 I6: ADD R11,R2,R3 R11 R2 + R3 I4: MUL R5,R6,R7 R5 R6 * R7 I5: ADD R10,R5,R7 R10 R5 + R7 I3: MUL R4,R2,R3 R4 R2 * R3 I6-I4 and I5-I3 could be executed in parallel Decode/ Issue Execute Writeback/ Complete Cycle I1 I2 1 I6 I4 I1 I2 2 I5 I3 I1 3 I6 I4 I1 I2 4 I5 I3 I6 I4 5 I5 I3 6 7 8 With in-order issue&in-order completion the processor has not to bother about outputdependency and antidependency! It has only to detect true data dependencies. No one of the two dependencies will be violated if instructions are issued/completed in-order: output dependency Antidependency ADD R4,R2,R5 R4 R2 + R5 ADD R3,R2,R5 R3 R2 + R5 The sequence needs only 6 cycles instead of 8. Datorarkitektur I Fö 10-19 Datorarkitektur I Fö 10-20 Out-of-Order Issue with Out-of-Order Completion Out-of-Order Issue with Out-of-Order Completion (cont d) With in-order issue, no new instruction can be issued when the processor has detected a conflict and is stalled, until after the conflict has been resolved. We consider the instruction sequence in slide 15. I6 can be now issued before I5 and in parallel with I4; the sequence takes only 6 cycles (compared to 8 if we have in-order issue&in-order completion). The processor is not allowed to look ahead for further instructions, which could be executed in parallel with the current ones. Out-of-order issue tries to resolve the above problem. Taking the set of decoded instructions the processor looks ahead and issues any instruction, in any order, as long as the program execution is correct. Decode/ Issue Execute Writeback/ Complete Cycle I1 I2 1 I3 I4 I1 I2 2 I5 I6 I1 I3 I2 3 I6 I4 I1 I3 4 I5 I4 I6 5 I5 6 7 8

Datorarkitektur I Fö 10-21 Datorarkitektur I Fö 10-22 Out-of-Order Issue with Out-of-Order Completion (cont d) With out-of-order issue&out-of-order completion the processor has to bother about true data dependency and both about output-dependency and antidependency! Output dependency can be violated (the addition completes before the multiplication): ADD R4,R2,R5 R4 R2 + R5 Antidependency can be violated (the operand in R3 is used after it has been over-written): ADD R3,R2,R5 R3 R2 + R5 Register Renaming Output dependencies and antidependencies can be treated similarly to true data dependencies as normal conflicts. Such conflicts are solved by delaying the execution of a certain instruction until it can be executed. Parallelism could be improved by eliminating output dependencies and antidependencies, which are not real data dependencies (see slide 12). Output dependencies and antidependencies can be eliminated by automatically allocating new registers to values, when such a dependency has been detected. This technique is called register renaming. The output dependency is eliminated by allocating, for example, R6 to the value R2+R5: ADD R4,R2,R5 R4 R2 + R5 (ADD R6,R2,R5 R6 R2 + R5) The same is true for the antidependency below: ADD R3,R2,R5 R3 R2 + R5 (ADD R6,R2,R5 R6 R2 + R5) Datorarkitektur I Fö 10-23 Datorarkitektur I Fö 10-24 Final Comments on Superscalars The following main techniques are characteristic for superscalar processors: 1. additional pipelined units which are working in parallel; 2. out-of-order issue&out-of-order completion; 3. register renaming. All of the above techniques are aimed to enhance performance. Experiments have shown: - without the other techniques, only adding additional units is not efficient; - out-of-order issue is extremely important; it allows to look ahead for independent instructions; - register renaming can improve performance with more than 30%; in this case performance is limited only by true dependencies. - it is important to provide a fetching/decoding capacity so that ~16 instructions are buffered for lookahead. Some Architectures PowerPC 604 six independent execution units: - Branch execution unit - Load/Store unit - 3 Integer units - Floating-point unit in-order issue Power PC 620 provides in addition to the 604 out-of-order issue Pentium three independent execution units: - 2 Integer units - Floating point unit in-order issue two instructions issued per clock cycle. Pentium II to 4 provides in addition to the Pentium out-of-order issue five instructions can be issued in one cycle

Datorarkitektur I Fö 10-25 Datorarkitektur I Fö 10-26 What is Good and what is Bad with Superscalars? The Alternative: VLIW Processors Good The hardware solves everything: - Hardware detects potential parallelism between instructions; - Hardware tries to issue as many instructions as possible in parallel. - Hardware solves register renaming. Binary compatibility - If functional units are added in a new version of the architecture or some other improvements have been made to the architecture (without changing the instruction sets), old programs can benefit from the additional potential of parallelism. Why? Because the new hardware will issue the old instruction sequence in a more efficient way. VLIW architectures rely on compile-time detection of parallelism the compiler analysis the program and detects operations to be executed in parallel; such operations are packed into one large instruction. After one instruction has been fetched all the corresponding operations are issued in parallel. No hardware is needed for run-time detection of parallelism. The instruction window problem is solved: the compiler can potentially analyse the whole program in order to detect parallel operations. Bad Very complex - Much hardware is needed for run-time detection. There is a limit in how far we can go with this technique. - Power consumption can be very large! The instruction window is limited this limits the capacity to detect potentially parallel instructions Datorarkitektur I Fö 10-27 Datorarkitektur I Fö 10-28 VLIW Processors VLIW Processors (cont d) Memory Reg. File FUs FU-1 FU-2 op 1 op 2 op 3 op 4 instruction - 1 Instruction fetch unit Instruction decode unit FU-3 FU-4 Register Files op 1 op 2 empty op 4 instruction - 2 FU-n Execution unit op 1 empty po op 3 op 4 instruction - 3 Detection of parallelism and packaging of operations into instructions is done, by the compiler, off-line.

Datorarkitektur I Fö 10-29 Advantages and Problems with VLIW Processors Advantages Simpler hardware: - the number of FUs can be increased without needing additional sophisticated hardware to detect parallelism, like in superscalars. - Power consumption can be reduced. Good compilers can detect parallelism based on global analysis of the whole program (no instruction window problem). Problems Large number of registers needed in order to keep all FUs active (to store operands and results). Large data transport capacity is needed between FUs and the register file and between register files and memory. High bandwidth between instruction cache and fetch unit. Example: one instruction with 7 operations, each 24 bits 168 bits/instruction. Large code size, partially because unused operations wasted bits in instruction word. Incomputability of binary code For example: If for a new version of the processor additional FUs are introduced the number of operations possible to execute in parallel is increased the instruction word changes old binary code cannot be run on this processor.