Instruction Level Parallelism

Size: px
Start display at page:

Download "Instruction Level Parallelism"

Transcription

1 Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated instructions Overlapping possible if: No Structural Hazards No RAW, WAR of WAW Hazards No Control Hazards Pipeline CPI = Ideal CPI + Structural Stalls + Data Hazard Stalls + Control Stalls 2 1

2 Instruction Level Parallelism Two strategies to support ILP: Dynamic Scheduling: depend on the hardware to locate parallelism Static Scheduling: rely on software for identifying potential parallelism Hardware intensive approaches dominate dominate desktop and server markets 3 Review: Summary of Pipelining Basics Hazards limit performance: Structural: Need more HW resources Data: Need forwarding, compiler scheduling Control: Early evaluation & PC, Delayed Branch, Branch Prediction Increasing length of pipe (superpipelining) increases impact of hazards Pipelining helps instruction bandwidth, not latency 4 2

3 Review: Summary of Pipelining Basics Interrupts, Instruction Set, FP makes pipelining harder Compilers reduce cost of data and control hazards Load delay slots Branch delay slots Branch prediction Today: Longer pipelines Better branch prediction, more instruction parallelism? 5 Basic Assumptions We consider single-issue processors The Instruction Fetch stage precedes the Issue Stage and may fetch either into an Instruction Register or into a queue of pending instructions Instructions are then issued from the IR or from the queue Execution stage may require multiple cycles, depending on the operation type. 6 3

4 Key Idea: Dynamic Scheduling Problem: Data dependences that cannot be hidden with bypassing or forwarding cause hardware stalls of the pipeline Solution: Allow instructions behind a stall to proceed Hw rearranges the instruction execution to reduce stalls Enables out-of-order execution and completion (commit) First implemented in CDC 6600 (1963). 7 Dynamic Scheduling Advantages: Enables handling cases of dependence unknown at compile time Simplifies compiler Allows code compiled for one pipeline to run efficiently on a different pipeline Disadvantages: Significant increase in hw complexity Could generate imprecise exceptions 8 4

5 Example 1 DIVD ADDD SUBD F0,F2,F4 F10,F0,F8 F12,F8,F14 RAW Hazard: ADDD stalls for F0 (waiting that DIVD commits). SUBD would stall even if not data dependent on anything in the pipeline without dynamic scheduling. 9 Example 2 LD F6, 34(R2) LD F2, 45(R3) MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD F10, F0, F6 ADDD F6, F8, F2 Analyze dependences and hazards 10 5

6 Problems? How do we prevent WAR and WAW hazards? How do we deal with variable latency? Forwarding for RAW hazards harder. Clock Cycle Number Instruction LD F6,34(R2) IF ID EX MEM WB LD F2,45(R3) IF ID EX MEM WB RAW RAW MULTD F0,F2,F4 IF ID stall M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 MEM WB SUBD F8,F6,F2 IF ID A1 A2 MEM WB DIVD F10,F0,F6 IF ID stall stall stall stall stall stall stall stall stall D1 D2 ADDD F6,F8,F2 IF ID A1 A2 MEM stall stall stall stall stall stall WB WAR 11 Scoreboard Dynamic Scheduling Algorithm 12 6

7 Scoreboard basic scheme Out-of-order execution divides ID stage: 1. Issue Decode instructions, check for structural hazards 2. Read operands Wait until no data hazards, then read operands Instructions execute whenever not dependent on previous instructions and no hazards Scoreboard allows instructions to execute whenever 1 & 2 hold, not waiting for prior instructions 13 Scoreboard basic scheme We distinguish when an instruction begins execution and it completes execution: between the two times, the instruction is in execution. We assume the pipeline allows multiple instructions in execution at the same time that requires multiple functional units, pipelined functional units or both. CDC 6600 (1963): In order issue, out of order execution, out of order completion (commit) No forwarding! Imprecise interrupt/exception model for now! 14 7

8 Scoreboard Architecture (CDC 6600) Registers FP FP Mult Mult FP FP Mult Mult FP FP Divide Divide FP FP Add Add Functional Units Integer SCOREBOARD Memory 15 Scoreboard Scheme Scoreboard replaces ID, EX, WB with 4 stages ID stage splitted in two parts: Issue (decode and check structural hazard) Read Operands (wait until no data hazards) Scoreboard allows instructions without dependencies to execute In-order issue BUT out-of-order readoperands out-of-order execution and completion All instructions pass through the issue stage inorder, but they can be stalled or bypass each other in the read operand stage and thus enter execution out-of-order, which implies out-oforder completion. 16 8

9 Scoreboard Implications Out-of-order completion WAR and WAW hazards can occur Solutions for WAR: Stall write back until registers have been read. Read registers only during Read Operands stage. 17 Scoreboard Implications Solution for WAW: Detect hazard and stall issue of new instruction until the other instruction completes No register renaming Need to have multiple instructions in execution phase Multiple execution units or pipelined execution units Scoreboard keeps track of dependencies and state of operations 18 9

10 Scoreboard Scheme All hazard detection and resolution is centralized in the scoreboard: Every instruction goes through the Scoreboard, where a record of data dependences is constructed The Scoreboard then determine when the instruction can read its operand and begin execution If the scoreboard decides the instruction cannot execute immediately, it monitors every change and decides when the instruction can execute. The scoreboard controls when the instruction can write its result into destination register 19 Exception handling Problem with out-of order completion Must preserve exception behavior as in-order execution Solution: ensure that no instruction can generate an exception until the processor knows that the instruction raising the exception will be executed 20 10

11 Imprecise exceptions An exception is imprecise if the processor state when an exception is raised does not look exactly as if the instructions were executed inorder. Imprecise exceptions can occur because: The pipeline may have already completed instructions that are later in program order than the instruction causing the exception The pipeline may have not yet completed some instructions that are earlier in program order than the instruction causing the exception Imprecise exception make it difficult to restart execution after handling 21 Four Stages of Scoreboard Control 1. Issue Decode instruction and check for structural hazards & WAW hazards Instructions issued in program order (for hazard checking) If a functional unit for the instruction is free and no other active instruction has the same destination register (no WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural hazard or a WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared

12 Four Stages of Scoreboard Control Note that when the issue stage stalls, it causes the buffer between Instruction fetch and issue to fill: If the buffer has a single entry: IF stalls If the buffer is a queue of multiple instruction: IF stalls when the queue fills 23 Four Stages of Scoreboard Control 2. Read Operands Wait until no data hazards, then read operands A source operand is available if: - No earlier issued active instruction will write it or - A functional unit is writing its value in a register When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. RAW hazards are resolved dynamically in this step, and instructions may be sent into execution out of order. No forwarding of data in this model 24 12

13 Four Stages of Scoreboard Control 3.Execution Operate on operands The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. FUs are characterized by: -Latency (the effective time used to complete one operation). - Initiation interval (the number of cycles that must elapse between issuing two operations to the same functional unit). 25 Four Stages of Scoreboard Control 4. Write result Check for WAR hazards and finish execution Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the completing instruction

14 WAR/WAW Example DIVD F0,F2,F4 ADDD F6,F0,F8 SUBD F8,F8,F14 MULD F6,F10,F8 WAR WAW The scoreboard would stall: SUBD in the WB stage, waiting that ADDD reads F0 and F8 and MULD in the issue stage until ADDD writes F6. Can be solved through register renaming 27 Scoreboard structure: Three parts 1. Instruction status 2. Functional Unit status Indicates the state of the functional unit (FU): Busy Indicates whether the unit is busy or not Op - The operation to perform in the unit (+,-, etc.) Fi - Destination register Fj, Fk Source register numbers Qj, Qk Functional units producing source registers Fj, Fk Rj, Rk Flags indicating when Fj, Fk are ready. Flags are set to NO after operands are read. 3. Register result status. Indicates which functional unit will write each register. Blank if no pending instructions will write that register

15 Detailed Scoreboard Pipeline Control Instruction status Issue Read operands Execution complete Write result Wait until Not busy (FU) and not result(d) Rj and Rk Functional unit done f((fj( f ) Fi(FU) or Rj( f )=No) & (Fk( f ) Fi(FU) or Rk( f )=No)) Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D ; Fj(FU) `S1 ; Fk(FU) `S2 ; Qj Result( S1 ); Qk Result(`S2 ); Rj not Qj; Rk not Qk; Result( D ) FU; Rj No; Rk No f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No 29 Scoreboard Example Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 FU 30 15

16 Scoreboard Example: Cycle 1 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 1 FU Integer 31 Scoreboard Example Cycle 2 Instruction status Read ExecutionWrite Instruction j k Issue operands complete Result LD F6 34+ R2 1 2 LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 Functional unit status dest S1 S2 FU for j FU for k Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 2 FU Integer Issue 2nd load? Integer Pipeline Full Cannot exec 2 nd Load due to structural hazard on Integer Unit Issue stalls 32 16

17 Scoreboard Example Cycle 3 Instruction status Read ExecutionWrite Instruction j k Issue operands complete Result LD F6 34+ R LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 Functional unit status dest S1 S2 FU for j FU for k Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 3 FU Integer Issue stalls 33 Scoreboard Example: Cycle 4 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 4 FU Integer Issue stalls Write F

18 Scoreboard Example: Cycle 5 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 5 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 5 FU Integer The 2 nd load is issued 35 Scoreboard Example: Cycle 6 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 5 6 MULTD F0 F2 F4 6 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 6 FU Mult1 Integer MULT is issued but has to wait for F2 from LOAD (RAW Hazard on F2) 36 18

19 Scoreboard Example: Cycle 7 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 7 FU Mult1 Integer Add Read multiply operands? Now SUBD can be issued to ADD Functional Unit 37 Scoreboard Example: Cycle 8a (First half of clock cycle) Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 Integer Add Divide DIVD is issued but there is another RAW hazard (F0) from MULTD -> DIVD has to wait for F

20 Scoreboard Example: Cycle 8b (Second half of clock cycle) Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 Add Divide Load completes (Writes F2), and operands for MULT an SUBD are ready 39 Scoreboard Example: Cycle 9 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Note Remaining Integer No 10 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 2Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 9 FU Mult1 Add Divide Read operands for MULTD & SUBD. Issue ADDD? No for structural hazard on ADD Functional Unit MULTD and SUBD are sent in execution in parallel: Latency of 10 cycles for MULTD and 2 cycles for SUBD 40 20

21 Scoreboard Example: Cycle 10 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 9Mult1 Yes Mult F0 F2 F4 No No Mult2 No 1Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 10 FU Mult1 Add Divide 41 Scoreboard Example: Cycle 11 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 8Mult1 Yes Mult F0 F2 F4 No No Mult2 No 0Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 11 FU Mult1 Add Divide SUBD ends execution 42 21

22 Scoreboard Example: Cycle 12 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 7Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 12 FU Mult1 Divide SUBD writes result in F8 43 Scoreboard Example: Cycle 13 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 6Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 13 FU Mult1 Add Divide ADDD can be issued DIVD still waits for operand F0 from MULTD 44 22

23 Scoreboard Example: Cycle 14 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 5Mult1 Yes Mult F0 F2 F4 No No Mult2 No 2 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 14 FU Mult1 Add Divide ADDD reads operands 45 Scoreboard Example: Cycle 15 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 4Mult1 Yes Mult F0 F2 F4 No No Mult2 No 1 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 15 FU Mult1 Add Divide ADDD starts execution 46 23

24 Scoreboard Example: Cycle 16 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 3Mult1 Yes Mult F0 F2 F4 No No Mult2 No 0 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 16 FU Mult1 Add Divide ADDD ends execution 47 Scoreboard Example: Cycle 17 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F WAR Hazard! Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 2Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 17 FU Mult1 Add Divide Why not write result of ADD??? DIVD must first read F6 (before ADDD write F6), but DIVD cannot read operands until MULTD writes F

25 Scoreboard Example: Cycle 18 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 1Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 18 FU Mult1 Add Divide 49 Scoreboard Example: Cycle 19 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 0Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 19 FU Mult1 Add Divide MULTD ends execution 50 25

26 Scoreboard Example: Cycle 20 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 20 FU Add Divide MULTD writes in F0 51 Scoreboard Example: Cycle 21 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 21 FU Add Divide DIVD can read operands WAR Hazard is now gone

27 Scoreboard Example: Cycle 22 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 39 Divide Yes Div F10 F0 F6 No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 22 FU Divide DIVD has read its operands in previous cycle ADDD can write the result in F6 53 (skipping some cycles ) 54 27

28 Scoreboard Example: Cycle 61 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 0 Divide Yes Div F10 F0 F6 No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 61 FU Divide DIVD ends execution 55 Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 62 FU DIVD writes in F

29 Review: Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 62 FU In-order issue; out-of-order execute & commit 57 CDC 6600 Scoreboard Speedup of 2.5 w.r.t. no dynamic scheduling Speedup 1.7 by reorganizing instructions from compiler; BUT slow memory (no cache) limits benefit Limitations of 6600 scoreboard: No forwarding hardware Limited to instructions in basic block (small window) Small number of functional units (structural hazards), especially integer/load store units Do not issue on structural hazards Wait for WAR hazards Prevent WAW hazards 58 29

30 Summary Instruction Level Parallelism (ILP) in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops Memory dependencies hardest to determine HW exploiting ILP Works when can t know dependence at run time Code for one machine runs well on another Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) Enables out-of-order execution => out-of-order completion ID stage checked both for structural 59 Tomasulo Dynamic Scheduling Algorithm 60 30

31 Tomasulo Algorithm Another dynamic scheduling algorithm: Allows execution to proceed in the presence of dependences Invented at IBM 3 years after CDC 6600 for the IBM 360/91 Same Goal: high performance w/o special compilers Lead to: Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC Tomasulo Algorithm vs. Scoreboard Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; FU buffers called reservation stations ; have pending operands Registers in instructions replaced by values or pointers to reservation stations(rs); called register renaming ; avoids WAR, WAW hazards by renaming results by using RS numbers More reservation stations than registers, so can do optimizations compilers can t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue 62 31

32 Tomasulo Algorithm Basics The control logic and the buffers are distributed with FUs (vs. centralized in scoreboard) Operand buffers are called reservation stations Each instruction is an entry of a reservation station Its operands are replaced by values or pointers (Register Renaming) 63 Tomasulo Algorithm Basics Register Renaming allows to: Avoid WAR and WAW hazards Reservation stations are more than registers (so can do better optimizations than a compiler). Results are dispatched to other FUs through a Common Data Bus (CDB) Load/Stores treated as FUs 64 32

33 Tomasulo Algorithm for an FPU 65 Reservation Station Components Tag identifying the RS OP=the operation to perform on the component. Vj, Vk=Value of the source operands (Vk holds offset for loads) Qj,Qk=Pointers to RS that produce Vj,Vk Zero value = Source op. is already available in Vj or Vk Busy=Indicates RS Busy Note: Only one of V-field or Q-field is valid for each operand 66 33

34 Other components RF and the Store buffer have a Value (V) and a Pointer (Q) field. Pointer (Q) field corresponds to number of reservation station producing the result to be stored in RF or store buffer If zero no active instructions producing the result (RF or store buffer content is the correct value). Load buffers have an address field (A), and a busy field. Store Buffers have also an address field (A). A: To hold info for memory address calculation for load/store. Initially contains the instruction offset (immediate field); after address calculation stores the effective address. 67 First stage of Tomasulo Algorithm ISSUE Get an instruction I from the head of instruction queue (maintained in FIFO order to ensure the correct data flow). If it is an FP op. Check if an RS is empty (i.e., check for structural hazards) otherwise stalls. If operands are not in RF, keep track of FU that will produce the operands. If there is not an empty RS structural hazard and the instruction stalls

35 First stage of Tomasulo Algorithm ISSUE Rename registers; WAR resolution: If I writes Rx, read by an instruction K already issued, K knows already the value of Rx or knows what instruction will write it. So the RF can be linked to I. WAW resolution: Since we use in-order issue, the RF can be linked to I. 69 Second stage of Tomasulo Algorithm Execution When both operands are ready, then execute. If not ready, watch the common data bus for results. By delaying execution until operands are available, RAW hazards are avoided. Notice that several instructions could become ready in the same clock cycle for the same FU

36 Second stage of Tomasulo Algorithm Load and stores: Two-step execution process. First step: compute effective address when base re is available, place it in load or store buffer. Loads in Load Buffer execute as soon as memory unit is available; stores in store buffer wait for the value to be stored before being sent to memory unit. Loads and stores: kept in program order through effective address calculation helps in preventing hazards through memory. To preserve exception behavior: no instruction can initiate execution until all branches preceding it in program order have completed. If branch prediction is used, CPU must know prediction correctness before beginning execution of following instructions. (Speculation allows more brilliant results!) 71 Third stage of Tomasulo Algorithm Write result When result is available, write on Common Data Bus and from there into RF and into all RSs (including store buffers) waiting for this result;stores also write data to memory during this stage. Mark reservation stations available

37 The Common Data Bus A common data bus is a data+source bus. In the IBM 360/91 Data=64 bits, Source=4 bits FU must perform associative lookup in the RS. 73 Tomasulo algorithm (some details) Loads and stores go through a functional unit for effective address computation before proceeding to effective load and store buffers; Loads take a second execution step to access memory, then go to Write Result to send the value from memory to RF and/or RS; Stores complete their execution in their Write Result stage (writes data to memory) All writes occur in Write Result simplifying Tomasulo algorithm

38 Tomasulo algorithm (some details) A Load and a Store can be done in different order, provided they access different memory locations; otherwise, a WAR (interchange load-store sequence) or a RAW (interchange store-load sequence) may result (WAW if two stores are interchanged). Loads can be reordered freely. To detect such hazards: data memory addresses associated with any earlier memory operation must have been computed by the CPU (e.g.: address computation executed in program order) 75 Tomasulo algorithm (some details) Load executed out of order with previous store: assume address computed in program order. When Load address has been computed, it can be compared with A fields in active Store buffers: in the case of a match, Load is not sent to Load buffer until conflicting store completes. Stores must check for matching addresses in both Load and Store buffers (dynamic disambiguation, alternative to static disambiguation performed by the compiler) Drawback: amount of hardware required. Each RS must contain a fast associative buffer; single CDB may limit performance

39 Tomasulo s example Cycle 1 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R2 1 LD F2 45+ R3 MULTF0 F2 F4 SUBDF8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 Load1 34 v(r2) add1 Load2 add2 EXLoad EXADD mult1 mult2 EXMUL v1 q1 v2 q2 RF q Load1 77 Tomasulo s example Cycle 2 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 LD F2 45+ R3 2 MULTF0 F2 F4 SUBDF8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 v(r2) Load1 34 add1 Load2 45 v(r3) add2 EXLoad 34 v(r2) EXADD mult1 mult2 EXMUL v1 q1 v2 q2 RF q Load2 Load

40 Tomasulo s example Cycle 3 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 LD F2 45+ R3 2 MULTF0 F2 F4 3 SUBDF8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 Load1 34 v(r2) add1 Load2 45 v(r3) add2 EXLoad 34 v(r2) EXADD v1 q1 v2 q2 mult1 Load2 v(f4) mult2 EXMUL RF q mult1 Load2 Load1 79 Tomasulo s example Cycle 4 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R3 2 MULTF0 F2 F4 3 SUBDF8 F6 F2 4 DIVD F10 F0 F6 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 v(r2) v(f6) load2 Load1 34 add1 Load2 45 v(r3) add2 EXLoad 34 v(r2) EXADD CDB v1 q1 v2 q2 Load2 v(f4) mult1 mult2 EXMUL RF q mult1 Load2 v(f6) add1 Forwarding is provided Writes on RF (F6) and RS through CDB 80 40

41 Tomasulo s example Cycle 5 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R3 2 5 MULTF0 F2 F4 3 SUBDF8 F6 F2 4 DIVD F10 F0 F6 5 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 v(f6) load2 Load1 add1 Load2 add2 45 v(r3) EXLoad 45 v(r3) EXADD v1 q1 v2 q2 Load2 v(f4) mult1 mult2 v(f6) mult1 EXMUL RF q mult1 Load2 v(f6) add1 mult2 81 Tomasulo s example Cycle 6 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R3 2 5 MULTF0 F2 F4 3 SUBDF8 F6 F2 4 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 v1 q1 v2 q2 v1 q1 v2 q2 v(f6) load2 Load1 add1 Load2 add2 load2 45 v(r3) add1 EXLoad 45 v(r3) EXADD v1 q1 v2 q2 Load2 v(f4) mult1 mult2 v(f6) mult1 EXMUL RF q mult1 Load2 add2 add1 mult2 WAR on F6 has been eliminated: ADDD will write in F6 and DIVD has already read v(f6) in v

42 Tomasulo s example Cycle 7 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R MULTF0 F2 F4 3 SUBDF8 F6 F2 4 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 v1 q1 v2 q2 v1 q1 v2 q2 Load1 add1 v(f6) v(f2) Load2 45 v(r3) add2 add1 v(f2) EXLoad 45 v(r3) CDB EXADD v1 q1 v2 q2 mult1 v(f2) v(f4) mult2 v(f6) mult1 EXMUL RF q mult1 v(f2) add2 add1 mult2 Forwarding is provided Writes on RF (F2) and RSs through CDB 83 Tomasulo s example Cycle 8 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R MULTF0 F2 F4 3 8 SUBDF8 F6 F2 4 8 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 v1 q1 v2 q2 v1 q1 v2 q2 v(f6) v(f2) Load1 add1 Load2 add2 v(f2) add1 EXLoad EXADD v(f2) v(f6) v1 q1 v2 q2 v(f2) v(f4) mult1 mult2 v(f6) mult1 EXMUL v(f2) v(f4) RF q mult1 v(f2) add2 add1 mult

43 Tomasulo s example Cycle 10 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R MULTF0 F2 F SUBDF8 F6 F DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 Latency MULTD: 2 cycles Latency SUBD: 2 cycles v1 q1 v2 q2 v1 q1 v2 q2 Load1 add1 v(f6) v(f2) Load2 add2 v(f8) v(f2) EXLoad EXADD v(f6) v(f2) CDB v1 q1 v2 q2 mult1 v(f2) v(f4) mult2 v(f6) v(f0) EXMUL v(f2) v(f4) CDB RF q v(f0) v(f2) add2 v(f8) mult2 85 Tomasulo s example Cycle 11 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R MULTF0 F2 F SUBDF8 F6 F DIVD F10 F0 F ADDDF6 F8 F v1 q1 v2 q2 v1 q1 v2 q2 Load1 add1 Load2 add2 v(f8) v(f2) EXLoad EXADD v(f8) v(f2) v1 q1 v2 q2 mult1 mult2 v(f6) v(f0) EXMUL v(f6) v(f0) RF q v(f0) v(f2) add2 v(f8) mult

44 Tomasulo s example Cycle 61 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R MULTF0 F2 F SUBDF8 F6 F DIVD F10 F0 F Latency DIVD: 50 cycles ADDDF6 F8 F Latency ADDD: 2 cycles Load1 Load2 EXLoad v1 q1 v2 q2 v1 q1 v2 q2 add1 add2 EXADD v1 q1 v2 q2 mult1 mult2 v(f6) v(f0) EXMUL v(f6) v(f0) CDB RF q v(f0) v(f2) v(f6) v(f8) v(f10) 87 Compare to Scoreboard Cycle 62 Instruction status: Read Exec Write Write Instruction j k Issue Oper CompResult IssueExec Resul LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Why take longer on scoreboard/6600? Structural Hazards Lack of forwarding 88 44

45 Tomasulo (IBM) versus Scoreboard (CDC) Issue window size=5 No issue on structural hazards WAR, WAW avoided with renaming Broadcast results from FU Control distributed on RS Allows loop unrolling in hw Issue window size=12 No issue on structural hazards Stall the completion for WAW and WAR hazards Results written back on registers. Control centralized through the Scoreboard. 89 Limits to the Instruction Level Parallelism Branches Exceptions (non-)precise: operand integrity for the exception handler (non-)exact: handler modifications are seen by instructions after the exception 90 45

46 Tomasulo Drawbacks Complexity Large amount of hardware Delays of 360/91, MIPS 10000, IBM 620? Many associative stores (CDB) at high speed Performance limited by Common Data Bus Multiple CDBs => more FU logic for parallel assoc stores 91 Summary (1) HW exploiting ILP Works when can t know dependence at compile time. Code for one machine runs well on another Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode =>Issue instr & read operands) Enables out-of-order execution => out-of-order completion ID stage checked both for structural & data dependencies Original version didn t handle forwarding. No automatic register renaming 92 46

47 Summary (2) Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha HW-based Speculation ReOrder Buffer 94 47

48 HW support for more ILP Speculation: allow an instruction without any consequences (including exceptions) if branch is not actually taken ( HW undo ); called boosting Combine branch prediction to choose which instructions to execute with dynamic scheduling to execute before branches resolved 95 HW support for more ILP Separate speculative bypassing of results from real bypassing of results When instruction no longer speculative, write boosted results (instruction commit) or discard boosted results execute out-of-order but commit in-order to prevent irrevocable action (update state or exception) until instruction commits 96 48

49 HW-based Speculation HW-based Speculation combines 3 ideas: Dynamic Branch Prediction to choose which instruction to execute Speculation to execute instructions before control dependences are resolved Dynamic Scheduling supporting out-oforeder execution but in-order commit to prevent any irrevocable actions such as register update or taking exception) until an instruction commits 97 HW support for More ILP Need HW buffer for results of uncommitted instructions: ReOrder Buffer (ROB) FP Op Queue Reorder Buffer FP Regs Res Stations FP Adder Res Stations FP Adder 98 49

50 ReOrder Buffer (ROB) Buffer to hold the results of instructions that have finished execution but non committed Buffer to pass results among instructions that can be speculated Support out-of-order execution but in-order commit Speculative Tomasulo Algorithm with ROB: Pointers are directed toward ROB slot. A register or memory is updated only when the instruction reaches the head of ROB (that is until the instruction is no longer speculative). 99 ReOrder Buffer (ROB) ROB completely replaces the store buffers The renaming function of Reservation Stations is replaced by ROB Reservation Stations now used only to buffer instructions and operands to Fus (to reduce structural hazards). Pointers now are directed toward ROB slots Processors with ROB can dynamically execute while maintaining precise interrupt model because instruction commit happens in order

51 ReOrder Buffer 4 fields: Instruction Type Destination: RF number (for load and ALU ops) Memory address (for stores) Value: To hold the value of instruction results until the instruction commits Ready: Indicates that the instruction has completed execution and the value is ready 101 ReOrder Buffer Reorder buffer can be operand source => more registers like reservation station Use reorder buffer number instead of reservation station when execution completes Supplies operands between execution complete & commit Once operand commits, result is put into register Instructions commit As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions

52 ReOrder Buffer (ROB) Originally (1988) introduced to solve precise interrupt problem; generalized to grant sequential consistency; Basically, ROB is a circular buffer with head pointer (indicating next free entry) and tail pointer indicating the instruction that will commit (leaving ROB) first. 103 ReOrder Buffer (ROB) Instructions are written in ROB in strict program order when an instruction is issued, an entry is allocated to it in sequence. Entry indicates status of instruction: issued (i), in execution (x), finished (f) (+ other items!), An instruction can commit (retire) iff 1. It has finished, and 2. All previous instructions have already retired

53 ReOrder Buffer (ROB) Tail (next instruction to be retired) I n-1 I n f Active entries x I n+k i i Head (first free entry): allocate subsequent instructions to subsequent entries, in order 105 ReOrder Buffer (ROB) Only retiring instructions can complete, i.e., update architectural registers and memory; ROB can support both speculative execution and exception handling Speculative execution: each ROB entry is extended to include a speculative status field, indicating whether instruction has been executed speculatively; Finished instructions cannot retire as long as they are in speculative status

54 ReOrder Buffer (ROB) Interrupt handling: exceptions generated in connection with instruction execution are made precise by accepting exception request only when instruction becomes next to retire (exceptions are processed in order); ROB can be used also for shelving (in this case: Deferred scheduling, Register renaming Instruction Shelf DRIS). 107 Hardware-based Speculation Outcome of branches is speculated and program is executed as if speculation was correct (simple dynamic scheduling would only fetch and decode, not execute!) Mechanisms are necessary to handle incorrect speculation hardware speculation extend dynamic scheduling

55 Hardware-based Speculation Combines: 1. dynamic branch prediction to choose which instruction to execute 2. speculation to allow executing instructions before dependencies are resolved (with the ability to undo effects of incorrectly speculated sequences) 3. Dynamic scheduling to deal with different combinations of basic blocks 109 Hardware-based Speculation Issue an instruction dependent on branch before the branch result is known. Commit is always made in order. Commit of a speculative instruction is made only when the branch outcome is known. The same holds for exceptions (synchronous or asynchronous) deviations of control flow Follows the predicted flow of data values to choose when to execute an instruction; Essentially, a data flow mode of execution: instructions execute as soon as their operands are available

56 Hardware-based Speculation Adopted in PowerPc 603/604, MIPS R10000/ R12000, Pentium II/III/4, AMD K5/K6 Athlon. Extends hardware support for Tomasulo algorithm: to support speculation, commit phase is separated from execution phase, reorder buffer is introduced. 111 Hardware-based Speculation Basic Tomasulo algorithm: instruction writes result in register file, where subsequent instructions find it: With speculation, results are written only when instruction commits and it is known whether the instruction had to be executed. Key idea: executing out of order, committing in order. Boosting

57 Speculative Tomasulo s Algorithm 113 Speculative Tomasulo s Algorithm Tomasulo s Boosting needs a buffer for uncommited results (reorder buffer). Each entry in ROB contains four fields: Instruction type field indicates whether instruction is a branch (no destination result), a store (has memory address destination), or a load/alu (register destination) Destination field: supplies register number (for loads and ALU instructions) or memory address (for stores) where results should be written; Value field (used to hold value of result until instruction commits) Ready field: indicates that instruction has completed execution, value is ready

58 ReOrder Buffer Extension ROB completely replaces store buffers: stores execute in two steps, the second one when instruction commits; Renaming function of reservation stations completely replaced by ROB; Reservation stations now only queue operations (and operands) to Fus between the time they issue and the time they begin execution; Results are tagged with ROB entry number rather than with RS number ROB entry assigned to instruction must be tracked in the reservation stations. 115 ReOrder Buffer Extension All instructions excluding incorrectly predicted branches (or incorrectly speculated loads) commit when reaching head of ROB; Incorrectly predicted branches reaches head of ROB wrong speculation is indicated, ROB is flushed, execution restarts at correct successor of branch. Speculative actions are easily undone. Processors with ROB can dynamically execute while maintaining a precise interrupt model: if instruction I j causes interrupt, CPU waits until I j reaches the head of the ROB and takes the interrupt, flushing all other pending instructions

59 Steps of Speculative Tomasulo s Algorithm (1) 1. Issue: get an instruction from the queue. RS && ROB must have a slot free. When Rob is full stop issuing instructions until an entry is free. Dispatch the operation indicating in which ROB slot it must write 2. Execution: When both operands ready, execute. If not watch in the CDB. 3. Write Result:Write on CDB and on ROB as well as to any RS waiting for this result. Mark the RS available. 117 Steps of Speculative Tomasulo s Algorithm (2) 4. Commit: 3 different possible sequences: 1. Normal commit: instruction reaches the head of the ROB, result is present in the buffer. Result is stored in the register, instruction is removed from ROB; 2. Store commit: as above, but memory rather than register is updated; 3. Instruction is a branch with incorrect prediction: it indicates that speculation was wrong. ROB is flushed ( graduation ), execution restarts at correct successor of the branch. If the branch was correctly predicted, branch is finished

60 Exception Handling Not recognize the exception until it is ready to commit If speculated instruction raises an exception record in the ROB If branch misprediction and the instruction should not have been executed flush exception If the instruction reaches the head of ROB and it is no longer speculative, the exception should be taken 119 Speculative Tomasulo s Algorithm

61 Hazards through memory : WAW, WAR hazards through memory are eliminated with speculation actual memory updating occurs in order; RAW hazards through memory two restrictions are introduced: No Load can initiate the second step of its execution if an active entry in ROB due to a Store has Destination field matching A field of Load; Program order for computation of Load address is maintained with respect to all previous Stores Some speculative machines bypass value from store to load, when RAW is detected. 121 HW Register Renaming It is a Reorder buffer that does not keep the results but only enforces in-order commit. Register File is extended with extra registers to hold speculative values. When issuing an instruction, rename all the speculative operands to the speculative registers. On commit copy the speculative register into the real one. Operands are read from the RF (real or speculative) or via the CDB

62 HW Register Renaming Cdb / virtual regs Rs Fu Rename Registers And issue (V+R) RF read Rs Rs Fu Fu Vrf Rrf ROB Maps virtual to physical registers 123 Register renaming vs. ROB Instruction commit simpler than with ROB; Deallocating registers more complex; Dynamic mapping of architectural to physical registers complicates design and debugging; Used in PowerPC603/604, Pentium II- III-4, MIPS 10000/12000, Alpha 21264; 20 to 80 registers are added

63 125 An example: Organization of Intel Pentium Pro and PowerPC604 PC Instruction cache Data cache Branch prediction Instruction queue Decode/dispatch unit Register file Reservation station Reservation station Reservation station Reservation station Reservation station Reservation station Store Load Branch Integer Integer Floating point Complex integer Load/ store Commit unit Reorder buffer

64 127 Speculating through multiple branches Speculating from multiple branches simultaneously: a benefit in the case of: Very high branch frequency, or Significant clustering of branches, or Long delays in functional units. Complicates speculation recovery, otherwise is straightforward; More complex: predicting and speculating more than one branch per cycle

65 Limitations of ILP? Basic questions: How much ILP can be found in applications; What is needed to exploit more ILP. Instrumental to extend both limitations: compiler technology + architecture! 129 Limitations of ILP? Hardware model adopted to perform evaluations: ideal processor : All artificial constraints on ILP are removed only limitations are due to actual data flow through registers and/or memory: Assumptions: 1. Register renaming: infinite number of registers available all WAR, WAW hazards are avoided, unbounded number of instructions can begin execution simultaneously; 2. Branch prediction: is perfect all conditional branches are predicted exactly 3. Jump prediction: all jumps predicted perfectly combined with previous assumption, leads to processor with perfect speculation + unbounded buffer of instructions available for execution; 4. Memory address alias analysis: all memory addresses are known perfectly, a load can be moved before a store provided addresses are not identical

66 Hardware model Assumptions 2, 3 no control dependencies; Assumptions 1, 4 eliminate all but true data dependencies. Any instruction can be scheduled on the cycle immediately following execution of the predecessor upon which it depends control and address speculation subsumed as perfect. 131 Further initial assumptions: CPU can issue at once unlimited number of instructions, looking arbitrarily ar ahead in computation; No restrictions on types of instructions that can be executed in one cycle (including loads and stores); All functional unit latencies = 1; any sequence of depending instructions can issue on successive cycles; Instructions in executions: in flight. Perfect caches = all loads, stores execute in one cycle only fundamental limits to ILP are taken into account. Obviously, results obtained are VERY optimistic! (no such CPU can be realized ); Benchmark programs used: six from SPEC92 (three FP-intensive ones, three integer ones)

67 Limits on window size Dynamic analysis is necessary to approach perfect branch prediction (impossible at compile time!); A perfect dynamic-scheduled CPU should: 1. Look arbitrarily far ahead to find set of instructions to issue, predict all branches perfectly; 2. Rename all registers uses ( no WAW, WAR hazards); 3. Determine whether there are data dependencies among instructions in the issue packet; rename if necessary; 4. Determine if memory dependencies exist among issuing instructions, handle them; 5. Provide enough replicated functional units to allow all ready instructions to issue. 133 Limits on window size, maximum issue count: Analysis quite complex! E.g.: determine data dependencies among n issuing register-register instructions (number of registers unbounded) number of comparisons: 2 ( n 1) n i = 2 = n n 1 i= n

68 Limits on window size, maximum issue count: Window size = 2000 almost four million comparisons! Even issuing 50 instructions requires 2450 comparisons number of instructions to be considered for issue obviously limited! Existing CPUs: limited number of registers + search for dependence pairs + in-order issue limit costs; Dependent instructions handled by renaming process dependent renaming in one clock cycle; Once instructions are issued, detection of dependencies is handled in distributed fashion by reservation stations or scoreboard. 135 Limits on window size, maximum issue count: All instructions in the window must be kept in the processor number of comparisons required at each cycle = maximum completion rate x window size x number of operands per instruction total window size limited by storage + comparisons + limited issue rate (today: window size up to over 2000 comparisons!)

69 Limits on window size, maximum issue count: Real CPUs have more limited number of functional units (e.g., no more than 2 memory references per clock, nor 2 FP operations) + limited number of buses + limited number of R.F. ports all limits on number of instructions initiated in the same clock max number of instructions that may issue, begin execution, or commit in the same clock cycle much smaller than window size. 137 Limits on window size, maximum issue count: Experimental data: maximum parallelism uncovered falls sharply with reducing window size for the best-performing program, from 150 for infinite window to 35 for a 128-instruction window. At low window sizes (32 or less) all programs exhibit more or less the same level of parallelism

70 Effects of realistic branch and jump prediction: Perfect branch prediction obviously impossible: when not highly accurate, mispredicted branches become a barrier to finding parallelism; Branch prediction mechanisms a major point of optimization in leadingedge CPUs. 139 Effects of finite registers: Reducing the number of registers available for renaming has great impact of extraction of available parallelism; increasingly relevant with increasing intrinsic level of parallelism in a benchmark!

71 Imperfect Alias Analysis Perfect analysis at compile time impossible (run-time compiled memory references, pointer. Accessed variables etc.); Run-time alias analysis: a priori (if no constraints are placed on number of simultaneous memory references is allowed) requires unlimited number of comparisons. 141 Imperfect Alias Analysis Consider three models of memory alias analysis, in addition to perfect analysis: 1. Global/stack perfect: assumes perfect prediction for all global+stack references, conflict on all heap references (based on improvements in compiler technology); 2. Inspection: accesses are examined to see if they can be determined not to interfere at compile time. Also, accesses based on registers that point to different allocation areas (e.g., global area and stack area) are assumed never to alias; 3. None: all memory references are assumed to conflict

72 Imperfect Alias Analysis Model 1 gives results quite similar to perfect alias analysis, model 2 is not much better than model 3. In practice, dynamically scheduled CPUs rely on dynamic memory disambiguation. Three factors limiting their efficiency: 1. To achieve perfect dynamic disambiguation for a load necessary to know memory addresses of all previous stores that have not yet committed. Memory address speculation: dependency is assumed not to exist or else predicts through hw mechanism, load is stalled if dependency is predicted. To check on prediction correctness: CPU examines destination address of each completing store preceding in program order the given load; if dependency that should have been enforced occurs, CPU uses speculative restart mechanism to redo load and following instructions (supported with suitable instruction set extension); 2. Only a small number of memory references can be disambiguated per clock cycle; 3. Number of load/store buffers determines how much earlier or later in the instruction stream a load or a store can be moved. 143 Multiple-Issue Processors

73 Multiple-Issue Processors Basic Idea to get CPI < 1: Issuing Multiple Instructions/Cycle Two variations: Superscalar: (Very) Long Instruction Words (V)LIW: Anticipated success lead to use of Instructions Per Clock cycle (IPC) vs. CPI 145 Multiple-Issue Processors Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler (statically) or by HW (dynamically by Tomasulo) IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates Joint HP/Intel agreement in 1999/2000 Intel Architecture-64 (IA-64) 64-bit address Style: Explicitly Parallel Instruction Computer (EPIC) Explicit dependences in issue packet marked by compiler, for example Itanium

74 Superscalar Processors 147 Superscalar Processors Issues varying number of instructions at each clock cycle. If instructions are dependent, only consecutive ready instructions are issued (in-order issue). This decision is made at run-time by the processor. => Variability in the issue rate (Dynamic issue capability)

75 Superscalar Processors Can be: Statically scheduled: Do not allow (issue) instructions behind stalls to proceed Dynamically scheduled (allow instructions behind RAW hazards to proceed) Dynamically scheduled and speculative. 149 How to optimize code for Superscalar Processors (1) Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,LOOP Loop: LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2 SD 0(R1),F4 SD -8(R1),F8 SD -16(R1),F12 SUBI R1,R1,#32 BNEZ R1,LOOP SD (R1),F16 ; 8-32 = -24 The loop is unrolled 4 times (load/addd/store) in which RAW hazards have been reduced, but there are resource conflicts on the pipelines (cannot execute 2 ld in parallel)

76 How to optimize code for Superscalar processors (2) Integer instruction FP instruction clk_cycle Loop: LD F0,0(R1) // 1 LD F6,-8(R1) // 2 LD F10,-16(R1) ADDD F4,F0,F2 3 LD F14,-24(R1) ADDD F8,F6,F2 4 LD F18,-32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD -8(R1),F8 ADDD F20,F18,F2 7 SD -16(R1),F12 // 8 SD -24(R1),F16 // 9 SUBI R1,R1,#40 // 10 BNEZ R1,LOOP // 11 SD -32(R1),F20 // 12 5 times unrolled loop. 151 Superscalar Processors: Examples

77 The PowerPC 620 [ 94] Superscalar Architecture Similar to: MIPS R10000 HP PA 8000 Fetch, issue and completion of up to 4 instructions per clock cycle. Six separate execution units buffered with reservation stations. 153 PowerPC Architecture Speculative Tomasulo with register renaming. Extended register file holds speculative result of an instruction until the instruction commits. The ROB enforces only in-order commit. Advantages: operands are available from a single location (no need for additional complex logic to access ROB result values)

78 PowerPC 620 architecture 155 PowerPC functional units 2 integer units (XSU0, XSU1), 0 cycles latency [+,-,shift..] 1 complex integer function unit MCFXU for integer (pipelined *, unpipelined /). Latency from 3 to 20 cycles). 1 Load store unit. Latency=1 for integer loads, 2 for FP loads

79 PowerPC functional units 1 FPU with latencies of: 2 cycles for multiply,add, multiply-add 31 for DP FP divide. (fully pipelined except for divide). 1 BRU, completes branches and informs the fetch unit of mispredictions. Includes the condition register used for conditional branches. 157 PowerPC Pipeline Fetch: The Fetch unit loads the decode queue with instructions from the cache. Next address is predicted through a 256-entry, two-way set associative BTB. A BPB is used if there is a miss in the BTB

80 PowerPC Pipeline Instruction decode: Instructions are decoded and inserted into an 8-entry instruction queue. Instruction Issue: 4 Instructions are taken from the 8-entry instruction queue and are issued to the RS. Allocate a rename register and a reorder buffer entry for the instruction issued. If we can t, stall. 159 PowerPC Pipeline Execution: Proceeds with execution when all operands are available. At the end, the result is written on the result bus. The completion unit is notified that the instruction has completed

81 PowerPC Pipeline If the instruction is a (mispredicted) branch, IFU and IC(ompletion)U are notified. Instruction fetch restarts, and ICU discards all the speculated instructions after the branch and free the rename buffers. Commit: When all previous instructions have been committed, commit the result into the RF and free the rename buffer. Stores also commit from store buffer to memory. 161 Performance results IPC from under 1 to 1.8. We do not reach IPC=4 due to: Fus are not replicated for each instruction (structural hazards) Limited instruction level parallelism or limited buffering (insufficient buffers)

82 PowerPC G4e 32K instruction cache. 32K data cache 7 pipeline stages 163 G4e Pipeline Stages 1 and 2 - Instruction Fetch: grabbing an instruction from the L1 cache. fetch width of up to four instructions per clock cycle 9 cycles of delay per instruction in the case of a miss

83 G4e Pipeline Stage 3 - Decode/Dispatch: Get the instruction from a 12-entry instruction queue Decode it and send it to the issue queue of its class. The G4e's decoder can dispatch up to three instructions per clock cycle to the next stage. 165 G4e Pipeline Stage 4 - Issue: Get an instruction from each of the issue queues (FIFO): Floating-Point Queue (FIQ) Vector Issue Queue (VIQ) (Altivec) General Instruction Queue (GIQ) Put it in a reservation station (Out of order execution)

84 G4e Pipeline Stage 5 - Execute: Instructions pass from reservation station queues into their respective functional units and are executed. Stages 6 and 7 - Complete and Write- Back 167 P6 Processor Family: Intel Pentium Pro, II/III 3-way superscalar. Basic Idea, three engines:

85 P6 Pipeline Fetch/Decode Unit: decodes instructions and puts them in the instruction pool in-order. converts the instructions in micro-ops that represent instruction code. Dispatch/Execute Unit: out-of-order issue from the instruction pool in a reservation station and out-of-order execution of micro-ops. Retire Unit Reorders the instructions and commits speculative results to the architectural state. 169 P6 Instruction Decode 8 pipeline stages The decoder fetches 16 bytes at each clock cycle from the cache 3 parallel decoders convert most of the instructions into one or more triadic micro-ops. Some instruction need microcode (several micro-ops) to be executed. Throughput=6 microops per clock cycle. Register Alias Table unit converts logical reg. ref. into virtual reg. ref. (40 registers). In-order Issue to reservation stations and reorder buffer

86 P6 Instruction Dispatch/Execute 20 entries RS Out of order execution through the 3 reservation pipeline stages station unit This happens when: All the operands are ready The resource needed is ready. Maximum throughput: 5 micro-ops/cycle. If micro-ops are branches, their execution is compared with the predicted address (in the Fetch phase). If mispredicted the JEU changes the status of all the micro-ops behind the branch and removes them from the instruction pool. 171 P6 Instruction Retire The retire unit looks for micro-ops that have been executed and can be removed from the pool. The original architectural target of the micro-ops is written. This is done in-order by committing an instruction only if: Previous instructions have been committed The instruction has been executed. Up to 3 micro-ops can be retired at each clock cycle

87 Pentium Pentium 4 New NetBurst micro-architecture 20 pipeline stages (hyper-pipeline) 1.4 GHz to 2GHz 3 prefetching mechanisms Harware instruction prefetcher (based on BTB). Software controlled data cache prefetching. L3->L2 data and instruction hardware prefetcher

88 Pentium 4 Execution Trace Cache TC stores decoded IA-32 instructions or micro-ops. Removes decoding costs 12K micro-ops, 3 micro-ops per cycle fetch bandwidth It stores traces built across predicted branches. However some instructions need micro-code from ROM. 175 Pentium 4 Branch penalty delay can be much more than 10 cycles Uses BTB In case of a miss in the BTB, static prediction is used (back=t, forw=nt) Use of software branch hints during the trace construction that override static prediction

89 Pentium 4 Execution Units and Issue Ports 177 Pentium 4 1 load and 1 store issue for each cycle. Loads can be reordered w.r.t. other loads and stores Loads can be executed speculatively Up to 4 outstanding load misses. Load/store forwarding

90 X86 Frequency scaling 286->P5 would run at similar clock rates if they were all implemented on the same silicon process technology (similar pipeline depth). 179 AMD Athlon K7 Nine-issue (micro-ops), super-pipelined, superscalar x86 processor Multiple x86 instruction decoders (into triadic microops) Three out-of-order, superscalar, fully pipelined floating point execution units. Three out-of-order, superscalar, pipelined integer units. Three out-of-order, superscalar, pipelined address calculation units. 72-entry instruction control unit (ROB)

91 AMD Athlon K7 181 AMD Athlon K7 The Instruction Control Unit contains a reorder buffer and distributed reservation stations to hold operands while OP s wait to be scheduled. The Integer Instruction Scheduler is an instruction scheduling logic that picks OP s for execution based on their operand availability and issues them to functional units or address generation units. The function units perform transformations on data and return their results to the reorder buffer, while the address-generation units send calculated memory addresses to the Load/Store Unit for further processing

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3 CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3 1 From MIPS pipeline to Scoreboard Lecture 5 Scoreboarding: Enforce Register Data Dependence Scoreboard design, big example Out-of-order execution divides ID stage: 1.Issue decode instructions, check for

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences Dynamic Scheduling Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences Increased compiler complexity, especially when attempting

More information

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction Review: Evaluating Branch Alternatives Lecture 3: Introduction to Advanced Pipelining Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier Pipeline speedup

More information

Lecture 4: Introduction to Advanced Pipelining

Lecture 4: Introduction to Advanced Pipelining Lecture 4: Introduction to Advanced Pipelining Prepared by: Professor David A. Patterson Computer Science 252, Fall 1996 Edited and presented by : Prof. Kurt Keutzer Computer Science 252, Spring 2000 KK

More information

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences

More information

The basic structure of a MIPS floating-point unit

The basic structure of a MIPS floating-point unit Tomasulo s scheme The algorithm based on the idea of reservation station The reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from

More information

COSC4201 Instruction Level Parallelism Dynamic Scheduling

COSC4201 Instruction Level Parallelism Dynamic Scheduling COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP? What is ILP? Instruction Level Parallelism or Declaration of Independence The characteristic of a program that certain instructions are, and can potentially be. Any mechanism that creates, identifies,

More information

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software: CS152 Computer Architecture and Engineering Lecture 17 Dynamic Scheduling: Tomasulo March 20, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation Graduate Computer Architecture Chapter 3 Instruction Level Parallelism and Its Dynamic Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding (Appendix A.8) Tomasulo

More information

Advantages of Dynamic Scheduling

Advantages of Dynamic Scheduling UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 5 Dynamic scheduling with Scoreboards Israel Koren ECE568/Koren Part.5.1 Advantages of Dynamic

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Scoreboard information (3 tables) Four stages of scoreboard control

Scoreboard information (3 tables) Four stages of scoreboard control Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010 CS252 Graduate Computer Architecture Lecture 8 Explicit Renaming Precise Interrupts February 13 th, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm EEF011 Computer Architecture 計算機結構 吳俊興高雄大學資訊工程學系 October 2004 Example to eleminate WAR and WAW by register renaming Original DIV.D ADD.D S.D SUB.D MUL.D F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6,

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Latencies of FP operations used in chapter 4.

Latencies of FP operations used in chapter 4. Instruction-Level Parallelism (ILP) ILP: refers to the overlap execution of instructions. Pipelined CPI = Ideal pipeline CPI + structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls.

More information

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation Lecture 7 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2013 Reading: Textbook, Ch. 3 Complexity-Effective Superscalar Processors, PhD Thesis by Subbarao Palacharla, Ch.1

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) Instruction Level Parallelism (ILP) Pipelining supports a limited sense of ILP e.g. overlapped instructions, out of order completion and issue, bypass logic, etc. Remember Pipeline CPI = Ideal Pipeline

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

COSC 6385 Computer Architecture - Pipelining (II)

COSC 6385 Computer Architecture - Pipelining (II) COSC 6385 Computer Architecture - Pipelining (II) Edgar Gabriel Spring 2018 Performance evaluation of pipelines (I) General Speedup Formula: Time Speedup Time IC IC ClockCycle ClockClycle CPI CPI For a

More information

Superscalar Architectures: Part 2

Superscalar Architectures: Part 2 Superscalar Architectures: Part 2 Dynamic (Out-of-Order) Scheduling Lecture 3.2 August 23 rd, 2017 Jae W. Lee (jaewlee@snu.ac.kr) Computer Science and Engineering Seoul NaMonal University Download this

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 09

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Multiple Instruction Issue and Hardware Based Speculation

Multiple Instruction Issue and Hardware Based Speculation Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) 1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

COSC 6385 Computer Architecture - Instruction Level Parallelism (II) COSC 6385 Computer Architecture - Instruction Level Parallelism (II) Edgar Gabriel Spring 2016 Data fields for reservation stations Op: operation to perform on source operands S1 and S2 Q j, Q k : reservation

More information

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) EEC 581 Computer Architecture Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation Digital Systems Architecture EECE 343-01 EECE 292-02 Predication, Prediction, and Speculation Dr. William H. Robinson February 25, 2004 http://eecs.vanderbilt.edu/courses/eece343/ Topics Aha, now I see,

More information

Instruction Level Parallelism. Taken from

Instruction Level Parallelism. Taken from Instruction Level Parallelism Taken from http://www.cs.utsa.edu/~dj/cs3853/lecture5.ppt Outline ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction

More information

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory The Big Picture: Where are We Now? CS152 Computer Architecture and Engineering Lecture 18 The Five Classic Components of a Computer Processor Input Control Dynamic Scheduling (Cont), Speculation, and ILP

More information

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 4 Tien-Fu Chen National Chung Cheng Univ. chap4-0 Advance Pipelining! Static Scheduling Have compiler to minimize the effect of structural, data, and control dependence "

More information

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units 6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm: LECTURE - 13 Dynamic Scheduling Better than static scheduling Scoreboarding: Used by the CDC 6600 Useful only within basic block WAW and WAR stalls Tomasulo algorithm: Used in IBM 360/91 for the FP unit

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Tomasulo s Algorithm

Tomasulo s Algorithm Tomasulo s Algorithm Architecture to increase ILP Removes WAR and WAW dependencies during issue WAR and WAW Name Dependencies Artifact of using the same storage location (variable name) Can be avoided

More information

University of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3.

University of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3. University of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3. SOLUTIONS Problem 1 (20pts). There are seven dependences in the C loop presented

More information

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline //11 Limitations of Our Simple stage Pipeline Diversified Pipelines The Path Toward Superscalar Processors HPCA, Spring 11 Assumes single cycle EX stage for all instructions This is not feasible for Complex

More information

Preventing Stalls: 1

Preventing Stalls: 1 Preventing Stalls: 1 2 PipeLine Pipeline efficiency Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls Ideal pipeline CPI: best possible (1 as n ) Structural hazards:

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 9 Instruction-Level Parallelism Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Dynamic Instruction Scheduling with Branch Prediction

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance

More information

Good luck and have fun!

Good luck and have fun! Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator.

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information