Instruction Level Parallelism

Size: px

Start display at page:

Download "Instruction Level Parallelism"

Dina Campbell
5 years ago
Views:

1 Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated instructions Overlapping possible if: No Structural Hazards No RAW, WAR of WAW Hazards No Control Hazards Pipeline CPI = Ideal CPI + Structural Stalls + Data Hazard Stalls + Control Stalls 2 1

2 Instruction Level Parallelism Two strategies to support ILP: Dynamic Scheduling: depend on the hardware to locate parallelism Static Scheduling: rely on software for identifying potential parallelism Hardware intensive approaches dominate dominate desktop and server markets 3 Review: Summary of Pipelining Basics Hazards limit performance: Structural: Need more HW resources Data: Need forwarding, compiler scheduling Control: Early evaluation & PC, Delayed Branch, Branch Prediction Increasing length of pipe (superpipelining) increases impact of hazards Pipelining helps instruction bandwidth, not latency 4 2

3 Review: Summary of Pipelining Basics Interrupts, Instruction Set, FP makes pipelining harder Compilers reduce cost of data and control hazards Load delay slots Branch delay slots Branch prediction Today: Longer pipelines Better branch prediction, more instruction parallelism? 5 Basic Assumptions We consider single-issue processors The Instruction Fetch stage precedes the Issue Stage and may fetch either into an Instruction Register or into a queue of pending instructions Instructions are then issued from the IR or from the queue Execution stage may require multiple cycles, depending on the operation type. 6 3

4 Key Idea: Dynamic Scheduling Problem: Data dependences that cannot be hidden with bypassing or forwarding cause hardware stalls of the pipeline Solution: Allow instructions behind a stall to proceed Hw rearranges the instruction execution to reduce stalls Enables out-of-order execution and completion (commit) First implemented in CDC 6600 (1963). 7 Dynamic Scheduling Advantages: Enables handling cases of dependence unknown at compile time Simplifies compiler Allows code compiled for one pipeline to run efficiently on a different pipeline Disadvantages: Significant increase in hw complexity Could generate imprecise exceptions 8 4

5 Example 1 DIVD ADDD SUBD F0,F2,F4 F10,F0,F8 F12,F8,F14 RAW Hazard: ADDD stalls for F0 (waiting that DIVD commits). SUBD would stall even if not data dependent on anything in the pipeline without dynamic scheduling. 9 Example 2 LD F6, 34(R2) LD F2, 45(R3) MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD F10, F0, F6 ADDD F6, F8, F2 Analyze dependences and hazards 10 5

6 Problems? How do we prevent WAR and WAW hazards? How do we deal with variable latency? Forwarding for RAW hazards harder. Clock Cycle Number Instruction LD F6,34(R2) IF ID EX MEM WB LD F2,45(R3) IF ID EX MEM WB RAW RAW MULTD F0,F2,F4 IF ID stall M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 MEM WB SUBD F8,F6,F2 IF ID A1 A2 MEM WB DIVD F10,F0,F6 IF ID stall stall stall stall stall stall stall stall stall D1 D2 ADDD F6,F8,F2 IF ID A1 A2 MEM stall stall stall stall stall stall WB WAR 11 Scoreboard Dynamic Scheduling Algorithm 12 6

7 Scoreboard basic scheme Out-of-order execution divides ID stage: 1. Issue Decode instructions, check for structural hazards 2. Read operands Wait until no data hazards, then read operands Instructions execute whenever not dependent on previous instructions and no hazards Scoreboard allows instructions to execute whenever 1 & 2 hold, not waiting for prior instructions 13 Scoreboard basic scheme We distinguish when an instruction begins execution and it completes execution: between the two times, the instruction is in execution. We assume the pipeline allows multiple instructions in execution at the same time that requires multiple functional units, pipelined functional units or both. CDC 6600 (1963): In order issue, out of order execution, out of order completion (commit) No forwarding! Imprecise interrupt/exception model for now! 14 7

8 Scoreboard Architecture (CDC 6600) Registers FP FP Mult Mult FP FP Mult Mult FP FP Divide Divide FP FP Add Add Functional Units Integer SCOREBOARD Memory 15 Scoreboard Scheme Scoreboard replaces ID, EX, WB with 4 stages ID stage splitted in two parts: Issue (decode and check structural hazard) Read Operands (wait until no data hazards) Scoreboard allows instructions without dependencies to execute In-order issue BUT out-of-order readoperands out-of-order execution and completion All instructions pass through the issue stage inorder, but they can be stalled or bypass each other in the read operand stage and thus enter execution out-of-order, which implies out-oforder completion. 16 8

9 Scoreboard Implications Out-of-order completion WAR and WAW hazards can occur Solutions for WAR: Stall write back until registers have been read. Read registers only during Read Operands stage. 17 Scoreboard Implications Solution for WAW: Detect hazard and stall issue of new instruction until the other instruction completes No register renaming Need to have multiple instructions in execution phase Multiple execution units or pipelined execution units Scoreboard keeps track of dependencies and state of operations 18 9

10 Scoreboard Scheme All hazard detection and resolution is centralized in the scoreboard: Every instruction goes through the Scoreboard, where a record of data dependences is constructed The Scoreboard then determine when the instruction can read its operand and begin execution If the scoreboard decides the instruction cannot execute immediately, it monitors every change and decides when the instruction can execute. The scoreboard controls when the instruction can write its result into destination register 19 Exception handling Problem with out-of order completion Must preserve exception behavior as in-order execution Solution: ensure that no instruction can generate an exception until the processor knows that the instruction raising the exception will be executed 20 10

11 Imprecise exceptions An exception is imprecise if the processor state when an exception is raised does not look exactly as if the instructions were executed inorder. Imprecise exceptions can occur because: The pipeline may have already completed instructions that are later in program order than the instruction causing the exception The pipeline may have not yet completed some instructions that are earlier in program order than the instruction causing the exception Imprecise exception make it difficult to restart execution after handling 21 Four Stages of Scoreboard Control 1. Issue Decode instruction and check for structural hazards & WAW hazards Instructions issued in program order (for hazard checking) If a functional unit for the instruction is free and no other active instruction has the same destination register (no WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural hazard or a WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared

12 Four Stages of Scoreboard Control Note that when the issue stage stalls, it causes the buffer between Instruction fetch and issue to fill: If the buffer has a single entry: IF stalls If the buffer is a queue of multiple instruction: IF stalls when the queue fills 23 Four Stages of Scoreboard Control 2. Read Operands Wait until no data hazards, then read operands A source operand is available if: - No earlier issued active instruction will write it or - A functional unit is writing its value in a register When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. RAW hazards are resolved dynamically in this step, and instructions may be sent into execution out of order. No forwarding of data in this model 24 12

13 Four Stages of Scoreboard Control 3.Execution Operate on operands The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. FUs are characterized by: -Latency (the effective time used to complete one operation). - Initiation interval (the number of cycles that must elapse between issuing two operations to the same functional unit). 25 Four Stages of Scoreboard Control 4. Write result Check for WAR hazards and finish execution Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the completing instruction

14 WAR/WAW Example DIVD F0,F2,F4 ADDD F6,F0,F8 SUBD F8,F8,F14 MULD F6,F10,F8 WAR WAW The scoreboard would stall: SUBD in the WB stage, waiting that ADDD reads F0 and F8 and MULD in the issue stage until ADDD writes F6. Can be solved through register renaming 27 Scoreboard structure: Three parts 1. Instruction status 2. Functional Unit status Indicates the state of the functional unit (FU): Busy Indicates whether the unit is busy or not Op - The operation to perform in the unit (+,-, etc.) Fi - Destination register Fj, Fk Source register numbers Qj, Qk Functional units producing source registers Fj, Fk Rj, Rk Flags indicating when Fj, Fk are ready. Flags are set to NO after operands are read. 3. Register result status. Indicates which functional unit will write each register. Blank if no pending instructions will write that register

15 Detailed Scoreboard Pipeline Control Instruction status Issue Read operands Execution complete Write result Wait until Not busy (FU) and not result(d) Rj and Rk Functional unit done f((fj( f ) Fi(FU) or Rj( f )=No) & (Fk( f ) Fi(FU) or Rk( f )=No)) Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D ; Fj(FU) `S1 ; Fk(FU) `S2 ; Qj Result( S1 ); Qk Result(`S2 ); Rj not Qj; Rk not Qk; Result( D ) FU; Rj No; Rk No f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No 29 Scoreboard Example Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 FU 30 15

16 Scoreboard Example: Cycle 1 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 1 FU Integer 31 Scoreboard Example Cycle 2 Instruction status Read ExecutionWrite Instruction j k Issue operands complete Result LD F6 34+ R2 1 2 LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 Functional unit status dest S1 S2 FU for j FU for k Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 2 FU Integer Issue 2nd load? Integer Pipeline Full Cannot exec 2 nd Load due to structural hazard on Integer Unit Issue stalls 32 16

17 Scoreboard Example Cycle 3 Instruction status Read ExecutionWrite Instruction j k Issue operands complete Result LD F6 34+ R LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 Functional unit status dest S1 S2 FU for j FU for k Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 3 FU Integer Issue stalls 33 Scoreboard Example: Cycle 4 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 4 FU Integer Issue stalls Write F

18 Scoreboard Example: Cycle 5 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 5 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 5 FU Integer The 2 nd load is issued 35 Scoreboard Example: Cycle 6 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 5 6 MULTD F0 F2 F4 6 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 6 FU Mult1 Integer MULT is issued but has to wait for F2 from LOAD (RAW Hazard on F2) 36 18

19 Scoreboard Example: Cycle 7 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 7 FU Mult1 Integer Add Read multiply operands? Now SUBD can be issued to ADD Functional Unit 37 Scoreboard Example: Cycle 8a (First half of clock cycle) Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 Integer Add Divide DIVD is issued but there is another RAW hazard (F0) from MULTD -> DIVD has to wait for F

20 Scoreboard Example: Cycle 8b (Second half of clock cycle) Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 Add Divide Load completes (Writes F2), and operands for MULT an SUBD are ready 39 Scoreboard Example: Cycle 9 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Note Remaining Integer No 10 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 2Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 9 FU Mult1 Add Divide Read operands for MULTD & SUBD. Issue ADDD? No for structural hazard on ADD Functional Unit MULTD and SUBD are sent in execution in parallel: Latency of 10 cycles for MULTD and 2 cycles for SUBD 40 20

21 Scoreboard Example: Cycle 10 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 9Mult1 Yes Mult F0 F2 F4 No No Mult2 No 1Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 10 FU Mult1 Add Divide 41 Scoreboard Example: Cycle 11 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 8Mult1 Yes Mult F0 F2 F4 No No Mult2 No 0Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 11 FU Mult1 Add Divide SUBD ends execution 42 21

22 Scoreboard Example: Cycle 12 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 7Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 12 FU Mult1 Divide SUBD writes result in F8 43 Scoreboard Example: Cycle 13 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 6Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 13 FU Mult1 Add Divide ADDD can be issued DIVD still waits for operand F0 from MULTD 44 22

23 Scoreboard Example: Cycle 14 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 5Mult1 Yes Mult F0 F2 F4 No No Mult2 No 2 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 14 FU Mult1 Add Divide ADDD reads operands 45 Scoreboard Example: Cycle 15 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 4Mult1 Yes Mult F0 F2 F4 No No Mult2 No 1 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 15 FU Mult1 Add Divide ADDD starts execution 46 23

24 Scoreboard Example: Cycle 16 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 3Mult1 Yes Mult F0 F2 F4 No No Mult2 No 0 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 16 FU Mult1 Add Divide ADDD ends execution 47 Scoreboard Example: Cycle 17 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F WAR Hazard! Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 2Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 17 FU Mult1 Add Divide Why not write result of ADD??? DIVD must first read F6 (before ADDD write F6), but DIVD cannot read operands until MULTD writes F

25 Scoreboard Example: Cycle 18 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 1Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 18 FU Mult1 Add Divide 49 Scoreboard Example: Cycle 19 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 0Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 19 FU Mult1 Add Divide MULTD ends execution 50 25

26 Scoreboard Example: Cycle 20 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 20 FU Add Divide MULTD writes in F0 51 Scoreboard Example: Cycle 21 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 21 FU Add Divide DIVD can read operands WAR Hazard is now gone

27 Scoreboard Example: Cycle 22 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 39 Divide Yes Div F10 F0 F6 No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 22 FU Divide DIVD has read its operands in previous cycle ADDD can write the result in F6 53 (skipping some cycles ) 54 27

28 Scoreboard Example: Cycle 61 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 0 Divide Yes Div F10 F0 F6 No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 61 FU Divide DIVD ends execution 55 Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 62 FU DIVD writes in F

29 Review: Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 62 FU In-order issue; out-of-order execute & commit 57 CDC 6600 Scoreboard Speedup of 2.5 w.r.t. no dynamic scheduling Speedup 1.7 by reorganizing instructions from compiler; BUT slow memory (no cache) limits benefit Limitations of 6600 scoreboard: No forwarding hardware Limited to instructions in basic block (small window) Small number of functional units (structural hazards), especially integer/load store units Do not issue on structural hazards Wait for WAR hazards Prevent WAW hazards 58 29

30 Summary Instruction Level Parallelism (ILP) in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops Memory dependencies hardest to determine HW exploiting ILP Works when can t know dependence at run time Code for one machine runs well on another Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) Enables out-of-order execution => out-of-order completion ID stage checked both for structural 59 Tomasulo Dynamic Scheduling Algorithm 60 30

31 Tomasulo Algorithm Another dynamic scheduling algorithm: Allows execution to proceed in the presence of dependences Invented at IBM 3 years after CDC 6600 for the IBM 360/91 Same Goal: high performance w/o special compilers Lead to: Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC Tomasulo Algorithm vs. Scoreboard Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; FU buffers called reservation stations ; have pending operands Registers in instructions replaced by values or pointers to reservation stations(rs); called register renaming ; avoids WAR, WAW hazards by renaming results by using RS numbers More reservation stations than registers, so can do optimizations compilers can t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue 62 31

32 Tomasulo Algorithm Basics The control logic and the buffers are distributed with FUs (vs. centralized in scoreboard) Operand buffers are called reservation stations Each instruction is an entry of a reservation station Its operands are replaced by values or pointers (Register Renaming) 63 Tomasulo Algorithm Basics Register Renaming allows to: Avoid WAR and WAW hazards Reservation stations are more than registers (so can do better optimizations than a compiler). Results are dispatched to other FUs through a Common Data Bus (CDB) Load/Stores treated as FUs 64 32

33 Tomasulo Algorithm for an FPU 65 Reservation Station Components Tag identifying the RS OP=the operation to perform on the component. Vj, Vk=Value of the source operands (Vk holds offset for loads) Qj,Qk=Pointers to RS that produce Vj,Vk Zero value = Source op. is already available in Vj or Vk Busy=Indicates RS Busy Note: Only one of V-field or Q-field is valid for each operand 66 33

34 Other components RF and the Store buffer have a Value (V) and a Pointer (Q) field. Pointer (Q) field corresponds to number of reservation station producing the result to be stored in RF or store buffer If zero no active instructions producing the result (RF or store buffer content is the correct value). Load buffers have an address field (A), and a busy field. Store Buffers have also an address field (A). A: To hold info for memory address calculation for load/store. Initially contains the instruction offset (immediate field); after address calculation stores the effective address. 67 First stage of Tomasulo Algorithm ISSUE Get an instruction I from the head of instruction queue (maintained in FIFO order to ensure the correct data flow). If it is an FP op. Check if an RS is empty (i.e., check for structural hazards) otherwise stalls. If operands are not in RF, keep track of FU that will produce the operands. If there is not an empty RS structural hazard and the instruction stalls

35 First stage of Tomasulo Algorithm ISSUE Rename registers; WAR resolution: If I writes Rx, read by an instruction K already issued, K knows already the value of Rx or knows what instruction will write it. So the RF can be linked to I. WAW resolution: Since we use in-order issue, the RF can be linked to I. 69 Second stage of Tomasulo Algorithm Execution When both operands are ready, then execute. If not ready, watch the common data bus for results. By delaying execution until operands are available, RAW hazards are avoided. Notice that several instructions could become ready in the same clock cycle for the same FU

36 Second stage of Tomasulo Algorithm Load and stores: Two-step execution process. First step: compute effective address when base re is available, place it in load or store buffer. Loads in Load Buffer execute as soon as memory unit is available; stores in store buffer wait for the value to be stored before being sent to memory unit. Loads and stores: kept in program order through effective address calculation helps in preventing hazards through memory. To preserve exception behavior: no instruction can initiate execution until all branches preceding it in program order have completed. If branch prediction is used, CPU must know prediction correctness before beginning execution of following instructions. (Speculation allows more brilliant results!) 71 Third stage of Tomasulo Algorithm Write result When result is available, write on Common Data Bus and from there into RF and into all RSs (including store buffers) waiting for this result;stores also write data to memory during this stage. Mark reservation stations available

37 The Common Data Bus A common data bus is a data+source bus. In the IBM 360/91 Data=64 bits, Source=4 bits FU must perform associative lookup in the RS. 73 Tomasulo algorithm (some details) Loads and stores go through a functional unit for effective address computation before proceeding to effective load and store buffers; Loads take a second execution step to access memory, then go to Write Result to send the value from memory to RF and/or RS; Stores complete their execution in their Write Result stage (writes data to memory) All writes occur in Write Result simplifying Tomasulo algorithm

38 Tomasulo algorithm (some details) A Load and a Store can be done in different order, provided they access different memory locations; otherwise, a WAR (interchange load-store sequence) or a RAW (interchange store-load sequence) may result (WAW if two stores are interchanged). Loads can be reordered freely. To detect such hazards: data memory addresses associated with any earlier memory operation must have been computed by the CPU (e.g.: address computation executed in program order) 75 Tomasulo algorithm (some details) Load executed out of order with previous store: assume address computed in program order. When Load address has been computed, it can be compared with A fields in active Store buffers: in the case of a match, Load is not sent to Load buffer until conflicting store completes. Stores must check for matching addresses in both Load and Store buffers (dynamic disambiguation, alternative to static disambiguation performed by the compiler) Drawback: amount of hardware required. Each RS must contain a fast associative buffer; single CDB may limit performance

39 Tomasulo s example Cycle 1 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R2 1 LD F2 45+ R3 MULTF0 F2 F4 SUBDF8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 Load1 34 v(r2) add1 Load2 add2 EXLoad EXADD mult1 mult2 EXMUL v1 q1 v2 q2 RF q Load1 77 Tomasulo s example Cycle 2 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 LD F2 45+ R3 2 MULTF0 F2 F4 SUBDF8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 v(r2) Load1 34 add1 Load2 45 v(r3) add2 EXLoad 34 v(r2) EXADD mult1 mult2 EXMUL v1 q1 v2 q2 RF q Load2 Load

40 Tomasulo s example Cycle 3 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 LD F2 45+ R3 2 MULTF0 F2 F4 3 SUBDF8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 Load1 34 v(r2) add1 Load2 45 v(r3) add2 EXLoad 34 v(r2) EXADD v1 q1 v2 q2 mult1 Load2 v(f4) mult2 EXMUL RF q mult1 Load2 Load1 79 Tomasulo s example Cycle 4 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R3 2 MULTF0 F2 F4 3 SUBDF8 F6 F2 4 DIVD F10 F0 F6 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 v(r2) v(f6) load2 Load1 34 add1 Load2 45 v(r3) add2 EXLoad 34 v(r2) EXADD CDB v1 q1 v2 q2 Load2 v(f4) mult1 mult2 EXMUL RF q mult1 Load2 v(f6) add1 Forwarding is provided Writes on RF (F6) and RS through CDB 80 40

41 Tomasulo s example Cycle 5 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R3 2 5 MULTF0 F2 F4 3 SUBDF8 F6 F2 4 DIVD F10 F0 F6 5 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 v(f6) load2 Load1 add1 Load2 add2 45 v(r3) EXLoad 45 v(r3) EXADD v1 q1 v2 q2 Load2 v(f4) mult1 mult2 v(f6) mult1 EXMUL RF q mult1 Load2 v(f6) add1 mult2 81 Tomasulo s example Cycle 6 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R3 2 5 MULTF0 F2 F4 3 SUBDF8 F6 F2 4 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 v1 q1 v2 q2 v1 q1 v2 q2 v(f6) load2 Load1 add1 Load2 add2 load2 45 v(r3) add1 EXLoad 45 v(r3) EXADD v1 q1 v2 q2 Load2 v(f4) mult1 mult2 v(f6) mult1 EXMUL RF q mult1 Load2 add2 add1 mult2 WAR on F6 has been eliminated: ADDD will write in F6 and DIVD has already read v(f6) in v

42 Tomasulo s example Cycle 7 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R MULTF0 F2 F4 3 SUBDF8 F6 F2 4 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 v1 q1 v2 q2 v1 q1 v2 q2 Load1 add1 v(f6) v(f2) Load2 45 v(r3) add2 add1 v(f2) EXLoad 45 v(r3) CDB EXADD v1 q1 v2 q2 mult1 v(f2) v(f4) mult2 v(f6) mult1 EXMUL RF q mult1 v(f2) add2 add1 mult2 Forwarding is provided Writes on RF (F2) and RSs through CDB 83 Tomasulo s example Cycle 8 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R MULTF0 F2 F4 3 8 SUBDF8 F6 F2 4 8 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 v1 q1 v2 q2 v1 q1 v2 q2 v(f6) v(f2) Load1 add1 Load2 add2 v(f2) add1 EXLoad EXADD v(f2) v(f6) v1 q1 v2 q2 v(f2) v(f4) mult1 mult2 v(f6) mult1 EXMUL v(f2) v(f4) RF q mult1 v(f2) add2 add1 mult

43 Tomasulo s example Cycle 10 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R MULTF0 F2 F SUBDF8 F6 F DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 Latency MULTD: 2 cycles Latency SUBD: 2 cycles v1 q1 v2 q2 v1 q1 v2 q2 Load1 add1 v(f6) v(f2) Load2 add2 v(f8) v(f2) EXLoad EXADD v(f6) v(f2) CDB v1 q1 v2 q2 mult1 v(f2) v(f4) mult2 v(f6) v(f0) EXMUL v(f2) v(f4) CDB RF q v(f0) v(f2) add2 v(f8) mult2 85 Tomasulo s example Cycle 11 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R MULTF0 F2 F SUBDF8 F6 F DIVD F10 F0 F ADDDF6 F8 F v1 q1 v2 q2 v1 q1 v2 q2 Load1 add1 Load2 add2 v(f8) v(f2) EXLoad EXADD v(f8) v(f2) v1 q1 v2 q2 mult1 mult2 v(f6) v(f0) EXMUL v(f6) v(f0) RF q v(f0) v(f2) add2 v(f8) mult

44 Tomasulo s example Cycle 61 Instruction status Write Instruction j k Issue Execute Result LD F6 34+ R LD F2 45+ R MULTF0 F2 F SUBDF8 F6 F DIVD F10 F0 F Latency DIVD: 50 cycles ADDDF6 F8 F Latency ADDD: 2 cycles Load1 Load2 EXLoad v1 q1 v2 q2 v1 q1 v2 q2 add1 add2 EXADD v1 q1 v2 q2 mult1 mult2 v(f6) v(f0) EXMUL v(f6) v(f0) CDB RF q v(f0) v(f2) v(f6) v(f8) v(f10) 87 Compare to Scoreboard Cycle 62 Instruction status: Read Exec Write Write Instruction j k Issue Oper CompResult IssueExec Resul LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Why take longer on scoreboard/6600? Structural Hazards Lack of forwarding 88 44

45 Tomasulo (IBM) versus Scoreboard (CDC) Issue window size=5 No issue on structural hazards WAR, WAW avoided with renaming Broadcast results from FU Control distributed on RS Allows loop unrolling in hw Issue window size=12 No issue on structural hazards Stall the completion for WAW and WAR hazards Results written back on registers. Control centralized through the Scoreboard. 89 Limits to the Instruction Level Parallelism Branches Exceptions (non-)precise: operand integrity for the exception handler (non-)exact: handler modifications are seen by instructions after the exception 90 45

46 Tomasulo Drawbacks Complexity Large amount of hardware Delays of 360/91, MIPS 10000, IBM 620? Many associative stores (CDB) at high speed Performance limited by Common Data Bus Multiple CDBs => more FU logic for parallel assoc stores 91 Summary (1) HW exploiting ILP Works when can t know dependence at compile time. Code for one machine runs well on another Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode =>Issue instr & read operands) Enables out-of-order execution => out-of-order completion ID stage checked both for structural & data dependencies Original version didn t handle forwarding. No automatic register renaming 92 46

47 Summary (2) Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha HW-based Speculation ReOrder Buffer 94 47

48 HW support for more ILP Speculation: allow an instruction without any consequences (including exceptions) if branch is not actually taken ( HW undo ); called boosting Combine branch prediction to choose which instructions to execute with dynamic scheduling to execute before branches resolved 95 HW support for more ILP Separate speculative bypassing of results from real bypassing of results When instruction no longer speculative, write boosted results (instruction commit) or discard boosted results execute out-of-order but commit in-order to prevent irrevocable action (update state or exception) until instruction commits 96 48

49 HW-based Speculation HW-based Speculation combines 3 ideas: Dynamic Branch Prediction to choose which instruction to execute Speculation to execute instructions before control dependences are resolved Dynamic Scheduling supporting out-oforeder execution but in-order commit to prevent any irrevocable actions such as register update or taking exception) until an instruction commits 97 HW support for More ILP Need HW buffer for results of uncommitted instructions: ReOrder Buffer (ROB) FP Op Queue Reorder Buffer FP Regs Res Stations FP Adder Res Stations FP Adder 98 49

50 ReOrder Buffer (ROB) Buffer to hold the results of instructions that have finished execution but non committed Buffer to pass results among instructions that can be speculated Support out-of-order execution but in-order commit Speculative Tomasulo Algorithm with ROB: Pointers are directed toward ROB slot. A register or memory is updated only when the instruction reaches the head of ROB (that is until the instruction is no longer speculative). 99 ReOrder Buffer (ROB) ROB completely replaces the store buffers The renaming function of Reservation Stations is replaced by ROB Reservation Stations now used only to buffer instructions and operands to Fus (to reduce structural hazards). Pointers now are directed toward ROB slots Processors with ROB can dynamically execute while maintaining precise interrupt model because instruction commit happens in order

51 ReOrder Buffer 4 fields: Instruction Type Destination: RF number (for load and ALU ops) Memory address (for stores) Value: To hold the value of instruction results until the instruction commits Ready: Indicates that the instruction has completed execution and the value is ready 101 ReOrder Buffer Reorder buffer can be operand source => more registers like reservation station Use reorder buffer number instead of reservation station when execution completes Supplies operands between execution complete & commit Once operand commits, result is put into register Instructions commit As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions

52 ReOrder Buffer (ROB) Originally (1988) introduced to solve precise interrupt problem; generalized to grant sequential consistency; Basically, ROB is a circular buffer with head pointer (indicating next free entry) and tail pointer indicating the instruction that will commit (leaving ROB) first. 103 ReOrder Buffer (ROB) Instructions are written in ROB in strict program order when an instruction is issued, an entry is allocated to it in sequence. Entry indicates status of instruction: issued (i), in execution (x), finished (f) (+ other items!), An instruction can commit (retire) iff 1. It has finished, and 2. All previous instructions have already retired

53 ReOrder Buffer (ROB) Tail (next instruction to be retired) I n-1 I n f Active entries x I n+k i i Head (first free entry): allocate subsequent instructions to subsequent entries, in order 105 ReOrder Buffer (ROB) Only retiring instructions can complete, i.e., update architectural registers and memory; ROB can support both speculative execution and exception handling Speculative execution: each ROB entry is extended to include a speculative status field, indicating whether instruction has been executed speculatively; Finished instructions cannot retire as long as they are in speculative status

54 ReOrder Buffer (ROB) Interrupt handling: exceptions generated in connection with instruction execution are made precise by accepting exception request only when instruction becomes next to retire (exceptions are processed in order); ROB can be used also for shelving (in this case: Deferred scheduling, Register renaming Instruction Shelf DRIS). 107 Hardware-based Speculation Outcome of branches is speculated and program is executed as if speculation was correct (simple dynamic scheduling would only fetch and decode, not execute!) Mechanisms are necessary to handle incorrect speculation hardware speculation extend dynamic scheduling

55 Hardware-based Speculation Combines: 1. dynamic branch prediction to choose which instruction to execute 2. speculation to allow executing instructions before dependencies are resolved (with the ability to undo effects of incorrectly speculated sequences) 3. Dynamic scheduling to deal with different combinations of basic blocks 109 Hardware-based Speculation Issue an instruction dependent on branch before the branch result is known. Commit is always made in order. Commit of a speculative instruction is made only when the branch outcome is known. The same holds for exceptions (synchronous or asynchronous) deviations of control flow Follows the predicted flow of data values to choose when to execute an instruction; Essentially, a data flow mode of execution: instructions execute as soon as their operands are available

56 Hardware-based Speculation Adopted in PowerPc 603/604, MIPS R10000/ R12000, Pentium II/III/4, AMD K5/K6 Athlon. Extends hardware support for Tomasulo algorithm: to support speculation, commit phase is separated from execution phase, reorder buffer is introduced. 111 Hardware-based Speculation Basic Tomasulo algorithm: instruction writes result in register file, where subsequent instructions find it: With speculation, results are written only when instruction commits and it is known whether the instruction had to be executed. Key idea: executing out of order, committing in order. Boosting

57 Speculative Tomasulo s Algorithm 113 Speculative Tomasulo s Algorithm Tomasulo s Boosting needs a buffer for uncommited results (reorder buffer). Each entry in ROB contains four fields: Instruction type field indicates whether instruction is a branch (no destination result), a store (has memory address destination), or a load/alu (register destination) Destination field: supplies register number (for loads and ALU instructions) or memory address (for stores) where results should be written; Value field (used to hold value of result until instruction commits) Ready field: indicates that instruction has completed execution, value is ready

58 ReOrder Buffer Extension ROB completely replaces store buffers: stores execute in two steps, the second one when instruction commits; Renaming function of reservation stations completely replaced by ROB; Reservation stations now only queue operations (and operands) to Fus between the time they issue and the time they begin execution; Results are tagged with ROB entry number rather than with RS number ROB entry assigned to instruction must be tracked in the reservation stations. 115 ReOrder Buffer Extension All instructions excluding incorrectly predicted branches (or incorrectly speculated loads) commit when reaching head of ROB; Incorrectly predicted branches reaches head of ROB wrong speculation is indicated, ROB is flushed, execution restarts at correct successor of branch. Speculative actions are easily undone. Processors with ROB can dynamically execute while maintaining a precise interrupt model: if instruction I j causes interrupt, CPU waits until I j reaches the head of the ROB and takes the interrupt, flushing all other pending instructions

59 Steps of Speculative Tomasulo s Algorithm (1) 1. Issue: get an instruction from the queue. RS && ROB must have a slot free. When Rob is full stop issuing instructions until an entry is free. Dispatch the operation indicating in which ROB slot it must write 2. Execution: When both operands ready, execute. If not watch in the CDB. 3. Write Result:Write on CDB and on ROB as well as to any RS waiting for this result. Mark the RS available. 117 Steps of Speculative Tomasulo s Algorithm (2) 4. Commit: 3 different possible sequences: 1. Normal commit: instruction reaches the head of the ROB, result is present in the buffer. Result is stored in the register, instruction is removed from ROB; 2. Store commit: as above, but memory rather than register is updated; 3. Instruction is a branch with incorrect prediction: it indicates that speculation was wrong. ROB is flushed ( graduation ), execution restarts at correct successor of the branch. If the branch was correctly predicted, branch is finished

60 Exception Handling Not recognize the exception until it is ready to commit If speculated instruction raises an exception record in the ROB If branch misprediction and the instruction should not have been executed flush exception If the instruction reaches the head of ROB and it is no longer speculative, the exception should be taken 119 Speculative Tomasulo s Algorithm

61 Hazards through memory : WAW, WAR hazards through memory are eliminated with speculation actual memory updating occurs in order; RAW hazards through memory two restrictions are introduced: No Load can initiate the second step of its execution if an active entry in ROB due to a Store has Destination field matching A field of Load; Program order for computation of Load address is maintained with respect to all previous Stores Some speculative machines bypass value from store to load, when RAW is detected. 121 HW Register Renaming It is a Reorder buffer that does not keep the results but only enforces in-order commit. Register File is extended with extra registers to hold speculative values. When issuing an instruction, rename all the speculative operands to the speculative registers. On commit copy the speculative register into the real one. Operands are read from the RF (real or speculative) or via the CDB

62 HW Register Renaming Cdb / virtual regs Rs Fu Rename Registers And issue (V+R) RF read Rs Rs Fu Fu Vrf Rrf ROB Maps virtual to physical registers 123 Register renaming vs. ROB Instruction commit simpler than with ROB; Deallocating registers more complex; Dynamic mapping of architectural to physical registers complicates design and debugging; Used in PowerPC603/604, Pentium II- III-4, MIPS 10000/12000, Alpha 21264; 20 to 80 registers are added

63 125 An example: Organization of Intel Pentium Pro and PowerPC604 PC Instruction cache Data cache Branch prediction Instruction queue Decode/dispatch unit Register file Reservation station Reservation station Reservation station Reservation station Reservation station Reservation station Store Load Branch Integer Integer Floating point Complex integer Load/ store Commit unit Reorder buffer

64 127 Speculating through multiple branches Speculating from multiple branches simultaneously: a benefit in the case of: Very high branch frequency, or Significant clustering of branches, or Long delays in functional units. Complicates speculation recovery, otherwise is straightforward; More complex: predicting and speculating more than one branch per cycle

65 Limitations of ILP? Basic questions: How much ILP can be found in applications; What is needed to exploit more ILP. Instrumental to extend both limitations: compiler technology + architecture! 129 Limitations of ILP? Hardware model adopted to perform evaluations: ideal processor : All artificial constraints on ILP are removed only limitations are due to actual data flow through registers and/or memory: Assumptions: 1. Register renaming: infinite number of registers available all WAR, WAW hazards are avoided, unbounded number of instructions can begin execution simultaneously; 2. Branch prediction: is perfect all conditional branches are predicted exactly 3. Jump prediction: all jumps predicted perfectly combined with previous assumption, leads to processor with perfect speculation + unbounded buffer of instructions available for execution; 4. Memory address alias analysis: all memory addresses are known perfectly, a load can be moved before a store provided addresses are not identical

66 Hardware model Assumptions 2, 3 no control dependencies; Assumptions 1, 4 eliminate all but true data dependencies. Any instruction can be scheduled on the cycle immediately following execution of the predecessor upon which it depends control and address speculation subsumed as perfect. 131 Further initial assumptions: CPU can issue at once unlimited number of instructions, looking arbitrarily ar ahead in computation; No restrictions on types of instructions that can be executed in one cycle (including loads and stores); All functional unit latencies = 1; any sequence of depending instructions can issue on successive cycles; Instructions in executions: in flight. Perfect caches = all loads, stores execute in one cycle only fundamental limits to ILP are taken into account. Obviously, results obtained are VERY optimistic! (no such CPU can be realized ); Benchmark programs used: six from SPEC92 (three FP-intensive ones, three integer ones)

67 Limits on window size Dynamic analysis is necessary to approach perfect branch prediction (impossible at compile time!); A perfect dynamic-scheduled CPU should: 1. Look arbitrarily far ahead to find set of instructions to issue, predict all branches perfectly; 2. Rename all registers uses ( no WAW, WAR hazards); 3. Determine whether there are data dependencies among instructions in the issue packet; rename if necessary; 4. Determine if memory dependencies exist among issuing instructions, handle them; 5. Provide enough replicated functional units to allow all ready instructions to issue. 133 Limits on window size, maximum issue count: Analysis quite complex! E.g.: determine data dependencies among n issuing register-register instructions (number of registers unbounded) number of comparisons: 2 ( n 1) n i = 2 = n n 1 i= n

68 Limits on window size, maximum issue count: Window size = 2000 almost four million comparisons! Even issuing 50 instructions requires 2450 comparisons number of instructions to be considered for issue obviously limited! Existing CPUs: limited number of registers + search for dependence pairs + in-order issue limit costs; Dependent instructions handled by renaming process dependent renaming in one clock cycle; Once instructions are issued, detection of dependencies is handled in distributed fashion by reservation stations or scoreboard. 135 Limits on window size, maximum issue count: All instructions in the window must be kept in the processor number of comparisons required at each cycle = maximum completion rate x window size x number of operands per instruction total window size limited by storage + comparisons + limited issue rate (today: window size up to over 2000 comparisons!)

69 Limits on window size, maximum issue count: Real CPUs have more limited number of functional units (e.g., no more than 2 memory references per clock, nor 2 FP operations) + limited number of buses + limited number of R.F. ports all limits on number of instructions initiated in the same clock max number of instructions that may issue, begin execution, or commit in the same clock cycle much smaller than window size. 137 Limits on window size, maximum issue count: Experimental data: maximum parallelism uncovered falls sharply with reducing window size for the best-performing program, from 150 for infinite window to 35 for a 128-instruction window. At low window sizes (32 or less) all programs exhibit more or less the same level of parallelism

70 Effects of realistic branch and jump prediction: Perfect branch prediction obviously impossible: when not highly accurate, mispredicted branches become a barrier to finding parallelism; Branch prediction mechanisms a major point of optimization in leadingedge CPUs. 139 Effects of finite registers: Reducing the number of registers available for renaming has great impact of extraction of available parallelism; increasingly relevant with increasing intrinsic level of parallelism in a benchmark!

71 Imperfect Alias Analysis Perfect analysis at compile time impossible (run-time compiled memory references, pointer. Accessed variables etc.); Run-time alias analysis: a priori (if no constraints are placed on number of simultaneous memory references is allowed) requires unlimited number of comparisons. 141 Imperfect Alias Analysis Consider three models of memory alias analysis, in addition to perfect analysis: 1. Global/stack perfect: assumes perfect prediction for all global+stack references, conflict on all heap references (based on improvements in compiler technology); 2. Inspection: accesses are examined to see if they can be determined not to interfere at compile time. Also, accesses based on registers that point to different allocation areas (e.g., global area and stack area) are assumed never to alias; 3. None: all memory references are assumed to conflict

72 Imperfect Alias Analysis Model 1 gives results quite similar to perfect alias analysis, model 2 is not much better than model 3. In practice, dynamically scheduled CPUs rely on dynamic memory disambiguation. Three factors limiting their efficiency: 1. To achieve perfect dynamic disambiguation for a load necessary to know memory addresses of all previous stores that have not yet committed. Memory address speculation: dependency is assumed not to exist or else predicts through hw mechanism, load is stalled if dependency is predicted. To check on prediction correctness: CPU examines destination address of each completing store preceding in program order the given load; if dependency that should have been enforced occurs, CPU uses speculative restart mechanism to redo load and following instructions (supported with suitable instruction set extension); 2. Only a small number of memory references can be disambiguated per clock cycle; 3. Number of load/store buffers determines how much earlier or later in the instruction stream a load or a store can be moved. 143 Multiple-Issue Processors

73 Multiple-Issue Processors Basic Idea to get CPI < 1: Issuing Multiple Instructions/Cycle Two variations: Superscalar: (Very) Long Instruction Words (V)LIW: Anticipated success lead to use of Instructions Per Clock cycle (IPC) vs. CPI 145 Multiple-Issue Processors Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler (statically) or by HW (dynamically by Tomasulo) IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates Joint HP/Intel agreement in 1999/2000 Intel Architecture-64 (IA-64) 64-bit address Style: Explicitly Parallel Instruction Computer (EPIC) Explicit dependences in issue packet marked by compiler, for example Itanium

74 Superscalar Processors 147 Superscalar Processors Issues varying number of instructions at each clock cycle. If instructions are dependent, only consecutive ready instructions are issued (in-order issue). This decision is made at run-time by the processor. => Variability in the issue rate (Dynamic issue capability)

75 Superscalar Processors Can be: Statically scheduled: Do not allow (issue) instructions behind stalls to proceed Dynamically scheduled (allow instructions behind RAW hazards to proceed) Dynamically scheduled and speculative. 149 How to optimize code for Superscalar Processors (1) Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,LOOP Loop: LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2 SD 0(R1),F4 SD -8(R1),F8 SD -16(R1),F12 SUBI R1,R1,#32 BNEZ R1,LOOP SD (R1),F16 ; 8-32 = -24 The loop is unrolled 4 times (load/addd/store) in which RAW hazards have been reduced, but there are resource conflicts on the pipelines (cannot execute 2 ld in parallel)

76 How to optimize code for Superscalar processors (2) Integer instruction FP instruction clk_cycle Loop: LD F0,0(R1) // 1 LD F6,-8(R1) // 2 LD F10,-16(R1) ADDD F4,F0,F2 3 LD F14,-24(R1) ADDD F8,F6,F2 4 LD F18,-32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD -8(R1),F8 ADDD F20,F18,F2 7 SD -16(R1),F12 // 8 SD -24(R1),F16 // 9 SUBI R1,R1,#40 // 10 BNEZ R1,LOOP // 11 SD -32(R1),F20 // 12 5 times unrolled loop. 151 Superscalar Processors: Examples

77 The PowerPC 620 [ 94] Superscalar Architecture Similar to: MIPS R10000 HP PA 8000 Fetch, issue and completion of up to 4 instructions per clock cycle. Six separate execution units buffered with reservation stations. 153 PowerPC Architecture Speculative Tomasulo with register renaming. Extended register file holds speculative result of an instruction until the instruction commits. The ROB enforces only in-order commit. Advantages: operands are available from a single location (no need for additional complex logic to access ROB result values)

78 PowerPC 620 architecture 155 PowerPC functional units 2 integer units (XSU0, XSU1), 0 cycles latency [+,-,shift..] 1 complex integer function unit MCFXU for integer (pipelined *, unpipelined /). Latency from 3 to 20 cycles). 1 Load store unit. Latency=1 for integer loads, 2 for FP loads

79 PowerPC functional units 1 FPU with latencies of: 2 cycles for multiply,add, multiply-add 31 for DP FP divide. (fully pipelined except for divide). 1 BRU, completes branches and informs the fetch unit of mispredictions. Includes the condition register used for conditional branches. 157 PowerPC Pipeline Fetch: The Fetch unit loads the decode queue with instructions from the cache. Next address is predicted through a 256-entry, two-way set associative BTB. A BPB is used if there is a miss in the BTB

80 PowerPC Pipeline Instruction decode: Instructions are decoded and inserted into an 8-entry instruction queue. Instruction Issue: 4 Instructions are taken from the 8-entry instruction queue and are issued to the RS. Allocate a rename register and a reorder buffer entry for the instruction issued. If we can t, stall. 159 PowerPC Pipeline Execution: Proceeds with execution when all operands are available. At the end, the result is written on the result bus. The completion unit is notified that the instruction has completed

81 PowerPC Pipeline If the instruction is a (mispredicted) branch, IFU and IC(ompletion)U are notified. Instruction fetch restarts, and ICU discards all the speculated instructions after the branch and free the rename buffers. Commit: When all previous instructions have been committed, commit the result into the RF and free the rename buffer. Stores also commit from store buffer to memory. 161 Performance results IPC from under 1 to 1.8. We do not reach IPC=4 due to: Fus are not replicated for each instruction (structural hazards) Limited instruction level parallelism or limited buffering (insufficient buffers)

82 PowerPC G4e 32K instruction cache. 32K data cache 7 pipeline stages 163 G4e Pipeline Stages 1 and 2 - Instruction Fetch: grabbing an instruction from the L1 cache. fetch width of up to four instructions per clock cycle 9 cycles of delay per instruction in the case of a miss

83 G4e Pipeline Stage 3 - Decode/Dispatch: Get the instruction from a 12-entry instruction queue Decode it and send it to the issue queue of its class. The G4e's decoder can dispatch up to three instructions per clock cycle to the next stage. 165 G4e Pipeline Stage 4 - Issue: Get an instruction from each of the issue queues (FIFO): Floating-Point Queue (FIQ) Vector Issue Queue (VIQ) (Altivec) General Instruction Queue (GIQ) Put it in a reservation station (Out of order execution)

84 G4e Pipeline Stage 5 - Execute: Instructions pass from reservation station queues into their respective functional units and are executed. Stages 6 and 7 - Complete and Write- Back 167 P6 Processor Family: Intel Pentium Pro, II/III 3-way superscalar. Basic Idea, three engines:

85 P6 Pipeline Fetch/Decode Unit: decodes instructions and puts them in the instruction pool in-order. converts the instructions in micro-ops that represent instruction code. Dispatch/Execute Unit: out-of-order issue from the instruction pool in a reservation station and out-of-order execution of micro-ops. Retire Unit Reorders the instructions and commits speculative results to the architectural state. 169 P6 Instruction Decode 8 pipeline stages The decoder fetches 16 bytes at each clock cycle from the cache 3 parallel decoders convert most of the instructions into one or more triadic micro-ops. Some instruction need microcode (several micro-ops) to be executed. Throughput=6 microops per clock cycle. Register Alias Table unit converts logical reg. ref. into virtual reg. ref. (40 registers). In-order Issue to reservation stations and reorder buffer

86 P6 Instruction Dispatch/Execute 20 entries RS Out of order execution through the 3 reservation pipeline stages station unit This happens when: All the operands are ready The resource needed is ready. Maximum throughput: 5 micro-ops/cycle. If micro-ops are branches, their execution is compared with the predicted address (in the Fetch phase). If mispredicted the JEU changes the status of all the micro-ops behind the branch and removes them from the instruction pool. 171 P6 Instruction Retire The retire unit looks for micro-ops that have been executed and can be removed from the pool. The original architectural target of the micro-ops is written. This is done in-order by committing an instruction only if: Previous instructions have been committed The instruction has been executed. Up to 3 micro-ops can be retired at each clock cycle

Pentium 4 173 Pentium 4 New NetBurst micro-architecture 20 pipeline stages (hyper-pipeline) 1.

87 Pentium Pentium 4 New NetBurst micro-architecture 20 pipeline stages (hyper-pipeline) 1.4 GHz to 2GHz 3 prefetching mechanisms Harware instruction prefetcher (based on BTB). Software controlled data cache prefetching. L3->L2 data and instruction hardware prefetcher

88 Pentium 4 Execution Trace Cache TC stores decoded IA-32 instructions or micro-ops. Removes decoding costs 12K micro-ops, 3 micro-ops per cycle fetch bandwidth It stores traces built across predicted branches. However some instructions need micro-code from ROM. 175 Pentium 4 Branch penalty delay can be much more than 10 cycles Uses BTB In case of a miss in the BTB, static prediction is used (back=t, forw=nt) Use of software branch hints during the trace construction that override static prediction

89 Pentium 4 Execution Units and Issue Ports 177 Pentium 4 1 load and 1 store issue for each cycle. Loads can be reordered w.r.t. other loads and stores Loads can be executed speculatively Up to 4 outstanding load misses. Load/store forwarding

X86 Frequency scaling 286->P5 would run at similar clock rates if they were all implemented on the same silicon process technology (similar pipeline depth).

90 X86 Frequency scaling 286->P5 would run at similar clock rates if they were all implemented on the same silicon process technology (similar pipeline depth). 179 AMD Athlon K7 Nine-issue (micro-ops), super-pipelined, superscalar x86 processor Multiple x86 instruction decoders (into triadic microops) Three out-of-order, superscalar, fully pipelined floating point execution units. Three out-of-order, superscalar, pipelined integer units. Three out-of-order, superscalar, pipelined address calculation units. 72-entry instruction control unit (ROB)

AMD Athlon K7 181 AMD Athlon K7 The Instruction Control Unit contains a reorder buffer and distributed reservation stations to hold operands while OP s wait to be scheduled.

91 AMD Athlon K7 181 AMD Athlon K7 The Instruction Control Unit contains a reorder buffer and distributed reservation stations to hold operands while OP s wait to be scheduled. The Integer Instruction Scheduler is an instruction scheduling logic that picks OP s for execution based on their operand availability and issues them to functional units or address generation units. The function units perform transformations on data and return their results to the reorder buffer, while the address-generation units send calculated memory addresses to the Load/Store Unit for further processing

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction