DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Size: px
Start display at page:

Download "DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD"

Transcription

1 DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

2 Outline 2 Dynamic instruction scheduling: Scoreboard Overview Data structures associated with Scoreboard

3 Review of static scheduling Compiler techniques to extract parallelism 3 // C Code for (i=99;i>=0;i--) A[i] = A[i] + K; ; Assembly Code Cont: L.D F0,0(R2) ADD.D S.D DSUBI BNE F2,F0,F1 0(R2),F2 R2,R2,#8 R2,R1,Cont Cont: L.D F0,0(R2) Stall ADD.D F2,F0,F1 Stall Stall S.D 0(R2),F2 DSUBI R2,R2,#8 BNE R2,R1,Cont Stall 9 cycles per iteration Straightforward implementation

4 Review of static scheduling Compiler techniques to extract parallelism 4 // C Code for (i=99;i>=0;i--) A[i] = A[i] + K; ; Assembly Code Cont: L.D F0,0(R2) ADD.D S.D DSUBI BNE S.D F2,F0,F1 0(R2),F2 R2,R2,#8 R2,R1,Cont 0(R2),F2 Cont: L.D F0,0(R2) Stall ADD.D F2,F0,F1 DSUBI R2,R2,#8 BNE R2,R1,Cont S.D 8(R2),F2 6 cycles per iteration Speedup = 9 6 = 1.5 Simple instruction scheduling

5 Review of static scheduling Compiler techniques to extract parallelism 5 // C Code for (i=99;i>=0;i--) A[i] = A[i] + K; Cont: L.D F0,0(R2) L.D F2,-8(R2) L.D F3,-16(R2) 3.5 cycles per iteration Speedup = = 2.57 Loop unrolling and instruction scheduling L.D ADD.D ADD.D ADD.D ADD.D SD SD SD SUB BNE S.D F4,-24(R2) F0,F0,F1 F2,F2,F1 F3,F3,F1 F4,F4,F1 0(R2),F0-8(R2),F2-16(R2),F3 R2,R2,#32 R2,R1,Cont 8(R2),F4

6 Review of static scheduling Compiler techniques to extract parallelism 6 // C Code for (i=99;i>=0;i--) A[i] = A[i] + K; 5 cycles per iteration Speedup = 9 5 = 1.8 L.D F0,0(R2) ADD.D F2,F0,F1 L.D F0,-8(R2) Cont: S.D 0(R2),F2 ADD.D F2,F0,F1 DSUBI R2,R2,#8 BNE R2,R1,Cont L.D F0,-16(R2) S.D 0(R2),F2 ADD.D F2,F0,F1 S.D -8(R2),F2 Software pipelining

7 Review of static scheduling Compiler techniques to extract parallelism 7 Applying the techniques in VLIW systems # Memory 1 Memory 2 FP 1 FP 2 Integer 1 L.D F0,0(R2) L.D F2,-8(R2) 2 L.D F3,-16(R2) L.D F4,-24(R2) 3 L.D F5,-32(R2) L.D F6,-40(R2) ADD.D F0,F0,F1 ADD.D F2,F2,F1 4 L.D F7,-48(R2) L.D F8,-56(R2) ADD.D F3,F3,F1 ADD.D F4,F4,F1 5 L.D F9,-64(R2) L.D F10,-72(R2) ADD.D F5,F5,F1 ADD.D F6,F6,F1 6 L.D F11,-80(R2) L.D F12,-88(R2) ADD.D F7,F7,F1 ADD.D F8,F8,F1 7 S.D 0(R2),F0 S.D -8(R2),F2 ADD.D F9,F9,F1 ADD.D F10,F10,F1 8 S.D -16(R2),F3 S.D -24(R2),F4 ADD.D F11,F11,F1 ADD.D F12,F12,F1 9 S.D -32(R2),F5 S.D -40(R2),F6 10 S.D -48(R2),F7 S.D -56(R2),F8 DSUBI R0,R0, S.D 32(R2),F9 S.D 24(R2),F10 BNE R0,R1,Cont 12 S.D 16(R2),F11 S.D 8(R2),F12 1,09 cycles per iteration Speedup = 9 12/11 = 8.25

8 8 Dynamic instruction scheduling Can we do better using dynamic approaches?

9 9 Dynamic scheduling General idea Can dynamic scheduling be used to reduce the number of cycles per instruction (CPI)? Static scheduling by the compiler tries to change the order of the instructions in order to reduce the number of stalls in the pipeline. Dynamic scheduling uses the same idea. Whenever a hazard forces the pipeline to stall, try to issue another instruction, e.g., DIV.D SUB.D ADD.D F0,F1,F2 F15,F0,F3 F6,F4,F5 The SUB.D instruction must stall until the division ends, which might take dozens of cycles The ADD.D instruction has no dependency with the previous instructions Out-of-order execution

10 10 Dynamic scheduling Hazards with dynamic scheduling In-order pipelined architectures can only generate RAW (Read after Write) hazards Out-of-order pipelined architectures generate more hazards: WAR (Write after Read), e.g., IN ORDER: DIV.D F0,F1,F2 SUB.D F15,F0,F3 ADD.D F3,F4,F5 Dynamic Scheduling OUT OF ORDER: DIV.D F0,F1,F2 ADD.D F3,F4,F5 SUB.D F15,F0,F3 WAR Observation: As it was seen in static scheduling, name dependencies (i.e., WAR and WAW dependencies), can be solved by register renaming

11 11 Dynamic scheduling Hazards with dynamic scheduling In-order pipelined architectures can only generate RAW (Read after Write) hazards Out-of-order pipelined architectures generate more hazards: WAW (Write after Write), e.g., IN ORDER: DIV.D F0,F1,F2 SUB.D F15,F0,F3 S.D 0(R2),F15 ADD.D F15,F1,F2 Dynamic Scheduling OUT OF ORDER: DIV.D F0,F1,F2 ADD.D F15,F1,F2 SUB.D F15,F0,F3 S.D 0(R2),F15 WAW Observation: As it was seen in static scheduling, name dependencies (i.e., WAR and WAW dependencies), can be solved by register renaming

12 Dynamic scheduling Scoreboard general overview 12 The scoreboard is a centralized dynamic scheduling technique that monitors each instruction waiting to be dispatched for execution Once it determines that all the source operands and the required functional units are available, it dispatches the instruction so that it can be executed It monitors multiple instructions and dispatches the first instruction(s) to have all dependencies satisfied The scoreboard does not resolve WAR and WAW hazards well, since it does not implement register renaming Out-of-order execution can also lead to structural conflicts that must be solved by the scoreboard Instruction queue IF Oldest instruction In queue Ready Ready Ready EX/MEM WB New instructions from instruction memory

13 Dynamic scheduling Scoreboard implementation 13 Divide the ID/OF stage in two parts: ISSUE Instruction decoding and verification of structural and WAW hazards Once all structural and WAW conflicts are solved, issue the instruction to the dispatch stage Instruction issue is performed in-order READ OPERANDS (Dispatch) Wait until all data hazards are solved, before reading the operands from the register file and before dispatching the instruction to execution Instruction dispatch is performed out-of-order DISP. IF ISSUE Ready Ready Ready EX/MEM WB IN ORDER OUT-OF-ORDER

14 Dynamic scheduling Scoreboard implementation 14 Divide the ID/OF stage in two parts Issue and Read Operands Instructions normally executed after solving all conflicts Scoreboard was introduced in the CDC 6600 mainframe in First delivered to the Lawrence Radiation Laboratory, part of the University of California at Berkeley, in 1964, the CDC 6600 was used primarily for high-energy nuclear physics research. It achieved a peak performance of 1 Mega FLOPS (3x faster than its faster predecessor). The CDC 6600 had 16 separate functional units (all non-pipelined): 4x FP, 7x Integer, 5x Load/Store DISP. IF ISSUE Ready Ready Ready EX/MEM WB IN ORDER OUT-OF-ORDER

15 Dynamic scheduling with scoreboard Issue stage 15 Resolves all structural and WAW hazards: When an instruction is received from the IF stage, perform the following steps: a) Decode the instruction b) Verify if any active instruction has the same destination register as the current instruction (check for WAW hazards) c) Verify if the target functional unit is available (check for structural hazards) d) If no WAW or structural hazard is found, issue the instruction to the Read Operands (Dispatch) stage; otherwise stall the pipeline DISP. IF ISSUE Ready Ready Ready EX/MEM WB IN ORDER OUT-OF-ORDER

16 Dynamic scheduling with scoreboard Read Operands (dispatch) stage 16 Resolves all structural and RAW hazards: When an instruction is received from the Issue stage perform the following steps: a) Verify if the source operands are available by checking if source operands will be written by any earlier issued active instruction (check for RAW hazards) b) If no RAW or hazard is found, dispatch the instruction to the execution stage DISP. IF ISSUE Ready Ready Ready EX/MEM WB IN ORDER OUT-OF-ORDER

17 Dynamic scheduling with scoreboard Execute (and memory) stage 17 Execute the operation When finished, inform the scoreboard that the functional unit has completed execution forwarding is performed; the scoreboard must solve all hazards Scoreboard DISP. IF ISSUE Ready Ready Ready EX/MEM WB IN ORDER OUT-OF-ORDER

18 Dynamic scheduling with scoreboard Write back stage 18 Resolve all WAR hazards before writing the results Verify if any WAR hazard exists by: Checking if any preceding instruction (in-order issue) has not read its operands, and That one of its operands depends on the register the completing instruction is trying to write, and The missing operand comes from another instruction If no WAR exists, allow the result to be written; otherwise delay writing the result to the destination register Scoreboard DISP. IF ISSUE Ready Ready Ready EX/MEM WB IN ORDER OUT-OF-ORDER

19 19 Dynamic scheduling with scoreboard Write back stage Resolve all WAR hazards before writing the results Verify if any WAR hazard exists by: Checking if any preceding instruction (in-order issue) has not read its operands, and That one of its operands depends on the register the completing instruction is trying to write, and The missing operand comes from another instruction If no WAR exists, allow the result to be written; otherwise delay writing the result to the destination register EXAMPLE: IN ORDER: DIV.D F0,F1,F2 SUB.D F15,F0,F3 ADD.D F3,F4,F5 Dynamic Scheduling OUT OF ORDER: DIV.D F0,F1,F2 ADD.D F3,F4,F5 SUB.D F15,F0,F3 WAR hazard solved by retaining the ADD.D instruction in WB stage until the SUB.D has read the value of R3

20 Dynamic scheduling with scoreboard Information on the scoreboard 20 Instruction status Where is the instruction: issue, read operands, execute (EX), write back (WB) Status of each functional unit (FU): Busy FU is occupied Op Operation being executed (e.g., +, -) Fi Destination register number Fj,Fk Source registers number Qj,Qk FUs computing the value of source registers Fj,Fk Rj,Rk Flags indicating whether the register value is already available Register result status Indicates which functional unit will produce the value of each register If an active (executing) instruction writes on a register, indicate the corresponding functional unit name, otherwise leave empty

21 Dynamic scheduling with scoreboard Example 21 Consider the execution of the instructions on the left on a processor with: n-pipelined functional units: 1x Integer ALU, with 1 cycle latency (INT) 1x FP multiplier, with 10 cycles latency (MULT1, ) 1x FP Adder/subtractor, with 2 cycles latency (ADD) 1x INT/FP Division, with 40 cycles latency (DIV) L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Load/store unit has 2 cycles latency (Add calc+mem access), with the computation of the effective address being performed at the integer ALU

22 22 Dynamic scheduling with scoreboard Information on the scoreboard L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 INT MULT1 ADD DIV FU

23 23 Information on the scoreboard ISSUE stage L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 INT MULT1 ADD DIV Check if FU is available Update target FU on issue FU Check for WAW conflicts Update target register on issue

24 24 Information on the scoreboard DISPATCH stage L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 INT MULT1 ADD DIV FU Dispatch all operations that have data ready Check if operands are ready Set to on issue

25 25 Information on the scoreboard WB stage L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 INT MULT1 ADD DIV FU Check for WAR hazards by checking if any instruction is waiting to read some value originated from a preceding instruction Update Ready information if necessary Clear corresponding register

26 26 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) 1 L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 CYCLE 1: Issue is performed since the INT unit is not busy INT Yes L.D F6 #34 R Yes MULT1 ADD DIV Updated on rising edge FU INT Updated on rising edge

27 27 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) 1 2 L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 CYCLE 2: Since all operands are ready, the first load is dispatched for execution; the second load cannot issue until the FU becomes available INT (2 cycles left) Yes L.D F6 #34 R2 MULT1 ADD DIV Updated on rising edge FU INT

28 28 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 CYCLE 4: The INT FU informs the scoreboard that the execution has finished INT (Done) Yes L.D F6 #34 R2 MULT1 ADD DIV FU INT

29 29 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 CYCLE 5: The scoreboard checks that no WAR conflict exists and allows the instruction to complete INT MULT1 ADD DIV Cleared at end of cycle (rising edge) FU

30 30 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) 6 MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 CYCLE 6: The second load instruction can finally issue, since the FU is now available INT Yes L.D F2 #45 R Yes MULT1 ADD DIV Updated at end of cycle (rising edge) FU INT

31 31 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) 6 7 MUL.D F0,F2,F4 7 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 CYCLE 7: The second load is dispatched for execution, while the MUL.D is issued INT (2 cycles left) Yes L.D F2 #45 R MULT1 Yes MUL.D F0 F2 F4 INT - Yes ADD DIV FU MULT1 INT

32 32 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 7 SUB.D F8,F6,F2 8 DIV.D F10,F0,F6 ADD.D F6,F8,F2 CYCLE 8: The MUL.D cannot be dispatched until operand F2 is written to register INT (1 cycle left) Yes L.D F2 #45 R MULT1 Yes MUL.D F0 F2 F4 INT - Yes ADD Yes SUB.D F8 F6 F2 - INT Yes DIV FU MULT1 INT SUB.D

33 33 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 7 SUB.D F8,F6,F2 8 DIV.D F10,F0,F6 9 ADD.D F6,F8,F2 CYCLE 9: The second load finishes execution, while the DIV.D is issued; neither the MUL.D or SUB.D can be dispatched because operand F2 is not yet on written on the register file INT Yes L.D F2 #45 R MULT1 Yes MUL.D F0 F2 F4 - - Yes ADD Yes SUB.D F8 F6 F2 - - Yes DIV Yes DIV.D F10 F0 F6 MULT1 - Yes FU MULT1 ADD DIV

34 34 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 7 SUB.D F8,F6,F2 8 DIV.D F10,F0,F6 9 ADD.D F6,F8,F2 CYCLE 10: The second load writes the result to register; neither the MUL.D nor SUB.D can be dispatched because F2 is not yet on RF, while the DIV.D is waiting for register F10. The ADD.D generates a structural hazard since the ADD unit is occupied INT Yes L.D F2 #45 R MULT1 Yes MUL.D F0 F2 F4 - - Yes Yes ADD Yes SUB.D F8 F6 F2 - - Yes Yes DIV Yes DIV.D F10 F0 F6 MULT1 - Yes Updated at end of cycle (rising edge) FU MULT1 ADD DIV

35 35 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F6 9 ADD.D F6,F8,F2 CYCLE 11: The MUL.D and SUB.D can be dispatched simultaneously; the next ADD.D cannot be issued because the ADD unit is busy INT MULT1 (10 cycles left) Yes MUL.D F0 F2 F4 - - ADD (2 cycles left) Yes SUB.D F8 F6 F2 - - DIV Yes DIV.D F10 F0 F6 MULT1 - Yes FU MULT1 ADD DIV

36 36 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F6 9 ADD.D F6,F8,F2 CYCLE 13: The SUB.D finishes execution INT MULT1 (8 cycles left) Yes MUL.D F0 F2 F4 - - ADD (Done) Yes SUB.D F8 F6 F2 - - DIV Yes DIV.D F10 F0 F6 MULT1 - Yes FU MULT1 ADD DIV

37 37 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F6 9 ADD.D F6,F8,F2 CYCLE 14: The SUB.D writes the result INT MULT1 (7 cycles left) Yes MUL.D F0 F2 F4 - - ADD DIV Yes DIV.D F10 F0 F6 MULT1 - Yes FU MULT1 DIV

38 38 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F6 9 ADD.D F6,F8,F2 15 CYCLE 15: The ADD.D has no further structural hazards and can be issued INT MULT1 (6 cycles left) Yes MUL.D F0 F2 F4 - - ADD Yes ADD.D F6 F8 F2 - - Yes Yes DIV Yes DIV.D F10 F0 F6 MULT1 - Yes FU MULT1 ADD DIV

39 39 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F6 9 ADD.D F6,F8,F CYCLE 16: The ADD.D has all operands ready and can be dispatched INT MULT1 (5 cycles left) Yes MUL.D F0 F2 F4 - - ADD (2 cycles left) Yes ADD.D F6 F8 F2 - - DIV Yes DIV.D F10 F0 F6 MULT1 - Yes FU MULT1 ADD DIV

40 40 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F6 9 ADD.D F6,F8,F CYCLE 18: The ADD.D has finished executing INT MULT1 (3 cycles left) Yes MUL.D F0 F2 F4 - - ADD (Done) Yes ADD.D F6 F8 F2 - - DIV Yes DIV.D F10 F0 F6 MULT1 - Yes FU MULT1 ADD DIV

41 41 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F6 9 ADD.D F6,F8,F CYCLE 19: The ADD.D cannot write the result until the DIV.D instruction has read its operands INT MULT1 (2 cycles left) Yes MUL.D F0 F2 F4 - - ADD (waiting to write) Yes ADD.D F6 F8 F2 - - DIV Yes DIV.D F10 F0 F6 MULT1 - Yes FU MULT1 ADD DIV

42 42 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F6 9 ADD.D F6,F8,F CYCLE 21: The MUL.D instruction has finished executing INT MULT1 (Done) Yes MUL.D F0 F2 F4 - - ADD (waiting to write) Yes ADD.D F6 F8 F2 - - DIV Yes DIV.D F10 F0 F6 MULT1 - Yes FU MULT1 ADD DIV

43 43 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F6 9 ADD.D F6,F8,F CYCLE 22: The MUL.D instruction writes the result INT MULT1 ADD (waiting to write) Yes ADD.D F6 F8 F2 - - DIV Yes DIV.D F10 F0 F6 MULT1 - Yes Yes FU ADD DIV

44 44 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F ADD.D F6,F8,F CYCLE 23: The DIV.D instruction reads the operands and is dispatched for execution INT MULT1 ADD (waiting to write) Yes ADD.D F6 F8 F2 - - DIV (40 cycles left) DIV.D F10 F0 F6 MULT1 - FU ADD DIV

45 45 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F ADD.D F6,F8,F CYCLE 24: The ADD.D can write the results to register, since there are no more WAR hazards INT MULT1 ADD DIV (39 cycles left) DIV.D F10 F0 F6 MULT1 - FU DIV

46 46 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F ADD.D F6,F8,F CYCLE 64: The DIV.D writes the results to register F10 INT MULT1 ADD DIV FU

47 47 Dynamic scheduling with scoreboard Scoreboard update example L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F SUB.D F8,F6,F DIV.D F10,F0,F ADD.D F6,F8,F Issue is performed in-order Dispatch (operand read), execute and write back are performed out-of-order INT MULT1 ADD DIV FU

48 Dynamic scheduling with scoreboard Limitations 48 Centralized control Status and control signals are generated in the scoreboard unit, where all the scheduling information is kept Forwarding techniques cannot be easily applied The scoreboard must be informed whenever the register is read WAW hazards completely block the pipeline; WAR hazards delay writing the results to register, thus delaying the dispatch of new instructions Register renaming is not applied! The scoreboard is a complex structure that may require multiple updates in a single clock cycle NEXT: a distributed dynamic scheduling technique

49 49 Next lesson Dynamic techniques to extract parallelism Tomasulo

DYNAMIC SPECULATIVE EXECUTION

DYNAMIC SPECULATIVE EXECUTION DYNAMIC SPECULATIVE EXECUTION Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP? What is ILP? Instruction Level Parallelism or Declaration of Independence The characteristic of a program that certain instructions are, and can potentially be. Any mechanism that creates, identifies,

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

COSC4201 Instruction Level Parallelism Dynamic Scheduling

COSC4201 Instruction Level Parallelism Dynamic Scheduling COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 09

More information

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed

More information

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences Dynamic Scheduling Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences Increased compiler complexity, especially when attempting

More information

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3 CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Advantages of Dynamic Scheduling

Advantages of Dynamic Scheduling UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 5 Dynamic scheduling with Scoreboards Israel Koren ECE568/Koren Part.5.1 Advantages of Dynamic

More information

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3 1 From MIPS pipeline to Scoreboard Lecture 5 Scoreboarding: Enforce Register Data Dependence Scoreboard design, big example Out-of-order execution divides ID stage: 1.Issue decode instructions, check for

More information

Scoreboard information (3 tables) Four stages of scoreboard control

Scoreboard information (3 tables) Four stages of scoreboard control Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

COSC 6385 Computer Architecture - Pipelining (II)

COSC 6385 Computer Architecture - Pipelining (II) COSC 6385 Computer Architecture - Pipelining (II) Edgar Gabriel Spring 2018 Performance evaluation of pipelines (I) General Speedup Formula: Time Speedup Time IC IC ClockCycle ClockClycle CPI CPI For a

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences

More information

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng Slide Set 8 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 501 Winter 2018 Slide Set 8 slide

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1

More information

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software: CS152 Computer Architecture and Engineering Lecture 17 Dynamic Scheduling: Tomasulo March 20, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

The basic structure of a MIPS floating-point unit

The basic structure of a MIPS floating-point unit Tomasulo s scheme The algorithm based on the idea of reservation station The reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Lecture 4: Introduction to Advanced Pipelining

Lecture 4: Introduction to Advanced Pipelining Lecture 4: Introduction to Advanced Pipelining Prepared by: Professor David A. Patterson Computer Science 252, Fall 1996 Edited and presented by : Prof. Kurt Keutzer Computer Science 252, Spring 2000 KK

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012 Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction Review: Evaluating Branch Alternatives Lecture 3: Introduction to Advanced Pipelining Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier Pipeline speedup

More information

Good luck and have fun!

Good luck and have fun! Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator.

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution

More information

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4 PROBLEM 1: An application running on a 1GHz pipelined processor has the following instruction mix: Instruction Frequency CPI Load-store 55% 5 Arithmetic 30% 4 Branch 15% 4 a) Determine the overall CPI

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Tomasulo s Algorithm

Tomasulo s Algorithm Tomasulo s Algorithm Architecture to increase ILP Removes WAR and WAW dependencies during issue WAR and WAW Name Dependencies Artifact of using the same storage location (variable name) Can be avoided

More information

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm EEF011 Computer Architecture 計算機結構 吳俊興高雄大學資訊工程學系 October 2004 Example to eleminate WAR and WAW by register renaming Original DIV.D ADD.D S.D SUB.D MUL.D F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6,

More information

Superscalar Architectures: Part 2

Superscalar Architectures: Part 2 Superscalar Architectures: Part 2 Dynamic (Out-of-Order) Scheduling Lecture 3.2 August 23 rd, 2017 Jae W. Lee (jaewlee@snu.ac.kr) Computer Science and Engineering Seoul NaMonal University Download this

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Lecture: Pipeline Wrap-Up and Static ILP

Lecture: Pipeline Wrap-Up and Static ILP Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle

More information

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Chapter 3 & Appendix C Part B: ILP and Its Exploitation CS359: Computer Architecture Chapter 3 & Appendix C Part B: ILP and Its Exploitation Yanyan Shen Department of Computer Science and Engineering Shanghai Jiao Tong University 1 Outline 3.1 Concepts and

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) Instruction Level Parallelism (ILP) Pipelining supports a limited sense of ILP e.g. overlapped instructions, out of order completion and issue, bypass logic, etc. Remember Pipeline CPI = Ideal Pipeline

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST * ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST * SAMPLE 1 Section: Simple pipeline for integer operations For all following questions we assume that: a) Pipeline contains 5 stages: IF, ID, EX,

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

CS433 Homework 3 (Chapter 3)

CS433 Homework 3 (Chapter 3) CS433 Homework 3 (Chapter 3) Assigned on 10/3/2017 Due in class on 10/17/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2018 Static Instruction Scheduling 1 Techniques to reduce stalls CPI = Ideal CPI + Structural stalls per instruction + RAW stalls per instruction + WAR stalls per

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow? Complications with long instructions CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3 Long Instructions & MIPS Case Study So far, all MIPS instructions take 5 cycles But haven't talked

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation Lecture 7 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2013 Reading: Textbook, Ch. 3 Complexity-Effective Superscalar Processors, PhD Thesis by Subbarao Palacharla, Ch.1

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1 ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 8 Instruction-Level Parallelism Part 1 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

COSC 6385 Computer Architecture - Instruction Level Parallelism (II) COSC 6385 Computer Architecture - Instruction Level Parallelism (II) Edgar Gabriel Spring 2016 Data fields for reservation stations Op: operation to perform on source operands S1 and S2 Q j, Q k : reservation

More information

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Updated Exercises by Diana Franklin

Updated Exercises by Diana Franklin C-82 Appendix C Pipelining: Basic and Intermediate Concepts Updated Exercises by Diana Franklin C.1 [15/15/15/15/25/10/15] Use the following code fragment: Loop: LD R1,0(R2) ;load R1 from address

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Complex Pipelining. Motivation

Complex Pipelining. Motivation 6.823, L10--1 Complex Pipelining Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Motivation 6.823, L10--2 Pipelining becomes complex when we want high performance in the presence

More information

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation Graduate Computer Architecture Chapter 3 Instruction Level Parallelism and Its Dynamic Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding (Appendix A.8) Tomasulo

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding

More information

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm: LECTURE - 13 Dynamic Scheduling Better than static scheduling Scoreboarding: Used by the CDC 6600 Useful only within basic block WAW and WAR stalls Tomasulo algorithm: Used in IBM 360/91 for the FP unit

More information

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3 Long Instructions & MIPS Case Study Complications With Long Instructions So far, all MIPS instructions take 5 cycles But haven't talked

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence

More information

Solutions to exercises on Instruction Level Parallelism

Solutions to exercises on Instruction Level Parallelism Solutions to exercises on Instruction Level Parallelism J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Computer Architecture ARCOS Group Computer Science and Engineering

More information

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue and Register Renaming Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://wwweecsberkeleyedu/~krste

More information