INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Size: px
Start display at page:

Download "INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing"

Transcription

1 UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version English Lecture 09 Title: Dynamic and Summary: Dynamic scheduling (centralized control - scoreboard; distributed control - algorithm); scheduling. 2010/2011 Nuno.Roma@ist.utl.pt

2 Architectures for Embedded Computing Dynamic and Prof. Nuno Roma ACE 2010/11 - DEI-IST 1 / 45 Previous Class In the previous class... Pipelining - implementation problems: Multi-cycle instructions; Super-pipelining; Interruptions; Code optimization for pipeline execution. Prof. Nuno Roma ACE 2010/11 - DEI-IST 2 / 45

3 Road Map Prof. Nuno Roma ACE 2010/11 - DEI-IST 3 / 45 Summary Today: Dynamic scheduling: Centralized control: Distributed control: algorithm scheduling. Bibliography: Computer Architecture: a Quantitative Approach, Sections A.7, 2.4 and 2.6 Prof. Nuno Roma ACE 2010/11 - DEI-IST 4 / 45

4 Prof. Nuno Roma ACE 2010/11 - DEI-IST 5 / 45 Static scheduling: Instructions are read from memory and executed in the pipeline in the same order as they occur in the program; If there is a data dependency that cannot be hidden, the hazard detection hardware stalls the pipeline by suspending the fetching of new instructions; To avoid this penalization, the compiler can schedule/intercalate other instructions which do not interfere with the dependency/hazard: Compiler/Static solutions: Interleaving of instructions; Loop-unrolling, etc. Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 45

5 : The hardware (on its own) re-arranges the instruction scheduling in order to: Reduce hazards and dependencies that were unknown at compile time: simplification of the compiler; Mask unpredictable delays (e.g.: cache misses), by executing other operations that do not impose any dependencies: performance improve; Prof. Nuno Roma ACE 2010/11 - DEI-IST 7 / 45 : The hardware (on its own) re-arranges the instruction scheduling in order to: Reduce hazards and dependencies that were unknown at compile time: simplification of the compiler; Mask unpredictable delays (e.g.: cache misses), by executing other operations that do not impose any dependencies: performance improve; The structural and data hazards are checked in ID stage: As soon as the necessary conditions to execute each instruction are assured, it is released from the ID stage. Prof. Nuno Roma ACE 2010/11 - DEI-IST 7 / 45

6 : The execution can be done out-of-order : each instruction is executed as soon as the corresponding operands are available. Out-of-order execution implies Out-of-order completion! Prof. Nuno Roma ACE 2010/11 - DEI-IST 8 / 45 : The hardware may change the execution order of instructions in order to keep its functionality. Particularly useful whenever the execution stage is implemented by several concurrent processing units. E.g.: floating-point computational units. Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 45

7 : The ID stage is now separated in 2 sub-steps: Issue: decoding and check for structural hazards; Read Operands: wait until no data hazards, then read operands. F I R X M W All instructions must pass through the issue step in their original order (in-order issue); Instructions may be stalled or bypassed by others that were blocked in the read operands step (out-of-order read). Pending instructions queue. Prof. Nuno Roma ACE 2010/11 - DEI-IST 10 / 45 Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 45

8 Centralized : ing: allows an out-of-order execution of instructions as soon as the required resources are available and there isn t any data dependency. Potential WAR and WAW hazards: DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14 DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F10, F8, F14 Prof. Nuno Roma ACE 2010/11 - DEI-IST 12 / 45 Centralized : Several instructions may simultaneously be in the EX stage (in distinct execution units or different phases of the same execution unit). Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 45

9 Centralized : The scoreboard records all dependencies and hazards that occur along the program, determining when each instruction can: Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 45 Centralized : The scoreboard records all dependencies and hazards that occur along the program, determining when each instruction can: Issue, by checking the structural and the WAW data hazards, introducing stalls whenever necessary; Example of a WAW hazard: DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F10, F8, F14 Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 45

10 Centralized : The scoreboard records all dependencies and hazards that occur along the program, determining when each instruction can: Read operands, by checking the RAW data hazards and introducing stalls whenever necessary; Example of a RAW hazards: DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14 Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 45 Centralized : The scoreboard records all dependencies and hazards that occur along the program, determining when each instruction can: Execute - when the result is ready, each EX stage notifies the scoreboard that it has completed execution; This step replaces the EX step in the original pipeline and takes multiple cycles in the FP pipeline; Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 45

11 Centralized : The scoreboard records all dependencies and hazards that occur along the program, determining when each instruction can: Write/store, by checking the WAR data hazards and introducing stalls whenever necessary; Example of a WAR hazard: DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14 Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 45 Centralized : The scoreboard records all dependencies and hazards that occur along the program, determining when each instruction can: Issue, by checking the structural and the WAW data hazards, introducing stalls whenever necessary; Read operands, by checking the RAW data hazards and introducing stalls whenever necessary; Execute - when the result is ready, each EX stage notifies the scoreboard that it has completed execution; Write/store, by checking the WAR data hazards and introducing stalls whenever necessary; The operands are read only when all operands are available at the register bank: there is NO forwarding!!! Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 45

12 Centralized : The scoreboard is structured in 3 tables: Instruction-Status - indicates in which of the four steps the instruction is in; Prof. Nuno Roma ACE 2010/11 - DEI-IST 15 / 45 Centralized : The scoreboard is structured in 3 tables: Instruction-Status - indicates in which of the four steps the instruction is in; Functional Units Status - indicates the status of the functional unit, using the following 9 fields: Busy - the unit is busy or not; Op - type of operation; F i - destination register; F j, F k - source-register numbers; Q j, Q k - functional units producing source registers F j and F k ; R j, R k - flags indicating when F j and F k are ready and not yet read; set to No after operands are read. Prof. Nuno Roma ACE 2010/11 - DEI-IST 15 / 45

13 Centralized : The scoreboard is structured in 3 tables: Instruction-Status - indicates in which of the four steps the instruction is in; Functional Units Status - indicates the status of the functional unit, using the following 9 fields: Busy - the unit is busy or not; Op - type of operation; F i - destination register; F j, F k - source-register numbers; Q j, Q k - functional units producing source registers F j and F k ; R j, R k - flags indicating when F j and F k are ready and not yet read; set to No after operands are read. - indicates which functional unit will write each register. Prof. Nuno Roma ACE 2010/11 - DEI-IST 15 / 45 Centralized : Example: Instruction Status Instruction Issue Read Ops Ex Complete Write Res L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Functional Unit Status Unit Busy Op F i F j F k Q j Q k R j R k Integer Y Load F2 R3 Y Mult1 Y Mult F0 F2 F4 Int N Y Mult2 N Add Y Sub F8 F6 F2 Int Y N Div Y Div F10 F0 F6 Mult1 N Y F0 F2 F4 F6 F8 F10 F12 F30 Unit Mult1 Int Add Div Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 45

14 Centralized : Example: Instruction Status Instruction Issue Read Ops Ex Complete Write Res L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 What is the scoreboard status immediately before instruction MUL.D writes its result? Functional Unit Status Unit Busy Op F i F j F k Q j Q k R j R k Integer Y Load F2 R3 Y Mult1 Y Mult F0 F2 F4 Int N Y Mult2 N Add Y Sub F8 F6 F2 Int Y N Div Y Div F10 F0 F6 Mult1 N Y F0 F2 F4 F6 F8 F10 F12 F30 Unit Mult1 Int Add Div Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 45 Centralized : Example: Instruction Status Instruction Issue Read Ops Ex Complete Write Res L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 What is the scoreboard status immediately before instruction MUL.D writes its result? Functional Unit Status Unit Busy Op F i F j F k Q j Q k R j R k Integer YDIV.D Loadinstruction F2 R3 still didn t read its Y Mult1 Yoperands Mult due F0 tof2 itsf4data Intdependency N Y Mult2 N Add Ywith Sub MUL.D. F8 F6 F2 Int Y N Div Y Div F10 F0 F6 Mult1 N Y F0 F2 F4 F6 F8 F10 F12 F30 Unit Mult1 Int Add Div Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 45

15 Centralized : Example: Instruction Status Instruction Issue Read Ops Ex Complete Write Res L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 What is the scoreboard status immediately before instruction MUL.D writes its result? Functional Unit Status Unit Busy Op F i F j F k Q j Q k R j R k Integer YDIV.D Loadinstruction F2 R3 still didn t read its Y Mult1 Yoperands Mult due F0 tof2 itsf4data Intdependency N Y Mult2 N Add Ywith Sub MUL.D; F8 F6 F2 Int Y N Div Y Div F10 F0 F6 Mult1 N Y ADD.D instruction already read its operands and is under execution. F0 F2 F4 F6 F8 F10 F12 F30 Unit Mult1 Int Add Div Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 45 Centralized : Example: Instruction Status Instruction Issue Read Ops Ex Complete Write Res L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Functional Unit Status Unit Busy Op F i F j F k Q j Q k R j R k Integer N Mult1 Y Mult F0 F2 F4 N N Mult2 N Add Y Add F6 F8 F2 N N Div Y Div F10 F0 F6 Mult1 N Y F0 F2 F4 F6 F8 F10 F12 F30 Unit Mult1 Add Div Prof. Nuno Roma ACE 2010/11 - DEI-IST 17 / 45

16 Centralized : Example: Instruction Status Instruction Issue Read Ops Ex Complete Write Res L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 What is the scoreboard status immediately before instruction DIV.D writes its result? Functional Unit Status Unit Busy Op F i F j F k Q j Q k R j R k Integer N Mult1 Y Mult F0 F2 F4 N N Mult2 N Add Y Add F6 F8 F2 N N Div Y Div F10 F0 F6 Mult1 N Y F0 F2 F4 F6 F8 F10 F12 F30 Unit Mult1 Add Div Prof. Nuno Roma ACE 2010/11 - DEI-IST 17 / 45 Centralized : Example: Instruction Status Instruction Issue Read Ops Ex Complete Write Res L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 What is the scoreboard status immediately before instruction DIV.D writes its result? Functional Unit Status Unit Busy Op F i F j F k Q j Q k R j R k Integer NInstruction ADD.D completed its execu- Mult1 Mult as soon F0 asf2instruction F4 DIV.D got N Ytion N Mult2 N Add Yinto the Add read F6 operands F8 F2 stage. N N Div Y Div F10 F0 F6 Mult1 N Y F0 F2 F4 F6 F8 F10 F12 F30 Unit Mult1 Add Div Prof. Nuno Roma ACE 2010/11 - DEI-IST 17 / 45

17 Centralized : Example: Instruction Status Instruction Issue Read Ops Ex Complete Write Res L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Functional Unit Status Unit Busy Op F i F j F k Q j Q k R j R k Integer N Mult1 N Mult Mult2 N Add N Div Y Div F10 F0 F6 Mult1 N N F0 F2 F4 F6 F8 F10 F12 F30 Unit Mult1 Add Div Prof. Nuno Roma ACE 2010/11 - DEI-IST 18 / 45 Centralized : First implementation of a scoreboard: CDC 6600 (1964) Prof. Nuno Roma ACE 2010/11 - DEI-IST 19 / 45

18 Centralized : Achieved speedup: 1.7 with Fortran applications; 2.5 with applications that were programmed and optimized in assembly. Prof. Nuno Roma ACE 2010/11 - DEI-IST 20 / 45 Centralized : Achieved speedup: 1.7 with Fortran applications; 2.5 with applications that were programmed and optimized in assembly. Problems: Large number of buses. Prof. Nuno Roma ACE 2010/11 - DEI-IST 20 / 45

19 Centralized : Achieved speedup: 1.7 with Fortran applications; 2.5 with applications that were programmed and optimized in assembly. Problems: Large number of buses. Limitations: The number of scoreboard entries - determines how far ahead the pipeline can look for independent instructions; The number and types of functional units. Prof. Nuno Roma ACE 2010/11 - DEI-IST 20 / 45 Prof. Nuno Roma ACE 2010/11 - DEI-IST 21 / 45

20 Distributed : Tomasulo : First implementation: IBM 360/91 (1964) Motivation: Only 4 floating-point registers; Very long memory accesses; Very long latency in the floating-point units. Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 45 Distributed : Tomasulo Minimization of data hazards: RAW: Each instruction is executed only when all its operands are available; WAR/WAW: Application of register renaming techniques: Renames all destination registers, including those with a pending read or write for an earlier instruction, so that the out-of-order write does not affect any instructions that depend on an earlier value of an operand. Prof. Nuno Roma ACE 2010/11 - DEI-IST 23 / 45

21 Distributed : Tomasulo Example (using temporary registers S and T): DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) SUB.D T, F10, F14 MUL.D F6, F10, T With this renaming, the MUL.D instruction can write over F6 before the S.D instruction is finalized. Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 45 Distributed : Tomasulo Solution to minimize the hazards: register renaming, accomplished through the use of reservation stations; Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 45

22 Distributed : Tomasulo Each reservation station stores the information corresponding to a given instruction which has passed to the Issued status and is currently waiting for the release of the required execution unit or the availability of the required operands; Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 45 Distributed : Tomasulo Once in the Issued status, each instruction no longer indexes the number of the register or the memory address of each operand; from now on, they are indexed using the reservation station number corresponding to the instruction which will produce the required operand value; Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 45

23 Distributed : Tomasulo Memory writes, pending in the Store buffer, wait until the value which will be written and the corresponding effective address are available; Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 45 Distributed : Tomasulo Each reservation station monitors the common data bus (CDB), where the several results are being transferred, looking for the required operands; Data forwarding mechanism, in order to eliminate the RAW hazards; Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 45

24 Distributed : Tomasulo The reservation stations represent an extension to the register bank, providing an easy solution to most WAW and WAR hazards; Since the number of reservation stations is greater than the number of registers, the WAR and WAW are easily solved. Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 45 Distributed : Tomasulo Control hazards??? Any instruction is NOT initiated until all previous branch predictions are confirmed. Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 45

25 Distributed : Tomasulo Information accommodated in each reservation station: Op - type of operation; Q j, Q k - reservation stations that will produce the corresponding source operands (a value of zero indicates that the source operand is already available in V j and V k ); V j, V k - value of the source operands; A - effective address (load/store); Busy - reservation station is busy/free. Prof. Nuno Roma ACE 2010/11 - DEI-IST 26 / 45 Distributed : Tomasulo Example 1: L.D L.D MUL.D SUB.D DIV.D ADD.D F6, 32(R2) F2, 44(R3) F0, F2, F4 F8, F2, F6 F10, F0, F6 F6, F8, F2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 27 / 45

26 Distributed : Tomasulo Instruction Status Instruction Issue Execute Write Res L.D F6,32(R2) L.D F2,44(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 (1) - After the first Load has Reservation finished its Station execution: Name Busy Op V j V k Q j Q k A Load1 N Load2 Y LOAD 44+Regs[R3] Add1 Y SUB Mem[32+Regs[R2]] Load2 Add2 Y ADD Add1 Load2 Add3 N Mult1 Y MUL Regs[F4] Load2 Mult2 Y DIV Mem[32+Regs[R2]] Mult1 F0 F2 F4 F6 F8 F10 F12 F30 Q i Mult1 Load2 Add2 Add1 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 45 Distributed : Tomasulo Instruction Status Instruction Issue Execute Write Res L.D F6,32(R2) L.D F2,44(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Reservation Station Name Busy Op V j V k Q j Q k A Load1 N Load2 Y LOAD 44+Regs[R3] Add1 Y SUB Mem[32+Regs[R2]] Load2 Add2 Y ADD Add1 Load2 Add3 N Mult1 Y MUL Regs[F4] Load2 Mult2 Y DIV Mem[32+Regs[R2]] Mult1 F0 F2 F4 F6 F8 F10 F12 F30 Q i Mult1 Load2 Add2 Add1 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 45

27 Distributed : Tomasulo Instruction Status Instruction Issue Execute Write Res L.D F6,32(R2) L.D F2,44(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 (1) - After the first Load has Reservation finished its Station execution: Name Busy Op V j V k Q j Q k A Load1 N ADD.D instruction, associated to a WAR hazard in Load2 Y LOAD the WB phase, can finish even before the DIV instruction starts Add1 Y SUB Mem[32+Regs[R2]] its execution!!! Load2 44+Regs[R3] Add2 Y ADD Add1 Load2 Add3 N Mult1 Y MUL Regs[F4] Load2 Mult2 Y DIV Mem[32+Regs[R2]] Mult1 F0 F2 F4 F6 F8 F10 F12 F30 Q i Mult1 Load2 Add2 Add1 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 45 Distributed : Tomasulo Instruction Status Instruction Issue Execute Write Res L.D F6,32(R2) L.D F2,44(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Reservation Station Name Busy Op V j V k Q j Q k A Load1 N Load2 Y LOAD 44+Regs[R3] Add1 Y SUB Mem[32+Regs[R2]] Load2 Add2 Y ADD Add1 Load2 Add3 N Mult1 Y MUL Regs[F4] Load2 Mult2 Y DIV Mem[32+Regs[R2]] Mult1 F0 F2 F4 F6 F8 F10 F12 F30 Q i Mult1 Load2 Add2 Add1 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 45

28 Distributed : Tomasulo Instruction Status Instruction Issue Execute Write Res L.D F6,32(R2) L.D F2,44(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 (2) - MUL.D is ready to write Reservation its result: Station Name Busy Op V j V k Q j Q k A Load1 N Load2 N Add1 N Add2 N Add3 N Mult1 Y MUL Mem[44+Regs[R3]] Regs[F4] Mult2 Y DIV Mem[32+Regs[R2]] Mult1 F0 F2 F4 F6 F8 F10 F12 F30 Q i Mult1 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 45 Distributed : Tomasulo Instruction Status Instruction Issue Execute Write Res L.D F6,32(R2) L.D F2,44(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Reservation Station Name Busy Op V j V k Q j Q k A Load1 N Load2 N Add1 N Add2 N Add3 N Mult1 Y MUL Mem[44+Regs[R3]] Regs[F4] Mult2 Y DIV Mem[32+Regs[R2]] Mult1 F0 F2 F4 F6 F8 F10 F12 F30 Q i Mult1 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 45

29 Distributed : Tomasulo Instruction Status Instruction Issue Execute Write Res L.D F6,32(R2) L.D F2,44(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 (2) - MUL.D is ready to write Reservation its result: Station Name Busy Op V j V k Q j Q k A Load1 N ADD.D instruction has already finished!!! Load2 N Add1 N Add2 N Add3 N Mult1 Y MUL Mem[44+Regs[R3]] Regs[F4] Mult2 Y DIV Mem[32+Regs[R2]] Mult1 F0 F2 F4 F6 F8 F10 F12 F30 Q i Mult1 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 45 Distributed : Tomasulo Instruction Status Instruction Issue Execute Write Res L.D F6,32(R2) L.D F2,44(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Reservation Station Name Busy Op V j V k Q j Q k A Load1 N Load2 N Add1 N Add2 N Add3 N Mult1 Y MUL Mem[44+Regs[R3]] Regs[F4] Mult2 Y DIV Mem[32+Regs[R2]] Mult1 F0 F2 F4 F6 F8 F10 F12 F30 Q i Mult1 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 45

30 Distributed : Tomasulo Example 2 (loop): Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop; Branches if R1 R2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 30 / 45 Distributed : Tomasulo Instruction Status Instruction Iteration Issue Execute Write Res L.D F0, 0(R1) 1 MUL.D F4, F0, F2 1 S.D F4, 0(R1) 1 L.D F0, 0(R1) 2 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 2 Reservation Station Name Busy Op V j V k Q j Q k A Load1 Y Load Regs[R1]+0 Load2 Y Load Regs[R1]-8 Add1 N Add2 N Add3 N Mult1 Y MUL Regs[F2] Load1 Mult2 Y MUL Regs[F2] Load2 Store1 Y Store Regs[R1] Mult1 Store2 Y Store Regs[R1]-8 Mult2 F0 F2 F4 F6 F8 F10 F12 F30 Q i Load2 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 45

31 Distributed : Tomasulo Instruction Status Instruction Iteration Issue Execute Write Res L.D F0, 0(R1) 1 MUL.D F4, F0, F2 1 S.D F4, 0(R1) 1 L.D F0, 0(R1) 2 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 2 Reservation Station Name Busy Op V j V k Q j Q k A (1) - The loop is dynamically unrolled by Load1 Y Load Regs[R1]+0 Load2 they HW, Load by renaming the operands using the Regs[R1]-8 Add1 N reservation stations as extra registers: Add2 N Add3 N Mult1 Y MUL None of the instructions Regs[F2] Load1 has already Mult2 Y MUL Regs[F2] Load2 finished... Store1 Y Store Regs[R1] Mult1 Store2 Y Store Regs[R1]-8 Mult2 F0 F2 F4 F6 F8 F10 F12 F30 Q i Load2 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 45 Distributed : Tomasulo Instruction Status Instruction Iteration Issue Execute Write Res L.D F0, 0(R1) 1 MUL.D F4, F0, F2 1 S.D F4, 0(R1) 1 L.D F0, 0(R1) 2 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 2 Reservation Station Name Busy Op V j V k Q j Q k A Load1 Y Load Regs[R1]+0 Load2 Y Load Regs[R1]-8 Add1 N Add2 N Add3 N Mult1 Y MUL Regs[F2] Load1 Mult2 Y MUL Regs[F2] Load2 Store1 Y Store Regs[R1] Mult1 Store2 Y Store Regs[R1]-8 Mult2 F0 F2 F4 F6 F8 F10 F12 F30 Q i Load2 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 45

32 Distributed : Tomasulo Instruction Status Instruction Iteration Issue Execute Write Res L.D F0, 0(R1) 1 MUL.D F4, F0, F2 1 S.D F4, 0(R1) 1 L.D F0, 0(R1) 2 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 2 Reservation Station Name Busy Op V j V k Q j Q k A (2) - The memory accesses with instructions Load/Store can be executed out-of-order pro- Load1 Y Load Regs[R1]+0 Load2 Y Load Regs[R1]-8 Add1 vided N that they use distinct addresses. Add2 N Add3 N Mult1 Y MUL Regs[F2] Load1 Mult2 Y MUL Regs[F2] Load2 Store1 Y Store Regs[R1] Mult1 Store2 Y Store Regs[R1]-8 Mult2 F0 F2 F4 F6 F8 F10 F12 F30 Q i Load2 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 45 Distributed : Tomasulo Instruction Status Instruction Iteration Issue Execute Write Res L.D F0, 0(R1) 1 MUL.D F4, F0, F2 1 S.D F4, 0(R1) 1 L.D F0, 0(R1) 2 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 2 Reservation Station Name Busy Op V j V k Q j Q k A Load1 Y Load Regs[R1]+0 Load2 Y Load Regs[R1]-8 Add1 N Add2 N Add3 N Mult1 Y MUL Regs[F2] Load1 Mult2 Y MUL Regs[F2] Load2 Store1 Y Store Regs[R1] Mult1 Store2 Y Store Regs[R1]-8 Mult2 F0 F2 F4 F6 F8 F10 F12 F30 Q i Load2 Mult2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 45

33 Distributed : Tomasulo Implementation of algorithm: Advantages: The hazard detection and the execution control is now distributed, contrary to what happens with the scoreboard: The information that is stored in the reservation station of each functional unit determines when the instruction can start its execution. The results are directly passed to the functional units through the reservation stations, instead of being passed through the registers: Elimination of WAW and WAR hazards Prof. Nuno Roma ACE 2010/11 - DEI-IST 32 / 45 Prof. Nuno Roma ACE 2010/11 - DEI-IST 33 / 45

34 consists in: Allowing the anticipated execution of a given instruction that is still conditioned (e.g.: previous branch condition); Passing of its result to other instructions, but preventing that they irreversibly commit their final result until the instruction is no longer speculative. Prof. Nuno Roma ACE 2010/11 - DEI-IST 34 / 45 consists in: Allowing the anticipated execution of a given instruction that is still conditioned (e.g.: previous branch condition); Passing of its result to other instructions, but preventing that they irreversibly commit their final result until the instruction is no longer speculative. Combines three techniques: Dynamic branch prediction, to select the instructions that should be executed; Speculation, to allow the execution of instructions before their control dependencies have been resolved and assuring the ability to rollback the effect of incorrectly speculated code sequences; Dynamic scheduling, to schedule the several pipeline stages. Prof. Nuno Roma ACE 2010/11 - DEI-IST 34 / 45

35 consists in: Allowing the anticipated execution of a given instruction that is still conditioned (e.g.: previous branch condition); Passing of its result to other instructions, but preventing that they irreversibly commit their final result until the instruction is no longer speculative. Combines three techniques: Dynamic branch prediction, to select the instructions that should be executed; Speculation, to allow the execution of instructions before their control dependencies have been resolved and assuring the ability to rollback the effect of incorrectly speculated code sequences; Dynamic scheduling, to schedule the several pipeline stages. Although the instructions execute out-of-order, the final commit step is executed in the correct order. Prof. Nuno Roma ACE 2010/11 - DEI-IST 34 / 45 Contrasting with dynamic scheduling techniques, with speculative scheduling the instructions continue through the fetch, issue and execute stages, assuming that the previous branch prediction is correct. Prof. Nuno Roma ACE 2010/11 - DEI-IST 35 / 45

36 Contrasting with dynamic scheduling techniques, with speculative scheduling the instructions continue through the fetch, issue and execute stages, assuming that the previous branch prediction is correct. Extension to algorithm to allow speculation: How: distinction between by-passing the results between instructions (using the reservation stations) and the effective write in the register bank; Why: allow speculative execution of instructions (with by-pass of their results), but preventing that such results are effectively written until the instruction is no longer speculative, i.e., until the previous branch predictions have been confirmed (instruction commit). Prof. Nuno Roma ACE 2010/11 - DEI-IST 35 / 45 Contrasting with dynamic scheduling techniques, with speculative scheduling the instructions continue through the fetch, issue and execute stages, assuming that the previous branch prediction is correct. Extension to algorithm to allow speculation: How: distinction between by-passing the results between instructions (using the reservation stations) and the effective write in the register bank; Why: allow speculative execution of instructions (with by-pass of their results), but preventing that such results are effectively written until the instruction is no longer speculative, i.e., until the previous branch predictions have been confirmed (instruction commit). The instructions are executed out-of-order, but they finish (commit) in the original order. Prof. Nuno Roma ACE 2010/11 - DEI-IST 35 / 45

37 Dynamic scheduling: after the fetch and issue stages, the instructions are blocked in the reservation stations until their operand values are available; Implementation of branches: the instructions after the branch are not executed while the branch condition is not evaluated. Prof. Nuno Roma ACE 2010/11 - DEI-IST 36 / 45 Reorder Buffer scheduling: all instructions whose operands values or origins are already known pass through the fetch, issue and execute stages; Implementation of branches: the instructions after the branch are executed and their results are temporarily stored in the Reorder Buffer, until the branch condition is known. Prof. Nuno Roma ACE 2010/11 - DEI-IST 36 / 45

38 Reorder Buffer: Assume a role quite similar to the reservation stations, by supplying partial results to subsequent instructions; Each entry contains 4 fields: Type of instruction - e.g.: branch/store/register operation; Destination - target register or memory address; Value - instruction result; Ready - conclusion of the operation. Prof. Nuno Roma ACE 2010/11 - DEI-IST 37 / 45 Commit phase: occurs when the instruction reaches the top of the Reorder Buffer and its result is ready; Prof. Nuno Roma ACE 2010/11 - DEI-IST 38 / 45

39 Commit phase: occurs when the instruction reaches the top of the Reorder Buffer and its result is ready; If the branch prediction was correct: the target register/memory position is updated with the new value; Prof. Nuno Roma ACE 2010/11 - DEI-IST 38 / 45 Commit phase: occurs when the instruction reaches the top of the Reorder Buffer and its result is ready; If the branch prediction was incorrect: the speculation is wrong and the Reorder Buffer is completely discarded, restarting the execution at the instruction that corresponds to the correct branch target. Prof. Nuno Roma ACE 2010/11 - DEI-IST 38 / 45

40 Commit phase: occurs when the instruction reaches the top of the Reorder Buffer and its result is ready; In-Order Commit!!!! Precise Interruptions!!! Prof. Nuno Roma ACE 2010/11 - DEI-IST 38 / 45 Example 1: L.D L.D MUL.D SUB.D DIV.D ADD.D F6, 32(R2) F2, 44(R3) F0, F2, F4 F8, F6, F2 F10, F0, F6 F6, F8, F2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 39 / 45

41 Reorder Buffer Busy Instruction Status Destination Value 1 N L.D F6, 32(R2) Commit F6 Mem[32+Regs[R2]] 2 N L.D F2, 44(R3) Commit F2 Mem[44+Regs[R3]] 3 Y MUL.D F0, F2, F4 Write result F0 #2 Regs[F4] 4 Y SUB.D F8, F6, F2 Write result F8 #2 - #1 5 Y DIV.D F10, F0, F6 Execute F10 6 Y ADD.D F6, F8, F2 Write result F6 #4 - #2 Reservation Stations Name Busy Op V j V k Q j Q k Dest A Load1 N Load2 N Add1 N Add2 N Add3 N Mult1 N MUL.D Mem[44+Regs[R3]] Regs[F4] #3 Mult2 Y DIV.D Mem[32+Regs[R2]] #3 #5 F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 Re-Order # Busy Y N N N N N Y N Y N Y Prof. Nuno Roma ACE 2010/11 - DEI-IST 40 / 45 Reorder Buffer Busy Instruction Status Destination Value 1 N L.D F6, 32(R2) Commit F6 Mem[32+Regs[R2]] 2 N L.D F2, 44(R3) Commit F2 Mem[44+Regs[R3]] 3 Y MUL.D F0, F2, F4 Write result F0 #2 Regs[F4] 4 Y SUB.D F8, F6, F2 Write result F8 #2 - #1 5 Y DIV.D F10, F0, F6 Execute F10 6 Y ADD.D F6, F8, F2 Write result F6 #4 - #2 By the time MUL.D instruction is ready to Commit, only the two LD.D instructions have already Committed, although several other have already finished their execution. Reservation Stations Name Busy Op V j V k Q j Q k Dest A Load1 N Load2 N Add1 N Add2 N Add3 N Mult1 N MUL.D Mem[44+Regs[R3]] Regs[F4] #3 Mult2 Y DIV.D Mem[32+Regs[R2]] #3 #5 F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 Re-Order # Busy Y N N N N N Y N Y N Y Prof. Nuno Roma ACE 2010/11 - DEI-IST 40 / 45

42 Reorder Buffer Busy Instruction Status Destination Value 1 N L.D F6, 32(R2) Commit F6 Mem[32+Regs[R2]] 2 N L.D F2, 44(R3) Commit F2 Mem[44+Regs[R3]] 3 Y MUL.D F0, F2, F4 Write result F0 #2 Regs[F4] 4 Y SUB.D F8, F6, F2 Write result F8 #2 - #1 5 Y DIV.D F10, F0, F6 Execute F10 6 Y ADD.D F6, F8, F2 Write result F6 #4 - #2 Reservation Stations Name Busy Op V j V k Q j Q k Dest A Load1 N Load2 N Add1 N Add2 N Add3 N Mult1 N MUL.D Mem[44+Regs[R3]] Regs[F4] #3 Mult2 Y DIV.D Mem[32+Regs[R2]] #3 #5 F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 Re-Order # Busy Y N N N N N Y N Y N Y Prof. Nuno Roma ACE 2010/11 - DEI-IST 40 / 45 Example 2 (loop): Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop; Branches if R1 R2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 41 / 45

43 Reorder Buffer Busy Instruction Status Destination Value 1 N L.D F0, 0(R1) Commit F0 Mem[0+Regs[R1]] 2 N MUL.D F4, F0, F2 Commit F4 #1 Regs[F2] 3 Y S.D F4, 0(R1) Write result 0+Regs[R1] #2 4 Y DADDIU R1, R1, -8 Write result R1 Regs[R1]-8 5 Y BNE R1, R2, Loop Write result 6 N L.D F0, 0(R1) Write result F0 Mem[#4] 7 N MUL.D F4, F0, F2 Write result F4 #6 Regs[F2] 8 Y S.D F4, 0(R1) Write result 0+Regs[R1] #7 9 Y DADDIU R1, R1, -8 Write result R1 # Y BNE R1, R2, Loop Write result Reservation Station Name Busy Op V j V k Q j Q k Dest A empty F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 Re-Order # 6 7 Busy Y N N N Y N N N N N N Prof. Nuno Roma ACE 2010/11 - DEI-IST 42 / 45 Reorder Buffer Busy Instruction Status Destination Value 1 N L.D F0, 0(R1) Commit F0 Mem[0+Regs[R1]] 2 N MUL.D F4, F0, F2 Commit F4 #1 Regs[F2] 3 Y S.D F4, 0(R1) Write result 0+Regs[R1] #2 4 Y DADDIU R1, R1, -8 Write result R1 Regs[R1]-8 5 Y BNE R1, R2, Loop Write result 6 N L.D F0, 0(R1) Write result F0 Mem[#4] 7 N MUL.D F4, F0, F2 Write result F4 #6 Regs[F2] 8 Y S.D F4, 0(R1) Write result 0+Regs[R1] #7 Only LD.D and MUL.D Committed, although 9 Y DADDIU R1, R1, -8 result R1 # Y all BNE other R1, R2, instructions Loop Writehave result already finished their execution. Reservation Station Name Busy Op V j V k Q j Q k Dest A empty F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 Re-Order # 6 7 Busy Y N N N Y N N N N N N Prof. Nuno Roma ACE 2010/11 - DEI-IST 42 / 45

44 Reorder Buffer Busy Instruction Status Destination Value 1 N L.D F0, 0(R1) Commit F0 Mem[0+Regs[R1]] 2 N MUL.D F4, F0, F2 Commit F4 #1 Regs[F2] 3 Y S.D F4, 0(R1) Write result 0+Regs[R1] #2 4 Y DADDIU R1, R1, -8 Write result R1 Regs[R1]-8 5 Y BNE R1, R2, Loop Write result 6 N L.D F0, 0(R1) Write result F0 Mem[#4] 7 N MUL.D F4, F0, F2 Write result F4 #6 Regs[F2] 8 Y S.D F4, 0(R1) Write result 0+Regs[R1] #7 9 Y DADDIU R1, R1, -8 Write result R1 # Y BNE R1, R2, Loop Write result Reservation Station Name Busy Op V j V k Q j Q k Dest A empty F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 Re-Order # 6 7 Busy Y N N N Y N N N N N N Prof. Nuno Roma ACE 2010/11 - DEI-IST 42 / 45 Interrupt handling? The exception is recorded in the entry of the Reorder Buffer corresponding to the instruction under execution; It is only treated, in the Commit phase, if the instruction reveals to be NOT speculative. Prof. Nuno Roma ACE 2010/11 - DEI-IST 43 / 45

45 Prof. Nuno Roma ACE 2010/11 - DEI-IST 44 / 45 Multiple-issue processors; Superscalar processors; Very Long Instrucion Word (VLIW) processors; Code optimization for multiple-issue processors; Multi-threading. Prof. Nuno Roma ACE 2010/11 - DEI-IST 45 / 45

The basic structure of a MIPS floating-point unit

The basic structure of a MIPS floating-point unit Tomasulo s scheme The algorithm based on the idea of reservation station The reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 06

More information

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 07

More information

COSC4201 Instruction Level Parallelism Dynamic Scheduling

COSC4201 Instruction Level Parallelism Dynamic Scheduling COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism

More information

Scoreboard information (3 tables) Four stages of scoreboard control

Scoreboard information (3 tables) Four stages of scoreboard control Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed

More information

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

COSC 6385 Computer Architecture - Instruction Level Parallelism (II) COSC 6385 Computer Architecture - Instruction Level Parallelism (II) Edgar Gabriel Spring 2016 Data fields for reservation stations Op: operation to perform on source operands S1 and S2 Q j, Q k : reservation

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 05

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP? What is ILP? Instruction Level Parallelism or Declaration of Independence The characteristic of a program that certain instructions are, and can potentially be. Any mechanism that creates, identifies,

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution

More information

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software: CS152 Computer Architecture and Engineering Lecture 17 Dynamic Scheduling: Tomasulo March 20, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Tomasulo s Algorithm

Tomasulo s Algorithm Tomasulo s Algorithm Architecture to increase ILP Removes WAR and WAW dependencies during issue WAR and WAW Name Dependencies Artifact of using the same storage location (variable name) Can be avoided

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3 CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) Instruction Level Parallelism (ILP) Pipelining supports a limited sense of ILP e.g. overlapped instructions, out of order completion and issue, bypass logic, etc. Remember Pipeline CPI = Ideal Pipeline

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences Dynamic Scheduling Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences Increased compiler complexity, especially when attempting

More information

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm: LECTURE - 13 Dynamic Scheduling Better than static scheduling Scoreboarding: Used by the CDC 6600 Useful only within basic block WAW and WAR stalls Tomasulo algorithm: Used in IBM 360/91 for the FP unit

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 14

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm EEF011 Computer Architecture 計算機結構 吳俊興高雄大學資訊工程學系 October 2004 Example to eleminate WAR and WAW by register renaming Original DIV.D ADD.D S.D SUB.D MUL.D F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6,

More information

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng Slide Set 8 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 501 Winter 2018 Slide Set 8 slide

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

COSC 6385 Computer Architecture - Pipelining (II)

COSC 6385 Computer Architecture - Pipelining (II) COSC 6385 Computer Architecture - Pipelining (II) Edgar Gabriel Spring 2018 Performance evaluation of pipelines (I) General Speedup Formula: Time Speedup Time IC IC ClockCycle ClockClycle CPI CPI For a

More information

Advantages of Dynamic Scheduling

Advantages of Dynamic Scheduling UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 5 Dynamic scheduling with Scoreboards Israel Koren ECE568/Koren Part.5.1 Advantages of Dynamic

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Chapter 3 & Appendix C Part B: ILP and Its Exploitation CS359: Computer Architecture Chapter 3 & Appendix C Part B: ILP and Its Exploitation Yanyan Shen Department of Computer Science and Engineering Shanghai Jiao Tong University 1 Outline 3.1 Concepts and

More information

Superscalar Architectures: Part 2

Superscalar Architectures: Part 2 Superscalar Architectures: Part 2 Dynamic (Out-of-Order) Scheduling Lecture 3.2 August 23 rd, 2017 Jae W. Lee (jaewlee@snu.ac.kr) Computer Science and Engineering Seoul NaMonal University Download this

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

DYNAMIC SPECULATIVE EXECUTION

DYNAMIC SPECULATIVE EXECUTION DYNAMIC SPECULATIVE EXECUTION Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,

More information

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3 1 From MIPS pipeline to Scoreboard Lecture 5 Scoreboarding: Enforce Register Data Dependence Scoreboard design, big example Out-of-order execution divides ID stage: 1.Issue decode instructions, check for

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010 CS252 Graduate Computer Architecture Lecture 8 Explicit Renaming Precise Interrupts February 13 th, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

More information

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units 6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 9 Instruction-Level Parallelism Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4 PROBLEM 1: An application running on a 1GHz pipelined processor has the following instruction mix: Instruction Frequency CPI Load-store 55% 5 Arithmetic 30% 4 Branch 15% 4 a) Determine the overall CPI

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..

More information

Lecture 19: Instruction Level Parallelism

Lecture 19: Instruction Level Parallelism Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register

More information

CSE 502 Graduate Computer Architecture

CSE 502 Graduate Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 CSE 502 Graduate Computer Architecture Lec 15-19 Inst. Lvl. Parallelism Instruction-Level Parallelism and Its Exploitation Larry Wittie

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions) EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s

More information

ECE 505 Computer Architecture

ECE 505 Computer Architecture ECE 505 Computer Architecture Pipelining 2 Berk Sunar and Thomas Eisenbarth Review 5 stages of RISC IF ID EX MEM WB Ideal speedup of pipelining = Pipeline depth (N) Practically Implementation problems

More information

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue and Register Renaming Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://wwweecsberkeleyedu/~krste

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 04

More information

Good luck and have fun!

Good luck and have fun! Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator.

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 16, 2014 Time: 1 hour + 15 minutes Name: Alias: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012 Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code

More information