EE 457 Unit 9a. Exploiting ILP Out-of-Order Execution

Size: px
Start display at page:

Download "EE 457 Unit 9a. Exploiting ILP Out-of-Order Execution"

Transcription

1 1 EE 457 Unit 9a Exploiting ILP Out-of-Order Execution

2 2 Credits Some of the material in this presentation is taken from: Computer Architecture: A Quantitative Approach John Hennessy & David Patterson Some of the material in this presentation is derived from course notes and slides from Prof. Michel Dubois (USC) Prof. Murali Annavaram (USC) Prof. David Patterson (UC Berkeley)

3 3 Outline Instruction Level Parallelism In-order pipeline From academic 5-stage pipeline To 8-stage MIPS R4000 pipeline Superscalar, superpipelined Out-of-Order Execution Out-of-order completion (Problem: Exceptions) In-order completion Thread Level Parallelism Chip multithreading (CMT) Chip multiprocessors (CMP)

4 4 8-Stage Pipeline In 5-stage pipeline, we place the I-Cache and D-Cache in one stage each When we add VM translation (i.e. TLB), cache tag check, and data access we need more stages to balance the delay Single Memory Stage in 5- Stage Pipeline Expand to three stages when TLB and Cache access is balanced D-Cache D-TLB Cache Data Cache Tag Lookup Tag Check

5 5 8-Stage Pipeline R4000 uses pipelined instruction and data caches The instruction is available at the end of the IS stage but the tag check is done in RF This is harmless since it s fine to start the decode and read registers even if it s an invalid tag so long as we know by the end of the clock For the data cache we must perform the tag check before we can start writing into the register file TLB Data/Tag Access Tag Check TLB Data/Tag Access Tag Check IF IS RF EX DF DS TC WB IM Reg ALU DM Reg

6 6 Load Dependency Issue Latency LW,0() 8-stage pipeline has a two-cycle load delay Possible since data is available at end of DS and can be forwarded If tag check indicates a miss the pipeline can be backed up (re-execute the instruction when the data is available IM Reg ALU DM Reg IM Reg ALU DM Reg IM Reg ALU DM Reg ADD,, IM Reg ALU DM Reg

7 SUPERSCALAR & SUPERPIPELINING 7

8 Superpipelining Superscalar 8 Overview Superscalar = More than 1 instruction completing per clock cycle 2-way superscalar = Proc. that can issue 2 instructions per clock cycle Success is sensitive to ability to find independent instructions to issue in the same cycle Superpipelining = Many small stages to boost clock freq. Success depends of finding instructions to schedule in the shadow of data and control hazards Instruction 1 Instruc. Fetch Instruc. Decode Execute Data Memory Write back Instruction 2 Instruc. Fetch Instruc. Decode Execute Data Memory Write back Superscalar: Executing more than 1 instruction per clock cycle (CPI < 1) Instruction 1 IF1 IF2 ID EX DM1 DM2 DM3 WB Instruction 2 IF1 IF2 ID EX DM1 DM2 DM3 WB Superpipelining: Divide logic into many short stages (Higher Clock Frequency)

9 LD/ST Slot Integer Slot 9 2-way Superscalar One ALU & Data transfer (LW/SW) instruction can be issued at the same time Instruction Pipeline Stages ALU or branch IF ID EX MEM WB LW/SW IF ID EX MEM WB ALU or branch IF ID EX MEM WB LW/SW IF ID EX MEM WB ALU or branch IF ID EX MEM WB LW/SW IF ID EX MEM WB PC I-Cache 2 instructions Reg. File (4 Read, 2 Write) ALU Addr. Calc. D-Cache

10 OUT-OF-ORDER EXECUTION 10

11 11 Instruction Level Parallelism (ILP) Although a program defines a sequential ordering of instructions, in reality many instructions can be executed in parallel. ILP refers to the process of finding instructions from a single program/thread of execution that can be executed in parallel Data flow (data dependencies) is what truly limits ordering Independent instructions (no data dependencies) can be executed at the same time) Control hazards also provide some ordering constraints Program Order (In-order) lw $s3,0($s4) and $t3,$t2,$t3 add $t0,$t0,$s4 or $t5,$t3,$t2 sub $t1,$t1,$t2 beq $t0,$t8,l1 xor $s0,$t1,$s2 Dependency Graph We may perform execution out-of-order LW ADD SUB AND BEQ XOR OR Cycle 1: Cycle 2: Cycle 3: lw $s3,0($s4) / add $t0,$t0,$s4 / sub $t1,$t1,$t2 / and $t3,$t2,$t3 / beq $t0,$t8,l1 / / or $t5,$t3,$t2 / / xor $s0,$t1,$s2 /

12 12 Out-of-Order Motivation Hide the impact of dynamic events such as a cache miss Out-of-Order (OoO) Execution Let independent instructions behind a stalled instruction execute Important aspect: Completion Ordering Out-of-Order completion: Let the independent instruction that has been executed update state (registers + memory) before the stalled instruction completes Problem: Exception handling In-Order completion: Let the independent instructions execute but buffer their results until the stalled instruction completes

13 13 Out-of-Order Execution Io = In-order Execution Execution here means producing the results not necessarily writing them to a register or memory Completion means committing/writing the results to register file or memory While we say out-of-order execution we really mean: IoI (IoD) => OoOE => IoC IoI / IoD = In-order Issue/Dispatch IoC = In-order completion

14 14 IoC or OoC? IoI (IoD) => OoOE => IoC In-order completion is necessary to support exceptions [exact state at time of exception] We will present the concept of OoOC (out-of-order completion) then roll back to IoC OoOC Issues Brancheswe should not commit an instruction that came after (in program order) a branch Solution: Stall dispatching instructions after a branch until we resolve the outcome LW,0() // cache miss BEQ,$0,L1 ADD,, // What if we execute this ADD out of order

15 15 Basic Blocks Basic Block (def.) = Sequence of instructions that will always be executed together No conditional branches out No branch targets coming in Also called straight-line code Average size: 5-7 instrucs. lw $s3,0($s4) and $t3,$t2,$t3 L1: add $t0,$t0,$s4 or $t5,$t3,$t2 sub $t1,$t1,$t2 beq $t0,$t8,l1 xor $s0,$t1,$s2 Instructions in a basic block can be overlapped if there are no data dependencies Control dependences really limit our window of possible instructions to overlap W/o extra hardware, we can only overlap execution of instructions within a basic block This is a basic block (starts w/ target, ends with branch)

16 16 Scheduling Strategies Static Scheduling Compiler re-orders instructions in such a way that no dependencies will be violated and allows for OoOE Dynamic Scheduling HW implementing the Tomasulo algorithm or other similar approach will re-order instructions to allow for OoOE More Advanced Concepts Branch prediction and speculative execution (execution beyond a branch flushing if incorrect) will be covered later

17 17 Static Scheduling Strengths Hardware simplicity [Better clock rate] Power/energy advantage Compiler has a global view of the program anyway, so it should be able to do a good job Very predictable: static performance predictions are reliable Weaknesses Requires re-compilation to take advantage of new/modified architecture Cannot foresee dynamic (data-dependent) events Cache miss, conditional branches (can only recedule instructions in a basic block) Cannot precompute memory addresses No good solution for precise exceptions with out-of-order completion

18 18 Where to Stall? In 5-stage pipeline (in-order execution) RAW dependency was solved by Forwarding (preferably) or Stalling Dependent instructions stalled in the ID stage if necessary ADD,, stall LW IM Reg ALU DM Reg

19 Prior ALU Result Data Mem. or ALU result Instruction Register Pipeline Stage Register Pipeline Stage Register D-Cache PC I-Cache Pipeline Stage Register Mem WB WB 19 Where to Stall? Simple 5-stage pipeline: Dependent instructions cannot be stalled in the EX stage Why? What if ADD was also dependent on the instruction in WB ADD has no place to buffer that forwarded value Thus we stall in ID so we can use the Register File to grab dependent values. Further stalling in ID incurs only 1 cycle penalty as would stalling in EX. PCWrite IRWrite 4 + rs rt. 5 5 IF.Flush HDU Control Read Reg. 1 # Read Reg. 2 # Write Reg. # Write Data Stall Read data 1 Read data 2 Register File Ex Mem WB ALUSelA FLUSH 0 1 Sh. Left ALU Zero Res. Branch 0 1 MemRead & MemWrite 0 1 MemToReg Reset Sign Extend rs rt rd ALUSelB Forwarding Unit ALUSrc 0 1 Regwrite & WriteReg# Regwrite, WriteReg#

20 20 Where to Stall? But to implement OoO execution, we cannot stall in the decode stage since that would prevent any further issuing of instructions Thus, now we will issue to queues for each of the multiple functional units and have the instruction stall in the queue until it is ready Queues + Functional Units ALU MUL IM Reg DM Reg DIV Stalling here would plug up the pipeline DM (Cache)

21 Prior ALU Result Data Mem. or ALU result Instruction Register Pipeline Stage Register Pipeline Stage Register D-Cache PC I-Cache Pipeline Stage Register Mem WB WB 21 Forwarding in OoO Execution In 5-stage pipeline junior instructions carried their source register IDs into the EX stage to be compared with destination register ID s of their seniors (earlier instructions) But in OoO execution, we may have many instructions (seniors) in front of us and cannot afford to perform so many comparisons 0 (as well as handling the 1 FLUSH PCWrite case where many seniors are producing new version of a register) 0 Stall Instead, the IF.Flush dispatch unit will explicitly tell the dependent instruction who to receive data Control from Branch 4 +. IRWrite rs rt 5 5 HDU Read Reg. 1 # Read Reg. 2 # Write Reg. # Write Data Read data 1 Read data 2 Register File Ex Mem WB ALUSelA 0 1 Sh. Left 2 ALU Zero Res. MemRead & MemWrite 0 1 MemToReg Reset Sign Extend rs rt rd ALUSelB Forwarding Unit ALUSrc 0 1 Regwrite & WriteReg# Regwrite, WriteReg#

22 22 Tomasulo s Plan OoO Execution Multiple functional units Integer ALU, Data memory, Multiplier, Divider Queues between ID and EX stages (in place of ID/EX register) Allows later instructions to keep issuing even if earlier ones are stalled

23 23 OoO Execution Problems For the time No branch prediction No speculative execution beyond a branch So we simply stall on a conditional branch For the time, no support for precise exceptions Even then

24 24 RAW, WAR, and WAW RAW = Read After Write lw, 40() add $9,, WAR = Write After Read add $9,, say is not available yet lw, 40() WAW = Write After Write add $9,, say is not available yet lw $9, 40() Why would anyone produce one result in $9 without utilizing that result? Why would he overwrite it with another result? How is this possible?

25 25 WAW can easily occur How is it possible? In OoO execution, instructions before the branch and after the branch can co-exist Consider multiple iterations of a loop body for(i=max; i!= 0; i--) A[i] = A[i] * 3; L1: lw, 40() mult,, sw, 40() addi,,-4 bne, $0,L1 Original Code L1: lw, 40() mult,, sw, 40() addi,,-4 bne, $0,L1 L1: lw, 40() mult,, sw, 40() addi,,-4 bne, $0,L1

26 26 Another WAW Example Say a company gives standard bonus to most of the employees and a higher bonus to managers The software may set a default value to the standard bonus and then overwrite for the special case int x = standard_bonus; if (manager) x = special_bonus; set_bonus(x);

27 Name Depdencies 27 RAW, WAR, and WAW Some terminology to remember RAW = Read After Write lw, 40() add $9,, WAR = Write After Read add $9,, lw, 40() WAW = Write After Write add $9,, lw $9, 40() RAW A true dependency WAR An anti-dependency WAW An anti-dependency Note: no information is communicated in WAR/WAW hazards

28 28 RAW, WAR, and WAW In-order execution: We need to deal with RAW only Out-of-order execution Now we need to deal with WAR and WAW hazards besides RAW Any of these hazards seem to prevent re-ordering instructions and executing them out-of-order

29 29 Register Renaming We have limited architectural registers Registers the instruction set is aware of We could have more physical registers Actual registers part of the register file Assume Delayed lw, 40() sw, 40() lw, 60() sw, 60() It is clear the compiler is using as a temporary register If there is a delay in obtaining the first part of the code cannot proceed Unfortunately, the second part of the code cannot proceed because of the name dependency for

30 30 Register Renaming If we had 64 registers instead of 32 registers, then perhaps the compiler might have used 8 instead of and we could have executed the second part of the code before the first part lw, 40() sw, 40() lw 8, 60() add 8, 8, 8 sw 8, 60() This is an example of a name-dependency

31 31 Register Renaming Four different temporary registers can be used here as shown:, 8, 8 and 8 We could even simply use aliases or code names: LION, TIGER, CAT, BEAR lw, 40() add 8,, sw 8, 40() lw LION, 40() add TIGER, LION, LION sw TIGER, 40() lw 8, 60() add 8, 8, 8 sw 8, 60() lw CAT, 60() add BEAR, CAT, CAT sw BEAR, 60()

32 32 Increasing Number of Registers Can a later implementation provide 64 registers (instead of 32) while maintaining binary compatibility with previously compiled code? Answer: Yes / No NO Why? Machine code has 5-bit fields for register ID s

33 33 Increasing Number of Registers Answer: Cannot change the number of architectural registers Instead we will perform Register Renaming through Tagging Registers This solves name dependency problems (WAR and WAW) while attending to true dependency (RAW) through waiting in queues Please be sure you understand this!

34 34 Tagging process sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST RF 1 1 Issue Logic RST = Register Status Table RF = Register File INT ALU T1: SQRT Val / 0 Val INT MUL/DIV/SQRT Load/ Store

35 35 Tagging process: CC1 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 RF 1 1 Issue Logic RST = Register Status Table RF = Register File INT ALU T1: SQRT Val / 0 Val INT MUL/DIV/SQRT Load/ Store

36 36 Tagging process: CC2 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T2 RF 1 1 Issue Logic RST = Register Status Table RF = Register File INT ALU T1: SQRT Val / 0 Val T2: LW T1 / 40 INT MUL/DIV/SQRT Load/ Store

37 37 Tagging process: CC3 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T2 T3 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val T2: LW T1 / 40 INT ALU INT MUL/DIV/SQRT Load/ Store

38 38 Tagging process: CC4 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T3 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store

39 39 Tagging process: CC5 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T3 T4 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT T4: LW val / 60 SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store

40 40 Tagging process: CC6 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T4 T5 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T5: ADD T4 / T4 T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT T4: LW val / 60 SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store

41 41 Tagging process: CC7 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T5 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T5: ADD T4 / T4 T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT T4: LW val / 60 SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store T4: Read 0x1111

42 42 Tagging process: CC8 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T5 => null RF 0x T5: Sum 0x2222 Issue Logic RST = Register Status Table RF = Register File T5: ADD 0x1111 / 0x1111 T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store T5: Sum 0x2222

43 43 Tagging process: CC9 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 null RF 0x Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT SW 0x2222, val / 60 SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store T5: Sum 0x2222

44 44 Tagging process: CC10 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 => null RF 0x T1: SQRT 0xacd0 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store T1: SQRT 0xacd0

45 45 Tagging process: CC11 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST RF 0xacd0 0x Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 INT ALU INT MUL/DIV/SQRT SW T3 / 0xacd0 / 40 T2: LW 0xacd0 / 40 Load/ Store T2: Read 0x5678

46 46 Tagging process: CC12 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST RF 0xacd0 0x Issue Logic RST = Register Status Table RF = Register File T3: ADD 0x5678 / 0x5678 INT ALU INT MUL/DIV/SQRT SW T3 / 0xacd0 / 40 Load/ Store T3: Sum 0xACF0

47 47 Tagging process: CC13 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST RF 0xacd0 0x Issue Logic RST = Register Status Table RF = Register File SW 0xacf0 / 0xacd0 / 40 INT ALU INT MUL/DIV/SQRT Load/ Store

48 48 Tomasulo s Algorithm Dispatch/Issue unit decodes and dispatches instructions For destination operand, an instruction carries a TAG (but not the actual register name) For source operands, an instruction carries either the values (if TAG is null in RST) or TAGs of the operands (but not the actual register name) When an instruction executes and produces a result it broadcasts the result and its destination TAG Any instruction waiting can compare its SRC tags with the destination tag and grab the value if they match If entry in RST matches the TAG then this instruction is the latest producer of the register and the value will be written to the RF

49 49 Register Renaming sqrt, 0 add,, add,, add,, add,, RST T1, T2, T3, T4 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T4: ADD T3 / T3 T3: ADD T2 / T2 T2: ADD T1 / T1 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT Load/ Store

50 50 Unique TAGs Like SSN, we need a unique TAG SSN s are reused. Similarly TAGS can be reused TAGs are similar to number TOKEN Helps to create a virtual queue. We do not need that here In State Bank of India, the cashier issues brass token to customers trying to draw money as an ID (and not at all to put them in any virtual queue / ordering). Token numbers are in random order. The cashier verifies the signature in the record rooms, returns with money, calls the token number and issues the money. Tokens are reclaimed & reused.

51 51 Tags (= Tokens) How many tokens should the bank casheir have to start with? What happens if the tokens run out? Does the cashier need to have any order in holding tokens and issuing tokens? Do they have to collect the tokens back?

52 52 TAG FIFO To issue and collect tokens (TAGS) use a circular FIFO (First- In/First-Out) unit While the FIFO order is not important here, a FIFO is the easiest to implement in hardware compared to a random order in a pile Filled (with say) 64 tokens (in any order) initially on reset Tokens return in any order FIFO s are taught in EE 560 Put tokens back in the FIFO and reissue TAG FIFO TAG FIFO TAG FIFO wp rp wp 2 63 rp wp rp FULL 2 Tokens issued 1 Tokens returned

53 Int. Queue L/S Queue Div Queue Mult. Queue Reg. File 53 Organization for OoO Execution I-Cache Instruc. Queue TAG FIFO Block Diagram Adapted from Prof. Michel Dubois (Simplified for EE 457) Register Status Table Dispatch Integer / Branch D-Cache Div Mul Issue Unit CDB

54 54 Front-End & Back-End IFQ (Instruction Fetch Queue) A FIFO structure Dispatch (Issue) Unit Includes RST, RF, Tag FIFO Load/Store and other Issue Queues Issue Units Functional units CDB (Common Data Bus) Like a public address system that everyone can see/hear when data is produced

55 55 More Tomasulo Algorithm Front End Instructions are fetched They are stored in a FIFO (IFQ) When instruction reached the head of the IFQ it is Decoded Dispatched to an issue queue/functional unit Even if some of the inputs are not ready (takes TAGs) Back End Instructions in issue queues wait for their input operands Once register operands are ready instructions can be scheduled for execution provided they will not conflict for the CDB or their functional unit Instructions execute in their functional unit and their result is put on the CDB All instructions in queues and the register file watch the CDB and grab the value they are waiting for when it is produced

56 56 Bottleneck in Tomasulo s Approach The CDB!!! Do all instructions use the CDB? No, not SW, J (jump), BEQ

57 57 Load/Store Queue (LSQ) Performs Address calculation Memory disambiguation RAW, WAR, WAW hazards due to memory reads and writes // Is there a dependency here? SW,0() LW,0() // What about here? SW, 1000() LW, 0()

58 58 Address Calculation for LW/SW EE 557 approach for address calculation Loads & store in 2 sub-instructions 1 instruction computes address and is dispatched to integer ALU 1 instruction access data cache and is issued to LSQ Address is communicated from integer ALU to LSQ via CDB forwarding using a tag EE 560/457 approach Use a dedicated adder in the LSQ to compute address (so just 1 dispatched instruction)

59 59 Memory Disambiguation EE 557: Issue to a cache from LSQ Loads can issue to a cache when their address is ready Stores can issue to cache when both address & data is ready Memory hazards (RAW, WAR, WAW) are resolved in the LSQ Load can issue to cache if no store with same address is before it Store can issue to cache if no store or load with same address before it Otherwise access waits in LSQ If an address is unknown it is assumed to be the same Worst case to enforce correctness The process of figuring out and comparing memory address is called disambiguation

60 60 Memory Disambiguation RAW sw, 2000($0) lw, 2000($0) This later lw can proceed only if there is no store ahead of it with the same address WAW sw, 2000($0) sw, 2000($0) This later sw can proceed only if there is no store ahead of it with the same address WAR lw, 2000($0) sw, 2000($0) This later sw can proceed only if there is no load ahead of it with the same address

61 61 LSQ Ordering Maintaining instructions in the order of arrival Issue order/program order in a queue Is this necessary and/or desirable? In the case of LSQ? Necessary! To enforce memory disambiguation In the case of Integer, MUL, DIV queues? Desirable, so that an earlier instruction gets executed whenever possible, thereby reducing queue pressure from too many instructions waiting on it

62 62 Priority Priority (based on the order of arrival among ready instructions) Is it necessary or is it desirable? Local priority within queues? Global priority across the queues?

63 63 Issue Unit CDB availability constraint Pipelined functional unit vs. multicycle unit Conflict resolution Round-robin priority adequate?, well,

64 64 Conditional Branches Dispatch unit stops dispatching until the branch is resolved CDB broadcasts the result of the branch Dispatching continues at the target or fall-through instruction Successful branch shall cause flushing of IFQ Since we stop dispatching instructions after a branch, does it mean that this branch is the last instruction to be executed in the back-end? Is it possible that the back-end holds simultaneously A. Some instructions dispatched before the branch and B. Some instructions issued after the branch

65 65 Structural & Control Hazards Structural Stalls I-Fetch must stall if the IFQ is full Dispatch must stall if all entries in the desired functional unit s Issue Queue is occupied Instructions cannot be scheduled in case of conflicts for the CDB or functional unit Control Dispatcher stalls when it reaches a branch Branches are dispatched to integer queue They wait for their operands if needed Put their result on CDB If untaken, dispatch unit resumes If taken, then dispatch clears the IFQ and resumes at target Precise exceptions not supported

66 BACKUP 66

67 67 Tagging Registers: CC1 DOG sqrt, 0 lw, 40() Orange means dispatched and SQRT is a long-latency computation RST DOG RF sw, 40() 1 1 lw, 60() sw, 60() Destination Dependent source RST = Register Status Table RF = Register File

68 68 Tagging Registers: CC2 sqrt, 0 LION DOG lw, 40() DOG Orange means dispatched and SQRT is a long-latency computation RST DOG LION RF sw, 40() 1 1 lw, 60() sw, 60() Destination Dependent source RST = Register Status Table RF = Register File

69 69 Tagging Registers: CC3 sqrt, 0 LION lw, 40() TIGER DOG LION DOG LION Orange means dispatched and SQRT is a long-latency computation RST DOG TIGER RF sw, 40() 1 1 lw, 60() sw, 60() Destination Dependent source RST = Register Status Table RF = Register File

70 70 Tagging Registers: CC4 sqrt, 0 LION lw, 40() TIGER DOG LION DOG LION Orange means dispatched and SQRT is a long-latency computation RST DOG TIGER RF TIGER sw, 40() 1 1 lw, 60() sw, 60() Destination Dependent source RST = Register Status Table RF = Register File

71 71 Tagging Registers Review sqrt, 0 LION lw, 40() TIGER TIGER DOG LION sw, 40() lw, 60() sw, 60() DOG LION Dispatch unit decodes and dispatches instructions For destination operand, an instruction carreis a TAG (but not the actual register name) For source operands, an instruction carries either the values (if no TAG in RST) or TAGs of the operands (but not the actual register name) When 1 RST DOG TIGER 1 RF

72 Int. Queue L/S Queue Div Queue Mult. Queue Reg. File 72 Organization for OoO Execution I-Cache Instruc. Queue TAG FIFO Block Diagram Adapted from Prof. Michel Dubois (Simplified for EE 457) Register Status Table Dispatch Integer / Branch D-Cache Div Mul Issue Unit CDB

73 73 Multiple Functional Units We now provide multiple functional units After decode, issue to a queue, stalling if the unit is busy or waiting for data dependency to resolve Queues + Functional Units ALU MUL IM Reg DM Reg DIV DM (Cache)

74 74 Functional Unit Latencies Int. ALU, Addr. Calc. EX An added complication of out-of-order execution & completion: WAW & WAR hazards FP Add A1 A2 A3 A4 Int. & FP MUL M1 M2 M3 M4 M5 M6 M7 Int. & FP DIV Look Ahead: Tomasulo Algorithm will help absorb latency of different functional units and cache miss latency by allowing other ready instruction proceed out-of-order Functional Unit Latency (Required stalls cycles between dependent [RAW] instrucs.) Initiation Interval (Distance between 2 independent instructions requiring the same FU) Integer ALU 0 1 FP Add 3 1 FP Mul. 6 1 FP Div

75 Int. Queue L/S Queue Div Queue Mult. Queue Reg. File 75 OoO Execution w/ ROB ROB allows for OoO execution but in-order completion I-Cache D-Cache Instruc. Queue ROB (Reorder Buffer) Br. Pred. Buffer Dispatch Exceptions? No problem Integer / Branch Exec. Unit D-Cache L/S Buffer Addr. Buffer Div Mul Issue Unit CDB

Credits. EE 457 Unit 9a. Outline. 8-Stage Pipeline. Exploiting ILP Out-of-Order Execution

Credits. EE 457 Unit 9a. Outline. 8-Stage Pipeline. Exploiting ILP Out-of-Order Execution Credits EE 457 Unit 9a Exploiting ILP Out-of-Order Execution Some of the material in this presentation is taken from: Computer Architecture: A Quantitative Approach John Hennessy & David Patterson Some

More information

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions) EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3 CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

COSC4201 Instruction Level Parallelism Dynamic Scheduling

COSC4201 Instruction Level Parallelism Dynamic Scheduling COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

More information

LECTURE 10. Pipelining: Advanced ILP

LECTURE 10. Pipelining: Advanced ILP LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes NAME: STUDENT NUMBER: EE557--FALL 1999 MAKE-UP MIDTERM 1 Closed books, closed notes Q1: /1 Q2: /1 Q3: /1 Q4: /1 Q5: /15 Q6: /1 TOTAL: /65 Grade: /25 1 QUESTION 1(Performance evaluation) 1 points We are

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 16, 2014 Time: 1 hour + 15 minutes Name: Alias: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Chapter 4 The Processor 1. Chapter 4B. The Processor

Chapter 4 The Processor 1. Chapter 4B. The Processor Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Out of Order Processing

Out of Order Processing Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods 10-1 Dynamic Scheduling 10-1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods Not yet complete. (Material below may

More information

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch branch taken Revisiting Branch Hazard Solutions Stall Predict Not Taken Predict Taken Branch Delay Slot Branch I+1 I+2 I+3 Predict Not Taken branch not taken Branch I+1 IF (bubble) (bubble) (bubble) (bubble)

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Instruction Level Parallelism ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Basic Block A straight line code sequence with no branches in except to the entry and no branches

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation CAD for VLSI 2 Pro ject - Superscalar Processor Implementation 1 Superscalar Processor Ob jective: The main objective is to implement a superscalar pipelined processor using Verilog HDL. This project may

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1

More information

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution

More information

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Thomas Polzer Institut für Technische Informatik

Thomas Polzer Institut für Technische Informatik Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =

More information

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods 10 1 Dynamic Scheduling 10 1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods Not yet complete. (Material below may repeat

More information

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes. The Processor Pipeline Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes. Pipeline A Basic MIPS Implementation Memory-reference instructions Load Word (lw) and Store Word (sw) ALU instructions

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

Scoreboard information (3 tables) Four stages of scoreboard control

Scoreboard information (3 tables) Four stages of scoreboard control Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy

More information

1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor 1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

The basic structure of a MIPS floating-point unit

The basic structure of a MIPS floating-point unit Tomasulo s scheme The algorithm based on the idea of reservation station The reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) EEC 581 Computer Architecture Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University Lecture 9 Pipeline Hazards Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee18b 1 Announcements PA-1 is due today Electronic submission Lab2 is due on Tuesday 2/13 th Quiz1 grades will

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

EE 457 Midterm Summer 14 Redekopp Name: Closed Book / 105 minutes No CALCULATORS Score: / 100

EE 457 Midterm Summer 14 Redekopp Name: Closed Book / 105 minutes No CALCULATORS Score: / 100 EE 47 Midterm Summer 4 Redekopp Name: Closed Book / minutes No CALCULATORS Score: /. (7 pts.) Short Answer [Fill in the blanks or select the correct answer] a. If a control signal must be valid during

More information

Lecture 19: Instruction Level Parallelism

Lecture 19: Instruction Level Parallelism Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register

More information

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) 1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Spring 2010 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic C/C++ program Compiler Assembly Code (binary) Processor 0010101010101011110 Memory MAR MDR INPUT Processing Unit OUTPUT ALU TEMP PC Control

More information

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences

More information

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012 Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code

More information

COSC121: Computer Systems. ISA and Performance

COSC121: Computer Systems. ISA and Performance COSC121: Computer Systems. ISA and Performance Jeremy Bolton, PhD Assistant Teaching Professor Constructed using materials: - Patt and Patel Introduction to Computing Systems (2nd) - Patterson and Hennessy

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information