References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

Size: px

Start display at page:

Download "References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)"

Estella Palmer
5 years ago
Views:

EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557

We will limit our discussion to scheduling of instructions mostly with-in the basic block (basic block = a straight-line code sequence with no

Scheduling (based on Prof Dubois slides) Strengths -- Hardware simplicity -- Compiler has a global view of the code Weaknesses -- can not be

1 EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s slides Prof Patterson s Lecture slides 2 Instruction Scheduling (Re-ordering of instructions) We will limit our discussion to scheduling of instructions mostly with-in the basic block (basic block = a straight-line code sequence with no branches) Compiler can perform static instruction scheduling Tomasulo Algorithm lets us schedule instructions dynamically (in hardware) 3 Static Scheduling (based on Prof Dubois slides) Strengths -- Hardware simplicity -- Compiler has a global view of the code Weaknesses -- can not be CPU-implementation specific -- can not foresee dynamic events -- cache misses -- data-dependent delays -- conditional branches -- can not pre-compute memory addresses 4

Simple 5-stage pipeline In-order execution RAW dependency

instructions are stalled in the ID stage IM DM IF ID EX M

stalled in the ID stage Simple 5-stage pipeline: Dependent

2 Simple 5-stage pipeline In-order execution RAW dependency Solve it by forwarding, if not, by stalling Dependent instructions are stalled in the ID stage IM DM IF ID EX M WB 5 6 Simple 5-stage pipeline: Dependent instructions are stalled in the ID stage Simple 5-stage pipeline: Dependent instructions can not be stalled in the EX stage Why? and lw and lw 7 8

3 Provide multiple functional units (for simplicity, we avoid talking about floating point execution unit and floating point register file) Stall, after decoding, in queues Tomasulo s plan OoO Out of order execution IM IF ID DM Queues and Functional unit Divide Multiply Integer Load/ Store WB 9 Multiple functional units (say, Integer, DM, Multiplier, Divider) Queues between ID and EX stages (in place ID/EX register) 10 Out of order execution?! Problems all over??!! For the time, no branch prediction, no speculative execution beyond branches, just stall on a conditional branch No support for precise exceptions Even then, 11 RAW, WAR, and WAW RAW = Read After Write add $9,, ; WAR = Write after Read add $9,, ; WAW = Write after Write add $9,, ; lw $9, 40(); WAW? How is it possible? Consider a printer or a FIFO 12

4 Name Dependences RAW, WAR, and WAW (some terminology to remember) RAW = Read After Write RAW add $9,, ; A true dependency WAR = Write after Read add $9,, ; WAR An anti-dependency WAW = Write after Write add $9,, ; lw $9, 40(); WAW An output dependency 13 RAW, WAR, and WAW In-order execution: We need to deal with RAW only Out of order execution: Now we need to deal with WAR and WAW besides RAW 14 Limited Architectural Registers More Physical Registers Register Renaming sw, 40(); lw, 60(); sw, 60(); It is clear that compiler is using as a temporary register If there is a delay in obtaining, the first part of the code can not proceed Unfortunately, the second part of the code can not proceed because of name dependency for 15 If we had 64 registers instead of 32 registers, then perhaps compiler might have used 8 instead of and we could have executed the second part of the code before the first part! sw, 40(); lw 8, 60(); add 8, 8, 8; sw 8, 60(); This is an example of name dependency 16

5 Four different temporary registers can be used here as shown:, $18, 8, and 8 (or called with coded names, LION, TIGER, CAT, and ANT) add $18,, ; sw $18, 40(); lw 8, 60(); add 8, 8, 8; sw 8, 60(); lw LION, 40(); add TIGER, LION, LION; sw TIGER, 40(); lw CAT, 60(); add ANT, CAT, CAT; sw ANT, 60(); 17 Can a later implementation provide 64 registers (instead of 32) while maintaining binary compatibility with previously compiled codes? Answer: Yes / No Why? 18 Answer: Can not change the number of Architectural Registers Register Renaming Through Tagging Registers This solves name dependency problems (WAR and WAW) while attending to true dependency (RAW) through waiting in queues 19 square_root, $10; $1 sw, 40(); lw, 60(); sw, 60(); 1 destination RST $1 1 RF dependent source RST = Register Status Table RF = Register File 20

6 RST RF RST RF square_root, $10; $1 sw, 40(); lw, 60(); sw, 60(); 1 $1 1 square_root, $10; $1 sw, 40(); lw, 60(); sw, 60(); 1 $ square_root, $10; $1 sw, 40(); lw, 60(); sw, 60(); 1 RST $1 1 RF square_root, $10; sw, 40(); lw, 60(); sw, 60(); Dispatch unit decodes and dispatches instructions For destination operand, an instruction carries a TAG (but not the actual register name)! For source operands, an instruction carries either the values or TAGs of the operands (but not the actual register names)! 23 24

7 TAGs for destinations or sources or for both? A new tag is assigned to the destination register of the instruction being dispatched For each of the source registers (source operands) of the instruction being dispatched, either the value of the source register (if it has not been previously tagged) or the existing tag associated with the source register (if it has been tagged already) is conveyed to the instruction If a tag is conveyed for a source, then the instruction needs to wait for the original instruction with that destination tag to go on to the CDB and announce the value 4 Unique TAG 4 Like SSN, we need a unique TAG SSNs are reused Similarly TAGs can be reused TAGs are similar to the number TOKENs Take a number vs Take a token 4 TAGs (= Tokens) How many Tokens should the bank cashier have to start with? 4 In State Bank of India, they issue brass tokens to customers waiting for service Tokens are reclaimed and reused What happens if the tokens are run out? Does he need to have any order in holding tokens and issuing tokens? 27 Does he have to collect tokens back? 28

8 TAG FIFO (FIFOs are taught in EE560) Simplified for EE457 Block Diagram provided by Prof Dubois To issue and collect Tokens (TAGs), use a circular FIFO (First-in-First-Out) unit Filled with (say) 64 tokens (in any order) initially on reset Tokens return in out of order anyway Put tokens back in stack and issue TAG FIFO 2 63 wp rp wp 2 63 rp wp rp Int Divider Integer Multiplier Issue Unit Full 2 tokens issued 1 token returned Front-End & Back-End IFQ Instruction Fetch Queue (a FIFO structure) Dispatch unit (including RST, RF, Tag FIFO) Load Store and other Issue Queues Issue Unit Functional units CDB (Common Data Bus) 31 32

beq 33 34 Address calculation for lw and sw EE557 approach for address calculation Memory

9 Bottle neck in the design CDB = Common Data Bus Do all instructions use CDB? Address calculation load store queue Memory disambiguation sw? j (jump)? beq Address calculation for lw and sw EE557 approach for address calculation Memory Disambiguation EE557 EE457/560 approach for address calculation Dedicated adder, to compute address, attached to the loadstore queue 35 36

10 Memory Disambiguation Memory Disambiguation RAW sw, 2000($0); lw, 2000($0); RAW sw, 2000($0); lw, 2000($0); This later lw can proceed only if there is no store ahead of it with the same address WAW sw, 2000($0); sw, 2000($0); WAW sw, 2000($0); sw, 2000($0); This later sw can proceed only if there is no store ahead of it with the same address WAR lw, 2000($0); sw, 2000($0); 37 WAR lw, 2000($0); sw, 2000($0); This later sw can proceed only if there is no load ahead of it with the same address 38 Maintaining instructions in the order of arrival (issue order/program order) in a queue Is it necessary or is it desirable? In the case of L-S Queue? In the case of Integer and other queues (mult queue, div queue)? 39 Maintaining instructions in the order of arrival (issue order/program order) in a queue Is it necessary or is it desirable? In the case of L-S Queue? NECESSARY to enforce memory disambiguation rules In the case of Integer and other queues (mult queue, div queue)? DESIRABLE, so that an earlier instruction gets executed whenever possible, there by perhaps reducing too many instructions waiting on it 40

11 Priority (based on the order of arrival) among instructions ready to execute Is it necessary or is it desirable? Local priority with in the queues Issue Unit CDB availability constraint Pipelined functional unit vs Multi-cycle functional unit CDB Global priority across the queues 41 Conflict resolution Round-robin priority adequate?, well, 42 Conditional branches Dispatch unit stops dispatching until the branch is resolved CDB broadcasts the result of the branch Conditional branches Since we stop dispatching instructions after a branch, does it mean that this branch is the last instruction to be executed in the back-end? Dispatching continues there after either at the fall-through instruction or at target instruction Successful branch shall cause flushing of IFQ very much like jump 43 Is it possible that the back-end holds simultaneously (a) some instructions dispatched before the branch and (b) some instructions issued after the branch was resolved? 44

12 Tomasulo Loop Example Loop: LW, 40($1); MULT, ; SW, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop; Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss), second load takes 1 clock (hit) Based on Prof Annavaram s lecture slide How could Tomasulo overlap iterations of loops? Loop: LW, 40($1); MULT, ; SW, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop; The destination registers, different TAGs in different iterations These tags were given in place of the source operands to the dependent instructions following them Say, only two iterations Let us unroll the two iterations Loop: LW, 40($1); MULT, ; SW, 40($1); destination register ADDI $1, $1, -4; BNE $1, $0, Loop; Loop: LW, 40($1); MULT, ; SW, 40($1); dependent source register(s) ADDI $1, $1, -4; BNE $1, $0, Loop; 47 Because, there is no reorder buffer Note: Your EE560 project will use a reorder buffer! 48

Credits. EE 457 Unit 9a. Outline. 8-Stage Pipeline. Exploiting ILP Out-of-Order Execution

Credits. EE 457 Unit 9a. Outline. 8-Stage Pipeline. Exploiting ILP Out-of-Order Execution Credits EE 457 Unit 9a Exploiting ILP Out-of-Order Execution Some of the material in this presentation is taken from: Computer Architecture: A Quantitative Approach John Hennessy & David Patterson Some