EE 457 Unit 9a. Exploiting ILP Out-of-Order Execution
|
|
- Stephen Osborne
- 6 years ago
- Views:
Transcription
1 1 EE 457 Unit 9a Exploiting ILP Out-of-Order Execution
2 2 Credits Some of the material in this presentation is taken from: Computer Architecture: A Quantitative Approach John Hennessy & David Patterson Some of the material in this presentation is derived from course notes and slides from Prof. Michel Dubois (USC) Prof. Murali Annavaram (USC) Prof. David Patterson (UC Berkeley)
3 3 Outline Instruction Level Parallelism In-order pipeline From academic 5-stage pipeline To 8-stage MIPS R4000 pipeline Superscalar, superpipelined Out-of-Order Execution Out-of-order completion (Problem: Exceptions) In-order completion Thread Level Parallelism Chip multithreading (CMT) Chip multiprocessors (CMP)
4 4 8-Stage Pipeline In 5-stage pipeline, we place the I-Cache and D-Cache in one stage each When we add VM translation (i.e. TLB), cache tag check, and data access we need more stages to balance the delay Single Memory Stage in 5- Stage Pipeline Expand to three stages when TLB and Cache access is balanced D-Cache D-TLB Cache Data Cache Tag Lookup Tag Check
5 5 8-Stage Pipeline R4000 uses pipelined instruction and data caches The instruction is available at the end of the IS stage but the tag check is done in RF This is harmless since it s fine to start the decode and read registers even if it s an invalid tag so long as we know by the end of the clock For the data cache we must perform the tag check before we can start writing into the register file TLB Data/Tag Access Tag Check TLB Data/Tag Access Tag Check IF IS RF EX DF DS TC WB IM Reg ALU DM Reg
6 6 Load Dependency Issue Latency LW,0() 8-stage pipeline has a two-cycle load delay Possible since data is available at end of DS and can be forwarded If tag check indicates a miss the pipeline can be backed up (re-execute the instruction when the data is available IM Reg ALU DM Reg IM Reg ALU DM Reg IM Reg ALU DM Reg ADD,, IM Reg ALU DM Reg
7 SUPERSCALAR & SUPERPIPELINING 7
8 Superpipelining Superscalar 8 Overview Superscalar = More than 1 instruction completing per clock cycle 2-way superscalar = Proc. that can issue 2 instructions per clock cycle Success is sensitive to ability to find independent instructions to issue in the same cycle Superpipelining = Many small stages to boost clock freq. Success depends of finding instructions to schedule in the shadow of data and control hazards Instruction 1 Instruc. Fetch Instruc. Decode Execute Data Memory Write back Instruction 2 Instruc. Fetch Instruc. Decode Execute Data Memory Write back Superscalar: Executing more than 1 instruction per clock cycle (CPI < 1) Instruction 1 IF1 IF2 ID EX DM1 DM2 DM3 WB Instruction 2 IF1 IF2 ID EX DM1 DM2 DM3 WB Superpipelining: Divide logic into many short stages (Higher Clock Frequency)
9 LD/ST Slot Integer Slot 9 2-way Superscalar One ALU & Data transfer (LW/SW) instruction can be issued at the same time Instruction Pipeline Stages ALU or branch IF ID EX MEM WB LW/SW IF ID EX MEM WB ALU or branch IF ID EX MEM WB LW/SW IF ID EX MEM WB ALU or branch IF ID EX MEM WB LW/SW IF ID EX MEM WB PC I-Cache 2 instructions Reg. File (4 Read, 2 Write) ALU Addr. Calc. D-Cache
10 OUT-OF-ORDER EXECUTION 10
11 11 Instruction Level Parallelism (ILP) Although a program defines a sequential ordering of instructions, in reality many instructions can be executed in parallel. ILP refers to the process of finding instructions from a single program/thread of execution that can be executed in parallel Data flow (data dependencies) is what truly limits ordering Independent instructions (no data dependencies) can be executed at the same time) Control hazards also provide some ordering constraints Program Order (In-order) lw $s3,0($s4) and $t3,$t2,$t3 add $t0,$t0,$s4 or $t5,$t3,$t2 sub $t1,$t1,$t2 beq $t0,$t8,l1 xor $s0,$t1,$s2 Dependency Graph We may perform execution out-of-order LW ADD SUB AND BEQ XOR OR Cycle 1: Cycle 2: Cycle 3: lw $s3,0($s4) / add $t0,$t0,$s4 / sub $t1,$t1,$t2 / and $t3,$t2,$t3 / beq $t0,$t8,l1 / / or $t5,$t3,$t2 / / xor $s0,$t1,$s2 /
12 12 Out-of-Order Motivation Hide the impact of dynamic events such as a cache miss Out-of-Order (OoO) Execution Let independent instructions behind a stalled instruction execute Important aspect: Completion Ordering Out-of-Order completion: Let the independent instruction that has been executed update state (registers + memory) before the stalled instruction completes Problem: Exception handling In-Order completion: Let the independent instructions execute but buffer their results until the stalled instruction completes
13 13 Out-of-Order Execution Io = In-order Execution Execution here means producing the results not necessarily writing them to a register or memory Completion means committing/writing the results to register file or memory While we say out-of-order execution we really mean: IoI (IoD) => OoOE => IoC IoI / IoD = In-order Issue/Dispatch IoC = In-order completion
14 14 IoC or OoC? IoI (IoD) => OoOE => IoC In-order completion is necessary to support exceptions [exact state at time of exception] We will present the concept of OoOC (out-of-order completion) then roll back to IoC OoOC Issues Brancheswe should not commit an instruction that came after (in program order) a branch Solution: Stall dispatching instructions after a branch until we resolve the outcome LW,0() // cache miss BEQ,$0,L1 ADD,, // What if we execute this ADD out of order
15 15 Basic Blocks Basic Block (def.) = Sequence of instructions that will always be executed together No conditional branches out No branch targets coming in Also called straight-line code Average size: 5-7 instrucs. lw $s3,0($s4) and $t3,$t2,$t3 L1: add $t0,$t0,$s4 or $t5,$t3,$t2 sub $t1,$t1,$t2 beq $t0,$t8,l1 xor $s0,$t1,$s2 Instructions in a basic block can be overlapped if there are no data dependencies Control dependences really limit our window of possible instructions to overlap W/o extra hardware, we can only overlap execution of instructions within a basic block This is a basic block (starts w/ target, ends with branch)
16 16 Scheduling Strategies Static Scheduling Compiler re-orders instructions in such a way that no dependencies will be violated and allows for OoOE Dynamic Scheduling HW implementing the Tomasulo algorithm or other similar approach will re-order instructions to allow for OoOE More Advanced Concepts Branch prediction and speculative execution (execution beyond a branch flushing if incorrect) will be covered later
17 17 Static Scheduling Strengths Hardware simplicity [Better clock rate] Power/energy advantage Compiler has a global view of the program anyway, so it should be able to do a good job Very predictable: static performance predictions are reliable Weaknesses Requires re-compilation to take advantage of new/modified architecture Cannot foresee dynamic (data-dependent) events Cache miss, conditional branches (can only recedule instructions in a basic block) Cannot precompute memory addresses No good solution for precise exceptions with out-of-order completion
18 18 Where to Stall? In 5-stage pipeline (in-order execution) RAW dependency was solved by Forwarding (preferably) or Stalling Dependent instructions stalled in the ID stage if necessary ADD,, stall LW IM Reg ALU DM Reg
19 Prior ALU Result Data Mem. or ALU result Instruction Register Pipeline Stage Register Pipeline Stage Register D-Cache PC I-Cache Pipeline Stage Register Mem WB WB 19 Where to Stall? Simple 5-stage pipeline: Dependent instructions cannot be stalled in the EX stage Why? What if ADD was also dependent on the instruction in WB ADD has no place to buffer that forwarded value Thus we stall in ID so we can use the Register File to grab dependent values. Further stalling in ID incurs only 1 cycle penalty as would stalling in EX. PCWrite IRWrite 4 + rs rt. 5 5 IF.Flush HDU Control Read Reg. 1 # Read Reg. 2 # Write Reg. # Write Data Stall Read data 1 Read data 2 Register File Ex Mem WB ALUSelA FLUSH 0 1 Sh. Left ALU Zero Res. Branch 0 1 MemRead & MemWrite 0 1 MemToReg Reset Sign Extend rs rt rd ALUSelB Forwarding Unit ALUSrc 0 1 Regwrite & WriteReg# Regwrite, WriteReg#
20 20 Where to Stall? But to implement OoO execution, we cannot stall in the decode stage since that would prevent any further issuing of instructions Thus, now we will issue to queues for each of the multiple functional units and have the instruction stall in the queue until it is ready Queues + Functional Units ALU MUL IM Reg DM Reg DIV Stalling here would plug up the pipeline DM (Cache)
21 Prior ALU Result Data Mem. or ALU result Instruction Register Pipeline Stage Register Pipeline Stage Register D-Cache PC I-Cache Pipeline Stage Register Mem WB WB 21 Forwarding in OoO Execution In 5-stage pipeline junior instructions carried their source register IDs into the EX stage to be compared with destination register ID s of their seniors (earlier instructions) But in OoO execution, we may have many instructions (seniors) in front of us and cannot afford to perform so many comparisons 0 (as well as handling the 1 FLUSH PCWrite case where many seniors are producing new version of a register) 0 Stall Instead, the IF.Flush dispatch unit will explicitly tell the dependent instruction who to receive data Control from Branch 4 +. IRWrite rs rt 5 5 HDU Read Reg. 1 # Read Reg. 2 # Write Reg. # Write Data Read data 1 Read data 2 Register File Ex Mem WB ALUSelA 0 1 Sh. Left 2 ALU Zero Res. MemRead & MemWrite 0 1 MemToReg Reset Sign Extend rs rt rd ALUSelB Forwarding Unit ALUSrc 0 1 Regwrite & WriteReg# Regwrite, WriteReg#
22 22 Tomasulo s Plan OoO Execution Multiple functional units Integer ALU, Data memory, Multiplier, Divider Queues between ID and EX stages (in place of ID/EX register) Allows later instructions to keep issuing even if earlier ones are stalled
23 23 OoO Execution Problems For the time No branch prediction No speculative execution beyond a branch So we simply stall on a conditional branch For the time, no support for precise exceptions Even then
24 24 RAW, WAR, and WAW RAW = Read After Write lw, 40() add $9,, WAR = Write After Read add $9,, say is not available yet lw, 40() WAW = Write After Write add $9,, say is not available yet lw $9, 40() Why would anyone produce one result in $9 without utilizing that result? Why would he overwrite it with another result? How is this possible?
25 25 WAW can easily occur How is it possible? In OoO execution, instructions before the branch and after the branch can co-exist Consider multiple iterations of a loop body for(i=max; i!= 0; i--) A[i] = A[i] * 3; L1: lw, 40() mult,, sw, 40() addi,,-4 bne, $0,L1 Original Code L1: lw, 40() mult,, sw, 40() addi,,-4 bne, $0,L1 L1: lw, 40() mult,, sw, 40() addi,,-4 bne, $0,L1
26 26 Another WAW Example Say a company gives standard bonus to most of the employees and a higher bonus to managers The software may set a default value to the standard bonus and then overwrite for the special case int x = standard_bonus; if (manager) x = special_bonus; set_bonus(x);
27 Name Depdencies 27 RAW, WAR, and WAW Some terminology to remember RAW = Read After Write lw, 40() add $9,, WAR = Write After Read add $9,, lw, 40() WAW = Write After Write add $9,, lw $9, 40() RAW A true dependency WAR An anti-dependency WAW An anti-dependency Note: no information is communicated in WAR/WAW hazards
28 28 RAW, WAR, and WAW In-order execution: We need to deal with RAW only Out-of-order execution Now we need to deal with WAR and WAW hazards besides RAW Any of these hazards seem to prevent re-ordering instructions and executing them out-of-order
29 29 Register Renaming We have limited architectural registers Registers the instruction set is aware of We could have more physical registers Actual registers part of the register file Assume Delayed lw, 40() sw, 40() lw, 60() sw, 60() It is clear the compiler is using as a temporary register If there is a delay in obtaining the first part of the code cannot proceed Unfortunately, the second part of the code cannot proceed because of the name dependency for
30 30 Register Renaming If we had 64 registers instead of 32 registers, then perhaps the compiler might have used 8 instead of and we could have executed the second part of the code before the first part lw, 40() sw, 40() lw 8, 60() add 8, 8, 8 sw 8, 60() This is an example of a name-dependency
31 31 Register Renaming Four different temporary registers can be used here as shown:, 8, 8 and 8 We could even simply use aliases or code names: LION, TIGER, CAT, BEAR lw, 40() add 8,, sw 8, 40() lw LION, 40() add TIGER, LION, LION sw TIGER, 40() lw 8, 60() add 8, 8, 8 sw 8, 60() lw CAT, 60() add BEAR, CAT, CAT sw BEAR, 60()
32 32 Increasing Number of Registers Can a later implementation provide 64 registers (instead of 32) while maintaining binary compatibility with previously compiled code? Answer: Yes / No NO Why? Machine code has 5-bit fields for register ID s
33 33 Increasing Number of Registers Answer: Cannot change the number of architectural registers Instead we will perform Register Renaming through Tagging Registers This solves name dependency problems (WAR and WAW) while attending to true dependency (RAW) through waiting in queues Please be sure you understand this!
34 34 Tagging process sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST RF 1 1 Issue Logic RST = Register Status Table RF = Register File INT ALU T1: SQRT Val / 0 Val INT MUL/DIV/SQRT Load/ Store
35 35 Tagging process: CC1 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 RF 1 1 Issue Logic RST = Register Status Table RF = Register File INT ALU T1: SQRT Val / 0 Val INT MUL/DIV/SQRT Load/ Store
36 36 Tagging process: CC2 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T2 RF 1 1 Issue Logic RST = Register Status Table RF = Register File INT ALU T1: SQRT Val / 0 Val T2: LW T1 / 40 INT MUL/DIV/SQRT Load/ Store
37 37 Tagging process: CC3 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T2 T3 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val T2: LW T1 / 40 INT ALU INT MUL/DIV/SQRT Load/ Store
38 38 Tagging process: CC4 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T3 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store
39 39 Tagging process: CC5 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T3 T4 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT T4: LW val / 60 SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store
40 40 Tagging process: CC6 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T4 T5 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T5: ADD T4 / T4 T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT T4: LW val / 60 SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store
41 41 Tagging process: CC7 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T5 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T5: ADD T4 / T4 T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT T4: LW val / 60 SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store T4: Read 0x1111
42 42 Tagging process: CC8 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T5 => null RF 0x T5: Sum 0x2222 Issue Logic RST = Register Status Table RF = Register File T5: ADD 0x1111 / 0x1111 T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store T5: Sum 0x2222
43 43 Tagging process: CC9 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 null RF 0x Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT SW 0x2222, val / 60 SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store T5: Sum 0x2222
44 44 Tagging process: CC10 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 => null RF 0x T1: SQRT 0xacd0 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store T1: SQRT 0xacd0
45 45 Tagging process: CC11 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST RF 0xacd0 0x Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 INT ALU INT MUL/DIV/SQRT SW T3 / 0xacd0 / 40 T2: LW 0xacd0 / 40 Load/ Store T2: Read 0x5678
46 46 Tagging process: CC12 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST RF 0xacd0 0x Issue Logic RST = Register Status Table RF = Register File T3: ADD 0x5678 / 0x5678 INT ALU INT MUL/DIV/SQRT SW T3 / 0xacd0 / 40 Load/ Store T3: Sum 0xACF0
47 47 Tagging process: CC13 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST RF 0xacd0 0x Issue Logic RST = Register Status Table RF = Register File SW 0xacf0 / 0xacd0 / 40 INT ALU INT MUL/DIV/SQRT Load/ Store
48 48 Tomasulo s Algorithm Dispatch/Issue unit decodes and dispatches instructions For destination operand, an instruction carries a TAG (but not the actual register name) For source operands, an instruction carries either the values (if TAG is null in RST) or TAGs of the operands (but not the actual register name) When an instruction executes and produces a result it broadcasts the result and its destination TAG Any instruction waiting can compare its SRC tags with the destination tag and grab the value if they match If entry in RST matches the TAG then this instruction is the latest producer of the register and the value will be written to the RF
49 49 Register Renaming sqrt, 0 add,, add,, add,, add,, RST T1, T2, T3, T4 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T4: ADD T3 / T3 T3: ADD T2 / T2 T2: ADD T1 / T1 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT Load/ Store
50 50 Unique TAGs Like SSN, we need a unique TAG SSN s are reused. Similarly TAGS can be reused TAGs are similar to number TOKEN Helps to create a virtual queue. We do not need that here In State Bank of India, the cashier issues brass token to customers trying to draw money as an ID (and not at all to put them in any virtual queue / ordering). Token numbers are in random order. The cashier verifies the signature in the record rooms, returns with money, calls the token number and issues the money. Tokens are reclaimed & reused.
51 51 Tags (= Tokens) How many tokens should the bank casheir have to start with? What happens if the tokens run out? Does the cashier need to have any order in holding tokens and issuing tokens? Do they have to collect the tokens back?
52 52 TAG FIFO To issue and collect tokens (TAGS) use a circular FIFO (First- In/First-Out) unit While the FIFO order is not important here, a FIFO is the easiest to implement in hardware compared to a random order in a pile Filled (with say) 64 tokens (in any order) initially on reset Tokens return in any order FIFO s are taught in EE 560 Put tokens back in the FIFO and reissue TAG FIFO TAG FIFO TAG FIFO wp rp wp 2 63 rp wp rp FULL 2 Tokens issued 1 Tokens returned
53 Int. Queue L/S Queue Div Queue Mult. Queue Reg. File 53 Organization for OoO Execution I-Cache Instruc. Queue TAG FIFO Block Diagram Adapted from Prof. Michel Dubois (Simplified for EE 457) Register Status Table Dispatch Integer / Branch D-Cache Div Mul Issue Unit CDB
54 54 Front-End & Back-End IFQ (Instruction Fetch Queue) A FIFO structure Dispatch (Issue) Unit Includes RST, RF, Tag FIFO Load/Store and other Issue Queues Issue Units Functional units CDB (Common Data Bus) Like a public address system that everyone can see/hear when data is produced
55 55 More Tomasulo Algorithm Front End Instructions are fetched They are stored in a FIFO (IFQ) When instruction reached the head of the IFQ it is Decoded Dispatched to an issue queue/functional unit Even if some of the inputs are not ready (takes TAGs) Back End Instructions in issue queues wait for their input operands Once register operands are ready instructions can be scheduled for execution provided they will not conflict for the CDB or their functional unit Instructions execute in their functional unit and their result is put on the CDB All instructions in queues and the register file watch the CDB and grab the value they are waiting for when it is produced
56 56 Bottleneck in Tomasulo s Approach The CDB!!! Do all instructions use the CDB? No, not SW, J (jump), BEQ
57 57 Load/Store Queue (LSQ) Performs Address calculation Memory disambiguation RAW, WAR, WAW hazards due to memory reads and writes // Is there a dependency here? SW,0() LW,0() // What about here? SW, 1000() LW, 0()
58 58 Address Calculation for LW/SW EE 557 approach for address calculation Loads & store in 2 sub-instructions 1 instruction computes address and is dispatched to integer ALU 1 instruction access data cache and is issued to LSQ Address is communicated from integer ALU to LSQ via CDB forwarding using a tag EE 560/457 approach Use a dedicated adder in the LSQ to compute address (so just 1 dispatched instruction)
59 59 Memory Disambiguation EE 557: Issue to a cache from LSQ Loads can issue to a cache when their address is ready Stores can issue to cache when both address & data is ready Memory hazards (RAW, WAR, WAW) are resolved in the LSQ Load can issue to cache if no store with same address is before it Store can issue to cache if no store or load with same address before it Otherwise access waits in LSQ If an address is unknown it is assumed to be the same Worst case to enforce correctness The process of figuring out and comparing memory address is called disambiguation
60 60 Memory Disambiguation RAW sw, 2000($0) lw, 2000($0) This later lw can proceed only if there is no store ahead of it with the same address WAW sw, 2000($0) sw, 2000($0) This later sw can proceed only if there is no store ahead of it with the same address WAR lw, 2000($0) sw, 2000($0) This later sw can proceed only if there is no load ahead of it with the same address
61 61 LSQ Ordering Maintaining instructions in the order of arrival Issue order/program order in a queue Is this necessary and/or desirable? In the case of LSQ? Necessary! To enforce memory disambiguation In the case of Integer, MUL, DIV queues? Desirable, so that an earlier instruction gets executed whenever possible, thereby reducing queue pressure from too many instructions waiting on it
62 62 Priority Priority (based on the order of arrival among ready instructions) Is it necessary or is it desirable? Local priority within queues? Global priority across the queues?
63 63 Issue Unit CDB availability constraint Pipelined functional unit vs. multicycle unit Conflict resolution Round-robin priority adequate?, well,
64 64 Conditional Branches Dispatch unit stops dispatching until the branch is resolved CDB broadcasts the result of the branch Dispatching continues at the target or fall-through instruction Successful branch shall cause flushing of IFQ Since we stop dispatching instructions after a branch, does it mean that this branch is the last instruction to be executed in the back-end? Is it possible that the back-end holds simultaneously A. Some instructions dispatched before the branch and B. Some instructions issued after the branch
65 65 Structural & Control Hazards Structural Stalls I-Fetch must stall if the IFQ is full Dispatch must stall if all entries in the desired functional unit s Issue Queue is occupied Instructions cannot be scheduled in case of conflicts for the CDB or functional unit Control Dispatcher stalls when it reaches a branch Branches are dispatched to integer queue They wait for their operands if needed Put their result on CDB If untaken, dispatch unit resumes If taken, then dispatch clears the IFQ and resumes at target Precise exceptions not supported
66 BACKUP 66
67 67 Tagging Registers: CC1 DOG sqrt, 0 lw, 40() Orange means dispatched and SQRT is a long-latency computation RST DOG RF sw, 40() 1 1 lw, 60() sw, 60() Destination Dependent source RST = Register Status Table RF = Register File
68 68 Tagging Registers: CC2 sqrt, 0 LION DOG lw, 40() DOG Orange means dispatched and SQRT is a long-latency computation RST DOG LION RF sw, 40() 1 1 lw, 60() sw, 60() Destination Dependent source RST = Register Status Table RF = Register File
69 69 Tagging Registers: CC3 sqrt, 0 LION lw, 40() TIGER DOG LION DOG LION Orange means dispatched and SQRT is a long-latency computation RST DOG TIGER RF sw, 40() 1 1 lw, 60() sw, 60() Destination Dependent source RST = Register Status Table RF = Register File
70 70 Tagging Registers: CC4 sqrt, 0 LION lw, 40() TIGER DOG LION DOG LION Orange means dispatched and SQRT is a long-latency computation RST DOG TIGER RF TIGER sw, 40() 1 1 lw, 60() sw, 60() Destination Dependent source RST = Register Status Table RF = Register File
71 71 Tagging Registers Review sqrt, 0 LION lw, 40() TIGER TIGER DOG LION sw, 40() lw, 60() sw, 60() DOG LION Dispatch unit decodes and dispatches instructions For destination operand, an instruction carreis a TAG (but not the actual register name) For source operands, an instruction carries either the values (if no TAG in RST) or TAGs of the operands (but not the actual register name) When 1 RST DOG TIGER 1 RF
72 Int. Queue L/S Queue Div Queue Mult. Queue Reg. File 72 Organization for OoO Execution I-Cache Instruc. Queue TAG FIFO Block Diagram Adapted from Prof. Michel Dubois (Simplified for EE 457) Register Status Table Dispatch Integer / Branch D-Cache Div Mul Issue Unit CDB
73 73 Multiple Functional Units We now provide multiple functional units After decode, issue to a queue, stalling if the unit is busy or waiting for data dependency to resolve Queues + Functional Units ALU MUL IM Reg DM Reg DIV DM (Cache)
74 74 Functional Unit Latencies Int. ALU, Addr. Calc. EX An added complication of out-of-order execution & completion: WAW & WAR hazards FP Add A1 A2 A3 A4 Int. & FP MUL M1 M2 M3 M4 M5 M6 M7 Int. & FP DIV Look Ahead: Tomasulo Algorithm will help absorb latency of different functional units and cache miss latency by allowing other ready instruction proceed out-of-order Functional Unit Latency (Required stalls cycles between dependent [RAW] instrucs.) Initiation Interval (Distance between 2 independent instructions requiring the same FU) Integer ALU 0 1 FP Add 3 1 FP Mul. 6 1 FP Div
75 Int. Queue L/S Queue Div Queue Mult. Queue Reg. File 75 OoO Execution w/ ROB ROB allows for OoO execution but in-order completion I-Cache D-Cache Instruc. Queue ROB (Reorder Buffer) Br. Pred. Buffer Dispatch Exceptions? No problem Integer / Branch Exec. Unit D-Cache L/S Buffer Addr. Buffer Div Mul Issue Unit CDB
Credits. EE 457 Unit 9a. Outline. 8-Stage Pipeline. Exploiting ILP Out-of-Order Execution
Credits EE 457 Unit 9a Exploiting ILP Out-of-Order Execution Some of the material in this presentation is taken from: Computer Architecture: A Quantitative Approach John Hennessy & David Patterson Some
More informationReferences EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)
EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationComputer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining
Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationRecall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationCISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions
CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationPage 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
More informationDYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING
DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationCISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3
CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationLoad1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1
Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not
More informationCOSC4201 Instruction Level Parallelism Dynamic Scheduling
COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism
More informationELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism
ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,
More informationProcessor: Superscalars Dynamic Scheduling
Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),
More informationCOSC 6385 Computer Architecture - Pipelining
COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage
More informationLECTURE 10. Pipelining: Advanced ILP
LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction
More informationStatic vs. Dynamic Scheduling
Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationCMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago
CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More informationChapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,
Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationEE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes
NAME: STUDENT NUMBER: EE557--FALL 1999 MAKE-UP MIDTERM 1 Closed books, closed notes Q1: /1 Q2: /1 Q3: /1 Q4: /1 Q5: /15 Q6: /1 TOTAL: /65 Grade: /25 1 QUESTION 1(Performance evaluation) 1 points We are
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo
More informationCS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes
CS433 Midterm Prof Josep Torrellas October 16, 2014 Time: 1 hour + 15 minutes Name: Alias: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationChapter 4 The Processor 1. Chapter 4B. The Processor
Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationOut of Order Processing
Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative
More informationDepartment of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More informationThis Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods
10-1 Dynamic Scheduling 10-1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods Not yet complete. (Material below may
More informationPredict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch
branch taken Revisiting Branch Hazard Solutions Stall Predict Not Taken Predict Taken Branch Delay Slot Branch I+1 I+2 I+3 Predict Not Taken branch not taken Branch I+1 IF (bubble) (bubble) (bubble) (bubble)
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationInstruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction
Instruction Level Parallelism ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Basic Block A straight line code sequence with no branches in except to the entry and no branches
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationPage # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,
More informationCAD for VLSI 2 Pro ject - Superscalar Processor Implementation
CAD for VLSI 2 Pro ject - Superscalar Processor Implementation 1 Superscalar Processor Ob jective: The main objective is to implement a superscalar pipelined processor using Verilog HDL. This project may
More informationInstruction Level Parallelism
Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated
More informationComplex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar
Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting
More informationOutline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches
Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,
More informationILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)
Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case
More informationCourse on Advanced Computer Architectures
Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1
More informationWebsite for Students VTU NOTES QUESTION PAPERS NEWS RESULTS
Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly
More informationCPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationEITF20: Computer Architecture Part3.2.1: Pipeline - 3
EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationPage 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson
More informationChapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences
Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution
More informationChapter 4 The Processor 1. Chapter 4A. The Processor
Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationInstruction Pipelining Review
Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number
More informationThomas Polzer Institut für Technische Informatik
Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =
More informationThis Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods
10 1 Dynamic Scheduling 10 1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods Not yet complete. (Material below may repeat
More informationCS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming
CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I
More informationThe Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.
The Processor Pipeline Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes. Pipeline A Basic MIPS Implementation Memory-reference instructions Load Word (lw) and Store Word (sw) ALU instructions
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More informationDYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD
DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,
More informationSuper Scalar. Kalyan Basu March 21,
Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationAdvanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017
Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation
More informationScoreboard information (3 tables) Four stages of scoreboard control
Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy
More information1 Hazards COMP2611 Fall 2015 Pipelined Processor
1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationThe basic structure of a MIPS floating-point unit
Tomasulo s scheme The algorithm based on the idea of reservation station The reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationEEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)
EEC 581 Computer Architecture Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationLecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University
Lecture 9 Pipeline Hazards Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee18b 1 Announcements PA-1 is due today Electronic submission Lab2 is due on Tuesday 2/13 th Quiz1 grades will
More informationChapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationEE 457 Midterm Summer 14 Redekopp Name: Closed Book / 105 minutes No CALCULATORS Score: / 100
EE 47 Midterm Summer 4 Redekopp Name: Closed Book / minutes No CALCULATORS Score: /. (7 pts.) Short Answer [Fill in the blanks or select the correct answer] a. If a control signal must be valid during
More informationLecture 19: Instruction Level Parallelism
Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register
More informationEEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)
1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationSpring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic
Spring 2010 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic C/C++ program Compiler Assembly Code (binary) Processor 0010101010101011110 Memory MAR MDR INPUT Processing Unit OUTPUT ALU TEMP PC Control
More informationCS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences
More informationAdvanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012
Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code
More informationCOSC121: Computer Systems. ISA and Performance
COSC121: Computer Systems. ISA and Performance Jeremy Bolton, PhD Assistant Teaching Professor Constructed using materials: - Patt and Patel Introduction to Computing Systems (2nd) - Patterson and Hennessy
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationLecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1
Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number
More information