Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Similar documents
Four Steps of Speculative Tomasulo cycle 0

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

The basic structure of a MIPS floating-point unit

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Static vs. Dynamic Scheduling

Scoreboard information (3 tables) Four stages of scoreboard control

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based Speculation

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Hardware-based Speculation

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Instruction Level Parallelism

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

5008: Computer Architecture

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Processor: Superscalars Dynamic Scheduling

Adapted from David Patterson s slides on graduate computer architecture

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

EECC551 Exam Review 4 questions out of 6 questions

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Metodologie di Progettazione Hardware-Software

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Handout 2 ILP: Part B

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Super Scalar. Kalyan Basu March 21,

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Multi-cycle Instructions in the Pipeline (Floating Point)

Hardware-Based Speculation

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Instruction-Level Parallelism and Its Exploitation

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Chapter 4 The Processor 1. Chapter 4D. The Processor

Copyright 2012, Elsevier Inc. All rights reserved.

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

Advanced issues in pipelining

CS433 Homework 2 (Chapter 3)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CS433 Homework 2 (Chapter 3)

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Instruction Level Parallelism (ILP)

Instruction Level Parallelism

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS425 Computer Systems Architecture

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

ILP: Instruction Level Parallelism

Lecture-13 (ROB and Multi-threading) CS422-Spring

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Superscalar Architectures: Part 2

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Getting CPI under 1: Outline

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Course on Advanced Computer Architectures

Hardware-Based Speculation

Multiple Instruction Issue and Hardware Based Speculation

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CS 2410 Mid term (fall 2018)

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

Tomasulo s Algorithm

CSE 502 Graduate Computer Architecture

Instruction Level Parallelism. Taken from

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

Lecture 9: Multiple Issue (Superscalar and VLIW)

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

DYNAMIC SPECULATIVE EXECUTION

EECC551 Review. Dynamic Hardware-Based Speculation

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Outline EEL 5764 Graduate Computer Architecture. Chapter 2 - Instruction Level Parallelism. Recall from Pipelining Review

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Transcription:

Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Int no Qi Mult1 Add2 Add1 Mult2 (37) Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 no Add2 Y Add Reg[F8] Reg[F2] Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Int no Qi Mult1 Add2 Mult2 (38) Page 1

Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 no Add2 no Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Int no Qi Mult1 Mult2 (39) Figure 3.6 Register file IF Issue CDB lw/sw unit Mem int unit FP mult. FP add Div. Reservation Stations and load/store buffers Write result to Reg (40) Page 2

Another example Instruction Issue Execute Write result L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) L.D F0, 8(R1) MUL.D F4, F0, F2 S.D F4, 8(R1) Name Busy Op Vj Vk Qj Qk A Load1 y ld Reg[R1]+0 Load2 y ld Reg[R1]+8 store1 y sd Reg[R1]+0 Mult1 store2 y sd Reg[R1]+8 Mult2 Add no Mult1 Y Mul Reg[F2] Load1 Mult2 Y Mul Reg[F2] Load2 Qi Load2 Mult 2 (41) Hardware-based Speculation ( 3.6) The goal is to allow instructions after a branch to start execution before the outcome of the branch is confirmed. There should be no consequences (including exceptions) if it is determined that the instruction should not execute. Use dynamic branch prediction and use OOO execution Use a Reorder Buffer (ROB) to reorder instructions after execution Commit results to registers and memory in-order All un-committed results can be flushed if a branch is miss-predicted Service interrupts only when an instruction is ready to commit Should free the reservation station when an instruction is in the reorder buffer. For each register, R, the status table keeps the ROB number reserved by the instruction which will write into R (instead of the RES station number). (42) Page 3

Reorder buffers (ROB) 3 fields: instruction, destination, value When issuing a new instruction, read a register value from the ROB if the status table indicates that the source instruction is in a ROB. Hence, ROBs supply operands between execution complete & commit => more virtual registers. ROBs form a circular queue ordered in the issue order. Once an instruction reaches the head of the ROB queue, it commits the results into register or memory. Hence, it s easy to undo speculated instructions on miss-predicted branches or on exceptions Should flush the pipe as soon as you discover a miss-predictions all earlier instructions should commit (problems??) IF Issue Reservation Stations and load/store buffers Div. FP add FP mult. W B Reorder buffer Register file int unit (43) lw/sw unit Mem CDB Steps of Speculative Tomasulo Algorithm Combining branch prediction with dynamic scheduling Issue (sometimes called Dispatch) If a RES station and a ROB are free, issue the instruction to the RES station after reading ready registers and renaming non-ready registers Execution (sometimes called issue) When both operands are ready, then execute; if not ready, watch CDB for result; when both in reservation station, execute (checks RAW) Write result (WB) Write on CDB to all awaiting RES stations & send the instruction to the ROB; mark reservation station available. Commit (sometimes called graduation) When instruction is at head of ROB, update registers (or memory) with result and free ROB. A miss-predicted branch flushes all non-committed instructions. (44) Page 4

Example: - Assume that MUL.D just finished WB and is entering the ROB - SUB.D and ADD.D are already in the ROB - DIV.D is in RES station waiting for the result of MUL.D Instruction Issued Execute In ROB committed L.D F6, 34(R2) x L.D F2, 45(R3) x MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 (45) Reservation stations status Name Busy Op Vj Vk Qj Qk Dest. A Load1 no Load2 no Add1 Y sub Reg[F2] Reg[F6] ROB4 Add2 Y add Reg[F8] Reg[F2] ROB6 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] ROB3 Mult2 Y Div Reg[F0] ROB6 ROB5 ROB status Register status Name Busy Instruction State Dest. value ROB1 no L.D F6, 34(R2) Commit F6 xxx ROB2 no L.D F2, 45(R3) Commit F2 xxx ROB3 yes MUL.D F0, F2, F4 Wrote result F0 xxx ROB4 yes SUB.D F8, F2, F6 Wrote result F8 xxx ROB5 yes DIV.D F10, F0, F6 Issued/executing F10 ROB6 yes ADD.D F6, F8, F2 Wrote result F6 xxx Qi ROB3 ROB6 ROB4 ROB5 Note: reservation stations Add1 and Add2 may be released after the instructions finish the write back. (46) Page 5

Renaming Registers Common variation of speculative design Reorder buffer keeps instruction information but not the result Extend register file with extra renaming registers to hold speculative results Rename register allocated at issue; result is written into renaming register when execution completes; renaming register is copied to real register on commit. Operands read either from register file (real or speculative) or via Common Data Bus Advantage: operands are always from single source (extended register file) Name Busy Instruction State Dest. R Reg. ROB1 no L.D F6, 34(R2) Commit F6 1 ROB2 no L.D F2, 45(R3) Commit F2 13 ROB3 yes MUL.D F0, F2, F4 Write result F0 15 ROB4 yes SUB.D F8, F2, F6 Write result F8 2 ROB5 yes DIV.D F10, F0, F6 Issued/executing F10 ROB6 yes ADD.D F6, F8, F2 Write result F6 14 ROB status RR0 RR1 RR2 RR3... RR13 RR14 RR15 Renaming Registers (47) Multiple issue processors ( 3.7) Issue more than one instruction per clock cycle VLIW processors (static scheduling by the compiler) Superscalar processors» Statically scheduled (in-order execution)» Dynamically scheduled (out-of-order execution) results in CPI < 1 (Instructions per clock, IPC > 1) The fetch unit gets an issue packet which contains multiple instructions The issue unit issues 1-8 instructions every cycle (usually, the issue unit is itself pipelined) independent instructions multiple resources should be available branch prediction should be accurate Leads to multiple instruction completion per cycle (48) Page 6

Five primary approaches (49) A statically scheduled MIPS superscalar Assume two execution pipes one for load, store, branch and integer operations one for floating point operations IF fetches 2 instructions ID process 2 instructions increase load on register file IF ID E/int E/fp Mem WB Type Pipe Stages Int. instruction IF ID E MEM WB FP instruction IF ID E MEM WB Int. instruction IF ID E MEM WB FP instruction IF ID E MEM WB Int. instruction IF ID E MEM WB FP instruction IF ID E MEM WB (50) Page 7

Loop Unrolling in Superscalar Assume: L.D to ADD.D: 1 Cycle and ADD.D to S.D: 2 Cycles Integer instruction FP instruction Clock cycle Loop: L.D F0,0(R1) 1 F0 L.D F6,-8(R1) 2 L.D F10,-16(R1) ADD.D F4,F0,F2 3 L.D F14,-24(R1) ADD.D F8,F6,F2 4 L.D F18,-32(R1) F4 ADD.D F12,F10,F2 5 S.D F4, 0(R1) ADD.D F16,F14,F2 6 S.D F8, -8(R1) ADD.D F20,F18,F2 7 S.D F12, -16(R1) 8 S.D F16, -24(R1) 9 S.D F20, -32(R1) 10 DADDI R1,R1,-40 11 BNEZ R1,Loop 12 Unrolled 5 times to avoid delays. 12 clocks, or 2.4 clocks per iteration. (51) MIPS superscalar IF ID E/int E/fp Mem WB While Integer/FP split is simple for the HW, we can get CPI of 0.5 only for programs with: Exactly 50% FP operations No hazards If two instructions issued every cycle, greater difficulty of decode and issue Even 2 units => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue 1 cycle load delay expands to 3 instructions in SS instruction in right half can t use it, nor instructions in next slot Control hazards is twice as expensive (52) Page 8

VLIW architectures Compiler packs multiple, independent, operations into one single instruction. instruction E.g., 1 integer operations, 2 FP ops, 1 Memory refs, 1 branch. instruction = 5*24 = 120 bits. Lock-step issue of instructions (one per cycle) In a pure VLIW approach, the compiler resolves all hazards (resolved by hardware in MIPS superscalars) To PC int FP FP Load/ store BR unit Register file To/from memory (53) Loop Unrolling in VLIW Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch L.D F0,0(R1) L.D F6,-8(R1) 1 L.D F10,-16(R1) L.D F14,-24(R1) 2 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4 ADD.D F20,F18,F2 ADD.D F24,F22,F2 5 S.D F4,0(R1) S.D F8,-8(R1) ADD.D F28,F26,F2 6 S.D F12,-16(R1) S.D F16,-24(R1) 7 S.D F20,-32(R1) S.D F24,-40(R1) DADDI R1,R1,-48 8 S.D F28,-0(R1) BNEZ R1,Loop 9 Unrolled 7 times 7 results in 9 clocks, or 1.3 clocks per iteration. Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 11 in SS) (54) Page 9