Scoreboard information (3 tables) Four stages of scoreboard control

Similar documents
Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Instruction-Level Parallelism and Its Exploitation

Tomasulo s Algorithm

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

The basic structure of a MIPS floating-point unit

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Metodologie di Progettazione Hardware-Software

Static vs. Dynamic Scheduling

COSC 6385 Computer Architecture - Pipelining (II)

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

Hardware-based Speculation

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Good luck and have fun!

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Instruction Level Parallelism

Advantages of Dynamic Scheduling

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Instruction Level Parallelism (ILP)

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Superscalar Architectures: Part 2

Adapted from David Patterson s slides on graduate computer architecture

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Chapter 4 The Processor 1. Chapter 4D. The Processor

Latencies of FP operations used in chapter 4.

EECC551 Exam Review 4 questions out of 6 questions

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

DYNAMIC SPECULATIVE EXECUTION

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

5008: Computer Architecture

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

Lecture-13 (ROB and Multi-threading) CS422-Spring

Instruction Level Parallelism

Instruction Level Parallelism. Taken from

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Complex Pipelining. Motivation

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Lecture 4: Introduction to Advanced Pipelining

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Hardware-based Speculation

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

What is instruction level parallelism?

Preventing Stalls: 1

Handout 2 ILP: Part B

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Branch prediction ( 3.3) Dynamic Branch Prediction

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

University of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3.

Dynamic Scheduling. CSE471 Susan Eggers 1

Super Scalar. Kalyan Basu March 21,

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CS 423 Computer Architecture Spring Lecture 04: A Superscalar Pipeline

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Four Steps of Speculative Tomasulo cycle 0

Copyright 2012, Elsevier Inc. All rights reserved.

332 Advanced Computer Architecture Chapter 3. Dynamic scheduling, out-of-order execution, register renaming and speculative execution

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

E0-243: Computer Architecture

CS 614 COMPUTER ARCHITECTURE II FALL 2004

Processor Architecture

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

CS433 Homework 3 (Chapter 3)

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Transcription:

Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy operation (if unit can perform more than one operation) destination register - F i source registers (containing source operands) - F j and F k the unit producing the source operands (if stall to avoid RAW hazards) - Q j and Q k flags to indicate that source operands are ready -- R j and R k result : indicates the functional unit that contains an instruction which will write into each register (if any) (24) Four stages of scoreboard control Issue only if no structural, WAR or WAW hazards. Issue (and reserve the functional unit) if the functional unit is free and» No issued instruction (in the wait queue) will write to the destination register (to avoid WAW)» No issued instruction (in the wait queue) will read from the destination register (to avoid WAR) otherwise, stall, and block subsequent instructions the fetch unit stalls when the queue between the fetch and the issue stages is full (may be only one buffer). Read operands only if no RAW hazard. If a RAW hazard is detected, wait until the operands are ready, When the operands are ready, read the registers and move to the execution stage, Note that instructions may proceed to the E stage out-of-order. Execution. When execution terminates, notify the score board. Write result to register file (25) Page 1

Example: Instruction Func. unit Instruction Issue Read op. Exec. Completed Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Unit Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Int. No Yes Mult2 No Add Yes Sub F8 F6 F2 Int. Yes No divide Yes Div F10 F0 F6 Mult1 No Yes Func. U Mult1 Int. Add Div done Not in Redundant information (26) Example: when L.D F2, 45(R3) is writing Instruction Func. unit Instruction Issue Read op. Exec. Completed Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F4, F8, F2 Unit Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Yes divide Yes Div F10 F0 F6 Mult1 No Yes Func. U Mult1 Add Div (27) done Not in Page 2

Example: 3 cycles after SUB.D finished writing Instruction Issue Read op. Exec. Completed Write result Instruction Func. unit L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F4, F8, F2 Unit Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes add F4 F8 F2 Yes Yes divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div (28) Limitations of the scoreboard approach No forwarding Structural hazards are cleared before instruction issue WAW and WAR hazards are cleared before instruction issue Did not discuss control hazards Execution units are not pipelined Possible enhancement If we can have k write-backs to registers per cycle and k parallel buses between registers and pipeline units, them k functional units may be released per cycle k instructions may be dispatched per cycles. k instructions may be issued per cycle. Need to extend the scoreboard to the case where the execution units are pipelined? (29) Page 3

The Tomasulo approach Introduced for IBM 360/91 in 1966 Main improvements over the scoreboard approach: Uses forwarding on a Common Data Bus (CDB) -- more efficient dealing with RAW hazards, avoids WAR hazards by reading the operands in the instruction-issue order, instead of stalling at issue. To accomplish this an instruction reads an available operand before waiting for the other. Avoids WAW hazards by renaming the registers (using the id of a reservation station rather than the register id) The control information and logic are distributed to the functional unit and not centralized. (30) The architecture for the Tomasulo scheme CDB lw/sw unit Mem file int unit FP add FP mult. Write result to Reg IF Issue Div. Reservation Stations and load/store buffers Each reservation station has an id and is used by one instruction during the lifetime of this instruction. Each unit has one or more reservation stations Reservation stations play the role of temporary registers (renaming) (31) Page 4

Book keeping in Tomasulo s algorithm Instruction : issued, executing or writing result, Reservation stations (functional units) : busy/not busy operation (if unit can perform more than one operation) source operands (data values) - V j and V k the reservation stations producing the source operands (if stall to avoid RAW hazards) - Q j and Q k Address field, A, for load/store buffers (store effective address) result : indicates the reservation station that contains an instruction which will write into each register (if any) (32) Three stages of control Issue If a reservation station is available for the needed functional unit» read ready operands» for operands that are not ready, rename the register to the reservation station that will produce it, Reservation stations for load/store instructions are called load/store buffers. Execution. Monitor the CDB for the operand that is not ready, When both operands are available, execute. If more than one station per unit, only one unit can start execution. Do not start execution before previous branches have completed. Write result. Write to CDB (and to registers) -- may be a structural hazards if only one CDB bus. Make the reservation station (the functional unit) available. (33) Page 5

Load and store instructions: Uses load/store buffers, and each buffer is like a reservation station. Address calculation (put result in buffer), then memory operation The result of a load is put on the CDB Stores are executed in program order (loads in any order) Performs memory disambiguation between store and load buffers, Remarks: May have more reservation stations than registers (a large virtual register space) The original Tomasulo algorithm was introduced before caches were incorporated into commercial processors If more than one issued instruction writes into a register, only the last one does the actual write (no WAW hazards). (34) Example lw/sw unit CDB Mem file int unit FP mult/div FP add Write result to Reg IF Issue # of reservation stations - 2 load/store buffers - 2 add - 3 mult/div - 1 int unit (35) Page 6

Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 done Name Busy Op Vj Vk Qj Qk A Load1 no Load2 Y Load 45+Reg[R3] Add1 Y Sub Reg[F6] Load2 Add2 Y Add Add1 Load2 Add3 no Mult1 Y Mul Reg[F4] Load2 Mult2 Y Div Reg[F6] Mult1 Int no Qi Mult1 load2 Add2 Add1 Mult2 (36) Page 7