Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Similar documents
CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Processor: Superscalars Dynamic Scheduling

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Metodologie di Progettazione Hardware-Software

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Superscalar Architectures: Part 2

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Instruction-Level Parallelism and Its Exploitation

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

Static vs. Dynamic Scheduling

Adapted from David Patterson s slides on graduate computer architecture

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

Scoreboard information (3 tables) Four stages of scoreboard control

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

EECC551 Exam Review 4 questions out of 6 questions

Instruction Level Parallelism

Instruction Level Parallelism. Taken from

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Instruction Level Parallelism (ILP)

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Four Steps of Speculative Tomasulo cycle 0

Advantages of Dynamic Scheduling

Lecture 4: Introduction to Advanced Pipelining

Lecture-13 (ROB and Multi-threading) CS422-Spring

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Chapter 4 The Processor 1. Chapter 4D. The Processor

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

CS425 Computer Systems Architecture

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Handout 2 ILP: Part B

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

5008: Computer Architecture

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

Instruction Level Parallelism

Super Scalar. Kalyan Basu March 21,

Hardware-based Speculation

Multi-cycle Instructions in the Pipeline (Floating Point)

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

The basic structure of a MIPS floating-point unit

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

332 Advanced Computer Architecture Chapter 3. Dynamic scheduling, out-of-order execution, register renaming and speculative execution

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

CS152 Computer Architecture and Engineering Lecture 15

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EECC551 Review. Dynamic Hardware-Based Speculation

Copyright 2012, Elsevier Inc. All rights reserved.

Advanced issues in pipelining

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CS 152 Computer Architecture and Engineering

Hardware-Based Speculation

Good luck and have fun!

CS425 Computer Systems Architecture

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CS 423 Computer Architecture Spring Lecture 04: A Superscalar Pipeline

Recall from Pipelining Review. Instruction Level Parallelism and Dynamic Execution

Computer Science 246 Computer Architecture

Advanced Computer Architecture

DYNAMIC SPECULATIVE EXECUTION

EEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism

CS252 Graduate Computer Architecture Lecture 5. Interrupt Controller CPU. Interrupts, Software Scheduling around Hazards February 1 st, 2012

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

Transcription:

CS152 Computer Architecture and Engineering Lecture 17 Dynamic Scheduling: Tomasulo March 20, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/ Lec16.1 Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software: Amortizes loop overhead over several iterations Gives more opportunity for scheduling around stalls Software Pipelining Ÿ Take one instruction from each of several iterations of the loop Software overlapping of loop iterations Today will show hardware overlapping of loop iterations Very Long Instruction Word machines (VLIW) Ÿ Multiple operations coded in single, long instruction Requires sophisticated compiler to decide which operations can be done in parallel Trace scheduling Ÿ find common path and schedule code as if branches didn t exist (+ add fixup code ) All of these require additional registers Lec16.2 Review: Software Pipelining Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (- Tomasulo in SW) Softwarepipelined iteration Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Review: Software Pipelining Example Before: Unrolled 3 times 1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP After: Software Pipelined 1 SD 0(R1),F4 ; Stores M[i] 2 ADDD F4,F0,F2 ; Adds to M[i-1] 3 LD F0,-16(R1); Loads M[i-2] 4 SUBI R1,R1,#8 5 BNEZ R1,LOOP overlapped ops Symbolic Loop Unrolling Maximize result-use distance Less code space than unrolling Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling SW Pipeline Time Loop Unrolled Time Lec16.3 Lec16.4

Review: Software Pipelining with Loop Unrolling in VLIW Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch LD F0,-48(R1) ST 0(R1),F4 ADDD F4,F0,F2 1 LD F6,-56(R1) ST -8(R1),F8 ADDD F8,F6,F2 SUBI R1,R1,#24 2 LD F10,-40(R1) ST 8(R1),F12 ADDD F12,F10,F2 BNEZ R1,LOOP 3 Software pipelined across 9 iterations of original loop In each iteration of above loop, we: - Store to m,m-8,m-16 (iterations I-3,I-2,I-1) - Compute for m-24,m-32,m-40 (iterations I,I+1,I+2) - Load from m-48,m-56,m-64 (iterations I+3,I+4,I+5) 9 results in 9 cycles, or 1 clock per iteration Average: 3.3 ops per clock, 66% efficiency te: Need less registers for software pipelining (only using 7 registers here, was using 15) Review: Dynamic hardware for out-of-order execution HW exploitation of ILP Works when can t know dependence at compile time. Code for one machine runs well on another Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) Enables out-of-order execution => out-of-order completion ID stage checked both for structural & data dependencies Original version didn t handle forwarding. automatic register renamingÿ stalls for WAR and WAW hazards Are these fundamental limitations??? () Lec16.5 Lec16.6 5HJLVWHUV Review: Scoreboard Architecture(CDC 6600) 6&25(%2$5' )30XOW )30XOW )3'LYLGH )3$GG,QWHJHU )XQFWLRQDO8QLWV 0HPRU\ Review: Four Stages of Scoreboard Control Issue decode instructions & check for structural hazards Instructions issued in program order (for hazard checking) Don t issue if structural hazard Don t issue if instruction is output dependent on any previously issued but uncompleted instruction (no WAW hazards) Read operands wait until no data hazards, then read operands All real dependencies (RAW hazards) resolved in this stage forwarding of data in this model! Execution operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. Write result finish execution (WB) Stall until no WAR hazards with previous instructions: Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands Lec16.7 Lec16.8

Review: Data Structures for Scoreboard Read Instruction j k Issue Oper Comp Result LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide : FU How are WAR and WAW hazards handled in Scoreboard? WAR hazards handled by stalling in WriteBack Stage WAW hazards handled by stalling in Issue Stage Are either of these real hazards???? Consider the following WAR hazard: Add $1, $2, $3 Sub $3, $5, $4 Add $2, $3, $5 Why not rename this: Add $1, $2, $3 Sub $7, $5, $4 Add $2, $7, $5 w, WAR hazard has disappeared!!!! Lec16.9 Lec16.10 The Big Picture: Where are We w? The Five Classic Components of a Computer Processor Input Control Memory Datapath Output Today s Topics: Recap last lecture Hardware loop unrolling with Tomasulo algorithm Administrivia Speculation, branch prediction Reorder buffers Another Dynamic Algorithm: Tomasulo Algorithm For IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Differences between IBM 360 & CDC 6600 ISA IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600 IBM has memory-register ops Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, Lec16.11 Lec16.12

Tomasulo Algorithm vs. Scoreboard Tomasulo Organization Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; FU buffers called reservation stations ; have pending operands Registers in instructions replaced by values or pointers to reservation stations(rs); called register renaming ; avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue )URP0HP )32S 4XHXH %XIIHUV $GG $GG $GG )3DGGHUV 0XOW 0XOW 5HVHUYDWLRQ 6WDWLRQV )35HJLVWHUV )3PXOWLSOLHUV 6WRUH %XIIHUV 7R0HP Lec16.13 &RPPRQ'DWD%XV&'% Lec16.14 Reservation Station Components Three Stages of Tomasulo Algorithm Op: Operation to perform in the unit (e.g., + or ) Vj, Vk: Value of Source operands Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) te: ready flags as in Scoreboard; Qj,Qk=0 => ready Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. 1. Issue get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execution operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available rmal data bus: data + destination ( go to bus) Common data bus: data + source ( come from bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast Lec16.15 Lec16.16

Tomasulo Example LD F6 34+ R2 Load1 LD F2 45+ R3 Load2 MULTD F0 F2 F4 Load3 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 RS Add1 Add2 Add3 Mult1 Mult2 : 0 FU Tomasulo Example Cycle 1 LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 MULTD F0 F2 F4 Load3 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 RS Add1 Add2 Add3 Mult1 Mult2 : 1 FU Load1 Lec16.17 Lec16.18 Tomasulo Example Cycle 2 Tomasulo Example Cycle 3 LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 RS Add1 Add2 Add3 Mult1 Mult2 : 2 FU Load2 Load1 1RWH8QOLNHFDQKDYHPXOWLSOHORDGVRXWVWDQGLQJ Lec16.19 LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 RS Add1 Add2 Add3 Mult1 Yes ULTD R(F4) Load2 Mult2 : 3 FU Mult1 Load2 Load1 1RWHUHJLVWHUVQDPHVDUHUHPRYHG UHQDPHGµLQ 5HVHUYDWLRQ6WDWLRQV08/7LVVXHGYVVFRUHERDUG 3/20/01 FRPSOHWLQJZKDWLVZDLWLQJIRU" UCB Spring 2001 Lec16.20

Tomasulo Example Cycle 4 Tomasulo Example Cycle 5 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2 RS Add1 Yes SUBD M(A1) Load2 Add2 Add3 Mult1 Yes ULTD R(F4) Load2 Mult2 : 4 FU Mult1 Load2 M(A1) Add1 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 4 ADDD F6 F8 F2 RS 2 Add1 Yes SUBD M(A1) M(A2) Add2 Add3 10 Mult1 Yes ULTD M(A2) R(F4) : 5 FU Mult1 M(A2) M(A1) Add1 Mult2 FRPSOHWLQJZKDWLVZDLWLQJIRU" Lec16.21 Lec16.22 Tomasulo Example Cycle 6 Tomasulo Example Cycle 7 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 4 ADDD F6 F8 F2 6 RS 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 9Mult1 Yes ULTD M(A2) R(F4) : 6 FU Mult1 M(A2) Add2 Add1 Mult2 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 4 7 ADDD F6 F8 F2 6 RS 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 8Mult1 Yes ULTD M(A2) R(F4) : 7 FU Mult1 M(A2) Add2 Add1 Mult2,VVXH$'''KHUHYVVFRUHERDUG" $GGFRPSOHWLQJZKDWLVZDLWLQJIRULW" Lec16.23 Lec16.24

Tomasulo Example Cycle 8 Tomasulo Example Cycle 9 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 ADDD F6 F8 F2 6 RS Add1 2 Add2 Yes ADDD (M-M) M(A2) Add3 7Mult1 Yes ULTD M(A2) R(F4) : 8 FU Mult1 M(A2) Add2 (M-M) Mult2 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 ADDD F6 F8 F2 6 RS Add1 1 Add2 Yes ADDD (M-M) M(A2) Add3 6Mult1 Yes ULTD M(A2) R(F4) : 9 FU Mult1 M(A2) Add2 (M-M) Mult2 Lec16.25 Lec16.26 Tomasulo Example Cycle 10 Tomasulo Example Cycle 11 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 ADDD F6 F8 F2 6 10 RS Add1 0 Add2 Yes ADDD (M-M) M(A2) Add3 5Mult1 Yes ULTD M(A2) R(F4) : 10 FU Mult1 M(A2) Add2 (M-M) Mult2 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 ADDD F6 F8 F2 6 10 11 RS Add1 Add2 Add3 4Mult1 Yes ULTD M(A2) R(F4) : 11 FU Mult1 M(A2) (M-M+M(M-M) Mult2 $GGFRPSOHWLQJZKDWLVZDLWLQJIRULW" Lec16.27 :ULWHUHVXOWRI$'''KHUHYVVFRUHERDUG" $OOTXLFNLQVWUXFWLRQVFRPSOHWHLQWKLVF\FOH Lec16.28

Tomasulo Example Cycle 12 Tomasulo Example Cycle 13 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 ADDD F6 F8 F2 6 10 11 RS Add1 Add2 Add3 3Mult1 Yes ULTD M(A2) R(F4) : 12 FU Mult1 M(A2) (M-M+M(M-M) Mult2 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 ADDD F6 F8 F2 6 10 11 RS Add1 Add2 Add3 2Mult1 Yes ULTD M(A2) R(F4) : 13 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Lec16.29 Lec16.30 Tomasulo Example Cycle 14 Tomasulo Example Cycle 15 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 ADDD F6 F8 F2 6 10 11 RS Add1 Add2 Add3 1Mult1 Yes ULTD M(A2) R(F4) : 14 FU Mult1 M(A2) (M-M+M(M-M) Mult2 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 15 Load3 ADDD F6 F8 F2 6 10 11 RS Add1 Add2 Add3 0Mult1 Yes ULTD M(A2) R(F4) : 15 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Lec16.31 Lec16.32

Tomasulo Example Cycle 16 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 15 16 Load3 ADDD F6 F8 F2 6 10 11 RS Add1 Add2 Add3 Mult1 40 Mult2 Yes DIVD M*F4 M(A1) : 16 FU M*F4 M(A2) (M-M+M(M-M) Mult2 Faster than light computation (skip a couple of cycles) Lec16.33 Lec16.34 Tomasulo Example Cycle 55 Tomasulo Example Cycle 56 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 15 16 Load3 ADDD F6 F8 F2 6 10 11 RS Add1 Add2 Add3 Mult1 1Mult2 Yes DIVD M*F4 M(A1) : 55 FU M*F4 M(A2) (M-M+M(M-M) Mult2 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 15 16 Load3 56 ADDD F6 F8 F2 6 10 11 RS Add1 Add2 Add3 Mult1 0Mult2 Yes DIVD M*F4 M(A1) : 56 FU M*F4 M(A2) (M-M+M(M-M) Mult2 0XOWLVFRPSOHWLQJZKDWLVZDLWLQJIRULW" Lec16.35 Lec16.36

Tomasulo Example Cycle 57 Compare to Scoreboard Cycle 62 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 15 16 Load3 56 57 ADDD F6 F8 F2 6 10 11 RS Add1 Add2 Add3 Mult1 0Mult2 Yes DIVD M*F4 M(A1) : 56 FU M*F4 M(A2) (M-M+M(M-M) Mult2 2QFHDJDLQ,QRUGHULVVXHRXWRIRUGHUH[HFXWLRQ DQGFRPSOHWLRQ Lec16.37 Read Instruction j k Issue Oper Comp Result Issue ComplResult LD F6 34+ R2 1 2 3 4 1 3 4 LD F2 45+ R3 5 6 7 8 2 4 5 MULTD F0 F2 F4 6 9 19 20 3 15 16 SUBD F8 F6 F2 7 9 11 12 4 7 8 DIVD F10 F0 F6 8 21 61 62 5 56 57 ADDD F6 F8 F2 13 14 16 22 6 10 11 :K\WDNHORQJHURQVFRUHERDUG" 6WUXFWXUDO+D]DUGV /DFNRIIRUZDUGLQJ Lec16.38 Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600) Pipelined Functional Units Multiple Functional Units (6 load, 3 store, 3 +, 2 x/ ) (1 load/store, 1 +, 2 x, 1 ) window size: ~ 14 instructions ~ 5 instructions issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall issue Broadcast results from FU Write/read registers Control: reservation stations central scoreboard Tomasulo Drawbacks Complexity delays of 360/91, MIPS 10000, IBM 620? Many associative stores (CDB) at high speed Performance limited by Common Data Bus Multiple CDBs => more FU logic for parallel assoc stores Lec16.39 Lec16.40

Administrivia Extension on Lab 5: Have until Wednesday (4/4) after Spring break. Use it wisely Our test program will be quite extensive Remember: a Working processor is necessary for full credit Tomorrow: Sections are back in classroom TAs will be going over some code scheduling examples and answering pipelining questions More info on some of the things that we have been talking about last two lectures: Computer Architecture: A Quantitative Approach by John Hennesy and David Patterson Lec16.41 Administrivia: New pentium-4 Architecture! Microprocessor Report: August 2000 20 Pipeline Stages! DriveŸ Wire Delay! Trace-Cache: caching paths through the code for quick decoding. Renaming: similar to Tomasulo architecture Branch and DATA prediction! Pentium (Original 586) Pentium-II (and III) (Original 686) Lec16.42 Tomasulo Loop Example Loop:LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss), second load takes 1 clock (hit) To be clear, will show clocks for SUBI, BNEZ Reality: integer instructions ahead Loop Example 1 LD F0 0 R1 Load1 1 MULTD F4 F0 F2 Load2 1 SD F4 0 R1 Load3 2 LD F0 0 R1 Store1 2 MULTD F4 F0 F2 Store2 2 SD F4 0 R1 Store3 Mult1 SUBI R1 R1 #8 Mult2 BNEZ R1 Loop 0 80 Fu Lec16.43 Lec16.44

Loop Example Cycle 1 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 Load2 1 SD F4 0 R1 Load3 2 LD F0 0 R1 Store1 2 MULTD F4 F0 F2 Store2 2 SD F4 0 R1 Store3 Mult1 SUBI R1 R1 #8 Mult2 BNEZ R1 Loop 1 80 Fu Load1 Loop Example Cycle 2 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 1 SD F4 0 R1 Load3 2 LD F0 0 R1 Store1 2 MULTD F4 F0 F2 Store2 2 SD F4 0 R1 Store3 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 BNEZ R1 Loop 2 80 Fu Load1 Mult1 Lec16.45 Lec16.46 Loop Example Cycle 3 What does this mean physically? 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 1 SD F4 0 R1 3 Load3 2 LD F0 0 R1 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 Store2 2 SD F4 0 R1 Store3 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 BNEZ R1 Loop 3 80 Fu Load1 Mult1 Implicit renaming sets up DataFlow graph Lec16.47 )URP0HP )32S 4XHXH %XIIHUV addr: addr: 80 80 $GG $GG $GG )3DGGHUV 0XOW mul R(F2) 0XOW 5HVHUYDWLRQ 6WDWLRQV &RPPRQ'DWD%XV&'% )35HJLVWHUV F0: F0: Load Load 1 F4: F4: Mult1 Mult1 Load1 )3PXOWLSOLHUV 6WRUH %XIIHUV Addr: Addr: 80 80Mult1 7R0HP Lec16.48

Loop Example Cycle 4 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 1 SD F4 0 R1 3 Load3 2 LD F0 0 R1 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 Store2 2 SD F4 0 R1 Store3 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 BNEZ R1 Loop 4 80 Fu Load1 Mult1 Loop Example Cycle 5 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 1 SD F4 0 R1 3 Load3 2 LD F0 0 R1 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 Store2 2 SD F4 0 R1 Store3 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 BNEZ R1 Loop 5 72 Fu Load1 Mult1 Dispatching SUBI Instruction Lec16.49 And, BNEZ instruction Lec16.50 Loop Example Cycle 6 Loop Example Cycle 7 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 Store2 2 SD F4 0 R1 Store3 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 BNEZ R1 Loop 6 72 Fu Load2 Mult1 tice that F0 never sees Load from location 80 Lec16.51 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 2 SD F4 0 R1 Store3 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop 7 72 Fu Load2 Mult2 Register file completely detached from iteration 1 Lec16.52

Loop Example Cycle 8 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop 8 72 Fu Load2 Mult2 First and Second iteration completely overlapped Lec16.53 What does this mean physically? )URP0HP )32S 4XHXH %XIIHUV addr: addr: 80 80 addr: addr: 72 72 $GG $GG $GG )3DGGHUV 0XOW mul R(F2) 0XOW mul R(F2) 5HVHUYDWLRQ 6WDWLRQV &RPPRQ'DWD%XV&'% )35HJLVWHUV F0: F0: Load2 Load2 F4: F4: Mult2 Mult2 Load1 Load2 )3PXOWLSOLHUV 6WRUH %XIIHUV Addr: Addr: 80 80Mult1 Addr: Addr: 72 72Mult2 7R0HP Lec16.54 Loop Example Cycle 9 1 LD F0 0 R1 1 9 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop 9 72 Fu Load2 Mult2 Load1 completing: who is waiting? te: Dispatching SUBI Lec16.55 Loop Example Cycle 10 1 LD F0 0 R1 1 9 10 Load1 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 2 LD F0 0 R1 6 10 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop 10 64 Fu Load2 Mult2 Load2 completing: who is waiting? te: Dispatching BNEZ Lec16.56

Loop Example Cycle 11 1 LD F0 0 R1 1 9 10 Load1 1 MULTD F4 F0 F2 2 Load2 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 4 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop 11 64 Fu Load3 Mult2 Next load in sequence Lec16.57 Loop Example Cycle 12 1 LD F0 0 R1 1 9 10 Load1 1 MULTD F4 F0 F2 2 Load2 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 3 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop 12 64 Fu Load3 Mult2 Why not issue third multiply? Lec16.58 Loop Example Cycle 13 1 LD F0 0 R1 1 9 10 Load1 1 MULTD F4 F0 F2 2 Load2 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 2 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop 13 64 Fu Load3 Mult2 Lec16.59 Loop Example Cycle 14 1 LD F0 0 R1 1 9 10 Load1 1 MULTD F4 F0 F2 2 14 Load2 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 0 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 1 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop 14 64 Fu Load3 Mult2 Mult1 completing. Who is waiting? Lec16.60

Loop Example Cycle 15 1 LD F0 0 R1 1 9 10 Load1 1 MULTD F4 F0 F2 2 14 15 Load2 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 Mult1 SUBI R1 R1 #8 0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop 15 64 Fu Load3 Mult2 Mult2 completing. Who is waiting? Lec16.61 Loop Example Cycle 16 1 LD F0 0 R1 1 9 10 Load1 1 MULTD F4 F0 F2 2 14 15 Load2 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 Store3 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 BNEZ R1 Loop 16 64 Fu Load3 Mult1 Lec16.62 Loop Example Cycle 17 1 LD F0 0 R1 1 9 10 Load1 1 MULTD F4 F0 F2 2 14 15 Load2 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 Store3 Yes 64 Mult1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 BNEZ R1 Loop 17 64 Fu Load3 Mult1 Loop Example Cycle 18 1 LD F0 0 R1 1 9 10 Load1 1 MULTD F4 F0 F2 2 14 15 Load2 1 SD F4 0 R1 3 18 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 Store3 Yes 64 Mult1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 BNEZ R1 Loop 18 64 Fu Load3 Mult1 Lec16.63 Lec16.64

Loop Example Cycle 19 1 LD F0 0 R1 1 9 10 Load1 1 MULTD F4 F0 F2 2 14 15 Load2 1 SD F4 0 R1 3 18 19 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 19 Store3 Yes 64 Mult1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 BNEZ R1 Loop 19 64 Fu Load3 Mult1 Loop Example Cycle 20 1 LD F0 0 R1 1 9 10 Load1 1 MULTD F4 F0 F2 2 14 15 Load2 1 SD F4 0 R1 3 18 19 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 2 MULTD F4 F0 F2 7 15 16 Store2 2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 BNEZ R1 Loop 20 64 Fu Load3 Mult1 Lec16.65 Lec16.66 Why can Tomasulo overlap iterations of loops? Register renaming Multiple iterations use different physical destinations for registers (dynamic loop unrolling). Replace static register names from code with dynamic register pointers Effectively increases size of register file Permit instruction issue to advance past integer control flow operations. Crucial: integer unit must get ahead of floating point unit so that we can issue multiple iterations Other idea: Tomasulo building DataFlow graph. Recall: Unrolled Loop That Minimizes Stalls 1 Loop:LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 FORFNF\FOHVRUSHULWHUDWLRQ 8VHGQHZUHJLVWHUV!UHJLVWHUUHQDPLQJ Lec16.67 Lec16.68

Summary #1/2 Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW t limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation Summary #2/2 Dynamic hardware schemes can unroll loops dynamically in hardware! BUT: What about precise interrupts? Out-of-order execution Ÿ out-of-order completion! BUT: What about branches? We can unroll loops in hardware only if we can get past branches Next time: Branch Prediction! How do we issue multiple instructions/cycle and still do out-of-order execution? Must increase instruction issue and retire bandwidth 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264 Lec16.69 Lec16.70