CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Similar documents
Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Precise Exceptions and Out-of-Order Execution. Samira Khan

Handout 2 ILP: Part B

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

Lecture-13 (ROB and Multi-threading) CS422-Spring

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Instruction Level Parallelism (Branch Prediction)

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Processor: Superscalars Dynamic Scheduling

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

5008: Computer Architecture

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Case Study IBM PowerPC 620

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

CS146 Computer Architecture. Fall Midterm Exam

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

1 Tomasulo s Algorithm

Metodologie di Progettazione Hardware-Software

COSC4201 Instruction Level Parallelism Dynamic Scheduling

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Static vs. Dynamic Scheduling

Chapter. Out of order Execution

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Hardware-Based Speculation

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

The Tomasulo Algorithm Implementation

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

Multithreaded Processors. Department of Electrical Engineering Stanford University

Hardware-based Speculation

Advanced issues in pipelining

EECC551 Exam Review 4 questions out of 6 questions

DYNAMIC SPECULATIVE EXECUTION

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Advanced Computer Architecture

Instruction-Level Parallelism and Its Exploitation

Four Steps of Speculative Tomasulo cycle 0

Copyright 2012, Elsevier Inc. All rights reserved.

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Good luck and have fun!

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelines and Branch Prediction

E0-243: Computer Architecture

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Chapter 4 The Processor 1. Chapter 4D. The Processor

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CMSC411 Fall 2013 Midterm 2 Solutions

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

CS433 Homework 2 (Chapter 3)

Limitations of Scalar Pipelines

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Scoreboard information (3 tables) Four stages of scoreboard control

Portland State University ECE 587/687. Superscalar Issue Logic

Super Scalar. Kalyan Basu March 21,

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018

Adapted from David Patterson s slides on graduate computer architecture

Multiple Instruction Issue and Hardware Based Speculation

Computer Architecture Spring 2016

Dynamic Scheduling. CSE471 Susan Eggers 1

Superscalar Processors

CS433 Homework 2 (Chapter 3)

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

ECE/CS 552: Introduction to Superscalar Processors

Superscalar Architectures: Part 2

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Transcription:

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago

Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office hours today moved to tomorrow " Announcement on Piazza 2

Lecture Outline! Review: branch prediction! Out-of-order (OOO) execution " Motivation " How it works " Discussion 3

Review: Gshare Branch Predictor Which direction earlier branches went Direction predictor (e.g., 2-bit counters) taken? Global branch history Program Counter XOR hit? PC + 4 Next Fetch Address Address of the current instruction target address BTB: Branch Target Buffer 4

Two Levels of Gshare! First level: Global branch history register (N bits) xor PC! Second level: 2-bit counters for each history entry " The direction the branch took the last time the same history was seen Pattern History Table (PHT) 1 1.. 1 0 GHR (global history register) xor PC 00. 00 00. 01 00. 10 2 3 index 0 1 11. 11 5

Branch Prediction Using a 2-bit Counter actually taken strongly taken actually taken weakly!taken pred taken pred!taken actually!taken actually taken actually!taken actually taken pred taken pred!taken weakly taken actually!taken strongly!taken actually!taken Change predic3on a5er 2 consecu3ve mistakes 6

2-bit Counter: Another Scheme actually taken strongly taken weakly!taken pred taken actually!taken pred!taken actually!taken actually taken actually!taken actually taken pred taken actually taken pred!taken weakly taken strongly!taken actually!taken 7

Review: Dependency Handling in the Pipeline! Software vs. hardware " Software based instruction scheduling # static scheduling " Hardware based instruction scheduling # dynamic scheduling! What information does the compiler not know that makes static scheduling difficult? " Answer: Anything that is determined at run time! Variable-length operation latency, memory addr, branch direction 8

Example: Load-Use Dependency! Consider this sequence! Requires 1 stall LDUR X2, [X1,#20] AND X4,X2,X5 OR X8,X3,X6! Static scheduling to re-order instructions! No need to stall LDUR X2, [X1,#20] OR X8,X3,X6 AND X4,X2,X5 What if load sometimes take 100 cycles to execute? 9

Another Example: Instructions w/ Variable Latencies F D E Integer add E E E E Integer mul E E E E E E E E FP mul R W E E E E E E E E... Cache miss 10

Dependency Handling! Consider the following two pieces of code IMUL R3 $ R1, R2 ADD R3 $ R3, R1 ADD R1 $ R6, R7 IMUL R5 $ R6, R8 ADD R7 $ R3, R5 LD R3 $ R1 (0) ADD R3 $ R3, R1 ADD R1 $ R6, R7 IMUL R5 $ R6, R8 ADD R7 $ R3, R5! In both cases, first ADD stalls the whole pipeline! " ADD cannot dispatch because its source registers unavailable " Later independent instructions cannot get executed! IMUL and LD can take a really long time " Latency of LD is unknown until runtime (cache hit vs. miss) 11

How to Do Better?! Hardware has knowledge of dynamic events on a perinstruction basis (i.e., at a very fine granularity) " Cache misses " Branch mispredictions " Load/store addresses! Wouldn t it be nice if hardware did the scheduling of instructions?! Hardware-based dynamic instruction scheduling, enabling OOO execution " Tradeoffs vs. static scheduling? 12

Benefits of OOO! In order F D E E E E M W F D STALL E M W F! Out-of-order F D E E E E M W F D F D STALL WAIT E M! 15 vs. 12 cycles D E M W F D E E E E M W E F D STALL E M W M W W F D E E E E M W F D WAIT E M W IMUL R3 $ R1, R2 ADD R3 $ R3, R1 ADD R1 $ R6, R7 IMUL R5 $ R6, R8 ADD R7 $ R3, R5 Assume: IMUL: 4 Ex cycles ADD: 1 Ex cycle 13

Out-of-Order Execution

Out-of-Order Execution! Idea " Move the dependent instructions out of the way of independent ones (s.t. independent ones can execute)! Approach " Monitor the source values of each instruction " When all source values of an instruction are available, fire (i.e. dispatch) the instruction " Retire each instruction in program order! Benefit " Latency tolerance: Allows independent instructions to execute and complete in the presence of a long latency operation 15

Illustration of an OOO Pipeline TAG and VALUE Broadcast Bus F D S C H E D U L E E Integer add E E E E Integer mul E E E E E E E E FP mul E E E E E E E E... Load/store in order out of order in order R E O R D E R W! Two humps " Hump 1: reservation stations (scheduling window) " Hump 2: reorder buffer (instruction window or active window) 16

Dynamic Scheduling: Tomasulo s Algorithm! Invented by Robert Tomasulo " Used in IBM 360/91 Floating Point Units " Tomasulo, An Efficient Algorithm for Exploiting Multiple Arithmetic Units, IBM Journal of R&D, Jan. 1967.! Variants are used in many high-performance processors 17

Key Ideas of Tomasulo s Algorithm 1. Register renaming " Track true dependencies by linking the consumer of a value to the producer 2. Buffer instructions in reservation stations until they are ready to execute " Keep track of readiness of source values " Instruction wakes up and dispatch to the appropriate functional unit (FU) if all sources are ready! If multiple instructions are awake, need to select one per FU 18

Register Renaming! Output and anti dependencies are not true dependencies " WHY? " They exist because not enough register ID s (i.e. names) in the ISA! The register ID is renamed to the reservation station (RS) entry that will hold the register s value " Register ID # RS entry ID " Architectural register ID # Physical register ID " After renaming, RS entry ID used to refer to the register! This eliminates anti- and output- dependencies " As if there are a large number of registers even though ISA can only support a small number 19

Registe Renaming Using RAT! RAT: Register Alias Table (aka Register Rename Table) X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 tag value valid? Don t care 0 1 Don t care 1 1 RS entry 7 Don t care 0 Don t care 3 1 RS entry 3 Don t care 0 RS entry 13 Don t care 0 Don t care 6 1 Don t care 7 1 RS entry 4 Don t care 0 Don t care 9 1 20

Better Register Renaming Techniques Rename through ROB Rename through merged RF Hinton et al., The Microarchitecture of the Pentium 4 Processor, Intel Technology Journal, 2001. 21

Tomasulo s Machine: IBM 360/91 from memory from instruction unit FP registers load buffers store buffers operation bus FP FU FP FU reservation stations to memory Common data bus 22

Tomasulo s Algorithm! If reservation station not available, stall; else Instruction + renamed operands (source value/tag) inserted into the reservation station! While in reservation station, each instruction: " Watches common data bus (CDB) for tag of its sources " When tag seen, grab value for the source and keep it in the reservation station " When both operands available, instruction ready to be dispatched! Dispatch instruction to the Functional Unit (FU) when instruction is ready " If multiple instructions ready at the same time and require the same FU, need logic to select one! After instruction finishes in the FU " Arbitrate for CDB " Put tagged value onto CDB (tag broadcast) " Register file, RS, and RAT connected to the CDB! Register contains a tag indicating the latest writer to the register! If the tag in the register file, RS, and RAT matches the broadcast tag, write broadcast value into register (and set valid bit) " Reclaim rename tag (i.e., free the corresponding RS entry) 23

An Exercise MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 from memory! Assume ADD (4-cycle execute), MUL (6-cycle execute)! Assume one adder and one multiplier in HW! Assume operations are done entirely using registers " No memory access from instruction unit F D E W FP registers load buffers store buffers operation bus FP FU FP FU reservation stations to memory Common data bus 24

Drawing Template MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 v? tag val X1 X2 X3 a r X4 b t X5 c s X6 d v X7 X8 X9 X10 ADD MUL X11 25

Cycle 1 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 3 a r X4 1 * 4 b t X5 1 * 5 c s X6 1 * 6 d v X7 1 * 7 X8 1 * 8 X9 1 * 9 X10 1 * 10 ADD MUL X11 1 * 11 26

Cycle 2 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D F v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a r 1 * 1 1 * 2 X4 1 * 4 b t X5 1 * 5 c s X6 1 * 6 d v X7 1 * 7 X8 1 * 8 X9 1 * 9 X10 1 * 10 ADD MUL X11 1 * 11 27

Cycle 3 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E F D F MUL (in RS entry r) starts to execute since both operands are valid v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b t X5 0 a * c s X6 1 * 6 d v X7 1 * 7 X8 1 * 8 X9 1 * 9 X10 1 * 10 ADD MUL X11 1 * 11 28

Cycle 4 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E F D -- F D F ADD (in RS entry a) waits since is not valid v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t X5 0 a * c s X6 1 * 6 d v X7 0 b * X8 1 * 8 X9 1 * 9 X10 1 * 10 ADD MUL X11 1 * 11 29

Cycle 5 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E F D -- -- F D E F D F ADD (in RS entry b) starts to execute v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t X5 0 a * c 1 * 8 1 * 9 s X6 1 * 6 d v X7 0 b * X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 1 * 11 30

Cycle 6 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E F D -- -- -- F D E E F D E F D F ADD (in RS entry c) starts to execute v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t 0 b * 0 c * X5 0 a * c 1 * 8 1 * 9 s X6 1 * 6 d v X7 0 b * X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 0 t * 31

Cycle 7 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E F D -- -- -- -- F D E E E F D E E F D -- F D MUL (in RS entry t) waits Pay attention to register renaming removing WAW v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t 0 b * 0 c * X5 0 d * c 1 * 8 1 * 9 s X6 1 * 6 d 0 a * 0 t * v X7 0 b * X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 0 t * 32

Cycle 8 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E F D -- -- -- -- -- F D E E E E F D E E E F D -- -- F D -- Broadcast results through CDB to wake up dependent instructions (check both RAT and RS) v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t 0 b * 0 c * X5 0 d * c 1 * 8 1 * 9 s X6 1 * 6 d 0 a * 0 t * v X7 0 b * X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 0 t * 33

Cycle 9 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E F D E E E E W F D E E E E F D -- -- -- F D -- -- Assuming 2 reg write ports and forwarding, we can dispatch ADD in RS entry a; also reclaim RS entries r and b v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t 1 * 8 0 c * X5 0 d * c 1 * 8 1 * 9 s X6 1 * 6 d 0 a * 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 0 t * 34

Cycle 10 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E F D E E E E W F D E E E E W F D -- -- -- E F D -- -- -- Now we dispatch the second MUL (in RS entry t) v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c 1 * 8 1 * 9 s X6 1 * 6 d 0 a * 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 35

Cycle 11 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E F D E E E E W F D E E E E W F D -- -- -- E E F D -- -- -- -- v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c s X6 1 * 6 d 0 a * 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 36

Cycle 12 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E F D E E E E W F D E E E E W F D -- -- -- E E E F D -- -- -- -- -- v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c s X6 1 * 6 d 0 a * 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 37

Cycle 13 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E W F D E E E E W F D E E E E W F D -- -- -- E E E E F D -- -- -- -- -- -- v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c s X6 1 * 6 d 1 * 6 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 38

Cycle 14 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E W F D E E E E W F D E E E E W F D -- -- -- E E E E E F D -- -- -- -- -- -- -- v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c s X6 1 * 6 d 1 * 6 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 39

Cycle 15 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E W F D E E E E W F D E E E E W F D -- -- -- E E E E E E F D -- -- -- -- -- -- -- -- v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t 1 * 17 1 * 8 X5 0 d * c s X6 1 * 6 d 1 * 6 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 40

Cycle 16 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E W Now we dispatch F D E E E E W the last ADD F D E E E E W F D -- -- -- E E E E E E W F D -- -- -- -- -- -- -- -- E v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t 1 * 17 1 * 8 X5 0 d * c s X6 1 * 6 d 1 * 6 1 * 136 v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 1 * 136 41

Cycle 19 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E W F D E E E E W F D E E E E W F D -- -- E E E E E E W F D -- -- -- -- -- -- -- E E E E v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t X5 0 d * c s X6 1 * 6 d 1 * 6 1 * 136 v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 1 * 136 42

Cycle 20 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E W F D E E E E W F D E E E E W F D -- -- E E E E E E W F D -- -- -- -- -- -- -- E E E E W v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t X5 1 * 142 c s X6 1 * 6 d 1 * 6 1 * 136 v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 1 * 136 43