COSC121: Computer Systems. ISA and Performance

Similar documents
Chapter 4 The Processor 1. Chapter 4B. The Processor

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN

CENG 3420 Lecture 06: Pipeline

Thomas Polzer Institut für Technische Informatik

Full Datapath. Chapter 4 The Processor 2

Chapter 4. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4. The Processor

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

COMPUTER ORGANIZATION AND DESIGN

Chapter 4. The Processor

EIE/ENE 334 Microprocessors

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Processor: Superscalars Dynamic Scheduling

Chapter 4. The Processor

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

LECTURE 3: THE PROCESSOR

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Adapted from David Patterson s slides on graduate computer architecture

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

14:332:331 Pipelined Datapath

Chapter 4 The Processor 1. Chapter 4D. The Processor

DEE 1053 Computer Organization Lecture 6: Pipelining

Four Steps of Speculative Tomasulo cycle 0

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Metodologie di Progettazione Hardware-Software

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Processor (II) - pipelining. Hwansoo Han

Instruction Level Parallelism

Static vs. Dynamic Scheduling

The Processor: Improving the performance - Control Hazards

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Full Datapath. Chapter 4 The Processor 2

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

ECS 154B Computer Architecture II Spring 2009

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Lecture 9 Pipeline and Cache

The Processor: Instruction-Level Parallelism

CSEE 3827: Fundamentals of Computer Systems

Chapter 4. The Processor. Jiang Jiang

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Instruction-Level Parallelism and Its Exploitation

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Copyright 2012, Elsevier Inc. All rights reserved.

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Processor (IV) - advanced ILP. Hwansoo Han

Chapter 4. The Processor

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

5008: Computer Architecture

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Superscalar Architectures: Part 2

EE557--FALL 1999 MIDTERM 1. Closed books, closed notes

EECC551 Exam Review 4 questions out of 6 questions

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Chapter 4. The Processor

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Multi-cycle Instructions in the Pipeline (Floating Point)

Outline Marquette University

ECE232: Hardware Organization and Design

Hardware-based Speculation

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

The basic structure of a MIPS floating-point unit

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

COMPUTER ORGANIZATION AND DESI

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

Super Scalar. Kalyan Basu March 21,

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS 423 Computer Architecture Spring Lecture 04: A Superscalar Pipeline

Advanced issues in pipelining

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

ECE260: Fundamentals of Computer Engineering

zhandling Data Hazards The objectives of this module are to discuss how data hazards are handled in general and also in the MIPS architecture.

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Handout 2 ILP: Part B

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

ECE473 Computer Architecture and Organization. Pipeline: Data Hazards

Transcription:

COSC121: Computer Systems. ISA and Performance Jeremy Bolton, PhD Assistant Teaching Professor Constructed using materials: - Patt and Patel Introduction to Computing Systems (2nd) - Patterson and Hennessy Computer Organization and Design (4th) **A special thanks to Eric Roberts and Mary Jane Irwin

Notes Project 3 Assigned. HW posted soon. Read PH.1 and PH.4

Outline ISA and performance CISC RISC Details of Pipelining Avoiding Hazards (and avoiding stalls) Data Stalls and no-ops Forwarding Branch Branch Delay Scheduling Prediction» Predictions schemes Unrolling loops

This week our journey takes us COSC 121: Computer Systems Application (Browser) Operating System Compiler (Win, Linux) COSC 255: Operating Systems Software Hardware Assembler Processor Memory Drivers I/O system Instruction Set Architecture Datapath & Control Digital Design Circuit Design transistors COSC 120: Computer Hardware

Evaluating ISAs Design-time metrics: Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation? Static Metrics: How many bytes does the program occupy in memory? Dynamic Metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? How many clocks are required per instruction? CPI How "lean" a clock is practical? Best Metric: Time to execute the program! depends on the instructions set, the processor organization, and compilation techniques. Inst. Count Cycle Time

RISC vs CISC Ideologies for ISA design Two extremes: Build very complex instructions that can execute multiple or complex operations as 1 instruction (CISC) Build very simple instructions that execute quickly (RISC)

CISC Architecture The simplest way to examine the advantages and disadvantages of RISC architecture is by contrasting it with it's predecessor: CISC (Complex Instruction Set Computers) architecture.

Multiplying Two numbers in Memory On the right is a diagram representing the storage scheme for a generic computer. The main memory is divided into locations numbered from (row) 1: (column) 1 to (row) 6: (column) 4.

Multiplying Two numbers in Memory The execution unit is responsible for carrying out all computations. However, the execution unit can only operate on data that has been loaded into one of the six registers (A, B, C, D, E, or F).

Multiplying Two numbers in Memory Let's say we want to find the product of two numbers - one stored in location 2,3 and another stored in location 5,2 - and then store the product back in the location 2,3.

The CISC Approach The primary goal of CISC architecture is to complete a task in as few lines of assembly as possible. This is achieved by building processor hardware that is capable of understanding and executing a series of operations.

The CISC Approach For this particular task, a CISC processor would come prepared with a specific instruction (we'll call it "MULT"). When executed, this instruction loads the two values into separate registers, multiplies the operands in the execution unit, and then stores the product in the appropriate register. Thus, the entire task of multiplying two numbers can be completed with one instruction: MULT 2:3, 5:2

The CISC Approach MULT is what is known as a "complex instruction." It operates directly on the computer's memory banks and does not require the programmer to explicitly call any loading or storing functions. It closely resembles a command in a higher level language. For instance, if we let "a" represent the value of 2:3 and "b" represent the value of 5:2, then this command is identical to the C statement "a = a * b."

The CISC Approach One of the primary advantages of this system is that the compiler has to do very little work to translate a high-level language statement into assembly. Because the length of the code is relatively short, very little RAM is required to store instructions. The emphasis is put on building complex instructions directly into the hardware.

The RISC Approach RISC processors only use simple instructions that can be executed within one clock cycle.(amortized via pipeline) Thus, the "MULT" command described above could be divided into three separate commands: "LOAD," which moves data from the memory bank to a register, "PROD," which finds the product of two operands located within the registers, and "STORE," which moves data from a register to the memory banks.

The RISC Approach In order to perform the exact series of steps described in the CISC approach, a programmer would need to code four lines of assembly: LOAD A, 2:3 LOAD B, 5:2 PROD A, B STORE 2:3, A

The RISC Approach At first, this may seem like a much less efficient way of completing the operation. Because there are more lines of code, more RAM is needed to store the assembly level instructions. The compiler must also perform more work to convert a highlevel language statement into code of this form.

The RISC Approach However, the RISC strategy also brings some very important advantages. Because each instruction requires only one clock cycle to execute, the entire program will execute in approximately the same amount of time as the multi-cycle "MULT" command. These RISC "reduced instructions" require less transistors of hardware space than the complex instructions, leaving more room for general purpose registers. Because all of the instructions execute in a uniform amount of time (i.e. one clock), pipelining is possible and effective.

Pipeline Performance Single-cycle (T c = 800ps) Pipelined (T c = 200ps)

Pipeline Speedup If all stages are balanced i.e., all take the same time Time between instructions pipelined = Time between instructions nonpipelined Number of stages If not balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease

CISC 1.Emphasis on hardware 2.Includes multi-clock complex instructions RISC 1.Emphasis on software 2.Single-clock, reduced instruction only 3.Memory-to-memory: "LOAD" and "STORE" incorporated in instructions 3.Register to register: "LOAD" and "STORE" are independent instructions 4.Small code sizes, high cycles per second 4.Low cycles per second, large code sizes 5.Transistors used for storing complex instructions 5.Spends more transistors on memory registers http://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/developments/index.html

MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number of instruction formats Smaller is faster limited instruction set limited number of registers in register file** Make the common case fast arithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands Good design demands good compromises three instruction formats

MIPS Arithmetic Instructions MIPS assembly language arithmetic statement add $t0, $s1, $s2 sub $t0, $s1, $s2 Each arithmetic instruction performs one operation Each specifies exactly three operands that are all contained in the datapath s register file ($t0,$s1,$s2) destination source1 op source2 Instruction Format (R format) 0 17 18 8 0 0x22

MIPS Instruction Fields MIPS fields are given names to make them easier to refer to op rs rt rd shamt funct op 6-bits opcode that specifies the operation rs 5-bits register file address of the first source operand rt 5-bits register file address of the second source operand rd 5-bits register file address of the result s destination shamt 5-bits shift amount (for shift instructions) funct 6-bits function code augmenting the opcode

MIPS Register File Register File 32 bits Holds thirty-two 32-bit registers Two read ports and One write port src1 addr src2 addr dst addr write data 5 5 5 32 32 locations 32 32 src1 data src2 data Registers are Faster than main memory Easier for a compiler to use - e.g., (A*B) (C*D) (E*F) can do multiplies in any order Can hold variables so that write control - code density improves (since register are named with fewer bits than a memory location)

Aside: MIPS Register Convention Name Register Number Usage Preserve on call? $zero 0 constant 0 (hardware) n.a. $at 1 reserved for assembler n.a. $v0 - $v1 2-3 returned values no $a0 - $a3 4-7 arguments yes $t0 - $t7 8-15 temporaries no $s0 - $s7 16-23 saved values yes $t8 - $t9 24-25 temporaries no $gp 28 global pointer yes $sp 29 stack pointer yes $fp 30 frame pointer yes $ra 31 return addr yes By standard: $t are caller save, $s are callee save

Review: Why Pipeline? For Performance! Time (clock cycles) I n s t r. Inst 0 Inst 1 IM Reg DM Reg IM Reg DM Reg Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 O r d e r Inst 2 Inst 3 IM Reg DM Reg IM Reg DM Reg Inst 4 Time to fill the pipeline IM Reg DM Reg

PC Review: MIPS Pipeline Data and Control Paths IF/ID Control ID/EX EX/MEM PCSrc 4 Read Address Add Instruction Memory RegWrite Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Sign 16 Extend 32 Shift left 2 Src Add cntrl Op Branch Address Write Data Data Memory Read Data MemWrite MemRead MEM/WB MemtoReg RegDst

Review: Can Pipelining Get Us Into Trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready An instruction s source operand(s) are produced by a prior instruction still in the pipeline control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated branch and jump instructions, exceptions Pipeline control must detect the hazard and then take action to resolve hazards

Review: Register Usage Can Cause Data Hazards Read before write data hazard Value of $1 10 10 10 10 10/-20-20 -20-20 -20 add $1, IM Reg DM Reg sub $4,$1,$5 IM Reg DM Reg and $6,$1,$7 IM Reg DM Reg or $8,$1,$9 IM Reg DM Reg xor $4,$1,$5 IM Reg DM Reg

One Way to Fix a Data Hazard I n s t r. add $1, stall IM Reg DM Reg Can fix data hazard by waiting stall but impacts CPI O r d e r stall sub $4,$1,$5 and $6,$1,$7 IM Reg DM Reg IM Reg DM Reg

Another Way to Fix a Data Hazard I n s t r. add $1, sub $4,$1,$5 IM Reg DM Reg IM Reg DM Reg Fix data hazards by forwarding results as soon as they are available to where they are needed O r d e r and $6,$1,$7 or $8,$1,$9 IM Reg DM Reg IM Reg DM Reg xor $4,$1,$5 IM Reg DM Reg

Data Forwarding (aka Bypassing) Take the result from the earliest point that it exists in any of the pipeline state registers and forward it to the functional units (e.g., the ) that need it that cycle For functional unit: the inputs can come from any pipeline register rather than just from ID/EX by adding multiplexors to the inputs of the connecting the Rd write data in EX/MEM or MEM/WB to either (or both) of the EX s stage Rs and Rt mux inputs adding the proper control hardware to control the new muxes Other functional units may need similar forwarding logic (e.g., the DM) With forwarding can achieve a CPI of 1 even in the presence of data dependencies

Data Forwarding Control Conditions 1. EX Forward Unit: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd!= 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd!= 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 2. MEM Forward Unit: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd!= 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd!= 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the previous instr. to either input of the Forwards the result from the second previous instr. to either input of the

Forwarding Illustration I n s t r. add $1, sub $4,$1,$5 IM Reg DM Reg IM Reg DM Reg O r d e r and $6,$7,$1 IM Reg DM Reg EX forwarding MEM forwarding

Yet Another Complication! Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction which should be forwarded? I n s t r. O r d e r add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg

Corrected Data Forwarding Control Conditions 1. EX Forward Unit: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd!= 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd!= 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 2. MEM Forward Unit: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd!= 0) and (EX/MEM.RegisterRd!= ID/EX.RegisterRs) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd!= 0) and (EX/MEM.RegisterRd!= ID/EX.RegisterRt) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the previous instr. to either input of the Forwards the result from the previous or second previous instr. to either input of the

PC Datapath with Forwarding Hardware PCSrc ID/EX EX/MEM IF/ID Control 4 Read Address Add Instruction Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 16 Sign 32 Extend Shift left 2 Add cntrl Branch Address Write Data Data Memory Read Data MEM/WB EX/MEM.RegisterRd ID/EX.RegisterRt ID/EX.RegisterRs Forward Unit MEM/WB.RegisterRd

Memory-to-Memory Copies For loads immediately followed by stores (memory-tomemory copies) can avoid a stall by adding forwarding hardware from the MEM/WB register to the data memory input. Would need to add a Forward Unit and a mux to the MEM stage I n s t r. O r d e r lw $1,4($2) sw $1,4($3) IM Reg DM Reg IM Reg DM Reg

Forwarding with Load-use Data Hazards I n s t r. O r d e r lw $1,4($2) sub stall $4,$1,$5 sub and $4,$1,$5 $6,$1,$7 and or $6,$1,$7 $8,$1,$9 xor $8,$1,$9 $4,$1,$5 IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg xor $4,$1,$5 IM Reg DM Will still need one stall cycle even with forwarding

Load-use Hazard Detection Unit Need a Hazard detection Unit in the ID stage that inserts a stall between the load and its use 1. ID Hazard detection Unit: if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))) stall the pipeline The first line tests to see if the instruction now in the EX stage is a lw; the next two lines check to see if the destination register of the lw matches either source register of the instruction in the ID stage (the load-use instruction) After this one cycle stall, the forwarding logic can handle the remaining data hazards. RESCHEDULING CAN HELP HERE MORE LATER

Hazard/Stall Hardware Along with the Hazard Unit, we have to implement the stall Prevent the instructions in the IF and ID stages from progressing down the pipeline done by preventing the PC register and the IF/ID pipeline register from changing Hazard detection Unit controls the writing of the PC (PC.write) and IF/ID (IF/ID.write) registers Insert a bubble between the lw instruction (in the EX stage) and the load-use instruction (in the ID stage) (i.e., insert a noop in the execution stream) Set the control bits in the EX, MEM, and WB control fields of the ID/EX pipeline register to 0 (noop). The Hazard Unit controls the mux that chooses between the real control values and the 0 s. Let the lw instruction and the instructions after it in the pipeline (before it in the code) proceed normally down the pipeline

PC Adding the Hazard/Stall Hardware PCSrc Hazard Unit 0 ID/EX ID/EX.MemRead EX/MEM 4 Read Address Add Instruction Memory IF/ID Control Read Addr 1 Register Read Addr 2 File 0 Write Addr Write Data 1 Read Data 1 Read Data 2 16 Sign 32 Extend Shift left 2 Add cntrl Branch Address Write Data Data Memory Read Data MEM/WB ID/EX.RegisterRt Forward Unit

Control Hazards When the flow of instruction addresses is not sequential (i.e., PC = PC + 4); incurred by change of flow instructions Unconditional branches (j, jal, jr) Conditional branches (beq, bne) Exceptions* Possible approaches Stall (impacts CPI) Move decision point as early in the pipeline as possible, thereby reducing the number of stall cycles Delay decision and scheduling (requires compiler support), out-of-order execution Predict and hope for the best! Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards

Control Hazards Stall / Flush Add hardware / optimize ISA Determine branch condition and target as early as possible VLIW Scheduling / Out-of-Order Execution Static (by compiler) Dynamic (need to implement hardware) Register renaming Predict branch Static or dynamic Don t commit until branch outcome determined Loop Unrolling Done by compiler

PC Datapath Branch and Jump Hardware Jump PCSrc Shift left 2 ID/EX EX/MEM IF/ID Control 4 Read Address Add Instruction Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data PC+4[31-28] Read Data 1 Read Data 2 16 Sign 32 Extend Shift left 2 Add cntrl Branch Address Write Data Data Memory Read Data MEM/WB Forward Unit

Jumps Incur One Stall Jumps not decoded until ID, so one flush is needed To flush, set IF.Flush to zero the instruction field of the IF/ID pipeline register (turning it into a noop) I n s t r. j flush IM Reg DM Reg IM Reg DM Reg Fix jump hazard by waiting flush O r d e r j target IM Reg DM Reg Fortunately, jumps are very infrequent only 3% of the SPECint instruction mix

Two Types of Stalls Noop instruction (or bubble) inserted between two instructions in the pipeline (as done for load-use situations) Keep the instructions earlier in the pipeline (later in the code) from progressing down the pipeline for a cycle ( stall them in place with write control signals) Insert noop by zeroing control bits in the pipeline register at the appropriate stage Let the instructions later in the pipeline (earlier in the code) progress normally down the pipeline Result: all operations in pipeline are simply stalled. Flushes (or instruction squashing) where an instruction in the pipeline is replaced with a noop instruction (as done for instructions located sequentially after j instructions) Zero the control bits for the instruction to be flushed Result: the flushed instruction is clobbered never executed.

PC Supporting ID Stage Jumps Jump PCSrc Shift left 2 ID/EX EX/MEM IF/ID Control 4 Instruction Memory Read Address Add 0 Read Addr 1 Register Read Addr 2 File Write Addr Write Data PC+4[31-28] Read Data 1 Read Data 2 16 Sign 32 Extend Shift left 2 Add cntrl Branch Address Write Data Data Memory Read Data MEM/WB Forward Unit

Review: Branch Instr s Cause Control Hazards Dependencies backward in time cause hazards I n s t r. O r d e r beq lw Inst 3 Inst 4 IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg

One Way to Fix a Branch Control Hazard I n s t r. beq flush IM Reg DM Reg IM Reg DM Reg Fix branch hazard by waiting flush but affects CPI O r d e r flush flush beq target IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg Inst 3 IM Reg DM

Another Way to Fix a Branch Control Hazard I n s t r. Move branch decision hardware back to as early in the pipeline as possible i.e., during the decode cycle beq flush IM Reg DM Reg IM Reg DM Reg Fix branch hazard by waiting flush O r d e r beq target Inst 3 IM Reg DM Reg IM Reg DM

Reducing the Delay of Branches Move the branch decision hardware back to the EX stage Reduces the number of stall (flush) cycles to two Adds an and gate and a 2x1 mux to the EX timing path Add hardware to compute the branch target address and evaluate the branch decision to the ID stage Reduces the number of stall (flush) cycles to one (like with jumps) But now need to add forwarding hardware in ID stage Computing branch target address can be done in parallel with RegFile read (done for all instructions only used when needed) Comparing the registers can t be done until after RegFile read, so comparing and updating the PC adds a mux, a comparator, and an and gate to the ID timing path For deeper pipelines, branch decision points can be even later in the pipeline, incurring more stalls

PC IF.Flush Compare Supporting ID Stage Branches PCSrc Branch Hazard Unit 0 1 ID/EX EX/MEM IF/ID Control 0 4 Add Shift left 2 Add MEM/WB Instruction Memory Read Address 0 Read Addr 1 RegFile Read Addr 2 Read Data 1 Write Addr ReadData 2 Write Data 16 Sign Extend 32 cntrl Data Memory Read Data Address Write Data Forward Unit Forward Unit

Issues with Basic Scheduling How do we prevent WAR and WAW hazards? How do we deal with variable latency? Clock Cycle Number Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 LD F6,34(R2) IF ID EX MEM WB LD F2,45(R3) IF ID EX MEM WB MULTD F0,F2,F4 IF ID stall M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 MEM WB SUBD F8,F6,F2 IF ID A1 A2 MEM WB DIVD F10,F0,F6 IF ID stall stall stall stall stall stall stall stall stall D1 D2 ADDD F6,F8,F2 IF ID A1 A2 MEM WB WAR RAW

Delayed Branches and (Static) Scheduling If the branch hardware has been moved to the ID stage, then we can eliminate all branch stalls with delayed branches which are defined as always executing the next sequential instruction after the branch instruction the branch takes effect after that next instruction MIPS compiler moves an instruction to immediately after the branch that is not affected by the branch (a safe instruction) thereby hiding the branch delay With deeper pipelines, the branch delay grows requiring more than one delay slot Delayed branches have lost popularity compared to more expensive but more flexible (dynamic) hardware branch prediction Growth in available transistors has made hardware branch prediction relatively cheaper

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; stall stall lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 13 cycles lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 11 cycles

Scheduling Branch Delay Slots A. From before branch B. From branch target C. From fall through add $1,$2,$3 if $2=0 then delay slot sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes add $1,$2,$3 if $2=0 then if $1=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 sub $4,$5,$6 A is the best choice, fills delay slot and reduces IC In B and C, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails Or Branch Prediction Hazard

Dynamic Scheduling A major limitation of the simple pipelining techniques is in-order execution If an instruction is stalled in the pipeline all the instructions behind it must wait Even if there would be enough hardware resources to execute them Solution: Let the instructions behind the stalled instruction proceed Split the Instruction Decode phase of the pipeline into: Issue: decode instruction and check for structural hazards Read operands: wait until no data hazards, then read operands We will have out-of-order execution and out-of-order completion of the instructions.

Dynamic Pipeline Scheduling Allow the CPU to execute instructions out of order to avoid stalls **But commit result to registers in order** Another Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 add $t5, $s4, $s4 Can start sub while addu is waiting for lw

Why Do Dynamic Scheduling? Why not just let the compiler schedule code? Not all stalls are predicable Can t always schedule around branches Branch outcome is dynamically determined Different implementations of an ISA have different latencies and hazards EG: Scorecard and Tomasolos Algorithm** An example upcoming if time

(Static) Branch Prediction Resolve branch hazards by assuming a given outcome and proceeding without waiting to see the actual branch outcome 1. Predict not taken always predict branches will not be taken, continue to fetch from the sequential instruction stream, only when branch is taken does the pipeline stall If taken, flush instructions after the branch (earlier in the pipeline) in IF, ID, and EX stages if branch logic in MEM three stalls In IF and ID stages if branch logic in EX two stalls in IF stage if branch logic in ID one stall ensure that those flushed instructions haven t changed the machine state automatic in the MIPS pipeline since machine state changing operations are at the tail end of the pipeline (MemWrite (in MEM) or RegWrite (in WB)) restart the pipeline at the branch destination

Flushing with Misprediction (Not Taken) I n s t r. O r d e r 4 beq $1,$2,2 8 sub flush$4,$1,$5 16 and $6,$1,$7 20 or r8,$1,$9 IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg To flush the IF stage instruction, assert IF.Flush to zero the instruction field of the IF/ID pipeline register (transforming it into a noop)

Branching Structures Predict not taken works well for top of the loop branching structures But such loops have jumps at the bottom of the loop to return to the top of the loop and incur the jump stall overhead Loop: beq $1,$2,Out 1 nd loop instr... last loop instr j Loop Out: fall out instr Predict not taken doesn t work well for bottom of the loop branching structures Loop: 1 st loop instr 2 nd loop instr... last loop instr bne $1,$2,Loop fall out instr

Branch Prediction, con t Resolve branch hazards by assuming a given outcome and proceeding 2. Predict taken predict branches will always be taken Predict taken always incurs one stall cycle (if branch destination hardware has been moved to the ID stage) Is there a way to cache the address of the branch target instruction?? As the branch penalty increases (for deeper pipelines), a simple static prediction scheme will hurt performance. With more hardware, it is possible to try to predict branch behavior dynamically during program execution 3. Dynamic branch prediction predict branches at run-time using run-time information

Dynamic Branch Prediction A branch prediction buffer (aka branch history table (BHT)) in the IF stage addressed by the lower bits of the PC, contains bit(s) passed to the ID stage through the IF/ID pipeline register that tells whether the branch was taken the last time it was executed Prediction bit may predict incorrectly (may be a wrong prediction for this branch this iteration or may be from a different branch with the same low order PC bits) but the doesn t affect correctness, just performance Branch decision occurs in the ID stage after determining that the fetched instruction is a branch and checking the prediction bit(s) If the prediction is wrong, flush the incorrect instruction(s) in pipeline, restart the pipeline with the right instruction, and invert the prediction bit(s) A 4096 bit BHT varies from 1% misprediction (nasa7, tomcatv) to 18% (eqntott)

PC Branch Target Buffer The BHT predicts when a branch is taken, but does not tell where its taken to! A branch target buffer (BTB) in the IF stage caches the branch target address, but we also need to fetch the next sequential instruction. The prediction bit in IF/ID selects which next instruction will be loaded into IF/ID at the next clock edge Would need a two read port instruction memory Or the BTB can cache the branch taken instruction while the instruction memory is fetching the next sequential instruction Instruction Memory Read 0 Address If the prediction is correct, stalls can be avoided no matter which direction they go BTB

1-bit Prediction Scheme A 1-bit predictor will be incorrect twice when not taken Assume predict_bit = 0 to start (indicating branch not taken) and loop control is at the bottom of the loop code 1. First time through the loop, the predictor mispredicts the branch since the branch is taken back to the top of the loop; invert prediction bit (predict_bit = 1) 2. As long as branch is taken (looping), prediction is correct 3. Exiting the loop, the predictor again mispredicts the branch since this time the branch is not taken falling out of the loop; invert prediction bit (predict_bit = 0) Loop: 1 st loop instr 2 nd loop instr... last loop instr bne $1,$2,Loop fall out instr For 10 times through the loop we have a 80% prediction accuracy for a branch that is taken 90% of the time

2-bit Predictors A 2-bit scheme can give higher accuracy since a prediction must be wrong twice before the prediction bit is changed Scenario: Consecutive Loops. When the first loop ends, the branch prediction will likely fail, but the prediction strategy will not change. Thus, predict branch taken strategy at the beginning of next loop will likely succeed (using 1-bit, this prediction would fail). right 9 times 1 Taken 0 Taken Predict Taken 11 Predict 01 Not Taken wrong on loop fall out Not taken Taken Not taken Taken 10 right on 1 st iteration Predict Taken 00Predict Not Taken 1 Not taken Not taken 0 Loop: 1 st loop instr 2 nd loop instr... last loop instr bne $1,$2,Loop fall out instr BHT also stores the initial FSM state

Speculation Speculation is used to allow execution of future instr s that (may) depend on the speculated instruction Speculate on the outcome of a conditional branch (branch prediction) Speculate that a store (for which we don t yet know the address) that precedes a load does not refer to the same address, allowing the load to be scheduled before the store (load speculation) Must have (hardware and/or software) mechanisms for Checking to see if the guess was correct Recovering from the effects of the instructions that were executed speculatively if the guess was incorrect

Multiple-Issue Processor Styles Static multiple-issue processors (aka VLIW) Decisions on which instructions to execute simultaneously are being made statically (at compile time by the compiler) E.g., Intel Itanium and Itanium 2 for the IA-64 ISA EPIC (Explicit Parallel Instruction Computer) 128-bit bundles containing three instructions, each 41-bits plus a 5-bit template field (which specifies which FU each instruction needs) Five functional units (Int, Mmedia, Dmem, FP, Branch) Extensive support for speculation and predication Dynamic multiple-issue processors (aka superscalar) Decisions on which instructions to execute simultaneously (in the range of 2 to 8) are being made dynamically (at run time by the hardware) E.g., IBM Power series, Pentium 4, MIPS R10K, AMD Barcelona

Multiple-Issue Datapath Responsibilities Must handle, with a combination of hardware and software fixes, the fundamental limitations of How many instructions to issue in one clock cycle issue slots Storage (data) dependencies aka data hazards Limitation more severe in a SS/VLIW processor Procedural dependencies aka control hazards Resource conflicts aka structural hazards A SS/VLIW processor has a much larger number of potential resource conflicts Functional units may have to arbitrate for result buses and registerfile write ports Register renaming and reservation stations can help.

Static Multiple Issue Machines (VLIW) Static multiple-issue processors (aka VLIW) use the compiler (at compile-time) to statically decide which instructions to issue and execute simultaneously Issue packet the set of instructions that are bundled together and issued in one clock cycle think of it as one large instruction with multiple operations The mix of instructions in the packet (bundle) is usually restricted a single instruction with several predefined fields The compiler does static branch prediction and code scheduling to reduce (control) or eliminate (data) hazards VLIW s have Multiple functional units Multi-ported register files Wide program bus

An Example: A VLIW MIPS Consider a 2-issue MIPS with a 2 instr bundle 64 bits Op (R format) or Branch (I format) Load or Store (I format) Instructions are always fetched, decoded, and issued in pairs If one instr of the pair can not be used, it is replaced with a noop Need 4 read ports and 2 write ports and a separate memory address adder

PC A MIPS VLIW (2-issue) Datapath Add Add 4 Instruction Memory Register File Write Addr Write Data Sign Extend Sign Extend Add Data Memory

Register Renaming Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 antidependence antidependence + name dependence with F6 (and F8)

Register Renaming Example: DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T Now only RAW hazards remain, which can be rescheduled

Register Renaming Name dependency but no true data dependency Register renaming is provided by reservation stations (RS) Contains: The instruction Buffered operand values (when available) Reservation station number of instruction providing the operand values RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file) Pending instructions designate the RS to which they will send their output Result values broadcast on a result bus, called the common data bus (CDB) Only the last output updates the register file As instructions are issued, the register specifiers are renamed with the reservation station May be more reservation stations than registers Reservation stations and reorder buffer effectively provide register renaming

Loop Unrolling Replicate loop body to expose more parallelism Reduces loop-control overhead Use different registers per replication Called register renaming Avoid loop-carried anti-dependencies Store followed by a load of the same register Aka name dependence Reuse of a register name

Code Scheduling Example (with VLIW) Consider the following loop code lp: lw $t0,0($s1) # $t0=array element addu $t0,$t0,$s2 # add val in $s2 sw $t0,0($s1) # store result addi $s1,$s1,-4 # decrement pointer bne $s1,$0,lp # branch if $s1!= 0 Must schedule the instructions to avoid pipeline stalls Instructions in one bundle must be independent Must separate load use instructions from their loads by one cycle Notice that the first two instructions have a load use dependency, the next two and last two have data dependencies Assume branches are perfectly predicted by the hardware

The Scheduled (out-of-order) Code (Not Unrolled with VLIW) or branch Data transfer CC lp: lw $t0,0($s1) 1 addi $s1,$s1,-4 2 addu $t0,$t0,$s2 3 bne $s1,$0,lp sw $t0,4($s1) 4 5 Four clock cycles to execute 5 instructions for a CPI of 0.8 (versus the best case of 0.5) IPC of 1.25 (versus the best case of 2.0) noops don t count towards performance!!

Loop Unrolling Loop unrolling multiple copies of the loop body are made and instructions from different iterations are scheduled together as a way to increase ILP Apply loop unrolling (4 times for our example) and then schedule the resulting code Eliminate unnecessary loop overhead instructions Schedule so as to avoid load use hazards During unrolling the compiler applies register renaming to eliminate all data dependencies that are not true data dependencies

Unrolled Code Example lp: lw $t0,0($s1) # $t0=array element lw $t1,-4($s1) # $t1=array element lw $t2,-8($s1) # $t2=array element lw $t3,-12($s1) # $t3=array element addu $t0,$t0,$s2 # add scalar in $s2 addu $t1,$t1,$s2 # add scalar in $s2 addu $t2,$t2,$s2 # add scalar in $s2 addu $t3,$t3,$s2 # add scalar in $s2 sw $t0,0($s1) # store result sw $t1,-4($s1) # store result sw $t2,-8($s1) # store result sw $t3,-12($s1) # store result addi $s1,$s1,-16 # decrement pointer bne $s1,$0,lp # branch if $s1!= 0

The Scheduled Code (Unrolled) or branch Data transfer CC lp: addi $s1,$s1,-16 lw $t0,0($s1) 1 lw $t1,12($s1) 2 addu $t0,$t0,$s2 lw $t2,8($s1) 3 addu $t1,$t1,$s2 lw $t3,4($s1) 4 addu $t2,$t2,$s2 sw $t0,16($s1) 5 addu $t3,$t3,$s2 sw $t1,12($s1) 6 sw $t2,8($s1) 7 bne $s1,$0,lp sw $t3,4($s1) 8 Eight clock cycles to execute 14 instructions for a CPI of 0.57 (versus the best case of 0.5) IPC of 1.8 (versus the best case of 2.0)

Summary All modern day processors use pipelining for performance (a CPI of 1 and fast a CC) Pipeline clock rate limited by slowest pipeline stage so designing a balanced pipeline is important Must detect and resolve hazards Structural hazards resolved by designing the pipeline correctly Data hazards Stall (impacts CPI) Forward (requires hardware support) Control hazards put the branch decision hardware in as early a stage in the pipeline as possible Stall (impacts CPI) Delay decision (requires compiler support) Static and dynamic prediction (requires hardware support) Scheduling and Speculation can reduce stalls Multiple-issue, and VLIW can improve ILP

Appendix Jeremy Bolton, PhD Assistant Teaching Professor Constructed using materials: - Patt and Patel Introduction to Computing Systems (2nd) - Patterson and Hennessy Computer Organization and Design (4th) **A special thanks to Eric Roberts and Mary Jane Irwin

Dynamic Scheduling Algorithm: Tomasulo Algorithm DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) SUB.D T, F10, F14 MUL.D F6, F10, T register renaming Implemented through reservation stations (rs) per functional unit Buffers an operand as soon as it is available avoids WAR hazards. Pending instr. designate rs that will provide their inputs avoids WAW hazards. The last write in a sequence of same-register-writing actually updates the register Decentralize hazard detection and execution control Instruction results are passed directly to the FU from rs rather than from registers Through common data bus (CDB) 96 Nov. 2, 2004 Lec. 7

FP unit and load-store unit using Tomasulo s alg. 97 Nov. 2, 2004 Lec. 7

Dynamically Scheduled CPU Preserves dependencies Hold pending operands Results also sent to any waiting reservation stations Reorders buffer for register writes Can supply operands for issued instructions

Three Stages of Tomasulo Algorithm 1. Issue get instruction from FP Op Queue Stall if structural hazard, ie. no space in the rs. If reservation station (rs) is free, the issue logic issues instr to rs & read operands into rs if ready (Register renaming => Solves WAR). Make status of destination register waiting for this latest instn even if the previous instn writing to this register hasn t completed => Solves WAW hazards. 2. Execution operate on operands (EX) When both operands are ready then execute; if not ready, watch CDB for result Solves RAW 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available. Write result into dest. reg. if its status is r. => Solves WAW. Normal data bus:data + destination ( go to bus) CDB: data + source ( come from bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does broadcast 99 Nov. 2, 2004 Lec. 7

Reservation Station Components Op Operation to perform in the unit (e.g., + or ) Vj, Vk Value of the source operand. Qj, Qk Name of the RS that would provide the source operands. Value zero means the source operands already available in Vj or Vk, or is not necessary. Busy Indicates reservation station or FU is busy Register File Status Qi: Qi Indicates which functional unit will write each register, if one exists. Blank (0) when no pending instructions that will write that register meaning that the value is already available. 100 Nov. 2, 2004 Lec. 7

Tomasulo Loop Example Loop:LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss), second load takes 1 clock (hit) To be clear, will show clocks for SUBI, BNEZ Reality: integer instructions ahead

Loop Example Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 Load1 No 1 MULTD F4 F0 F2 Load2 No 1 SD F4 0 R1 Load3 No 2 LD F0 0 R1 Store1 No 2 MULTD F4 F0 F2 Store2 No 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 No SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 0 80 Fu

Loop Example Cycle 1 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 Load2 No 1 SD F4 0 R1 Load3 No 2 LD F0 0 R1 Store1 No 2 MULTD F4 F0 F2 Store2 No 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 No SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 1 80 Fu Load1

Loop Example Cycle 2 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 Load3 No 2 LD F0 0 R1 Store1 No 2 MULTD F4 F0 F2 Store2 No 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F4) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 2 80 Fu Load1 Mult1

Loop Example Cycle 3 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 Store2 No 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F4) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 3 80 Fu Load1 Mult1 Implicit renaming sets up DataFlow graph

Loop Example Cycle 4 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 Store2 No 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F4) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 4 80 Fu Load1 Mult1 Dispatching SUBI Instruction

Loop Example Cycle 5 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 Store2 No 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F4) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 5 72 Fu Load1 Mult1 And, BNEZ instruction

Loop Example Cycle 6 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 Store2 No 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F4) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 6 72 Fu Load2 Mult1 Notice that F0 never sees Load from location 80

Loop Example Cycle 7 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 No 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 7 72 Fu Load2 Mult2 Register file completely detached from computation First and Second iteration completely overlapped

Loop Example Cycle 8 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 8 72 Fu Load2 Mult2

Loop Example Cycle 9 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 9 72 Fu Load2 Mult2 Load1 completing: who is waiting? Note: Dispatching SUBI

Loop Example Cycle 10 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 10 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 10 64 Fu Load2 Mult2 Load2 completing: who is waiting? Note: Dispatching BNEZ

Loop Example Cycle 11 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 4 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 11 64 Fu Load3 Mult2 Next load in sequence

Loop Example Cycle 12 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 3 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 12 64 Fu Load3 Mult2 Why not issue third multiply?

Loop Example Cycle 13 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 2 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 13 64 Fu Load3 Mult2

Loop Example Cycle 14 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 0 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 1 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 14 64 Fu Load3 Mult2 Mult1 completing. Who is waiting?

Loop Example Cycle 15 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 No SUBI R1 R1 #8 0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 15 64 Fu Load3 Mult2 Mult2 completing. Who is waiting?

Loop Example Cycle 16 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 16 64 Fu Load3 Mult1

Loop Example Cycle 17 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 Store3 Yes 64 Mult1 Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F30 17 64 Fu Load3 Mult1