Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Similar documents
CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Four Steps of Speculative Tomasulo cycle 0

5008: Computer Architecture

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Hardware-based Speculation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Super Scalar. Kalyan Basu March 21,

CS425 Computer Systems Architecture

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Handout 2 ILP: Part B

Multi-cycle Instructions in the Pipeline (Floating Point)

EECC551 Exam Review 4 questions out of 6 questions

Chapter 4 The Processor 1. Chapter 4D. The Processor

Instruction Level Parallelism

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Advanced issues in pipelining

Metodologie di Progettazione Hardware-Software

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Hardware-based Speculation

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture-13 (ROB and Multi-threading) CS422-Spring

Instruction-Level Parallelism and Its Exploitation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Getting CPI under 1: Outline

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

CS 152 Computer Architecture and Engineering

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Adapted from David Patterson s slides on graduate computer architecture

E0-243: Computer Architecture

Processor (IV) - advanced ILP. Hwansoo Han

Multiple Instruction Issue. Superscalars

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

Hardware-Based Speculation

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

The Processor: Instruction-Level Parallelism

Processor: Superscalars Dynamic Scheduling

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Copyright 2012, Elsevier Inc. All rights reserved.

Instruction Level Parallelism

November 7, 2014 Prediction

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Instruction Level Parallelism

INSTRUCTION LEVEL PARALLELISM

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Static vs. Dynamic Scheduling

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998

EECC551 Review. Dynamic Hardware-Based Speculation

CS433 Homework 2 (Chapter 3)

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Recall from Pipelining Review. Instruction Level Parallelism and Dynamic Execution

Instruction Pipelining Review

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Dynamic Scheduling. CSE471 Susan Eggers 1

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Transcription:

Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018

Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types of functional units Dynamic scheduling and out-of-order execution Scoreboarding approach vs. Tomasulo s approach 2

Outlines Superscalar Pipeline Speculation VLIW and EPIC 3

Three key Limitations Limitations of Scalar Pipelines A scalar instruction pipeline is a single pipeline with multiple stages organized in a linear sequential order Upper bound on scalar pipeline throughput Only one instruction per machine cycle Cost-effectiveness of deeper pipelining Inefficient unification into a single pipeline Different hardware resource support needed Instructions require long/variable latencies Efficiency issue due to a rigid execution One stalled instruction affect all trailing instructions (backward propagation) 4

Improving Simple Scalar Pipelines Involve extensions to alleviate the three limitations Ext. 1: Make it parallel (Superscalar Pipelines) Fetch multiple instructions in every machine cycle Decode instructions in parallel Ext. 2: Make it diversified Multiple and heterogeneous FUs in the EX stage With appropriate number and mix of function units Ext. 3: Make it dynamic Trailing instructions can bypass a stalled leading instruction Complex multi-entry buffers for buffering instructions in flight 5

Multiple Issue Processor To get a CPI<1, the CPU must be capable of issuing more than one instruction per cycle. A k-issue processor: each stages accept k instructions Alpha 21264: four-issue MIPS R1000: four-issue Intel P4: three-issue Multi-issue processors come in two basic flavors: Superscalar processors VLIW (very long instruction word) processors 6

From Scalar to Superscalar IF D1 IF D1 IF D1 D2 EX WB D2 EX WB D2 EX WB 5-stage i486 scalar pipeline 5-stage Pentium superscalar pipeline: Width = 2 Performance speedup: Scalar: measured with respect to a no pipelined design Primarily determined by the depth of the scalar pipeline Superscalar: measured with respect to a scalar pipeline Primarily determined by the width of the parallel pipeline 7

Superscalar Pipeline Example Width = 3 4 pipelined FUs 2 multi-entry buffers: IF ID (in order) Dispatch Buffer (DB) Dispatch Buffer (out of order) Centralized EX ALU MEM1 FP1 BR Pentium Pro Distributed MEM2 FP2 PowerPC 604 FP3 (out of order) WB Completion Buffer (in order) Completion Buffer A parallel diversified pipeline 8

Successive instructions Classifying ILP Machines Baseline machine (simple scalar) Instruction-level parallelism required to fully utilize = 1 / cycle Instruction issued per cycle (IPC) = 1 Simple operation latency = 1 IF ID EX WB 0 1 2 3 4 5 6 7 8 9 Time in Base Cycles 9

Successive instructions Classifying ILP Machines Superscalar machine (n width) Instruction-level parallelism required to fully utilize = n / cycle Instruction issued per cycle (IPC) = n Simple operation latency = 1 IF ID EX WB 0 1 2 3 4 5 6 7 8 9 Time in Base Cycles 10

Successive instructions Classifying ILP Machines Superpipelined machine (degree m) Instruction-level parallelism required to fully utilize = m / base cycle IPC = 1, but the cycle time is 1/m of the base machine Simple operation latency = m IF ID EX WB 0 1 2 3 4 5 6 7 8 9 Time in Base Cycles 11

Successive instructions Classifying ILP Machines VLIW machine Instruction-level parallelism required to fully utilize = n / cycle IPC = n instructions / cycle = 1 VLIW / cycle Simple operation latency = 1 IF ID EX (3 ops) WB 0 1 2 3 4 5 6 7 8 9 Time in Base Cycles 12

Successive instructions Superscalar vs. Superpipelined Roughly equivalent performance If n = m, then both have about the same throughput Parallelism exposed in different way (spatial vs. temporal) Superscalar Superpipelined 0 1 2 3 4 5 6 7 8 9 Time in Base Cycles 13

Outlines Superscalar Pipeline Speculation VLIW and EPIC 14

Branch Prediction 17% instructions of gcc is conditional branch Instruction Taken known? Target know? BEQZ/BNEZ After Reg. Fetch After Inst. Fetch J Always taken After Inst. Fetch JR Always taken After Reg. Fetch Two fundamental components of branch prediction Branch target speculation and Branch condition speculation Branch penalties limit performance Performance = ƒ(accuracy, cost of misprediction) 15

Static Branch Prediction Static: predict branch behavior at compile-time Individual branch is often highly biased toward taken or untaken Predict a branch as always taken Predict on the basis of branch direction Predict on the basis of profile information collected earlier Backward JZ Forward JZ Dynamic: based on the behavior of each branch May change predictions for a branch over the life of a program HW support: branch history tables, branch target buffers, etc. 16

History-based Prediction Fetch PC x x 2^k entry 2 bits/entry Instruction Cache Opcode Offset + k BHT Index Branch History Table (BTH) Branch? Target PC Taken / not Taken 4K-entry BHT, 2 bits/entry: ~ 80-90% correct predictions 17

2-bit Branch Prediction Taken Not Taken 11 / 1 10 / 1 Taken Taken Not Taken 01 / 0 Not Taken Taken 00 / 0 Not Taken Change prediction only if get misprediction twice n-bit Branch Prediction? Not much more accurate than the 2-bit version (n>2) Will likely predict incorrectly twice (n==1) 18

Branch Target Buffer BTB contains two fields Branch instruction address BIA Branch target address BTA PC BIA field BTA field Branch History BTB I-Cache Speculative target address FSM Predict taken/not taken 19

Mis-prediction Recovery 20% Misprediction Rate 15% 10% 5% 0% In-order processor Kill all instructions in pipeline behind branch: No instruction issued after branch can write-back before branch resolves Out-of-order processor Multiple instructions following branch in program order can complete before branch resolves 20

Precise Exception An exception is precise if The exception is associated with a particular faulting instruction Fi All instructions (in program order) prior to Fi complete execution Fi and all subsequent instructions appear to have never started Precise exception & misprediction recovery: same problem Maintain precise exception implies in-order completion Need HW buffer for results of uncommitted instructions: order in which instructions are completed order in which memory is accessed 21

Discussion Although the instructions executed out of order do not have any visible architectural effect on registers or memory, they have micro-architectural side effects. 22

Hardware-based Speculation Hardware-based speculation involves: Dynamic branch prediction Dynamic scheduling of basic blocks OoO execution with precise exception (in-order commit) Commit: write instruction results to architectural state To graduate: oldest instruction + valid results 23

Reorder Buffer (ROB) A completion buffer managed as circular queue (FIFO) Contains all the instructions that are in flight Holds result between complete and commit ROB fields Instruction type: branch/store/alu Destination: supplies register number or memory address Value: holds results until the instruction commits Tail Pointer (Next entry to be allocated) Head Pointer (Next instruction to complete) 24

Tomasulo + ROB Extended Tomasulo hardware to support speculation PowerPC 603/604/G3/G4 MIPS R10000/R12000 Intel Pentium II/III/4 Alpha 21264 AMD K5/K6/Athlon Combined ROB-RS structure called instruction window The size of the instruction window determines the degree of ILP that can be achieved by the machine Now 4 stages (issue, execute, write, commit) are needed 25

Tomasulo+ROB Example Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I5 I4 I3 I2 I1 head 0 1 2 3 4 5 6 Instr. Dest. Val. FP Registers Val. Src F0 0.0 F2 2.0 F4 4.0 F6 6.0 F8 8.0 ALU #1 instr des s1 s2 #2 #3 FP Adders #1 instr des s1 s2 #2 FP Multipliers 26

Tomasulo+ROB Example Cycle 1 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I6 I5 I4 I3 I2 head Instr. Dest. Val. 0 I1 F4-1 2 3 4 5 6 FP Registers Val. Src F0 0.0 F2 2.0 F4 4.0 rob0 F6 6.0 F8 8.0 #1 ADD.D rob0 2.0 0.0 #1 #2 #2 ALU #3 FP Adders FP Multipliers 27

Tomasulo+ROB Example Cycle 2 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I1 I6 I5 I4 I3 head Instr. Dest. Val. 0 I1 F4-1 I2 F8-2 3 4 5 6 FP Registers Val. Src F0 0.0 F2 2.0 F4 4.0 rob0 F6 6.0 F8 8.0 rob1 ALU #1 ADD.D rob0 2.0 0.0 #2 #3 FP Adders #1 MUL.D rob1 rob0 2.0 #2 FP Multipliers 28

Tomasulo+ROB Example Cycle 3 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I2 I1 I6 I5 I4 head Instr. Dest. Val. 0 I1 F4-1 I2 F8-2 I3 F6-3 4 5 6 FP Registers Val. Src F0 0.0 F2 2.0 F4 4.0 rob0 F6 6.0 rob2 F8 8.0 rob1 ALU #1 ADD.D rob0 2.0 0.0 #2 ADD.D rob2 rob1 6.0 #3 FP Adders #1 MUL.D rob1 rob0 2.0 #2 FP Multipliers 29

Tomasulo+ROB Example Cycle 4 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I3 I2 I1 I6 I5 head Instr. Dest. Val. 0 I1 F4-1 I2 F8-2 I3 F6-3 I4 F8 4 5 6 FP Registers Val. Src F0 0.0 F2 2.0 F4 4.0 rob0 F6 6.0 rob2 F8 8.0 rob3 ALU #1 ADD.D rob0 2.0 0.0 #2 ADD.D rob2 rob1 6.0 #3 SUB.D rob3 2.0 0.0 FP Adders #1 MUL.D rob1 rob0 2.0 #2 FP Multipliers 30

Tomasulo+ROB Example Cycle 5 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I4 I3 I2 I1 I6 head Instr. Dest. Val. 0 I1 F4 2.0 1 I2 F8-2 I3 F6-3 I4 F8 4 I5 5 6 FP Registers Val. Src F0 0.0 F2 2.0 F4 4.0 rob0 F6 6.0 rob2 F8 8.0 rob3 SUB.I ALU #1 ADD.D rob0 2.0 0.0 #2 ADD.D rob2 rob1 6.0 #3 SUB.D rob3 2.0 0.0 FP Adders #1 MUL.D rob1 2.0 2.0 #2 FP Multipliers 2.0 (rob0) 31

Tomasulo+ROB Example Cycle 6 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I5 I4 I3 I2 I1 head Instr. Dest. Val. 0 1 I2 F8-2 I3 F6-3 I4 F8 4 I5 5 I6 6 FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 F6 6.0 rob2 F8 8.0 rob3 SUB.I BNEZ ALU #1 #2 ADD.D rob2 rob1 6.0 #3 SUB.D rob3 2.0 0.0 FP Adders #1 MUL.D rob1 2.0 2.0 #2 FP Multipliers 32

Tomasulo+ROB Example Cycle 7 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I6 I5 I4 I3 I2 head Instr. Dest. Val. 0 1 I2 F8-2 I3 F6-3 I4 F8 4 I5 5 I6 6 I1 F4 - FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 rob6 F6 6.0 rob2 F8 8.0 rob3 SUB.I BNEZ ALU #1 ADD.D rob6 2.0 0.0 #2 ADD.D rob2 rob1 6.0 #3 SUB.D rob3 2.0 0.0 FP Adders #1 MUL.D rob1 2.0 2.0 #2 FP Multipliers 33

Tomasulo+ROB Example Cycle 8 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I1 I6 I5 I4 I3 head Instr. Dest. Val. 0 I2 F8-1 I2 F8-2 I3 F6-3 I4 F8 2.0 4 I5 5 I6 6 I1 F4 - FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 rob6 F6 6.0 rob2 F8 8.0 rob0 Overwrite! SUB.I BNEZ ALU #1 ADD.D rob6 2.0 0.0 #2 ADD.D rob2 rob1 6.0 #3 SUB.D rob3 2.0 0.0 FP Adders #1 MUL.D rob1 2.0 2.0 #2 MUL.D rob0 rob6 2.0 FP Multipliers 2.0 (rob3) 34

Tomasulo+ROB Example Cycle 9 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ ROB full! I1 I6 I5 I4 I3 head Instr. Dest. Val. 0 I2 F8-1 I2 F8-2 I3 F6-3 I4 F8 2.0 4 I5 5 I6 6 I1 F4 - FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 rob6 F6 6.0 rob2 F8 8.0 rob0 SUB.I BNEZ ALU #1 ADD.D rob6 2.0 0.0 #2 ADD.D rob2 rob1 6.0 #3 FP Adders #1 MUL.D rob1 2.0 2.0 #2 MUL.D rob0 rob6 2.0 FP Multipliers 35

Tomasulo+ROB Example Cycle 10 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I1 I6 I5 I4 I3 head Instr. Dest. Val. 0 I2 F8-1 I2 F8-2 I3 F6-3 I4 F8 2.0 4 I5 5 I6 6 I1 F4 - FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 rob6 F6 6.0 rob2 F8 8.0 rob0 SUB.I BNEZ ALU #1 ADD.D rob6 2.0 0.0 #2 ADD.D rob2 rob1 6.0 #3 FP Adders #1 MUL.D rob1 2.0 2.0 #2 MUL.D rob0 rob6 2.0 FP Multipliers 36

Tomasulo+ROB Example Cycle 11 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I1 I6 I5 I4 I3 head Instr. Dest. Val. 0 I2 F8-1 I2 F8-2 I3 F6-3 I4 F8 2.0 4 I5 5 I6 6 I1 F4 2.0 FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 rob6 F6 6.0 rob2 F8 8.0 rob0 SUB.I BNEZ ALU #1 ADD.D rob6 2.0 0.0 #2 ADD.D rob2 rob1 6.0 #3 FP Adders #1 MUL.D rob1 2.0 2.0 #2 MUL.D rob0 2.0 2.0 FP Multipliers 2.0 (rob6) 37

Tomasulo+ROB Example Cycle 16 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I1 I6 I5 I4 I3 head Instr. Dest. Val. 0 I2 F8-1 I2 F8-2 I3 F6-3 I4 F8 2.0 4 I5 5 I6 6 I1 F4 2.0 FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 rob6 F6 6.0 rob2 F8 8.0 rob0 SUB.I BNEZ ALU #1 #2 ADD.D rob2 rob1 6.0 #3 FP Adders #1 MUL.D rob1 2.0 2.0 #2 MUL.D rob0 2.0 2.0 FP Multipliers 38

Tomasulo+ROB Example Cycle 17 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ I1 I6 I5 I4 I3 head Instr. Dest. Val. 0 I2 F8-1 I2 F8 4.0 2 I3 F6-3 I4 F8 2.0 4 I5 val 5 I6 6 I1 F4 2.0 FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 rob6 F6 6.0 rob2 F8 8.0 rob0 SUB.I BNEZ ALU #1 #2 ADD.D rob2 4.0 6.0 #3 FP Adders #1 MUL.D rob1 2.0 2.0 #2 MUL.D rob0 2.0 2.0 FP Multipliers 4.0 (rob1) 39

Tomasulo+ROB Example Cycle 18 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ Branch Mispredicted! I2 I1 I6 I5 I4 head Instr. Dest. Val. 0 I2 F8-1 I3 F6-2 I3 F6-3 I4 F8 2.0 4 I5 val 5 I6 6 I1 F4 2.0 FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 rob6 F6 6.0 rob2 F8 4.0 rob0 SUB.I BNEZ ALU #1 ADD.D rob1 rob0 rob2 #2 ADD.D rob2 4.0 6.0 #3 FP Adders #1 #2 MUL.D rob0 2.0 2.0 FP Multipliers 40

Tomasulo+ROB Example Cycle 19 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ FLS FLS FLS FLS FLS head Instr. Dest. Val. 0 Flushed 1 Flushed 2 I3 F6-3 I4 F8 2.0 4 I5 val 5 I6 nt 6 Flushed FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 F6 6.0 rob2 F8 4.0 rob3 SUB.I BNEZ ALU #1 ADD.D rob1 rob0 rob2 #2 ADD.D rob2 4.0 6.0 #3 FP Adders #1 #2 MUL.D rob0 2.0 2.0 FP Multipliers 41

Tomasulo+ROB Example Cycle 20 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ head Instr. Dest. Val. 0 1 2 I3 F6-3 I4 F8 2.0 4 I5 val 5 I6 nt 6 FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 F6 6.0 rob2 F8 4.0 rob3 SUB.I #1 #1 BNEZ #2 ADD.D rob2 4.0 6.0 #2 ALU #3 FP Adders FP Multipliers 42

Tomasulo+ROB Example Cycle 21 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ head Instr. Dest. Val. 0 1 2 I3 F6 10.0 3 I4 F8 2.0 4 I5 val 5 I6 nt 6 FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 F6 6.0 rob2 F8 4.0 rob3 SUB.I #1 #1 BNEZ #2 ADD.D rob2 4.0 6.0 #2 ALU #3 FP Adders FP Multipliers 10.0 (rob2) 43

Tomasulo+ROB Example Cycle 22 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ head Instr. Dest. Val. 0 1 2 3 I4 F8 2.0 4 I5 val 5 I6 nt 6 FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 F6 10.0 F8 4.0 rob3 SUB.I #1 #1 BNEZ #2 #2 ALU #3 FP Adders FP Multipliers 44

Tomasulo+ROB Example Cycle 23 Instr. Q Reorder Buffer LP: I1 ADD.D F4 F2 F0 I2 MUL.D F8 F4 F2 I3 ADD.D F6 F8 F6 I4 SUB.D F8 F2 F0 I5 SUB.I I6 BNEZ head Instr. Dest. Val. 0 1 2 3 4 I5 val 5 I6 nt 6 FP Registers Val. Src F0 0.0 F2 2.0 F4 2.0 F6 10.0 F8 2.0 SUB.I #1 #1 BNEZ #2 #2 ALU #3 FP Adders FP Multipliers 45

Outlines Superscalar Pipeline Speculation VLIW and EPIC 46

VLIW Processor VLIW: (Very Long Instruction Word) Processors A fixed number of operations are formatted as one big instruction Typically 3 operations today (e.g. Intel Itanium) A few startup companies attempted to commercialize VLIW architecture in the market in the1980s Goal: High performance with lower hardware complexity Less multiple-issue hardware Simpler instruction dispatch No structural hazard checking logic 47

Architecture Comparison Category CISC RISC VLIW Inst Size Varies Fixed (typically 32bits Fixed Inst Format Field placement varies Regular, consistent placement of fields Inst Semantics Complex; possibly many dependent ops per instr Almost always one simple operation Registers Few, sometimes special Many, general purpose Memory Reg-Mem architecture Load/Store architecture Multiple, independent simple operations Hardware Exploit microcode Not microcoded implementation Example: 48

Compiler Support for VLIW Processor Complexity of scheduling is moved into the compiler Compiler creates each VLIW word Detects hazards and hides latencies Optimizations to fill the slots in instruction Structural hazards? No two operations to the same functional unit No two operations to the same memory bank Data hazards? no data hazards among instructions in a bundle Control hazards? Predicated execution Static branch prediction 49

Loop Unrolling Without any scheduling: Clock Cycle Issued 50

Loop Unrolling Loop unrolling simply replicates the loop body multiple times, adjusting the loop termination code. 14 clock cycles, or 3.5 per iteration 51

Loop Unrolling in VLIW Memory reference 1 Memory reference 2 FP operation 1 FP operation 2 Int. op/ branch clk LD F0, 0 (R1) LD F6, -8 (R1) 1 LD F10, -16 (R1) LD F14, -24 (R1) 2 LD F18, -32 (R1) LD F22, -40 (R1) ADDD F4, F0, F2 ADDD F8, F6, F2 3 LD F26, -48 (R1) ADDD F12, F10, F2 ADDD F16, F14, F2 4 ADDD F20, F18, F2 ADDD F24, F22, F2 5 SD 0(R1), F4 SD -8(R1), F8 ADDD F28, F26, F2 6 SD -16(R1), F12 SD -24(R1), F16 7 SD -32(R1), F20 SD -40(R1), F24 SUBI R1, R1, #48 8 SD -0(R1), F28 BNEZ R1, LOOP 9 Unrolled 7 times to avoid delays 7 results in 9 cycles, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency 52

Explicitly Parallel Instruction Computing EPIC (style of architecture) is an evolution of VLIW which has absorbed many of the best ideas of superscalar processors EPIC Philosophy: Providing the ability to design the desired plan of execution (POE) at compile-time Providing features that permit the compiler to "play the statistics". Providing the ability to communicate the POE to the hardware. Intel/HP Itanium (IA-64) follows EPIC philosophy 6-wide, 10-stage pipeline at 800Mhz 53

Summary Superscalar Pipeline Limitations of scalar processor Basic feature of superscalar pipeline Multi-Issue Processor Dispatch buffer and completion buffer Classification of ILP Machines Rationale of branch prediction; 2-bit prediction Speculation Precise exception Reorder buffer (ROB) Tomasulo Algorithm with ROB VLIW and EPIC CISC vs RISC vs VLIW Loop unrolling The concept of EPIC 54