What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Similar documents
Lecture 4: Introduction to Advanced Pipelining

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction-Level Parallelism and Its Exploitation

Advantages of Dynamic Scheduling

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Instruction Level Parallelism (ILP)

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Processor: Superscalars Dynamic Scheduling

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Scoreboard information (3 tables) Four stages of scoreboard control

Metodologie di Progettazione Hardware-Software

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Static vs. Dynamic Scheduling

Adapted from David Patterson s slides on graduate computer architecture

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

COSC 6385 Computer Architecture - Pipelining (II)

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CS425 Computer Systems Architecture

Instruction Level Parallelism

COSC4201 Instruction Level Parallelism Dynamic Scheduling

EECC551 Exam Review 4 questions out of 6 questions

Instruction Level Parallelism

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Multi-cycle Instructions in the Pipeline (Floating Point)

Preventing Stalls: 1

Four Steps of Speculative Tomasulo cycle 0

Hardware-based Speculation

University of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CS433 Homework 3 (Chapter 3)

Latencies of FP operations used in chapter 4.

Superscalar Architectures: Part 2

Floating Point/Multicycle Pipelining in DLX

Good luck and have fun!

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Copyright 2012, Elsevier Inc. All rights reserved.

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Instruction-Level Parallelism (ILP)

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CMSC 611: Advanced Computer Architecture

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Chapter 4 The Processor 1. Chapter 4D. The Processor

The basic structure of a MIPS floating-point unit

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EE 4683/5683: COMPUTER ARCHITECTURE

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Lecture: Pipeline Wrap-Up and Static ILP

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Course on Advanced Computer Architectures

ILP: Instruction Level Parallelism

5008: Computer Architecture

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Advanced Pipelining and Instruction- Level Parallelism 4

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EECC551 Review. Dynamic Hardware-Based Speculation

Instruction Level Parallelism. Taken from

ECSE 425 Lecture 11: Loop Unrolling

Updated Exercises by Diana Franklin

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Hardware-based Speculation

Advanced issues in pipelining

DYNAMIC SPECULATIVE EXECUTION

Lecture-13 (ROB and Multi-threading) CS422-Spring

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Computer Science 246 Computer Architecture

Getting CPI under 1: Outline

Complex Pipelining. Motivation

Super Scalar. Kalyan Basu March 21,

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Transcription:

What is ILP? Instruction Level Parallelism or Declaration of Independence The characteristic of a program that certain instructions are, and can potentially be. Any mechanism that creates, identifies, or exploits the independence of instructions, allowing them to be executed in parallel. Why do we want/need ILP? In a superscalar architecture? What about a scalar architecture? Where do we find ILP? In basic blocks? 15-20% of (dynamic) instructions are branches in typical code Across basic blocks? how? How do we expose ILP? by moving instructions around. How?? for (i=1; i<=1000; i++) x[i] = x[i] * s

Exposing ILP in software instruction scheduling (changes ILP within a basic block) loop unrolling (allows ILP across iterations by putting instructions from multiple iterations in the same basic block) Others (trace scheduling, software pipelining) we ll talk about later A sample loop ;F0=array element, R1=X[] ;multiply scalar in F2 SD F4, 0(R1) ;store result ;increment pointer 8B (DW) ;R2 = &X[1001] ;branch R3!=zero NOP ;delayed branch slot Where are the dependencies and stalls? Operation Latency (stalls) FP Mult 6 (5) LD 2 (1) Int ALU 1 (0) Instruction Scheduling Loop Unrolling SD 0(R1),F4 NOP LD F0,0(R1)

Register Renaming Register Renaming SD 0(R1),F4 LD F10,8(R1) ADDI R1,R1,16 MULD F14,F10,F2 SD -8(R1),F14 LD F10,8(R1) MULD F14,F10,F2 ADDI R1,R1,16 SD 0(R1),F4 SD -8(R1),F14 Remember: dependencies are a property of code, whether or not it is a HW hazard depends on the given pipeline. Compiler must respect (True) Data dependencies (RAW) Instruction i produces a result used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. Easy to determine for registers (fixed names) Hard for memory: Does 100(R4) = 20(R6)? From different loop iterations, does 20(R6) = 20(R6)? Other kinds of dependence also called name (false) dependence: two instructions use same name but don t exchange data Antidependence (WAR dependence) Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Output dependence (WAW dependence) Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

Name Dependence Also Harder for Memory Accesses Does 100(R4) = 20(R6)? From different loop iterations, does 20(R6) = 20(R6)? Our example required compiler to know that if R1 doesn t change then: 0(R1) -8(R1) There were no dependencies between some loads and stores so they could be moved by each other Compilers must also preserve Example if (c1) I1; if (c2) I2; I1 is control dependent on c1 and I2 is control dependent on c2 but not on c1. Two (obvious) constraints on control dependences: An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. Can be done in SW or HW Why SW? Why HW? Code Motion An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. Control dependencies relaxed to get parallelism; as long as we get same effect if preserve order of exceptions and data flow

Key Points You can find, create, and exploit Instruction Level Parallelism in SW or HW Loop level parallelism is usually easiest to see Dependencies exist in a program, and become hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can/should unroll loops HW Schemes: Instruction Parallelism Why in HW at run time? Works when can t know dependence until run time Variable latency Control dependent data dependence Can schedule differently every time through the code. Compiler simpler Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion First HW ILP Technique: Out-of-order Issue/Dynamic Scheduling Dynamic Scheduling by hand Problem -- need to get stalled instructions out of the ID stage, so that subsequent instructions can begin execution. Must separate detection of structural hazards from detection of data hazards Must split ID operation into two: Issue (decode, check for structural hazards) Read operands (read operands when NO DATA HAZARDS) i.e., must be able to issue even when a data hazard exists instructions issue in-order, but proceed to EX out-of-order DIVD F0,F2,F4 (10 cycles) ADDD F10, F0, F8 (4 cycles) SUBD F12, F8, F14 (4 cycles) ADDD F20,F2,F3 MULTD F13,F12,F2 (6 cycles) ADDD F4,F1,F3 ADDD F5,F4,F13 in-order out-of-order (assume several FP ADD units)

CDC 6600 Scoreboard Enables dynamic scheduling Allows instructions to proceed when dependencies are satisfied Scoreboard Implications Out-of-order completion => WAR, WAW hazards? Solution for WAR stall WB until earlier (in program order) instruction reads operands. For WAW, must detect hazard: stall in ID until other completes DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 Scoreboard Implications, cont. Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages Four Stages of Scoreboard Control 1. Issue decode instructions & check for structural hazards (ID1) If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared. 2. Read operands wait until no data hazards, then read operands (ID2) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.

Four Stages of Scoreboard Control Scoreboard Hardware 3. Execution operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. 4. Write result finish execution (WB) Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands Registers Data buses FP mult FP mult FP divide FP add Integer unit Control/ status Scoreboard Control/ status Three Parts of the Scoreboard 1. Instruction status which of 4 steps the instruction is in 2. Functional unit status Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy Indicates whether the unit is busy or not Op Operation to perform in the unit (e.g., + or ) Fi Destination register Fj, Fk Source-register numbers Qj, Qk Functional units producing source registers Fj, Fk Rj, Rk Flags indicating when Fj, Fk are ready 3. Register result status Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register