Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Similar documents
Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Four Steps of Speculative Tomasulo cycle 0

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Hardware-based Speculation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Super Scalar. Kalyan Basu March 21,

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Hardware-Based Speculation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Handout 2 ILP: Part B

Static vs. Dynamic Scheduling

Multi-cycle Instructions in the Pipeline (Floating Point)

5008: Computer Architecture

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Hardware-Based Speculation

Computer Science 246 Computer Architecture

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

EECC551 Exam Review 4 questions out of 6 questions

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Adapted from David Patterson s slides on graduate computer architecture

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Instruction Level Parallelism

Metodologie di Progettazione Hardware-Software

CS425 Computer Systems Architecture

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Lecture-13 (ROB and Multi-threading) CS422-Spring

The basic structure of a MIPS floating-point unit

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Hardware-based Speculation

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Multiple Instruction Issue and Hardware Based Speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Chapter 4 The Processor 1. Chapter 4D. The Processor

Instruction-Level Parallelism and Its Exploitation

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

Copyright 2012, Elsevier Inc. All rights reserved.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Processor: Superscalars Dynamic Scheduling

COSC4201 Instruction Level Parallelism Dynamic Scheduling

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Instruction Level Parallelism

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Getting CPI under 1: Outline

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Computer Science 146. Computer Architecture

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

CS425 Computer Systems Architecture

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Advanced issues in pipelining

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

EE 4683/5683: COMPUTER ARCHITECTURE

Outline EEL 5764 Graduate Computer Architecture. Chapter 2 - Instruction Level Parallelism. Recall from Pipelining Review

ILP: Instruction Level Parallelism

Lecture 4: Introduction to Advanced Pipelining

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Lecture: Pipeline Wrap-Up and Static ILP

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Superscalar Architectures: Part 2

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Instruction Level Parallelism. Taken from

CS252 Graduate Computer Architecture Lecture 5. Interrupt Controller CPU. Interrupts, Software Scheduling around Hazards February 1 st, 2012

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

CS 152 Computer Architecture and Engineering

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

Instruction-Level Parallelism (ILP)

Transcription:

Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW Software Pipelining /0/00 UAH- Basic Pipeline Scheduling: xample Simple loop: Assumptions: for(i=; i<=000; i++) x[i]=x[i] + s; Instruction Instruction Latency in producing result using result clock cycles ALU op Another ALU op ALU op Store double Load double ALU op Load double Store double 0 Integer op Integer op 0 ;R points to the last element in the array ;for simplicity, we assume that x[0] is at the address 0 Loop: L.D F0, 0(R) ;F0=array el. ADD.D F,F0,F ;add scalar in F S.D 0(R),F ;store result SUBI R,R,# BNZ R, Loop ;decrement pointer ;branch /0/00 UAH- Revised loop to minimise stalls. Loop: F0, 0(R). SUBI R,R,#. ADDD F,F0,F. Stall Swap BNZ and by changing address of SUBI is moved up. BNZ R, Loop ;delayed branch. (R),F ;altered and interch. SUBI clocks per iteration ( stall); but only instructions do the actual work processing the array (, ADDD, ) => Unroll loop times to improve potential for instr. scheduling Instruction Instruction Latency in producing result using result clock cycles ALU op Another ALU op ALU op Store double Load double ALU op Load double Store double 0 Integer op Integer op 0 /0/00 UAH-

Unrolled Loop F0, 0(R) ADDD F,F0,F 0(R),F ; drop SUBI&BNZ F0, -(R) ADDD F,F0,F -(R),F ; drop SUBI&BNZ F0, -(R) ADDD F,F0,F -(R),F ; drop SUBI&BNZ F0, -(R) ADDD F,F0,F -(R),F SUBI R,R,# BNZ R,Loop cycle stall cycles stall This loop will run cc ( stalls) per iteration; each has one stall, each ADDD, SUBI, BNZ, plus instruction issue cycles - or /= for each element of the array (even slower than the scheduled version)! => Rewrite loop to minimize stalls Unrolled Loop that Minimise Stalls Loop: F0,0(R) F,-(R) F0,-(R) F,-(R) ADDD F,F0,F ADDD F,F,F ADDD F,F0,F ADDD F,F,F 0(R),F -(R),F SUBI R,R,# (R),F BNZ R,Loop (R),F ; This loop will run cycles (no stalls) per iteration; or /=. for each element! Assumptions that make this possible: - move s before s - move after SUBI and BNZ - use different registers When is it safe for compiler to do such changes? /0/00 UAH- /0/00 UAH- I I I Superscalar MIPS Superscalar MIPS: instructions, & anything else Fetch -bits/clock cycle; Int on left, on right Can only issue nd instruction if st instruction issues More ports for registers to do load & op in a pair Instr. 0 Time [clocks] Note: operations extend X cycle /0/00 UAH- 0 Loop Unrolling in Superscalar Integer Instr. Loop: F0,0(R) F,-(R) F0,-(R) F,-(R) F,-(R) 0(R),F -(R),F -(R),F SUBI R,R,#0 (R),F BNZ R,Loop (R),F0 Instr. ADDD F,F0,F ADDD F,F,F ADDD F,F0,F ADDD F,F,F ADDD F0,F,F Unrolled times to avoid delays This loop will run cycles (no stalls) per iteration - or /=. for each element of the array /0/00 UAH-

I i I i+ The VLIW Approach VLIWs use multiple independent functional units VLIWs package the multiple operations into one very long instruction Compiler is responsible to choose instructions to be issued simultaneously IF Instr. ID IF ID W W Time [clocks] /0/00 UAH- Loop Unrolling in VLIW Mem. Ref F,0(R) F,-(R) F0,-(R) F,-(R) F,-(R) F,-0(R) ADDD F,F0,F ADDD F,F0,F F,-(R) ADDD F,F0,F0 ADDD F,F0,F 0(R),F -(R),F -(R),F (R),F0 (R),F (R),F Mem Ref. -(R),F Unrolled times to avoid delays ADDD F0,F0,F ADDD F,F0,F ADDD F,F0,F results in clocks, or. clocks per each element (.X) Average:. ops per clock, 0% efficiency Note: Need more registers in VLIW ( vs. in SS) Int/Branch SUBI R,R,# BNZ R,Loop /0/00 UAH- 0 Software Pipelining Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (~ Tomasulo in SW) Iteration 0 Softwarepipelined iteration Iteration Iteration Iteration Iteration /0/00 UAH- Software Pipelining xample Before: Unrolled times F0,0(R) ADDD F,F0,F 0(R),F F,-(R) ADDD F,F,F -(R),F F0,-(R) ADDD F,F0,F -(R),F 0 SUBUI R,R,# BNZ R,LOOP After: Software Pipelined 0(R),F ; Stores M[i] ADDD F,F0,F ; Adds to M[i-] F0,-(R); Loads M[i-] SUBUI R,R,# BNZ R,LOOP cycles per iteration Symbolic Loop Unrolling Maximize result-use distance Less code space than unrolling Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling SW Pipeline Time Loop Unrolled /0/00 UAH- overlapped ops Time

Statically Scheduled Superscalar.g., four-issue static superscalar instructions make one issue packet Fetch examines each instruction in the packet in the program order instruction cannot be issued will cause a structural or data hazard either due to an instruction earlier in the issue packet or due to an instruction already in execution can issue from 0 to instruction per clock cycle From Mem Op Queue Load Buffers Load Load Load Load Load Load Add Add Add adders From Instruction Unit Registers Mult Mult Reservation Stations multipliers Store Buffers Store Store Store To Mem /0/00 UAH- Issue: instructions per clock cycle /0/00 UAH- Loop: L.D F0, 0(R) ADD.D F,F0,F S.D DADDIU BN Assumptions: 0(R), F R,R,-# R,R,Loop One and one integer operation can be issued; Resources: ALU (int + effective address), a separate pipelined for each operation type, branch prediction hardware, CDB cc for loads, cc for Add Branches single issue, branch prediction is perfect /0/00 UAH- Iter. Inst..D F0,0(R) ADD.D F,F0,F S.D 0(R), F DADDIU R,R,-# BN R,R,Loop.D F0,0(R) ADD.D F,F0,F S.D 0(R), F xe. Issue (begins) 0 Mem. Access Wait for BN DADDIU R,R,-# 0 Wait for ALU BN R,R,Loop.D F0,0(R) Wait for BN ADD.D F,F0,F Wait for.d S.D 0(R), F DADDIU R,R,-# Wait for ALU BN R,R,Loop /0/00 UAH- Write Com. at CDB first issue Wait for.d Wait for ALU Wait for.d

: Resource Usage Clock 0 Int ALU /L.D /S.D /DADDIU /L.D /S.D / DADDIU /L.D /S.D / DADDIU ALU /ADD.D /ADD.D /ADD.D Data Cache /L.D /L.D /S.D /L.D /S.D /L.D /DADDIU /ADD.D /L.D /DADDIU /ADD.D /L.D /DADDIU /ADD.D /S.D /0/00 UAH- CDB : DADDIU waits for ALU used by S.D Add one ALU dedicated to effective address calculation Use CDBs Draw table for the dual-issue version of Tomasulo s pipeline /0/00 UAH- Iter. Inst..D F0,0(R) ADD.D F,F0,F S.D 0(R), F DADDIU R,R,-# BN R,R,Loop.D F0,0(R) ADD.D F,F0,F S.D 0(R), F xe. Issue (begins) Wait for BN DADDIU R,R,-# xecutes earlier BN R,R,Loop.D F0,0(R) 0 Wait for BN ADD.D F,F0,F S.D 0(R), F 0 DADDIU R,R,-# 0 BN R,R,Loop /0/00 UAH- Mem. Access Write Com. at CDB first issue Wait for.d xecutes earlier Wait for.d : Resource Usage Clock 0 Int ALU /DADDIU / DADDIU / DADDIU Adr. Adder /L.D /S.D /L.D /S.D /L.D /S.D ALU /ADD.D /ADD.D /ADD.D Data Cache /L.D /L.D /S.D /L.D /S.D /S.D CDB# /L.D /DADDIU /ADD.D /DADDIU /L.D /ADD.D /ADD.D CDB# /DADDIU /L.D /0/00 UAH- 0

What about Precise Interrupts? State of machine looks as if no instruction beyond faulting instructions has issued Tomasulo had: In-order issue, out-of-order execution, and outof-order completion Need to fix the out-of-order completion aspect so that we can find precise breakpoint in instruction stream. Relationship between precise interrupts and speculation Speculation: guess and check Important for branch prediction: Need to take our best shot at predicting branch direction. If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly: This is exactly same as precise exceptions! Technique for both precise interrupts/exceptions and speculation: in-order completion or commit /0/00 UAH- /0/00 UAH- HW support for precise interrupts Need HW buffer for results of uncommitted instructions: reorder buffer fields: instr, destination, value Use reorder buffer number instead of reservation station when execution completes Supplies operands between execution complete & commit (Reorder buffer can be operand source => more registers like RS) Instructions commit Once instruction commits, result is put into register As a result, easy to undo speculated instructions on mispredicted branches or exceptions Op Queue Adder Reorder Buffer Regs Adder /0/00 UAH- Four Steps of Speculative Tomasulo Algorithm.Issue get instruction from Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination(this stage sometimes called dispatch ).xecution operate on operands (X) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called issue ).Write result finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer ; mark reservation station available..commit update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called graduation ) /0/00 UAH-

What are the hardware complexities with reorder buffer (ROB)? Dest Reg Result xceptions? Valid Program Counter Op Queue Compar network Reorder Buffer Regs Reorder Table Adder Adder How do you find the latest version of a register? (As specified by Smith paper) need associative comparison network Could use future file or just use the register result status buffer to track which specific reorder buffer has received the value Need as many ports on ROB as register file /0/00 UAH-