Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

Similar documents
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Computer Science 246 Computer Architecture

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

CS425 Computer Systems Architecture

Computer Science 146. Computer Architecture

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

EE 4683/5683: COMPUTER ARCHITECTURE

Static Compiler Optimization Techniques

HY425 Lecture 09: Software to exploit ILP

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

HY425 Lecture 09: Software to exploit ILP

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Hiroaki Kobayashi 12/21/2004

Four Steps of Speculative Tomasulo cycle 0

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

CS425 Computer Systems Architecture

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Multi-cycle Instructions in the Pipeline (Floating Point)

Instruction-Level Parallelism (ILP)

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

EEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Instruction Level Parallelism

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 Exam Review 4 questions out of 6 questions

ILP: Instruction Level Parallelism

CS252 Graduate Computer Architecture Lecture 5. Interrupt Controller CPU. Interrupts, Software Scheduling around Hazards February 1 st, 2012

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

Getting CPI under 1: Outline

Lecture 6: Static ILP

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Floating Point/Multicycle Pipelining in DLX

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Graduate Computer Architecture. Chapter 4 Explore Instruction Level. Parallelism with Software Approaches

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 4: Introduction to Advanced Pipelining

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Copyright 2012, Elsevier Inc. All rights reserved.

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

EECC551 Review. Dynamic Hardware-Based Speculation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Computer Architecture Practical 1 Pipelining

Advanced Pipelining and Instruction- Level Parallelism 4

Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Super Scalar. Kalyan Basu March 21,

ECSE 425 Lecture 11: Loop Unrolling

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Hardware-Based Speculation

5008: Computer Architecture

G.1 Introduction: Exploiting Instruction-Level Parallelism Statically G-2 G.2 Detecting and Enhancing Loop-Level Parallelism G-2 G.

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

INSTRUCTION LEVEL PARALLELISM

Hardware-Based Speculation

Hardware-based Speculation

Lecture: Pipeline Wrap-Up and Static ILP

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Adapted from David Patterson s slides on graduate computer architecture

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CS422 Computer Architecture

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

ECE 505 Computer Architecture

EE557--FALL 2000 MIDTERM 2. Open books and notes

Instruction-Level Parallelism and Its Exploitation

TDT 4260 lecture 7 spring semester 2015

Static vs. Dynamic Scheduling

UNIT I (Two Marks Questions & Answers)

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Metodologie di Progettazione Hardware-Software

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CS433 Homework 3 (Chapter 3)

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Static Scheduling Techniques. Loop Unrolling (1 of 2) Loop Unrolling (2 of 2)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Transcription:

Digital Systems Architecture EECE 33-01 EECE 292-02 Software Approaches to ILP Part 2 Dr. William H. Robinson March 5, 200 Topics A deja vu is usually a glitch in the Matrix. It happens when they change something. Trinity from The Matrix Loop-carried dependencies Software pipelining http://eecs.vanderbilt.edu/courses/eece33/ Compiler speculation 1 2 Processor Case Studies GPU Daniel & Cole: Nvidia Geforce Adam D. & Stephen: ATi GPU DSP Adam N. & Brendan: SHARC II RISC Amogh & Raj: SunSPARC CISC Jehanzeb & Sylvester: Pentium Chapter 3 Chapter Ideas To Reduce Stalls Technique Dynamic scheduling Dynamic branch prediction Issuing multiple instructions per cycle Speculation Dynamic memory disambiguation Loop unrolling Basic compiler pipeline scheduling Compiler dependence analysis Software pipelining and trace scheduling Compiler speculation Reduces Data hazard stalls Control stalls Ideal CPI Data and control stalls Data hazard stalls involving memory Control hazard stalls Data hazard stalls Ideal CPI and data hazard stalls Ideal CPI and data hazard stalls Ideal CPI, data and control stalls 3

Unroll Loop Four Times (Straightforward Way) 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F,F0,F2 stall 5 stall 6 SD 0(R1),F 7 LD F6,-8(R1) 8 stall 9 ADDD F8,F6,F2 10 stall 11 stall 12 SD -8(R1),F8 13 LD F10,-16(R1) 1 stall 15 ADDD F12,F10,F2 16 stall 17 stall 18 SD -16(R1),F12 19 LD F1,-2(R1) 20 stall 21 ADDD F16,F1,F2 22 stall 23 stall 2 SD -2(R1),F16 25 SUBI R1,R1,#32 26 stall 27 BNEZ R1,LOOP 28 NOP 15 + x (1+2) +1 = 28 clock cycles, or 7 cycles per iteration Assumes R1 is multiple of Unrolled Loop that Minimizes Stalls 1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1) LD F1,-2(R1) 5 ADDD F,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F1,F2 9 SD 0(R1),F 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 1 SD 8(R1),F16 ; 8-32 = -2 1 clock cycles, or 3.5 cycles per iteration LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles What assumptions made when moved code? OK to move store past SUBI even though changes register OK to move loads before stores: get right data? When is it safe for compiler to do such changes? Rewrite loop to minimize stalls. 5 Adapted from John Kubiatowicz s CS 252 lecture notes. Copyright 2003 UCB. 6 Compiler Checks for Dependencies Compiler concerned about dependencies in program Data dependence Instruction i produces a result used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. Anti-dependence Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Compiler Checks for Dependencies Compiler concerned about dependencies in program Output dependence Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved. Control dependence An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. 7 8

Detecting and Enhancing Loop-Level Parallelism (LLP) Example: Where are data dependencies? (A,B,C distinct & non-overlapping) Detecting and Enhancing Loop-Level Parallelism (LLP) Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping) A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ 1. S2 uses the value, A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a loop-carried dependence between iterations. Implies that iterations are dependent, and can t be executed in parallel Note the case for our prior example; each iteration was distinct 9 A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ 1. No dependence from S1 to S2. If there were, then there would be a cycle in the dependencies and the loop would not be parallel. Since this other dependence is absent, interchanging the two statements will not affect the execution of S2. 2. On the first iteration of the loop, statement S1 depends on the value of B[1] computed prior to initiating the loop. 10 OLD Eliminating Loop-Carried Dependence A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ S1 S2 Original Loop: A[1] = A[1] + B[1]; B[2] = C[1] + D[1]; Example: LLP Analysis A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ Iteration 1 Iteration 2 Iteration 99 Iteration 100 A[2] = A[2] + B[2]; B[3] = C[2] + D[2];............ Loop-carried Dependence A[99] = A[99] + B[99]; B[100] = C[99] + D[99]; A[100] = A[100] + B[100]; B[101] = C[100] + D[100]; NEW A[1] = A[1] + B[1]; /* Start-up code */ for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; B[101] = C[100] + D[100]; /* Completion code */ Modified Parallel Loop: Loop Start-up code A[1] = A[1] + B[1]; B[2] = C[1] + D[1]; Iteration 1 A[2] = A[2] + B[2]; B[3] = C[2] + D[2]; A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; B[101] = C[100] + D[100]; Not Loop Carried Dependence.... Iteration 98 A[99] = A[99] + B[99]; B[100] = C[99] + D[99]; Iteration 99 A[100] = A[100] + B[100]; B[101] = C[100] + D[100]; Loop Completion code 11 12

ILP Compiler Support: Loop-Carried Dependence Detection Compilers can increase the utilization of ILP by better detection of instruction dependencies. To detect loop-carried dependence in a loop, the GCD test can be used by the compiler, which is based on the following: If an array element with index: a i + b is stored and element: c i + d of the same array is loaded where index runs from m to n, a dependence exists if the following two conditions hold: 1 There are two iteration indices, j and k, m j, k n (within iteration limits) 2 The loop stores into an array element indexed by: a j + b and later loads from the same array the element indexed by: c k + d Thus: a j + b = c k + d j < k 13 The Greatest Common Divisor (GCD) Test If a loop carried dependence exists, then : GCD(c, a) must divide (d - b) Index of element stored: a i + b Index of element loaded: c i + d The GCD test is sufficient to guarantee no dependence. However there are cases where GCD test succeeds but no dependence exists because GCD test does not take loop bounds into account. Example: for(i=1; i<=100; i=i+1) { x[2*i+3] = x[2*i] * 5.0; a = 2 b = 3 c = 2 d = 0 GCD(a, c) = 2 (d b) = -3 2 does not divide -3 No dependence possible. 1 Showing Example Loop Iterations to Be Independent for(i=1; i<=100; i=i+1) { x[2*i+3] = x[2*i] * 5.0; c x i + d a x i + b = 2 x i + 0 = 2 x i + 3 Iteration i Index of x loaded Index of x stored 1 2 3 5 6 7 2 6 8 10 12 1 5 7 9 11 13 15 17 Index of element stored: a x i + b Index of element loaded: c x i + d a = 2 b = 3 c = 2 d = 0 GCD(a, c) = 2 d- b = - 3 2 does not divide - 3 No dependence possible. What if GCD (a, c) divided d - b? i = 1 i = 2 i = 3 i = i = 5 i = 6 i = 7 x[1] x[2] x[3] x[] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12] x[13] x[1] x[15] x[16] x[17] x[18] Software Pipelining Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (Tomasulo in SW) Softwarepipelined iteration Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 15 16

Software Pipelining Example Show a software-pipelined version of the code: Loop: L.D F0,0(R1) ADD.D F,F0,F2 S.D F,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,LOOP Before: Unrolled 3 times 1 L.D F0,0(R1) 2 ADD.D F,F0,F2 3 S.D F,0(R1) L.D F0,-8(R1) 5 ADD.D F,F0,F2 6 S.D F,-8(R1) 7 L.D F0,-16(R1) 8 ADD.D F,F0,F2 9 S.D F,-16(R1) 10 DADDUI R1,R1,#-2 11 BNE R1,R2,LOOP overlapped ops Time After: Software Pipelined L.D F0,0(R1) start-up ADD.D F,F0,F2 code L.D F0,-8(R1) 1 S.D F,0(R1) ;Stores M[i] 2 ADD.D F,F0,F2 ;Adds to M[i-1] 3 L.D F0,-16(R1);Loads M[i-2] DADDUI R1,R1,#-8 5 BNE R1,R2,LOOP S.D ADDD S.D start-up code F, 0(R1) F,F0,F2 F,-8(R1) Software Pipeline Time Loop Unrolled finish code finish code 17 Trace Scheduling Parallelism across IF branches vs. LOOP branches Two steps: Trace Selection Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code Trace Compaction Squeeze trace into few VLIW instructions Need bookkeeping code in case prediction is wrong Compiler undoes bad guess (discards values in registers) Subtle compiler bugs mean wrong answer vs. poorer performance; no hardware interlocks 18 Predicated Execution Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP If false, then neither store result nor cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction IA-6: 6 1-bit condition fields selected so conditional execution of any instruction This transformation is called if-conversion Drawbacks to conditional instructions Still takes a clock even if annulled Stall if condition evaluated late Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C Increasing Parallelism Theory Move an instruction across a branch to increase the size of a basic block and thus to increase parallelism. Primary difficulty is in avoiding exceptions. For example: If ( a = {, 0, ) c = b / a; May have divide by zero error in some cases. Methods for increasing speculation include: 1. Use a set of status bits (poison bits) associated with the registers. Are a signal that the instruction results are invalid until some later time. 2. Result of instruction isn t written until it s certain the instruction is no longer speculative. Adapted from John Kubiatowicz s CS 252 lecture notes. Copyright 2003 UCB. 19 20

Example: Compiler Speculation Poison Bit Example on Page 36 Code for if ( A == 0 ) A = B; else A = A + ; Assume A is at 0(R3) and B is at 0(R) Assume R1 is unused and available Original Code: LW R1, 0(R3) Load A BNEZ R1, L1 Test A LW R1, 0(R2) If Clause J L2 Skip Else L1: ADDI R1, R1, # Else Clause L2: SW 0(R3), R1 Store A Speculated Code: LW R1, 0(R3) Load A LW R1, 0(R2) Spec Load B BEQZ R1, L3 Other if Branch ADDI R1, R1, # Else Clause L3: SW 0(R3), R1 Non-Spec Store If the LW* produces an exception, a poison bit is set on that register. If a later instruction tries to use the register, an exception is THEN raised. Speculated Code: LW R1, 0(R3) Load A LW* R1, 0(R2) Spec Load B BEQZ R1, L3 Other if Branch ADDI R1, R1, # Else Clause L3: SW 0(R3), R1 Non-Spec Store 21 22 Summary Compilers are used to statically identify ILP Loop-carried dependencies makes loop unrolling less effective Compilers can improve performance with Software pipelining Trace scheduling Speculation Adapted from John Kubiatowicz s CS 252 lecture notes. Copyright 2003 UCB. 23