EE 4683/5683: COMPUTER ARCHITECTURE

Similar documents
Instruction-Level Parallelism (ILP)

EEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CS425 Computer Systems Architecture

Computer Science 246 Computer Architecture

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Instruction Level Parallelism. Taken from

Lecture 15: Instruc.on Level Parallelism -- Introduc.on, Compiler Techniques, and Advanced Branch Predic.on

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Advanced Pipelining and Instruction- Level Parallelism 4

Floating Point/Multicycle Pipelining in DLX

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Lecture 4: Introduction to Advanced Pipelining

Multi-cycle Instructions in the Pipeline (Floating Point)

CMSC 611: Advanced Computer Architecture

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Copyright 2012, Elsevier Inc. All rights reserved.

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Hardware-based Speculation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Getting CPI under 1: Outline

ILP: Instruction Level Parallelism

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

TDT 4260 lecture 7 spring semester 2015

Outline EEL 5764 Graduate Computer Architecture. Chapter 2 - Instruction Level Parallelism. Recall from Pipelining Review

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism

Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Pipeline Wrap-Up and Static ILP

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

EECC551 Exam Review 4 questions out of 6 questions

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

Instruction-Level Parallelism and Its Exploitation

Computer Science 146. Computer Architecture

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

5008: Computer Architecture

Advanced Computer Architecture

Adapted from David Patterson s slides on graduate computer architecture

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

ECSE 425 Lecture 11: Loop Unrolling

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Four Steps of Speculative Tomasulo cycle 0

Hiroaki Kobayashi 12/21/2004

Super Scalar. Kalyan Basu March 21,

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP

NOW Handout Page 1. COSC 5351 Advanced Computer Architecture

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Computer Architecture Practical 1 Pipelining

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

HW 2 is out! Due 9/25!

Exploitation of instruction level parallelism

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Preventing Stalls: 1

TDT 4260 TDT ILP Chap 2, App. C

Updated Exercises by Diana Franklin

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Instruction Level Parallelism (ILP)

Hardware-Based Speculation

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

The basic structure of a MIPS floating-point unit

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Metodologie di Progettazione Hardware-Software

ECE 505 Computer Architecture

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

CS433 Homework 3 (Chapter 3)

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

EECC551 Review. Dynamic Hardware-Based Speculation

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

CS 152 Computer Architecture and Engineering

Superscalar Architectures: Part 2

Static Compiler Optimization Techniques

Transcription:

EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence 1

Recall from Pipelining Review 3 Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls Ideal pipeline CPI: measure of the maximum performance attainable by the implementation Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps) Instruction Level Parallelism 4 Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance 2 approaches to exploit ILP: 1) Rely on hardware to help discover and exploit the parallelism dynamically (e.g., Pentium 4, AMD Opteron, IBM Power), and 2) Rely on software technology to find parallelism, statically at compile-time (e.g., Itanium 2) 2

Instruction-Level Parallelism (ILP) 5 Basic Block (BB) ILP is quite small BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches Plus instructions in BB likely to depend on each other To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks Simplest: loop-level parallelism to exploit parallelism among iterations of a loop. E.g., for (i=1; i<=1000; i=i+1) x[i] = x[i] + y[i]; Loop-Level Parallelism 6 Exploit loop-level parallelism to parallelism by unrolling loop either by 1. dynamic via branch prediction or 2. static via loop unrolling by compiler Determining instruction dependence is critical to Loop Level Parallelism If 2 instructions are parallel, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls (assuming no structural hazards) dependent, they are not parallel and must be executed in order, although they may often be partially overlapped 3

Data Dependence and Hazards 7 Instr J is data dependent (aka true dependence) on Instr I: 1. Instr J tries to read operand before Instr I writes it I: add r1,r2,r3 J: sub r4,r1,r3 2. or Instr J is data dependent on Instr K which is dependent on Instr I If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped Data dependence in instruction sequence data dependence in source code effect of original data dependence must be preserved If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazard ILP and Data Dependencies,Hazards 8 HW/SW must preserve program order: order instructions would execute in if executed sequentially as determined by original source program Dependences are a property of programs Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is property of the pipeline Importance of the data dependencies 1) indicates the possibility of a hazard 2) determines order in which results must be calculated 3) sets an upper bound on how much parallelism can possibly be exploited HW/SW goal: exploit parallelism by preserving program order only where it affects the outcome of the program 4

Name Dependence #1: Anti-dependence 9 Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence Instr J writes operand before Instr I reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r1 If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard Name Dependence #2: Output dependence 10 Instr J writes operand before Instr I writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an output dependence by compiler writers This also results from the reuse of name r1 If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict Register renaming resolves name dependence for regs Either by compiler or by HW 5

Control Dependencies 11 Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order if p1 { S1; }; if p2 { S2; } S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1. Software Techniques - Example 12 This code, add a scalar to a vector: for (i=1000; i>0; i=i 1) x[i] = x[i] + s; Assume following latencies for all examples Ignore delayed branch in these examples Instruction Instruction stalls between producing result using result in cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 6

FP Loop: Where are the Hazards? 13 First translate into MIPS code: -To simplify, assume 8 is lowest address Loop: L.D F0,0(R1) ;F0=vector element ADD.D F4,F0,F2 ;add scalar from F2 S.D 0(R1),F4 ;store result DADDUI R1,R1,-8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 FP Loop Showing Stalls 14 1 Loop: L.D F0,0(R1) ;F0=vector element 2 stall 3 ADD.D F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 S.D 0(R1),F4 ;store result 7 DADDUI R1,R1,-8 ;decrement pointer 8B (DW) 8 stall ;assumes can t forward to branch 9 BNEZ R1,Loop ;branch R1!=zero Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 9 clock cycles: Rewrite code to minimize stalls? 7

Revised FP Loop Minimizing Stalls 15 1 Loop: L.D F0,0(R1) 2 DADDUI R1,R1,-8 3 ADD.D F4,F0,F2 4 stall 5 stall 6 S.D 8(R1),F4 ;altered offset when move DSUBUI 7 BNEZ R1,Loop Swap DADDUI and S.D by changing address of S.D Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead; How can we make the loop faster? 16 Unroll Loop Four Times (straightforward way) 1 Loop: L.D F0,0(R1) 3 ADD.D F4,F0,F2 1 cycle stall 2 cycles stall 6 S.D 0(R1),F4 ;drop DSUBUI & BNEZ 7 L.D F6,-8(R1) 9 ADD.D F8,F6,F2 12 S.D -8(R1),F8 ;drop DSUBUI & BNEZ 13 L.D F10,-16(R1) 15 ADD.D F12,F10,F2 18 S.D -16(R1),F12 ;drop DSUBUI & BNEZ 19 L.D F14,-24(R1) 21 ADD.D F16,F14,F2 24 S.D -24(R1),F16 25 DADDUI R1,R1,#-32 ;alter to 4*8 26 BNEZ R1,LOOP 27 clock cycles, or 6.75 per iteration (Assumes R1 is multiple of 4) Rewrite loop to minimize stalls? 8

Unrolled Loop That Minimizes Stalls 17 1 Loop: L.D F0,0(R1) 2 L.D F6,-8(R1) 3 L.D F10,-16(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D 0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 DSUBUI R1,R1,#32 13 S.D 8(R1),F16 ; 8-32 = -24 14 BNEZ R1,LOOP 14 clock cycles, or 3.5 per iteration Unrolled Loop Detail 18 Do not usually know upper bound of loop Suppose it is n, and we would like to unroll the loop to make k copies of the body Instead of a single unrolled loop, we generate a pair of consecutive loops: 1st executes (n mod k) times and has a body that is the original loop 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times For large values of n, most of the execution time will be spent in the unrolled loop 9

5 Loop Unrolling Decisions 19 Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences: 1. Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code) 2. Use different registers to avoid unnecessary constraints forced by using same registers for different computations 3. Eliminate the extra test and branch instructions and adjust the loop termination and iteration code 4. Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent Transformation requires analyzing memory addresses and finding that they do not refer to the same address 5. Schedule the code, preserving any dependences needed to yield the same result as the original code 20 When Safe to Unroll Loop? Example: Where are data dependencies? (A,B,C distinct & nonoverlapping) } for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ 10

When Safe to Unroll Loop? 21 Example: Where are data dependencies? (A,B,C distinct & nonoverlapping) for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */ 1. S2 uses the value, A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a loop-carried dependence between iterations: Implies that iterations are dependent, and can t be executed in parallel 22 Sometimes it is possible to have a loopcarried dependence but still parallelize Example: for (i = 1; i<=100; i = i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; } /* S2 */ Where are the dependences here? Is the loop parallelizable and how? 11

23 Sometimes it is possible to have a loopcarried dependence but still parallelize S1 uses the value assigned in the previous iteration by statement S2, therefore there is a loop-carried dependence for (i = 1; i<=100; i = i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; } /* S2 */ Unlike the earlier loop-carried dependence this is not a circular dependence: while statement S1 depends on S2, S2 does not depend on S1. When there is no circular dependences, the loop can be parallelized. Let s write the code for the parallel version. 24 Sometimes it is possible to have a loopcarried dependence but still parallelize A[1] = A[1] + B[1]; for (i = 1; i<=99; i = i+1) { B[i+1] = C[i] + D[i]; /* old S2 */ A[i+1] = A[i+1] + B[i+1]; } /* old S1 */ B[101] = C[100] + D[100]; The dependence between the two statements is no longer loop-carried, so different iterations may be overlapped provided the statements in each iteration are kept in order. 12