Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Similar documents
Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Instruction-Level Parallelism (ILP)

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Instruction Level Parallelism

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

EEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

CS425 Computer Systems Architecture

EE 4683/5683: COMPUTER ARCHITECTURE

Instruction-Level Parallelism and Its Exploitation

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Instruction Level Parallelism. Taken from

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Adapted from David Patterson s slides on graduate computer architecture

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Branch Prediction Chapter 3

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Hardware-based Speculation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Multi-cycle Instructions in the Pipeline (Floating Point)

Floating Point/Multicycle Pipelining in DLX

Lecture 4: Introduction to Advanced Pipelining

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

ILP: Instruction Level Parallelism

Lecture 15: Instruc.on Level Parallelism -- Introduc.on, Compiler Techniques, and Advanced Branch Predic.on

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Computer Science 246 Computer Architecture

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

COSC 6385 Computer Architecture. Instruction Level Parallelism

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Four Steps of Speculative Tomasulo cycle 0

HY425 Lecture 05: Branch Prediction

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Hardware-Based Speculation

Advanced Pipelining and Instruction- Level Parallelism 4

ECE 505 Computer Architecture

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

CS252 Graduate Computer Architecture Lecture 5. Interrupt Controller CPU. Interrupts, Software Scheduling around Hazards February 1 st, 2012

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

Static vs. Dynamic Scheduling

5008: Computer Architecture

CS 152 Computer Architecture and Engineering

CMSC411 Fall 2013 Midterm 2 Solutions

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Advanced Computer Architecture

Exploitation of instruction level parallelism

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Metodologie di Progettazione Hardware-Software

CS433 Homework 2 (Chapter 3)

EECC551 Exam Review 4 questions out of 6 questions

Super Scalar. Kalyan Basu March 21,

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

NOW Handout Page 1. COSC 5351 Advanced Computer Architecture

Getting CPI under 1: Outline

HY425 Lecture 09: Software to exploit ILP

CS433 Homework 2 (Chapter 3)

HY425 Lecture 09: Software to exploit ILP

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

Hiroaki Kobayashi 12/21/2004

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

CS252 Graduate Computer Architecture Midterm 1 Solutions

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Outline EEL 5764 Graduate Computer Architecture. Chapter 2 - Instruction Level Parallelism. Recall from Pipelining Review

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:

CS425 Computer Systems Architecture

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Chapter 4 The Processor 1. Chapter 4D. The Processor

6.823 Computer System Architecture

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

Transcription:

CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture, 4th edition ---- Additional teaching material from: Jelena Mirkovic (U Del) and John Kubiatowicz (UC Berkeley) Pipeline CPI 2 Pipeline CPI (I) Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Techniques to reduce stalls - type of stalls (seen so far): Forward and bypass - potential data hazard stalls Delayed branches and simple branch scheduling - control hazard stalls Basic compiler pipeline schedule - data hazard stalls 3

Pipeline CPI (II) Techniques to reduce stalls - type of stalls (we will see in the next few weeks): Compiler pipeline schedule - data hazard stalls Loop unrolling - control hazard stalls Branch predictions - control stalls Dynamic scheduling (scoreboarding) - data hazard stalls from true dependences Dynamic scheduling with renaming - data hazard stalls and stall from antidependences and output dependences Dynamic memory disambiguation - data hazard stalls with memory Hardware speculations - data hazard and control hazard stalls Issuing multiple instructions per cycle - ideal CPI 4 Pipeline CPI (III) Techniques to reduce stalls - type of stalls (we will not cover in this course): Compiler dependence analysis, software pipeline, trace scheduling - ideal CPI, data hazard stalls Hardware support for compiler speculation - ideal CPI, data hazard stalls, branch hazard stalls 5 Dependences 6

Dependences and Hazards Dependences are a property of programs. If two instructions are data dependent they cannot execute simultaneously. Whether a dependence results in a hazard and whether that hazard actually causes a stall are properties of the pipeline organization. Data dependences may occur through registers or memory. 7 Dependences and Hazards The presence of the dependence indicates the potential for a hazard, but the actual hazard and the length of any stall is a property of the pipeline. If two instructions are independent then they can be executed in parallel Otherwise they must execute in order, although they may partially overlap A data dependence: Indicates that there is a possibility of a hazard. Determines the order in which results must be calculated, and Sets an upper bound on the amount of parallelism that can be exploited. 8 Type of Dependencies Types of dependencies: Data (true) dependencies Name dependencies Control dependencies 9

Data Dependencies Instructions j is data dependent on instruction i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i LOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,#-8 BNE R1, R2, LOOP 10 Data Dependencies Instructions j is data dependent on instruction i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i LOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,#-8 BNE R1, R2, LOOP What effect do we get if we move branch condition test to EX phase? Is this RAW, WAW or WAR hazard? 11 Data Dependencies Dependences through registers are easy : lw r10, 10(r11) add r12, r10, r8 just compare register names Dependences through memory are harder : sw r10, 4 (r2) lw r6, 0(r4) is r2+4 = r4+0? If so they are dependent, if not, they are not. 12

Data Dependencies Data dependencies can be overcome by Leaving the dependence but avoiding the hazard Eliminating the dependence by transforming the code 13 Name Dependencies (I) Instructions i and j use the same register or memory location Antidependence instruction j writes a location that instruction i reads Is this RAW, WAW or WAR hazard? Output dependence instruction j writes a location that instruction i writes Is this RAW, WAW or WAR hazard? Since there is no data flow between instructions, they can be renamed and executed in parallel - register renaming 14 Name Dependencies (II) Antidependence : When instruction j writes a register or memory location that instruction i reads : i: add r6,r5,r4 j: sub r5,r8,r11 Output dependence : When instruction i and j write the same register or memory location. The ordering must be preserved to leave the correct value in the register: add r7,r4,r3 div r7,r2,r8 15

Control Dependencies Branches incur some penalty while the target and condition are evaluated we cannot be sure which instruction is next We have to guess We have to reorder instructions so that we execute useful instructions while waiting for the branch Main goal is not to affect correctness of the program 16 Control Dependencies An instruction j is control dependent on i if the execution of j is controlled by instruction i. i: if (a < b) j: a=a+1; j is control dependent on i. 1. An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. 2. An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch. 17 Control Dependencies Preserve exception behavior and data flow Instruction reordering should not cause exception reordering L: DADDU R2, R3, R4 BEQZ R2, L1 LW R1, 0(R2) Only those exceptions are allowed that would surely occur Instructions after the branch depend on it and all instructions prior to the branch for correct execution DADDU R1, R2, R3 BEQZ R4, L DSUBU R1, R5, R6 L: OR R7, R1, R8 18

Preserving Exception Behavior A simple pipeline preserves control dependences since it executes programs in program order. L1: daddu r2,r3,r4 beqz r2,l1 lw r1,0(r2) Can we move lw before the branch? (Don t worry, it is OK to violate control dependences as long as we can preserve the program semantics) 19 Preserving Exception Behavior Corollary: Any changes in the ordering of instructions should not change how exceptions are raised in a program. 20 Preserving Data Flow Consider the following example: daddu r1,r2,r3 beqz r4,l dsubu r1,r5,r6 L: or r7,r1,r8 What can you say about the value of r1 used by the or instruction? 21

Preserving Data Flow Corollary: Preserving data dependences alone is not sufficient when changing program order. We must preserve the data flow. These two principles together allow us to execute instructions in a different order and still maintain the program semantics. This is the foundation upon which ILP processors are built. 22 Instruction Level Parallelism Amount of parallelism within a basic block is very small We must exploit parallelism across multiple basic blocks Pipelining Out-of-order execution 23 Dynamic Scheduling Techniques we have learned so far are static scheduling techniques forwarding, delayed branches, flush pipeline, predict taken, predict untaken Compiler detects dependencies and schedules instruction execution to minimize hazards Pipeline executes instructions in order, detects hazards and inserts stalls Dynamic scheduling overcomes data hazards by out-of-order execution 24

Out-of-Order Execution If some instruction is stalled, check the following instructions to see whether they can proceed (they have no hazards with previous instructions) Check for structural and data hazards Instruction can be issued as soon as its operands are available Out-of-order issue means out-of-order completion and possibility of WAR and WAW hazards, and problems with exception handling 25 Loop Unrolling and Scheduling 26 Can we make CPI closer to 1? Let s assume full pipelining: If we have a 4-cycle latency, then we need 3 instructions between a producing instruction and its use: multf $F0,$F2,$F4 delay-1 delay-2 delay-3 addf $F6,$F10,$F0 Earliest forwarding for 1-cycle instructions Earliest forwarding for 4-cycle instructions Fetch Decode Ex1 Ex2 Ex3 Ex4 WB addf delay3 delay2 delay1 multf 27

FP Loop: Where are the Hazards? Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 Where are the stalls? 28 FP Loop Showing Stalls 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1!=zero 9 stall ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 9 clocks: Rewrite code to minimize stalls? 29 Revised FP Loop Minimizing Stalls 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 SUBI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBI Swap BNEZ and SD by changing address of SD Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 6 clocks: Unroll loop 4 times code to make faster? 30

Unroll Loop Four Times (straightforward way) 1 Loop:LD F0,0(R1) 2 ADDD F4,F0,F2 1 cycle stall 2 cycles stall 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP Rewrite loop to minimize stalls? 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4 31 Unrolled Loop That Minimizes Stalls 1 Loop:LD F0,0(R1) 2 LD F6,-8(R1) What assumptions 3 LD F10,-16(R1) made when moved 4 LD F14,-24(R1) code? 5 ADDD F4,F0,F2 OK to move store past 6 ADDD F8,F6,F2 SUBI even though changes 7 ADDD F12,F10,F2 register 8 ADDD F16,F14,F2 OK to move loads before 9 SD 0(R1),F4 stores: get right data? 10 SD -8(R1),F8 When is it safe for 11 SD -16(R1),F12 compiler to do such 12 SUBI R1,R1,#32 changes? 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration 32 Loop Level Parallelism Loop level parallelism ILP - unrolling loops Vector machines Loop level parallelism into ILP: unroll loop Statically by the compilers Dynamically by the hardware 33

Branch Predictions 34 Reducing Branch Penalty Branch penalty in dynamically scheduled processors: wasted cycles due to pipeline flushing on mis-predicted branches Reduce branch penalty: Predict branch/jump instructions AND branch direction (taken or not taken) Predict branch/jump target address (for taken branches) Speculatively execute instructions along the predicted path 35 What to Use and What to Predict Available info: Current predicted PC Past branch history (direction and target) What to predict: Conditional branch inst: branch direction and target address Jump inst: target address Procedure call/return: target address May need instruction predecoded PC IM pred_pc Predictors PC & Inst pred info feedback PC 36

Mis-prediction Detections and Feedbacks Detections: At the end of decoding Target address known at decoding, and not match Flush fetch stage At commit (most cases) Wrong branch direction or target address not match Flush the whole pipeline (at EXE: MIPS R10000) Feedbacks: Any time a mis-prediction is detected At a branch s commit (at EXE: called speculative update) FETCH RENAME REB/ROB SCHD EXE WB COMMIT predictors 37 Branch Direction Prediction Predict branch direction: taken or not taken (T/NT) taken BNE R1, R2, L1 Not taken L1: Static prediction: compilers decide the direction Dynamic prediction: hardware decides the direction using dynamic information 1. 1-bit Branch-Prediction Buffer 2. 2-bit Branch-Prediction Buffer 3. Correlating Branch Prediction Buffer 4. Tournament Branch Predictor 5. and more 38 Predictor for a Single Branch General Form 1. Access PC state 2. Predict Output T/NT 1-bit prediction Predict Taken T 3. Feedback T/NT NT NT 1 0 T Feedback Predict Taken 39

Branch History Table of 1-bit Predictor BHT also Called Branch Prediction Buffer in textbook Can use only one 1-bit predictor, but accuracy is low BHT: use a table of simple predictors, indexed by bits from PC Similar to direct mapped cache More entries, more cost, but less conflicts, higher accuracy BHT can contain complex predictors K-bit 2 k Branch address Prediction 40 1-bit BHT Weakness Example: in a loop, 1-bit BHT will cause 2 mis-predictions Consider a loop of 9 iterations before exit: for ( ){ for (i=0; i<9; i++) a[i] = a[i] * 2.0; } End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping Only 80% accuracy even if loop 90% of the time 41 2-bit Saturating Counter Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249) Predict Taken Predict Not Taken T NT 11 10 T T NT NT 01 00 T Predict Taken Predict Not Taken NT Blue: stop, not taken Gray: go, taken Adds hysteresis to decision making process 42

Correlating Branches Code example showing the potential Assemble code If (d==0) d=1; If (d==1) BNEZ R1, L1 DADDIU R1,R0,#1 L1: DADDIU R3,R1,#-1 BNEZ R3, L2 L2: Observation: if BNEZ1 is not taken, then BNEZ2 is taken 43 Correlating Branch Predictor Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) Then behavior of recent branches selects between, say, 2 predictions of next branch, updating just that prediction (1,1) predictor: 1-bit global, 1-bit local Branch address (4 bits) 1-bits per branch local predictors Prediction 1-bit global branch history (0 = not taken) 44 Correlating Branch Predictor General form: (m, n) predictor m bits for global history, n bits for local history Records correlation between m+1 branches Simple implementation: global history can be store in a shift register Example: (2,2) predictor, 2-bit global, 2-bit local Branch address (4 bits) 2-bits per branch local predictors Prediction 2-bit global branch history (01 = not taken then taken) 45

Accuracy of Different Schemes (Figure 3.15, p. 206) Frequency of Mispredictions 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 46 Accuracy of Return Address Predictor 47 Branch Target Buffer Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Note: must check for branch match now, since can t use wrong branch address Example: BTB combined with BHT PC of instruction FETCH Branch PC Predicted PC =? No: branch not predicted, proceed normally (Next PC = PC+4) Yes: instruction is branch and use predicted PC as next PC Extra prediction state bits 48

Deadlines 4 Sep 25 Lec07 Multi-cycles App A.7; Chap 2 Sep 29 Homework 1 due 5 Sep 30 Homework review 5 Oct 2 Lec08 - Instruction Level Parallelism (ILP) Q3 6 Oct 7 Lec09 - Dynamic Scheduling: Scoreboard 6 Oct 9 Lec10 - Dynamic Scheduling: Tomasulo 7 Oct 14 Lec11 Hardware Speculation 7 Oct 16 Lec12 - Multiple Issue Oct 20 Homework 2 due 8 Oct 21 Homework review 8 Oct 23 Midterm exam Chap 3; App C 9 Oct 28 Lec13 - Study of the Limitations of ILP 9 Oct 30 Lec14 - Review Cache and Review Virtual Memory Q4 49