Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Similar documents
Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Instruction-Level Parallelism (ILP)

Instruction Level Parallelism

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

CS425 Computer Systems Architecture

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

EEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism

Instruction Level Parallelism. Taken from

EE 4683/5683: COMPUTER ARCHITECTURE

COSC 6385 Computer Architecture. Instruction Level Parallelism

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Instruction-Level Parallelism and Its Exploitation

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Lecture 15: Instruc.on Level Parallelism -- Introduc.on, Compiler Techniques, and Advanced Branch Predic.on

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 4: Introduction to Advanced Pipelining

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Branch Prediction Chapter 3

Adapted from David Patterson s slides on graduate computer architecture

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Multi-cycle Instructions in the Pipeline (Floating Point)

Hardware-based Speculation

CMSC 611: Advanced Computer Architecture

Floating Point/Multicycle Pipelining in DLX

ILP: Instruction Level Parallelism

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Tomasulo Loop Example

Advanced Computer Architecture

CS252 Graduate Computer Architecture Lecture 5. Interrupt Controller CPU. Interrupts, Software Scheduling around Hazards February 1 st, 2012

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Computer Science 246 Computer Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Advanced Pipelining and Instruction- Level Parallelism 4

Metodologie di Progettazione Hardware-Software

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

HY425 Lecture 05: Branch Prediction

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Four Steps of Speculative Tomasulo cycle 0

NOW Handout Page 1. COSC 5351 Advanced Computer Architecture

Hardware-Based Speculation

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Exploitation of instruction level parallelism

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

ECE 505 Computer Architecture

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Super Scalar. Kalyan Basu March 21,

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

CS 152 Computer Architecture and Engineering

Static vs. Dynamic Scheduling

HY425 Lecture 09: Software to exploit ILP

5008: Computer Architecture

HY425 Lecture 09: Software to exploit ILP

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Getting CPI under 1: Outline

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Outline EEL 5764 Graduate Computer Architecture. Chapter 2 - Instruction Level Parallelism. Recall from Pipelining Review

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

CMSC411 Fall 2013 Midterm 2 Solutions

Hiroaki Kobayashi 12/21/2004

EECC551 Exam Review 4 questions out of 6 questions

Recall from Pipelining Review. Instruction Level Parallelism and Dynamic Execution

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Multiple Instruction Issue and Hardware Based Speculation

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

CS252 Graduate Computer Architecture Midterm 1 Solutions

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

CS433 Homework 2 (Chapter 3)

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

Transcription:

CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture, 4th edition ---- Additional teaching material from: Jelena Mirkovic (U Del) and John Kubiatowicz (UC Berkeley) 2 Pipeline CPI (I) Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Techniques to reduce stalls - type of stalls (seen so far): Forward and bypass - potential data hazard stalls Delayed branches and simple branch scheduling - control hazard stalls Basic compiler pipeline schedule - data hazard stalls Pipeline CPI (II) Techniques to reduce stalls - type of stalls (we will see in the next few weeks): Compiler pipeline schedule - data hazard stalls Loop unrolling - control hazard stalls Branch predictions - control stalls Dynamic scheduling (scoreboarding) - data hazard stalls from true dependences Dynamic scheduling with renaming - data hazard stalls and stall from antidependences and output dependences Dynamic memory disambiguation - data hazard stalls with memory Hardware speculations - data hazard and control hazard stalls Issuing multiple instructions per cycle - ideal CPI 3 4 Page 1

Pipeline CPI (III) Techniques to reduce stalls - type of stalls (we will not cover in this course): Compiler dependence analysis, software pipeline, trace scheduling - ideal CPI, data hazard stalls Hardware support for compiler speculation - ideal CPI, data hazard stalls, branch hazard stalls Dependences 5 6 Dependences and Hazards Dependences are a property of programs. If two instructions are data dependent they cannot execute simultaneously. Whether a dependence results in a hazard and whether that hazard actually causes a stall are properties of the pipeline organization. Data dependences may occur through registers or memory. Dependences and Hazards The presence of the dependence indicates the potential for a hazard, but the actual hazard and the length of any stall is a property of the pipeline. If two instructions are independent then they can be executed in parallel Otherwise they must execute in order, although they may partially overlap A data dependence: Indicates that there is a possibility of a hazard. Determines the order in which results must be calculated, and Sets an upper bound on the amount of parallelism that can be exploited. 7 8 Page 2

Type of Dependencies Types of dependencies: Data (true) dependencies Name dependencies Control dependencies Data Dependencies Instructions j is data dependent on instruction i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i LOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,#-8 BNE R1, R2, LOOP 9 10 Data Dependencies Instructions j is data dependent on instruction i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i LOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,#-8 BNE R1, R2, LOOP What effect do we get if we move branch condition test to EX phase? Data Dependencies Dependences through registers are easy : lw r10, 10(r11) add r12, r10, r8 just compare register names Dependences through memory are harder : sw r10, 4 (r2) lw r6, 0(r4) is r2+4 = r4+0? If so they are dependent, if not, they are not. Is this RAW, WAW or WAR hazard? 11 12 Page 3

Data Dependencies Name Dependencies (I) Data dependencies can be overcome by Leaving the dependence but avoiding the hazard Eliminating the dependence by transforming the code Instructions i and j use the same register or memory location Antidependence instruction j writes a location that instruction i reads Is this RAW, WAW or WAR hazard? Output dependence instruction j writes a location that instruction i writes Is this RAW, WAW or WAR hazard? Since there is no data flow between instructions, they can be renamed and executed in parallel - register renaming 13 14 Name Dependencies (II) Antidependence : When instruction j writes a register or memory location that instruction i reads : i: add r6,r5,r4 j: sub r5,r8,r11 Output dependence : When instruction i and j write the same register or memory location. The ordering must be preserved to leave the correct value in the register: add r7,r4,r3 div r7,r2,r8 Control Dependencies Branches incur some penalty while the target and condition are evaluated we cannot be sure which instruction is next We have to guess We have to reorder instructions so that we execute useful instructions while waiting for the branch Main goal is not to affect correctness of the program 15 16 Page 4

Control Dependencies An instruction j is control dependent on i if the execution of j is controlled by instruction i. i: if (a < b) j: a=a+1; j is control dependent on i. 1. An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. 2. An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch. Control Dependencies Preserve exception behavior and data flow Instruction reordering should not cause exception reordering L: DADDU R2, R3, R4 BEQZ R2, L1 LW R1, 0(R2) Only those exceptions are allowed that would surely occur Instructions after the branch depend on it and all instructions prior to the branch for correct execution DADDU R1, R2, R3 BEQZ R4, L DSUBU R1, R5, R6 L: OR R7, R1, R8 17 18 Preserving Exception Behavior A simple pipeline preserves control dependences since it executes programs in program order. L1: daddu r2,r3,r4 beqz r2,l1 lw r1,0(r2) Preserving Exception Behavior Corollary: Any changes in the ordering of instructions should not change how exceptions are raised in a program. Can we move lw before the branch? (Don t worry, it is OK to violate control dependences as long as we can preserve the program semantics) 19 20 Page 5

Preserving Data Flow Consider the following example: daddu r1,r2,r3 beqz r4,l dsubu r1,r5,r6 L: or r7,r1,r8 What can you say about the value of r1 used by the or instruction? Preserving Data Flow Corollary: Preserving data dependences alone is not sufficient when changing program order. We must preserve the data flow. These two principles together allow us to execute instructions in a different order and still maintain the program semantics. This is the foundation upon which ILP processors are built. 21 22 Instruction Level Parallelism Amount of parallelism within a basic block is very small We must exploit parallelism across multiple basic blocks Pipelining Out-of-order execution Dynamic Scheduling Techniques we have learned so far are static scheduling techniques forwarding, delayed branches, flush pipeline, predict taken, predict untaken Compiler detects dependencies and schedules instruction execution to minimize hazards Pipeline executes instructions in order, detects hazards and inserts stalls Dynamic scheduling overcomes data hazards by out-of-order execution 23 24 Page 6

Out-of-Order Execution If some instruction is stalled, check the following instructions to see whether they can proceed (they have no hazards with previous instructions) Check for structural and data hazards Instruction can be issued as soon as its operands are available Out-of-order issue means out-of-order completion and possibility of WAR and WAW hazards, and problems with exception handling Loop Unrolling and Scheduling 25 26 Can we make CPI closer to 1? Let s assume full pipelining: If we have a 4-cycle latency, then we need 3 instructions between a producing instruction and its use: multf $F0,$F2,$F4 delay-1 delay-2 delay-3 addf $F6,$F10,$F0 Earliest forwarding for 1-cycle instructions Earliest forwarding for 4-cycle instructions Fetch Decode Ex1 Ex2 Ex3 Ex4 WB FP Loop: Where are the Hazards? Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 addf delay3 delay2 delay1 multf Where are the stalls? 27 28 Page 7

FP Loop Showing Stalls 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1!=zero 9 stall ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 9 clocks: Rewrite code to minimize stalls? Revised FP Loop Minimizing Stalls 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 SUBI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBI Swap BNEZ and SD by changing address of SD Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 6 clocks: Unroll loop 4 times code to make faster? 29 30 Unroll Loop Four Times (straightforward way) 1 Loop:LD F0,0(R1) 2 ADDD F4,F0,F2 1 cycle stall 2 cycles stall 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP Rewrite loop to minimize stalls? 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4 31 Unrolled Loop That Minimizes Stalls 1 Loop:LD F0,0(R1) What assumptions 2 LD F6,-8(R1) 3 LD F10,-16(R1) made when moved 4 LD F14,-24(R1) code? 5 ADDD F4,F0,F2 OK to move store past 6 ADDD F8,F6,F2 SUBI even though changes 7 ADDD F12,F10,F2 register 8 ADDD F16,F14,F2 OK to move loads before 9 SD 0(R1),F4 stores: get right data? 10 SD -8(R1),F8 When is it safe for 11 SD -16(R1),F12 compiler to do such 12 SUBI R1,R1,#32 changes? 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration 32 Page 8

Loop Level Parallelism Loop level parallelism ILP - unrolling loops Vector machines Loop level parallelism into ILP: unroll loop Statically by the compilers Dynamically by the hardware Branch Predictions 33 34 Reducing Branch Penalty What to Use and What to Predict Branch penalty in dynamically scheduled processors: wasted cycles due to pipeline flushing on mis-predicted branches Reduce branch penalty: Predict branch/jump instructions AND branch direction (taken or not taken) Predict branch/jump target address (for taken branches) Speculatively execute instructions along the predicted path Available info: Current predicted PC Past branch history (direction and target) What to predict: Conditional branch inst: branch direction and target address Jump inst: target address Procedure call/return: target address May need instruction predecoded PC IM pred_pc Predictors PC & Inst pred info feedback PC 35 36 Page 9

Mis-prediction Detections and Feedbacks Branch Direction Prediction Detections: At the end of decoding Target address known at decoding, and not match Flush fetch stage At commit (most cases) Wrong branch direction or target address not match Flush the whole pipeline (at EXE: MIPS R10000) Feedbacks: Any time a mis-prediction is detected At a branch s commit (at EXE: called speculative update) FETCH RENAME REB/ROB SCHD EXE WB predictors Predict branch direction: taken or not taken (T/NT) taken Not taken BNE R1, R2, L1 L1: Static prediction: compilers decide the direction Dynamic prediction: hardware decides the direction using dynamic information 1. 1-bit Branch-Prediction Buffer 2. 2-bit Branch-Prediction Buffer 3. Correlating Branch Prediction Buffer 4. Tournament Branch Predictor 5. and more COMMIT 37 38 Predictor for a Single Branch Branch History Table of 1-bit Predictor General Form 1. Access PC 1-bit prediction Predict Taken T state 3. Feedback T/NT NT NT 1 0 T 2. Predict Output T/NT Feedback Predict Taken BHT also Called Branch Prediction Buffer in textbook Can use only one 1-bit predictor, but accuracy is low BHT: use a table of simple predictors, indexed by bits from PC Similar to direct mapped cache More entries, more cost, but less conflicts, higher accuracy BHT can contain complex predictors K-bit 2 k Branch address Prediction 39 40 Page 10

1-bit BHT Weakness Example: in a loop, 1-bit BHT will cause 2 mis-predictions Consider a loop of 9 iterations before exit: for ( ){ for (i=0; i<9; i++) a[i] = a[i] * 2.0; } End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping Only 80% accuracy even if loop 90% of the time 2-bit Saturating Counter Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249) Predict Taken Predict Not Taken T NT 11 10 T T NT NT 01 00 T Predict Taken Predict Not Taken NT Blue: stop, not taken Gray: go, taken Adds hysteresis to decision making process 41 42 Correlating Branches Correlating Branch Predictor Code example showing the potential If (d==0) d=1; If (d==1) Assemble code BNEZ R1, L1 DADDIU R1,R0,#1 L1: DADDIU R3,R1,#-1 BNEZ R3, L2 L2: Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) Then behavior of recent branches selects between, say, 2 predictions of next branch, updating just that prediction (1,1) predictor: 1-bit global, 1-bit local Branch address (4 bits) 1-bits per branch local predictors Prediction Observation: if BNEZ1 is not taken, then BNEZ2 is taken 1-bit global branch history (0 = not taken) 43 44 Page 11

Correlating Branch Predictor General form: (m, n) predictor m bits for global history, n bits for local history Records correlation between m+1 branches Simple implementation: global history can be store in a shift register Example: (2,2) predictor, 2-bit global, 2-bit local Branch address (4 bits) 2-bits per branch local predictors Prediction Accuracy of Different Schemes (Figure 3.15, p. 206) Frequency of of Mispredictions 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 0.05 0.06 0.06 0.11 0.04 0.06 0.05 0.02 0.01 0.01 2-bit global branch history (01 = not taken then taken) 0 0 nasa7 tomcatv spice gcc eqntott entries: 2-bits Unlimited 2-bits/entry entries (2,2) 4,096 per entry entries: 1,024 45 46 Accuracy of Return Address Predictor Branch Target Buffer Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Note: must check for branch match now, since can t use wrong branch address Example: BTB combined with BHT PC of instruction FETCH Branch PC Predicted PC 47 =? No: branch not predicted, proceed normally (Next PC = PC+4) Yes: instruction is branch and use predicted PC as next PC Extra prediction state bits 48 Page 12