As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Similar documents
Super Scalar. Kalyan Basu March 21,

Hardware-based Speculation

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Instruction Level Parallelism

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Metodologie di Progettazione Hardware-Software

Adapted from David Patterson s slides on graduate computer architecture

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Hardware-Based Speculation

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Hardware-Based Speculation

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction

Instruction-Level Parallelism and Its Exploitation

Getting CPI under 1: Outline

HY425 Lecture 05: Branch Prediction

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

5008: Computer Architecture

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Four Steps of Speculative Tomasulo cycle 0

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Exploitation of instruction level parallelism

Instruction Level Parallelism. Taken from

EECC551 Exam Review 4 questions out of 6 questions

Multi-cycle Instructions in the Pipeline (Floating Point)

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Copyright 2012, Elsevier Inc. All rights reserved.

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Control Hazards. Branch Prediction

Chapter 4 The Processor 1. Chapter 4D. The Processor

HY425 Lecture 09: Software to exploit ILP

Hiroaki Kobayashi 12/21/2004

Dynamic Control Hazard Avoidance

COSC 6385 Computer Architecture. Instruction Level Parallelism

HY425 Lecture 09: Software to exploit ILP

TDT 4260 lecture 7 spring semester 2015

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Static Branch Prediction

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Static vs. Dynamic Scheduling

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

Branch Prediction Chapter 3

Lecture 15: Instruc.on Level Parallelism -- Introduc.on, Compiler Techniques, and Advanced Branch Predic.on

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

CS433 Homework 2 (Chapter 3)

CSE4201 Instruction Level Parallelism. Branch Prediction

Control Hazards. Prediction

CS433 Homework 2 (Chapter 3)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

ILP: Instruction Level Parallelism

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Complex Pipelines and Branch Prediction

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

CS425 Computer Systems Architecture

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

EE 4683/5683: COMPUTER ARCHITECTURE

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Instruction-Level Parallelism (ILP)

INSTRUCTION LEVEL PARALLELISM

CS 152 Computer Architecture and Engineering

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CS 152 Computer Architecture and Engineering

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

EECC551 Review. Dynamic Hardware-Based Speculation

Floating Point/Multicycle Pipelining in DLX

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

Computer Science 246 Computer Architecture

Control Dependence, Branch Prediction

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

Branch prediction ( 3.3) Dynamic Branch Prediction

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

Processor (IV) - advanced ILP. Hwansoo Han

3.16 Historical Perspective and References

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Transcription:

Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction stream to the processor will require that we predict the outcome of branches. Amdahl s Law reminds us that relative impact of the control stalls will be larger with the low potential CPI in such machines. Branch Prediction Schemes Static branch by compiler Predict-taken/predict-not taken schemes, delayed branch scheme Dynamic branch by hardware Branch- buffer, branch history table, branch target buffer he goal of these mechanisms is to allow the processor to resolve the outcome of a branch early // Hiroaki Kobayashi

Branch arget Buffer A small memory indexed by the lower portion of the address of the branch instruction. he memory contains bits that say whether the branch was recently taken or not. If the turns out to be wrong, the bits are updated. Prediction accuracy depends on the bits/ schemes! // Hiroaki Kobayashi -bit scheme -bit holds the information about last branch direction. If the turns out to be wrong, the bit is inverted and stored back. Simple and low cost scheme Low accuracy Even if a branch is almost always taken, it will likely predict incorrectly twice, rather than twice, when it is not taken. Example Consider a loop branch whose behavior is taken nine times in a row, then not taken once. What is the accuracy for this branch? Miss-s on the first and last loop iterations. he accuracy for this branch that is taken % of the time is only %! // Hiroaki Kobayashi

A must miss twice before it is changed. A -bit saturating counter hold the information about the recent branch behavior. he counter is incremented on a taken branch and decremented on an untaken branch. n-bit extension is possible, but studies of n-bit predictors have shown that the -bit predictors do almost as well. he most systems rely on -bit branch predictors rather than the more general n-bit predictors. // Hiroaki Kobayashi Performance of a branch buffer with K entries on SPEC Because integer programs, li, eqntott, espresso, and gcc, have higher branch frequency, accuracy of the gives a larger impact on the performance. Prediction accuracy of a Kentries -bit buffer versus an infinite buffer // Hiroaki Kobayashi

Neither the number of entries, nor the size of bits! Need to consider the correlation among branches for more accurate branch! -bit predictor schemes use only the recent behavior of a single branch to predict the future behavior of that branch Example: if (aa==) aa=; if (bb==) bb=; if (aa!=bb) { DSUBUI R,R,# BNEZ R,L ;branch b (aa!=) DADD R,R,R ;aa= L: DSUBI R,R,# BNEZ R,L ;branch b (bb!=) DADD R,R,R ;bb= L: DSUBI R,R,R ;R=aa-bb BNEZ R,L ;branch b (aa==bb), NN, NN, NN may occur, but the // others Hiroaki Never Kobayashi occur! Example if (d==) if (d==) d=; BNEZ R,L ;branch b (d!=) DADDIU R,R,# ;d==, so d= L: DADDIU R,R,#- BNEZ R,L ;branch b (d!=) L: d=? Initial value of d b N N Possible execution sequences for a code fragment d==? N N Yes No No b Not taken aken aken N N Value of d before b Not taken Not taken taken All the branches // are mispredicted! Hiroaki Kobayashi N d==? Yes Yes No Behavior of a -bit predictor initialized to not taken b action New b b N b action N N b New b N N

Prediction bits are provided for each case of branch correlating patterns Prediction bits N/N N/ /N / Prediction if last branch not taken N N Prediction if last branch taken N N he action of the bit predictor with bit of correlation, initialized to not-taken/not-taken. d=? b N/N /N /N /N b action N N New b /N /N /N /N b N/N N/ N/ N/ b action N N New b N/ N/ N/ N/ he only mis is on the first iteration! (,) predictor: it uses the behavior of the last branch to choose from among a pair of -bit branch predictors. // Hiroaki Kobayashi Use the behavior of the last m branches to choose from m branch predictors, each of which is an n-bit predictor for a single branch. A (,) branch- buffer uses a -bit global history to choose from among four predictors for each branch address A -bit predictor with no global history is simply a (,) predictor. // Hiroaki Kobayashi

// Hiroaki Kobayashi o reduce the branch penalty for pipelines, need to know what address to fetch by the end of IF, not in ID. Get a hint from instruction address Branch-arget Buffer // Hiroaki Kobayashi

Penalties for all possible combinations of whether the branch is in the buffer and what it actually does, assuming we store only taken branches in the buffer. Instruction in buffer Yes Yes No No Prediction aken aken Actual branch aken Not taken aken Not taken Penalty cycles Example Determine the total branch penalty for a branch-target buffer assuming the penalty cycles for individual miss from the left table. Make the following assumptions about the accuracy and hit rate: Prediction accuracy is % (for instructions in buffer). Hit rate in the buffer is % (for branches predicted taken). Assume that % of the branches are taken. Hints: calculate Probability (branch in buffer, but actually not taken) Probability (branch not in buffer, but actually taken) Branch penalty = (the probability of two events) x (penalty) // Hiroaki Kobayashi Goal: Decrease the CPI to less than one! Common name Superscalar (static) Superscalar (dynamic) Superscalar (speculative) Allow multiple instructions to issue in a clock cycle Superscalar processors and VLIW (very long instruction word) processors Issue structure Dynamic Dynamic Dynamic Hazard detection Hardware Hardware Hardware Scheduling Static Dynamic Dynamic with speculation Distinguishing characteristic In-order execution Some out-of-order execution Out-of-order execution with speculation Example Sun UltraSPARC II/III IBM Power Pentium III/, MIPS RK, Alpha, HP PA, IBM RSIII No hazards between VLIW/LIW Static Software Static rimedia, i issue packets Mostly Mostly Explicit dependences EPIC* Mostly static Itanium static software marked by compiler *EPIC: Explicitly Parallel Instruction Computers // Hiroaki Kobayashi

Number of instructions per clock (issue width) Necessary hazard checks among up to maximum issue width of instructions must complete in one clock cycle! Instruction issue mechanisms Statically scheduled using compiler techniques or In-order execution: if some instruction in the instruction stream is dependent or doesn t meet the issue criteria, only the instructions preceding that one in the instruction sequence will be issued. Dynamically scheduled using techniques based on omasulo s algorithm Out-of-order execution: Instructions will be issued as long as any hazards do not occur. // Hiroaki Kobayashi Instructions issue in order and all pipeline hazards are checked for at issue time. Instruction type Pipe stages Integer instruction IF ID MEM WB FP instruction IF ID WB Integer instruction IF ID MEM WB FP instruction Integer instruction IF ID MEM WB FP instruction IF ID WB Integer instruction FP instruction // Hiroaki Kobayashi

Goal: Instructions issue at least until the hardware runs out of reservation stations. Example: Implementation of a two-issue dynamically scheduled processor Consider the execution of the following simple loop, which adds a scalar in F to each element of a vector in memory. Loop: LD F,(R) ;F=array element Add F,F,F ;add scalar in F SD F,(R) ;store result DADDIU R,R,- ;decrement pointer ; bytes BNE R,R,LOOP ; branch R!=R Assumptions It can issue two instruction on every clock cycle: one integer operation and one FP operation Latencies One cycle for integer ALU wo cycles for loads hree cycles for FP add // Hiroaki Kobayashi Iteration Number Instructions LD F,(R) ADD F,F,F SD F,(R) DADDIU R,R,- BNE R,R,LOOP LD F,(R) ADD F,F,F SD F,(R) DADDIU R,R,- BNE R,R,LOOP LD F,(R) ADD F,F,F SD F,(R) DADDIU R,R,- BNE R,R,LOOP Issue at Executes Memory access at Write CDB at Comment First Issue Wait for LD Wait for ALU Wait for BNE complete Wait for LD Wait for ALU Wait for BNE complete Wait for LD Wait for ALU // Hiroaki Kobayashi

Clock # Int ALU FP ALU D-Cache CDB Clock # Int ALU FP ALU D-Cache CDB /LD /SD /LD /LD /DADDIU /LD /SD /LD /ADD /ADD /DADDIU /DADDIU /SD /LD /ADD /DADDIU /LD /SD /LD /ADD /DADDIU /SD /LD /ADD /ADD /DADDIU /SD Issue Rate =/=., but Instruction Complete Rate = /=. Integer ALU becomes a bottleneck! // Hiroaki Kobayashi Iteration Number Instructions Issue at Executes Memory access at Write CDB at Comment LD F,(R) First Issue ADD F,F,F Wait for LD SD F,(R) DADDIU R,R,- Executes earlier BNE R,R,LOOP LD F,(R) Wait for BNE complete ADD F,F,F Wait for LD SD F,(R) DADDIU R,R,- Executes earlier BNE R,R,LOOP LD F,(R) Wait for BNE complete ADD F,F,F Wait for LD SD F,(R) DADDIU R,R,- Executes earlier BNE R,R,LOOP // Hiroaki Kobayashi

Clock # Integer ALU Address Adder FP ALU D-Cache CDB# CDB# /LD /DADDIU /SD /LD /LD /DADDIU /ADD /DADDIU /LD /SD /LD /DADDIU /ADD /LD /DADDIU /LD /ADD /SD /SD /LD /DADDIU /LD /ADD /ADD /SD /ADD /SD Improved Instruction Complete Rate = /=. // Hiroaki Kobayashi here is an imbalance between the functional unit structure of the pipeline and example loop. his imbalance means that it is impossible to fully use the FP units. o remedy this, we would need fewer dependent integer operations per loop. he amount of overhead per loop interaction is very high wo out of five instructions (DADDIU and BNE) are overhead. he control hazard, which prevents us from starting the next LD before we know whether the branch was correctly predicted, causes a one-cycle penalty on every loop iteration. // Hiroaki Kobayashi