CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Similar documents
Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Multiple Instruction Issue and Hardware Based Speculation

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Instruction Level Parallelism

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Adapted from David Patterson s slides on graduate computer architecture

Lecture-13 (ROB and Multi-threading) CS422-Spring

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

The basic structure of a MIPS floating-point unit

Hardware-based Speculation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Instruction-Level Parallelism and Its Exploitation

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

COSC4201 Instruction Level Parallelism Dynamic Scheduling

CS433 Homework 2 (Chapter 3)

Metodologie di Progettazione Hardware-Software

CS433 Homework 2 (Chapter 3)

Hardware-based Speculation

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

5008: Computer Architecture

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Static vs. Dynamic Scheduling

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Branch prediction ( 3.3) Dynamic Branch Prediction

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Four Steps of Speculative Tomasulo cycle 0

Handout 2 ILP: Part B

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

EECC551 Exam Review 4 questions out of 6 questions

CMSC411 Fall 2013 Midterm 2 Solutions

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Instruction Level Parallelism. Taken from

Processor: Superscalars Dynamic Scheduling

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

Advanced Computer Architecture

HY425 Lecture 05: Branch Prediction

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Tomasulo s Algorithm

Copyright 2012, Elsevier Inc. All rights reserved.

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Hardware-Based Speculation

Super Scalar. Kalyan Basu March 21,

Branch Prediction Chapter 3

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

COSC 6385 Computer Architecture. Instruction Level Parallelism

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Chapter 4 The Processor 1. Chapter 4D. The Processor

Recall from Pipelining Review. Instruction Level Parallelism and Dynamic Execution

Advanced issues in pipelining

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

NOW Handout Page 1. COSC 5351 Advanced Computer Architecture

Scoreboard information (3 tables) Four stages of scoreboard control

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Outline EEL 5764 Graduate Computer Architecture. Chapter 2 - Instruction Level Parallelism. Recall from Pipelining Review

DYNAMIC SPECULATIVE EXECUTION

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

CS 614 COMPUTER ARCHITECTURE II FALL 2004

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Transcription:

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture, 4th edition ---- Additional teaching material from: Jelena Mirkovic (U Del), John Kubiatowicz (UC Berkeley), and Soner Oender (Michigan Technological University) 2 Reducing Branch Penalty What to Use and What to Predict Branch penalty in dynamically scheduled processors: wasted cycles due to pipeline flushing on mis-predicted branches Reduce branch penalty: Predict branch/jump instructions AND branch direction (taken or not taken) Predict branch/jump target address (for taken branches) Speculatively execute instructions along the predicted path Available info: Current predicted PC Past branch history (direction and target) What to predict: Conditional branch inst: branch direction and target address Jump inst: target address Procedure call/return: target address May need instruction pre-decoded PC IM pred_pc Predictors PC & Inst pred info feedback PC 3 4 1

Mis-prediction Detections and Feedbacks Branch Direction Prediction Detections: At the end of decoding Target address known at decoding, and not match Flush fetch stage At commit (most cases) Wrong branch direction or target address not match Flush the whole pipeline (at EE: MIPS R10000) Feedbacks: Any time a mis-prediction is detected At a branch s commit (at EE: called speculative update) FETCH RENAME REB/ROB SCHD EE WB COMMIT predictors Predict branch direction: taken or not taken (T/NT) taken BNE R1, R2, L1 Not taken L1: Static prediction: compilers decide the direction Dynamic prediction: hardware decides the direction using dynamic information 1. 1-bit Branch-Prediction Buffer 2. 2-bit Branch-Prediction Buffer 3. Correlating Branch Prediction Buffer 4. Tournament Branch Predictor 5. and more 5 6 Predictor for a Single Branch Branch History Table of 1-bit Predictor General Form 1. Access PC 1-bit prediction Predict Taken state T 3. Feedback T/NT NT NT 1 0 T 2. Predict Output T/NT Feedback Predict Taken BHT also Called Branch Prediction Buffer in textbook Can use only one 1-bit predictor, but accuracy is low BHT: use a table of simple predictors, indexed by bits from PC Similar to direct mapped cache More entries, more cost, but less conflicts, higher accuracy BHT can contain complex predictors K-bit 2 k Branch address Prediction 7 8 2

1-bit BHT Weakness Example: in a loop, 1-bit BHT will cause 2 mis-predictions Consider a loop of 9 iterations before exit: for ( ){ for (i=0; i<9; i++) a[i] = a[i] * 2.0; } End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping Only 80% accuracy even if loop 90% of the time 2-bit Saturating Counter Solution: 2-bit scheme where change prediction only if get mis-prediction twice: (Figure 3.7, p. 249) Predict Taken Predict Not Taken T NT 11 10 T T NT NT 01 00 T Predict Taken Predict Not Taken NT 9 Blue: stop, not taken Gray: go, taken 10 Adds hysteresis to decision making process Correlating Branches Correlating Branch Predictor Code example showing the potential If (d==0) d=1; If (d==1) Assemble code BNEZ R1, L1 DADDIU R1,R0,#1 L1: DADDIU R3,R1,#-1 BNEZ R3, L2 L2: Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) Then behavior of recent branches selects between, say, 2 predictions of next branch, updating just that prediction (1,1) predictor: 1-bit global, 1-bit local Branch address (4 bits) 1-bits per branch local predictors Prediction Observation: if BNEZ1 is not taken, then BNEZ2 is taken 11 12 1-bit global branch history (0 = not taken) 3

Correlating Branch Predictor General form: (m, n) predictor m bits for global history, n bits for local history Records correlation between m+1 branches Simple implementation: global history can be store in a shift register Example: (2,2) predictor, 2-bit global, 2-bit local Branch address (4 bits) 2-bits per branch local predictors Prediction Frequency of Mispredictions Accuracy of Different Schemes (Figure 3.15, p. 206) 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 13 2-bit global branch history (01 = not taken then taken) 14 Accuracy of Return Address Predictor Branch Target Buffer Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Note: must check for branch match now, since can t use wrong branch address Example: BTB combined with BHT PC of instruction FETCH Branch PC Predicted PC 15 =? No: branch not predicted, proceed normally 16 (Next PC = PC+4) Yes: instruction is branch and use predicted PC as next PC Extra prediction state bits 4

Hardware Based Speculation Hardware Speculation Exploiting more ILP requires that we overcome the limitation of control dependence: With branch prediction we allowed the processor continue issuing instructions past a branch based on a prediction: Those fetched instructions do not modify the processor state. These instructions are squashed if prediction is incorrect. We now allow the processor to execute these instructions before we know if it is ok to execute them: We need to correctly restore the processor state if such an instruction should not have been executed. We need to pass the results from these instructions to future instructions as if the program is just following that path. 17 Hardware Based Speculation Hardware Based Speculation B1 x < y? N T A =b+c C=0 C=c-1 A=0 < z B2 N T B=b+1 C=a A=a+1 D=a+b+c. Use d Assume the processor predicts B1 to be taken (T) and executes. What will happen if the prediction was wrong? What value of each variable should be used if the processor predicts B1 and B2 taken (T) and executes instructions along the way? In order to execute instructions speculatively, we need to provide means: To roll back the values of both registers and the memory to their correct values upon a misprediction. To communicate speculatively calculated values to the new uses of those values. Both can be provided by using a simple structure called Reorder Buffer (ROB). 5

Reorder Buffer It is a simple circular array with a head and a tail pointer: New instructions is allocated a position at the tail in program order. Each entry provides a location for storing the instruction s result. New instructions look for the values starting from tail back. When the instruction at the head complete and becomes non-speculative the values are committed and the instruction is removed from the buffer. Tail Head Reorder Buffer 3 fields: instr, destination, value can be operand source => more registers like RS Supplies operands between execution complete & commit Use reorder buffer number instead of reservation station when execution completes Once operand commits, result is put into register As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions Steps of Speculative Tomasulo Algorithm 1. Issue [get instruction from FP Op Queue] 1. Check if the reorder buffer is full. 2. Check if a reservation station is available. 3. Access the register file and the reorder buffer for the current values of the source operands. 4. Send the instruction, its reorder buffer slot number and the source operands to the reservation station. Steps of Speculative Tomasulo Algorithm 2. Execute [operate on operands (E) ] When both operands ready and a functional unit is available, the instruction executes. This step checks RAW hazards and as long as operands are not ready, watches CDB for results. Once issued, the instruction stays in the reservation station until it gets both operands. 6

Steps of Speculative Tomasulo Algorithm 3. Write result [ finish execution (WB) ] Write on Common Data Bus to all awaiting FUs and the reorder buffer. Mark reservation station available. Steps of Speculative Tomasulo Algorithm 4. Commit [ update register file with reorder result ] When instruction reaches the head of reorder buffer The result is present No exceptions associated with the instruction The instruction becomes non-speculative: Update register file with result (or store to memory) Remove the instruction from the reorder buffer. A mispredicted branch flushes the reorder buffer. MIPS FP Unit Recall: Four Steps of Speculative Tomasulo Algorithm 1. Issue get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called dispatch ) 2. Execution operate on operands (E) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called issue ) 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called graduation ) 7

FP Op Queue Dest Tomasulo With Reorder Buffer Reorder Buffer Registers FP adders Dest Reservation Stations FP multipliers Done? To Memory from Memory Dest 1 10+R2 ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Oldest FP Op Queue Tomasulo With Reorder Buffer Reorder Buffer Dest. Value Instruction type Done? Registers To Memory ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest COB Oldest LD F6, 34(R2) LD F2, 45(R3) MULTD, F2, F4 SUBD F8, F6, F2 DIVD F10,, F6 ADDD F6, F8, F2 Example 1 Time = 8

Time = 1 2 3 4 5 6 Time =1 First load is issued Load Regs[R2] #1 34 #1 Time =1 First load is issued 1 Issue F6 2 3 4 5 6 Time =2 First load executes Second load is issued Load #1 34+ Regs[R2] Load Regs[R3] #2 45 #2 #1 9

Time =3 Time =2 First load executes First load executes Second load executes Second load is issued 1 Execute F6 2 3 4 5 6 Issue F2 Mul is issued Load #1 34+ Regs[R2] Load #2 45+ Regs[R3] Mult Regs[F4] #2 #2 #1 Time =3 First load executes Second load executes Mul is issued 1 Execute F6 2 Execute F2 3 4 5 6 Issue Time =4 First load writes result Second load executes Sub is issued Load #2 Sub Mem[34+ Regs[R2]] #2 #4 Mult Regs[F4] #2 45+ Regs[R3] #2 #1 #4 10

Time =4 First load writes result Second load executes Sub is issued 1 Write result 2 Execute F2 3 Stalled in issue 4 5 6 Issue F8 Time =5 First load commits Second load writes result Div is issued Sub Mem[45+ Regs[R3]] Mem[34+ Regs[R2]] #4 Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #2 #4 #5 Time =5 First load commits Second load writes result Div is issued 1 no Commit 2 Write result 3 Stalled in issue 4 Stalled in issue F8 5 6 Issue F10 Time =6 Second load commits Mul (1/10) and sub(1/2) execute Add is issued Sub Mem[45+ Regs[R3]] Mem[34+ Regs[R2]] #4 Add Mem[45+ Regs[R3]] #4 #6 Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #6 #4 #5 11

Time =6 Second load commits Mul (1/10) and sub(1/2) execute Add is issued 1 no Commit 2 no Commit 3 Execute 4 Execute F8 5 Stalled in issue F10 6 Issue F6 Time =7 Second load commits Mul (2/10) and sub(2/2) execute Sub Mem[45+ Regs[R3]] Mem[34+ Regs[R2]] #4 Add Mem[45+ Regs[R3]] #4 #6 Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #6 #4 #5 Time =7 Second load commits Mul (2/10) and sub(2/2) execute Add is issued 1 no Commit 2 no Commit 3 Execute 4 Execute F8 5 Stalled in issue F10 6 Issue F6 Time =8 Mul executes (3/10) Sub writes result () Add Mem[45+ Regs[R3]] Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #6 #6 #4 #5 12

Time =8 Mul executes (3/10) Sub writes result () 1 no Commit 2 no Commit 3 Execute 4 Write result F8 5 Stalled in issue F10 6 Stalled in issue F6 Time =9 Mul executes (4/10) Add executes(1/2) Add Mem[45+ Regs[R3]] Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #6 #6 #4 #5 Time =9 Mul executes (4/10) Add executes(1/2) 1 no Commit 2 no Commit 3 Execute 4 Waiting to commit F8 5 Stalled in issue F10 6 Execute F6 Time =10 Mul executes (5/10) Add executes(2/2) Add Mem[45+ Regs[R3]] Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #6 #6 #4 #5 13

Time =10 Mul executes (5/10) Add executes(2/2) 1 no Commit 2 no Commit 3 Execute 4 Waiting to commit F8 5 Stalled in issue F10 6 Execute F6 Time =11 Mul executes (6/10) Add writes result (Y) Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #6 #4 #5 Time =11 Mul executes (6/10) Add writes result (Y) 1 no Commit 2 no Commit 3 Execute 4 Waiting to commit F8 5 Stalled in issue F10 6 Write result F6 Y Faster than light computation (skip a couple of cycles) 14

Faster than light computation (skip a couple of cycles) Time =16 Mul writes result (Z) Div Z Mem[34+ Regs[R2]] #5 #6 #4 #5 Time =16 Mul writes result (Z) 1 no Commit 2 no Commit 3 Write result Z 4 Waiting to commit F8 5 Stalled in issue F10 6 Waiting to commit F6 Y Time =17 Mul commits Div is executed (1/40) Div Z Mem[34+ Regs[R2]] #5 #6 #4 #5 15

Time =17 Mul commits Div is executed (1/40) 1 no Commit 2 no Commit 3 no Commit Z 4 Waiting to commit F8 5 Execute F10 6 Waiting to commit F6 Y Time =18 Sub commits Div is executed (2/40) Div Z Mem[34+ Regs[R2]] #5 #6 #5 Time =18 Sub commits Div is executed (2/40) 1 no Commit 2 no Commit 3 no Commit Z 4 no Commit F8 5 Execute F10 6 Waiting to commit F6 Y Faster than light computation (skip a couple of cycles) 16

Faster than light computation (skip a couple of cycles) Time =57 Div writes result (W) #6 #5 Time =57 Div writes result (W) 1 no Commit 2 no Commit 3 no Commit Z 4 no Commit F8 5 Write result F10 W 6 Waiting to commit F6 Y Time =58 Div commits #6 17

Time =58 Div commits 1 no Commit 2 no Commit 3 no Commit Z 4 no Commit F8 5 no Commit F10 W 6 Waiting to commit F6 Y Time =59 Add commits Time =59 Add commits 1 no Commit 2 no Commit 3 no Commit Z 4 no Commit F8 5 no Commit F10 W 6 no Commit F6 Y 18