Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Similar documents
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Four Steps of Speculative Tomasulo cycle 0

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Tomasulo Loop Example

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Metodologie di Progettazione Hardware-Software

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

CS425 Computer Systems Architecture

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Lecture 9: Multiple Issue (Superscalar and VLIW)

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Multi-cycle Instructions in the Pipeline (Floating Point)

Instruction Level Parallelism

Super Scalar. Kalyan Basu March 21,

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Getting CPI under 1: Outline

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Dynamic Control Hazard Avoidance

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Instruction Level Parallelism. Taken from

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

COSC 6385 Computer Architecture. Instruction Level Parallelism

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Recall from Pipelining Review. Instruction Level Parallelism and Dynamic Execution

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

CS425 Computer Systems Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Instruction-Level Parallelism and Its Exploitation

Hardware-Based Speculation

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Multiple Issue ILP Processors. Summary of discussions

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Lecture 15: Instruc.on Level Parallelism -- Introduc.on, Compiler Techniques, and Advanced Branch Predic.on

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

Lecture-13 (ROB and Multi-threading) CS422-Spring

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Hardware-based Speculation

TDT 4260 lecture 7 spring semester 2015

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

COSC 6385 Computer Architecture Dynamic Branch Prediction

Advanced issues in pipelining

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

5008: Computer Architecture

Hardware-Based Speculation

Branch prediction ( 3.3) Dynamic Branch Prediction

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Multiple Instruction Issue and Hardware Based Speculation

Computer Science 246 Computer Architecture

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

EECC551 Exam Review 4 questions out of 6 questions

NOW Handout Page 1. COSC 5351 Advanced Computer Architecture

Static vs. Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling

EECC551 Review. Dynamic Hardware-Based Speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Adapted from David Patterson s slides on graduate computer architecture

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Lecture 4: Introduction to Advanced Pipelining

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

EE557--FALL 2000 MIDTERM 2. Open books and notes

Advanced Computer Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

Computer Science 146. Computer Architecture

COSC4201. Prof. Mokhtar Aboelaze York University

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Instruction-Level Parallelism (ILP)

EE 4683/5683: COMPUTER ARCHITECTURE

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Transcription:

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S

Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest Also called a branch-prediction buffer Lower bits of branch address index table of 1-bit values Says whether or not branch taken last time If branch was taken last time, then take again Initially, bits are set to predict that all branches are taken Problem: in a loop, 1-bit BHT will cause two mispredictions: End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping LOOP: LOAD R1, 100(R2) MUL R6, R6, R1 SUBI R2, R2, #4 BNEZ R2, LOOP

Dynamic Branch Prediction Solution: 2-bit predictor scheme where change prediction only if mispredict twice in a row: (Figure 4.13, p. 264) Predict Taken Predict Not Taken T T NT NT T T Predict Taken NT Predict Not Taken NT This idea can be extended to n-bit saturating counters Increment counter when branch is taken Decrement counter when branch is not taken If counter <= 2 n-1, then predict the branch is taken; else not taken.

2-bit BHT Accuracy Mispredict because: First time branch encountered Wrong guess for that branch (e.g., end of loop) Got branch history of wrong branch when index the table (can happen for large programs) With a 4096 entry 2-bit table misprediction rates vary depending on the program. 1% for nasa7, tomcatv (lots of loops with many iterations) 9% for spice 12% for gcc 18% for eqntott (few loops, relatively hard to predict) A 4096 entry table is about as good as an infinite table. Instead of using a separate table, the branch prediction bits can be stored in the instruction cache.

Correlating Branches Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table In general, (m,n) predictor means record last m branches to select between 2 m history talbes each with n-bit counters Old 2-bit BHT is then a (0,2) predictor

Correlating Branches Often the behavior of one branch is correlated with the behavior of other branches. For example C CODE if (aa == 2) aa = 0; DLX CODE SUBI R3, R1, #2; BNEZ R3, L1 ADD R1, R0, R0 if (bb == 2) L1: SUBI R3, R2, #2; BNEZ R3, L2 bb = 0; ADD R2, R0, R0 if (aa!= bb) L2: SUBI R3, R1, R2; BEQZ R3, L3 cc = 4; ADD, R4, R0, #4 L3: If the first two branches are not taken, the third one will be.

Correlating Predicators Correlating predicators or twolevel predictors use the behavior of other branches to predict if the branch is taken. An (m, n) predictor uses the behavior of the last m branches to chose from (2 m ) n-bit predictors. The branch predictor is accessed using the low order k bits of the branch address and the m-bit global history. The number of bits needed to implement an (m, n) predictor, which uses k bits of the branch address is 2 m x n x 2 k In the figure, we have m = 2, n = 2, k=4 2 2 x 2 x 2 4 = 128 bits Branch address 2-bits per branch predictors (2, 2) predictor

Frequency of of Mispredictions 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% Accuracy of Different Schemes 18% (Figure 4.21, p. 272) 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 1% 0% 1% 5% 6% 6% 11% 4% 6% 5% nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Branch-Target Buffers DLX computes the branch target in the ID stage, which leads to a one cycle stall when the branch is taken. A branch-target buffer or branch-target cache stores the predicted address of branches that are predicted to be taken. Values not in the buffer are predicted to be not taken. The branch-target buffer is accessed during the IF stage, based on the k low order bits of the branch address. If the branch-target is in the buffer and is predicted correctly, the one cycle stall is eliminated.

Branch Target Buffer PC k For more than single bit predictors also need to store prediction information Instead of storing predicted target PC, store the target instruction PC of instruction Predicted Target PC = No Predict not a taken branch - proceed normally Yes Predict a taken branch - use Predicted PC as Next PC

Issuing Multiple Instructions/Cycle Two variations Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates Anticipated success lead to use of Instructions Per Clock cycle (IPC) vs. CPI

Superscalar DLX Superscalar DLX: 2 instructions; 1 FP op, 1 other Fetch 64-bits/clock cycle; Int on left, FP on right Can only issue 2nd instruction if 1st instruction issues 2 more ports for FP registers to do FP load or FP store and FP op Type PipeStages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB 1 cycle load delay expands to 3 cycles in 2-way SS instruction in right half can t use it, nor instructions in next slot Branches also have a delay of 3 cycles

Unrolled Loop that Minimizes Stalls for Scalar 1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 12 SUBI R1,R1,#32 11 SD 16(R1),F12 13 BNEZ R1,LOOP 14 SD 8(R1),F16 LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles 14 clock cycles, or 3.5 per iteration

Loop Unrolling in Superscalar Integer instruction FP instruction Clock cycle Loop: LD F0,0(R1) 1 LD F6,-8(R1) 2 LD F10,-16(R1) ADDD F4,F0,F2 3 LD F14,-24(R1) ADDD F8,F6,F2 4 LD F18,-32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD -8(R1),F8 ADDD F20,F18,F2 7 SD -16(R1),F12 8 SUBI R1,R1,#40 10 SD 16(R1),F16 9 BNEZ R1,LOOP 11 SD 8(R1),F20 12 Unrolled 5 times to avoid delays (+1 due to SS) 12 clocks, or 2.4 clocks per iteration

Dynamic Scheduling in Superscalar How to issue two instructions and keep in-order instruction issue for Tomasulo? Assume 1 integer + 1 floating point 1 Tomasulo control for integer, 1 for floating point Issue 2X clock rate, so that issue remains in order Only FP loads might cause dependency between integer and FP issue: Replace load reservation station with a load queue; operands must be read in the order they are fetched Load checks addresses in Store Queue to avoid RAW violation Store checks addresses in Load & Store Queues to avoid WAR,WAW Called decoupled architecture

Limits of Superscalar While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: Exactly 50% FP operations No hazards If more instructions issue at same time, greater difficulty of decode and issue Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue Issue rates of modern processors vary between 2 and 8 instructions per cycle.

VLIW Processors Very Long Instruction Word (VLIW) processors Tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word can execute in parallel E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch» 16 to 24 bits per field => 7*16 (112) to 7*24 (168) bits wide Need compiling technique that schedules across branches

Loop Unrolling in VLIW Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch LD F0,0(R1) LD F6,-8(R1) 1 LD F10,-16(R1) LD F14,-24(R1) 2 LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3 LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4 ADDD F20,F18,F2 ADDD F24,F22,F2 5 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6 SD -16(R1),F12 SD -24(R1),F16 7 SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8 SD -0(R1),F28 BNEZ R1,LOOP 9 Unroll loop 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration Need more registers in VLIW

Limits to Multi-Issue Machines Limitations specific to either SS or VLIW implementation Decode/issue in SS VLIW code size: unroll loops + wasted fields in VLIW VLIW lock step => 1 hazard & all instructions stall VLIW & binary compatibility is practical weakness Inherent limitations of ILP 1 branch in 5 instructions => how to keep a 5-way VLIW busy? Latencies of units => many operations must be scheduled Need about Pipeline Depth x No. Functional Units of independent instructions to keep all busy

Branch Prediction Summary Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch Branch Target Buffer: include branch address if predicted taken Superscalar and VLIW CPI < 1 Superscalar is more hardware dependent (dynamic) VLIW is more compiler dependent (static) More instructions issue at same time => larger penalties for hazards