Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

Similar documents
Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

What about branches? Branch outcomes are not known until EXE What are our options?

What about branches? Branch outcomes are not known until EXE What are our options?

Implementing a MIPS Processor. Readings:

Implementing a MIPS Processor. Readings:

Pipelining is Hazardous!

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

COMPUTER ORGANIZATION AND DESI

Complex Pipelines and Branch Prediction

The Processor: Improving the performance - Control Hazards

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Full Datapath. Chapter 4 The Processor 2

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Pipeline design. Mehran Rezaei

LECTURE 3: THE PROCESSOR

Chapter 4. The Processor

Unresolved data hazards. CS2504, Spring'2007 Dimitris Nikolopoulos

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 4. The Processor

Chapter 4 The Processor 1. Chapter 4B. The Processor

Outline Marquette University

COMPUTER ORGANIZATION AND DESIGN

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

14:332:331 Pipelined Datapath

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

ELE 655 Microprocessor System Design

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

Static, multiple-issue (superscaler) pipelines

COMPUTER ORGANIZATION AND DESIGN

Chapter 4. The Processor

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Chapter 4. The Processor

Processor Design Pipelined Processor. Hung-Wei Tseng

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Pipelining. CSC Friday, November 6, 2015

EIE/ENE 334 Microprocessors

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Thomas Polzer Institut für Technische Informatik

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Pipelining - Part 2. CSE Computer Systems October 19, 2001

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Computer Organization and Structure

Control Hazards. Branch Recovery. Control Hazard Pipeline Diagram. Branch Performance

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Final Exam Fall 2007

The Processor: Instruction-Level Parallelism

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Processor (IV) - advanced ILP. Hwansoo Han

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

CS311 Lecture: Pipelining and Superscalar Architectures

CPU Pipelining Issues

Lecture 7 Pipelining. Peng Liu.

Instruction Level Parallelism (Branch Prediction)

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

CS Basic Pipeline

ECE232: Hardware Organization and Design

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Processor Design Pipelined Processor (II) Hung-Wei Tseng

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

Advanced Instruction-Level Parallelism

CSEE 3827: Fundamentals of Computer Systems

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

(Refer Slide Time: 00:02:04)

Chapter 4. The Processor

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Processor (II) - pipelining. Hwansoo Han

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Computer Architecture CS372 Exam 3

LECTURE 9. Pipeline Hazards

DEE 1053 Computer Organization Lecture 6: Pipelining

Full Datapath. Chapter 4 The Processor 2

Assignment 1 solutions

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Chapter 4 The Processor 1. Chapter 4A. The Processor

CS252 Prerequisite Quiz. Solutions Fall 2007

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CSE 490/590 Computer Architecture Homework 2

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

CS 341l Fall 2008 Test #2

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Transcription:

Control Hazards 1

Today Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches! 2

Key Points: Control Hazards Control hazards occur when we don t know which instruction to execute next Mostly caused by branches Strategies for aling with them Stall Guess! Leads to speculation Flushing the pipeline Strategies for making better guesses Unrstand the difference between stall and flush 3

Ial operation Cycles 4

Stalling for Load Cycles Load $s1, 0($s1) Only four stages are occupied. What s in Mem? Addi $t1, $s1, 4 All stages of the pipeline earlier than the stall stand still. Deco Mem Writ bac To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline Fetch Fetch Deco 5 Me

Inserting Noops Cycles Load $s1, 0($s1) The noop is in Mem Noop inserted Mem Write Addi $t1, $s1, 4 Deco Fetch Fetch Deco Mem To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 6

Control Hazards Computing the new PC add $s1, $s3, $s2 sub $s6, $s5, $s2 beq $s6, $s7, somewhere and $s2, $s3, $s1 7

Computing the PC Non-branch instruction PC = PC + 4 When is PC ready? No Hazard. 8

Computing the PC Branch instructions bne $s1, $s2, offset if ($s1!= $s2) { PC = PC + offset} else {PC = PC + 4;} When is the value ready? 9

Solution #1: Stall on branches Worked for loads and ALU ops. But wait! When do we know whether the instruction is a branch? We would have to stall on every instruction 10

There is a constant control hazard We don t even know what kind of instruction we have until co. What do we do? 11

Smart ISA sign Cycles add $s0, $t0, $t1 sub $t2, $s0, $t3 sub $t2, $s0, $t3 sub $t2, $s0, $t3 Make it very easy to tell if the instruction is a branch -- maybe a single bit or just a couple. Decoding these bits is nearly trivial. In MIPS the branches and jumps are opcos 0-7, so if the high orr bits are zero, it s a control instruction 12

Dealing with Branches: Option 1 -- stall Cycles sll $s4, $t6, $t5 bne $t2, $s0, somewhere add $s0, $t0, $t1 Fetch Fetch Stall and $s4, $t0, $t1 Fetch Deco Mem No instructions in Deco or Execute What does this do to our CPI? Speedup? 13

Performance impact of ET = I * CPI * CT Branches about about 1 in 5 instructions What s the CPI for branches? 1 + 2 = 3 Amdah ls law:speedup = 1/(.2/(1/3) + (.8)) = 0.714 ET = 1 * (.2*3 +.8 * 1) * 1 = 1.4 14

Option 2: The compiler Use branch lay slots. The next N instructions after a branch are always executed Much like load-lay slots Good Simple hardware Bad N cannot change. MIPS has one branch lay slot It s a big pain! 15

Delay slots. Cycles Taken bne $t2, $s0, somewhere Branch Delay add $t2, $s4, $t1 add $s0, $t0, $t1 Mem Wri bac... somewhere: sub $t2, $s0, $t3 Fetch Deco Me 16

Option 3: Simple Prediction Can a processor tell the future? For non-taken branches, the new PC is ready immediately. Let s just assume the branch is not taken Also called branch prediction or control speculation What if we are wrong? 17

Predict Not-taken Cycles bne $t2, $s0, somewhere sll $s4, $t6, $t5 bne $t2, $s4, else and $s4, $t0, $t1 add $s0, $t0, $t1... else: sub $t2, $s0, $t3 These two instructions become Noops Mem W b Fetch Deco Fetch Squash Deco Squa We start the add and the and, and then, when we discover the branch outcome, we squash them. This means we turn it into a Noop. 18

Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Predict not-taken Pros? Loops are commons Not all branches are for loops. Backward Taken/Forward not taken Best of both worlds. 19

Implementing Backward taken/forward not taken 01 2!"#$ %$$&"''.//!6+$';A?+'!"#$%&'()" *+,)%-!"#$+%$$&+- 3+45#$+%!"#$+%$$&+. 657+ (&)*"+%$$& (&)*"+,#*#!"#$ +,#*#+-!"#$ +,#*#+.?+'ABC+' :;5< 7+<=>.//.89 BC+'A*+, %$$&"'' (&)*"+,#*#?@$@ *+,)%-!"#$,#*# *+,ADE :54" -/ BC$+"/ 0.

Implementing Backward taken/forward not sign bit taken comparison result Sign Extend Shi< le< 2 Add Compute target Insert bubble to flush PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32

Implementing Backward taken/forward not taken Changes in control New inputs to the control unit The sign of the offset The result of the branch New outputs from control The flush signal. Inserts noop bits in datapath and control 22

Performance Impact ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the pipeline increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? 23

Performance Impact ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI = 0.2*0.2*(1 + 2) + (1-.2*.2)*1 = 1.08 CT = 1.1 ET = 1.188 Stall CPI =.2*3 +.8*1 = 1.4 CT = 1 ET = 1.4 Speed up = 1.4/1.188 = 1.17 24

The Importance of Pipeline pth There are two important parameters of the pipeline that termine the impact of branches on performance Branch co time -- how many cycles does it take to intify a branch (in our case, this is less than 1) Branch resolution time -- cycles until the real branch outcome is known (in our case, this is 2 cycles) 25

Pentium 4 pipeline (Willamette) 1.Branches take 19 cycles to resolve 2.Intifying a branch takes 4 cycles. 1.The P4 fetches 3 instructions per cycle 3.Stalling is not an option. Pentium 4 pipelines peaked at 31 stage!!! Current cpus have about 12-14 stages.

Performance Impact ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI =.2*.2*(1 + 2) +.9*1 CT = 1.1 ET = 1.118 Stall CPI =.2*4 +.8*1 = 1.6 CT = 1 ET = 1.4 Speed up = 1.4/1.118 = 1.18 What if this were 20? 27

Performance Impact ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI =.2*.2*(1 + 20) +.8*1 = 1.64 CT = 1.1 ET = 1.804 Stall CPI =.2*(1 + 20) +.8*1 = 5 CT = 1 ET = 5 Speed up = 5/1.804= 2.77 28

Dynamic Branch Prediction Long pipes mand higher accuracy than static schemes can liver. Instead of making the the guess once, make it every time we see the branch. Predict future behavior based on past behavior 29

Predictable control Use previous branch behavior to predict future branch behavior. When is branch behavior predictable? 30

Predictable control Use previous branch behavior to predict future branch behavior. When is branch behavior predictable? Loops -- for(i = 0; i < 10; i++) {} 9 taken branches, 1 nottaken branch. All 10 are pretty predictable. Run-time constants Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. The branch is always taken or always not taken. Corollated control a = 10; b = <something usually larger than a > if (a > 10) {} if (b > 10) {} Function calls LibraryFunction() -- Converts to a jr (jump register) instruction, but it s always the same. BaseClass * t; // t is usually a of sub class, SubClass t->somevirtualfunction() // will usually call the same function 31

Dynamic Predictor 1: The Simplest Thing Predict that this branch will go the same way as the previous one. Pros? Dead simple. Keep a bit in the fetch stage. Works ok for simple loops. The compiler might be able to arrange things to make it work better Cons? An unpredictable branch in a loop will mess everything up. It can t tell the difference between branches 32

Dynamic Prediction 2: A table of bits Give each branch it s own bit in a table How big does the table need to be? Look up the prediction bit for the branch Pros: Cons: Infinite! Bigger is better, but don t mess with the cycle time. Inx into it using the low orr bits of the PC It can differentiate between branches. Bad behavior by one won t mess up others... mostly (why not always?) Accuracy is still not great. 33

Dynamic Prediction 2: A table of bits for(i = 0; i < 10; i++) { for(j = 0; j < 4; j++) { } } iteration Actual prediction new prediction 1 taken not taken taken 2 taken taken taken 3 taken taken taken 4 not taken taken not taken 1 taken not taken take 50% or 2 per loop 2 taken taken taken 3 taken taken taken What s the accuracy for the inner loop s branch? 34

Dynamic prediction 3: A table of counters Instead of a single bit, keep two. This gives four possible states Taken branches move the state to the right. Not-taken branches move it to the left. State 00 -- strongly not taken 01 -- weakly not taken 10 -- weakly taken 11 -- strongly taken Predicti on not taken not taken taken taken The net effect is that we wait a bit to change out mind 35

Dynamic Prediction 3: A table of counters for(i = 0; i < 10; i++) { for(j = 0; j < 4; j++) { } } iteration Actual state prediction new state 1 taken weakly taken taken strongly taken 2 taken strongly taken taken strongly taken 3 taken strongly taken taken strongly taken 4 not taken strongly taken taken weakly taken 1 taken weakly taken taken strongly taken 2 taken strongly taken taken strongly taken 25% or 1 per loop 3 taken strongly taken taken strongly taken What s the accuracy for the inner loop s branch? (start in weakly taken) 36

Two-bit Prediction The two bit prediction scheme is used very wily and in many ways. Make a table of 2-bit predictors Devise a way to associate a 2-bit predictor with each dynamic branch Use the 2-bit predictor for each branch to make the prediction. In the previous example we associated the predictors with branches using the PC. We ll call this per-pc prediction. 37

Associating Predictors with Branches: Using the low-orr PC bits When is branch behavior predictable? Loops -- for(i = 0; i < 10; i++) {} 9 taken branches, 1 nottaken branch. All 10 are pretty predictable. Run-time constants Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. Good The branch is always taken or not taken. Corollated control a = 10; b = <something usually larger than a > if (a > 10) {} if (b > 10) {} Function calls LibraryFunction() -- Converts to a jr (jump register) instruction, but it s always the same. BaseClass * t; // t is usually a of sub class, SubClass t->somevirtualfunction() // will usually call the same function OK -- we miss one per loop Poor -- no help Not applicable 38

Predicting Loop Branches Revisited for(i = 0; i < 10; i++) { for(j = 0; j < 4; j++) { } } iteration Actual 1 taken 2 taken 3 taken 4 not taken 1 taken 2 taken 3 taken 4 not taken 1 taken 2 taken 3 taken 4 not taken What s the pattern we need to intify? 39

Dynamic prediction 4: Global branch history Instead of using the PC to choose the predictor, use a bit vector ma up of the previous branch outcomes. iteration Actual Branch history Steady state prediction Nearly perfect 1 taken 11111 2 taken 11111 3 taken 11111 4 not taken 11111 outer loop branch taken 11110 taken 1 taken 11101 taken 2 taken 11011 taken 3 taken 10111 taken 4 not taken, 01111 not taken outer loop branch taken 11110 taken 1 taken 11101 taken 2 taken 11011 taken 3 taken 10111 taken 4 not taken, 01111 not taken outer loop branch taken 11110 taken 1 taken 11101 taken 2 taken 11011 taken 3 taken 10111 taken 4 not taken, 01111 not taken 40

Dynamic prediction 4: Global branch history How long should the history be? Infinite is a bad choice. We would learn nothing. Imagine N bits of history and a loop that executes K iterations If K <= N, history will do well. If K > N, history will do poorly, since the history register will always be all 1 s for the last K-N iterations. We will mis-predict the last branch. 41

Associating Predictors with Branches: Global history When is branch behavior predictable? Loops -- for(i = 0; i < 10; i++) {} 9 taken branches, 1 nottaken branch. All 10 are pretty predictable. Run-time constants Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. The branch is always taken or not taken. Corollated control a = 10; b = <something usually larger than a > if (a > 10) {} if (b > 10) {} Function calls Not so great LibraryFunction() -- Converts to a jr (jump register) instruction, but it s always the same. BaseClass * t; // t is usually a of sub class, SubClass t->somevirtualfunction() // will usually call the same function Good Pretty good, as long as the history is not too long Not applicable 42