What about branches? Branch outcomes are not known until EXE What are our options?

Similar documents
Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

What about branches? Branch outcomes are not known until EXE What are our options?

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

Implementing a MIPS Processor. Readings:

Implementing a MIPS Processor. Readings:

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Pipeline design. Mehran Rezaei

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

ELE 655 Microprocessor System Design

Pipelining is Hazardous!

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

COMPUTER ORGANIZATION AND DESI

14:332:331 Pipelined Datapath

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Pipelining. CSC Friday, November 6, 2015

Chapter 4 The Processor 1. Chapter 4B. The Processor

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Pipelining - Part 2. CSE Computer Systems October 19, 2001

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

CPU Pipelining Issues

COSC 6385 Computer Architecture - Pipelining

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Full Datapath. Chapter 4 The Processor 2

ECE154A Introduction to Computer Architecture. Homework 4 solution

Chapter 4. The Processor

ECE260: Fundamentals of Computer Engineering

CS252 Prerequisite Quiz. Solutions Fall 2007

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Unresolved data hazards. CS2504, Spring'2007 Dimitris Nikolopoulos

Instr. execution impl. view

ECE232: Hardware Organization and Design

Processor (II) - pipelining. Hwansoo Han

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

LECTURE 3: THE PROCESSOR

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Complex Pipelines and Branch Prediction

Outline Marquette University

Static, multiple-issue (superscaler) pipelines

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Full Datapath. Chapter 4 The Processor 2

5008: Computer Architecture HW#2

Chapter 4. The Processor

ECE331: Hardware Organization and Design

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

The Processor: Improving the performance - Control Hazards

Advanced Instruction-Level Parallelism

Computer Architecture CS372 Exam 3

What is Pipelining? RISC remainder (our assumptions)

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CSSE232 Computer Architecture I. Datapath

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Chapter 4 The Processor 1. Chapter 4A. The Processor

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Processor (IV) - advanced ILP. Hwansoo Han

Chapter 4. The Processor

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Thomas Polzer Institut für Technische Informatik

SI232 Set #20: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Chapter 6 ADMIN. Reading for Chapter 6: 6.1,

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

EIE/ENE 334 Microprocessors

ECE 331 Hardware Organization and Design. UMass ECE Discussion 10 4/5/2018

ECE331: Hardware Organization and Design

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

COMPUTER ORGANIZATION AND DESIGN

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CIS 662: Midterm. 16 cycles, 6 stalls

CSEE 3827: Fundamentals of Computer Systems

CS311 Lecture: Pipelining and Superscalar Architectures

COMPUTER ORGANIZATION AND DESIGN

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Processor Design Pipelined Processor. Hung-Wei Tseng

Pipeline Control Hazards and Instruction Variations

The Processor: Instruction-Level Parallelism

Computer Architecture Spring 2016

Chapter Six. Dataı access. Reg. Instructionı. fetch. Dataı. Reg. access. Dataı. Reg. access. Dataı. Instructionı fetch. 2 ns 2 ns 2 ns 2 ns 2 ns

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Processor Design Pipelined Processor (II) Hung-Wei Tseng

Pipelined Processor Design

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Chapter 4. The Processor

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Lecture 9 Pipeline and Cache

Pipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Transcription:

What about branches? Branch outcomes are not known until E What are our options? 1

Control Hazards 2

Today Quiz Control Hazards Midterm review Return your papers 3

Key Points: Control Hazards Control occur when we don t know what the next instruction is Mostly caused by branches Strategies for aling with them Stall Guess! Leads to speculation Flushing the pipeline Strategies for making better guesses Unrstand the difference between stall and flush 4

Control Hazards add $s1, $s3, $s2 Computing the new PC sub $s6, $s5, $s2 beq $s6, $s7, somewhere and $s2, $s3, $s1 Fetch Deco Mem Write back 5

Computing the PC Non-branch instruction PC = PC + 4 When is PC ready? 6

Computing the PC Branch instructions bne $s1, $s2, offset if ($s1!= $s2) { PC = PC + offset} else {PC = PC + 4;} When is the value ready? 7

Computing the PC Wait, when we do know? if (Instruction is branch) { if ($s1!= $s2) { PC = PC + offset; } else { PC = PC + 4; } } else { PC = PC + 4; } 8

There is a constant control hazard We don t even know what kind of instruction we have until co. Let s consir the non-branch case first. What do we do? 9

Option 1: Smart ISA sign Cycles Fetch Deco Mem Write back Fetch Deco Mem Write back Fetch Deco Mem Write back Fetch Deco Mem Write back Make it very easy to tell if the instruction is a branch -- maybe a single bit or just a couple. Deco is trivial Pre-co -- Do part of co when the instruction comes on chip. more on this later 10

Option 2: The compiler Use branch lay slots. The next N instructions after a branch are always executed Good Simple hardware Bad N cannot change. 11

Delay slots. Cycles Taken bne $t2, $s0, somewhere Fetch Deco Mem Write back Branch Delay add $t2, $s4, $t1 add $s0, $t0, $t1 Fetch Deco Mem Write back Fetch Deco Mem Wri bac... somewhere: sub $t2, $s0, $t3 Fetch Deco Me 12

Option 4: Stall Cycles add $s0, $t0, $t1 Fetch Deco Mem Write back bne $t2, $s0, somewhere Fetch Deco Mem Write back sub $t2, $s0, $t3 Stall Fetch Deco sub $t2, $s0, $t3 Fetch Deco What does this do to our CPI? Speedup? 13

Performance impact of stalling ET = I * CPI * CT Branches about about 1 in 5 instructions What s the CPI for branches? 1 + 2 = 3 This is really the CPI for the instruction that follows the branch. Speedup = 1/(.2/(1/3) + (.8) = 0.714 ET = 1 * (.2*3 +.8 * 1) * 1 = 1.4 14

Option 2: Simple Prediction Can a processor tell the future? For non-taken branches, the new PC is ready immediately. Let s just assume the branch is not taken Also called branch prediction or control speculation What if we are wrong? 15

Predict Not-taken Cycles Not-taken bne $t2, $s0, somewhere Fetch Deco Mem Write back Taken bne $t2, $s4, else Fetch Deco Mem Write back add $s0, $t0, $t1 Fetch Deco Mem Write back Squash... else: sub $t2, $s0, $t3 Fetch Deco We start the add, and then, when we discover the branch outcome, we squash it. We flush the pipeline. 16

Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Predict not-taken Pros? Loops are commons Not all branches are for loops. Backward Taken/Forward not taken Best of both worlds. 17

Implementing Backward taken/forward not taken 01 2!"#$ %$$&"''.//!6+$';A?+'!"#$%&'()" *+,)%-!"#$+%$$&+- 3+45#$+%!"#$+%$$&+. 657+ (&)*"+%$$& (&)*"+,#*#!"#$ +,#*#+-!"#$ +,#*#+.?+'ABC+' :;5< 7+<=>.//.89 BC+'A*+, %$$&"'' (&)*"+,#*#?@$@ *+,)%-!"#$,#*# *+,ADE :54" -/ BC$+"/ 0.

Implementing Backward taken/forward not Compute target taken Sign Extend Shi< le< 2 Add Insert bubble PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32

Implementing Backward taken/forward not taken Changes in control New inputs to the control unit The sign of the offset The result of the branch New outputs from control The flush signal. Inserts noop bits in datapath and control 20

Performance Impact ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? 21

Performance Impact ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI = 0.2*0.2*(1 + 2) + (1-.2*.2)*1 = CT = 1.1 ET = 1.188 Stall CPI =.2*3 +.8*1 = 1.4 CT = 1 ET = 1.4 Speed up = 1.4/1.188 = 1.18 22

The Importance of Pipeline pth There are two important parameters of the pipeline that termine the impact of branches on performance Branch co time -- how many cycles does it take to intify a branch (in our case, this is less than 1) Branch resolution time -- cycles until the real branch outcome is known (in our case, this is 2 cycles) 23

Pentium 4 pipeline 1.Branches take 19 cycles to resolve 2.Intifying a branch takes 4 cycles. 3.Stalling is not an option.

Performance Impact ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt What if this we 20? CPI =.2*.2*(1 + 2) +.9*1 CT = 1.1 ET = 1.118 Stall CPI =.2*4 +.8*1 = 1.6 CT = 1 ET = 1.4 Speed up = 1.4/1.118 = 1.18 25

Performance Impact ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI =.2*.2*(1 + 20) +.8*1 = 1.64 CT = 1.1 ET = 1.804 Stall CPI =.2*(1 + 20) +.8*1 = 5 CT = 1 ET = 5 Speed up = 5/1.804= 2.77 26

Dynamic Branch Prediction Long pipes mand higher accuracy than static schemes can liver. Instead of making the the guess once, make it every time we see the branch. Predict future behavior based on past behavior 27