Pipeline design. Mehran Rezaei

Similar documents
COMP303 - Computer Architecture Lecture 8. Designing a Single Cycle Datapath

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

CS3350B Computer Architecture Quiz 3 March 15, 2018

CS61C : Machine Structures

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 7 Pipelining. Peng Liu.

CS3350B Computer Architecture Winter Lecture 5.7: Single-Cycle CPU: Datapath Control (Part 2)

MIPS-Lite Single-Cycle Control

Single Cycle CPU Design. Mehran Rezaei

361 datapath.1. Computer Architecture EECS 361 Lecture 8: Designing a Single Cycle Datapath

(Basic) Processor Pipeline

Chapter 4 The Processor 1. Chapter 4A. The Processor

361 control.1. EECS 361 Computer Architecture Lecture 9: Designing Single Cycle Control

COMPUTER ORGANIZATION AND DESIGN

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

CpE242 Computer Architecture and Engineering Designing a Single Cycle Datapath

ECS 154B Computer Architecture II Spring 2009

Outline. EEL-4713 Computer Architecture Designing a Single Cycle Datapath

CPU Organization (Design)

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

ECE170 Computer Architecture. Single Cycle Control. Review: 3b: Add & Subtract. Review: 3e: Store Operations. Review: 3d: Load Operations

RISC Pipeline. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter 4.6

Processor (I) - datapath & control. Hwansoo Han

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

CS 110 Computer Architecture Single-Cycle CPU Datapath & Control

Laboratory 5 Processor Datapath

CSE 378 Midterm 2/12/10 Sample Solution

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

CS 61C: Great Ideas in Computer Architecture. MIPS CPU Datapath, Control Introduction

The Processor: Datapath & Control

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Designing a Multicycle Processor

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

ECE4680 Computer Organization and Architecture. Designing a Pipeline Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMP2611: Computer Organization. The Pipelined Processor

Chapter 4. The Processor

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Lecture #17: CPU Design II Control

CPU Design Steps. EECC550 - Shaaban

CS 61C: Great Ideas in Computer Architecture Control and Pipelining

Major CPU Design Steps

Computer Organization and Structure

The Big Picture: Where are We Now? EEM 486: Computer Architecture. Lecture 3. Designing a Single Cycle Datapath

EEM 486: Computer Architecture. Lecture 3. Designing Single Cycle Control

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Final Exam Spring 2017

CENG 3420 Lecture 06: Datapath

EECS 470. Further review: Pipeline Hazards and More. Lecture 2 Winter 2018

EE 457 Unit 6a. Basic Pipelining Techniques

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

CPE 335 Computer Organization. Basic MIPS Pipelining Part I

Chapter 4. The Processor

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Chapter 4. The Processor

COMP303 Computer Architecture Lecture 9. Single Cycle Control

Working on the Pipeline

CS61C : Machine Structures

EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction

Processor Design Pipelined Processor (II) Hung-Wei Tseng

CS 61C: Great Ideas in Computer Architecture Datapath. Instructors: John Wawrzynek & Vladimir Stojanovic

Pipelined datapath Staging data. CS2504, Spring'2007 Dimitris Nikolopoulos

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

Improving Performance: Pipelining

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

ECE331: Hardware Organization and Design

ECE154A Introduction to Computer Architecture. Homework 4 solution

CS420/520 Homework Assignment: Pipelining

COSC 6385 Computer Architecture - Pipelining

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Single- Cycle CPU Datapath & Control Part 2

Design a MIPS Processor (2/2)

ECE260: Fundamentals of Computer Engineering

COMPUTER ORGANIZATION AND DESIGN

CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

Chapter 4. The Processor

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Mark Redekopp and Gandhi Puvvada, All rights reserved. EE 357 Unit 15. Single-Cycle CPU Datapath and Control

What about branches? Branch outcomes are not known until EXE What are our options?

Pipelining. Pipeline performance

What do we have so far? Multi-Cycle Datapath (Textbook Version)

EECS150 - Digital Design Lecture 9- CPU Microarchitecture. Watson: Jeopardy-playing Computer

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Chapter 4. The Processor

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Very Simple MIPS Implementation

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

CSE 141 Computer Architecture Summer Session Lecture 3 ALU Part 2 Single Cycle CPU Part 1. Pramod V. Argade

Inf2C - Computer Systems Lecture 12 Processor Design Multi-Cycle

CS 61C Summer 2016 Guerrilla Section 4: MIPS CPU (Datapath & Control)

CS 61C Fall 2016 Guerrilla Section 4: MIPS CPU (Datapath & Control)

Computer Architecture. Lecture 6.1: Fundamentals of

The MIPS Processor Datapath

Transcription:

Pipeline design Mehran Rezaei

How Can We Improve the Performance? Exec Time = IC * CPI * CCT Optimization IC CPI CCT Source Level * Compiler * * ISA * * Organization * * Technology * With Pipelining We want to get 5 times faster Clock rate Single Cycle machine: CPI is one

Analogy order pay pickup

lw or add sw and Pipelining add sub icroprocessor lw fetch decode ALU mem writeback add

Pipeline design Break the execution of the instruction into cycles. Design a separate datapath stage for the execution performed during each cycle. Build pipeline registers to communicate between the stages.

Shift Left 2 npcsle pc Opcode Instruction ExtOp Cont Unit Reg RegDst Addr Addr2 Addr ALUSrc 2 Mem OVF Branch ALUCtr MemtoReg Mem Funct Extension ALUOp ALU Cont

Shift Left 2 ID EXE MEM WB pc Instruction Addr Addr2 Addr 2 Extension IF

Shift Left 2 ID EXE MEM WB pc Instruction Addr Addr2 Addr 2 Extension IF

Instruction Fetch Design a datapath that can fetch an instruction from memory every cycle. Use PC to index memory to read instruction Increment the PC (assume no branches for now) everything needed to complete execution to the pipeline register (IF/ID) The next stage will read this pipeline register. Note that pipeline register must be edge triggered

IF PC+ Inst. Instruction pc IF/ID Registers

ID IF/ID Registers PC+ Inst. Addr Addr2 Addr 2 PC+ RegA RegB IMM ID/EXE Registers Extension Rt Rd

Shift Left 2 EXE ID/EXE Registers PC+ RegA RegB IMM Rt Rd Br. Tr. Add. ALUres RegB Rt/Rd EXE/MEM Registers

MEM Br. Tr. Add. Mem ALUres EXE/MEM Registers RegB Rt/Rd ALUres Rt/Rd MEM/WB Registers

WB MEM/WB Registers Mem ALUres Rt/Rd

Shift Left 2 ID EXE MEM WB pc Instruction Addr Addr2 Addr 2 Extension IF

Example Run the following code on our pipeline machine add $,$0,$3 lw $,20($2) sub $5,$6,$6 sw $7,8($8) add $9,$,$3

Shift Left 2 pc Instruction add $,$0,$3 0 3 R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 add $,$0,$3? 3 Extension

Shift Left 2 add $,$0,$3 pc Instruction Lw $,20($2) 2 R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 0 5 lw $,20($2) 20? Extension 3

Shift Left 2 lw $,20($2) add $,$0,$3 pc Instruction Sub $5,$6,$6 6 6 R0 R2 R R6 R8 0 8 5 8 6 7 3 9 R R3 R5 R7 R9 8 5 sub $5,$6,$6 6 Extension 5 20?

Shift Left 2 sub $5,$6,$6 lw $,20($2) add $,$0,$3 pc Instruction sw $7,0($8) R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 6 6 28 sw $7,8($8) Extension 6 5 5

Shift Left 2 sw $7,8($8) sub $5,$6,$6 lw $,20($2) add $,$0,$3 pc Instruction add $9,$,$3 R0 R2 R R6 R8 0 8 5 5 8 6 7 9 R R3 R5 R7 R9 7 0 28 200 200 Extension 8 7 5 add $9,$,$3

Clk Next PC Recall: Single cycle control! Ideal Instruction Memory Instruction 32 Rd 5 Instruction Rs 5 Rt 5 Rw Ra Rb 32 32-bit Registers A 32 B Control Control Signals ALU Conditions 32 In Ideal Memory Out Clk 32 Clk path

Stationary Control The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc,...) are used cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc IF/ID Register Main Control ALUOp RegDst MemWr Branch MemtoReg ID/Ex Register ALUOp RegDst MemWr Branch MemtoReg Ex/Mem Register MemWr Branch MemtoReg Mem/Wr Register MemtoReg RegWr RegWr RegWr RegWr

Next PC PC Mem Acces s Mem Reg File Exec Reg. File Inst. Mem Decode path + Stationary Control IR fun rt rs op rs rt v rw wb me ex im v rw wb me Mem Ctrl v rw wb WB Ctrl A S M B D

Shift Left 2 pc Opcode Instruction ExtOp Cont Unit RegDst npcsle Reg ALUSrc Addr Addr2 Addr 2 Mem MemtoReg Mem OVF Branch ALUCtr Funct Extension ALUOp ALU Cont 25

Shift Left 2 ID EXE MEM WB pc Instruction Addr Addr2 Addr 2 Extension IF 26

Shift Left 2 ID EXE MEM WB pc Instruction Addr Addr2 Addr 2 Extension IF 27

Pipeline timing diagram add $,$0,$3 lw $,20($2) sub $5,$6,$6 sw $7,8($8) add $9,$,$3 IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB 28

What are they? Hazards How do you detect them? How do you deal with them? 29

Shift Left 2 pc Instruction PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 30

Pipeline cycles for add IF - Fetch: read instruction from memory ID - Decode: read source operands from reg EXE - Execute: calculate sum MEM - Memory: pass results to next stage WB - back: write sum (ALUres) into register file 3

Hazard Register one is written add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF ID EXE MEM WB Register one is read If we are not careful, we will read the wrong value! If sub is supposed to read updated value (not stale), how many instruction should be in between add and sub? 32

Shift Left 2 sub $,$5,$ add $,$2,$3 pc Instruction R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 8 3 Extension 33

Hazard write add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF hazard hazard ID EXE MEM WB read 3

Class work What are the data hazards in this piece of code? add $,$2,$3 sub $2,$,$3 xor $,$3,$5 nor $5,$2,$ add $5,$3,$5 35

What to do with them? Avoid Make sure there are no hazards in the code Detect and Stall If hazards exist, stall the processor until they go away. Detect and Forward If hazards exist, fix up the pipeline to get the correct value (if possible) 36

First Approach: avoid all hazards Assume the programmer (or the compiler) knows about the processor implementation. Make sure no hazards exist. Consider if I have an instruction called noop. Put noops between any dependent instructions. add $,$2,$3 noop noop sub $,$5,$ IF ID EXE MEM WB IF ID EXE MEM WB 37

What is the problem with this solution? Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more noops Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 0% of instructions are noops Program execution is slower CPI is, but some instructions are noops 38

The second solution Detect: Compare rega with previous DestRegs 5 bit operand fields Compare regb with previous DestRegs Stall: 5 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute 39

Shift Left 2 pc Instruction PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 0

Shift Left 2 pc Instruction PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest

Shift Left 2 pc Instruction PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres 2

Hazard write Addr 0x00 add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF hazard hazard ID EXE MEM WB read 3

0 Shift Left 2 First half of cycle 0x0 0x00 Instruction PC+ 0 5 6 2 3 PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres add $,$2,$3

Shift Left 2 Second half of cycle add $,$2,$3 0x0 Instruction 0x0 add $,$2,$3 0 5 6 2 3 PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres 5

0 Shift Left 2 First half of cycle 2 0x08 0x0 Instruction 0x0 add $,$2,$3 add $,$2,$3 2 3 0 5 6 2 3 6 PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres sub $,$,$5 6

0 Shift Left 2 Second half of cycle 2 add $,$2,$3 0x08 Instruction 0x08 sub $,$,$5 2 3 0 5 6 2 3 0x0 6 target ALUres eq? mdata Extension IMM valb ALUres sub $,$,$5 7

0 Shift Left 2 First half of cycle 3 0x0c add $,$2,$3 0x08 Instruction 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 0x0 6 target ALUres eq? mdata Extension IMM 7 valb ALUres sub $,$,$5 8

Hazard detected compare compare compare compare rega regb REG file IF/ ID 9 ID/ EX

Hazard detected compare 0000 50 0000 rega regb

What Next? Detect: Compare rega with previous DestRegs 5 bit operand fields Compare regb with previous DestRegs Stall: 5 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute 5

0 Shift Left 2 Second half of cycle 3 0x0c add $,$2,$3 0x08 Instruction 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 0x0 7 eq? mdata Extension valb ALUres sub $,$,$5 noop 52

0 Shift Left 2 First half of cycle 0x0c add $,$2,$3 0x08 Instruction 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 0x0 6 0x0 7 eq? mdata sub $,$,$5 Extension IMM noop valb 7 ALUres 53

0 Shift Left 2 Second half of cycle 0x0c add $,$2,$3 0x08 Instruction 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 eq? 0x0 mdata 7 Extension sub $,$,$5 noop noop 5

0 Shift Left 2 first half of cycle 5 0x0c 0x0c Instruction 0x08 sub $,$,$5 5 0 7 6 2 3 eq? 0x0 mdata add $,$2,$3 sub $,$,$5 Extension noop noop 55

0 Shift Left 2 second half of cycle 5 0x0c 0x08 Instruction 0 7 6 2 3 7 3 eq? mdata Extension sub $,$,$5 noop noop 56

Timing graph Time: 2 3 5 6 7 8 9 0 2 3 add $,$2,$3 IF ID EX ME WB Sub $,$,$5 IF no op no op ID EX ME WB add $6,$,$7 IF ID EX ME WB lw $6,0($8) IF ID EX ME WB sw $6,3($) IF no op no op ID EX ME 57

Problems with the second solution Still CPI is the same as before, no improvement in performance The only improvement is in the code size, and no longer compiler is responsible to detect the data hazards In fact, now the system runs slower Why? 58

Detect the data hazard The third solution Add instruction calculated the result in the execution cycle Forward the result to the decode stage of the sub instruction Therefore sub does not need to wait until the result is written back into register file And more control is needed; place the result somewhere else rather than register file 59

The third solution Detect: same as detect and stall Except that all hazards are treated differently Forward: i.e., you can t logical-or the hazard signals New bypass datapaths route computed data to where it is needed New MUX and control to pick the right data Beware: Stalling may still be required even in the presence of forwarding 60

Shift Left 2 First half of cycle 3 sub $,$,$5 add $,$2,$3 pc Instruction PC+ sub $,$,$5 Hazard detected 5 0 5 6 2 3 PC+ 6 7 target ALUres eq? mdata Extension IMM valb ALUres FW FW FW add $6,$,$7 6

Shift Left 2 End of cycle 3 sub $,$,$5 add $,$2,$3 pc Instruction PC+ Add $6,$,$7 0 5 6 2 3 7 9 Extension PC+ 5 3 IMM target 7 eq? valb mdata ALUres FW FW H add $6,$,$7 62

Shift Left 2 First half of cycle add $6,$,$7 sub $,$,$5 add $,$2,$3 pc Instruction PC+ Add $6,$,$7 New Hazard 7 0 5 6 2 3 7 9 Extension PC+ 5 3 IMM target 7 eq? valb 7 mdata ALUres lw $6,0($8) H FW FW 63

Shift Left 2 End of cycle add $6,$,$7 sub $,$,$5 add $,$2,$3 pc Instruction PC+ lw $6,0($8) 6 0 5 6 2 3 7 9 2 PC+ 5 9 target eq? valb mdata Extension IMM 7 lw $6,0($8) H2 H FW 6

Shift Left 2 pc Instruction PC+ lw $6,0($8) First half of cycle 5 lw $6,0($8) New Hazard 6 0 7 6 2 3 7 9 2 add $6,$,$7 PC+ 5 9 sub $,$,$5 target 6 eq? valb mdata add $,$2,$3 Extension IMM sw $6,3($) H2 H FW 65

What else can go wrong in our pipelined CPU? Control hazards Exceptions: First of all, what are exceptions? And, how do you handle exceptions in a pipelined processor with 5 instructions in flight?

Control Hazard What is a control hazard? How does the pipelined CPU handle control hazards?

Shift Left 2 beq bne pc Instruction PC+ PC+ vala valb target ALUres eq? mdata Extension IMM ALU Unit valb ALUres Control Unit

What happens in executing BEQ? Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate target address and test for equality Memory: Send target to PC if test is equal back: Nothing left to do

Example y=y*2; x=0; for(j=00;j>0;j--){ x++; z--; } y--; x=x*3; z=z+x; 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,2 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

What do you observe from the example? How many times the branch is taken? How many times is not taken? What happens each time that the branch instruction is executed? What happens next?

Surprise! 2 addi $2,$2,... 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 2 IF ID EXE MEM WB 28 IF ID EXE MEM WB 32 IF ID EXE MEM WB 36 IF ID EXE MEM WB 2 IF ID EXE MEM WB

Solutions Avoid Make sure there are no hazards in the code Detect and Stall Delay fetch until branch resolved. Speculate and Squash-if-Wrong Go ahead and fetch more instruction in case it is correct, but stop them if they shouldn t have been executed

Avoid Don t have branch instructions! Maybe a little impractical Delay taking branch: dbeq R,R2,offset dbne R,R2,offset Instructions at PC+, PC+8, etc will execute before deciding whether to fetch from PC++offset. (If no useful instructions can be placed after dbeq, noops must be inserted.)

Consider our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 noop 32 noop 36 noop 0 addi $3,$3,- add $5,$2,$0 8 add $2,$2,$2 52 add $2,$2,$5 56 add $,$,$2

Can we do better? 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $5,$5,- 6 dbne $5,$0,-2 20 addi $,$,- 2 addi $2,$2, 28 noop 32 addi $3,$3,- 36 add $5,$2,$0 0 add $2,$2,$2 add $2,$2,$5 8 add $,$,$2 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 dbne $5,$0,- 6 addi $5,$5,- 20 addi $,$,- 2 addi $2,$2, 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2 This code generates wrong results.

Problems with this solution Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more instuctions/noops after delayed beq Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 0% of instructions are noops Program execution is slower CPI equals, but some instructions are noops

Detect and Stall (hardware approach) Detection: Must wait until decode Compare opcode to beq Alternately, this is just another control signal Stall: Keep current instructions in fetch Pass noop to decode stage (not execute!)

Our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

Shift Left 2 28 Instruction 28 bne $5,$0,- PC+ vala valb target ALUres Eq? mdata Extension IMM ALU Unit valb ALUres bne $5,$0,- Control Unit

Shift Left 2 bne $5,$0,- pc Instruction 28 noop 28 0 target ALUres eq mdata 0 Extension IMM ALU Unit valb ALUres Control Unit

Shift Left 2 bne $5,$0,- pc Instruction 28 noop 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres Control Unit noop

Shift Left 2 28 2 bne $5,$0,- pc Instruction 28 noop 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres Control Unit noop noop

Shift Left 2 pc Instruction 6 addi $2,$2, 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres addi $2,$2, Control Unit noop noop noop

What seems to be the problem? CPI increases every time a branch is detected! Is that necessary? Not always! Only about ½ of the time is the branch taken Let s assume that it is NOT taken In this case, we can ignore the beq or bne (treat them like a noop) Keep fetching PC + What if we are wrong? OK, as long as we do not COMPLETE any instructions we mistakenly executed (i.e. don t perform writeback)

Speculate and Squash Speculate: assume not equal Keep fetching from PC+ until we know that the branch is really taken Squash: stop bad instructions if taken Send a noop to: Decode, Execute and Memory Send target address to PC

Our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

Shift Left 2 pc Instruction PC+ noop PC+ vala valb target ALUres eq? mdata 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 Extension Control Unit IMM noop ALU Unit valb noop ALUres

Performance problem, again CPI increases every time a branch is taken! About ½ of the time Is that necessary? No!, but how can you fetch from the target before you even know the previous instruction is a branch much less whether it is taken???

Shift Left 2 28 Instruction 28 2 bne $5,$0,- PC+ vala valb target ALUres Eq? mdata bpc target Extension IMM ALU Unit valb ALUres 2 bne $5,$0,- Control Unit

Shift Left 2 28 Instruction 28 PC PC+ PC vala valb target ALUres Eq? 2 mdata bpc target 2 2 Extension IMM ALU Unit valb ALUres Control Unit 2 bne $5,$0,-

Shift Left 2 eq? 28 Instruction 28 PC PC+ PC vala valb target ALUres Eq? PC mdata bpc target 2 2 Extension IMM ALU Unit valb ALUres 2 bne $5,$0,- Control Unit

Branch Prediction Predict not taken: ~50% accurate Predict backward taken: ~65% accurate Predict same as last time: ~80% accurate Pentium: ~85% accurate Pentium Pro: ~92% accurate Best paper designs: ~96% accurate