CENG 3420 Lecture 06: Pipeline

Similar documents
Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Chapter 4 The Processor 1. Chapter 4B. The Processor

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

COMPUTER ORGANIZATION AND DESIGN

Chapter 4 The Processor 1. Chapter 4A. The Processor

14:332:331 Pipelined Datapath

COMPUTER ORGANIZATION AND DESIGN

Full Datapath. Chapter 4 The Processor 2

Chapter 4. The Processor

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 9 Pipeline and Cache

LECTURE 3: THE PROCESSOR

Thomas Polzer Institut für Technische Informatik

The Processor: Improving the performance - Control Hazards

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CSEE 3827: Fundamentals of Computer Systems

Processor (II) - pipelining. Hwansoo Han

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Full Datapath. Chapter 4 The Processor 2

Outline Marquette University

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Chapter 4. The Processor

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Chapter 4. The Processor

ECS 154B Computer Architecture II Spring 2009

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Chapter 4. The Processor

EIE/ENE 334 Microprocessors

ECE260: Fundamentals of Computer Engineering

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

COSC121: Computer Systems. ISA and Performance

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

Design a MIPS Processor (2/2)

Computer Architecture. Lecture 6.1: Fundamentals of

Chapter 4. The Processor

Chapter 4. The Processor

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

MIPS An ISA for Pipelining

Pipelined datapath Staging data. CS2504, Spring'2007 Dimitris Nikolopoulos

Computer Architecture

zhandling Data Hazards The objectives of this module are to discuss how data hazards are handled in general and also in the MIPS architecture.

ELE 655 Microprocessor System Design

CENG 3420 Lecture 06: Datapath

Modern Computer Architecture

Lecture 7 Pipelining. Peng Liu.

CPE 335 Computer Organization. Basic MIPS Pipelining Part I

COMPUTER ORGANIZATION AND DESI

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12

ECE232: Hardware Organization and Design

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

DEE 1053 Computer Organization Lecture 6: Pipelining

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 4 (Part II) Sequential Laundry

Chapter 4. The Processor. Jiang Jiang

COSC 6385 Computer Architecture - Pipelining

LECTURE 9. Pipeline Hazards

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Instruction Pipelining Review

COMPUTER ORGANIZATION AND DESIGN

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CPE 335. Basic MIPS Architecture Part II

EE557--FALL 1999 MIDTERM 1. Closed books, closed notes

COMP2611: Computer Organization. The Pipelined Processor

Introduction to Pipelined Datapath

CPE 335 Computer Organization. Basic MIPS Architecture Part I

ECE473 Computer Architecture and Organization. Pipeline: Data Hazards

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Pipelining. CSC Friday, November 6, 2015

ECE331: Hardware Organization and Design

ECE331: Hardware Organization and Design

EITF20: Computer Architecture Part2.2.1: Pipeline-1

ECE/CS 552: Pipeline Hazards

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Unresolved data hazards. CS2504, Spring'2007 Dimitris Nikolopoulos

EITF20: Computer Architecture Part2.2.1: Pipeline-1

University of Jordan Computer Engineering Department CPE439: Computer Design Lab

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Instr. execution impl. view

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

Transcription:

CENG 3420 Lecture 06: Pipeline Bei Yu byu@cse.cuhk.edu.hk CENG3420 L06.1 Spring 2019

Outline q Pipeline Motivations q Pipeline Hazards q Exceptions q Background: Flip-Flop Control Signals CENG3420 L06.2 Spring 2019

Outline q Pipeline Motivations q Pipeline Hazards q Exceptions q Background: Flip-Flop Control Signals CENG3420 L06.3 Spring 2019

Review: Instruction Critical Paths q Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC access, shift left 2, wires) except: Instruction and Data Memory (4 ns) and adders (2 ns) Register File access (reads or writes) (1 ns) Instr. I Mem Reg Rd Op D Mem Reg Wr Total R- type load store beq jump 4 1 2 1 8 4 1 2 4 1 12 4 1 2 4 11 4 1 2 7 4 4 CENG3420 L06.4 Spring 2019

Review: Single Cycle Disadvantages & Advantages q Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instr especially problematic for more complex instructions like floating point multiply Clk Cycle 1 Cycle 2 lw sw Waste q May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle but q It is simple and easy to understand CENG3420 L06.5 Spring 2019

How Can We Make It Faster? q Start fetching and executing the next instruction before the current one has completed Pipelining (all?) modern processors are pipelined for performance Remember the performance equation: CPU time = CPI * CC * IC q Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages A five stage pipeline is nearly five times faster because the CC is nearly five times faster q Fetch (and execute) more than one instruction at a time Superscalar processing stay tuned CENG3420 L06.6 Spring 2019

The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB q IFetch: Instruction Fetch and Update PC q Dec: Registers Fetch and Instruction Decode q Exec: Execute R-type; calculate memory address q Mem: Read/write the data from/to the Data Memory q WB: Write the result data into the register file CENG3420 L06.7 Spring 2019

A Pipelined MIPS Processor q Start the next instruction before the current one has completed improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB - clock cycle (pipeline stage time) is limited by the slowest stage - for some stages don t need the whole clock cycle (e.g., WB) - for some instructions, some stages are wasted cycles (i.e., nothing is done during that cycle for that instruction) CENG3420 L06.8 Spring 2019

Single Cycle versus Pipeline Single Cycle Implementation (CC = 800 ps): Cycle 1 Cycle 2 Clk lw sw Waste Pipeline Implementation (CC = 200 ps): 400 ps lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB q To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why? q How long does each take to complete 1,000,000 adds? CENG3420 L06.9 Spring 2019

Pipelining the MIPS ISA q What makes it easy all instructions are the same length (32 bits) - can fetch in the 1 st stage and decode in the 2 nd stage few instruction formats (three) with symmetry across formats - can begin reading register file in 2 nd stage memory operations occur only in loads and stores - can use the execute stage to calculate memory addresses each instruction writes at most one result (i.e., changes the machine state) and does it in the last few pipeline stages (MEM or WB) operands must be aligned in memory so a single data transfer takes only one data memory access CENG3420 L06.10 Spring 2019

MIPS Pipeline Datapath Additions/Mods q State registers between each pipeline stage to isolate them IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack IF/ID ID/EX EX/MEM Add PC 4 Instruction Memory Read Address Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Shift left 2 Add Address Write Data Data Memory Read Data MEM/WB Sign 16 Extend 32 System Clock CENG3420 L06.11 Spring 2019

MIPS Pipeline Control Path Modifications q All control signals can be determined during Decode and held in the state registers between pipeline stages PCSrc IF/ID Control ID/EX EX/MEM PC 4 Instruction Memory Read Address Add RegWrite Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Sign 16 Extend 32 Shift left 2 Src Add cntrl Op Branch Address Write Data Data Memory Read Data MemRead MEM/WB MemtoReg RegDst CENG3420 L06.12 Spring 2019

Pipeline Control q IF Stage: read Instr Memory (always asserted) and write PC (on System Clock) q ID Stage: no optional control signals to set Reg Dst EX Stage MEM Stage WB Stage Op1 Op0 Src Brch Mem Read Mem Write Reg Write Mem toreg R 1 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 sw X 0 0 1 0 0 1 0 X beq X 0 1 0 1 0 0 0 X CENG3420 L06.13 Spring 2019

Graphically Representing MIPS Pipeline q Can help with answering questions like: How many cycles does it take to execute this code? What is the doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed? CENG3420 L06.14 Spring 2019

Other Pipeline Structures Are Possible q What about the (slow) multiply operation? Make the clock twice as slow or let it take two cycles (since it doesn t use the DM stage) MUL q What if the data memory access is twice as slow as the instruction memory? make the clock twice as slow or let data memory access take two cycles (and keep the same clock rate) IM Reg DM1 DM2 Reg CENG3420 L06.15 Spring 2019

Other Sample Pipeline Alternatives q ARM7 IM Reg EX PC update IM access decode reg access op DM access shift/rotate commit result (write back) q XScale PC update BTB access start IM access IM1 IM2 Reg DM1 Reg SHFT DM2 IM access decode reg 1 access op shift/rotate reg 2 access DM write reg write start DM access exception CENG3420 L06.16 Spring 2019

Why Pipeline? For Performance! Time (clock cycles) I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 Inst 4 Time to fill the pipeline CENG3420 L06.17 Spring 2019

Outline q Pipeline Motivations q Pipeline Hazards q Exceptions q Background: Flip-Flop Control Signals CENG3420 L06.18 Spring 2019

Can Pipelining Get Us Into Trouble? q Yes: Pipeline Hazards structural hazards: - a required resource is busy data hazards: - attempt to use data before it is ready control hazards: - deciding on control action depends on previous instruction q Can usually resolve hazards by waiting pipeline control must detect the hazard and take action to resolve hazards CENG3420 L06.19 Spring 2019

Structure Hazards q Conflict for use of a resource q In MIPS pipeline with a single memory Load/store requires data access Instruction fetch requires instruction access q Hence, pipeline datapaths require separate instruction/data memories Or separate instruction/data caches q Since Register File CENG3420 L06.20 Spring 2019

Resolve Structural Hazard 1 Time (clock cycles) I n s t r. lw Inst 1 Mem Reg Mem Reg Mem Reg Mem Reg Reading data from memory O r d e r Inst 2 Inst 3 Mem Reg Mem Reg Mem Reg Mem Reg Inst 4 Reading instruction from memory CENG3420 L06.21 Spring 2019 Mem Reg Mem Reg q Fix with separate instr and data memories (I$ and D$)

Resolve Structural Hazard 2 Time (clock cycles) I n s t r. O r d e r add $1, Inst 1 Inst 2 add $2,$1, Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half clock edge that controls register writing clock edge that controls loading of pipeline state registers CENG3420 L06.22 Spring 2019

Data Hazards: Register Usage q Dependencies backward in time cause hazards add $1, sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 q Read before write data hazard CENG3420 L06.24 Spring 2019

Data Hazards: Load Memory q Dependencies backward in time cause hazards I n s t r. O r d e r lw $1,4($2) sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 q Load-use data hazard CENG3420 L06.25 Spring 2019

Resolve Data Hazards 1: Insert Stall I n s t r. add $1, stall Can fix data hazard by waiting stall but impacts CPI O r d e r stall sub $4,$1,$5 and $6,$1,$7 CENG3420 L06.26 Spring 2019

Resolve Data Hazards 2: Forwarding I n s t r. add $1, sub $4,$1,$5 Fix data hazards by forwarding results as soon as they are available to where they are needed O r d e r and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 CENG3420 L06.28 Spring 2019

Forward Unit Output Signals CENG3420 L06.29 Spring 2019

Datapath with Forwarding Hardware PCSrc ID/EX EX/MEM IF/ID Control PC 4 Instruction Memory Read Address Add Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data 16 Sign 32 Extend Shift left 2 Add cntrl Branch Address Data Memory Write Data Read Data MEM/WB EX/MEM.RegisterRd ID/EX.RegisterRt ID/EX.RegisterRs Forward Unit MEM/WB.RegisterRd CENG3420 L06.31 Spring 2019

Data Forwarding Control Conditions 1. EX Forward Unit: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd!= 0) and (EX/MEM.RegisterRd == ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd!= 0) and (EX/MEM.RegisterRd == ID/EX.RegisterRt)) ForwardB = 10 2. MEM Forward Unit: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd!= 0) and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd!= 0) and (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the previous instr. to either input of the Forwards the result from the second previous instr. to either input of the CENG3420 L06.32 Spring 2019

Forwarding Illustration I n s t r. add $1, sub $4,$1,$5 O r d e r and $6,$7,$1 EX forwarding MEM forwarding CENG3420 L06.33 Spring 2019

Yet Another Complication! q Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction which should be forwarded? I n s t r. O r d e r add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 CENG3420 L06.35 Spring 2019

EX: Corrected MEM Forward Unit q MEM Forward Unit: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd!= 0) and (EX/MEM.RegisterRd!= ID/EX.RegisterRs) and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd!= 0) and (EX/MEM.RegisterRd!= ID/EX.RegisterRt) and (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 CENG3420 L06.36 Spring 2019

Memory-to-Memory Copies q For loads immediately followed by stores (memoryto-memory copies) can avoid a stall by adding forwarding hardware from the MEM/WB register to the data memory input. Would need to add a Forward Unit and a mux to the MEM stage I n s t r. O r d e r lw $1,4($2) sw $1,4($3) CENG3420 L06.37 Spring 2019

Forwarding with Load-use Data Hazards I n s t r. O r d e r lw $1,4($2) sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 IM Reg DM q Will still need one stall cycle even with forwarding CENG3420 L06.39 Spring 2019

Forwarding with Load-use Data Hazards I n s t r. O r d e r lw $1,4($2) stall sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 IM Reg DM q Will still need one stall cycle even with forwarding CENG3420 L06.40 Spring 2019

Load-use Hazard Detection Unit (optional) q Need a Hazard detection Unit in the ID stage that inserts a stall between the load and its use 1. ID Hazard detection Unit: if (ID/EX.MemRead and ((ID/EX.RegisterRt == IF/ID.RegisterRs) or (ID/EX.RegisterRt == IF/ID.RegisterRt))) stall the pipeline q The first line tests to see if the instruction now in the EX stage is a lw; the next two lines check to see if the destination register of the lw matches either source register of the instruction in the ID stage (the load-use instruction) q After this one cycle stall, the forwarding logic can handle the remaining data hazards CENG3420 L06.41 Spring 2019

Adding the Hazard/Stall Hardware (optional) PCSrc PC PC.Write 4 Instruction Memory Read Address Add IF/ID.Write IF/ID Hazard Unit Control 0 0 1 Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data 16 Sign 32 Extend ID/EX ID/EX.MemRead Shift left 2 Add cntrl EX/MEM Branch Address Data Memory Write Data Read Data MEM/WB ID/EX.RegisterRt Forward Unit CENG3420 L06.43 Spring 2019

Control Hazards q When the flow of instruction addresses is not sequential (i.e., PC = PC + 4); incurred by change of flow instructions Unconditional branches (j, jal, jr) Conditional branches (beq, bne) Exceptions q Possible approaches Stall (impacts CPI) Move decision point as early in the pipeline as possible, thereby reducing the number of stall cycles Delay decision (requires compiler support) Predict and hope for the best! q Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards CENG3420 L06.44 Spring 2019

Control Hazards 1: Jumps Incur One Stall q Jumps not decoded until ID, so one flush is needed To flush, set IF.Flush to zero the instruction field of the IF/ID pipeline register (turning it into a nop) I n s t r. j flush Fix jump hazard by waiting flush O r d e r j target q Fortunately, jumps are very infrequent only 3% of the SPECint instruction mix CENG3420 L06.45 Spring 2019

Datapath Branch and Jump Hardware Jump PCSrc Shift left 2 ID/EX EX/MEM IF/ID Control PC 4 Read Address Add Instruction Memory PC+4[31-28] Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data 16 Sign 32 Extend Shift left 2 Add cntrl Branch Address Data Memory Write Data Read Data MEM/WB Forward Unit CENG3420 L06.46 Spring 2019

Supporting ID Stage Jumps Jump PCSrc Shift left 2 ID/EX EX/MEM IF/ID Control PC 4 Instruction Memory Read Address Add 0 PC+4[31-28] Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data 16 Sign 32 Extend Shift left 2 Add cntrl Branch Address Data Memory Write Data Read Data MEM/WB Forward Unit CENG3420 L06.47 Spring 2019

Control Hazards 2: Branch Instr q Dependencies backward in time cause hazards I n s t r. O r d e r beq lw Inst 3 Inst 4 CENG3420 L06.48 Spring 2019

One Way to Fix a Branch Control Hazard I n s t r. beq flush Fix branch hazard by waiting flush but affects CPI O r d e r flush flush beq target Inst 3 IM Reg DM CENG3420 L06.49 Spring 2019

Another Way to Fix a Branch Control Hazard q Move branch decision hardware back to as early in the pipeline as possible i.e., during the decode cycle I n s t r. beq flush Fix branch hazard by waiting flush O r d e r beq target Inst 3 IM Reg DM CENG3420 L06.50 Spring 2019

Two Types of Stalls q Nop instruction (or bubble) inserted between two instructions in the pipeline (as done for load-use situations) Keep the instructions earlier in the pipeline (later in the code) from progressing down the pipeline for a cycle ( bounce them in place with write control signals) Insert nop by zeroing control bits in the pipeline register at the appropriate stage Let the instructions later in the pipeline (earlier in the code) progress normally down the pipeline q Flushes (or instruction squashing) were an instruction in the pipeline is replaced with a nop instruction (as done for instructions located sequentially after j instructions) Zero the control bits for the instruction to be flushed CENG3420 L06.51 Spring 2019

Reducing the Delay of Branches q Move the branch decision hardware back to the EX stage Reduces the number of stall (flush) cycles to two Adds an and gate and a 2x1 mux to the EX timing path q Add hardware to compute the branch target address and evaluate the branch decision to the ID stage Reduces the number of stall (flush) cycles to one (like with jumps) - But now need to add forwarding hardware in ID stage Computing branch target address can be done in parallel with RegFile read (done for all instructions only used when needed) Comparing the registers can t be done until after RegFile read, so comparing and updating the PC adds a mux, a comparator, and an and gate to the ID timing path q For deeper pipelines, branch decision points can be even later in the pipeline, incurring more stalls CENG3420 L06.52 Spring 2019

ID Branch Forwarding Issues q MEM/WB forwarding is taken care of by the normal RegFile write before read operation q Need to forward from the EX/MEM pipeline stage to the ID comparison hardware for cases like WB add3 $1, MEM add2 $3, EX add1 $4, ID beq $1,$2,Loop IF next_seq_instr WB add3 $3, MEM add2 $1, EX add1 $4, ID beq $1,$2,Loop IF next_seq_instr if (IDcontrol.Branch and (EX/MEM.RegisterRd!= 0) and (EX/MEM.RegisterRd == IF/ID.RegisterRs)) ForwardC = 1 if (IDcontrol.Branch and (EX/MEM.RegisterRd!= 0) and (EX/MEM.RegisterRd == IF/ID.RegisterRt)) ForwardD = 1 Forwards the result from the second previous instr. to either input of the compare CENG3420 L06.53 Spring 2019

ID Branch Forwarding Issues, con t q If the instruction immediately before the branch produces one of the branch source operands, then a stall needs to be inserted (between the WB add3 $3, MEM add2 $4, EX add1 $1, ID beq $1,$2,Loop IF next_seq_instr beq and add1) since the EX stage operation is occurring at the same time as the ID stage branch compare operation Bounce the beq (in ID) and next_seq_instr (in IF) in place (ID Hazard Unit deasserts PC.Write and IF/ID.Write) Insert a stall between the add in the EX stage and the beq in the ID stage by zeroing the control bits going into the ID/EX pipeline register (done by the ID Hazard Unit) q If the branch is found to be taken, then flush the instruction currently in IF (IF.Flush) CENG3420 L06.54 Spring 2019

Supporting ID Stage Branches (optional) PCSrc Branch Hazard Unit 0 1 ID/EX EX/MEM IF/ID Control 0 PC 4 Instruction Memory Read Address Add 0 IF.Flush Shift left 2 Read Addr 1 RegFile Read Addr 2 Read Data 1 Write Addr ReadData 2 Write Data 16 Sign Extend Add 32 Compare cntrl Data Memory Read Data Address Write Data MEM/WB Forward Unit Forward Unit CENG3420 L06.55 Spring 2019

Delayed Branches q If the branch hardware has been moved to the ID stage, then we can eliminate all branch stalls with delayed branches which are defined as always executing the next sequential instruction after the branch instruction the branch takes effect after that next instruction MIPS compiler moves an instruction to immediately after the branch that is not affected by the branch (a safe instruction) thereby hiding the branch delay q With deeper pipelines, the branch delay grows requiring more than one delay slot Delayed branches have lost popularity compared to more expensive but more flexible (dynamic) hardware branch prediction Growth in available transistors has made hardware branch prediction relatively cheaper CENG3420 L06.56 Spring 2019

Scheduling Branch Delay Slots A. From before branch B. From branch target C. From fall through add $1,$2,$3 if $2=0 then delay slot sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot q A is the best choice, fills delay slot and reduces IC add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes add $1,$2,$3 if $2=0 then if $1=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 sub $4,$5,$6 q In B and C, the sub instruction may need to be copied, increasing IC q In B and C, must be okay to execute sub when branch fails CENG3420 L06.57 Spring 2019

q Static Branch Prediction Resolve branch hazards by assuming a given outcome and proceeding without waiting to see the actual branch outcome 1. Predict not taken always predict branches will not be taken, continue to fetch from the sequential instruction stream, only when branch is taken does the pipeline stall If taken, flush instructions after the branch (earlier in the pipeline) - in IF, ID, and EX stages if branch logic in MEM three stalls - In IF and ID stages if branch logic in EX two stalls - in IF stage if branch logic in ID one stall ensure that those flushed instructions haven t changed the machine state automatic in the MIPS pipeline since machine state changing operations are at the tail end of the pipeline (MemWrite (in MEM) or RegWrite (in WB)) restart the pipeline at the branch destination CENG3420 L06.58 Spring 2019

Flushing with Misprediction (Not Taken) I n s t r. O r d e r 4 beq $1,$2,2 8 flush sub $4,$1,$5 16 and $6,$1,$7 20 or r8,$1,$9 q To flush the IF stage instruction, assert IF.Flush to zero the instruction field of the IF/ID pipeline register (transforming it into a nop) CENG3420 L06.60 Spring 2019

Branching Structures q Predict not taken works well for top of the loop branching structures But such loops have jumps at the bottom of the loop to return to the top of the loop and incur the jump stall overhead Loop: beq $1,$2,Out 1 nd loop instr... last loop instr j Loop Out: fall out instr q Predict not taken doesn t work well for bottom of the loop branching structures Loop: 1 st loop instr 2 nd loop instr... last loop instr bne $1,$2,Loop fall out instr CENG3420 L06.61 Spring 2019

q Static Branch Prediction, con t Resolve branch hazards by assuming a given outcome and proceeding 2. Predict taken predict branches will always be taken Predict taken always incurs one stall cycle (if branch destination hardware has been moved to the ID stage) Is there a way to cache the address of the branch target instruction?? q As the branch penalty increases (for deeper pipelines), a simple static prediction scheme will hurt performance. With more hardware, it is possible to try to predict branch behavior dynamically during program execution 3. Dynamic branch prediction predict branches at runtime using run-time information CENG3420 L06.62 Spring 2019

Dynamic Branch Prediction q A branch prediction buffer (aka branch history table (BHT)) in the IF stage addressed by the lower bits of the PC, contains bit(s) passed to the ID stage through the IF/ID pipeline register that tells whether the branch was taken the last time it was execute Prediction bit may predict incorrectly (may be a wrong prediction for this branch this iteration or may be from a different branch with the same low order PC bits) but the doesn t affect correctness, just performance - Branch decision occurs in the ID stage after determining that the fetched instruction is a branch and checking the prediction bit(s) If the prediction is wrong, flush the incorrect instruction(s) in pipeline, restart the pipeline with the right instruction, and invert the prediction bit(s) - A 4096 bit BHT varies from 1% misprediction (nasa7, tomcatv) to 18% (eqntott) CENG3420 L06.63 Spring 2019

Branch Target Buffer q The BHT predicts when a branch is taken, but does not tell where its taken to! A branch target buffer (BTB) in the IF stage caches the branch target address, but we also need to fetch the next sequential instruction. The prediction bit in IF/ID selects which next instruction will be loaded into IF/ID at the next clock edge - Would need a two read port instruction memory Or the BTB can cache the branch taken instruction while the instruction memory is fetching the next sequential instruction Instruction Memory Read 0 Address q If the prediction is correct, stalls can be avoided no matter which direction they go PC BTB CENG3420 L06.64 Spring 2019

q q 1-bit Prediction Accuracy A 1-bit predictor will be incorrect twice when not taken Assume predict_bit = 0 to start (indicating branch not taken) and loop control is at the bottom of the loop code 1. First time through the loop, the predictor mispredicts the branch since the branch is taken back to the top of the loop; invert prediction bit (predict_bit = 1) 2. As long as branch is taken (looping), prediction is correct 3. Exiting the loop, the predictor again mispredicts the branch since this time the branch is not taken falling out of the loop; invert prediction bit (predict_bit = 0) Loop: 1 st loop instr 2 nd loop instr... last loop instr bne $1,$2,Loop fall out instr For 10 times through the loop we have a 80% prediction accuracy for a branch that is taken 90% of the time CENG3420 L06.65 Spring 2019

2-bit Predictors q A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed right 9 times 1 Taken 0 Taken Predict Taken 11 Predict 01 Not Taken wrong on loop fall out Not taken Taken right on 1 st iteration Not taken Taken Predict 10 Taken 1 Not taken 0 00Predict Not Taken Not taken Loop: 1 st loop instr 2 nd loop instr... last loop instr bne $1,$2,Loop fall out instr q BHT also stores the initial FSM state CENG3420 L06.67 Spring 2019

Outline q Pipeline Motivations q Pipeline Hazards q Exceptions q Background: Flip-Flop Control Signals CENG3420 L06.68 Spring 2019

Dealing with Exceptions q Exceptions (aka interrupts) are just another form of control hazard. Exceptions arise from R-type arithmetic overflow Trying to execute an undefined instruction An I/O device request An OS service request (e.g., a page fault, TLB exception) A hardware malfunction q The pipeline has to stop executing the offending instruction in midstream, let all prior instructions complete, flush all following instructions, set a register to show the cause of the exception, save the address of the offending instruction, and then jump to a prearranged address (the address of the exception handler code) q The software (OS) looks at the cause of the exception and deals with it CENG3420 L06.69 Spring 2019

Two Types of Exceptions q Interrupts asynchronous to program execution caused by external events may be handled between instructions, so can let the instructions currently active in the pipeline complete before passing control to the OS interrupt handler simply suspend and resume user program q Traps (Exception) synchronous to program execution caused by internal events condition must be remedied by the trap handler for that instruction, so much stop the offending instruction midstream in the pipeline and pass control to the OS trap handler the offending instruction may be retried (or simulated by the OS) and the program may continue or it may be aborted CENG3420 L06.70 Spring 2019

Where in the Pipeline Exceptions Occur q Arithmetic overflow q Undefined instruction q TLB or page fault q I/O service request q Hardware malfunction Stage(s)? EX ID IF, MEM any any Synchronous? yes yes yes no no q Beware that multiple exceptions can occur simultaneously in a single clock cycle CENG3420 L06.72 Spring 2019

Multiple Simultaneous Exceptions I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 D$ page fault arithmetic overflow undefined instruction Inst 4 I$ page fault q Hardware sorts the exceptions so that the earliest instruction is the one interrupted first CENG3420 L06.74 Spring 2019

Additions to MIPS to Handle Exceptions (optional) q Cause register (records exceptions) hardware to record in Cause the exceptions and a signal to control writes to it (CauseWrite) q EPC register (records the addresses of the offending instructions) hardware to record in EPC the address of the offending instruction and a signal to control writes to it (EPCWrite) Exception software must match exception to instruction q A way to load the PC with the address of the exception handler Expand the PC input mux where the new input is hardwired to the exception handler address - (e.g., 8000 0180 hex for arithmetic overflow) q A way to flush offending instruction and the ones that follow it CENG3420 L06.75 Spring 2019

Datapath with Controls for Exceptions (optional) PC 4 Instruction Memory Read Address 8000 0180 hex Add 0 IF.Flush PCSrc IF/ID Hazard Unit Control Shift left 2 Read Addr 1 RegFile Read Addr 2 Read Data 1 Write Addr ReadData 2 Write Data 16 0 Sign Extend Forward Unit Branch ID.Flush 1 0 Add 32 Compare ID/EX Cause EPC EX.Flush 0 0 Forward Unit cntrl EX/MEM Data Memory Read Data Address Write Data MEM/WB CENG3420 L06.76 Spring 2019

Summary q All modern day processors use pipelining for performance (a CPI of 1 and a fast CC) q Pipeline clock rate limited by slowest pipeline stage so designing a balanced pipeline is important q Must detect and resolve hazards Structural hazards resolved by designing the pipeline correctly Data hazards - Stall (impacts CPI) - Forward (requires hardware support) Control hazards put the branch decision hardware in as early a stage in the pipeline as possible - Stall (impacts CPI) - Delay decision (requires compiler support) - Static and dynamic prediction (requires hardware support) q Pipelining complicates exception handling CENG3420 L06.77 Spring 2019

Outline q Pipeline Motivations q Pipeline Hazards q Exceptions q Background: Flip-Flop Control Signals CENG3420 L06.78 Spring 2019

Clocking Methodologies q Clocking methodology defines when signals can be read and when they can be written falling (negative) edge clock cycle rising (positive) edge clock rate = 1/(clock cycle) e.g., 10 nsec clock cycle = 100 MHz clock rate 1 nsec clock cycle = 1 GHz clock rate q State element design choices level sensitive latch master-slave and edge-triggered flipflops CENG3420 L06.79 Spring 2019

Review:Latches vs Flipflops q Output is equal to the stored value inside the element q Change of state (value) is based on the clock Latches: output changes whenever the inputs change and the clock is asserted (level sensitive methodology) - Two-sided timing constraint Flip-flop: output changes only on a clock edge (edgetriggered methodology) - One-sided timing constraint A clocking methodology defines when signals can be read and written would NOT want to read a signal at the same time it was being written CENG3420 L06.80 Spring 2019

Review: Design A Latch q Store one bit of information: cross-coupled invertor = q How to change the value stored? R: reset signal S: set signal SR-Latch other Latch structures CENG3420 L06.81 Spring 2019

Review: Design A Flip-Flop q Based on Gated Latch = q Master-slave positive-edge-triggered D flip-flop CENG3420 L06.82 Spring 2019

Review: Latch and Flip-Flop q Latch is level-sensitive q Flip-flop is edge triggered CENG3420 L06.83 Spring 2019

Our Implementation q An edge-triggered methodology q Typical execution read contents of some state elements send values through some combinational logic write results to one or more state elements State element 1 Combinational logic State element 2 clock one clock cycle q Assumes state elements are written on every clock cycle; if not, need explicit write control signal write occurs only when both the write control is asserted and the clock edge occurs CENG3420 L06.84 Spring 2019