Hakim Weatherspoon CS 3410 Computer Science Cornell University

Similar documents
RISC Pipeline. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter 4.6

Pipeline Control Hazards and Instruction Variations

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. See P&H Chapter:

Pipeline design. Mehran Rezaei

COMPUTER ORGANIZATION AND DESIGN

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

(Basic) Processor Pipeline

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Lecture 7 Pipelining. Peng Liu.

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

LECTURE 3: THE PROCESSOR

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

COSC 6385 Computer Architecture - Pipelining

14:332:331 Pipelined Datapath

Chapter 4. The Processor

Pipelining. CSC Friday, November 6, 2015

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

CPU Performance Pipelined CPU

Chapter 4 The Processor 1. Chapter 4B. The Processor

Pipelining. Pipeline performance

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ECEC 355: Pipelining

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Processor (II) - pipelining. Hwansoo Han

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

COMPUTER ORGANIZATION AND DESIGN

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Chapter 4 The Processor 1. Chapter 4A. The Processor

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Chapter 4. The Processor

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Computer Organization and Structure

ELE 655 Microprocessor System Design

ECE154A Introduction to Computer Architecture. Homework 4 solution

ECE331: Hardware Organization and Design

Anne Bracy CS 3410 Computer Science Cornell University. See P&H Chapter: , , Appendix B

Appendix C. Abdullah Muzahid CS 5513

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

RISC, CISC, and ISA Variations

CSEE 3827: Fundamentals of Computer Systems

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Thomas Polzer Institut für Technische Informatik

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. See P&H Appendix , and 2.21

Computer Architecture

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Outline Marquette University

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Processor Architecture

Basic Pipelining Concepts

Modern Computer Architecture

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Very Simple MIPS Implementation

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

CS3350B Computer Architecture Quiz 3 March 15, 2018

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

SI232 Set #20: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Chapter 6 ADMIN. Reading for Chapter 6: 6.1,

CS61C : Machine Structures

EITF20: Computer Architecture Part2.2.1: Pipeline-1

COMPUTER ORGANIZATION AND DESIGN

Computer Architecture Spring 2016

The Pipelined RiSC-16

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Pipelining. Maurizio Palesi

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

Chapter 4. The Processor

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

COMP2611: Computer Organization. The Pipelined Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

ECE331: Hardware Organization and Design

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESI

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

Very Simple MIPS Implementation

Chapter 4 (Part II) Sequential Laundry

Transcription:

Hakim Weatherspoon CS 3410 Computer Science Cornell University The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, McKee, and Sirer.

memory inst register file alu PC +4 +4 new pc offset target imm control extend =? cmp addr d in d out memory

Advantages Single cycle per instruction make logic and clock simple Disadvantages Since instructions take different time to finish, memory and functional unit are not efficiently utilized Cycle time is the longest delay Load instruction Best possible CPI is 1 (actually < 1 w parallelism) However, lower MIPS and longer clock period (lower clock frequency); hence, lower performance

Advantages Better MIPS and smaller clock period (higher clock frequency) Hence, better performance than Single Cycle processor Disadvantages Higher CPI than single cycle processor Pipelining: Want better Performance want small CPI (close to 1) with high MIPS and short clock period (high clock frequency)

Parallelism Pipelining Both!

Alice Bob They don t always get along

Saw Drill Glue Paint

N pieces, each built following same sequence: Saw Drill Glue Paint

Alice owns the room Bob can enter when Alice is finished Repeat for remaining tasks No possibility for conflicts

time 1 2 3 4 5 6 7 8 Latency: Elapsed Time for 4 Alice: hours/task 4 Throughput: Elapsed Time for 1 Bob: task/4 4 hrs Concurrency: Total elapsed time: 1 4*N Can we do better? CPI = 4

Partition room into stages of a pipeline Dave Carol Bob Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep

Partition room into stages of a pipeline Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep It still takes all four stages for one job to complete

Partition room into stages of a pipeline Bob Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep It still takes all four stages for one job to complete

Partition room into stages of a pipeline Carol Bob Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep It still takes all four stages for one job to complete

Partition room into stages of a pipeline Dave Carol Bob Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep It still takes all four stages for one job to complete

Partition room into stages of a pipeline Alice Alice Alice Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep It still takes all four stages for one job to complete

time 1 2 3 4 5 6 7 Latency: 4 hrs/task Throughput: 1 task/hr Concurrency: 4 CPI = 1

Time 1 2 3 4 5 6 7 8 9 10 What if drilling takes twice as long, but gluing and paint take ½ as long? Latency: Throughput: CPI =

Time 1 2 3 4 5 6 7 8 9 10 Done: 4 cycles Done: 6 cycles Done: 8 cycles What if drilling takes twice as long, but gluing and paint take ½ as long? Latency: 4 cycles/task Throughput: 1 task/2 cycles CPI = CPI = 2

Principle: Throughput increased by parallel execution Balanced pipeline very important Else slowest stage dominates performance Pipelining: Identify pipeline stages Isolate stages from each other Resolve pipeline hazards (next lecture)

Single Cycle vs Pipelined Processor

Single-cycle insn0.fetch, dec, exec insn1.fetch, dec, exec Pipelined insn0.fetch insn0.dec insn1.fetch insn0.exec insn1.dec insn1.exec 23

5 stage Pipeline Implementation Working Example Hazards Structural Data Hazards Control Hazards 24

Review: Single cycle processor memory inst register file alu PC +4 +4 new pc offset target imm control extend =? cmp addr d in d out memory

memory inst register file alu +4 addr PC control d in d out memory new pc imm extend compute jump/branch targets Instruction Fetch Instruction Decode Execute Memory Write Back

memory register file alu +4 addr PC new pc control extend compute jump/branch targets d in d out memory Fetch Decode Execute Memory WB

memory PC +4 new pc inst register file control extend imm B A alu compute jump/branch targets B D addr d in d out memory D M Instruction Fetch Instruction Decode ctrl Execute ctrl Memory ctrl Write Back IF/ID ID/EX EX/MEM MEM/WB

Cycle add nand lw add sw 1 2 3 4 5 6 7 8 9 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Latency: 5 cycles Throughput: 1 insn/cycle Concurrency: 5 CPI = 1 29

Break datapath into multiple cycles (here 5) Parallel execution increases throughput Balanced pipeline very important Slowest stage determines clock rate Imbalance kills performance Add pipeline registers (flip flops) for isolation Each stage begins by reading values from latch Each stage ends by writing values to latch Resolve hazards 30

memory PC +4 new pc inst register file control extend imm B A alu compute jump/branch targets B D addr d in d out memory D M Instruction Fetch Instruction Decode ctrl Execute ctrl Memory ctrl Write Back IF/ID ID/EX EX/MEM MEM/WB

Stage Fetch Decode Execute Memory Perform Functionality Use PC to index Program Memory, increment PC Decode instruction, generate control signals, read register file Perform ALU operation Compute targets (PC+4+offset, etc.) in case this is a branch, decide if branch taken Perform load/store if needed, address is ALU result Latch values of interest Instruction bits (to be decoded) PC + 4 (to compute branch targets) Control information, Rd index, immediates, offsets, register values (Ra, Rb), PC+4 (to compute branch targets) Control information, Rd index, etc. Result of ALU operation, value in case this is a store instruction Control information, Rd index, etc. Result of load, pass result from execute Writeback Select value, write to register file 32

Stage 1: Instruction Fetch Fetch a new instruction every cycle Current PC is index to instruction memory Increment the PC at end of cycle (assume no branches for now) Write values of interest to pipeline register (IF/ID) Instruction bits (for later decoding) PC+4 (for later computing branch targets)

instruction memory addr mc +4 PC new pc PC+4 pc rel (PC relative); e.g. BEQ, BNE pc abs (PC absolute); e.g. J and JAL. (PC+4) 31..28 target 00 pc reg (PC registers); e.g. JR

instruction memory addr mc PC +4 00 = read word inst PC+4 Rest of pipeline new pc pc reg pc relpc abs pc sel IF/ID

instruction memory addr mc PC +4 00 = read word inst PC+4 Rest of pipeline pc reg pc relpc abs PC+4 pc reg (PC registers: JR) pc rel (PC relative: BEQ, BNE) pc abs (PC absolute: J and JAL) pc sel IF/ID 36

Stage 2: Instruction Decode On every cycle: Read IF/ID pipeline register to get instruction bits Decode instruction, generate control signals Read from register file Write values of interest to pipeline register (ID/EX) Control information, Rd index, immediates, offsets, Contents of Ra, Rb PC+4 (for computing branch targets later)

ctrl PC+4 Rest of pipeline ID/EX PC+4 inst IF/ID WE Rd D register file A B Ra Rb Stage 1: Instruction Fetch imm B A

ctrl PC+4 Rest of pipeline ID/EX PC+4 inst IF/ID WE Rd D register file A B Ra Rb Stage 1: Instruction Fetch imm B A decode result dest extend

Stage 3: Execute On every cycle: Read ID/EX pipeline register to get values and control bits Perform ALU operation Compute targets (PC+4+offset, etc.) in case this is a branch Decide if jump/branch should be taken Write values of interest to pipeline register (EX/MEM) Control information, Rd index, Result of ALU operation Value in case this is a memory store instruction

Stage 2: Instruction Decode ctrl B D Rest of pipeline EX/MEM ctrl PC+4 imm B A target ID/EX alu

pcrel pcabs Stage 2: Instruction Decode ctrl B D Rest of pipeline EX/MEM ctrl PC+4 imm B A ID/EX + pcreg alu pcsel branch? target

Stage 4: Memory On every cycle: Read EX/MEM pipeline register to get values and control bits Perform memory load/store if needed address is ALU result Write values of interest to pipeline register (MEM/WB) Control information, Rd index, Result of memory operation Pass result of ALU operation

ctrl Rest of pipeline MEM/WB ctrl Stage 3: Execute B D M D EX/MEM addr d in d out memory mc target

ctrl Rest of pipeline MEM/WB ctrl Stage 3: Execute B D M D pcsel pcreg pcrel pcabs EX/MEM branch? addr d in d out memory mc target

Stage 5: Write back On every cycle: Read MEM/WB pipeline register to get values and control bits Select value and write to register file

ctrl Stage 4: Memory M D MEM/WB

result ctrl Stage 4: Memory M D dest MEM/WB

+4 IF/ID addr d in d out mem ID/EX EX/MEM MEM/WB PC+4 inst OP Rt Rd OP Rd OP Rd PC+4 imm B A B D M D inst mem PC Rd D A B Ra Rb 49

Consider a non pipelined processor with clock period C (e.g., 50 ns). If you divide the processor into N stages (e.g., 5), your new clock period will be: A. C B. N C. less than C/N D. C/N E. greater than C/N 50

Consider a non pipelined processor with clock period C (e.g., 50 ns). If you divide the processor into N stages (e.g., 5), your new clock period will be: A. C B. N C. less than C/N D. C/N E. greater than C/N 51

Pipelining is a powerful technique to mask latencies and increase throughput Logically, instructions execute one at a time Physically, instructions execute in parallel Instruction level parallelism Abstraction promotes decoupling Interface (ISA) vs. implementation (Pipeline)

Instructions same length 32 bits, easy to fetch and then decode 3 types of instruction formats Easy to route bits between stages Can read a register source before even knowing what the instruction is Memory access through lw and sw only Access memory after ALU 53

5 stage Pipeline Implementation Working Example Hazards Structural Data Hazards Control Hazards 54

add nand lw add sw r3 r1, r2 r6 r4, r5 r4 20(r2) r5 r2, r5 r7 12(r3) Assume 8 register machine 55

M U X PC 4 + PC+4 instruction rega regb Register file Bits 11 15 Bits 16 20 Bits 26 31 R0 R1 R2 R3 R4 R5 R6 R7 0 extend PC+4 vala valb imm Rd Rt op M U X M U X A L U target ALU result valb dest op Data mem ALU result mdata IF/ID ID/EX EX/MEM MEM/WB dest op M U X data dest

At time 1, Fetch add r3 r1 r2 4 0 extend data dest 0 0 M U X Time: 0 IF/ID ID/EX EX/MEM MEM/WB

add 3 1 2 M U X 8 PC 4 4 Fetch: add 3 1 2 Time: 1/ 2 + 4 add 3 1 2 Register file Bits 11 15 Bits 16 20 Bits 26 31 R0 R1 R2 R3 R4 R5 R6 R7 0 36 9 12 18 7 41 22 extend / 0 4 / 0 36 / 0 9 0 / 03 / 02 nop / add M U X M U X A L U 0 0 0 0 0 nop Data mem IF/ID ID/EX EX/MEM MEM/WB 0 0 0 nop M U X data dest

nand 6 4 5 add 3 1 2 M U X 12 PC 8 4 Fetch: nand 6 4 5 Time: 2/ 3 + 8 nand 6 4 5 1 2 Register file Bits 11 15 Bits 16 20 Bits 26 31 R0 R1 R2 R3 R4 R5 R6 R7 0 36 9 12 18 7 41 22 extend / 4 8 36 / 18 / 9 7 3 36 9 add / nand M U X A L U / 0 4 / 0 45 / 0 9 / 36 M 3 U / 25 / 0 3 X 0 nop / add Data mem IF/ID ID/EX EX/MEM MEM/WB 0 0 0 nop M U X data dest

lw 4 20(2) nand 6 4 5 add 3 1 2 16 M U X PC 12 4 Fetch: lw 4 20(2) Time: 3/ 4 + 12 lw 4 20(2) 4 5 Register file Bits 11 15 Bits 16 20 Bits 26 31 R0 R1 R2 R3 R4 R5 R6 R7 0 36 9 12 18 7 41 22 extend 8 18 7 6 nand 36 6 3 2 5 nand (18 7) 18 = 01 0010 7 = 00 0111 3 = 11 1101 / 18 / 9 7 M U X M U X A L U 3 / 4 8 0 45 / 3 / 9 7 / 3 6 add / nand Data mem / 0 45 0 / 0 3 M U X nop / add IF/ID ID/EX EX/MEM MEM/WB data dest

add 5 2 5 lw 4 20(2) nand 6 4 5 add 3 1 2 M U X 20 PC 16 4 Fetch: add 5 2 5 Time: 4 + 16 add 5 2 5 2 4 Register file Bits 11 15 Bits 16 20 Bits 26 31 R0 R1 R2 R3 R4 R5 R6 R7 0 36 9 12 18 7 41 22 extend 12 9 18 20 lw 18 7 0 6 5 4 M U X M U X A L U 6 8 0 3 7 6 nand 45 3 Data mem IF/ID ID/EX EX/MEM MEM/WB 45 0 3 add M U X data dest

sw 7 12(3) add 5 2 5 lw 4 20 (2) nand 6 4 5 add 3 1 2 M U X 24 PC 20 4 Fetch: sw 7 12(3) Time: 5 + 20 sw 7 12(3) 2 5 Register file Bits 11 15 Bits 16 20 Bits 26 31 R0 R1 R2 R3 R4 R5 R6 R7 0 36 9 45 18 7 41 22 extend 16 9 7 5 add 9 20 5 0 4 5 M U X M U X A L U 4 12 0 29 18 4 lw 3 6 Data mem 3 0 6 nand IF/ID ID/EX EX/MEM MEM/WB 45 3 M U X data dest

sw 7 12(3) add 5 2 5 lw 4 20(2) nand 6 4 5 M U X 28 PC 24 4 No more instructions Time: 6 + 3 7 Register file Bits 11 15 Bits 16 20 Bits 26 31 R0 R1 R2 R3 R4 R5 R6 R7 0 36 9 45 18 7 3 22 extend 20 45 22 12 sw 9 7 0 5 5 7 M U X M U X A L U 5 16 0 16 7 5 add 29 4 Data mem IF/ID ID/EX EX/MEM MEM/WB 29 99 4 lw 3 6 M U X data dest

nop nop sw 7 12(3) add 5 2 5 lw 4 20(2) M U X 32 PC 28 4 No more instructions Time: 7 + Register file Bits 11 15 Bits 16 20 Bits 26 31 R0 R1 R2 R3 R4 R5 R6 R7 0 36 9 45 99 7 3 22 extend 45 12 0 7 M U X M U X A L U 7 20 0 57 22 7 sw 16 5 Data mem IF/ID ID/EX EX/MEM MEM/WB 16 0 5 add 99 4 M U X data dest

nop nop nop sw 7 12(3) add 5 2 5 M U X 36 PC 32 4 + Register file R0 R1 R2 R3 R4 R5 R6 R7 0 36 9 45 99 16 3 22 extend M U X A L U 57 22 22 Data mem 57 0 16 M U X data dest No more instructions Time: 8 Bits 11 15 Bits 16 20 Bits 26 31 M U X IF/ID ID/EX EX/MEM MEM/WB 7 sw 5 Slides thanks to Sally McKee

nop nop nop nop sw 7 12(3) M U X 40 PC 36 4 + Register file R0 R1 R2 R3 R4 R5 R6 R7 0 36 9 45 99 16 3 22 extend M U X A L U Data mem M U X data dest No more instructions Bits 11 15 Bits 16 20 Bits 21 23 M U X Time: 9 IF/ID ID/EX EX/MEM MEM/WB

Pipelining is great because: A. You can fetch and decode the same instruction at the same time. B. You can fetch two instructions at the same time. C. You can fetch one instruction while decoding another. D. Instructions only need to visit the pipeline stages that they require. E. C and D 67

Pipelining is great because: A. You can fetch and decode the same instruction at the same time. B. You can fetch two instructions at the same time. C. You can fetch one instruction while decoding another. D. Instructions only need to visit the pipeline stages that they require. E. C and D 68

memory PC +4 new pc inst register file control extend imm B A alu compute jump/branch targets B D addr d in d out memory D M Instruction Fetch Instruction Decode ctrl Execute ctrl Memory ctrl Write Back IF/ID ID/EX EX/MEM MEM/WB

5 stage Pipeline Implementation Working Example Hazards Structural Data Hazards Control Hazards 70

Correctness problems associated w/processor design 1. Structural hazards Same resource needed for different purposes at the same time (Possible: ALU, Register File, Memory) 2. Data hazards Instruction output needed before it s available 3. Control hazards Next instruction PC unknown at time of Fetch 71

Dependence: relationship between two insns Data: two insns use same storage location Control: 1 insn affects whether another executes at all Not a bad thing, programs would be boring otherwise Enforced by making older insn go before younger one Happens naturally in single /multi cycle designs But not in a pipeline Hazard: dependence & possibility of wrong insn order Effects of wrong insn order cannot be externally visible Hazards are a bad thing: most solutions either complicate the hardware or reduce performance 72

Data Hazards register file (RF) reads occur in stage 2 (ID) RF writes occur in stage 5 (WB) RF written in ½ half, read in second ½ half of cycle x10: add r3 r1, r2 x14: sub r5 r3, r4 1. Is there a dependence? 2. Is there a hazard? A) Yes B) No C) Cannot tell with the information given. 73

Data Hazards register file (RF) reads occur in stage 2 (ID) RF writes occur in stage 5 (WB) RF written in ½ half, read in second ½ half of cycle x10: add r3 r1, r2 x14: sub r5 r3, r4 1. Is there a dependence? 2. Is there a hazard? A) Yes for both B) No C) Cannot tell with the information given. 74

Which of the following statements is true? A. Whether there is a data dependence between two instructions depends on the machine the program is running on. B. Whether there is a data hazard between two instructions depends on the machine the program is running on. C. Both A & B D. Neither A nor B 75

Which of the following statements is true? A. Whether there is a data dependence between two instructions depends on the machine the program is running on. B. Whether there is a data hazard between two instructions depends on the machine the program is running on. C. Both A & B D. Neither A nor B 76

time Clock cycle 1 2 3 4 5 6 7 8 9 add r3, r1, r2 IF ID MEM WB sub r5, r3, r4 IF ID MEM WB lw r6, 4(r3) IF ID MEM WB or r5, r3, r5 IF ID MEM WB sw r6, 12(r3) IF ID MEM WB

add r3, r1, r2 sub r5, r3, r4 lw r6, 4(r3) or r5, r3, r5 How many data hazards due to r3 only A) 1 B) 2 C) 3 D) 4 E) 5 sw r6, 12(r3)

time Clock cycle 1 2 3 4 5 6 7 8 9 backwards arrows require time travel add r3, r1, r2 IF ID X MEM WB sub r5, r3, r4 IF ID X MEM WB lw r6, 4(r3) IF ID X MEM WB or r5, r3, r5 IF ID X MEM WB sw r6, 12(r3) IF ID X MEM WB 79

time Clock cycle 1 2 3 4 5 6 7 8 9 add r3, r1, r2 IF ID X MEM WB sub r5, r3, r4 IF ID X MEM WB lw r6, 4(r3) IF ID X MEM WB or r5, r3, r5 IF ID X MEM WB sw r6, 12(r3) IF ID X MEM WB 80

time Clock cycle 1 2 3 4 5 6 7 8 9 add r3, r1, r2 IF ID X MEM WB sub r5, r3, r4 IF ID X MEM WB lw r6, 4(r3) IF ID X MEM WB or r5, r3, r5 IF ID X MEM WB sw r6, 12(r3) IF ID X MEM WB 81

Data Hazards register file reads occur in stage 2 (ID) register file writes occur in stage 5 (WB) next instructions may read values about to be written i.e. add r3, r1, r2 sub r5, r3, r4 How to detect?

Detecting Data Hazards inst mem PC +4 inst PC+4 IF/ID Rd D A B Ra Rb IF/ID.Ra 0 && (IF/ID.Ra==ID/Ex.Rd IF/ID.Ra==Ex/M.Rd IF/ID.Ra==M/W.Rd) sub r5,r3,r4 OP Rt Rd PC+4 imm B A add r3, r1, r2 B D OP Rd d in addr d out mem M D OP Rd ID/EX EX/MEM MEM/WB

Data Hazards register file reads occur in stage 2 (ID) register file writes occur in stage 5 (WB) next instructions may read values about to be written How to detect? Logic in ID stage: stall = (IF/ID.Ra!= 0 && (IF/ID.Ra == ID/EX.Rd IF/ID.Ra == EX/M.Rd IF/ID.Ra == M/WB.Rd)) (same for Rb)

Detecting Data Hazards inst mem PC +4 inst PC+4 Rd D A B Ra Rb detect hazard OP Rt Rd PC+4 imm B A B D OP Rd addr d in d out mem M D OP Rd IF/ID ID/EX EX/MEM MEM/WB

Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards.

What to do if data hazard detected?

What to do if data hazard detected? A) Wait/Stall B) Reorder in Software (SW) C) Forward/Bypass D) All the above E) None. We will use some other method

1. Do Nothing Change the ISA to match implementation Hey compiler: don t create code w/data hazards! (We can do better than this) 2. Stall Pause current and subsequent instructions till safe 3. Forward/bypass Forward data value to where it is needed (Only works if value actually exists already) 89

How to stall an instruction in ID stage prevent IF/ID pipeline register update stalls the ID stage instruction convert ID stage instr into nop for later stages innocuous bubble passes through pipeline prevent PC update stalls the next (IF stage) instruction

add r3, r1, r2 sub inst r5, r3, r5 or r6, r3, r4 mem add r6, r3, r8 PC +4 WE=0 inst PC+4 IF/ID Rd A D B Ra Rb detect hazard If detect hazard MemWr=0 RegWr=0 OP Rt Rd PC+4 imm B A B D OP Rd d in addr d out mem M D OP Rd ID/EX EX/MEM MEM/WB

time Clock cycle 1 2 3 4 5 6 7 8 add r3, r1, r2 sub r5, r3, r5 or r6, r3, r4 add r6, r3, r8

time r3 = 10 add r3, r1, r2 r3 = 20 sub r5, r3, r5 Clock cycle 1 2 3 4 5 6 7 8 IF ID Ex M W 3 Stalls IF ID ID ID ID Ex M W or r6, r3, r4 IF IF IF IF ID Ex M add r6, r3, r8 IF ID Ex

inst mem +4 inst D rd ra rb A B A B D B data mem D M PC (MemWr=0 RegWr=0) nop Op WE Rd Op WE Rd Op WE Rd sub r5,r3,r5 add r3,r1,r2 or r6,r3,r4 (WE=0) /stall NOP = If(IF/ID.rA 0 && (IF/ID.rA==ID/Ex.Rd IF/ID.rA==Ex/M.Rd IF/ID.rA==M/W.Rd)) STALL CONDITION MET

inst mem PC +4 or r6,r3,r4 inst (WE=0) D rd ra rb A B (MemWr=0 RegWr=0) nop sub r5,r3,r5 /stall A B Rd WE Op NOP = If(IF/ID.rA 0 && (IF/ID.rA==ID/Ex.Rd IF/ID.rA==Ex/M.Rd IF/ID.rA==M/W.Rd)) (MemWr=0 RegWr=0) nop D B Rd WE Op data mem add r3,r1,r2 STALL CONDITION MET D M Op WE Rd

inst mem +4 inst D rd ra rb A B A B D B data mem D M PC (MemWr=0 RegWr=0) nop Rd WE Op (MemWr=0 RegWr=0) Rd WE Op (MemWr=0 RegWr=0) sub r5,r3,r5 nop nop Rd WE Op add r3,r1,r2 or r6,r3,r4 (WE=0) /stall NOP = If(IF/ID.rA 0 && (IF/ID.rA==ID/Ex.Rd IF/ID.rA==Ex/M.Rd IF/ID.rA==M/W.Rd)) STALL CONDITION MET

time r3 = 10 add r3, r1, r2 r3 = 20 sub r5, r3, r5 Clock cycle 1 2 3 4 5 6 7 8 IF ID Ex M W 3 Stalls IF ID ID ID ID Ex M W or r6, r3, r4 IF IF IF IF ID Ex M add r6, r3, r8 IF ID Ex

How to stall an instruction in ID stage prevent IF/ID pipeline register update stalls the ID stage instruction convert ID stage instr into nop for later stages innocuous bubble passes through pipeline prevent PC update stalls the next (IF stage) instruction

Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards. Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs ( bubbles ) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. *Bubbles in pipeline significantly decrease performance.

1. Do Nothing Change the ISA to match implementation Compiler: don t create code with data hazards! (Nice try, we can do better than this) 2. Stall Pause current and subsequent instructions till safe 3. Forward/bypass Forward data value to where it is needed (Only works if value actually exists already) 100

Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Three types of forwarding/bypass Forwarding from Ex/Mem registers to Ex stage (M Ex) Forwarding from Mem/WB register to Ex stage (W Ex) RegisterFile Bypass

inst mem D A B A B imm D B data mem D M detect hazard Rb Ra forward unit Rd WE MC Rd WE MC IF/ID ID/Ex Ex/Mem Mem/WB 102

inst mem D A B A B imm D B data mem D M detect hazard Rb Ra forward unit Rd WE MC Rd WE MC IF/ID ID/Ex Ex/Mem Mem/WB Three types of forwarding/bypass Forwarding from Ex/Mem registers to Ex stage (M Ex) Forwarding from Mem/WB register to Ex stage (W Ex) RegisterFile Bypass 103

Ex/Mem inst mem D A B data mem sub r5, r3, r1 add r3, r1, r2 add r3, r1, r2 sub r5, r3, r1 IF ID Ex M W IF ID Ex M W Problem: EX needs ALU result that is in MEM stage Solution: add a bypass from EX/MEM.D to start of EX 104

Ex/Mem A inst mem D B data mem sub r5, r3, r1 add r3, r1, r2 Detection Logic in Ex Stage: forward = (Ex/M.WE && EX/M.Rd!= 0 && ID/Ex.Ra == Ex/M.Rd) (same for Rb) 105

Mem/WB A inst mem D B data mem add r3, r1, r2 sub r5, r3, r1 or r6, r3, r4 or r6, r3, r4 IF ID Ex M W IF ID IF Ex ID M Ex W M sub r5, r3, r1 W add r3, r1,r2 Problem: EX needs value being written by WB Solution: Add bypass from WB final value to start of EX 106

Mem/WB inst mem D A B data mem or r6, r3, r4 sub r5, r3, r1 Detection Logic: forward = (M/WB.WE && M/WB.Rd!= 0 && ID/Ex.Ra == M/WB.Rd && not (ID/Ex.WE && Ex/M.Rd!= 0 && ID/Ex.Ra == Ex/M.Rd) (same for Rb) add r3, r1,r2 107

A inst mem D B data mem add r6, r3, r8 or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2 Problem: Reading a value that is currently being written Solution: just negate register file clock writes happen at end of first half of each clock cycle reads happen during second half of each clock cycle

inst mem D A B data mem add r6, r3, r8 or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2 add r3, r1, r2 IF ID Ex M W sub r5, r3, r1 or r6, r3, r4 add r6, r3, r8 IF ID IF Ex M W ID Ex M W IF ID Ex M W

5 stage Pipeline Implementation Working Example Hazards Structural Data Hazards Control Hazards 110

time Clock cycle 1 2 3 4 5 6 7 8 add r3, r1, r2 sub r5, r3, r4 lw r6, 4(r3) or r5, r3, r5 sw r6, 12(r3) 111

time add r3, r1, r2 Clock cycle 1 2 3 4 5 6 7 8 IF ID Ex M W sub r5, r3, r4 IF ID Ex M W lw r6, 4(r3) IF ID Ex M W or r5, r3, r6 IF ID Ex M W sw r6, 12(r3) IF ID Ex M W

time add r3, r1, r2 Clock cycle 1 2 3 4 5 6 7 8 IF ID Ex M W backwards arrows require time travel sub r5, r3, r4 IF ID Ex M W lw r6, 4(r3) IF ID Ex M W or r5, r3, r5 IF ID Ex M W sw r6, 12(r3) IF ID Ex M W

A inst mem D B data mem or r6, r3, r4 lw r4, 20(r8) Data dependency after a load instruction: Value not available until after the M stage Next instruction cannot proceed if dependent THE KILLER HAZARD 114

inst mem A D B or r6,r4,r1 lw r4, 20(r8) data mem lw r4, 20(r8) or r6, r3, r4 115

inst mem A D B or r6,r4,r1 lw r4, 20(r8) data mem lw r4, 20(r8) IF ID Ex or r6, r3, r4 IF ID 116

inst mem A D B or r6,r4,r1 NOP data mem lw r4, 20(r8) lw r4, 20(r8) or r6, r3, r4 IF ID Ex M W Stall IF ID* ID Ex M W 117

inst mem D A B data mem or r6,r4,r1 NOP lw r4, 20( lw r4, 20(r8) or r6, r3, r4 IF ID Ex M W Stall IF ID* ID Ex Ex M W 118

inst mem D A B A B imm D B data mem D M detect hazard Rd Rb Ra MC forward unit IF/ID ID/Ex Ex/Mem Mem/WB Rd WE MC Rd WE MC Stall = If(ID/Ex.MemRead && IF/ID.Ra == ID/Ex.Rd 119

inst mem D A B A B imm D B data mem D M detect hazard Rd Rb Ra MC forward unit IF/ID ID/Ex Ex/Mem Mem/WB Rd WE MC Rd WE MC Most frequent 3410 non solution to load use hazards Why is this solution so so so so so so awful? 120

Forwarding values directly from Memory to the Execute stage without storing them in a register first: A. Does not remove the need to stall. B. Adds one too many possible inputs to the ALU. C. Will cause the pipeline register to have the wrong value. D. Halves the frequency of the processor. E. Both A & D 121

Forwarding values directly from Memory to the Execute stage without storing them in a register first: A. Does not remove the need to stall. B. Adds one too many possible inputs to the ALU. C. Will cause the pipeline register to have the wrong value. D. Halves the frequency of the processor. E. Both A & D 122

Two MIPS Solutions: MIPS 2000/3000: delay slot ISA says results of loads are not available until one cycle later Assembler inserts nop, or reorders to fill delay slot MIPS 4000 onwards: stall But really, programmer/compiler reorders to avoid stalling in the load delay slot 123

Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards. Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs ( bubbles ) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. Bubbles (nops) in pipeline significantly decrease performance. Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Better performance than stalling.

Find all hazards, and say how they are resolved: add r3, r1, r2 nand r5, r3, r4 add r2, r6, r3 lw r6, 24(r3) sw r6, 12(r2)

Find all hazards, and say how they are resolved: add r3, r1, r2 nand r5, r3, r4 add r2, r6, r3 lw r6, 24(r3) sw r6, 12(r2) 5 Hazards

Find all hazards, and say how they are resolved: add r3, r1, r2 nand r5, r3, r4 add r2, r6, r3 lw r6, 24(r3) sw r6, 12(r2) Forwarding from Ex/M ID/Ex (M Ex) Forwarding from M/W ID/Ex (W Ex) RegisterFile (RF) Bypass Forwarding from M/W ID/Ex (W Ex) Stall + Forwarding from M/W ID/Ex (W Ex) 5 Hazards

Find all hazards, and say how they are resolved: add r3, r1, r2 sub r3, r2, r1 nand r4, r3, r1 or r0, r3, r4 xor r1, r4, r3 sb r4, 1(r0)

Find all hazards, and say how they are resolved: add r3, r1, r2 sub r3, r2, r1 nand r4, r3, r1 or r0, r3, r4 xor r1, r4, r3 sb r4, 1(r0) Hours and hours of debugging!

Delay Slot(s) Modify ISA to match implementation Stall Pause current and all subsequent instructions Forward/Bypass Try to steal correct value from elsewhere in pipeline Otherwise, fall back to stalling or require a delay slot Tradeoffs?

5 stage Pipeline Implementation Working Example Hazards Structural Data Hazards Control Hazards 131

i = 0; do { n += 2; i++; } while(i < max) i = 7; n ; i r1 Assume: n r2 max r3 x10 addiu r1, r0, 0 # i=0 x14 Loop: addiu r2, r2, 2 # n += 2 x18 addiu r1, r1, 1 # i++ x1c blt r1, r3, Loop # i<max? x20 addiu r1, r0, 7 # i = 7 x24 subi r2, r2, 1 # n 132

Control Hazards instructions are fetched in stage 1 (IF) branch and jump decisions occur in stage 3 (EX) next PC not known until 2 cycles after branch/jump x1c blt r1, r3, Loop x20 addiu r1, r0, 7 x24 subi r2, r2, 1 Branch not taken? No Problem! Branch taken? Just fetched 2 addi s Zap & Flush 133

inst mem PC +4 D A B New PC = 14 1C blt r1,r3,l 20 addiu r1,r0,7 24 subi r2,r2,1 14 L:addi r2,r2,2 branch calc decide branch data mem IF ID Ex M W IF ID NOP NOP NOP prevent PC update clear IF/ID latch branch continues If branch Taken Zap IF NOP NOP NOP NOP IF ID Ex M W 134

prevent PC update clear IF/ID latch branch continues inst mem PC +4 D A B New PC = 1C 1C blt r1,r3,l 20 addiu r1,r0,7 24 subi r2,r2,1 14 L:addi r2,r2,2 branch calc decide branch data mem If branch Taken Zap IF ID Ex M W IF ID NOP NOP NOP IF NOP NOP NOP NOP For every taken branch? OUCH!!! IF ID Ex M W 135

1. Delay Slot You MUST do this MIPS ISA: 1 insn after ctrl insn always executed Whether branch taken or not 2. Resolve Branch at Decode Some groups do this for Project 3, your choice Move branch calc from EX to ID Alternative: just zap 2 nd instruction when branch taken 3. Branch Prediction Not in 3410, but every processor worth anything does this (no offense!) 136

inst mem PC +4 D A B New PC = 1C branch calc decide branch 1C blt r1, r3, Loop F D X 20 addiu r1, r0, 7 F D 24 subi r2, r2, 1 F data mem If branch Taken Zap

i = 0; do { n += 2; i++; } while(i < max) i = 7; n ; x10 addiu r1, r0, 0 # i=0 i r1 Assume: n r2 max r3 x14 Loop: addiu r2, r2, 2 # n += 2 x18 addiu r1, r1, 1 # i++ x1c blt r1, r3, Loop # i<max? x20 nop x24 addiu r1, r0, 7 # i = 7 x28 subi r2, r2, 1 # n++ 138

inst mem PC +4 D A B New PC = 1C branch calc decide branch 1C blt r1, r3, Loop F D X 20 nop F D 24 addiu r1, r0, 7 F data mem If branch Taken Zap

inst mem PC +4 D A B branch calc decide data mem New PC = 1C branch 1C blt r1, r3, Loop F D X 20 nop F D 14 Loop:addiu r2,r2,2 F If branch Taken No Zapping 140

x10 addiu r1, r0, 0 # i=0 x14 Loop: addiu r2, r2, 2 # n += 2 x18 addiu r1, r1, 1 # i++ x1c blt r1, r3, Loop # i<max? x20 nop Compiler transforms code x10 addiu r1, r0, 0 # i=0 x14 Loop: addiu r1, r1, 1 # i++ x18 blt r1, r3, Loop # i<max? x1c addiu r2, r2, 2 # n += 2 141

inst mem PC +4 D A B branch calc decide data mem New PC = 1C branch 1C blt r1, r3, Loop F D X 20 addi r2,r2,2 F D 14 Loop:addi r1,r1,1 F Note: Insn in delay slot will always be executed whether branch take or not 142

Most processor support Speculative Execution Guess direction of the branch Allow instructions to move through pipeline Zap them later if guess turns out to be wrong A must for long pipelines 143

Pipeline so far Guess (predict) that the branch will not be taken We can do better! Make prediction based on last branch Predict take branch if last branch taken Or Predict do not take branch if last branch not taken Need one bit to keep track of last branch

What is accuracy of branch predictor? Wrong twice per loop! Once on loop enter and exit We can do better with 2 bits While (r3 0) {. r3 ;} Top: BEQZ r3, End J Top End: While (r3 0) {. r3 ;} Top2: BEQZ r3, End2 End2: J Top

Branch Not Taken (NT) Predict Taken 2 (PT2) Predict Taken 1 (PT1) Branch Taken (T) Branch Taken (T) Branch Not Taken (NT) Branch Taken (T) Predict Not Taken 2 (PT2) Predict Not Taken 1 (PT1) Branch Not Taken (NT)

Control hazards Is branch taken or not? Performance penalty: stall and flush Reduce cost of control hazards Move branch decision from Ex to ID 2 nops to 1 nop Delay slot Compiler puts useful work in delay slot. ISA level. Branch prediction Correct. Great! Wrong. Flush pipeline. Performance penalty

Data hazards Control hazards Structural hazards resource contention so far: impossible because of ISA and pipeline design

Data hazards register file reads occur in stage 2 (IF) register file writes occur in stage 5 (WB) next instructions may read values soon to be written Control hazards branch instruction may change the PC in stage 3 (EX) next instructions have already started executing Structural hazards resource contention so far: impossible because of ISA and pipeline design

Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. Pipelined processors need to detect data hazards. Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs ( bubbles ) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. Nops significantly decrease performance. Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Better performance than stalling. 150

Control hazards occur because the PC following a control instruction is not known until control instruction is executed. If branch is taken need to zap instructions. 1 cycle performance penalty. Delay Slots can potentially increase performance due to control hazards. The instruction in the delay slot will always be executed. Requires software (compiler) to make use of delay slot. Put nop in delay slot if not able to put useful instruction in delay slot. We can reduce cost of a control hazard by moving branch decision and calculation from Ex stage to ID stage. With a delay slot, this removes the need to flush instructions on taken branches. 151