Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Similar documents
Processor (II) - pipelining. Hwansoo Han

Full Datapath. Chapter 4 The Processor 2

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CSEE 3827: Fundamentals of Computer Systems

Full Datapath. Chapter 4 The Processor 2

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

LECTURE 3: THE PROCESSOR

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Chapter 4 The Processor 1. Chapter 4B. The Processor

COMPUTER ORGANIZATION AND DESIGN

Pipelining. CSC Friday, November 6, 2015

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

ECE260: Fundamentals of Computer Engineering

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Outline Marquette University

Chapter 4. The Processor

EIE/ENE 334 Microprocessors

Chapter 4. The Processor

Chapter 4. The Processor

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Thomas Polzer Institut für Technische Informatik

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

zhandling Data Hazards The objectives of this module are to discuss how data hazards are handled in general and also in the MIPS architecture.

Chapter 4 (Part II) Sequential Laundry

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Pipelined datapath Staging data. CS2504, Spring'2007 Dimitris Nikolopoulos

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

ECE473 Computer Architecture and Organization. Pipeline: Data Hazards

COMPUTER ORGANIZATION AND DESI

Chapter 4. The Processor. Jiang Jiang

ECE260: Fundamentals of Computer Engineering

LECTURE 9. Pipeline Hazards

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12

ELE 655 Microprocessor System Design

14:332:331 Pipelined Datapath

COMPUTER ORGANIZATION AND DESIGN

Chapter 4. The Processor

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

COMP2611: Computer Organization. The Pipelined Processor

ECS 154B Computer Architecture II Spring 2009

CENG 3420 Lecture 06: Pipeline

1 Hazards COMP2611 Fall 2015 Pipelined Processor

DEE 1053 Computer Organization Lecture 6: Pipelining

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

EE557--FALL 1999 MIDTERM 1. Closed books, closed notes

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

ECE 154A Introduction to. Fall 2012

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. The Processor

ECE154A Introduction to Computer Architecture. Homework 4 solution

Design a MIPS Processor (2/2)

CS 251, Winter 2018, Assignment % of course mark

ECEC 355: Pipelining

ECE Exam II - Solutions November 8 th, 2017

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

COSC 6385 Computer Architecture - Pipelining

SI232 Set #20: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Chapter 6 ADMIN. Reading for Chapter 6: 6.1,

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

CS 251, Winter 2019, Assignment % of course mark

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

Lecture 2: Processor and Pipelining 1

Pipelining. Pipeline performance

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Pipeline Data Hazards. Dealing With Data Hazards

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Computer Architecture. Lecture 6.1: Fundamentals of

CS/CoE 1541 Exam 1 (Spring 2019).

CS 2506 Computer Organization II Test 2. Do not start the test until instructed to do so! printed

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

COSC121: Computer Systems. ISA and Performance

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

(Basic) Processor Pipeline

Computer Organization and Structure

MIPS An ISA for Pipelining

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

ECE260: Fundamentals of Computer Engineering

Pipelining. Maurizio Palesi

Unresolved data hazards. CS2504, Spring'2007 Dimitris Nikolopoulos

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

CS232 Final Exam May 5, 2001

Chapter 4. The Processor

Transcription:

Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview of Pipelining 37 Chapter 4 The Processor 1

MIPS Pipeline Five stages, one step per stage 1. IF: Instruction Fetch from memory 2. ID: Instruction Decode & register read 3. EX: Execute operation or calculate address 4. MEM: Access MEMory operand 5. WB: Write result Back to register 38 Chapter 4 The Processor 2

Pipeline Performance Assume time for stages is 100ps for register read or write 200ps for other stages Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps 39 Chapter 4 The Processor 3

Pipeline Performance Single-cycle (T c = 800ps) Pipelined (T c = 200ps) 40 Chapter 4 The Processor 4

Pipeline Speedup If all stages are balanced i.e., all take the same time Time between instructions pipelined = Time between instructions nonpipelined Number of stages If not balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease 41 Chapter 4 The Processor 5

MIPS Pipeline Stages (review) Instruction Fetch (IF): The instruction is fetched from memory and placed in the instruction register (IR). Instruction decode (ID): The bits of the instruction are decoded into control signals. Operands are moved from registers or immediate fields to working registers. For branch instructions, the branch condition is tested and the branch address computed. Execution (EX): The instruction is executed. Specifically, if the instruction is an arithmetic or logical operation, its results are computed. If it is a load-store instruction, the address is computed. All this is done by an elaborate logic circuit called the arithmeticlogical unit (ALU). MEMory read/write (MEM): If the instruction is a load-store, the memory is read or written. Write Back (WB): The results of the operation are written to the destination register. 42 Chapter 4 The Processor 6

Pipelining and ISA Design MIPS ISA designed for pipelining All instructions are 32-bits Easier to fetch and decode in one cycle c.f. x86: 1- to 17-byte instructions Few and regular instruction formats Can decode and read registers in one step Load/store addressing Can calculate address in 3 rd stage, access memory in 4 th stage Alignment of memory operands Memory access takes only one cycle 43 Chapter 4 The Processor 7

Hazards Situations that prevent starting the next instruction ti in the next cycle Structure hazards A required resource is busy Data hazard Need to wait for previous instruction to complete its data read/write Control hazard Deciding on control action depends d on previous instruction 44 Chapter 4 The Processor 8

Structure Hazards Conflict for use of a resource Execution of Instruction combination in one clock cycle is not supported. (Combine Washer + Dryer) In MIPS pipeline with a single memory Load/store requires data access Instruction fetch would have to stall for that cycle Would cause a pipeline bubble Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches 45 Chapter 4 The Processor 9

Structure Hazards lw $4, 400($0) Single memory for data and program? 46 Chapter 4 The Processor 10

Data Hazards An instruction depends on completion of data access by a previous instruction add $s0, $t0, $t1 sub $t2, $s0, $t3 47 Chapter 4 The Processor 11

Forwarding (aka Bypassing) Use result when it is computed Don t wait for it to be stored in a register Requires extra connections in the datapath 48 Chapter 4 The Processor 12

Load-Use Data Hazard Can t always avoid stalls by forwarding If value not computed when needed Can t forward backward in time! 49 Chapter 4 The Processor 13

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; lw $t1, 0($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t2, 4($t0) stall add $t3, $t1, $t2 lw $t4, 8($t0) sw $t3, 12($t0) add $t3, $t1, $t2 lw $t4, 8($t0) sw $t3, 12($t0) stall add $t5, $t1, $t4 add $t5, $t1, $t4 sw $t5, 16($t0) sw $t5, 16($t0) 13 cycles 11 cycles 50 Chapter 4 The Processor 14

Control/Branch Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always fetch correct instruction Still working on ID stage of branch In MIPS pipeline Need to compare registers and compute target early in the pipeline Add hardware to do it in ID stage 51 Chapter 4 The Processor 15

Stall on Branch Wait until branch outcome determined before fetching next instruction 52 Chapter 4 The Processor 16

Branch Prediction Longer pipelines can t readily determine branch outcome early Stall penalty becomes unacceptable Predict outcome of branch Only stall if prediction is wrong In MIPS pipeline Can predict branches not taken Fetch instruction after branch, with no delay 53 Chapter 4 The Processor 17

MIPS with Predict Not Taken Prediction correct Prediction incorrect 54 Chapter 4 The Processor 18

More-Realistic Branch Prediction Static branch prediction Based on typical branch behavior Example: loop and if-statement branches Predict backward branches taken Predict forward branches not taken Dynamic branch prediction Hardware measures actual branch behavior e.g., record recent history of each branch Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history 55 Chapter 4 The Processor 19

Pipeline Summary The BIG Picture Pipelining improves performance by increasing instruction throughput Executes multiple instructions in parallel Each instruction has the same latency Subject to hazards Structure, data, control Instruction set design affects complexity of pipeline implementation 56 Chapter 4 The Processor 20

MIPS Pipelined Datapath 4.6 Pipelined Datapath and Control MEM Right-to-left t flow leads to hazards WB 57 Chapter 4 The Processor 21

Pipeline Instruction Execution Single cycle datapath Chapter 4 The Processor 58 Chapter 4 The Processor 22

Pipeline registers Need registers between stages To hold information produced in previous cycle 59 64, 128, 97, and 64 Chapter 4 The Processor 23

Pipeline Operation Cycle-by-cycle flow of instructions through the pipelined datapath Single-clock-cycle pipeline diagram Shows pipeline usage in a single cycle Highlight resources used c.f. multi-clock-cycle diagram Graph of operation over time We ll look at single-clock-cycle diagrams for load & store 60 64, 128, 97, and 64 Chapter 4 The Processor 24

IF for Load, Store, Right Half Shaded: Read Left Half Shaded: Write 61 64, 128, 97, and 64 Chapter 4 The Processor 25

ID for Load, Store, IF/ID =16-bit immediate field the register numbers to read the two registers. All three values are stored in the ID/EX pipeline register, along with the 62 incremented PC address. 64, 128, 97, and 64 Chapter 4 The Processor 26

EX for Load EX/MEM=contents of register 1 [from ID/EX ] + the sign-extended immediate 63 64, 128, 97, and 64 Chapter 4 The Processor 27

MEM for Load MEM/WB = is loaded with the contents from memory location [address EX/MEM] 64 64, 128, 97, and 64 Chapter 4 The Processor 28

WB for Load Wrong register number 65 Chapter 4 The Processor 29

Corrected Datapath for Load 66 Chapter 4 The Processor 30

EX for Store 67 Chapter 4 The Processor 31

MEM for Store 68 Chapter 4 The Processor 32

WB for Store 69 Chapter 4 The Processor 33

Multi-Cycle Pipeline Diagram Form showing resource usage 70 Chapter 4 The Processor 34

Multi-Cycle Pipeline Diagram Traditional form 71 Chapter 4 The Processor 35

Single-Cycle Pipeline Diagram State of pipeline in a given cycle 72 Chapter 4 The Processor 36

Pipelined Control (Simplified) 73 Chapter 4 The Processor 37

Pipelined Control Control signals derived from instruction As in single-cycle implementation 74 Chapter 4 The Processor 38

Pipelined Control 75 Chapter 4 The Processor 39

Data Hazards in ALU Instructions Consider this sequence: sub $2, $1,$3 # Register $2 written by sub and $12,$2,$5 # 1st operand($2) depends on sub or $13,$6,$2 # 2nd operand($2) depends on sub add $14,$2,$2 # 1st($2) & 2nd($2) depend on sub sw $15,100($2) 100($2) # Base ($2) depends d on sub We can resolve hazards with forwarding How do we detect when to forward? 4.7 Data Hazard ds: Forwarding vs. Stalling 76 Chapter 4 The Processor 40

Dependencies & Forwarding 77 Chapter 4 The Processor 41

Detecting the Need to Forward Pass register numbers along pp pipeline e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register ALU operand register numbers in EX stage are given by ID/EX.RegisterRs, ID/EX.RegisterRt Data hazards when 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline reg 78 Chapter 4 The Processor 42

Detecting the Need to Forward EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 The sub-and is a type 1a hazard. sub $2, $1,$3 # Register $2 written by sub and $12,$2,$5 # 1st operand($2) depends on sub or $13,$6,$2 # 2nd operand($2) depends on sub add $14,$2,$2 # 1st($2) & 2nd($2) depend on sub sw $15,100($2) # Base ($2) depends on sub The sub-or is a type 2b hazard:mem/wb.registerrd = ID/EX.RegisterRt = $2 The two dependences on sub-add are not hazards because the register file supplies the proper data during the ID stage of add. There is no data hazard between sub and sw because sw reads $2 the clock cycle after sub writes $2. 79 Chapter 4 The Processor 43

Detecting the Need to Forward But only if forwarding instruction will write to a register! EX/MEM.RegWrite, MEM/WB.RegWrite And only if Rd for that instruction is not $zero EX/MEM.RegisterRd 0, MEM/WB.RegisterRd 0 80 Chapter 4 The Processor 44

Forwarding Paths 81 Chapter 4 The Processor 45

Forwarding Paths 82 Chapter 4 The Processor 46

Forwarding Conditions EX hazard if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 83 Chapter 4 The Processor 47

Forwarding Registers The control values for the forwarding multiplexors. 84 Chapter 4 The Processor 48

Double Data Hazard Consider the sequence: add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 Both hazards occur Want to use the most recent Revise MEM hazard condition Only fwd if EX hazard a condition dto isn t true 85 Chapter 4 The Processor 49

Revised Forwarding Condition MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 86 Chapter 4 The Processor 50

Datapath with Forwarding 87 Chapter 4 The Processor 51

Load-Use Data Hazard tests to see if the instruction is a load if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))) stall the pipeline Check to see if the destination register field of the load in the EX stage matches either source register of the instruction in the ID stage. 88 Chapter 4 The Processor 52

Load-Use Data Hazard Need to stall for one cycle 89 Chapter 4 The Processor 53

Load-Use Hazard Detection Check when using instruction is decoded in ID stage ALU operand register numbers in ID stage are given by IF/ID.RegisterRs, IF/ID.RegisterRt Load-use hazard when ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) If detected, stall and insert bubble 90 Chapter 4 The Processor 54

How to Stall the Pipeline Force control values in ID/EX register to 0 EX, MEM and WB do nop (no-operation) Prevent update of PC and IF/ID register Using instruction is decoded again Following instruction is fetched again 1-cycle stall allows MEM to read data for lw Can subsequently forward to EX stage 91 Chapter 4 The Processor 55

Stall/Bubble in the Pipeline Stall inserted here 92 Chapter 4 The Processor 56

Stall/Bubble in the Pipeline Or, more accurately 93 Chapter 4 The Processor 57

Datapath with Hazard Detection 94 Chapter 4 The Processor 58

Stalls and Performance The BIG Picture Stalls reduce performance But are required to get correct results Compiler can arrange code to avoid hazards and stalls Requires knowledge of the pipeline structure 95 Chapter 4 The Processor 59

Branch Hazards If branch outcome determined in MEM 4.8 Control Hazards beq rs, rt, offset if rs= rt then PC [PC] + 4 + 4*offset else PC [PC] + 4 Flush these instructions (Set control values to 0) PC 96 Chapter 4 The Processor 60

Branch Hazards beq rs, rt, offset if rs= rt then PC [PC] + 4 + (4*offset) else PC [PC] + 4 bgez rs, offset if rs 0 then PC [PC] + 4 + (4*offset) else PC [PC] + 4 bgtz rs, offset if rs > 0 then PC [PC] + 4 + (4*offset) else PC [PC] + 4 97 Chapter 4 The Processor 61

Reducing Branch Delay Move hardware to determine outcome to ID stage Target address adder Register comparator Example: branch taken 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $7... 72: lw $4, 50($7) 98 Chapter 4 The Processor 62

Example: Branch Taken 99 Chapter 4 The Processor 63

Example: Branch Taken 100 Chapter 4 The Processor 64