Thomas Polzer Institut für Technische Informatik

Similar documents
Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Full Datapath. Chapter 4 The Processor 2

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Chapter 4. The Processor

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

COMPUTER ORGANIZATION AND DESIGN

Chapter 4. The Processor

Chapter 4. The Processor

LECTURE 3: THE PROCESSOR

COMPUTER ORGANIZATION AND DESI

EIE/ENE 334 Microprocessors

Chapter 4. The Processor

CSEE 3827: Fundamentals of Computer Systems

Processor (II) - pipelining. Hwansoo Han

Full Datapath. Chapter 4 The Processor 2

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

COMPUTER ORGANIZATION AND DESIGN

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Processor (IV) - advanced ILP. Hwansoo Han

Chapter 4 The Processor 1. Chapter 4B. The Processor

The Processor: Instruction-Level Parallelism

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Chapter 4. The Processor

Advanced Instruction-Level Parallelism

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Chapter 4. The Processor. Jiang Jiang

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Outline Marquette University

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

The Processor: Improving the performance - Control Hazards

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 4 The Processor 1. Chapter 4A. The Processor

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

Pipelining. CSC Friday, November 6, 2015

Chapter 4. The Processor

Lec 25: Parallel Processors. Announcements

DEE 1053 Computer Organization Lecture 6: Pipelining

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4 The Processor (Part 4)

14:332:331 Pipelined Datapath

CENG 3420 Lecture 06: Pipeline

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12

zhandling Data Hazards The objectives of this module are to discuss how data hazards are handled in general and also in the MIPS architecture.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

Lecture 2: Processor and Pipelining 1

Chapter 4 The Processor 1. Chapter 4D. The Processor

ELE 655 Microprocessor System Design

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

ECE260: Fundamentals of Computer Engineering

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COSC121: Computer Systems. ISA and Performance

1 Hazards COMP2611 Fall 2015 Pipelined Processor

ECE473 Computer Architecture and Organization. Pipeline: Data Hazards

Lecture 9 Pipeline and Cache

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Chapter 4 (Part II) Sequential Laundry

LECTURE 9. Pipeline Hazards

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

MIPS An ISA for Pipelining

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

CS 251, Winter 2018, Assignment % of course mark

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Multicore and Parallel Processing

ECE260: Fundamentals of Computer Engineering

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Pipelined datapath Staging data. CS2504, Spring'2007 Dimitris Nikolopoulos

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

COSC 6385 Computer Architecture - Pipelining

ECE232: Hardware Organization and Design

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

LECTURE 10. Pipelining: Advanced ILP

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS 251, Winter 2019, Assignment % of course mark

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

COMPUTER ORGANIZATION AND DESIGN

Lecture 7 Pipelining. Peng Liu.

EE557--FALL 1999 MIDTERM 1. Closed books, closed notes

Question 1: (20 points) For this question, refer to the following pipeline architecture.

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

(Basic) Processor Pipeline

Transcription:

Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik

Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup = number of stages Time Task order A B C D Time Task order A B C D 6 PM 7 8 9 10 11 12 1 2 AM 6 PM 7 8 9 10 11 12 1 2 AM VO Rechnerstrukturen 2

Five stages, one step per stage IF: Instruction fetch from (instruction) memory ID: Instruction decode & register read EX: Execute operation or calculate address MEM: Access (data) memory operand WB: Write result back to register VO Rechnerstrukturen 3

Assume time for stages is 100 ps for register read or write 200 ps for other stages Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register read ALU op Memory access Register write Total time lw sw R-format beq VO Rechnerstrukturen 4

Assume time for stages is 100 ps for register read or write 200 ps for other stages Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps VO Rechnerstrukturen 5

Single-cycle (T c = 800ps) Pipelined (T c = 200ps) VO Rechnerstrukturen 6

k-stage pipeline increases the throughput by factor k (ideally) Execution time of instructions unchanged Limits: Balance of the stages Filling the pipeline Dependencies (Hazards) VO Rechnerstrukturen 7

MIPS ISA designed for pipelining All instructions are 32-bits Easier to fetch and decode in one cycle c.f. x86: 1- to 17-byte instructions Few and regular instruction formats Can decode and read registers in one step Load/store addressing Can calculate address in 3rd stage, access memory in 4th stage Alignment of memory operands Memory access takes only one cycle VO Rechnerstrukturen 8

IF: Instruction fetch 0 M u x 1 ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back Add 4 Shift left 2 Add Add result PC Address Instruction memory Instruction Read register 1 Read data 1 Read register 2 Registers Read data 2 Write register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data Read data 1 M u x 0 16 Sign extend 32 VO Rechnerstrukturen 9

0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left 2 Add Add result PC Address Instruction memory Instruction Read register 1 Read data 1 Read register 2 Registers Read data 2 Write register Write data 0 M u x 1 Zero ALU ALU result Address Write data Data memory Read data M ux 1 0 16 Sign extend 32 VO Rechnerstrukturen 10

Cycle-by-cycle flow of instructions through the pipelined datapath Single-clock-cycle pipeline diagram Shows pipeline usage in a single cycle Highlight resources used We ll look at single-clock-cycle diagrams for load & store VO Rechnerstrukturen 11

PC: visible pipeline reg. VO Rechnerstrukturen 12

Executed by all instructions VO Rechnerstrukturen 13

VO Rechnerstrukturen 14

VO Rechnerstrukturen 15

Wrong register number VO Rechnerstrukturen 16

VO Rechnerstrukturen 17

VO Rechnerstrukturen 18

VO Rechnerstrukturen 19

VO Rechnerstrukturen 20

Form showing resource usage VO Rechnerstrukturen 21

Traditional form VO Rechnerstrukturen 22

State of pipeline in a given cycle VO Rechnerstrukturen 23

VO Rechnerstrukturen 24

Control signals derived from instruction As in single-cycle implementation VO Rechnerstrukturen 25

VO Rechnerstrukturen 26

Situations that prevent starting the next instruction in the next cycle Structure hazards A required resource is busy Data hazard Need to wait for previous instruction to complete its data read/write Control hazard Deciding on control action depends on previous instruction VO Rechnerstrukturen 27

Conflict for use of a resource In MIPS pipeline with a single memory Load/store requires data access Instruction fetch would have to stall for that cycle Would cause a pipeline bubble Hence, pipelined datapath require separate instruction/data memories Or separate instruction/data caches VO Rechnerstrukturen 28

read-after-write (RAW) Register is read before the previous instruction has written it write-after-read (WAR) Register is written before the previous instruction has read it write-after-write (WAW) Register written but then overwritten by a previous instruction VO Rechnerstrukturen 29

An instruction depends on completion of data access by a previous instruction add $s0, $t0, $t1 IM Reg DM Reg ALU sub $t2, $s0, $t3 IM Reg DM Reg ALU or $t4, $s0, $s1 IM Reg DM Reg ALU and $s3, $s0, $s2 IM Reg DM Reg ALU sw $s4, 0($s0) IM Reg DM Reg ALU VO Rechnerstrukturen 30

Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half add $s0, $t0, $t1 IM Reg DM Reg ALU sub $t2, $s0, $t3 IM Reg DM Reg ALU or $t4, $s0, $s1 IM Reg DM Reg ALU and $s3, $s0, $s2 IM Reg DM Reg ALU sw $s4, 0($s0) Write Register VO Rechnerstrukturen Read Register IM Reg DM Reg ALU 31

add $s0, $t0, $t1 IM Reg DM Reg ALU nop bubble bubble bubble bubble bubble nop bubble bubble bubble bubble bubble sub $t2, $s0, $t3 IM Reg DM Reg ALU or $t4, $s0, $s1 IM Reg DM Reg ALU VO Rechnerstrukturen 32

Use result when it is computed Don t wait for it to be stored in a register Requires extra connections in the datapath add $s0, $t0, $t1 IM Reg DM Reg ALU sub $t2, $s0, $t3 IM Reg DM Reg ALU VO Rechnerstrukturen 33

How do we detect when to forward? add $s0, $t0, $t1 IM Reg DM Reg ALU sub $t2, $s0, $t3 IM Reg DM Reg ALU or $t4, $s0, $s1 IM Reg DM Reg ALU and $s3, $s0, $s2 IM Reg DM Reg ALU sw $s4, 0($s0) IM Reg DM Reg ALU VO Rechnerstrukturen 34

Pass register numbers along pipeline e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register ALU operand register numbers in EX stage are given by ID/EX.RegisterRs, ID/EX.RegisterRt Data hazards when 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline reg VO Rechnerstrukturen 35

But only if forwarding instruction will write to a register! EX/MEM.RegWrite, MEM/WB.RegWrite And only if Rd for that instruction is not $zero EX/MEM.RegisterRd 0, MEM/WB.RegisterRd 0 VO Rechnerstrukturen 36

VO Rechnerstrukturen 37

EX hazard if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 VO Rechnerstrukturen 38

Consider the sequence: add $t1,$t2,$t3 add $t1,$t1,$t4 add $t1,$t1,$t5 Both hazards occur Want to use the most recent Revise MEM hazard condition Only forward if EX hazard condition is not true VO Rechnerstrukturen 39

How do we detect when to forward? add $t1, $t2, $t3 IM Reg DM Reg ALU add $t1, $t1, $t4 IM Reg DM Reg ALU add $t1, $t1, $t5 IM Reg DM Reg ALU VO Rechnerstrukturen 40

MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 VO Rechnerstrukturen 41

VO Rechnerstrukturen 42

Can not always avoid stalls by forwarding If value not computed when needed Can not forward backward in time! lw $s0, 0($t0) IM Reg DM Reg ALU sub $t2, $s0, $t3 IM Reg DM Reg ALU VO Rechnerstrukturen 43

Can not always avoid stalls by forwarding If value not computed when needed Can not forward backward in time! lw $s0, 0($t0) IM Reg DM Reg ALU nop bubble bubble bubble bubble bubble sub $t2, $s0, $t3 IM Reg DM Reg ALU VO Rechnerstrukturen 44

Check when using instruction is decoded in ID stage ALU operand register numbers in ID stage are given by IF/ID.RegisterRs, IF/ID.RegisterRt Load-use hazard when ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) If detected, stall and insert bubble VO Rechnerstrukturen 45

Force control values in ID/EX register to 0 EX, MEM and WB do nop (no-operation) Prevent update of PC and IF/ID register Using instruction is decoded again Following instruction is fetched again 1-cycle stall allows MEM to read data for lw Can subsequently forward to EX stage VO Rechnerstrukturen 46

Can not always avoid stalls by forwarding If value not computed when needed Can not forward backward in time! lw $s0, 0($t0) IM Reg DM Reg ALU sub $t2, $s0, $t3 nop IM Reg bubble bubble bubble Pause for one cycle sub $t2, $s0, $t3 ALU DM Reg VO Rechnerstrukturen 47

VO Rechnerstrukturen 48

Reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; stall stall lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 13 cycles lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 11 cycles VO Rechnerstrukturen 49

Stalls reduce performance But are required to get correct results Compiler can arrange code to avoid hazards and stalls Requires knowledge of the pipeline structure VO Rechnerstrukturen 50

Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always fetch correct instruction VO Rechnerstrukturen 51

If branch outcome determined in MEM Flush these instructions (Set control values to 0) Update PC VO Rechnerstrukturen 52

When the flow of instruction addresses is not sequential Unconditional branches (j, jal, jr) Conditional branches (beq, bne) Exceptions Possible approaches Stall (impacts CPI) Move decision point as early in the pipeline as possible Delay decision (requires compiler support) Predict and hope for the best! Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards VO Rechnerstrukturen 53

Wait until branch outcome determined before fetching next instruction Conservative approach: Stall immediately after fetching a branch, wait until outcome of branch is known and fetch branch address. Extra HW (cmp and adder for address in ID stage) VO Rechnerstrukturen 54

Move hardware to determine outcome to ID stage Reduce cost of the taken branch Target address adder Register comparator VO Rechnerstrukturen 55

If a comparison register is a destination of 2nd or 3rd preceding ALU instruction add $1, $2, $3 IF ID EX MEM WB add $4, $s1, $6 IF ID EX MEM WB IF ID EX MEM WB beq $1, $4, target IF ID EX MEM WB Can resolve using forwarding VO Rechnerstrukturen 56

If a comparison register is a destination of preceding ALU instruction or 2nd preceding load instruction Need 1 stall cycle lw $1, addr IF ID EX MEM WB add $4, $s1, $6 IF ID EX MEM WB beq stalled IF ID beq $1, $4, target ID EX MEM WB VO Rechnerstrukturen 57

If a comparison register is a destination of immediately preceding load instruction Need 2 stall cycles lw $1, addr IF ID EX MEM WB beq stalled IF ID beq stalled ID beq $1, $zero, target ID EX MEM WB VO Rechnerstrukturen 58

Longer pipelines can t readily determine branch outcome early Stall penalty becomes unacceptable Predict outcome of branch Only stall if prediction is wrong In MIPS pipeline Can predict branches not taken Fetch instruction after branch, with no delay VO Rechnerstrukturen 59

Prediction correct Prediction incorrect VO Rechnerstrukturen 60

Static branch prediction Based on typical branch behavior Example: loop and if-statement branches Predict backward branches taken Predict forward branches not taken Dynamic branch prediction Hardware measures actual branch behavior e.g., record recent history of each branch Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history VO Rechnerstrukturen 61

In deeper and superscalar pipelines, branch penalty is more significant Use dynamic prediction Branch prediction buffer (aka branch history table) Indexed by recent branch instruction addresses Stores outcome (taken/not taken) To execute a branch Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction VO Rechnerstrukturen 62

A 1-bit predictor will be incorrect twice when not taken Assume predict_bit = 0 to start (indicating branch not taken) and loop control is at the bottom of the loop code First time through the loop, the predictor mispredicts the branch since the branch is taken back to the top of the loop; invert prediction bit (predict_bit = 1) As long as branch is taken (looping), prediction is correct Exiting the loop, the predictor again mispredicts the branch since this time the branch is not taken falling out of the loop; invert prediction bit (predict_bit = 0) Loop: 1 st loop instr 2 nd loop instr... last loop instr bne $1,$2,Loop fall out instr For 10 times through the loop we have a 80% prediction accuracy for a branch that is taken 90% of the time VO Rechnerstrukturen 63

A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed One bit for what the branch is supposed to be One bit for what it did last time No branch Branch Branch 00 Predict no branch Branch No branch 01 Predict no branch one more time 10 Predict branch one more time No branch Branch 11 Predict branch No branch VO Rechnerstrukturen 64

Even with predictor, still need to calculate the target address 1-cycle penalty for a taken branch Or branch target buffer Cache of target addresses Indexed by PC when instruction fetched If hit and instruction is branch predicted taken, can fetch target immediately VO Rechnerstrukturen 65

Unexpected events requiring change in flow of control Different ISAs use the terms differently Exception Arises within the CPU e.g., undefined opcode, overflow, syscall, Interrupt From an external I/O controller Dealing with them without sacrificing performance is hard VO Rechnerstrukturen 66

In MIPS, exceptions managed by a System Control Coprocessor (CP0) Save PC of offending (or interrupted) instruction In MIPS: Exception Program Counter (EPC) Save indication of the problem In MIPS: Cause register Jump to handler VO Rechnerstrukturen 67

Pipelining: Executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle Multiple issue Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle CPI < 1 use Instructions Per Cycle (IPC) e.g. 4-way multiple-issue peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice VO Rechnerstrukturen 68

Static multiple issue (at compile time) Compiler groups instructions to be issued together Packages them into issue slots Compiler detects and avoids hazards Dynamic multiple issue (during execution) CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime VO Rechnerstrukturen 69

Guess what to do with an instruction Start operation as soon as possible Check whether guess was right If so, complete the operation If not, roll-back and do the right thing Common to static and dynamic multiple issue Examples Speculate on branch outcome, execute instructions after branch Roll back if path taken is different Speculate on store that precedes load doesn t refer to same address Roll back if location is updated VO Rechnerstrukturen 70

Compiler can reorder instructions Hardware can look ahead for instructions to execute Buffer results until it determines they are actually needed (written to the registers or memory) Flush buffers on incorrect speculation VO Rechnerstrukturen 71

What if exception occurs on a speculatively executed instruction? Static speculation Can add ISA support for deferring exceptions Dynamic speculation Can buffer exceptions until instruction completion (which may not occur) VO Rechnerstrukturen 72

Compiler groups instructions into issue packets Group of instructions that can be issued on a single cycle Determined by pipeline resources required Think of an issue packet as a very long instruction Specifies multiple concurrent operations Very Long Instruction Word (VLIW) VO Rechnerstrukturen 73

Compiler must remove some/all hazards Reorder instructions into issue packets No dependencies with a packet Possibly some dependencies between packets Pad with nop if necessary VO Rechnerstrukturen 74

Two-issue packets One ALU/branch instruction One load/store instruction 64-bit aligned ALU/branch, then load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB VO Rechnerstrukturen 75

VO Rechnerstrukturen 76

More instructions executing in parallel Execution data hazard Forwarding avoided stalls with single-issue Now can t use ALU result in load/store in same packet add $t0, $s0, $s1 load $s2, 0($t0) Split into two packets, effectively a stall Load-use hazard Still one cycle use latency, but now two instructions More aggressive scheduling required VO Rechnerstrukturen 77

Schedule this for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 IPC = 5/4 = 1.25 (c.f. peak IPC = 2) VO Rechnerstrukturen 78

Replicate loop body to expose more parallelism Reduces loop-control overhead Use different registers per replication Compiler applies register renaming to eliminate all data dependencies that are not true data dependencies VO Rechnerstrukturen 79

lp: lw $t0,0($s1) # $t0=array element lw $t1,-4($s1) # $t1=array element lw $t2,-8($s1) # $t2=array element lw $t3,-12($s1) # $t3=array element addu $t0,$t0,$s2 # add scalar in $s2 addu $t1,$t1,$s2 # add scalar in $s2 addu $t2,$t2,$s2 # add scalar in $s2 addu $t3,$t3,$s2 # add scalar in $s2 sw $t0,0($s1) # store result sw $t1,-4($s1) # store result sw $t2,-8($s1) # store result sw $t3,-12($s1) # store result addi $s1,$s1,-16 # decrement pointer bne $s1,$zero,lp # branch if $s1!= 0 VO Rechnerstrukturen 80

ALU/branch Load/store cycle Loop: addi $s1, $s1, 16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t4, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 IPC = 14/8 = 1.75 Closer to 2, but at cost of registers and code size VO Rechnerstrukturen 81

Superscalar processors CPU decides whether to issue 0, 1, 2, each cycle Avoiding structural and data hazards Avoids the need for compiler scheduling Though it may still help Code semantics ensured by the CPU VO Rechnerstrukturen 82

Allow the CPU to execute instructions out of order to avoid stalls But commit result to registers in order Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 Can start sub while addu is waiting for lw VO Rechnerstrukturen 83

Preserves dependencies Hold pending operands Reorders buffer for register (memory) writes Can supply operands for issued instructions Results also sent to any waiting reservation stations VO Rechnerstrukturen 84

Why not just let the compiler schedule code? Not all stalls are predicable e.g., cache misses Can t always schedule around branches Branch outcome is dynamically determined Different implementations of an ISA have different latencies and hazards VO Rechnerstrukturen 85

Yes, but not as much as we d like: Programs have real dependencies that limit ILP Some dependencies are hard to eliminate Some parallelism is hard to expose Memory delays and limited bandwidth, hard to keep pipelines full Speculation can help if done well VO Rechnerstrukturen 86

Pipelining is easy (!) The basic idea is easy The devil is in the details e.g., detecting data hazards Pipelining is independent of technology So why haven t we always done pipelining? More transistors make more advanced techniques feasible Today, power concerns lead to less aggressive designs VO Rechnerstrukturen 87

Poor ISA design can make pipelining harder e.g., complex instruction sets (e.g. IA-32) Significant overhead to make pipelining work IA-32 micro-op approach e.g., complex addressing modes VO Rechnerstrukturen 88

ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput using parallelism More instructions completed per second Latency for each instruction not reduced Hazards: structural, data, control Multiple issue and dynamic scheduling (ILP) Dependencies limit achievable parallelism Complexity leads to the power wall VO Rechnerstrukturen 89

Consider a five stage pipeline. Unconditional jumps are calculated at the end of the second stage, while the target of conditional jumps is only known at the end of stage three. The jump prediction is implemented using the predict-not-taken strategy and the CPU has an ideal CPI of one. The frequency of conditional jumps is 20%, of unconditional jumps five percent and in 60% of the cases the conditional jumps are taken. How much faster would the machine be without branch hazards? VO Rechnerstrukturen 91

Conditional jumps: 20% of all instructions are conditional jumps 60% of the predictions are incorrect Costs for erroneous prediction: 2 cycles Unconditional jumps: 5% of all instructions are unconditional jumps All predictions are wrong Costs for erroneous prediction: 1 cycles VO Rechnerstrukturen 92

Execution time without branch hazards: ET ideal = N * CPI ideal * T clk CPI ideal = 1 Execution time including branch hazards: ET real = N * CPI real * T clk CPI real = 0.75 + 0.2 * (0.4 * 1 + 0.6 * 3) + 0.05 * 2 = 1.29 Performance comparison: ET real / ET ideal = CPI real / CPI ideal = 1.29 The CPU would be 29% faster. VO Rechnerstrukturen 93

Assume a MIPS-processor with a five stage pipeline. There is no forwarding implemented and jump instructions are executed in stage 2. Add nop instructions to the following program such that it is executed correctly. Optimize the program afterwards to minimize the overhead. ori $s4,$zero,12 or $s1,$zero,$zero sum: lw $t2,0($s4) add $s1,$s1,$t2 addi $s4,$s4,-4 bne $s4,$zero,sum VO Rechnerstrukturen 94

Inserting nop-instructions: ori $s4,$zero,12 or $s1,$zero,$zero nop sum: lw $t2,0($s4) nop nop add $s1,$s1,$t2 addi $s4,$s4,-4 nop nop bne $s4,$zero,sum Optimized code: ori $s4,$zero,12 or $s1,$zero,$zero nop sum: lw $t2,0($s4) addi $s4,$s4,-4 nop add $s1,$s1,$t2 bne $s4,$zero,sum VO Rechnerstrukturen 95

Reorder the following MIPS-code to minimize pipeline stalls. Forwarding is implemented and load-instructions are executed in stage four ( one load-delay slot). lw lw add add lw lw sub sw $t1,0($t2) $t3,4($t2) $t4,$t1,$t3 $t4,$t4,$s2 $s1,0($t6) $t7,4($t6) $t8,$s1,$t7 $t8,8($t6) VO Rechnerstrukturen 96

Reordered code: lw $t1,0($t2) lw $t3,4($t2) lw $s1,0($t6) add $t4,$t1,$t3 lw $t7,4($t6) add $t4,$t4,$s2 sub $t8,$s1,$t7 sw $t8,8($t6) VO Rechnerstrukturen 97

Draw the pipeline diagram for the following program: ori $s4,$zero,20 lp: lw $t2,0($s4) mult $s1,$s1,$t2 addi $s4,$s4,-4 bne $s4,$zero,lp nop Forwarding is not implemented and the multiplication takes three cycles in stage EX. Branch targets are calculated in stage ID. If there are any RAWdependencies, the pipeline is stalled until data is available. Assume that there the loop is executed five times. VO Rechnerstrukturen 98

ori $s4,$zero,20 IF ID EX ME WB Lp: lw $t2,0($s4) IF ID st st EX ME WB mlt $s1,$s1,$t2 IF st st ID st st EX EX EX ME WB addi $s4,$s4,-4 IF st st ID st st EX ME WB bne $s4,$zero,lp IF st st ID st st EX... nop IF st st abort lw $t2,0($s4) IF... # of stalls: 2 + 5 * 3 * 2 cycles # of control hazards: 4 * 1 cycles VO Rechnerstrukturen 99

Draw the pipeline diagram of the following program How many stalls occur? Which value is stored at memory address zero, if the initial values in the memory are: Address 4 2 8 3 12 4 Value VO Rechnerstrukturen 100

lw $t1,12($zero) addi $t2,$zero,8 loop: lw $t3,0($t2) addi $t2,$t2,-4 bne $t2,$zero,loop mult $t1,$t1,$t3 ori $t1,$t1,1 sw $t1,0($zero) The multiplication needs three cycles in stage EX. Forwarding is implemented. The jump is executed at the end of the ID stage and the processor uses a delaybranch-slot of length one. VO Rechnerstrukturen 101

lw $t1,12($zero) IF ID EX ME WB addi $t2,$zero,8 IF ID EX ME WB lw $t3,0($t2) IF ID EX ME WB addi $t2,$t2,-4 IF ID EX ME WB bne $t2,$zero,loop IF ID EX ME WB mult $t1,$t1,$t3 IF ID EX EX EX ME WB lw $t3,0($t2) IF ID st st EX ME WB addi $t2,$t2,-4 IF st st ID EX ME WB bne $t2,$zero,loop IF ID EX ME WB mult $t1,$t1,$t3 IF ID EX EX EX ME ori $t1,$t1,1 IF ID st st EX sw $t1,0($zero) IF st st ID # of stalls: 2 * 2 cycles Address 0: 25 VO Rechnerstrukturen 102

Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik