Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Similar documents
Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Computer Architecture

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Modern Computer Architecture

Lecture 2: Processor and Pipelining 1

CSE 502 Graduate Computer Architecture. Lec 3-5 Performance + Instruction Pipelining Review

COSC 6385 Computer Architecture - Pipelining

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Advanced Computer Architecture

CSE 502 Graduate Computer Architecture. Lec 3-5 Performance + Instruction Pipelining Review

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

MIPS An ISA for Pipelining

CSE 502 Graduate Computer Architecture. Lec 4-6 Performance + Instruction Pipelining Review

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Computer Architecture and System

Computer Architecture and System

Computer Architecture Spring 2016

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Pipelining: Hazards Ver. Jan 14, 2014

Advanced Computer Architecture Pipelining

Appendix A. Overview

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Chapter 4 The Processor 1. Chapter 4B. The Processor

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Pipelining: Basic and Intermediate Concepts

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Pipeline Review. Review

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Lecture 1: Introduction

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

Appendix C: Pipelining: Basic and Intermediate Concepts

EITF20: Computer Architecture Part2.2.1: Pipeline-1

The Big Picture Problem Focus S re r g X r eg A d, M lt2 Sub u, Shi h ft Mac2 M l u t l 1 Mac1 Mac Performance Focus Gate Source Drain BOX

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Basic Pipelining Concepts

Chapter 4 The Processor 1. Chapter 4A. The Processor

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Appendix C. Abdullah Muzahid CS 5513

CISC 662 Graduate Computer Architecture Lecture 5 - Pipeline. Pipelining. Pipelining the Idea. Similar to assembly line in a factory:

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Instruction Pipelining Review

CISC 662 Graduate Computer Architecture Lecture 5 - Pipeline Pipelining

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

The Processor: Improving the performance - Control Hazards

ISA. CSE 4201 Minas E. Spetsakis based on Computer Architecture by Hennessy and Patterson

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

第三章 Instruction-Level Parallelism and Its Dynamic Exploitation. 陈文智 浙江大学计算机学院 2014 年 10 月

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Computer Systems Architecture Spring 2016

HY425 Lecture 05: Branch Prediction

Processor Architecture

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

LECTURE 3: THE PROCESSOR

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Pipelining. Maurizio Palesi

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

(Basic) Processor Pipeline

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

ECE154A Introduction to Computer Architecture. Homework 4 solution

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

ECS 154B Computer Architecture II Spring 2009

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Chapter 4. The Processor

What is Pipelining? RISC remainder (our assumptions)

CENG 3420 Lecture 06: Pipeline

Instr. execution impl. view

Chapter 4. The Processor

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

ECE232: Hardware Organization and Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

Thomas Polzer Institut für Technische Informatik

Very Simple MIPS Implementation

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

DLX Unpipelined Implementation

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Full Datapath. Chapter 4 The Processor 2

Computer Architecture. Lecture 6.1: Fundamentals of

CSEE 3827: Fundamentals of Computer Systems

Improving Performance: Pipelining

ECE 505 Computer Architecture

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Very Simple MIPS Implementation

ELE 655 Microprocessor System Design

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

ECE331: Hardware Organization and Design

Transcription:

Pipeline Overview Dr. Jiang Li Adapted from the slides provided by the authors

Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion

A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair) 3-address, reg-reg arithmetic instruction Single address mode for load/store: base + displacement no indirection Simple branch conditions Delayed branch

Example: MIPS ( MIPS) ister-ister 31 26 25 2120 16 15 1110 6 5 0 Op 31 26 25 2120 16 15 0 Op Op Rs1 ister-immediate Branch Op Rs2 Rs1 Rd immediate 31 26 25 0 Rd target Opx 31 26 25 2120 16 15 0 Jump / Call Rs1 Rs2/Opx immediate

Datapath vs Control Datapath Controller signals Control Points Datapath: Storage, FU, interconnect sufficient to perform the desired functions Inputs are Control Points Outputs are signals Controller: State machine to orchestrate operation on the data path Based on desired function and signals

Approaching an ISA Instruction Set Architecture Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing Meaning of each instruction is described by RTL on architected registers and memory Given technology constraints assemble adequate datapath Architected storage mapped to actual storage Function units to do all the required operations Possible additional storage (eg. MAR, MBR, ) Interconnect to move information among regs and FUs Map each instruction to sequence of RTLs Collate sequences into symbolic controller state transition diagram (STD) Lower symbolic STD to control points Implement controller

5 Steps of MIPS Datapath Instruction Fetch Instr. Decode. Fetch Execute Addr. Calc Memory Access Write Back Next PC 4 Adder Next SEQ PC RS1 Zero? MUX Address Memory Inst RS2 RD File MUX MUX Data Memory L M D MUX IR <= mem[pc]; PC <= PC + 4 Imm Sign Extend [IR rd ] <= [IR rs ] op IRop [IR rt ] WB Data

Inst. Set Processor Controller IR <= mem[pc]; PC <= PC + 4 br JSR JR jmp A <= [IR rs ]; B <= [IR rt ] RR opfetch-dcd RI LD ST if bop(a,b) PC <= PC+IR im PC <= IR jaddr r <= A op IRop B r <= A op IRop IR im r <= A + IR im WB <= r WB <= r WB <= Mem[r] [IR rd ] <= WB [IR rd ] <= WB [IR rd ] <= WB

WB Data Pipelined MIPS Datapath Instruction Fetch Instr. Decode. Fetch Execute Addr. Calc Memory Access Write Back Next PC 4 Adder Next SEQ PC RS1 Next SEQ PC Zero? MUX Address Memory IR <= mem[pc]; PC <= PC + 4 IF/ID RS2 Imm File Sign Extend ID/EX MUX MUX EX/MEM Data Memory MEM/WB MUX A <= [IR rs ]; B <= [IR rt ] RD RD RD rslt <= A op IRop WB <= rslt B

Visualizing Pipelining Time (clock cycles) I n s t r. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 O r d e r

Pipelining is not quite that easy! Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4

One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Stall Instr 3 Bubble Bubble Bubble Bubble Bubble How do you bubble the pipe?

Speed Up Equation for Pipelining CPI pipelined Ideal CPI Average Stall cycles per Inst Speedup Ideal CPI Ideal CPI Ideal CPI Pipeline stall CPI Ideal CPI Pipeline stall CPI Cycle Time Cycle Time Cycle Time unpipelined pipelined pipelined Cycle Time Pipeline depth pipelined For simple RISC pipeline, CPI = 1: Speedup Pipeline 1 Pipeline depth stall CPI

Example: Dual-port vs. Single-port Machine A: Dual ported memory ( Harvard Architecture ) without the structural hazard Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate with the hazard Ideal CPI = 1 for both Loads are 40% of instructions executed Average Instruction time_a= CPI x clock cycle time Average Instruction time_b = (1 + 0.4) x clock cycle time/1.05 SpeedUp = Average Instruction time_b/ Average Instruction time_a = 1.33 Machine A is 1.33 times faster

Data Hazard on R1 Time (clock cycles) IF ID/RF EX MEM WB I n s t r. add r1,r2,r3 sub r4,r1,r3 O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Three Generic Data Hazards Read After Write (RAW) Instr J tries to read operand before Instr I writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a Dependence (in compiler nomenclature). This hazard results from an actual need for communication.

Three Generic Data Hazards Write After Read (WAR) Instr J writes operand before Instr I reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r1. Can t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

Three Generic Data Hazards Write After Write (WAW) Instr J writes operand before Instr I writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an output dependence by compiler writers This also results from the reuse of name r1. Can t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes

Forwarding to Avoid Data Hazard Time (clock cycles) I n s t r. add r1,r2,r3 sub r4,r1,r3 O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

HW Change for Forwarding NextPC isters ID/EX mux mux EX/MEM Data Memory MEM/WR Immediate mux What circuit detects and resolves this hazard?

Forwarding to Avoid LW-SW Data Hazard Time (clock cycles) I n s t r. add r1,r2,r3 lw r4, 0(r1) O r d e r sw r4,12(r1) or r8,r6,r9 xor r10,r9,r11

Data Hazard Even with Forwarding Time (clock cycles) I n s t r. lw r1, 0(r2) sub r4,r1,r6 O r d e r and r6,r1,r7 or r8,r1,r9

Data Hazard Even with Forwarding Time (clock cycles) I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Bubble Bubble or r8,r1,r9 Bubble How is this detected?

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b Fast code: LW Rb,b LW Rc,c LW Rc,c ADD Ra,Rb,Rc LW Re,e SW a,ra ADD Ra,Rb,Rc LW Re,e LW Rf,f LW Rf,f SW a,ra SUB Rd,Re,Rf SUB Rd,Re,Rf SW d,rd SW d,rd

Control Hazard on Branches Three Stage Stall 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 What do you do with the 3 instructions in between? How do you do it? Where is the commit?

Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier MIPS branch tests if register = 0 or 0 MIPS Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

WB Data Pipelined MIPS Datapath Instruction Fetch Instr. Decode. Fetch Execute Addr. Calc Memory Access Write Back Next PC 4 Adder Next SEQ PC RS1 Adder MUX Zero? Address Memory IF/ID RS2 File ID/EX MUX EX/MEM Data Memory MEM/WB MUX Imm Sign Extend RD RD RD Interplay of instruction set design and cycle time.

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 53% MIPS branches taken on average But haven t calculated branch target address in MIPS MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome

Four Branch Hazard Alternatives #4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2... sequential successor n branch target if taken Branch delay of length n 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS uses this

Scheduling Branch Delay Slots A. From before branch B. From branch target C. From fall through add $1,$2,$3 if $2=0 then delay slot sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes add $1,$2,$3 if $2=0 then if $1=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 add $14,$15,$16 add $14,$15,$16 sub $4,$5,$6 A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub/add when branch fails

Delayed Branch Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 48% (60% x 80%) of slots usefully filled Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches Growth in available transistors has made dynamic approaches relatively cheaper

Evaluating Branch Alternatives Pipeline speedup = Pipeline depth 1 +Branch frequency Branch penalty Assume 4% unconditional branch, 6% conditional branch-untaken, 10% conditional branch-taken Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline 3 1.48 3.38 1.0 Predict taken 1 1.16 4.31 1.27 Predict not taken 1 1.10 4.55 1.35 Delayed branch 0.5 1.08 4.63 1.37

Problems with Pipelining Exception: An unusual event happens to an instruction during its execution Examples: divide by zero, undefined opcode Interrupt: Hardware signal to switch the processor to a new instruction stream Example: a sound card interrupts when it needs more audio output samples (an audio click happens if it is left waiting) Problem: Exception or interrupt must appear to be between 2 instructions (I i and I i+1 ) The effect of all instructions up to and including I i is totally completed No effect of any instruction after I i can take place The interrupt (exception) handler either aborts program or restarts at instruction I i+1

Precise Exceptions in Static Pipelines Key observation: architected state only change in memory and register write stages.