EITF20: Computer Architecture Part2.2.1: Pipeline-1

Similar documents
EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Computer Architecture

Pipelining. Maurizio Palesi

COSC 6385 Computer Architecture - Pipelining

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Appendix A. Overview

Instruction Pipelining Review

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

The Big Picture Problem Focus S re r g X r eg A d, M lt2 Sub u, Shi h ft Mac2 M l u t l 1 Mac1 Mac Performance Focus Gate Source Drain BOX

Modern Computer Architecture

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Lecture 7 Pipelining. Peng Liu.

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Appendix C: Pipelining: Basic and Intermediate Concepts

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

What is Pipelining? RISC remainder (our assumptions)

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Chapter 4 The Processor 1. Chapter 4A. The Processor

MIPS An ISA for Pipelining

Multi-cycle Instructions in the Pipeline (Floating Point)

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Computer Architecture. Lecture 6.1: Fundamentals of

Basic Pipelining Concepts

Pipelining: Basic and Intermediate Concepts

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Advanced issues in pipelining

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Pipeline: Introduction

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

CPS104 Computer Organization and Programming Lecture 19: Pipelining. Robert Wagner

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Computer Systems Architecture Spring 2016

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

Instr. execution impl. view

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Appendix C. Abdullah Muzahid CS 5513

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

COMPUTER ORGANIZATION AND DESIGN

LECTURE 10. Pipelining: Advanced ILP

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

ECE 154A Introduction to. Fall 2012

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Lecture 15: Pipelining. Spring 2018 Jason Tang

Advanced Computer Architecture

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Outline Marquette University

Computer Architecture Spring 2016

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Pipelining: Hazards Ver. Jan 14, 2014

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

Chapter 4. The Processor

ECE 486/586. Computer Architecture. Lecture # 12

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Instruction Pipelining

CSEE 3827: Fundamentals of Computer Systems

LECTURE 3: THE PROCESSOR

Lecture 2: Processor and Pipelining 1

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

POLITECNICO DI MILANO. Exception handling. Donatella Sciuto:

Full Datapath. Chapter 4 The Processor 2

Execution/Effective address

Chapter 4. The Processor

Slide Set 7. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Computer System. Agenda

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Pipelining. CSC Friday, November 6, 2015

C.1 Introduction. What Is Pipelining? C-2 Appendix C Pipelining: Basic and Intermediate Concepts

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

Instruction Pipelining

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Advanced Computer Architecture Pipelining

Transcription:

EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1

Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle operations Summary 2

Computer architecture Design and analysis ISA Orgnization (microarchitecture) Implementation To meet requirements of Functionality (application, standards ) Price Performance Power Reliability Dependability Compatability.. 3

Previous lecture Instruction set architecture 4

Instruction set principles Classification of instruction sets What s needed in an instruction set? Addressing Operands Operations Control Flow Encoding The impact of the compiler The MIPS instruction set architecture 5

Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle operations Summary 6

The Assembly Line... 7

The general pipeline principle Start again from laundry room Small laundry has one washer, one dryer and one folder, it takes 110 minutes to finish one load: Washer takes 40 minutes Dryer takes 50 minutes Folding takes 20 minutes Need to do 4 laundries 8

Not very smart way... Time 40 50 20 40 50 20 40 50 20 40 50 20 1 Laundries 2 3 4 110 min Total = N*(Washer+ Dryer+Folder) = 440 mins 9

If we pipelining Time 1 40 50 50 50 50 20 Laundries 2 3 4 The washer waits for the dryer for 10 minutes Total = Washer+N*Max(Washer,Dryer,Folder)+Folder = 260 mins 10

Pipeline Facts Time Multiple (independent) tasks operating simultaneously Laundries 1 2 3 40 50 50 50 50 20 Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Potential speedup Number of pipe stages 4 11

Two basic digital components a b c Combinational Logic F Always: z <= F(a, b, c); z D Register clk Q Rising clock edge if clk event and clk= 1 then Q <= D; Propagation delay: After presenting new inputs Worst case delay before producing correct output clk D Q 1 1 2 2 3 3 time 12

Clock Frequency (RTL) What is the maximum clock frequency? Reg & & & Reg clk clk Register (perfect) Setup time: Tsu 0ps Hold time: Th 0ps AND-gate Propagation delay: Tprop 250ps 250 3=0.75ns f=1.333ghz 13

Pipeline stages Back to digital systems F X H P(X) Clock cycle G i i+1 i+2 i+3 Input X i X i+1 X i+2 X i+3 F Reg G Reg F(X i ) G(X i ) F(X i+1 ) G(X i+1 ) F(X i+2 ) G(X i+2 ) H Reg H(X i ) H(X i+1 ) H(X i+2 ) 14

One core the MIPS data-path 15

Classic RISC 5-stage pipeline Passed To Next Stage IR <- Mem[PC] NPC <- PC + 4 Instruction Fetch (IF): Send out the PC and fetch the instruction from memory into the instruction register (IR); increment the PC by 4 to address the next sequential instruction. IR (reg) holds the instruction that will be used in the next stage. NPC (reg) holds the value of the next PC (either sequential or jump). 16

Classic RISC 5-stage pipeline Passed To Next Stage A <- Regs[IR6..IR10]; B <- Regs[IR10..IR15]; Imm <- ((IR16) ##IR16-31 Instruction Decode/Register Fetch Cycle (ID): Decode the instruction and access the register file to read the registers. The outputs of the general purpose registers are read into two temporary registers (A & B) for use in later clock cycles. Extend the sign of the lower 16 bits of the Instruction Register (immediate). 17

Classic RISC 5-stage pipeline Passed To Next Stage A <- A func. B cond = 0; Execute Address Calculation (EX): Perform an operation (for an ALU) or an address calculation (if it s a load/store or a Branch). If an ALU, actually do the operation. If a (memory) address calculation, figure out how to obtain the address Branch decision 18

Classic RISC 5-stage pipeline Passed To Next Stage A = Mem[prev. B] or Mem[prev. B] = A MEMORY ACCESS (MEM): If this is an ALU, do nothing. If a load or store, then access memory. 19

Classic RISC 5-stage pipeline Passed To Next Stage Regs <- A, B; WRITE BACK (WB): Update the registers (GPR) from either the ALU or from the data loaded. 20

Power breakdown 21

Power breakdown (Cortex A-15) 22

Speed up Time = Instructions Cycles Time Program Program * Instruction * Cycle 23

Speed up Non-Pipelined Instruction Order lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) 0 200 400 600 800 1000 1200 1400 1600 1800 Instruction Fetch REG RD ALU 800ps MEM REG WR Instruction Fetch REG RD ALU 800ps MEM REG WR Instruction Fetch Time Pipelined Instruction Order lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) 0 200 400 600 800 1000 1200 1400 1600 Instruction Fetch 200ps REG RD Instruction Fetch 200ps ALU REG RD Instruction Fetch MEM ALU REG RD REG WR MEM ALU REG WR MEM REG WR 200ps 200ps 200ps 200ps 200ps 800ps Time 24

Speed up (idea case) Consider an un-pipelined processor. Assume: It has a 1 ns clock cycle and it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations Relative frequencies of these operations are 40%, 20%, and 40% Pipelined: Due to clock skew and setup, pipelining adds 0.2ns of overhead to the clock The speedup in the instruction execution rate we can gain: Average instruction execution time = 1 ns * ((40% + 20%)*4 + 40%*5) = 4.4ns Speedup from pipeline = Average instruction time unpipelined/average instruction time pipelined = 4.4ns/1.2ns = 3.7 25

Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle operations Summary 26

Fundamental limitations Hazards can prevent next instruction from executing during its designated clock cycle: Structural hazards: Simultaneous use of a HW resource Data hazards: Data dependencies between instructions Control hazards: Change in program flow A way of solving hazards is to serialize the execution by inserting bubbles that effectively stall the pipeline (interlock). 27

Structure hazard ALU ALU ALU ALU ALU Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4Cycle 5 Cycle 6 Cycle 7 Load Ifetch Reg DMem Reg Instr 1 Ifetch Reg DMem Reg Instr 2 Ifetch Reg DMem Reg Instr 3 Ifetch Reg DMem Reg Instr 4 Ifetch Reg DMem Reg 28 When two or more different instructions want to use same hardware resource in same cycle, e.g., MEM uses the same memory port (only one memory port)

A simple solution ALU ALU ALU ALU Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Load Ifetch Reg DMem Reg Instr 1 Ifetch Reg DMem Reg Instr 2 Ifetch Reg DMem Reg Stall Bubble Bubble Bubble Bubble Bubble Instr 3 Ifetch Reg DMem Reg 29

Other solutions Stall low cost, simple Increases CPI use for rare case since stalling has performance effect Pipeline hardware resource useful for multi-cycle resources good performance sometimes complex Replicate resource good performance increases cost (+ maybe interconnect delay) useful for cheap resources 30

Dual/single port memory (65nm) Size Single-port Dual-port 64*16 12um 2 /bit 23um 2 /bit 256*16 4.6um 2 /bit 8um 2 /bit 512*16 4um 2 /bit 6um 2 /bit 31

Data hazard ALU ALU ALU ALU ALU Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Ifetch Reg DMem Reg sub r4,r1,r3 Ifetch Reg DMem Reg and r6,r1,r7 Ifetch Reg DMem Reg or r8,r1,r9 Ifetch Reg DMem Reg xor r10,r1,r11 Ifetch Reg DMem Reg The use of the result of the ADD instruction in the next 3 instructions causes a hazard, since the register is not written until after those instructions read it. 32

Fundamental types of data hazard RAW (Read-After-Write) Instruction i + 1 reads A and i modifies A Instruction i+1 reads old A! WAR (Write-After-Read) Instruction i + 1 modifies A and instruction i reads new A WAW (Write-After-Write) Instructions i and i + 1 both modifies A The value in A is the one written by instruction i (RAR?) 33

Strategies for data hazard Interlock Wait for hazard to clear by holding dependent instruction in issue stage Forwarding Resolve hazard earlier by bypassing value as soon as available 34

Interlock and forwarding add x1, x3, x5 sub x2, x1, x4 F D F X D F M X D F W M X D F W M X D add x1, x3, x5 W M bubble bubble W X M W bubble Instruction interlocked in decode stage sub x2, x1, x4 F D X M W F D X M W add x1, x3, x5 sub x2, x1, x4 Bypass around ALU with no bubbles 35

PC Hardware support of forwarding Inst. Register Registers Imm B A ALU Store Fetch Decode EXecute Memory Writeback Instruction Cache Data Cache 36

Data hazard with forwarding Required date from MEM 37

Data hazard with forwarding 38

Software scheduling of load Try producing fast code for a = b + c; d = e - f; assuming a, b, c, d, e, and f are in memory LD LD DADD SD LD LD DSUB SD R1, B R2, C R3,R1,R2 R3, A R5, F R4, E R6,R4,R5 R6, D F D F X D F M X D F W M X D F W M X D W M W X M W How many stalls? How many stalls with hardware forwarding? 39

Software scheduling of load Try producing fast code for a = b + c; d = e - f; assuming a, b, c, d, e, and f are in memory. Architecture dependent optimization 40

Scheduling vs un-scheduling scheduled unscheduled gcc 31% spice 42% 14% tex 25% 54% 65% 0% 20% 40% 60% 80% % loads stalling pipeline 41

Control hazard Control hazard Need to find the destination of a branch, and can t fetch any new instructions until we know that destination Assume: branches are not resolved until the MEM stage Three wasted clock cycles: two stalls one extra instruction fetch (IF) If branch is not taken, the extra IF not needed 42

Hardware support to reduce control hazard Calculate target address and test condition in ID 1 clock cycle branch penalty instead of 3! 43

Four control hazard alternatives Stall until branch condition and target is known Predict branch not taken Execute successor instructions in sequence Squash instructions in pipeline if the branch is actually taken Works well if state is updated (WRITE) late in the pipeline 33 % MIPS conditional branches not taken on average one cycle penalty if taken 44

Four control hazard alternatives Predict Branch taken (benefit in our case?) 67 % MIPS conditional branches taken on average MIPS calculates target address in ID stage! Still one cycle penalty Delayed branch Define branch to take place after a following instruction Slot delay allows proper decision and branch target address calculation in a 5 stage pipeline such as MIPS 45

Compiler support for delay branch Scheduling from before is safe (if independent) Scheduling from target or fall through is not always safe 46

Pipeline speed up 47

Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle operations Summary 48

What s hard to implement Exceptions (fault, interrupt) 49

Exceptions PC Inst. Mem D Decode E + M Data Mem W PC address Exception Illegal Opcode Overflow Data address Exceptions When an interrupt occurs: How to stop the pipeline? How to restart the pipeline? Who caused the interrupt? A pipeline implements precise exceptions if: All instructions before the faulting instruction can complete All instructions after (and including) the faulting instruction can safely be restarted 50

Exceptions are difficult in pipeline We need to be able to restart an instruction that causes an exception: Force a trap instruction (e.g., some special routine call to handle the exception) into the pipeline Turn off all writes for the faulting instruction Save the PC for the faulting instruction to be used in return from exception handling If delayed branch is used we may need to save several PC s 51

Solution for simple MIPS Need to add control and datapaths to support exceptions and interrupts. When an exception or interrupt occurs, the following must be done: EPC <= PC Cause <= (cause code for event) Status <= (fault) PC <= (handler address) To return from an exception or datapath, the following must be done: PC <= EPC Status <= (fault clear) 52

Exceptions are difficult in pipeline Exceptions may be generated out-of-(program) order 53

Solution for simple MIPS Add a hardware status vector containing exceptions Pass along with instruction in the pipeline Hold exception flags in pipeline until commit point Turn of writes when an exception entered in the status vector Handle exceptions from status vector in WB (in program order) If exception at commit: update Cause and EPC registers, kill all stages, inject handler PC into fetch stage Inject external interrupts at commit point (override others) Handle at commit point NOT at exception point 54

Solution for simple MIPS Cause EPC PC Inst. Mem D Decode E + M Data Mem W PC address Exception Exc D Illegal Opcode Exc E Overflow Exc M Data address Exceptions Select Handler PC Kill F Stage PC D Kill D Stage PC E Kill E Stage PC M Asynchrono us Interrupts Kill Writeback F D X M W F D X M W 55

Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle operations Summary 56

Multi-cycle instruction in pipeline (FP) FP Instruction Latency Initiation Rate Add, Subtract 3 1 Multiply 6 1 Divide 24 25 57

Parallelism between integer and FP Instructions are issued in order Instructions may be completed out of order 58

Pipeline hazard Structural hazards RAW hazards WAW harzards WAR hazards 59

Pipeline hazard Structural hazards. Stall in ID stage if: The functional unit is occupied (applicable to DIV only) Any instruction already executing will reach the MEM/WB stage at the same time as this one RAW hazards: Normal bypassing from MEM and WB stages Stall in ID stage if any of the source operands is destination operand in any of the FP functional units 60

Pipeline hazard WAR hazards? There are no WAR-hazards since the operands are read (in ID) before the EX-stages in the pipeline WAW hazard SUB finishes before DIV which will overwrite the result from SUB! are eliminated by stalling SUB until DIV reaches MEM stage When WAW hazard is a problem? How about exception? 61

Exception Suppose the SUB instruction generates an arithmetic trap DIV instruction hasn t completed ADD instruction have completed Imprecise interrupt signaling: Another problem with out-of-order completion 62

Summary Pipelining (ILP): Speeds up throughput, not latency Speedup #stages Hazards limit performance, generate stalls: Structural: need more HW Data (RAW,WAR,WAW): need forwarding and compiler scheduling Control: delayed branch, branch prediction Complications: Precise exceptions may be difficult to implement 63