Appendix C. Abdullah Muzahid CS 5513

Similar documents
Computer System. Hiroaki Kobayashi 6/16/2010. Ver /16/2010 Computer Science 1

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Pipelining. Maurizio Palesi

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Computer System. Agenda

CS4617 Computer Architecture

Speeding Up DLX Computer Architecture Hadassah College Spring 2018 Speeding Up DLX Dr. Martin Land

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

mywbut.com Pipelining

MIPS An ISA for Pipelining

COSC 6385 Computer Architecture - Pipelining

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Computer Architecture

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

What is Pipelining? RISC remainder (our assumptions)

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Instruction Pipelining

Appendix A. Overview

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

ECSE 425 Lecture 6: Pipelining

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Chapter 4 The Processor 1. Chapter 4A. The Processor

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

Modern Computer Architecture

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Instruction Pipelining

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining: Hazards Ver. Jan 14, 2014

CA226 Advanced Computer Architecture

Chapter 4. The Processor

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

Pipelining: Basic and Intermediate Concepts

Lecture 7 Pipelining. Peng Liu.

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Full Datapath. Chapter 4 The Processor 2

第三章 Instruction-Level Parallelism and Its Dynamic Exploitation. 陈文智 浙江大学计算机学院 2014 年 10 月

Processor (II) - pipelining. Hwansoo Han

Pipelining. CSC Friday, November 6, 2015

Pipeline Review. Review

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Processor Architecture

RISC Pipeline. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter 4.6

DLX: A Simplified RISC Model

Chapter 4. The Processor

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

COMPUTER ORGANIZATION AND DESIGN

Pipeline design. Mehran Rezaei

Execution/Effective address

Lecture 5: Pipelining Basics

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

LECTURE 3: THE PROCESSOR

DLX: A Simplified RISC Model

DLX Unpipelined Implementation

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

COMPUTER ORGANIZATION AND DESIGN

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Computer Organization MIPS Architecture. Department of Computer Science Missouri University of Science & Technology

ECEC 355: Pipelining

Computer Architecture Spring 2016

Very Simple MIPS Implementation

Instruction Pipelining Review

Lecture 2: Processor and Pipelining 1

ECS 154B Computer Architecture II Spring 2009

Very Simple MIPS Implementation

Pipelining. Pipeline performance

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Basic Pipelining Concepts

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

ECE154A Introduction to Computer Architecture. Homework 4 solution

Improving Performance: Pipelining

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

CS61C : Machine Structures

Full Datapath. Chapter 4 The Processor 2

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

Transcription:

Appendix C Abdullah Muzahid CS 5513 1

A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero) Single address mode for load/store: base + displacement no indirection Simple branch conditions

Example: MIPS Register-Register (Ex: ADD, SUB etc) 31 26 25 21 20 16 15 11 10 0 Op Rs Rt Rd Register-Immediate (Ex: ADDI, SUBI, Load, Store etc) 31 26 25 21 20 16 15 0 Op Rs Rt immediate Branch (Ex: BEQZ) 31 26 25 21 20 16 15 0 Op Rs 0 immediate

4

Implementation of RISC Instructions 1. Instruction Fetch cycle (IF) IR Mem[PC] ; IR holds the instruction NPC PC+4 2. Instruction decode/register fetch cycle (ID) A Regs[rs] ; decode the instruction B Regs[rt] ; in the meantime Imm sign-extend imm field of IR ;Regs A, B, Imm ; ok if some of this is not needed 5

3. Execution /Effective address cycle (EX) memory ref: ALU output A+Imm Reg-Reg (ALU op): ALU output A op B Reg-Immed (ALU op): ALU output A op Imm Branch: ALU output NPC+ (Imm << 2) ;address of target cond (A op O) ; op = equal, = not equal /* note: no instructions need to do 2 of these operations */ /* note: Imm has word count for branches; need to shift by 2 to get bytes to add to PC */ 6

4. Memory Access/Branch Completion Cycle (MEM) /* only for LD,ST,BR */ Memory access: LMD Mem[ALU output] ;for loads. Store data in Mem[ALU output] B ; load mem data register ; for stores Branch if (cond) else PC ALU output PC NPC 7

5. Write-back cycle (WB) Reg-Reg ALU instr: Regs[rd] ALU output Reg-Imm ALU instr: Regs[rt] ALU output Load Instruction: Regs[rt] LMD Branches 4 cycles Rest of ins 5 cycles Now we will try to pipeline it We need: At the end of each cycle, the data is stored in some registers (PC,LMD,Imm,A,B, ). This allows other instructions to execute too. 8

If a program has 20% branch, 40% load/store and 40% other type of instructions, what is the CPI? A) 4.8 B) 4.2 C) 5 D) 4 Copyright Josep Torrellas 1999, 2001, 2002 9

Pipelining Multiple instructions are overlapped in execution Each is in a different stage Each stage is called pipe stage or segment Throughput: # inst completed/cycle Each step takes a machine cycle Want to balance the work in each stage Ideally: Time per instruction = Time per inst in a non-pipelined # pipe stages 10

Figure A.1 Simple RISC pipeline. On each clock cycle, another instruction is fetched and begins its 5-cycle execu- Clock number Instruction number 1 2 3 4 5 6 7 8 9 Instruction i IF ID EX MEM WB Instruction i + 1 IF ID EX MEM WB Instruction i + 2 IF ID EX MEM WB Instruction i + 3 IF ID EX MEM WB Instruction i + 4 IF ID EX MEM WB 11

12

13

14

Stage IF ID Any instruction IF/ID.IR Mem[PC]; IF/ID.NPC,PC (if ((EX/MEM.opcode == branch) & EX/MEM.cond){EX/MEM. ALUOutput} else {PC+4}); ID/EX.A Regs[IF/ID.IR[rs]]; ID/EX.B Regs[IF/ID.IR[rt]]; ID/EX.NPC IF/ID.NPC; ID/EX.IR IF/ID.IR; ID/EX.Imm sign-extend(if/id.ir[immediate field]); ALU instruction Load or store instruction Branch instruction EX EX/MEM.IR ID/EX.IR; EX/MEM.ALUOutput ID/EX.A func ID/EX.B; or EX/MEM.ALUOutput ID/EX.A op ID/EX.Imm; EX/MEM.IR to ID/EX.IR EX/MEM.ALUOutput ID/EX.A + ID/EX.Imm; EX/MEM.B ID/EX.B; EX/MEM.ALUOutput ID/EX.NPC + (ID/EX.Imm << 2); EX/MEM.cond (ID/EX.A == 0); MEM MEM/WB.IR EX/MEM.IR; MEM/WB.ALUOutput EX/MEM.ALUOutput; MEM/WB.IR EX/MEM.IR; MEM/WB.LMD Mem[EX/MEM.ALUOutput]; or Mem[EX/MEM.ALUOutput] EX/MEM.B; WB Regs[MEM/WB.IR[rd]] MEM/WB.ALUOutput; or Regs[MEM/WB.IR[rt]] MEM/WB.ALUOutput; For load only: Regs[MEM/WB.IR[rt]] MEM/WB.LMD; 15

How to make it work? Use separate I and D caches Register file can be read/written in 0.5 cycles PC: incremented in IF if branch taken, in EX, add PC+ (Imm << 2) Cannot keep any state in IR need to move it to another register every cycle see picture These registers IF/ID, ID/EX, EX/MEM, MEM/WB subsume the temp ones e.g. Destination Reg in a LD 16

Control of the pipeline: set the control of the 4 MUXES 17

18

Selects PC+4 or branch target address 19

MUX is set by whether it is a branch or not selects PC + 4 or Reg[rs] 20

MUX is set by whether it is a reg-reg ALU op or not selects Reg[rt] or Immidiate 21

MUX is set by whether it is a load or not selects data or ALU output 22

One more MUX should be here WHY??? 23

A final MUX (not shown) in WB: chooses the field in IR that determines what reg to use to store the result in reg-reg ALU MEM/WB. IR 16 20 (rd) in reg-imm ALU and LD MEM/WB. IR 11 15 (rt) 24

Example Unpipelined: 10ns cycle time 4 cycles for ALU (40%), branch (20%) 5 cycles for mem (40%) pipelining: adds 1 ns to clock speedup in execution rate? Unpipelined: avg inst time = clock * avg CPI = 10*((40% +20%)*4 + 40%*5) = 44 ns pipelined = clock * avg CPI = 11 ns * 1 = 11ns Speedup= 44/11 = 4 25

Pipeline Hazards Situations that prevent the next instruction from executing its designated clock cycle Structural: resource conflicts e.g. 2 people want to use 1 laptop at the same time Data: instruction depends on the result of a previous one. e.g. all the exam and h/w grades are required before calculating the final grade Control: results from instructions that change the PC. e.g. BEQZ First choose your course and then buy books Pipeline may have to stall 26

CPI pip = Ideal CPI + Pipeline stall clock cycles per inst. 27

Structural Hazards Some Combination of inst. Cannot be accomodated because of resource conflicts Usually because some functional unit is not pipelined two instructions using it cannot proceed back to back Some resource has not been replicated enough Eg 1 register file port Combined I,D memory Result : Pipeline stall, like if we had inserted a bubble. 28

29

Clock cycle number Instruction 1 2 3 4 5 6 7 8 9 10 Load instruction IF ID EX MEM WB Instruction i + 1 IF ID EX MEM WB Instruction i + 2 IF ID EX MEM WB Instruction i + 3 stall IF ID EX MEM WB Instruction i + 4 IF ID EX MEM WB Instruction i + 5 IF ID EX MEM Instruction i + 6 IF ID EX Figure A.5 A pipeline stalled for a structural hazard a load with one memory port. As shown here, the load 30

Example : Machine 1 separate I,D Machine 2: Unified I,D clock rate 1.05 higher 40% of instructions are data Accesses Which is faster? (Avg. inst. time) = CPI * (Clock cycle time) = 1 * (Clock cycle time ) 1 Clock Cycle Time (Avg. inst. time 1 = CPI * = (1 +0.4*1) * 2 1.05 Clock Cycle Time 1.05 = 1.3 * ( Clock Cycle time) Why allow structural hazards? Reduce cost speed up FUnit 31

Data Hazards Occurs because pipelining changes the order of read/write accesses to operands 1 ADD R1, R2, R3 2 SUB R4,R5,R1 3 AND R6,R1,R7 4 OR R8,R1,R9 5 XOR R10,R1,R11 32

33

34

Feed ALU result back from EX/MEM or MEM/WB to ALU input 35

Forwarding, Bypassing or Short Circuiting 36

Write into Reg File in the 1st ½ Clock Cycle Read from Reg File in the 2nd ½ Clock Cycle 37

Forwarding Need forwarding path to the data memory input ADD R1, R2, R3 LW R4, 0(R1) SW 12(R1), R4 38

39

Need forwarding path from memory output to memory input 40

HW Change for Forwarding NextPC Registers ID/EX mux mux ALU EX/MEM Data Memory MEM/WR Immediate mux

Another Example LD R1, 0(R2) DSUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 42

43

LD R1,0(R2) IF ID EX MEM WB DSUB R4,R1,R5 IF ID EX MEM WB AND R6,R1,R7 IF ID EX MEM WB OR R8,R1,R9 IF ID EX MEM WB LD R1,0(R2) IF ID EX MEM WB DSUB R4,R1,R5 IF ID stall EX MEM WB AND R6,R1,R7 IF stall ID EX MEM WB OR R8,R1,R9 stall IF ID EX MEM WB 44

LD R1,0(R2) IF ID EX MEM WB DSUB R4,R1,R5 IF ID EX MEM WB AND R6,R1,R7 IF ID EX MEM WB OR R8,R1,R9 IF ID EX MEM WB LD R1,0(R2) IF ID EX MEM WB DSUB R4,R1,R5 IF ID stall EX MEM WB AND R6,R1,R7 IF stall ID EX MEM WB OR R8,R1,R9 stall IF ID EX MEM WB All later instructions from hazard point are stalled 45

How to handle these hazards 1 Add hardware(pipeline interlock) to detect hazard and stall then pipeline until the hazard is cleared The CPI of the SUB instruction increases by 1 2 Pipeline scheduling by the compiler : avoid putting a load followed by immediate use of the load register a = b + c lw Rb, b lw Rb, b d = e - f lw Rc, c lw Rc, c add Ra, Rb, Rc, lw Re, e sw Ra, a add Ra, Rb, Rc lw Re, e lw Rf, f lw Rf, f sw Ra, a sub Rd, Re, Rf sub Rd, Re, Rf sw Rd, d sw Rd, d Pipeline schedule can increase the reg. count required d It is easier if scheduling happens within Basic Blocks: A basic block is a straightline code sequence with no transfers in or out, except at the beginning or end 46

Classifying Data Hazards Inst i Inst ( i + j) 1. 2. 3. 4. Wr Wr Rd Rd Rd Wr Wr Rd Copyright Josep Torrellas 1999, 2001, 2002 47

Classifying Data Hazards RAW(Read after Write) : i + 1 tries to read before i writes ADD R1 ADD R7, R1 WAW(Write after Write) : i + 1 tries to write before i writes Not Possible in MIPS WHY? WAR( Write after Read) : i + 1 tries to write before i reads Not possible in MIPS because instruction reads first in ID, writes in WB Occurs when some instructions write early and read late RAR( Read after Read) : No Hazard 48

Control of MIPS Pipeline Pass frm ID to EX: inst is issued All data haz det in ID! Comparators det if two reg# the same Only prob comes with load in EX and use in ID, as shown in table ) Insert bubble if read in ID, load in EX, and read# matches dest# Code Result Action LD R1,45(R2) No dep R1 not used after EX, so DADD R5,R6,R7 no action DSUB R8,R6,R7 OR R9,R6,R7 LD R1,45(R2) DADD R5,R1,R7 DSUB R8,R6,R7 OR R9,R6,R7 LD R1,45(R2) DADD R5,R6,R7 DSUB R8,R1,R7 OR R9,R6,R7 LD R1,45(R2) DADD R5,R6,R7 DSUB R8,R6,R7 OR R9,R1,R7 Stall for depend Depend defeated by forwarding Depend, but accesses in order comparators det use of R1 in DADD, stall DADD (and succ inst) before DADD enters EX Comp detect use of R1 in DSUB, forward ld val in time for DSUM to enter EX Read of R1 by OR in 2 nd half of ID, while write occured in 1 st (WB of LD) 49

Control Hazards: Branches When a branch is executed, it may or may not be taken If taken, the PC is not changed until the end of EX -> end of address calculation Branch Successor Successor + 1 IF ID EX MEM WB IF IF IF ID EX MEM WB IF ID EX MEM WB 50

Control Hazards: Branches When a branch is executed, it may or may not be taken If taken, the PC is not changed until the end of EX -> end of address calculation Branch Successor Successor + 1 IF ID EX MEM WB IF IF IF ID EX MEM WB IF ID EX MEM WB Overall 2 cycles lost 51

Reducing Branch Stalls Do, as soon as possible : Find out whether or not the BR is taken Find out the target addr. How? - move the zero test (condition test) to ID Compute the target in the ID (instead of EX) -> requires extra adder -> therefore : only 1 clock cycle stall ( Branch delay) Branch instruction IF ID EX MEM WB Branch successor IF IF ID EX MEM WB Branch successor + 1 IF ID EX MEM Branch successor + 2 IF ID EX

Reducing Branch Stalls Do, as soon as possible : Find out whether or not the BR is taken Find out the target addr. How? Still 10% - 30% performance loss - move the zero test (condition test) to ID Compute the target in the ID (instead of EX) -> requires extra adder -> therefore : only 1 clock cycle stall ( Branch delay) Branch instruction IF ID EX MEM WB Branch successor IF IF ID EX MEM WB Branch successor + 1 IF ID EX MEM Branch successor + 2 IF ID EX