Improving Performance: Pipelining

Similar documents
COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. The Processor

Chapter 4. The Processor

Chapter 4. The Processor

Processor (I) - datapath & control. Hwansoo Han

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

COMPUTER ORGANIZATION AND DESIGN

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 4 The Processor 1. Chapter 4A. The Processor

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

LECTURE 3: THE PROCESSOR

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Lecture 7 Pipelining. Peng Liu.

CS3350B Computer Architecture Quiz 3 March 15, 2018

Computer Organization and Structure

Systems Architecture

COSC 6385 Computer Architecture - Pipelining

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

What is Pipelining? RISC remainder (our assumptions)

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

COMPUTER ORGANIZATION AND DESI

CENG 3420 Lecture 06: Datapath

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Chapter 4. The Processor Designing the datapath

COMP303 - Computer Architecture Lecture 8. Designing a Single Cycle Datapath

Processor Design Pipelined Processor (II) Hung-Wei Tseng

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Pipelining. CSC Friday, November 6, 2015

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

ECE232: Hardware Organization and Design

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Review: Abstract Implementation View

CPU Organization (Design)

EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction

Mark Redekopp and Gandhi Puvvada, All rights reserved. EE 357 Unit 15. Single-Cycle CPU Datapath and Control

CPE 335 Computer Organization. Basic MIPS Architecture Part I

Pipeline design. Mehran Rezaei

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

CSE 378 Midterm 2/12/10 Sample Solution

ECE260: Fundamentals of Computer Engineering

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

Inf2C - Computer Systems Lecture Processor Design Single Cycle

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

The MIPS Processor Datapath

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

Lecture Topics. Announcements. Today: Single-Cycle Processors (P&H ) Next: continued. Milestone #3 (due 2/9) Milestone #4 (due 2/23)

Chapter 4. The Processor

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Computer Architecture

Full Datapath. Chapter 4 The Processor 2

Chapter 4. The Processor

Chapter 4. The Processor

COMP2611: Computer Organization. The Pipelined Processor

Advanced Computer Architecture Pipelining

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

EIE/ENE 334 Microprocessors

Major CPU Design Steps

EECS150 - Digital Design Lecture 9- CPU Microarchitecture. Watson: Jeopardy-playing Computer

The overall datapath for RT, lw,sw beq instrucution

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution

Instruction Pipelining Review

Instruction Level Parallelism

COMPUTER ORGANIZATION AND DESIGN

The Processor: Datapath & Control

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

Introduction. Datapath Basics

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

CPE 335. Basic MIPS Architecture Part II

Systems Architecture I

Advanced Computer Architecture

CSEE 3827: Fundamentals of Computer Systems

CS61C : Machine Structures

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Instr. execution impl. view

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Transcription:

Improving Performance: Pipelining Memory General registers Memory ID EXE MEM WB Instruction Fetch (includes PC increment) ID Instruction Decode + fetching values from general purpose registers EXE EXEcute arithmetic/logic operations or address computation MEM MEMory access or branch completion WB Write Back results to general purpose registers (a.k.a. Commit) Inf3 Computer Architecture - 2011-2012 1

Phases of Instruction Execution Instruction Fetch InstructionRegister = Mem (INST, PC) Decoding Generate datapath control signals Determine register operands Operand Assembly Trivial for some ISAs, not for others E.g. select between literal or register operand; operand pre-scaling Sometimes considered to part of the Decode phase Function Evaluation or Address Calculation Add, subtract, shift, logical, etc. Address calculation is simply unsigned addition Memory Access (if required) Load: Data = Mem(DATA, MemAddress, Size) Store: MemWrite (DATA, MemAddress, WriteData, Size) Completion Update processor state modified by this instruction Interrupts or exceptions may prevent state update from taking place Inf3 Computer Architecture - 2011-2012 2

Instruction fetch from Instruction Cache at address given by PC Increment PC, i.e. PC = PC + sizeof(instruction) 4 Add PC Instruction memory Address Data Inf3 Computer Architecture - 2011-2012 3

MIPS R-type instruction format (revision) 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits opcode reg rs reg rt reg rd shamt funct Destination register for R-type format add $1, $2, $3 special $2 $3 $1 add sll $4, $5, 16 special $5 $4 16 sll Inf3 Computer Architecture - 2011-2012 4

MIPS I-type instruction format (revision) 6 bits 5 bits 5 bits 16 bits opcode reg rs reg rt immediate value/addr Destination register for Load lw $1, offset($2) lw $2 $1 address offset beq $4, $5,.L001 beq $4 $5 (PC -.L001) >> 2 addi $1, $2, -10 addi $2 $1 0xfff6 Inf3 Computer Architecture - 2011-2012 5

ing Registers Use source register fields to address the register file and read two registers Select the destination register address, according to the format 4 Add PC Instruction memory Address Data inst [25:21] inst [20:16] Register File Addr 0 Addr 1 Data 0 Data 1 inst [15:11] Write Addr Write Data RegDst Inf3 Computer Architecture - 2011-2012 6

Extracting the literal operand Sign-extend the 16-bit literal field, for those instructions that have a literal 4 Add PC Instruction memory Address Data inst [25:21] inst [20:16] Register File Addr 0 Addr 1 Data 0 Data 1 inst [15:11] Write Addr Write Data RegDst inst [15:0] Sign extend Verilog lit = { {16{inst[15]}}, inst[15:0] } Inf3 Computer Architecture - 2011-2012 7

Performing the Arithmetic Perform arithmetic or logical operation on Data 0 and either Data 1 or the sign-extended literal 4 Add PC Instruction memory Address Data inst [25:21] inst [20:16] Register File Addr 0 Addr 1 Data 0 Data 1 ALU inst [15:11] Write Addr Write Data RegDst inst [15:0] Sign extend Inf3 Computer Architecture - 2011-2012 8

Inside the ALU Adder, Logic Unit, and Barrel Shifter are separate combinational logic blocks AndOp XorOp OrOp Logic unit A B + A Cout Add B Cin m u x ==0 Zero Result SubtractOp Barrel shifter B [4:0] LeftOp SignedOp ShiftOp Inf3 Computer Architecture - 2011-2012 9

Computing Branch Displacements Compute sum of PC and scaled, sign-extended literal displacement Can t share ALU, it might be needed for comparisons during branch operations 4 Add << 2 Add PCsrc PC Instruction memory Address Data inst [25:21] inst [20:16] Register File Addr 0 Addr 1 Data 0 Data 1 ALU inst [15:11] Write Addr Write Data RegDst inst [15:0] Sign extend Inf3 Computer Architecture - 2010-2011 10

Accessing Memory Loads & Stores Load and Store instructions use the ALU result as the effective address Store instructions use Data 1 as the store data 4 Add << 2 Add PCsrc PC Instruction memory Address Data inst [25:21] inst [20:16] inst [15:11] Register File Addr 0 Addr 1 Write Addr Write Data Data 0 Data 1 ALU MemRd MemWr Data Memory Address Write data data LoadReg RegDst inst [15:0] Sign extend Inf3 Computer Architecture - 2010-2011 11

Decoding Instructions Control signals driven by combinational logic, based on instruction opcode 4 Add inst [31:26] Decode logic LoadReg MemWr MemRd PCsrc ALUop ALUsrc << 2 Add PC Instruction memory Address Data inst [25:21] inst [20:16] inst [15:11] inst [5:0] RegDst Register File Addr 0 Addr 1 Write Addr Write Data Data 0 Data 1 ALU ALU decode zero Data Memory Address Write data data inst [15:0] Sign extend Inf3 Computer Architecture - 2010-2011 12

Pipelined Instruction Execution action Phases of Instruction Execution Fetch Decode Execute Memory Write clock Write Write Write Write Write Write Memory Memory Memory Memory Memory Memory Execute Execute Execute Execute Execute Execute Decode Decode Decode Decode Decode Decode 1 Fetch 2 Fetch Fetch 3 Fetch 4 Fetch 5 Fetch 2 time Inf3 Computer Architecture - 2010-2011 13

CPU Pipeline Structure DEC EX MEM WB Decode logic EX MEM MEM 4 Add PC+4 WB PC+4 << 2 Add WB bpc WB [31:26] PC Instruction memory Address Data [25:21] [20:16] Register File Addr 0 Addr 1 Write Data Write Addr Data 0 Data 1 ALU zero Branch decision Data Memory Address Write data data [15:0] Sign extend 6 ALU decode [15:11] Inf3 Computer Architecture - 2010-2011 14

Implementation Issues: Pipeline balance Each pipeline stage is a combinational logic network Registered inputs and outputs Longest circuit delay through all stages determines clock period D D Q Q Pipeline Stage Logic D D Q Q Ideally, all delays through every pipeline stage are identical In practice this is hard to achieve clk1 D Q clk2 Clock tree clock Inf3 Computer Architecture - 2010-2011 15

Representing a sequence of instructions Space-time diagram of pipeline Think of each instruction as a time-shifted pipeline c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5 Inf3 Computer Architecture - 2010-2011 16

Information flow constraints Information from one instruction to any successor, must always move from left to right c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5 Inf3 Computer Architecture - 2010-2011 17

Another way to represent pipeline timing A similar, and slightly simpler, way to represent pipeline timing: Clock cycles progress left to right Instructions progress top to bottom Time at which each instruction is present in each pipeline stage is shown by labelling appropriate cell with pipeline name This form is used in H&P, and throughout the remainder of these notes. Instruction \ cycle 1 2 3 4 5 6 7 8 9 instruction 1 DEC EX MEM WB instruction 2 DEC EX MEM WB instruction 3 DEC EX MEM WB instruction 4 DEC EX MEM WB instruction 5 DEC EX MEM WB Inf3 Computer Architecture - 2010-2011 18

Pipeline Hazards Hazards are pipeline events that restrict the pipeline flow They occur in circumstances where two or more activities cannot proceed in parallel There are three types of hazard: Structural Hazards Arise from resource conflicts, when a set of actions have to be performed sequentially because there is not sufficient resource to operate in parallel Data Hazards Occur when one instruction depends on the result of a previous instruction, and that result is not yet available. These hazards are exposed by the overlapped execution of instructions in a pipeline Control Hazards These arise from the pipelining of branch instructions, and other activities that change the PC. Inf3 Computer Architecture - 2010-2011 19

Structural Hazards Multi-cycle operations Memory or register file port restrictions Example structural hazard caused by having only one memory port Instruction \ cycle 1 2 3 4 5 6 7 8 9 10 lw $1, ($2) DEC EX M EM WB instruction 2 DEC EX M EM WB instruction 3 DEC EX M EM WB instruction 4 DEC EX M EM WB instruction 5 DEC EX M EM WB Effect is to STALL instruction 4, delaying its entry to by one cycle Instruction \ cycle 1 2 3 4 5 6 7 8 9 10 lw $1, ($2) DEC EX M EM WB instruction 2 DEC EX M EM WB instruction 3 DEC EX M EM WB instruction 4 DEC EX M EM WB instruction 5 DEC EX M EM WB Inf3 Computer Architecture - 2010-2011 20

Data Hazards Overlapped execution of instructions means information may be required before it is available. c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, r7 OR R8, r1, R9 XOR R10, R1, R11 Inf3 Computer Architecture - 2010-2011 21

Data hazards lead to pipeline stalls SUB instruction must wait until R1 has been written to register file All subsequent instructions are similarly delayed c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 ADD R1, R2, R3 SUB R4, R1, R5 STALL AND R6, R1, r7 OR R8, r1, R9 XOR R10, R1, R11 Inf3 Computer Architecture - 2010-2011 22

Minimising data hazards by data-forwarding Key idea is to bypass the register file and forward information, as soon as it becomes available within the pipeline, to the place it is needed. c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, r7 OR R8, r1, R9 XOR R10, R1, R11 Inf3 Computer Architecture - 2010-2011 23

CPU pipeline showing forwarding paths DEC EX MEM WB Decode logic EX MEM MEM PC 4 Add Instruction memory Address Data PC+4 [31:26] [25:21] [20:16] Dependency checks Register File Addr 0 Addr 1 Write Data Write Addr Data 0 Data 1 WB PC+4 Add << 2 ALU WB bpc zero Branch decision Data Memory Address Write data data WB [15:0] Sign extend 6 ALU decode [15:11] Inf3 Computer Architecture - 2010-2011 24

Data hazards requiring a stall Hazards involving the use of a Load result usually require a stall, even if forwarding is implemented c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 LW R1, (R2) SUB R4, R1, R5 STALL Reg ALU Mem Reg AND R6, R1, r7 OR R8, r1, R9 XOR R10, R1, R11 Inf3 Computer Architecture - 2010-2011 25

Code scheduling to avoid stalls (before) Hazards involving the use of a Load may be avoided by reordering the code c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 LW R1, 2(R2) LW R3, 4(R1) Reg STALL ALU Mem Reg ADD R4, R4, R3 Reg STALL ALU Mem Reg ADD R1, R1, 4 SUB R9, R9, 1 Inf3 Computer Architecture - 2010-2011 26

Code scheduling to avoid stalls (after) SUB is entirely independent of other instructions place after 1 st load ADD to R1 can be placed after LW to R3 to hide the load delay on R3 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 LW R1, 2(R2) SUB R9, R9, 1 LW R3, 4(R1) ADD R1, R1, 4 ADD R4, R4, R3 Inf3 Computer Architecture - 2010-2011 27

General Performance Impact of Hazards Speedup from pipelining: S = CPI unpipelined x clock unpipelined CPI pipelined clock pipelined CPI pipelined = ideal CPI + stall cycles per instruction = 1 + stall cycles per instruction CPI unpipelined ~ pipeline depth clock unpipelined clock pipelined ~ 1 S = pipeline depth 1 + stall cycles per instruction Inf3 Computer Architecture - 2010-2011 28

Impact of Empty Load-delay Slots on CPI 3 2.5 2 FP structural stalls FP result stalls CPI 1.5 1 0.5 0 compress eqntott espresso gcc li doduc Benchmark ear hydro2d mdljdp su2cor Branch stalls Load stalls Base CPI H&P Fig. A.48 Bottom-line: CPI increase of 0.01 to 0.27 cycles Inf3 Computer Architecture - 2010-2011 29

Control Hazards When a branch is executed, PC is not affected until the branch instruction reaches the MEM stage. By this time 3 instructions have been fetched from the fall-through path. c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 BEQZ R1, label SUB R4, R2, R5 Kill instructions in EX, DEC and as they move forwards AND R6, R2, r7 OR R8, r2, R9 : : label: XOR R10, R1, R11 Inf3 Computer Architecture - 2010-2011 30

Effect of branch penalty on CPI In this example pipeline the cost of each branch is: 1 cycle, if the branch is not taken 4 cycles, if the branch is taken If an equal number of branches are taken and not taken, and if 20% of all instructions are branches (a reasonable assumption), then CPI = 0.8 + 0.2*2.5 = 1.3 This is a significant reduction in performance If the pipeline was deeper, with 2 stages for ALU and 2 stages for Decode, then: Cost of taken branch would be 6 cycles CPI = 0.8 + 0.2*3.5 = 1.5 Deeper pipelines have greater branch penalties, and potentially higher CPI Pentium 4 (Prescott) had 31 pipeline stages! (this was too deep) Several important techniques have been developed to reduce branch penalties Early branch outcome Delayed branches Branch prediction (static and dynamic) Inf3 Computer Architecture - 2010-2011 31

Early branch outcome calculation - BEQZ, BNEZ DEC EX MEM WB Decode logic EX MEM MEM 4 Add PC+4 << 2 Add WB WB WB [31:26] RD0 == 0? PC Instruction memory Address Data [25:21] [20:16] Register File Addr 0 Addr 1 Write Data Write Addr Data 0 Data 1 ALU Data Memory Address Write data data [15:0] Sign extend 6 ALU decode [15:11] Inf3 Computer Architecture - 2010-2011 32

Delayed branch execution Always execute the instruction immediately after the branch, regardless of branch outcome. c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 SUB R4, R2, R5 BEQZ R1, label OR R8, r2, R9 : : Before: instruction after the branch gets killed if the branch is taken label: XOR R10, R1, R11 BEQZ R1, label SUB R4, R2, R5 label: XOR R10, R1, R11 Branch delay slot After: by moving the SUB instruction into the branch delay slot, and executing it unconditionally, the 1-cycle penalty is eliminated Inf3 Computer Architecture - 2010-2011 33

Impact of Branch Hazards on CPI 3 2.5 2 FP structural stalls FP result stalls CPI 1.5 1 0.5 0 compress eqntott espresso gcc li doduc Benchmark ear hydro2d mdljdp su2cor Branch stalls Load stalls Base CPI H&P Fig. A.48 Bottom-line: CPI increase of 0.06 to 0.62 cycles Inf3 Computer Architecture - 2010-2011 34