What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Similar documents
What is Pipelining? RISC remainder (our assumptions)

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Pipeline Review. Review

Instruction Pipelining Review

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

C.1 Introduction. What Is Pipelining? C-2 Appendix C Pipelining: Basic and Intermediate Concepts

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

ECEC 355: Pipelining

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Lecture 5: Pipelining Basics

COSC 6385 Computer Architecture - Pipelining

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Unpipelined Machine. Pipelining the Idea. Pipelining Overview. Pipelined Machine. MIPS Unpipelined. Similar to assembly line in a factory

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

Pipelining. Maurizio Palesi

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Advanced Computer Architecture

DLX Unpipelined Implementation

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Pipelining: Hazards Ver. Jan 14, 2014

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Computer System. Hiroaki Kobayashi 6/16/2010. Ver /16/2010 Computer Science 1

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Computer System. Agenda

Appendix C. Abdullah Muzahid CS 5513

第三章 Instruction-Level Parallelism and Its Dynamic Exploitation. 陈文智 浙江大学计算机学院 2014 年 10 月

Appendix A. Overview

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

MIPS An ISA for Pipelining

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers

Instruction Pipelining

Improving Performance: Pipelining

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Computer Architecture

mywbut.com Pipelining

Pipelining. Pipeline performance

What do we have so far? Multi-Cycle Datapath (Textbook Version)

Instr. execution impl. view

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. The Processor

Chapter 4. The Processor

T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good

Processor (II) - pipelining. Hwansoo Han

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Instruction Pipelining

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

COMPUTER ORGANIZATION AND DESIGN

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

LECTURE 3: THE PROCESSOR

Pipelined Processor Design

Full Datapath. Chapter 4 The Processor 2

Pipelining. CSC Friday, November 6, 2015

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4 The Processor 1. Chapter 4B. The Processor

Updated Exercises by Diana Franklin

Pipelining: Basic and Intermediate Concepts

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

What is Pipelining. work is done at each stage. The work is not finished until it has passed through all stages.

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Appendix C: Pipelining: Basic and Intermediate Concepts

ECE260: Fundamentals of Computer Engineering

Full Datapath. Chapter 4 The Processor 2

COMPUTER ORGANIZATION AND DESIGN

CSEE 3827: Fundamentals of Computer Systems

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

COMPUTER ORGANIZATION AND DESIGN

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

ELE 655 Microprocessor System Design

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Week 11: Assignment Solutions

MIPS pipelined architecture

ECE 2300 Digital Logic & Computer Organization. More Caches Measuring Performance

ECE 505 Computer Architecture

Modern Computer Architecture

Lecture 15: Pipelining. Spring 2018 Jason Tang

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Lecture: Pipelining Basics

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Transcription:

What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism that exists among the actions needed to execute an instruction Is like assembly line, each step in the pipeline completes a part of an instruction these steps are called pipe stages or pipe segments The time required between moving an instruction one step down the pipeline is a processor cycle So, assuming ideal conditions, the time per instruction on the pipelined processor is equal to: Time per instruction on unpipelined machine Number of pipe stages 1 1

RISC remainder (our assumptions) Key features All operations on data apply to data in registers (these instructions take either two registers or a register and a sign-extended immediate, operate on them, and store the result into a third register), The only operations that effect memory are load and store (these instruction take a register source, called the base register, and an immediate field, called the offset, as operands) The instruction formats are few in number with all instructions typically being one size Branches are conditional (now we consider only comparisons for equality between two registers) 2 2

A simple implementation without pipelining Every instruction is implemented in at most 5 clock cycles, these clock cycles are as follows: 1. Instruction fetch cycle (IF) Send PC + Fetch instruction + update PC 2. Instruction decode/register fetch cycle (ID) Decode instruction + read the instruction + {do equality test} + {sign-extend the offset field} + {compute the possible target} 3. Execution/effective address cycle (EX) Different function are performed depending on the instruction type (memory reference, register-register, register-immediate) 4. Memory access (MEM) If load, memory does a read using effective address, if store, then the memory writes the data from the second register 5. Write-back cycle (WB) Register-register ALU instruction or Load instruction 3 3

Five-Stage Pipeline Each of the clock cycles become a pipe stage. Stages can be executed in parallel (1 per cycle). However each instruction takes 5 cycles to complete the CPI can change from 5 to 1 (in ideal case) Is it really so simple? 4 4

The pipeline data paths CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 Time IM REG DM REG IM REG DM REG IM REG DM REG IM REG DM REG IM REG DM REG 5 5

Three observations It is not so simple! What can happens on every clock cycle? For example: a single ALU cannot be asked to compute an effective address and perform an arithmetic operation at the same time. Happily the major functional units are used in different cycles, and hence overlapping the execution of multiple instructions introduced relatively few conflicts. There are three observation on which this fact rests: We use separate instruction and data memories (implemented typically as caches) The register file is used in the two stages: one for reading in ID and one for writing in WB To start a new instruction every clock, we must increment and store the PC every clock, and this must be done during the IF stage in preparation for the next instruction 6 6

The pipeline data paths skeleton used EX/MEM MEM/WB (pipeline registers) 7 7

Basic Performance Issues Pipelining increases the CPU instruction throughput but it does not reduce the execution time of an individual instruction, however a program runs faster instruction throughput,pipeline latency). An Example: Let s consider the unpipelined processor with 1 ns clock cycle which uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20% and 40%, respectively. The total clock overhead for pipelined is 0,2 ns. The average instruction execution time on the unpipelined processor is equal to clock cycle * average CPI = 1ns * ((40%+20%)* 4+40%*5) = 4,4 ns For pipeline processor average instruction execution time is 1,2 ns (the clock must run at the speed of the lowest stage plus overhead). Then speedup from pipelining = 4,4ns/1,2ns = 3,7 times 8 8

Pipeline Hazards Prevent the next instruction in the instruction stream from executing during its designated clock cycle Structural hazards arise from resources conflicts when the hardware cannot support all possible combinations of instructions simultaneously in overlapped execution Data hazards arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline Control hazards arise from the pipelining of branches and other instructions that change the PC 9 9

Performance of Pipes with Stalls I A stall causes the pipeline performance to degrade from the ideal performance Let s start from the previous formula It can be calculated as follows The ideal CPI on a pipelined processor is almost 1. Hence, we can compute the pipelined CPI (decreasing the CPI) 1 pipelinestall clock cycles per instruction 10 10

Performance of Pipes with Stalls II When we ignored the cycle time overheads then the speedup can be expressed by (clock cycles are equal): If all instruction take the same number of cycles, which must also equal the number of pipeline stages (the depth of the pipelined) then: This leads to result that pipelining can improve performance by the depth of the pipeline (if no pipeline stalls) 11 11

Performance of Pipes with Stalls III Now we assume that the CPI of the unpipelined processor, as well as that of the pipelined, is 1 (decreasing the clock cycle tme). 1 Pipeline stall 1 cycles per instruction Clock cycle unpipelined * Clock cycle pipelined When pipe stages are perfectly balanced and there are no overheads, the clock cycle on the pipelined processor is smaller than the clock cycle of the unpipelined processor by a factor equal to the pipelined depth, so speedup is expressed by: Speedup from pipelining 1 1 pipelinestall cycles per instruction * pipeline depth This leads to conclusion, that if there are no stalls, the speedup is equal to the number of pipeline stages. 12 12

Structural Hazards The overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all possible combination of instruction in the pipeline. If some combination of instructions cannot be accommodated because of resource conflicts, the processor is said to have a structural hazard (+ some functional unit is not fully pipelined) It can happened when we need access to: Memory Registers Processor To resolve this, we stall one of the instructions until the required resource is available. A stall is commonly called a pipeline bubble. 13 13

Structural Hazards an example Load Mem. access Instr.1 Instr.2 Instr.3 Instr.4 mem Mem. access 14 14

Structural Hazards an example - solution Instruction Clock cycle number 1 2 3 4 5 6 7 8 9 10 Load IF ID EX MEM WB Instr. 1 IF ID EX MEM WB Instr. 2 IF ID EX MEM WB Instr. 3 stall IF ID EX MEM WB Instr. 4 IF ID EX MEM WB Instr. 5 IF ID EX MEM 15 15

Structural hazard cost Let s assume: Data references constitute 40% of the mix, Ideal CPI is equal to 1, A clock rate for processor with structural hazard is 1.05 times higher than without hazard If the pipeline without structural hazard is faster, and by how much? Average intr. time CPI*Clock cycle time Clock cycle time (1 0.4*1) * 1.05 1.3* Clock cycle time ideal ideal The processor without structural hazard is 1,3 times faster. 16 16

Data hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instruction on an unpipelined processor. Let s consider the execution of following instructions: DADD DSUB AND OR XOR R1,R2,R3 R4,R1,R5 R6,R1,R7 R8,R1,R9 R10,R1,R11 17 17

An example of data hazard DADD R1,R2,R3 DSUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11 Three (two?) instructions causes a hazard, since the register is not written until after those instructions read it 18 18

Data hazard - solution To solve the problem we use technique called forwarding (bypassing or short-circuiting). Forwarding works as follows: The ALU result from both the EX/MEM and MEM/WB pipeline registers is always fed back to the ALU inputs, If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. 19 19

Data hazard - solution DADD R1,R2,R3 DSUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11 We use forwarding paths (marked blue) instead red ones to avoid the data hazard 20 20

Data hazard next example To prevent a stall in this sequence, we would need to forward the values of the ALU output and memory unit output from the pipeline registers to the ALU and data memory inputs DADD R1,R2,R3 LD SD R4,0(R1) R4,12(R1) 21 21

Data Hazards Requiring Stalls LD R1,0(R2) DSUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 The load instruction can bypass its results to the AND and OR instructions, but not to the DSUB, since that Would mean forwarding the result in negative time 22 22

Pipeline interlock Instruction Clock cycle number 1 2 3 4 5 6 7 8 9 10 LO R1,0(R2) IF ID EX MEM WB DSUB R4,R1,R5 AND R6,R1,R7 IF ID stall EX MEM WB IF stall ID EX MEM WB OR R8,R1,R9 stall IF ID EX MEM WB The Load instruction has a delay or latency that cannot be eliminated by forwarding alone (we need to add pipeline interlock) 23 23

Branch Hazards The instruction after the branch is fetched, but the instruction is ignored, and the fetch is restarted once the branch target is known Branch instruction Branch successor Branch successor +1 Branch successor +2 IF ID EX MEM WB IF Stall! IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM 24 24

Reducing pipeline branch penalties Simple compile time schemas (static) Freeze (flush) the pipeline holding or deleting any instruction after the branch until the branch destination is known (previous slide). Predicted-not-taken (predicted-untaken) treating every branch as not taken, simple allowing the hardware to continue as if branch were not executed, care must be taken not to change the processor state until the branch outcome is definitely known. An alternative schema is to tread every branch as taken. As soon as the branch is decoded and the target address is computed, we assume the branch to be taken and begin fetching and executing at the target. 25 25

Predicted-not-taken schema Untaken branch instr. IF ID EX MEM WB Instruction +1 IF ID EX MEM WB Instruction +2 IF ID EX MEM WB Instruction +3 IF ID EX MEM WB Instruction +4 IF ID EX MEM WB Taken branch instruction IF ID EX MEM WB Instruction +1 IF idle idle idle idle Branch target IF ID EX MEM WB Branch target + 1 IF ID EX MEM WB Branch target + 2 IF ID EX MEM WB 26 26

Delayed Branch In a delayed branch, the execution cycle with a branch delay of one is: Branch instruction Sequential successor1 Branch target if taken The sequential successor is in the branch delay slot. This instruction is executed whether or not the branch is taken It is possible to have a branch delay longer than one, however in practice almost all processors with delayed branch have a single instruction delay. 27 27

The behavior of a delayed branch Untaken branch instr. IF ID EX MEM WB Branch delay Instr. i +1 IF ID EX MEM WB Instruction i+2 IF ID EX MEM WB Instruction i+3 IF ID EX MEM WB Instruction i+4 IF ID EX MEM WB Taken branch instruction Branch delay Instr. i +1 IF ID EX MEM WB IF ID EX MEM WB Branch target IF ID EX MEM WB Branch target + 1 IF ID EX MEM WB Branch target + 2 IF ID EX MEM WB 28 28