COSC 6385 Computer Architecture - Pipelining

Similar documents
CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Modern Computer Architecture

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Computer Architecture

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Computer Architecture Spring 2016

COSC 6385 Computer Architecture. - Pipelining

Lecture 2: Processor and Pipelining 1

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

MIPS An ISA for Pipelining

Chapter 4 The Processor 1. Chapter 4A. The Processor

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Lecture 1: Introduction

Advanced Computer Architecture Pipelining

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Pipelining. Maurizio Palesi

Pipelining: Hazards Ver. Jan 14, 2014

Advanced Computer Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

CSE 502 Graduate Computer Architecture. Lec 3-5 Performance + Instruction Pipelining Review

Advanced Computer Architecture

What is Pipelining? RISC remainder (our assumptions)

CSE 502 Graduate Computer Architecture. Lec 3-5 Performance + Instruction Pipelining Review

Basic Pipelining Concepts

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

HY425 Lecture 05: Branch Prediction

EITF20: Computer Architecture Part2.2.1: Pipeline-1

COMP2611: Computer Organization. The Pipelined Processor

DLX Unpipelined Implementation

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

ECE154A Introduction to Computer Architecture. Homework 4 solution

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

(Basic) Processor Pipeline

Chapter 4 The Processor 1. Chapter 4B. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Chapter 4. The Processor

Pipelining. CSC Friday, November 6, 2015

Appendix C. Abdullah Muzahid CS 5513

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Full Datapath. Chapter 4 The Processor 2

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Processor (II) - pipelining. Hwansoo Han

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

COMPUTER ORGANIZATION AND DESIGN

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Processor Architecture

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

LECTURE 3: THE PROCESSOR

COMPUTER ORGANIZATION AND DESIGN

Instr. execution impl. view

Pipelining. Pipeline performance

The Big Picture Problem Focus S re r g X r eg A d, M lt2 Sub u, Shi h ft Mac2 M l u t l 1 Mac1 Mac Performance Focus Gate Source Drain BOX

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

CSE 502 Graduate Computer Architecture. Lec 4-6 Performance + Instruction Pipelining Review

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Computer Systems Architecture Spring 2016

ECS 154B Computer Architecture II Spring 2009

Week 11: Assignment Solutions

RISC Pipeline. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter 4.6

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Pipelined Processor Design

ECEC 355: Pipelining

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

Very Simple MIPS Implementation

COMPUTER ORGANIZATION AND DESIGN

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

CSEE 3827: Fundamentals of Computer Systems

Chapter 4. The Processor

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Lecture 7 Pipelining. Peng Liu.

Appendix A. Overview

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Transcription:

COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler,

Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage Memory addressing Type and size of operands Operations Instructions for Flow Control Encoding of the IS

Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution Split an expensive operation into several sub-operations Execute the sub-operations in a staggered manner Real world analogy: assembly line in car manufacturing Each station is doing something different Each station working on a separate car Pipelining increases the throughput, but does not reduce the latency of an operation

Classes of instructions instructions Take either 2 registers as operands or 1 register and one 16bit immediate offset Results are stored in a 3 rd register Load and store instructions Branches and jumps

Typical implementation of an instruction (I) 1. Instruction fetch cycle (IF): send PC to memory Fetch current instruction Update PC to next sequential PC (+4 bytes) 2. Instruction decode/register fetch cycle (ID) Decode instruction Read registers corresponding to register source specifiers from register file Sign extend offset fields if needed Compute possible branch target address

Typical implementation of an instruction (II) 3. Execution /effective address cycle (EX) adds base register and offset to form effective address or performs operations on the values read from register file or performs operation on value read from register and signextended immediate 4. Memory access cycle (MEM) If instruction is a load, read memory using the effective address computed in step 3 If instruction is a store, write the data from the second register read of the register file to the effective address 5. Write-back cycle (WB) Write result into register file From memory for a load instruction From for an instruction

Typical implementation of an instruction Next PC Instruction Fetch 4 Adder (III) Instr. Decode. Fetch Next SEQ PC RS1 Execute Addr. Calc Zero? Memory Access MUX Write Back PC Memory Inst RS2 RD File MUX MUX Data Memory L M D MUX Imm Sign Extend WB Data

Datapath (I) Fetching instructions and incrementing program count (PC) 4 Adder PC Read address Instruction Instruction memory

Datapath (II) instructions, e.g. add R1, R2, R3 ister number input is 5 bit wide if you have 32(=2 5 ) registers operation control signal (4 bits) ister numbers 5 5 5 Read register 1 Read register 2 Write register ister file Read data 1 Read data 2 Data 4 operation Zero result Data Write Data Write Write control signal

Datapath (III) Load/Store instructions, e.g. LW R1,offset (R2) MemWrite Address Write Data Data memory Read data 16 32 Sign Extend MemRead Basic steps for a load/store operation sign extend the offset from 16 to 32 bit add the sign extended offset to R2 Load the content of the resulting address into R1 or store the data from R1 into the resulting memory address

Datapath (IV) Combining Load/Store and instructions operation Instruction Read register 1 Read register 2 Write register Write Data ister file Read data 1 Read data 2 Write src 0 1 M U X 4 Address Read data Data memory Write Data MemWrite Memto 0 1 M U X 16 32 Sign Extend MemRead

Datapath (V) Branches e.g. beq R1,R2,offset Basic steps for a branch equal instruction compute branch target address sign extended offset field shift offset field by 2 bits in order to ensure a word offset add shifted, sign-extended offset to PC compare registers R1 and R2

Datapath (VI) Implementation of branches, e.g. beq R1,R2,offset PC+4 from instruction datapath Shift Left 2 Add Branch target Instruction Read register 1 Read register 2 Write register Write Data ister file Read data 1 Read data 2 4 operation To branch control logic Write 16 32 Sign Extend

Visualizing pipelining Time (clock cycles) I n s t r. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 IF ID IF ID Mem WB Mem WB O r d e r IF ID IF ID Mem WB Mem WB

Effects of pipelining A pipeline of depth n requires n-times the memory bandwidth of a non-pipelined processor for the same clock rate Separate data and instruction cache eliminates some memory conflicts ister file is used in stage ID and in WB Usually not a conflict, since write s are executed in the first half of the clock-cycle and read s in the second half Instructions in the pipeline should not attempt to use the same hardware resources at the same time Introduction pipeline registers between successive stages of the pipeline isters named after the stages they connect (e.g. IF/ID, ID/, etc.)

Instruction Fetch Instr. Decode. Fetch Execute Addr. Calc Memory Access Write Back Next PC 4 Adder Next SEQ PC RS1 Next SEQ PC Zero? MUX Address Memory IF/ID RS2 File ID/EX MUX MUX EX/MEM Data Memory MEM/WB MUX Imm Sign Extend RD RD RD

Pipeline Hazards Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

One Memory Port/Structural Hazards Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4

One Memory Port/Structural Hazards Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Stall Instr 3 Bubble Bubble Bubble Bubble Bubble

Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Average Stall cycles per Inst Ideal CPI Pipeline depth Speedup = Ideal CPI + Pipeline stall CPI Cycle Cycle Time Time unpipelined pipelined For simple RISC pipeline, CPI = 1: Pipeline depth Speedup = 1 + Pipeline stall CPI Cycle Cycle Time Time unpipelined pipelined

Example: Dual-port vs. Single-port Machine A: Dual ported memory ( Harvard Architecture ) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/(1 + 0.4 x 1) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster

Time (clock cycles) Data Hazard on R1 IF ID/RF EX MEM WB I n s t r. add r1,r2,r3 sub r4,r1,r3 O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Three Generic Data Hazards Read After Write (RAW) Instr J tries to read operand before Instr I writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a Dependence (in compiler nomenclature). This hazard results from an actual need for communication.

Three Generic Data Hazards Write After Read (WAR) Instr J writes operand before Instr I reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r1. Can t happen in our 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

Three Generic Data Hazards Write After Write (WAW) Instr J writes operand before Instr I writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an output dependence by compiler writers This also results from the reuse of name r1. Can t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes

Forwarding to Avoid Data Hazard Time (clock cycles) I n s t r. add r1,r2,r3 sub r4,r1,r3 O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Data Hazard Even with Forwarding Time (clock cycles) I n s t r. lw r1, 0(r2) sub r4,r1,r6 O r d e r and r6,r1,r7 or r8,r1,r9

Data Hazard Even with Forwarding Time (clock cycles) I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Bubble Bubble Bubble or r8,r1,r9

Branches: Pipelined Datapath Instruction Fetch Instr. Decode. Fetch Execute Addr. Calc Memory Access Write Back Next PC 4 Adder Next SEQ PC RS1 Adder MUX Zero? Address Memory IF/ID RS2 File ID/EX MUX EX/MEM Data Memory MEM/WB MUX Imm Sign Extend RD RD RD WB Data

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 53% branches taken on average But haven t calculated branch target address yet still incurs 1 cycle branch penalty Other machines: branch target known before outcome

Four Branch Hazard Alternatives #4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2... sequential successor n branch target if taken Branch delay of length n 1 slot delay allows proper decision and branch target address in 5 stage pipeline

Delayed Branch Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation