ECE260: Fundamentals of Computer Engineering

Similar documents
3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Pipelining. CSC Friday, November 6, 2015

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. The Processor

Processor (II) - pipelining. Hwansoo Han

Full Datapath. Chapter 4 The Processor 2

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

1 Hazards COMP2611 Fall 2015 Pipelined Processor

LECTURE 3: THE PROCESSOR

Chapter 4. The Processor

Full Datapath. Chapter 4 The Processor 2

Pipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CSEE 3827: Fundamentals of Computer Systems

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

COMPUTER ORGANIZATION AND DESIGN

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

ECE260: Fundamentals of Computer Engineering

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

ECEC 355: Pipelining

Chapter 4 The Processor 1. Chapter 4A. The Processor

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

EIE/ENE 334 Microprocessors

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

COMPUTER ORGANIZATION AND DESIGN

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

ECE260: Fundamentals of Computer Engineering

Chapter 4. The Processor

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

ELE 655 Microprocessor System Design

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

Lecture 15: Pipelining. Spring 2018 Jason Tang

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

COMP2611: Computer Organization. The Pipelined Processor

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Chapter 4. The Processor

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Thomas Polzer Institut für Technische Informatik

ECE 154A Introduction to. Fall 2012

Chapter 4. The Processor

SI232 Set #20: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Chapter 6 ADMIN. Reading for Chapter 6: 6.1,

Chapter 4. The Processor

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

Computer Architecture Review. Jo, Heeseung

Pipelining. Maurizio Palesi

Outline Marquette University

Final Exam Fall 2007

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Instr. execution impl. view

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

ECE260: Fundamentals of Computer Engineering

Computer Architecture. Lecture 6.1: Fundamentals of

Chapter 4 (Part II) Sequential Laundry

Improving Performance: Pipelining

Instruction Pipelining Review

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

What is Pipelining? RISC remainder (our assumptions)

Appendix C: Pipelining: Basic and Intermediate Concepts

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

Pipeline Review. Review

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

Pipeline: Introduction

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Chapter 4 The Processor 1. Chapter 4B. The Processor

The Processor: Instruction-Level Parallelism

ECE232: Hardware Organization and Design

ECE154A Introduction to Computer Architecture. Homework 4 solution

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Chapter 4. The Processor. Jiang Jiang

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

COSC 6385 Computer Architecture - Pipelining

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Processor (IV) - advanced ILP. Hwansoo Han

Pipelining and Vector Processing

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Pipelined Processor Design

Advanced Instruction-Level Parallelism

Chapter 8. Pipelining

CS252 Prerequisite Quiz. Solutions Fall 2007

Transcription:

Pipelining James Moscola Dept. of Engineering & Computer Science York College of Pennsylvania Based on Computer Organization and Design, 5th Edition by Patterson & Hennessy

What is Pipelining? Pipelining is an implementation technique in which multiple instructions are overlapped in execution Pipeline has multiple stages where each stage performs a specific task Single-cycle datapath is divided into smaller combinational logic elements separated by sequential storage elements Instructions pass through each stage of a pipeline one stage at a time Allows multiple instructions to be worked on concurrently Reduces time to complete multiple instructions Does not reduce time to finish any one single instruction A processor pipeline behaves similarly to an assembly line 1 2 3 4 2

Pipelining Analogy Pipelined laundry overlapping execution of different stages Parallelism improves performance Non-pipelined version 8 hours Non-pipelined version Wash, then dry, then fold, then store Pipelined version Wash, then dry AND start next wash, etc. Allows tasks to be done on multiple loads of laundry concurrently, reducing time to finish all laundry (improves throughput) Pipelined version 3.5 hours Each load of laundry still takes the same amount of time 3

MIPS Single-Cycle Datapath (not pipelined yet) 4

MIPS Pipelined Datapath Single-cycle MIPS datapath can be divided into the classic 5-stage pipeline IF: Instruction Fetch ID: Instruction Decode and Register Read EX: Execution (or address calculation) MEM: Data Memory Access WB: Write Back Divide single-cycle datapath to reduce the amount of combinational logic between sequential elements Place registers between each stage Single-cycle datapath with dashed lines to indicate where stages will be separated 5

The MIPS Pipeline a Classic 5-Stage Pipeline Five stages, one step per stage 1) IF: Instruction Fetch from memory Uses PC to address instruction memory and retrieve instruction 2) ID: Instruction decode & register read Controller decodes the instruction while values are retrieved from register file 3) EX: Execute arithmetic operation or calculate address Sometimes referred to as the ALU Stage 4) MEM: Access memory Retrieves value from data memory (if necessary) 5) WB: Write result back to register file Result of arithmetic and load instructions are written back to destination register 6

Pipeline Performance Assume time for pipeline stages is as follows: 100 ps for ID and WB; 200 ps for others (IF, EX, MEM) Example Instructions Pipeline Stage Latency (ps) IF ID EX MEM WB Total Latency (ps) lw 200 100 200 200 100 800 sw 200 100 200 200 700 R-Type 200 100 200 100 600 beq 200 100 200 500 The lw instruction uses all stages of the pipeline If implemented in a single-cycle datapath, requires an 800 ps clock period (1.25 GHz) If implemented in a pipelined datapath, the longest stage of the pipeline determines the clock period 200 ps in the example shown here (5 GHz) Longest stage of the pipeline is the critical path 7

Pipeline Performance (continued) Three lw instructions in a single-cycle datapath IF ID EX MEM WB IF ID EX MEM WB Clock period of 800 ps IF Three lw instructions in a pipelined datapath IF ID EX MEM WB IF ID EX MEM WB Clock period of 200 ps IF ID EX MEM WB 8

Pipeline Speedup Measured as the time between instructions The amount of time between starting instruction x and starting instruction x+1 If all pipeline stages are balanced (i.e. they all take the same amount of time) Time Between Instructions pipelined = Time Between Instructions nonpipelined Number of Pipeline Stages If pipelined stages are not balanced, then speedup is less Pipeline speedup is due to increased throughput i.e. less time between instructions i.e. more instructions executed per unit time Time to execute each individual instruction does not decrease 9

Hazards Situations that prevent starting the next instruction on the next clock cycle Three main types of hazards Structural hazard Next instruction cannot execute because a required hardware resource is busy Data hazard Next instruction needs to wait for previous instruction to complete its data read/write (data dependent) Control hazard Next instruction depends on a control action of previous instruction 10

Structural Hazards Next instruction cannot execute because a required hardware resource is busy Conflict for use of a resource Example: if MIPS had only a single memory shared as instruction and data memory Could not read the next instruction while also performing the MEM stage of a load/store instruction Load/store instructions require access to memory Instruction fetch would have to stall for the cycle that load/store accesses memory Stalling instruction fetch would cause a pipeline bubble Pipelined datapaths require separate instruction/data memories Or separate instruction/data caches 11

Data Hazards Next instruction needs to wait for previous instruction to complete its data read/write A data dependency exists between the result of one instruction and the next add $s0, $t0, $t1 sub $t2, $s0, $t3 sub instruction requires $s0 as input add instruction will not yet have executed WB stage! Register file will contain stale data for $s0 when sub requests it A naive solution might just stall the pipeline by inserting bubbles, thereby allowing WB of add to happen before ID of sub (there s a better way ) 12

Forwarding (a.k.a. Bypassing) (a.k.a. the Better Way) Use the result of an instruction immediately after it is computed Don t wait for it to be stored in a register Requires extra hardware in the datapath (i.e. some wires and control) The result of instruction x is forwarded to instruction x+1 (bypassing the WB stage) The add instruction still completes normally and writes the register file during its WB stage 13

Load-Use Data Hazard Forwarding works very well in many cases, but not all Consider when a lw instruction precedes an instruction that uses the value being loaded Cannot forward data backwards in time!!! Must stall the pipeline and insert a bubble so the output of the MEM stage can be forwarded to the input of the EX stage of the next instruction Stalling the pipeline reduces instruction throughput (i.e. decreases performance) 14

Code Scheduling to Avoid Stalls Be a clever programmer and avoid stalls Reorder code to avoid the use of load result in the next instruction Two implementations of the same C code that take different number of clock cycles to complete Sample C Code: a = b + e; c = b + f; Stalls will be required and inserted before these add instructions 5 clock cycles for sw instruction to complete Implementation #1: lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) Requires 13 clock cycles (Two stalls required) Implementation #2: lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) Requires 11 clock cycles (No stalls required) 15

Control Hazards (a.k.a. Branch Hazard) A branch instruction determines the flow of control in a program The next instruction to fetch and execute after a branch instruction depends on the branch outcome Don t actually know the outcome of a branch instruction until EX stage Pipeline can t always fetch correct instruction (don t really know which instruction to get yet) Pipeline is attempting to fetch next instruction while the branch instruction is still in the ID stage Multiple solutions: Stall until result of branch is determined NYEH HEH HEH Guessing (more common that you d guess) Use a branch delay slot 16

Stall on Branch Wait until branch outcome is determined before fetching the next instruction Without any additional hardware, would require a stall of 3 clock cycles for each branch Very inefficient looping behavior! Could add some additional hardware to get result of branch after ID stage of pipeline Would still require a cycle of stalling (see below) 17

Branch Prediction (Yes, Guessing!) If stall penalty is unacceptable, try to predict the result of a branch before actual result is available Many implementations will always assume that the branch is NOT taken and fetch the instruction that immediately follows the branch (this is the common case) No stalls are inserted, pipeline continues full speed ahead If the prediction was correct, then no penalty Sometimes prediction will be wrong Stall will need to be inserted if prediction was wrong Result of wrongly fetched/executed instruction discarded In MIPS pipeline Can predict branches not taken Fetches instruction after branch, with no delay 18

MIPS with Predict Not Taken MIPS predict branch not taken behavior when prediction is correct (i.e. branch was not taken and didn t need to be) MIPS predict branch not taken behavior when prediction is incorrect (i.e. branch was not taken but should have been) 19

Branch Delay Slots Rearrange instructions and put an arbitrary instruction after the branch instruction Instruction following the branch is always executed Pick an instruction that would have been executed regardless of branch taken/not taken A suitable choice for an instruction to insert into a branch delay slot is the instruction that immediately precedes the branch instruction (assembler can swap the two instructions) Branch instruction cannot depend on result of preceding instruction add $t0, $t0, $t0 beq $s0, $s1, label or $t1, $t0, $t2... beq $s0, $s1, label add $t0, $t0, $t0 or $t1, $t0, $t2... The assembler can swap the add and beq instructions without affecting the programs results. After the assembler rearranges instructions. The add instruction is in the branch delay slot and will always execute. 20