Pipelined Processor Design

Similar documents
Pipelined Processor Design

Computer Architecture

RISC Design: Multi-Cycle Implementation

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Chapter 4 The Processor 1. Chapter 4A. The Processor

RISC Processor Design

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Pipelining. CSC Friday, November 6, 2015

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Processor (I) - datapath & control. Hwansoo Han

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

ECE 313 Computer Organization FINAL EXAM December 14, This exam is open book and open notes. You have 2 hours.

ECS 154B Computer Architecture II Spring 2009

COMP2611: Computer Organization. The Pipelined Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

RISC Architecture: Multi-Cycle Implementation

Chapter 4. The Processor

COSC 6385 Computer Architecture - Pipelining

Chapter 4. The Processor

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

RISC Architecture: Multi-Cycle Implementation

CPE 335 Computer Organization. Basic MIPS Architecture Part I

LECTURE 3: THE PROCESSOR

Full Datapath. Chapter 4 The Processor 2

1 Hazards COMP2611 Fall 2015 Pipelined Processor

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Processor (II) - pipelining. Hwansoo Han

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN

Full Datapath. Chapter 4 The Processor 2

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

14:332:331 Pipelined Datapath

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

CENG 3420 Lecture 06: Datapath

EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction

ECE 313 Computer Organization FINAL EXAM December 11, Multicycle Processor Design 30 Points

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

CS3350B Computer Architecture Quiz 3 March 15, 2018

Instr. execution impl. view

Lecture 7 Pipelining. Peng Liu.

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Chapter 4 The Processor 1. Chapter 4B. The Processor

The overall datapath for RT, lw,sw beq instrucution

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Chapter 4. The Processor

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

CSEE 3827: Fundamentals of Computer Systems

What is Pipelining? RISC remainder (our assumptions)

Systems Architecture

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

Chapter 4 (Part II) Sequential Laundry

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

COMPUTER ORGANIZATION AND DESI

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

ECE232: Hardware Organization and Design

Basic Pipelining Concepts

ELE 655 Microprocessor System Design

ENGN1640: Design of Computing Systems Topic 04: Single-Cycle Processor Design

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

EECS150 - Digital Design Lecture 9- CPU Microarchitecture. Watson: Jeopardy-playing Computer

ECEC 355: Pipelining

Outline Marquette University

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

LECTURE 5. Single-Cycle Datapath and Control

CSEN 601: Computer System Architecture Summer 2014

EE 457 Unit 6a. Basic Pipelining Techniques

Pipeline Review. Review

Computer Organization and Structure

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

MIPS An ISA for Pipelining

ECE260: Fundamentals of Computer Engineering

Processor Design Pipelined Processor (II) Hung-Wei Tseng

COMP303 - Computer Architecture Lecture 8. Designing a Single Cycle Datapath

CPE 335 Computer Organization. Basic MIPS Pipelining Part I

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

CPE 335. Basic MIPS Architecture Part II

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

Lecture 16: Pipeline Controls. Spring 2018 Jason Tang

Perfect Student CS 343 Final Exam May 19, 2011 Student ID: 9999 Exam ID: 9636 Instructions Use pencil, if you have one. For multiple choice

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Processor (multi-cycle)

COMPUTER ORGANIZATION AND DESIGN

Pipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College

Transcription:

Pipelined Processor Design Pipelined Implementation: MIPS Virendra Singh Computer Design and Test Lab. Indian Institute of Science (IISc) Bangalore virendra@computer.org Advance Computer Architecture http://www.serc.iisc.ernet.in/~viren/courses/aca/aca.htm

ILP: Instruction Level Parallelism Single-cycle and multi-cycle datapaths execute one instruction at a time. How can we get better performance? Answer: Execute multiple instruction at a time: Pipelining Enhance a multi-cycle datapath to fetch one instruction every cycle. Parallelism Fetch multiple instructions every cycle. Advance Computer Architecture 2

Automobile Team Assembly 1 hour 1 hour 1 hour 1 hour 1 car assembled every four hours 6 cars per day 180 cars per month 2,040 cars per year Advance Computer Architecture 3

Automobile Assembly Line Task 1 1 hour Task 2 1 hour Task 3 1 hour Task 4 1 hour Mecahnical Electrical Painting Testing First car assembled in 4 hours (pipeline latency) thereafter 1 car per hour 21 cars on first day, thereafter 24 cars per day 717 cars per month 8,637 cars per year Advance Computer Architecture 4

Throughput: Team Assembly Mechanical Electrical Painting Testing Red car started Mechanical Electrical Painting Testing Red car completed Blue car started Blue car completed Time of assembling one car = n hours where n is the number of nearly equal subtasks, each requiring 1 unit of time Throughput = 1/n cars per unit time Time Advance Computer Architecture 5

Throughput: Assembly Line Car 1 Mechanical Electrical Painting Testing Car 2 Car 3 Car 4.. Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing Car 1 complete Time to complete first car = n time units (latency) Cars completed in time T = T n + 1 Car 2 complete Throughput = 1- (n - 1)/ T car per unit time Throughput (assembly line) 1 (n - 1)/ T n(n 1) = = n n Throughput (team assembly) 1/n T as T Advance Computer Architecture 6 time

Some Features of Assembly Line Task 1 1 hour Electrical parts delivered (JIT) Task 2 1 hour Task 3 1 hour Task 4 1 hour Mechanical Electrical Painting Testing Stall assembly line to fix the cause of defect 3 cars in the assembly line are suspects, to be removed (flush pipeline) Defect found Advance Computer Architecture 7

Pipelining in a Computer Divide datapath into nearly equal tasks, to be performed serially and requiring non-overlapping resources. Insert registers at task boundaries in the datapath; registers pass the output data from one task as input data to the next task. Synchronize tasks with a clock having a cycle time that just exceeds the time required by the longest task. Break each instruction down into a fixed number of tasks so that instructions can be executed in a staggered fashion. Advance Computer Architecture 8

Single-Cycle Datapath Instruction class Instr. fetch (IF) Instr. Decode (also reg. file read) (ID) Execution ( Operation) (EX) Data access (MEM) Write Back (Reg. file write) (WB) Total time lw 2ns 1ns 2ns 2ns 1ns 8ns sw 2ns 1ns 2ns 2ns 8ns R-format add, sub, and, or, slt 2ns 1ns 2ns 1ns 8ns B-format, beq 2ns 1ns 2ns 8ns No operation on data; idle time equalizes instruction length to a fixed clock period. Advance Computer Architecture 9

Execution Time: Single-Cycle 0 2 4 6 8 10 12 14 16.. lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) IF ID EX MEM WB IF ID EX MEM WB IF ID Time (ns) EX MEM WB Clock cycle time = 8 ns Total time for executing three lw instructions = 24 ns Advance Computer Architecture 10

Instruction class Pipelined Datapath Instr. fetch (IF) Instr. Decode (also reg. file read) (ID) Execution ( Operation) (EX) Data access (MEM) Write Back (Reg. file write) (WB) Total time lw 2ns 1ns 2ns sw 2ns 1ns 2ns R-format: add, sub, and, or, slt 2ns 1ns 2ns 2ns 2ns 1ns 2ns 2ns 2ns 1ns 2ns 2ns 2ns 1ns 2ns 10ns 10ns 10ns B-format: beq 2ns 1ns 2ns 2ns 2ns 1ns 2ns 10ns No operation on data; idle time inserted to equalize instruction lengths. Advance Computer Architecture 11

Execution Time: Pipeline 0 2 4 6 8 10 12 14 16.. lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) IF ID EX MEM RW IF ID EX MEM RW IF ID EX MEM RW Time (ns) Clock cycle time = 2 ns, four times faster than single-cycle clock Total time for executing three lw instructions = 14 ns Single-cycle time 24 Performance ratio = = = 1.7 Pipeline time 14 Advance Computer Architecture 12

Pipeline Performance Clock cycle time = 2 ns 1,003 lw instructions: Total time for executing 1,003 lw instructions = 2,014 ns Single-cycle time 8,024 Performance ratio = = = 3.98 Pipeline time 2,014 10,003 lw instructions: Performance ratio = 80,024 / 20,014 = 3.998 Clock cycle ratio (4) Pipeline performance approaches clock-cycle ratio for long programs. Advance Computer Architecture 13

IF: Instr. fetch Single-Cycle Datapath ID: Instr. decode, reg. file read EX: Execute, address calc. MEM: mem. access WB: write back PC 4 Add Instr. mem. 0-15 0-5 opcode 26-31 21-25 16-20 11-15 RegDst Sign ext. CONTROL 1 mux 0 RegWrite Reg. File Shift left 2 Branch Src 1 mux 0 Op Cont. zero 1 mux 0 MemWrite MemRead Data mem. MemtoReg 0 mux 1 Advance Computer Architecture 14

Pipelining of RISC Instructions Fetch Instruction Examine Opcode Fetch Operands Perform Operation Store Result IF ID EX MEM WB Instruction Instruction Execute Memory Write Fetch Decode and Operation Back Fetch operands to Reg file Although an instruction takes five clock cycles, one instruction is completed every cycle. Advance Computer Architecture 15

Pipeline Registers PC 4 Add Instr. mem. This requires a CONTROL not too different from single-cycle 0-15 0-5 opcode 26-31 21-25 16-20 11-15 RegDst Sign ext. CONTROL 1 mux 0 RegWrite Reg. File Shift left 2 Branch Src 1 mux 0 Op Cont. zero 1 mux 0 Data mem. Advance Computer Architecture 16 MemWrite MemRead MemtoReg 0 mux 1

Pipeline Register Functions Four pipeline registers are added: Register name Data held PC+4, Instruction word (IW) PC+4, R1, R2, IW(0-15) sign ext., IW(11-15) PC+4, zero, Result, R2, IW(11-15) or IW(16-20) M[Result], Result, IW(11-15) or IW(16-20) Advance Computer Architecture 17

Pipelined Datapath PC 4 Add Instr mem 11-15 for R-type 16-20 for I-type lw opcode 26-31 21-25 16-20 Reg. File Sign ext. Shift left 2 1 mux 0 zero 1 mux 0 Data mem. 0 mux 1 0-15 Advance Computer Architecture 18

Five-Cycle Pipeline CC1 CC2 CC3 CC4 CC5 IM DM Advance Computer Architecture 19

Add Instruction add $t0, $s1, $s2 Machine instruction word 000000 10001 10010 01000 00000 100000 opcode $s1 $s2 $t0 function CC1 CC2 CC3 CC4 CC5 IM DM IF ID EX MEM WB read $s1 add write $t0 read $s2 $s1+$s2 Advance Computer Architecture 20

Pipelined Datapath Executing add PC 4 11-15 for R-type 16-20 for I-type lw t0 Add Instr mem opcode 0-15 26-31 21-25 s1 Reg. File 16-20 s2 $s2 Sign ext. Shift left 2 $s1 1 mux 0 zero addr 1 mux 0 Data mem data 0 mux 1 Advance Computer Architecture 21

Load Instruction lw $t0, 1200 ($t1) 100011 01001 01000 0000 0100 1000 0000 opcode $t1 $t0 1200 CC1 CC2 CC3 CC4 CC5 IM DM IF ID EX MEM WB read $t1 add read write $t0 sign ext $t1+1200 M[addr] 1200 Advance Computer Architecture 22

Pipelined Datapath Executing lw PC 4 Add 11-15 for R-type 16-20 for I-type lw t0 Instr mem opcode 0-15 1200 26-31 21-25 16-20 t1 Reg. File Sign ext. Shift left 2 $t1 1 mux 0 zero addr Data mem data 1 mux 0 Advance Computer Architecture 23 0 mux 1

Store Instruction sw $t0, 1200 ($t1) 101011 01001 01000 0000 0100 1000 0000 opcode $t1 $t0 1200 CC1 CC2 CC3 CC4 CC5 IM DM IF ID EX MEM WB read $t1 add write sign ext $t1+1200 M[addr] 1200 (addr) $t0 Advance Computer Architecture 24

Pipelined Datapath Executing sw PC 4 Add Instr mem 11-15 for R-type 16-20 for I-type lw opcode 26-31 21-25 16-20 t0 t1 Reg. File Sign ext. $t0 Shift left 2 $t1 1 mux 0 zero addr Data mem data 1 mux 0 0 mux 1 0-15 1200 Advance Computer Architecture 25

Executing a Program Consider a five-instruction segment: lw $10, 20($1) sub $11, $2, $3 add $12, $3, $4 lw $13, 24($1) add $14, $5, $6 Advance Computer Architecture 26

Program Execution CC1 CC2 CC3 CC4 CC5 time IM DM lw $10, 20($1) IM add $12, $3, $4 IM DM DM sub $11, $2, $3 Program instructions lw $13, 24($1) IM DM add $14, $5, $6 IM DM Advance Computer Architecture 27

CC5 IF: add $14, $5, $6 ID: lw $13, 24($1) EX: add $12, $3, $4 MEM: sub $11, $2, $3 WB: lw $10, 20($1) PC 4 Add Instr mem 11-15 for R-type 16-20 for I-type lw opcode 26-31 21-25 16-20 Reg. File Sign ext. Shift left 2 1 mux 0 zero 1 mux 0 Data mem. 0 mux 1 0-15 Advance Computer Architecture 28

Advantages of Pipeline After the fifth cycle (CC5), one instruction is completed each cycle; CPI 1, neglecting the initial pipeline latency of 5 cycles. Pipeline latency is defined as the number of stages in the pipeline, or The number of clock cycles after which the first instruction is completed. The clock cycle time is about four times shorter than that of single-cycle datapath and about the same as that of multicycle datapath. For multicycle datapath, CPI = 3.. So, pipelined execution is faster, but... Advance Computer Architecture 29

Science is always wrong. It never solves a problem without creating ten more. George Bernard Shaw Advance Computer Architecture 30

Pipeline Hazards Definition: Hazard in a pipeline is a situation in which the next instruction cannot complete execution one clock cycle after completion of the present instruction. Three types of hazards: Structural hazard (resource conflict) Data hazard Control hazard Advance Computer Architecture 31

Structural Hazard Two instructions cannot execute due to a resource conflict. Example: Consider a computer with a common data and instruction memory. The fourth cycle of a lw instruction requires memory access (memory read) and at the same time the first cycle of the fourth instruction requires instruction fetch (memory read). This will cause a memory resource conflict. Advance Computer Architecture 32

Example of Structural Hazard CC1 CC2 CC3 CC4 CC5 time IM/DM IM/DM lw $10, 20($1) IM/DM Common data and instr. Mem. add $12, $3, $4 lw $13, 24($1) IM/DM IM/DM IM/DM IM/DM sub $11, $2, $3 IM/DM Program instructions Nedded by two instructions Advance Computer Architecture 33

Possible Remedies for Structural Hazards Provide duplicate hardware resources in datapath. Control unit or compiler can insert delays (no-op cycles) between instructions. This is known as pipeline stall or bubble. Advance Computer Architecture 34

Stall (Bubble) for Structural Hazard CC1 CC2 CC3 CC4 CC5 time IM/DM IM/DM lw $10, 20($1) IM/DM add $12, $3, $4 Stall (bubble) IM/DM IM/DM IM/DM sub $11, $2, $3 Program instructions lw $13, 24($1) IM/DM IM/DM Advance Computer Architecture 35

Data Hazard Data hazard means that an instruction cannot be completed because the needed data, to be generated by another instruction in the pipeline, is not available. Example: consider two instructions: add $s0, $t0, $t1 sub $t2, $s0, $t3 # needs $s0 Advance Computer Architecture 36

Example of Data Hazard CC1 CC2 CC3 CC4 CC5 IM DM Write s0 in CC5 add $s0, $t0, $t1 time IM Read s0 and t3 in CC3 DM We need to read s0 from reg file in cycle 3 But s0 will not be written in reg file until cycle 5 However, s0 will only be used in cycle 4 And it is available at the end of cycle 3 sub $t2, $s0, $t3 Program instructions Advance Computer Architecture 37

Forwarding or Bypassing Output of a resource used by an instruction is forwarded to the input of some resource being used by another instruction. Forwarding can eliminate some, but not all, data hazards. Advance Computer Architecture 38

Forwarding for Data Hazard CC1 CC2 CC3 CC4 CC5 IM DM Write s0 in CC5 add $s0, $t0, $t1 time IM Read s0 and t3 in CC3 Forwarding DM sub $t2, $s0, $t3 Program instructions Advance Computer Architecture 39

Forwarding Unit Hardware to reg. file FORW. MUX FORW. MUX Data Mem. MUX Control signals Source reg. IDs from opcode Forwarding Unit Advance Computer Architecture 40

Forwarding Alone May Not Work CC1 CC2 CC3 CC4 CC5 IM DM Write s0 in CC5 lw $s0, 20($s1) time IM Read s0 and t3 in CC3 data needed by sub (data hazard) DM data available from memory only at the end of cycle 4 sub $t2, $s0, $t3 Program instructions Advance Computer Architecture 41

Use Bubble and Forwarding CC1 CC2 CC3 CC4 CC5 IM DM Write s0 in CC5 lw $s0, 20($s1) time stall (bubble) sub $t2, $s0, $t3 IM Forwarding DM Program instructions Advance Computer Architecture 42

Hazard Detection Unit Hardware Disable write PC Instruction Source reg. IDs from opcode Hazard Detection Unit Control 0 to reg. file NOP MUX FORW. MUX FORW. MUX Forwarding Unit Advance Computer Architecture 43 Data Mem. Control signals

Resolving Hazards Hazards are resolved by Hazard detection and forwarding units. Compiler s understanding of how these units work can improve performance. Advance Computer Architecture 44

Avoiding Stall by Code Reorder C code: A = B + E; C = B + F; MIPS code: lw $t1, 0($t0). $t1 written lw $t2, 4($t0).. $t2 written add $t3, $t1, $t2... $t1, $t2 needed sw $t3, 12($t0).... lw $t4, 8($t0)..... $t4 written add $t5, $t1, $t4..... $t4 needed sw $t5, 16,($t0)............... Advance Computer Architecture 45

Reordered Code C code: A = B + E; C = B + F; MIPS code: lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 no hazard sw $t3, 12($t0) add $t5, $t1, $t4 no hazard sw $t5, 16,($t0) Advance Computer Architecture 46

Control Hazard Instruction to be fetched is not known! Example: Instruction being executed is branch-type, which will determine the next instruction: add $4, $5, $6 beq $1, $2, 40 next instruction... 40 and $7, $8, $9 Advance Computer Architecture 47

Stall on Branch CC1 CC2 CC3 CC4 CC5 time IM DM add $4, $5, $6 beq $1, $2, 40 Stall (bubble) IM DM Program instructions next instruction or and $7, $8, $9 IM DM Advance Computer Architecture 48

Why Only One Stall? Extra hardware in ID phase: Additional to compute branch address Comparator to generate zero signal Hazard detection unit writes the branch address in PC Advance Computer Architecture 49

Ways to Handle Branch Stall or bubble Branch prediction: Heuristics Next instruction Prediction based on statistics (dynamic) Hardware decision (dynamic) Prediction error: pipeline flush Delayed branch Advance Computer Architecture 50

Delayed Branch Example Stall on branch add $4, $5, $6 beq $1, $2, skip next instruction... skip or $7, $8, $9 Delayed branch beq $1, $2, skip add $4, $5, $6 next instruction... skip or $7, $8, $9 Instruction executed irrespective of branch decision Advance Computer Architecture 51

Delayed Branch CC1 CC2 CC3 CC4 CC5 time beq $1, $2, skip add $4, $5, $6 IM IM DM DM Program instructions next instruction or skip or $7, $8, $9 IM DM Advance Computer Architecture 52

Summary: Hazards Structural hazards Cause: resource conflict Remedies: (i) hardware resources, (ii) stall (bubble) Data hazards Cause: data unavailablity Remedies: (i) forwarding, (ii) stall (bubble), (iii) code reordering Control hazards Cause: out-of-sequence execution (branch or jump) Remedies: (i) stall (bubble), (ii) branch prediction/pipeline flush, (iii) delayed branch/pipeline flush Advance Computer Architecture 53

Thank You Advance Computer Architecture 54