CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Similar documents
CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CS 61C: Great Ideas in Computer Architecture Control and Pipelining

Computer Architecture. Lecture 6.1: Fundamentals of

Outline Marquette University

CPS104 Computer Organization and Programming Lecture 19: Pipelining. Robert Wagner

Working on the Pipeline

Pipelining. Maurizio Palesi

Chapter 4. The Processor

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Modern Computer Architecture

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Chapter 4. The Processor

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Pipeline: Introduction

Pipelining. CSC Friday, November 6, 2015

Full Datapath. Chapter 4 The Processor 2

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4 The Processor 1. Chapter 4A. The Processor

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

ECE260: Fundamentals of Computer Engineering

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESI

LECTURE 3: THE PROCESSOR

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Chapter 8. Pipelining

COMPUTER ORGANIZATION AND DESIGN

CS61C : Machine Structures

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Full Datapath. Chapter 4 The Processor 2

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Lecture 15: Pipelining. Spring 2018 Jason Tang

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Lecture 7 Pipelining. Peng Liu.

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Processor (II) - pipelining. Hwansoo Han

Computer Architecture

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

CS61C : Machine Structures

Pipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College

Lecture 6: Pipelining

Pipeline Processors David Rye :: MTRX3700 Pipelining :: Slide 1 of 15

COSC 6385 Computer Architecture - Pipelining

Computer Systems Architecture Spring 2016

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

ELCT 501: Digital System Design

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Thomas Polzer Institut für Technische Informatik

Chapter 5 (a) Overview

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

CSE 141 Computer Architecture Spring Lectures 11 Exceptions and Introduction to Pipelining. Announcements

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Chapter 4. The Processor

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

ECEC 355: Pipelining

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

Lecture 2: Processor and Pipelining 1

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

Chapter 4 The Processor 1. Chapter 4B. The Processor

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Chapter 4 (Part II) Sequential Laundry

Topics. Lecture 12: Pipelining. Introduction to pipelining. Pipelined datapath. Hazards in pipeline. Performance. Design issues.

Chapter 4. The Processor

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Chapter 3 & Appendix C Pipelining Part A: Basic and Intermediate Concepts

Appendix A. Overview

CSEE 3827: Fundamentals of Computer Systems

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Chapter 4. The Processor

The Pipelined MIPS Processor

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Pipelined Processor Design

Pipelined Processor Design

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ECE 154A Introduction to. Fall 2012

Instr. execution impl. view

ECE232: Hardware Organization and Design

Chapter 4. The Processor

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Lecture 19 Introduction to Pipelining

EIE/ENE 334 Microprocessors

Processor (IV) - advanced ILP. Hwansoo Han

Transcription:

CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C 1

Summary: Single-cycle Processor Five steps to design a processor: 1. Analyze instruction set à datapath requirements 2. Select set of datapath components & establish clock methodology 3. Assemble datapath meeting the requirements Processor Control Datapath Memory 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic Formulate Logic Equations Design Circuits Input Output 2

Single Cycle Performance Assume time for actions are 100ps for register read or write; 200ps for other events Clock period is? Instr Instr fetch Register read op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps Clock rate (cycles/second = Hz) = 1/Period (seconds/cycle) 3

Single Cycle Performance Assume time for actions are 100ps for register read or write; 200ps for other events Clock period is? Instr Instr fetch Register read op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps What can we do to improve clock rate? Will this improve performance as well? Want increased clock rate to mean faster programs 4

Gotta Do Laundry Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away Washer takes 30 minutes A B C D Dryer takes 30 minutes Folder takes 30 minutes Stasher takes 30 minutes to put clothes into drawers 5

Sequential Laundry 6 PM 7 8 9 10 11 12 1 2 AM T a s k O r d e r A B C D 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time Sequential laundry takes 8 hours for 4 loads 6

Pipelined Laundry 12 2 AM 6 PM 7 8 9 10 11 1 T a s k O r d e r A B C D 30 30 30 30 30 30 30 Pipelined laundry takes 3.5 hours for 4 loads! Time 7

Pipelining Lessons (1/2) T a s k O r d e r 6 PM 7 8 9 Time 30 30 30 30 30 30 30 A B C D Pipelining doesn t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Time to fill pipeline and time to drain it reduces speedup: 2.3x (8/3.5) v. 4x (8/2) in this example 8

Pipelining Lessons (2/2) T a s k O r d e r 6 PM 7 8 9 Time 30 30 30 30 30 30 30 A B C D Suppose new Dryer takes 20 minutes, new Folder takes 20 minutes. How much faster is pipeline? Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup 9

Execution Steps in MIPS Datapath 1) IFtch: Instruction Fetch, Increment PC 2) Dcd: Instruction Decode, Read Registers 3) Exec: Mem-ref: Calculate Address Arith-log: Perform Operation 4) Mem: Load: Read Data from Memory Store: Write Data to Memory 5) WB: Write Data Back to Register 10

Single Cycle Datapath PC instruction memory rd rs rt registers Data memory +4 imm 1. Instruction Fetch 2. Decode/ 3. Execute 4. Memory Register Read 5. Write Back 11

Pipeline registers PC instruction memory rd rs rt registers Data memory +4 imm 1. Instruction Fetch 2. Decode/ 3. Execute 4. Memory Register Read 5. Write Back Need registers between stages To hold information produced in previous cycle 12

More Detailed Pipeline 13

IF for Load, Store, 14

ID for Load, Store, 15

EX for Load 16

MEM for Load 17

WB for Load Oops! Wrong register number! 18

Corrected Datapath for Load 19

Pipelined Execution Representation Time IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Every instruction must take same number of steps, so some stages will idle e.g. MEM stage for any arithmetic instruction 20

Graphical Pipeline Diagrams PC MUX +4 instruction memory rd rs rt imm Register File Data memory 1. Instruction Fetch 2. Decode/ Register Read 3. Execute 4. Memory 5. Write Back Use datapath figure below to represent pipeline: IF ID EX Mem WB 21

Graphical Pipeline Representation RegFile: left half is write, right half is read Time (clock cycles) I n I$ Reg D$ Reg s Load t I$ Reg D$ Reg r Add O r d e r Store Sub Or I$ Reg I$ Reg I$ D$ Reg Reg D$ Reg D$ Reg 22

Pipelining Performance (1/3) Use T c ( time between completion of instructions ) to measure speedup Equality only achieved if stages are balanced (i.e. take the same amount of time) If not balanced, speedup is reduced Speedup due to increased throughput Latency for each instruction does not decrease 23

Pipelining Performance (2/3) Assume time for stages is Instr 100ps for register read or write 200ps for other stages Instr fetch Register read op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps What is pipelined clock rate? Compare pipelined datapath with single-cycle datapath 24

Pipelining Performance (3/3) Single-cycle T c = 800 ps f = 1.25GHz Pipelined T c = 200 ps f = 5GHz 25

Question Logic in some stages takes 200ps and in some 100ps. Clk-Q delay is 30ps and setup-time is 20ps. What is the maximum clock frequency at which a pipelined design can operate? A: 10GHz B: 5GHz C: 6.7GHz D: 4.35GHz E: 4GHz 26

Question Which statement is false? A: Pipelining increases instruction throughput B: Pipelining increases instruction latency C: Pipelining increases clock frequency D: Pipelining decreases number of components 27

Pipelining Hazards A hazard is a situation that prevents starting the next instruction in the next clock cycle 1) Structural hazard A required resource is busy (e.g. needed in multiple stages) 2) Data hazard Data dependency between instructions Need to wait for previous instruction to complete its data read/write 3) Control hazard Flow of execution depends on previous instruction 28

1. Structural Hazards Conflict for use of a resource MIPS pipeline with a single memory? Load/Store requires memory access for data Instruction fetch would have to stall for that cycle Causes a pipeline bubble Hence, pipelined datapaths require separate instruction/data memories Separate L1 I$ and L1 D$ take care of this 29

Structural Hazard #1: Single Memory I n st r O rd Load Instr 1 Instr 2 Time (clock cycles) Trying to read same memory twice in same clock cycle e r Instr 3 Instr 4 I$ Reg D$ Reg 30

Structural Hazard #2: Registers (1/2) I n st r O rd e r Load Instr 1 Instr 2 Instr 3 Instr 4 Time (clock cycles) I$ Can we read and write to registers simultaneously? Reg D$ Reg 31

Structural Hazard #2: Registers (2/2) Two different solutions have been used: 1) Split RegFile access in two: Write during 1 st half and Read during 2 nd half of each clock cycle Possible because RegFile access is VERY fast (takes less than half the time of stage) 2) Build RegFile with independent read and write ports Conclusion: Read and Write to registers during same clock cycle is okay Structural hazards can always be removed by adding hardware resources 32

2. Data Hazards (1/2) Consider the following sequence of instructions: add $t0, $t1, $t2 sub $t4, $t0, $t3 and $t5, $t0, $t6 or $t7, $t0, $t8 xor $t9, $t0, $t10 33

I n st 2. Data Hazards (2/2) Data-flow backwards in time are hazards add $t0,$t1,$t2 Time (clock cycles) IF ID/RF EX MEM WB r sub $t4,$t0,$t3 and $t5,$t0,$t6 O rd e r or $t7,$t0,$t8 xor $t9,$t0,$t10 I$ Reg D$ Reg 34

Data Hazard Solution: Forwarding Forward result as soon as it is available OK that it s not stored in RegFile yet add $t0,$t1,$t2 sub $t4,$t0,$t3 and $t5,$t0,$t6 IF ID/RF EX MEM WB or $t7,$t0,$t8 xor $t9,$t0,$t10 I$ Reg D$ Reg 35

Datapath for Forwarding (1/2) What changes need to be made here? 36

Datapath for Forwarding (2/2) Handled by forwarding unit 37

Data Hazard: Loads (1/4) Recall: Dataflow backwards in time are hazards lw $t0,0($t1) IF ID/RF EX MEM WB sub $t3,$t0,$t2 Can t solve all cases with forwarding Must stall instruction dependent on load, then forward (more hardware) 38

Data Hazard: Loads (2/4) Hardware stalls pipeline Called hardware interlock lw $t0, 0($t1) IF ID/RF EX MEM WB Schematically, this is what we want, but in reality stalls done horizontally sub $t3,$t0,$t2 and $t5,$t0,$t4 bub ble I$ bub Reg D$ Reg ble How to bub Reg D$ stall just ble part of pipeline? 39 or $t7,$t0,$t6 I$

Data Hazard: Loads (3/4) Stalled instruction converted to bubble, acts like nop lw $t0, 0($t1) sub $t3,$t0,$t2 I$ Reg bub ble bub ble bub ble sub $t3,$t0,$t2 and $t5,$t0,$t4 First two pipe or $t7,$t0,$t6 I$ stages stall by repeating stage one cycle later Reg D$ 40

Data Hazard: Loads (4/4) Slot after a load is called a load delay slot If that instruction uses the result of the load, then the hardware interlock will stall it for one cycle Letting the hardware stall the instruction in the delay slot is equivalent to putting an explicit nop in the slot (except the latter uses more code space) Idea: Let the compiler put an unrelated instruction in that slot à no stall! 41

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction! MIPS code for D=A+B; E=A+C; Stall! Stall! # Method 1: # Method 2: lw $t1, 0($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t2, 4($t0) add $t3, $t1, $t2 lw $t4, 8($t0) sw $t3, 12($t0) add $t3, $t1, $t2 lw $t4, 8($t0) sw $t3, 12($t0) add $t5, $t1, $t4 add $t5, $t1, $t4 sw $t5, 16($t0) sw $t5, 16($t0) 13 cycles 11 cycles 42

3. Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always fetch correct instruction Still working on ID stage of branch BEQ, BNE in MIPS pipeline Simple solution Option 1: Stall on every branch until branch condition resolved Would add 2 bubbles/clock cycles for every Branch! (~ 20% of instructions executed)

Stall => 2 Bubbles/Clocks I n st r. O rd beq Instr 1 Instr 2 Time (clock cycles) Instr 3 I$ Reg D$ Reg e r Instr 4 Where do we do the compare for the branch?

Control Hazard: Branching Optimization #1: Insert special branch comparator in Stage 2 As soon as instruction is decoded (Opcode identifies it as a branch), immediately make a decision and set the new value of the PC Benefit: since branch is complete in Stage 2, only one unnecessary instruction is fetched, so only one no-op is needed Side Note: means that branches are idle in Stages 3, 4 and 5

I n st beq One Clock Cycle Stall Time (clock cycles) r. O rd e r Instr 1 Instr 2 Instr 3 Instr 4 I$ Reg D$ Reg Branch comparator moved to Decode stage.

Control Hazards: Branching Option 2: Predict outcome of a branch, fix up if guess wrong Must cancel all instructions in pipeline that depended on guess that was wrong This is called flushing the pipeline Simplest hardware if we predict that all branches are NOT taken Why?

Control Hazards: Branching Option #3: Redefine branches Old definition: if we take the branch, none of the instructions after the branch get executed by accident New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (the branch-delay slot) Delayed Branch means we always execute inst after branch This optimization is used with MIPS

Example: Nondelayed vs. Delayed Branch Nondelayed Branch Delayed Branch or $8, $9, $10 add $1, $2,$3 add $1, $2, $3 sub $4, $5, $6 beq $1, $4, Exit xor $10, $1, $11 sub $4, $5, $6 beq $1, $4, Exit or $8, $9, $10 xor $10, $1, $11 Exit: Exit:

Control Hazards: Branching Notes on Branch-Delay Slot Worst-Case Scenario: put a nop in the branchdelay slot Better Case: place some instruction preceding the branch in the branch-delay slot as long as the changed doesn t affect the logic of program Re-ordering instructions is common way to speed up programs Compiler usually finds such an instruction more than 70% of time Jumps also have a delay slot

Greater Instruction-Level Parallelism (ILP) Deeper pipeline (5 => 10 => 15 stages) Less work per stage => shorter clock cycle Multiple issue superscalar Replicate pipeline stages => multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice Out-of-Order execution Reorder instructions dynamically in hardware to reduce impact of hazards 4.10 Parallelism and Advanced Instruction Level Parallelism

Hyperthreading PC PC instruction memory rd rs rt registers registers Data memory +4 imm 1. Instruction Fetch 2. Decode/ 5. Write 3. Execute 4. Memory Register Read Back Duplicate all elements that hold the state (registers) Use the same CL blocks Use muxes to select which state to use every clock cycle => run 2 totally independent threads (same memory -> shared memory!) Speedup? No obvious speedup make use of CL blocks in case of unavailable resources (e.g. wait for memory) 52

Intel Nehalem i7 Hyperthreading: About 5% die area Up to 30% speed gain (BUT also < 0% possible) Pipeline: 20-24 stages! Out-of-order execution 1. Instruction fetch. 2. Instruction dispatch to an instruction queue 3. Instruction: Wait in queue until input operands are available => instruction can leave queue before earlier, older instructions. 4. The instruction is issued to the appropriate functional unit and executed by that unit. 5. The results are queued. 6. Write to register only after all older instructions have their results written. 53

In Conclusion Pipelining increases throughput by overlapping execution of multiple instructions in different pipestages Pipestages should be balanced for highest clock rate Three types of pipeline hazard limit performance Structural (always fixable with more hardware) Data (use interlocks or bypassing to resolve) Control (reduce impact with branch prediction or branch delay slots) 54