EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Similar documents
Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Computer Architecture

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Modern Computer Architecture

CSE 502 Graduate Computer Architecture. Lec 3-5 Performance + Instruction Pipelining Review

CSE 502 Graduate Computer Architecture. Lec 3-5 Performance + Instruction Pipelining Review

COSC 6385 Computer Architecture - Pipelining

Lecture 2: Processor and Pipelining 1

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Advanced Computer Architecture

CSE 502 Graduate Computer Architecture. Lec 4-6 Performance + Instruction Pipelining Review

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

Computer Architecture Spring 2016

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Computer Architecture and System

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

MIPS An ISA for Pipelining

Computer Architecture and System

Pipelining: Hazards Ver. Jan 14, 2014

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Appendix A. Overview

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Advanced Computer Architecture Pipelining

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Pipeline Review. Review

Lecture 1: Introduction

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Pipelining: Basic and Intermediate Concepts

Chapter 4 The Processor 1. Chapter 4B. The Processor

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

第三章 Instruction-Level Parallelism and Its Dynamic Exploitation. 陈文智 浙江大学计算机学院 2014 年 10 月

Appendix C: Pipelining: Basic and Intermediate Concepts

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Basic Pipelining Concepts

The Big Picture Problem Focus S re r g X r eg A d, M lt2 Sub u, Shi h ft Mac2 M l u t l 1 Mac1 Mac Performance Focus Gate Source Drain BOX

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

The Processor: Improving the performance - Control Hazards

Instruction Pipelining Review

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CISC 662 Graduate Computer Architecture Lecture 5 - Pipeline. Pipelining. Pipelining the Idea. Similar to assembly line in a factory:

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

LECTURE 3: THE PROCESSOR

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Appendix C. Abdullah Muzahid CS 5513

Chapter 4 The Processor 1. Chapter 4A. The Processor

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CISC 662 Graduate Computer Architecture Lecture 5 - Pipeline Pipelining

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Full Datapath. Chapter 4 The Processor 2

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

(Basic) Processor Pipeline

Pipelining. Maurizio Palesi

HY425 Lecture 05: Branch Prediction

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

COMPUTER ORGANIZATION AND DESIGN

ECE154A Introduction to Computer Architecture. Homework 4 solution

ECE232: Hardware Organization and Design

Processor Architecture

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Chapter 4. The Processor

ISA. CSE 4201 Minas E. Spetsakis based on Computer Architecture by Hennessy and Patterson

DLX Unpipelined Implementation

COMPUTER ORGANIZATION AND DESIGN

Computer Architecture. Lecture 6.1: Fundamentals of

Thomas Polzer Institut für Technische Informatik

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

Chapter 4. The Processor

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Computer Systems Architecture Spring 2016

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

ECS 154B Computer Architecture II Spring 2009

ELE 655 Microprocessor System Design

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipeline design. Mehran Rezaei

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

ECE331: Hardware Organization and Design

CSEE 3827: Fundamentals of Computer Systems

Instr. execution impl. view

Very Simple MIPS Implementation

Transcription:

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building 3-513 wuct@cs.sjtu.edu.cn

Download Lectures ftp://public.sjtu.edu.cn User: wuct Password: wuct123456

Computer Architecture A Quantitative Approach, Fifth Edition Appendix C Pipelining 3

5 Steps of a (pre-pipelined) MIPS Datapath 4 Stages: 1 2 3 4 5 Instruction Instr. Decode Execute Memory Write Fetch. Fetch Addr. Calc Access Back Next PC PC Address 4 Adder Memory IR Inst RTL Actions:. Transfer Language IR <= mem[pc]; #stage 1 PC <= PC + 4 Next SEQ PC RS1 RS2 RD Imm File Sign Extend MUX MUX Zero? MUX Data Memory L M D MUX WB Data [IR rd ] <= ([Ir rs ] op IRop [IR rt ]) #op is done in stages 2-5

WB Data 5-Stage MIPS Datapath (has pipeline latches) Stages: 1 Instruction 2 Instr. Decode 3 Execute 4 Memory 5 Fetch. Fetch Addr. Calc Access Next PC 4 Adder Next SEQ PC RS1 Next SEQ PC Zero? MUX Write Back Address Memory IR <= mem[pc]; #1 PC <= PC + 4 A <= [IR rs ]; #2 B <= [IR rt ] rslt <= A op IRop #3 B WB <= rslt #4 [IR rd ] <= WB #5 IF/ID RS2 Imm File Sign Extend ID/EX MUX MUX EX/MEM RD RD RD Data Memory MEM/WB MUX 5

Instruction Set Processor Controller IR <= mem[pc]; PC <= PC + 4 br JAL JR jmp A <= [IR rs ]; B <= [IR rt ] RR opfetch-decode RI LD ST if bop(a,b) PC <= PC+IR im PC <= IR jaddr r <= A op IRop B r <= A op IRop IR im r <= A + IR im WB <= r WB <= r WB <= Mem[r] [IR rd ] <= WB [IR rd ] <= WB [IR rd ] <= WB 6

WB Data 5-Stage MIPS Datapath (has pipeline latches) Stages: 1 Instruction 2 Instr. Decode 3 Execute 4 Memory 5 Fetch. Fetch Addr. Calc Access Write Back Next PC 4 Adder Next SEQ PC RS1 Next SEQ PC Zero? MUX Address Memory IF/ID RS2 File ID/EX MUX MUX EX/MEM Data Memory MEM/WB MUX Imm Sign Extend RD RD RD Data stationary control local decode for each instruction phase / pipeline stage 7

Visualizing Pipelining Time (clock cycles) I n s t r. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 O r d e r 8

Pipelining is not quite that easy! Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (having a single person to fold and put clothes away at same time) Data hazards: Instruction depends on result of prior instruction still in the pipeline (having a missing sock in a later wash; cannot put away) Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). 9

One Memory_Port / Structural_Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4 10

One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Stall Instr 3 Bubble Bubble Bubble Bubble Bubble How do you bubble the pipe? 11

Code SpeedUp Equation for Pipelining CPI pipelined Ideal CPI Average Stall cycles per Inst Speedup Ideal CPI Pipeline depth Ideal CPI Pipeline stall CPI Cycle Time Cycle Time unpipeline d pipelined For simple RISC pipeline, Ideal CPI = 1: Speedup 1 Pipeline depth Pipeline stall CPI Cycle Cycle Time Time unpipelined pipelined 12

Example: Dual-port vs. Single-port Machine A: Dual ported memory ( Harvard Architecture ) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Assume loads are 20% of instructions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/(1 + 0.2 x 1) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.20) x 1.05 {105/120 = 7/8} = 0.875 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.875 x Pipeline Depth) = 1.14 Machine A is 1.14 times faster 13

Data Hazard on ister R1 (If No Forwarding) Time (clock cycles) I n s t r. add r1,r2,r3 sub r4,r1,r3 IF ID/RF EX MEM WB No forwarding needed since write reg in 1 st half cycle, read reg in 2 nd half cycle. O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 14

Three Generic Data Hazards Read After Write (RAW) Instr J tries to read operand before Instr I writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a (True) Dependence (in compiler nomenclature). This hazard results from an actual need for communicating a new data value. 15

Three Generic Data Hazards Write After Read (WAR) Instr J writes operand before Instr I reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r1. Cannot happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and ister reads are always in stage 2, and 16 ister writes are always in stage 5

Three Generic Data Hazards Write After Write (WAW) Instr J writes operand before Instr I writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an output dependence by compiler writers This also results from the reuse of name r1. Cannot happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and ister writes are always in stage 5 Will see WAR and WAW in more complicated pipes 17

I n s t r. Forwarding to Avoid Data Hazard Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 Forwarding of outputs needed as inputs 1 & 2 cycles later. Forwarding of LW MEM outputs to SW MEM or inputs 1 or 2 cycles later. Need no forwarding since write reg is in 1 st half cycle, read reg in 2 nd half cycle. O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 18

HW Datapath Changes (in red) for Forwarding NextPC isters ID/EX mux mux EX/MEM To forward output 1 cycle to inputs Data Memory MEM/WR To forward, MEM 2 cycles to (From LW Data Memory) Immediate mux mux (From ) What circuit detects and resolves this hazard? 19 To forward MEM 1 cycle to SW MEM input

Forwarding Avoids - & LW-SW Data Hazards Time (clock cycles) I n s t r. add r1,r2,r3 lw r4, 0(r1) O r d e r sw r4,12(r1) or r8,r6,r9 xor r10,r9,r11 20

LW- Data Hazard Even with Forwarding Time (clock cycles) I n s t r. lw r1, 0(r2) sub r4,r1,r6 No forwarding needed since write reg in 1 st half cycle, read reg in 2 nd half cycle. O r d e r and r6,r1,r7 or r8,r1,r9 21

I n s t r. O r d e r Data Hazard Even with Forwarding Time (clock cycles) lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Bubble Bubble No forwarding needed since write reg in 1 st half cycle, read reg in 2 nd half cycle. or r8,r1,r9 Bubble How is this hazard detected? 22

Software Scheduling to Avoid Load Hazards Try producing fast code with no stalls for a = b + c; d = e f; assuming a, b, c, d,e, and f are in memory. Slow code: LW Rb,b LW Rc,c Stall ===> Stall ===> ADD Ra,Rb,Rc SW a,ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,rd Fast code (no stalls): LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,ra SUB Rd,Re,Rf SW Important 23 technique! d,rd Compiler optimizes for performance. Hardware checks for safety.

Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion 24

WB Data 5-Stage MIPS Datapath (has pipeline latches) Stages: 1 Instruction 2 Instr. Decode 3 Execute 4 Memory 5 Write Fetch. Fetch Addr. Calc Access Back Next PC 4 Adder Next SEQ PC RS1 Next SEQ PC Zero? MUX Address Memory IF/ID RS2 File ID/EX MUX MUX EX/MEM Data Memory MEM/WB MUX Will move red circuits to 2 nd stage to make branch delays shorter Imm Sign Extend RD RD RD Old simple design put branch completion in stage 4 (Mem) 25

Control Hazard on Branch - Three Cycle Stall (Caused if Decide Branches in 4 th Stage) 10: beq r1,r3,34 ID/RF MEM 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 34: xor r10,r1,r11 What can be done with the 3 instructions between beq & xor? Code between beq&xor must not start until know beq not branch => 3 stalls Adding 3 cycle stall after every branch (1/7 of instructions) => CPI += 3/7. BAD! 26

Branch Stall Impact if Commit in Stage 4 If CPI = 1 and 15% of instructions are branches, Stall 3 cycles => new CPI = 1.45 (1+3*.15) Too much! Two-part solution: Determine sooner whether branch taken or not, AND Compute taken branch address earlier MIPS branch tests if register = 0 or 0 Original 1986 MIPS Solution: Move zero_test to ID/RF (Inst Decode & ister Fetch) stage(2) (4=MEM) Add extra adder to calculate new PC (Program Counter) in ID/RF stage Result is 1 clock cycle penalty for branch versus 3 when decided in MEM 27

New Pipelined MIPS Datapath: Faster Branch WB Data Stages: 1 Instruction 2 Instr. Decode 3 Execute 4 Memory 5 Write Fetch. Fetch Addr. Calc Access Back Next PC 4 Adder Next SEQ PC RS1 Adder MUX Zero? Address Memory The fast_branch design needs a slightly longer stage 2 cycle time, making the clock a little slower for all stages. IF/ID RS2 Imm 28 File Sign Extend MUX EX/MEM RD RD RD Data Memory Example of interplay of instruction set design and cycle time. ID/EX MEM/WB MUX

Four Branch Hazard Alternatives #1: Stall until branch direction is clearly known #2: Predict Branch Not Taken Easy Solution Execute the next instructions in sequence PC+4 already calculated, so use it to get next instruction Nullify bad instructions in pipeline if branch is actually taken Nullify easier since pipeline state updates are late (MEM, WB) 47% MIPS branches not taken on average 29

Four Branch Hazard Alternatives #3: Predict Branch Taken 53% MIPS branches taken on average But have not calculated branch target address in MIPS MIPS still incurs 1 cycle branch penalty Some other CPUs: branch target known before outcome 30

Last of Four Branch Hazard Alternatives #4: Delayed Branch (Used Only in 1st MIPS Killer Micro ) Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2... sequential successor n branch target if taken Branch delay of length n 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS 1 st used this (Later versions of MIPS did not; pipeline deeper) 31

Scheduling Branch Delay Slots A. From before branch B. From branch target C. From fall through add $1,$2,$3 if $2=0 then delay slot sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute an extra sub when branch fails 32 add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes sub $4,$5,$6 add $1,$2,$3 if $2=0 then if $1=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 sub $4,$5,$6

Delayed Branch Not Used in Modern CPUs Compiler effectiveness 1/2 for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation Only half of (60% x 80%) slots usefully filled; cannot fill 2 or more 33

Delayed Branch Not Used in Modern CPUs Delayed Branch downside: As processor designs use deeper pipelines and multiple issue, the branch delay grows and needs many more delay slots Delayed branching soon lost effectiveness and popularity compared to more expensive but more flexible dynamic approaches Growth in available transistors soon permitted dynamic approaches that keep records of branch locations, taken/not-taken decisions, and target addresses Multi-issue 2 => 3 delay slots needed, 4 => 7 slots, 8 => 15 slots 34

Evaluating Branch Alternatives Pipeline speedup = Assume 4% unconditional jump, 10% conditional branch-taken, 6% conditional branch-not-taken, base CPI = 1. Scheduling Branch CPI speedup vs. speedup vs. Scheme penalty no-pipe 5 cycles stall_pipeline Stall pipeline (Stage 4) 3 1.60 3.1 1.00 Predict taken (Stage 2) 1 1.20 4.2 1.33 Predict not taken (St.2) 1 1.14 4.4 1.40 Delayed branch (Stg 2) 0.5 1.10 4.5 1.45 (Sample 1.60=1+3(4+10+6)% (4.5=5/1.10) (1.45=1.6/1.1) calcu- 1.20=1+1(4+10+6)% (to calculate taken target) lations) 1.14=1+1(4+10)% Pipeline depth 1 +Branch frequency Branch penalty 35 (refetch for jump, taken-branch)

Another Problem with Pipelining Exception: An unusual event happens to an instruction during its execution {caused by instructions executing} Examples: divide by zero, undefined opcode Interrupt: Hardware signal to switch the processor to a new instruction stream {not directly caused by code} Example: a sound card interrupts when it needs more audio output samples (an audio click happens if it is left waiting) Precise Interrupt Problem: Must seem as if the exception or interrupt appeared between 2 instructions (I i and I i+1 ) although several instructions were executing at the time All instructions up to and including I i are totally completed No effect of any instruction after I i is allowed to be saved After a precise interrupt, the interrupt (exception) handler either aborts the program or restarts at instruction I i+1 36

Precise Exceptions in Static Pipelines Stages: F D E M W Fetch Decode Execute Memory Key observation: Architected states change only in memory (M) and register write (W) stages. 37

And In Conclusion: Control and Pipelining Quantify and summarize performance Ratios, Geometric Mean, Multiplicative Standard Deviation F&P: Benchmarks age, disks fail, single-point failure Control via State Machines and Microprogramming Just overlap tasks; easy if tasks are independent Speed Up Pipeline Depth; if ideal CPI is 1, then: Speedup 1 Pipeline depth Pipeline stall CPI Hazards limit performance on computers by stalling: Structural: need more HW resources Data (RAW,WAR,WAW): need forwarding, compiler scheduling Control: delayed branch or branch (taken/not-taken) prediction Exceptions and interrupts add complexity 38 Cycle Cycle Time Time unpipelined pipelined

Homework C.1 39