CSE 502 Graduate Computer Architecture. Lec 3-5 Performance + Instruction Pipelining Review

Similar documents
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

CSE 502 Graduate Computer Architecture. Lec 3-5 Performance + Instruction Pipelining Review

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Computer Architecture

Modern Computer Architecture

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CSE 502 Graduate Computer Architecture. Lec 4-6 Performance + Instruction Pipelining Review

Lecture 2: Processor and Pipelining 1

COSC 6385 Computer Architecture - Pipelining

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Advanced Computer Architecture

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Computer Architecture and System

Computer Architecture and System

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

MIPS An ISA for Pipelining

Computer Architecture Spring 2016

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Pipelining: Hazards Ver. Jan 14, 2014

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Advanced Computer Architecture Pipelining

Appendix A. Overview

Lecture 1: Introduction

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Pipeline Review. Review

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Computer Systems Architecture Spring 2016

CISC 662 Graduate Computer Architecture Lecture 5 - Pipeline. Pipelining. Pipelining the Idea. Similar to assembly line in a factory:

Appendix C: Pipelining: Basic and Intermediate Concepts

Pipelining: Basic and Intermediate Concepts

LECTURE 3: THE PROCESSOR

The Big Picture Problem Focus S re r g X r eg A d, M lt2 Sub u, Shi h ft Mac2 M l u t l 1 Mac1 Mac Performance Focus Gate Source Drain BOX

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Instruction Pipelining Review

Pipelining. Maurizio Palesi

The Processor: Improving the performance - Control Hazards

Chapter 4 The Processor 1. Chapter 4B. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor

Appendix C. Abdullah Muzahid CS 5513

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Basic Pipelining Concepts

第三章 Instruction-Level Parallelism and Its Dynamic Exploitation. 陈文智 浙江大学计算机学院 2014 年 10 月

CISC 662 Graduate Computer Architecture Lecture 5 - Pipeline Pipelining

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 4. The Processor

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

HY425 Lecture 05: Branch Prediction

Full Datapath. Chapter 4 The Processor 2

Processor Architecture

COMPUTER ORGANIZATION AND DESIGN

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

Computer Architecture. Lecture 6.1: Fundamentals of

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESI

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

ISA. CSE 4201 Minas E. Spetsakis based on Computer Architecture by Hennessy and Patterson

Thomas Polzer Institut für Technische Informatik

ECE232: Hardware Organization and Design

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

COMPUTER ORGANIZATION AND DESIGN

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

ECE331: Hardware Organization and Design

(Basic) Processor Pipeline

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Advanced Computer Architecture

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. The Processor

Lecture 4: Instruction Set Architecture

DLX Unpipelined Implementation

Lecture 7 Pipelining. Peng Liu.

Improving Performance: Pipelining

Transcription:

CSE 502 Graduate Computer Architecture Lec 3-5 Performance + Instruction Pipelining Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson, UC-Berkeley cs252-s06 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 1

Review from last lecture Tracking and extrapolating technology part of architect s responsibility Expect Bandwidth in disks, DRAM, network, and processors to improve by at least as much as the square of the improvement in Latency Quantify Cost (vs. Price) IC f(area 2 ) + Learning curve, volume, commodity, price margins Quantify dynamic and static power Capacitance x Voltage 2 x frequency, Energy vs. power Quantify dependability Reliability (MTTF vs. FIT), Availability (MTTF/(MTTF+MTTR) 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 2

Outline Review F&P: Benchmarks age, disks fail, singlepoint fail danger 502 Administrivia MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 3

Fallacies and Pitfalls (1/2) Fallacies - commonly held misconceptions When discussing a fallacy, we try to give a counterexample. Pitfalls - easily made mistakes. Often generalizations of principles true in limited context Text shows Fallacies and Pitfalls to help you avoid these errors Fallacy: Benchmarks remain valid indefinitely Once a benchmark becomes popular, there is tremendous pressure to improve performance by targeted optimizations or by aggressive interpretation of the rules for running the benchmark: benchmarksmanship. 70 benchmarks in the 5 SPEC releases to 2000. 70% dropped from the next release because no longer useful Pitfall: A single point of failure Rule of thumb for fault tolerant systems: make sure that every component is redundant so that no single component failure can bring down the whole system (e.g, power supply) Lab rule of thumb: Don t buy one of anything. 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 4

Fallacies and Pitfalls (2/2) Fallacy - Rated MTTF of disks is 1,200,000 hours or 140 years, so disks practically never fail But disk lifetime is 5 years replace a disk every 5 years; on average, 28 replacements, so never fail A better unit: % that fail (1.2M MTTF = 833 FIT) Fail over lifetime: if had 1000 disks for 5 years = 1000*(5*365*24)*833 /10 9 = 36,485,000 / 10 6 = 37 = 3.7% (37/1000) fail over 5 yr lifetime (1.2M hr MTTF) But this is under pristine conditions little vibration, narrow temperature range no power failures Real world: 3% to 6% of SCSI drives fail per year 3400-6800 FIT or 150,000 to 300,000 hour MTTF [Gray & van Ingen 05] 3% to 7% of ATA drives fail per year (Advanced Tech Attachment) 3400-8000 FIT or 125,000 to 300,000 hour MTTF [Gray & van Ingen 05] 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 5

CSE502: Administrivia Instructor: Prof Larry Wittie Office/Lab: 1308 CompSci, lw AT icdotsunysbdotedu Office Hrs: TuTh, 3:45-5:15 pm, if door open, or appt. T. A.: Unlikely Class: TuTh, 2:20-3:40 pm 2120 Comp Sci Text: Computer Architecture: A Quantitative Approach, 4th Ed. (Oct, 2006), ISBN 0123704901 or 978-0123704900, $65/($50?) Amazon; $77used SBU, S11 Web page: http://www.cs.sunysb.edu/~cse502/ Current reading: Appendix A (back of CAQA4) (Chap 1 was last week) (New edition this fall 2011 so little 4ed trade-back value) 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 6

Outline Review F&P: Benchmarks age, disks fail, single-points fail 502 Administrivia MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 7

A "Typical" RISC ISA 32-bit fixed format instruction (only 3 formats: RIJ) 32 32-bit GPR (R0 contains zero, DP takes pair) 3-address, reg-reg arithmetic instruction Single address mode for load/store: base + displacement no indirection (since it needs another memory access) Simple branch conditions (e.g., single-bit: 0 or not?) (Delayed branch - ineffective in deep pipelines, so no longer used) see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 8

Example: MIPS ister-ister R Format Arithmetic operations 31 26 25 2120 16 15 1110 6 5 0 Op Rs1 Rs2 Rd Opx ister-immediate I Format All immediate arithmetic ops 31 26 25 2120 16 15 0 Op Rs1 Rd immediate Branch I Format Moderate relative distance conditional branches 31 26 25 2120 16 15 0 Op Rs1 Rs2/Opx immediate Jump / Call J Format Long distance jumps 31 26 25 0 Op target 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 9

Datapath vs Control Datapath Controller signals Control Points Datapath: Storage, Functional Units, Interconnections sufficient to perform the desired functions Inputs are Control Points Outputs are signals Controller: State machine to orchestrate operation on the data path Based on desired function and signals 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 10

Approaching an ISA Instruction Set Architecture Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing Meaning of each instruction is described by RTL (register transfer language) on architected registers and memory Given technology constraints, assemble adequate datapath Architected storage mapped to actual storage Function Units (FUs) to do all the required operations Possible additional storage (eg. Internal registers: MAR, MDR, IR, {Memory Address ister, Memory Data ister, Instruction ister} Interconnect to move information among registers and function units Map each instruction to a sequence of RTL operations Collate sequences into symbolic controller state transition diagram (STD) Lower symbolic STD to control points Implement controller 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 11

5 Steps of a (pre-pipelined) MIPS Datapath Figure A.2, Page A-8 Stages: 1 2 3 4 5 Instruction Fetch Instr. Decode. Fetch Execute Addr. Calc Memory Access Write Back Next PC PC Address 4 Adder Memory RTL Actions:. Transfer Language Inst IR <= mem[pc]; #stage 1 PC <= PC + 4 IR Next SEQ PC RS1 RS2 RD Imm File Sign Extend 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 12 MUX MUX Zero? WB Data MUX Data Memory [IR rd ] <= ([Ir rs ] op IRop [IR rt ]) #op is done in stages 2-5 L M D MUX

5-Stage MIPS Datapath(has pipeline latches) Figure A.3, Page A-9 Stages: 1 Instruction 2 Instr. Decode 3 Execute 4 Memory 5 Write Fetch. Fetch Addr. Calc Access Back Next PC 4 Adder Next SEQ PC RS1 Next SEQ PC Zero? MUX Address Memory IR <= mem[pc]; #1 PC <= PC + 4 A <= [IR rs ]; #2 B <= [IR rt ] rslt <= A op IRop B #3 WB <= rslt #4 [IR rd ] <= WB #5 IF/ID RS2 Imm File Sign Extend ID/EX 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 13 MUX MUX EX/MEM RD RD RD Data Memory MEM/WB MUX WB Data

Instruction Set Processor Controller IR <= mem[pc]; PC <= PC + 4 JAL br JR jmp A <= [IR rs ]; B <= [IR rt ] RR opfetch-decode RI LD ST if bop(a,b) PC <= IR jaddr r <= A op IRop B r <= A op IRop IR im r <= A + IR im PC <= PC+IR im WB <= r WB <= r WB <= Mem[r] [IR rd ] <= WB [IR rd ] <= WB [IR rd ] <= WB 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 14

5-Stage MIPS Datapath(has pipeline latches) Figure A.3, Page A-9 Stages: 1 Instruction 2 Instr. Decode 3 Execute 4 Memory 5 Write Fetch. Fetch Addr. Calc Access Back Next PC 4 Adder Next SEQ PC RS1 Next SEQ PC Zero? MUX Address Memory IF/ID RS2 File ID/EX MUX MUX EX/MEM Data Memory MEM/WB MUX Imm Sign Extend RD RD RD WB Data Data stationary control local decode for each instruction phase / pipeline stage 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 15

Visualizing Pipelining Figure A.2, Page A-8 Time (clock cycles) I n s t r. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 O r d e r 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 16

Pipelining is not quite that easy! Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (having a single person to fold and put clothes away at same time) Data hazards: Instruction depends on result of prior instruction still in the pipeline (having a missing sock in a later wash; cannot put away) Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 17

One Memory_Port / Structural_Hazards Figure A.4, Page A-14 Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 18

One Memory Port/Structural Hazards (Similar to Figure A.5, Page A-15) Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Stall Instr 3 Bubble Bubble Bubble Bubble Bubble How do you bubble the pipe? 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 19

Code SpeedUp Equation for Pipelining CPI pipelined = Ideal CPI + Average Stall cycles per Inst Ideal CPI Pipeline depth Speedup = Ideal CPI + Pipeline stall CPI Cycle Cycle Time Time unpipelined pipelined For simple RISC pipeline, Ideal CPI = 1: Pipeline depth Speedup = 1 + Pipeline stall CPI Cycle Cycle Time Time unpipelined pipelined 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 20

Example: Dual-port vs. Single-port Machine A: Dual ported memory ( Harvard Architecture ) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Assume loads are 20% of instructions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/(1 + 0.2 x 1) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.20) x 1.05 {105/120 = 7/8} = 0.875 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.875 x Pipeline Depth) = 1.14 Machine A is 1.14 times faster 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 21

Data Hazard on ister R1 (If No Forwarding) Figure A.6, Page A-16 Time (clock cycles) I n s t r. add r1,r2,r3 sub r4,r1,r3 IF ID/RF EX MEM WB No forwarding needed since write reg in 1 st half cycle, read reg in 2 nd half cycle. O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 22

Three Generic Data Hazards Read After Write (RAW) Instr J tries to read operand before Instr I writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a (True) Dependence (in compiler nomenclature). This hazard results from an actual need for communicating a new data value. 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 23

Three Generic Data Hazards Write After Read (WAR) Instr J writes operand before Instr I reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r1. Cannot happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and ister reads are always in stage 2, and ister writes are always in stage 5 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 24

Three Generic Data Hazards Write After Write (WAW) Instr J writes operand before Instr I writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an output dependence by compiler writers This also results from the reuse of name r1. Cannot happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and ister writes are always in stage 5 Will see WAR and WAW in more complicated pipes 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 25

I n s t r. O r d e r Forwarding to Avoid Data Hazard Figure A.7, Page A-19 Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 Forwarding of outputs needed as inputs 1 & 2 cycles later. Forwarding of LW MEM outputs to SW MEM or inputs 1 or 2 cycles later. Need no forwarding since write reg is in 1 st half cycle, read reg in 2 nd half cycle. xor r10,r1,r11 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 26

HW Datapath Changes (in red) for Forwarding Figure A.23, Page A-37 NextPC isters Immediate ID/EX mux mux EX/MEM To forward output 1 cycle to inputs mux Data Memory MEM/WR To forward, MEM 2 cycles to (From LW Data Memory) mux (From ) What circuit detects and resolves this hazard? To forward MEM 1 cycle to SW MEM input 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 27

Forwarding Avoids - & LW-SW Data Hazards Figure A.8, Page A-20 Time (clock cycles) I n s t r. add r1,r2,r3 lw r4, 0(r1) O r d e r sw r4,12(r1) or r8,r6,r9 xor r10,r9,r11 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 28

LW- Data Hazard Even with Forwarding Figure A.9, Page A-21 Time (clock cycles) I n s t r. lw r1, 0(r2) sub r4,r1,r6 No forwarding needed since write reg in 1 st half cycle, read reg in 2 nd half cycle. O r d e r and r6,r1,r7 or r8,r1,r9 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 29

Data Hazard Even with Forwarding (Similar to Figure A.10, Page A-21) Time (clock cycles) I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Bubble Bubble No forwarding needed since write reg in 1 st half cycle, read reg in 2 nd half cycle. or r8,r1,r9 Bubble How is this hazard detected? 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 30

Software Scheduling to Avoid Load Hazards Try producing fast code with no stalls for a = b + c; d = e f; assuming a, b, c, d,e, and f are in memory. Slow code: Fast code (no stalls): LW Rb,b LW Rb,b LW Rc,c LW Rc,c Stall ===> ADD Ra,Rb,Rc LW Re,e SW a,ra ADD Ra,Rb,Rc LW Re,e LW Rf,f LW Rf,f SW a,ra Stall ===> SUB Rd,Re,Rf SUB Rd,Re,Rf SW d,rd SW d,rd Compiler optimizes for performance. Hardware checks for safety. Important technique! 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 31

Outline Review F&P: Benchmarks age, disks fail, single-points fail 502 Administrivia MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 32

5-Stage MIPS Datapath(has pipeline latches) Figure A.3, Page A-9 Stages: 1 Instruction 2 Instr. Decode 3 Execute 4 Memory 5 Write Fetch. Fetch Addr. Calc Access Back Next PC 4 Adder Next SEQ PC RS1 Next SEQ PC Zero? MUX Address Memory IF/ID RS2 File ID/EX MUX MUX EX/MEM Data Memory MEM/WB MUX Will move red circuits to 2 nd stage to make branch delays shorter Imm Sign Extend RD RD RD WB Data Old simple design put branch completion in stage 4 (Mem) 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 33

Control Hazard on Branch - Three Cycle Stall (Caused if Decide Branches in 4 th Stage) 10: beq r1,r3,34 ID/RF MEM 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 34: xor r10,r1,r11 What can be done with the 3 instructions between beq & xor? Code between beq&xor must not start until know beq not branch => 3 stalls Adding 3 cycle stall after every branch (1/7 of instructions) => CPI += 3/7. BAD! 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 34

Branch Stall Impact if Commit in Stage 4 If CPI = 1 and 15% of instructions are branches, Stall 3 cycles => new CPI = 1.45 (1+3*.15) Too much! Two-part solution: Determine sooner whether branch taken or not, AND Compute taken branch address earlier MIPS branch tests if register = 0 or 0 Original 1986 MIPS Solution: Move zero_test to ID/RF (Inst Decode & ister Fetch) stage(2) (4=MEM) Add extra adder to calculate new PC (Program Counter) in ID/RF stage Result is 1 clock cycle penalty for branch versus 3 when decided in MEM 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 35

New Pipelined MIPS Datapath: Faster Branch Figure A.24, page A-38 Stages: 1 Instruction 2 Instr. Decode 3 Execute 4 Memory 5 Write Fetch. Fetch Addr. Calc Access Back Next PC 4 Adder Next SEQ PC Adder RS1 MUX Zero? Address Memory IF/ID RS2 File ID/EX MUX EX/MEM Data Memory MEM/WB MUX The fast_branch design needs a slightly longer stage 2 cycle time, making the clock a little slower for all stages. Imm Sign Extend RD RD RD WB Data Example of interplay of instruction set design and cycle time. 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 36

Four Branch Hazard Alternatives #1: Stall until branch direction is clearly known #2: Predict Branch Not Taken Easy Solution Execute the next instructions in sequence PC+4 already calculated, so use it to get next instruction Nullify bad instructions in pipeline if branch is actually taken Nullify easier since pipeline state updates are late (MEM, WB) 47% MIPS branches not taken on average #3: Predict Branch Taken 53% MIPS branches taken on average But have not calculated branch target address in MIPS» MIPS still incurs 1 cycle branch penalty» Some other CPUs: branch target known before outcome 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 37

Last of Four Branch Hazard Alternatives #4: Delayed Branch (Used Only in 1st MIPS Killer Micro ) Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2... sequential successor n branch target if taken Branch delay of length n 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS 1 st used this (Later versions of MIPS did not; pipeline deeper) 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 38

Scheduling Branch Delay Slots (Fig A.14) A. From before branch B. From branch target C. From fall through add $1,$2,$3 if $2=0 then delay slot sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes sub $4,$5,$6 add $1,$2,$3 if $2=0 then if $1=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 sub $4,$5,$6 A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute an extra sub when branch fails 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 39

Delayed Branch Not Used in Modern CPUs Compiler effectiveness 1/2 for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation Only half of (60% x 80%) slots usefully filled; cannot fill 2 or more Delayed Branch downside: As processor designs use deeper pipelines and multiple issue, the branch delay grows and needs many more delay slots Delayed branching soon lost effectiveness and popularity compared to more expensive but more flexible dynamic approaches Growth in available transistors soon permitted dynamic approaches that keep records of branch locations, taken/not-taken decisions, and target addresses Multi-issue 2 => 3 delay slots needed, 4 => 7 slots, 8 => 15 slots 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 40

Evaluating Branch Alternatives Pipeline speedup = Pipeline depth 1 +Branch frequency Branch penalty Assume 4% unconditional jump, 10% conditional branchtaken, 6% conditional branch-not-taken, base CPI = 1. Scheduling Branch CPI speedup vs. speedup vs. Scheme penalty no-pipe 5 cycles stall_pipeline Stall pipeline (Stage 4) 3 1.60 3.1 1.00 Predict taken (Stage 2) 1 1.20 4.2 1.33 Predict not taken (St.2) 1 1.14 4.4 1.40 Delayed branch (Stg 2) 0.5 1.10 4.5 1.45 (Sample 1.60=1+3(4+10+6)% (4.5=5/1.10) (1.45=1.6/1.1) calcu- 1.20=1+1(4+10+6)% (to calculate taken target) lations) 1.14=1+1(4+10)% (refetch for jump, taken-branch) 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 41

Another Problem with Pipelining Exception: An unusual event happens to an instruction during its execution {caused by instructions executing} Examples: divide by zero, undefined opcode Interrupt: Hardware signal to switch the processor to a new instruction stream {not directly caused by code} Example: a sound card interrupts when it needs more audio output samples (an audio click happens if it is left waiting) Precise Interrupt Problem: Must seem as if the exception or interrupt appeared between 2 instructions (I i and I i+1 ) although several instructions were executing at the time All instructions up to and including I i are totally completed No effect of any instruction after I i is allowed to be saved After a precise interrupt, the interrupt (exception) handler either aborts the program or restarts at instruction I i+1 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 42

Precise Exceptions in Static Pipelines Stages: F D E M W Fetch Decode Execute Memory Key observation: Architected states change only in memory (M) and register write (W) stages. 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 43

And In Conclusion: Control and Pipelining Quantify and summarize performance Ratios, Geometric Mean, Multiplicative Standard Deviation F&P: Benchmarks age, disks fail, single-point failure Control via State Machines and Microprogramming Just overlap tasks; easy if tasks are independent Speed Up Pipeline Depth; if ideal CPI is 1, then: Pipeline depth Speedup = 1 + Pipeline stall CPI Hazards limit performance on computers by stalling: Structural: need more HW resources Data (RAW,WAR,WAW): need forwarding, compiler scheduling Control: delayed branch or branch (taken/not-taken) prediction Exceptions and interrupts add complexity For next time: Read Appendix C. Cycle Cycle Time Time unpipelined pipelined 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 44

Unused Slides Spr 11 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 45

CSE502: Administrivia http://www.cs.sunysb.edu/~lw/teaching/cse502/doldf07/ Last year's slides are in ~lw/teaching/cse502/doldf08/ DoldF07/lec01-intro.pdf DoldF07/lec02-intro.pdf DoldF07/lec03-pipe.pdf DoldF07/lec04-cache.pdf DoldF07/lec05-dynamic-sched.pdf DoldF07/lec06-dynamic-schedB.pdf DoldF07/lec07-ILP limits.pdf DoldF07/lec07-limitsILP_SMT.pdf DoldF07/lec08-SMT.pdf DoldF07/lec09-Vector.pdf DoldF07/lec10-Modern Vector.pdf DoldF07/lec11-SMP.pdf DoldF07/lec12-Snoop+MTreview.pdf DoldF07/lec12-Snoop+MTreviewPreliminary.pdf DoldF07/lec14-directory.pdf DoldF07/lec16-T1 MP.pdf DoldF07/lec17-memoryhier.pdf DoldF07/lec18-VM memhier2.pdf DoldF07/lec19-storage.pdf DoldF07/lec20-review.pdf 2/8,10,15/2011 CSE502-S11, Lec 03+4+5-perf & pipes 46