COSC 6385 Computer Architecture - Pipelining (II)

Similar documents
Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Scoreboard information (3 tables) Four stages of scoreboard control

Instruction Pipelining Review

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Instruction-Level Parallelism and Its Exploitation

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Processor: Superscalars Dynamic Scheduling

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

There are different characteristics for exceptions. They are as follows:

ECE 486/586. Computer Architecture. Lecture # 12

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3

Good luck and have fun!

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

COSC4201 Instruction Level Parallelism Dynamic Scheduling

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

EECC551 Exam Review 4 questions out of 6 questions

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Advantages of Dynamic Scheduling

Lecture 4: Introduction to Advanced Pipelining

Instruction Level Parallelism (ILP)

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Advanced issues in pipelining

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction Level Parallelism

Floating Point/Multicycle Pipelining in DLX

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Metodologie di Progettazione Hardware-Software

Static vs. Dynamic Scheduling

The basic structure of a MIPS floating-point unit

Administrivia. CMSC 411 Computer Systems Architecture Lecture 6. When do MIPS exceptions occur? Review: Exceptions. Answers to HW #1 posted

Updated Exercises by Diana Franklin

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Hardware-based Speculation

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Adapted from David Patterson s slides on graduate computer architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

Appendix C: Pipelining: Basic and Intermediate Concepts

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Complex Pipelining. Motivation

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

Preventing Stalls: 1

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

Multi-cycle Instructions in the Pipeline (Floating Point)

Tomasulo s Algorithm

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Instruction Level Parallelism

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

COMP 4211 Seminar Presentation

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

CS433 Homework 2 (Chapter 3)

Hardware-based Speculation

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CS433 Homework 2 (Chapter 3)

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

LECTURE 10. Pipelining: Advanced ILP

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Chapter3 Pipelining: Basic Concepts

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

CS433 Homework 3 (Chapter 3)

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Course on Advanced Computer Architectures

Four Steps of Speculative Tomasulo cycle 0

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Latencies of FP operations used in chapter 4.

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS425 Computer Systems Architecture

Appendix A. Overview

University of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3.

Transcription:

COSC 6385 Computer Architecture - Pipelining (II) Edgar Gabriel Spring 2018 Performance evaluation of pipelines (I) General Speedup Formula: Time Speedup Time IC IC ClockCycle ClockClycle CPI CPI For a fixed application lets assume that IC = IC ClockCycle Speedup ClockClycle CPI CPI If we assume additionally that the CPU has the same frequency, i.e. ClockCycle = ClockCycle CPI Speedup CPI 1

Performance evaluation of pipelines (II) If looking at individual classes of instructions Speedup overall Time Time ClockClycle ClockClycle Assuming IC total is identical in both architectures n i1 n i1 i IC CPI i IC CPI i i with f i IC IC Speedup i total overall Time Time ClockClycle ClockClycle n i1 n i1 i f CPI i f CPI i i Comparing and non- execution An ideal pipeline produces one result per clock cycle Ideal CPI = 1 Time Time no Time Speedup Time non_ pipeline_ stages non_ no pipeline_ stages using the average instruction execution time (AvIETime) AvIETime Speedup AvIETime CPI non_ CPI non_ ClockCycle ClockCycle non_ 2

Comparing and non execution (II) Realistic CPI = Ideal CPI + Pipeline stall cycles per instruction Thus: AvIETime Speedup AvIETime non_ CPI non_ ClockCycle 1 PipelineStallCyclesPerInstr ClockCycle non_ If ClockCycle is constant: CPInon _ Speedup 1 PipelineStallCyclesPerInstr Example I (A) Given an non- processor: 1 ns clock cycle time 4 cycles for ALU operations 4 cycles for branches 5 cycles for memory operations (B) Given also a processor 1.2 ns clock cycle time Both (A) and (B) have 40% ALU operations 40% branches 20% memory operations What is the speedup of (B) over (A) due to pipelining? 3

For machine (A): AvIETime ( A) Example I ClockCycle A n i1 i i f CPI 1ns(0.4 40.4 40.25) 4. 4ns For machine (B): assuming ideal CPI (= 1) AvIETime ( B) ClockCycle B n i1 i i f CPI 1.2ns(0.41 0.21 0.41) 1. 2ns Thus AvIETime Speedup AvIETime ( A) ( B) 4.4ns 3.7 1.2ns Exceptions Instruction execution order is interrupted E.g. I/O device request Invoking an OS service from an application Tracing execution Breakpoint or FP arithmetic anomaly (e.g. overflow) Page fault Misaligned memory access Memory protection violation Hardware malfunction 4

Classification of Exceptions Problems with pipelining: Different stages of the pipeline can raise exceptions leading to a different order of exceptions compared to the un case Classes of exceptions 1. Synchronous vs. Asynchronous: 2. User requested vs. Coerced 3. User maskable vs. user non-maskable 4. Within vs. between instructions 5. Resume vs. terminate Exceptions Most problematic: exceptions raised within instructions, where the instruction must be resumed Another program must be invoked to save the state of the program Pipelines capable of handling exceptions are called restartable Pipeline stage IF ID EX MEM WB Possible exceptions Page fault on Instruction fetch; misaligned memory access; memory protection violation Undefined or illegal opcode Arithmetic exception Page fault on data fetch; misaligned memory access; memory protection violation Non 5

Exceptions Since an exception can not be raised when it occurs Status vector associated with instruction shows exception Status vector carried along with instruction Writing of data values disabled if status vector is set In WB status vector checked and exception handled => Exception of instruction i handled before exception of instruction i+1 => Since no data values are written back, register file not changed -> instruction can be repeated Multi-cycle instructions Not all instructions will take the same amount of cycles to finish! Floating point instructions can take many cycles to complete Latency: number of intervening cycles between an instruction that produces a result and instruction that uses the result Usually: depth of the EX stage -1 Initiation interval: Number of cycles that must elapse between issuing two operations of a given type Multi-cycle instructions/pipelines increase the probability for occurring WAW and RAW hazards 6

Example for a multi-cycle pipeline EX FP/ multiply unit M1 M2 M3 M4 M5 M6 M7 IF ID FP/ add unit A1 A2 A3 A4 MEM WB FP/ division (non ) DIV Functional unit Latency Initiation interval ALU 0 1 Data memory 1 1 FP add 3 1 FP multiply 6 1 FP divide 24 25 Instruction level parallelism Exploit parallelism between independent instructions Limited by data dependencies Limited by branches Example: for (i=0; i<n; i++ ) { c[i] = a[i] + b[i]; } Each iteration of the loop is independent Exploitation of that fact is not trivial because of register reuse! 7

Instruction level parallelism Data dependencies: True dependencies: instruction i produces a result required by instruction i+k, k>0 (RAW) sharing a register or a memory location Name dependencies: usage of the same register or memory location without data flow Antidependence: instruction i+k writes a register/memory location read by instruction i (WAR) No problem if not reordering instructions Output dependence: instruction i and instruction i+k write the same register/memory location (WAW) No problem if not reordering instructions Control dependencies: determines ordering of an instruction i with respect to a branch Dynamic scheduling Up-to-now Instructions are issued in program order If an instruction is stalled in the pipeline, no later instruction can proceed DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F12, F8, F14 In order to allow out-of-order execution, the ID stage is split into two parts: Instruction issue: decode instruction and check for structural hazards Read operands: Read operands if no data hazard 8

Dynamic scheduling Out-of-order execution introduces the possibility of WAR and WAW hazards DIV.D F0, F2, F4 DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14 SUB.D F8, F8, F14 ADD.D F10, F0, F8 Out-of-order execution only improves performance if Multiple instructions can be executed at once Multiple functional units are available All instructions pass through the issue stage in order Instructions can be bypassed in the read-operand stage Algorithms allowing instructions to execute out-of-order Scoreboarding Tomasulo s approach Scoreboarding First implemented in the CDC6600 Assumption for the following slides: 2 multipliers 1 adder 1 divider 1 integer unit Each instruction goes through the scoreboard Scoreboard determines when an instruction can execute Scoreboard monitors usage of execution units Scoreboard monitors when a result can be written to the destination register 9

Scoreboarding (II) 4 steps of Scoreboarding (replaces ID, EX and WB) 1. Issue: if functional unit is free and no other active instruction has the same destination register 2. Read operands: Scoreboard monitors the availability of operands. 3. Execution 4. Write result: if Execution done, Scoreboard checks for WAR hazards and stalls the instruction if necessary. Scoreboarding (II) Scoreboard data structures: : which of the four steps the instruction is in : status of a functional unit. Busy: indicates whether unit is busy or not Op: operation to be performed Fi: Destination register number Fj, Fk: Source register number Qj, Qk: Functional units producing source registers Fj, Fk Rj, Rk: Flags indicating whether Fj, Fk are ready. Set to NO after operands are read. : which functional unit will write which register 10

Scoreboarding example L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Following slides are based on a lecture by Jelena Mirkovic, University of Delaware http://www.cis.udel.edu/~sunshine/courses/f04/cis662/class10.pdf Assumption: ADD and SUB take 2 clock cycles MULT takes 10 clock cycle DIV takes 40 clock cycles Time=1 Issue first load L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Yes Load F6 R2 Yes Mult1 Add Divide FU 11

Time=2 first load read operands; second load can not issue (structural hazard) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Yes Load F6 R2 No Mult1 Add Divide FU Time=3 first load completes exec; second load can not issue (SH) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Yes Load F6 R2 No Mult1 Add Divide FU 12

Time=4 first load writes result; second load can not issue (SH) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Add Divide FU Time=5 Second load is issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Yes Load F2 R3 Yes Mult1 Add Divide FU 13

Time=6 Second load reads operands; Mult is issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 No Yes Add Divide FU Mult1 Time=7 Second load completes exec; Mult is stalled waiting for F2; Sub is issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 No Yes Add Yes Sub F8 F6 F2 Yes No Divide FU Mult1 Add 14

Time=8 Second load writes result; Mult and Sub stalled (F2); Div is issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 Yes Yes Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div Time=9 Mult and Sub read operands; Div stalled waiting for (F0); Add not issued (SH) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div 15

Time=10 Mult executing (1 out of 10 cycles); Sub executing (1 out of 2 cycles); Div stalled (F0); L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div Time=11 Mult executing (2/10); Sub completes execution; Div stalled (F0); L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div 16

Time=12 Mult executing (3/10); Sub writes result; Div stalled (F0); L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Div Time=13 Mult executing (4/10); Div stalled (F0); Add issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div 17

Time=14 Mult executing (5/10); Div stalled (F0); Add reads operands L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div Time=15 Mult executing (6/10); Div stalled (F0); Add executes (1 of 2 cycles) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div 18

Time=16 Mult executing (7/10 cycles); Div stalled (F0); Add completes exec L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div Time=17 Mult executing (8/10); Div stalled (F0); Add stalled (WAR hazard on F6) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div 19

Time=19 Mult completes exec; Div stalled (F0); Add stalled (WAR hazard on F6) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div Time=20 Mult writes result; Div stalled (F0); Add stalled (WAR hazard on F6) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes FU Add Div 20

Time=21 Div reads operands; Add stalled (WAR hazard on F6) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 No No FU Add Div Time=22 Div executes (1/40); Add writes result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Add Divide Yes Div F10 F0 F6 No No FU Div 21

Time=61 Div completes execution L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Add Divide Yes Div F10 F0 F6 No No FU Div Time=62 Div writes result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Add Divide FU 22

Scoreboarding (IV) Performance of scoreboarding depends on The amount of parallelism available among instructions Number of scoreboard entries Number and type of functional units Presence of antidependeces and output dependences 23