CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Similar documents
CMSC 611: Advanced Computer Architecture

COSC4201. Prof. Mokhtar Aboelaze York University

Execution/Effective address

Instruction Pipelining Review

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Multi-cycle Instructions in the Pipeline (Floating Point)

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

COMP 4211 Seminar Presentation

LECTURE 10. Pipelining: Advanced ILP

Lecture 4: Introduction to Advanced Pipelining

Administrivia. CMSC 411 Computer Systems Architecture Lecture 6. When do MIPS exceptions occur? Review: Exceptions. Answers to HW #1 posted

Advanced issues in pipelining

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter3 Pipelining: Basic Concepts

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction-Level Parallelism and Its Exploitation

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Floating Point/Multicycle Pipelining in DLX

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

COSC 6385 Computer Architecture - Pipelining (II)

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

COSC4201 Instruction Level Parallelism Dynamic Scheduling

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Good luck and have fun!

09-1 Multicycle Pipeline Operations

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Processor: Superscalars Dynamic Scheduling

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

Metodologie di Progettazione Hardware-Software

Static vs. Dynamic Scheduling

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CS252 Graduate Computer Architecture Lecture 5. Interrupt Controller CPU. Interrupts, Software Scheduling around Hazards February 1 st, 2012

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

EECC551 Exam Review 4 questions out of 6 questions

Updated Exercises by Diana Franklin

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

There are different characteristics for exceptions. They are as follows:

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

CS433 Homework 3 (Chapter 3)

Appendix C: Pipelining: Basic and Intermediate Concepts

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CS433 Homework 2 (Chapter 3)

Instr. execution impl. view

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

CS433 Homework 2 (Chapter 3)

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Handout 2 ILP: Part B

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Super Scalar. Kalyan Basu March 21,

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Hardware-based Speculation

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Tomasulo s Algorithm

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Course on Advanced Computer Architectures

Lecture-13 (ROB and Multi-threading) CS422-Spring

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Four Steps of Speculative Tomasulo cycle 0

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

The basic structure of a MIPS floating-point unit

Hardware-based Speculation

Instruction Level Parallelism

Transcription:

CMCS 611-101 Advanced Computer Architecture Lecture 9 Pipeline Implementation Challenges October 5, 2009 www.csee.umbc.edu/~younis/cmsc611/cmsc611.htm Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Previous Lecture: Lecture s Overview Control hazards Limiting the effect of control hazards via branch prediction and delay Static branch prediction techniques and performance Branch delay issues and performance Exceptions handling Categorizing of exception based on types and handling requirements Issues of stopping and restarting instructions in a pipeline Precise exception handling and the conditions that enable it Instructions set effects on complicating pipeline design This Lecture Pipelining floating point operations An example pipeline: MIPS R4000 Mohamed Younis CMCS 611, Advanced Computer Architecture 2

Floating-Point Pipeline It is impractical to require that all MIPS floating point operations complete in one clock cycle (complex logic and/or very long clock cycle) A floating-point pipeline differs from integer instructions pipeline in: The EX cycle may be repeated as many times as need to complete the operation There may be multiple floating-point functional units Integer & FP instructions EX stage not pipelined stall if EX takes more than one cycle A stall will occur if the instruction to be issued will either cause a structural or a data hazard No instruction is assumed to enter the EX stage as long the previous instruction has not finished it Multi-cycle EX phase Mohamed Younis CMCS 611, Advanced Computer Architecture 3

Multi-cycle FP Pipeline 7-stages pipelined Integer ALU FP multiple Non-pipelined DIV operation stalling the whole pipeline for 24 cycle Pipelined FP addition with latency of 3 cycles MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB ADDD IF ID A1 A2 A3 A4 MEM WB LD IF ID EX MEM WB SD IF ID EX MEM WB Example: blue indicate where data is needed and red when result is available Mohamed Younis CMCS 611, Advanced Computer Architecture 4

Multi-cycle FP Pipeline EX Phase Example: assume that floating point add, subtract and multiply can be performed in stages while integer and FP divide id cannot Latency: the number of intervening cycles between an instruction that produces a result and an instruction that uses it Since most operations consume their operands at the beginning of the EX stage, latency is usually the number of stages in the EX an instruction uses Naturally long latency increases the frequency of RAW hazards Initiation (Repeat) interval: the number of cycles that must elapse between issuing two operations of a given type Functional unit Latency Initiation interval Integer ALU 0 1 Data memory y( (integer and FP loads) 1 1 FP add 3 1 FP multiply (also integer multiply) 6 1 FP divide (also integer divide) 24 25 Example of typical latency of floating point operations Mohamed Younis CMCS 611, Advanced Computer Architecture 5

Issues: FP Pipeline Challenges Because the divide unit is not fully pipelined, structural hazards can occur Because the instructions have varying running times, the number of register writes required in a cycle can be larger than 1 WAW hazards are possible, since instructions no longer reach WB in order WAR hazards are NOT possible, since register reads are still taking place during the ID stage Instructions can complete in a different order than they were issued, causing problems with exceptions Longer latency of operations makes stalls for RAW hazards more frequent Instruction Clock cycle number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 LD F4, 0(R2) IF ID EX MEM WB MULTD F0, F4, F6 IF ID stall M1 M2 M3 M4 M5 M6 M7 MEM WB ADDD F2, F0, F8 IF stall ID stall stall stall stall stall stall A1 A2 A3 A4 MEM SD 0(R2), F2 IF stall stall stall stall stall stall ID EX stall stall stall MEM Example of RAW hazard caused by the long latency Mohamed Younis CMCS 611, Advanced Computer Architecture 6

Write-Caused Structural Hazard Instruction Clock cycle number 1 2 3 4 5 6 7 8 9 10 11 MULTD F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB IF ID EX MEM WB IF ID EX MEM WB ADDD F2, F4, F6 IF ID A1 A2 A3 A4 MEM WB IF ID EX MEM WB IF ID EX MEM WB LD F2, 0(R2) IF ID EX MEM WB At cycle 11, the MULTD, ADDD and LD instructions ti will try to write back causing structural hazard if there is only one write port Additional write ports are not cost effective since they are rarely used and it is better to detect and resolve the structural hazard Structural hazards can be detected at the ID stage and the instruction will be stalled to avoid a conflict with another at the WB Alternatively, structural hazards can be handled at the MEM or WB by rescheduling the usage of the write port Mohamed Younis CMCS 611, Advanced Computer Architecture 7

Instruction WAW Data Hazards Clock cycle number 1 2 3 4 5 6 7 8 9 10 11 MULTD F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB IF ID EX MEM WB IF ID EX MEM WB ADDD F2, F4, F6 IF ID A1 A2 A3 A4 MEM WB IF ID EX MEM WB LD F2, 0(R2) IF ID EX MEM WB. IF ID EX MEM WB WAW hazards can be corrected by either: Stopping the latter instruction (LD in example) at the MEM until it is safe Preventing the first instruction from overwriting the register Correcting WAW Hazards at cycle 11 is not problematic unless there is an instruction between ADDD and LD that read F2 causing RAW hazard WAW hazards can be detected at the ID stage and thus the first instruction can be converted to no-op Since WAW hazards are generally very rare, designers usually go with the simplest solution Mohamed Younis CMCS 611, Advanced Computer Architecture 8

Detecting Hazards Hazards can occur among FP instructions and among FP and integer instructions Using separate register files for integer and FP limits potential hazards to just FP load and store instructions Assuming all checks are to be performed in the ID phase, hazards can be detected through the following steps: Check for structural hazards: Wait if the functional unit is busy (Divides in our case) Make sure the register write port is available when needed Check for a RAW data hazard Requires knowledge of latency and initiation interval to decide when to forward and when to stall Check for a WAW data hazard Write completion has to be estimated at the ID stage to check with other instructions in the pipeline Data hazard detection and forwarding logic can be derived from values stored at the data stationary between the different stages Mohamed Younis CMCS 611, Advanced Computer Architecture 9

Maintaining Precise Exception Pipelining FP instructions can cause out-of-order completion since FP operations are different in length Although data hazard solutions prevent the effect of out-of-order instruction completion on the semantics, exception handling can be problematic Example: DIVD F0, F2, F4 ADDD and SUBD will complete ADDD F10, F10, F8 before DIVD completes without SUBD F12, F12, F14 causing any data hazards What if an exception occurs while DIVD executes in a point of time after ADDD overwrites F10!! There are four possible approaches to deal with FP exception handling: Settle for imprecise exceptions (restrict # active FP instructions) IEEE floating point standard requires precise exceptions Some supercomputers p still uses this approach Some machines offer slow precise and fast imprecise exceptions Buffer the results of all operations until previous instructions complete Complex and expensive design (many comparators and large MUX) History or future register file are practical implementations that allow for restoring original machine state in case of exception Mohamed Younis CMCS 611, Advanced Computer Architecture 10

Precise Exception Approaches (Cont.) Four possible approaches to deal with FP exception handling (cont.): Allow imprecise exceptions and get the handler to clean up any miss Simply save PC and some state about the interrupting instruction and all out-of-order completed instructions The trap handler will consider the state modification caused by finished instructions and prepare machine to resume correctly Issues: consider the following example Instruction 1 : A long-running instr. that eventually get interrupted Instructions 2 (n-1) : A series of instructions that are not complete Instruction n : An instruction that is finished The trap handler will have a complex job to deal with The compiler can simplify the problem by grouping FP instructions so that the trap does not have to worry about unrelated instructions s Allow instruction issue to continue only if previous instruction are guaranteed to cause no exceptions: Mainly applied in the execution phase Used on MIPS R4000 and Intel Pentium Mohamed Younis CMCS 611, Advanced Computer Architecture 11

Performance of FP Pipeline Average number of stalls per instruction mainly affected by latency and number of cycles before results are used but not by instruction frequency (except for divide) Stalls are mainly caused by: Latency of the divide instruction RAW hazards WAW hazards are possible but rare in practice P SPEC Benc chmark F Add/Subtract t Compares Multiply Divide Divide Structural Mohamed Younis CMCS 611, Advanced Computer Architecture 12

Performance of FP Pipeline (Cont.) FP results stall FP Compare stall FP SPEC Benc chmark Branch/Load stalls FP Structural The total number of stalls/instr. range from 0.65 to 1.21 with average 0.87 FP result stalls dominate in all cases with average of 0.71 (82% of all stalls) Compares generate an average of 0.1 stalls and are second largest source Mohamed Younis CMCS 611, Advanced Computer Architecture 13

MIPS 4000 Pipeline Implements the MIPS-3 64-bit instruction Uses 8 stages pipeline through pipelining instruction and data cache access Deeper pipeline allows for higher clock rate but increases load/branch delays IF: First thalf of finstr. fetch; PC selection, initiation iti of finstruction ti cache access IS: Second half of the instruction fetch completing instruction cache access RF: Instr. Decode, register fetch, hazard checking and instr. cache hit detection EX: ALU operations, effective address calculation, and condition evaluation DF: Data fetch, first half of data cache access DS: Second half of data fetch, completion of data cache access TC: Tag check, determine whether the data cache access hit WB: Write back for loads and register-register operations Mohamed Younis CMCS 611, Advanced Computer Architecture 14

Load Delays Data value are available at the end of the DS cycle, cause a two cycle delay for load instructions Pipelined cache access increases the need for forwarding and complicates the forwarding logic A cache miss will stall the pipeline additional one (or more) cycles Mohamed Younis CMCS 611, Advanced Computer Architecture 15

Branching Delays Branch conditions are computed during EX stage extending the basic branch delay to 3 cycles MIPS allows for a single-cycle delay branching and a predict-not-taken strategy for the remaining two branching delay cycles Mohamed Younis CMCS 611, Advanced Computer Architecture 16

MIPS 4000 FP Pipeline Three FP functional units: adder, multiplier and divider, with the adder logic serving as the final step of the a multiply and divide Stage Functional Unit Description A FP adder Mantissa ADD stage D FP Divider Divide pipeline stage E FP Multiplier Exception test stage M FP Multiplier First stage of multiplier N FP Multiplier Second stage of multiplier R FP adder Rounding stage S FP adder Operand shift stage U Unpack FP numbers The FP functional unit can be thought as having eight different stages, combined in different order to execute various FP operations An instruction can use a stage zero or multiple times and in different orders FP Instruction Latency Pipeline stages Add/Subtract 4 U, S+A, A+R, R+S Multiply 8 U, E+M, M, M, M, N, N+A, R Divide 36 U, A, R, D 27, D+A, D+R, D+A, D+R, A, R Compare 3 U, A, R Mohamed Younis CMCS 611, Advanced Computer Architecture 17

Performance of MIPS Pipeline Pipeli ine CPI Base Load stalls Branch stalls FP result stalls FP structural stalls SPEC92 benchmark Mohamed Younis CMCS 611, Advanced Computer Architecture 18

Summary Conclusion Exceptions handling Categorizing of exception based on types and handling requirements Issues of stopping and restarting instructions in a pipeline Precise exception handling and the conditions that enable it Instructions set effects on complicating pipeline design Pipelining floating point instruction Multi-cycle operation of the pipeline execution phase Hazard detection and resolution Exception handling and FP pp pipeline performance An example pipeline: MIPS R4000 Next Lecture Instruction level parallelism Dynamic instruction scheduling Reading assignment includes sections A.5 & A.6, in the textbook Mohamed Younis CMCS 611, Advanced Computer Architecture 19