COSC4201. Prof. Mokhtar Aboelaze York University

Similar documents
Execution/Effective address

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Lecture 4: Introduction to Advanced Pipelining

COMP 4211 Seminar Presentation

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Advanced issues in pipelining

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

09-1 Multicycle Pipeline Operations

Multi-cycle Instructions in the Pipeline (Floating Point)

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Chapter3 Pipelining: Basic Concepts

Administrivia. CMSC 411 Computer Systems Architecture Lecture 6. When do MIPS exceptions occur? Review: Exceptions. Answers to HW #1 posted

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Instruction Pipelining Review

Metodologie di Progettazione Hardware-Software

CS252 Graduate Computer Architecture Lecture 5. Interrupt Controller CPU. Interrupts, Software Scheduling around Hazards February 1 st, 2012

CS425 Computer Systems Architecture

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

Instruction Level Parallelism

Static vs. Dynamic Scheduling

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Super Scalar. Kalyan Basu March 21,

CMSC 611: Advanced Computer Architecture

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Floating Point/Multicycle Pipelining in DLX

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Lecture-13 (ROB and Multi-threading) CS422-Spring

Appendix C: Pipelining: Basic and Intermediate Concepts

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

Processor: Superscalars Dynamic Scheduling

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Updated Exercises by Diana Franklin

Good luck and have fun!

LECTURE 10. Pipelining: Advanced ILP

Basic Pipelining Concepts

CSE4201 Instruction Level Parallelism. Branch Prediction

Course on Advanced Computer Architectures

There are different characteristics for exceptions. They are as follows:

Four Steps of Speculative Tomasulo cycle 0

Review: Pipelining. Key to pipelining: smooth flow Making all instructions the same length can increase performance!

Instruction Level Parallelism. Taken from

Hardware-Based Speculation

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Hardware-based Speculation

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Instr. execution impl. view

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CS152 Computer Architecture and Engineering Lecture 15

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

The basic structure of a MIPS floating-point unit

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Instruction-Level Parallelism and Its Exploitation

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

EECC551 Exam Review 4 questions out of 6 questions

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Copyright 2012, Elsevier Inc. All rights reserved.

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

EITF20: Computer Architecture Part2.2.1: Pipeline-1

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Dynamic Control Hazard Avoidance

CS152 Computer Architecture and Engineering Lecture 15 Advanced Pipelining Dave Patterson (

Instruction Level Parallelism

COSC 6385 Computer Architecture - Pipelining (II)

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Appendix A. Overview

Improving Performance: Pipelining

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

5008: Computer Architecture

Lecture: Pipeline Wrap-Up and Static ILP

Hardware-Based Speculation

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Transcription:

COSC4201 Chapter 3 Multi Cycle Operations Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RTI) 1 Multicycle Operations More than one function unit, each require a variable number of cycles. INT FP/* ID MEM WB FP+ FP/ / 2

Multi cycles operations Assuming the following Function Unit latency initiation period Integer ALU 0 1 Data Memory 1 1 FP add 3 1 FP Multiply 6 1 FP Divide 24 24 Notice that FP add and multiply are pipelined (4 and 7 stages pipeline respectively). Latency is the number of cycles between an instruction that produces a result and another one that uses the result. 3 4

Multicycle operations MULTD ID M1 M2 M3 M4 M5 M6 M7 MEM WB ADDD ID A1 A2 A3 A4 MEM WB LD ID MEM WB SD ID MEM WB Stages in red indicates when data are needed, in blue indicates when data are produced Need to introduce more pipeline registers A1/A2,.. 5 Hazards and Forwarding Because the divide unit is not pipelined, structural hazards may arise Because of different running times. We may need to do more than one register write in a single cycle WAW hazard is now possible, WAR is not since they all read in one stage Instructions can complete in different order, more complicated exception handling Because of the longer latency, more RAW hazard 6

Hazards and Forwarding Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 LD F4,0(R2) ID MEM WB MUL F0,F4,F6 ID S M1 M2 M3 M4 M5 M6 M7 MEM WB ADDD F2,F0,F8 ID S S S S S S S A1 A2 A3 A4 SD F2,0(R2) ID S S S S S S S S S S MEM Substantially longer stall and forwarding 7 Hazard and Forwarding Instruction 1 2 3 4 5 6 7 8 9 10 11 MULTD F0,F4,F6 ID M1 M2 M3 M4 M5 M6 M7 MEM WB.. ADDD F2,F4,F6 ID A1 A2 A3 A4 MEM WB LD F8,0(R2) ID MEM WB Three different instruction writing in the same cycle, LD is issued one cycle earlier, with destination of F2, that will lead to WAW hazard 8

Hazards and Forwarding One way to deal with multiple writes is to have multiple write ports, but it may be rarely used. Another way is to detect the structural hazard by using an interlock 1. We can track the use of the write port before it is issued (ID stage) and stall 2. Or, we can detect this hazard at entering the MEM stage, it is easier to detect, and we can choose which instruction to proceed (the one with the longest latency?) 9 Maintaining Precise Exception DIVF F0,F2,F4 AD F10,F10,F8 SUBF F12,F12,F14 This is known as out of order completion What if SUBF causes an Exception after AD is completed but before DIVF is, or if DIVF caused an exception after both AD and SUBF completed, there is no way to maintain a precise exception since AD destroys one of its operands 10

Maintaining Precise Exception (sol 1) Early solution is to ignore the problem More recent ones, are to introduce two modes of operations, fast but with imprecise exception, and slow with precise exception. DEC Alpha 2104, Power1 and Power-2, MIPS R8000 11 Maintaining Precise Exception (sol 2) Buffer the results of an operation until all the operations before it are completed. Costly, especially with long pipes. One variation is called history file, old values are stored in the history file and can be restored in case of exception Or, we an use future file, where new values are stored until all proceeding instructions are completed. 12

Maintaining Precise Exception (sol 3) Allow the exception to become imprecise, but we have to keep enough information so that the exception handling routine can recover. These information are usually the PC addresses of the instructions that were in the pipe during the exception, who finished, and who not. 13 Maintaining Precise Exception (sol 4) A hybrid technique, Allow the instructions to be issued only if we are certain that all the instructions before the issuing instruction will complete without causing an exception That guarantees that no instruction after the interrupting one will be completed, and all instructions before it will complete. Must check for exception early in the stage. 14

MIPS4000 8 Stage Pipeline: first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. second half of access to instruction cache. instruction decode and register fetch, hazard checking and also instruction cache hit detection. execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. data fetch, first half of access to data cache. second half of access to data cache. TC tag check, determine whether the data cache access hit. WB write back for loads and register-register operations. 8 Stages: What is impact on Load delay? Branch delay? Why? 15 MIPS4000 2 cycle load delay (data is ready after ) TC WB TC 3 cycles branch delay, MIOS4000 has a singly cycle branch delay scheduling with a predict taken for the remaining 2 TC WB TC 16

MIPS4000 Floating Point Pipeline FP Adder, FP Multiplier, FP Divider Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units: Stage Functional unit Description A FP adder Mantissa ADD stage D FP divider Divide pipeline stage E FP multiplier Exception test stage M FP multiplier First stage of multiplier N FP multiplier Second stage of multiplier R FP adder Rounding stage S FP adder Operand shift stage U Unpack FP numbers 17 MIPS4000 Floating Point Pipeline FP Instr Pipeline stages Add, Subtract U,S+A,A+R,R+S Multiply U,E+M,M,M,M,N,N+A,R Divide U,A,R,D 28,D+A,D+R, D+R, D+A, D+R, A, R Square root U,E,(A+R) 108,A,R Negate U,S Absolute value U,S FP compare U,A,R OP Latency Initiation interval ADD,SUB 4 3 MUL 8 4 DIV 36 35 SQRT 112 111 NEG,ABS 2 1 COMP 3 2 18

AMPLE MUL SUE 0 1 2 3 4 5 6 7 8 9 10 MUL Issue U M M M M N N,A R ADD Issue U S,A A,R R,S ADD Issue U S,A A,R R,S ADD Stall U S,A A,R R,S ADD Stall U S,A A,R R,S ADD Issue U S,A A,R R,S ADD Issue U S,A A,R R,S ADD Issue U S,A A,R R,S The interaction between a multiply issued at time 0, and add issued between 1 and 7, in all except 2 we can proceed without stalls, in these two we have to stall 19 MIPS4000 Performance Not ideal CPI of 1: Load stalls (1 or 2 clock cycles) Branch stalls (2 cycles + unfilled slots) FP result stalls: RAW data hazard (latency) FP structural stalls: Not enough FP hardware (parallelism) 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 eqntott espresso gcc li doduc nasa7 ora spice2g6 su2cor tomcatv Base Load stalls Branch stalls FP result stalls FP structural stalls 20