Execution/Effective address

Similar documents
COSC4201. Prof. Mokhtar Aboelaze York University

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Lecture 4: Introduction to Advanced Pipelining

COMP 4211 Seminar Presentation

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

09-1 Multicycle Pipeline Operations

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Advanced issues in pipelining

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Multi-cycle Instructions in the Pipeline (Floating Point)

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Instruction Pipelining Review

Appendix A. Overview

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

Chapter3 Pipelining: Basic Concepts

Instruction Level Parallelism

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Pipelining: Basic and Intermediate Concepts

Computer System. Hiroaki Kobayashi 6/16/2010. Ver /16/2010 Computer Science 1

Computer System. Agenda

Instruction Pipelining

Instruction Pipelining

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Administrivia. CMSC 411 Computer Systems Architecture Lecture 6. When do MIPS exceptions occur? Review: Exceptions. Answers to HW #1 posted

CS252 Graduate Computer Architecture Lecture 5. Interrupt Controller CPU. Interrupts, Software Scheduling around Hazards February 1 st, 2012

Metodologie di Progettazione Hardware-Software

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

ECSE 425 Lecture 6: Pipelining

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Lecture-13 (ROB and Multi-threading) CS422-Spring

Super Scalar. Kalyan Basu March 21,

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

CS425 Computer Systems Architecture

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

Static vs. Dynamic Scheduling

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Updated Exercises by Diana Franklin

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Appendix C. Abdullah Muzahid CS 5513

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

EITF20: Computer Architecture Part2.2.1: Pipeline-1

LECTURE 10. Pipelining: Advanced ILP

Floating Point/Multicycle Pipelining in DLX

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Good luck and have fun!

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Instruction Level Parallelism. Taken from

Appendix C: Pipelining: Basic and Intermediate Concepts

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Four Steps of Speculative Tomasulo cycle 0

Hardware-Based Speculation

Basic Pipelining Concepts

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Instruction Level Parallelism

Instruction-Level Parallelism and Its Exploitation

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Review: Pipelining. Key to pipelining: smooth flow Making all instructions the same length can increase performance!

There are different characteristics for exceptions. They are as follows:

Pipelining. Maurizio Palesi

CMSC 611: Advanced Computer Architecture

Processor: Superscalars Dynamic Scheduling

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

The basic structure of a MIPS floating-point unit

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Dynamic Control Hazard Avoidance

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CS152 Computer Architecture and Engineering Lecture 15

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Improving Performance: Pipelining

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Advanced Computer Architecture Pipelining

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

Hardware-based Speculation

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CS152 Computer Architecture and Engineering Lecture 15 Advanced Pipelining Dave Patterson (

EECC551 Exam Review 4 questions out of 6 questions

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Transcription:

Pipelined RC 69 Pipelined RC Instruction Fetch IR mem[pc] NPC PC+4 Instruction Decode/Operands fetch A Regs[rs]; B regs[rt]; Imm sign extended immediate field Execution/Effective address Memory Ref ALUOutput A + Imm 70

Pipelined RC Execution/Effective address Reg Reg Op - ALUOutput A func B Reg-Immediate - ALUOutput A func Imm Branch - ALUOutput NPC + (IMM <<2) Memory Access Cycle Memory Referrence - LMD Mem[ALUOutput] or Mem[ALUOutput] B; 71 Pipelined RC Memory Access Cycle Branch - If (cond) PC ALUOutput Write Back Reg-Reg ALU OP - Regs[rd] ALUOutput Reg-Imm - Regs[rt] ALUOutput Load - Regs[rt] LMD 72

Multicycle Operations More than one function unit, each require a variable number of cycles. 73 Multi cycles operations Assuming the following Function Unit latency initiation period Integer ALU 0 1 Data Memory 1 1 FP add 3 1 FP Multiply 6 1 FP Divide 24 24 Notice that FP add and multiply are pipelined (4 and 7 stages pipeline respectively). Latency is the number of cycles between an instruction that produces a result and another one that uses the result. 74

75 Multicycle operations MULTD ID M1 M2 M3 M4 M5 M6 M7 MEM WB ADDD ID A1 A2 A3 A4 MEM WB LD ID MEM WB SD ID MEM WB Stages in red indicates when data are needed, in blue indicates when data are produced Need to introduce more pipeline registers A1/A2,.. 76

Hazards and Forwarding Because the divide unit is not pipelined, structural hazards may arise Because of different running times. We may need to do more than one register write in a single cycle WAW hazard is now possible, WAR is not since they all read in one stage Instructions can complete in different order, more complicated exception handling Because of the longer latency, more RAW hazard 77 Hazards and Forwarding Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 LD F4,0(R2) ID MEM WB MUL F0,F4,F6 ID S M1 M2 M3 M4 M5 M6 M7 MEM WB ADDD F2,F0,F8 ID S S S S S S S A1 A2 A3 A4 SD F2,0(R2) ID S S S S S S S S S S MEM Substantially longer stall and forwarding 78

Hazard and Forwarding Instruction 1 2 3 4 5 6 7 8 9 10 11 MULTD F0,F4,F6 ID M1 M2 M3 M4 M5 M6 M7 MEM WB.. ADDD F2,F4,F6 ID A1 A2 A3 A4 MEM WB LD F8,0(R2) ID MEM WB Three different instruction writing in the same cycle, LD is issued one cycle earlier, with destination of F2, that will lead to WAW hazard 79 Hazards and Forwarding One way to deal with multiple writes is to have multiple write ports, but it may be rarely used. Another way is to detect the structural hazard by using an interlock 1. We can track the use of the write port before it is issued (ID stage) and stall 2. Or, we can detect this hazard at entering the MEM stage, it is easier to detect, and we can choose which instruction to proceed (the one with the longest latency?) 80

Maintaining Precise Exception DIVF F0,F2,F4 AD F10,F10,F8 SUBF F12,F12,F14 This is known as out of order completion What if SUBF causes an Exception after AD is completed but before DIVF is, or if DIVF caused an exception after both AD and SUBF completed, there is no way to maintain a precise exception since AD destroys one of its operands 81 Maintaining Precise Exception (sol 1) Early solution is to ignore the problem More recent ones, are to inrtoduce two modes of operations, fast but with imprecise exception, and slow with precise exception. DEC Alpha 2104, Power1 and Power-2, MIPS R8000 82

Maintaining Precise Exception (sol 2) Buffer the results of an operation until all the operations before it are completed. Costly, especially with long pipes. One variation is called history file, old values are stored in the history file and can be restored in case of exception Or, we an use future file, where new values are stored until all proceeding instructions are completed. 83 Maintaining Precise Exception (sol 3) Allow the exception to become imprecise, but we have to keep enough information so that the exception handling routine can recover. These information are usually the PC addresses of the instructions that were in the pipe during the exception, who finished, and who not. 84

Maintaining Precise Exception (sol 4) A hybrid technique, Allow the instructions to be issued only if we are certain that all the instructions before the issuing instruction will complete without causing an exception That guarantees that no instruction after the interrupting one will be completed, and all instructions before it will complete. Must check for exception early in the stage. 85 MIPS R4000 86

MIPS4000 8 Stage Pipeline: first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. second half of access to instruction cache. instruction decode and register fetch, hazard checking and also instruction cache hit detection. execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. data fetch, first half of access to data cache. DS second half of access to data cache. TC tag check, determine whether the data cache access hit. WB write back for loads and register-register operations. 8 Stages: What is impact on Load delay? Branch delay? Why? 87 MIPS4000 2 cycle load delay (data is ready after DS) DS TC DS WB TC DS 3 cycles branch delay, MIPS4000 has a singly cycle branch delay scheduling with a predict taken for the remaining 2 DS TC DS WB TC DS 88

Performance of branch Schemes Consider an R4000-style pipeline with the following penalty for branches What is the addition to the CPI if we assume the following. Unconditional=4%, conditional untaken =10%, conditional taken = 6% Branch Scheme Penalty unconditional Penalty untaken Penalty taken Flush pipeline 2.0 3 3 Predict taken 2.0 3 2 Predict untaken 2.0 0 3 89 Perormance Scheme Uncond Untaken Taken total Frequency 4% 10% 6% 20% stall 0.04*2=0.08 0.1*3=0.3 0.06*3=0.18 0.56 Predict taken 0.04*2=0.08 0.1*3=0.3 0.06*2=0.12 0.50 Predict untaken 0.04*2=0.08 0 0.06*3=0.18 0.26 90

MIPS4000 Floating Point Pipeline FP Adder, FP Multiplier, FP Divider Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units: Stage Functional unit Description A FP adder Mantissa ADD stage D FP divider Divide pipeline stage E FP multiplier Exception test stage M FP multiplier First stage of multiplier N FP multiplier Second stage of multiplier R FP adder Rounding stage S FP adder Operand shift stage U Unpack FP numbers 91 MIPS4000 Floating Point Pipeline FP Instr Pipeline stages Add, Subtract U,S+A,A+R,R+S Multiply U,E+M,M,M,M,N,N+A,R Divide U,A,R,D 28,D+A,D+R, D+R, D+A, D+R, A, R Square root U,E,(A+R) 108,A,R Negate U,S Absolute value U,S FP compare U,A,R OP Latency Initiation interval ADD,SUB 4 3 MUL 8 4 DIV 36 35 SQRT 112 111 NEG,ABS 2 1 COMP 3 2 92

AMPLE MUL SUE 0 1 2 3 4 5 6 7 8 9 10 MUL Issue U E,M M M M N N,A R ADD Issue U S,A A,R R,S ADD Issue U S,A A,R R,S ADD Issue U S,A A,R R,S ADD Stall U S,A A,R R,S ADD Stall U S,A A,R R,S ADD Issue U S,A A,R R,S ADD Issue U S,A A,R R,S Multiply issued at time 0, the table shows if we can issue or stall an ADD instruction n cycles after the MULThe interaction between a multiply issued at time 0, and add issued between 1 and 7, in all except 2 we can proceed without stalls, in these two we have to stall 93 MIPS4000 Performance Not ideal CPI of 1: Load stalls (1 or 2 clock cycles) Branch stalls (2 cycles + unfilled slots) FP result stalls: RAW data hazard (latency) FP structural stalls: Not enough FP hardware (parallelism) 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 eqntott espresso gcc li doduc nasa7 ora spice2g6 su2cor tomcatv Base Load stalls Branch stalls FP result stalls FP structural stalls 94