CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Similar documents
Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

Administrivia. CMSC 411 Computer Systems Architecture Lecture 6. When do MIPS exceptions occur? Review: Exceptions. Answers to HW #1 posted

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Instruction Pipelining Review

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Computer Architecture

COSC 6385 Computer Architecture - Pipelining

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Execution/Effective address

COSC4201. Prof. Mokhtar Aboelaze York University

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

COMP 4211 Seminar Presentation

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

LECTURE 10. Pipelining: Advanced ILP

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

MIPS An ISA for Pipelining

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

Computer Architecture Spring 2016

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Advanced issues in pipelining

Multi-cycle Instructions in the Pipeline (Floating Point)

Modern Computer Architecture

Basic Pipelining Concepts

Appendix C: Pipelining: Basic and Intermediate Concepts

Pipelining: Hazards Ver. Jan 14, 2014

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

HY425 Lecture 05: Branch Prediction

09-1 Multicycle Pipeline Operations

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

LECTURE 3: THE PROCESSOR

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

EITF20: Computer Architecture Part2.2.1: Pipeline-1

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

CMSC411 Fall 2013 Midterm 1

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Copyright 2012, Elsevier Inc. All rights reserved.

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Four Steps of Speculative Tomasulo cycle 0

Appendix A. Overview

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

ECE154A Introduction to Computer Architecture. Homework 4 solution

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Appendix C. Abdullah Muzahid CS 5513

第三章 Instruction-Level Parallelism and Its Dynamic Exploitation. 陈文智 浙江大学计算机学院 2014 年 10 月

Hardware-based Speculation

Processor: Superscalars Dynamic Scheduling

Chapter 4 The Processor 1. Chapter 4A. The Processor

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

Static vs. Dynamic Scheduling

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

ECE232: Hardware Organization and Design

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Lecture 2: Processor and Pipelining 1

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

The basic structure of a MIPS floating-point unit

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

CMSC 611: Advanced Computer Architecture

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

CSE 502 Graduate Computer Architecture. Lec 4-6 Performance + Instruction Pipelining Review

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CS252 Graduate Computer Architecture Lecture 5. Interrupt Controller CPU. Interrupts, Software Scheduling around Hazards February 1 st, 2012

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Chapter3 Pipelining: Basic Concepts

Advanced Computer Architecture Pipelining

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

EECS 252 Graduate Computer Architecture Lecture 4. Control Flow (continued) Interrupts. Review: Ordering Properties of basic inst.

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Transcription:

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3 Long Instructions & MIPS Case Study Complications With Long Instructions So far, all MIPS instructions take 5 cycles But haven't talked yet about the floating point instructions Take it on faith that floating point instructions are inherently slower than integer arithmetic instructions doubters may consult Appendix H in H&P online CMSC 411-6 (from Patterson) 2

How Slow Is Slow? Some typical times: latency is the number of cycles between an instruction that produces a result and one that uses it initiation interval is the number of cycles between two instructions of the same kind (for example, two ADD.Fs) Instruction Latency Initiation uses 0 1 Load/store 1 1 ADD.F,SUB.F 3 1 DIV.F 24 25 CMSC 411-6 (from Patterson) 3 Examples If we have a sequence of integer instructions ADD SUB AND OR SLLI then there are no delays in the pipeline, because initiation=1 means can start one of these instructions every cycle latency=0 means that results from one instruction will be available when the next instruction needs them CMSC 411-6 (from Patterson) 4

Examples (cont.) If we have a sequence of floating point instructions ADD.F SUB.F Then initiation=1 means that can start SUB.F one cycle behind ADD.F But latency=3 means that this will work right only if SUB.F doesn't need ADD.F s results If it does need the results, then need 3 instructions in between ADD.F and SUB.F to prevent bubbles in the pipeline CMSC 411-6 (from Patterson) 5 Functional Units CMSC 411-6 (from Patterson) 6

Examples (cont.) MUL.D IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB ADD.D IF ID A1 A2 A3 A4 MEM WB L.D IF ID EX MEM WB S.D IF ID EX MEM WB Italics shows where data is needed blue where a result is available CMSC 411-6 (from Patterson) 7 Hazards Caused By Long Instructions The floating point adder and multiplier are pipelined, but the divider is not - that is why the initiation interval for divide is 25 A program will run very slowly if it does too many of these! It will also run slowly if the results of the divide are needed too soon CMSC 411-6 (from Patterson) 8

FP Stalls From RAW Hazards Inst. 1 2 3 4 5 6 7 8 9 L.D F4,0(R2) IF ID EX MEM WB MUL.D F0,F4,F6 IF ID stall M1 M2 M3 M4 M5 ADD.D F2,F0,F8 IF stall ID stall stall stall stall S.D F2,0(R2) stall IF stall stall stall stall Inst. 10 11 12 13 14 15 16 17 L.D MUL.D M6 M7 MEM WB ADD.D stall stall A1 A2 A3 A4 MEM WB S.D stall stall ID EX stall stall stall MEM CMSC 411-6 (from Patterson) 9 Long Instructions (cont.) It is possible that two instructions enter the WB stage at the same time ADD.D IF ID A1 A2 A3 A4 MEM WB LD IF ID MEM WB DADD IF ID MEM WB DADD IF ID MEM WB A structural hazard CMSC 411-6 (from Patterson) 10

Long Instructions (cont.) Instructions can finish in the wrong order This can cause WAW hazards This violation of WB ordering defeats the previous strategy for precise exception handling problem is out-of-order completion 1) stop fetching 2) turn off writes 3) let pipeline drain 4) handle DIV.D F0, F2, F4 ADD R1, R1, R2 SUB.D F10, F12, F14 What happens if sub faults? And then div? What about R1? CMSC 411-6 (from Patterson) 11 WAW Structural Hazard 1 2 3 4 5 6 7 8 9 10 11 MUL.D F0,F4,F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB IF ID EX MEM WB IF ID EX MEM WB ADD.D F2,F4,F6 IF ID A1 A2 A3 A4 MEM WB IF ID EX MEM WB IF ID EX MEM WB L.D F2,0(R2) IF ID EX MEM WB CMSC 411-6 (from Patterson) 12

Possible Fixes Give up and just do imprecise exception handling tempting, but very annoying to users Delay WB until all previous instructions complete since so many instructions can be active, this is expensive - requires a lot of supporting hardware Write, to memory, a history file of register and memory changes so can undo instructions if necessary or keep a future file of computed results that are waiting for MEM or WB CMSC 411-6 (from Patterson) 13 Possible Fixes (cont.) Let the exception handler finish the instructions in the pipeline and then restart the pipe at the next instruction Have the floating point units diagnose exceptions in their first or second stages, so can handle them by methods that work well for handling integer exceptions CMSC 411-6 (from Patterson) 14

How To Detect Hazards In ID Early detection would prevent trouble Check for structural hazards: will the divide unit clear in time? will WB be possible when we need it? Check for RAW data hazards: will all source registers be available when needed? Check for WAW data hazards: Is the destination register for any ADD.D, multiply or divide instruction the same register as the destination for this instruction? If anything dangerous could happen, delay the execute cycle so no conflict occurs CMSC 411-6 (from Patterson) 15 Review MIPS Instruction Format ister-ister 31 26 25 2120 16 15 1110 6 5 0 op rs rt rd opx rd rs OP rt Examples ADD R1,R2,R3 // R1 R2+R3 MUL R1,R2,R3 // R1 R2*R3 ister-immediate 31 26 25 2120 16 15 0 op rs rt immediate rt rs OP immed Examples ADDI R1,R2,8 // R1 R2+8 LW R1,4(R2) // R1 MEM (R2+4) SW R1,4(R2) // MEM (R2+4) R1 rs OP rt CMSC 411-3 (from Patterson) 16

Review Pipeline isters Instruction Fetch Instr. Decode. Fetch Execute Addr. Calc Memory Access Write Back Next PC 4 Adder Next SEQ PC RS Next SEQ PC Zero? MUX Address Memory IF/ID RT File ID/EX MUX MUX EX/MEM Data Memory MEM/WB MUX Imm Sign Extend RD RD RD WB Data Pipeline register s instruction register (IR) fields: IR[op] IR[rs] IR[rt] IR[rd] CMSC 411-4 (from Patterson) 17 Data Hazard Without Forwarding Time (clock cycles) lw r1,4(r2) Ifetch DMem add r3,r1,r4 Ifetch DMem CMSC 411-5 (from Patterson) 18

Data Hazard With Forwarding Time (clock cycles) lw r1,4(r2) Ifetch DMem add r3,r1,r4 Ifetch DMem CMSC 411-5 (from Patterson) 19 RAW Data Hazard Detection (LW/ADD) Time (clock cycles) lw r1,4(r2) Ifetch DMem add r3,r1,r4 Ifetch DMem At time = cycle 2 IF/ID.IR[op] = ADD IF/ID.IR[rs] = r1 IF/ID.IR[rt] = r4 IF/ID.IR[rd] = r3 ID/EX.IR[op] = LW ID/EX.IR[rs] = r2 ID/EX.IR[rt] = r1 So insert pipeline stall due to RAW hazard if ID/EX.IR[op] = LW IF/ID.IR[op] = ADD ID/EX.IR[rt] = IF/ID.IR[rs] OR ID/EX.IR[rt] = IF/ID.IR[rt] CMSC 411-5 (from Patterson) 20

RAW Data Hazard Detection (LW/LW) Time (clock cycles) lw r1,4(r2) Ifetch DMem lw r3,8(r1) Ifetch DMem At time = cycle 2 IF/ID.IR[op] = LW IF/ID.IR[rs] = r1 IF/ID.IR[rt] = r3 ID/EX.IR[op] = LW ID/EX.IR[rs] = r2 ID/EX.IR[rt] = r1 So insert pipeline stall due to RAW hazard if ID/EX.IR[op] = LW IF/ID.IR[op] = LW ID/EX.IR[rt] = IF/ID.IR[rs] CMSC 411-5 (from Patterson) 21 WAW Data Hazard Detection (MUL.D/ADD.D) MUL.D r1,r2,r3 ADD.D r1,r4,r5 If at time x IF/ID.IR[op] = ADD.D IF/ID.IR[rd] = r1 M1/M2.IR[op] = MUL.D M1/M2.IR[rd] = r1 Then ADD.D would write to r1 at time x+6, while MUL.D would write r1 at time x=7, reversing order of writes So insert pipeline stall due to WAW hazard if M1/M2.IR[op] = MUL.D IF/ID.IR[op] = ADD.D M1/M2.IR[rd] = IF/ID.IR[rd] CMSC 411-5 (from Patterson) 22

A Case Study: MIPS R4000 MIPS R4000 Introduced 1991, one of the first 64-bit CPUs Sony PSP (2004) used 0.3 GHz R4000 Deep 8 stage pipeline to get higher clock rates extra stages come from memory accesses techniques called superpipelining CMSC 411-7 (from Patterson) 23 MIPS R4000 Pipeline Stages IF 1st half instruction fetch PC selection and start instruction cache access IS 2nd half instruction fetch complete instruction cache access RF instruction decode, register fetch, hazard checking, instruction cache hit detection EX execution includes effective address computation, operation, branch target computation and condition evaluation CMSC 411-7 (from Patterson) 24

MIPS R4000 Pipeline (cont.) DF 1 st half data fetch 1 st half of data cache access DS 2 nd half data fetch complete data cache access TC tag check determine whether data cache access hit WB write back for loads and operations CMSC 411-7 (from Patterson) 25 MIPS R4000 Pipeline (cont.) A 2 cycle load delay Might need to restart ADDD s CMSC 411-7 (from Patterson) 26

MIPS R4000 Pipeline (cont.) A 3 cycle branch delay 1 delay slot + 2 cycle stall for taken branch (untaken just delay slot) CMSC 411-7 (from Patterson) 27 Forwarding Deeper pipeline increases number of levels of forwarding for operations 4 possible sources for an bypass» EX/DF» DF/DS» DS/TC» TC/WB CMSC 411-7 (from Patterson) 28

Floating Point Pipeline 3 functional units divider, multiplier, adder Double precision FP ops take from 2 (negate) up to 112 cycles (square root) Effectively 8 stages, combined in different orders for various FP operations one copy of each stage, and some instructions use a stage zero or more times, and in different orders Overall, rather complicated see H&P for more details CMSC 411-7 (from Patterson) 29 R4000 Pipeline Performance 4 major causes of pipeline stalls load stalls from using load result 1 or 2 cycles after load branch stalls 2 cycles on every taken branch, or empty branch delay slot FP result stalls RAW hazards for an FP operand FP structural stalls from conflicts for functional units in FP pipeline CMSC 411-7 (from Patterson) 30

SPEC92 Benchmarks Assuming a perfect cache 5 integer and 5 floating-point programs CMSC 411-7 (from Patterson) 31 Pitfalls Unexpected hazards do occur for example, when a branch is taken before a previous instruction finishes Extensive pipelining can slow a machine down, or lead to worse cost-performance more complex hardware can cause a longer clock cycle, killing the benefits of more pipelining CMSC 411-7 (from Patterson) 32

Pitfalls (cont.) A poor compiler can make a good machine look bad compiler writers need to understand the architecture in order to» optimize efficiently and» avoid hazards better to eliminate useless instructions, than make them run faster CMSC 411-7 (from Patterson) 33