Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Similar documents
ECE 486/586. Computer Architecture. Lecture # 12

Instruction Pipelining Review

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

There are different characteristics for exceptions. They are as follows:

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Chapter3 Pipelining: Basic Concepts

Appendix C: Pipelining: Basic and Intermediate Concepts

LECTURE 10. Pipelining: Advanced ILP

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Advanced issues in pipelining

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

COMP 4211 Seminar Presentation

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

COSC 6385 Computer Architecture - Pipelining (II)

COSC4201. Prof. Mokhtar Aboelaze York University

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Appendix A. Overview

Execution/Effective address

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

Improving Performance: Pipelining

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

09-1 Multicycle Pipeline Operations

Instr. execution impl. view

Multi-cycle Instructions in the Pipeline (Floating Point)

Lecture 4: Introduction to Advanced Pipelining

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Basic Pipelining Concepts

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Pipelining: Basic and Intermediate Concepts

Instruction Level Parallelism

Advanced Computer Architecture Pipelining

Improvement: Correlating Predictors

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Floating Point/Multicycle Pipelining in DLX

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Administrivia. CMSC 411 Computer Systems Architecture Lecture 6. When do MIPS exceptions occur? Review: Exceptions. Answers to HW #1 posted

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

HY425 Lecture 09: Software to exploit ILP

EITF20: Computer Architecture Part2.2.1: Pipeline-1

HY425 Lecture 09: Software to exploit ILP

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

Lecture-13 (ROB and Multi-threading) CS422-Spring

Four Steps of Speculative Tomasulo cycle 0

Updated Exercises by Diana Franklin

The Big Picture Problem Focus S re r g X r eg A d, M lt2 Sub u, Shi h ft Mac2 M l u t l 1 Mac1 Mac Performance Focus Gate Source Drain BOX

Metodologie di Progettazione Hardware-Software

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

EECC551 Exam Review 4 questions out of 6 questions

CS252 Graduate Computer Architecture Lecture 5. Interrupt Controller CPU. Interrupts, Software Scheduling around Hazards February 1 st, 2012

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Super Scalar. Kalyan Basu March 21,

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Pipeline Complexities. Pipeline Hazards

LECTURE 3: THE PROCESSOR

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Instruction-Level Parallelism and Its Exploitation

Very short answer questions. "True" and "False" are considered short answers.

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

ECE/CS 552: Pipeline Hazards

Processor: Superscalars Dynamic Scheduling

CS422 Computer Architecture

Anti-Inspiration. It s true hard work never killed anybody, but I figure, why take the chance? Ronald Reagan, US President

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

MIPS An ISA for Pipelining

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

COSC 6385 Computer Architecture - Pipelining

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43

Computer Architecture

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

COMPUTER ORGANIZATION AND DESIGN

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Instruction Level Parallelism. Taken from

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Transcription:

branch taken Revisiting Branch Hazard Solutions Stall Predict Not Taken Predict Taken Branch Delay Slot Branch I+1 I+2 I+3 Predict Not Taken branch not taken Branch I+1 IF (bubble) (bubble) (bubble) (bubble) Branch Target T+1 Delayed Branch Branch I+1 (delay slot) I+2 I+3 branch not taken Branch Filling the delay slot (e.g., in the compiler) lw R1, 10000(R7) add R5, R6, R1 beqz R5, label: sub R8, R1, R3 add R4, R8, R9 and R2, R4, R8 Can be done when? Improves performance when? I+1 (delay slot) Branch Target T+1 branch taken label: add R2, R5, R8

Problems filling delay slot Branch Likely 1. need to predict direction of branch to be most effective 2. limited by correctness restriction correctness restriction can be removed by a canceling branch branch likely or branch not likely e.g., beqz likely delay slot instruction fall-through instruction squashed/nullified/canceled if branch not taken Branch likely I+1 (delay slot) I+2 I+3 Branch likely I+1 (delay slot) Branch Target T+1 IF (bubble) (bubble) (bubble) (bubble) branch not taken branch taken Branch Performance CPI = BCPI + pipeline stalls from branches per instruction = 1.0 + branch frequency * branch penalty assume 20% branches, 67% taken: branch taken not taken scheme penalty penalty CPI stall predict taken predict not taken delayed branch Static Branch Prediction Static branch prediction takes place at compile time, dynamic branch prediction during program execution static bp done by software, dynamic bp done in hardware Static branch prediction enables more effective code scheduling around hazards (how?) more effective use of delay slots Misprediction rate 2 20% 1 12% 10% 22% 18% 11% 12% 6% 9% 10% 1 0% compress eqntott espresso gcc li doduc Benchmark ear hydro2d mdljdp su2cor

MIPS Integer Pipeline Performance But now, the real world interrupts... Only stalls for load hazards and branch hazards, both of which can be reduced (but not eliminated) by software 14% 14% Pipelining is not as easy as we have made it seem so far... interrupts and exceptions long-latency instructions 12% 10% 9% Branch stalls Percentage of all instructions that stall Load stalls 8% 6% 4% 4% 3% 4% 7% 7% 2% 0% compress eqntott espresso gcc li Benchmark Exceptions and Interrupts Transfer of control flow (to an exception handler) without an explicit branch or jump are often unpredictable examples I/O device request OS system call arithmetic overflow/underflow FP error page fault memory-protection violation hardware error undefined instruction Classes of Exceptions synchronous vs. asynchronous user-initiated vs. coerced user maskable vs. nonmaskable within instruction vs. between instructions resume vs. terminate when the pipeline can be stopped just before the faulting instruction, and can be restarted from there (if necessary), the pipeline supports precise exceptions

Basic Exception Methodology turn off writes for faulting instruction and following force a trap into the pipeline at the next IF save the PC of the faulting instruction (not quite enough for delayed branches) Exceptions Can Occur In Several Places in the pipeline IF -- page fault on memory access, misaligned memory access, memory-protection violation ID -- illegal opcode EX -- arithmetic exception MEM -- page fault, misaligned access, memory-protection WB -- none (and, of course, asynchronous can happen anytime) LW ADD SUB Simplifying Exceptions in the ISA Each instruction changes machine state only once autoincrement string operations condition codes Each instruction changes machine state at the end of the pipeline (when you know it will not cause an exception) Handling Multicycle Operations Unrealistic to expect that all operations take the same amount of time to execute FP, some memory operations will take longer This violates some of the assumptions of our simple pipeline

Multiple Execution Pipelines New problems EX FP/integer multiply M1 M2 M3 M4 M5 M6 M7 IF ID MEM WB FP adder A1 A2 A3 A4 FP/integer divider DIV structural hazards divide unit WB stage WAW hazards are possible out-of-order completion WAR hazards still not possible Integer unit FU Latency Initiation interval Integer 0 1 Memory 1 1 FP add 3 1 FP multiply 6 1 FP divide 24 24 structural hazards and WAW hazards Hazard Detection in the ID stage structural hazards divide unit WB stage ADDD IF ID A1 A2 A3 A4 MEM WB...... LD An instruction can only issue (proceed past the ID stage) when: there are no structural hazards (divide unit is free, WB port will be free when needed) no WAR data hazards no WAW hazards with instructions in long pipes WAW hazards ADDD F8,... IF ID A1 A2 A3 A4 MEM WB LD F8,...

IF MIPS R4000 Pipeline IS RF EX DF DS TC WB R4000 Data Load Hazard Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 CC 11 FIGURE 3.50 The eight-stage pipeline structure of the R4000 uses pipelined instruction and data caches. LW R1 scalar, superpipelined IF first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. IS second half of access to instruction cache. RF instruction decode and register fetch, hazard checking and also instruction cache hit detection. EX execution, which includes effective address calculation, operation, and branch target computation and condition evaluation. DF data fetch, first half of access to data cache. DS second half of access to data cache. TC tag check, determine whether the data cache access hit. WB write back for loads and register-register operations. Instruction 1 Instruction 2 Is there an integer arithmetic data hazard? ADD R2, R1 R4000 Data Load Hazard R4000 Branch Hazard Time (in clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 BZ lw R1,... IF IS RF EX DF DS TC WB add R3, R1, R2 IF IS RF (stall) (stall) EX DF Instruction 1 Instruction 2 2-cycle load delay Instruction 3 Target predict not taken, branch delay slot not taken -> no penalty (unless branch likely or no delay slot instruction) taken -> 2 stall cycles if delay slot instruction used

R4000 Performance Key Points Data Hazards can be significantly reduced by forwarding Branch hazards can be reduced by early computation of condition and target, branch delay slots, branch prediction Data hazard and branch hazard reduction require complex compiler support Exceptions are hard, precise exceptions are really hard variable-length instructions introduce structural hazards, WAW hazards, more RAW hazards