Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Similar documents
Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Instruction Pipelining Review

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? RISC remainder (our assumptions)

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

EITF20: Computer Architecture Part2.2.1: Pipeline-1

ECE 505 Computer Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Multi-cycle Instructions in the Pipeline (Floating Point)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Appendix A. Overview

Instruction-Level Parallelism and Its Exploitation

Advanced issues in pipelining

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Basic Pipelining Concepts

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Pipelining. Maurizio Palesi

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Pipelining: Basic and Intermediate Concepts

EECC551 Exam Review 4 questions out of 6 questions

Appendix C: Pipelining: Basic and Intermediate Concepts

Computer Architecture

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Lecture: Pipeline Wrap-Up and Static ILP

Hardware-based Speculation

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

ECE 486/586. Computer Architecture. Lecture # 12

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

LECTURE 10. Pipelining: Advanced ILP

ILP: Instruction Level Parallelism

Instruction-Level Parallelism (ILP)

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

EE 4683/5683: COMPUTER ARCHITECTURE

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

MIPS An ISA for Pipelining

Four Steps of Speculative Tomasulo cycle 0

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Instr. execution impl. view

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

COSC 6385 Computer Architecture - Pipelining (II)

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Administrivia. CMSC 411 Computer Systems Architecture Lecture 6. When do MIPS exceptions occur? Review: Exceptions. Answers to HW #1 posted

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Instruction Level Parallelism

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

COMPUTER ORGANIZATION AND DESI

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

Handout 2 ILP: Part B

Final Exam Fall 2007

Pipelining: Hazards Ver. Jan 14, 2014

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

TDT 4260 lecture 7 spring semester 2015

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Floating Point/Multicycle Pipelining in DLX

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Lecture 7 Pipelining. Peng Liu.

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

C.1 Introduction. What Is Pipelining? C-2 Appendix C Pipelining: Basic and Intermediate Concepts

Full Datapath. Chapter 4 The Processor 2

Static vs. Dynamic Scheduling

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Adapted from David Patterson s slides on graduate computer architecture

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Updated Exercises by Diana Franklin

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Course on Advanced Computer Architectures

The basic structure of a MIPS floating-point unit

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

Chapter 4 The Processor 1. Chapter 4D. The Processor

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Exploitation of instruction level parallelism

5008: Computer Architecture

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Appendix C. Abdullah Muzahid CS 5513

Transcription:

Instruction Level Parallelism Appendix C and Chapter 3, HP5e

Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP.

Implementation of RISC ISA - Stages Instruction Fetch (IF) Instruction Decode/Register Fetch (ID) Fixed field decoding Execution/Effective address (EX) Memory Access (MEM) Write back (WB)

ALU MIPS Datapath IF ID EX MEM WB 4 ADD NP C Zero? Cond M U X P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LM D M U X Sign Extend 16 32 Imm Instruction Fetch Instruction Decode/ Register Fetch Execute/ Address Calculation Memory Access Write Back

B A Multiple Issue Integer Pipeline Zero? IR0 IM RF Read RF Write IR1 DM IF ID EX MEM WB

Pipeline Performance An unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup? Average Instruction Execution time = Clock cycle * Average CPI n CPI = i =1 IC i InstructionCount CPI i

Dependences Pipeline Hazards Structural & Data

Data dependences Name dependences Structural hazards Data hazards Stalling, Forwarding Outline

Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D S.D F0, 0(R1) F4, F0, F2 F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop

Dependence for (i=0; i<=999; i=i+1) x[i] = x[i] + a; Data Dependence (RAW) Name Dependences (WAR, WAW) Name dependences Register renaming Hazard Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop ADD.D F4, F0, F2 ADD.D F4, F6, F8 Overlap during execution could change the order of access to the operand involved in the dependence.

Hazards Program Order ILP preserves program order only where it affects the outcome of the program Structural Hazards Resource conflicts Data Hazards RAW, WAW, WAR Control Hazard Whether or not an instruction should be executed depends on a control decision made by an earlier instruction

Structural Hazard 1 2 3 4 5 6 7 8 9 i1 i2 i3 i4 i5... MEM ID EX MEM WB MEM ID EX MEM WB MEM ID EX MEM WB MEM ID EX MEM WB MEM ID EX MEM WB HAZARD!!! Unified Memory example Register File WB, ID example.

Cost of a Load Structural Hazard Data references constitute 40% of the instruction mix. Ideal CPI = 1 (with no structural hazards). Assume that the processor with the structural hazard has a clock rate that is 1.1 times higher than the clock rate of the processor without the hazard. Which processor is faster, and by how much? Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime ideal =CPI Clock cycle time ideal

Cost of a Load Structural Hazard Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime =(1+0.4 1) Clock cycle time ideal 1.1 Avg. InstructionTime =1.27 Clock cycle time ideal

ALU Data Hazards R1 is updated in the WB stage. IR IR IR 4 ADD NP C Zero? Cond M U X P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LM D M U X R1 R2 + R3 R4 R1 + R5 Sign Extend 16 32 Imm

How to overcome this hazard? Data Dazard Time (clock cycles) R1 R2 + R3 R4 R1 + R5 IF ID IF EX MA WB ID EX MA WB IF ID EX MA WB IF ID EX MA WB Wrong register values!!!!!! IF ID EX MA WB

Stalled Stages and Pipeline Bubbles Time (clock cycles) R1 R2 + R3 R4 R1 + R5 IF ID IF EX MA WB ID IF ID IF ID IF EX MA WB ID EX MA WB Stalled Stages IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB I1 I 2 I1 I 3 I 2 I1 I I I I I 3 3 3 4 5 I I I I I I 2 2 2 3 4 5 nop nop nop I I I 2 3 4 I1 nop nop nop I I 2 3 I1 nop nop nop I 2 I 5 I 4 I 3 I 5 I 4 I 5

Resolving Data Hazards Stalling one of the instructions Data Forwarding (Bypassing) Scheduling hazardous instructions away from each other

ALU Stalling (Interlocking) Stall Condition NOP IR IR IR 4 ADD NP C Zero? Cond M U X P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LM D M U X R1 R2 + R3 R4 R1 + Sign Extend 16 32 Imm

Pipeline Performance Speedup pipelining = CPI unpipelined CPI pipelined Speedup pipelining = Pipeline depth 1+ Stall cycles per instruction

Forwarding DADD DSUB AND OR XOR R1,R2,R R4,R1,R 3 5R6,R1,R 7R8,R1,R 9R10,R1,R1 1 Time (clock cycles) DADD IM REG ALU DM REG DSUB IM REG ALU DM REG AND IM REG ALU DM REG

Forwarding Before Bypassing Time (clock cycles) R1 R2 + R3 R4 R1 + R5 CPI > 1 IF ID IF EX MA WB ID IF ID IF ID IF Stalled Stages ID IF EX MA WB ID EX MA WB After Bypassing Time (clock cycles) R1 R2 + R3 R4 R1 + R5 CPI = 1 IF ID IF EX MA WB ID EX MA WB IF ID EX MA WB

Cost of Forwarding In longer pipelines? In multiple issue pipelines? All the dependences have been solved?

Forwarding Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3

Forwarding Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3 Time (clock cycles) LD IM REG ALU DM REG ADD IM REG ALU DM REG

Forwarding - Stall Condition Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3 Time (clock cycles) LD IM REG ALU DM REG ADD IM REG REG ALU DM STALL REG

Instruction Level Parallelism Static Scheduling

Outline ILP Multicycle instructions Loop unrolling, scheduling Superscalar pipelines

ILP Instruction-level parallelism: overlap among instructions: pipelining or multiple instruction execution What determines the degree of ILP? dependences: property of the program hazards: property of the pipeline

Pipeline Scheduling Reorder instructions so that dependent instructions are far enough apart Done by the compiler, before the program runs: Static Instruction Scheduling Done by the hardware, when the program is running: Dynamic Instruction Scheduling

Static vs. Dynamic Scheduling Dynamic scheduling: requires complex structures to identify independent instructions (scoreboards, issue queue) high power consumption low clock speed high design and verification effort Static: Compiler can compute instruction latencies and dependences

Pipeline Scheduling Original Program LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Scheduled Code LW R3, 0(R1) LW R13, 0(R11) ADDI R5, R3, 1 ADD R2, R2, R3 ADD R12, R13, R3 Total Execution Cycles: 7 Total Execution Cycles: 5

Why is Pipelining Hard to Implement? Interrupts, Exceptions, Traps, etc.

Outline Exception Handling Precise and Imprecise exceptions Exceptions in OoO pipelines

Exceptions Events that request attention of of the processor

Stopping and Restarting Execution Trap instruction, Turn off writes, Save PC, Save processor state, (Disable Exceptions), Exception handler, RFE Precise exceptions Pipeline stage IF ID EX MEM WB Problem exceptions occurring Page fault on IF, misaligned memory access; memory protection violation Undefined or illegal opcode Arithmetic exception Page fault on data fetch; misaligned memory access; memory protection violation None

Precise Exception Handling LD

Precise Exceptions LD IF ID EX MEM WB DADD IF ID EX MEM WB Multiple exceptions in the same cycle Early exception by a later instruction Instruction Status Vector: Check before commit

Precise Exceptions Instruction Status Vector: Check before commit

Multi-cycle Operations Pipeline

Precise Exceptions DIV.D ADD.D SUB.D F0, F2, F4 F10, F10, F8 F12, F12, F14 Out of order completion Can't ignore exceptions Virtual Memory, IEEE 754 Fast mode vs. Slow mode with precise exceptions Store results of earlier operations in a buffer History file, Future file.

Outline Exception Handling Precise and Imprecise exceptions Exceptions in OoO pipelines

References HP5e. Appendix C Pipelining: Basic and Intermediate Concepts. HP5e. Chapter 3 Instruction-Level Parallelism and Its Exploitation. Smith and Plezskun, Implementing Precise Interrupts in Pipelined Processors, IEEE Transactions on Computers 1988