ECE 505 Computer Architecture

Similar documents
Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CMSC411 Fall 2013 Midterm 2 Solutions

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

HY425 Lecture 05: Branch Prediction

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction-Level Parallelism (ILP)

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Hardware-based Speculation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

ILP: Instruction Level Parallelism

CIS 662: Midterm. 16 cycles, 6 stalls

Instruction-Level Parallelism and Its Exploitation

Instruction Pipelining Review

COMPUTER ORGANIZATION AND DESI

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Instruction Level Parallelism

Handout 2 ILP: Part B

Advanced Computer Architecture

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Floating Point/Multicycle Pipelining in DLX

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Instruction Level Parallelism

Chapter 06: Instruction Pipelining and Parallel Processing

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Adapted from David Patterson s slides on graduate computer architecture

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Super Scalar. Kalyan Basu March 21,

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Copyright 2012, Elsevier Inc. All rights reserved.

Processor (IV) - advanced ILP. Hwansoo Han

Instruction Level Parallelism (ILP)

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Control Dependence, Branch Prediction

EE 4683/5683: COMPUTER ARCHITECTURE

Updated Exercises by Diana Franklin

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Week 11: Assignment Solutions

COSC 6385 Computer Architecture - Pipelining

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Processor: Superscalars Dynamic Scheduling

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

What is Pipelining? RISC remainder (our assumptions)

EECC551 Exam Review 4 questions out of 6 questions

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Chapter 4 The Processor 1. Chapter 4D. The Processor

Metodologie di Progettazione Hardware-Software

Instruction Level Parallelism. Taken from

Advanced Instruction-Level Parallelism

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Pipelining. CSC Friday, November 6, 2015

Getting CPI under 1: Outline

The basic structure of a MIPS floating-point unit

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

ECE 486/586. Computer Architecture. Lecture # 12

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

Complex Pipelines and Branch Prediction

Lecture 2: Processor and Pipelining 1

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Lecture-13 (ROB and Multi-threading) CS422-Spring

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

EITF20: Computer Architecture Part2.2.1: Pipeline-1

LECTURE 3: THE PROCESSOR

Full Datapath. Chapter 4 The Processor 2

Instr. execution impl. view

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Instruction Level Parallelism

CS433 Homework 2 (Chapter 3)

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

The Processor: Instruction-Level Parallelism

CS433 Homework 2 (Chapter 3)

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Transcription:

ECE 505 Computer Architecture Pipelining 2 Berk Sunar and Thomas Eisenbarth

Review 5 stages of RISC IF ID EX MEM WB Ideal speedup of pipelining = Pipeline depth (N) Practically Implementation problems (slower clock) Hazards (stalls) Speedup = Pipeline depth (N) 1 + Average number of stalls per instruction ECE 505 Computer Architecture 2

Review Type of hazards: Structural Hazards Resource conflicts Data Hazards Result is needed before being written back Control Hazards Branches ECE 505 Computer Architecture 3

Overview of Data Hazards Data hazards occur when one instruction depends on a data value produced by a preceding instruction still in the pipeline Approaches to resolving data hazards Schedule: Programmer explicitly avoids scheduling instructions that would create data hazards Stall: Hardware includes control logic that freezes earlier stages until preceding instruction has finished producing data value Bypass: Hardware datapath allows values to be sent to an earlier stage before preceding instruction has left the pipeline Speculate: Guess that there is not a problem, if incorrect kill speculative instruction and restart ECE 505 Computer Architecture 4

Review ADD R1, R2, R3 SUB R2, R4, R1 OR R1, R6, R3 ADD R1, R2, R3 SUB R2, R4, R1 OR R1, R6, R3 ADD R1, R2, R3 SUB R2, R4, R1 OR R1, R6, R3 RAW WAR WAW True dependency Anti-dependency Output-dependency Forwarding or stall Name Dependencies cannot happen in 5-stage pipeline Enforce the order or Register Renaming Through memory? ST R1, 100(R3) LD R2, 20(R4) Presence of dependency is a property of. Generating hazards, and the number of stalls are properties of ECE 505 Computer Architecture 5

Control Hazards Too many branches Basic Block in MIPS is typically 3~6 instructions if p1 { S1; S2; S3; }; Basic Block is a straight-line code sequence with no branches Can we improve ILP across branches? Can we execute instructions without commit? Preserve execution behavior Preserve data flow ECE 505 Computer Architecture 6

Exception Behavior Preserving exception behavior: Changes do not raise new exceptions ADD R2, R3, R4 BEQZ R2, L1 LD R1, 0(R2) L1: Problem with moving LD before BEQZ? ECE 505 Computer Architecture 7

Data Flow Data flow: The flow of useful data between registers ADD R1,R2,R3 BEQZ R4,L SUB R1,R5,R6 L: OR R7,R1,R8 OR depends on ADD or SUB? ECE 505 Computer Architecture 8

Dealing with Branches 1. Always stall 1-stall in each branch ADD R1,R2,R3 BEQZ R4,L SUB R1,R5,R6 L: OR R7,R1,R8 Stall 2. Predict Not-Taken Wrong guesses? turn instruction into no-op (after IF) Taken idle idle idle idle 1 stall only on wrong guesses ECE 505 Computer Architecture 9

Dealing with Branches 3. Predict Taken Not useful in 5-stages MIPS, why? Stall But, may be useful in more complex architectures 4. Delayed Branch Find a useful instruction to put in the delay ECE 505 Computer Architecture 10

Dealing with Branches From before From target From fall- through ADD R1,R2,R3 IF R2=0 then STALL ECE 505 Computer Architecture SUB R4,R5,R6 ADD R1,R2,R3 IF R1=0 then STALL ADD R1,R2,R3 IF R1=0 then STALL SUB R4,R5,R6 becomes becomes becomes SUB R4,R5,R6 ADD $1,$2,$3 IF R2=0 then IF $1=0 then ADD R1,R2,R3 Perfect Static or Dynamic? ADD R1,R2,R3 IF R1=0 then SUB R4,R5,R6 Useful only if taken OR R7,R8,R9 OR R7, R8, R9 SUB R4,R5,R6 Useful only if not- taken Can be arranged by the compiler 11

Static Branch Prediction ECE 505 Computer Architecture 12

Performance Speedup = Pipeline depth (N) 1 + Average number of stalls per instruction Avg # of stalls = Branch frequency x Branch penalty Smaller branch frequency? Smaller branch penalty? Resolve branch sooner, AND compute branch address Use Zero test and a dedicated adder in the ID stage. ECE 505 Computer Architecture 13

Simple Implementation of MIPS ECE 505 Computer Architecture 14

Performance Example: In MIPS R4000, the architecture works as follows: Compare always-stall, predict-taken, predict-untaken for a code with: Why we didn t consider delay-branches? ECE 505 Computer Architecture 15

Performance Example: The branch penalty for MIPS R4000: Which one is the best? ECE 505 Computer Architecture 16

Dynamic Branch Prediction Branch-prediction buffer / Branch history table A small cache for branch outcomes Managed at run-time (not compile time) 1-bit branch prediction Taken Predict taken 1 Not taken Taken Predict not taken 0 Not taken for (i=0;i<3;i++) s = s + x[i]; for (i=0;i<3;i++) y[i] = y[i] + s; PredicMon T T T T N T T T True T T T N T T T N Stall? Y Y 2- stalls for each miss- guess ECE 505 Computer Architecture 17

Dynamic Branch Prediction 2-bit branch prediction for (i=0;i<3;i++) s = s + x[i]; Taken Not taken Not taken Not taken for (i=0;i<3;i++) y[i] = y[i] + s; Predict taken 11 Predict taken 10 Predict not taken 01 Predict not taken 00 Taken Taken Taken Not taken PredicMon T T T T T T T T True T T T N T T T N Stall? Y 1- stall for each miss- guess ECE 505 Computer Architecture 18

Dynamic Branch Prediction Dynamic Branch Prediction ECE 505 Computer Architecture 19

Improving ILP (more to come) Loop Unrolling Correlating Branch Prediction Tournament Prediction Dynamic Scheduling Scoreboard Tomasulo s Algorithm Reservation Stations Hardware Based Speculation Multiple Issue ECE 505 Computer Architecture 20