Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Similar documents
LECTURE 3: THE PROCESSOR

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Full Datapath. Chapter 4 The Processor 2

Chapter 4. The Processor

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

COMPUTER ORGANIZATION AND DESIGN

COSC 6385 Computer Architecture - Pipelining

1 Hazards COMP2611 Fall 2015 Pipelined Processor

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

HY425 Lecture 05: Branch Prediction

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CS422 Computer Architecture

Lecture 9 Pipeline and Cache

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

CSEE 3827: Fundamentals of Computer Systems

Computer Architecture Spring 2016

ECE331: Hardware Organization and Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Chapter 4 The Processor 1. Chapter 4B. The Processor

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

The Processor: Improving the performance - Control Hazards

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining. CSC Friday, November 6, 2015

ECEC 355: Pipelining

Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha

ECE232: Hardware Organization and Design

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Advanced Computer Architecture

Lecture: Out-of-order Processors

Control Hazards. Branch Prediction

Complex Pipelines and Branch Prediction

Computer Architecture Sample Test 1

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Instruction Pipelining

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Pipelining, Branch Prediction, Trends

ELE 655 Microprocessor System Design

What about branches? Branch outcomes are not known until EXE What are our options?

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Branch prediction ( 3.3) Dynamic Branch Prediction

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

ECE260: Fundamentals of Computer Engineering

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Week 11: Assignment Solutions

Full Datapath. Chapter 4 The Processor 2

Control Hazards. Prediction

Computer Architecture

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Processor (II) - pipelining. Hwansoo Han

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Modern Computer Architecture

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

ECE154A Introduction to Computer Architecture. Homework 4 solution

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Instruction Pipelining

Chapter 4 The Processor 1. Chapter 4A. The Processor

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining. Pipeline performance

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Chapter 4. The Processor

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Chapter 4. The Processor

CS/CoE 1541 Mid Term Exam (Fall 2018).

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Question 1: (20 points) For this question, refer to the following pipeline architecture.

Chapter 4. The Processor

Thomas Polzer Institut für Technische Informatik

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

5008: Computer Architecture HW#2

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Instruction Pipelining Review

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

CS 251, Winter 2018, Assignment % of course mark

CS 251, Winter 2019, Assignment % of course mark

Computer Organization and Structure

CENG 3420 Lecture 06: Pipeline

Outline Marquette University

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

Lecture 15: Pipelining. Spring 2018 Jason Tang

What is Pipelining? RISC remainder (our assumptions)

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

EIE/ENE 334 Microprocessors

Instruction Level Parallelism

LECTURE 9. Pipeline Hazards

Transcription:

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation ID EX MEM WB Actions Read next instruction into CPU and increment PC by 4 Determine opcode, read registers, compare registers (if branch), sign-extend immediate if needed, compute target address of branch, update PC if branch Calculate using operands prepared in ID memory ref: add base reg to offset to form effective address reg-reg ALU: ALU performs specified calculation reg-immediate ALU: ALU performs specified calculation load: read memory from effective address into pipeline register store: write reg value from ID stage to memory at effective address ALU or load instruction: write result into register file ID EX MEM WB /ID ID/EX EX/MEM MEM/WB latch latch latch latch Instr Memory Decoder ALU ALU Data Memory Register File Register File ALU Assembly-language Code Example -- Two possible streams of instruction BEQ R3, R8, ELSE SUB R8, R5, R6 ELSE: OR R3, R3, R2 In what stage is the outcome of the comparison known? ADD should not be executed if the branch is taken Lecture 4 -

Assume the branch is taken: Instructions BEQ R3, R8, ELSE ELSE: OR R3, R3, R2 2 3 ID 4 EX ID Time 5 6 7 MEM WB EX MEM WB ID EX MEM 8 9 0 changed to NOOP WB 2 If the branch is taken, then there is a branch penalty of cycle, ie, the instruction after a taken conditional branch (the ADD) must be thrown away by changing it to a NOOP (NO-OPeration) instruction During clock cycle 3 the ID phase for the BEQ instruction was compare register R3 with R8, and computing the target address of branch (ie, the location of the ELSE label) If the branch is not taken and we continue to fetch instructions sequentially, then there is no branch penalty How could reduce the branch penalty in the pipeline above? Delayed Branching (used in MIPS) - redefine the branch such that one (or two) instruction(s) after the branch will always be executed Compiler(and assembler) automatically rearranges code to fill the delayed-branch slot(s) with instructions that can always be executed Instructions in the delayed-branch slot(s) do not need to be flushed after branching If no instruction can be found to fill the delayed-branch slot(s), then a NOOP instruction is inserted Without Delayed Branching SUB R5, R2, R BEQ R3, R8, ELSE ELSE: ADD R3, R3, R2 With Delayed Branching BEQ R3, R8, ELSE SUB R5, R2, R # delay slot alway done ELSE: ADD R3, R3, R2 Due to data dependences, the instruction before the branch cannot always be moved into the branch-delay slot Lecture 4-2

Other alternative to consider are: The Instruction at the Target of the Branch Without Delayed Branching With Delayed Branching LOOP: ADD R7, R8, R9 ADD R7, R8, R9 SUB R3, R2, R BEQ R3, R0, LOOP LOOP: SUB R3, R2, R BEQ R3, R0, LOOP MUL R4, R5, R6 ADD R7, R8, R9 #delay slot MUL R4, R5, R6 Can this technique always be used? Without Delayed Branching SUB R3, R2, R BEQ R3, R0, ELSE ADD R8, R5, R6 ELSE: ADD R3, R3, R2 Can this technique always be used? The Instruction From the Fall-Through of the Branch With Delayed Branching SUB R3, R2, R BEQ R3, R0, ELSE ADD R8, R5, R6 # delay slot ELSE: ADD R3, R3, R2 Note: In the simple 5-stage pipeline of MIPS, delayed branching worked well However, more modern processors have much longer pipelines and issue multiple instructions per clock cycle, so the branch delay becomes longer and a single delay slot is insufficient Therefore, delayed branching is not used much today Instead, more flexible dynamic branch prediction is used Lecture 4-3

Branch Prediction to reducing the branch penalty Main idea: predict whether the branch will be taken and fetch accordingly Fixed Techniques: a) Predict never taken - continue to fetch sequentially If the branch is not taken, then there is no wasted fetches b) Predict always taken - fetch from branch target as soon as possible (From analyzing program behavior, > 50% of branches are taken) Static Techniques: Predict by opcode - compiler helps by having different opcodes based on likely outcome of the branch Consider the HLL constructs: HLL AL CMP x, #0 While (x > 0) do BR_LE_PREDICT_NOT_ END_WHILE {loop body} end while END_WHILE: Studies have found about a 75-82% successful prediction rate using this technique Dynamic Techniques: try to improve prediction by recording history of conditional branch We need to store one or more history bits to reflect whether the most recent executions of the branch were taken or not Problem: How do we avoid always fetching the instruction after the branch? BLE R3, 0, END_WHILE time ID EX Need target of branch, but its not calculated yet! Plus, how do we know that we have just fetched a branch since it has not been decoded yet? Lecture 4-4

Solution: Branch Target Buffer (BTB) (I might use the terms Branch Prediction Buffer, or Branch History Table )- small, fully-associative cache to store information about the most recently executed branch instructions With a fully-associative cache, you supply a tag/key value to search for across the whole cache in parallel In a BTB, the Branch instruction address acts as the tag since that s what you know about an instruction at the beginning of the stage Valid Bit Branch Instr Addr Target Address Prediction Bits Tag Field During the stage, the Branch Target Buffer is checked to see if the instruction being fetched is a branch (if the addresses match) instruction If the instruction is a branch instruction and it is in the Branch Target Buffer, then the target address and prediction can be supplied by the BTB by the end of the for the branch instruction If the branch instruction is in the Branch Target Buffer, will the target address supplied correspond to the correct instruction to be execute next? What if the instruction is a branch instruction and it is not in the Branch Target Buffer? Should the Branch Target Buffer contain entries for unconditional as well as conditional branch instructions? Lecture 4-5

The table shows the advantage of using a Branch Target Buffer to improve accuracy of the branch prediction It shows the impact of past n branches on prediction accuracy Type of mix n Compiler Business Scientific 0 64 644 704 99 952 866 2 933 965 908 3 937 966 90 4 945 968 98 5 947 970 920 Notice: ) the big jump in using the knowledge of just past branch to predict the branch 2) notice the big jump in going from using to 2 past branches to predict the branch for scientific applications What types of data do scientific applications spend most of their time processing? What would be true about the code for processing this type of data? Typically, two prediction bits are use so that two wrong predictions in a row are need to change the prediction -- see page 570 Taken Predict Taken Not Taken Taken Predict Taken 0 Taken Not Taken Not Taken Predict Not Taken Predict Not Taken 0 Taken 00 Not Taken How does this help for nested loops? Lecture 4-6

Consider the nested loops: for (i = ; i <= 00; i++) { for (j = ; j <= 00; j++) { <do something> } } i = Nested Loop Unconditional Branch Penalty With History Bit j = PREDICTION ACTUAL NEXT PREDICTION PENALTY miss in BHT miss in BHT j = 2 j = 00 j = 0 i = 2 j = j = 2 j = 00 j = 0 Penalties associated with conditional branches = + 2 * 99 + inner loop outer loop Lecture 4-7

Consider the nested loops: for (i = ; i <= 00; i++) { for (j = ; j <= 00; j++) { <do something> } } i = Nested Loop Unconditional Branch Penalty With 2 History Bit j = PREDICTION ACTUAL NEXT PREDICTION PENALTY miss in BHT miss in BHT j = 2 j = 00 j = 0 i = 2 j = j = 2 j = 00 j = 0 Penalties associated with conditional branches = * 00 + inner loop outer loop Lecture 4-8

Correlating Predictor - instead of just using local information about a branch (its past branch behavior) to make a prediction, more global information about the behavior of some number of recently executed branches can be used to improve the branch prediction Tournament Predictor - multiple predictions for each branch are recorded with a selection mechanism that chooses which predictor to use A typical tournament predictor might contain two predictions for each branch: one based on local information 2 one based on global branch behavior By tracking performance of both predictors, the selector could choose the predictor which has been the most accurate Lecture 4-9