ECE 571 Advanced Microprocessor-Based Design Lecture 7

Similar documents
ECE 571 Advanced Microprocessor-Based Design Lecture 8

ECE 571 Advanced Microprocessor-Based Design Lecture 9

Instruction Level Parallelism (Branch Prediction)

Lecture 13: Branch Prediction

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

COSC 6385 Computer Architecture Dynamic Branch Prediction

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Control Hazards. Branch Prediction

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Complex Pipelines and Branch Prediction

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

BRANCH PREDICTORS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Control Hazards. Prediction

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Spring 2009 Prof. Hyesoon Kim

HY425 Lecture 05: Branch Prediction

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

Static Branch Prediction

Last time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls

Dynamic Control Hazard Avoidance

Old formulation of branch paths w/o prediction. bne $2,$3,foo subu $3,.. ld $2,..

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Dynamic Branch Prediction

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Lecture: Branch Prediction

Lecture: Out-of-order Processors

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

CMSC411 Fall 2013 Midterm 2 Solutions

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CS146 Computer Architecture. Fall Midterm Exam

Control Dependence, Branch Prediction

Topic 14: Dealing with Branches

Topic 14: Dealing with Branches

COSC 6385 Computer Architecture. Instruction Level Parallelism

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Branch Prediction Chapter 3

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction

Lecture: Branch Prediction

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

Fall 2011 Prof. Hyesoon Kim

What about branches? Branch outcomes are not known until EXE What are our options?

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Improvement: Correlating Predictors

Branch prediction ( 3.3) Dynamic Branch Prediction

CS422 Computer Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Static, multiple-issue (superscaler) pipelines

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

ELE 655 Microprocessor System Design

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University

Instruction Level Parallelism

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Lecture Topics. Why Predict Branches? ECE 486/586. Computer Architecture. Lecture # 17. Basic Branch Prediction. Branch Prediction

CS252 Prerequisite Quiz. Solutions Fall 2007

Wide Instruction Fetch

Computer Architecture Spring 2016

Pipelining and Caching. CS230 Tutorial 09

Hardware Speculation Support

5008: Computer Architecture

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

CSE4201 Instruction Level Parallelism. Branch Prediction

CS152 Computer Architecture and Engineering. Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

November 7, 2014 Prediction

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Branch Prediction. CS441 System Architecture Jeffrey Waldron Caleb Hellickson

COMPUTER ORGANIZATION AND DESI

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

ECE 571 Advanced Microprocessor-Based Design Lecture 13

ECE232: Hardware Organization and Design

(Refer Slide Time: 00:02:04)

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

CMU Introduction to Computer Architecture, Spring 2012 Handout 9/ HW 4: Pipelining

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

ECE 471 Embedded Systems Lecture 8

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

Alexandria University

CS 6354: Static Scheduling / Branch Prediction. 12 September 2016

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

ECE571: Advanced Microprocessor Design Final Project Spring Officially Due: Friday, 4 May 2018 (Last day of Classes)

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Transcription:

ECE 571 Advanced Microprocessor-Based Design Lecture 7 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 9 February 2016

HW2 Grades Ready Announcements HW3 Posted be careful when cut and pasting with the backslashes 1

The Branch Problem With a pipelined processor, you may have to stall waiting until branch outcome known before you can correctly fetch the next instruction Conditional branches are common. 5th instruction [cite?] On average every What can you do to speed things up? 2

Branch Prediction One solution is speculative execution. Guess which way branch goes. Good branch predictors have a 95% or higher hit rate Downsides? If wrong, in-flight thrown out, have to replay. Speculation wastes power 3

Branch Predictor Implementations 4

Static Prediction of Conditional Branches Backward taken Forward not taken Can be used as fallback when nothing else is available 5

Common Access Patterns For Loop for (i =0;i <100; i ++) SOMETHING ; mov r1,#0 label : SOMETHING add r1,r1,#1 cmp r1,#100 bne label 6

for Branch Behavior No branch predictor 100 stalls Other way to avoid problem (branch delay slot on MIPS) Static prediction BTFN 99 times predicted right, 1 time wrong (exit) 99% correct predict rate 7

Common Access Patterns While Loop x =0; while (x <100) { SOMETHING ; x ++;} mov r1,#0 label : cmp r1,#100 bge done SOMETHING add r1,r1,#1 b label done : 8

while Branch Behavior No branch predictor 100 stalls (unless branch delay slot) Static BTFN prediction 99 times predicted right, 1 time wrong (exit) 99% correct predict rate Optimizing compiler may translate this to a for loop (why?) 9

Common Access Patterns Do/While Loop x =0; do { SOMETHING ; x ++;} while (x <100); mov r1,#0 label : SOMETHING add r1,r1,#1 cmp r1,#100 blt label b done : label 10

while Do/While Behavior No branch predictor 100 stalls (unless branch delay slot) Static BTFN prediction 99 times predicted right, 1 time wrong (exit) 99% correct predict rate 11

Notes Optimizing compiler will optimize all above to same for loop (tried it). Why? Because loop unrolling becomes possible? 12

Common Access Patterns If/Then if (x) { FOO } else { BAR } cmp r1,#0 beq else then : FOO b done else : BAR done : ARM : cmp r1,#0 FOOeq BARne 13

Common Access Patterns If/Then Behavior If x is true, static = 100%, if x is false, 0% Assuming completely random, average 50% miss rate ARM can use conditional execution/predication to avoid this in simple scenarios 14

How can we Improve Things? 15

Branch Prediction Hints Give compiler (or assembler) hints likely() (maps to builtin expect()) unlikely() on some processors, (p4) hint for static others, just move unlikely blocks out of way for better L1I$ performance 16

Dynamic Branch Prediction 17

One-bit Branch History Table Table, likely indexed by lower bits of instruction (Why low bits and not high bits?) an have more advanced indexing to avoid aliasing no-tag bits, unlike caches aliasing does not affect correct program execution One-bit indicating what the branch did last time Update when a branch miss happens 18

0x1000 0003 : bne PC+45 0 1 2 3 taken... f 19

Branch History Table Behavior Two misses for each time through loop. Wrong at exit of loop, then wrong again when restarts. (so actually worse than static on loops) If/Then potentially better than static if long runs of true/false but can be worse if completely random. 20

Aliasing Is it bad? Good? Does the O/S need to save on context switch? Do you need to save if entering low-power state? 21

Two-bit saturating counter Use saturating 2-bit counter If 3/2, predict taken, if 1,0 not-taken. Takes two misses or hits to switch from one extreme to the next, letting loops take only one mispredict. Needs to be updated on every branch, not just for a mispredict 22

Taken NT 00 Strong NT NT 01 Weak NT Taken Taken NT 11 Strong Taken NT 10 Weak Taken Taken 23

Local vs Global History Can use branch history as index into tables Use a shift register to hold history Global: history is all branches Local: store branch history on a branch by branch basis 24

Global History Global Predictor N N T N 2 bit Counters 1 1 Taken... 25

Local Predictor 0x8000 0001 : bne PC+45 2 bit Counters N N T N 0 0 Not Taken...... 26

Correlating / Two Level Predictors Take history into account. Break branch prediction for a branch out into multiple locations based on history. 0x1000 0002 : bge PC+45 2 bit saturating counters............ 1 1 Taken Global History T N 27

gshare Xors the global history with the address bits to get which line to use. Benefits of 2-level without the extra circuitry 0x1000 000e : bgt PC+45 1110 2 bit Counters XOR N N Not Taken 1111... T T T T Global History 28

Tournament Predictors Which to use? Local or global? Have both. How to know which one to use? Predict it! 2-bit counter remembers which was best. 29

Perceptron There are actually Branch Prediction Competitions The winner the past few times has been a Perceptron predictor Neural Networks 30

Comparing Predictors Branch miss rate not enough Usually the total number of bits needed is factored in May also need to keep track of logic needed if it is complex. 31

Branch Target Buffer Predicts the actual destination of addresses. Indexed by whole PC. May be looking up before even know it is a branch instruction. Only need to store predicted-taken branches. Because not-taken fall through as per normal). (Why? 32

Return Address Stack Function calls can confuse BTB. Multiple locations branching to same spot. Which return address should be predicted? Keep a stack of return addresses for function calls Playing games with size optimization and fallthrough/tail optimization can confuse. 33

Adjusting Predictor on the Fly Some processors let you configure predictor at runtime. MIPS R12000 let you ARM possibly does. Why is this useful? In theory if you have a known workload you can pick the one that works best. Also if realtime you want something that is deterministic, like static prediction. Also Good for simulator validation 34

From the Manual: Cortex A9 Branch Predictor two-level prediction mechanism, comprising: a two-way BTAC of 512 entries organized as two-way x 256 entries a Global History Buffer (GHB) with 4096 2-bit predictors a return stack with eight 32-bit entries. It is also capable of predicting state changes from ARM to Thumb, and from Thumb to ARM. 35

Example Code in perf event validation tests for generic events. http://web.eece.maine.edu/~vweaver/projects/perf_events/validation/ 36

Example Results 37

Part 1 Testing a loop with 1500000 branches (100 times): On a simple loop like this, miss rate should be very small. Adjusting domain to 0,0,0 for ARM Average number of branch misses: 685 Part 2 Adjusting domain to 0,0,0 for ARM Testing a function that branches based on a random number The loop has 7710798 branches. 500000 are random branches; 250699 of those were taken Adjusting domain to 0,0,0 for ARM Out of 7710798 branches, 291081 were mispredicted Assuming a good random number generator and no freaky luck The mispredicts should be roughly between 125000 and 375000 Testing branch-misses generalized event... PASSED 38

Value Prediction Can we use this mechanism to help other performance issues? What about caches? Can we predict values loaded from memory? Load Value Prediction. You can, sometimes with reasonable success, but apparently not worth trouble as no vendors have ever implemented it. 39