Hardware Speculation Support

Similar documents
Multiple Instruction Issue and Hardware Based Speculation

CS422 Computer Architecture

Super Scalar. Kalyan Basu March 21,

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Instruction Level Parallelism

Dynamic Control Hazard Avoidance

Hardware-based Speculation

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Instruction Level Parallelism

Lecture-13 (ROB and Multi-threading) CS422-Spring

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction Level Parallelism (ILP)

Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998

Metodologie di Progettazione Hardware-Software

5008: Computer Architecture

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Handout 2 ILP: Part B

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Week 6 out-of-class notes, discussions and sample problems

Chapter 4 The Processor 1. Chapter 4D. The Processor

The basic structure of a MIPS floating-point unit

Hardware-Based Speculation

TDT 4260 lecture 7 spring semester 2015

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Advanced issues in pipelining

Case Study IBM PowerPC 620

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Four Steps of Speculative Tomasulo cycle 0

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Hardware-based Speculation

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Tutorial 11. Final Exam Review

CS425 Computer Systems Architecture

Exploitation of instruction level parallelism

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Instruction Level Parallelism (ILP)

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Instruction Level Parallelism

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

CMSC411 Fall 2013 Midterm 2 Solutions

Static vs. Dynamic Scheduling

Recall from Pipelining Review. Instruction Level Parallelism and Dynamic Execution

Instruction Level Parallelism. Taken from

EECC551 Exam Review 4 questions out of 6 questions

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Getting CPI under 1: Outline

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Processor: Superscalars Dynamic Scheduling

PowerPC 620 Case Study

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Lecture 9: Multiple Issue (Superscalar and VLIW)

Hardware-Based Speculation

CS425 Computer Systems Architecture

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

CS 152, Spring 2011 Section 8

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Transcription:

Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification instructions in HP-PA: add/sub two operands, store the sum, and cause the following instruction to be skipped if the sume is 0 (Page C-20 text) ALPHA, MIPS, SPARC, PowerPC, and P6 all have simple conditional moves Sometimes it can eliminate branches in cases where there is a single inst. in the ``then'' part of an ``if'' statement in these cases it changes a control dependence into a data dependence A win since in global scheduling, control dependence is the key limiting complexity Chapter 4 page 66

Conditional Instruction Limitations Exceptions semantics must be that if the condition fails then the instruction has no effect hence if an exception happens to a conditional instruction we must handle it properly based on the conditional evaluation result Another factor is the type of exception: a memory protection violation vs. a page fault Not useful for more complicated control flow it would require multiple conditions to be specified in the conditional inst. Wasted resource speculated instructions still take time to execute tends to work well in the superscalar case (like our simple 2-way model) where otherwise the resource would be wasted anyway Cycle-time or CPI Issues conditional instructions are more complex danger is that they may consume more cycles or result in a longer time per cycle note that the utility is mainly useful for short control flows hence its use may not be for the common case we don't want to slow down the real common case to support the uncommon case Chapter 4 page 67

Ideal view Compiler Speculation with HW support or course do conditional things in advance of the branch nuke them if the branch goes the wrong way also control exception behavior if the branch goes the wrong way Limits speculated values cannot clobber any real results exceptions cannot cause any destructive activity HW support poison bits set on registers on exception - fault if regular instruction tries to use them HW (and OS) ignores exception until instruction commits speculative instruction and results must be tagged as speculative until condition is resolved -- if predict incorrectly, speculative results and exceptions can be discarded boosting - provide separate shadow resources for boosted instruction results - if condition resolves selecting the boosted path then these results are committed to the real registers (note this won t work for memory) Chapter 4 page 68

Aggression Levels in Speculation Consider the if-then-else case if condition-block then then-block else else-block Traditional conservative method do them in order filling in branch delay slots by compiler can help Using prediction start predicted path while evaluating condition either continue or nuke based on condition result Aggressive start all 3 blocks when condition is known - nuke the unselected path implies lots of resources but idea may be used - just dampened by real resource limitations Chapter 4 page 69

Hardware-Based Speculation Combo of 3 key ideas - effect is a data-flow with speculation model dyanamic branch prediction speculation - allow the speculated blocks to start before condition resolution dynamic scheduling (Tomasulo style approach) Advantages more instruction order flexibility - things tend to run as soon as they can dynamic memory disambiguation possible where compiler would have to be more conservative dynamic branch prediction works considerably better than the static variant able to maintain a precise exception model - it isn t free but it can be done HW-based method, so it doesn't require compensation or book-keeping code relieves compiler from difficult machine specific tuning and optimization duties Approach allow out of order issue but require in-order commit (point where no longer speculative) prevent speculative instructions from performing destructive state changes involves adding a reorder buffer to hold completed but not committed instructions reorder buffer contains virtual registers (similar to reservation station) and becomes a bypass source Chapter 4 page 70

The Speculative DLX From Instruction Unit Reorder Buffer FP Op. Queue To Memory Data Load Results Reg# Reservation Stations FP Registers FP Multipliers FP Adders Note: looks a lot like the Tomasulo DLX, reorder buffer takes CDB Common Data Bus on most of the work Chapter 4 page 71

Steps in Speculative Execution Issue (or dispatch) get instruction from the queue issue if available reservation station AND available reorder buffer slot send operands if they are in register or reorder buffer otherwise stall Execute reservation station waits grabs results off the CDB if necessary when all operands are there execution happens Write Result result posted to reorder buffer via the CDB waiting reservation stations can grab it as well Commit (or graduate) when instruction reaches the head of the reorder buffer the value is posted to the registers or memory if an incorrect branch then incorrect successor entries in the reorder buffer (althought some of them may have completed execution) will be nuked this nuke may flush the entire buffer and FP OP queue and restart IF at the appropriate spot. if there is an exception it is taken at this step Chapter 4 page 72

ILP Simulation Studies Done by tracing inst. and data references in benchmarks Hardware model -- the ideal case register renaming - infinite virtual registers so now WAW or WAR sensitivity branch prediction is perfect jump prediction (even computed) are also perfect memory disambiguation - also perfect How many instructions would issue on the perfect machine every cycle? gcc - 54.8 espresso - 62.6 li - 17.9 fpppp - 75.2 doduc - 118.7 tomcatv - 150.1 Huge amounts of loop parallelism in the FP SPEC codes Chapter 4 page 73

Getting More Real Effects of limiting the Issue Window Size Table 1: Application Win=infinite Win=512 Win=128 Win=32 Win=8 Win=4 GCC 55 10 10 8 4 3 Espresso 63 15 13 8 4 3 Li 18 12 11 9 4 3 fpppp 75 49 35 14 5 3 doduc 119 16 15 9 4 3 tomcatv 150 45 34 14 6 3 Ambitious in 1995 PA-8000 trying Win=56 (28 load/store, and 28 non-memory) Chapter 4 page 74

Effects of Realistic Branch Prediction Schemes used Perfect Selective (97% accurate with 48K bits) uses a correlating 2 bit and non-correlating 2 bit plus a selector to choose between the two prediction buffer has 8K (13 address bits from the branch) 3 entries per slot - non-cor, cor, select Standard 2 bit 512 (9 address bits) entries plus 16 entry buffer to predict RETURNS Static based on profile - predict either T or NT but it stays fixed None Chapter 4 page 75

Results of Prediction Models Application Perfect Selective Standard 2-bit Static GCC 35 9 6 6 2 Espresso 41 12 7 6 2 Li 16 10 6 7 2 fpppp 61 48 46 45 29 doduc 58 15 13 14 4 tomcatv 60 46 45 45 19 None Window Size =2k and Issue Limit=64 Note: effective equivalence between the standard 2-bit predictor and a compiler-based profile static predictor which costs nothing in hardware Chapter 4 page 76

Effects of Limiting the Renaming Registers Application Infinite 256 128 64 32 None GCC 11 10 10 9 5 4 Espresso 15 15 13 10 5 4 Li 12 12 12 11 6 5 fpppp 59 49 35 20 5 4 doduc 29 16 15 11 5 5 tomcatv 54 45 44 28 7 5 Note this assumes an amazing machine: 97-98% correct predictor which takes 150K bits to implement 2K window Note even PowerPC 620 only 64 issue capability has 12 FP renaming registers and 8 more for the Integer pipe 2K Jump and 2K return Predictors Are infinite renaming registers needed? Chapter 4 page 77

Models for Memory Alias Analysis Perfect no mistakes - the unrealistic limit Global/Stack Perfect representing to best compiler analysis to date perfect prediction for global and stack references assume heap references conflict (because of pointers) Inspection if pointer is to different allocation areas then no conflict also no conflict using same register with different offsets None all memory references are assumed to conflict Chapter 4 page 78

Application Memory Alias Effects Perfect Global/Stack Perfect Inspection GCC 10 7 4 3 Espresso 15 7 5 5 Li 12 9 4 3 fpppp 49 49 4 3 doduc 16 16 6 4 tomcatv 45 45 5 4 None Perfect global and stack analysis is not too realistic -- array dependences may be a problem Perfect analysis of global and stack references is a factor of 2 better than inspection and is perfect for f.p. benchmarks because no heap references exist in these benchmarks Recent research on alias analysis for pointers should further improve the handling of pointes to the heap Chapter 4 page 79

Toward a Realizable Processor something we can conceive might be possible in 5 years 64 issue with no issue restrictions the no restriction part is disputable (e.g., 64 memory references in the same cycle may be a problem) selective predictor - 1K entries 16 entry return predictor Dynamic perfect memory disambiguation Register Renaming with 64 additional FP regs and 64 additional integer regs Chapter 4 page 80

Amount of Realizable ILP in 5 Years App. Infinite Win=256 Win=12 8 Win=64 Win=32 Win=16 Win=8 Win=4 Gcc 10 10 10 9 8 6 4 3 Espresso 15 15 13 10 8 6 4 2 Li 12 12 11 11 9 6 4 3 fpppp 52 47 35 22 14 8 5 3 doduc 17 16 15 12 9 7 4 3 tomcatv 56 45 34 22 14 9 6 3 For more recent developments, see IEEE Computer, Sept. 1997 issue Billion transistors on a chip: what is the best way to spend them? A simple processor/w large on-chip caches and high clock rate OR Explore more ILP/w smaller caches and a slower clock rate? Chapter 4 page 81

Recent Machines (see Figure 4.60) Table 1: CPU Year Clock MHz Issue Structure Sched. Max issue Load-St. Issue Int Issue Fload Issue Branch Issue SPEC Int/Float Power1 1991 66 Dynamic Static 4 1 1 1 1 60/80 HP7100 1992 100 Static Static 2 1 1 1 1 80/150 Alpha 21064 1993 150 Dynamic Static 2 1 1 1 1 100/150 SuperSparc 1993 50 Dynamic Static 3 1 1 1 1 75/85 Power2 1994 67 Dynamic Static 6 2 2 2 2 95/270 MIPS TFP 1994 75 Dynamic Static 4 2 2 2 1 100/310 Pentium 1994 66 Dynamic Static 2 2 2 1 1 65/65 Alpha 21164 1995 300 Static Static 4 2 2 2 1 330/500 UltraSparc 1995 167 Dynamic Static 4 1 1 1 1 275/305 Intel P6 1995 150 Dynamic Dynamic 3 1 2 1 1 >200 int Hal R1 1995 154 Dynamic Dynamic 4 1 2 1 1 255/330 PowerPC 620 1995 133 Dynamic Dynamic 4 1 1 1 1 225/300 MIPS R10000 1995 200 Dynamic Dynamic 4 1 2 2 1 300/600 HP PA-8000 1996 200 Dynamic Static 6 2 2 2 1 >360/>550 Chapter 4 page 82