Hardware Speculation Support

Size: px

Start display at page:

Download "Hardware Speculation Support"

Juliana Jacobs
6 years ago
Views:

1 Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification instructions in HP-PA: add/sub two operands, store the sum, and cause the following instruction to be skipped if the sume is 0 (Page C-20 text) ALPHA, MIPS, SPARC, PowerPC, and P6 all have simple conditional moves Sometimes it can eliminate branches in cases where there is a single inst. in the ``then'' part of an ``if'' statement in these cases it changes a control dependence into a data dependence A win since in global scheduling, control dependence is the key limiting complexity Chapter 4 page 66

2 Conditional Instruction Limitations Exceptions semantics must be that if the condition fails then the instruction has no effect hence if an exception happens to a conditional instruction we must handle it properly based on the conditional evaluation result Another factor is the type of exception: a memory protection violation vs. a page fault Not useful for more complicated control flow it would require multiple conditions to be specified in the conditional inst. Wasted resource speculated instructions still take time to execute tends to work well in the superscalar case (like our simple 2-way model) where otherwise the resource would be wasted anyway Cycle-time or CPI Issues conditional instructions are more complex danger is that they may consume more cycles or result in a longer time per cycle note that the utility is mainly useful for short control flows hence its use may not be for the common case we don't want to slow down the real common case to support the uncommon case Chapter 4 page 67

3 Ideal view Compiler Speculation with HW support or course do conditional things in advance of the branch nuke them if the branch goes the wrong way also control exception behavior if the branch goes the wrong way Limits speculated values cannot clobber any real results exceptions cannot cause any destructive activity HW support poison bits set on registers on exception - fault if regular instruction tries to use them HW (and OS) ignores exception until instruction commits speculative instruction and results must be tagged as speculative until condition is resolved -- if predict incorrectly, speculative results and exceptions can be discarded boosting - provide separate shadow resources for boosted instruction results - if condition resolves selecting the boosted path then these results are committed to the real registers (note this won t work for memory) Chapter 4 page 68

4 Aggression Levels in Speculation Consider the if-then-else case if condition-block then then-block else else-block Traditional conservative method do them in order filling in branch delay slots by compiler can help Using prediction start predicted path while evaluating condition either continue or nuke based on condition result Aggressive start all 3 blocks when condition is known - nuke the unselected path implies lots of resources but idea may be used - just dampened by real resource limitations Chapter 4 page 69

5 Hardware-Based Speculation Combo of 3 key ideas - effect is a data-flow with speculation model dyanamic branch prediction speculation - allow the speculated blocks to start before condition resolution dynamic scheduling (Tomasulo style approach) Advantages more instruction order flexibility - things tend to run as soon as they can dynamic memory disambiguation possible where compiler would have to be more conservative dynamic branch prediction works considerably better than the static variant able to maintain a precise exception model - it isn t free but it can be done HW-based method, so it doesn't require compensation or book-keeping code relieves compiler from difficult machine specific tuning and optimization duties Approach allow out of order issue but require in-order commit (point where no longer speculative) prevent speculative instructions from performing destructive state changes involves adding a reorder buffer to hold completed but not committed instructions reorder buffer contains virtual registers (similar to reservation station) and becomes a bypass source Chapter 4 page 70

6 The Speculative DLX From Instruction Unit Reorder Buffer FP Op. Queue To Memory Data Load Results Reg# Reservation Stations FP Registers FP Multipliers FP Adders Note: looks a lot like the Tomasulo DLX, reorder buffer takes CDB Common Data Bus on most of the work Chapter 4 page 71

7 Steps in Speculative Execution Issue (or dispatch) get instruction from the queue issue if available reservation station AND available reorder buffer slot send operands if they are in register or reorder buffer otherwise stall Execute reservation station waits grabs results off the CDB if necessary when all operands are there execution happens Write Result result posted to reorder buffer via the CDB waiting reservation stations can grab it as well Commit (or graduate) when instruction reaches the head of the reorder buffer the value is posted to the registers or memory if an incorrect branch then incorrect successor entries in the reorder buffer (althought some of them may have completed execution) will be nuked this nuke may flush the entire buffer and FP OP queue and restart IF at the appropriate spot. if there is an exception it is taken at this step Chapter 4 page 72

8 ILP Simulation Studies Done by tracing inst. and data references in benchmarks Hardware model -- the ideal case register renaming - infinite virtual registers so now WAW or WAR sensitivity branch prediction is perfect jump prediction (even computed) are also perfect memory disambiguation - also perfect How many instructions would issue on the perfect machine every cycle? gcc espresso li fpppp doduc tomcatv Huge amounts of loop parallelism in the FP SPEC codes Chapter 4 page 73

9 Getting More Real Effects of limiting the Issue Window Size Table 1: Application Win=infinite Win=512 Win=128 Win=32 Win=8 Win=4 GCC Espresso Li fpppp doduc tomcatv Ambitious in 1995 PA-8000 trying Win=56 (28 load/store, and 28 non-memory) Chapter 4 page 74

10 Effects of Realistic Branch Prediction Schemes used Perfect Selective (97% accurate with 48K bits) uses a correlating 2 bit and non-correlating 2 bit plus a selector to choose between the two prediction buffer has 8K (13 address bits from the branch) 3 entries per slot - non-cor, cor, select Standard 2 bit 512 (9 address bits) entries plus 16 entry buffer to predict RETURNS Static based on profile - predict either T or NT but it stays fixed None Chapter 4 page 75

11 Results of Prediction Models Application Perfect Selective Standard 2-bit Static GCC Espresso Li fpppp doduc tomcatv None Window Size =2k and Issue Limit=64 Note: effective equivalence between the standard 2-bit predictor and a compiler-based profile static predictor which costs nothing in hardware Chapter 4 page 76

12 Effects of Limiting the Renaming Registers Application Infinite None GCC Espresso Li fpppp doduc tomcatv Note this assumes an amazing machine: 97-98% correct predictor which takes 150K bits to implement 2K window Note even PowerPC 620 only 64 issue capability has 12 FP renaming registers and 8 more for the Integer pipe 2K Jump and 2K return Predictors Are infinite renaming registers needed? Chapter 4 page 77

13 Models for Memory Alias Analysis Perfect no mistakes - the unrealistic limit Global/Stack Perfect representing to best compiler analysis to date perfect prediction for global and stack references assume heap references conflict (because of pointers) Inspection if pointer is to different allocation areas then no conflict also no conflict using same register with different offsets None all memory references are assumed to conflict Chapter 4 page 78

14 Application Memory Alias Effects Perfect Global/Stack Perfect Inspection GCC Espresso Li fpppp doduc tomcatv None Perfect global and stack analysis is not too realistic -- array dependences may be a problem Perfect analysis of global and stack references is a factor of 2 better than inspection and is perfect for f.p. benchmarks because no heap references exist in these benchmarks Recent research on alias analysis for pointers should further improve the handling of pointes to the heap Chapter 4 page 79

15 Toward a Realizable Processor something we can conceive might be possible in 5 years 64 issue with no issue restrictions the no restriction part is disputable (e.g., 64 memory references in the same cycle may be a problem) selective predictor - 1K entries 16 entry return predictor Dynamic perfect memory disambiguation Register Renaming with 64 additional FP regs and 64 additional integer regs Chapter 4 page 80

16 Amount of Realizable ILP in 5 Years App. Infinite Win=256 Win=12 8 Win=64 Win=32 Win=16 Win=8 Win=4 Gcc Espresso Li fpppp doduc tomcatv For more recent developments, see IEEE Computer, Sept issue Billion transistors on a chip: what is the best way to spend them? A simple processor/w large on-chip caches and high clock rate OR Explore more ILP/w smaller caches and a slower clock rate? Chapter 4 page 81

17 Recent Machines (see Figure 4.60) Table 1: CPU Year Clock MHz Issue Structure Sched. Max issue Load-St. Issue Int Issue Fload Issue Branch Issue SPEC Int/Float Power Dynamic Static /80 HP Static Static /150 Alpha Dynamic Static /150 SuperSparc Dynamic Static /85 Power Dynamic Static /270 MIPS TFP Dynamic Static /310 Pentium Dynamic Static /65 Alpha Static Static /500 UltraSparc Dynamic Static /305 Intel P Dynamic Dynamic >200 int Hal R Dynamic Dynamic /330 PowerPC Dynamic Dynamic /300 MIPS R Dynamic Dynamic /600 HP PA Dynamic Static >360/>550 Chapter 4 page 82

Multiple Instruction Issue and Hardware Based Speculation

Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we