Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification instructions in HP-PA: add/sub two operands, store the sum, and cause the following instruction to be skipped if the sume is 0 (Page C-20 text) ALPHA, MIPS, SPARC, PowerPC, and P6 all have simple conditional moves Sometimes it can eliminate branches in cases where there is a single inst. in the ``then'' part of an ``if'' statement in these cases it changes a control dependence into a data dependence A win since in global scheduling, control dependence is the key limiting complexity Chapter 4 page 66
Conditional Instruction Limitations Exceptions semantics must be that if the condition fails then the instruction has no effect hence if an exception happens to a conditional instruction we must handle it properly based on the conditional evaluation result Another factor is the type of exception: a memory protection violation vs. a page fault Not useful for more complicated control flow it would require multiple conditions to be specified in the conditional inst. Wasted resource speculated instructions still take time to execute tends to work well in the superscalar case (like our simple 2-way model) where otherwise the resource would be wasted anyway Cycle-time or CPI Issues conditional instructions are more complex danger is that they may consume more cycles or result in a longer time per cycle note that the utility is mainly useful for short control flows hence its use may not be for the common case we don't want to slow down the real common case to support the uncommon case Chapter 4 page 67
Ideal view Compiler Speculation with HW support or course do conditional things in advance of the branch nuke them if the branch goes the wrong way also control exception behavior if the branch goes the wrong way Limits speculated values cannot clobber any real results exceptions cannot cause any destructive activity HW support poison bits set on registers on exception - fault if regular instruction tries to use them HW (and OS) ignores exception until instruction commits speculative instruction and results must be tagged as speculative until condition is resolved -- if predict incorrectly, speculative results and exceptions can be discarded boosting - provide separate shadow resources for boosted instruction results - if condition resolves selecting the boosted path then these results are committed to the real registers (note this won t work for memory) Chapter 4 page 68
Aggression Levels in Speculation Consider the if-then-else case if condition-block then then-block else else-block Traditional conservative method do them in order filling in branch delay slots by compiler can help Using prediction start predicted path while evaluating condition either continue or nuke based on condition result Aggressive start all 3 blocks when condition is known - nuke the unselected path implies lots of resources but idea may be used - just dampened by real resource limitations Chapter 4 page 69
Hardware-Based Speculation Combo of 3 key ideas - effect is a data-flow with speculation model dyanamic branch prediction speculation - allow the speculated blocks to start before condition resolution dynamic scheduling (Tomasulo style approach) Advantages more instruction order flexibility - things tend to run as soon as they can dynamic memory disambiguation possible where compiler would have to be more conservative dynamic branch prediction works considerably better than the static variant able to maintain a precise exception model - it isn t free but it can be done HW-based method, so it doesn't require compensation or book-keeping code relieves compiler from difficult machine specific tuning and optimization duties Approach allow out of order issue but require in-order commit (point where no longer speculative) prevent speculative instructions from performing destructive state changes involves adding a reorder buffer to hold completed but not committed instructions reorder buffer contains virtual registers (similar to reservation station) and becomes a bypass source Chapter 4 page 70
The Speculative DLX From Instruction Unit Reorder Buffer FP Op. Queue To Memory Data Load Results Reg# Reservation Stations FP Registers FP Multipliers FP Adders Note: looks a lot like the Tomasulo DLX, reorder buffer takes CDB Common Data Bus on most of the work Chapter 4 page 71
Steps in Speculative Execution Issue (or dispatch) get instruction from the queue issue if available reservation station AND available reorder buffer slot send operands if they are in register or reorder buffer otherwise stall Execute reservation station waits grabs results off the CDB if necessary when all operands are there execution happens Write Result result posted to reorder buffer via the CDB waiting reservation stations can grab it as well Commit (or graduate) when instruction reaches the head of the reorder buffer the value is posted to the registers or memory if an incorrect branch then incorrect successor entries in the reorder buffer (althought some of them may have completed execution) will be nuked this nuke may flush the entire buffer and FP OP queue and restart IF at the appropriate spot. if there is an exception it is taken at this step Chapter 4 page 72
ILP Simulation Studies Done by tracing inst. and data references in benchmarks Hardware model -- the ideal case register renaming - infinite virtual registers so now WAW or WAR sensitivity branch prediction is perfect jump prediction (even computed) are also perfect memory disambiguation - also perfect How many instructions would issue on the perfect machine every cycle? gcc - 54.8 espresso - 62.6 li - 17.9 fpppp - 75.2 doduc - 118.7 tomcatv - 150.1 Huge amounts of loop parallelism in the FP SPEC codes Chapter 4 page 73
Getting More Real Effects of limiting the Issue Window Size Table 1: Application Win=infinite Win=512 Win=128 Win=32 Win=8 Win=4 GCC 55 10 10 8 4 3 Espresso 63 15 13 8 4 3 Li 18 12 11 9 4 3 fpppp 75 49 35 14 5 3 doduc 119 16 15 9 4 3 tomcatv 150 45 34 14 6 3 Ambitious in 1995 PA-8000 trying Win=56 (28 load/store, and 28 non-memory) Chapter 4 page 74
Effects of Realistic Branch Prediction Schemes used Perfect Selective (97% accurate with 48K bits) uses a correlating 2 bit and non-correlating 2 bit plus a selector to choose between the two prediction buffer has 8K (13 address bits from the branch) 3 entries per slot - non-cor, cor, select Standard 2 bit 512 (9 address bits) entries plus 16 entry buffer to predict RETURNS Static based on profile - predict either T or NT but it stays fixed None Chapter 4 page 75
Results of Prediction Models Application Perfect Selective Standard 2-bit Static GCC 35 9 6 6 2 Espresso 41 12 7 6 2 Li 16 10 6 7 2 fpppp 61 48 46 45 29 doduc 58 15 13 14 4 tomcatv 60 46 45 45 19 None Window Size =2k and Issue Limit=64 Note: effective equivalence between the standard 2-bit predictor and a compiler-based profile static predictor which costs nothing in hardware Chapter 4 page 76
Effects of Limiting the Renaming Registers Application Infinite 256 128 64 32 None GCC 11 10 10 9 5 4 Espresso 15 15 13 10 5 4 Li 12 12 12 11 6 5 fpppp 59 49 35 20 5 4 doduc 29 16 15 11 5 5 tomcatv 54 45 44 28 7 5 Note this assumes an amazing machine: 97-98% correct predictor which takes 150K bits to implement 2K window Note even PowerPC 620 only 64 issue capability has 12 FP renaming registers and 8 more for the Integer pipe 2K Jump and 2K return Predictors Are infinite renaming registers needed? Chapter 4 page 77
Models for Memory Alias Analysis Perfect no mistakes - the unrealistic limit Global/Stack Perfect representing to best compiler analysis to date perfect prediction for global and stack references assume heap references conflict (because of pointers) Inspection if pointer is to different allocation areas then no conflict also no conflict using same register with different offsets None all memory references are assumed to conflict Chapter 4 page 78
Application Memory Alias Effects Perfect Global/Stack Perfect Inspection GCC 10 7 4 3 Espresso 15 7 5 5 Li 12 9 4 3 fpppp 49 49 4 3 doduc 16 16 6 4 tomcatv 45 45 5 4 None Perfect global and stack analysis is not too realistic -- array dependences may be a problem Perfect analysis of global and stack references is a factor of 2 better than inspection and is perfect for f.p. benchmarks because no heap references exist in these benchmarks Recent research on alias analysis for pointers should further improve the handling of pointes to the heap Chapter 4 page 79
Toward a Realizable Processor something we can conceive might be possible in 5 years 64 issue with no issue restrictions the no restriction part is disputable (e.g., 64 memory references in the same cycle may be a problem) selective predictor - 1K entries 16 entry return predictor Dynamic perfect memory disambiguation Register Renaming with 64 additional FP regs and 64 additional integer regs Chapter 4 page 80
Amount of Realizable ILP in 5 Years App. Infinite Win=256 Win=12 8 Win=64 Win=32 Win=16 Win=8 Win=4 Gcc 10 10 10 9 8 6 4 3 Espresso 15 15 13 10 8 6 4 2 Li 12 12 11 11 9 6 4 3 fpppp 52 47 35 22 14 8 5 3 doduc 17 16 15 12 9 7 4 3 tomcatv 56 45 34 22 14 9 6 3 For more recent developments, see IEEE Computer, Sept. 1997 issue Billion transistors on a chip: what is the best way to spend them? A simple processor/w large on-chip caches and high clock rate OR Explore more ILP/w smaller caches and a slower clock rate? Chapter 4 page 81
Recent Machines (see Figure 4.60) Table 1: CPU Year Clock MHz Issue Structure Sched. Max issue Load-St. Issue Int Issue Fload Issue Branch Issue SPEC Int/Float Power1 1991 66 Dynamic Static 4 1 1 1 1 60/80 HP7100 1992 100 Static Static 2 1 1 1 1 80/150 Alpha 21064 1993 150 Dynamic Static 2 1 1 1 1 100/150 SuperSparc 1993 50 Dynamic Static 3 1 1 1 1 75/85 Power2 1994 67 Dynamic Static 6 2 2 2 2 95/270 MIPS TFP 1994 75 Dynamic Static 4 2 2 2 1 100/310 Pentium 1994 66 Dynamic Static 2 2 2 1 1 65/65 Alpha 21164 1995 300 Static Static 4 2 2 2 1 330/500 UltraSparc 1995 167 Dynamic Static 4 1 1 1 1 275/305 Intel P6 1995 150 Dynamic Dynamic 3 1 2 1 1 >200 int Hal R1 1995 154 Dynamic Dynamic 4 1 2 1 1 255/330 PowerPC 620 1995 133 Dynamic Dynamic 4 1 1 1 1 225/300 MIPS R10000 1995 200 Dynamic Dynamic 4 1 2 2 1 300/600 HP PA-8000 1996 200 Dynamic Static 6 2 2 2 1 >360/>550 Chapter 4 page 82