As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction stream to the processor will require that we predict the outcome of branches. Amdahl s Law reminds us that relative impact of the control stalls will be larger with the low potential CPI in such machines. Branch Prediction Schemes Static branch by compiler Predict-taken/predict-not taken schemes, delayed branch scheme Dynamic branch by hardware Branch- buffer, branch history table, branch target buffer he goal of these mechanisms is to allow the processor to resolve the outcome of a branch early // Hiroaki Kobayashi

Branch arget Buffer A small memory indexed by the lower portion of the address of the branch instruction. he memory contains bits that say whether the branch was recently taken or not. If the turns out to be wrong, the bits are updated. Prediction accuracy depends on the bits/ schemes! // Hiroaki Kobayashi -bit scheme -bit holds the information about last branch direction. If the turns out to be wrong, the bit is inverted and stored back. Simple and low cost scheme Low accuracy Even if a branch is almost always taken, it will likely predict incorrectly twice, rather than twice, when it is not taken. Example Consider a loop branch whose behavior is taken nine times in a row, then not taken once. What is the accuracy for this branch? Miss-s on the first and last loop iterations. he accuracy for this branch that is taken % of the time is only %! // Hiroaki Kobayashi

A must miss twice before it is changed. A -bit saturating counter hold the information about the recent branch behavior. he counter is incremented on a taken branch and decremented on an untaken branch. n-bit extension is possible, but studies of n-bit predictors have shown that the -bit predictors do almost as well. he most systems rely on -bit branch predictors rather than the more general n-bit predictors. // Hiroaki Kobayashi Performance of a branch buffer with K entries on SPEC Because integer programs, li, eqntott, espresso, and gcc, have higher branch frequency, accuracy of the gives a larger impact on the performance. Prediction accuracy of a Kentries -bit buffer versus an infinite buffer // Hiroaki Kobayashi

Neither the number of entries, nor the size of bits! Need to consider the correlation among branches for more accurate branch! -bit predictor schemes use only the recent behavior of a single branch to predict the future behavior of that branch Example: if (aa==) aa=; if (bb==) bb=; if (aa!=bb) { DSUBUI R,R,# BNEZ R,L ;branch b (aa!=) DADD R,R,R ;aa= L: DSUBI R,R,# BNEZ R,L ;branch b (bb!=) DADD R,R,R ;bb= L: DSUBI R,R,R ;R=aa-bb BNEZ R,L ;branch b (aa==bb), NN, NN, NN may occur, but the // others Hiroaki Never Kobayashi occur! Example if (d==) if (d==) d=; BNEZ R,L ;branch b (d!=) DADDIU R,R,# ;d==, so d= L: DADDIU R,R,#- BNEZ R,L ;branch b (d!=) L: d=? Initial value of d b N N Possible execution sequences for a code fragment d==? N N Yes No No b Not taken aken aken N N Value of d before b Not taken Not taken taken All the branches // are mispredicted! Hiroaki Kobayashi N d==? Yes Yes No Behavior of a -bit predictor initialized to not taken b action New b b N b action N N b New b N N

Prediction bits are provided for each case of branch correlating patterns Prediction bits N/N N/ /N / Prediction if last branch not taken N N Prediction if last branch taken N N he action of the bit predictor with bit of correlation, initialized to not-taken/not-taken. d=? b N/N /N /N /N b action N N New b /N /N /N /N b N/N N/ N/ N/ b action N N New b N/ N/ N/ N/ he only mis is on the first iteration! (,) predictor: it uses the behavior of the last branch to choose from among a pair of -bit branch predictors. // Hiroaki Kobayashi Use the behavior of the last m branches to choose from m branch predictors, each of which is an n-bit predictor for a single branch. A (,) branch- buffer uses a -bit global history to choose from among four predictors for each branch address A -bit predictor with no global history is simply a (,) predictor. // Hiroaki Kobayashi

// Hiroaki Kobayashi o reduce the branch penalty for pipelines, need to know what address to fetch by the end of IF, not in ID. Get a hint from instruction address Branch-arget Buffer // Hiroaki Kobayashi

Penalties for all possible combinations of whether the branch is in the buffer and what it actually does, assuming we store only taken branches in the buffer. Instruction in buffer Yes Yes No No Prediction aken aken Actual branch aken Not taken aken Not taken Penalty cycles Example Determine the total branch penalty for a branch-target buffer assuming the penalty cycles for individual miss from the left table. Make the following assumptions about the accuracy and hit rate: Prediction accuracy is % (for instructions in buffer). Hit rate in the buffer is % (for branches predicted taken). Assume that % of the branches are taken. Hints: calculate Probability (branch in buffer, but actually not taken) Probability (branch not in buffer, but actually taken) Branch penalty = (the probability of two events) x (penalty) // Hiroaki Kobayashi Goal: Decrease the CPI to less than one! Common name Superscalar (static) Superscalar (dynamic) Superscalar (speculative) Allow multiple instructions to issue in a clock cycle Superscalar processors and VLIW (very long instruction word) processors Issue structure Dynamic Dynamic Dynamic Hazard detection Hardware Hardware Hardware Scheduling Static Dynamic Dynamic with speculation Distinguishing characteristic In-order execution Some out-of-order execution Out-of-order execution with speculation Example Sun UltraSPARC II/III IBM Power Pentium III/, MIPS RK, Alpha, HP PA, IBM RSIII No hazards between VLIW/LIW Static Software Static rimedia, i issue packets Mostly Mostly Explicit dependences EPIC* Mostly static Itanium static software marked by compiler *EPIC: Explicitly Parallel Instruction Computers // Hiroaki Kobayashi

Number of instructions per clock (issue width) Necessary hazard checks among up to maximum issue width of instructions must complete in one clock cycle! Instruction issue mechanisms Statically scheduled using compiler techniques or In-order execution: if some instruction in the instruction stream is dependent or doesn t meet the issue criteria, only the instructions preceding that one in the instruction sequence will be issued. Dynamically scheduled using techniques based on omasulo s algorithm Out-of-order execution: Instructions will be issued as long as any hazards do not occur. // Hiroaki Kobayashi Instructions issue in order and all pipeline hazards are checked for at issue time. Instruction type Pipe stages Integer instruction IF ID MEM WB FP instruction IF ID WB Integer instruction IF ID MEM WB FP instruction Integer instruction IF ID MEM WB FP instruction IF ID WB Integer instruction FP instruction // Hiroaki Kobayashi

Goal: Instructions issue at least until the hardware runs out of reservation stations. Example: Implementation of a two-issue dynamically scheduled processor Consider the execution of the following simple loop, which adds a scalar in F to each element of a vector in memory. Loop: LD F,(R) ;F=array element Add F,F,F ;add scalar in F SD F,(R) ;store result DADDIU R,R,- ;decrement pointer ; bytes BNE R,R,LOOP ; branch R!=R Assumptions It can issue two instruction on every clock cycle: one integer operation and one FP operation Latencies One cycle for integer ALU wo cycles for loads hree cycles for FP add // Hiroaki Kobayashi Iteration Number Instructions LD F,(R) ADD F,F,F SD F,(R) DADDIU R,R,- BNE R,R,LOOP LD F,(R) ADD F,F,F SD F,(R) DADDIU R,R,- BNE R,R,LOOP LD F,(R) ADD F,F,F SD F,(R) DADDIU R,R,- BNE R,R,LOOP Issue at Executes Memory access at Write CDB at Comment First Issue Wait for LD Wait for ALU Wait for BNE complete Wait for LD Wait for ALU Wait for BNE complete Wait for LD Wait for ALU // Hiroaki Kobayashi

Clock # Int ALU FP ALU D-Cache CDB Clock # Int ALU FP ALU D-Cache CDB /LD /SD /LD /LD /DADDIU /LD /SD /LD /ADD /ADD /DADDIU /DADDIU /SD /LD /ADD /DADDIU /LD /SD /LD /ADD /DADDIU /SD /LD /ADD /ADD /DADDIU /SD Issue Rate =/=., but Instruction Complete Rate = /=. Integer ALU becomes a bottleneck! // Hiroaki Kobayashi Iteration Number Instructions Issue at Executes Memory access at Write CDB at Comment LD F,(R) First Issue ADD F,F,F Wait for LD SD F,(R) DADDIU R,R,- Executes earlier BNE R,R,LOOP LD F,(R) Wait for BNE complete ADD F,F,F Wait for LD SD F,(R) DADDIU R,R,- Executes earlier BNE R,R,LOOP LD F,(R) Wait for BNE complete ADD F,F,F Wait for LD SD F,(R) DADDIU R,R,- Executes earlier BNE R,R,LOOP // Hiroaki Kobayashi

Clock # Integer ALU Address Adder FP ALU D-Cache CDB# CDB# /LD /DADDIU /SD /LD /LD /DADDIU /ADD /DADDIU /LD /SD /LD /DADDIU /ADD /LD /DADDIU /LD /ADD /SD /SD /LD /DADDIU /LD /ADD /ADD /SD /ADD /SD Improved Instruction Complete Rate = /=. // Hiroaki Kobayashi here is an imbalance between the functional unit structure of the pipeline and example loop. his imbalance means that it is impossible to fully use the FP units. o remedy this, we would need fewer dependent integer operations per loop. he amount of overhead per loop interaction is very high wo out of five instructions (DADDIU and BNE) are overhead. he control hazard, which prevents us from starting the next LD before we know whether the branch was correctly predicted, causes a one-cycle penalty on every loop iteration. // Hiroaki Kobayashi