COSC 6385 Computer Architecture. Instruction Level Parallelism

Size: px

Start display at page:

Download "COSC 6385 Computer Architecture. Instruction Level Parallelism"

Lilian Ferguson
6 years ago
Views:

1 COSC 6385 Computer Architecture Instruction Level Parallelism Spring 2013 Instruction Level Parallelism Pipelining allows for overlapping the execution of instructions Limitations on the (pipelined) execution of instruction Data Dependencies Control Dependencies Minimizing the effect of the limitations can be done in Hardware Software (Compiler) 1

2 Compiler techniques Scheduling instructions to minimize the no. of stall cycles Loop unrolling: modify a loop such that multiple iterations of the loop are executed at once Reduces the no. of instructions that control the loop Increases binary size for ( i=0; i < N ; i++ ) x[i] = x[i] + s; for (i=0; i < N ; i+=k) { } x[i] = x[i] + s; x[i+1] = x[i+1] + s; x[i+2] = x[i+2] + s; x[i+k] = x[i+k] + s; Loop unrolling Might require two loops to deal with situation where N%k!= 0 } for (i=0; i< N%k; i++ ) { x[i] = x[i] + s; for ( ; i < N ; i+=k) { x[i] = x[i] + s; x[i+1] = x[i+1] + s; x[i+2] = x[i+2] + s; x[i+k] = x[i+k] + s; 2

3 An example for ( i=1000; i > 0 ; i-- ) { x[i] = x[i] + s; } Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Assumptions Instruction producing results Instruction using results FP ALU FP ALU 3 FP ALU Store 2 Load FP FP ALU 1 Load FP Store 0 Latency in clock cycles Latency: number of intervening cycles between an instruction that produces a result and instruction that uses the result 1 cycle branch delay 3

4 An example (III) Loop: L.D F0, 0(R1) 1 s 2 ADD.D F4, F0, F2 3 s 4 s 5 S.D F4, 0(R1) 6 DADDUI R1, R1, #-8 7 s 8 BNE R1, R2, Loop 9 s 10 wait for F0 to propagate wait for ADD to complete wait for ADD to complete wait for R1 to propagate branch delay slot Rescheduling the code Loop: L.D F0, 0(R1) 1 DADDUI R1, R1, #-8 2 ADD.D F4, F0, F2 3 s 4 BNE R1, R2, Loop 5 S.D F4, 8(R1) 6 Delayed branch slot Each loop iteration consists of 3 instructions of actual work (load, add, store) and 3 cycles loop overhead 4

5 Loop unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) /* Drop DADDUI & BNE */ L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) /* Drop DADDUI & BNE */ L.D F10, -16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) /* Drop DADDUI & BNE */ L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop Loop unrolling (II) Eliminates three branches and three decrements Reduces loop overhead The previous code sequence still contains many stalls, since many operations are still dependent on each other Requires more registers Increase of the code size 5

6 Scheduled version of the unrolled loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12, 16(R1) BNE R1, R2, Loop S.D F16, 8(R1) Scheduled version of the unrolled loop (II) Non trivial transformations in the previous code Adjust the offsets of S.D instructions Determine that it is legal to move the S.D after DADDUI and BNE Determine that loads and stores of difference iterations can be interchanged 6

7 Branch prediction No instruction is allowed to initiate execution until all branches preceding the instruction have completed Static techniques to avoid branch hazards Stall Predict not taken Predict taken Delayed branch -> do not take the previous behavior of branches into account Dynamic branch prediction Algorithms using the previous execution of a branch to predict the outcome of the next execution Seven techniques for dynamic branch prediction 1bit branch prediction buffer 2bit branch prediction buffer Correlating Branch Prediction Buffer Branch Target Buffer Return Address Predictors Tournament predictors 7

8 1bit Branch prediction buffer (I) Branch prediction buffer: Small memory area indexed by the lower portion of the address of the branch instruction Records whether the branch was taken the last time or not (1 bit is sufficient) Please note: Several branches might share the same address since we do not use the full branch instruction address for accessing the branch prediction buffer 1bit Branch Prediction Buffer (II) Limitations Even for a regular loop (embedded in another large loop) the 1bit Branch Prediction Buffer will mispredict at least the first and the last iteration 1 st iteration: the bit has been set by the last iteration of the same loop to not-taken, but the branch will be taken Last iteration: the bit says taken, but the branch won t be taken 8

9 2bit Branch Prediction Buffer A prediction must miss twice before the prediction is changed Can be extended to n-bits Taken Predict taken 11 Taken Not taken Taken Predict taken 10 Not taken Predict not taken 01 Not taken Taken Predict not taken 00 Correlated branches For a (1,1) predictor: each branch has two different branch prediction buffers: Predictor used in case the previous branch in the application has not been taken X / Y Predictor used in case the previous branch in the application has been taken The content of the two branch prediction buffers are determined by the branch to which they belong Which of the two branch prediction buffers are used is depending on the outcome of the previous branch in the application 9

10 Correlated branches - example if ( d==0 ) d = 1; if ( d==1 ) BNEZ R1, L1 DADDIU R1, R0, #1 L1: DADDIU R3, R1, #-1 BNEZ R3, L2 L2:!branch b1!branch b2 Initial value of d d==0? b1 Value of d before b2 d==1? 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken b2 Correlated branches - example d=? BPB b1 b1 act. BPB b2 B2 act. 2 NT/NT NT/NT the branch prediction buffers for the branches b1 and b2 are assumed to hold the prediction Not taken for both option (previous branch not taken/taken) 10

11 Correlated branches - example d=? BPB b1 b1 act. BPB b2 B2 act. 2 NT/NT NT/NT assuming BPB for b1 uses the Not Taken predictor because the previous branch in the application has not been taken BPB for b1 predicts that b1 will not be taken Correlated branches - example d=? BPB b1 b1 act. BPB b2 B2 act. 2 NT/NT T NT/NT BPB for b1 predicts that b1 will not be taken b1 is taken (see table for d=2) Initial value of d d==0? b1 Value of d before b2 d==1? 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken b2 11

12 Correlated branches - example d=? BPB b1 b1 act. BPB b2 B2 act. 2 NT/NT T NT/NT T/NT updating the Previous branch has not been taken part of BPB for b1 to Taken because b1 has been taken, the last branch has been taken part of BPB b2 will be used BPB b2 predicts, that b2 will not be taken Correlated branches - example d=? BPB b1 b1 act. BPB b2 B2 act. 2 NT/NT T NT/NT T T/NT NT/T b2 is taken (see table for d=2) updating the Previous branch has been taken part of BPB for b2 to Taken because b2 has been taken, the last branch has been taken part of BPB b1 will be used Initial BPB value b1 predicts, d==0? that b1 will b1 not be Value taken of d d==1? b2 of d before b2 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken 12

13 Correlated branches - example d=? BPB b1 b1 act. BPB b2 B2 act. 2 NT/NT T NT/NT T 0 T/NT NT NT/T b1 is not taken (see table for d=0) matches prediction! update of BPB b1 does not modify any entry taken because b1 has not been taken, the last branch has not been taken part of BPB b2 will be used BPB b2 predicts that b2 will not be taken Initial value of d d==0? b1 Value of d before b2 d==1? 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken b2 Correlated branches A (2,1) correlated branch predictor Uses the behavior of the last 2 branches to choose from 2 2 different predictions Uses a 1 bit predictor for each of the 4 prediction buffers Predictor used in case the previous 2 branches in the application have both not been taken (00) Predictor used in case the previous branches have the history :second last branch not taken, last branch taken (01) Predictor used in case the previous branches have the history: second last branch taken, last branch not taken (10) Predictor used in case the previous 2 branches in the application have both been taken (11) A / B / C / D 13

14 Correlated branches How do we know which of the four sections of our branch predictor to use Need to record the behavior of all branches in the application Initial value of d d==0? b1 Value of d before b2 d==1? 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken b2 e.g Correlated branches For a (2,n) branch predictor, the last two branches are relevant 11 2-bit global branch history (implemented using a 2bit shift register)

15 Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction (2,2) predictor: 2-bit global, 2-bit local Correlated Branches Branch address (4 bits) 2-bits per branch local predictors Prediction 2-bit global branch history (01 = not taken then taken) Slide based on a lecture by David A. Patterson, University of California, Berkley Frequency of Mispredictions Frequency of Mispredictions 16% 14% 12% 10% 8% 6% 2% 0% Accuracy of Different Schemes 20% 18% 18% 4% 0% 1% 0% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 1% 5% 6% 6% nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li 11% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) 4% Slide based on a lecture by David A. Patterson, University of California, Berkley 6% 5% 15

16 Branch Target Buffers Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) PC of instruction FETCH No: branch not predicted, proceed normally (Next PC = PC+4) Branch PC =? Predicted PC Yes: instruction is branch and use predicted PC as next PC Slide based on a lecture by David A. Patterson, University of California, Berkley Extra prediction state bits Need Address at Same Time as Prediction (II) Send PC to memory and branch target buffer (BTB) No Entry found in BTB? Yes Is instruction a taken branch? Send out predicted PC No Yes No Taken branch? Yes Normal execution Enter branch address and next PC count into BTB Mispredicted branch, kill fetched instruction Branch correctly predicted 16

17 Return Addresses and Tournament predictor Return Address: Register Indirect branch hard to predict address Save return address in small stack 8 to 16 entries reduces miss rate dramatically Tournament predictors: Combine multiple prediction algorithms Keeps track which predictor was the most accurate for the last execution of a branch Allows to use different prediction algorithms for different types of branches 17

COSC 6385 Computer Architecture Dynamic Branch Prediction

COSC 6385 Computer Architecture Dynamic Branch Prediction Edgar Gabriel Spring 208 Pipelining Pipelining allows for overlapping the execution of instructions Limitations on the (pipelined) execution of