Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST

Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism Dynamic scheduling Scoreboarding / Tomasulo approach Hardware branch prediction Branch prediction buffer / Branch target buffer Multiple issue Compiler support for ILP Software pipelining / Trace schduling Hardware support for parallelism Studies of ILP Real example, PowerPC 620 2

ILP : parallelism among instruction sequences How to utilize parallelism Pipeline Multiple issue processor How much parallelism # of instructions in a basic block Branch frequency is about 15~20% Exploit ILP across multiple basic blocks Loop-level parallelism Loop unrolling either statically or dynamically Vector instruction vector processor 3

Dynamic scheduling Scoreboarding / Tomasulo approach Hardware Branch Prediction Branch prediction buffer/ Branch target buffer Multiple Issue Superscalar / VLIW Compiler Support for ILP Software Pipelining / Trace Scheduling Hardware Support for Parallelism Conditional instruction Poison bit / Boosting Tomasulo + Reorder buffer 4

for (i=1 ; i<=1000 ; i++) x[i] = x[i] +s; Loop : LD F0, 0(R1) ; F0 = array element ADDD F4, F0, F2 ; add scalar in F2 SD 0(R1), F4 ; store result SUBI R1, R1, 8 ; decrement pointer ; 8 Byte (per DW) BNEZ R1, Loop ; Branch R1!= zero clock cycle issued Loop : LD F0, 0(R1) 1 Stall 2 ADDD F4, F0, F2 3 Stall 4 Stall 5 SD 0(R1), F4 6 SUBI R1, R1, #8 7 BNEZ R1, Loop 8 Stall 9 scheduling Loop : LD F0, 0(R1) Stall ADDD F4, F0, F2 SUBI R1, R1, #8 BNEZ R1, Loop ; delayed branch SD 8(R1), F4 ; altered & interchanged with SUBI 5

Unrolling 4 iterations Loop : LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 ; drop SUBI & BNEZ LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 ; drop SUBI & BNEZ LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 ; drop SUBI & BNEZ LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 ; drop SUBI & BNEZ SUBI R1, R1, #32 BNEZ R1, Loop scheduling Loop : LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SD -16(R1), F12 SUBI R1, R1, #32 BNEZ R1, Loop SD 8(R1), F16 ; 8-32 = -24 6

To reduce loop overhead Eliminate branches Improve scheduling Allow instructions from different iterations to be scheduled Exposes more computations that can be scheduled to minimize stalls Different registers for each iteration Increases the required count of registers When the upper bound of the loop is not known k copies of the loop body (n mod k) + (n / k) iterations 7

Data dependence (RAW) Data dependent instructions can not execute simultaneously hazard stall is a property of the pipeline To avoid Maintain the dependence but avoid a harzard Data forwarding Scheduling = code rearrangement Eliminate the dependence by transforming the code Data dependence that flows through memory is difficult to detect. 8

Name dependence (WAR, WAW) There is no value being transmitted between the instructions Two types : anti-dependence (WAR) output dependence (WAW) Register renaming Either by compiler or in hardware Only true dependence remains Control dependence Every instruction except for the first basic block is control dependent on some set of branches An instruction dependent on a branch cannot be moved before or after the branch To avoid it, we can do delayed branch, speculation, and loop unrolling 9

Loop carried dependence Dependence exists between different iterations for ( i = 1 ; i <= 100; i++) x[i] = x[i] + s; Loop is parallel for ( i = 1 ; i <= 100; i++){ a[i] = a[i] + b[i]; b[i+1] = c[i] +d[i]; } a[1] = a[1] + b[1]; for ( i = 1 ; i <= 99; i++){ b[i+1] = c[i] +d[i] a[i+1] = a[i+1] + b[i+1]; } b[101] = c[100] + d[100]; b[i] : loop carried dependence 10

Static scheduling : compiler techniques to reduce hazards and stalls Dynamic scheduling : hardware rearranges instruction execution to reduce stalls Out-of-order execution, out-of-order completion Instructions begin execution as soon as their operands are available Imprecise execution ID stage is split into two stages Issue : decode instructions, check for structural hazards Read operands : wait until no data hazards, then read operands Two techniques Scoreboarding Tomasulo approach (Reservation Station) 11

Used in CDC6600 The goal is to execute an instruction as early as possible Instructions can be issued and executed if they do not dependent on any active or stalled instruction Performance improvement of 1.7 for FORTRAN programs and 2.5 for hand-coded assembly programs Scoreboard All hazard detection and resolution is centralized Constructs a record of data dependences Monitors every change in the hardware Determines when the instruction can read operands and begin execution Controls when the instruction can write its result into the destination 13

Issue Check for structural and WAW hazards If so, issue stalls Read operands Read source operands when available and begin execution Resolves RAW hazards dynamically No forwarding of data Execute Notify its completion to the scoreboard Write result Checks for WAR hazards, and stalls the instruction if necessary 14

Amount of parallelism available among the instructions Parallelism within a basic block Scoreboard size Instruction window : the set of instructions examined as candidates for potential execution The number and types of functional units The presence of anti-dependence and output dependence 17

Used by IBM360/91 floating point unit [1967] Scoreboarding + Register renaming Eliminates WAW and WAR hazards by register renaming Reservation station Located in front of each functional unit to hold an instruction issued Hazard detection and execution control are distributed Fetches and buffers an operand as soon as available The result is directly passed to reservation stations waiting for it Common data bus (CDB) 18

Issue Get an instruction from instruction queue Issue if there is an empty reservation station Sends the operands if they are in registers, otherwise a tag field denoting the reservation station that will produce the operand If no empty reservation station, stall until a station is free Execute If operands are not yet available, monitor CDB while waiting for the register to be computed When the operand becomes available, place it into reservation station When both operands are available, execute the operation Write result Write it into CDB, into registers, and any reservation stations waiting for the result 20

Dynamic prediction at run time The prediction will change if the branch changes its behavior while the program is running Predicts branch direction and branch target address Usually implemented based on the branch prediction buffers 23

A small memory indexed by low portion of the address of the branch instruction Aliasing between branch instructions having the same low portion Accessed with the instruction address during IF stage Contains previous branch history One bit history mis-predicts two times for a change Actual : T T T T T T N T T T Predict :? T T T T T T N T T N-bit saturating up-down counter 2-bit counter is enough for almost all applications 24

Integer programs have higher branch frequency but lower prediction accuracy than FP programs Buffer hit rate is not the limiting factor Increasing number of bits per predictor has little impact Correlating predictor = Two-level predictor The current branch prediction is heavily affected by the direction of the last branches. If ( d == 0 ) d=1; If ( d == 1 ) branch branch 27

If ( d == 0 ) d=1; If ( d == 1 ) BNEZ R1, L1 ; branch b1 (d!=0) ADDI R1, R0, #1 ; d==0, so d=1 L1: SUBI R3, R1, #1 BNEZ R3, L2 ; branch b2 (d!=1) L2: 28

Two-level Branch address + direction history of last branches (m,n) predictor m : last m branches are used Usually m > 10 The direction history is called a pattern n : n-bit counter Usually n = 2 Pattern-based predictor, which does not use branch address to index the predictor, produces quite good results gshare : widely used (branch address XOR history pattern) is used to index the predictor 29

BTB A cache that stores the predicted address for the next instruction after a branch Access during the IF stage, that is, before decoding the instruction Must know the instruction to be fetched is a branch Store only predict-taken branches Must know whether the fetched instruction is a taken branch For 2-bit counter predictor, use both a target buffer and a prediction buffer 32

Store one or more target instructions instead of target PC Branch folding if the only function of branch is to change PC Return stack The accuracy of predicting the target address for a return instruction is low Indirect jump : destination address varies at runtime Majority of the indirect jumps come from procedure returns Small stack for return addresses Caches most recent return addresses 35

Superscalar Issue a varying number of independent instructions per clock Satisfy some hardware constraints Simplest organization : one integer + one floating-point More ports in register file More difficult in a CISC whose instruction length is variable The effect of stalls or delay is more severe than scalar machines Statically scheduled by compiler Dynamically scheduled by hardware Issue two or more instructions to reservation stations Pipeline the issue stage so that it runs two or more times faster For load/store instructions, we do not want out-of-order execution Queue Cf. Decoupled architecture 37

VLIW (Very Long Instruction Word) Issue a fixed number of instructions formatted as one large instruction or as a fixed instruction packet Extremely difficult to determine in hardware whether multiple instructions are independent An efficient compiler is essential to make a long sequence of instructions that can be executed in parallel Limitations Large number of functional units; much larger than the number of instructions in a VLIW Large number of ports on the register file and memory Large memory bandwidth Instructions are not full Waste of instruction bits and functional units Binary code compatibility 38

Parallel Loop Recurrence : a variable is defined based on the value of the variable in an earlier iteration Dependence distance The larger the distance, the more potential parallelism Array Assume indices are affine : affine if index = a*i+b For two indices, ai+b and cj+d, These is no dependence if GCD(a,c) does not divide (d-b) Renaming Dependence analysis is challenging Pointers, indirect indexing, Lacks of runtime information: Possible dependence, but not encountered in runtime. 39

Reorganizing loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop Interleaves instructions from different iterations without unrolling the loop Prolog / Start-up code Epilog / Clean-up code 40

Find parallelism across branches Run frequent paths faster Trace selection Likely sequence of basic blocks Trace compaction Global code scheduling Code motion speculative execution Often insert compensation code (or book-keeping code) to off-trace code to ensure correctness One example of compensation code : inverse operation 44

Conditional instruction If condition is true, the instruction is executed normally, otherwise the execution continues as if a NOP. Conditional move instructions are employed in recent processors 46

Compiler speculation Ignore executions for speculative instructions Simply return an undefined value for any exception that would cause termination Still need the renaming in software Poison bits A poison bit is added to every register, and a bit is added to every instruction to indicate whether the instruction is speculative Renaming in software Boosting Providing the renaming and buffering in hardware, much as Tomasulo s algolithm Record (boosted + predicted branch direction) The result of boosted instructions are forwarded to and used by other boosted instructions 47

Hardware based Speculation Dynamic branch prediction + speculation + dynamic scheduling Advantages Disambiguate memory reference Hardware branch prediction is superior Completely precise exception Does not require compensation Good performance for different implementation of an architecture binary compatibility Tomasulo s algolithm + Reorder buffer 51

Issue (dispatch) Issue if there is an empty reservations station and an empty slot in reorder buffer Execute Monitor CDB while waiting for the register to be computed, and then execute Write Result Write the result on CDB and into reorder buffer Commit Update the register with the result when the instruction reaches the head of the reorder buffer If mis-predicted branch reached the head of the reorder buffer, the reorder buffer is flushed and execution is started at the correct successor of the branch. 53

Ideal / Perfect Processor Perfect register renaming Perfect branch prediction Perfect jump instruction Perfect memory-address alias analysis Unlimited issue : Enough functional units to allow the ready instructions to issue Look arbitrarily far ahead to find instructions to issue One cycle execution 56

Limitations on the window size 2K window + issue of 64 instructions Realistic branch and jump prediction + aggressive predictor with 8K entries + 2K jump and return predictors Finite renaming registers + 256 renaming registers Imperfect alias analysis 58

2K window + issue of 64 instructions 61

2K window + issue of 64 instructions + aggressive predictor with 8K entries + 2K jump and return predictors 63

2K window + issue of 64 instructions + aggressive predictor with 8K entries + 2K jump and return predictors + 256 renaming registers 65

64 instruction issues A selective predictor with 1K entries and 16 entry return predictor Perfect disambiguation of memory references done dynamically Register renaming with additional 64 integer registers and 64 FP registers No cache misses + unit latencies 67

64 bit advanced superscalar processor Can fetch, issue, complete up to 4 instructions per cycle Speculative execution past 4 unresolved branches Register renaming 8 extra integer registers, 8 extra FP registers Reservation stations with 16-entry reorder buffer Execution unit Six execution unit 3 integer units with 2 reservation stations each Two simple integer units, XSU0, XSU1 One complex integer function unit, XCFXU 1 branch unit, BPU, with 2 reservation stations 1 Load/Store unit, LSU with 3 reservation stations 1 FP unit, FPU 70

8 entry instruction queue Static / Dynamic Branch prediction Branch prediction in fetch and dispatch stages 256-entry 2-way branch target address cache 2048-entry branch history table Caches 32KB 8-way set associative non-blocking data cache 32KB 8-way set associative instruction cache BUS interface 40-bit address bus and 128 bit data bus Split transaction Pipelined snoop bus protocol MESI 71

Branch misprediction 256 entry BTB, 2 way set associative 2K entry branch prediction buffer Return stack Instruction cache miss Not serious because of a perfect off-chip cache Partial cache line fill 75

No reservation station available No rename registers Reorder buffer is full The same functional unit Misc. Shortage of read ports Special registers serialization 77

Source operand unavailable ILP is insufficient fewer buffers Functional unit unavailable Increase the number of FUs Increase pipelining in the un-pipelined units Out-of-order disallowed Serialization Overall 1.limitation of FU LSU 2.losses in fetch, issue, and execution 3.ILP limitation and finite buffering 82