INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)
Outline 2 Increasing processing speedup: Speedup from pipelining Speedup from instruction scheduling and loop unrolling Instruction Level Parallelism Static (compiler) vs Dynamic (processor) instruction scheduling approaches VLIW processors
Review of pipelining architectures 3 5 stage pipeline: IF, ID, EX, MEM, WB Data hazards solved by: WB stages uses half cycle (writes on falling edge) Forwarding from MEM and WB stages to EX Control hazards: Stall until dependency is solved (i.e., next PC is known) Delayed branch Static Branch Prediction What is the speedup from pipelining?
4 Evaluating pipelining architectures Speedups Speedup = T No Pipeline T Pipeline = #Cycles T CLK No Pipeline #Cycles T CLK Pipeline = CPI No Pipeline CPI Pipeline CPI No Pipeline = #Cycles T CLK No Pipeline #Cycles T CLK Pipeline T CLK No Pipeline T CLK Pipeline = #Stalls CPI Ideal + #Instructions #Stages = #Stalls 1 + #Instructions #Instructions #Instructions T CLK No Pipeline T CLK Pipeline CPI No Pipeline = 1 CPI Ideal = 1 Considering that pipelining is performed by evenly balancing the critical path of all stages: #Stages = T CLK No Pipeline T CLK Pipeline Objective: Maximize the number of pipeline stages while minimizing the average number of stalls per instruction Notice: keep the critical path of all pipeline stages balanced
Evaluating pipelining architectures Speedups 5 Ganho energético: E No Pipeline E Pipeline = T No Pipeline P No Pipeline T Pipeline P Pipeline = Speedup (P din + P static ) No Pipeline = (P din + P static ) Pipeline N 1 + SPI α C V2 f + V 2 /R α C V 2 N f + V 2 /R Static Power Consumption P static = V I = V 2 /R Where R is the effective resistance for the leakage currents Dynamic Power Consumption P din = α C V 2 f Where α is a switching factor and C is the effective capacitance of the IC N Number of pipeline stages SPI Average number of stalls per instruction
Evaluating pipelining architectures Influence of control instructions 6 Consider a typical 5-stage pipelining architecture with: Around 15% of instructions are conditional or unconditional jumps In 60% of cases, the sequence of instructions is broken All data dependencies are resolved by forwarding Resolution of control hazards Penalty cycles Stalls per instruction Speedup Pipeline Stall (BR CTRL @ EX stage) 2 0.30 3.85 2 penalty cycles times 15% of instructions, resulting in an average of 0.45 stalls per instruction Number of Stages Speedup = 1 + stalls per instruction
Evaluating pipelining architectures Influence of control instructions 7 Consider a typical 5-stage pipelining architecture with: Around 15% of instructions are conditional or unconditional jumps In 60% of cases, the sequence of instructions is broken All data dependencies are resolved by forwarding Resolution of control hazards Penalty cycles Stalls per instruction Speedup Pipeline Stall (BR CTRL @ EX stage) 2 0.30 3.85 Branch Predict Taken 1 0.15 4.35 A static branch predict taken strategy has no performance benefit since the jump address is known only in later pipeline stages. Thus, 1 penalty cycles times 15% of instructions, results in an average of 0.15 stalls per instruction Speedup = Number of Stages 1 + stalls per instruction
Evaluating pipelining architectures Influence of control instructions 8 Consider a typical 5-stage pipelining architecture with: Around 15% of instructions are conditional or unconditional jumps In 60% of cases, the sequence of instructions is broken All data dependencies are resolved by forwarding Resolution of control hazards Penalty cycles Stalls per instruction Speedup Pipeline Stall (BR CTRL @ EX stage) 2 0.30 3.85 Branch Predict Taken 1 0.15 4.35 Branch Predict Not Taken 1 0.09 4.59 A static not taken branch prediction strategy has no performance loss when the jump is not taken since the instructions are already on the pipeline. Thus, we only have a penalty when the jump is taken. Thus, 1 penalty cycles times 15% of instructions times 60% of the cases where the jump is taken, results in an average of 0.09 stalls per instruction Number of Stages Speedup = 1 + stalls per instruction
Evaluating pipelining architectures Influence of control instructions 9 Consider a typical 5-stage pipelining architecture with: Around 15% of instructions are conditional or unconditional jumps In 60% of cases, the sequence of instructions is broken All data dependencies are resolved by forwarding Resolution of control hazards Penalty cycles Stalls per instruction Speedup Pipeline Stall (BR CTRL @ EX stage) 2 0.30 3.85 Branch Predict Taken 1 0.15 4.35 Branch Predict Not Taken 1 0.09 4.59 Delayed Branch 0.5 (50%) 0.08 4.63 A delayed branch strategy has no performance loss if the delay slots can be filled with useful instructions. However in 50% of times the delay slot remains empty (i.e., a NOP instruction is used to fill the slot). Thus, 1 penalty cycles times 50% of cases times 15% of instructions, results in an average of 0.08 stalls per instruction Number of Stages Speedup = 1 + stalls per instruction
Evaluating pipelining architectures Influence of control instructions 10 Consider a typical 5-stage pipelining architecture with: Around 15% of instructions are conditional or unconditional jumps In 60% of cases, the instruction sequence is broken (the branch is taken) All data dependencies are resolved by forwarding Resolution of control hazards Penalty cycles Stalls per instruction Speedup Pipeline Stall (BR CTRL @ EX stage) 2 0.30 3.85 Branch Predict Taken 1 0.150 4.35 Branch Predict Not Taken 1 0.090 4.59 Delayed Branch 0.5 (50%) 0.075 4.63 Delayed Branch + Branch Predict Not Taken 0.5 (50%) 0.045 4.78 Number of Stages Speedup = 1 + stalls per instruction
11 Static instruction scheduling Can we improve processor performance just by wisely re-ordering the instructions?
Static instruction scheduling Example architecture 12 IF ID EX MEM WB MIPS architecture: 1 delayed branch slot Forwarding Producing instruction Consuming instruction INT ALU Store FP ALU INT ALU 0 0 0 Load 1 0 1 FP ALU - 2 3 Instruction Latency Number of wait cycles from generating the result and consuming it, e.g.,: Latency INT ALUINT ALU is zero, which means that the instructions can be consecutive Latency LoadINT ALU is one; if the instructions appear consecutive, one stall cycle will be generated
Static instruction scheduling Example assembly code 13 double X[100]; for (i=100; i >0; i--) X[i] = X[i] + k; ; R2 contains the address of X[99] ; F1 contains the value of K Cont: L.D F0,0(R2) ; F0 M[0+R2] (load double) ADD.D F2,F0,F1 ; F2 F0 + F1 (add double) S.D 0(R2),F2 ; M[0+R2] F2 (store double) DSUBI R2,R2,#8 ; R2 R2 8 BNE R2,R1,Cont ; PC Cont if R2 R1 Note: Registers F0,F1,F2, are used to store double precision floating point numbers
Static instruction scheduling Program execution 14 Cont: L.D F0,0(R2) Stall ADD.D F2,F0,F1 Stall Stall Producing instruction Consuming instruction INT ALU Store FP ALU INT ALU 0 0 0 Load 1 0 1 FP ALU - 2 3 S.D DSUBI BNE Stall 0(R2),F2 R2,R2,#8 R2,R1,Cont Cont: L.D F0,0(R2) ADD.D F2,F0,F1 9 cycles per iteration S.D DSUBI BNE 0(R2),F2 R2,R2,#8 R2,R1,Cont
Static instruction scheduling Instruction re-ordering 15 Producing instruction Consuming instruction INT ALU Store FP ALU INT ALU 0 0 0 Load 1 0 1 FP ALU - 2 3 Cont: L.D F0,0(R2) Cont: L.D F0,0(R2) ADD.D F2,F0,F1 ADD.D F2,F0,F1 DSUBI R2,R2,#8 S.D 0(R2),F2 BNED R2,R1,Cont DSUBI R2,R2,#8 S.D 8(R2),F2 BNE R2,R1,Cont
Static instruction scheduling Speedup from instruction re-ordering 16 Cont: L.D F0,0(R2) Stall ADD.D F2,F0,F1 DSUBI R2,R2,#8 BNED R2,R1,Cont Producing instruction Consuming instruction INT ALU Store FP ALU INT ALU 0 0 0 Load 1 0 1 FP ALU - 2 3 S.D -8(R2),F2 6 cycles per iteration Cont: L.D F0,0(R2) ADD.D F2,F0,F1 Re scheduling speedup = 9 6 = 1.5 DSUBI BNED S.D R2,R2,#8 R2,R1,Cont 8(R2),F2 50% faster
Static instruction scheduling Speedup from loop unrolling 17 Cont: L.D F0,0(R2) L.D F2,-8(R2) L.D F3,-16(R2) L.D F4,-24(R2) ADD.D F0,F0,F1 ADD.D F2,F2,F1 ADD.D F3,F3,F1 ADD.D F4,F4,F1 SD 0(R2),F0 SD -8(R2),F2 SD -16(R2),F3 SUB R2,R2,#32 BNED R2,R1,Cont S.D 8(R2),F4 14 cycles per 4 iterations = 3.5 cycles per iteration Cont: L.D F0,0(R2) L.D F2,-8(R2) L.D F3,-16(R2) L.D F4,-24(R2) ADD.D F0,F0,F1 ADD.D F2,F2,F1 ADD.D F3,F3,F1 ADD.D F4,F4,F1 SD 0(R2),F0 SD -8(R2),F2 SD -16(R2),F3 SUB R2,R2,#32 BNED R2,R1,Cont Re scheduling speedup = 9 3.5 = 2.57 S.D 8(R2),F4
18 Instruction Level Parallelism Additional speedups can be obtained by issuing multiple instructions in a single clock cycle: Explores instruction-level parallelism (ILP) The simultaneous execution of up to N instructions per clock cycle allows decreasing the ideal CPI from 1 to 1/N
Exploring Instruction Level Parallelism (ILP) Static vs Dynamic Approaches 19 Very Long Instruction Word (VLIW) processors (e.g., Itanium): Static instruction re-scheduling HAZARDS: Solved by the compiler by statically analysing the dependencies Allows more complex analysis Superscalar processors (e.g., Intel, ARM): Dynamic instruction re-scheduling HAZARDS: Solved in real-time using dedicated hardware structures Simpler analysis but can take into account conflicts that are only visible during execution Requires: Multiple functional units Simultaneous fetch and decode of multiple instructions Leads to additional conflicts
Exploring Instruction Level Parallelism (ILP) Static vs Dynamic Approaches 20 Very Long Instruction Word (VLIW) processors (e.g., Itanium): static instruction re-scheduling INSTRUCTION SCHEDULING: The compiler analyses the dependencies and identifies instruction level parallelism (ILP) Using the collected information, the compiler generates groups of instructions (packets) that can be executed in parallel Allows reducing the hardware resources for controlling the processor, but cannot extract parallelism only visible during execution Superscalar processors (e.g., Intel, ARM): Dynamic instruction re-scheduling INSTRUCTION SCHEDULING: Dynamic resolution of conflicts that takes into account information only known during execution; typical algorithms: Scoreboard (centralized) and Tomasulo (distributed) Uses out-of-order instruction execution and register renaming to solve the dependencies and extract ILP Leads to more complex control mechanisms and requires additional hardware resources
Exploring Instruction Level Parallelism (ILP) Static vs Dynamic Approaches 21 Very Long Instruction Word (VLIW) processors (e.g., Itanium): static instruction re-scheduling INSTRUCTION SCHEDULING: Example case where parallelism cannot be extracted Superscalar processors (e.g., Intel, ARM): Dynamic instruction re-scheduling INSTRUCTION SCHEDULING: Example case where parallelism is only known during instruction execution S.D L.D 100(R1),R2 R3,20(R4) 100+R1 = 20+R4?
Exploring Instruction Level Parallelism (ILP) VLIW processors 22 The compiler formats the instructions in a shape of packets Each packet consists on a set of independent instructions If there are dependencies, they must be explicitly marked The need for conflict identification is substantially reduced In each clock cycle: The Instruction Fetch (IF) stage fetches a packet from memory The Instruction Decode (ID) stage decodes the complete packet and issues it to execution
23 Very Long Instruction Word (VLIW) Processors MIPS extension to VLIW MIPS-VLIW extension includes 5 different execution units, which translates into packets of 5 instructions, maximum # Memory 1 Memory 2 FP 1 FP 2 Integer 1 L.D F0,0(R0) L.D F2,-8(R0) 2 L.D F3,-16(R0) L.D F4,-24(R0) 3 L.D F5,-32(R0) L.D F6,-40(R0) ADD.D F0,F0,F1 ADD.D F2,F2,F1 4 L.D F7,-48(R0) L.D F8,-56(R0) ADD.D F3,F3,F1 ADD.D F4,F4,F1 5 L.D F9,-64(R0) L.D F10,-72(R0) ADD.D F5,F5,F1 ADD.D F6,F6,F1 6 ADD.D F7,F7,F1 ADD.D F8,F8,F1 7 S.D 0(R0),F0 S.D -8(R0),F2 ADD.D F9,F9,F1 ADD.D F10,F10,F1 8 S.D -16(R0),F3 S.D -24(R0),F4 9 S.D -32(R0),F5 S.D -40(R0),F6 DSUBI R0,R0,-80 10 S.D 32(R0),F5 S.D 24(R0),F6 BNED R0,R1,Cont 11 S.D 16(R0),F5 S.D 8(R0),F6 Takes 11 cycles to execute 10 iterations of the loop Average of 1.1 cycles per iteration Speedup vs orignal case = 9 1.1 = 8.18 Speedup vs reordered = 3.5 1.1 = 3.18
24 Very Long Instruction Word (VLIW) Processors MIPS extension to VLIW MIPS-VLIW extension includes 5 different execution unit, which translates into packets of 5 instructions, maximum # Memory 1 Memory 2 FP 1 FP 2 Integer 1 L.D F0,0(R0) L.D F2,-8(R0) 2 L.D F3,-16(R0) L.D F4,-24(R0) 3 L.D F5,-32(R0) L.D F6,-40(R0) ADD.D F0,F0,F1 ADD.D F2,F2,F1 4 L.D F7,-48(R0) L.D F8,-56(R0) ADD.D F3,F3,F1 ADD.D F4,F4,F1 5 L.D F9,-64(R0) L.D F10,-72(R0) ADD.D F5,F5,F1 ADD.D F6,F6,F1 6 ADD.D F7,F7,F1 ADD.D F8,F8,F1 7 S.D 0(R0),F0 S.D -8(R0),F2 ADD.D F9,F9,F1 ADD.D F10,F10,F1 8 S.D -16(R0),F3 S.D -24(R0),F4 9 S.D -32(R0),F5 S.D -40(R0),F6 DSUBI R0,R0,-80 10 S.D 32(R0),F5 S.D 24(R0),F6 BNE R0,R1,Cont 11 S.D 16(R0),F5 S.D 8(R0),F6 Which of the two speedups should be used to compare the architectures? What is the maximum achievable speedup? Speedup vs orignal case = 9 1.1 = 8.18 Speedup vs reordered = 3.5 1.1 = 3.18
Very Long Instruction Word (VLIW) Processors Drawbacks 25 Requires a great amount of floating point registers to enable exposing the parallelism To fully expose the parallelism and maximise the use of the available functional units it is necessary to deeply apply loop unrolling Even in such cases, the functional unit utilization is low E.g., for the previous case we achieve an average functional unit utilization of 58% Actual speedup is lower than ideal In the previous case, the ideal speedup is 5x, whereas the real value (considering that the critical path remains the same) is 3.18x VLIW processors require a large bandwidth to the register file The previous case requires reading from 4 FP and 4 integer registers and writing to 4 FP and 3 integer registers Code incompatibility is a large drawback
Moving beyond VLIW processors Explicitly Parallel Instruction Computing (EPIC) 26 VLIW ISAs are not backward compatible between implementations: I.e., different packet width sizes and/or the availability of a different set of functional units The variability of the memory access times (due to CPU caches and RAM) introduces difficulties to the compiler Explicitly Parallel Instruction Computing (EPIC) Instead of grouping instructions in packets, group them in bundles and use a stop bit to indicate dependencies between bundles New load instructions to decrease memory access time variability E.g., software prefetching and speculative loading New branch instructions that combine multiple branch conditions in a single bundle Predicative execution modes that allow a bundle to be executed only upon a condition
Static scheduling architectures Intel IA-64 Architecture 27 EPIC (Explicit Parallel Instruction Computer) The compiler identifies the parallelism and schedule the instruction execution indicating which instructions can be performed in parallel Includes hardware support for instruction scheduling Instructions are organized in: Groups of instructions that can be executed in parallel Coded in 128-bit bundles of 3 instructions Registers: 128 x 64-bit registers with 1 poison bit/register (32GPR+96Stack) 128 x 82-bit registers for FP (using IEEE 80-bit format) 64 x 1-bit predication register 8 x 64-bit register for indirect branches
Static scheduling architectures Intel IA-64 Architecture 28 Execution Unit Unit-I Unit-M Instruction Type Integer ALU (A) Non-Integer ALU (I) Integer ALU (A) Memory Access (M) Example Addition, subtraction,... bit test, move,... Addition, subtraction,... Integer/FP load/store Unit-F FP (F) Floating point operations Unit-B Saltos (B) Jumps/Calls 24 possible bundle patterns (» indicates new parallel section, i.e., end of parallelism) Pattern 0: M I I Pattern 1: M I I» Pattern 2: M I» I Pattern 10: M» M I Pattern 29: M F B» L+X Extendidas L+X Extended immediate Each instruction is coded in 41-bits The bundle 5 most significant bits state the bundle pattern The 6 least significant bits specify predication
Static scheduling architectures Intel IA-64 Architecture 29 Itanium Itanium2 Pipeline Stages 10 8 Issued instructions per clock cycle 6 6 Functional Units - Integer (Type I) 2 2 - Load/Store (Type M) 2 4 - Floating Point (Type B) 3 3 - Branch (Type L+X) 3 3 Latencies - Floating Point 4 4 - Branch Miss prediction Up to 9 Up to 6
Architectures comparison Intel IA-64 (Itanium2) vs IA32e (Pentium 4) 30 483.xalancbmk 473.astar 471.omnetpp 464.h264ref 462.libquantum 458.sjeng 456.hmmer 445.gobmk 429.mcf 403.gcc 401.bzip2 400.pearlbench SPEC CPU INT2006 Pentium 4@3.8GHz Itanium2@1.6GHz 100 300 500 700 900 1100 1300 1500 Time [s]
Architectures comparison Intel IA-64 vs IA32e 31 Core 2 E6300 @ 2.66GHz (ICC 9) (2007) Pentium 4 @ 3.07GHz (ICC 7) Pentium 4 @ 1.3GHz (ICC 5) (2002) CFP2000 CINT2000 Itanium 2 @ 1.6GHz, 6M L3 (ICC 8) Itanium 2 @ 1GHz, 3M L3 (ICC 7) (2002) Itanium @ 0.8GHz (ICC 5) (2001) 0 500 1000 1500 2000 2500 3000
32 Next lesson Compiler techniques to extract parallelism Local techniques Global techniques
Very Long Instruction Word (VLIW) Processors Extracting ILP and scheduling 33 Local techniques Parallelism is simpler to explore if loop unrolling leads to a large number of sequential instructions Software Pipelining is an effective technique for extracting parallelism and to schedule instructions through symbolic loop unrolling Global techniques Identification and exploration of parallelism requires instruction shifting to resolve control dependencies; complex algorithms must be used and these achieve sub-optimal solutions Techniques that use hardware support to extract parallelism can also be used
Parallelizing loops 34 To extract parallelism from loops, each iteration must be independent from the previous Parallelizable loop: for (i=0;i<=;i++) A[i] = A[i] + K; Iteration i is independent of iteration j Iterations can be performed in any order since they are independent Non-parallelizable loop: for (i=1;i<=;i++) A[i] = A[i-1] + K; Iteration i is dependent of iteration j Iterations i must be performed before iteration i+1