Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Size: px

Start display at page:

Download "Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology"

Howard Gregory
5 years ago
Views:

1 Spring 2 Parallelization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

2 Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization a at Opportunities t

3 Moore s Law From Hennessy and Patterson, Computer Architecture: Itanium 2 A Quantitative Approach, 4th edition, 26 Itanium,,, mance (vs. VAX X-/78) Perfor %/year 486 Pentium 52%/year P2 P3 P4??%/year,,,,,,, Numbe er of Transistors s 886, Saman Amarasinghe MIT From Fall David 26 Patterson

4 Uniprocessor Performance (SPECint) From Hennessy and Patterson, Computer Architecture: Itanium 2 A Quantitative Approach, 4th edition, 26 Itanium,,, P4,, Pentium P2 P3,,,,, Numbe er of Transistors s 886, Saman Amarasinghe MIT From Fall David 26 Patterson

5 Multicores Are Here! 52 Picochip Ambric PC2 AM Cisco CSR- Intel 28 Tflops 64 #of cores 32 Raza Cavium Raw XLR Octeon 6 8 Niagara Opteron 4P Boardcom 48 4 Xeon MP Xbox36 Power4 2 PA-88 Opteron Tanglewood PExtreme Power6 Yonah Pentium P2 P3 Itanium P4 88 Athlon Itanium 2 Cell ?? Saman Amarasinghe MIT Fall 26

6 Issues with Parallelism Amdhal s Law Any computation can be analyzed in terms of a portion that must be executed sequentially, Ts, and a portion that can be executed in parallel, Tp. Then for n processors: T(n) = Ts + Tp/n T() = Ts, thus maximum speedup (Ts + Tp) /Ts Load Balancing The work is distributed among processors so that all processors are kept busy when parallel task is executed. Granularity The size of the parallel regions between synchronizations or the ratio of computation (useful work) to communication (overhead). Saman Amarasinghe MIT Fall 26

7 Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization a at Opportunities t

8 Types of Parallelism Instruction Level Parallelism li (ILP) Scheduling and Hardware Task Level Parallelism (TLP) Loop Level Parallelism (LLP) or Data Parallelism Pipeline Parallelism Divide and Conquer Parallelism Mainly by hand Hand or Compiler Generated Hardware or Streaming Recursive functions Saman Amarasinghe MIT Fall 26

9 Why Loops? 9% of the execution time in % of the code Mostly in loops If parallel, can get good performance Load balancing Relatively easy to analyze Saman Amarasinghe MIT Fall 26

10 Programmer Defined Parallel Loop FORALL FORACROSS No loop carried Some loop carried dependences dependences Fully parallel Saman Amarasinghe 6.35 MIT Fall 26

11 Parallel Execution Example FORPAR I = to N A[I] = A[I] + Block Distribution: Program gets mapped into Iters = ceiling(n/numproc); FOR P = to NUMPROC- FOR I = P*Iters to MIN((P+)*Iters, N) A[I] = A[I] + SPMD (Single Program, Multiple Data) Code If(myPid == ) { Iters = ceiling(n/numproc); } Barrier(); FOR I = mypid*iters to MIN((myPid+)*Iters, Iters, N) A[I] = A[I] + Barrier(); Saman Amarasinghe 6.35 MIT Fall 26

12 Parallel Execution Example FORPAR I = to N A[I] = A[I] + Block Distribution: Program gets mapped into Iters = ceiling(n/numproc); FOR P = to NUMPROC- FOR I = P*Iters to MIN((P+)*Iters, N) A[I] = A[I] + Code fork a function Iters = ceiling(n/numproc); ParallelExecute(func); void func(integer mypid) { FOR I = mypid*iters to MIN((myPid+)*Iters, Iters, N) A[I] = A[I] + } Saman Amarasinghe MIT Fall 26

13 Parallel Execution SPMD Need to get all the processors execute the control flow Et Extra synchronization overhead or redundant d computation on all processors or both Stack: Private or Shared? Fork Local variables not visible within the function Either make the variables used/defined in the loop body global or pass and return them as arguments Function call overhead Saman Amarasinghe MIT Fall 26

14 Parallel Thread Basics Create separate threads Create an OS thread (hopefully) it will be run on a separate core pthread_create(&thr, NULL, &entry_point, NULL) Overhead in thread creation Create a separate stack Get the OS to allocate a thread Thread pool Create all the threads (= num cores) at the beginning g Keep N- idling on a barrier, while sequential execution Get them to run parallel code by each executing a function Back to the barrier when parallel region is done Saman Amarasinghe MIT Fall 26

15 Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization a at Opportunities t

16 Parallelizing Compilers Finding FORALL Loops out of FOR loops Examples FOR I = to 5 A[I] = A[I] + FOR I = to 5 A[I] = A[I+6] + For I = to 5 A[2*I] = A[2*I + ] + Saman Amarasinghe MIT Fall 26

17 Iteration Space N deep loops n-dimensional discrete cartesian space Normalized loops: assume step size = FOR I = to 6 FOR J = I to 7 2 I J Iterations are represented as coordinates in iteration space i = [i, i 2, i 3,, i n ] Saman Amarasinghe MIT Fall 26

18 Iteration Space N deep loops n-dimensional discrete cartesian space Normalized loops: assume step size = FOR I = to 6 FOR J = I to 7 I J Iterations are represented as coordinates in iteration space Sequential execution order of iterations Lexicographic order [,], [,], [,2],, [,6], [,7], [,], [,2],, [,6], [,7], [2,2],, [2,6], [2,7], [6,6], [6,7], Saman Amarasinghe MIT Fall 26

19 Iteration Space N deep loops n-dimensional discrete cartesian space Normalized loops: assume step size = FOR I = to 6 FOR J = I to 7 I J Iterations are represented as coordinates in iteration space Sequential execution order of iterations Lexicographic order Iteration i is lexicograpically less than j, i < j iff there exists c s.t. i = j, i 2 = j 2, i c- = j c- and i c < j c Saman Amarasinghe MIT Fall 26

20 Iteration Space N deep loops n-dimensional discrete cartesian space Normalized loops: assume step size = FOR I = to 6 FOR J = I to 7 I J An affine loop nest Loop bounds are integer linear functions of constants, loop constant variables and outer loop indexes Array accesses are integer linear functions of constants, loop constant variables and loop indexes Saman Amarasinghe MIT Fall 26

21 Iteration Space N deep loops n-dimensional discrete cartesian space Normalized loops: assume step size = J FOR I = to 6 FOR J = I to 7 2 I Affine loop nest Iteration space as a set of liner inequalities I I J I 6 J 7 Saman Amarasinghe MIT Fall 26

22 Data Space M dimensional arrays m-dimensional discrete cartesian space a hypercube Integer A() Float B(5, 6) Saman Amarasinghe MIT Fall 26

23 Dependences True dependence a = = a Anti dependence = a a = Output dependence a = a = Definition: Data dependence exists for a dynamic instance i and j iff either i or j is a write operation i and j refer to the same variable i executes before j How about array accesses within loops? Saman Amarasinghe MIT Fall 26

24 Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization a at Opportunities t

25 Array Accesses in a loop FOR I = to 5 A[I] = A[I] + Iteration Space Data Space Saman Amarasinghe MIT Fall 26

26 Array Accesses in a loop FOR I = to 5 A[I] = A[I] + Iteration Space Data Space A[I] A[I] A[I] A[I] A[I] = A[I] =A[I] = A[I] = A[I] = A[I] [] A[I] = A[I] Saman Amarasinghe MIT Fall 26

27 Array Accesses in a loop FOR I = to 5 A[I+] = A[I] + Iteration Space Data Space A[I+] A[I+] A[I+] A[I+] A[I+] = A[I] =A[I] = A[I] = A[I] = A[I] [] A[I+] = A[I] Saman Amarasinghe MIT Fall 26

28 Array Accesses in a loop FOR I = to 5 A[I] = A[I+2] + Iteration Space Data Space A[I] A[I] A[I] A[I] A[I] = A[I+2] =A[I+2] = A[I+2] = A[I+2] = A[I+2] A[I] = A[I+2] Saman Amarasinghe MIT Fall 26

29 Array Accesses in a loop FOR I = to 5 A[2*I] = A[2*I+] + Iteration Space Data Space = A[2*+] A[2*I] = A[2*I+] A[2*I] = A[2*I+] A[2*I] = A[2*I+] A[2*I] = A[2*I+] A[2*I] = A[2*I+] A[2*I] Saman Amarasinghe MIT Fall 26

30 Distance Vectors A loop has a distance d if there exist a data dependence from iteration i to j and d = j-i dv FOR I = to 5 A[I] = A[I] + dv FOR I = to 5 A[I+] = A[I] + dv 2 FOR I = to 5 A[I] = A[I+2] +, 2 dv * FOR I = to 5 A[I] = A[] + Saman Amarasinghe MIT Fall 26

31 Multi-Dimensional Dependence FOR I = to n FOR J = to n A[I, J] = A[I, J-] + dv I J Saman Amarasinghe MIT Fall 26

32 Multi-Dimensional Dependence FOR I = to n FOR J = to n A[I, J] = A[I, J-] + dv I J FOR I = to n FOR J = to n A[I, J] = A[I+, J] + dv J I Saman Amarasinghe MIT Fall 26

33 Outline Dependence Analysis Increasing Parallelization Opportunities

34 What is the Dependence? FOR I = to n FOR J = to n A[I, J] = A[I-, J+] + I J Saman Amarasinghe MIT Fall 26

35 What is the Dependence? FOR I = to n FOR J = to n A[I, J] = A[I-, J+] + I J Saman Amarasinghe MIT Fall 26

36 What is the Dependence? dv FOR I = to n FOR J = to n A[I, J] = A[I-, J+] + I J Saman Amarasinghe MIT Fall 26

37 What is the Dependence? FOR I = to n FOR J = to n A[I, J] = A[I-, J+] + I J FOR I = to n FOR J = to n A[I] = A[I-] + I J Saman Amarasinghe MIT Fall 26

38 What is the Dependence? FOR I = to n FOR J = to n A[I, J] = A[I-, J+] + dv dv=[, -] I J FOR I = to n FOR J = to n B[I] = B[I-] + dv,,, 2 3 * J I Saman Amarasinghe MIT Fall 26

39 What is the Dependence? FOR i = to N- FOR j = to N- A[i,j] = A[i,j-] + A[i-,j]; I J dv, Saman Amarasinghe MIT Fall 26

40 Recognizing g FORALL Loops Find data dependences in loop For every pair of array acceses to the same array If the first access has at least one dynamic instance (an iteration) in which it refers to a location in the array that the second access also refers to in at least one of the later dynamic instances (iterations). Then there is a data dependence between the statements (Note that same array can refer to itself output dependences) Definition Loop-carried dependence: dependence that crosses a loop boundary If there are no loop carried dependences parallelizable Saman Amarasinghe MIT Fall 26

41 Data Dependence Analysis I: Distance Vector method II: Integer Programming Saman Amarasinghe MIT Fall 26

42 Distance Vector Method The i th loop is parallelizable for all dependence d = [d,,d i,..d n ] either one of d,,d i- is > or all d,,d i = Saman Amarasinghe MIT Fall 26

43 Is the Loop Parallelizable? dv Yes FOR I = to 5 A[I] = A[I] + dv No FOR I = to 5 A[I+] = A[I] + dv 2 No FOR I = to 5 A[I] = A[I+2] + dv * No FOR I = to 5 A[I] = A[] + Saman Amarasinghe MIT Fall 26

44 Are the Loops Parallelizable? FOR I = to n dv FOR J = to n A[I, J] = A[I, J-] + Yes No I J FOR I = to n dv FOR J = to n A[I, J] = A[I+, J] + No Yes Saman Amarasinghe MIT Fall 26 I J

45 Are the Loops Parallelizable? FOR I = to n FOR J = to n A[I, J] = A[I-, J+] + dv dv=[, -] No Yes I J FOR I = to n FOR J = to n B[I] = B[I-] + No dv * Yes J I Saman Amarasinghe MIT Fall 26

46 Integer Programming g Method Example FOR I = to 5 A[I+] = A[I] + Is there a loop-carried dependence between A[I+] and A[I] Is there two distinct iterations i w and i r such that A[i w +] is the same location as A[i r ] integers i w, i r i w, i r 5 i w i r i w + = i r Is there a dependence between A[I+] and A[I+] Is there two distinct iterations i and i 2 such that A[i +] is the same location as A[i 2 +] integers i, i 2 i, i 2 5 i i 2 i +=i 2 + Saman Amarasinghe MIT Fall 26

47 Integer Programming g Method Formulation FOR I = to 5 A[I+] = A[I] + an integer vector i such that Â i b where Â is an integer matrix and b is an integer vector Saman Amarasinghe MIT Fall 26

48 Iteration Space N deep loops n-dimensional discrete cartesian space FOR I = to 5 A[I+] = A[I] + Affine loop nest Iteration space as a set of liner inequalities 2 I I 3 I 6 4 I J 5 J J Saman Amarasinghe MIT Fall 26

49 Integer Programming g Method Formulation FOR I = to 5 A[I+] = A[I] + an integer vector i such that Â i b where Â is an integer matrix and b is an integer vector Our problem formulation for A[i] and A[i+] integers i w, i r i w, i r 5 i w i r i w + = i r i w i r is not an affine function divide into 2 problems Problem with i w <i r and problem 2 with i r <i w If either problem has a solution there exists a dependence How about i w + = i r Add two inequalities to single problem i w + i r, and i r i w + Saman Amarasinghe MIT Fall 26

50 Integer Programming g Formulation FOR I = to 5 Problem A[I+] = A[I] + i w i w 5 i r i r 5 i w < i r i w + i r i r i w + Saman Amarasinghe MIT Fall 26

51 Integer Programming g Formulation FOR I = to 5 Problem A[I+] = A[I] + i w -i w i w 5 i w 5 i r -i r i r 5 i r 5 i w < i r i w - i r - i w + i r i w -i r - i r i w + -i w +i r Saman Amarasinghe MIT Fall 26

52 Integer Programming g Formulation 5 Problem 5 i w -i w - i w 5 i w i r -i r i r 5 i r i r i w + -i w +i r - Â b 5-5 i w < i r i w - i r i w + i r i w -i r and problem 2 with i r < i w Saman Amarasinghe MIT Fall 26

53 Generalization An affine loop nest FOR i = f l (c c k ) to I u (c c k ) FOR i 2 = f l2 (i,c c k ) to I u2 (i,c c k ) FOR i n = f ln (i i n-,c c k ) to I un (i i n-,c c k ) A[f a (i i n,c c k ), f a2 (i i n,c c k ),,f am (i i n,c c k )] Solve 2*n problems of the form i = j, i 2 = j 2, i n- = j n-, i n < j n i = j, i 2 = j 2, i n- = j n-, j n < i n i = j, i 2 = j 2, i n- < j n- i = j, i 2 = j 2, j n- < i n- i = j, i 2 < j 2 i = j, j 2 < i 2 i < j j < i Saman Amarasinghe MIT Fall 26

54 Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization a at Opportunities t

55 Increasing Parallelization Scalar Privatization Opportunities Reduction Recognition Induction Variable Identification Array Privatization Loop Transformations Granularity of Parallelism Interprocedural Parallelization Saman Amarasinghe MIT Fall 26

56 Scalar Privatization Example FOR i = to n X = A[i] * 3; B[i] = X; Is there a loop carried dependence? What is the type of dependence? Saman Amarasinghe MIT Fall 26

57 Privatization Analysis: Any anti- and output- loop-carried dependences Eliminate i by assigning i in local l context t FOR i = to n integer Xtmp; Xtmp = A[i] * 3; B[i] = Xtmp; Eliminate by expanding into an array FOR i = to n Xtmp[i] = A[i] * 3; B[i] = Xtmp[i]; Saman Amarasinghe MIT Fall 26

58 Xtmp[i] = A[i] * 3; B[i] = Xtmp[i]; X = Xtmp[n]; Privatization Need a final assignment to maintain the correct value after the loop nest Eliminate by assigning in local context FOR i = to n integer Xtmp; Xtmp = A[i] * 3; B[i] = Xtmp; if(i == n) X = Xtmp Eliminate i by expanding into an array FOR i = to n Saman Amarasinghe MIT Fall 26

59 Another Example How about loop-carried true dependences? Example FOR i = to n X = X + A[i]; Is this loopparallelizable? Saman Amarasinghe MIT Fall 26

60 Reduction Recognition Reduction Analysis: Only associative operations The result is never used within the loop Transformation Integer Xtmp[NUMPROC]; Barrier(); FOR i = mypid*iters to MIN((myPid+)*Iters, n) Xtmp[myPid] = Xtmp[myPid] + A[i]; Barrier(); If(myPid == ) { FOR p = to NUMPROC- X = X + Xtmp[p]; Saman Amarasinghe MIT Fall 26

61 Induction Variables Example FOR i = to N A[i] = 2^i; After strength reduction t = FOR i = to N A[i] = t; t = t*2; What happened to loop carried dependences? Need to do opposite of this! Perform induction variable analysis Rewrite IVs as a function of the loop variable Saman Amarasinghe MIT Fall 26

62 Array Privatization Similar to scalar privatization However, analysis is more complex Array Data Dependence Analysis: Checks if two iterations access the same location Array Data Flow Analysis: Checks if two iterations access the same value Transformations Similar to scalar privatization Pi Private copy for each processor or expand with an additional dimension Saman Amarasinghe MIT Fall 26

63 Loop Transformations A loop may not be parallel as is Example FOR i = to N- FOR j = to N- A[i,j] = A[i,j-] + A[i-,j]; I J Saman Amarasinghe MIT Fall 26

64 Loop Transformations A loop may not be parallel as is Example FOR i = to N- FOR j = to N- A[i,j] = A[i,j-] + A[i-,j]; I J J After loop Skewing i new i j new j FOR i = to 2*N-3 FORPAR j = max(,i-n+2) to min(i, N-) A[i-j+,j] = A[i-j+,j-] + A[i-j,j]; old old I Saman Amarasinghe MIT Fall 26

65 Granularity of Parallelism Example FOR i = to N- FOR j = to N- A[i,j] = A[i,j] + A[i-,j]; Gets transformed into FOR i = to N- Barrier(); FOR j = + mypid*iters to MIN((myPid+)*Iters, n-) A[i,j] = A[i,j] + A[i-,j]; Barrier(); I J Inner loop parallelism can be expensive Startup and teardown overhead of parallel regions Lot of synchronization Can even lead to slowdowns Saman Amarasinghe MIT Fall 26

66 Granularity of Parallelism Inner loop pp parallelism can be expensive Solutions Don t parallelize if the amount of work within the loop is too small or Transform into outer-loop parallelism Saman Amarasinghe MIT Fall 26

67 Outer Loop Parallelism Example FOR i = to N- FOR j = to N- A[i,j] = A[i,j] + A[i-,j]; I J After Loop Transpose FOR j = to N- FOR i = to N- A[i,j] = A[i,j] + A[i-,j]; Get mapped into Barrier(); FOR j = + mypid*iters to MIN((myPid+)*Iters, n-) FOR i = to N- A[i,j] = A[i,j] + A[i-,j]; Barrier(); J I Saman Amarasinghe MIT Fall 26

68 Unimodular Transformations Interchange, reverse and skew Use a matrix transformation I new = A I old Interchange i new i j new j old old Reverse i new i j new j old old Skew i new i j new j Saman Amarasinghe MIT Fall 26 old old

69 Legality of Transformations g y Unimodular transformation with matrix A is valid iff. For all dependence vectors v For all dependence vectors v the first non-zero in Av is positive Example FOR i = to N- FOR j = to N- A[i,j] = A[i,j] + A[i-,j];, dv Interchange A Reverse A Skew Saman Amarasinghe MIT Fall 26 A

70 Interprocedural Parallelization Function calls will make a loop unparallelizatble Reduction of available parallelism A lot of inner-loop parallelism Solutions Interprocedural Analysis Inlining Saman Amarasinghe MIT Fall 26

71 Interprocedural Parallelization Issues Same function reused many times Analyze a function on each trace Possibly exponential Analyze a function once unrealizable path problem Interprocedural Analysis Need to update all the analysis Complex analysis Can be expensive Inlining Works with existing analysis Large code bloat can be very expensive Saman Amarasinghe MIT Fall 26

72 Summary Multicores are here Need parallelism to keep the performance gains Programmer defined or compiler extracted parallelism Automatic parallelization of loops with arrays Requires Data Dependence Analysis Iteration space & data space abstraction An integer programming problem Many optimizations that ll increase parallelism Saman Amarasinghe MIT Fall 26

73 MIT OpenCourseWare Computer Language Engineering Spring 2 For information about citing these materials or our Terms of Use, visit:

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Parallelization Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Moore s Law From Hennessy and Patterson, Computer Architecture: