Lecture 11 Loop Transformations for Parallelism and Locality

Size: px

Start display at page:

Download "Lecture 11 Loop Transformations for Parallelism and Locality"

Madeleine Neal
5 years ago
Views:

1 Lecture 11 Loop Transformations for Parallelism and Locality 1. Examples 2. Affine Partitioning: Do-all 3. Affine Partitioning: Pipelining Readings: Chapter , , Shared Memory Machines Performance on Shared Address Space Multiprocessors: Parallelism & Locality P P P P M M M M nterconnect 2 1

2 Parallelism and Locality Parallelism DOES NOT imply speed up! Parallel performance: mprove locality with loop transformations Minimize communication Operations using the same data are executed on the same processor Sequential performance: mprove locality with loop transformations Minimize cache misses Operations using the same data are executed close in time. 3 Loop Permutation (Loop nterchange) Z[,] = Z[-1,] for = 1 to 3 for = 1 to 4 Z[, ] = Z[ -1, ] " j' % " ' = 0 1 % " i % ' ' # i' & # 1 0& # j& 4 2

3 Loop Fusion T[]= A[]+B[] (s1) for = 1 to 4 C[ ]= T[ ] x T[ ] (s2) for = 1 to 4 T[]= A[]+B[] (s1) C[]= T[] x T[] (s2) s1: [ j] = 1 [ ] [i] [ j] = [ 1] [i'] s2: 5 Loop Transformations Unimodular transforms on loop nests nterchange Skewing Reversal Cross statement transforms Loop fusion Loop fission Re-indexing How to combine them to get parallelism and locality? 6 3

4 Affine Partitioning: An Contrived but llustrative Example FOR j = 1 TO n FOR i = 1 TO n A[i,j] = A[i,j]+B[i-1,j]; (S 1 ) B[i,j] = A[i,j-1]*B[i,j]; (S 2 ) j S 1 S 2 i 7 Best Parallelization Scheme Algorithm finds affine partition mappings for each instruction: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1. SPMD code: Let p be the processor s D number if (1-n <= p <= n) then if [1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S 2 ) for i 1 = max[1,1+p) to min[n,n-1+p) do A[i 1,i 1 -p] = A[i 1,i 1 -p] + B[i 1-1,i 1 -p]; (S 1 ) B[i 1,i 1 -p+1] = A[i 1,i 1 -p] * B[i 1,i 1 -p+1]; (S 2 ) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S 1 ) 8 4

5 2. teration Space FOR i = 0 to 5 FOR j = i to 7 n-deep loop nests: n-dimensional polytope terations: coordinates in the iteration space Assume: iteration index is incremented in the loop Sequential execution order: lexicographic order [0,0], [0,1],, [0,6], [0,7], [1,1],, [1,6], [1,7], 9 Maximum Parallelism & No Communication Loops F 1 i 1 +f 1 F 2 i 2 +f 2 Array C 2 i 2 +c 2 C 1 i 1 +c 1 Processor D For every pair of data dependent accesses F 1 i 1 +f 1 and F 2 i 2 +f 2 Find C 1, c 1, C 2, c 2 : i 1, i 2 F 1 i 1 + f 1 = F 2 i 2 +f 2 C 1 i 1 +c 1 = C 2 i 2 +c 2 with the objective of maximizing the rank of C 1, C

6 Rank of Partitioning = Degree of Parallelism Affine Mapping i j [ 0 0] [ 1] i 1 0 i 0 j 0 1 j Rank Mapped to same processor 11 Example 1: Loop Transform Z[,] = Z[-1,] Find affine partitioning: c 1, c 2, c 0 such that " p = [ c 1 c 2 ] i % ' + c 0 # j& Suppose itera4on i,j & i, j refer to same loca4on i = i - 1 j = j No communica4on means: c 1 i + c 2 j + c 0 = c 1 i + c 2 j + c 0 c 1 (i - 1) +c 2 j + c 0 = c 1 i + c 2 j + c 0 c 1 = 0 p = c 2 j + c 0 Pick simplest c 2, c 0 : c 2 = 1, c 0 = 0 p = j 12 6

7 Code Generation Naive Each processor visits all the iterations Executes only if it owns that iteration Optimization Removes unnecessary looping and condition evaluation 13 Code Generation Z[,] = Z[-1,] p = j for P = 1 to 3 if (j == P) Z[,] = Z[-1,] for P = 1 to 3 Z[,P] = Z[-1,P] SPMD (single program multiple data) code: Z[,P] = Z[-1,P] 14 7

8 Loop Permutation (Loop nterchange) Z[,] = Z[-1,] for P = 1 to 3 Z[,P] = Z[-1,P] " p' % " ' = 0 1 % " # i' & # 1 0& ' i % # j ' & P 15 Example 2: Loop Fusion T[]= A[]+B[] (s1) for = 1 to 4 C[ ]= T[ ] x T[ ] (s2) Find affine partitioning: c 1,1, c 1,0, c 2.1, c 1,0, such that s1: s2: [ ] [i] + c 1,0 [ p] = [ c 2,1 ] [i'] + c 2,0 [ p] = c 1,1 Suppose itera4on i & i refer to the same loca4on i = i No communica4on means: c 1,1 i + c 1,0 = c 2,1 i + c 2,0 c 1,1 = c 2,1 c 1,0 = c 2,0 Pick simplest values: c 1,1 = c 2,1 = 1, c 1,0 = c 2,0 = 0 p = i; p=i 16 8

9 Loop Fusion T[]= A[]+B[] (s1) for = 1 to 4 C[ ]= T[ ] x T[ ] (s2) for P = 1 to 4 if ( == P) T[]= A[]+B[] (s1) for = 1 to 4 if ( == P) C[ ]= T[ ] x T[ ] (s2) s1: s2: [ p] = 1 [ ] [i] [ p] = 1 [ ] [i'] for P = 1 to 4 T[P]= A[P]+B[P] (s1) C[P]= T[P] x T[P] (s2) 17 Example 3: 2 Nested, Parallel Loops Z[,] = Z[,]+1 Find affine partitioning: c 1, c 2, c 0 such that " p = [ c 1 c 2 ] i % ' + c 0 # j& Suppose itera4on i,j & i, j refer to same loca4on i = i j = j No communica4on means: c 1 i + c 2 j + c 0 = c 1 i + c 2 j + c 0 c 1 i +c 2 j + c 0 = c 1 i + c 2 j + c 0 No constraints Two basis vectors: [c 1 c 2 ]=[1 0], or [c 1 c 2 ] = [0 1] Two answers for p: two degrees of parallelism " # p 1 p 2 % " ' = 1 0 %" & # 0 1& ' i % # j ' & 18 9

10 Example 3: 2 Nested, Parallel Loops Z[,] = Z[,]+1 " # p 1 p 2 % " ' = 1 0 %" & # 0 1& ' i % # j ' & for p1 = 1 to 4 for p2 = 1 to 3 if (==p1 & == p2) Z[,] = Z[,]+1 for p1 = 1 to 4 for p2 = 1 to 3 Z[p1,p2] = Z[p1,p2]+1 19 Optimizing Arbitrary Loop Nesting Using Affine Partitions (chotst, NAS) DO 1 = 0, N 0 = MAX ( -M, - ) DO 2 = 0, -1 DO 3 = 0 -, -1 DO 3 L = 0, NMAT 3 A(L,,) = A(L,,) - A(L,,+) * A(L,+,) DO 2 L = 0, NMAT 2 A(L,,) = A(L,,) * A(L,0,+) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,) DO 5 = 0, -1 DO 5 L = 0, NMAT 5 A(L,0,) = A(L,0,) - A(L,,) ** 2 DO 1 L = 0, NMAT 1 A(L,0,) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,)) ) DO 6 = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(,L,K) = B(,L,K) * A(L,0,K) DO 7 = 1, MN (M, N-K) DO 7 L = 0, NMAT 7 B(,L,K+) = B(,L,K+) - A(L,-,K+) * B(,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(,L,K) = B(,L,K) * A(L,0,K) DO 6 = 1, MN (M, K) DO 6 L = 0, NMAT 6 B(,L,K-) = B(,L,K-) - A(L,-,K) * B(,L,K) A B EPSS 20 L L L 10

11 Chotst: Results with Affine Partitioning + Blocking (Unimodular: a subset of affine partitioning for perfect loop nests) Speedup Unimodular + Blocking Affine Partitioning + Blocking Number of Processors 21 Summary of Affine Partitioning Communication-Free Loops F 1 i 1 +f 1 F 2 i 2 +f 2 Array C 2 i 2 +c 2 C 1 i 1 +c 1 Processor D 22 11

12 Advanced topic: Pipelining SOR (Successive Over-Relaxation): An Example for i = 1 TO m for j = 1 to n A[i,j] = c * (A[i-1,j] + A[i,j-1]) i j 23 Finding the Maximum Degree of Pipelining i 1 i 2 Loops F 2 i 2 +f 2 F 1 i 1 +f 1 C C 1 i 1 +c 2 i 2 +c 2 1 Time Stage Array For every pair of data dependent accesses F 1 i 1 +f 1 and F 2 i 2 +f 2 Let B 1 i 1 +b 1 0, B 2 i 2 +b 2 0 be the corresponding loop bound constraints, Find C 1, c 1, C 2, c 2 : i 1, i 2 B 1 i 1 + b 1 0, B 2 i 2 + b 2 0 (i 1 i 2 ) (F 1 i 1 + f 1 = F 2 i 2 +f 2 ) C 1 i 1 +c 1 C 2 i 2 +c 2 with the objective of maximizing the rank of C 1, C

13 Key nsight Choice in time mapping => (pipelined) parallelism Rank(C) 1 degree of parallelism with 1 degree of synchronization Can create blocks with Rank(C) dimensions Find time partitions is not as straightforward as space partitions Need to deal with linear inequalities Solved using Farkas Lemma no simple intuitive proof 25 Communication-Free Summary of Affine Partitioning F 1 i 1 +f 1 Loops Array F 2 i 2 +f 2 C 1 i 1 +c 1 C 2 i 2 +c 2 Processor D Pipelining i 1 i 2 Loops F 2 i 2 +f 2 F 1 i 1 +f 1 Array C 1 i 1 +c 1 C 2 i 2 +c 2 Time Stage 26 13

Loop Transformations, Dependences, and Parallelization

Loop Transformations, Dependences, and Parallelization Announcements HW3 is due Wednesday February 15th Today HW3 intro Unimodular framework rehash with edits Skewing Smith-Waterman (the fix is in!), composing