Null space basis: mxz. zxz I

Size: px

Start display at page:

Download "Null space basis: mxz. zxz I"

Marvin Gaines
5 years ago
Views:

1 Loop Transformations Linear Locality Enhancement for

2 ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence Transformed code can be generated using ILP calculator. Story so far transformation on the iteration space of the loop nest. linear Legality of permutation can be determined from the 2

3 for permutations applies to other loop transformations that Theory be modeled as linear transformations: skewing, reversal,scaling. can Dependence matrix of transformed program: T:D Transformation matrix: T (a non-singular matrix) matrix: D Dependence in which each column is a distance/direction vector Matrix Legality: T:D Small complication with code generation if scaling is included. 3

4 Linear loop transformations: permutation, skewing, reversal,scaling ompositions of linear loop transformations: matrix viewpoint Locality matrix: model for temporal and spatial locality Two key ideas: Height reduction. Making a loop nest fully permutable 2. A unied algorithm for improving cache performance Lecture based on work we did for HPs compiler product line Overview of lecture 4

5 is easy to nd a basis for null space of matrix in column-echelon It form. Find unimodular matrix U so that MU = L is in form. column-echelon If x is a vector in null space of L, Ux is in null space of M. If is basis for null space of L, U is basis for null space of M. Key algorithm: nding basis for null space of an integer matrix Denition: A vector x is in null space of a matrix M if Mx =. Null space basis: mxz zxz I m z General matrix M: 5

6 @ = A Example: M*U = L 2 ;2 ; ; ; asis for null space of M = U* ( ) = (-2 - ) 6

7 We have used reduction to column-echelon form: MU = L. We can also view this as a matrix decomposition: M = LQ Q is inverse of U from previous step. where Analogy: QR factorization of real matrices. matrices are like orthogonal matrices in reals. Unimodular Special form of this decomposition: M = LQ where all diagonal of L are non-negative. elements This is called a psuedo-hermite normal form of matrix. Some observations: 7

8 between two iterations Dependence iterations touch the same location => J I Motivating example: Wavefront code DO I =,N DO J =,N X(I,J) = X(I-,J+)... Dependence matrix = - => potential for exploiting data reuse! iteration points between executions of dependent iterations. N we schedule dependent iterations closer together? an 8

9 now, focus only on reuse opportunities between dependent For iterations. iterations that read from the same location: input dependences spatial locality This exploits only one kind of temporal locality. There are other kinds of locality that are important: oth are easy to add to basic model as we will see later. 9

10 J Exploiting temporal locality in wavefront I J I Permutation is illegal! Tiling is illegal! We have studied two transformations: permutation and tiling. Permutation and tiling are both illegal.

11 Height Reduction

12 Transformation is legal. Dependent iterations are scheduled close together, so good for J H One solution: schedule iterations along 45 degree planes! I G Note: locality. an we view this in terms of loop transformation? Loop skewing followed by loop reversal. 2

13 Loop Skewing: a linear loop transformation J V I J = U V Skewing of inner loop by outer loop: I k U (k is some fixed integer) Skewing of inner loop by an outer loop: always legal New dependence vectors: compute T*D In this example, D = T*D = - This skewing has changed dependence vector but it has not brought dependent iterations closer together... 3

14 Skewing outer loop by inner loop J V I U I J = U V Outer loop skewing: k Skewing of outer loop by inner loop: not necessarily legal In this example, D = T*D = incorrect - - Dependent iterations are closer together (good) but program is illegal (bad). How do we fix this?? 4

15 5-5 Loop Reversal:a linear loop transformation I U U = [-][I] DO I =, N DO U = -N,- X(I) = I+2 X(-U) = -U +2 Transformation matrix = [-] Another example: 2-D loop, reverse inner loop U V = - I J Legality of loop reversal: Apply transformation matrix to all dependences verify lex +ve ode generation: easy 5

16 Need for composite transformations J V I J = U V I U Transformation: skewing followed by reversal In final program, dependent iterations are close together! omposition of linear transformations = another linear transformation! omposite transformation matrix is H - U V = G H G - * = - How do we synthesize this composite transformation?? 6

17 Transformation matrices for permutation/reversal/skewing are unimodular. Any composition of these transformations can be represented a unimodular matrix. by Any unimodular matrix can be decomposed into product of matrices. permutation/reversal/skewing Legality of composite transformation T : check that T:D. T 3 (T 2 (T D)) = (T 3 T 2 T ) D.) (Proof: ode generation algorithm: New bounds: compute from A T ; U b Some facts about permutation/reversal/skewing Original bounds: A I b Transformation: U = T I 7

18 composite transformations using matrix-based Synthesizing approaches Rather than reason about sequences of transformations, we can about the single matrix that represents the composite reason transformation. Enabling abstraction: dependence matrix 8

19 Height reduction: move reuse into inner loops!. Dependence vector is ; Prefer not to have dependent iterations in dierent outer loop iterations.! So dependence vector in transformed program should look like.??!! So T = ;??! This says rst row of T is orthogonal to. ; So rst row of T can be chosen to be ( ). What about second row? 9

20 s should be linearly independent of rst row of T (non-singular matrix). T D should be a lexicographically positive vector. Determinant of matrix should be + or - (unimodularity). Second row of T (call it s) should satisfy the following properties:! One choice: ( ; ), giving T =. ; 2

21 How do we choose the rst few rows of the transformation. to push reuses down? matrix How do we ll in the rest of the matrix to get unimodular 2. matrix? General questions 2

22 Let be a basis for null space of D. Use transpose of basis vectors as rst few rows of hoosing rst few rows of transformation matrix: For now, assume dependence matrix D has only distances in it. Algorithm: transformation matrix. Example: study running example in previous slides. 22

23 ; rst row of transformation be (r r 2 r 3 r 4 ). Let this is orthogonal to D, it is clear that r and r 2 must be zero. If st and 2nd rows of D and nd basis for null space of Ignore rows of D : ( ) remaining picture: knock out rows of D with direction entries, use General for distance vectors, and pad resulting null space vectors algorithm Extension to direction vectors: easy + onsider D = 2 ; ;2 Pad with zeros to get partial transformation: ( ) with s. 23

24 Turns out to be easier if we drop unimodularity requirement ompletion procedure: generate a non-singular matrix T given few rows of this matrix rst Problem: how do we generate code if transformation matrix is Filling in rest of matrix: a general non-singular matrix? 24

25 to generate a complete transformation from a partially Goal: one specied a non-singular matrix T that extends P and that satises all generate dependences in D (TD ). the rows of P are linearly independent P D (no dependences violated yet) ompletion Procedure!!! = ;???? Given: a dependence matrix D a partial transformation P (rst few rows) Precondition: 25

26 From D, delete all dependences d for which Pd.. If D is now empty, use null space basis determination to add 2. rows to P to generate a non-singular matrix. more Otherwise, let k be the rst row of (reduced) D that has a 3. non-zero entry. Use e k everywhere else) as the next row of the transformation. zero Repeat from step (i). 4. Proof: Need to show that e k of partial transformation, and does not violate any rows Easy (see Li/Pingali paper). dependences. ompletion algorithm: iterative algorithm (unit row vector with in column k and is linearly independent of existing 26

27 A dependence is satised by partial transformation. No row of transformation = ( ) A Example: + + = A?????? ; ; New partial dependence matrix is + +??? 27

28 dependences are now satised, so we are left with the problem All completing transformation with a row independent of existing A ones. asis for null space of partial transformation = ( -) T = ; 28

29 transformation matrix T we obtain is non-singular but Problem: not be unimodular. may solutions: based on T = LQ pseudo-hermite normal form Two where L is lower-triangular with positive diagonal decomposition Develop code generation technology for non-singular matrices. Use Q as transformation matrix. How do we know this is legal, and is good for Question: locality? let us understand what transformations are modeled by First, matrices that are not unimodular. non-singular elements 29

30 k Loop scaling:change step size of loop DO I =, vs DO U = 2,2,2 Y(I) = I Y(U/2) = U/2 Scaling matrix: non-unimodular transformation! 3

31 Scaling, like reversal, is not important by itself. Nonsingular linear loop transformations: permutation/skewing/reversal/scaling Any non-singular integer matrix can be decomposed into of permutation/skewing/reversal/scaling matrices. product Standard decompositions: (pseudo)-hermite form T = L Q Q is unimodular and L is lower triangular with positive where elements, (pseudo)-smith normal form T = UDV diagonal U V are unimodular and D is diagonal with positive where elements. Including scaling in repertoire of loop transformations makes it to synthesize transformations but complicates code easier generation. 3

32 ode generation with a non-singular transformation T Original bounds: A I b ode generation for unimodular matrix U: Transformation: J = U I New bounds: compute from A U ; J b Key problem here: T ; is not necessarily an integer matrix. 32

33 Solution: use Hermite normal form decomposition T = L Q Diculties: J -2 4 I J = U V I U DO I =,3 DO J =,3 A(4J-2I+3,I+J) = J; DO U = -2,,2 DO V = -U/2 +3 max(,ceil(u/2+/2)), -U/2 + 3min(3,floor(U/2+3/2)),3 A[U+3,V] = (U+2V)/6 Key problems: How do we determine integer lower and upper bounds? (rounding up from rational bounds is not correct) How do we determine step sizes? 33

34 Running example: factorize T = L Q J - 2 Auxiliary Iteration Space DO I =,3 DO J =,3 A(4J-2I+3,I+J) = J; Initial Iteration Space I 2-3 U DO P = -,5 DO Q = max(,ceil(p+/2)),min(3,floor(p+3/2) A(2P+3,-P+Q) = Q -2 4 = U Final Iteration Space DO U = -2,,2 DO V = -U/2 +3 max(,ceil(u/2+/2)), -U/2 + 3min(3,floor(U/2+3/2)),3 A[U+3,V] = (U+2V)/6 34

35 Given T = L Q where L is triangular with positive diagonal and Q is unimodular. elements Use Q as the transformation to generate bounds for auxiliary space. Read o bounds for nal iteration space from L non-singular matrix T as transformation matrix. Using example: Running -2 4 = ; p 5 max( ceil((p + )=2)) q min(3 floor((p + 3)=2))!!! p u 2 = q v ; 3 ;2 u ;u=2 + 3max( ceil((u=2 + )=2)) v ;u=2 + 3 min(3 floor((u=2 + 3) 35

36 Good to use strength reduction to replace the integer divisions by and multiplications. additions transformation with non-singular matrix is needed for omplete code for distributed-memory parallel machines and similar generating line: transformation synthesis procedure can focus on producing ottom matrix. Unimodularity is not important. non-singular Step sizes of loops: diagonal entries of L hange of variables in body!!! ;=6 4=6 i u = =6 2=6 j v problems (see Wolfes book). 36

37 2 to code generation with non-singular transformation Solution matrices If L is a n n triangular matrix with positive diagonal Lemma: L maps lexicographically positive vectors into elements, L does not change the order in which iterations are done Intuition: just renumbers them! but onsider d = L where is a lexicographically positive Proof: and show that d is lexicographically positive as well. vector Auxiliary space iterations are performed in same order as iterations in nal space. corresponding Memory system behavior of auxiliary program is same as system behavior of nal program! memory lexicographically positive vectors. 37

38 memory system optimization, we can use a non-singular For matrix T as follows: compute Hermite normal form transformation of T = L Q and use Q as transformation matrix! 38

39 Find rows orthogonal to space spanned by dependence vectors, and them as the partial transformation. In this example, the use subspace is spanned by the vector ( ). orthogonal Apply completion procedure to partial transformation to generate J I Putting it all together for wavefront example: DO I =,N DO J =,N X(I,J) = X(I-,J+)... Dependence matrix = - full transformation. We would add row ( ). So transformation is! 39

40 Exploit dependences + spatial locality + read-only data reuse a dependence matrix D with dependence vectors. a locality matrix L D with locality vectors 2. matrix has additional vectors to account for spatial Locality and read-only data reuse. locality we would like to scan along directions in L in innermost Intuitively, provided dependences permit it. loops, General picture: 4

41 spatial locality: scan iteration space points so that we get Modeling stride accesses (or some multiple thereof) unit arrays are stored in column-major order, we should interchange the If loops. How do we determine this in general? two Example: from Assignment DO I =.. DO J =... [I,J] = A[I,J] + [J,I] 4

42 we do iteration i and then iteration i 2. Suppose element accessed in iteration i = X[Ai + a] Array locality matrix: matrix containing a basis for null space of Spatial A Example: array stored in column-major order, reference X[Ai + a] Array element accessed in iteration i 2 = X[Ai 2 + a] We want these two accesses to be within a column of A. So c A (i ; i 2 ) = ::: Let A = A with rst row deleted. So we want to solve A R = where R is "reuse direction". Solution: R is any vector in null space of A. 42

43 what direction(s) can we move so we get accesses along rst In of A for the most part? A Example: for 2 2 A = A Spatial locality matrix is ; 43

44 from Assignment : exploiting spatial locality in MVM Example accesses A. for I =,N for J =,N Y(I) = Y(I) + A(I,J)*X(J) Reference matrix is. Truncated matrix is So spatial locality matrix is 44

45 spatial locality: perform height reduction on locality A For our example, we get the transformation A A That is, we would permute the loops which is correct. 45

46 I = I 2 K = K 2 = I 2 ; I 2 = J 2 ; J 3 = K 2 ; K Read-only data reuse: add reuse directions to locality matrix DO I =,N DO J =,N DO K =,N [I,J] = [I,J] + A[I,K]*[K,J] onsider reuse vector for reference A[I,K]: I I 2 J J 2 K K 2 N (I J K ) (I 2 J 2 K 2 ) 46

47 @. A + Reuse vector is onsidering all the references, we get the following locality matrix:

48 ompute the locality matrix L which contains all vectors along we would like to exploit locality (dependences + which the rst few rows of the transformation. as all the completion algorithm to generate a complete and use that T transformation. Do a pseudo-hermite form decomposition of transformation Flexibility: if null space of L T good to drop some of the "less important" read-only reuse be and spatial locality vectors from consideration. vectors General Algorithm for Height Reduction data reuse + spatial locality) read-only Determine a basis for the null space of L T matrix and use the unimodular part as transformation. has only the zero vector in it, it may 48

49 some codes, height reduction may fail since locality vectors span In space. entire MMM as shown in previous slide (actually even MVM if Example: want to exploit locality in both matrix A and vectors x,y) we We can still exploit most of the locality provided we can tile loops. Tiling is not always legal... Solution: apply linear loop transformations to enable tiling. 49

50 Linear loop transformations to enable tiling 5

51 is legal if loops are fully permutable (all permutations of Tiling are legal). loops an we always convert a perfectly nested loop into a fully loop nest? permutable When we can, how do we do it? J In general, tiling is not legal D = I Tiling is illegal! Tiling is legal if all entries in dependence matrix are non-negative. 5

52 If all dependence vectors are distance vectors, we can Theorem: entire loop nest into a fully permutable loop nest. ; A. matrix of transformed program must have all positive Dependence entries. rst row of transformation can be ( ). So row of transformation (m ) (for any m > ). Second idea: skew inner loops by outer loops suciently to make General negative entries non-negative. all Example: wavefront Dependence matrix is 52

53 to make rst row with negative entries into row Transformation non-negative entries with p2... p m -n p3... -k row a row b first row with negative entries (a) for each negative entry in the first row with negative entries, find the first positive number in the corresponding column assume the rows for these positive entries are a,b etc as shown above (b) skew the row with negative entries by appropriate multiples of rows a,b... For our example, multiple of row a = ceiling(n/p2) multiple of row b = ceiling(max(m/p,k/p3)) Transformation: I.. ceiling(n/p2) ceiling(max(m/p,k/p3))... I 53

54 all entries in dependence matrix are non-negative, done. If Otherwise, Apply algorithm on previous slide to rst row with. entries. non-negative Generate new dependence matrix. 2. If no negative entries, done. 3. General algorithm for making loop nest fully permutable: 4. Otherwise, go step (). 54

55 as nice as height reduction solution, but it will work ne for Not enhancement except at tile boundaries (but boundary locality J I J I Result of tiling transformed wavefront Original loop Tiled fully permutable loop Tiling generates a 4-deep loop nest. points small compared to number of interior points). 55

56 @ A try to make outermost loops fully permutable, so we would We the second and third loops, and then tile the rst two interchange What happens with direction vectors? In general, we cannot make loop nest fully permutable. + ; Example: D = + est we can do is to make some of the loops fully permutable. loops only. Idea: algorithm will nd bands of fully permutable loop nests. 56

57 row has -ve direction which cannot be knocked out. ut we Second interchange fth row with second row to continue band. can Example for general algorithm: + ; + ;2 ;3 ; ; + A 3 4 egin rst band of fully permutable loops with rst loop. 57

58 second loop can be part of rst fully permutable band. New out -ve distances in third row by adding 2*second row to Knock 3 4 ;2 ;3 ; ; + A + ; + third row. 58

59 make band any bigger, so terminate band, drop all satised annot and start second band. dependences, this case, all dependences are satised, so last two loops form In band of fully permutable loops. second ; + A + ; + 59

60 do we determine transformation matrix for performing this How operations? time you apply a transformation to the dependence matrix, Every same transformation to T. apply Proof sketch: (U 2 (U D)) = (U 2 (U I)) D If you have n loops, start with T = I nn. At the end, T is your transformation matrix. 6

61 the previous example, we get A Interchange the second and fth rows to A 6

62 Add 2*second row to third. A 62

63 ompute dependence matrix D. urrent-row = row.. egin new fully permutable band. 2. If all entries in D are non-negative, add all remaining loops to 3. band and exit. current Otherwise, nd rst row r with one or more negative entries. 4. negative entry must be guarded by a positive entry above Each in its dependence vector. Add all loops between current-row it row (r-) to current band. and If row does not both positive and negative directions, x it as 5. follows. all direction entries are negative, negate row. If this point, all direction entries must be positive, and any At negative entries must be distances. If there are remaining entries remaining, determine appropriate multiple of negative onversion to fully permutable bands 63

64 rows to be added to that row (treating + guard entries guard ) to convert all negative entries in row r to strictly positive. as transformation matrix and add row r to current band. Update = r+ urrent-row to step 3. Go Otherwise, see if a row below current row without both positive 6. negative directions can be interchanged legally with and row. current so, perform interchange, update transformation matrix and If to step 4. go terminate current fully permutable band, drop all Otherwise, dependence vectors, and go to step 2. satised 64

65 ompute locality matrix L.. Perform height reduction by nding a basis for null space of 2. important locality vectors. less ompute resulting dependence matrix after height reduction. 3. all entries are non-negative, declare loop nest to be fully If permutable. Otherwise, apply algorithm to convert to bands of fully 4. loop nests. permutable Perform pseudo-hermite normal form decomposition of 5. transformation matrix, and use unimodular extract to resulting code transformation. perform Tile fully permutable bands choosing appropriate tile sizes. 6. Ye Grande Locality Enhancement Algorithm L T. If null space is too small, you have the option of dropping 65

66 Height reduction can also be viewed as outer-loop if no dependences are carried by outer loops, parallelization: loops are parallel! outer First paper on transformation synthesis in this style: Leslie (969) on multiprocessor scheduling. Lamport Algorithm here is based on Wei Lis ornell PhD thesis (Wei is a compiler hot-shot at Intel!). now A version of this algorithm was implemented by us in HPs product line. compiler It is easy to combine height reduction and conversion to fully band form into a single transformation matrix permutable but you can work it out for yourself. synthesis, Summary 66

Linear Loop Transformations for Locality Enhancement

Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation