Program transformations and optimizations in the polyhedral framework

Size: px

Start display at page:

Download "Program transformations and optimizations in the polyhedral framework"

Florence Brittany Skinner
5 years ago
Views:

1 Program transformations and optimizations in the polyhedral framework Louis-Noël Pouchet Dept. of Computer Science UCLA May 14, st Polyhedral Spring School St-Germain au Mont d or, France

2 : Overview UCLA 2

3 Overview & Motivation: Agenda for the Lecture Disclaimer: this lecture is (mainly) for new Ph.D students 1 Program Representation in the Polyhedral Model 2 Applying loop transformations in the model 3 Exact data dependence representation 4 Scheduling using Farkas-based methods 5 Scheduling using approximate data dependences If time permits... UCLA 3

4 Overview & Motivation: Running Example: DGEMM Example (dgemm) /* C := alpha*a*b + beta*c */ for (i = 0; i < ni; i++) for (j = 0; j < nj; j++) S1: C[i][j] *= beta; for (i = 0; i < ni; i++) for (j = 0; j < nj; j++) for (k = 0; k < nk; ++k) S2: C[i][j] += alpha * A[i][k] * B[k][j]; Loop transformation: permute(i,k,s2) Execution time (in s) on this laptop, GCC 4.2, ni=nj=nk=512 version -O0 -O1 -O2 -O3 -vec original permute UCLA 4

5 Overview & Motivation: Running Example: fdtd-2d Example (fdtd-2d) } for(t = 0; t < tmax; t++) { for (j = 0; j < ny; j++) ey[0][j] = _edge_[t]; for (i = 1; i < nx; i++) for (j = 0; j < ny; j++) ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]); for (i = 0; i < nx; i++) for (j = 1; j < ny; j++) ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]); for (i = 0; i < nx - 1; i++) for (j = 0; j < ny - 1; j++) hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] + ey[i+1][j]-ey[i][j]); Loop transformation: polyhedralopt(fdtd-2d) Execution time (in s) on this laptop, GCC 4.2, 64x1024x1024 version -O0 -O1 -O2 -O3 -vec original polyhedralopt UCLA 5

6 Overview & Motivation: Manual transformations? Example (fdtd-2d) } for(t = 0; t < tmax; t++) { for (j = 0; j < ny; j++) ey[0][j] = _edge_[t]; for (i = 1; i < nx; i++) for (j = 0; j < ny; j++) ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]); for (i = 0; i < nx; i++) for (j = 1; j < ny; j++) ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]); for (i = 0; i < nx - 1; i++) for (j = 0; j < ny - 1; j++) hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] + ey[i+1][j]-ey[i][j]); UCLA 6

7 Overview & Motivation: NO NO NO!!! Example (fdtd-2d tiled) for (c0 = 0; c0 <= (((ny + 2 * tmax + -3) * 32 < 0?((32 < 0?-((-(ny + 2 * tmax + -3) ) / 32) : -((-(ny + 2 * tmax + -3) ) / 32))) : (ny + 2 * tmax + -3) / 32)); ++c0) { #pragma omp parallel for private(c3, c4, c2, c5) for (c1 = (((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c ) / -2 : (c ) / 2)))) > (((32 * c0 + -tmax + 1) * 32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) ) / -32 : (32 * c0 + -tmax ) / 32))))?((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c ) / -2 : (c ) / 2)))) : (((32 * c0 + -tmax + 1) * 32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) ) / -32 : (32 * c0 + -tmax ) / 32))))); c1 <= (((((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64)))) < c0?(((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64)))) : c0)); ++c1) { for (c2 = c0 + -c1; c2 <= (((((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) ) / 32) : -((-(tmax + nx + -2) ) / 32))) : (tmax + nx + -2) / 32)) < (((32 * c * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c * c1 + nx + 30) ) / 32) : -((-(32 * c * c1 + nx + 30) ) / 32))) : (32 * c * c1 + nx + 30) / 32))?(((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) ) / 32) : -((-(tmax + nx + -2) ) / 32))) : (tmax + nx + -2) / 32)) : (((32 * c * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c * c1 + nx + 30) ) / 32) : -((-(32 * c * c1 + nx + 30) ) / 32))) : (32 * c * c1 + nx + 30) / 32)))); ++c2) { if (c0 == 2 * c1 && c0 == 2 * c2) { for (c3 = 16 * c0; c3 <= ((tmax + -1 < 16 * c0 + 30?tmax + -1 : 16 * c0 + 30)); ++c3) if (c0 % 2 == 0) (ey[0])[0] = (_edge_[c3]);... (200 more lines!) Performance gain: 2-6 on modern multicore platforms UCLA 6

8 Overview & Motivation: Loop Transformations in Production Compilers Limitations of standard syntactic frameworks: Composition of transformations may be tedious composability rules / applicability Parametric loop bounds, imperfectly nested loops are challenging Look at the examples! Approximate dependence analysis Miss parallelization opportunities (among many others) (Very) conservative performance models UCLA 7

9 Overview & Motivation: Achievements of Polyhedral Compilation The polyhedral model: Model/apply seamlessly arbitrary compositions of transformations Automatic handling of imperfectly nested, parametric loop structures Many loop transformation can be modeled Exact dependence analysis on a class of programs Unleash the power of automatic parallelization Aggressive multi-objective program restructuring (parallelism, SIMD, cache, etc.) Requires computationally expensive algorithms Usually NP-complete / exponential complexity Requires careful problem statement/representation UCLA 8

10 Overview & Motivation: Research around Polyhedral Compilation "first generation" Provided the theoretical foundations Focused on high-level/abstract objectives Was questioned about how practical is this model "second generation" Provided practical / "more scalable" algorithms Focused on concrete performance improvement Still often questioned about how practical is this model Thanks for the legacy! ;-) UCLA 9

11 Overview & Motivation: Perspective on 25 Years of Computing "first generation" Parallel machines were... rare Era of increased program performance for free Intel 486 DX2: 54 MIPS, 66MHz "second generation" Parallel machines are ubiquitous (multi-core, SIMD, GPUs) Very difficult to get good performance Intel Core i7: MIPS, 3.33 GHz UCLA 10

12 Overview & Motivation: Perspective on 25 Years of Computing "first generation" Parallel machines were... rare Era of increased program performance for free Intel 486 DX2: 54 MIPS, 66MHz "second generation" Parallel machines are ubiquitous (multi-core, SIMD, GPUs) Very difficult to get good performance Intel Core i7: MIPS, 3.33 GHz More complex problem can be solved, but more complex objectives are needed anyway UCLA 10

13 Overview & Motivation: Disclaimer This lecture is: Partial polyhedral model affine scheduling - Google Scholar 5/14/13 12:57 AM Web Images More louisnoel.pouchet@gmail.com polyhedral model affine scheduling Scholar About 3,520 results (0.06 sec) My Citations 3 Articles Iterative optimization in the polyhedral model: Part II, multidimensional time LN Pouchet, C Bastoul, A Cohen, J Cavazos - ACM SIGPLAN Notices, dl.acm.org Legal documents loops) where iterative affine schedul- ing has never been attempted before. 4. We demonstrate significant performance improvements on 3 different architectures. The paper is structured as follows. Section 2 introduces multi- dimensional scheduling in the polyhedral model.... Any time Cited by 80 Related articles All 20 versions Cite Since 2013 Since 2012 Iterative optimization in the polyhedral model: Part I, one-dimensional time Since 2009 LN Pouchet, C Bastoul, A Cohen - Code Generation and, ieeexplore.ieee.org Custom range that can be ex- pressed as one-dimensional affine schedules (using a polyhedral abstraction); this... ing approaches on loop transformation sequences, or state-of-the-art affine scheduling; such smaller... of the optimal code, a confirmation that building a predictive model for loop... Sort by relevance Cited by 85 Related articles All 29 versions Cite Sort by date Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model include patents U Bondhugula, M Baskaran, S Krishnamoorthy - Compiler, Springer include citations... The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses. Affine transformations in this model capture a com- plex sequence of execution... Although a significant body of research has addressed affine scheduling and partitioning, the... Create alert Cited by 60 Related articles All 19 versions Cite Biased (strongly!) Provocative at times (hopefully ;-) ) [PDF] from inria.fr [PDF] from inria.fr [PDF] from ohio-state.edu Memory reuse analysis in the polyhedral model [PDF] from psu.edu D Wilde, S Rajopadhye - Euro-Par'96 Parallel Processing, Springer... Currently, the user chooses the transformations based on the analyses enabled by the polyhedral UCLA model, specifies it... Scheduling: There has been much research on the SARE scheduling problem 11 [9, 17, 8, i, j) - i + j and tx (i) = 2i are valid (one dimensional) affine schedules for... Cited by 38 Related articles All 10 versions Cite

14 Overview & Motivation: Polyhedral Program Representation UCLA 12

15 Polyhedral Program Representation: Basics Example: DGEMM Example (dgemm) /* C := alpha*a*b + beta*c */ for (i = 0; i < ni; i++) for (j = 0; j < nj; j++) { S1: C[i][j] *= beta; for (k = 0; k < nk; ++k) S2: C[i][j] += alpha * A[i][k] * B[k][j]; } This code has: imperfectly nested loops multiple statements parametric loop bounds UCLA 13

16 Polyhedral Program Representation: Basics Granularity of Program Representation DGEMM has: 3 loops For loops in the code, while loops Control-flow graph analysis 2 (syntactic) statements Input source code? Basic block? ASM instructions? S1 is executed ni nj times dynamic instances of the statement Does not (necessarily) correspond to reality! UCLA 14

17 Polyhedral Program Representation: Basics Some Observations Reasoning at the loop/statement level: Some loop transformation may be very difficult to implement How to fuse loops with different loop bounds? How to permute triangular loops? How to unroll-and-jam triangular loops? How to apply time-tiling?... Statements may operate on the same array while being independent UCLA 15

18 Polyhedral Program Representation: Basics Some Motivations for Polyhedral Transformations Known problem: scheduling of task graph Obvious limitations: task graph is not finite / size depends on problem / effective code generation almost impossible Alternative approach: use loop transformations solve all above limitation BUT the problem is to find a sequence that implements the order we want AND also how to apply/compose them Desired features: ability to reason at the instance level (as for task graph scheduling) ability to easily apply/compose loop transformations UCLA 16

19 The Polyhedral Model: The Polyhedral Model UCLA 17

20 The Polyhedral Model: Overview Polyhedral Program Optimization: a Three-Stage Process 1 Analysis: from code to model Existing prototype tools URUK, Omega, ChiLL... ISL, Loopo PolyOpt/PoCC (Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...) GCC GRAPHITE, LLVM Polly (now in mainstream) Reservoir Labs R-Stream, IBM XL/Poly UCLA 18

21 The Polyhedral Model: Overview Polyhedral Program Optimization: a Three-Stage Process 1 Analysis: from code to model Existing prototype tools URUK, Omega, ChiLL... ISL, Loopo PolyOpt/PoCC (Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...) GCC GRAPHITE, LLVM Polly (now in mainstream) Reservoir Labs R-Stream, IBM XL/Poly 2 Transformation in the model Build and select a program transformation UCLA 18

22 The Polyhedral Model: Overview Polyhedral Program Optimization: a Three-Stage Process 1 Analysis: from code to model Existing prototype tools URUK, Omega, ChiLL... ISL, Loopo PolyOpt/PoCC (Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...) GCC GRAPHITE, LLVM Polly (now in mainstream) Reservoir Labs R-Stream, IBM XL/Poly 2 Transformation in the model Build and select a program transformation 3 Code generation: from model to code "Apply" the transformation in the model Regenerate syntactic (AST-based) code UCLA 18

23 The Polyhedral Model: Overview Motivating Example [1/2] Example for (i = 0; i <= 1; ++i) for (j = 0; j <= 2; ++j) A[i][j] = i * j; Program execution: 1: A[0][0] = 0 * 0; 2: A[0][1] = 0 * 1; 3: A[0][2] = 0 * 2; 4: A[1][0] = 1 * 0; 5: A[1][1] = 1 * 1; 6: A[1][2] = 1 * 2; UCLA 19

24 The Polyhedral Model: Overview Motivating Example [2/2] A few observations: Statement is executed 6 times There is a different values for i, j associated to these 6 instances There is an order on them (the execution order) A rough analogy: polyhedral compilation is about (statically) scheduling tasks, where tasks are statement instances, or operations UCLA 20

25 The Polyhedral Model: Overview Polyhedral Representation of Programs Static Control Parts Loops have affine control only (over-approximation otherwise) UCLA 21

26 The Polyhedral Model: Overview Polyhedral Representation of Programs Static Control Parts Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra for (i=1; i<=n; ++i). for (j=1; j<=n; ++j).. if (i<=n-j+2)... s[i] =... D S1 = i j n 1 0 UCLA 21

27 The Polyhedral Model: Overview Polyhedral Representation of Programs Static Control Parts Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p for (i=0; i<n; ++i) {. s[i] = 0;. for (j=0; j<n; ++j).. s[i] = s[i]+a[i][j]*x[j]; } f s ( x S2 ) = [ ]. [ f a ( x S2 ) = ]. f x ( x S2 ) = [ ]. x S2 n 1 x S2 n 1 x S2 n 1 UCLA 21

28 The Polyhedral Model: Overview Polyhedral Representation of Programs Static Control Parts Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p Data dependence between S1 and S2: a subset of the Cartesian product of D S1 and D S2 (exact analysis) for (i=1; i<=3; ++i) {. s[i] = 0;. for (j=1; j<=3; ++j).. s[i] = s[i] + 1; } D S1δS2 : i S1. i S2 j S2 1 = 0 0 S1 iterations S2 iterations i UCLA 21

29 The Polyhedral Model: Overview Math Corner UCLA 22

30 The Polyhedral Model: Math Corner The Affine Qualifier Definition (Affine function) A function f : K m K n is affine if there exists a vector b K n and a matrix A K m n such that: x K m, f ( x) = A x + b Definition (Affine half-space) An affine half-space of K m (affine constraint) is defined as the set of points: { x K m a. x b} UCLA 23

31 The Polyhedral Model: Math Corner Polyhedron (Implicit Representation) Definition (Polyhedron) A set S K m is a polyhedron if there exists a system of a finite number of inequalities A x b such that: P = { x K m A x b} Equivalently, it is the intersection of finitely many half-spaces. Definition (Polytope) A polytope is a bounded polyhedron. UCLA 24

32 The Polyhedral Model: Math Corner Integer Polyhedron Definition (Z-polyhedron) It is a polyhedron where all its extreme points are integer valued Definition (Integer hull) The integer hull of a rational polyhedron P is the largest set of integer points such that each of these points is in P. For the moment, we will "say" an integer polyhedron is a polyhedron of integer points (language abuse) UCLA 25

33 The Polyhedral Model: Math Corner Rational and Integer Polytopes UCLA 26

34 The Polyhedral Model: Math Corner Another View of Polyhedra The dual representation models a polyhedron as a combination of lines L and rays R (forming the polyhedral cone) and vertices V (forming the polytope) Definition (Dual representation) P : { x Q n x = L λ + R µ + V ν, µ 0, ν 0, ν i = 1} i Definition (Face) A face F of P is the intersection of P with a supporting hyperplane of P. We have: dim(f ) dim(p ) Definition (Facet) A facet F of P is a face of P such that: dim(f ) = dim(p ) 1 UCLA 27

35 The Polyhedral Model: Math Corner Some Equivalence Properties Theorem (Fundamental Theorem on Polyhedral Decomposition) If P is a polyhedron, then it can be decomposed as a polytope V plus a polyhedral cone L. Theorem (Equivalence of Representations) Every polyhedron has both an implicit and dual representation Chernikova s algorithm can compute the dual representation from the implicit one The Dual representation is heavily used in polyhedral compilation Some works operate on the constraint-based representation (Pluto) UCLA 28

36 The Polyhedral Model: Math Corner Parametric Polyhedra Definition (Paramteric polyhedron) Given n the vector of symbolic parameters, P is a parametric polyhedron if it is defined by: P = { x K m A x B n + b} Requires to adapt theory and tools to parameters Can become nasty: case distinctions (QUAST) Reflects nicely the program context UCLA 29

37 The Polyhedral Model: Math Corner Some Useful Algorithms All extended to parametric polyhedra: Compute the facets of a polytope: PolyLib [Wilde et al] Compute the volume of a polytope (number of points): Barvinok [Claus/Verdoolaege] Scan a polytope (code generation): CLooG [Quillere/Bastoul] Find the lexicographic minimum: PIP [Feautrier] UCLA 30

38 The Polyhedral Model: Math Corner Operations on Polyhedra Definition (Intersection) The intersection of two convex sets P 1 and P 2 is a convex set P : P = { x K m x P 1 x P 2 } Definition (Union) The union of two convex sets P 1 and P 2 is a set P : P = { x K m x P 1 x P 2 } The union of two convex sets may not be a convex set UCLA 31

39 The Polyhedral Model: Math Corner Lattices Definition (Lattice) A subset L in Q n is a lattice if is generated by integral combination of finitely many vectors: a 1,a 2,...,a n (a i Q n ). If the a i vectors have integral coordinates, L is an integer lattice. Definition (Z-polyhedron) A Z-polyhedron is the intersection of a polyhedron and an affine integral full dimensional lattice. UCLA 32

40 The Polyhedral Model: Math Corner Pictured Example Example of a Z-polyhedron: Q 1 = {i,j 0 i 5, 0 3j 20} L 1 = {2i + 1,3j + 5 i,j Z} Z 1 = Q 1 L 1 UCLA 33

41 The Polyhedral Model: Math Corner Quick Facts on Z-polyhedra Iteration domains are in fact Z-polyhedra with unit lattice Intersection of Z-polyhedra is not convex in general Union is complex to compute Can count points, can optimize, can scan Implementation available for most operations in PolyLib UCLA 34

42 The Polyhedral Model: Math Corner Image and Preimage Definition (Image) The image of a polyhedron P Z n by an affine function f : Z n Z m is a Z-polyhedron P : P = {f ( x) Z m x P } Definition (Preimage) The preimage of a polyhedron P Z n by an affine function f : Z n Z m is a Z-polyhedron P : P = { x Z n f ( x) P } We have Image(f 1,P ) = Preimage(f,P ) if f is invertible. UCLA 35

43 The Polyhedral Model: Math Corner Relation Between Image, Preimage and Z-polyhedra The image of a Z-polyhedron by an unimodular function is a Z-polyhedron The preimage of a Z-polyhedron by an affine function is a Z-polyhedron The image of a polyhedron by an affine invertible function is a Z-polyhedron The preimage of a Z-polyhedron by an affine function is a Z-polyhedron The image by a non-invertible function is not a Z-polyhedron UCLA 36

44 The Polyhedral Model: Math Corner Program Transformations UCLA 37

45 The Polyhedral Model: Modeling Programs What Can Be Modeled? Exact vs. approximate representation: Exact representation of iteration domains Static control flow Affine loop bounds (includes min/max/integer division) Affine conditionals (conjunction/disjunction) Approximate representation of iteration domains Use affine over-approximations, predicate statement executions Full-function support UCLA 38

46 Program Transformation: Key Intuition Programs are represented with geometric shapes Transforming a program is about modifying those shapes rotation, skewing, stretching,... But we need here to assume some order to scan points UCLA 39

47 Program Transformation: Affine Transformations Interchange Transformation The transformation matrix is the identity with a permutation of two rows. j ( ) i j i ( ) i j = = [ ] ( ) i j j i 0 1 ( ) i 1 0 j (a) original polyhedron (b) transformation function (c) target polyhedron A x + a 0 y = T x (AT 1 ) y + a 0 do i = 1, 2 do j = 1, 3 S(i,j) do i = 1, 3 do j = 1, 2 S(i=j,j=i ) UCLA 40

48 Program Transformation: Affine Transformations Reversal Transformation The transformation matrix is the identity with one diagonal element replaced by 1. j j i = i 1 0 ( ) i j ( ) i j = [ ] ( ) i j 1 0 ( ) i 0 1 j (a) original polyhedron (b) transformation function (c) target polyhedron A x + a 0 y = T x (AT 1 ) y + a 0 do i = 1, 2 do j = 1, 3 S(i,j) do i = -1, -2, -1 do j = 1, 3 S(i=3-i,j=j ) UCLA 40

49 Program Transformation: Affine Transformations Coumpound Transformation The transformation matrix is the composition of an interchange and reversal j j i = i 1 0 ( ) i j ( ) i j = [ ] ( ) i j 0 1 ( ) i 1 0 j (a) original polyhedron (b) transformation function (c) target polyhedron A x + a 0 y = T x (AT 1 ) y + a 0 do i = 1, 2 do j = 1, 3 S(i,j) do j = -1, -3, -1 do i = 1, 2 S(i=4-j,j=i ) UCLA 40

50 Program Transformation: Affine Transformations Coumpound Transformation The transformation matrix is the composition of an interchange and reversal j j i = i 1 0 ( ) i j ( ) i j = [ ] ( ) i j 0 1 ( ) i 1 0 j (a) original polyhedron (b) transformation function (c) target polyhedron A x + a 0 y = T x (AT 1 ) y + a 0 do i = 1, 2 do j = 1, 3 S(i,j) do j = -1, -3, -1 do i = 1, 2 S(i=4-j,j=i ) UCLA 40

51 Program Transformation: So, What is This Matrix? We know how to generate code for arbitrary matrices with integer coefficients Arbitrary number of rows (but fixed number of columns) Arbitrary value for the coefficients Through code generation, the number of dynamic instances is preserved But this is not true for the transformed polyhedra! Some classification: The matrix is unimodular The matrix is full rank and invertible The matrix is full rank The matrix has only integral coefficients The matrix has rational coefficients UCLA 41

52 Program Transformation: A Reverse History Perspective 1 CLooG: arbitrary matrix 2 Affine Mappings 3 Unimodular framework 4 SARE 5 SURE UCLA 42

53 Program Transformation: Program Transformations Original Schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ C[i][j] = 0; for (k = 0; k < n; ++k) C[i][j] += A[i][k]* B[k][j]; } Represent Static Control Parts (control flow and dependences must be statically computable) Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) UCLA 43

54 Program Transformation: Program Transformations Original Schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ C[i][j] = 0; for (k = 0; k < n; ++k) C[i][j] += A[i][k]* B[k][j]; } Represent Static Control Parts (control flow and dependences must be statically computable) Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) UCLA 43

55 Program Transformation: Program Transformations Original Schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ C[i][j] = 0; for (k = 0; k < n; ++k) C[i][j] += A[i][k]* B[k][j]; } Represent Static Control Parts (control flow and dependences must be statically computable) Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) UCLA 43

56 Program Transformation: Program Transformations Distribute loops for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) C[i][j] = 0; for (i = n; i < 2*n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) C[i-n][j] += A[i-n][k]* B[k][j]; All instances of S1 are executed before the first S2 instance UCLA 43

57 Program Transformation: Program Transformations Distribute loops + Interchange loops for S2 for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) C[i][j] = 0; for (k = n; k < 2*n; ++k) for (j = 0; j < n; ++j) for (i = 0; i < n; ++i) C[i][j] += A[i][k-n]* B[k-n][j]; The outer-most loop for S2 becomes k UCLA 43

58 Program Transformation: Program Transformations Illegal schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) for (i = 0; i < n; ++i) C[i][j] += A[i][k]* B[k][j]; for (i = n; i < 2*n; ++i) for (j = 0; j < n; ++j) C[i-n][j] = 0; All instances of S1 are executed after the last S2 instance UCLA 43

59 Program Transformation: Program Transformations A legal schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = n; i < 2*n; ++i) for (j = 0; j < n; ++j) C[i][j] = 0; for (k= n+1; k<= 2*n; ++k) for (j = 0; j < n; ++j) for (i = 0; i < n; ++i) C[i][j] += A[i][k-n-1]* B[k-n-1][j]; Delay the S2 instances Constraints must be expressed between Θ S1 and Θ S2 UCLA 43

60 Program Transformation: Program Transformations Implicit fine-grain parallelism for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i Θ S1. x S1 = ( ). j n 1 i j Θ S2. x S2 = ( ). k n 1 for (i = 0; i < n; ++i) pfor (j = 0; j < n; ++j) C[i][j] = 0; for (k = n; k < 2*n; ++k) pfor (j = 0; j < n; ++j) pfor (i = 0; i < n; ++i) C[i][j] += A[i][k-n]* B[k-n][j]; Less (linear) rows than loop depth remaining dimensions are parallel UCLA 43

61 Program Transformation: Program Transformations Representing a schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = n; i < 2*n; ++i) for (j = 0; j < n; ++j) C[i][j] = 0; for (k= n+1; k<= 2*n; ++k) for (j = 0; j < n; ++j) for (i = 0; i < n; ++i) C[i][j] += A[i][k-n-1]* B[k-n-1][j]; Θ. x = ( ) ( i j i j k n n 1 1 ) T UCLA 43

62 Program Transformation: Program Transformations Representing a schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = n; i < 2*n; ++i) for (j = 0; j < n; ++j) C[i][j] = 0; for (k= n+1; k<= 2*n; ++k) for (j = 0; j < n; ++j) for (i = 0; i < n; ++i) C[i][j] += A[i][k-n-1]* B[k-n-1][j]; Θ. x = ( ) ( i j i j k n n 1 1 ) T ı p c UCLA 43

63 Program Transformation: Program Transformations Representing a schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = n; i < 2*n; ++i) for (j = 0; j < n; ++j) C[i][j] = 0; for (k= n+1; k<= 2*n; ++k) for (j = 0; j < n; ++j) for (i = 0; i < n; ++i) C[i][j] += A[i][k-n-1]* B[k-n-1][j]; ı p c Transformation reversal skewing interchange fusion distribution peeling shifting Description Changes the direction in which a loop traverses its iteration range Makes the bounds of a given loop depend on an outer loop counter Exchanges two loops in a perfectly nested loop, a.k.a. permutation Fuses two loops, a.k.a. jamming Splits a single loop nest into many, a.k.a. fission or splitting Extracts one iteration of a given loop Allows to reorder loops UCLA 43

64 Program Transformation: Fusion in the Polyhedral Model 0 N for (i = 0; i <= N; ++i) { Blue(i); Red(i); } Perfectly aligned fusion UCLA 44

65 Program Transformation: Fusion in the Polyhedral Model 0 1 N N+1 Blue(0); for (i = 1; i <= N; ++i) { Blue(i); Red(i-1); } Red(N); Fusion with shift of 1 Not all instances are fused UCLA 44

66 Program Transformation: Fusion in the Polyhedral Model 0 P N N+P for (i = 0; i < P; ++i) Blue(i); for (i = P; i <= N; ++i) { Blue(i); Red(i-P); } for (i = N+1; i <= N+P; ++i) Red(i-P); Fusion with parametric shift of P Automatic generation of prolog/epilog code UCLA 44

67 Program Transformation: Fusion in the Polyhedral Model 0 P N N+P for (i = 0; i < P; ++i) Blue(i); for (i = P; i <= N; ++i) { Blue(i); Red(i-P); } for (i = N+1; i <= N+P; ++i) Red(i-P); Many other transformations may be required to enable fusion: interchange, skewing, etc. UCLA 44

68 Program Transformation: Scheduling Matrix Definition (Affine multidimensional schedule) Given a statement S, an affine schedule Θ S of dimension m is an affine form on the d outer loop iterators x S and the p global parameters n. Θ S Z m (d+p+1) can be written as: θ 1,1... θ 1,d+p+1 Θ S ( x S ) =... x S n θ m,1... θ m,d+p+1 1 Θ S k denotes the kth row of Θ S. Definition (Bounded affine multidimensional schedule) Θ S is a bounded schedule if θ S i,j [x,y] with x,y Z UCLA 45

69 Program Transformation: Another Representation One can separate coefficients of Θ into: 1 The iterator coefficients 2 The parameter coefficients 3 The constant coefficients One can also enforce the schedule dimension to be 2d+1. A d-dimensional square matrix for the linear part represents composition of interchange/skewing/slowing A d n matrix for the parametric part represents (parametric) shift A d + 1 vector for the scalar offset represents statement interleaving See URUK for instance UCLA 46

70 Program Transformation: Computing the 2d+1 Identity Schedule θs1 ("xs1 ) = (0, i, 0) θs2 ("xs2 ) = (0, i, 1, j, 0) θs3 0 i 0 S1 S2 S3 S4 S5 S6 UCLA S1 j S3 0 j 0 S2 S4 1 2 k S6 0 S5 47

71 examples updating the data layout and array access functions. This table is not complete (e.g., it lacks index-set splitting and datalayout Transformation transformations), but Catalogue it demonstrates [1/2] the expressiveness of the unified representation. Program Transformation: Syntax Effect UNIMODULAR(P,U) S S cop P β S,A S U.A S ; Γ S U.Γ S SHIFT(P,M) S S cop P β S,Γ S Γ S + M CUTDOM(P,c) S S cop P β S,Λ S ( AddRow Λ S,0,c/gcd(c 1,...,c d S +d lv S +dgp+1)) d S d S + 1; Λ S AddCol(Λ S,c,0); β S AddRow(β S,l,0); EXTEND(P,l,c) S S cop P β S, A S AddRow(AddCol(A S,c,0),l,1 l ); Γ S AddRow(Γ S,l,0); (A,F) Lhs S R hs S,F AddRow(F,l,0) ADDLOCALVAR(P) S S cop P β S,dlv S ds lv + 1; ΛS AddCol(Λ S,d S + 1,0); (A,F) Lhs S R hs S,F AddCol(F,dS + 1,0) PRIVATIZE(A,l) S S cop, (A,F) Lhs S R hs S,F AddRow(F,l,1 l) CONTRACT(A,l) S S cop, (A,F) Lhs S R hs S,F RemRow(F,l) FUSION(P,o) b = max{β S dim(p)+1 (P,o) βs } + 1 Move((P,o + 1),(P,o + 1),b); Move(P,(P,o + 1), 1) FISSION(P,o,b) Move(P,(P,o,b),1); Move((P,o + 1),(P,o + 1), b) MOTION(P,T ) if dim(p) + 1 = dim(t) then b = max{β S dim(p) P βs } + 1 else b = 1 Move(pfx(T,dim(T) 1),T,b) S S cop P β S,β S β S + T pfx(p,dim(t)) Move(P,P, 1) Fig. 18. Some classical transformation primitives UCLA 48

72 Program Transformation: Transformation Catalogue [2/2] 30 Girbal, Vasilache et al. Syntax Effect Comments INTERCHANGE(P,o) S S cop P β S swap rows o and o + 1, { U = Id S 1 o,o 1 o+1,o+1 +1 o,o+1 +1 o+1,o ; UNIMODULAR(β S,U) SKEW(P,l,c,s) S S cop P β S, { U = Id S + s 1 l,c ; UNIMODULAR(β S,U) REVERSE(P,o) S S cop P β S, { U = Id S 2 1 o,o ; UNIMODULAR(β S,U) STRIPMINE(P,k) S S cop P β S, c = dim(p); EXTEND(β S,c,c); u = d S + dlv S + d gp + 1; CUTDOM(β S, k 1 c + (A S c+1,γs c+1 )); CUTDOM(β S,k 1 c (A S c+1,γs c+1 ) + (k 1)1 u) TILE(P,o,k 1,k 2 ) S S cop (P,o) β S, STRIPMINE((P,o),k 2 ); STRIPMINE(P,k 1 ); INTERCHANGE((P,0),dim(P)) add the skew factor put a -1 in (o,o) insert intermediate loop constant column k i c i c+1 i c+1 k i c + k 1 strip outer loop strip inner loop interchange Fig. 19. Composition of transformation primitives UCLA 49

73 Program Transformation: Some Final Observations Some classical pitfalls The number of rows of Θ does not correspond to actual parallel levels Scalar rows vs. linear rows Linear independence Parametric shift for domains without parametric bound... UCLA 50

74 Program Transformation: Data Dependences UCLA 51

75 Data Dependences: Purpose of Dependence Analysis Not all program transformations preserve the semantics Semantics is preserved if the dependence are preserved In standard frameworks, it usually means reordering statements/loops Statements containing dependent references should not be executed in a different order Granularity: usually a reference to an array In the polyhedral framework, it means reordering statement instances Statement instances containing dependent references should not be executed in a different order Granularity: a reference to an array cell UCLA 52

76 Data Dependences: Example Exercise: Compute the set of cells of A accessed Example for (i = 0; i < N; ++i) for (j = i; j < N; ++j) A[2i + 3][4j] = i * j; D S : {i,j 0 i < N, i j < N} Function: f A : {2i + 3,4j i,j Z} Image(f A,D S ) is the set of cells of A accessed (a Z-polyhedron): Polyhedron: Q : {i,j 3 i < 2N + 2, 0 j < 4N} Lattice: L : {2i + 3,4j i,j Z} UCLA 53

77 Data Dependences: Data Dependence Definition (Bernstein conditions) Given two references, there exists a dependence between them if the three following conditions hold: they reference the same array (cell) one of this access is a write the two associated statements are executed Three categories of dependences: RAW (Read-After-Write, aka flow): first reference writes, second reads WAR (Write-After-Read, aka anti): first reference reads, second writes WAW (Write-After-Write, aka output): both references writes Another kind: RAR (Read-After-Read dependences), used for locality/reuse computations UCLA 54

78 Data Dependences: Some Terminology on Dependence Relations We categorize the dependence relation in three kinds: Uniform dependences: the distance between two dependent iterations is a constant ex: i i + 1 ex: i,j i + 1,j + 1 Non-uniform dependences: the distance between two dependent iterations varies during the execution ex: i i + j ex: i 2i Parametric dependences: at least a parameter is involved in the dependence relation ex: i i + N ex: i + N j + M UCLA 55

79 Data Dependences: Dependence Polyhedra Dependence Polyhedron [1/3] Principle: model all pairs of instances in dependence Definition (Dependence of statement instances) A statement S depends on a statement R (written R S) if there exists an operation S( x S ) and R( x R ) and a memory location m such that: 1 S( x S ) and R( x R ) refer to the same memory location m, and at least one of them writes to that location, 2 x S and x R belongs to the iteration domain of R and S, 3 in the original sequential order, S( x S ) is executed before R( x R ). UCLA 56

80 Data Dependences: Dependence Polyhedra Dependence Polyhedron [2/3] 1 Same memory location: equality of the subscript functions of a pair of references to the same array: F S x S + a S = F R x R + a R. 2 Iteration domains: both S and R iteration domains can be described using affine inequalities, respectively A S x S + c S 0 and A R x R + c R 0. 3 Precedence order: each case corresponds to a common loop depth, and is called a dependence level. For each dependence level l, the precedence constraints are the equality of the loop index variables at depth lesser to l: x R,i = x S,i for i < l and x R,l > x S,l if l is less than the common nesting loop level. Otherwise, there is no additional constraint and the dependence only exists if S is textually before R. Such constraints can be written using linear inequalities: P l,s x S P l,r x R + b 0. UCLA 57

81 Data Dependences: Dependence Polyhedra Dependence Polyhedron [3/3] The dependence polyhedron for R S at a given level l and for a given pair of references f R,f S is described as [Feautrier/Bastoul]: ( ) xs D R,S,fR,f S,l : D + d = x R F S F R A S 0 0 A R PS P R ( ) a S a R xs c + S x R c R b = 0 0 A few properties: We can always build the dep polyhedron for a given pair of affine array accesses (it is convex) It is exact, if the iteration domain and the access functions are also exact it is over-approximated if the iteration domain or the access function is an approximation UCLA 58

82 Data Dependences: Dependence Polyhedra Polyhedral Dependence Graph Definition (Polyhedral Dependence Graph) The Polyhedral Dependence Graph is a multi-graph with one vertex per syntactic program statement S i, with edges S i S j for each dependence polyhedron D Si,S j. UCLA 59

83 Data Dependences: Dependence Polyhedra Scheduling UCLA 60

84 Affine Scheduling: Affine Scheduling Definition (Affine schedule) Given a statement S, a p-dimensional affine schedule Θ R is an affine form on the outer loop iterators x S and the global parameters n. It is written: Θ S ( x S ) = T S x S n, T S K p dim( x S)+dim( n)+1 1 A schedule assigns a timestamp to each executed instance of a statement If T is a vector, then Θ is a one-dimensional schedule If T is a matrix, then Θ is a multidimensional schedule Question: does it translate to sequential loops? UCLA 61

85 Affine Scheduling: Legal Program Transformation Definition (Precedence condition) Given Θ R a schedule for the instances of R, Θ S a schedule for the instances of S. Θ R and Θ S are legal schedules if x R, x S D R,S : denotes the lexicographic ordering. Θ R ( x R ) Θ S ( x S ) (a 1,...,a n ) (b 1,...,b m ) iff i, 1 i min(n,m) s.t. (a 1,...,a i 1 ) = (b 1,...,b i 1 ) and a i < b i UCLA 62

86 Affine Scheduling: Scheduling in the Polyhedral Model Constraints: The schedule must respect the precedence condition, for all dependent instances Dependence constraints can be turned into constraints on the solution set Scheduling: Among all possibilities, one has to be picked Optimal solution requires to consider all legal possible schedules Question: is this always true? UCLA 63

87 Affine Scheduling: One-Dimensional Affine Schedules For now, we focus on 1-d schedules Example for (i = 1; i < N; ++i) A[i] = A[i - 1] + A[i] + A[i + 1]; Simple program: 1 loop, 1 polyhedral statement 2 dependences: RAW: A[i] A[i - 1] WAR: A[i + 1] A[i] UCLA 64

88 Affine Scheduling: Checking the Legality of a Schedule Exercise: given the dependence polyhedra, check if a schedule is legal eq eq D 1 : i S i S D 2 : n i S i S n Θ = i 2 Θ = i UCLA 65

89 Affine Scheduling: Checking the Legality of a Schedule Exercise: given the dependence polyhedra, check if a schedule is legal eq eq D 1 : i S i S D 2 : n i S i S n Θ = i 2 Θ = i Solution: check for the emptiness of the polyhedron [ ] i S D P : i S i. i S S n 1 where: i S i S gets the consumer instances scheduled after the producer ones For Θ = i, it is i S i S, which is non-empty UCLA 65

90 Affine Scheduling: A (Naive) Scheduling Approach Pick a schedule for the program statements Check if it respects all dependences This is called filtering Limitations: How to use this in combination of an objective function? The density of legal 1-d affine schedules is low: matmult locality fir h264 crout i-bounds 1,1 1,1 0,1 1,1 3,3 c-bounds 1,1 1,1 0,3 0,4 3,3 #Sched #Legal UCLA 66

91 Affine Scheduling: Objectives for a Good Scheduling Algorithm Build a legal schedule! Embed some properties in this legal schedule latency: minimize the time of the last iteration delay: minimize the time between the first and last iteration parallelism / placement permutability (for tiling)... A possible "simple" two-step approach: Find the solution set of all legal affine schedules Find an ILP/PIP formulation for the objective function(s) UCLA 67

92 Affine Scheduling: The Precedence Constraint (Again!) Precedence constraint adapted to 1-d schedules: Definition (Causality condition for schedules) Given D R,S, Θ R and Θ S are legal iff for each pair of instances in dependence: Θ R ( x R ) < Θ S ( x S ) Equivalently: R,S = Θ S ( x S ) Θ R ( x R ) 1 0 All functions R,S which are non-negative over the dependence polyhedron represent legal schedules For the instances which are not in dependence, we don t care First step: how to get all non-negative functions over a polyhedron? UCLA 68

93 Affine Scheduling: Affine Form of the Farkas Lemma Lemma (Affine form of Farkas lemma) Let D be a nonempty polyhedron defined by A x + b 0. Then any affine function f ( x) is non-negative everywhere in D iff it is a positive affine combination: f ( x) = λ 0 + λ T (A x + b), with λ 0 0 and λ 0 λ 0 and λ T are called the Farkas multipliers. UCLA 69

94 Affine Scheduling: The Farkas Lemma: Example Function: f (x) = ax + b Domain of x: {1 x 3} x 1 0, x Farkas lemma: f (x) 0 f (x) = λ 0 + λ 1 (x 1) + λ 2 ( x + 3) The system to solve: λ 1 λ 2 = a λ 0 λ 1 + 3λ 2 = b λ 0 0 λ 1 0 λ 2 0 UCLA 70

95 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Legal Distinct Schedules UCLA 71

96 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Legal Distinct Schedules - Causality condition Property (Causality condition for schedules) Given RδS, Θ R and Θ S are legal iff for each pair of instances in dependence: Θ R ( x R ) < Θ S ( x S ) Equivalently: R,S = Θ S ( x S ) Θ R ( x R ) 1 0 UCLA 71

97 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Legal Distinct Schedules - Causality condition - Farkas Lemma Lemma (Affine form of Farkas lemma) Let D be a nonempty polyhedron defined by A x + b 0. Then any affine function f ( x) is non-negative everywhere in D iff it is a positive affine combination: f ( x) = λ 0 + λ T (A x + b), with λ 0 0 and λ 0. λ 0 and λ T are called the Farkas multipliers. UCLA 71

98 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Legal Distinct Schedules - Causality condition - Farkas Lemma UCLA 71

99 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Many to one Legal Distinct Schedules - Causality condition - Farkas Lemma UCLA 71

100 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Legal Distinct Schedules - Causality condition - Farkas Lemma - Identification ( ( ) ) Θ S ( x S ) Θ R ( x R ) 1 = λ 0 + λ T xr D R,S + d R,S 0 x S D RδS i R : λ D1,1 λ D1,2 + λ D1,3 λ D1,4 i S : λ D1,1 + λ D1,2 + λ D1,5 λ D1,6 j S : λ D1,7 λ D1,8 n : λ D1,4 + λ D1,6 + λ D1,8 1 : λ D1,0 UCLA 71

101 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Legal Distinct Schedules - Causality condition - Farkas Lemma - Identification ( ( ) ) Θ S ( x S ) Θ R ( x R ) 1 = λ 0 + λ T xr D R,S + d R,S 0 x S D RδS i R : t 1R = λ D1,1 λ D1,2 + λ D1,3 λ D1,4 i S : t 1S = λ D1,1 + λ D1,2 + λ D1,5 λ D1,6 j S : t 2S = λ D1,7 λ D1,8 n : t 3S t 2R = λ D1,4 + λ D1,6 + λ D1,8 1 : t 4S t 3R 1 = λ D1,0 UCLA 71

102 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Legal Distinct Schedules - Causality condition - Farkas Lemma - Identification - Projection Solve the constraint system Use (purpose-optimized) Fourier-Motzkin projection algorithm Reduce redundancy Detect implicit equalities UCLA 71

103 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Valid Transformation Coefficients Legal Distinct Schedules - Causality condition - Farkas Lemma - Identification - Projection UCLA 71

104 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Valid Transformation Coefficients Bijection Legal Distinct Schedules - Causality condition - Farkas Lemma - Identification - Projection One point in the space one set of legal schedules w.r.t. the dependences UCLA 71

105 Affine Scheduling: One-dimensional Schedules Scheduling Algorithm for Multiple Dependences Algorithm Compute the schedule constraints for each dependence Intersect all sets of constraints Output is a convex solution set of all legal one-dimensional schedules Computation is fast, but requires eliminating variables in a system of inequalities: projection Can be computed as soon as the dependence polyhedra are known UCLA 72

106 Affine Scheduling: One-dimensional Schedules Objective Function Idea: bound the latency of the schedule and minimize this bound Theorem (Schedule latency bound) If all domains are bounded, and if there exists at least one 1-d schedule Θ, then there exists at least one affine form in the structure parameters: L = u. n + w such that: x R, L Θ R ( x R ) Objective function: min{ u,w u. n + w Θ 0} Subject to Θ is a legal schedule, and θ i 0 In many cases, it is equivalent to take the lexicosmallest point in the polytope of non-negative legal schedules UCLA 73

107 Affine Scheduling: One-dimensional Schedules Example min{ u,w u. n + w Θ 0} : Θ R = 0, Θ S = k + 1 Example parfor (i = 0; i < N; ++i) parfor (j = 0; j < N; ++j) C[i][j] = 0; for (k = 1; k < N + 1; ++k) parfor (i = 0; i < N; ++i) parfor (j = 0; j < N; ++j) C[i][j] += A[i][k-1] + B[k-1][j]; UCLA 74

108 Affine Scheduling: One-dimensional Schedules Limitations of One-dimensional Schedules Not all programs have a legal one-dimensional schedule Question: does this program have a 1-d schedule? Example for (i = 1; i < N - 1; ++i) for (j = 1; j < N - 1; ++j) A[i][j] = A[i-1][j-1] + A[i+1][j] + A[i][j+1]; Not all compositions of transformation are possible Interchange in inner-loops Fusion / distribution of inner-loops UCLA 75

109 Affine Scheduling: One-dimensional Schedules Multidimensional Scheduling UCLA 76

110 Affine Scheduling: Multidimensional Scheduling Multidimensional Scheduling Some program does not have a legal 1-d schedule It means, it s not possible to enforce the precedence condition for all Example dependences for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) s += s; Intuition: multidimensional time means nested time loops The precedence constraint needs to be adapted to multidimensional time UCLA 77

111 Affine Scheduling: Multidimensional Scheduling Dependence Satisfaction Definition (Strong dependence satisfaction) Given D R,S, the dependence is strongly satisfied at schedule level k if x R, x S D R,S, Θ S k ( x S) Θ R k ( x R) 1 Definition (Weak dependence satisfaction) Given D R,S, the dependence is weakly satisfied at dimension k if x R, x S D R,S, Θ S k ( x S) Θ R k ( x R ) 0 x R, x S D R,S, Θ S k ( x S) = Θ R k ( x R ) UCLA 78

112 Affine Scheduling: Multidimensional Scheduling Program Legality and Existence Results All dependence must be strongly satisfied for the program to be correct Once a dependence is strongly satisfied at level k, it does not contribute to the constraints of level k + i Unlike with 1-d schedules, it is always possible to build a legal multidimensional schedule for a SCoP [Feautrier] Theorem (Existence of an affine schedule) Every static control program has a multdidimensional affine schedule UCLA 79

113 Affine Scheduling: Multidimensional Scheduling Reformulation of the Precedence Condition We introduce variable δ D R,S 1 to model the dependence satisfaction Considering the first row of the scheduling matrices, to preserve the precedence relation we have: D R,S, x R, x S D R,S, Θ S 1 ( x S) Θ R 1 ( x R) δ D R,S 1 δ D R,S 1 {0,1} Lemma (Semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if: D R,S, p {1,...,m}, δ D R,S p = 1 j < p, δ D R,S j = 0 j p, x R, x S D R,S, Θ S p( x S ) Θ R p ( x R ) δ D R,S j UCLA 80

114 Affine Scheduling: Multidimensional Scheduling Space of All Affine Schedules Objective: Design an ILP which operates on all scheduling coefficients Easier optimality reasoning: the space contains all schedules (hence necesarily the optimal one) Examples: maximal fusion, maximal coarse-grain parallelism, best idea: locality, etc. Combine all coefficients of all rows of the scheduling function into a single solution set Find a convex encoding for the lexicopositivity of dependence satisfaction A dependence must be weakly satisfied until it is strongly satisfied Once it is strongly satisfied, it must not constrain subsequent levels UCLA 81

115 Affine Scheduling: Multidimensional Scheduling Schedule Lower Bound Idea: Bound the schedule latency with a lower bound which does not prevent to find all solutions Intuitively: Θ S ( x S ) Θ R ( x R ) δ if the dependence has not been strongly satisfied Θ S ( x S ) Θ R ( x R ) if it has Lemma (Schedule lower bound) Given Θ R k, ΘS k such that each coefficient value is bounded in [x,y]. Then there exists K Z such that: max ( Θ S k ( x S) Θ R k ( x R ) ) > K. n K UCLA 82

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Space of Semantics-Preserving Affine Schedules All unique bounded affine multidimensional

116 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Space of Semantics-Preserving Affine Schedules All unique bounded affine multidimensional schedules All unique semantics-preserving affine multidimensional schedules 1 point 1 unique semantically equivalent program (up to affine iteration reordering) UCLA 83

117 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Semantics Preservation Definition (Causality condition) Given Θ R a schedule for the instances of R, Θ S a schedule for the instances of S. Θ R and Θ S preserve the dependence D R,S if x R, x S D R,S : Θ R ( x R ) Θ S ( x S ) denotes the lexicographic ordering. (a 1,...,a n ) (b 1,...,b m ) iff i, 1 i min(n,m) s.t. (a 1,...,a i 1 ) = (b 1,...,b i 1 ) and a i < b i UCLA 84

118 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Lexico-positivity of Dependence Satisfaction Θ R ( x R ) Θ S ( x S ) is equivalently written Θ S ( x S ) Θ R ( x R ) 0 UCLA 85

119 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Lexico-positivity of Dependence Satisfaction Θ R ( x R ) Θ S ( x S ) is equivalently written Θ S ( x S ) Θ R ( x R ) 0 Considering the row p of the scheduling matrices: Θ S p( x S ) Θ R p ( x R ) δ p UCLA 85

120 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Lexico-positivity of Dependence Satisfaction Θ R ( x R ) Θ S ( x S ) is equivalently written Θ S ( x S ) Θ R ( x R ) 0 Considering the row p of the scheduling matrices: Θ S p( x S ) Θ R p ( x R ) δ p δp 1 implies no constraints on δ k, k > p δp 0 is required if k < p, δ k 1 UCLA 85

121 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Lexico-positivity of Dependence Satisfaction Θ R ( x R ) Θ S ( x S ) is equivalently written Θ S ( x S ) Θ R ( x R ) 0 Considering the row p of the scheduling matrices: Θ S p( x S ) Θ R p ( x R ) δ p δp 1 implies no constraints on δ k, k > p δp 0 is required if k < p, δ k 1 Schedule lower bound: Lemma (Schedule lower bound) Given Θ R k, ΘS k such that each coefficient value is bounded in [x,y]. Then there exists K Z such that: x R, x S, Θ S k ( x S) Θ R k ( x R ) > K. n K UCLA 85

122 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Convex Form of All Bounded Affine Schedules Lemma (Convex form of semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if the three following conditions hold: (i) D R,S, δ D R,S p {0,1} (ii) m D R,S, δ D R,S p = 1 p=1 (iii) D R,S, p {1,...,m}, x R, x S D R,S, UCLA 86

123 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Convex Form of All Bounded Affine Schedules Lemma (Convex form of semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if the three following conditions hold: (i) D R,S, δ D R,S p {0,1} (ii) m D R,S, δ D R,S p = 1 p=1 (iii) D R,S, p {1,...,m}, x R, x S D R,S, UCLA 86

124 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Convex Form of All Bounded Affine Schedules Lemma (Convex form of semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if the three following conditions hold: (i) D R,S, δ D R,S p {0,1} (ii) m D R,S, δ D R,S p = 1 p=1 (iii) D R,S, p {1,...,m}, x R, x S D R,S, Θ S p( x S ) Θ R p ( x R ) δ D R,S p UCLA 86

125 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Convex Form of All Bounded Affine Schedules Lemma (Convex form of semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if the three following conditions hold: (i) D R,S, δ D R,S p {0,1} (ii) m D R,S, δ D R,S p = 1 p=1 (iii) D R,S, p {1,...,m}, x R, x S D R,S, Θ S p( x S ) Θ R p ( x R ) δ D R,S p p 1 k=1 δ D R,S k.(k. n + K) UCLA 86

126 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Convex Form of All Bounded Affine Schedules Lemma (Convex form of semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if the three following conditions hold: (i) D R,S, δ D R,S p {0,1} (ii) m D R,S, δ D R,S p = 1 p=1 (iii) D R,S, p {1,...,m}, x R, x S D R,S, Θ S p( x S ) Θ R p ( x R ) δ D R,S p 1 p + k=1 δ D R,S k.(k. n + K) 0 Use Farkas lemma to build all non-negative functions over a polyhedron (here, the dependence polyhedra) [Feautrier,92] UCLA 86

127 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Convex Form of All Bounded Affine Schedules Lemma (Convex form of semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if the three following conditions hold: (i) D R,S, δ D R,S p {0,1} (ii) m D R,S, δ D R,S p = 1 p=1 (iii) D R,S, p {1,...,m}, x R, x S D R,S, Θ S p( x S ) Θ R p ( x R ) δ D R,S p 1 p + k=1 δ D R,S k.(k. n + K) 0 Use Farkas lemma to build all non-negative functions over a polyhedron (here, the dependence polyhedra) [Feautrier,92] Bounded coefficients required [Vasilache,07] UCLA 86

128 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Maximal Fine-Grain Parallelism Objectives: Have as few dimensions as possible carrying a dependence For dimension k p..1: min δ D R,S k D R,S We use lexicographic optimization UCLA 87

129 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Key Observations Is all of this really necessary? We have encoded one objective per row of Θ Question: do we need to solve this large ILP/LP? UCLA 88

130 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Feautrier s Greedy Algorithm Main idea: 1 Start at row 1 of Θ 2 Build the set of legal one-dimensional schedules 3 Maximize the number of dependences strongly solved (maxδ i ) 4 Remove strongly solved dependences from P 5 Goto 1 This is a row-by-row decomposition of the scheduling problem UCLA 89

131 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Key Properties of Feautrier s Algorithm It terminates It finds "optimal" fine-grain parallelism Granularity of dependence satisfaction: all-or-nothing Example for (i = 0; i < 2 * N; ++i) A[i] = A[2 * N - i]; UCLA 90

132 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Key Observations Is all of this really necessary? Question: do we need to consider all statements at once? Insight: the PDG gives structural information about dependences Decomposition of the PDG into strongly-connected components UCLA 91

133 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Feautrier s Scheduler UCLA 92

134 Affine Scheduling: Space of Semantics-Preserving Affine Schedules More Observations Some problems may be decomposed without loss of optimality The PDG gives extra information about further problem decomposition Still, is all of this really necessary? Question: can we use additional knowledge about dependences? Uniform, non-uniform and parametric dependences UCLA 93

135 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Cost Functions UCLA 94

136 Scheduling for Performance: Objectives for Good Scheduling Fine-grain parallelism is nice, but... It has little connection with modern SIMD parallelism No information about the quality of the generated code Ignores all the important performance objectives: Data locality / TLB / Cache consideration Multi-core parallelism (sync-free, with barrier) SIMD vectorization Question: how to find a FAST schedule for a modern processor? UCLA 95

137 Scheduling for Performance: Performance Distribution for 1-D Schedules [1/2] 2e+09 matmult 4e+09 locality 1.8e e e+09 3e+09 Cycles 1.4e e+09 Cycles 2.5e+09 2e+09 1e+09 original 1.5e+09 8e+08 6e Transformation identifier 1e+09 original 5e Transformation identifier Figure : Performance distribution for matmult and locality UCLA 96

138 Scheduling for Performance: Performance Distribution for 1-D Schedules [2/2] 1.42e+09 crout 1.34e+09 crout 1.4e e e e e e+09 Cycles 1.34e e+09 Cycles 1.3e e e e+09 original 1.28e e+09 original 1.26e Transformation identifier 1.26e Transformation identifier (a) GCC -O3 (b) ICC -fast Figure : The effect of the compiler UCLA 97

139 Scheduling for Performance: Quantitative Analysis: The Hypothesis Extremely large generated spaces: > points we must leverage static and dynamic characteristics to build traversal mechanisms Hypothesis: It is possible to statically order the impact on performance of transformation coefficients, that is, decompose the search space in subspaces where the performance variation is maximal or reduced First rows of Θ are more performance impacting than the last ones UCLA 98

140 Scheduling for Performance: Observations on the Performance Distribution Performance distribution - 8x8 DCT Performance improvement Best Average Worst for (i = 0; i < M; i++) for (j = 0; j < M; j++) { tmp[i][j] = 0.0; for (k = 0; k < M; k++) tmp[i][j] += block[i][k] * cos1[j][k]; } for (i = 0; i < M; i++) for (j = 0; j < M; j++) { sum2 = 0.0; for (k = 0; k < M; k++) sum2 += cos1[i][k] * tmp[k][j]; block[i][j] = ROUND(sum2); } Point index for the first schedule row Extensive study of 8x8 Discrete Cosine Transform (UTDSP) Search space analyzed: = different legal program versions UCLA 99

141 Scheduling for Performance: Observations on the Performance Distribution Performance improvement Performance distribution - 8x8 DCT Best Average Worst Θ : Point index for the first schedule row Extensive study of 8x8 Discrete Cosine Transform (UTDSP) Search space analyzed: = different legal program versions UCLA 99

142 Scheduling for Performance: Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst best average worst Performance improvement Point index for the first schedule row Take one specific value for the first row Try the possible values for the second row UCLA 99

143 Scheduling for Performance: Observations on the Performance Distribution Performance distribution - 8x8 DCT Performance distribution (sorted) - 8x8 DCT Best Average Worst Performance improvement Point index for the first schedule row Point index of the second schedule dimension, first one fixed Take one specific value for the first row Try the possible values for the second row Very low proportion of best points: < 0.02% UCLA 99

144 Scheduling for Performance: Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst Large performance variation Performance improvement Point index for the first schedule row Performance variation is large for good values of the first row UCLA 99

145 Scheduling for Performance: Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst Small performance variation Performance improvement Point index for the first schedule row Performance variation is large for good values of the first row It is usually reduced for bad values of the first row UCLA 99

146 Scheduling for Performance: Scanning The Space of Program Versions The search space: Performance variation indicates to partition the space: ı > p > c Non-uniform distribution of performance No clear analytical property of the optimization function Build dedicated heuristic and genetic operators aware of these static and dynamic characteristics UCLA 100

147 Scheduling for Performance: The Quest for Good Objective Functions For data locality, loop tiling is key But what is the cost of tiling? Is tiling the only criterion? For coarse-grain parallelism, doall parallelization is key But what is the cost of parallelization? UCLA 101

148 Scheduling for Performance: Dependence Distance Minimization Idea: minimize the delay between instances accessing the same data Formulation in the polyhedral model: Expression of the delay through parametric form Use all dependences (including RAR) Definition (Dependence distance minimization) u k. n + w k Θ S ( x S ) Θ R ( x R ) x R, x S D R,S (1) u k N p,w k N UCLA 102

149 Scheduling for Performance: Key Observations Minimizing d = u k. n + w k minimize the dependence distance When d = 0 then Θ R k ( x R) = Θ S k ( x S) 0 Θ R k ( x R ) Θ S k ( x S) 0 d gives an indication of the communication volume between hyperplanes UCLA 103

150 Scheduling for Performance: Tiling An Overview of Tiling Tiling: partition the computation into atomic blocs Early work in the late 80 s Motivation: data locality improvement + parallelization UCLA 104

151 Scheduling for Performance: Tiling An Overview of Tiling Tiling the iteration space It must be valid (dependence analysis required) It may require pre-transformation Unimodular transformation framework limitations Supported in current compilers, but limited applicability Challenges: imperfectly nested loops, parametric loops, pre-transformations, tile shape,... Tile size selection Critical for locality concerns: determines the footprint Empirical search of the best size (problem + machine specific) Parametric tiling makes the generated code valid for any tile size UCLA 105

computation into blocks Note we consider only rectangular

152 Scheduling for Performance: The Tiling Hyperplane Method Tiling in the Polyhedral Model Tiling partition the computation into blocks Note we consider only rectangular tiling here For tiling to be legal, such a partitioning must be legal UCLA 106

153 Scheduling for Performance: The Tiling Hyperplane Method Key Ideas of the Tiling Hyperplane Algorithm Affine transformations for communication minimal parallelization and locality optimization of arbitrarily nested loop sequences [Bondhugula et al, CC 08 & PLDI 08] Compute a set of transformations to make loops tilable Try to minimize synchronizations Try to maximize locality (maximal fusion) Result is a set of permutable loops, if possible Strip-mining / tiling can be applied Tiles may be sync-free parallel or pipeline parallel Algorithm always terminates (possibly by splitting loops/statements) UCLA 107

154 Scheduling for Performance: The Tiling Hyperplane Method Legality of Tiling Theorem (Legality of Tiling) Given Θ R k,θs k two one-dimensional schedules. They are valid tiling hyperplanes if D R,S, x R, x S D R,S, Θ S k ( x S) Θ R k ( x R ) 0 For a schedule to be a legal tiling hyperplane, all communications must go forward: Forward Communication Only [Griebl] All dependences must be considered at each level, including the previously strongly satisfied Equivalence between loop permutability and loop tilability UCLA 108

155 Scheduling for Performance: The Tiling Hyperplane Method Greedy Algorithm for Tiling Hyperplane Computation 1 Start from the outer-most level, find the set of FCO schedules 2 Select one which minimize the distance between dependent iterations 3 Mark dependences strongly satisfied by this schedule, but do not remove them 4 Formulate the problem for the next level (FCO), adding orthogonality constraints (linear independence) 5 solve again, etc. Special treatment when no permutable band can be found: splitting A few properties: Result is a set of permutable/tilable outer loops, when possible It exhibits coarse-grain parallelism Maximal fusion achieved to improve locality UCLA 109

156 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi 1-D Jacobi (imperfectly nested) for (t=1; t<m; t++) { for (i=2; i<n!1; i++) { S: b[i] = 0.333*(a[i!1]+a[i]+a[i+1]); } for (j=2; j<n!1; j++) { T: a[j] = b[j]; } } Louisiana State University 1 The Ohio State University UCLA 110

157 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi 1-D Jacobi (imperfectly nested) The resulting transformation is equivalent to a constant shift of one for T relative to S, fusion (j and i are named the same as a result), and skewing the fused i loop with respect to the t loop by a factor of two. The (1,0) hyperplane has the least communication: no dependence crosses more than one hyperplane instance along it. Louisiana State University 2 The Ohio State University UCLA 111

158 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi Transforming S i i! Louisiana State University 3 t t! The Ohio State University UCLA 112

159 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi Transforming T j j! Louisiana State University 4 t t! The Ohio State University UCLA 113

160 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi Interleaving S and T i! j! t! t! Louisiana State University 5 The Ohio State University UCLA 114

161 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi Interleaving S and T t Louisiana State University 6 The Ohio State University UCLA 115

162 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi 1-D Jacobi (imperfectly nested) transformed code for (t0=0;t0<=m-1;t0++) { S : b[2]=0.333*(a[2-1]+a[2]+a[2+1]); for (t1=2*t0+3;t1<=2*t0+n-2;t1++) { S: b[-2*t0+t1]=0.333*(a[-2*t0+t1-1]+a[-2*t0+t1] +a[-2*t0+t1+1]); T: a[-2*t0+t1-1]=b[-2*t0+t1-1]; } T : a[n-2]=b[n-2]; } Louisiana State University 7 The Ohio State University UCLA 116

Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi 1-D Jacobi (imperfectly nested) transformed code for (t0=0;t0<=m-1;t0++) { S : b[2]=0.

163 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi 1-D Jacobi (imperfectly nested) transformed code for (t0=0;t0<=m-1;t0++) { S : b[2]=0.333*(a[2-1]+a[2]+a[2+1]); for (t1=2*t0+3;t1<=2*t0+n-2;t1++) { S: b[-2*t0+t1]=0.333*(a[-2*t0+t1-1]+a[-2*t0+t1] +a[-2*t0+t1+1]); T: a[-2*t0+t1-1]=b[-2*t0+t1-1]; } T : a[n-2]=b[n-2]; }!!!!!!! Louisiana State University 8 The Ohio State University UCLA 117

CS671 Parallel Programming in the Many-Core Era

1 CS671 Parallel Programming in the Many-Core Era Polyhedral Framework for Compilation: Polyhedral Model Representation, Data Dependence Analysis, Scheduling and Data Locality Optimizations December 3,