Program transformations and optimizations in the polyhedral framework

Size: px
Start display at page:

Download "Program transformations and optimizations in the polyhedral framework"

Transcription

1 Program transformations and optimizations in the polyhedral framework Louis-Noël Pouchet Dept. of Computer Science UCLA May 14, st Polyhedral Spring School St-Germain au Mont d or, France

2 : Overview UCLA 2

3 Overview & Motivation: Agenda for the Lecture Disclaimer: this lecture is (mainly) for new Ph.D students 1 Program Representation in the Polyhedral Model 2 Applying loop transformations in the model 3 Exact data dependence representation 4 Scheduling using Farkas-based methods 5 Scheduling using approximate data dependences If time permits... UCLA 3

4 Overview & Motivation: Running Example: DGEMM Example (dgemm) /* C := alpha*a*b + beta*c */ for (i = 0; i < ni; i++) for (j = 0; j < nj; j++) S1: C[i][j] *= beta; for (i = 0; i < ni; i++) for (j = 0; j < nj; j++) for (k = 0; k < nk; ++k) S2: C[i][j] += alpha * A[i][k] * B[k][j]; Loop transformation: permute(i,k,s2) Execution time (in s) on this laptop, GCC 4.2, ni=nj=nk=512 version -O0 -O1 -O2 -O3 -vec original permute UCLA 4

5 Overview & Motivation: Running Example: fdtd-2d Example (fdtd-2d) } for(t = 0; t < tmax; t++) { for (j = 0; j < ny; j++) ey[0][j] = _edge_[t]; for (i = 1; i < nx; i++) for (j = 0; j < ny; j++) ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]); for (i = 0; i < nx; i++) for (j = 1; j < ny; j++) ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]); for (i = 0; i < nx - 1; i++) for (j = 0; j < ny - 1; j++) hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] + ey[i+1][j]-ey[i][j]); Loop transformation: polyhedralopt(fdtd-2d) Execution time (in s) on this laptop, GCC 4.2, 64x1024x1024 version -O0 -O1 -O2 -O3 -vec original polyhedralopt UCLA 5

6 Overview & Motivation: Manual transformations? Example (fdtd-2d) } for(t = 0; t < tmax; t++) { for (j = 0; j < ny; j++) ey[0][j] = _edge_[t]; for (i = 1; i < nx; i++) for (j = 0; j < ny; j++) ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]); for (i = 0; i < nx; i++) for (j = 1; j < ny; j++) ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]); for (i = 0; i < nx - 1; i++) for (j = 0; j < ny - 1; j++) hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] + ey[i+1][j]-ey[i][j]); UCLA 6

7 Overview & Motivation: NO NO NO!!! Example (fdtd-2d tiled) for (c0 = 0; c0 <= (((ny + 2 * tmax + -3) * 32 < 0?((32 < 0?-((-(ny + 2 * tmax + -3) ) / 32) : -((-(ny + 2 * tmax + -3) ) / 32))) : (ny + 2 * tmax + -3) / 32)); ++c0) { #pragma omp parallel for private(c3, c4, c2, c5) for (c1 = (((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c ) / -2 : (c ) / 2)))) > (((32 * c0 + -tmax + 1) * 32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) ) / -32 : (32 * c0 + -tmax ) / 32))))?((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c ) / -2 : (c ) / 2)))) : (((32 * c0 + -tmax + 1) * 32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) ) / -32 : (32 * c0 + -tmax ) / 32))))); c1 <= (((((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64)))) < c0?(((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64)))) : c0)); ++c1) { for (c2 = c0 + -c1; c2 <= (((((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) ) / 32) : -((-(tmax + nx + -2) ) / 32))) : (tmax + nx + -2) / 32)) < (((32 * c * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c * c1 + nx + 30) ) / 32) : -((-(32 * c * c1 + nx + 30) ) / 32))) : (32 * c * c1 + nx + 30) / 32))?(((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) ) / 32) : -((-(tmax + nx + -2) ) / 32))) : (tmax + nx + -2) / 32)) : (((32 * c * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c * c1 + nx + 30) ) / 32) : -((-(32 * c * c1 + nx + 30) ) / 32))) : (32 * c * c1 + nx + 30) / 32)))); ++c2) { if (c0 == 2 * c1 && c0 == 2 * c2) { for (c3 = 16 * c0; c3 <= ((tmax + -1 < 16 * c0 + 30?tmax + -1 : 16 * c0 + 30)); ++c3) if (c0 % 2 == 0) (ey[0])[0] = (_edge_[c3]);... (200 more lines!) Performance gain: 2-6 on modern multicore platforms UCLA 6

8 Overview & Motivation: Loop Transformations in Production Compilers Limitations of standard syntactic frameworks: Composition of transformations may be tedious composability rules / applicability Parametric loop bounds, imperfectly nested loops are challenging Look at the examples! Approximate dependence analysis Miss parallelization opportunities (among many others) (Very) conservative performance models UCLA 7

9 Overview & Motivation: Achievements of Polyhedral Compilation The polyhedral model: Model/apply seamlessly arbitrary compositions of transformations Automatic handling of imperfectly nested, parametric loop structures Many loop transformation can be modeled Exact dependence analysis on a class of programs Unleash the power of automatic parallelization Aggressive multi-objective program restructuring (parallelism, SIMD, cache, etc.) Requires computationally expensive algorithms Usually NP-complete / exponential complexity Requires careful problem statement/representation UCLA 8

10 Overview & Motivation: Research around Polyhedral Compilation "first generation" Provided the theoretical foundations Focused on high-level/abstract objectives Was questioned about how practical is this model "second generation" Provided practical / "more scalable" algorithms Focused on concrete performance improvement Still often questioned about how practical is this model Thanks for the legacy! ;-) UCLA 9

11 Overview & Motivation: Perspective on 25 Years of Computing "first generation" Parallel machines were... rare Era of increased program performance for free Intel 486 DX2: 54 MIPS, 66MHz "second generation" Parallel machines are ubiquitous (multi-core, SIMD, GPUs) Very difficult to get good performance Intel Core i7: MIPS, 3.33 GHz UCLA 10

12 Overview & Motivation: Perspective on 25 Years of Computing "first generation" Parallel machines were... rare Era of increased program performance for free Intel 486 DX2: 54 MIPS, 66MHz "second generation" Parallel machines are ubiquitous (multi-core, SIMD, GPUs) Very difficult to get good performance Intel Core i7: MIPS, 3.33 GHz More complex problem can be solved, but more complex objectives are needed anyway UCLA 10

13 Overview & Motivation: Disclaimer This lecture is: Partial polyhedral model affine scheduling - Google Scholar 5/14/13 12:57 AM Web Images More louisnoel.pouchet@gmail.com polyhedral model affine scheduling Scholar About 3,520 results (0.06 sec) My Citations 3 Articles Iterative optimization in the polyhedral model: Part II, multidimensional time LN Pouchet, C Bastoul, A Cohen, J Cavazos - ACM SIGPLAN Notices, dl.acm.org Legal documents loops) where iterative affine schedul- ing has never been attempted before. 4. We demonstrate significant performance improvements on 3 different architectures. The paper is structured as follows. Section 2 introduces multi- dimensional scheduling in the polyhedral model.... Any time Cited by 80 Related articles All 20 versions Cite Since 2013 Since 2012 Iterative optimization in the polyhedral model: Part I, one-dimensional time Since 2009 LN Pouchet, C Bastoul, A Cohen - Code Generation and, ieeexplore.ieee.org Custom range that can be ex- pressed as one-dimensional affine schedules (using a polyhedral abstraction); this... ing approaches on loop transformation sequences, or state-of-the-art affine scheduling; such smaller... of the optimal code, a confirmation that building a predictive model for loop... Sort by relevance Cited by 85 Related articles All 29 versions Cite Sort by date Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model include patents U Bondhugula, M Baskaran, S Krishnamoorthy - Compiler, Springer include citations... The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses. Affine transformations in this model capture a com- plex sequence of execution... Although a significant body of research has addressed affine scheduling and partitioning, the... Create alert Cited by 60 Related articles All 19 versions Cite Biased (strongly!) Provocative at times (hopefully ;-) ) [PDF] from inria.fr [PDF] from inria.fr [PDF] from ohio-state.edu Memory reuse analysis in the polyhedral model [PDF] from psu.edu D Wilde, S Rajopadhye - Euro-Par'96 Parallel Processing, Springer... Currently, the user chooses the transformations based on the analyses enabled by the polyhedral UCLA model, specifies it... Scheduling: There has been much research on the SARE scheduling problem 11 [9, 17, 8, i, j) - i + j and tx (i) = 2i are valid (one dimensional) affine schedules for... Cited by 38 Related articles All 10 versions Cite

14 Overview & Motivation: Polyhedral Program Representation UCLA 12

15 Polyhedral Program Representation: Basics Example: DGEMM Example (dgemm) /* C := alpha*a*b + beta*c */ for (i = 0; i < ni; i++) for (j = 0; j < nj; j++) { S1: C[i][j] *= beta; for (k = 0; k < nk; ++k) S2: C[i][j] += alpha * A[i][k] * B[k][j]; } This code has: imperfectly nested loops multiple statements parametric loop bounds UCLA 13

16 Polyhedral Program Representation: Basics Granularity of Program Representation DGEMM has: 3 loops For loops in the code, while loops Control-flow graph analysis 2 (syntactic) statements Input source code? Basic block? ASM instructions? S1 is executed ni nj times dynamic instances of the statement Does not (necessarily) correspond to reality! UCLA 14

17 Polyhedral Program Representation: Basics Some Observations Reasoning at the loop/statement level: Some loop transformation may be very difficult to implement How to fuse loops with different loop bounds? How to permute triangular loops? How to unroll-and-jam triangular loops? How to apply time-tiling?... Statements may operate on the same array while being independent UCLA 15

18 Polyhedral Program Representation: Basics Some Motivations for Polyhedral Transformations Known problem: scheduling of task graph Obvious limitations: task graph is not finite / size depends on problem / effective code generation almost impossible Alternative approach: use loop transformations solve all above limitation BUT the problem is to find a sequence that implements the order we want AND also how to apply/compose them Desired features: ability to reason at the instance level (as for task graph scheduling) ability to easily apply/compose loop transformations UCLA 16

19 The Polyhedral Model: The Polyhedral Model UCLA 17

20 The Polyhedral Model: Overview Polyhedral Program Optimization: a Three-Stage Process 1 Analysis: from code to model Existing prototype tools URUK, Omega, ChiLL... ISL, Loopo PolyOpt/PoCC (Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...) GCC GRAPHITE, LLVM Polly (now in mainstream) Reservoir Labs R-Stream, IBM XL/Poly UCLA 18

21 The Polyhedral Model: Overview Polyhedral Program Optimization: a Three-Stage Process 1 Analysis: from code to model Existing prototype tools URUK, Omega, ChiLL... ISL, Loopo PolyOpt/PoCC (Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...) GCC GRAPHITE, LLVM Polly (now in mainstream) Reservoir Labs R-Stream, IBM XL/Poly 2 Transformation in the model Build and select a program transformation UCLA 18

22 The Polyhedral Model: Overview Polyhedral Program Optimization: a Three-Stage Process 1 Analysis: from code to model Existing prototype tools URUK, Omega, ChiLL... ISL, Loopo PolyOpt/PoCC (Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...) GCC GRAPHITE, LLVM Polly (now in mainstream) Reservoir Labs R-Stream, IBM XL/Poly 2 Transformation in the model Build and select a program transformation 3 Code generation: from model to code "Apply" the transformation in the model Regenerate syntactic (AST-based) code UCLA 18

23 The Polyhedral Model: Overview Motivating Example [1/2] Example for (i = 0; i <= 1; ++i) for (j = 0; j <= 2; ++j) A[i][j] = i * j; Program execution: 1: A[0][0] = 0 * 0; 2: A[0][1] = 0 * 1; 3: A[0][2] = 0 * 2; 4: A[1][0] = 1 * 0; 5: A[1][1] = 1 * 1; 6: A[1][2] = 1 * 2; UCLA 19

24 The Polyhedral Model: Overview Motivating Example [2/2] A few observations: Statement is executed 6 times There is a different values for i, j associated to these 6 instances There is an order on them (the execution order) A rough analogy: polyhedral compilation is about (statically) scheduling tasks, where tasks are statement instances, or operations UCLA 20

25 The Polyhedral Model: Overview Polyhedral Representation of Programs Static Control Parts Loops have affine control only (over-approximation otherwise) UCLA 21

26 The Polyhedral Model: Overview Polyhedral Representation of Programs Static Control Parts Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra for (i=1; i<=n; ++i). for (j=1; j<=n; ++j).. if (i<=n-j+2)... s[i] =... D S1 = i j n 1 0 UCLA 21

27 The Polyhedral Model: Overview Polyhedral Representation of Programs Static Control Parts Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p for (i=0; i<n; ++i) {. s[i] = 0;. for (j=0; j<n; ++j).. s[i] = s[i]+a[i][j]*x[j]; } f s ( x S2 ) = [ ]. [ f a ( x S2 ) = ]. f x ( x S2 ) = [ ]. x S2 n 1 x S2 n 1 x S2 n 1 UCLA 21

28 The Polyhedral Model: Overview Polyhedral Representation of Programs Static Control Parts Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p Data dependence between S1 and S2: a subset of the Cartesian product of D S1 and D S2 (exact analysis) for (i=1; i<=3; ++i) {. s[i] = 0;. for (j=1; j<=3; ++j).. s[i] = s[i] + 1; } D S1δS2 : i S1. i S2 j S2 1 = 0 0 S1 iterations S2 iterations i UCLA 21

29 The Polyhedral Model: Overview Math Corner UCLA 22

30 The Polyhedral Model: Math Corner The Affine Qualifier Definition (Affine function) A function f : K m K n is affine if there exists a vector b K n and a matrix A K m n such that: x K m, f ( x) = A x + b Definition (Affine half-space) An affine half-space of K m (affine constraint) is defined as the set of points: { x K m a. x b} UCLA 23

31 The Polyhedral Model: Math Corner Polyhedron (Implicit Representation) Definition (Polyhedron) A set S K m is a polyhedron if there exists a system of a finite number of inequalities A x b such that: P = { x K m A x b} Equivalently, it is the intersection of finitely many half-spaces. Definition (Polytope) A polytope is a bounded polyhedron. UCLA 24

32 The Polyhedral Model: Math Corner Integer Polyhedron Definition (Z-polyhedron) It is a polyhedron where all its extreme points are integer valued Definition (Integer hull) The integer hull of a rational polyhedron P is the largest set of integer points such that each of these points is in P. For the moment, we will "say" an integer polyhedron is a polyhedron of integer points (language abuse) UCLA 25

33 The Polyhedral Model: Math Corner Rational and Integer Polytopes UCLA 26

34 The Polyhedral Model: Math Corner Another View of Polyhedra The dual representation models a polyhedron as a combination of lines L and rays R (forming the polyhedral cone) and vertices V (forming the polytope) Definition (Dual representation) P : { x Q n x = L λ + R µ + V ν, µ 0, ν 0, ν i = 1} i Definition (Face) A face F of P is the intersection of P with a supporting hyperplane of P. We have: dim(f ) dim(p ) Definition (Facet) A facet F of P is a face of P such that: dim(f ) = dim(p ) 1 UCLA 27

35 The Polyhedral Model: Math Corner Some Equivalence Properties Theorem (Fundamental Theorem on Polyhedral Decomposition) If P is a polyhedron, then it can be decomposed as a polytope V plus a polyhedral cone L. Theorem (Equivalence of Representations) Every polyhedron has both an implicit and dual representation Chernikova s algorithm can compute the dual representation from the implicit one The Dual representation is heavily used in polyhedral compilation Some works operate on the constraint-based representation (Pluto) UCLA 28

36 The Polyhedral Model: Math Corner Parametric Polyhedra Definition (Paramteric polyhedron) Given n the vector of symbolic parameters, P is a parametric polyhedron if it is defined by: P = { x K m A x B n + b} Requires to adapt theory and tools to parameters Can become nasty: case distinctions (QUAST) Reflects nicely the program context UCLA 29

37 The Polyhedral Model: Math Corner Some Useful Algorithms All extended to parametric polyhedra: Compute the facets of a polytope: PolyLib [Wilde et al] Compute the volume of a polytope (number of points): Barvinok [Claus/Verdoolaege] Scan a polytope (code generation): CLooG [Quillere/Bastoul] Find the lexicographic minimum: PIP [Feautrier] UCLA 30

38 The Polyhedral Model: Math Corner Operations on Polyhedra Definition (Intersection) The intersection of two convex sets P 1 and P 2 is a convex set P : P = { x K m x P 1 x P 2 } Definition (Union) The union of two convex sets P 1 and P 2 is a set P : P = { x K m x P 1 x P 2 } The union of two convex sets may not be a convex set UCLA 31

39 The Polyhedral Model: Math Corner Lattices Definition (Lattice) A subset L in Q n is a lattice if is generated by integral combination of finitely many vectors: a 1,a 2,...,a n (a i Q n ). If the a i vectors have integral coordinates, L is an integer lattice. Definition (Z-polyhedron) A Z-polyhedron is the intersection of a polyhedron and an affine integral full dimensional lattice. UCLA 32

40 The Polyhedral Model: Math Corner Pictured Example Example of a Z-polyhedron: Q 1 = {i,j 0 i 5, 0 3j 20} L 1 = {2i + 1,3j + 5 i,j Z} Z 1 = Q 1 L 1 UCLA 33

41 The Polyhedral Model: Math Corner Quick Facts on Z-polyhedra Iteration domains are in fact Z-polyhedra with unit lattice Intersection of Z-polyhedra is not convex in general Union is complex to compute Can count points, can optimize, can scan Implementation available for most operations in PolyLib UCLA 34

42 The Polyhedral Model: Math Corner Image and Preimage Definition (Image) The image of a polyhedron P Z n by an affine function f : Z n Z m is a Z-polyhedron P : P = {f ( x) Z m x P } Definition (Preimage) The preimage of a polyhedron P Z n by an affine function f : Z n Z m is a Z-polyhedron P : P = { x Z n f ( x) P } We have Image(f 1,P ) = Preimage(f,P ) if f is invertible. UCLA 35

43 The Polyhedral Model: Math Corner Relation Between Image, Preimage and Z-polyhedra The image of a Z-polyhedron by an unimodular function is a Z-polyhedron The preimage of a Z-polyhedron by an affine function is a Z-polyhedron The image of a polyhedron by an affine invertible function is a Z-polyhedron The preimage of a Z-polyhedron by an affine function is a Z-polyhedron The image by a non-invertible function is not a Z-polyhedron UCLA 36

44 The Polyhedral Model: Math Corner Program Transformations UCLA 37

45 The Polyhedral Model: Modeling Programs What Can Be Modeled? Exact vs. approximate representation: Exact representation of iteration domains Static control flow Affine loop bounds (includes min/max/integer division) Affine conditionals (conjunction/disjunction) Approximate representation of iteration domains Use affine over-approximations, predicate statement executions Full-function support UCLA 38

46 Program Transformation: Key Intuition Programs are represented with geometric shapes Transforming a program is about modifying those shapes rotation, skewing, stretching,... But we need here to assume some order to scan points UCLA 39

47 Program Transformation: Affine Transformations Interchange Transformation The transformation matrix is the identity with a permutation of two rows. j ( ) i j i ( ) i j = = [ ] ( ) i j j i 0 1 ( ) i 1 0 j (a) original polyhedron (b) transformation function (c) target polyhedron A x + a 0 y = T x (AT 1 ) y + a 0 do i = 1, 2 do j = 1, 3 S(i,j) do i = 1, 3 do j = 1, 2 S(i=j,j=i ) UCLA 40

48 Program Transformation: Affine Transformations Reversal Transformation The transformation matrix is the identity with one diagonal element replaced by 1. j j i = i 1 0 ( ) i j ( ) i j = [ ] ( ) i j 1 0 ( ) i 0 1 j (a) original polyhedron (b) transformation function (c) target polyhedron A x + a 0 y = T x (AT 1 ) y + a 0 do i = 1, 2 do j = 1, 3 S(i,j) do i = -1, -2, -1 do j = 1, 3 S(i=3-i,j=j ) UCLA 40

49 Program Transformation: Affine Transformations Coumpound Transformation The transformation matrix is the composition of an interchange and reversal j j i = i 1 0 ( ) i j ( ) i j = [ ] ( ) i j 0 1 ( ) i 1 0 j (a) original polyhedron (b) transformation function (c) target polyhedron A x + a 0 y = T x (AT 1 ) y + a 0 do i = 1, 2 do j = 1, 3 S(i,j) do j = -1, -3, -1 do i = 1, 2 S(i=4-j,j=i ) UCLA 40

50 Program Transformation: Affine Transformations Coumpound Transformation The transformation matrix is the composition of an interchange and reversal j j i = i 1 0 ( ) i j ( ) i j = [ ] ( ) i j 0 1 ( ) i 1 0 j (a) original polyhedron (b) transformation function (c) target polyhedron A x + a 0 y = T x (AT 1 ) y + a 0 do i = 1, 2 do j = 1, 3 S(i,j) do j = -1, -3, -1 do i = 1, 2 S(i=4-j,j=i ) UCLA 40

51 Program Transformation: So, What is This Matrix? We know how to generate code for arbitrary matrices with integer coefficients Arbitrary number of rows (but fixed number of columns) Arbitrary value for the coefficients Through code generation, the number of dynamic instances is preserved But this is not true for the transformed polyhedra! Some classification: The matrix is unimodular The matrix is full rank and invertible The matrix is full rank The matrix has only integral coefficients The matrix has rational coefficients UCLA 41

52 Program Transformation: A Reverse History Perspective 1 CLooG: arbitrary matrix 2 Affine Mappings 3 Unimodular framework 4 SARE 5 SURE UCLA 42

53 Program Transformation: Program Transformations Original Schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ C[i][j] = 0; for (k = 0; k < n; ++k) C[i][j] += A[i][k]* B[k][j]; } Represent Static Control Parts (control flow and dependences must be statically computable) Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) UCLA 43

54 Program Transformation: Program Transformations Original Schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ C[i][j] = 0; for (k = 0; k < n; ++k) C[i][j] += A[i][k]* B[k][j]; } Represent Static Control Parts (control flow and dependences must be statically computable) Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) UCLA 43

55 Program Transformation: Program Transformations Original Schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ C[i][j] = 0; for (k = 0; k < n; ++k) C[i][j] += A[i][k]* B[k][j]; } Represent Static Control Parts (control flow and dependences must be statically computable) Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) UCLA 43

56 Program Transformation: Program Transformations Distribute loops for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) C[i][j] = 0; for (i = n; i < 2*n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) C[i-n][j] += A[i-n][k]* B[k][j]; All instances of S1 are executed before the first S2 instance UCLA 43

57 Program Transformation: Program Transformations Distribute loops + Interchange loops for S2 for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) C[i][j] = 0; for (k = n; k < 2*n; ++k) for (j = 0; j < n; ++j) for (i = 0; i < n; ++i) C[i][j] += A[i][k-n]* B[k-n][j]; The outer-most loop for S2 becomes k UCLA 43

58 Program Transformation: Program Transformations Illegal schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) for (i = 0; i < n; ++i) C[i][j] += A[i][k]* B[k][j]; for (i = n; i < 2*n; ++i) for (j = 0; j < n; ++j) C[i-n][j] = 0; All instances of S1 are executed after the last S2 instance UCLA 43

59 Program Transformation: Program Transformations A legal schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = n; i < 2*n; ++i) for (j = 0; j < n; ++j) C[i][j] = 0; for (k= n+1; k<= 2*n; ++k) for (j = 0; j < n; ++j) for (i = 0; i < n; ++i) C[i][j] += A[i][k-n-1]* B[k-n-1][j]; Delay the S2 instances Constraints must be expressed between Θ S1 and Θ S2 UCLA 43

60 Program Transformation: Program Transformations Implicit fine-grain parallelism for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i Θ S1. x S1 = ( ). j n 1 i j Θ S2. x S2 = ( ). k n 1 for (i = 0; i < n; ++i) pfor (j = 0; j < n; ++j) C[i][j] = 0; for (k = n; k < 2*n; ++k) pfor (j = 0; j < n; ++j) pfor (i = 0; i < n; ++i) C[i][j] += A[i][k-n]* B[k-n][j]; Less (linear) rows than loop depth remaining dimensions are parallel UCLA 43

61 Program Transformation: Program Transformations Representing a schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = n; i < 2*n; ++i) for (j = 0; j < n; ++j) C[i][j] = 0; for (k= n+1; k<= 2*n; ++k) for (j = 0; j < n; ++j) for (i = 0; i < n; ++i) C[i][j] += A[i][k-n-1]* B[k-n-1][j]; Θ. x = ( ) ( i j i j k n n 1 1 ) T UCLA 43

62 Program Transformation: Program Transformations Representing a schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = n; i < 2*n; ++i) for (j = 0; j < n; ++j) C[i][j] = 0; for (k= n+1; k<= 2*n; ++k) for (j = 0; j < n; ++j) for (i = 0; i < n; ++i) C[i][j] += A[i][k-n-1]* B[k-n-1][j]; Θ. x = ( ) ( i j i j k n n 1 1 ) T ı p c UCLA 43

63 Program Transformation: Program Transformations Representing a schedule for (i = 0; i < n; ++i) for (j = 0; j < n; ++j){ S1: C[i][j] = 0; for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* B[k][j]; } i ( ) Θ S1. x S1 = j n 1 ( ) i j Θ S2. x S2 = k n 1 for (i = n; i < 2*n; ++i) for (j = 0; j < n; ++j) C[i][j] = 0; for (k= n+1; k<= 2*n; ++k) for (j = 0; j < n; ++j) for (i = 0; i < n; ++i) C[i][j] += A[i][k-n-1]* B[k-n-1][j]; ı p c Transformation reversal skewing interchange fusion distribution peeling shifting Description Changes the direction in which a loop traverses its iteration range Makes the bounds of a given loop depend on an outer loop counter Exchanges two loops in a perfectly nested loop, a.k.a. permutation Fuses two loops, a.k.a. jamming Splits a single loop nest into many, a.k.a. fission or splitting Extracts one iteration of a given loop Allows to reorder loops UCLA 43

64 Program Transformation: Fusion in the Polyhedral Model 0 N for (i = 0; i <= N; ++i) { Blue(i); Red(i); } Perfectly aligned fusion UCLA 44

65 Program Transformation: Fusion in the Polyhedral Model 0 1 N N+1 Blue(0); for (i = 1; i <= N; ++i) { Blue(i); Red(i-1); } Red(N); Fusion with shift of 1 Not all instances are fused UCLA 44

66 Program Transformation: Fusion in the Polyhedral Model 0 P N N+P for (i = 0; i < P; ++i) Blue(i); for (i = P; i <= N; ++i) { Blue(i); Red(i-P); } for (i = N+1; i <= N+P; ++i) Red(i-P); Fusion with parametric shift of P Automatic generation of prolog/epilog code UCLA 44

67 Program Transformation: Fusion in the Polyhedral Model 0 P N N+P for (i = 0; i < P; ++i) Blue(i); for (i = P; i <= N; ++i) { Blue(i); Red(i-P); } for (i = N+1; i <= N+P; ++i) Red(i-P); Many other transformations may be required to enable fusion: interchange, skewing, etc. UCLA 44

68 Program Transformation: Scheduling Matrix Definition (Affine multidimensional schedule) Given a statement S, an affine schedule Θ S of dimension m is an affine form on the d outer loop iterators x S and the p global parameters n. Θ S Z m (d+p+1) can be written as: θ 1,1... θ 1,d+p+1 Θ S ( x S ) =... x S n θ m,1... θ m,d+p+1 1 Θ S k denotes the kth row of Θ S. Definition (Bounded affine multidimensional schedule) Θ S is a bounded schedule if θ S i,j [x,y] with x,y Z UCLA 45

69 Program Transformation: Another Representation One can separate coefficients of Θ into: 1 The iterator coefficients 2 The parameter coefficients 3 The constant coefficients One can also enforce the schedule dimension to be 2d+1. A d-dimensional square matrix for the linear part represents composition of interchange/skewing/slowing A d n matrix for the parametric part represents (parametric) shift A d + 1 vector for the scalar offset represents statement interleaving See URUK for instance UCLA 46

70 Program Transformation: Computing the 2d+1 Identity Schedule θs1 ("xs1 ) = (0, i, 0) θs2 ("xs2 ) = (0, i, 1, j, 0) θs3 0 i 0 S1 S2 S3 S4 S5 S6 UCLA S1 j S3 0 j 0 S2 S4 1 2 k S6 0 S5 47

71 examples updating the data layout and array access functions. This table is not complete (e.g., it lacks index-set splitting and datalayout Transformation transformations), but Catalogue it demonstrates [1/2] the expressiveness of the unified representation. Program Transformation: Syntax Effect UNIMODULAR(P,U) S S cop P β S,A S U.A S ; Γ S U.Γ S SHIFT(P,M) S S cop P β S,Γ S Γ S + M CUTDOM(P,c) S S cop P β S,Λ S ( AddRow Λ S,0,c/gcd(c 1,...,c d S +d lv S +dgp+1)) d S d S + 1; Λ S AddCol(Λ S,c,0); β S AddRow(β S,l,0); EXTEND(P,l,c) S S cop P β S, A S AddRow(AddCol(A S,c,0),l,1 l ); Γ S AddRow(Γ S,l,0); (A,F) Lhs S R hs S,F AddRow(F,l,0) ADDLOCALVAR(P) S S cop P β S,dlv S ds lv + 1; ΛS AddCol(Λ S,d S + 1,0); (A,F) Lhs S R hs S,F AddCol(F,dS + 1,0) PRIVATIZE(A,l) S S cop, (A,F) Lhs S R hs S,F AddRow(F,l,1 l) CONTRACT(A,l) S S cop, (A,F) Lhs S R hs S,F RemRow(F,l) FUSION(P,o) b = max{β S dim(p)+1 (P,o) βs } + 1 Move((P,o + 1),(P,o + 1),b); Move(P,(P,o + 1), 1) FISSION(P,o,b) Move(P,(P,o,b),1); Move((P,o + 1),(P,o + 1), b) MOTION(P,T ) if dim(p) + 1 = dim(t) then b = max{β S dim(p) P βs } + 1 else b = 1 Move(pfx(T,dim(T) 1),T,b) S S cop P β S,β S β S + T pfx(p,dim(t)) Move(P,P, 1) Fig. 18. Some classical transformation primitives UCLA 48

72 Program Transformation: Transformation Catalogue [2/2] 30 Girbal, Vasilache et al. Syntax Effect Comments INTERCHANGE(P,o) S S cop P β S swap rows o and o + 1, { U = Id S 1 o,o 1 o+1,o+1 +1 o,o+1 +1 o+1,o ; UNIMODULAR(β S,U) SKEW(P,l,c,s) S S cop P β S, { U = Id S + s 1 l,c ; UNIMODULAR(β S,U) REVERSE(P,o) S S cop P β S, { U = Id S 2 1 o,o ; UNIMODULAR(β S,U) STRIPMINE(P,k) S S cop P β S, c = dim(p); EXTEND(β S,c,c); u = d S + dlv S + d gp + 1; CUTDOM(β S, k 1 c + (A S c+1,γs c+1 )); CUTDOM(β S,k 1 c (A S c+1,γs c+1 ) + (k 1)1 u) TILE(P,o,k 1,k 2 ) S S cop (P,o) β S, STRIPMINE((P,o),k 2 ); STRIPMINE(P,k 1 ); INTERCHANGE((P,0),dim(P)) add the skew factor put a -1 in (o,o) insert intermediate loop constant column k i c i c+1 i c+1 k i c + k 1 strip outer loop strip inner loop interchange Fig. 19. Composition of transformation primitives UCLA 49

73 Program Transformation: Some Final Observations Some classical pitfalls The number of rows of Θ does not correspond to actual parallel levels Scalar rows vs. linear rows Linear independence Parametric shift for domains without parametric bound... UCLA 50

74 Program Transformation: Data Dependences UCLA 51

75 Data Dependences: Purpose of Dependence Analysis Not all program transformations preserve the semantics Semantics is preserved if the dependence are preserved In standard frameworks, it usually means reordering statements/loops Statements containing dependent references should not be executed in a different order Granularity: usually a reference to an array In the polyhedral framework, it means reordering statement instances Statement instances containing dependent references should not be executed in a different order Granularity: a reference to an array cell UCLA 52

76 Data Dependences: Example Exercise: Compute the set of cells of A accessed Example for (i = 0; i < N; ++i) for (j = i; j < N; ++j) A[2i + 3][4j] = i * j; D S : {i,j 0 i < N, i j < N} Function: f A : {2i + 3,4j i,j Z} Image(f A,D S ) is the set of cells of A accessed (a Z-polyhedron): Polyhedron: Q : {i,j 3 i < 2N + 2, 0 j < 4N} Lattice: L : {2i + 3,4j i,j Z} UCLA 53

77 Data Dependences: Data Dependence Definition (Bernstein conditions) Given two references, there exists a dependence between them if the three following conditions hold: they reference the same array (cell) one of this access is a write the two associated statements are executed Three categories of dependences: RAW (Read-After-Write, aka flow): first reference writes, second reads WAR (Write-After-Read, aka anti): first reference reads, second writes WAW (Write-After-Write, aka output): both references writes Another kind: RAR (Read-After-Read dependences), used for locality/reuse computations UCLA 54

78 Data Dependences: Some Terminology on Dependence Relations We categorize the dependence relation in three kinds: Uniform dependences: the distance between two dependent iterations is a constant ex: i i + 1 ex: i,j i + 1,j + 1 Non-uniform dependences: the distance between two dependent iterations varies during the execution ex: i i + j ex: i 2i Parametric dependences: at least a parameter is involved in the dependence relation ex: i i + N ex: i + N j + M UCLA 55

79 Data Dependences: Dependence Polyhedra Dependence Polyhedron [1/3] Principle: model all pairs of instances in dependence Definition (Dependence of statement instances) A statement S depends on a statement R (written R S) if there exists an operation S( x S ) and R( x R ) and a memory location m such that: 1 S( x S ) and R( x R ) refer to the same memory location m, and at least one of them writes to that location, 2 x S and x R belongs to the iteration domain of R and S, 3 in the original sequential order, S( x S ) is executed before R( x R ). UCLA 56

80 Data Dependences: Dependence Polyhedra Dependence Polyhedron [2/3] 1 Same memory location: equality of the subscript functions of a pair of references to the same array: F S x S + a S = F R x R + a R. 2 Iteration domains: both S and R iteration domains can be described using affine inequalities, respectively A S x S + c S 0 and A R x R + c R 0. 3 Precedence order: each case corresponds to a common loop depth, and is called a dependence level. For each dependence level l, the precedence constraints are the equality of the loop index variables at depth lesser to l: x R,i = x S,i for i < l and x R,l > x S,l if l is less than the common nesting loop level. Otherwise, there is no additional constraint and the dependence only exists if S is textually before R. Such constraints can be written using linear inequalities: P l,s x S P l,r x R + b 0. UCLA 57

81 Data Dependences: Dependence Polyhedra Dependence Polyhedron [3/3] The dependence polyhedron for R S at a given level l and for a given pair of references f R,f S is described as [Feautrier/Bastoul]: ( ) xs D R,S,fR,f S,l : D + d = x R F S F R A S 0 0 A R PS P R ( ) a S a R xs c + S x R c R b = 0 0 A few properties: We can always build the dep polyhedron for a given pair of affine array accesses (it is convex) It is exact, if the iteration domain and the access functions are also exact it is over-approximated if the iteration domain or the access function is an approximation UCLA 58

82 Data Dependences: Dependence Polyhedra Polyhedral Dependence Graph Definition (Polyhedral Dependence Graph) The Polyhedral Dependence Graph is a multi-graph with one vertex per syntactic program statement S i, with edges S i S j for each dependence polyhedron D Si,S j. UCLA 59

83 Data Dependences: Dependence Polyhedra Scheduling UCLA 60

84 Affine Scheduling: Affine Scheduling Definition (Affine schedule) Given a statement S, a p-dimensional affine schedule Θ R is an affine form on the outer loop iterators x S and the global parameters n. It is written: Θ S ( x S ) = T S x S n, T S K p dim( x S)+dim( n)+1 1 A schedule assigns a timestamp to each executed instance of a statement If T is a vector, then Θ is a one-dimensional schedule If T is a matrix, then Θ is a multidimensional schedule Question: does it translate to sequential loops? UCLA 61

85 Affine Scheduling: Legal Program Transformation Definition (Precedence condition) Given Θ R a schedule for the instances of R, Θ S a schedule for the instances of S. Θ R and Θ S are legal schedules if x R, x S D R,S : denotes the lexicographic ordering. Θ R ( x R ) Θ S ( x S ) (a 1,...,a n ) (b 1,...,b m ) iff i, 1 i min(n,m) s.t. (a 1,...,a i 1 ) = (b 1,...,b i 1 ) and a i < b i UCLA 62

86 Affine Scheduling: Scheduling in the Polyhedral Model Constraints: The schedule must respect the precedence condition, for all dependent instances Dependence constraints can be turned into constraints on the solution set Scheduling: Among all possibilities, one has to be picked Optimal solution requires to consider all legal possible schedules Question: is this always true? UCLA 63

87 Affine Scheduling: One-Dimensional Affine Schedules For now, we focus on 1-d schedules Example for (i = 1; i < N; ++i) A[i] = A[i - 1] + A[i] + A[i + 1]; Simple program: 1 loop, 1 polyhedral statement 2 dependences: RAW: A[i] A[i - 1] WAR: A[i + 1] A[i] UCLA 64

88 Affine Scheduling: Checking the Legality of a Schedule Exercise: given the dependence polyhedra, check if a schedule is legal eq eq D 1 : i S i S D 2 : n i S i S n Θ = i 2 Θ = i UCLA 65

89 Affine Scheduling: Checking the Legality of a Schedule Exercise: given the dependence polyhedra, check if a schedule is legal eq eq D 1 : i S i S D 2 : n i S i S n Θ = i 2 Θ = i Solution: check for the emptiness of the polyhedron [ ] i S D P : i S i. i S S n 1 where: i S i S gets the consumer instances scheduled after the producer ones For Θ = i, it is i S i S, which is non-empty UCLA 65

90 Affine Scheduling: A (Naive) Scheduling Approach Pick a schedule for the program statements Check if it respects all dependences This is called filtering Limitations: How to use this in combination of an objective function? The density of legal 1-d affine schedules is low: matmult locality fir h264 crout i-bounds 1,1 1,1 0,1 1,1 3,3 c-bounds 1,1 1,1 0,3 0,4 3,3 #Sched #Legal UCLA 66

91 Affine Scheduling: Objectives for a Good Scheduling Algorithm Build a legal schedule! Embed some properties in this legal schedule latency: minimize the time of the last iteration delay: minimize the time between the first and last iteration parallelism / placement permutability (for tiling)... A possible "simple" two-step approach: Find the solution set of all legal affine schedules Find an ILP/PIP formulation for the objective function(s) UCLA 67

92 Affine Scheduling: The Precedence Constraint (Again!) Precedence constraint adapted to 1-d schedules: Definition (Causality condition for schedules) Given D R,S, Θ R and Θ S are legal iff for each pair of instances in dependence: Θ R ( x R ) < Θ S ( x S ) Equivalently: R,S = Θ S ( x S ) Θ R ( x R ) 1 0 All functions R,S which are non-negative over the dependence polyhedron represent legal schedules For the instances which are not in dependence, we don t care First step: how to get all non-negative functions over a polyhedron? UCLA 68

93 Affine Scheduling: Affine Form of the Farkas Lemma Lemma (Affine form of Farkas lemma) Let D be a nonempty polyhedron defined by A x + b 0. Then any affine function f ( x) is non-negative everywhere in D iff it is a positive affine combination: f ( x) = λ 0 + λ T (A x + b), with λ 0 0 and λ 0 λ 0 and λ T are called the Farkas multipliers. UCLA 69

94 Affine Scheduling: The Farkas Lemma: Example Function: f (x) = ax + b Domain of x: {1 x 3} x 1 0, x Farkas lemma: f (x) 0 f (x) = λ 0 + λ 1 (x 1) + λ 2 ( x + 3) The system to solve: λ 1 λ 2 = a λ 0 λ 1 + 3λ 2 = b λ 0 0 λ 1 0 λ 2 0 UCLA 70

95 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Legal Distinct Schedules UCLA 71

96 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Legal Distinct Schedules - Causality condition Property (Causality condition for schedules) Given RδS, Θ R and Θ S are legal iff for each pair of instances in dependence: Θ R ( x R ) < Θ S ( x S ) Equivalently: R,S = Θ S ( x S ) Θ R ( x R ) 1 0 UCLA 71

97 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Legal Distinct Schedules - Causality condition - Farkas Lemma Lemma (Affine form of Farkas lemma) Let D be a nonempty polyhedron defined by A x + b 0. Then any affine function f ( x) is non-negative everywhere in D iff it is a positive affine combination: f ( x) = λ 0 + λ T (A x + b), with λ 0 0 and λ 0. λ 0 and λ T are called the Farkas multipliers. UCLA 71

98 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Legal Distinct Schedules - Causality condition - Farkas Lemma UCLA 71

99 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Many to one Legal Distinct Schedules - Causality condition - Farkas Lemma UCLA 71

100 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Legal Distinct Schedules - Causality condition - Farkas Lemma - Identification ( ( ) ) Θ S ( x S ) Θ R ( x R ) 1 = λ 0 + λ T xr D R,S + d R,S 0 x S D RδS i R : λ D1,1 λ D1,2 + λ D1,3 λ D1,4 i S : λ D1,1 + λ D1,2 + λ D1,5 λ D1,6 j S : λ D1,7 λ D1,8 n : λ D1,4 + λ D1,6 + λ D1,8 1 : λ D1,0 UCLA 71

101 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Legal Distinct Schedules - Causality condition - Farkas Lemma - Identification ( ( ) ) Θ S ( x S ) Θ R ( x R ) 1 = λ 0 + λ T xr D R,S + d R,S 0 x S D RδS i R : t 1R = λ D1,1 λ D1,2 + λ D1,3 λ D1,4 i S : t 1S = λ D1,1 + λ D1,2 + λ D1,5 λ D1,6 j S : t 2S = λ D1,7 λ D1,8 n : t 3S t 2R = λ D1,4 + λ D1,6 + λ D1,8 1 : t 4S t 3R 1 = λ D1,0 UCLA 71

102 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Legal Distinct Schedules - Causality condition - Farkas Lemma - Identification - Projection Solve the constraint system Use (purpose-optimized) Fourier-Motzkin projection algorithm Reduce redundancy Detect implicit equalities UCLA 71

103 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Valid Transformation Coefficients Legal Distinct Schedules - Causality condition - Farkas Lemma - Identification - Projection UCLA 71

104 Affine Scheduling: One-dimensional Schedules Example: Semantics Preservation (1-D) Affine Schedules Valid Farkas Multipliers Valid Transformation Coefficients Bijection Legal Distinct Schedules - Causality condition - Farkas Lemma - Identification - Projection One point in the space one set of legal schedules w.r.t. the dependences UCLA 71

105 Affine Scheduling: One-dimensional Schedules Scheduling Algorithm for Multiple Dependences Algorithm Compute the schedule constraints for each dependence Intersect all sets of constraints Output is a convex solution set of all legal one-dimensional schedules Computation is fast, but requires eliminating variables in a system of inequalities: projection Can be computed as soon as the dependence polyhedra are known UCLA 72

106 Affine Scheduling: One-dimensional Schedules Objective Function Idea: bound the latency of the schedule and minimize this bound Theorem (Schedule latency bound) If all domains are bounded, and if there exists at least one 1-d schedule Θ, then there exists at least one affine form in the structure parameters: L = u. n + w such that: x R, L Θ R ( x R ) Objective function: min{ u,w u. n + w Θ 0} Subject to Θ is a legal schedule, and θ i 0 In many cases, it is equivalent to take the lexicosmallest point in the polytope of non-negative legal schedules UCLA 73

107 Affine Scheduling: One-dimensional Schedules Example min{ u,w u. n + w Θ 0} : Θ R = 0, Θ S = k + 1 Example parfor (i = 0; i < N; ++i) parfor (j = 0; j < N; ++j) C[i][j] = 0; for (k = 1; k < N + 1; ++k) parfor (i = 0; i < N; ++i) parfor (j = 0; j < N; ++j) C[i][j] += A[i][k-1] + B[k-1][j]; UCLA 74

108 Affine Scheduling: One-dimensional Schedules Limitations of One-dimensional Schedules Not all programs have a legal one-dimensional schedule Question: does this program have a 1-d schedule? Example for (i = 1; i < N - 1; ++i) for (j = 1; j < N - 1; ++j) A[i][j] = A[i-1][j-1] + A[i+1][j] + A[i][j+1]; Not all compositions of transformation are possible Interchange in inner-loops Fusion / distribution of inner-loops UCLA 75

109 Affine Scheduling: One-dimensional Schedules Multidimensional Scheduling UCLA 76

110 Affine Scheduling: Multidimensional Scheduling Multidimensional Scheduling Some program does not have a legal 1-d schedule It means, it s not possible to enforce the precedence condition for all Example dependences for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) s += s; Intuition: multidimensional time means nested time loops The precedence constraint needs to be adapted to multidimensional time UCLA 77

111 Affine Scheduling: Multidimensional Scheduling Dependence Satisfaction Definition (Strong dependence satisfaction) Given D R,S, the dependence is strongly satisfied at schedule level k if x R, x S D R,S, Θ S k ( x S) Θ R k ( x R) 1 Definition (Weak dependence satisfaction) Given D R,S, the dependence is weakly satisfied at dimension k if x R, x S D R,S, Θ S k ( x S) Θ R k ( x R ) 0 x R, x S D R,S, Θ S k ( x S) = Θ R k ( x R ) UCLA 78

112 Affine Scheduling: Multidimensional Scheduling Program Legality and Existence Results All dependence must be strongly satisfied for the program to be correct Once a dependence is strongly satisfied at level k, it does not contribute to the constraints of level k + i Unlike with 1-d schedules, it is always possible to build a legal multidimensional schedule for a SCoP [Feautrier] Theorem (Existence of an affine schedule) Every static control program has a multdidimensional affine schedule UCLA 79

113 Affine Scheduling: Multidimensional Scheduling Reformulation of the Precedence Condition We introduce variable δ D R,S 1 to model the dependence satisfaction Considering the first row of the scheduling matrices, to preserve the precedence relation we have: D R,S, x R, x S D R,S, Θ S 1 ( x S) Θ R 1 ( x R) δ D R,S 1 δ D R,S 1 {0,1} Lemma (Semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if: D R,S, p {1,...,m}, δ D R,S p = 1 j < p, δ D R,S j = 0 j p, x R, x S D R,S, Θ S p( x S ) Θ R p ( x R ) δ D R,S j UCLA 80

114 Affine Scheduling: Multidimensional Scheduling Space of All Affine Schedules Objective: Design an ILP which operates on all scheduling coefficients Easier optimality reasoning: the space contains all schedules (hence necesarily the optimal one) Examples: maximal fusion, maximal coarse-grain parallelism, best idea: locality, etc. Combine all coefficients of all rows of the scheduling function into a single solution set Find a convex encoding for the lexicopositivity of dependence satisfaction A dependence must be weakly satisfied until it is strongly satisfied Once it is strongly satisfied, it must not constrain subsequent levels UCLA 81

115 Affine Scheduling: Multidimensional Scheduling Schedule Lower Bound Idea: Bound the schedule latency with a lower bound which does not prevent to find all solutions Intuitively: Θ S ( x S ) Θ R ( x R ) δ if the dependence has not been strongly satisfied Θ S ( x S ) Θ R ( x R ) if it has Lemma (Schedule lower bound) Given Θ R k, ΘS k such that each coefficient value is bounded in [x,y]. Then there exists K Z such that: max ( Θ S k ( x S) Θ R k ( x R ) ) > K. n K UCLA 82

116 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Space of Semantics-Preserving Affine Schedules All unique bounded affine multidimensional schedules All unique semantics-preserving affine multidimensional schedules 1 point 1 unique semantically equivalent program (up to affine iteration reordering) UCLA 83

117 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Semantics Preservation Definition (Causality condition) Given Θ R a schedule for the instances of R, Θ S a schedule for the instances of S. Θ R and Θ S preserve the dependence D R,S if x R, x S D R,S : Θ R ( x R ) Θ S ( x S ) denotes the lexicographic ordering. (a 1,...,a n ) (b 1,...,b m ) iff i, 1 i min(n,m) s.t. (a 1,...,a i 1 ) = (b 1,...,b i 1 ) and a i < b i UCLA 84

118 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Lexico-positivity of Dependence Satisfaction Θ R ( x R ) Θ S ( x S ) is equivalently written Θ S ( x S ) Θ R ( x R ) 0 UCLA 85

119 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Lexico-positivity of Dependence Satisfaction Θ R ( x R ) Θ S ( x S ) is equivalently written Θ S ( x S ) Θ R ( x R ) 0 Considering the row p of the scheduling matrices: Θ S p( x S ) Θ R p ( x R ) δ p UCLA 85

120 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Lexico-positivity of Dependence Satisfaction Θ R ( x R ) Θ S ( x S ) is equivalently written Θ S ( x S ) Θ R ( x R ) 0 Considering the row p of the scheduling matrices: Θ S p( x S ) Θ R p ( x R ) δ p δp 1 implies no constraints on δ k, k > p δp 0 is required if k < p, δ k 1 UCLA 85

121 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Lexico-positivity of Dependence Satisfaction Θ R ( x R ) Θ S ( x S ) is equivalently written Θ S ( x S ) Θ R ( x R ) 0 Considering the row p of the scheduling matrices: Θ S p( x S ) Θ R p ( x R ) δ p δp 1 implies no constraints on δ k, k > p δp 0 is required if k < p, δ k 1 Schedule lower bound: Lemma (Schedule lower bound) Given Θ R k, ΘS k such that each coefficient value is bounded in [x,y]. Then there exists K Z such that: x R, x S, Θ S k ( x S) Θ R k ( x R ) > K. n K UCLA 85

122 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Convex Form of All Bounded Affine Schedules Lemma (Convex form of semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if the three following conditions hold: (i) D R,S, δ D R,S p {0,1} (ii) m D R,S, δ D R,S p = 1 p=1 (iii) D R,S, p {1,...,m}, x R, x S D R,S, UCLA 86

123 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Convex Form of All Bounded Affine Schedules Lemma (Convex form of semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if the three following conditions hold: (i) D R,S, δ D R,S p {0,1} (ii) m D R,S, δ D R,S p = 1 p=1 (iii) D R,S, p {1,...,m}, x R, x S D R,S, UCLA 86

124 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Convex Form of All Bounded Affine Schedules Lemma (Convex form of semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if the three following conditions hold: (i) D R,S, δ D R,S p {0,1} (ii) m D R,S, δ D R,S p = 1 p=1 (iii) D R,S, p {1,...,m}, x R, x S D R,S, Θ S p( x S ) Θ R p ( x R ) δ D R,S p UCLA 86

125 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Convex Form of All Bounded Affine Schedules Lemma (Convex form of semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if the three following conditions hold: (i) D R,S, δ D R,S p {0,1} (ii) m D R,S, δ D R,S p = 1 p=1 (iii) D R,S, p {1,...,m}, x R, x S D R,S, Θ S p( x S ) Θ R p ( x R ) δ D R,S p p 1 k=1 δ D R,S k.(k. n + K) UCLA 86

126 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Convex Form of All Bounded Affine Schedules Lemma (Convex form of semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if the three following conditions hold: (i) D R,S, δ D R,S p {0,1} (ii) m D R,S, δ D R,S p = 1 p=1 (iii) D R,S, p {1,...,m}, x R, x S D R,S, Θ S p( x S ) Θ R p ( x R ) δ D R,S p 1 p + k=1 δ D R,S k.(k. n + K) 0 Use Farkas lemma to build all non-negative functions over a polyhedron (here, the dependence polyhedra) [Feautrier,92] UCLA 86

127 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Convex Form of All Bounded Affine Schedules Lemma (Convex form of semantics-preserving affine schedules) Given a set of affine schedules Θ R,Θ S... of dimension m, the program semantics is preserved if the three following conditions hold: (i) D R,S, δ D R,S p {0,1} (ii) m D R,S, δ D R,S p = 1 p=1 (iii) D R,S, p {1,...,m}, x R, x S D R,S, Θ S p( x S ) Θ R p ( x R ) δ D R,S p 1 p + k=1 δ D R,S k.(k. n + K) 0 Use Farkas lemma to build all non-negative functions over a polyhedron (here, the dependence polyhedra) [Feautrier,92] Bounded coefficients required [Vasilache,07] UCLA 86

128 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Maximal Fine-Grain Parallelism Objectives: Have as few dimensions as possible carrying a dependence For dimension k p..1: min δ D R,S k D R,S We use lexicographic optimization UCLA 87

129 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Key Observations Is all of this really necessary? We have encoded one objective per row of Θ Question: do we need to solve this large ILP/LP? UCLA 88

130 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Feautrier s Greedy Algorithm Main idea: 1 Start at row 1 of Θ 2 Build the set of legal one-dimensional schedules 3 Maximize the number of dependences strongly solved (maxδ i ) 4 Remove strongly solved dependences from P 5 Goto 1 This is a row-by-row decomposition of the scheduling problem UCLA 89

131 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Key Properties of Feautrier s Algorithm It terminates It finds "optimal" fine-grain parallelism Granularity of dependence satisfaction: all-or-nothing Example for (i = 0; i < 2 * N; ++i) A[i] = A[2 * N - i]; UCLA 90

132 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Key Observations Is all of this really necessary? Question: do we need to consider all statements at once? Insight: the PDG gives structural information about dependences Decomposition of the PDG into strongly-connected components UCLA 91

133 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Feautrier s Scheduler UCLA 92

134 Affine Scheduling: Space of Semantics-Preserving Affine Schedules More Observations Some problems may be decomposed without loss of optimality The PDG gives extra information about further problem decomposition Still, is all of this really necessary? Question: can we use additional knowledge about dependences? Uniform, non-uniform and parametric dependences UCLA 93

135 Affine Scheduling: Space of Semantics-Preserving Affine Schedules Cost Functions UCLA 94

136 Scheduling for Performance: Objectives for Good Scheduling Fine-grain parallelism is nice, but... It has little connection with modern SIMD parallelism No information about the quality of the generated code Ignores all the important performance objectives: Data locality / TLB / Cache consideration Multi-core parallelism (sync-free, with barrier) SIMD vectorization Question: how to find a FAST schedule for a modern processor? UCLA 95

137 Scheduling for Performance: Performance Distribution for 1-D Schedules [1/2] 2e+09 matmult 4e+09 locality 1.8e e e+09 3e+09 Cycles 1.4e e+09 Cycles 2.5e+09 2e+09 1e+09 original 1.5e+09 8e+08 6e Transformation identifier 1e+09 original 5e Transformation identifier Figure : Performance distribution for matmult and locality UCLA 96

138 Scheduling for Performance: Performance Distribution for 1-D Schedules [2/2] 1.42e+09 crout 1.34e+09 crout 1.4e e e e e e+09 Cycles 1.34e e+09 Cycles 1.3e e e e+09 original 1.28e e+09 original 1.26e Transformation identifier 1.26e Transformation identifier (a) GCC -O3 (b) ICC -fast Figure : The effect of the compiler UCLA 97

139 Scheduling for Performance: Quantitative Analysis: The Hypothesis Extremely large generated spaces: > points we must leverage static and dynamic characteristics to build traversal mechanisms Hypothesis: It is possible to statically order the impact on performance of transformation coefficients, that is, decompose the search space in subspaces where the performance variation is maximal or reduced First rows of Θ are more performance impacting than the last ones UCLA 98

140 Scheduling for Performance: Observations on the Performance Distribution Performance distribution - 8x8 DCT Performance improvement Best Average Worst for (i = 0; i < M; i++) for (j = 0; j < M; j++) { tmp[i][j] = 0.0; for (k = 0; k < M; k++) tmp[i][j] += block[i][k] * cos1[j][k]; } for (i = 0; i < M; i++) for (j = 0; j < M; j++) { sum2 = 0.0; for (k = 0; k < M; k++) sum2 += cos1[i][k] * tmp[k][j]; block[i][j] = ROUND(sum2); } Point index for the first schedule row Extensive study of 8x8 Discrete Cosine Transform (UTDSP) Search space analyzed: = different legal program versions UCLA 99

141 Scheduling for Performance: Observations on the Performance Distribution Performance improvement Performance distribution - 8x8 DCT Best Average Worst Θ : Point index for the first schedule row Extensive study of 8x8 Discrete Cosine Transform (UTDSP) Search space analyzed: = different legal program versions UCLA 99

142 Scheduling for Performance: Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst best average worst Performance improvement Point index for the first schedule row Take one specific value for the first row Try the possible values for the second row UCLA 99

143 Scheduling for Performance: Observations on the Performance Distribution Performance distribution - 8x8 DCT Performance distribution (sorted) - 8x8 DCT Best Average Worst Performance improvement Point index for the first schedule row Point index of the second schedule dimension, first one fixed Take one specific value for the first row Try the possible values for the second row Very low proportion of best points: < 0.02% UCLA 99

144 Scheduling for Performance: Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst Large performance variation Performance improvement Point index for the first schedule row Performance variation is large for good values of the first row UCLA 99

145 Scheduling for Performance: Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst Small performance variation Performance improvement Point index for the first schedule row Performance variation is large for good values of the first row It is usually reduced for bad values of the first row UCLA 99

146 Scheduling for Performance: Scanning The Space of Program Versions The search space: Performance variation indicates to partition the space: ı > p > c Non-uniform distribution of performance No clear analytical property of the optimization function Build dedicated heuristic and genetic operators aware of these static and dynamic characteristics UCLA 100

147 Scheduling for Performance: The Quest for Good Objective Functions For data locality, loop tiling is key But what is the cost of tiling? Is tiling the only criterion? For coarse-grain parallelism, doall parallelization is key But what is the cost of parallelization? UCLA 101

148 Scheduling for Performance: Dependence Distance Minimization Idea: minimize the delay between instances accessing the same data Formulation in the polyhedral model: Expression of the delay through parametric form Use all dependences (including RAR) Definition (Dependence distance minimization) u k. n + w k Θ S ( x S ) Θ R ( x R ) x R, x S D R,S (1) u k N p,w k N UCLA 102

149 Scheduling for Performance: Key Observations Minimizing d = u k. n + w k minimize the dependence distance When d = 0 then Θ R k ( x R) = Θ S k ( x S) 0 Θ R k ( x R ) Θ S k ( x S) 0 d gives an indication of the communication volume between hyperplanes UCLA 103

150 Scheduling for Performance: Tiling An Overview of Tiling Tiling: partition the computation into atomic blocs Early work in the late 80 s Motivation: data locality improvement + parallelization UCLA 104

151 Scheduling for Performance: Tiling An Overview of Tiling Tiling the iteration space It must be valid (dependence analysis required) It may require pre-transformation Unimodular transformation framework limitations Supported in current compilers, but limited applicability Challenges: imperfectly nested loops, parametric loops, pre-transformations, tile shape,... Tile size selection Critical for locality concerns: determines the footprint Empirical search of the best size (problem + machine specific) Parametric tiling makes the generated code valid for any tile size UCLA 105

152 Scheduling for Performance: The Tiling Hyperplane Method Tiling in the Polyhedral Model Tiling partition the computation into blocks Note we consider only rectangular tiling here For tiling to be legal, such a partitioning must be legal UCLA 106

153 Scheduling for Performance: The Tiling Hyperplane Method Key Ideas of the Tiling Hyperplane Algorithm Affine transformations for communication minimal parallelization and locality optimization of arbitrarily nested loop sequences [Bondhugula et al, CC 08 & PLDI 08] Compute a set of transformations to make loops tilable Try to minimize synchronizations Try to maximize locality (maximal fusion) Result is a set of permutable loops, if possible Strip-mining / tiling can be applied Tiles may be sync-free parallel or pipeline parallel Algorithm always terminates (possibly by splitting loops/statements) UCLA 107

154 Scheduling for Performance: The Tiling Hyperplane Method Legality of Tiling Theorem (Legality of Tiling) Given Θ R k,θs k two one-dimensional schedules. They are valid tiling hyperplanes if D R,S, x R, x S D R,S, Θ S k ( x S) Θ R k ( x R ) 0 For a schedule to be a legal tiling hyperplane, all communications must go forward: Forward Communication Only [Griebl] All dependences must be considered at each level, including the previously strongly satisfied Equivalence between loop permutability and loop tilability UCLA 108

155 Scheduling for Performance: The Tiling Hyperplane Method Greedy Algorithm for Tiling Hyperplane Computation 1 Start from the outer-most level, find the set of FCO schedules 2 Select one which minimize the distance between dependent iterations 3 Mark dependences strongly satisfied by this schedule, but do not remove them 4 Formulate the problem for the next level (FCO), adding orthogonality constraints (linear independence) 5 solve again, etc. Special treatment when no permutable band can be found: splitting A few properties: Result is a set of permutable/tilable outer loops, when possible It exhibits coarse-grain parallelism Maximal fusion achieved to improve locality UCLA 109

156 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi 1-D Jacobi (imperfectly nested) for (t=1; t<m; t++) { for (i=2; i<n!1; i++) { S: b[i] = 0.333*(a[i!1]+a[i]+a[i+1]); } for (j=2; j<n!1; j++) { T: a[j] = b[j]; } } Louisiana State University 1 The Ohio State University UCLA 110

157 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi 1-D Jacobi (imperfectly nested) The resulting transformation is equivalent to a constant shift of one for T relative to S, fusion (j and i are named the same as a result), and skewing the fused i loop with respect to the t loop by a factor of two. The (1,0) hyperplane has the least communication: no dependence crosses more than one hyperplane instance along it. Louisiana State University 2 The Ohio State University UCLA 111

158 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi Transforming S i i! Louisiana State University 3 t t! The Ohio State University UCLA 112

159 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi Transforming T j j! Louisiana State University 4 t t! The Ohio State University UCLA 113

160 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi Interleaving S and T i! j! t! t! Louisiana State University 5 The Ohio State University UCLA 114

161 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi Interleaving S and T t Louisiana State University 6 The Ohio State University UCLA 115

162 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi 1-D Jacobi (imperfectly nested) transformed code for (t0=0;t0<=m-1;t0++) { S : b[2]=0.333*(a[2-1]+a[2]+a[2+1]); for (t1=2*t0+3;t1<=2*t0+n-2;t1++) { S: b[-2*t0+t1]=0.333*(a[-2*t0+t1-1]+a[-2*t0+t1] +a[-2*t0+t1+1]); T: a[-2*t0+t1-1]=b[-2*t0+t1-1]; } T : a[n-2]=b[n-2]; } Louisiana State University 7 The Ohio State University UCLA 116

163 Scheduling for Performance: The Tiling Hyperplane Method Example: 1D-Jacobi 1-D Jacobi (imperfectly nested) transformed code for (t0=0;t0<=m-1;t0++) { S : b[2]=0.333*(a[2-1]+a[2]+a[2+1]); for (t1=2*t0+3;t1<=2*t0+n-2;t1++) { S: b[-2*t0+t1]=0.333*(a[-2*t0+t1-1]+a[-2*t0+t1] +a[-2*t0+t1+1]); T: a[-2*t0+t1-1]=b[-2*t0+t1-1]; } T : a[n-2]=b[n-2]; }!!!!!!! Louisiana State University 8 The Ohio State University UCLA 117

CS671 Parallel Programming in the Many-Core Era

CS671 Parallel Programming in the Many-Core Era 1 CS671 Parallel Programming in the Many-Core Era Polyhedral Framework for Compilation: Polyhedral Model Representation, Data Dependence Analysis, Scheduling and Data Locality Optimizations December 3,

More information

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 22, 2010 888.11, Class #5 Introduction: Polyhedral

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 15, 2010 888.11, Class #4 Introduction: Polyhedral

More information

The Polyhedral Compilation Framework

The Polyhedral Compilation Framework The Polyhedral Compilation Framework Louis-Noël Pouchet Dept. of Computer Science and Engineering Ohio State University pouchet@cse.ohio-state.edu October 20, 2011 Introduction: Overview of Today s Lecture

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness

More information

Iterative Optimization in the Polyhedral Model

Iterative Optimization in the Polyhedral Model Iterative Optimization in the Polyhedral Model Louis-Noël Pouchet, INRIA Saclay / University of Paris-Sud 11, France January 18th, 2010 Ph.D Defense Introduction: A Brief History... A Quick look backward:

More information

The Polyhedral Model (Transformations)

The Polyhedral Model (Transformations) The Polyhedral Model (Transformations) Announcements HW4 is due Wednesday February 22 th Project proposal is due NEXT Friday, extension so that example with tool is possible (see resources website for

More information

Computing and Informatics, Vol. 36, 2017, , doi: /cai

Computing and Informatics, Vol. 36, 2017, , doi: /cai Computing and Informatics, Vol. 36, 2017, 566 596, doi: 10.4149/cai 2017 3 566 NESTED-LOOPS TILING FOR PARALLELIZATION AND LOCALITY OPTIMIZATION Saeed Parsa, Mohammad Hamzei Department of Computer Engineering

More information

Automatic Polyhedral Optimization of Stencil Codes

Automatic Polyhedral Optimization of Stencil Codes Automatic Polyhedral Optimization of Stencil Codes ExaStencils 2014 Stefan Kronawitter Armin Größlinger Christian Lengauer 31.03.2014 The Need for Different Optimizations 3D 1st-grade Jacobi smoother Speedup

More information

Louis-Noël Pouchet

Louis-Noël Pouchet Internship Report, jointly EPITA CSI 2006 and UPS Master 2 (March - September 2006) Louis-Noël Pouchet WHEN ITERATIVE OPTIMIZATION MEETS THE POLYHEDRAL MODEL: ONE-DIMENSIONAL

More information

Legal and impossible dependences

Legal and impossible dependences Transformations and Dependences 1 operations, column Fourier-Motzkin elimination us use these tools to determine (i) legality of permutation and Let generation of transformed code. (ii) Recall: Polyhedral

More information

Polyhedral-Based Data Reuse Optimization for Configurable Computing

Polyhedral-Based Data Reuse Optimization for Configurable Computing Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February

More information

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006 Loop Nest Optimizer of GCC CRI / Ecole des mines de Paris Avgust, 26 Architecture of GCC and Loop Nest Optimizer C C++ Java F95 Ada GENERIC GIMPLE Analyses aliasing data dependences number of iterations

More information

PolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet

PolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet PolyOpt/C A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C 0.2.1 March 12th 2012 Louis-Noël Pouchet This manual is dedicated to PolyOpt/C version 0.2.1, a framework for Polyhedral

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

Polyhedral Operations. Algorithms needed for automation. Logistics

Polyhedral Operations. Algorithms needed for automation. Logistics Polyhedral Operations Logistics Intermediate reports late deadline is Friday March 30 at midnight HW6 (posted) and HW7 (posted) due April 5 th Tuesday April 4 th, help session during class with Manaf,

More information

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

Parametric Multi-Level Tiling of Imperfectly Nested Loops* Parametric Multi-Level Tiling of Imperfectly Nested Loops* Albert Hartono 1, Cedric Bastoul 2,3 Sriram Krishnamoorthy 4 J. Ramanujam 6 Muthu Baskaran 1 Albert Cohen 2 Boyana Norris 5 P. Sadayappan 1 1

More information

Linear Loop Transformations for Locality Enhancement

Linear Loop Transformations for Locality Enhancement Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation

More information

Integer Programming Theory

Integer Programming Theory Integer Programming Theory Laura Galli October 24, 2016 In the following we assume all functions are linear, hence we often drop the term linear. In discrete optimization, we seek to find a solution x

More information

An Overview to. Polyhedral Model. Fangzhou Jiao

An Overview to. Polyhedral Model. Fangzhou Jiao An Overview to Polyhedral Model Fangzhou Jiao Polyhedral Model A framework for performing loop transformation Loop representation: using polytopes to achieve fine-grain representation of program Loop transformation:

More information

Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time

Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time Louis-Noël Pouchet 1,3, Cédric Bastoul 1,3, Albert Cohen 1,3 and John Cavazos 2,3 1 ALCHEMY Group, INRIA Futurs and Paris-Sud

More information

UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY

UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY In Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Specialty: Computer Science Louis-Noël POUCHET Subject: ITERATIVE

More information

Loop Transformations: Convexity, Pruning and Optimization

Loop Transformations: Convexity, Pruning and Optimization Loop Transformations: Convexity, Pruning and Optimization Louis-Noël Pouchet The Ohio State University pouchet@cse.ohio-state.edu Uday Bondhugula IBM T.J. Watson Research Center ubondhug@us.ibm.com Cédric

More information

Convex Geometry arising in Optimization

Convex Geometry arising in Optimization Convex Geometry arising in Optimization Jesús A. De Loera University of California, Davis Berlin Mathematical School Summer 2015 WHAT IS THIS COURSE ABOUT? Combinatorial Convexity and Optimization PLAN

More information

PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System

PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System Uday Bondhugula J. Ramanujam P. Sadayappan Dept. of Computer Science and Engineering Dept. of Electrical & Computer Engg. and

More information

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse

More information

A polyhedral loop transformation framework for parallelization and tuning

A polyhedral loop transformation framework for parallelization and tuning A polyhedral loop transformation framework for parallelization and tuning Ohio State University Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, P. Sadayappan Argonne National Laboratory

More information

Review. Loop Fusion Example

Review. Loop Fusion Example Review Distance vectors Concisely represent dependences in loops (i.e., in iteration spaces) Dictate what transformations are legal e.g., Permutation and parallelization Legality A dependence vector is

More information

Combinatorial Geometry & Topology arising in Game Theory and Optimization

Combinatorial Geometry & Topology arising in Game Theory and Optimization Combinatorial Geometry & Topology arising in Game Theory and Optimization Jesús A. De Loera University of California, Davis LAST EPISODE... We discuss the content of the course... Convex Sets A set is

More information

Loop Transformations: Convexity, Pruning and Optimization

Loop Transformations: Convexity, Pruning and Optimization Loop Transformations: Convexity, Pruning and Optimization Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, Jagannathan Ramanujam, Ponnuswamy Sadayappan, Nicolas Vasilache To cite this

More information

The Challenges of Non-linear Parameters and Variables in Automatic Loop Parallelisation

The Challenges of Non-linear Parameters and Variables in Automatic Loop Parallelisation The Challenges of Non-linear Parameters and Variables in Automatic Loop Parallelisation Armin Größlinger December 2, 2009 Rigorosum Fakultät für Informatik und Mathematik Universität Passau Automatic Loop

More information

Math 5593 Linear Programming Lecture Notes

Math 5593 Linear Programming Lecture Notes Math 5593 Linear Programming Lecture Notes Unit II: Theory & Foundations (Convex Analysis) University of Colorado Denver, Fall 2013 Topics 1 Convex Sets 1 1.1 Basic Properties (Luenberger-Ye Appendix B.1).........................

More information

Advanced Operations Research Techniques IE316. Quiz 2 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 2 Review. Dr. Ted Ralphs Advanced Operations Research Techniques IE316 Quiz 2 Review Dr. Ted Ralphs IE316 Quiz 2 Review 1 Reading for The Quiz Material covered in detail in lecture Bertsimas 4.1-4.5, 4.8, 5.1-5.5, 6.1-6.3 Material

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Complaton Foundatons Lous-Noël Pouchet pouchet@cse.oho-state.edu Dept. of Computer Scence and Engneerng, the Oho State Unversty Feb 8, 200 888., Class # Introducton: Polyhedral Complaton Foundatons

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

The Polyhedral Model Is More Widely Applicable Than You Think

The Polyhedral Model Is More Widely Applicable Than You Think The Polyhedral Model Is More Widely Applicable Than You Think Mohamed-Walid Benabderrahmane 1 Louis-Noël Pouchet 1,2 Albert Cohen 1 Cédric Bastoul 1 1 ALCHEMY group, INRIA Saclay / University of Paris-Sud

More information

Polyhedral Search Space Exploration in the ExaStencils Code Generator

Polyhedral Search Space Exploration in the ExaStencils Code Generator Preprint version before issue assignment Polyhedral Search Space Exploration in the ExaStencils Code Generator STEFAN KRONAWITTER and CHRISTIAN LENGAUER, University of Passau, Germany Performance optimization

More information

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material

More information

Polly Polyhedral Optimizations for LLVM

Polly Polyhedral Optimizations for LLVM Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas Simbürger - Armin Grösslinger - Louis-Noël Pouchet April 03, 2011 Polly - Polyhedral Optimizations for LLVM

More information

Loop Transformations, Dependences, and Parallelization

Loop Transformations, Dependences, and Parallelization Loop Transformations, Dependences, and Parallelization Announcements HW3 is due Wednesday February 15th Today HW3 intro Unimodular framework rehash with edits Skewing Smith-Waterman (the fix is in!), composing

More information

UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY

UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY In Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Specialty: Computer Science Louis-Noël POUCHET Subject: ITERATIVE

More information

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Mathematical and Algorithmic Foundations Linear Programming and Matchings Adavnced Algorithms Lectures Mathematical and Algorithmic Foundations Linear Programming and Matchings Paul G. Spirakis Department of Computer Science University of Patras and Liverpool Paul G. Spirakis

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

More information

In this chapter we introduce some of the basic concepts that will be useful for the study of integer programming problems.

In this chapter we introduce some of the basic concepts that will be useful for the study of integer programming problems. 2 Basics In this chapter we introduce some of the basic concepts that will be useful for the study of integer programming problems. 2.1 Notation Let A R m n be a matrix with row index set M = {1,...,m}

More information

maximize c, x subject to Ax b,

maximize c, x subject to Ax b, Lecture 8 Linear programming is about problems of the form maximize c, x subject to Ax b, where A R m n, x R n, c R n, and b R m, and the inequality sign means inequality in each row. The feasible set

More information

Null space basis: mxz. zxz I

Null space basis: mxz. zxz I Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence

More information

Linear programming and duality theory

Linear programming and duality theory Linear programming and duality theory Complements of Operations Research Giovanni Righini Linear Programming (LP) A linear program is defined by linear constraints, a linear objective function. Its variables

More information

Control flow graphs and loop optimizations. Thursday, October 24, 13

Control flow graphs and loop optimizations. Thursday, October 24, 13 Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange

More information

A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs

A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs Muthu Manikandan Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J. Ramanujam 2 Atanas Rountev 1

More information

Solving the Live-out Iterator Problem, Part I

Solving the Live-out Iterator Problem, Part I Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University September 2010 888.11 Outline: Reminder: step-by-step methodology 1 Problem definition:

More information

Loop Transformations! Part II!

Loop Transformations! Part II! Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow

More information

Lecture 4: Rational IPs, Polyhedron, Decomposition Theorem

Lecture 4: Rational IPs, Polyhedron, Decomposition Theorem IE 5: Integer Programming, Spring 29 24 Jan, 29 Lecture 4: Rational IPs, Polyhedron, Decomposition Theorem Lecturer: Karthik Chandrasekaran Scribe: Setareh Taki Disclaimer: These notes have not been subjected

More information

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT 6.189 IAP 2007 Lecture 11 Parallelizing Compilers 1 6.189 IAP 2007 MIT Outline Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel

More information

The Polytope Model: Past, Present, Future

The Polytope Model: Past, Present, Future The Polytope Model: Past, Present, Future Paul Feautrier ENS de Lyon Paul.Feautrier@ens-lyon.fr 8 octobre 2009 1 / 39 What is a Model? What is a Polytope? Basis of the Polytope Model Fundamental Algorithms

More information

Polytopes Course Notes

Polytopes Course Notes Polytopes Course Notes Carl W. Lee Department of Mathematics University of Kentucky Lexington, KY 40506 lee@ms.uky.edu Fall 2013 i Contents 1 Polytopes 1 1.1 Convex Combinations and V-Polytopes.....................

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique

More information

GRAPHITE: Polyhedral Analyses and Optimizations

GRAPHITE: Polyhedral Analyses and Optimizations GRAPHITE: Polyhedral Analyses and Optimizations for GCC Sebastian Pop 1 Albert Cohen 2 Cédric Bastoul 2 Sylvain Girbal 2 Georges-André Silber 1 Nicolas Vasilache 2 1 CRI, École des mines de Paris, Fontainebleau,

More information

Investigating Mixed-Integer Hulls using a MIP-Solver

Investigating Mixed-Integer Hulls using a MIP-Solver Investigating Mixed-Integer Hulls using a MIP-Solver Matthias Walter Otto-von-Guericke Universität Magdeburg Joint work with Volker Kaibel (OvGU) Aussois Combinatorial Optimization Workshop 2015 Outline

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Linear Programming in Small Dimensions

Linear Programming in Small Dimensions Linear Programming in Small Dimensions Lekcija 7 sergio.cabello@fmf.uni-lj.si FMF Univerza v Ljubljani Edited from slides by Antoine Vigneron Outline linear programming, motivation and definition one dimensional

More information

Convex Hull Representation Conversion (cddlib, lrslib)

Convex Hull Representation Conversion (cddlib, lrslib) Convex Hull Representation Conversion (cddlib, lrslib) Student Seminar in Combinatorics: Mathematical Software Niklas Pfister October 31, 2014 1 Introduction In this report we try to give a short overview

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Conic Duality. yyye

Conic Duality.  yyye Conic Linear Optimization and Appl. MS&E314 Lecture Note #02 1 Conic Duality Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/

More information

Polar Duality and Farkas Lemma

Polar Duality and Farkas Lemma Lecture 3 Polar Duality and Farkas Lemma October 8th, 2004 Lecturer: Kamal Jain Notes: Daniel Lowd 3.1 Polytope = bounded polyhedron Last lecture, we were attempting to prove the Minkowsky-Weyl Theorem:

More information

Polyhedral Computation Today s Topic: The Double Description Algorithm. Komei Fukuda Swiss Federal Institute of Technology Zurich October 29, 2010

Polyhedral Computation Today s Topic: The Double Description Algorithm. Komei Fukuda Swiss Federal Institute of Technology Zurich October 29, 2010 Polyhedral Computation Today s Topic: The Double Description Algorithm Komei Fukuda Swiss Federal Institute of Technology Zurich October 29, 2010 1 Convexity Review: Farkas-Type Alternative Theorems Gale

More information

1. Lecture notes on bipartite matching

1. Lecture notes on bipartite matching Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans February 5, 2017 1. Lecture notes on bipartite matching Matching problems are among the fundamental problems in

More information

Advanced Compiler Construction Theory And Practice

Advanced Compiler Construction Theory And Practice Advanced Compiler Construction Theory And Practice Introduction to loop dependence and Optimizations 7/7/2014 DragonStar 2014 - Qing Yi 1 A little about myself Qing Yi Ph.D. Rice University, USA. Associate

More information

Dual-fitting analysis of Greedy for Set Cover

Dual-fitting analysis of Greedy for Set Cover Dual-fitting analysis of Greedy for Set Cover We showed earlier that the greedy algorithm for set cover gives a H n approximation We will show that greedy produces a solution of cost at most H n OPT LP

More information

Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework

Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework Louis-Noël Pouchet The Ohio State University pouchet@cse.ohio-state.edu Uday Bondhugula IBM T.J. Watson Research

More information

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini DM545 Linear and Integer Programming Lecture 2 The Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. 2. 3. 4. Standard Form Basic Feasible Solutions

More information

Math 414 Lecture 2 Everyone have a laptop?

Math 414 Lecture 2 Everyone have a laptop? Math 44 Lecture 2 Everyone have a laptop? THEOREM. Let v,...,v k be k vectors in an n-dimensional space and A = [v ;...; v k ] v,..., v k independent v,..., v k span the space v,..., v k a basis v,...,

More information

Data Dependences and Parallelization

Data Dependences and Parallelization Data Dependences and Parallelization 1 Agenda Introduction Single Loop Nested Loops Data Dependence Analysis 2 Motivation DOALL loops: loops whose iterations can execute in parallel for i = 11, 20 a[i]

More information

Lecture 6: Faces, Facets

Lecture 6: Faces, Facets IE 511: Integer Programming, Spring 2019 31 Jan, 2019 Lecturer: Karthik Chandrasekaran Lecture 6: Faces, Facets Scribe: Setareh Taki Disclaimer: These notes have not been subjected to the usual scrutiny

More information

LooPo: Automatic Loop Parallelization

LooPo: Automatic Loop Parallelization LooPo: Automatic Loop Parallelization Michael Claßen Fakultät für Informatik und Mathematik Düsseldorf, November 27 th 2008 Model-Based Loop Transformations model-based approach: map source code to an

More information

CS 293S Parallelism and Dependence Theory

CS 293S Parallelism and Dependence Theory CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law

More information

MATH 890 HOMEWORK 2 DAVID MEREDITH

MATH 890 HOMEWORK 2 DAVID MEREDITH MATH 890 HOMEWORK 2 DAVID MEREDITH (1) Suppose P and Q are polyhedra. Then P Q is a polyhedron. Moreover if P and Q are polytopes then P Q is a polytope. The facets of P Q are either F Q where F is a facet

More information

Minimizing Memory Strides using Integer Transformations of Parametric Polytopes

Minimizing Memory Strides using Integer Transformations of Parametric Polytopes Minimizing Memory Strides using Integer Transformations of Parametric Polytopes Rachid Seghir, Vincent Loechner and Benoît Meister ICPS/LSIIT UMR CNRS-ULP 7005 Pôle API, Bd Sébastien Brant F-67400 Illkirch

More information

ACTUALLY DOING IT : an Introduction to Polyhedral Computation

ACTUALLY DOING IT : an Introduction to Polyhedral Computation ACTUALLY DOING IT : an Introduction to Polyhedral Computation Jesús A. De Loera Department of Mathematics Univ. of California, Davis http://www.math.ucdavis.edu/ deloera/ 1 What is a Convex Polytope? 2

More information

Lecture 3. Corner Polyhedron, Intersection Cuts, Maximal Lattice-Free Convex Sets. Tepper School of Business Carnegie Mellon University, Pittsburgh

Lecture 3. Corner Polyhedron, Intersection Cuts, Maximal Lattice-Free Convex Sets. Tepper School of Business Carnegie Mellon University, Pittsburgh Lecture 3 Corner Polyhedron, Intersection Cuts, Maximal Lattice-Free Convex Sets Gérard Cornuéjols Tepper School of Business Carnegie Mellon University, Pittsburgh January 2016 Mixed Integer Linear Programming

More information

Basic Algorithms for Periodic-Linear Inequalities and Integer Polyhedra

Basic Algorithms for Periodic-Linear Inequalities and Integer Polyhedra Basic Algorithms for Periodic-Linear Inequalities and Integer Polyhedra Alain Ketterlin / Camus IMPACT 2018: January, 23, 2018 Motivation Periodic-Linear Inequalities The Omicron Test Decomposition Motivation

More information

THEORY OF LINEAR AND INTEGER PROGRAMMING

THEORY OF LINEAR AND INTEGER PROGRAMMING THEORY OF LINEAR AND INTEGER PROGRAMMING ALEXANDER SCHRIJVER Centrum voor Wiskunde en Informatica, Amsterdam A Wiley-Inter science Publication JOHN WILEY & SONS^ Chichester New York Weinheim Brisbane Singapore

More information

Applied Integer Programming

Applied Integer Programming Applied Integer Programming D.S. Chen; R.G. Batson; Y. Dang Fahimeh 8.2 8.7 April 21, 2015 Context 8.2. Convex sets 8.3. Describing a bounded polyhedron 8.4. Describing unbounded polyhedron 8.5. Faces,

More information

Discrete Optimization 2010 Lecture 5 Min-Cost Flows & Total Unimodularity

Discrete Optimization 2010 Lecture 5 Min-Cost Flows & Total Unimodularity Discrete Optimization 2010 Lecture 5 Min-Cost Flows & Total Unimodularity Marc Uetz University of Twente m.uetz@utwente.nl Lecture 5: sheet 1 / 26 Marc Uetz Discrete Optimization Outline 1 Min-Cost Flows

More information

More Data Locality for Static Control Programs on NUMA Architectures

More Data Locality for Static Control Programs on NUMA Architectures More Data Locality for Static Control Programs on NUMA Architectures Adilla Susungi 1, Albert Cohen 2, Claude Tadonki 1 1 MINES ParisTech, PSL Research University 2 Inria and DI, Ecole Normale Supérieure

More information

C&O 355 Lecture 16. N. Harvey

C&O 355 Lecture 16. N. Harvey C&O 355 Lecture 16 N. Harvey Topics Review of Fourier-Motzkin Elimination Linear Transformations of Polyhedra Convex Combinations Convex Hulls Polytopes & Convex Hulls Fourier-Motzkin Elimination Joseph

More information

be a polytope. has such a representation iff it contains the origin in its interior. For a generic, sort the inequalities so that

be a polytope. has such a representation iff it contains the origin in its interior. For a generic, sort the inequalities so that ( Shelling (Bruggesser-Mani 1971) and Ranking Let be a polytope. has such a representation iff it contains the origin in its interior. For a generic, sort the inequalities so that. a ranking of vertices

More information

Ranking Functions. Linear-Constraint Loops

Ranking Functions. Linear-Constraint Loops for Linear-Constraint Loops Amir Ben-Amram 1 for Loops Example 1 (GCD program): while (x > 1, y > 1) if x

More information

Lecture 5: Duality Theory

Lecture 5: Duality Theory Lecture 5: Duality Theory Rajat Mittal IIT Kanpur The objective of this lecture note will be to learn duality theory of linear programming. We are planning to answer following questions. What are hyperplane

More information

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Parallelization Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Moore s Law From Hennessy and Patterson, Computer Architecture:

More information

3. The Simplex algorithmn The Simplex algorithmn 3.1 Forms of linear programs

3. The Simplex algorithmn The Simplex algorithmn 3.1 Forms of linear programs 11 3.1 Forms of linear programs... 12 3.2 Basic feasible solutions... 13 3.3 The geometry of linear programs... 14 3.4 Local search among basic feasible solutions... 15 3.5 Organization in tableaus...

More information

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Spring 2 Parallelization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Outline Why Parallelism Parallel Execution Parallelizing Compilers

More information

Affine and Unimodular Transformations for Non-Uniform Nested Loops

Affine and Unimodular Transformations for Non-Uniform Nested Loops th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 3-, 008 Affine and Unimodular Transformations for Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and

More information

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2

More information

Chapter 4 Concepts from Geometry

Chapter 4 Concepts from Geometry Chapter 4 Concepts from Geometry An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Line Segments The line segment between two points and in R n is the set of points on the straight line joining

More information

Lecture 2 September 3

Lecture 2 September 3 EE 381V: Large Scale Optimization Fall 2012 Lecture 2 September 3 Lecturer: Caramanis & Sanghavi Scribe: Hongbo Si, Qiaoyang Ye 2.1 Overview of the last Lecture The focus of the last lecture was to give

More information

Lecture 2 - Introduction to Polytopes

Lecture 2 - Introduction to Polytopes Lecture 2 - Introduction to Polytopes Optimization and Approximation - ENS M1 Nicolas Bousquet 1 Reminder of Linear Algebra definitions Let x 1,..., x m be points in R n and λ 1,..., λ m be real numbers.

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

60 2 Convex sets. {x a T x b} {x ã T x b}

60 2 Convex sets. {x a T x b} {x ã T x b} 60 2 Convex sets Exercises Definition of convexity 21 Let C R n be a convex set, with x 1,, x k C, and let θ 1,, θ k R satisfy θ i 0, θ 1 + + θ k = 1 Show that θ 1x 1 + + θ k x k C (The definition of convexity

More information