Parametric Multi-Level Tiling of Imperfectly Nested Loops*

Size: px

Start display at page:

Download "Parametric Multi-Level Tiling of Imperfectly Nested Loops*"

Griffin Simon
6 years ago
Views:

1 Parametric Multi-Level Tiling of Imperfectly Nested Loops* Albert Hartono 1, Cedric Bastoul 2,3 Sriram Krishnamoorthy 4 J. Ramanujam 6 Muthu Baskaran 1 Albert Cohen 2 Boyana Norris 5 P. Sadayappan 1 1 Ohio State University 2 INRIA Saclay 3 Paris-Sud 11 University 4 Pacific Northwest National Laboratory 5 Argonne National Laboratory 6 Louisiana State University * Funded by NSF

2 One Slide Summary Imperfectly nested loops are common in practice Parametric tiled loop generator can provide valuable compiler support for auto-tuning Current general solutions for tiled code generation Parametric tiling of perfect loop nests Non-parametric tiling of imperfect loop nests Both use polyhedral model and ILP machinery (Constraint) Inequalities of the loop bounds must be linear in terms of loop iterators and problem sizes => problem with parametric tile sizes We have recently developed a hybrid solution for parametric tiling of imperfect loop nests

3 Loop Tiling Key loop transformation for both: Efficient coarse-grained parallel execution Data locality optimization j for (i=1; i<=7; i++) for (j=1; j<=6; j++) S(i,j); i Inter-tile loops Intra-tile loops for (it=1; it<=7; it+=ti) for (jt=1; jt<=6; jt+=tj) for (i=it; i<min(7,it+ti-1); i++) for (j=jt; j<min(6,jt+tj-1); j++) S(i,j); j i

4 for (i=1; i<n; i++) for (j=2; j<n; j++) S1: a[i][j] = a[j][i] + a[i][j-1]; i j 2 j n-1 i n-1 x S1 = i j i 1 j I S1 = x S1 n 1 Stmt instances integer points in polyhedra systems of linear inequalities

5 N=4 M=3 for (i=; i<n; i++) { for (j=; j<n; j++) for(k=; k<n; k++) S1; for (p=; p<m; p++)s2; } Uniform, powerful abstraction for imperfect loop nests Uniform, powerful handling of parametric loop bounds Loop transform == Affine scheduling functions =>Arbitrary sequence of transforms == change of affine coeffs.

6 Input Program Output Program Loops -> Polyhedra Data Dependence Analysis Transforms (Affine Functions) Code Generation: Polyhedra -> Loops

7 Parametric Tiled Code Generation for (i=1; i<=n; i++) for (j=1; j<=n; j++) S(i,j); Tile loop i with tile size Ti Tile loop j with tile size Tj for (it=1; it<=n; it+=ti) for (jt=1; jt<=n; jt+=tj) for (i=it; i<min(n,it+ti-1); i++) for (j=jt; j<min(n,jt+tj-1); j++) S(i,j); Tiled code generation is straightforward for rectangular, perfectly nested loops But tiled code generation is more challenging if Inner loop bounds depend on outer loops Data dependences make rectangular tiling illegal Loops are imperfectly nested Polyhedral compilation model enables tiled code generation for arbitrary affine codes with imperfectly nested loops

8 Loop Code Generation from Polyhedra Code generation in polyhedral compiler framework: The process of converting a polyhedral representation of computations back into loop structures CLooG State-of-the-art polyhedral code generator Takes statement domains and affine schedules to generate transformed code Uses efficient polyhedral scanning algorithm to generate imperfectly nested loops that scan a union of polyhedra (corresponding to statement domains)

9 Loop Code Generation from Polyhedra (cont.) N 2 1 j for (i=1; i<=n; i++) for (j=i; j<=n; j++) S1(i,j); for (i=1; i<=m; i++) /* M<N */ for (j=1; j<=n; j++) S2(i,j); S1 S2 for (j=1; j<=n; j++) { S1(1,j); S2(1,j); } for (i=2; i<=n; i++) { for (j=1; j<=i-1; j++) S2(i,j); for (j=i; j<=n; j++) { S1(i,j); S2(i,j); } } for (i=m+1; i<=n; i++) for (j=i; j<=n; j++) S1(i,j); 1 2 M N i

10 Tiled Code Generation in Polyhedral Model j i 1 i N Tile sizes = 32 x 32 Original loop: for (i=1; i<=n; i++) for (j=1; j<=n; j++) S(i,j); Statement domain: Tiled loop: N j N for (it=; it<=floord(n,32); it++) for (jt=; jt<=floord(n,32); jt++) for (i=max(1,32*it); i<=min(n,32*it+31); i++) for (j=max(1,32*jt); j<=min(n,32*jt+31); j++) S(i,j); 2 1 j it jt i j N 1 = i-32 it 1 2 N i i-32 it 31 j-32 jt Affine schedule: j-32 jt 31 it 1 1 i jt 1 i N i = 1. 1 j j 1 j N it jt i j N 1 it = it jt = jt i = i j = j Constraint of polyhedral model and ILP machinery: Inequalities of the loop bounds must be linear in terms of loop iterators and symbolic parameters

11 Parametric Tiling: Perfectly Nested Loop No full tiles Full tiles j for (i=lbi; i<=ubi; i++) for (j=lbj(i); j<=ubj(i); j++) S(i,j); Output pseudocode: for it { [compute lbv] [compute ubv] if (lbv<ubv) { [prolog j] [full tiles j] [epilog j] } else { [untiled j] } } [epilog i] Full tiles (loop i) Partial tile (loop i) i

12 Parametric Tiling: Imperfectly Nested Loops Output pseudocode: for (i=lbi; i<=ubi; i++) { for (j1=lbj1(i); j1<=ubj1(i); j1++) S1(i,j); for (j2=lbj2(i); j2<=ubj2(i); j2++) S2(i,j); } Combined and interleaved Combined and interleaved for it { [compute lbv1,ubv1,lbv2,ubv2] if (lbv1<ubv1) { [prolog j1] [full tiles j1] if (lbv2<ubv2) { [epilog j1 + prolog j2] [full tiles j2] [epilog j2] } else { [epilog j1 + untiled j2] } } else { /* omitted */ } } [epilog i] ubv2 lbv2 ubv1 if (lbv2<ubv2) lbv1 { [untiled j1 + prolog j2] [tiled j2] [epilog j2] } else { j [untiled j1 + untiled One j2] tile segment } along i dimension i S2a S1a S2b S1b Statement domain of S2 Statement domain of S1 Combined and interleaved Combined and interleaved

13 Essential for: Exploiting data locality in deep multi-level memory hierarchies Approach: Boundary tiles can be recursively tiled using smaller tile sizes Multi-Level Tiling j i 12 3 levels of tiling

Implementation: PrimeTile A Parametric Multi-Level Tiler for Imperfect Loop Nests

net Loop nest sequence Pre-process Pluto Iteration space polyhedra + Affine schedules

Rectangularly tileable loop code (with complete embedding information) Parametric

14 Implementation: PrimeTile A Parametric Multi-Level Tiler for Imperfect Loop Nests Loop nest sequence Pre-process Pluto Iteration space polyhedra + Affine schedules for rectangular tileability Parser + AST Generator Loop ASTs Loop Tiling Transformer Rectangularly tileable loop code (with complete embedding information) Parametric multi-level tiled loop ASTs Modified CLooG Code Generator All statements in a loop nest have the same number of surrounding loops. Parametric multi-level tiled code

15 Experiments Xeon workstation (dual quad-core E5462 Xeon processors (8 cores total) running at 2.8 GHz (16 MHz FSB) with 32 KB L1 cache, 12 MB of L2 cache (6 MB shared per core pair), and 16 GB of DDR2 FBDIMM RAM, running Linux kernel version (x86-64)) GCC version Options: -O3 Comparisons with other tiled-code generators Tiled code generator Tile sizes Loop nest structure HiTLOG Parametric Perfect Pluto Fixed Imperfect PrimeTile Parametric Imperfect

16 Benchmarks Name Description Imperfect nest Require skewing LU LU factorization Yes No N=25 2D FDTD 2D Finite Difference Time Domain method Input problem size Yes Yes T=2, N=2 1D Jacobi 1D Jacobi method Yes Yes T=2, N=6x1 6 Cholesky Cholesky factorization Yes No N=5 TriSolver Triangular solver Yes No N=3 Seidel 3D Gauss Seidel No Yes T=2, N=2 DSYRK Symmetric rank k update No No N=3 DTRMM Triangular matrix multiplication No No N=3

17 Generation time (seconds) Efficiency of Code Generation LU Pluto PrimeTile (full) PrimeTile (no boundary tiling) Generation time (seconds) Cholesky Pluto PrimeTile (full) PrimeTile (no boundary tiling) Levels of tiling Levels of tiling

18 Generation time (seconds) Efficiency of Code Generation (cont.) DSYRK Pluto PrimeTile (full) PrimeTile (no boundary tiling) HiTLOG Generation time (seconds) DTRMM Pluto PrimeTile (full) PrimeTile (no boundary tiling) HiTLOG Levels of tiling Levels of tiling Fully polyhedral fixed tiled code generation does not scale Double benefit of PrimeTile: better scalability and parametric tiling

19 1 Performance of Generated Tiled Code Pluto PrimeTile HiTLOG Execution time (seconds) LU 2D FDTD 1D Jacobi Cholesky TriSolver Seidel DSYRK DTRMM Parametric tiled code efficiency is comparable to or better than fixed tiled code

20 Impact of Separation of Partial and Full Tiles 1 Pluto PrimeTile Pluto(unroll/jam) PrimeTile(unroll) PrimeTile(regtile) Execution time (seconds) LU 2D FDTD 1D Jacobi Cholesky TriSolver

21 Impact of Separation of Partial and Full Tiles Pluto PrimeTile HiTLOG Pluto(unroll/jam) PrimeTile(unroll) PrimeTile(regtile) HiTLOG(unroll) HiTLOG(regtile) 1 Execution time (seconds) Seidel DSYRK DTRMM Identification of full-tile loops enables downstream optimization (e.g., register tiling)

22 Summary Developed an effective general approach to parametric multi-level tiling of imperfectly nested affine loops Achieved separation of partial tiles from full tiles, thereby enabling optimizations such as register tiling Ongoing/follow-up work targets parallel parametric tiling of affine imperfect loop nests Software download: 1. A beta release of PrimeTile 2. A modified version of CLooG

23 Thank You!

A polyhedral loop transformation framework for parallelization and tuning

A polyhedral loop transformation framework for parallelization and tuning Ohio State University Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, P. Sadayappan Argonne National Laboratory