Automatic Polyhedral Optimization of Stencil Codes

Size: px

Start display at page:

Download "Automatic Polyhedral Optimization of Stencil Codes"

Simon Townsend
5 years ago
Views:

1 Automatic Polyhedral Optimization of Stencil Codes ExaStencils 2014 Stefan Kronawitter Armin Größlinger Christian Lengauer

2 The Need for Different Optimizations 3D 1st-grade Jacobi smoother Speedup Naive Implementation Basic Transformations Temporal Blocking Temp&Spatial Blocking Speedup Threads Ivy Bridge (6 MB cache) Threads BlueGene/Q (32 MB cache) 1 / 20

3 Why Use the Polyhedron Model? Transformations / optimizations simple (if considered in isolation) rather complex in combination Benefits of the polyhedron model easy composition of transformations correctness, even on boundary 2 / 20

4 Why Use the Polyhedron Model? Transformations / optimizations simple (if considered in isolation) rather complex in combination Benefits of the polyhedron model easy composition of transformations correctness, even on boundary 2 / 20

5 Why Use the Polyhedron Model? Transformations / optimizations simple (if considered in isolation) rather complex in combination Benefits of the polyhedron model easy composition of transformations correctness, even on boundary 2 / 20

6 Why Use the Polyhedron Model? Transformations / optimizations simple (if considered in isolation) rather complex in combination Benefits of the polyhedron model easy composition of transformations correctness, even on boundary 2 / 20

7 Manual Transformations? Input: fdtd-2d for(t = 0; t < tmax; t++) { for (j = 0; j < ny; j++) ey[0][j] = _edge_[t]; for (i = 1; i < nx; i++) for (j = 0; j < ny; j++) ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]); for (i = 0; i < nx; i++) for (j = 1; j < ny; j++) ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]); for (i = 0; i < nx - 1; i++) for (j = 0; j < ny - 1; j++) hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] + ey[i+1][j]-ey[i][j]); } 3 / 20

8 Automatic Transformations! Optimized: fdtd-2d tiled [code from Luis-Noël Pouchet] for (c0 = 0; c0 <= (((ny + 2 * tmax + -3) * 32 < 0?((32 < 0?-((-(ny + 2 * tmax + -3) ) / 32) : -((-(ny + 2 * tmax + -3) ) / 32))) : (ny + 2 * tmax + -3) / 32)); ++c0) { #pragma omp parallel for private(c3, c4, c2, c5) for (c1 = (((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c ) / -2 : (c ) / 2)))) > (((32 * c0 + -tmax + 1) * 32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) ) / -32 : (32 * c0 + -tmax ) / 32))))?((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c ) / -2 : (c ) / 2)))) : (((32 * c0 + -tmax + 1) * 32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) ) / -32 : (32 * c0 + -tmax ) / 32))))); c1 <= (((((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64)))) < c0?(((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64)))) : c0)); ++c1) { for (c2 = c0 + -c1; c2 <= (((((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) ) / 32) : -((-(tmax + nx + -2) ) / 32))) : (tmax + nx + -2) / 32)) < (((32 * c * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c * c1 + nx + 30) ) / 32) : -((-(32 * c * c1 + nx + 30) ) / 32))) : (32 * c * c1 + nx + 30) / 32))?(((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) ) / 32) : -((-(tmax + nx + -2) ) / 32))) : (tmax + nx + -2) / 32)) : (((32 * c * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c * c1 + nx + 30) ) / 32) : -((-(32 * c * c1 + nx + 30) ) / 32))) : (32 * c * c1 + nx + 30) / 32)))); ++c2) { if (c0 == 2 * c1 && c0 == 2 * c2) { for (c3 = 16 * c0; c3 <= ((tmax + -1 < 16 * c0 + 30?tmax + -1 : 16 * c0 + 30)); ++c3) if (c0 % 2 == 0) (ey[0])[0] = (_edge_[c3]);... (200 more lines!) 4 / 20

9 for (i=1; i<=n; ++i) for (j=1; j<=n-i+1; ++j) A[i][j] = A[i-1][j] + A[i][j-1]; Polyhedron Model 5 / 20

10 for (i=1; i<=n; ++i) for (j=1; j<=n-i+1; ++j) A[i][j] = A[i-1][j] + A[i][j-1]; Iteration domain j Polyhedron Model i 1 i n 1 j n i / 20

11 for (i=1; i<=n; ++i) for (j=1; j<=n-i+1; ++j) A[i][j] = A[i-1][j] + A[i][j-1]; Iteration domain j Polyhedron Model i 1 i n 1 j n i + 1 Dependences (i, j) (i + 1, j) (i, j) (i, j + 1) 5 / 20

12 for (i=1; i<=n; ++i) for (j=1; j<=n-i+1; ++j) A[i][j] = A[i-1][j] + A[i][j-1]; Polyhedron Model Iteration domain j Transformation t = i + j 1 p = j p i 1 i n 1 j n i + 1 Dependences (i, j) (i + 1, j) (i, j) (i, j + 1) 1 t n 1 p t t (t, p) (t + 1, p) (t, p) (t + 1, p + 1) 5 / 20

13 for (i=1; i<=n; ++i) for (j=1; j<=n-i+1; ++j) A[i][j] = A[i-1][j] + A[i][j-1]; Polyhedron Model for (t=1; t<=n; ++t) #pragma omp parallel for for (p=1; p<=t; ++p) A[t-p+1][p] =...; Iteration domain j Transformation t = i + j 1 p = j p i 1 i n 1 j n i + 1 Dependences (i, j) (i + 1, j) (i, j) (i, j + 1) 1 t n 1 p t t (t, p) (t + 1, p) (t, p) (t + 1, p + 1) 5 / 20

14 Constraints on the Input Data structures only scalars and arrays allowed alias information must be available for (i=0; i<n; ++i) { P = P->next; // bad: linked list *c = A[i]; // does c point inside A or B? A[i] = B[i+3]; // do A and B alias? } for the blue variables: additional information is essential (are there hidden dependences?) 6 / 20

15 Constraints on Input Loop bounds and array subscripts must be affine in surrounding loop variables and parameters iteration domain is Z-polyhedron for (i=0; i<n; ++i) for (j=0; j<=i; ++j) { A[(i*i+i)/2+j] = B[2*j][i-j]; C[n*i] = D[C[j]]; } // linearized triangle marked parts are relevant for model extraction: affine and non-affine 7 / 20

16 Stencil Optimizations L1 L2 L3 L4 IR Continuous Domain & Continuous Model Discrete Domain & Discrete Model Algorithmic Components & Parameters Complete Program Specification Polyhedral Model Intermediate Representation Extraction point representation already executable ( C-like ) but with some abstract elements, e.g. abstract communication node loop node for grid traversion 8 / 20

17 Optimizations Red-black Gauss-Seidel smoother first, all red, then all black points are updated in place each traversion updates every second element only more pressure on cache and reduced memory bandwidth more colors for larger stencils possible Generate color splitting 9 / 20

18 Temporal Blocking Normal update x y input result 10 / 20

19 Temporal Blocking Normal update x y input result 10 / 20

20 Temporal Blocking Normal update x y input result 10 / 20

21 Temporal Blocking Combination of two subsequent updates x y input intermediate result 11 / 20

22 Temporal Blocking Combination of two subsequent updates x y input intermediate result 11 / 20

23 Temporal Blocking Combination of two subsequent updates x y input intermediate result 11 / 20

24 Temporal Blocking Combination of two subsequent updates x y input intermediate result block size can be determined automatically 11 / 20

25 Vectorization Find transformations that allow SIMD parallelism innermost loop must be parallel subsequent iterations must access neighbouring grid elements j i for (i =..) for (j =..; ++j) A[..][..+j] =.. B[..][..+j]; 12 / 20

26 Domain-Specific Extensions Polyhedron model already allows color splitting (temporal) blocking vectorization x j y input intermediate result i 13 / 20

27 Domain-Specific Extensions Polyhedron model already allows color splitting (temporal) blocking vectorization x j y input intermediate result i Required extensions reductions for (i=0; i<n; ++i) sum = sum + A[i]; non-cuboid grids x y 13 / 20

28 Support for Reductions Example: sum reduction for (i=0; i<n; ++i) sum = sum + A[i]; sum is updated in each iteration 14 / 20

29 Support for Reductions Example: sum reduction for (i=0; i<n; ++i) sum = sum + A[i]; sum is updated in each iteration Extracted model i purely sequential execution order 14 / 20

30 Support for Reductions Example: sum reduction for (i=0; i<n; ++i) sum = sum + A[i]; sum is updated in each iteration Extracted model i purely sequential execution order Domain knowledge tells p because + is associative t 14 / 20

31 Example: triangular grid x y Support for non-cuboid Grids grid is stored contiguously in memory 15 / 20

32 Example: triangular grid x y Support for non-cuboid Grids grid is stored contiguously in memory Usage for (i=..) for (j=..).. A[(i*i+i)/2+j]..; array access is not affine 15 / 20

33 Example: triangular grid x y Support for non-cuboid Grids grid is stored contiguously in memory Usage for (i=..) for (j=..).. A[(i*i+i)/2+j]..; Domain knowledge tells for (i=..) for (j=..).. A[i][j]..; array access is not affine is equivalent to access above (with respect to the dependences) 15 / 20

34 (De-)Serialization for Communication L1 L2 L3 L4 IR Continuous Domain & Continuous Model Discrete Domain & Discrete Model Algorithmic Components & Parameters Complete Program Specification Polyhedral Model Intermediate Representation Polyhedral Model Generation point late in transformation process shortly before target code generation 16 / 20

35 Generation of (De-)Serialization Code Communication north neighbour east neighbour data exchange with all direct neighbours required some elements must be sent to multiple nodes avoid loading these elements repeatedly from main memory fill send buffers simultaneously 17 / 20

36 Serialization Code 2D example N W E S 18 / 20

37 Serialization Code 2D example N W E S Model representation for N: [n] -> { N[x,y]->[x,y]: x=1 and 1<=y<=n-2 } S, E and W analogously 18 / 20

38 Serialization Code 2D example W N S E Model representation for N: [n] -> { N[x,y]->[x,y]: x=1 and 1<=y<=n-2 } S, E and W analogously W(1, 1); for (int x=1; x<n-1; ++x) N(1, x); E(1, n-2); for (int y=2; y<n-2; ++y) { W(y, 1); E(y, n-2); } W(n-2, 1); for (int x=1; x<n-1; ++x) S(n-2, x); E(n-2, n-2); 18 / 20

39 Stencil optimizations Summary polyhedral representation extracted from IR domain knowledge: grid size number of pre- and post-smoothing steps transformations and their ordering correct treatment of boundaries (De-)Serialization polyhedral representation created directly generate optimal grid traversion code according to memory layout 19 / 20

40 Stencil optimizations Summary polyhedral representation extracted from IR domain knowledge: grid size number of pre- and post-smoothing steps transformations and their ordering correct treatment of boundaries can be performed completely automatically (De-)Serialization polyhedral representation created directly generate optimal grid traversion code according to memory layout can be performed completely automatically 19 / 20

41 End... Thank you for your attention!... any questions? 20 / 20

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential