High Performance Stencil Code Generation with Lift. Bastian Hagedorn Larisa Stoltzfus Michel Steuwer Sergei Gorlatch Christophe Dubach

Size: px

Start display at page:

Download "High Performance Stencil Code Generation with Lift. Bastian Hagedorn Larisa Stoltzfus Michel Steuwer Sergei Gorlatch Christophe Dubach"

Junior Long
6 years ago
Views:

1 High Performance Stencil Code Generation with Lift Bastian Hagedorn Larisa Stoltzfus Michel Steuwer Sergei Gorlatch Christophe Dubach

2 Why Stencil Computations? Stencil computations are a class of kernels which update neighboring array elements according to a fixed pattern, called stencil. Frequently occur in: Medical Imaging Physics Simulations Machine Learning PDE Solvers

3 Why Stencil Computations? Stencil compute time: HPC Center München 49% 2017 HPC Center Stuttgart 68% 2016 Frequently occur in: Medical Imaging Physics Simulations Machine Learning PDE Solvers

4 Yet Another Stencil Paper? ICS'09 ICS'05 PLDI' CGO'12 SC'10 CLUSTER' CGO' CLUSTER'17 WOLFHPC'16 CGO'18

5 Domain Specific Languages PATUS Pochoir PARTANS Halide... DSL

6 Exploiting Domain Knowledge PATUS Multicore CPU Pochoir GPU PARTANS HPC Mobile Xeon Phi Halide KNC KNL... DSL... Hardware

7 Exploiting Domain Knowledge Stencil DSLs Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL... Hardware

8 Exploiting Domain Knowledge Linear Algebra DSLs Multicore CPU GPU Stencil DSLs HPC Mobile Xeon Phi N-Body DSLs KNC KNL... Hardware...

9 approaching Performance Portability Linear Algebra DSLs Stencil DSLs N-Body DSLs Universal High Performance Code Generator Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL... Hardware...

10 approaching Performance Portability Linear Algebra DSLs Stencil DSLs N-Body DSLs Lift Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL... Hardware...

11 DSL DSL Lift Multicore CPU GPU HPC Mobile Xeon Phi DSL KNC KNL... Hardware

12 DSL DSL Lift DSL High-Level IR Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL... Hardware

13 DSL DSL Lift DSL High-Level IR Explore Optimizations by rewriting Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL [CASES'16]... Hardware

14 DSL DSL Lift DSL High-Level IR Explore Optimizations by rewriting [CASES'16] Low-Level Program Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL... Hardware

15 DSL DSL Lift DSL High-Level IR Explore Optimizations by rewriting Low-Level Program Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL [CASES'16] Code Generation [CGO'17]... Hardware

17 Lift's High-level Primitives map( ) reduce( ) split(n) join zip

18 Lift's High-level Primitives map( ) dotproduct.lift reduce( ) * split(n) join zip a b +

19 Lift's High-level Primitives map( ) dotproduct.lift reduce( ) * split(n) + join zip a b reduce(+,0, map(*, zip(a,b)))

20 Lift's High-level Primitives map( ) dotproduct.lift reduce( ) * split(n) + join zip a b reduce(+,0, map(*, zip(a,b)))

21 Lift's High-level Primitives map( ) dotproduct.lift reduce( ) * split(n) + join zip a b reduce(+,0, map(*, zip(a,b)))

22 Lift's High-level Primitives map( ) stencil.lift? reduce( ) split(n) join zip Can we express stencil computations in Lift? Let's look at a simple stencil example...

23 What ARE Stencil Computations? 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { j;? 0 : pos; - 1? N - 1 : pos; ]; } A B

24 What ARE Stencil Computations? 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { j;? 0 : pos; - 1? N - 1 : pos; ]; }

25 What ARE Stencil Computations? 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { j;? 0 : pos; - 1? N - 1 : pos; ]; }

26 Stencil computations in Lift map( ) reduce( ) split(n) join zip 3-point-stencil.lift

27 Stencil computations in Lift map( ) 3-point-stencil.lift reduce( ) split(n) join zip stencil Add specialized primitive: Job done?

28 Stencil computations in Lift map( ) 3-point-stencil.lift reduce( ) split(n) join zip stencil Add specialized primitive: Job done? No Reuse of existing primitives and optimizations Domain-specific rather than generic Multidimensional? is it composable?

29 Decomposing Stencil Computations 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { j;? 0 : pos; - 1? N - 1 : pos; ]; }

30 Decomposing Stencil Computations 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { // ( a ) j;? 0 : pos; - 1? N - 1 : pos; ]; } (a) access neighborhoods for every element

31 Decomposing Stencil Computations 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { // ( a ) j;? 0 : pos; // ( b ) - 1? N - 1 : pos; ]; } (a) access neighborhoods for every element (b) specify boundary handling

32 Decomposing Stencil Computations 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { // ( a ) j;? 0 : pos; // ( b ) - 1? N - 1 : pos; ]; } // ( c ) (a) access neighborhoods for every element (b) specify boundary handling (c) apply stencil function to neighborhoods

33 Decomposing Stencil Computations 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { // ( a ) j;? 0 : pos; // ( b ) - 1? N - 1 : pos; ]; } // ( c ) (a) access neighborhoods for every element (b) specify boundary handling (c) apply stencil function to neighborhoods

34 Stencil computations in Lift map( ) 3-point-stencil.lift reduce( ) split(n) join zip??????????????????

35 Stencil computations in Lift map( ) 3-point-stencil.lift reduce( ) split(n) join zip Reuse map allows to reuse existing rewrite rules Simplicity???????????? one primitive per task Multidimensional easily composable??? map

36 Boundary Handling Using Pad pad ( reindexing ) pad ( constant ) C C pad-reindexing.lift pad-constant.lift clamp(i, n) = (i < 0)? 0 : ((i >= n)? n-1:i) constant(i, n) = C pad(1,1,clamp, [a,b,c,d]) = [a,a,b,c,d,d] pad(1,1,constant, [a,b,c,d]) = [C,a,b,c,d,C]

37 Neighborhood Creation using Slide size slide-example.lift slide(3,1,[a,b,c,d,e]) = [[a,b,c],[b,c,d],[c,d,e]] step...

38 Applying Stencil function using Map sum-neighborhoods.lift map(nbh => reduce(add, 0.0f, nbh))

39 Putting it Together map( ) stencil1d.lift reduce( ) split(n) join zip pad(l,r,b) slide(n,s) pad def stencil1d = fun(a => map(reduce(add, 0.0f), slide(3,1, pad(1,1,clamp,a)))) slide map

40 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re

41 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map2(sumnbh, slide2(3,1, pad2(1,1,clamp,input)))

42 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map2(sumnbh, slide2(3,1, pad2(1,1,clamp,input)))

43 Multidimensional Boundary Handling using PAD 2 input pad2 = map(pad(l,r,b,pad(l,r,b,input)))

44 Multidimensional Boundary Handling using PAD 2 input pad2 = map(pad(l,r,b,pad(l,r,b,input)))

45 Multidimensional Boundary Handling using PAD 2 input pad2 = map(pad(l,r,b,pad(l,r,b,input)))

46 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map2(sum, slide2(3,1, pad2(1,1,clamp,input)))

47 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map2(sum, slide2(3,1, pad2(1,1,clamp,input)))

48 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map2(sum, slide2(3,1, pad2(1,1,clamp,input)))

49 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map3(sum, slide3(3,1, pad3(1,1,clamp,input)))

computations se po om -C Re map3(sum, slide3(3,1,

50 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map3(sum, slide3(3,1, pad3(1,1,clamp,input))) Advantages: Compact Language Reuse Rewrites Simple Compilation

52 Reusing existing Rewrite Rules Divide & Conquer map(f, A)

53 Reusing existing Rewrite Rules Divide & Conquer map(f, A) join(map(map(f), split(n, A)))

54 Optimization: Overlapped Tiling Exploit Locality Close neighborhoods share elements that can be grouped in tiles Shared Memory Fast memory can be used to cache tiles

55 Optimization: Overlapped Tiling Shared memory size Exploit Locality Close neighborhoods share elements that can be grouped in tiles Shared Memory Fast memory can be used to cache tiles

56 Optimization: Overlapped Tiling overlap Exploit Locality Close neighborhoods share elements that can be grouped in tiles Shared Memory Fast memory can be used to cache tiles

57 Optimization: Overlapped Tiling overlap Exploit Locality Close neighborhoods share elements that can be grouped in tiles tile containing three neighborhoods Shared Memory Fast memory can be used to cache tiles

58 Overlapped Tiling as a rewrite rule overlapped tiling rule map(f, slide(3,1,input)) u v

59 Overlapped Tiling as a rewrite rule overlapped tiling rule map(f, slide(3,1,input)) join(map(tile map(f, slide(3,1,tile)), slide(u,v,input))) u v

60 Overlapped Tiling as a rewrite rule overlapped tiling rule map(f, slide(3,1,input)) join(map(tile map(f, slide(3,1,tile)), slide(u,v,input))) u v

62 Comparison with Hand-Optimized codes higher is better Lift achieves the same performance as hand optimized code

63 Comparison with polyhedral compilation higher is better Lift outperforms state-of-the-art optimizing compilers

64 Stencil Computations in LIft We added: 2 Primitives pad, slide 1 Rewrite Rule overlapped tiling High-Level IR

65 Lift is open Source! more info at: lift-project.org Paper CGO Artifact Bastian Hagedorn: Source Code

High Performance Stencil Code Generation with Lift

High Performance Stencil Code Generation with Lift Bastian Hagedorn Larisa Stoltzfus Michel Steuwer University of Münster Münster, Germany b.hagedorn@wwu.de The University of Edinburgh Edinburgh, Scotland,