STELLA: A Domain Specific Language for Stencil Computations

Size: px

Start display at page:

Download "STELLA: A Domain Specific Language for Stencil Computations"

Carol Gregory
5 years ago
Views:

1 STELLA: A Domain Specific Language for Stencil Computations, Center for Climate Systems Modeling, ETH Zurich Tobias Gysi, Department of Computer Science, ETH Zurich Oliver Fuhrer, Federal Office of Meteorology and Climatology, Meteoswiss Zurich Mauro Bianco, Swiss National Supercomputing Centre, CSCS Lugano. Thomas C. Schulthess, Swiss National Supercomputing Centre, CSCS Lugano. Work funded by HP2C and PASC initiatives Dagstuhl Seminar, 14th April

2 Outline Dynamical Core of COSMO STELLA: DSL using C++ template metaprogramming STELLA DSL elements Exploiting performance: Loop & kernel fusion, data locality Extending Parallelization models 2

3 MOTIVATION: COSMO MODEL Dynamical core of COSMO solves the Navier Stokes equations using finite difference methods on structured grids. Stencils on 3D grids: 2-7 km resolution in the horizontal, 60/80 levels in the vertical Split-explicit time integrator with slow processes integrated using Runge-Kutta method. Collection of (Fortran operators (advection, diffusion, fast waves, etc. on prognostic variables (, u, v, w, T, qx,... 3

4 COSMO time step Initialization Δt Boundary conditions Physics Dynamics Data assimilation Halo-update Diagnostics I/O Cleanup Dagstuhl Seminar 14th April

5 COSMO Dynamical Core Dynamical code: ~35 stencils per time step Initialization Vertical Diffusion (compute T. Time Integrator (FastWaves RungeKutta Start comm Wait comm U, V, W U, V, W Δt Boundary conditions Physics Dynamics Data assimilation Halo-update Diagnostics I/O Vertical Diffusion (compute U, V, W N small t. Cleanup Horizontal Diffusion 5

6 Algorithmic Motifs In COSMO Horizontal PDE operators explicitly solved -> compact stencils with horizontal data dependencies for k=1, ke { for i=1, ie-1 { for j=1, je-1 { lap(i,j = -4*U(i,j + U(i+1,j + U(i-1,j + U(i,j+1 + U(i,j-1 for i=1, ie-1 { for j=1, je-1 { diff(i,j = -4*lap(i,j + lap(i+1,j + lap(i-1,j + lap(i,j+1 + lap(i,j-1 Dagstuhl Seminar 14th April

7 Algorithmic Motifs In COSMO Vertical PDE operators implicitly solved -> tridiagonal systems // forward for j=1, je for i=1, ie for k=2, ke { c(i,j,k = 1.0 / b(i,j,k - c(i,j,k-1 * a(i,j,k d(i,j,k = ( d(i,j,k - d(i,j,k-1 * a(i,j,k * c(i,j,k // backward substitution for j=1, je { for i=1, ie { for k=2, ke { x(i,j,k = d(i,j,k - c(i,j,k * x(i,j,k+1 7

8 Model development U =U ( x, z, t U (x, z,t=0=u 0 (x, z U = α U = α ( U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ 2 for i=1, ie-1 for j=1, je-1 lap(i,j = -4*U(i,j + U(i+1,j + U(i-1,j + U(i,j+1 + U(i,j-1 for i=1, ie-1 for j=1, je-1 diff(i,j = -4*lap(i,j + lap(i+1,j + lap(i-1,j + lap(i,j+1 + lap(i,j-1 8

U =U ( x, z, t U (x, z,t=0=u 0 (x, z U 4 2 2 = α U = α ( U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ 2 for i=1, ie-1 for j=1, je-1 lap(i,j = -4*U(i,j

9 U =U ( x, z, t U (x, z,t=0=u 0 (x, z U = α U = α ( U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ 2 for i=1, ie-1 for j=1, je-1 lap(i,j = -4*U(i,j + U(i+1,j + U(i-1,j + U(i,j+1 + U(i,j-1 for i=1, ie-1 for j=1, je-1 diff(i,j = -4*lap(i,j + lap(i+1,j + lap(i-1,j + lap(i,j+1 + lap(i,j-1 Dagstuhl Seminar 14th April

10 U =U ( x, z, t U (x, z,t=0=u 0 (x, z U = α U = α ( U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ 2? 10

11 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ2 static void Lap(Context ctx { ctx[lap::center(] = ctx[u::at(iplus1] + ctx[u::at(iminus1] + ctx[u::at(jplus1] + ctx[u::at(jminus1] - 4*ctx[u::Center(]; 11

$const int i = threadidx.x; const int j = threadidx.y; int i_h = 0; int j_h = 0; if(j < 2 { i_h = i; j_h = (j==0? -1 : blockdim.y; else if(j < 4 && i <= blockdim.y { i_h = (j==2? -1 : blockdim.x; j_h = i; for(int k=0; k < kdim; ++k { lap(i,j = - 4.$

12 const int i = threadidx.x; const int j = threadidx.y; int i_h = 0; int j_h = 0; if(j < 2 { i_h = i; j_h = (j==0? -1 : blockdim.y; else if(j < 4 && i <= blockdim.y { i_h = (j==2? -1 : blockdim.x; j_h = i; for(int k=0; k < kdim; ++k { lap(i,j = * U(i,j,k + U(i+1,j,k + U(i-1,j,k + U(i,j+1,k + U(i,j-1,k if(i_h!= 0 j_h!= 0 lap(i_h, j_h = * U(i_h,j_h,k + U(i_h+1,j_h,k + U(i_h-1,j_h,k + U(i_h,j_h+1,k + U(i_h,j_h-1,k syncthreads(; result(i,j = -4*(i,j,k - alpha(i,j,k*( lap(i,j,k - lap(i-1,j,k + lap(i,j,k - lap(i,j-1,k 12

13 STELLA DSL using C++ template metaprogramming Separate model and algorithm from hardware specific implementation and optimizations. Single source code compiling multiple architectures, performance portable. Concise syntax / close to mathematical description of operators Same toolchain as the full model (access to debugger/data fields, interoperability with other programming languages Supports CPU & GPU backends Used to rewrite a full dynamical core (COSMO 13

14 How it works? Operators are encapsulated as type information (functors Types are composed/organized and manipulated template<typename Context> struct LapStage{ static void Do(Context ctx { ctx[lap::center(] = ctx[u::at(iplus1] + ctx[u::at(iminus1] + ctx[u::at(jplus1] + ctx[u::at(jminus1] - 4*ctx[T::Center(]; ; Specific backends expand the code of operators into template kernels with loop structures Many optimizations (memory layout, tiling, looping, data locality depend on the backend Dagstuhl Seminar 14th April

15 Stencil Stages Stencil Stages describe the single operation applied to each grid point. Assumed parallel execution in the IJ plane template<typename Tenv> struct Lap { STENCIL_STAGE(TEnv STAGE_PARAMETER(u STAGE_PARAMETER(lap static void Do(Context ctx, FullDomain { ctx[lap::center(] = -4*ctx[u::Center(] + ctx[u::at(iplus1] + ctx[u::at(iminus1] + ctx[u::at(jplus1] + ctx[u::at(jminus1]; 15

16 Full example IJKRealField in_, out_; Instantiate data fields (memory layout abstracted, backend dependent StencilCompiler::Build( Stencil, pack_parameters( Param<in, cin, cdatafield>( in_, Param<out, cinout, cdatafield>( out_, define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; Dagstuhl Seminar 14th April

17 Full example IJKRealField in_, out_; Pack the parameters and associate them to place holders used in stages StencilCompiler::Build( Stencil, pack_parameters( Param<in, cin, cdatafield>( in_, Param<out, cinout, cdatafield>( out_, define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; 17

18 Full example IJKRealField in_, out_; Define temporary buffers used by the stencil StencilCompiler::Build( Stencil, pack_parameters( Param<in, cin, cdatafield>( in_, Param<out, cinout, cdatafield>( out_, define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; 18

19 Full example IJKRealField in_, out_; Compose multiple stencil stages. ( Operations with data dependencies placed in different stages Build Sweeps (sequential loops in K StencilCompiler::Build( Stencil, pack_parameters( Param<in, cin, cdatafield>( in_, Param<out, cinout, cdatafield>( out_, define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; 19

20 Computation on the fly or buffering? U = α 4 = α 2 ( 2 U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ2 4 T 2U 2 T Dagstuhl Seminar 14th April

21 4 U 2 U 2 U Using DSL Functions Using DSL Temporaries StencilCompiler::Build( stencil, pack_parameters(..., define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; template<typename TEnv> struct Lap { STENCIL_STAGE(TEnv STAGE_PARAMETER(u STAGE_PARAMETER(diff static void Do(Context ctx, FullDomain { ctx[diff::center(] = ctx[call<lapfn>::with( Call<LapFn>::With(u::Center( ]; ; StencilCompiler::Build( //... define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( ; Dagstuhl Seminar 14th April

22 2 U 4 U 2 U Dagstuhl Seminar 14th April

23 Loop and Kernel Fusion STELLA applies loop fusion to stages in a sweep (only GPU const int i = threadidx.x; const int j = threadidx.y; int i_h = 0; int j_h = 0; define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Result, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; if(j < 2 { i_h = i; j_h = (j==0? -1 : blockdim.y; else if(j < 4 && i <= blockdim.y { i_h = (j==2? -1 : blockdim.x; j_h = i; for(int k=0; k < kdim; ++k { lap(i,j = * U(i,j,k + U(i+1,j,k + U(i-1,j,k + U(i,j+1,k + U(i,j-1,k if(i_h!= 0 j_h!= 0 lap(i_h, j_h = * U(i_h,j_h,k + U(i_h+1,j_h,k + U(i_h-1,j_h,k + U(i_h,j_h+1,k + U(i_h,j_h-1,k syncthreads(; result(i,j = -4*(i,j,k - alpha(i,j,k*( lap(i,j,k - lap(i-1,j,k + lap(i,j,k - lap(i,j-1,k 23

$const int i = threadidx.x; const int j = threadidx.y; int i_h = 0; int j_h = 0;! Loop fusion requires duplicated computation at halo points CUDA block (32x? if(j < 2 { i_h = i; j_h = (j==0?$

24 const int i = threadidx.x; const int j = threadidx.y; int i_h = 0; int j_h = 0;! Loop fusion requires duplicated computation at halo points CUDA block (32x? if(j < 2 { i_h = i; j_h = (j==0? -1 : blockdim.y; else if(j < 4 && i <= blockdim.y { i_h = (j==2? -1 : blockdim.x; j_h = i; for(int k=0; k < kdim; ++k { lap(i,j = * U(i,j,k + U(i+1,j,k + U(i-1,j,k + U(i,j+1,k + U(i,j-1,k if(i_h!= 0 j_h!= 0 lap(i_h, j_h = * U(i_h,j_h,k + U(i_h+1,j_h,k + U(i_h-1,j_h,k + U(i_h,j_h+1,k + U(i_h,j_h-1,k syncthreads(; result(i,j = -4*(i,j,k - alpha(i,j,k*( lap(i,j,k - lap(i-1,j,k + lap(i,j,k - lap(i,j-1,k Dagstuhl Seminar 14th April

25 Loop and Kernel Fusion STELLA applies kernel fusion to all sweeps of a stencil define_loops( define_sweep<ckincrement>( define_stages( StencilStage<ForwardStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( define_sweep<ckdecrement>( define_stages( StencilStage<BackwardStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( ; 25

26 Loop and Kernel Fusion Performance implications of Loop & Kernel fusion 26

27 Data locality? DSL elements for describing data reuse define_loops( define_sweep<ckincrement>( define_caches( KCache<buff, clocal, KWindow<-2,1>, KRange<FullDomain> define_stages( StencilStage<Avg, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >(, StencilStage<Interp, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( Avg Interp + 27

registers: KCache<buff, clocal, KWindow<-2,1>, KRange<FullDomain> IJKCache cache a 3D window

28 Data locality? DSL elements for describing data reuse KCache cache a window of a vertical column in registers: KCache<buff, clocal, KWindow<-2,1>, KRange<FullDomain> IJKCache cache a 3D window in shared memory. IJKCache<buff, cfill, IJWindow<-2,2,-2,2,-2,0>, KRange<FullDomain,0,0> Policies: cfill, clocal, cflush, cfillandflush IJKCache KCache Dagstuhl Seminar 14th April

29 KCache and IJKCache effect in COSMO dycore K20c Dagstuhl Seminar 14th April

30 Performance on the COSMO Dycore Overall speedup of 1.8x for CPU and 5.8x for GPU with respect to legacy code. 30

31 Scalability 31

32 Reusing operators Some operators applied to multiple fields, like concentration of certain species in the atmosphere. Ex: pollen, sea salt, volcano ashes... q s 4 = α q s t s {1,...,100 α=α( x, y, z Processing operator on multiple fields in single kernel is beneficial due to: Reuse of common intermediate computations Data locality of input coefficients or fields Dagstuhl Seminar 14th April

33 Expandable parameters // setup the tracer stencil StencilCompiler::Build( stencil_, "HorizontalDiffusionTracers", repository.calculationdomain(, StencilConfiguration<Real, HorizontalDiffusionTracersBlockSize>(, pack_parameters( /* output fields */ ExpandableParam<data_out, cinout>(tracout.begin(, tracout.end(, /* input fields */ ExpandableParam<data_in, cin>(tracin.begin(, tracin.end(, Param<hdmasktr, cin>(repository.hdmask(, Param<ofahdx, cin>(repository.ofahdx(, Param<ofahdy, cin>(repository.ofahdy(, Param<crlato, cin>(repository.crlato(, Param<crlatu, cin>(repository.crlatu(, define_temporaries( StencilExpandableBuffer<lap, StencilExpandableBuffer<flx, StencilExpandableBuffer<fly, StencilExpandableBuffer<rxp, StencilExpandableBuffer<rxm, ; Real, Real, Real, Real, Real, KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> >(, >(, >(, >(, >(, define_loops( define_sweep<ckincrement>( define_caches( IJCache<lap, KRange<FullDomain,0,0> >(, IJCache<flx, KRange<FullDomain,0,0> >(, IJCache<fly, KRange<FullDomain,0,0> >(, IJCache<rxp, KRange<FullDomain,0,0> >(, IJCache<rxm, KRange<FullDomain,0,0> >(, define_stages( StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(, StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(, StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(, StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( 33

34 Expandable parameters // setup the tracer stencil StencilCompiler::Build( stencil_, "HorizontalDiffusionTracers", repository.calculationdomain(, StencilConfiguration<Real, HorizontalDiffusionTracersBlockSize>(, pack_parameters( /* output fields */ ExpandableParam<NumTracersPerStencil::value, data_out, cinout>(tracout.begin(, tracout.end(, /* input fields */ ExpandableParam<NumTracersPerStencil::value, data_in, cin>(tracin.begin(, tracin.end(, Param<hdmasktr, cin>(repository.hdmask(, Param<ofahdx, cin>(repository.ofahdx(, Param<ofahdy, cin>(repository.ofahdy(, Param<crlato, cin>(repository.crlato(, Param<crlatu, cin>(repository.crlatu(,, define_temporaries( StencilExpandableBuffer<lap, StencilExpandableBuffer<flx, StencilExpandableBuffer<fly, StencilExpandableBuffer<rxp, StencilExpandableBuffer<rxm, Expandable buffers ; Real, Real, Real, Real, Real, KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> >(, >(, >(, >(, >(, define_loops( define_sweep<ckincrement>( define_caches( IJCache<lap, KRange<FullDomain,0,0> >(, IJCache<flx, KRange<FullDomain,0,0> >(, IJCache<fly, KRange<FullDomain,0,0> >(, IJCache<rxp, KRange<FullDomain,0,0> >(, IJCache<rxm, KRange<FullDomain,0,0> >(, define_stages( StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(, StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(, StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(, StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( Dagstuhl Seminar 14th April

35 Expandable parameters // setup the tracer stencil StencilCompiler::Build( stencil_, "HorizontalDiffusionTracers", repository.calculationdomain(, StencilConfiguration<Real, HorizontalDiffusionTracersBlockSize>(, pack_parameters( /* output fields */ ExpandableParam<data_out, cinout>(tracout.begin(, tracout.end(, /* input fields */ ExpandableParam<data_in, cin>(tracin.begin(, tracin.end(, Param<hdmasktr, cin>(repository.hdmask(, Param<ofahdx, cin>(repository.ofahdx(, Param<ofahdy, cin>(repository.ofahdy(, Param<crlato, cin>(repository.crlato(, Param<crlatu, cin>(repository.crlatu(, define_temporaries( StencilExpandableBuffer<lap, StencilExpandableBuffer<flx, StencilExpandableBuffer<fly, StencilExpandableBuffer<rxp, StencilExpandableBuffer<rxm, Expandable buffers Expandable caches ; Real, Real, Real, Real, Real, KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> >(, >(, >(, >(, >(, define_loops( define_sweep<ckincrement>( define_caches( IJCache<lap, KRange<FullDomain,0,0> >(, IJCache<flx, KRange<FullDomain,0,0> >(, IJCache<fly, KRange<FullDomain,0,0> >(, IJCache<rxp, KRange<FullDomain,0,0> >(, IJCache<rxm, KRange<FullDomain,0,0> >(, define_stages( StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(, StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(, StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(, StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( Dagstuhl Seminar 14th April

IJ plane K dimension executed sequentially by each

36 Extending the Parallelization Model Tiling in IJ plane Blocks are extended with halo points Parallel execution of IJ plane K dimension executed sequentially by each (CUDA/OpenMP thread Warp: 32 threads Dagstuhl Seminar 14th April

37 Parallelization Model: KParallel Compact stencils of horizontal operators can be parallelized in K. Vertical dependencies only on input fields. Only horizontal dependencies on intermediate computed values K parallelization increases parallelism and occupancy in GPUs 37

38 Parallelization Model: KParallel StencilCompiler::Build( stencil, pack_parameters(..., define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckparallel>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Result, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; 38

39 Parallelization Model: KParallel dycore stencils on 32x32, K20c 39

40 Parallelization Model: Parallel Tridiagonal Solver Basic Thomas algorithm (sequential in K for large domains. Performance deteriorates at small domain, due to lack of parallelism STELLA integrates a parallel HPCR algorithm (Jeremy Appleyard NVIDIA Additionally we can prepare the system coefficients on the fly in the same stencil based on the prognostic variables. 40

41 Parallelization Model: Parallel Tridiagonal Solves STELLA keyword to trigger a HPCR algorithm for the tridiagonal solve StencilCompiler::Build( stencil_, "TridiagonalSolve_HPCR", repository.calculationdomain(, StencilConfiguration<Real, TridiagonalSolve_HPCRBlockSize>(, pack_parameters( /* output fields */ Param<w_out, cinout>(repository.w_out(, Param<acol_dat, cinout>(repository.acol(, Param<bcol_dat, cinout>(repository.bcol(, Param<ccol_dat, cinout>(repository.ccol(, Param<dcol_dat, cinout>(repository.dcol(, define_loops( ctridiagonalsolve>( define_sweep< define_stages( SetupStage, StencilStage< IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >(, StencilStage<TridiagonalSolveFWStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >(, StencilStage<WriteOutputStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( ; 41

42 HPCR Performance vs Thomas for domain sizes of 32 x (J domain size Thomas HPCR 42

43 Conclusions STELLA used to port the dynamical core of COSMO to GPUs: Retain single source code. Performance portable 2 backends, CPU(1.8x and GPU(5.8x C++ template metaprogramming -> interoperatility. Extended parallelization modes further exploits performance for characteristic algorithmics motifs of STELLA 43

44 BACKUP 44

45 45

Dynamical Core Rewrite

Dynamical Core Rewrite Tobias Gysi Oliver Fuhrer Carlos Osuna COSMO GM13, Sibiu Fundamental question How to write a model code which allows productive development by domain scientists runs efficiently