STELLA: A Domain Specific Language for Stencil Computations

Size: px
Start display at page:

Download "STELLA: A Domain Specific Language for Stencil Computations"

Transcription

1 STELLA: A Domain Specific Language for Stencil Computations, Center for Climate Systems Modeling, ETH Zurich Tobias Gysi, Department of Computer Science, ETH Zurich Oliver Fuhrer, Federal Office of Meteorology and Climatology, Meteoswiss Zurich Mauro Bianco, Swiss National Supercomputing Centre, CSCS Lugano. Thomas C. Schulthess, Swiss National Supercomputing Centre, CSCS Lugano. Work funded by HP2C and PASC initiatives Dagstuhl Seminar, 14th April

2 Outline Dynamical Core of COSMO STELLA: DSL using C++ template metaprogramming STELLA DSL elements Exploiting performance: Loop & kernel fusion, data locality Extending Parallelization models 2

3 MOTIVATION: COSMO MODEL Dynamical core of COSMO solves the Navier Stokes equations using finite difference methods on structured grids. Stencils on 3D grids: 2-7 km resolution in the horizontal, 60/80 levels in the vertical Split-explicit time integrator with slow processes integrated using Runge-Kutta method. Collection of (Fortran operators (advection, diffusion, fast waves, etc. on prognostic variables (, u, v, w, T, qx,... 3

4 COSMO time step Initialization Δt Boundary conditions Physics Dynamics Data assimilation Halo-update Diagnostics I/O Cleanup Dagstuhl Seminar 14th April

5 COSMO Dynamical Core Dynamical code: ~35 stencils per time step Initialization Vertical Diffusion (compute T. Time Integrator (FastWaves RungeKutta Start comm Wait comm U, V, W U, V, W Δt Boundary conditions Physics Dynamics Data assimilation Halo-update Diagnostics I/O Vertical Diffusion (compute U, V, W N small t. Cleanup Horizontal Diffusion 5

6 Algorithmic Motifs In COSMO Horizontal PDE operators explicitly solved -> compact stencils with horizontal data dependencies for k=1, ke { for i=1, ie-1 { for j=1, je-1 { lap(i,j = -4*U(i,j + U(i+1,j + U(i-1,j + U(i,j+1 + U(i,j-1 for i=1, ie-1 { for j=1, je-1 { diff(i,j = -4*lap(i,j + lap(i+1,j + lap(i-1,j + lap(i,j+1 + lap(i,j-1 Dagstuhl Seminar 14th April

7 Algorithmic Motifs In COSMO Vertical PDE operators implicitly solved -> tridiagonal systems // forward for j=1, je for i=1, ie for k=2, ke { c(i,j,k = 1.0 / b(i,j,k - c(i,j,k-1 * a(i,j,k d(i,j,k = ( d(i,j,k - d(i,j,k-1 * a(i,j,k * c(i,j,k // backward substitution for j=1, je { for i=1, ie { for k=2, ke { x(i,j,k = d(i,j,k - c(i,j,k * x(i,j,k+1 7

8 Model development U =U ( x, z, t U (x, z,t=0=u 0 (x, z U = α U = α ( U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ 2 for i=1, ie-1 for j=1, je-1 lap(i,j = -4*U(i,j + U(i+1,j + U(i-1,j + U(i,j+1 + U(i,j-1 for i=1, ie-1 for j=1, je-1 diff(i,j = -4*lap(i,j + lap(i+1,j + lap(i-1,j + lap(i,j+1 + lap(i,j-1 8

9 U =U ( x, z, t U (x, z,t=0=u 0 (x, z U = α U = α ( U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ 2 for i=1, ie-1 for j=1, je-1 lap(i,j = -4*U(i,j + U(i+1,j + U(i-1,j + U(i,j+1 + U(i,j-1 for i=1, ie-1 for j=1, je-1 diff(i,j = -4*lap(i,j + lap(i+1,j + lap(i-1,j + lap(i,j+1 + lap(i,j-1 Dagstuhl Seminar 14th April

10 U =U ( x, z, t U (x, z,t=0=u 0 (x, z U = α U = α ( U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ 2? 10

11 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ2 static void Lap(Context ctx { ctx[lap::center(] = ctx[u::at(iplus1] + ctx[u::at(iminus1] + ctx[u::at(jplus1] + ctx[u::at(jminus1] - 4*ctx[u::Center(]; 11

12 const int i = threadidx.x; const int j = threadidx.y; int i_h = 0; int j_h = 0; if(j < 2 { i_h = i; j_h = (j==0? -1 : blockdim.y; else if(j < 4 && i <= blockdim.y { i_h = (j==2? -1 : blockdim.x; j_h = i; for(int k=0; k < kdim; ++k { lap(i,j = * U(i,j,k + U(i+1,j,k + U(i-1,j,k + U(i,j+1,k + U(i,j-1,k if(i_h!= 0 j_h!= 0 lap(i_h, j_h = * U(i_h,j_h,k + U(i_h+1,j_h,k + U(i_h-1,j_h,k + U(i_h,j_h+1,k + U(i_h,j_h-1,k syncthreads(; result(i,j = -4*(i,j,k - alpha(i,j,k*( lap(i,j,k - lap(i-1,j,k + lap(i,j,k - lap(i,j-1,k 12

13 STELLA DSL using C++ template metaprogramming Separate model and algorithm from hardware specific implementation and optimizations. Single source code compiling multiple architectures, performance portable. Concise syntax / close to mathematical description of operators Same toolchain as the full model (access to debugger/data fields, interoperability with other programming languages Supports CPU & GPU backends Used to rewrite a full dynamical core (COSMO 13

14 How it works? Operators are encapsulated as type information (functors Types are composed/organized and manipulated template<typename Context> struct LapStage{ static void Do(Context ctx { ctx[lap::center(] = ctx[u::at(iplus1] + ctx[u::at(iminus1] + ctx[u::at(jplus1] + ctx[u::at(jminus1] - 4*ctx[T::Center(]; ; Specific backends expand the code of operators into template kernels with loop structures Many optimizations (memory layout, tiling, looping, data locality depend on the backend Dagstuhl Seminar 14th April

15 Stencil Stages Stencil Stages describe the single operation applied to each grid point. Assumed parallel execution in the IJ plane template<typename Tenv> struct Lap { STENCIL_STAGE(TEnv STAGE_PARAMETER(u STAGE_PARAMETER(lap static void Do(Context ctx, FullDomain { ctx[lap::center(] = -4*ctx[u::Center(] + ctx[u::at(iplus1] + ctx[u::at(iminus1] + ctx[u::at(jplus1] + ctx[u::at(jminus1]; 15

16 Full example IJKRealField in_, out_; Instantiate data fields (memory layout abstracted, backend dependent StencilCompiler::Build( Stencil, pack_parameters( Param<in, cin, cdatafield>( in_, Param<out, cinout, cdatafield>( out_, define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; Dagstuhl Seminar 14th April

17 Full example IJKRealField in_, out_; Pack the parameters and associate them to place holders used in stages StencilCompiler::Build( Stencil, pack_parameters( Param<in, cin, cdatafield>( in_, Param<out, cinout, cdatafield>( out_, define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; 17

18 Full example IJKRealField in_, out_; Define temporary buffers used by the stencil StencilCompiler::Build( Stencil, pack_parameters( Param<in, cin, cdatafield>( in_, Param<out, cinout, cdatafield>( out_, define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; 18

19 Full example IJKRealField in_, out_; Compose multiple stencil stages. ( Operations with data dependencies placed in different stages Build Sweeps (sequential loops in K StencilCompiler::Build( Stencil, pack_parameters( Param<in, cin, cdatafield>( in_, Param<out, cinout, cdatafield>( out_, define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; 19

20 Computation on the fly or buffering? U = α 4 = α 2 ( 2 U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ2 4 T 2U 2 T Dagstuhl Seminar 14th April

21 4 U 2 U 2 U Using DSL Functions Using DSL Temporaries StencilCompiler::Build( stencil, pack_parameters(..., define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; template<typename TEnv> struct Lap { STENCIL_STAGE(TEnv STAGE_PARAMETER(u STAGE_PARAMETER(diff static void Do(Context ctx, FullDomain { ctx[diff::center(] = ctx[call<lapfn>::with( Call<LapFn>::With(u::Center( ]; ; StencilCompiler::Build( //... define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( ; Dagstuhl Seminar 14th April

22 2 U 4 U 2 U Dagstuhl Seminar 14th April

23 Loop and Kernel Fusion STELLA applies loop fusion to stages in a sweep (only GPU const int i = threadidx.x; const int j = threadidx.y; int i_h = 0; int j_h = 0; define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Result, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; if(j < 2 { i_h = i; j_h = (j==0? -1 : blockdim.y; else if(j < 4 && i <= blockdim.y { i_h = (j==2? -1 : blockdim.x; j_h = i; for(int k=0; k < kdim; ++k { lap(i,j = * U(i,j,k + U(i+1,j,k + U(i-1,j,k + U(i,j+1,k + U(i,j-1,k if(i_h!= 0 j_h!= 0 lap(i_h, j_h = * U(i_h,j_h,k + U(i_h+1,j_h,k + U(i_h-1,j_h,k + U(i_h,j_h+1,k + U(i_h,j_h-1,k syncthreads(; result(i,j = -4*(i,j,k - alpha(i,j,k*( lap(i,j,k - lap(i-1,j,k + lap(i,j,k - lap(i,j-1,k 23

24 const int i = threadidx.x; const int j = threadidx.y; int i_h = 0; int j_h = 0;! Loop fusion requires duplicated computation at halo points CUDA block (32x? if(j < 2 { i_h = i; j_h = (j==0? -1 : blockdim.y; else if(j < 4 && i <= blockdim.y { i_h = (j==2? -1 : blockdim.x; j_h = i; for(int k=0; k < kdim; ++k { lap(i,j = * U(i,j,k + U(i+1,j,k + U(i-1,j,k + U(i,j+1,k + U(i,j-1,k if(i_h!= 0 j_h!= 0 lap(i_h, j_h = * U(i_h,j_h,k + U(i_h+1,j_h,k + U(i_h-1,j_h,k + U(i_h,j_h+1,k + U(i_h,j_h-1,k syncthreads(; result(i,j = -4*(i,j,k - alpha(i,j,k*( lap(i,j,k - lap(i-1,j,k + lap(i,j,k - lap(i,j-1,k Dagstuhl Seminar 14th April

25 Loop and Kernel Fusion STELLA applies kernel fusion to all sweeps of a stencil define_loops( define_sweep<ckincrement>( define_stages( StencilStage<ForwardStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( define_sweep<ckdecrement>( define_stages( StencilStage<BackwardStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( ; 25

26 Loop and Kernel Fusion Performance implications of Loop & Kernel fusion 26

27 Data locality? DSL elements for describing data reuse define_loops( define_sweep<ckincrement>( define_caches( KCache<buff, clocal, KWindow<-2,1>, KRange<FullDomain> define_stages( StencilStage<Avg, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >(, StencilStage<Interp, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( Avg Interp + 27

28 Data locality? DSL elements for describing data reuse KCache cache a window of a vertical column in registers: KCache<buff, clocal, KWindow<-2,1>, KRange<FullDomain> IJKCache cache a 3D window in shared memory. IJKCache<buff, cfill, IJWindow<-2,2,-2,2,-2,0>, KRange<FullDomain,0,0> Policies: cfill, clocal, cflush, cfillandflush IJKCache KCache Dagstuhl Seminar 14th April

29 KCache and IJKCache effect in COSMO dycore K20c Dagstuhl Seminar 14th April

30 Performance on the COSMO Dycore Overall speedup of 1.8x for CPU and 5.8x for GPU with respect to legacy code. 30

31 Scalability 31

32 Reusing operators Some operators applied to multiple fields, like concentration of certain species in the atmosphere. Ex: pollen, sea salt, volcano ashes... q s 4 = α q s t s {1,...,100 α=α( x, y, z Processing operator on multiple fields in single kernel is beneficial due to: Reuse of common intermediate computations Data locality of input coefficients or fields Dagstuhl Seminar 14th April

33 Expandable parameters // setup the tracer stencil StencilCompiler::Build( stencil_, "HorizontalDiffusionTracers", repository.calculationdomain(, StencilConfiguration<Real, HorizontalDiffusionTracersBlockSize>(, pack_parameters( /* output fields */ ExpandableParam<data_out, cinout>(tracout.begin(, tracout.end(, /* input fields */ ExpandableParam<data_in, cin>(tracin.begin(, tracin.end(, Param<hdmasktr, cin>(repository.hdmask(, Param<ofahdx, cin>(repository.ofahdx(, Param<ofahdy, cin>(repository.ofahdy(, Param<crlato, cin>(repository.crlato(, Param<crlatu, cin>(repository.crlatu(, define_temporaries( StencilExpandableBuffer<lap, StencilExpandableBuffer<flx, StencilExpandableBuffer<fly, StencilExpandableBuffer<rxp, StencilExpandableBuffer<rxm, ; Real, Real, Real, Real, Real, KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> >(, >(, >(, >(, >(, define_loops( define_sweep<ckincrement>( define_caches( IJCache<lap, KRange<FullDomain,0,0> >(, IJCache<flx, KRange<FullDomain,0,0> >(, IJCache<fly, KRange<FullDomain,0,0> >(, IJCache<rxp, KRange<FullDomain,0,0> >(, IJCache<rxm, KRange<FullDomain,0,0> >(, define_stages( StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(, StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(, StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(, StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( 33

34 Expandable parameters // setup the tracer stencil StencilCompiler::Build( stencil_, "HorizontalDiffusionTracers", repository.calculationdomain(, StencilConfiguration<Real, HorizontalDiffusionTracersBlockSize>(, pack_parameters( /* output fields */ ExpandableParam<NumTracersPerStencil::value, data_out, cinout>(tracout.begin(, tracout.end(, /* input fields */ ExpandableParam<NumTracersPerStencil::value, data_in, cin>(tracin.begin(, tracin.end(, Param<hdmasktr, cin>(repository.hdmask(, Param<ofahdx, cin>(repository.ofahdx(, Param<ofahdy, cin>(repository.ofahdy(, Param<crlato, cin>(repository.crlato(, Param<crlatu, cin>(repository.crlatu(,, define_temporaries( StencilExpandableBuffer<lap, StencilExpandableBuffer<flx, StencilExpandableBuffer<fly, StencilExpandableBuffer<rxp, StencilExpandableBuffer<rxm, Expandable buffers ; Real, Real, Real, Real, Real, KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> >(, >(, >(, >(, >(, define_loops( define_sweep<ckincrement>( define_caches( IJCache<lap, KRange<FullDomain,0,0> >(, IJCache<flx, KRange<FullDomain,0,0> >(, IJCache<fly, KRange<FullDomain,0,0> >(, IJCache<rxp, KRange<FullDomain,0,0> >(, IJCache<rxm, KRange<FullDomain,0,0> >(, define_stages( StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(, StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(, StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(, StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( Dagstuhl Seminar 14th April

35 Expandable parameters // setup the tracer stencil StencilCompiler::Build( stencil_, "HorizontalDiffusionTracers", repository.calculationdomain(, StencilConfiguration<Real, HorizontalDiffusionTracersBlockSize>(, pack_parameters( /* output fields */ ExpandableParam<data_out, cinout>(tracout.begin(, tracout.end(, /* input fields */ ExpandableParam<data_in, cin>(tracin.begin(, tracin.end(, Param<hdmasktr, cin>(repository.hdmask(, Param<ofahdx, cin>(repository.ofahdx(, Param<ofahdy, cin>(repository.ofahdy(, Param<crlato, cin>(repository.crlato(, Param<crlatu, cin>(repository.crlatu(, define_temporaries( StencilExpandableBuffer<lap, StencilExpandableBuffer<flx, StencilExpandableBuffer<fly, StencilExpandableBuffer<rxp, StencilExpandableBuffer<rxm, Expandable buffers Expandable caches ; Real, Real, Real, Real, Real, KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> >(, >(, >(, >(, >(, define_loops( define_sweep<ckincrement>( define_caches( IJCache<lap, KRange<FullDomain,0,0> >(, IJCache<flx, KRange<FullDomain,0,0> >(, IJCache<fly, KRange<FullDomain,0,0> >(, IJCache<rxp, KRange<FullDomain,0,0> >(, IJCache<rxm, KRange<FullDomain,0,0> >(, define_stages( StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(, StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(, StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(, StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( Dagstuhl Seminar 14th April

36 Extending the Parallelization Model Tiling in IJ plane Blocks are extended with halo points Parallel execution of IJ plane K dimension executed sequentially by each (CUDA/OpenMP thread Warp: 32 threads Dagstuhl Seminar 14th April

37 Parallelization Model: KParallel Compact stencils of horizontal operators can be parallelized in K. Vertical dependencies only on input fields. Only horizontal dependencies on intermediate computed values K parallelization increases parallelism and occupancy in GPUs 37

38 Parallelization Model: KParallel StencilCompiler::Build( stencil, pack_parameters(..., define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckparallel>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Result, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; 38

39 Parallelization Model: KParallel dycore stencils on 32x32, K20c 39

40 Parallelization Model: Parallel Tridiagonal Solver Basic Thomas algorithm (sequential in K for large domains. Performance deteriorates at small domain, due to lack of parallelism STELLA integrates a parallel HPCR algorithm (Jeremy Appleyard NVIDIA Additionally we can prepare the system coefficients on the fly in the same stencil based on the prognostic variables. 40

41 Parallelization Model: Parallel Tridiagonal Solves STELLA keyword to trigger a HPCR algorithm for the tridiagonal solve StencilCompiler::Build( stencil_, "TridiagonalSolve_HPCR", repository.calculationdomain(, StencilConfiguration<Real, TridiagonalSolve_HPCRBlockSize>(, pack_parameters( /* output fields */ Param<w_out, cinout>(repository.w_out(, Param<acol_dat, cinout>(repository.acol(, Param<bcol_dat, cinout>(repository.bcol(, Param<ccol_dat, cinout>(repository.ccol(, Param<dcol_dat, cinout>(repository.dcol(, define_loops( ctridiagonalsolve>( define_sweep< define_stages( SetupStage, StencilStage< IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >(, StencilStage<TridiagonalSolveFWStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >(, StencilStage<WriteOutputStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( ; 41

42 HPCR Performance vs Thomas for domain sizes of 32 x (J domain size Thomas HPCR 42

43 Conclusions STELLA used to port the dynamical core of COSMO to GPUs: Retain single source code. Performance portable 2 backends, CPU(1.8x and GPU(5.8x C++ template metaprogramming -> interoperatility. Extended parallelization modes further exploits performance for characteristic algorithmics motifs of STELLA 43

44 BACKUP 44

45 45

Dynamical Core Rewrite

Dynamical Core Rewrite Dynamical Core Rewrite Tobias Gysi Oliver Fuhrer Carlos Osuna COSMO GM13, Sibiu Fundamental question How to write a model code which allows productive development by domain scientists runs efficiently

More information

Porting COSMO to Hybrid Architectures

Porting COSMO to Hybrid Architectures Porting COSMO to Hybrid Architectures T. Gysi 1, O. Fuhrer 2, C. Osuna 3, X. Lapillonne 3, T. Diamanti 3, B. Cumming 4, T. Schroeder 5, P. Messmer 5, T. Schulthess 4,6,7 [1] Supercomputing Systems AG,

More information

Preparing a weather prediction and regional climate model for current and emerging hardware architectures.

Preparing a weather prediction and regional climate model for current and emerging hardware architectures. Preparing a weather prediction and regional climate model for current and emerging hardware architectures. Oliver Fuhrer (MeteoSwiss), Tobias Gysi (Supercomputing Systems AG), Xavier Lapillonne (C2SM),

More information

COSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder,

COSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder, COSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder, 8.9.2011 Supercomputing Systems AG Technopark 1 8005 Zürich 1 Fon +41 43 456 16 00 Fax +41 43 456 16 10 www.scs.ch Boulder, 8.9.2011, by

More information

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs O. Fuhrer, T. Gysi, X. Lapillonne, C. Osuna, T. Dimanti, T. Schultess and the HP2C team Eidgenössisches

More information

TORSTEN HOEFLER

TORSTEN HOEFLER TORSTEN HOEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures most work performed by TOBIAS GYSI AND TOBIAS GROSSER Stencil computations (oh no,

More information

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures TORSTEN HOEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures with support of Tobias Gysi, Tobias Grosser @ SPCL presented at Guangzhou, China,

More information

PP POMPA (WG6) News and Highlights. Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team. COSMO GM13, Sibiu

PP POMPA (WG6) News and Highlights. Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team. COSMO GM13, Sibiu PP POMPA (WG6) News and Highlights Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team COSMO GM13, Sibiu Task Overview Task 1 Performance analysis and documentation Task 2 Redesign memory layout

More information

GPU Consideration for Next Generation Weather (and Climate) Simulations

GPU Consideration for Next Generation Weather (and Climate) Simulations GPU Consideration for Next Generation Weather (and Climate) Simulations Oliver Fuhrer 1, Tobias Gisy 2, Xavier Lapillonne 3, Will Sawyer 4, Ugo Varetto 4, Mauro Bianco 4, David Müller 2, and Thomas C.

More information

DOI: /jsfi Towards a performance portable, architecture agnostic implementation strategy for weather and climate models

DOI: /jsfi Towards a performance portable, architecture agnostic implementation strategy for weather and climate models DOI: 10.14529/jsfi140103 Towards a performance portable, architecture agnostic implementation strategy for weather and climate models Oliver Fuhrer 1, Carlos Osuna 2, Xavier Lapillonne 2, Tobias Gysi 3,4,

More information

Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss. PP POMPA status.

Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss. PP POMPA status. Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss PP POMPA status Xavier Lapillonne Performance On Massively Parallel Architectures Last year of the project

More information

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C. The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy! Thomas C. Schulthess ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!1

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Carlos Osuna. Meteoswiss.

Carlos Osuna. Meteoswiss. Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss DSL Toolchains for Performance Portable Geophysical Fluid Dynamic Models Carlos Osuna Meteoswiss carlos.osuna@meteoswiss.ch

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Deutscher Wetterdienst

Deutscher Wetterdienst Porting Operational Models to Multi- and Many-Core Architectures Ulrich Schättler Deutscher Wetterdienst Oliver Fuhrer MeteoSchweiz Xavier Lapillonne MeteoSchweiz Contents Strong Scalability of the Operational

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, Torsten Hoefler Swiss National Supercomputing

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction 1/8 GPU Implementation of Elliptic Solvers in Numerical Weather- and Climate- Prediction Eike Hermann Müller, Robert Scheichl Department of Mathematical Sciences EHM, Xu Guo, Sinan Shi and RS: http://arxiv.org/abs/1302.7193

More information

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental

More information

Administrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13

Administrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13 Administrative Optimizing Stencil Computations March 18, 2013 Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings and review lecture Prior exams are posted Design

More information

CUDA/OpenGL Fluid Simulation. Nolan Goodnight

CUDA/OpenGL Fluid Simulation. Nolan Goodnight CUDA/OpenGL Fluid Simulation Nolan Goodnight ngoodnight@nvidia.com Document Change History Version Date Responsible Reason for Change 0.1 2/22/07 Nolan Goodnight Initial draft 1.0 4/02/07 Nolan Goodnight

More information

An update on the COSMO- GPU developments

An update on the COSMO- GPU developments An update on the COSMO- GPU developments COSMO User Workshop 2014 X. Lapillonne, O. Fuhrer, A. Arteaga, S. Rüdisühli, C. Osuna, A. Roches and the COSMO- GPU team Eidgenössisches Departement des Innern

More information

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced

More information

Designing a Domain-specific Language to Simulate Particles. dan bailey

Designing a Domain-specific Language to Simulate Particles. dan bailey Designing a Domain-specific Language to Simulate Particles dan bailey Double Negative Largest Visual Effects studio in Europe Offices in London and Singapore Large and growing R & D team Squirt Fluid Solver

More information

CS377P Programming for Performance GPU Programming - I

CS377P Programming for Performance GPU Programming - I CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic

More information

Deutscher Wetterdienst. Ulrich Schättler Deutscher Wetterdienst Research and Development

Deutscher Wetterdienst. Ulrich Schättler Deutscher Wetterdienst Research and Development Deutscher Wetterdienst COSMO, ICON and Computers Ulrich Schättler Deutscher Wetterdienst Research and Development Contents Problems of the COSMO-Model on HPC architectures POMPA and The ICON Model Outlook

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System

Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System Joseph Charles & William Sawyer (CSCS), Manuel F. Dolz (UHAM), Sandra Catalán (UJI) EnA-HPC, Dresden September 1-2, 2014 1

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia

More information

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

S4289: Efficient solution of multiple scalar and block-tridiagonal equations S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,

More information

CLAW FORTRAN Compiler Abstractions for Weather and Climate Models

CLAW FORTRAN Compiler Abstractions for Weather and Climate Models CLAW FORTRAN Compiler Abstractions for Weather and Climate Models Image: NASA PASC 17 June 27, 2017 Valentin Clement, Jon Rood, Sylvaine Ferrachat, Will Sawyer, Oliver Fuhrer, Xavier Lapillonne valentin.clement@env.ethz.ch

More information

GPU MEMORY BOOTCAMP III

GPU MEMORY BOOTCAMP III April 4-7, 2016 Silicon Valley GPU MEMORY BOOTCAMP III COLLABORATIVE ACCESS PATTERNS Tony Scudiero NVIDIA Devtech Fanatical Bandwidth Evangelist The Axioms of Modern Performance #1. Parallelism is mandatory

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

GTC 2014 Session 4155

GTC 2014 Session 4155 GTC 2014 Session 4155 Portability and Performance: A Functional Language for Stencil Operations SFB/TR 7 gravitational wave astronomy Gerhard Zumbusch Institut für Angewandte Mathematik Results: Standard

More information

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86) 26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Physis: An Implicitly Parallel Framework for Stencil Computa;ons

Physis: An Implicitly Parallel Framework for Stencil Computa;ons Physis: An Implicitly Parallel Framework for Stencil Computa;ons Naoya Maruyama RIKEN AICS (Formerly at Tokyo Tech) GTC12, May 2012 1 è Good performance with low programmer produc;vity Mul;- GPU Applica;on

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

Efficient 3D Stencil Computations Using CUDA

Efficient 3D Stencil Computations Using CUDA Efficient 3D Stencil Computations Using CUDA Marcin Krotkiewski,Marcin Dabrowski October 2011 Abstract We present an efficient implementation of 7 point and 27 point stencils on high-end Nvidia GPUs. A

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Code Generators for Stencil Auto-tuning

Code Generators for Stencil Auto-tuning Code Generators for Stencil Auto-tuning Shoaib Kamil with Cy Chan, Sam Williams, Kaushik Datta, John Shalf, Katherine Yelick, Jim Demmel, Leonid Oliker Diagnosing Power/Performance Correctness Where this

More information

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA Implementation of Adaptive Coarsening Algorithm on GPU using CUDA 1. Introduction , In scientific computing today, the high-performance computers grow

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N. GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON

Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON Marc Tajchman* a a Commissariat à l énergie atomique

More information

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms http://www.scidacreview.org/0902/images/esg13.jpg Matthew Norman Jeffrey Larkin Richard Archibald Valentine Anantharaj

More information

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA GPU Developments for the NEMO Model Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA NVIDIA HPC AND ESM UPDATE TOPICS OF DISCUSSION GPU PROGRESS ON NEMO MODEL 2 NVIDIA GPU

More information

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,

More information

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 677: Parallel Programming for Many-core Processors Lecture 6 1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11

More information

GPU Performance Nuggets

GPU Performance Nuggets GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance

More information

Scientific Computing with GPUs Autotuning GEMMs Fermi GPUs

Scientific Computing with GPUs Autotuning GEMMs Fermi GPUs Parallel Processing and Applied Mathematics September 11-14, 2011 Toruń, Poland Scientific Computing with GPUs Autotuning GEMMs Fermi GPUs Innovative Computing Laboratory Electrical Engineering and Computer

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

Outline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters

Outline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters Implementing 3D Finite Difference Codes on the GPU Paulius Micikevicius NVIDIA Outline Single GPU Implementation 2-pass and 1-pass approaches Performance evaluation Multi-GPU Implementation Scalability

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles

More information

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike

More information

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng

More information

CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS

CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS Dániel Berényi Wigner RCP, GPU Laboratory, Budapest, Hungary Perspectives of GPU Computing in Physics and Astrophysics Rome 2014. INTRODUCTION The most

More information

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO RIKEN Advanced Institute of Computational Science Kobe, Japan My topic The study for Cloud computing My topic

More information

Image convolution with CUDA

Image convolution with CUDA Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany

More information

CS 314 Principles of Programming Languages

CS 314 Principles of Programming Languages CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled

More information

The Icosahedral Nonhydrostatic (ICON) Model

The Icosahedral Nonhydrostatic (ICON) Model The Icosahedral Nonhydrostatic (ICON) Model Scalability on Massively Parallel Computer Architectures Florian Prill, DWD + the ICON team 15th ECMWF Workshop on HPC in Meteorology October 2, 2012 ICON =

More information

Case Study - Computational Fluid Dynamics (CFD) using Graphics Processing Units

Case Study - Computational Fluid Dynamics (CFD) using Graphics Processing Units - Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Summer School 2009: Many-Core Processors for Science and Engineering Applications,

More information

GPU programming basics. Prof. Marco Bertini

GPU programming basics. Prof. Marco Bertini GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région

More information

Large-scale Gas Turbine Simulations on GPU clusters

Large-scale Gas Turbine Simulations on GPU clusters Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based

More information

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Computational Fluid Dynamics (CFD) using Graphics Processing Units Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores

More information

Sparse Linear Algebra in CUDA

Sparse Linear Algebra in CUDA Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

Lecture 10. Stencil Methods Limits to Performance

Lecture 10. Stencil Methods Limits to Performance Lecture 10 Stencil Methods Limits to Performance Announcements Tuesday s lecture on 2/11 will be moved to room 4140 from 6.30 PM to 7.50 PM Office hours are on for today Scott B. Baden /CSE 260/ Winter

More information

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation Erik Schnetter, Perimeter Institute with M. Blazewicz, I. Hinder, D. Koppelman, S. Brandt, M. Ciznicki, M.

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

A Software Developing Environment for Earth System Modeling. Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012

A Software Developing Environment for Earth System Modeling. Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012 A Software Developing Environment for Earth System Modeling Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012 1 Outline Motivation Purpose and Significance Research Contents Technology

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Sam White Parallel Programming Lab UIUC 1 Introduction How to enable Overdecomposition, Asynchrony, and Migratability in existing

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,

More information

Fast Image Processing using Halide

Fast Image Processing using Halide Fast Image Processing using Halide Andrew Adams (Google) Jonathan Ragan-Kelley (Berkeley) Zalman Stern (Google) Steven Johnson (Google) Dillon Sharlet (Google) Patricia Suriana (Google) 1 This talk The

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Locality-Aware Mapping of Nested Parallel Patterns on GPUs Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue

More information

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

Turbostream: A CFD solver for manycore

Turbostream: A CFD solver for manycore Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware

More information

Parallel algorithms for fast air pollution assessment in three dimensions

Parallel algorithms for fast air pollution assessment in three dimensions HPC-UA 2014 (Ukraine, Kyiv, Octoer 14, 2014) Parallel algorithms for fast air pollution assessment in three dimensions Bohaienko V.O. 1 1 Glushkov Institute of Cyernetic of NAS of Ukraine, Kyiv, Ukraine

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

CudaDMA: Overview and Code Examples. Brucek Khailany (NVIDIA Research) Michael Bauer (Stanford) Henry Cook (UC Berkeley)

CudaDMA: Overview and Code Examples. Brucek Khailany (NVIDIA Research) Michael Bauer (Stanford) Henry Cook (UC Berkeley) CudaDMA: Overview and Code Examples Brucek Khailany (NVIDIA Research) Michael Bauer (Stanford) Henry Cook (UC Berkeley) What is cudadma? An API for efficiently copying data from global to shared memory

More information

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko

More information