STELLA: A Domain Specific Language for Stencil Computations
|
|
- Carol Gregory
- 5 years ago
- Views:
Transcription
1 STELLA: A Domain Specific Language for Stencil Computations, Center for Climate Systems Modeling, ETH Zurich Tobias Gysi, Department of Computer Science, ETH Zurich Oliver Fuhrer, Federal Office of Meteorology and Climatology, Meteoswiss Zurich Mauro Bianco, Swiss National Supercomputing Centre, CSCS Lugano. Thomas C. Schulthess, Swiss National Supercomputing Centre, CSCS Lugano. Work funded by HP2C and PASC initiatives Dagstuhl Seminar, 14th April
2 Outline Dynamical Core of COSMO STELLA: DSL using C++ template metaprogramming STELLA DSL elements Exploiting performance: Loop & kernel fusion, data locality Extending Parallelization models 2
3 MOTIVATION: COSMO MODEL Dynamical core of COSMO solves the Navier Stokes equations using finite difference methods on structured grids. Stencils on 3D grids: 2-7 km resolution in the horizontal, 60/80 levels in the vertical Split-explicit time integrator with slow processes integrated using Runge-Kutta method. Collection of (Fortran operators (advection, diffusion, fast waves, etc. on prognostic variables (, u, v, w, T, qx,... 3
4 COSMO time step Initialization Δt Boundary conditions Physics Dynamics Data assimilation Halo-update Diagnostics I/O Cleanup Dagstuhl Seminar 14th April
5 COSMO Dynamical Core Dynamical code: ~35 stencils per time step Initialization Vertical Diffusion (compute T. Time Integrator (FastWaves RungeKutta Start comm Wait comm U, V, W U, V, W Δt Boundary conditions Physics Dynamics Data assimilation Halo-update Diagnostics I/O Vertical Diffusion (compute U, V, W N small t. Cleanup Horizontal Diffusion 5
6 Algorithmic Motifs In COSMO Horizontal PDE operators explicitly solved -> compact stencils with horizontal data dependencies for k=1, ke { for i=1, ie-1 { for j=1, je-1 { lap(i,j = -4*U(i,j + U(i+1,j + U(i-1,j + U(i,j+1 + U(i,j-1 for i=1, ie-1 { for j=1, je-1 { diff(i,j = -4*lap(i,j + lap(i+1,j + lap(i-1,j + lap(i,j+1 + lap(i,j-1 Dagstuhl Seminar 14th April
7 Algorithmic Motifs In COSMO Vertical PDE operators implicitly solved -> tridiagonal systems // forward for j=1, je for i=1, ie for k=2, ke { c(i,j,k = 1.0 / b(i,j,k - c(i,j,k-1 * a(i,j,k d(i,j,k = ( d(i,j,k - d(i,j,k-1 * a(i,j,k * c(i,j,k // backward substitution for j=1, je { for i=1, ie { for k=2, ke { x(i,j,k = d(i,j,k - c(i,j,k * x(i,j,k+1 7
8 Model development U =U ( x, z, t U (x, z,t=0=u 0 (x, z U = α U = α ( U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ 2 for i=1, ie-1 for j=1, je-1 lap(i,j = -4*U(i,j + U(i+1,j + U(i-1,j + U(i,j+1 + U(i,j-1 for i=1, ie-1 for j=1, je-1 diff(i,j = -4*lap(i,j + lap(i+1,j + lap(i-1,j + lap(i,j+1 + lap(i,j-1 8
9 U =U ( x, z, t U (x, z,t=0=u 0 (x, z U = α U = α ( U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ 2 for i=1, ie-1 for j=1, je-1 lap(i,j = -4*U(i,j + U(i+1,j + U(i-1,j + U(i,j+1 + U(i,j-1 for i=1, ie-1 for j=1, je-1 diff(i,j = -4*lap(i,j + lap(i+1,j + lap(i-1,j + lap(i,j+1 + lap(i,j-1 Dagstuhl Seminar 14th April
10 U =U ( x, z, t U (x, z,t=0=u 0 (x, z U = α U = α ( U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ 2? 10
11 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ2 static void Lap(Context ctx { ctx[lap::center(] = ctx[u::at(iplus1] + ctx[u::at(iminus1] + ctx[u::at(jplus1] + ctx[u::at(jminus1] - 4*ctx[u::Center(]; 11
12 const int i = threadidx.x; const int j = threadidx.y; int i_h = 0; int j_h = 0; if(j < 2 { i_h = i; j_h = (j==0? -1 : blockdim.y; else if(j < 4 && i <= blockdim.y { i_h = (j==2? -1 : blockdim.x; j_h = i; for(int k=0; k < kdim; ++k { lap(i,j = * U(i,j,k + U(i+1,j,k + U(i-1,j,k + U(i,j+1,k + U(i,j-1,k if(i_h!= 0 j_h!= 0 lap(i_h, j_h = * U(i_h,j_h,k + U(i_h+1,j_h,k + U(i_h-1,j_h,k + U(i_h,j_h+1,k + U(i_h,j_h-1,k syncthreads(; result(i,j = -4*(i,j,k - alpha(i,j,k*( lap(i,j,k - lap(i-1,j,k + lap(i,j,k - lap(i,j-1,k 12
13 STELLA DSL using C++ template metaprogramming Separate model and algorithm from hardware specific implementation and optimizations. Single source code compiling multiple architectures, performance portable. Concise syntax / close to mathematical description of operators Same toolchain as the full model (access to debugger/data fields, interoperability with other programming languages Supports CPU & GPU backends Used to rewrite a full dynamical core (COSMO 13
14 How it works? Operators are encapsulated as type information (functors Types are composed/organized and manipulated template<typename Context> struct LapStage{ static void Do(Context ctx { ctx[lap::center(] = ctx[u::at(iplus1] + ctx[u::at(iminus1] + ctx[u::at(jplus1] + ctx[u::at(jminus1] - 4*ctx[T::Center(]; ; Specific backends expand the code of operators into template kernels with loop structures Many optimizations (memory layout, tiling, looping, data locality depend on the backend Dagstuhl Seminar 14th April
15 Stencil Stages Stencil Stages describe the single operation applied to each grid point. Assumed parallel execution in the IJ plane template<typename Tenv> struct Lap { STENCIL_STAGE(TEnv STAGE_PARAMETER(u STAGE_PARAMETER(lap static void Do(Context ctx, FullDomain { ctx[lap::center(] = -4*ctx[u::Center(] + ctx[u::at(iplus1] + ctx[u::at(iminus1] + ctx[u::at(jplus1] + ctx[u::at(jminus1]; 15
16 Full example IJKRealField in_, out_; Instantiate data fields (memory layout abstracted, backend dependent StencilCompiler::Build( Stencil, pack_parameters( Param<in, cin, cdatafield>( in_, Param<out, cinout, cdatafield>( out_, define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; Dagstuhl Seminar 14th April
17 Full example IJKRealField in_, out_; Pack the parameters and associate them to place holders used in stages StencilCompiler::Build( Stencil, pack_parameters( Param<in, cin, cdatafield>( in_, Param<out, cinout, cdatafield>( out_, define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; 17
18 Full example IJKRealField in_, out_; Define temporary buffers used by the stencil StencilCompiler::Build( Stencil, pack_parameters( Param<in, cin, cdatafield>( in_, Param<out, cinout, cdatafield>( out_, define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; 18
19 Full example IJKRealField in_, out_; Compose multiple stencil stages. ( Operations with data dependencies placed in different stages Build Sweeps (sequential loops in K StencilCompiler::Build( Stencil, pack_parameters( Param<in, cin, cdatafield>( in_, Param<out, cinout, cdatafield>( out_, define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; 19
20 Computation on the fly or buffering? U = α 4 = α 2 ( 2 U t 2 U i, j= U i+1, j +U i 1, j +U i, j+1 +U i, j 1 4 U i, j Δ2 4 T 2U 2 T Dagstuhl Seminar 14th April
21 4 U 2 U 2 U Using DSL Functions Using DSL Temporaries StencilCompiler::Build( stencil, pack_parameters(..., define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; template<typename TEnv> struct Lap { STENCIL_STAGE(TEnv STAGE_PARAMETER(u STAGE_PARAMETER(diff static void Do(Context ctx, FullDomain { ctx[diff::center(] = ctx[call<lapfn>::with( Call<LapFn>::With(u::Center( ]; ; StencilCompiler::Build( //... define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( ; Dagstuhl Seminar 14th April
22 2 U 4 U 2 U Dagstuhl Seminar 14th April
23 Loop and Kernel Fusion STELLA applies loop fusion to stages in a sweep (only GPU const int i = threadidx.x; const int j = threadidx.y; int i_h = 0; int j_h = 0; define_loops( define_sweep<ckincrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Result, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; if(j < 2 { i_h = i; j_h = (j==0? -1 : blockdim.y; else if(j < 4 && i <= blockdim.y { i_h = (j==2? -1 : blockdim.x; j_h = i; for(int k=0; k < kdim; ++k { lap(i,j = * U(i,j,k + U(i+1,j,k + U(i-1,j,k + U(i,j+1,k + U(i,j-1,k if(i_h!= 0 j_h!= 0 lap(i_h, j_h = * U(i_h,j_h,k + U(i_h+1,j_h,k + U(i_h-1,j_h,k + U(i_h,j_h+1,k + U(i_h,j_h-1,k syncthreads(; result(i,j = -4*(i,j,k - alpha(i,j,k*( lap(i,j,k - lap(i-1,j,k + lap(i,j,k - lap(i,j-1,k 23
24 const int i = threadidx.x; const int j = threadidx.y; int i_h = 0; int j_h = 0;! Loop fusion requires duplicated computation at halo points CUDA block (32x? if(j < 2 { i_h = i; j_h = (j==0? -1 : blockdim.y; else if(j < 4 && i <= blockdim.y { i_h = (j==2? -1 : blockdim.x; j_h = i; for(int k=0; k < kdim; ++k { lap(i,j = * U(i,j,k + U(i+1,j,k + U(i-1,j,k + U(i,j+1,k + U(i,j-1,k if(i_h!= 0 j_h!= 0 lap(i_h, j_h = * U(i_h,j_h,k + U(i_h+1,j_h,k + U(i_h-1,j_h,k + U(i_h,j_h+1,k + U(i_h,j_h-1,k syncthreads(; result(i,j = -4*(i,j,k - alpha(i,j,k*( lap(i,j,k - lap(i-1,j,k + lap(i,j,k - lap(i,j-1,k Dagstuhl Seminar 14th April
25 Loop and Kernel Fusion STELLA applies kernel fusion to all sweeps of a stencil define_loops( define_sweep<ckincrement>( define_stages( StencilStage<ForwardStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( define_sweep<ckdecrement>( define_stages( StencilStage<BackwardStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( ; 25
26 Loop and Kernel Fusion Performance implications of Loop & Kernel fusion 26
27 Data locality? DSL elements for describing data reuse define_loops( define_sweep<ckincrement>( define_caches( KCache<buff, clocal, KWindow<-2,1>, KRange<FullDomain> define_stages( StencilStage<Avg, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >(, StencilStage<Interp, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( Avg Interp + 27
28 Data locality? DSL elements for describing data reuse KCache cache a window of a vertical column in registers: KCache<buff, clocal, KWindow<-2,1>, KRange<FullDomain> IJKCache cache a 3D window in shared memory. IJKCache<buff, cfill, IJWindow<-2,2,-2,2,-2,0>, KRange<FullDomain,0,0> Policies: cfill, clocal, cflush, cfillandflush IJKCache KCache Dagstuhl Seminar 14th April
29 KCache and IJKCache effect in COSMO dycore K20c Dagstuhl Seminar 14th April
30 Performance on the COSMO Dycore Overall speedup of 1.8x for CPU and 5.8x for GPU with respect to legacy code. 30
31 Scalability 31
32 Reusing operators Some operators applied to multiple fields, like concentration of certain species in the atmosphere. Ex: pollen, sea salt, volcano ashes... q s 4 = α q s t s {1,...,100 α=α( x, y, z Processing operator on multiple fields in single kernel is beneficial due to: Reuse of common intermediate computations Data locality of input coefficients or fields Dagstuhl Seminar 14th April
33 Expandable parameters // setup the tracer stencil StencilCompiler::Build( stencil_, "HorizontalDiffusionTracers", repository.calculationdomain(, StencilConfiguration<Real, HorizontalDiffusionTracersBlockSize>(, pack_parameters( /* output fields */ ExpandableParam<data_out, cinout>(tracout.begin(, tracout.end(, /* input fields */ ExpandableParam<data_in, cin>(tracin.begin(, tracin.end(, Param<hdmasktr, cin>(repository.hdmask(, Param<ofahdx, cin>(repository.ofahdx(, Param<ofahdy, cin>(repository.ofahdy(, Param<crlato, cin>(repository.crlato(, Param<crlatu, cin>(repository.crlatu(, define_temporaries( StencilExpandableBuffer<lap, StencilExpandableBuffer<flx, StencilExpandableBuffer<fly, StencilExpandableBuffer<rxp, StencilExpandableBuffer<rxm, ; Real, Real, Real, Real, Real, KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> >(, >(, >(, >(, >(, define_loops( define_sweep<ckincrement>( define_caches( IJCache<lap, KRange<FullDomain,0,0> >(, IJCache<flx, KRange<FullDomain,0,0> >(, IJCache<fly, KRange<FullDomain,0,0> >(, IJCache<rxp, KRange<FullDomain,0,0> >(, IJCache<rxm, KRange<FullDomain,0,0> >(, define_stages( StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(, StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(, StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(, StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( 33
34 Expandable parameters // setup the tracer stencil StencilCompiler::Build( stencil_, "HorizontalDiffusionTracers", repository.calculationdomain(, StencilConfiguration<Real, HorizontalDiffusionTracersBlockSize>(, pack_parameters( /* output fields */ ExpandableParam<NumTracersPerStencil::value, data_out, cinout>(tracout.begin(, tracout.end(, /* input fields */ ExpandableParam<NumTracersPerStencil::value, data_in, cin>(tracin.begin(, tracin.end(, Param<hdmasktr, cin>(repository.hdmask(, Param<ofahdx, cin>(repository.ofahdx(, Param<ofahdy, cin>(repository.ofahdy(, Param<crlato, cin>(repository.crlato(, Param<crlatu, cin>(repository.crlatu(,, define_temporaries( StencilExpandableBuffer<lap, StencilExpandableBuffer<flx, StencilExpandableBuffer<fly, StencilExpandableBuffer<rxp, StencilExpandableBuffer<rxm, Expandable buffers ; Real, Real, Real, Real, Real, KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> >(, >(, >(, >(, >(, define_loops( define_sweep<ckincrement>( define_caches( IJCache<lap, KRange<FullDomain,0,0> >(, IJCache<flx, KRange<FullDomain,0,0> >(, IJCache<fly, KRange<FullDomain,0,0> >(, IJCache<rxp, KRange<FullDomain,0,0> >(, IJCache<rxm, KRange<FullDomain,0,0> >(, define_stages( StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(, StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(, StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(, StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( Dagstuhl Seminar 14th April
35 Expandable parameters // setup the tracer stencil StencilCompiler::Build( stencil_, "HorizontalDiffusionTracers", repository.calculationdomain(, StencilConfiguration<Real, HorizontalDiffusionTracersBlockSize>(, pack_parameters( /* output fields */ ExpandableParam<data_out, cinout>(tracout.begin(, tracout.end(, /* input fields */ ExpandableParam<data_in, cin>(tracin.begin(, tracin.end(, Param<hdmasktr, cin>(repository.hdmask(, Param<ofahdx, cin>(repository.ofahdx(, Param<ofahdy, cin>(repository.ofahdy(, Param<crlato, cin>(repository.crlato(, Param<crlatu, cin>(repository.crlatu(, define_temporaries( StencilExpandableBuffer<lap, StencilExpandableBuffer<flx, StencilExpandableBuffer<fly, StencilExpandableBuffer<rxp, StencilExpandableBuffer<rxm, Expandable buffers Expandable caches ; Real, Real, Real, Real, Real, KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> KRange<FullDomain,0,0> >(, >(, >(, >(, >(, define_loops( define_sweep<ckincrement>( define_caches( IJCache<lap, KRange<FullDomain,0,0> >(, IJCache<flx, KRange<FullDomain,0,0> >(, IJCache<fly, KRange<FullDomain,0,0> >(, IJCache<rxp, KRange<FullDomain,0,0> >(, IJCache<rxm, KRange<FullDomain,0,0> >(, define_stages( StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(, StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(, StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(, StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( Dagstuhl Seminar 14th April
36 Extending the Parallelization Model Tiling in IJ plane Blocks are extended with halo points Parallel execution of IJ plane K dimension executed sequentially by each (CUDA/OpenMP thread Warp: 32 threads Dagstuhl Seminar 14th April
37 Parallelization Model: KParallel Compact stencils of horizontal operators can be parallelized in K. Vertical dependencies only on input fields. Only horizontal dependencies on intermediate computed values K parallelization increases parallelism and occupancy in GPUs 37
38 Parallelization Model: KParallel StencilCompiler::Build( stencil, pack_parameters(..., define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >(, define_loops( define_sweep<ckparallel>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(, StencilStage<Result, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >( ; 38
39 Parallelization Model: KParallel dycore stencils on 32x32, K20c 39
40 Parallelization Model: Parallel Tridiagonal Solver Basic Thomas algorithm (sequential in K for large domains. Performance deteriorates at small domain, due to lack of parallelism STELLA integrates a parallel HPCR algorithm (Jeremy Appleyard NVIDIA Additionally we can prepare the system coefficients on the fly in the same stencil based on the prognostic variables. 40
41 Parallelization Model: Parallel Tridiagonal Solves STELLA keyword to trigger a HPCR algorithm for the tridiagonal solve StencilCompiler::Build( stencil_, "TridiagonalSolve_HPCR", repository.calculationdomain(, StencilConfiguration<Real, TridiagonalSolve_HPCRBlockSize>(, pack_parameters( /* output fields */ Param<w_out, cinout>(repository.w_out(, Param<acol_dat, cinout>(repository.acol(, Param<bcol_dat, cinout>(repository.bcol(, Param<ccol_dat, cinout>(repository.ccol(, Param<dcol_dat, cinout>(repository.dcol(, define_loops( ctridiagonalsolve>( define_sweep< define_stages( SetupStage, StencilStage< IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >(, StencilStage<TridiagonalSolveFWStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >(, StencilStage<WriteOutputStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >( ; 41
42 HPCR Performance vs Thomas for domain sizes of 32 x (J domain size Thomas HPCR 42
43 Conclusions STELLA used to port the dynamical core of COSMO to GPUs: Retain single source code. Performance portable 2 backends, CPU(1.8x and GPU(5.8x C++ template metaprogramming -> interoperatility. Extended parallelization modes further exploits performance for characteristic algorithmics motifs of STELLA 43
44 BACKUP 44
45 45
Dynamical Core Rewrite
Dynamical Core Rewrite Tobias Gysi Oliver Fuhrer Carlos Osuna COSMO GM13, Sibiu Fundamental question How to write a model code which allows productive development by domain scientists runs efficiently
More informationPorting COSMO to Hybrid Architectures
Porting COSMO to Hybrid Architectures T. Gysi 1, O. Fuhrer 2, C. Osuna 3, X. Lapillonne 3, T. Diamanti 3, B. Cumming 4, T. Schroeder 5, P. Messmer 5, T. Schulthess 4,6,7 [1] Supercomputing Systems AG,
More informationPreparing a weather prediction and regional climate model for current and emerging hardware architectures.
Preparing a weather prediction and regional climate model for current and emerging hardware architectures. Oliver Fuhrer (MeteoSwiss), Tobias Gysi (Supercomputing Systems AG), Xavier Lapillonne (C2SM),
More informationCOSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder,
COSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder, 8.9.2011 Supercomputing Systems AG Technopark 1 8005 Zürich 1 Fon +41 43 456 16 00 Fax +41 43 456 16 10 www.scs.ch Boulder, 8.9.2011, by
More informationAdapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs
Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs O. Fuhrer, T. Gysi, X. Lapillonne, C. Osuna, T. Dimanti, T. Schultess and the HP2C team Eidgenössisches
More informationTORSTEN HOEFLER
TORSTEN HOEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures most work performed by TOBIAS GYSI AND TOBIAS GROSSER Stencil computations (oh no,
More informationMODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures
TORSTEN HOEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures with support of Tobias Gysi, Tobias Grosser @ SPCL presented at Guangzhou, China,
More informationPP POMPA (WG6) News and Highlights. Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team. COSMO GM13, Sibiu
PP POMPA (WG6) News and Highlights Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team COSMO GM13, Sibiu Task Overview Task 1 Performance analysis and documentation Task 2 Redesign memory layout
More informationGPU Consideration for Next Generation Weather (and Climate) Simulations
GPU Consideration for Next Generation Weather (and Climate) Simulations Oliver Fuhrer 1, Tobias Gisy 2, Xavier Lapillonne 3, Will Sawyer 4, Ugo Varetto 4, Mauro Bianco 4, David Müller 2, and Thomas C.
More informationDOI: /jsfi Towards a performance portable, architecture agnostic implementation strategy for weather and climate models
DOI: 10.14529/jsfi140103 Towards a performance portable, architecture agnostic implementation strategy for weather and climate models Oliver Fuhrer 1, Carlos Osuna 2, Xavier Lapillonne 2, Tobias Gysi 3,4,
More informationFederal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss. PP POMPA status.
Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss PP POMPA status Xavier Lapillonne Performance On Massively Parallel Architectures Last year of the project
More informationThe challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.
The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy! Thomas C. Schulthess ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!1
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationCarlos Osuna. Meteoswiss.
Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss DSL Toolchains for Performance Portable Geophysical Fluid Dynamic Models Carlos Osuna Meteoswiss carlos.osuna@meteoswiss.ch
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationDeutscher Wetterdienst
Porting Operational Models to Multi- and Many-Core Architectures Ulrich Schättler Deutscher Wetterdienst Oliver Fuhrer MeteoSchweiz Xavier Lapillonne MeteoSchweiz Contents Strong Scalability of the Operational
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationA PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, Torsten Hoefler Swiss National Supercomputing
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationGPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction
1/8 GPU Implementation of Elliptic Solvers in Numerical Weather- and Climate- Prediction Eike Hermann Müller, Robert Scheichl Department of Mathematical Sciences EHM, Xu Guo, Sinan Shi and RS: http://arxiv.org/abs/1302.7193
More informationAuto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters
Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental
More informationAdministrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13
Administrative Optimizing Stencil Computations March 18, 2013 Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings and review lecture Prior exams are posted Design
More informationCUDA/OpenGL Fluid Simulation. Nolan Goodnight
CUDA/OpenGL Fluid Simulation Nolan Goodnight ngoodnight@nvidia.com Document Change History Version Date Responsible Reason for Change 0.1 2/22/07 Nolan Goodnight Initial draft 1.0 4/02/07 Nolan Goodnight
More informationAn update on the COSMO- GPU developments
An update on the COSMO- GPU developments COSMO User Workshop 2014 X. Lapillonne, O. Fuhrer, A. Arteaga, S. Rüdisühli, C. Osuna, A. Roches and the COSMO- GPU team Eidgenössisches Departement des Innern
More informationAutomatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced
More informationDesigning a Domain-specific Language to Simulate Particles. dan bailey
Designing a Domain-specific Language to Simulate Particles dan bailey Double Negative Largest Visual Effects studio in Europe Offices in London and Singapore Large and growing R & D team Squirt Fluid Solver
More informationCS377P Programming for Performance GPU Programming - I
CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic
More informationDeutscher Wetterdienst. Ulrich Schättler Deutscher Wetterdienst Research and Development
Deutscher Wetterdienst COSMO, ICON and Computers Ulrich Schättler Deutscher Wetterdienst Research and Development Contents Problems of the COSMO-Model on HPC architectures POMPA and The ICON Model Outlook
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationEvaluating the Performance and Energy Efficiency of the COSMO-ART Model System
Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System Joseph Charles & William Sawyer (CSCS), Manuel F. Dolz (UHAM), Sandra Catalán (UJI) EnA-HPC, Dresden September 1-2, 2014 1
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationS4289: Efficient solution of multiple scalar and block-tridiagonal equations
S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,
More informationCLAW FORTRAN Compiler Abstractions for Weather and Climate Models
CLAW FORTRAN Compiler Abstractions for Weather and Climate Models Image: NASA PASC 17 June 27, 2017 Valentin Clement, Jon Rood, Sylvaine Ferrachat, Will Sawyer, Oliver Fuhrer, Xavier Lapillonne valentin.clement@env.ethz.ch
More informationGPU MEMORY BOOTCAMP III
April 4-7, 2016 Silicon Valley GPU MEMORY BOOTCAMP III COLLABORATIVE ACCESS PATTERNS Tony Scudiero NVIDIA Devtech Fanatical Bandwidth Evangelist The Axioms of Modern Performance #1. Parallelism is mandatory
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationGTC 2014 Session 4155
GTC 2014 Session 4155 Portability and Performance: A Functional Language for Stencil Operations SFB/TR 7 gravitational wave astronomy Gerhard Zumbusch Institut für Angewandte Mathematik Results: Standard
More informationInformation Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)
26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationPhysis: An Implicitly Parallel Framework for Stencil Computa;ons
Physis: An Implicitly Parallel Framework for Stencil Computa;ons Naoya Maruyama RIKEN AICS (Formerly at Tokyo Tech) GTC12, May 2012 1 è Good performance with low programmer produc;vity Mul;- GPU Applica;on
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationEfficient 3D Stencil Computations Using CUDA
Efficient 3D Stencil Computations Using CUDA Marcin Krotkiewski,Marcin Dabrowski October 2011 Abstract We present an efficient implementation of 7 point and 27 point stencils on high-end Nvidia GPUs. A
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationCode Generators for Stencil Auto-tuning
Code Generators for Stencil Auto-tuning Shoaib Kamil with Cy Chan, Sam Williams, Kaushik Datta, John Shalf, Katherine Yelick, Jim Demmel, Leonid Oliker Diagnosing Power/Performance Correctness Where this
More informationImplementation of Adaptive Coarsening Algorithm on GPU using CUDA
Implementation of Adaptive Coarsening Algorithm on GPU using CUDA 1. Introduction , In scientific computing today, the high-performance computers grow
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationGPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.
GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationParallelization Using a PGAS Language such as X10 in HYDRO and TRITON
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON Marc Tajchman* a a Commissariat à l énergie atomique
More informationPorting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms
Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms http://www.scidacreview.org/0902/images/esg13.jpg Matthew Norman Jeffrey Larkin Richard Archibald Valentine Anantharaj
More informationGPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA
GPU Developments for the NEMO Model Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA NVIDIA HPC AND ESM UPDATE TOPICS OF DISCUSSION GPU PROGRESS ON NEMO MODEL 2 NVIDIA GPU
More informationTowards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA
Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,
More informationCS 677: Parallel Programming for Many-core Processors Lecture 6
1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationGPU Performance Nuggets
GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance
More informationScientific Computing with GPUs Autotuning GEMMs Fermi GPUs
Parallel Processing and Applied Mathematics September 11-14, 2011 Toruń, Poland Scientific Computing with GPUs Autotuning GEMMs Fermi GPUs Innovative Computing Laboratory Electrical Engineering and Computer
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationOutline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters
Implementing 3D Finite Difference Codes on the GPU Paulius Micikevicius NVIDIA Outline Single GPU Implementation 2-pass and 1-pass approaches Performance evaluation Multi-GPU Implementation Scalability
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles
More informationEfficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling
Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike
More informationEXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March
EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng
More informationCODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS
CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS Dániel Berényi Wigner RCP, GPU Laboratory, Budapest, Hungary Perspectives of GPU Computing in Physics and Astrophysics Rome 2014. INTRODUCTION The most
More informationA Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC
A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO RIKEN Advanced Institute of Computational Science Kobe, Japan My topic The study for Cloud computing My topic
More informationImage convolution with CUDA
Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationHardware/Software Co-Design
1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled
More informationThe Icosahedral Nonhydrostatic (ICON) Model
The Icosahedral Nonhydrostatic (ICON) Model Scalability on Massively Parallel Computer Architectures Florian Prill, DWD + the ICON team 15th ECMWF Workshop on HPC in Meteorology October 2, 2012 ICON =
More informationCase Study - Computational Fluid Dynamics (CFD) using Graphics Processing Units
- Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Summer School 2009: Many-Core Processors for Science and Engineering Applications,
More informationGPU programming basics. Prof. Marco Bertini
GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationA Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA
A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région
More informationLarge-scale Gas Turbine Simulations on GPU clusters
Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based
More informationComputational Fluid Dynamics (CFD) using Graphics Processing Units
Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores
More informationSparse Linear Algebra in CUDA
Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationLecture 10. Stencil Methods Limits to Performance
Lecture 10 Stencil Methods Limits to Performance Announcements Tuesday s lecture on 2/11 will be moved to room 4140 from 6.30 PM to 7.50 PM Office hours are on for today Scott B. Baden /CSE 260/ Winter
More informationFrom Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation Erik Schnetter, Perimeter Institute with M. Blazewicz, I. Hinder, D. Koppelman, S. Brandt, M. Ciznicki, M.
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationA Software Developing Environment for Earth System Modeling. Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012
A Software Developing Environment for Earth System Modeling Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012 1 Outline Motivation Purpose and Significance Research Contents Technology
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationIntroducing Overdecomposition to Existing Applications: PlasComCM and AMPI
Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Sam White Parallel Programming Lab UIUC 1 Introduction How to enable Overdecomposition, Asynchrony, and Migratability in existing
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationReview for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects
Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,
More informationFast Image Processing using Halide
Fast Image Processing using Halide Andrew Adams (Google) Jonathan Ragan-Kelley (Berkeley) Zalman Stern (Google) Steven Johnson (Google) Dillon Sharlet (Google) Patricia Suriana (Google) 1 This talk The
More informationGPU Programming Using CUDA. Samuli Laine NVIDIA Research
GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick
More informationLocality-Aware Mapping of Nested Parallel Patterns on GPUs
Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue
More informationCUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University
GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationParallel algorithms for fast air pollution assessment in three dimensions
HPC-UA 2014 (Ukraine, Kyiv, Octoer 14, 2014) Parallel algorithms for fast air pollution assessment in three dimensions Bohaienko V.O. 1 1 Glushkov Institute of Cyernetic of NAS of Ukraine, Kyiv, Ukraine
More informationPhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.
Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences
More informationCudaDMA: Overview and Code Examples. Brucek Khailany (NVIDIA Research) Michael Bauer (Stanford) Henry Cook (UC Berkeley)
CudaDMA: Overview and Code Examples Brucek Khailany (NVIDIA Research) Michael Bauer (Stanford) Henry Cook (UC Berkeley) What is cudadma? An API for efficiently copying data from global to shared memory
More informationMatrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs
Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko
More information