From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with andreas.schaefer@cs.fau.de Friedrich-Alexander-Universität Erlangen-Nürnberg GPU Technology Conference 2013, San José, CA 2013.03.21

Outline 1 Applications 2 Architecture 3 Usage

1. Applications What can it do?

Parallel IMage Particle Flow (PIMPF)

Parallel IMage Particle Flow (PIMPF) particle-in-cell 300k particles, 30 FPS runs on PC, Android control PC via tablet

Computational Fluid Dynamics Lattice Boltzmann Method toolkit 2D/3D solid/fluid interaction various boundary conditions: - free-slip, no-slip, moving wall - Zou He (pressure, velocity)

Simulation of Granular Gases n-body code periodic boundaries uniform distribution model: Institute for Multiscale Simulation, FAU

Simulation of Dendritic Growth 1000 2, 8 hours 1000 3, 3 years? cellular automaton + finite difference code model: Institute for Metallic Materials, FSU

Simulation of Dendritic Growth (cont.) Parallelization Options manual parallelization reimplement with problem solving environment reimplement with parallel programming language port to

Simulation of Dendritic Growth (cont.) Parallelization Options manual parallelization reimplement with problem solving environment reimplement with parallel programming language port to Port to : Success! total code base: 5000 LOC (Lines Of Code) port: 100 LOC changed video: 2 TB output 7 days compute, 3 days visualization active usage since 2011

2. Architecture How does it work?

Target Applications Library for Geometric Decomposition codes () time- and space-discrete simulations chiefly stencil codes: regular 2D/3D grid challenges: computations are tightly coupled (communication, load-balancing) vectorization, ILP, caches, multi-cores, GPUs, MPI...

Architecture Cell: user supplied model Simulator: generic parallelization (C++ templates) handles spatial/temporal loops calls back user code

Architecture Cell: user supplied model Simulator: generic parallelization (C++ templates) handles spatial/temporal loops calls back user code plug-ins for: OpenMP CUDA MPI HPX

Architecture Cell: user supplied model Simulator: generic parallelization (C++ templates) handles spatial/temporal loops calls back user code plug-ins for: OpenMP CUDA MPI HPX features: hierarchical parallelization parallel IO live-steering, in situ visualization dynamic load-balancing...

Performance TFLOPS 45 40 35 30 25 20 15 10 5 RTM Performance 0 0 200 400 600 800 1000 1200 GPUs Reverse Time Migration (RTM) on Tsubame 2.0 scaled to 1080 GPUs (>480 k cores) weak-scaling efficiency >90 %

3. Usage How can I use it?

3D Jacobi Smoother: NaïveCode heat dissipation periodic boundary conditions well-known example seemingly simple plenty of potential for optimization plenty of pitfalls

3D Jacobi Smoother: NaïveCode (cont.) for ( i n t z = 0; z < griddim ; ++z ) { for ( i n t y = 0; y < griddim ; ++y ) { for ( i n t x = 0; x < griddim ; ++x ) { SET ( ) = (GET( 0, 0, 1) + GET( 0, 1, 0) + GET( 1, 0, 0) + GET( 1, 0, 0) + GET( 0, 1, 0) + GET( 0, 0, 1 ) ) ( 1. 0 / 6. 0 ) ;.

3D Jacobi Smoother: NaïveCode (cont.) #define GET(X, Y, Z ) ( gridold ) [ \ ( ( z + Z + griddim ) % griddim ) griddim griddim + \ ( ( y + Y + griddim ) % griddim ) griddim + \ ( ( x + X + griddim ) % griddim ) ] #define SET ( ) ( gridnew ) [ \ ( ( z + griddim ) % griddim ) griddim griddim + \ ( ( y + griddim ) % griddim ) griddim + \ ( ( x + griddim ) % griddim ) ] for ( i n t z = 0; z < griddim ; ++z ) { for ( i n t y = 0; y < griddim ; ++y ) { for ( i n t x = 0; x < griddim ; ++x ) { SET ( ) = (GET( 0, 0, 1) + GET( 0, 1, 0) + GET( 1, 0, 0) + GET( 1, 0, 0) + GET( 0, 1, 0) + GET( 0, 0, 1 ) ) ( 1. 0 / 6. 0 ) ; 0.10 GLUPS, 1 thread Intel Core i7-3610qm @2.30 GHz (Ivy Bridge) Giga Lattice Updates Per Second (GLUPS) matrix size: 512 3 cells

3D Jacobi Smoother: NaïveCode OpenMP #pragma omp parallel for for ( i n t z = 0; z < griddim ; ++z ) { for ( i n t y = 0; y < griddim ; ++y ) { for ( i n t x = 0; x < griddim ; ++x ) { SET ( ) = (GET( 0, 0, 1) + GET( 0, 1, 0) + GET( 1, 0, 0) + GET( 1, 0, 0) + GET( 0, 1, 0) + GET( 0, 0, 1 ) ) ( 1. 0 / 6. 0 ) ; 0.37 GLUPS, 4 threads Intel Core i7-3610qm @2.30 GHz (Ivy Bridge) speedup 3.6

3D Jacobi Smoother: Model class C e l l { public : typedef Stencils : : VonNeumann<3, 1> Stencil ; typedef Topologies : : Torus <3 >:: Topology Topology ; class API : public CellAPITraits : : Fixed { ; s t a t i c i n l i n e unsigned nanosteps ( ) { return 1; i n l i n e e x p l i c i t Cell ( const double& v=0) : temp ( v ) { #define GET(X, Y, Z ) hood [ FixedCoord<X, Y, Z > ( ) ]. temp #define SET ( ) temp template <typename COORD_MAP> void update ( const COORD_MAP& hood, const unsigned& nanostep ) { SET ( ) = (GET( 0, 0, 1) + GET( 0, 1, 0) + GET( 1, 0, 0) + GET( 1, 0, 0) + GET( 0, 1, 0) + GET( 0, 0, 1 ) ) ( 1. 0 / 6. 0 ) ; ; double temp ;

3D Jacobi Smoother: Simple #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; S e r i a l S i m u l a t o r <Cell > sim ( i n i t ) ; sim. run ( ) ; return 0; 0.40 GLUPS, 1 thread

3D Jacobi Smoother: Cache Blocking #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; CacheBlockingSimulator<Cell> sim(init, 3, Coord<2>(270, 30)); sim. run ( ) ; return 0; 1.04 GLUPS, 4 threads Intel Core i7-3610qm @2.30 GHz (Ivy Bridge)

3D Jacobi Smoother: GPU #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; CudaSimulator<Cell> sim(init); sim. run ( ) ; return 0; 2.48 GLUPS, Tesla C2050

Summary Self-Adapting Stencil Codes for the Grid http://www.libgeodecomp.org single node performance matters (47x speedup...) scalability!= peak performance delivers both thanks: