From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

Size: px

Start display at page:

Download "From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp"

Rhoda Dorsey
6 years ago
Views:

1 From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with Friedrich-Alexander-Universität Erlangen-Nürnberg GPU Technology Conference 2013, San José, CA

2 Outline 1 Applications 2 Architecture 3 Usage

3 1. Applications What can it do?

4 Parallel IMage Particle Flow (PIMPF)

5 Parallel IMage Particle Flow (PIMPF) particle-in-cell 300k particles, 30 FPS runs on PC, Android control PC via tablet

6 Computational Fluid Dynamics Lattice Boltzmann Method toolkit 2D/3D solid/fluid interaction various boundary conditions: - free-slip, no-slip, moving wall - Zou He (pressure, velocity)

7 Simulation of Granular Gases n-body code periodic boundaries uniform distribution model: Institute for Multiscale Simulation, FAU

8 Simulation of Dendritic Growth , 8 hours , 3 years? cellular automaton + finite difference code model: Institute for Metallic Materials, FSU

9 Simulation of Dendritic Growth (cont.) Parallelization Options manual parallelization reimplement with problem solving environment reimplement with parallel programming language port to

10 Simulation of Dendritic Growth (cont.) Parallelization Options manual parallelization reimplement with problem solving environment reimplement with parallel programming language port to Port to : Success! total code base: 5000 LOC (Lines Of Code) port: 100 LOC changed video: 2 TB output 7 days compute, 3 days visualization active usage since 2011

11 2. Architecture How does it work?

12 Target Applications Library for Geometric Decomposition codes () time- and space-discrete simulations chiefly stencil codes: regular 2D/3D grid challenges: computations are tightly coupled (communication, load-balancing) vectorization, ILP, caches, multi-cores, GPUs, MPI...

13 Architecture Cell: user supplied model Simulator: generic parallelization (C++ templates) handles spatial/temporal loops calls back user code

14 Architecture Cell: user supplied model Simulator: generic parallelization (C++ templates) handles spatial/temporal loops calls back user code plug-ins for: OpenMP CUDA MPI HPX

15 Architecture Cell: user supplied model Simulator: generic parallelization (C++ templates) handles spatial/temporal loops calls back user code plug-ins for: OpenMP CUDA MPI HPX features: hierarchical parallelization parallel IO live-steering, in situ visualization dynamic load-balancing...

16 Performance TFLOPS RTM Performance GPUs Reverse Time Migration (RTM) on Tsubame 2.0 scaled to 1080 GPUs (>480 k cores) weak-scaling efficiency >90 %

17 3. Usage How can I use it?

18 3D Jacobi Smoother: NaïveCode heat dissipation periodic boundary conditions well-known example seemingly simple plenty of potential for optimization plenty of pitfalls

19 3D Jacobi Smoother: NaïveCode (cont.) for ( i n t z = 0; z < griddim ; ++z ) { for ( i n t y = 0; y < griddim ; ++y ) { for ( i n t x = 0; x < griddim ; ++x ) { SET ( ) = (GET( 0, 0, 1) + GET( 0, 1, 0) + GET( 1, 0, 0) + GET( 1, 0, 0) + GET( 0, 1, 0) + GET( 0, 0, 1 ) ) ( 1. 0 / 6. 0 ) ;.

20 3D Jacobi Smoother: NaïveCode (cont.) #define GET(X, Y, Z ) ( gridold ) [ \ ( ( z + Z + griddim ) % griddim ) griddim griddim + \ ( ( y + Y + griddim ) % griddim ) griddim + \ ( ( x + X + griddim ) % griddim ) ] #define SET ( ) ( gridnew ) [ \ ( ( z + griddim ) % griddim ) griddim griddim + \ ( ( y + griddim ) % griddim ) griddim + \ ( ( x + griddim ) % griddim ) ] for ( i n t z = 0; z < griddim ; ++z ) { for ( i n t y = 0; y < griddim ; ++y ) { for ( i n t x = 0; x < griddim ; ++x ) { SET ( ) = (GET( 0, 0, 1) + GET( 0, 1, 0) + GET( 1, 0, 0) + GET( 1, 0, 0) + GET( 0, 1, 0) + GET( 0, 0, 1 ) ) ( 1. 0 / 6. 0 ) ; 0.10 GLUPS, 1 thread Intel Core GHz (Ivy Bridge) Giga Lattice Updates Per Second (GLUPS) matrix size: cells

21 3D Jacobi Smoother: NaïveCode OpenMP #pragma omp parallel for for ( i n t z = 0; z < griddim ; ++z ) { for ( i n t y = 0; y < griddim ; ++y ) { for ( i n t x = 0; x < griddim ; ++x ) { SET ( ) = (GET( 0, 0, 1) + GET( 0, 1, 0) + GET( 1, 0, 0) + GET( 1, 0, 0) + GET( 0, 1, 0) + GET( 0, 0, 1 ) ) ( 1. 0 / 6. 0 ) ; 0.37 GLUPS, 4 threads Intel Core GHz (Ivy Bridge) speedup 3.6

22 3D Jacobi Smoother: Model class C e l l { public : typedef Stencils : : VonNeumann<3, 1> Stencil ; typedef Topologies : : Torus <3 >:: Topology Topology ; class API : public CellAPITraits : : Fixed { ; s t a t i c i n l i n e unsigned nanosteps ( ) { return 1; i n l i n e e x p l i c i t Cell ( const double& v=0) : temp ( v ) {

23 3D Jacobi Smoother: Model class C e l l { public : typedef Stencils : : VonNeumann<3, 1> Stencil ; typedef Topologies : : Torus <3 >:: Topology Topology ; class API : public CellAPITraits : : Fixed { ; s t a t i c i n l i n e unsigned nanosteps ( ) { return 1; i n l i n e e x p l i c i t Cell ( const double& v=0) : temp ( v ) { #define GET(X, Y, Z ) hood [ FixedCoord<X, Y, Z > ( ) ]. temp #define SET ( ) temp template <typename COORD_MAP> void update ( const COORD_MAP& hood, const unsigned& nanostep ) { SET ( ) = (GET( 0, 0, 1) + GET( 0, 1, 0) + GET( 1, 0, 0) + GET( 1, 0, 0) + GET( 0, 1, 0) + GET( 0, 0, 1 ) ) ( 1. 0 / 6. 0 ) ; ; double temp ;

24 3D Jacobi Smoother: Simple #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; S e r i a l S i m u l a t o r <Cell > sim ( i n i t ) ; sim. run ( ) ; return 0; 0.40 GLUPS, 1 thread

25 3D Jacobi Smoother: Simple #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; S e r i a l S i m u l a t o r <Cell > sim ( i n i t ) ; sim. run ( ) ; return 0; 0.40 GLUPS, 1 thread 0.50 GLUPS, 4 threads Intel Core GHz (Ivy Bridge)

26 3D Jacobi Smoother: Cache Blocking #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; CacheBlockingSimulator<Cell> sim(init, 3, Coord<2>(270, 30)); sim. run ( ) ; return 0; 1.04 GLUPS, 4 threads Intel Core GHz (Ivy Bridge)

27 3D Jacobi Smoother: GPU #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; CudaSimulator<Cell> sim(init); sim. run ( ) ; return 0; 2.48 GLUPS, Tesla C2050

28 3D Jacobi Smoother: GPU #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; CudaSimulator<Cell> sim(init); sim. run ( ) ; return 0; 2.48 GLUPS, Tesla C GLUPS, Tesla K20 47x faster than naïvecode

29 Summary Self-Adapting Stencil Codes for the Grid single node performance matters (47x speedup...) scalability!= peak performance delivers both thanks:

Parallel Simulation of Dendritic Growth On Unstructured Grids

Parallel Simulation of Dendritic Growth On Unstructured Grids, Julian Hammer, Dietmar Fey Friedrich-Alexander-Universität Erlangen-Nürnberg IA 3 Workshop @SC11, Nov. 13th, 2011 Outline 1 What and why?