Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010

Contents WaLBerla: Applications Software and Performance Engineering Numerical Results LBM Multigrid Future Work 4

Applications WALBERLA 5

walberla: parallel structured grid framework for various applications 6

Free surface flows Gas phase only modeled by boundary conditions Rising bubble & Metal foam & Sinking box & PEM (proton exchange membrane) fuel cell 7

Fluid-structure interaction: particle segregation 64.3 million objects in an LBM grid with 151 billion lattice cells Runs on 294.912 cores of a Blue Gene/P at Jülich Supercomputing center 8

WaLBerla SOFTWARE AND PERFORMANCE ENGINEERING 9

WaLBerla framework Main design goal: massive parallel software framework for various engineering applications Software quality factors Usability Reliability Maintainability and Expandability Portability Efficiency and Scalability 10

Performance at all costs? Performance optimization architectural factors optimization techniques programming techniques Performance engineering simplicity extendability performance effort 11

Performance at different scales I walberla (C++, MPI) Code management, standard implementations OpenCL framework for dataparallel computations on structured grids Low-level kernels for optimized architecturespecific computations (C++, CUDA, intrinsics, Assembler) 12

Performance at different scales II walberla heterogeneous devices, distributed memory parallel Low-level / OpenCL framework Local device, shared memory parallel 13

Efficient OpenCL? Kernels have to be adapted to specific architecture Optimization techniques necessary Vectorize computations Define suitable workgroups (parallel running threads that share local memory) reuse local memory: copy to local then perform computations, copy back to global Correct data alignment and coalesced memory accesses WaLBerla supports OpenCL but also offers similar strategies for heterogeneous computing in order to support e.g. CUDA 14

WaLBerla: Patch concept 15

WaLBerla: Sweep concept 16

WaLBerla: Communication concept 17

LBM NUMERICAL RESULTS 18

Boltzmann equation Mesoscopic approach to solving the Navier-Stokes equations Boltzmann equation describes the statistical distribution of one particle in a fluid f + ζ f = Ω t f is the probability distribution function (PDF), the particle velocity, and Ω(f) is the change due to collision (f) Models behavior of fluids in statistical physics ζ 19

Lattice Boltzmann Method Lattice Boltzmann Method solves the discrete Boltzmann equation Fluid domain is split into simple cells that are treated independently in each time step Microscopic numerical fluid solver Simple updating rules Handle complex boundary conditions for static and moving objects Inherent parallel scheme for hardware acceleration 20

D3Q19 LBM cell 21

Distribution of work to threads Each thread treats one cell One block treats one stripe along x-direction with its threads, thus, all cells have the same y and z index 22

LBM results I 23

HLRS Stuttgart NEC Nehalem Cluster Peak Performance: 62 TFlops No. Nodes: 700 Dual Sockel Quad Core Processor: Intel Xeon(X5560) Nehalem@2.8 GHz,8MB Cache Memory/Node: 12 GB Disk: 60 TB Node-node interconnnect: infiniband, GigE Accelerators: 32 nodes provide Nvidia Tesla S1070 http://www.hlrs.de/systems/platforms/nec-nehalem-cluster/ 24

Weak Scaling on Tesla C1070 and HLRS cluster 25

Heterogeneous LBM results Figure: Investigation of the load balance with 6 CPU-only processes each working on one Block and 2 GPU processes with varying block counts. The Block size is 90 3 lattice cells. 26

Multigrid NUMERICAL RESULTS 27

Image denoising of 3D CT volume Data: Siemens AG, Healthcare Sector 28

Denoising by diffusion Idea: Use nonlinear anisotropic diffusion process to denoise the image u 0 in the domain Ω,i.e. solve the time-dependent PDE u div( g u) = in Ω T t g u, n = 0 on Ω T u( x,0) = u 0 ( x) in Ω 29

Multigrid idea Multigrid methods are based on two principles: 1. Smoothing property Smooth error on fine grid 30

Multigrid idea Multigrid methods are based on two principles: 1. Smoothing property 2. Coarse grid principle Approximate smooth error on coarser grids 31

GPU single node performance 1,2 Runtime for a (2,2)-V-Cycle [s] 1 0,8 0,6 0,4 0,2 0 Tesla C2050, single precision Tesla C2050, double precision Tesla M1060, single precision Tesla M1060, double precision 128x128x128 256x128x128 256x256x128 256x256x256 Number of Unknowns 32

Weak scaling (256 3 unknowns per unit) Runtime for a (2,2)-V-Cycle [s] 2,5 2 1,5 1 0,5 0 CPU Version, single precision CPU Version, double precision GPU version, single precision GPU version, double precision 0 2 4 6 8 10 12 14 16 Number of Processing Units 33

Strong scaling (256 3 unknowns) 2,5 Runtime for a (2,2)-V-Cycle [s] 2 1,5 1 0,5 0 CPU Version, single precision CPU Version, double precision GPU version, single precision GPU version, double precision 0 2 4 6 8 10 Number of Processing Units 34

Future Work Performance Engineering Heterogeneous computing OpenCL optimization Better tools Applications Improve Multigrid coarse grid performance Extend LBM boundary conditions and free surfaces 35