Software and Performance Engineering for numerical codes on GPU clusters

Size: px

Start display at page:

Download "Software and Performance Engineering for numerical codes on GPU clusters"

Egbert Clifford Anthony
6 years ago
Views:

1 Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China

2 2

3 3

4 Contents WaLBerla: Applications Software and Performance Engineering Numerical Results LBM Multigrid Future Work 4

5 Applications WALBERLA 5

6 walberla: parallel structured grid framework for various applications 6

7 Free surface flows Gas phase only modeled by boundary conditions Rising bubble & Metal foam & Sinking box & PEM (proton exchange membrane) fuel cell 7

8 Fluid-structure interaction: particle segregation 64.3 million objects in an LBM grid with 151 billion lattice cells Runs on cores of a Blue Gene/P at Jülich Supercomputing center 8

9 WaLBerla SOFTWARE AND PERFORMANCE ENGINEERING 9

10 WaLBerla framework Main design goal: massive parallel software framework for various engineering applications Software quality factors Usability Reliability Maintainability and Expandability Portability Efficiency and Scalability 10

11 Performance at all costs? Performance optimization architectural factors optimization techniques programming techniques Performance engineering simplicity extendability performance effort 11

12 Performance at different scales I walberla (C++, MPI) Code management, standard implementations OpenCL framework for dataparallel computations on structured grids Low-level kernels for optimized architecturespecific computations (C++, CUDA, intrinsics, Assembler) 12

13 Performance at different scales II walberla heterogeneous devices, distributed memory parallel Low-level / OpenCL framework Local device, shared memory parallel 13

14 Efficient OpenCL? Kernels have to be adapted to specific architecture Optimization techniques necessary Vectorize computations Define suitable workgroups (parallel running threads that share local memory) reuse local memory: copy to local then perform computations, copy back to global Correct data alignment and coalesced memory accesses WaLBerla supports OpenCL but also offers similar strategies for heterogeneous computing in order to support e.g. CUDA 14

15 WaLBerla: Patch concept 15

16 WaLBerla: Sweep concept 16

17 WaLBerla: Communication concept 17

18 LBM NUMERICAL RESULTS 18

19 Boltzmann equation Mesoscopic approach to solving the Navier-Stokes equations Boltzmann equation describes the statistical distribution of one particle in a fluid f + ζ f = Ω t f is the probability distribution function (PDF), the particle velocity, and Ω(f) is the change due to collision (f) Models behavior of fluids in statistical physics ζ 19

20 Lattice Boltzmann Method Lattice Boltzmann Method solves the discrete Boltzmann equation Fluid domain is split into simple cells that are treated independently in each time step Microscopic numerical fluid solver Simple updating rules Handle complex boundary conditions for static and moving objects Inherent parallel scheme for hardware acceleration 20

21 D3Q19 LBM cell 21

22 Distribution of work to threads Each thread treats one cell One block treats one stripe along x-direction with its threads, thus, all cells have the same y and z index 22

23 LBM results I 23

24 HLRS Stuttgart NEC Nehalem Cluster Peak Performance: 62 TFlops No. Nodes: 700 Dual Sockel Quad Core Processor: Intel Xeon(X5560) GHz,8MB Cache Memory/Node: 12 GB Disk: 60 TB Node-node interconnnect: infiniband, GigE Accelerators: 32 nodes provide Nvidia Tesla S

25 Weak Scaling on Tesla C1070 and HLRS cluster 25

26 Heterogeneous LBM results Figure: Investigation of the load balance with 6 CPU-only processes each working on one Block and 2 GPU processes with varying block counts. The Block size is 90 3 lattice cells. 26

27 Multigrid NUMERICAL RESULTS 27

28 Image denoising of 3D CT volume Data: Siemens AG, Healthcare Sector 28

29 Denoising by diffusion Idea: Use nonlinear anisotropic diffusion process to denoise the image u 0 in the domain Ω,i.e. solve the time-dependent PDE u div( g u) = in Ω T t g u, n = 0 on Ω T u( x,0) = u 0 ( x) in Ω 29

30 Multigrid idea Multigrid methods are based on two principles: 1. Smoothing property Smooth error on fine grid 30

31 Multigrid idea Multigrid methods are based on two principles: 1. Smoothing property 2. Coarse grid principle Approximate smooth error on coarser grids 31

32 GPU single node performance 1,2 Runtime for a (2,2)-V-Cycle [s] 1 0,8 0,6 0,4 0,2 0 Tesla C2050, single precision Tesla C2050, double precision Tesla M1060, single precision Tesla M1060, double precision 128x128x x128x x256x x256x256 Number of Unknowns 32

33 Weak scaling (256 3 unknowns per unit) Runtime for a (2,2)-V-Cycle [s] 2,5 2 1,5 1 0,5 0 CPU Version, single precision CPU Version, double precision GPU version, single precision GPU version, double precision Number of Processing Units 33

34 Strong scaling (256 3 unknowns) 2,5 Runtime for a (2,2)-V-Cycle [s] 2 1,5 1 0,5 0 CPU Version, single precision CPU Version, double precision GPU version, single precision GPU version, double precision Number of Processing Units 34

35 Future Work Performance Engineering Heterogeneous computing OpenCL optimization Better tools Applications Improve Multigrid coarse grid performance Extend LBM boundary conditions and free surfaces 35

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications