A Massively Parallel Two-Phase Solver for Incompressible Fluids on Multi-GPU Clusters

Size: px

Start display at page:

Download "A Massively Parallel Two-Phase Solver for Incompressible Fluids on Multi-GPU Clusters"

Nigel Parks
5 years ago
Views:

1 A Massively Parallel Two-Phase Solver for Incompressible Fluids on Multi-GPU Clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn GPU Technology Conference 2012

2 Outline 1 Introduction 2 Multi-GPU Fluid Solver 3 Uncertainty Quantification in CFD

3 Two-phase flows major topic in computational fluid dynamics simulating interaction of two fluids like air & water, water & oil interesting small-scale phenomena: surface tension effects, droplet deformation, bubble dynamics large-scale studies: ship construction, river simulation

NaSt3DGPF - A 3D two-phase Navier-Stokes

4 NaSt3DGPF - A 3D two-phase Navier-Stokes solver 3D grid-based fluid solver for the non-stationary two-phase incompressible Navier-Stokes equations simulation of two interacting fluids (e.g. air, water) with high-order methods running on HPC clusters

5 Application: Flows through hydraulic constructions Real-world sluice system was successfully modeled and simulated Joint project with German Federal Waterways Engineering and Research Institute

6 Bionics-motivated textile design development of textiles and materials which can store oil model in nature: specialized bees that transport oil

Application: Fluid animation Coupling NaSt3DGPF and Autodesk Maya integration of full fluid simulation pipeline in Maya graphical setup of simulation scenes automatic parallel solver run

7 Application: Fluid animation Coupling NaSt3DGPF and Autodesk Maya integration of full fluid simulation pipeline in Maya graphical setup of simulation scenes automatic parallel solver run integration of computed fluid simulations photo-realistic rendering of fluid surfaces Z., Griebel: Photorealistic Visualization and Fluid Animation [...], Comp. Vis. Sci., 2012, accepted.

8 Outline 1 Introduction 2 Multi-GPU Fluid Solver 3 Uncertainty Quantification in CFD

9 NaSt3DGPF - A 3D two-phase Navier-Stokes solver We have ported our in-house fluid solver to the GPU 3D grid-based fluid solver for the non-stationary two-phase incompressible Navier-Stokes equations simulation of two interacting fluids (e.g. air, water) FD solver based on Chorin s pressure projection approach high-order space discretizations: e.g. WENO 5 th time discretizations: Runge-Kutta 3 rd, Adams-Bashforth Jacobi-preconditioned conjugate gradient solver for Poisson equation complex geometries with different boundary conditions multi-gpu MPI parallelization by domain decomposition Griebel, Z.: A multi-gpu accelerated solver for the three-dimensional two-phase incompressible Navier-Stokes equations. Comp. Sci. Res. Dev., 25(1-2):65-73, May Z., Griebel: Solving incompressible two-phase flows on multi-gpu clusters. Comp. Fluids, accept.

10 Core technique for two-phase flows: Level-set method Level-set method Representation of a free surface Γ t by a signed distance function φ R 3 R R: Γ t = {x φ(x, t) = 0} φ = 1 Fluid phase distinction by sign of level set function: Ω 2 : φ < 0 φ(x, t) > 0 for x Ω 1 φ(x, t) 0 for x Ω 2 Ω 1 : φ > 0

11 Two-phase Navier-Stokes equations PDE system u p φ ρ fluid velocity pressure level set function density ρ(φ)( t u + (u )u) = (µ(φ)s) p u = 0 t φ + u φ = 0 µ dynamic viscosity S stress tensor σ surface tension g volume forces S := u + { u} T ρ(φ) := ρ 2 + (ρ 1 ρ 2 ) H(φ) µ(φ) := µ 2 + (µ 1 µ 2 ) H(φ) σκ(φ)δ(φ) φ + ρ(φ)g 0 if φ < 0 1 H(φ) := 2 if φ = 0 1 if φ > 0 κ δ local curvature of fluid surface Dirac deltafunctional

12 Solver algorithm based on pressure projection For t = 1, 2,... do: 1 set boundary conditions for u n 2 compute intermediate velocity field u : u u n δt = (u n )u n + g + 1 ρ(φ n ) (µ(φn )S n ) 1 ρ(φ n ) σκ(φn )δ(φ n ) φ n 3 apply boundary conditions and transport level-set function: 4 reinitialize level-set function by solving φ = φ n + δt (u n φ n ) τ d + sign(φ )( d 1) = 0, 5 solve the pressure Poisson equation with φ n+1 = d: ( ) δt ρ(φ n+1 ) pn+1 = u 6 apply velocity correction: u n+1 = u δt ρ(φ n+1 ) pn+1 d 0 = φ

13 Solver algorithm based on pressure projection For t = 1, 2,... do: 1 set boundary conditions for u n 2 compute intermediate velocity field u : u u n δt = (u n )u n + g + 1 ρ(φ n ) (µ(φn )S n ) 1 ρ(φ n ) σκ(φn )δ(φ n ) φ n 3 apply boundary conditions and transport level-set function: 4 reinitialize level-set function by solving This φ = φis n + now δt (udone n φ n ) on multiple GPUs. τ d + sign(φ )( d 1) = 0, 5 solve the pressure Poisson equation with φ n+1 = d: ( ) δt ρ(φ n+1 ) pn+1 = u 6 apply velocity correction: u n+1 = u δt ρ(φ n+1 ) pn+1 d 0 = φ

14 CPU GPU porting process Our approach 1 identification of most time consuming parts of CPU code good starting point 2 stepwise porting with full CPU GPU data copy before and after GPU computation and per-method memory allocation 3 continuously: GPU code validation for each porting step 4 step-wise unification of data fields and reduction of CPU GPU data transfers 5 overall optimization Advantage first results within short period of time easy code validation during porting process

15 Design principles of the GPU code (1) General CUDA as GPU programming framework full double precision implementation linearization of 3D data fields Memory hierarchies use global memory wherever acceptable low algorithmic complexity L1 / L2 caches more and more popular and faster optimization with shared memory for most time-critical parts shmem-based parallel reduction used from SDK Compute configuration for maximized GPU occupancy use of maximum number of threads supported by symmetric multiprocessor (SM)

16 Design principles of the GPU code (2) Compute-intensive kernels high instruction count per kernel high register usage = slow kernels (example: WENO stencil) solution: precompute some parts in other kernel What remains on CPU? configuration file parser binary/visualization data file input/output parallel communication

17 Multi-GPU parallelization by domain decomposition Multi-GPU parallelization fully integrated with distributed memory MPI parallelization of CPU code: 1 GPU 1 CPU core

18 Optimizing multi-gpu data exchanges buffer on GPU buffer CPU RAM buffer on GPU on GPU buffer on GPU buffer CPU RAM Prepacking of boundary data buffer on GPU on GPU Matrix-vector product on inner cells Exchange boundary data Ax Results Matrix-vector product on boundary cells Ax Overlapping communication and computation (PCG solver)

19 Results

20 Benchmarking problem: air bubble rising in water Properties domain size: liquid phase: gas phase: surface tension: volume forces: initial air bubble radius: initial center position of bubble: 20 cm 20 cm 20 cm water at 20 o C air at 20 o C standard standard gravity 3 cm (10 cm, 6 cm, 10 cm)

21 Performance measurements for GPUs Perfectly fair CPU-GPU benchmarks are very hard! 1 GPU vs. 1 CPU core + good GPU results CPU speed unclear not realistic wrt. price Performance per dollar ++ best price realism price per node / CPU? prices subject to changes 1 GPU vs. 1 CPU socket + better price realism # of cores per socket? speed per CPU core? Performance per Watt ++ Green IT + power costs high influence of config.

22 Benchmarking platforms CPU Hardware dual-6-core Intel Xeon E5650 CPU 2.67 GHz 24 GB DDR3-RAM GPU Cluster (8 GT200 GPUs) 2 workstations with 4-core Intel Core i7-920 CPU 2.66 GHz 12 GB DDR3-RAM NVIDIA Tesla S1070 (4 GPUs) InfiniBand 40G QDR ConnectX Ubuntu Linux bit GCC compiler, CUDA 3.2 SDK, OpenMPI GPU Hardware (GF100 Fermi) 4-core Intel Xeon E5620 CPU 2.40 GHz 6 GB DDR3-RAM NVIDIA Tesla C2050 GPU GPU Cluster (48 GF100 RWTH Aachen University, Germany 24 double-gpu systems with 2 Intel Xeon X GHz 2 NVIDIA Quadro 6000 GPUs QDR InfiniBand Scientific Linux 6.1, GCC 4.4.5, OpenMPI 1.5.3, CUDA

23 Performance per dollar Speed-up on one GPU GT200 GPU vs. 6-core Xeon CPU GF100 GPU w. ECC vs. dual 6-core Xeon CPU GF100 GPU w/o ECC vs. dual 6-core Xeon CPU Simulation Grid Resolution Costs NVIDIA Tesla C1060 GPU similar to 6-core Xeon CPU new NVIDIA Tesla C2050 Fermi GPU similar to two 6-core Xeon CPUs 1 core vs. 1 GPU (w. ECC) > 36x speedup 4-core socket vs. 1 GPU (w. ECC) > 9x speedup

24 Performance per Watt Power consumption in kwh dual 6-core Xeon CPU 8 GT200 GPUs GF100 GPU with ECC GF100 GPU w/o ECC Grid resolution Fermi-type GPU more than two times more power-efficient

25 Multi-GPU performance (GT200 GPUs) Strong scaling speed-up relative to one GT200 GPU grid resolution (w/o overlap) grid resolution (w. overlap) Number of GPUs Weak scaling efficiency (in %) relative to one GPU Weak scaling efficiency (in %) relative to one GPU grid resolution per GPU (w/o overlap) grid resolution per GPU (w. overlap) grid resolution per GPU (w/o overlap) grid resolution per GPU (w. overlap) Number of GT200 GPUs strong scaling / speedup weak scaling / scale-up

26 Multi-GPU performance (Fermi GPUs) Strong scaling / speed-up relative to one Fermi GPU grid resolution Number of Fermi GPUs Weak scaling efficiency (in percent) relative to one GPU grid resolution per GPU Number of Fermi GPUs strong scaling / speed-up weak scaling / scale-up

27 Outline 1 Introduction 2 Multi-GPU Fluid Solver 3 Uncertainty Quantification in CFD

28 Uncertainty Quantification (UQ) in CFD Current standard CFD simulations for fixed and known input parameters Problem nature phenomena: input data not known exactly engineering: constructions / measurements always subject to perturbations Solution numerical modeling and analysis of uncertainties Uncertainty Quantification joint project with M. Griebel and C. Rieger

29 High-level idea of standard UQ methods Algorithmic idea: 1 sampling of stochastic input parameters according to some distribution 2 computation of hundreds or thousands of stochastic realizations (high-resolution simulations) 3 extraction of averaged data (expectation value), variance information,... as post-processing step non-intrusive method all available solvers usable without reprogramming Challenge: computation of realizations highly parallel but extremely time-intensive handling of enormous data quantities in post-processing step optimal setting for multi-gpu clusters in Exascale

30 Stochastic formulation of the Naver-Stokes equations Find stochastic functions u : D R + Ω R 3, p : D R + Ω R such that P-almost surely: 1 u(x, t, ω)+u(x, t, ω) u(x, t, ω)+ ( p(x, t, ω) µ(ω) u(x, t, ω)) = f(x, ω) t ρ(ω) on D u(x, t, ω) = 0 on D u(x, t, ω) = g(x, ω) on D x D R 3 physical domain (Ω, F, P) complete probability space Ω set of outcomes, F 2 Ω σ-algebra of events, P : F [0, 1] probability measure ρ, µ : Ω R, f, g : Ω R 3 stochastic functions

31 General setting Formulation for general setting Find stochastic functions u : D R + Ω R l such that P-almost surely: L(a(x, t, ω))u(x, t, ω) = f (x, t, ω) for all x D R d, t R + +BC L general operator depending on parameters modeled by random field a a, f : D R + Ω R s stochastic function x D R 3 physical domain (Ω, F, P) complete probability space Ω set of outcomes, F 2 Ω σ-algebra of events, P : F [0, 1] probability measure

32 Finite noise assumption Introduction of finite number of random variables: Γ M : Ω R M, ω (Y 1 (ω),..., Y M (ω)) T = (y (1),..., y (M) ) T =: y, M N Ω M := Γ M (Ω), (Ω, F, P) (Ω M, B(Ω M ), ρ M (y)), ρ M joint probability density E[Y i ] = 0, E[Y i Y j ] = δ i,j for all i, j = 1,..., M Finite dimensional representation of stochastic space: a(x, t, ω) a M (x, t, Γ M (ω)) = a M (x, t, y) f (x, t, ω) f M (x, t, Γ M (ω)) = f M (x, t, y) With further assumptions, solution can be given as: u(x, t, ω) u M (x, t, Γ M (ω)) = u M (x, t, y)

33 Stochastic RBF collocation 1. generate N(Ω) stochastic samples (collocation points) wrt. some distribution X N(Ω) := {y 1,..., y N(Ω) } Ω M 2. solve N(Ω) discretized deterministic problems L (h) (a (h) M (x, t, y i))u (h) M (x, t, y i) = f (h) M (x, t, y i) for all x D R d, t R + 3. compute Lagrange basis {l i } in trial space V XN(Ω) := span{k(, y i ) y i X N(Ω) } {l i (y)} N(Ω) i=1, l i(y) V XN(Ω), l i (y j ) = δ i,j K(, ) RBF kernel, i.e. K(y i, y j ) = k( y i y j ), 4. interpolate solution N(Ω) M (x, t, y) = u (h) (x, t, y i )l i (y) u (h) i=1 e.g. k(x) = e αx2

34 Computation of Lagrange basis {l i (y)} N(Ω) i=1 Given: samples in stochastic space: X N(Ω) := {y 1,..., y N(Ω) } Ω M Target: Lagrange basis {l i (y)} N(Ω) i=1 in trial space V XN(Ω) := span{k(, y i ) y i X N(Ω) } l i (y) V XN(Ω), l i (y j ) = δ i,j, K(y i, y j ) = k( y i y j ), k(x) = e αx2 Evaluation of Lagrange basis at point q Ω M l(q) K(y 1, y 1 ) K(y 1, y N(Ω) ) A = K(y N(Ω), y 1 ) K(y N(Ω), y N(Ω) ) Solution of a linear system for each q, l(q) = A l(q) = R(q) l 1(q). l N(Ω) (q), R(q) = K(q, y 1 ). K(q, y N(Ω) )

35 For now: expectation value Remember Expectation value N(Ω) M (x, t, y) = u (h) (x, t, y i )l i (y) u (h) i=1 E(u (h) E(u (h) M (x, t, y)) = Ω M u (h) (x, t, y i )ρ M (y)dy N(Ω) M (x, t, y)) = u (h) (x, t, y i )E(l i (y)), E(l i (y)) = l i (y)ρ M (y)dy Ω M i=1 Monte Carlo using quadrature points Q = {q 1,..., q Q } Ω M gives E(l i (y)) 1 l i (q)ρ M (y) Q q Q

36 Implementation Framework configuration files for independent fluid solver runs generated by Python scripts simulations performed with full multi-gpu code on 36 Nvidia Tesla M2090 cluster at our institute stochastics as post-processing on GPU using VTK data files produced by fluid solver GPU programming Nvidia CUDA 4.1 Thrust (transforms, reductions,... ) CULA Dense R14 (double precision LU factorization) CUBLAS (dot product with stride, not available in Thrust) custom GPU kernels (matrix setup, integration,... )

37 First results: Uncertainty in reattachment length

38 First results: Mixing in bubble reactors

39 Summary NaSt3DGPF solves the two-phase incompressible Navier-Stokes equations Code scales well on multi-gpu clusters Uncertainty Quantification profits from GPUs Thank you!

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous