Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous Systems in Physics Jena, 5-7 October 2011

CFD computing moving forward to Exascale GPU computing important technology for next generation Exascale cluster systems world s fastest HPC cluster based on GPUs original application: rasterizing images now: high performance for highly parallel algorithms growing number of GPU based codes available CFD codes prepared for the next generation of cluster hardware?

Two-phase flows major topic in computational fluid dynamics simulating interaction of two fluids like air & water, water & oil interesting small-scale phenomena: surface tension effects, droplet deformation, bubble dynamics large-scale studies: ship construction, river simulation

Two-phase flow simulation example

NaSt3DGPF - A 3D two-phase Navier-Stokes solver We have ported our in-house fluid solver to the GPU level-set formulation for simulation of two interacting fluids model: two-phase incompressible Navier-Stokes equations 3D finite difference solver on staggered uniform grid using Chorin s projection approach Jacobi-preconditioned CG solver for pressure Poisson equation high-order space discretizations: e.g. WENO 5 th time discretizations: Runge-Kutta 3 rd, Adams-Bashforth 2 nd complex geometries with different boundary conditions MPI parallelization by domain decomposition

Core technique for two-phase flows: Level-set method Level-set method Representation of a free surface Γ t by a signed distance function φ R 3 R R: Γ t = { x φ( x, t) = 0} φ = 1 Fluid phase distinction by sign of level set function: φ( x, t) > 0 for x Ω 1 Ω 2 : φ < 0 φ( x, t) 0 for x Ω 2 n = φ φ, κ = n( x, t) Ω1: φ > 0

Two-phase Navier-Stokes equations PDE system u p φ ρ ρ(φ)( t u + ( u ) u) = (µ(φ)s) p fluid velocity pressure level set function density u = 0 t φ + u φ = 0 σκ(φ)δ(φ) φ + ρ(φ) g µ dynamic viscosity S stress tensor σ surface tension g volume forces κ δ local curvature of fluid surface Dirac deltafunctional S := u + { u} T ρ(φ) := ρ 2 + (ρ 1 ρ 2 ) H(φ) µ(φ) := µ 2 + (µ 1 µ 2 ) H(φ) 0 if φ < 0 1 H(φ) := 2 if φ = 0 1 if φ > 0

Solver algorithm based on pressure projection For t = 1, 2,... do: 1 set boundary conditions for u n 2 compute intermediate velocity field u : u u n δt = ( u n ) u n + g + 1 ρ(φ n ) (µ(φn )S n ) 1 ρ(φ n ) σκ(φn )δ(φ n ) φ n 3 apply boundary conditions and transport level-set function: φ = φ n + δt ( u n φ n ) 4 reinitialize level-set function by solving τ d + sign(φ )( d 1) = 0, d 0 = φ 5 solve the pressure Poisson equation with φ n+1 = d: ( ) δt ρ(φ n+1 ) pn+1 = u 6 apply velocity correction: u n+1 = u δt ρ(φ n+1 ) pn+1

Solver algorithm based on pressure projection For t = 1, 2,... do: 1 set boundary conditions for u n 2 compute intermediate velocity field u : u u n δt = ( u n ) u n + g + 1 ρ(φ n ) (µ(φn )S n ) 1 ρ(φ n ) σκ(φn )δ(φ n ) φ n 3 apply boundary conditions and transport level-set function: This is now done on multiple GPUs. φ = φ n + δt ( u n φ n ) 4 reinitialize level-set function by solving τ d + sign(φ )( d 1) = 0, d 0 = φ 5 solve the pressure Poisson equation with φ n+1 = d: ( ) δt ρ(φ n+1 ) pn+1 = u 6 apply velocity correction: u n+1 = u δt ρ(φ n+1 ) pn+1

CPU GPU porting process Our approach 1 identification of most time consuming parts of CPU code good starting point 2 stepwise porting with full CPU GPU data copy before and after GPU computation and per-method memory allocation 3 continuously: GPU code validation for each porting step 4 step-wise unification of data fields and reduction of CPU GPU data transfers 5 overall optimization Advantage first results within short period of time easy code validation during porting process

Design principles of the GPU code General CUDA as GPU programming framework full double precision implementation linearization of 3D data fields Memory hierarchies use global memory wherever acceptable low algorithmic complexity L1 / L2 caches more and more popular and faster optimization with shared memory for most time-critical parts shmem-based parallel reduction used from SDK Compute configuration for maximized GPU occupancy use of maximum number of threads supported by symmetric multiprocessor (SM)

Data access patterns for complex geometry handling Irregular data access patterns different CPU loops (including/excluding) boundary cells periodic / non-periodic boundary conditions complex geometries: no computation on solid cells conditionals expensive on GPUs Solution compute kernel operates on whole data field precomputed boolean access pattern fields one additional conditional and global load operation measurements: faster than explicit boundary checks

Typical GPU kernel 1 g l o b a l v o i d RHSonGPU( double RHS, char pattern, double U, 2 double V, double W, double DX device, 3 double DY device, double DZ device, 4 double delt, i n t GPUsizeX, i n t GPUsizeY, 5 i n t offx, i n t offy, i n t offz, i n t GPUsize ) 6 { 7 i n t i d x = b l o c k I d x. x blockdim. x + t h r e a d I d x. x ; // l i n e a r i n d e x based on 8 // compute c o n f i g u r a t i o n 9 i n t i, j, k, tmp ; 10 11 i f ( ( idx<gpusize ) && ( pattern [ idx ]==1)) // data access pattern 12 { 13 k = i d x / ( GPUsizeX GPUsizeY ) ; // 3D c o o r d s computation 14 tmp = i d x % ( GPUsizeX GPUsizeY ) ; 15 j = tmp / GPUsizeX ; 16 i = tmp % GPUsizeX ; 17 i+=o f f X ; j+=o f f Y ; k+=o f f Z ; // p a r a l l e l f i e l d o f f s e t s 18 19 // c a l c u l a t i o n o f P o i s s o n e q u a t i o n s r i g h t hand s i d e 20 RHS [ i d x ]=((U[ i d x ] U[ idx 1 ] ) / DX device [ i ] + 21 (V [ i d x ] V [ idx GPUsizeX ] ) / DY device [ j ] + 22 (W[ i d x ] W[ idx GPUsizeX GPUsizeY ] ) / DZ device [ k ] )/ d e l t ; 23 } 24 }

Further details Compute-intensive kernels high instruction count per kernel register spilling = slow kernels (example: WENO stencil) solution: precompute some parts in additional kernel What remains on CPU? configuration file parser binary/visualization data file input/output parallel communication

Multi-GPU parallelization by domain decomposition multi-gpu parallelization fully integrated with distributed memory MPI parallelization of CPU code: 1 GPU 1 CPU core

Optimizing multi-gpu data exchanges Prepacking of boundary data buffer on GPU buffer CPU RAM buffer on GPU on GPU buffer on GPU buffer CPU RAM buffer on GPU on GPU Overlapping communication and computation (PCG solver) Matrix-vector product on inner cells Exchange boundary data Ax Results Matrix-vector product on boundary cells Ax

Results

Benchmarking problem: air bubble rising in water Properties domain size: liquid phase: gas phase: surface tension: volume forces: initial air bubble radius: initial center position of bubble: 20 cm 20 cm 20 cm water at 20 o C air at 20 o C standard standard gravity 3 cm (10 cm, 6 cm, 10 cm)

Performance measurements for GPUs Perfectly fair CPU-GPU benchmarks are very hard! 1 GPU vs. 1 CPU core + good GPU results CPU speed unclear not realistic wrt. price Performance per dollar ++ best price realism price per node / CPU? prices subject to changes 1 GPU vs. 1 CPU socket + better price realism # of cores per socket? speed per CPU core? Performance per Watt ++ Green IT + power costs high influence of config.

Benchmarking platforms CPU Hardware dual-6-core Intel Xeon E5650 CPU 2.67 GHz 24 GB DDR3-RAM GPU Hardware (GF100 Fermi) 4-core Intel Xeon E5620 CPU 2.40 GHz 6 GB DDR3-RAM NVIDIA Tesla C2050 GPU GPU Cluster (8 GT200 GPUs) 2 workstations with 4-core Intel Core i7-920 CPU 2.66 GHz 12 GB DDR3-RAM NVIDIA Tesla S1070 (4 GPUs) InfiniBand 40G QDR ConnectX Ubuntu Linux 10.04 64 bit operating system GCC 4.4.3 compiler, CUDA 3.2 SDK, OpenMPI 1.4.1

Performance per dollar Speed-up on one GPU 4 3.5 3 2.5 2 1.5 1 GT200 GPU vs. 6-core Xeon CPU GF100 GPU w. ECC vs. dual 6-core Xeon CPU GF100 GPU w/o ECC vs. dual 6-core Xeon CPU 1.61 1.23 1.63 2.57 3.04 2.86 3.26 2.24 2.26 2.26 3.01 3.42 64 3 128 3 256 128 2 256 2 128 Simulation Grid Resolution 1 core vs. 1 GPU > 41x speedup 1 socket (4 cores) vs. 1 GPU > 10x speedup

Performance per Watt Power consumption in kwh 0.3 0.25 0.2 0.15 0.1 dual 6-core Xeon CPU 8 GT200 GPUs GF100 GPU with ECC GF100 GPU w/o ECC 0.21 0.12 0.09 0.08 0.05 0 Grid resolution 256 2 128 Fermi-type GPU more than two times more power-efficient

Multi-GPU performance (GT200 GPUs) Strong scaling speed-up relative to one GT200 GPU 8 7 6 5 4 3 2 1 grid resolution 256 256 256 6.59 4.89 3.7 1.95 1 1 2 3 4 5 6 7 8 Number of GPUs Weak scaling relative to one GPU 8 7 6 5 4 3 2 1 grid resolution per GPU 256 256 128 grid resolution per GPU 256 256 256 1 1.1 1.12 1.13 1 0.93 1.05 1.1 1 2 3 4 5 6 7 8 Number of GT200 GPUs strong scaling / speedup weak scaling / scale-up

Summary NaSt3DGPF solves the two-phase incompressible Navier-Stokes equations CFD applications well-suited for GPUs Code scales on next-generation multi-gpu clusters Thanks to:

Thank you! Griebel, Z.: A multi-gpu accelerated solver for the three-dimensional two-phase incompressible Navier-Stokes equations. Computer Science - Research and Development, 25(1-2):65-73, May 2010. Z., Griebel: Solving Incompressible Two-Phase Flows on Massively Parallel Multi-GPU Clusters. Computers and Fluids - Special Issue: ParCFD2011, submitted.