Two-Phase flows on massively parallel multi-gpu clusters

Similar documents
A Massively Parallel Two-Phase Solver for Incompressible Fluids on Multi-GPU Clusters

A multi-gpu accelerated solver for the three-dimensional two-phase incompressible Navier-Stokes equations

Solving Incompressible Two-Phase Flows on Multi-GPU Clusters

FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION. M. Panchatcharam 1, S. Sundar 2

Gradient Free Design of Microfluidic Structures on a GPU Cluster

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Software and Performance Engineering for numerical codes on GPU clusters

Center for Computational Science

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Advances of parallel computing. Kirill Bogachev May 2016

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Particleworks: Particle-based CAE Software fully ported to GPU

Turbostream: A CFD solver for manycore

Large scale Imaging on Current Many- Core Platforms

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Performance of Implicit Solver Strategies on GPUs

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

cuibm A GPU Accelerated Immersed Boundary Method

Large-scale Gas Turbine Simulations on GPU clusters

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

International Supercomputing Conference 2009

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Automated Finite Element Computations in the FEniCS Framework using GPUs

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

ANSYS HPC Technology Leadership

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

General Purpose GPU Computing in Partial Wave Analysis

Realistic Animation of Fluids

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Interdisciplinary practical course on parallel finite element method using HiFlow 3

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Available online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)

Computing on GPU Clusters

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Numerical Algorithms on Multi-GPU Architectures

Overview of Traditional Surface Tracking Methods

Droplet collisions using a Level Set method: comparisons between simulation and experiments

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

CPU/GPU COMPUTING FOR AN IMPLICIT MULTI-BLOCK COMPRESSIBLE NAVIER-STOKES SOLVER ON HETEROGENEOUS PLATFORM

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Shape of Things to Come: Next-Gen Physics Deep Dive

Computational Fluid Dynamics (CFD) using Graphics Processing Units

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Imagery for 3D geometry design: application to fluid flows.

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

CGT 581 G Fluids. Overview. Some terms. Some terms

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Aerodynamics of a hi-performance vehicle: a parallel computing application inside the Hi-ZEV project

Robust Simulation of Sparsely Sampled Thin Features in SPH-Based Free Surface Flows

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

HPC Usage for Aerodynamic Flow Computation with Different Levels of Detail

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Parallel Summation of Inter-Particle Forces in SPH

Massively Parallel Phase Field Simulations using HPC Framework walberla

The 3D DSC in Fluid Simulation

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Accepted Manuscript. Higher-order CFD and Interface Tracking Methods on Highly-Parallel MPI and GPU systems. J. Appleyard, D.

Asynchronous OpenCL/MPI numerical simulations of conservation laws

2.7 Cloth Animation. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter 2 123

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Case Study - Computational Fluid Dynamics (CFD) using Graphics Processing Units

SENSEI / SENSEI-Lite / SENEI-LDC Updates

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Surface Tension Approximation in Semi-Lagrangian Level Set Based Fluid Simulations for Computer Graphics

A GPU-based High-Performance Library with Application to Nonlinear Water Waves

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

FEM techniques for interfacial flows

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Performance Benefits of NVIDIA GPUs for LS-DYNA

PORTABLE AND SCALABLE SOLUTIONS FOR CFD ON MODERN SUPERCOMPUTERS

Ab initio NMR Chemical Shift Calculations for Biomolecular Systems Using Fragment Molecular Orbital Method

Multi-Level Parallelism for Incompressible Flow Computations on GPU Clusters

Fast Multipole Method on the GPU

METHODS FOR MULTILEVEL PARALLELISM ON GPU CLUSTERS: APPLICATION TO A MULTIGRID ACCELERATED NAVIER-STOKES SOLVER

Analysis, extensions and applications of the Finite-Volume Particle Method (FVPM) PN-II-RU-TE Synthesis of the technical report -

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Transcription:

Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous Systems in Physics Jena, 5-7 October 2011

CFD computing moving forward to Exascale GPU computing important technology for next generation Exascale cluster systems world s fastest HPC cluster based on GPUs original application: rasterizing images now: high performance for highly parallel algorithms growing number of GPU based codes available CFD codes prepared for the next generation of cluster hardware?

Two-phase flows major topic in computational fluid dynamics simulating interaction of two fluids like air & water, water & oil interesting small-scale phenomena: surface tension effects, droplet deformation, bubble dynamics large-scale studies: ship construction, river simulation

Two-phase flow simulation example

NaSt3DGPF - A 3D two-phase Navier-Stokes solver We have ported our in-house fluid solver to the GPU level-set formulation for simulation of two interacting fluids model: two-phase incompressible Navier-Stokes equations 3D finite difference solver on staggered uniform grid using Chorin s projection approach Jacobi-preconditioned CG solver for pressure Poisson equation high-order space discretizations: e.g. WENO 5 th time discretizations: Runge-Kutta 3 rd, Adams-Bashforth 2 nd complex geometries with different boundary conditions MPI parallelization by domain decomposition

Core technique for two-phase flows: Level-set method Level-set method Representation of a free surface Γ t by a signed distance function φ R 3 R R: Γ t = { x φ( x, t) = 0} φ = 1 Fluid phase distinction by sign of level set function: φ( x, t) > 0 for x Ω 1 Ω 2 : φ < 0 φ( x, t) 0 for x Ω 2 n = φ φ, κ = n( x, t) Ω1: φ > 0

Two-phase Navier-Stokes equations PDE system u p φ ρ ρ(φ)( t u + ( u ) u) = (µ(φ)s) p fluid velocity pressure level set function density u = 0 t φ + u φ = 0 σκ(φ)δ(φ) φ + ρ(φ) g µ dynamic viscosity S stress tensor σ surface tension g volume forces κ δ local curvature of fluid surface Dirac deltafunctional S := u + { u} T ρ(φ) := ρ 2 + (ρ 1 ρ 2 ) H(φ) µ(φ) := µ 2 + (µ 1 µ 2 ) H(φ) 0 if φ < 0 1 H(φ) := 2 if φ = 0 1 if φ > 0

Solver algorithm based on pressure projection For t = 1, 2,... do: 1 set boundary conditions for u n 2 compute intermediate velocity field u : u u n δt = ( u n ) u n + g + 1 ρ(φ n ) (µ(φn )S n ) 1 ρ(φ n ) σκ(φn )δ(φ n ) φ n 3 apply boundary conditions and transport level-set function: φ = φ n + δt ( u n φ n ) 4 reinitialize level-set function by solving τ d + sign(φ )( d 1) = 0, d 0 = φ 5 solve the pressure Poisson equation with φ n+1 = d: ( ) δt ρ(φ n+1 ) pn+1 = u 6 apply velocity correction: u n+1 = u δt ρ(φ n+1 ) pn+1

Solver algorithm based on pressure projection For t = 1, 2,... do: 1 set boundary conditions for u n 2 compute intermediate velocity field u : u u n δt = ( u n ) u n + g + 1 ρ(φ n ) (µ(φn )S n ) 1 ρ(φ n ) σκ(φn )δ(φ n ) φ n 3 apply boundary conditions and transport level-set function: This is now done on multiple GPUs. φ = φ n + δt ( u n φ n ) 4 reinitialize level-set function by solving τ d + sign(φ )( d 1) = 0, d 0 = φ 5 solve the pressure Poisson equation with φ n+1 = d: ( ) δt ρ(φ n+1 ) pn+1 = u 6 apply velocity correction: u n+1 = u δt ρ(φ n+1 ) pn+1

CPU GPU porting process Our approach 1 identification of most time consuming parts of CPU code good starting point 2 stepwise porting with full CPU GPU data copy before and after GPU computation and per-method memory allocation 3 continuously: GPU code validation for each porting step 4 step-wise unification of data fields and reduction of CPU GPU data transfers 5 overall optimization Advantage first results within short period of time easy code validation during porting process

Design principles of the GPU code General CUDA as GPU programming framework full double precision implementation linearization of 3D data fields Memory hierarchies use global memory wherever acceptable low algorithmic complexity L1 / L2 caches more and more popular and faster optimization with shared memory for most time-critical parts shmem-based parallel reduction used from SDK Compute configuration for maximized GPU occupancy use of maximum number of threads supported by symmetric multiprocessor (SM)

Data access patterns for complex geometry handling Irregular data access patterns different CPU loops (including/excluding) boundary cells periodic / non-periodic boundary conditions complex geometries: no computation on solid cells conditionals expensive on GPUs Solution compute kernel operates on whole data field precomputed boolean access pattern fields one additional conditional and global load operation measurements: faster than explicit boundary checks

Typical GPU kernel 1 g l o b a l v o i d RHSonGPU( double RHS, char pattern, double U, 2 double V, double W, double DX device, 3 double DY device, double DZ device, 4 double delt, i n t GPUsizeX, i n t GPUsizeY, 5 i n t offx, i n t offy, i n t offz, i n t GPUsize ) 6 { 7 i n t i d x = b l o c k I d x. x blockdim. x + t h r e a d I d x. x ; // l i n e a r i n d e x based on 8 // compute c o n f i g u r a t i o n 9 i n t i, j, k, tmp ; 10 11 i f ( ( idx<gpusize ) && ( pattern [ idx ]==1)) // data access pattern 12 { 13 k = i d x / ( GPUsizeX GPUsizeY ) ; // 3D c o o r d s computation 14 tmp = i d x % ( GPUsizeX GPUsizeY ) ; 15 j = tmp / GPUsizeX ; 16 i = tmp % GPUsizeX ; 17 i+=o f f X ; j+=o f f Y ; k+=o f f Z ; // p a r a l l e l f i e l d o f f s e t s 18 19 // c a l c u l a t i o n o f P o i s s o n e q u a t i o n s r i g h t hand s i d e 20 RHS [ i d x ]=((U[ i d x ] U[ idx 1 ] ) / DX device [ i ] + 21 (V [ i d x ] V [ idx GPUsizeX ] ) / DY device [ j ] + 22 (W[ i d x ] W[ idx GPUsizeX GPUsizeY ] ) / DZ device [ k ] )/ d e l t ; 23 } 24 }

Further details Compute-intensive kernels high instruction count per kernel register spilling = slow kernels (example: WENO stencil) solution: precompute some parts in additional kernel What remains on CPU? configuration file parser binary/visualization data file input/output parallel communication

Multi-GPU parallelization by domain decomposition multi-gpu parallelization fully integrated with distributed memory MPI parallelization of CPU code: 1 GPU 1 CPU core

Optimizing multi-gpu data exchanges Prepacking of boundary data buffer on GPU buffer CPU RAM buffer on GPU on GPU buffer on GPU buffer CPU RAM buffer on GPU on GPU Overlapping communication and computation (PCG solver) Matrix-vector product on inner cells Exchange boundary data Ax Results Matrix-vector product on boundary cells Ax

Results

Benchmarking problem: air bubble rising in water Properties domain size: liquid phase: gas phase: surface tension: volume forces: initial air bubble radius: initial center position of bubble: 20 cm 20 cm 20 cm water at 20 o C air at 20 o C standard standard gravity 3 cm (10 cm, 6 cm, 10 cm)

Performance measurements for GPUs Perfectly fair CPU-GPU benchmarks are very hard! 1 GPU vs. 1 CPU core + good GPU results CPU speed unclear not realistic wrt. price Performance per dollar ++ best price realism price per node / CPU? prices subject to changes 1 GPU vs. 1 CPU socket + better price realism # of cores per socket? speed per CPU core? Performance per Watt ++ Green IT + power costs high influence of config.

Benchmarking platforms CPU Hardware dual-6-core Intel Xeon E5650 CPU 2.67 GHz 24 GB DDR3-RAM GPU Hardware (GF100 Fermi) 4-core Intel Xeon E5620 CPU 2.40 GHz 6 GB DDR3-RAM NVIDIA Tesla C2050 GPU GPU Cluster (8 GT200 GPUs) 2 workstations with 4-core Intel Core i7-920 CPU 2.66 GHz 12 GB DDR3-RAM NVIDIA Tesla S1070 (4 GPUs) InfiniBand 40G QDR ConnectX Ubuntu Linux 10.04 64 bit operating system GCC 4.4.3 compiler, CUDA 3.2 SDK, OpenMPI 1.4.1

Performance per dollar Speed-up on one GPU 4 3.5 3 2.5 2 1.5 1 GT200 GPU vs. 6-core Xeon CPU GF100 GPU w. ECC vs. dual 6-core Xeon CPU GF100 GPU w/o ECC vs. dual 6-core Xeon CPU 1.61 1.23 1.63 2.57 3.04 2.86 3.26 2.24 2.26 2.26 3.01 3.42 64 3 128 3 256 128 2 256 2 128 Simulation Grid Resolution 1 core vs. 1 GPU > 41x speedup 1 socket (4 cores) vs. 1 GPU > 10x speedup

Performance per Watt Power consumption in kwh 0.3 0.25 0.2 0.15 0.1 dual 6-core Xeon CPU 8 GT200 GPUs GF100 GPU with ECC GF100 GPU w/o ECC 0.21 0.12 0.09 0.08 0.05 0 Grid resolution 256 2 128 Fermi-type GPU more than two times more power-efficient

Multi-GPU performance (GT200 GPUs) Strong scaling speed-up relative to one GT200 GPU 8 7 6 5 4 3 2 1 grid resolution 256 256 256 6.59 4.89 3.7 1.95 1 1 2 3 4 5 6 7 8 Number of GPUs Weak scaling relative to one GPU 8 7 6 5 4 3 2 1 grid resolution per GPU 256 256 128 grid resolution per GPU 256 256 256 1 1.1 1.12 1.13 1 0.93 1.05 1.1 1 2 3 4 5 6 7 8 Number of GT200 GPUs strong scaling / speedup weak scaling / scale-up

Summary NaSt3DGPF solves the two-phase incompressible Navier-Stokes equations CFD applications well-suited for GPUs Code scales on next-generation multi-gpu clusters Thanks to:

Thank you! Griebel, Z.: A multi-gpu accelerated solver for the three-dimensional two-phase incompressible Navier-Stokes equations. Computer Science - Research and Development, 25(1-2):65-73, May 2010. Z., Griebel: Solving Incompressible Two-Phase Flows on Massively Parallel Multi-GPU Clusters. Computers and Fluids - Special Issue: ParCFD2011, submitted.