Massively Parallel OpenMP-MPI Implementation of the SPH Code DualSPHysics

Similar documents
DualSPHysics: past, present and future

Massively Parallel Computing for Incompressible Smoothed Particle Hydrodynamics

Improved accuracy in modelling armoured breakwaters with SPH

A laboratory-dualsphysics modelling approach to support landslide-tsunami hazard assessment

Two-Phase flows on massively parallel multi-gpu clusters

CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison

CGT 581 G Fluids. Overview. Some terms. Some terms

Interaction of Fluid Simulation Based on PhysX Physics Engine. Huibai Wang, Jianfei Wan, Fengquan Zhang

Acknowledgements. Prof. Dan Negrut Prof. Darryl Thelen Prof. Michael Zinn. SBEL Colleagues: Hammad Mazar, Toby Heyn, Manoj Kumar

PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON. Pawe l Wróblewski, Krzysztof Boryczko

SPH: Why and what for?

GPU Simulations of Violent Flows with Smooth Particle Hydrodynamics (SPH) Method

Support for Multi physics in Chrono

Shape of Things to Come: Next-Gen Physics Deep Dive

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

DualSPHysics code New version 4.0

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Smooth Particle Hydrodynamics for Surf Zone Waves

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Runoff study on real terrains using UAV photogrammetry and SPH modelling of fluids.

Modelling of tsunami-induced bore and structure interaction

PySPH: A Python framework for SPH

Introducing OpenMP Tasks into the HYDRO Benchmark

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

Characteristic Aspects of SPH Solutions

The Potential of Diffusive Load Balancing at Large Scale

Fluid Simulation. [Thürey 10] [Pfaff 10] [Chentanez 11]

1.2 Numerical Solutions of Flow Problems

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

SPH: Towards the simulation of wave-body interactions in extreme seas

Implementation of an integrated efficient parallel multiblock Flow solver

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Robust Simulation of Sparsely Sampled Thin Features in SPH-Based Free Surface Flows

More Animation Techniques

Smooth Particle Hydrodynamics for Surf Zone Waves

Software and Performance Engineering for numerical codes on GPU clusters

Overview of Traditional Surface Tracking Methods

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Particle-Based Fluid Simulation. CSE169: Computer Animation Steve Rotenberg UCSD, Spring 2016

Realtime Water Simulation on GPU. Nuttapong Chentanez NVIDIA Research

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

CS 231. Fluid simulation

Recent applications of overset mesh technology in SC/Tetra

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang

Application of Finite Volume Method for Structural Analysis

Technical Report TR

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Fast Dynamic Load Balancing for Extreme Scale Systems

NVIDIA. Interacting with Particle Simulation in Maya using CUDA & Maximus. Wil Braithwaite NVIDIA Applied Engineering Digital Film

Towards real-time prediction of Tsunami impact effects on nearshore infrastructure

FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION. M. Panchatcharam 1, S. Sundar 2

SPH model to simulate Oscillating Water Column wave energy converter

Mesh-free Lagrangian modelling of fluid dynamics

Available online at ScienceDirect. Procedia Engineering 126 (2015 )

Parallel Summation of Inter-Particle Forces in SPH

Realistic Animation of Fluids

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

simulation framework for piecewise regular grids

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

European exascale applications workshop, Manchester, 11th and 12th October 2016 DLR TAU-Code - Application in INTERWinE

Navier-Stokes & Flow Simulation

Enabling In Situ Viz and Data Analysis with Provenance in libmesh

Mesh reordering in Fluidity using Hilbert space-filling curves

T6: Position-Based Simulation Methods in Computer Graphics. Jan Bender Miles Macklin Matthias Müller

Experimental Validation of the Computation Method for Strongly Nonlinear Wave-Body Interactions

Handling Parallelisation in OpenFOAM

ANSYS HPC Technology Leadership

Advanced Load Balancing for SPH Simulations on Multi-GPU Architectures

ScienceDirect. Modelling of Solitary Wave Run-up on an Onshore Coastal Cliff by Smoothed Particle Hydrodynamics Method

AQUAgpusph, a free 3D SPH solver accelerated with OpenCL

PySPH: A Python Framework for Smoothed Particle Hydrodynamics

Particle-based Fluid Simulation

Parallel Mesh Multiplication for Code_Saturne

LS-DYNA 980 : Recent Developments, Application Areas and Validation Process of the Incompressible fluid solver (ICFD) in LS-DYNA.

High Performance Computing

Users Guide for DualSPHysics code

Navier-Stokes & Flow Simulation

3D Simulation of Dam-break effect on a Solid Wall using Smoothed Particle Hydrodynamic

High performance computing and numerical modeling

Smoothed particle hydrodynamics applied in fluid structure interactions

Development of an Integrated Computational Simulation Method for Fluid Driven Structure Movement and Acoustics

Large-scale Gas Turbine Simulations on GPU clusters

Domain Decomposition: Computational Fluid Dynamics

Divergence-Free Smoothed Particle Hydrodynamics

Performance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows

HPC Algorithms and Applications

Coastal impact of a tsunami Review of numerical models

SPH Accuracy to Describe the Wave Impact on a Tall Structure (benchmark case 1)

Simulation of Offshore Wave Impacts with a Volume of Fluid Method. Tim Bunnik Tim Bunnik MARIN

CPU-GPU Heterogeneous Computing

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Using GPUs to compute the multilevel summation of electrostatic forces

User Guide for DualSPHysics code

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Transcription:

Massively Parallel OpenMP-MPI Implementation of the SPH Code DualSPHysics Athanasios Mokos, Benedict D. Rogers School of Mechanical, Aeronautical and Civil Engineering University of Manchester, UK Funded by the ecse, ecse07-16 ecse Webinar, 24 January 2018

Motivation for Research Outline of Presentation Introduction to Meshless Methods Introduction to SPH Message Passing Interface Domain Division Process Communication Asynchronous communications Results Runtime Results Optimization Dynamic Load Balancing Domain Decomposition 2D/3D Decomposition Zoltan Library Domain Decomposition Algorithms Conclusions and Future Work

Motivation for Research Primary focus on violent water flows with breaking free surface, e.g. wave impact/slamming or potentially explosive pipe flows Applications: Coastal Defence Offshore structures Dam and river flooding Experiments are expensive and require significant investment Focus on computational methods Whitehaven 2013 (North News)

Mesh-based Methods Mesh around a RAE 2822 transonic aerofoil (NASA) Most common methods used in both industry and academia: Finite Difference Method (FDM) Finite Element Method (FEM) Finite Volume Method (FVM) Robust, well-developed and mature Multiple algorithms and models Adapted for every application However

Mesh-based Methods Meshing can be complex Deformation? Mesh around an airplane hull (Photo courtesy of Siemens) Waves breaking on the shore (Photo courtesy of the University of Plymouth)

Meshless Methods Computation Points: Nodes -> Particles Particles are not connected with a grid Particles are not fixed but move with their own velocity and acceleration Each particle follows a unique trajectory Particles are described through Lagrangian derivatives: Rate of change along a trajectory

Local Interpolation Particles possess properties (density, pressure etc.) travelling with them Particles are linked to their current neighbouring particles in space Neighbours values affect the properties of the particle through a summation Particle movement also affected by neighbours

Introduction to Smoothed Particle Hydrodynamics SPH is a Lagrangian meshless method: Computation points (particles) move according to governing equations (Navier-Stokes Equations) Basic idea: The value of a function A(r) at point r in space is approximated as: Properties computed through local interpolation with a weighting function (kernel) around each particle A N j r Ar j W r rj, h j1 m j A r Ar rd Radius of influence 2h

Introduction to SPH Navier-Stokes Equations Continuity Momentum Fluid Compressibility ij i j j i j i W m t u u d d i ij j ij i i i j j j W p p m t F u d d 2 2 o Incompressible SPH Poisson Equation o Weakly Compressible SPH Equation of State d d v t F u u 2 1 d d p t

SPH for real problems Real-life applications are complex 3D flows Multi-scale problems with long runtimes SPH requires over 10 8 particles to model them Must do so as quickly as possible OPTION: Use the inherent parallelism of the GPU Photo by University of Plymouth

SPH Solver DualSPHysics Open-source project, co-developed with the Universities of Vigo, Manchester, Parma and Lisbon http://dual.sphysics.org/ Validated for violent water flows 1 Includes pre- and post processing software http://www.youtube.com/user/dualsphysics/videos

Additional Capabilities Integration with all existing capabilities of DualSPHysics Wave Generator DEM model Floating Bodies Air-Water multiphase model Solid-Water multiphase model Object motion

Current State of DualSPHysics GPU Highly optimised code CPU Highly optimised code Multiple options Multiple options Pre- and post-processing tools Pre- and post-processing tools Able to take advantage of the inherent parallelism OpenMP implementation Simulates millions of particles in a few hours Simulates millions of particles in a few months

Current State of DualSPHysics 2D 3D Speedup up to 21 for a 6-year old card compared to an 8-thread OpenMP simulation Speedup up to 16 for a 6-year old card compared to an 8-thread OpenMP simulation

GPUs are fantastic: Current State of DualSPHysics Massively Parallel, ideal for n-body simulations Low cost and energy consumption (Green Computing) But Still in their infancy (less developed tools and compilers) Single precision to maintain speed The multi-gpu code is not easily portable Require specialised hardware and additional investment (cannot take advantage of existing HPC infrastructure) Industrial engineering companies still need convincing to invest resources and personnel NVidia GTX1080

Motivation for Research Develop a CPU code with similar capabilities to the existing GPU code that can be used in HPC installations Massive Parallelism required: Ability to scale for 1000s of cores Currently only local parallelism (OpenMP) -> Communication between different processors required Implementation of the Message Passing Interface (MPI) standard AIM: Develop a hybrid OpenMP-MPI program that can scale to 1000s of cores

Message Passing Interface Standardised, independent and portable message parsing library specification Message Passing: Data is moved from one process to another through cooperative operations on each process. The recipient then selects the appropriate code to be executed. Distributed memory model

Message Passing Interface Standardised, independent and portable message parsing library specification Message Passing: Data is moved from one process to another through cooperative operations on each process. The recipient then selects the appropriate code to be executed. OpenMP already developed so Hybrid memory model

Challenges of Integrating MPI Maintain DualSPHysics optimisation and structure Cell-linked neighbor list 3 Ease of use Reduce changes in SPH computation Limits options when creating particles and cells Need to introduce new features Focus on updating existing functions to work with multiple nodes Create new files to handle communication and data transfer

SPH Solver DualSPHysics Cell-linked neighbour list 3 http://dual.sphysics.org/ Algorithm that optimises neighbour searching Divide the domain into cells Cells remain constant throughout the computation 2h Create a list linking particles and cells Search for neighbour particles only in adjacent cells Cell 2h Neighbouring Particles Neighbour Search Area

Integrating MPI in DualSPHysics Single node files JCellDivCpuSingle JPartsLoad4 JSphCpuSingle MPI files CellDivCpuMPI ParticleLoadMPI SphCpuMPI Changes focused on: Loading data from pre-processing software Creating and updating the assignment of particles in cells Handling and integrating the new features

Integrating MPI in DualSPHysics Single node files JCellDivCpuSingle JPartsLoad4 JSphCpuSingle MPI files CellDivCpuMPI ParticleLoadMPI SphCpuMPI New files created to handle: Node communication Domain Decomposition Halo Exchange BufferMPI DataCommMPI HostMPI InfoMPI SliceMPI SphMPI SphHaloMPI

Domain Decomposition Divide the domain between nodes Allows the simulation to use more particles Unique particle and cell list 1D decomposition through slices 2 Reduces local and global memory footprint Reduces the load on each CPU core Cell

Domain Decomposition Divide the domain between nodes Allows the simulation to use more particles Unique particle and cell list 1D decomposition through slices 2 Reduces local and global memory footprint Reduces the load on each CPU core Cell

Halo Exchange Identify neighbouring particles in another process or particles moved from another process Only data from the neighbouring slice (distance 2h) are transferred Transfer only the data of all potential neighbours Use a halo system for more efficiency 3 Edge particles form the halo of the subdomain Similar procedure on every subdomain border

Asynchronous Communications Objective: Minimise waiting time for data transfer Neighbour list of interior particles processed while sending data of displaced particles Compute forces on interior particles while receiving halo data Processes synchronise when calculating the time step (Dominguez et al. 2013) 2

Results Execution for 8 processes Results identical to single-node DualSPHysics Results independent of the number of processes Portability: Code operates for both Windows and Linux in different processor architectures

Results Execution for 8 processes Results identical to single-node DualSPHysics Results independent of the number of processes Portability: Code operates for both Windows and Linux in different processor architectures

Results Execution for 8 processes Results identical to single-node DualSPHysics Results independent of the number of processes Portability: Code operates for both Windows and Linux in different processor architectures

Runtime Results (small scale) Local execution for 1-16 processes Westmere: Xeon X5650-2.66GHz (2x6-core) Sandy Bridge: Xeon E5-2640 - 2.5GHz (2x6-core) Ivy Bridge: Xeon E5-2650 v2 2.6GHz (2x8-core) Haswell: Xeon E5-2690 v3 2.6GHz (2x12-core) Still Water case for 700,000 particles Parallel Efficiency E 100% p T p pt 1

Scalability (small scale) Code can be further optimised Parallel Efficiency E 100% p T p pt 1 Scalability issues do not allow efficient computation with ~100 processes 1D decomposition not scalable No load balancing REMINDER: We need more than 10 8 particles for the target problems

Runtime Results (small scale) Local execution for 8 processes Intel Xeon E5507 at 2.27GHz Still Water case for 160,000 particles Synchronisation at the end of the time step slows the computation Current implementation : MPI_Allreduce

Runtime Results (small scale) Local execution for 8 processes Intel Xeon E5507 at 2.27GHz Still Water case for 160,000 particles Synchronisation at the end of the time step slows the computation Current implementation : MPI_Allreduce

Dynamic Load Balancing Processes do not have the same workload (number of particles, interparticle forces) Dynamic simulations workload of each process changes constantly Options: 1. Same number of particles 2. Same execution time Option 1 is simpler to enforce Option 2 has higher potential but difficult to enforce

Dynamic Load Balancing Processes do not have the same workload (number of particles, interparticle forces) Dynamic simulations workload of each process changes constantly Options: 1. Same number of particles 2. Same execution time Option 1 is simpler to enforce Option 2 has higher potential but difficult to enforce

The Zoltan Library Use of the Zoltan data management library 4 Library for the development and optimization of parallel, unstructured and adaptive codes Scalable up to 10 6 cores 4 Includes a suite of spatial decomposition and dynamic load balancing algorithms and an unstructured communication package Geometric Decomposition Algorithm: Hilbert Space Filling Curve (HSFC) Dambreak at 1.1s for 256 partitions 5

Hilbert Space Filling Curve A continuous fractal space-filling curve (containing the entire 2D unit square) Maps 2D and 3D points to a 1D curve Maintains spatial locality Already used for SPH 5 Irregular subdomain shapes (increased complexity of data transfer)

Hilbert Space Filling Curve A continuous fractal space-filling curve (containing the entire 2D unit square) Maps 2D and 3D points to a 1D curve Maintains spatial locality Already used for SPH 5 Irregular subdomain shapes (increased complexity of data transfer) Guo et al. (2015) 7

HSFC Algorithm HSFC maps cells on a 1D curve into the interval [0,1] Divides the curve into N bins where N is larger than the amount of processes Sums bin weights from starting point, cutting off whenever the desired weight is reached Bins containing a cutting off point are further refined until the desired balance is achieved

HSFC Algorithm HSFC maps cells on a 1D curve into the interval [0,1] Divides the curve into N bins where N is larger than the amount of processes Sums bin weights from starting point, cutting off whenever the desired weight is reached Bins containing a cutting off point are further refined until the desired balance is achieved

Using Zoltan in DualSPHysics Domain Decomposition and Load Balancing through Zoltan Main Partitioning Parameter: Cells Significantly smaller number than particles Allow for load balancing Position does not change Load Balancing through Cell Weights Based on particle number 5 (Current) Based on execution time Automatic migration through Zoltan_Migrate Low complexity of data transferred Devine et al. (2009) 4

Using Zoltan in DualSPHysics New arrays created: Global Cell ID Local Cell ID Cell Coordinates Cell Weights Each process only holds local data Example: Domain divided in 64 cells containing 285 particles Initial domain split by 1D decomposition (Slices) 7 15 23 31 39 47 55 63 6 14 22 30 38 46 54 62 5 13 21 29 37 45 53 61 4 12 20 28 36 44 52 60 3 11 19 27 35 43 51 59 2 10 18 26 34 42 50 58 1 9 17 25 33 41 49 57 0 8 16 24 32 40 48 56 Global Cell ID

Using Zoltan in DualSPHysics New arrays created: Global Cell ID Local Cell ID Cell Coordinates Cell Weights Each process only holds local data Example: Domain divided in 64 cells containing 285 particles Initial domain split by 1D decomposition (Slices) 7 15 6 14 5 13 4 12 3 11 2 10 1 9 0 8 Local Cell ID

Cell weights 5 : w C Using Zoltan in DualSPHysics N N pc pt Data is sent to Zoltan HSFC algorithm is applied Zoltan Output: Global Cell IDs of imported cells Global Cell IDs of exported cells Destination process Cell data automatically migrated using AUTO_MIGRATE option

Using Zoltan in DualSPHysics GlobalCellID is updated: Exported cells removed Imported cells added Particles are also imported and exported Data reordered creating new celllinked neighbour list LocalCellID is updated Algorithm applied only when imbalance exceeds 20%

Particle Mapping Connection between cells and particles needed Existing DualSPhysics array: CellPart CellPart can be easily mapped on LocalCellID LocalCellID acts as intermediary between CellPart and GlobalCellID If N c number of local cells CellPart (2N c +5) LocalCellID (N c ) GlobalCellID (N c )

Particle Reordering Particles need reordering to maintain local spatial locality Currently, particle data reordered using single node algorithm Same for LocalCellID allows mapping to Cellpart 2 5 9 13 1 4 8 12 0 3 7 11 15 17 6 10 14 16 GlobalCellID is constant Better option: reorder along HSFC path 5

Particle Reordering Particles need reordering to maintain local spatial locality Currently, particle data reordered using single node algorithm Same for LocalCellID allows mapping to Cellpart GlobalCellID is constant Better option: reorder along HSFC path 5

Particle Reordering Particles need reordering to maintain local spatial locality Currently, particle data reordered using single node algorithm Same for LocalCellID allows mapping to Cellpart 3 4 7 8 2 5 6 9 1 0 11 10 17 16 12 13 14 15 GlobalCellID is constant Better option: reorder along HSFC path 5

Partition Results

Partition Results z x y

Partition Results z x y z y x

Future Work Complete a fully working version of the DualSPHysics MPI code Halo Exchange Particle Exchange Assess the code capabilities and validate Optimisation New I/O functions required Transition to the Hierarchical Data Format (HDF5) Execution to large HPC clusters for 1000s of cores

Halo Exchange Halo exchange reworked using cells Neighbouring cells explicitly known through GlobalCellID Identify processes the particles are in and transfer data Packing and unpacking algorithms same as previous code

Particle Exchange Particles can move out of the cell New cell may be in a different process Use Cell coordinates to identify edges of the process domain Identify process and cell the particle moves into Use same packing/unpacking algorithm Process needs to be completed before reordering particle data

References 1 Crespo, A.J.C., J.M. Dominguez, B.D. Rogers, M. Gomez-Gesteira, S. Longshaw, R. Canelas, R. Vacondio, A. Barreiro, and O. Garcia-Feal, DualSPHysics: Open-source parallel CFD solver based on Smoothed Particle Hydrodynamics (SPH). Computer Physics Communications, 2015. 187(0): p. 204-216. 2 Valdez-Balderas, D., J.M. Dominguez, B.D. Rogers, and A.J.C. Crespo, Towards accelerating smoothed particle hydrodynamics simulations for free-surface flows on multi-gpu clusters. Journal of Parallel and Distributed Computing, 2013. 73(11): p. 1483-1493. 3 Dominguez, J.M., A.J.C. Crespo, D. Valdez-Balderas, B.D. Rogers, and M. Gomez-Gesteira, New multi-gpu implementation for smoothed particle hydrodynamics on heterogeneous clusters. Computer Physics Communications, 2013. 184(8): p. 1848-1860. 4 Devine, K., E. Boman, R. Heaphy, B. Hendrickson, and C. Vaughan, Zoltan Data Management Service for Parallel Dynamic Applications. Computing in Science & Engineering, 2002. 4(2):p.90-97. 5 Guo, X., B.D. Rogers, S. Lind and P.K. Stansby, New Massively Parallel Scheme for Incompressible Smoothed Particle Hydrodynamics (ISPH) for Highly Nonlinear and Distorted Flow, in Computer Physics Communications, under publication. 6 Guo, X., S. Lind, B.D. Rogers, P.K. Stansby, and M. Ashworth, Efficient massive parallelisation for incompressible Smoothed Particle Hydrodynamics with 10^8 particles, in 8th International SPHERIC Workshop. 2013: Trondheim, Norway. 7 Guo, X., B.D. Rogers, S. Lind, P.K. Stansby, and M. Ashworth, Exploring an Efficient Parallel Implementation Model for 3-D Incompressible Smoothed Particle Hydrodynamics, in 10th International SPHERIC Workshop. 2013: Trondheim, Norway.

Thank you Acknowledgements U-Man: Georgios Fourtakas, Peter Stansby, Steve Lind STFC: Xiaohu Guo, Stephen Longshaw U-Vigo: Alex Crespo, Moncho Gomez-Gesteira U-Parma: Renato Vacondio Free open-source DualSPHysics code: http://www.dual.sphysics.org

Additional Models Viscosity: N m j ij j i j r W ij j i uij 2 r ij ij δ-sph: N j j i ui u j iwij D ij i Di hcs 2 j i 2 d dt j m j j m j r W r ij ij Quintic Wendland kernel: W ij D 1 q 2 4 2q 1 where q r r i h j Equation of state: P P 0 1 0 2 nd order Velocity Verlet time marching scheme