CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison

Similar documents
Welcome to the 2017 Charm++ Workshop!

Technical Report TR

MULTIPHYSICS SIMULATION USING GPU

Support for Multi physics in Chrono

Acknowledgements. Prof. Dan Negrut Prof. Darryl Thelen Prof. Michael Zinn. SBEL Colleagues: Hammad Mazar, Toby Heyn, Manoj Kumar

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Fluid-Structure-Interaction Using SPH and GPGPU Technology

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé

SUPPORT FOR ADVANCED COMPUTING IN PROJECT CHRONO

3D Simulation of Dam-break effect on a Solid Wall using Smoothed Particle Hydrodynamic

CHRONO: A PARALLEL PHYSICS LIBRARY FOR RIGID-BODY, FLEXIBLE-BODY, AND FLUID DYNAMICS

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

Advanced Parallel Programming. Is there life beyond MPI?

Kevin J. Barker. Scott Pakin and Darren J. Kerbyson

CGT 581 G Fluids. Overview. Some terms. Some terms

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy

Two-Phase flows on massively parallel multi-gpu clusters

Scalable In-memory Checkpoint with Automatic Restart on Failures

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

Lagrangian methods and Smoothed Particle Hydrodynamics (SPH) Computation in Astrophysics Seminar (Spring 2006) L. J. Dursi

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Divergence-Free Smoothed Particle Hydrodynamics

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

Scalable Interaction with Parallel Applications

Enzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON. Pawe l Wróblewski, Krzysztof Boryczko

Example 13 - Shock Tube

Lattice Boltzmann with CUDA

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch

Topics in Computer Animation

Shape of Things to Come: Next-Gen Physics Deep Dive

Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

Realistic Animation of Fluids

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

SPH: Why and what for?

Fast Dynamic Load Balancing for Extreme Scale Systems

Robust Simulation of Sparsely Sampled Thin Features in SPH-Based Free Surface Flows

Using Charm++ to Support Multiscale Multiphysics

A PARALLEL ALGORITHM FOR THE DEFORMATION AND INTERACTION OF STRUCTURES MODELED WITH LAGRANGE MESHES IN AUTODYN-3D

High performance computing using AUTODYN-3D

Simulation of Automotive Fuel Tank Sloshing using Radioss

Charm++ for Productivity and Performance

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

NVIDIA. Interacting with Particle Simulation in Maya using CUDA & Maximus. Wil Braithwaite NVIDIA Applied Engineering Digital Film

EXPLICIT MOVING PARTICLE SIMULATION METHOD ON GPU CLUSTERS. of São Paulo

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Massively Parallel OpenMP-MPI Implementation of the SPH Code DualSPHysics

LS-DYNA Smooth Particle Galerkin (SPG) Method

Domain Decomposition for Colloid Clusters. Pedro Fernando Gómez Fernández

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

The Potential of Diffusive Load Balancing at Large Scale

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Introduction to parallel Computing

A New Approach to Modeling Physical Systems: Discrete Event Simulations of Grid-based Models

PySPH: A Python Framework for Smoothed Particle Hydrodynamics

T6: Position-Based Simulation Methods in Computer Graphics. Jan Bender Miles Macklin Matthias Müller

Parallelization of Scientific Applications (II)

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

PiTP Summer School 2009

Recent Advances in Heterogeneous Computing using Charm++

Parallel Summation of Inter-Particle Forces in SPH

Realtime Water Simulation on GPU. Nuttapong Chentanez NVIDIA Research

Programming Models for Supercomputing in the Era of Multicore

FOUR WHAT S NEW IN THIS VERSION? 4.1 FLOW-3D Usability CHAPTER

Navier-Stokes & Flow Simulation

Shallow Water Simulations on Graphics Hardware

Development of an Integrated Computational Simulation Method for Fluid Driven Structure Movement and Acoustics

Adaptive Runtime Support

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

simulation framework for piecewise regular grids

Scalable Dynamic Load Balancing of Detailed Cloud Physics with FD4

European exascale applications workshop, Manchester, 11th and 12th October 2016 DLR TAU-Code - Application in INTERWinE

Week 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh

Interaction of Fluid Simulation Based on PhysX Physics Engine. Huibai Wang, Jianfei Wan, Fengquan Zhang

Navier-Stokes & Flow Simulation

Advanced Load Balancing for SPH Simulations on Multi-GPU Architectures

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

C. A. D. Fraga Filho 1,2, D. F. Pezzin 1 & J. T. A. Chacaltana 1. Abstract

Scalability of Uintah Past Present and Future

RBF Morph An Add-on Module for Mesh Morphing in ANSYS Fluent

FINITE ELEMENT MODELING OF THICK PLATE PENETRATIONS IN SMALL CAL MUNITIONS

Comparison between incompressible SPH solvers

Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications

The future is parallel but it may not be easy

Thermal Coupling Method Between SPH Particles and Solid Elements in LS-DYNA

LS-DYNA 980 : Recent Developments, Application Areas and Validation Process of the Incompressible fluid solver (ICFD) in LS-DYNA.

Simulation of liquid cube fracture with SPH

Overview of research activities Toward portability of performance

Thermal Coupling Method Between SPH Particles and Solid Elements in LS-DYNA

3D Finite Difference Time-Domain Modeling of Acoustic Wave Propagation based on Domain Decomposition

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

Load-balancing multi-gpu shallow water simulations on small clusters

Charm++ Migratable Objects + Active Messages + Adaptive Runtime = Productivity + Performance

A dynamic load-balancing strategy for large scale CFD-applications

Transcription:

CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison Support: Rapid Innovation Fund, U.S. Army TARDEC ASME IDETC/CIE 2016 :: Software Tools for Computational Dynamics in Industry and Academia Charlotte, North Carolina :: August 21 24, 2016

Motivation 2/28/2017 Chrono::HPC 2

The Lagrangian-Lagrangian framework Based on the work behind Chrono::FSI Fluid Smoothed Particle Hydrodynamics (SPH) Solid 3D rigid body dynamics (CM position, rigid rotation) Absolute Nodal Coordinate Formulation (ANCF) for flexible bodies (nodes location and slope) Lagrangian-Lagrangian approach attractive since: Consistent with Lagrangian tracking of discrete solid components Straightforward simulation of free surface flows prevalent in target applications Maps well to parallel computing architectures (GPU, many-core, distributed memory) A Lagrangian-Lagrangian Framework for the Simulation of Fluid-Solid Interaction Problems with Rigid and Flexible Components, University of Wisconsin-Madison, 2014 2/28/2017 Chrono::HPC 3

Smoothed Particle Hydrodynamics (SPH) method Smoothed refers to W b r ab a h Particle refers to S Kernel Properties Cubic spline kernel (often used) 2/28/2017 Chrono::HPC 4

SPH for fluid dynamics Continuity Momentum In the context of fluid dynamics, each particle carries fluid properties like pressure, density, etc. Note: The above sums are done for millions of particles. 2/28/2017 Chrono::FSI 5

Fluid-Solid Interaction (ongoing work) Boundary Condition Enforcing (BCE) markers for no-slip condition Rigidly attached to the solid body (hence their velocities are those of the corresponding material points on the solid) Hydrodynamic properties from the fluid Example Representation Rigid bodies/walls Flexible Bodies 2/28/2017 Chrono::HPC 6

Current SPH Model Runge-Kutta 2 nd order Requires force calculation to happen twice per step Wall Boundary Density changes for boundary particles as you would for the fluid particles. Periodic boundary Fluid marker Boundary marker Ghost marker Periodic Boundary Condition Markers who exit the periodic boundary, enter from the other side 2/28/2017 Chrono::HPC 7

Challenges for Scalable Distributed Memory Codes SPH is a computationally expensive method, hence, high performance computing (HPC) is necessary. High Performance Computing is hard. MPI codes are able to achieve good strong and weak scaling, but the developer is in charge of making this happen. Distributed memory challenges: Communication bottlenecks > Computation bottlenecks Load imbalance Heterogeneity: processor types, process variation, memory hierarchies, etc. Power/Temperature (becoming an important) Fault tolerance To deal with these, we would like to seek Not full automation Not full burden on app-developers But: a good division of labor between the system and app developers 2/28/2017 Chrono::HPC 8

Solution: Charm++ Charm++ is a generalized approach to writing parallel programs An alternative to the likes of MPI, UPC, GA etc. But not to sequential languages such as C, C++, and Fortran Represents: The style of writing parallel programs The runtime system And the entire ecosystem that surrounds it Three design principles: Overdecomposition, Migratability, Asynchrony 2/28/2017 Chrono::HPC 9

Charm++ Design Principles Overdecomposition Migratability Asynchrony Decompose work and data units into many more pieces than processing elements (cores, nodes, ). Not so hard: problem decomposition needs to be done anyway. Allow data/work units to be migratable (by runtime and programmer). Communication is addressed to logical units (C++ objects) as opposed to physical units. Runtime System must keep track of these units Message-driven execution Let the work unit that happens to have data ( message ) available execute next. Runtime selects which work unit executes next (user can influence) Scheduling 2/28/2017 Chrono::HPC 10

Realization of the design principle in Charm++ Overdecomposed entities: chares Chares are C++ objects With methods designated as entry methods Which can be invoked asynchronously by remote chares Chares are organized into indexed collections Each collection may have its own indexing scheme 1D,..7D Sparse Bitvector or string as an index Chares communicate via asynchronous method invocations: entry methods A[i].foo(.); A is the name of a collection, i is the index of the particular chare. It is a kind of task-based parallelism Pool of tasks + pool of workers Runtime system selects what executes next. 2/28/2017 Chrono::HPC 11

Charm-based Parallel Model for SPH Hybrid decomposition (domain + force) Inspired by NaMD (molecular dynamics application) Domain Decomposition: 3D Cell Chare Array. Each cell contains fluid/boundary/solid particles. Data Units Indexed: (x, y,z) Force decomposition: 6D Compute Chare Array Each compute chare is associated to a pair of cells. Work units. Indexed (x1, y1, z1, x2, y2, z2) No need to sort particles to find neighbor particles (overdecomposition implicitly takes care of it). Similar decomposition to LeanMD. Charm++ Molecular Dynamics mini-app. Kale, et al. Charm++ for productivity and performance. PPL Technical Report, 2011. 2/28/2017 Chrono::HPC 12

Algorithm (Charm-based SPH) 1. Init each Cell Chare (very small subdomains) 2. For each subdomain create the number of Compute Chares The following instructions happen in parallel for each Cell/Compute Chare. Cell Array Loop (For each time step) Compute Array Loop (For each time step) 3. SendPositions to each associate compute chare 4. When calcforces SelfInteract OR Interact 6. Reduce forces from each compute chare 5. Send resulting forces 7. When reduce forces update properties at halfstep Repeat 3-7, but calc forces with marker properties at half step. 8. Migrate Particles to Neighbors 9. Load Balance every n steps 2/28/2017 Chrono::HPC 13

Charm-based Parallel Model for FSI (ongoing work) Particles representing the solid will be contained with the fluid and boundary particles. Solid Chare Array (1D Array) Particles keep track of the index of the solid they are associated with. Once computes are done they send a message (invoke an entry method) to each solid they have particles of. Do a force reduction and calculate the dynamics of the solid. 2/28/2017 Chrono::HPC 14

Charm++ In Practice Achieving optimal decomposition granularity Average number of markers allowed per subdomain = Amount of work per chare. Make sure there is enough work to hide communications. Way too many chare objects is not optimal Memory + Scheduling overheads Hyper Parameter Search Vary Cell Size Changes total number of cells and computes. Vary Charm++ nodes per physical node Feed comm network at max rate. Varies number of communication and scheduling threads per node. System specific. Small clusters might only need a single Charm++ node (1 communication thread), but larger clusters with different configurations might need more) Charm++ Nodes\CellSize 2 * h 4 * h 8 * h aprun -n 8 -N 1 -d 32./charmsph +ppn 31 +commap 0 +pemap 1-31 aprun -n 16 -N 2 -d 16./charmsph +ppn 15 +commap 0,16 +pemap 1-15:17-31 Average times per time step aprun -n 32 -N 4 -d 8./charmsph +ppn 7 +commap 0,8,16,24 +pemap 1-7:9-15:17-23:25-31 2/28/2017 Chrono::HPC 15

Results: Hyper parameter Search Hyper parameter search for optimal cell size and Charm++ nodes per physical node. Nodes denotes physical nodes (64 processors per node), and h denotes the particle interaction radius. H = Interaction radius of SPH particles. PE = Charm++ node (equivalent to MPI rank). 2/28/2017 Chrono::HPC 16

Results: Strong Scaling Speeups calculated with respect to an 8 core run (8-504 cores). 2/28/2017 Chrono::HPC 17

Results: Dam break Simulation Figure 3: Dam break simulation (139,332 SPH Markers). Note: Plain SPH requires hand tuning for stability. 2/28/2017 Chrono::HPC 18

Future Work (a lot to do) Improve the current SPH model following the same communication patterns for kernel calculations Density Re-initialization. Generalized Wall Boundary Condition Adami, S., X. Y. Hu, and N. A. Adams. "A generalized wall boundary condition for smoothed particle hydrodynamics." Journal of Computational Physics231.21 (2012): 7057-7075. Pazouki, A., B. Song, and D. Negrut. "Technical Report TR-2015-09." (2015). Validation Hyper parameter search and scaling results on larger clusters. Some bugs in HPC codes only appear after 1,000+ or 10,000+ cores. Performance+scaling comparison against other distributed memory SPH codes. Fluid-Solid Interaction A. Pazouki, R. Serban, and D. Negrut, A Lagrangian-Lagrangian framework for the simulation of rigid and deformable bodies in fluid, Multibody Dynamics: Computational Methods and Applications, ISBN: 9783319072593, Springer, 2014. 2/28/2017 Chrono::HPC 19

Thank you! Questions? Code available at: https://github.com/uwsbel/charmsph 2/28/2017 Chrono::HPC 20