Lattice Boltzmann methods on the way to exascale

Similar documents
Lattice Boltzmann Methods on the way to exascale

The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms

Peta-Scale Simulations with the HPC Software Framework walberla:

simulation framework for piecewise regular grids

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

walberla: Developing a Massively Parallel HPC Framework

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

Massively Parallel Phase Field Simulations using HPC Framework walberla

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Software and Performance Engineering for numerical codes on GPU clusters

Large scale Imaging on Current Many- Core Platforms

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Performance and Software-Engineering Considerations for Massively Parallel Simulations


A Python extension for the massively parallel framework walberla

Towards PetaScale Computational Science

Numerical Algorithms on Multi-GPU Architectures

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Towards Exa-Scale: Computing with Millions of Cores

τ-extrapolation on 3D semi-structured finite element meshes

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Simulation of moving Particles in 3D with the Lattice Boltzmann Method

Direct Numerical Simulation of Particulate Flows on Processor Cores

Massively Parallel Finite Element Simulations with deal.ii

HIGH PERFORMANCE COMPUTATION (HPC) FOR THE

Sustainability and Efficiency for Simulation Software in the Exascale Era

Highly efficient «on-the-fly» data processing using the open-source library CPPPO

Lehrstuhl für Informatik 10 (Systemsimulation)

CFD in COMSOL Multiphysics

Adaptive Hierarchical Grids with a Trillion Tetrahedra

Challenges in Fully Generating Multigrid Solvers for the Simulation of non-newtonian Fluids

High Performance Computing

Adarsh Krishnamurthy (cs184-bb) Bela Stepanova (cs184-bs)

Generic finite element capabilities for forest-of-octrees AMR

CFD MODELING FOR PNEUMATIC CONVEYING

Lattice Boltzmann with CUDA

Support for Multi physics in Chrono

Solving Partial Differential Equations on Overlapping Grids

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS

Hierarchical Hybrid Grids

Forces on particles and bubbles Literature

arxiv: v1 [cs.pf] 5 Dec 2011

Development of an Integrated Computational Simulation Method for Fluid Driven Structure Movement and Acoustics

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Development of Hybrid Fluid Jet / Float Polishing Process

Final drive lubrication modeling

Applications of ICFD /SPH Solvers by LS-DYNA to Solve Water Splashing Impact to Automobile Body. Abstract

Free Surface Flows with Moving and Deforming Objects for LBM

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Free Surface Lattice-Boltzmann fluid simulations. with and without level sets.

Preliminary Spray Cooling Simulations Using a Full-Cone Water Spray

Towards Generating Solvers for the Simulation of non-newtonian Fluids. Harald Köstler, Sebastian Kuckuk FAU Erlangen-Nürnberg

Performance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows

Particleworks: Particle-based CAE Software fully ported to GPU

Possibility of Implicit LES for Two-Dimensional Incompressible Lid-Driven Cavity Flow Based on COMSOL Multiphysics

Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy

ORAP Forum October 10, 2013

Optimizing Bio-Inspired Flow Channel Design on Bipolar Plates of PEM Fuel Cells

Direct Numerical Simulation of Turbulent Boundary Layers at High Reynolds Numbers.

CGT 581 G Fluids. Overview. Some terms. Some terms

SIMULATION OF FLOW FIELD AROUND AND INSIDE SCOUR PROTECTION WITH PHYSICAL AND REALISTIC PARTICLE CONFIGURATIONS

Lattice Boltzmann Liquid Simulation with Moving Objects on Graphics Hardware

LARGE-SCALE FREE-SURFACE FLOW SIMULATION USING LATTICE BOLTZMANN METHOD ON MULTI-GPU CLUSTERS

A Python Extension for the Massively Parallel Multiphysics Simulation Framework walberla

High performance computing and numerical modeling

ENERGY-224 Reservoir Simulation Project Report. Ala Alzayer

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG. Lehrstuhl für Informatik 10 (Systemsimulation)

FOUR WHAT S NEW IN THIS VERSION? 4.1 FLOW-3D Usability CHAPTER

Introduction to C omputational F luid Dynamics. D. Murrin

Dynamics in Maya. Gary Monheit Alias Wavefront PHYSICALLY BASED MODELING SH1 SIGGRAPH 97 COURSE NOTES

A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method

Driven Cavity Example

Phys 113 Final Project

Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle

Abstract. 1 Introduction

LS-DYNA 980 : Recent Developments, Application Areas and Validation Process of the Incompressible fluid solver (ICFD) in LS-DYNA.

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

NIA CFD Futures Conference Hampton, VA; August 2012

Simulation in Computer Graphics. Deformable Objects. Matthias Teschner. Computer Science Department University of Freiburg

SPH: Why and what for?

Realtime Water Simulation on GPU. Nuttapong Chentanez NVIDIA Research

Particle Tracing Module

Efficient Imaging Algorithms on Many-Core Platforms

Part I: Theoretical Background and Integration-Based Methods

Coupling of STAR-CCM+ to Other Theoretical or Numerical Solutions. Milovan Perić

MESHLESS SOLUTION OF INCOMPRESSIBLE FLOW OVER BACKWARD-FACING STEP

Direct Numerical Simulation of a Low Pressure Turbine Cascade. Christoph Müller

Computational Fluid Dynamic Hydraulic Characterization: G3 Cube vs. Dolos Armour Unit. IS le Roux, WJS van der Merwe & CL de Wet

Large Scale Parallel Lattice Boltzmann Model of Dendritic Growth

Forest-of-octrees AMR: algorithms and interfaces

Using the Eulerian Multiphase Model for Granular Flow

Efficiency Aspects for Advanced Fluid Finite Element Formulations

High-Order Methods for Turbulent Transport in Engineering and Geosciences.

Shape of Things to Come: Next-Gen Physics Deep Dive

Technical Report TR

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Interaction of Fluid Simulation Based on PhysX Physics Engine. Huibai Wang, Jianfei Wan, Fengquan Zhang

Transcription:

Lattice Boltzmann methods on the way to exascale Ulrich Rüde LSS Erlangen and CERFACS Toulouse ulrich.ruede@fau.de Centre Européen de Recherche et de Formation Avancée en Calcul Scientifique www.cerfacs.fr Lehrstuhl für Simulation Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de 1

Outline Motivation Building Blocks for the Direct Simulation of Complex Flows 1. Supercomputing: scalable algorithms, efficient software 2. Solid phase - rigid body dynamics 3. Fluid phase - Lattice Boltzmann method 4. Electrostatics - finite volume 5. Fast implicit solvers - multigrid 6. Gas phase - free surface tracking, volume of fluids Multi-physics applications Coupling Examples Perspectives 2

Motivation Simulating Additive Manufacturing Bikas, H., Stavropoulos, P., & Chryssolouris, G. (2015). Additive manufacturing methods and modelling approaches: a critical review. The International Journal of Advanced Manufacturing Technology, 1-17. Klassen, A., Scharowsky, T., & Körner, C. (2014). Evaporation model for beam based additive manufacturing using free surface lattice Boltzmann methods. Journal of Physics D: Applied Physics, 47(27), 275303. 3

Electron Beam Melting Process 3D printing EU-Project Fast- EBM ARCAM (Gothenburg) TWI (Cambridge) FAU Erlangen Generation of powder bed Energy transfer by electron beam penetration depth heat transfer Flow dynamics melting melt flow surface tension wetting capillary forces contact angles solidification Ammer, R., Markl, M., Ljungblad, U., Körner, C., & UR (2014). Simulating fast electron beam melting with a parallel thermal free surface lattice Boltzmann method. Computers & Mathematics with Applications, 67(2), 318-330. Ammer, R., UR, Markl, M., Jüchter V., & Körner, C. (2014). Validation experiments for LBM simulations of electron beam melting. International Journal of Modern Physics C. 4

SuperMuc: 3 PFlops Building Block I: Current and Future High Performance Supercomputers 5

Multi-PetaFlops Supercomputers Sunway TaihuLight JUQUEEN SuperMUC (phase 1) SW26010 processor 10,649,600 cores 260 cores (1.45 GHz) per node 32 GiB RAM per node 125 PFlops Peak Power consumption: 15.37 MW TOP 500: #1 Blue Gene/Q architecture 458,752 PowerPC A2 cores 16 cores (1.6 GHz) per node 16 GiB RAM per node 5D torus interconnect 5.8 PFlops Peak TOP 500: #13 Intel Xeon architecture 147,456 cores 16 cores (2.7 GHz) per node 32 GiB RAM per node Pruned tree interconnect 3.2 PFlops Peak TOP 500: #27 What is the problem? LBM Methods Ulrich Rüde

Only optimal algorithms qualify TERRA Thought experiment with O(N 2 ) algorithm Energy computer generation gigascale: 10 9 FLOPS terascale 10 12 FLOPS petascale 10 15 FLOPS exascale 10 18 FLOPS desired problem size DoF=N 10 6 10 9 10 12 10 15 energy estimate (kwh) 1 NJoule N 2 all-to-all communication 0.278 Wh 10 min of LED light 278 kwh 2 weeks blow drying hair 278 GWh 1 month electricity for Hamburg 278 PWh 100 years world electricity production TerraNeo prototype (kwh) 0.13 Wh 0.03 kwh 27 kwh? At extreme scale: optimal complexity is a must! 7

Building block II: The Lagrangian View: Granular media simulations with the physics engine 1 250 000 spherical particles 256 processors 300 300 time steps runtime: 48 h (including data output) texture mapping, ray tracing Pöschel, T., & Schwager, T. (2005). Computational granular dynamics: models and algorithms. Springer Science & Business Media. 8

Lagrangian Particle Presentation Single particle described by state variables (position x, orientation φ, translational and angular velocity v and ω) a parameterization of its shape S (e.g. geometric primitive, composite object, or mesh) and its inertia properties (mass m, principle moments of inertia I xx, I yy and I zz ). The Newton-Euler equations of motion for rigid bodies describe the rate of change of the state variables: ẋ(t) = '(t) v(t) M('(t)) =!(t) v(t) Q('(t))!(t) f(s(t),t) (s(t),t)!(t) I('(t))!(t) 9

Hard Contacts alternative to the discrete element method Hard contacts require impulses, exhibit non-differentiable but continuous trajectories, contact reactions are defined implicitly in general, have non-unique solutions, and can be solved numerically by methods from two classes. ) measure differential inclusions 100% 80% 60% 40% 20% 0% -20% -1 soft contacts hard contacts -0.5 0 0.5 1 1.5 2 2.5 3 Fig.: Bouncing ball with a soft and a hard contact model. Moreau, J., Panagiotopoulos P. (1988): Nonsmooth mechanics and applications, vol 302. Springer, Wien-New York Popa, C., Preclik, T., & UR (2014). Regularized solution of LCP problems with application to rigid body dynamics. Numerical Algorithms, 1-12. Preclik, T. & UR (2015). Ultrascale simulations of non-smooth granular dynamics; Computational Particle Mechanics, DOI: 10.1007/s40571-015-0047-6 Extreme Scale LBM Methods - Ulrich Rüde 10

Newton-Euler Equations for Rigid Bodies ẋ(t) v(t) = '(t) Q('(t))!(t) v(t) f(s(t),t) M('(t)) =!(t) (s(t),t)!(t) I('(t))!(t) Contact detection minimizing signed distance function Time-continuous non-penetration constraint for hard contacts Coulomb friction Parallelization via domain partitioning Parallel data structures for contact detection Protocol for synchronization (t) 0? n (t) 0 11

Nonlinear Complementarity and Time Stepping Non-penetration conditions Coulomb friction conditions continuous forces =0 + =0 0? n 0 k to k 2 apple µ n + 0? n 0 k v + tok 2 to = µ n v + to + 0? n 0 k v + tok 2 to = µ n v+ to k v + to k 2 =0 impulses =0 0? n 0 k to k 2 apple µ n + 0? n 0 k v + tok 2 to = µ n v + to discrete t + v0 n( ) 0? n 0 k to k 2 apple µ n k v 0 to( )k 2 to = µ n v 0 to( ) Signorini condition impact law friction cone condition frictional reaction opposes slip Moreau, J., Panagiotopoulos P. (1988): Nonsmooth mechanics and applications, vol 302. Springer, Wien-New York Popa, C., Preclik, T., & UR (2014). Regularized solution of LCP problems with application to rigid body dynamics. Numerical Algorithms, 1-12. Preclik, T. & UR (2015). Ultrascale simulations of non-smooth granular dynamics; Computational Particle Mechanics, DOI: 10.1007/s40571-015-0047-6 12

Regularizing Multi-Contact Problems 13

Parallel Computation Key features of the parallelization: domain partitioning distribution of data contact detection synchronization protocol subdomain NBGS accumulators and corrections aggressive message aggregation nearest-neighbor communication Iglberger, K., & UR (2010). Massively parallel granular flow simulations with non-spherical particles. Computer Science-Research and Development, 25(1-2), 105-113 Iglberger, K., & UR (2011). Large-scale rigid body simulations. Multibody System Dynamics, 25(1), 81-95 14

Shaker scenario with sharp edged hard objects 864 000 sharp-edged particles with a diameter between 0.25 mm and 2 mm. 15

PE marble run - rigid objects in complex geometry Animation by Sebastian Eibl and Christian Godenschwager 16

Scaling Results Solver algorithmically not optimal for dense systems, hence cannot scale unconditionally, but is highly efficient in many cases of practical importance Strong and weak scaling results for a constant number of iterations performed on SuperMUC and Juqueen Largest ensembles computed 2.8 10 10 non-spherical particles 1.1 10 10 contacts granular gas: scaling results Breakup up of compute times on Erlangen RRZE Cluster Emmy av. time per time step and 1000 particles in s 0.116 0.114 0.112 0.11 0.108 0.106 0.104 0.102 0.1 av. time per time step ( rst series) 0.098 av. time per time step (second series) parallel e ciency (second series) 0.096 1 4 16 64 256 1024 4096 16384 number of nodes (b) Weak-scaling graph on the Juqueen supercomputer. 1.2 1 0.8 0.6 0.4 0.2 0 parallel e ciency 9.5% 25.9% 8.0% 12.6% (a) Time-step profile of the granular gas executed with 5 2 2 = 20 processes on a single node. 25.8% 18.1% 8.3% 5.9% 22.7% 30.6% 16.5% 16.0% (b) Time-step profile of the granular gas executed with 8 8 5 = 320 processes on 16 nodes. 17

Building Block III: Scalable Flow Simulations with the Lattice Boltzmann Method Lallemand, P., & Luo, L. S. (2000). Theory of the lattice Boltzmann method: Dispersion, dissipation, isotropy, Galilean invariance, and stability. Physical Review E, 61(6), 6546. Feichtinger, C., Donath, S., Köstler, H., Götz, J., & Rüde, U. (2011). WaLBerla: HPC software design for computational engineering simulations. Journal of Computational Science, 2(2), 105-112. Extreme Scale LBM Methods - Ulrich Rüde 18

The Eulerian View : Lattice-Boltzmann Discretization in squares or cubes (cells) Particles Distribution Functions (PDF) in 2D: 9 numbers (2DQ9) in 3D: D3Q19 (alternatives D3Q27, etc) Repeat (many times) stream collide Extreme Scale LBM Methods - Ulrich Rüde 19

Basic Lattice Boltzmann Method Single Relaxation Time (SRT) Macroscopic quantities Equilibrium distribution function Extreme Scale LBM Methods - Ulrich Rüde

The stream step Move PDFs into neighboring cells Non-local part, Linear propagation to neighbors (stream step) Local part, Non-linear operator, (collide step) Extreme Scale LBM Methods - Ulrich Rüde 21

The collide step Compute new PDFs modeling molecular collisions Most collision operators can be expressed as Equilibrium function: non-linear, depending on the conserved moments,, and. Extreme Scale LBM Methods - Ulrich Rüde 22

Performance on Coronary Arteries Geometry Color coded proc assignment Godenschwager, C., Schornbaum, F., Bauer, M., Köstler, H., & UR (2013). A framework for hybrid parallel flow simulations with a trillion cells in complex geometries. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis (p. 35). ACM. Weak scaling 458,752 cores of JUQUEEN over a trillion (10 12 ) fluid lattice cells cell sizes 1.27µm diameter of red blood cells: 7µm 2.1 10 12 cell updates per second 0.41 PFlops Strong scaling 32,768 cores of SuperMUC cell sizes of 0.1 mm 2.1 million fluid cells 6000+ time steps per second Extreme Scale LBM Methods - Ulrich Rüde

Single Node Performance JUQUEEN SuperMUC vectorized optimized standard Pohl, T., Deserno, F., Thürey, N., UR, Lammers, P., Wellein, G., & Zeiser, T. (2004). Performance evaluation of parallel largescale lattice Boltzmann applications on three supercomputing architectures. Proceedings of the 2004 ACM/IEEE conference on Supercomputing (p. 21). IEEE Computer Society. Donath, S., Iglberger, K., Wellein, G., Zeiser, T., Nitsure, A., & UR (2008). Performance comparison of different parallel lattice Boltzmann implementations on multi-core multi-socket systems. International Journal of Computational Science and Engineering, 4(1), 3-11. Extreme Scale LBM - Ulrich Rüde

Weak scaling for TRT lid driven cavity - uniform grids JUQUEEN 16 processes per node 4 threads per process 2.1 10 12 cell updates per second (TLups) SuperMUC 4 processes per node 4 threads per process 0.837 10 12 cell updates per second (TLups) Körner, C., Pohl, T., UR., Thürey, N., & Zeiser, T. (2006). Parallel lattice Boltzmann methods for CFD applications. In Numerical Solution of Partial Differential Equations on Parallel Computers (pp. 439-466). Springer Berlin Heidelberg. Feichtinger, C., Habich, J., Köstler, H., UR, & Aoki, T. (2015). Performance modeling and analysis of heterogeneous lattice Boltzmann simulations on CPU GPU clusters. Parallel Computing, 46, 1-13. Extreme Scale LBM - Ulrich Rüde

Partitioning and Parallelization static block-level refinement ( forest of octrees) static load balancing DISK compact (KiB/MiB) binary MPI IO DISK allocation of block data ( grids) separation of domain partitioning from simulation (optional) 26

Flow through structure of thin crystals (filter) work with Jose Pedro Galache and Antonio Gil CMT-Motores Termicos, Universitat Politecnica de Valencia 27

Parallel AMR load balancing different views on domain partitioning 2:1 balanced grid (used for the LBM) distributed graph: nodes = blocks edges explicitly stored as < block ID, process rank > pairs forest of octrees: octrees are not explicitly stored, but implicitly defined via block IDs 28

AMR and Load Balancing with walberla Isaac, T., Burstedde, C., Wilcox, L. C., & Ghattas, O. (2015). Recursive algorithms for distributed forests of octrees. SIAM Journal on Scientific Computing, 37(5), C497-C531. Meyerhenke, H., Monien, B., & Sauerwald, T. (2009). A new diffusion-based multilevel algorithm for computing graph partitions. Journal of Parallel and Distributed Computing, 69(9), 750-761. Schornbaum, F., & Rüde, U. (2016). Massively Parallel Algorithms for the Lattice Boltzmann Method on NonUniform Grids. SIAM Journal on Scientific Computing, 38(2), C96-C126. Extreme Scale LBM Methods - Ulrich Rüde 29

AMR Performance Benchmark Environments: JUQUEEN (5.0 PFLOP/s) Blue Gene/Q, 459K cores, 1 GB/core compiler: IBM XL / IBM MPI SuperMUC (2.9 PFLOP/s) Intel Xeon, 147K cores, 2 GB/core compiler: Intel XE / IBM MPI Benchmark (LBM D3Q19 TRT): avg. blocks/process (max. blocks/proc.) level initially after refresh after load balance 0 0.383 (1) 0.328 (1) 0.328 (1) 1 0.656 (1) 0.875 (9) 0.875 (1) 2 1.313 (2) 3.063 (11) 3.063 (4) 3 3.500 (4) 3.500 (16) 3.500 (4) Extreme Scale LBM Methods - Ulrich Rüde 30

AMR Performance Benchmark Environments: JUQUEEN (5.0 PFLOP/s) Blue Gene/Q, 459K cores, 1 GB/core compiler: IBM XL / IBM MPI SuperMUC (2.9 PFLOP/s) Intel Xeon, 147K cores, 2 GB/core compiler: Intel XE / IBM MPI Benchmark (LBM D3Q19 TRT): during this refresh process all cells on the finest level are coarsened and coarsen refine the same amount of fine cells is created by splitting coarser cells 72 % of all cells change their size Extreme Scale LBM Methods - Ulrich Rüde 31

AMR Performance JUQUEEN space filling curve: Morton 12 197 billion cells seconds 10 8 6 4 58 billion cells 14 billion cells hybrid MPI+OpenMP version with SMP 1 process 2 cores 8 threads #cells per core 31,062 127,232 429,408 2 0 256 4096 32,768 458,752 cores Extreme Scale LBM Methods - Ulrich Rüde 32

AMR Performance JUQUEEN diffusion load balancing 12 197 billion cells seconds 10 8 6 4 58 billion cells 14 billion cells time almost independent of #processes! #cells per core 31,062 127,232 429,408 2 0 256 4096 32,768 458,752 cores Extreme Scale LBM Methods - Ulrich Rüde 33

Pore scale resolved flow in porous media Direct numerical simulation of flow through sphere packings Beetstra, R., Van der Hoef, M. A., & Kuipers, J. A. M. (2007). Drag force of intermediate Reynolds number flow past monoand bidisperse arrays of spheres. AIChE Journal, 53(2), 489-501. Tenneti, S., Garg, R., & Subramaniam, S. (2011). Drag law for monodisperse gas solid systems using particle-resolved direct numerical simulation of flow past fixed assemblies of spheres. International journal of multiphase flow, 37(9), 1072-1092. 34

Flow field and vorticity 2D slice visualized Domain size: Re = 300 Volume fraction:

Macroscopic drag correlation Finally, the drag correlation reads Average absolute percentage error: 9.7 % Bogner, S., Mohanty, S., & UR (2015). Drag correlation for dilute and moderately dense fluid-particle systems using the lattice Boltzmann method, International Journal of Multiphase Flow 68, 71-79. 36

Setup of random spherical structure for porous media with the PE Massively Parallel Multiphysics Simulations - Ulrich Rüde

Flow over porous structure Porosity 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 Setup Height of the channel 0.8 0.6 0.4 0.2 Porosity Velocity Profile 0 0 0.2 0.4 0.6 0.8 1 Streamwise velocity Pore geometry and streamlines Fattahi, E., Waluga, C., Wohlmuth, B., & Rüde, U. (2015). Large scale lattice Boltzmann simulation for the coupling of free and porous media flow. arxiv preprint arxiv:1507.06565. To appear in LNCS, Proceedings of HPCSE 2015. Massively Parallel Multiphysics Simulations - Ulrich Rüde 38

Turbulent flow over a permeable region Re 3000 direct numerical simulation Volume rendering of velocity magnitude Periodic in X and Y direction I10 cluster, 7x32x19=4256 core hours 1,300,000 timesteps 8 times more time steps on the finest level 39

Multi-Physics Simulations for Particulate Flows Parallel Coupling with walberla and PE Ladd, A. J. (1994). Numerical simulations of particulate suspensions via a discretized Boltzmann equation. Part 1. Theoretical foundation. Journal of Fluid Mechanics, 271(1), 285-309. Tenneti, S., & Subramaniam, S. (2014). Particle-resolved direct numerical simulation for gas-solid flow model development. Annual Review of Fluid Mechanics, 46, 199-230. Bartuschat, D., Fischermeier, E., Gustavsson, K., & UR (2016). Two computational models for simulating the tumbling motion of elongated particles in fluids. Computers & Fluids, 127, 17-35. 40

Fluid-Structure Interaction direct simulation of Particle Laden Flows (4-way coupling) Götz, J., Iglberger, K., Stürmer, M., & UR (2010). Direct numerical simulation of particulate flows on 294912 processor cores. In Proceedings of Supercomputing 2010, IEEE Computer Society. Götz, J., Iglberger, K., Feichtinger, C., Donath, S., & UR (2010). Coupling multibody dynamics and computational fluid dynamics on 8192 processor cores. Parallel Computing, 36(2), 142-151. 41

Simulation of sediment transport 42

Heterogenous CPU-GPU Simulation Particles: 31250, Domain: 400x400x200, Timesteps: 400 000 Devices: 2 x M2070 + 1 Intel Westmere, Runtime: 17.5 h Fluidized Beds: Direct numerical simulation fully resolved particles Fluid-structureinteraction 4-way-coupling C. Feichtinger, J. Habich, H. Köstler, U. Rüde, T. Aoki, Performance modeling and analysis of heterogeneous lattice Boltzmann simulations on CPU GPU clusters, Parallel Computing, Volume 46, July 2015, Pages 1-13 C. Feichtinger, J. Habich, H. Köstler, G. Hager, U. Rüde, G. Wellein, A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU CPU clusters, Parallel Computing, Volume 37, Issue 9, September 2011, Pages 536-549. 43

Weak Scaling experiment Götz, J., Iglberger, K., Stürmer, M., & Rüde, U. (2010, November). Direct numerical simulation of particulate flows on 294912 processor cores. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1-11). IEEE Computer Society. Efficiency 1 0.9 0.8 0.7 0.6 0.5 Jugene Blue Gene/P Jülich Supercomputer Center 40x40x40 lattice cells per core 80x80x80 lattice cells per core 100 1000 10000 300000 Number of Cores Largest simulation on Jugene: 8 Trillion (10 12 ) variables per time step (LBM alone) Scaling 64 to 294 912 cores Densely packed particles 150 994 944 000 lattice cells 264 331 905 rigid spherical objects Massively Parallel Multiphysics Simulations - Ulrich Rüde 44

Building Block IV (electrostatics) Direct numerical simulation of charged particles in flow Positive and negatively charged particles in flow subjected to transversal electric field Masilamani, K., Ganguly, S., Feichtinger, C., & UR (2011). Hybrid lattice-boltzmann and finite-difference simulation of electroosmotic flow in a microchannel. Fluid Dynamics Research, 43(2), 025501. Bartuschat, D., Ritter, D., & UR (2012). Parallel multigrid for electrokinetic simulation in particle-fluid flows. In High Performance Computing and Simulation (HPCS), 2012 International Conference on (pp. 374-380). IEEE. Bartuschat, D. & UR (2015). Parallel Multiphysics Simulations of Charged Particles in Microfluidic Flows, Journal of Computational Science, Volume 8, May 2015, Pages 1-19 45

6-way coupling charge distribution velocity BCs Finite volumes MG treat BCs V-cycle iter at. object motion hydrodynam. force LBM treat BCs stream-collide step electrostat. force Newtonian mechanics collision response object distance correction force Lubrication correction 46

Separation experiment 240 time steps fully 6-way coupled simulation 400 sec on SuperMuc weak scaling up to 32 768 cores 7.1 Mio particles Total runtimes [] 400 300 200 100 0 1 2 4 8 16 32 64 Number of nodes 128 256 512 1024 2048 LBM Map Lubr HydrF pe MG SetRHS PtCm ElectF 10 3 MFLUPS (LBM) 90 80 70 60 50 40 30 20 10 0 0 250 500 750 1000 1250 1500 1750 2000 Number of nodes 47 LBM Perform. MG Perform. 120 100 80 60 40 20 10 3 MLUPS (MG)

Building Block V Volume of Fluids Method for Free Surface Flows joint work with Regina Ammer, Simon Bogner, Martin Bauer, Daniela Anderl, Nils Thürey, Stefan Donath, Thomas Pohl, C Körner, A. Delgado Körner, C., Thies, M., Hofmann, T., Thürey, N., & UR. (2005). Lattice Boltzmann model for free surface flow for modeling foaming. Journal of Statistical Physics, 121(1-2), 179-196. Donath, S., Feichtinger, C., Pohl, T., Götz, J., & UR. (2010). A Parallel Free Surface Lattice Boltzmann Method for Large- Scale Applications. Parallel Computational Fluid Dynamics: Recent Advances and Future Directions, 318. Anderl, D., Bauer, M., Rauh, C., UR, & Delgado, A. (2014). Numerical simulation of adsorption and bubble interaction in protein foams using a lattice Boltzmann method. Food & function, 5(4), 755-763. Bogner, S., Ammer, R., & Rüde, U. (2015). Boundary conditions for free interfaces with the lattice Boltzmann method. Journal of Computational Physics, 297, 1-12. 48

Free Surface Flows Volume-of-Fluids like approach Flag field: Compute only in fluid Special free surface conditions in interface cells Reconstruction of curvature for surface tension 49

Simulation of Metal Foams Example application: Engineering: metal foam simulations Based on LBM: Free surfaces Surface tension Disjoining pressure to stabilize thin liquid films Parallelization with MPI and load balancing Other applications: Food processing Fuel cells Körner, C., Thies, M., Hofmann, T., Thürey, N., & UR (2005). Lattice Boltzmann model for free surface flow for modeling foaming. Journal of Statistical Physics, 121(1-2), 179-196. Donath, S., Mecke, K., Rabha, S., Buwa, V., & UR (2011). Verification of surface tension in the parallel free surface lattice Boltzmann method in walberla. Computers & Fluids, 45(1), 177-186. Thürey, N., &UR. (2009). Stable free surface flows with the lattice Boltzmann method on adaptively coarsened grids. Computing and Visualization in Science, 12(5), 247-263. 50

Simulation for hygiene products (for Procter&Gamble) capillary pressure surface tension inclination contact angle 51

Additive Manufacturing Fast Electron Beam Melting Ammer, R., Markl, M., Ljungblad, U., Körner, C., & UR (2014). Simulating fast electron beam melting with a parallel thermal free surface lattice Boltzmann method. Computers & Mathematics with Applications, 67(2), 318-330. Ammer, R., UR, Markl, M., Jüchter V., & Körner, C. (2014). Validation experiments for LBM simulations of electron beam melting. International Journal of Modern Physics C. 52

Simulation of Electron Beam Melting Simulating powder bed generation using the PE framework High speed camera shows melting step for manufacturing a hollow cylinder WaLBerla Simulation 53

Conclusions 54

Research in Computational Science is done by teams Harald Köstler Christian Godenschwager Kristina Pickl Regina Ammer Simon Bogner Ehsan Fattahi Florian Schornbaum Sebastian Kuckuk Christoph Rettinger Dominik Bartuschat Martin Bauer Christian Kuschel 55

Thank you for your attention! Videos, preprints, slides at https://www10.informatik.uni-erlangen.de 56