High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

Similar documents
Computational Science and Engineering (Int. Master s Program)

International Supercomputing Conference 2009

Software and Performance Engineering for numerical codes on GPU clusters

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Lattice Boltzmann with CUDA

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

Dynamic Mode Decomposition analysis of flow fields from CFD Simulations

The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Two-Phase flows on massively parallel multi-gpu clusters

A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method

Recent Advances in Heterogeneous Computing using Charm++

Numerical Algorithms on Multi-GPU Architectures

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

A portable OpenCL Lattice Boltzmann code for multi- and many-core processor architectures

Blocking Vs. Non-Blocking Halo Exchange for Lattice Boltzmann. Anthony Bourached arxiv: v1 [cs.dc] 18 Sep 2017

Performance potential for simulating spin models on GPU

Efficient Algorithmic Techniques for Heterogeneous Computing

Virtual EM Inc. Ann Arbor, Michigan, USA

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

GPU-based Distributed Behavior Models with CUDA

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS

GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD

Graph traversal and BFS

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Real-time Thermal Flow Predictions for Data Centers

Parallel Summation of Inter-Particle Forces in SPH

NVIDIA. Interacting with Particle Simulation in Maya using CUDA & Maximus. Wil Braithwaite NVIDIA Applied Engineering Digital Film

Unstructured Mesh Generation for Implicit Moving Geometries and Level Set Applications

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Volumetric processing and visualization on heterogeneous architecture

CAF versus MPI Applicability of Coarray Fortran to a Flow Solver

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Threads Implementation. Jo, Heeseung

Performance analysis of single-phase, multiphase, and multicomponent lattice-boltzmann fluid flow simulations on GPU clusters

SPEED-UP GEARBOX SIMULATIONS BY INTEGRATING SCORG. Dr. Christine Klier, Sahand Saheb-Jahromi, Ludwig Berger*

UberFlow: A GPU-Based Particle Engine

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Parallelization Strategy

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Webinar #3 Lattice Boltzmann method for CompBioMed (incl. Palabos)

Overview of research activities Toward portability of performance

Technical report: How to implement your DdQq dynamics with only q variables per node (instead of 2q)

Simulation of moving Particles in 3D with the Lattice Boltzmann Method

ME964 High Performance Computing for Engineering Applications

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

arxiv: v1 [cs.ms] 8 Aug 2018

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

Heterogeneous-Race-Free Memory Models

Turbostream: A CFD solver for manycore

simulation framework for piecewise regular grids

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Comparison of CPU and GPGPU performance as applied to procedurally generating complex cave systems

Domain Decomposition for Colloid Clusters. Pedro Fernando Gómez Fernández

General Purpose GPU Computing in Partial Wave Analysis

How to Optimize Geometric Multigrid Methods on GPUs

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

GPU Implementation of Implicit Runge-Kutta Methods

A D3Q19 Lattice Boltzmann Solver on a GPU Using Constant-Time Circular Array Shifting

Lattice Boltzmann Method for Simulating Turbulent Flows

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

High Performance Computing

Application of GPU technology to OpenFOAM simulations

Optimisation Myths and Facts as Seen in Statistical Physics

Large scale Imaging on Current Many- Core Platforms

LARGE-SCALE FREE-SURFACE FLOW SIMULATION USING LATTICE BOLTZMANN METHOD ON MULTI-GPU CLUSTERS

arxiv: v1 [physics.ins-det] 11 Jul 2015

Designing a Domain-specific Language to Simulate Particles. dan bailey

Progress in automatic GPU compilation and why you want to run MPI on your GPU

Iterative Sparse Triangular Solves for Preconditioning

Shape of Things to Come: Next-Gen Physics Deep Dive

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

High Performance Computing. University questions with solution

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Parallel FFT Program Optimizations on Heterogeneous Computers

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Programming Using NVIDIA CUDA

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Introduction à OpenCL et applications

Asynchronous K-Means Clustering of Multiple Data Sets. Marek Fiser, Illia Ziamtsov, Ariful Azad, Bedrich Benes, Alex Pothen

Transcription:

SIAM PP 2014 High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters C. Riesinger, A. Bakhtiari, M. Schreiber Technische Universität München February 20, 2014 SIAM PP 2014, February 20, 2014 1

Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20, 2014 2

Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20, 2014 3

Motivation Main Goal Development of a highly scalable code for heterogeneous clusters to investigate new ideas. Minor Goal Extend the software step by step with additional features (never loosing the main goal out of sight). SIAM PP 2014, February 20, 2014 4

Motivation Main Goal Development of a highly scalable code for heterogeneous clusters to investigate new ideas. Minor Goal Extend the software step by step with additional features (never loosing the main goal out of sight). Properties of our Software application: algorithm: heterogenous system: development platform: Computation Fluid Dynamics Lattice Boltzmann Method CPUs + GPUs OpenCL Yet another Lattice Boltzmann Code SIAM PP 2014, February 20, 2014 4

Features Features implemented so far: Implementation of LBM Tailored software design for heterogeneous clusters Overlapping communication with computation Support for free surfaces Considering turbulent flows Validation of the simulation results SIAM PP 2014, February 20, 2014 5

Features Features implemented so far: Implementation of LBM Tailored software design for heterogeneous clusters Overlapping communication with computation Support for free surfaces Considering turbulent flows Validation of the simulation results Features to be implemented: Support for (complex) geometries Investigation of additional turbulence models Exploiting CPU power... SIAM PP 2014, February 20, 2014 5

Features Features implemented so far: Implementation of LBM Tailored software design for heterogeneous clusters Overlapping communication with computation Support for free surfaces Considering turbulent flows Validation of the simulation results Features to be implemented: Support for (complex) geometries Investigation of additional turbulence models Expoiting CPU power... SIAM PP 2014, February 20, 2014 5

The Lattice Boltzmann Method Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20, 2014 6

The Lattice Boltzmann Method 1/3 Boltzmann Equation f t + v f = (f f eq ) }{{} Collision operator f ( x, v, t): f eq : Probability density function: Probability for finding particles at x and time t with velocity v Equilibrium distribution SIAM PP 2014, February 20, 2014 7

The Lattice Boltzmann Method 1/3 Boltzmann Equation f t + v f = (f f eq ) }{{} Collision operator f ( x, v, t): f eq : Probability density function: Probability for finding particles at x and time t with velocity v Equilibrium distribution Lattice Boltzmann Method (LBM) LBM simulates fluid flow by... tracking virtual particles (more specifically, their distribution) colliding and streaming them (collision & streaming phase) only along a limited number of directions (DdQq discretization schemes) SIAM PP 2014, February 20, 2014 7

The Lattice Boltzmann Method 2/3 Discretization Schemes d: number of dimensions q: number of density distributions/lattice vectors 11 7 12 1 16 5 9 2 3 14 4 17 0 8 13 6 10 15 Collision & Streaming During collision phase, the collision operator is approximated (BGK model): f i ( x, t + δt) = f i ( x, t) + 1 i f i ) τ f eq i ( u) = ω i ρ (1 + 3( e i u) + 92 ( e i u) 2 32 ) u2 with ω i = 1 for i 0,..., 3, 16, 17, 1 for i 4,..., 15, and 1 for i = 18. 18 36 3 Movement of the virtual particles is simulated during streaming phase: (f eq f i ( x + e i, t + δt) = f i ( x, t + δt) SIAM PP 2014, February 20, 2014 8

The Lattice Boltzmann Method 3/3 Densities & Velocities Only density distributions along the lattice vectores are stored. Densities & velocities are NOT explicitly stored per cell, but can be calculated: ρ = 18 i=0 f i ρ v = LBM for High Performance Computing 18 i=0 (f i e i ) Data access only for discretized density distributions Communication only with neighboring cells Calculations only use simple instructions SIAM PP 2014, February 20, 2014 9

Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20, 2014 10

Single GPU optimization: The A-B Pattern Collision Density distributions of own cell are read Collision operator is executed Updated density distributions are written SIAM PP 2014, February 20, 2014 11

Single GPU optimization: The A-B Pattern Collision Density distributions of own cell are read Collision operator is executed Updated density distributions are written Propagation Updated density distributions are copied to the corresponding neighboring cells SIAM PP 2014, February 20, 2014 11

Single GPU optimization: The A-B Pattern Collision Density distributions of own cell are read Collision operator is executed Updated density distributions are written Propagation Updated density distributions are copied to the corresponding neighboring cells To avoid race conditions, two dedicated arrays for the density distributions are required! They are used in a ping pong like manner. SIAM PP 2014, February 20, 2014 11

Single GPU optimization: The A-A Pattern Only one memory layout is used for all iterations. This saves memory by a factor of two twice as many cells per GPU α Step (odd iterations) Performed during odd iterations Only one collision is performed Memory belongs to own cell SIAM PP 2014, February 20, 2014 12

Single GPU optimization: The A-A Pattern Only one memory layout is used for all iterations. This saves memory by a factor of two twice as many cells per GPU α Step (odd iterations) β Step (even iterations) Performed during odd iterations Only one collision is performed Memory belongs to own cell Performed during even iterations Two streamings and one collision Memory belongs to neighbors SIAM PP 2014, February 20, 2014 12

Single GPU optimization: The A-A Pattern Only one memory layout is used for all iterations. This saves memory by a factor of two twice as many cells per GPU α Step (odd iterations) β Step (even iterations) Performed during odd iterations Only one collision is performed Memory belongs to own cell Performed during even iterations Two streamings and one collision Memory belongs to neighbors Data is always read and written to the same memory location. SIAM PP 2014, February 20, 2014 12

OpenCL Features CommandQueue 0...1 * Event Command Queue Contains a list of commands Commands can be executed in-order or out-of-order Event Can be used to determine the state of a command Mechanism to implement synchronization between commands Complex dependency graphs can be realized by dependencies of events SIAM PP 2014, February 20, 2014 13

OpenCL Features CommandQueue 0...1 * Event Command Queue Contains a list of commands Commands can be executed in-order or out-of-order Event Can be used to determine the state of a command Mechanism to implement synchronization between commands Complex dependency graphs can be realized by dependencies of events Only the concurrent execution of several command queues enables the parallel execution of several commands! SIAM PP 2014, February 20, 2014 13

Single Boundary Kernel - Single Command Queue (SBK-SCQ) 1/2 communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n CPU MPI_Isend() MPI_Irecv() communication (GPU CPU) one single command queue X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n GPU one signle kernel for halo cells compute halo cells compute inner cells computations done by GPU computations done by CPU communication command queue idle times SIAM PP 2014, February 20, 2014 14

Single Boundary Kernel - Single Command Queue (SBK-SCQ) 2/2 communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n CPU MPI_Isend() MPI_Irecv() communication (GPU CPU) one single command queue X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n GPU one signle kernel for halo cells compute halo cells compute inner cells computations done by GPU computations done by CPU communication command queue idle times SIAM PP 2014, February 20, 2014 15

Multi Boundary Kernels - Single Command Queue (MBK-SCQ) 1/2 communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n MPI_Isend() CPU communication (GPU CPU) one single command queue MPI_Irecv() X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n GPU multiple kernels for halo cells compute halo cells compute inner cells computations done by GPU computations done by CPU communication command queue idle times SIAM PP 2014, February 20, 2014 16

Multi Boundary Kernels - Single Command Queue (MBK-SCQ) 2/2 communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n MPI_Isend() CPU communication (GPU CPU) one single command queue MPI_Irecv() X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n GPU multiple kernels for halo cells compute halo cells compute inner cells computations done by GPU computations done by CPU communication command queue idle times SIAM PP 2014, February 20, 2014 17

Multi Boundary Kernels - Multi Command Queues (MBK-SCQ) communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n X 0 X n Y 0 Y n Z 0... CPU MPI_Isend() MPI_Isend() multiple command queues for halo cells MPI_Irecv() communication (GPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n X 0 X n Y 0 Y n Z 0 Z N multiple kernels for halo cells multiple kernels for halo cells GPU compute halo cells compute inner cells compute halo cells computations done by GPU communication computations done by CPU command queue idle times SIAM PP 2014, February 20, 2014 18

Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20, 2014 19

Test Configuration Tödi Cluster in Switzerland CPUs: AMD Opteron Interlagos 6272 number of cores per CPU: 16 total number of CPUs: 272 total number of CPU cores: 4352 GPUs: NVIDIA Tesla K20x number of CUDA cores per GPU: 2688 total number of GPUs: 272 (256 used) total number of CUDA cores: 731.136 SIAM PP 2014, February 20, 2014 20

Runtime (Weak Scaling) Runtime (Weak Scaling) Optimal Runtime Basic SBC-SCQ MBC-SCQ MBC-MCQ 3,91 3,8 normalized Runtime (linear scale) 3,3 2,8 2,3 1,8 1,3 0,8 1,00 1,00 1,00 1,00 1,10 1,13 1,17 1,18 1,02 1,19 1,20 1,02 1,18 1,23 1,20 1,11 2,23 1,96 1,48 1,44 1,37 1,71 1,70 1,55 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 2,98 2,08 2,15 2,06 2,61 2,48 2,44 2 4 8 16 32 64 128 256 Number of GPUs (logarithmic scale) SIAM PP 2014, February 20, 2014 21

Lattice Updates (Weak Scaling) Lattice Updates (Weak Scaling) Optimal Scaling Basic SBC-SCQ MBC-SCQ MBC-MCQ 205,25 GLUPS (logarithmic) 100 10 1 1,60 1,77 1,74 1,57 1,60 3,21 3,02 2,94 3,05 2,90 6,41 5,98 5,78 6,09 5,67 12,83 11,58 11,53 11,19 10,82 25,66 19,26 19,36 18,14 13,07 51,31 33,19 32,68 31,94 23,00 102,62 54,75 51,75 48,19 34,35 87,45 89,44 82,94 2 4 8 16 32 64 128 256 Number of GPUs (logarithmic) 53,98 SIAM PP 2014, February 20, 2014 22

Parallel Efficiency (Weak Scaling) Parallel Efficiency (Weak Scaling) Optimal Efficiency Efficiency Basic Efficiency SBC-SCQ Efficiency MBC-SCQ Efficiency MBC-MCQ Parallel Efficiency (linear scale) 1,1 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 0,98 0,98 0,85 0,85 0,91 0,84 0,83 0,88 0,82 0,83 0,90 0,84 0,68 0,70 0,73 0,58 0,59 0,64 0,51 0,45 0,48 0,46 0,49 0,38 0,40 0,41 0,34 0,26 2 4 8 16 32 64 128 256 Number of GPUs (logarithmic scale) SIAM PP 2014, February 20, 2014 23

Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20, 2014 24

Conclusion & Outlook Conclusion Biggest gain in performance is achieved by using SBK-SCQ instead of a basic scheme More sophisticated schemes (MBK-MCQ) achieve results similar to less sophisticated schemes (SBK-SCQ) Looks like NVIDIA OpenCL does not support overlapping of command queues Degree of parallel efficiency decreases linearly with number of GPUs! SIAM PP 2014, February 20, 2014 25

Conclusion & Outlook Conclusion Biggest gain in performance is achieved by using SBK-SCQ instead of a basic scheme More sophisticated schemes (MBK-MCQ) achieve results similar to less sophisticated schemes (SBK-SCQ) Looks like NVIDIA OpenCL does not support overlapping of command queues Degree of parallel efficiency decreases linearly with number of GPUs! Outlook Improve poor parallel efficiency ( 40%), by running multiple command queues in parallel by considering the network topology Exploit CPUs not just for communication but also for computation SIAM PP 2014, February 20, 2014 25

Technische Universita t Mu nchen Final slide SIAM PP 2014, February 20, 2014 26