Software and Performance Engineering for numerical codes on GPU clusters

Similar documents
Numerical Algorithms on Multi-GPU Architectures

Large scale Imaging on Current Many- Core Platforms

Multigrid algorithms on multi-gpu architectures

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

simulation framework for piecewise regular grids

The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms

Lattice Boltzmann with CUDA

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Two-Phase flows on massively parallel multi-gpu clusters

Turbostream: A CFD solver for manycore

Accelerating image registration on GPUs

International Supercomputing Conference 2009

Efficient Imaging Algorithms on Many-Core Platforms

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Asynchronous OpenCL/MPI numerical simulations of conservation laws

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

How to Optimize Geometric Multigrid Methods on GPUs

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Large-scale Gas Turbine Simulations on GPU clusters

Peta-Scale Simulations with the HPC Software Framework walberla:

Advances of parallel computing. Kirill Bogachev May 2016

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

Center for Computational Science

Scalable, Hybrid-Parallel Multiscale Methods using DUNE

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

Visual Analysis of Lagrangian Particle Data from Combustion Simulations

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Massively Parallel Phase Field Simulations using HPC Framework walberla

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

smooth coefficients H. Köstler, U. Rüde

Computing on GPU Clusters

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Parallel 3D Sweep Kernel with PaRSEC

A GPU-based High-Performance Library with Application to Nonlinear Water Waves

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method

Automated Finite Element Computations in the FEniCS Framework using GPUs

A Kernel-independent Adaptive Fast Multipole Method

Performance and Software-Engineering Considerations for Massively Parallel Simulations

Fast Multipole Method on the GPU

Application of GPU technology to OpenFOAM simulations

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations

NVIDIA Application Lab at Jülich

Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python


High performance Computing and O&G Challenges

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Performance of Implicit Solver Strategies on GPUs

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

walberla: Developing a Massively Parallel HPC Framework

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

General Plasma Physics

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

High Performance Computing

Multicore-aware parallelization strategies for efficient temporal blocking (BMBF project: SKALB)

The Cray CX1 puts massive power and flexibility right where you need it in your workgroup

Porting Scientific Applications to OpenPOWER

RAMSES on the GPU: An OpenACC-Based Approach

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Parallelising Pipelined Wavefront Computations on the GPU

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Designing a Domain-specific Language to Simulate Particles. dan bailey

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Porting COSMO to Hybrid Architectures

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Large Scale Parallel Lattice Boltzmann Model of Dendritic Growth

High Performance Computing (HPC) Introduction

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Recent results with elsa on multi-cores

arxiv: v1 [cs.pf] 5 Dec 2011

Scalability of Uintah Past Present and Future

Towards Generating Solvers for the Simulation of non-newtonian Fluids. Harald Köstler, Sebastian Kuckuk FAU Erlangen-Nürnberg

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

Accelerating CFD with Graphics Hardware

Computational Fluid Dynamics

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

Transcription:

Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010

2

3

Contents WaLBerla: Applications Software and Performance Engineering Numerical Results LBM Multigrid Future Work 4

Applications WALBERLA 5

walberla: parallel structured grid framework for various applications 6

Free surface flows Gas phase only modeled by boundary conditions Rising bubble & Metal foam & Sinking box & PEM (proton exchange membrane) fuel cell 7

Fluid-structure interaction: particle segregation 64.3 million objects in an LBM grid with 151 billion lattice cells Runs on 294.912 cores of a Blue Gene/P at Jülich Supercomputing center 8

WaLBerla SOFTWARE AND PERFORMANCE ENGINEERING 9

WaLBerla framework Main design goal: massive parallel software framework for various engineering applications Software quality factors Usability Reliability Maintainability and Expandability Portability Efficiency and Scalability 10

Performance at all costs? Performance optimization architectural factors optimization techniques programming techniques Performance engineering simplicity extendability performance effort 11

Performance at different scales I walberla (C++, MPI) Code management, standard implementations OpenCL framework for dataparallel computations on structured grids Low-level kernels for optimized architecturespecific computations (C++, CUDA, intrinsics, Assembler) 12

Performance at different scales II walberla heterogeneous devices, distributed memory parallel Low-level / OpenCL framework Local device, shared memory parallel 13

Efficient OpenCL? Kernels have to be adapted to specific architecture Optimization techniques necessary Vectorize computations Define suitable workgroups (parallel running threads that share local memory) reuse local memory: copy to local then perform computations, copy back to global Correct data alignment and coalesced memory accesses WaLBerla supports OpenCL but also offers similar strategies for heterogeneous computing in order to support e.g. CUDA 14

WaLBerla: Patch concept 15

WaLBerla: Sweep concept 16

WaLBerla: Communication concept 17

LBM NUMERICAL RESULTS 18

Boltzmann equation Mesoscopic approach to solving the Navier-Stokes equations Boltzmann equation describes the statistical distribution of one particle in a fluid f + ζ f = Ω t f is the probability distribution function (PDF), the particle velocity, and Ω(f) is the change due to collision (f) Models behavior of fluids in statistical physics ζ 19

Lattice Boltzmann Method Lattice Boltzmann Method solves the discrete Boltzmann equation Fluid domain is split into simple cells that are treated independently in each time step Microscopic numerical fluid solver Simple updating rules Handle complex boundary conditions for static and moving objects Inherent parallel scheme for hardware acceleration 20

D3Q19 LBM cell 21

Distribution of work to threads Each thread treats one cell One block treats one stripe along x-direction with its threads, thus, all cells have the same y and z index 22

LBM results I 23

HLRS Stuttgart NEC Nehalem Cluster Peak Performance: 62 TFlops No. Nodes: 700 Dual Sockel Quad Core Processor: Intel Xeon(X5560) Nehalem@2.8 GHz,8MB Cache Memory/Node: 12 GB Disk: 60 TB Node-node interconnnect: infiniband, GigE Accelerators: 32 nodes provide Nvidia Tesla S1070 http://www.hlrs.de/systems/platforms/nec-nehalem-cluster/ 24

Weak Scaling on Tesla C1070 and HLRS cluster 25

Heterogeneous LBM results Figure: Investigation of the load balance with 6 CPU-only processes each working on one Block and 2 GPU processes with varying block counts. The Block size is 90 3 lattice cells. 26

Multigrid NUMERICAL RESULTS 27

Image denoising of 3D CT volume Data: Siemens AG, Healthcare Sector 28

Denoising by diffusion Idea: Use nonlinear anisotropic diffusion process to denoise the image u 0 in the domain Ω,i.e. solve the time-dependent PDE u div( g u) = in Ω T t g u, n = 0 on Ω T u( x,0) = u 0 ( x) in Ω 29

Multigrid idea Multigrid methods are based on two principles: 1. Smoothing property Smooth error on fine grid 30

Multigrid idea Multigrid methods are based on two principles: 1. Smoothing property 2. Coarse grid principle Approximate smooth error on coarser grids 31

GPU single node performance 1,2 Runtime for a (2,2)-V-Cycle [s] 1 0,8 0,6 0,4 0,2 0 Tesla C2050, single precision Tesla C2050, double precision Tesla M1060, single precision Tesla M1060, double precision 128x128x128 256x128x128 256x256x128 256x256x256 Number of Unknowns 32

Weak scaling (256 3 unknowns per unit) Runtime for a (2,2)-V-Cycle [s] 2,5 2 1,5 1 0,5 0 CPU Version, single precision CPU Version, double precision GPU version, single precision GPU version, double precision 0 2 4 6 8 10 12 14 16 Number of Processing Units 33

Strong scaling (256 3 unknowns) 2,5 Runtime for a (2,2)-V-Cycle [s] 2 1,5 1 0,5 0 CPU Version, single precision CPU Version, double precision GPU version, single precision GPU version, double precision 0 2 4 6 8 10 Number of Processing Units 34

Future Work Performance Engineering Heterogeneous computing OpenCL optimization Better tools Applications Improve Multigrid coarse grid performance Extend LBM boundary conditions and free surfaces 35