OpenFOAM + GPGPU. İbrahim Özküçük

Similar documents
Application of GPU technology to OpenFOAM simulations

This offering is not approved or endorsed by OpenCFD Limited, the producer of the OpenFOAM software and owner of the OPENFOAM and OpenCFD trade marks.

OPENFOAM ON GPUS USING AMGX

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Optimization of parameter settings for GAMG solver in simple solver

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Iterative Sparse Triangular Solves for Preconditioning

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

AMS526: Numerical Analysis I (Numerical Linear Algebra)

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

OpenFOAM on GPUs. 3rd Northern germany OpenFoam User meeting. Institute of Scientific Computing. September 24th 2015

Performance of Implicit Solver Strategies on GPUs

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Contents. I The Basic Framework for Stationary Problems 1

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Structure-preserving Smoothing for Seismic Amplitude Data by Anisotropic Diffusion using GPGPU

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

GPU-based Parallel Reservoir Simulators

Large scale Imaging on Current Many- Core Platforms

Lecture 15: More Iterative Ideas

CUDA Accelerated Compute Libraries. M. Naumov

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

The Visual Computing Company

NEW ADVANCES IN GPU LINEAR ALGEBRA

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

Report of Linear Solver Implementation on GPU

Sparse Matrices. This means that for increasing problem size the matrices become sparse and sparser. O. Rheinbach, TU Bergakademie Freiberg

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

High Performance Iterative Solver for Linear System using Multi GPU )

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

Performance of PETSc GPU Implementation with Sparse Matrix Storage Schemes

GPU PROGRESS AND DIRECTIONS IN APPLIED CFD

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

OpenFOAM on GPUs. Thilina Rathnayake R. Department of Computer Science & Engineering. University of Moratuwa Sri Lanka

Proceedings of the First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014) Porto, Portugal

GPU-Acceleration of CAE Simulations. Bhushan Desam NVIDIA Corporation

Studies of the ERCOFTAC Centrifugal Pump with OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Two-Phase flows on massively parallel multi-gpu clusters

Introduction to fluid mechanics simulation using the OpenFOAM technology

ME964 High Performance Computing for Engineering Applications

Matrix-free IPM with GPU acceleration

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

(Sparse) Linear Solvers

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

How to Optimize Geometric Multigrid Methods on GPUs

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

CUDA Toolkit 5.0 Performance Report. January 2013

CUDA 6.0 Performance Report. April 2014

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

Figure 6.1: Truss topology optimization diagram.

High Performance Computing for PDE Some numerical aspects of Petascale Computing

Super Matrix Solver-P-ICCG:

The Fermi GPU and HPC Application Breakthroughs

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Iterative solution of linear systems in electromagnetics (and not only): experiences with CUDA

(Sparse) Linear Solvers

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Automated Finite Element Computations in the FEniCS Framework using GPUs

PROGRAMMING OF MULTIGRID METHODS

Numerical Algorithms on Multi-GPU Architectures

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Available online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

smooth coefficients H. Köstler, U. Rüde

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Effects of Solvers on Finite Element Analysis in COMSOL MULTIPHYSICS

Krishnan Suresh Associate Professor Mechanical Engineering

Introduction to Multigrid and its Parallelization

Mixed Precision Methods

Large Displacement Optical Flow & Applications

Porting, optimization and bottleneck of OpenFOAM in KNL environment

GPU Cluster Computing for FEM

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Generic Programming Experiments for SPn and SN transport codes

GPUML: Graphical processors for speeding up kernel machines

Implicit schemes for wave models

Auto-tuning Multigrid with PetaBricks

Transcription:

OpenFOAM + GPGPU İbrahim Özküçük

Outline GPGPU vs CPU GPGPU plugins for OpenFOAM Overview of Discretization CUDA for FOAM Link (cufflink) Cusp & Thrust Libraries How Cufflink Works Performance data of Cufflink solvers CUDA Solvers in foam-extend-3.0 Considerations about future Linear System Solvers in OpenFOAM 2

GPGPU vs CPU Taken from reference (1) 3

GPGPU vs CPU Taken from reference (1) 4

OpenFOAM GPGPU Solvers SpeedIT Plugin to OpenFOAM - Conjugate Gradient & BiConjugate Gradient Further information @ http://speedit.vratis.com/index.php/products ofgpu, GPU Linear Solvers for OpenFOAM Further information @ http://www.symscape.com/gpu-openfoam Culises - GPU power for OpenFOAM Further information @ http://www.fluidyna.com/content/culises 5

Overview of Discretization The term discretization means approximation of a problem into discrete quantities. The FV method and others, such as the finite element and finite difference methods, all discretize the problem as follows: Spatial discretization Defining the solution domain by a set of points that fill and bound a region of space when connected; Temporal discretization (For transient problems) dividing the time domain into into a finite number of time intervals, or steps; Equation discretization Generating a system of algebraic equations in terms of discrete quantities defined at specific locations in the domain, from the PDEs that characterize the problem. 6

Linear System Solvers in OpenFOAM PBiCG - preconditioned bi-conjugate gradient solver for asymmetric matrices; PCG - preconditioned conjugate gradient solver for symmetric matrices; GAMG - generalized geometric-algebraic multi-grid solver smoothsolver - solver using a smoother for both symmetric and asymmetric matrices diagonalsolver - diagonal solver for both symmetric and asymmetric matrices 7

Linear System Solvers in OpenFOAM Preconditioners Diagonal incomplete-cholesky (DIC) Diagonal incomplete LU (DILU) GAMG preconditioner Smoothers Diagonal incomplete-cholesky (DIC) Diagonal incomplete LU (DILU) Gauss-Seidel Variants of DIC and DILU exist with additional Gauss-Seidel smoothing 8

Interface for Linear System Solvers OpenFOAM GPGPU Linear ldumatrix Class A b System Solver =? A x b x_solution 9

CUDA for FOAM Link (cufflink) Cuda For FOAM Link (cufflink) is an open-source library for linking numerical methods based on Nvidia's Compute Unified Device Architecture (CUDA ) C/C++ programming language and OpenFOAM. Currently, the library utilizes the sparse linear solvers of Cusp and methods from Thrust to solve the linear Ax = b system derived from OpenFOAM's ldumatrix class and return the solution vector. Cufflink is designed to utilize the course-grained parallelism of OpenFOAM (via domain decomposition) to allow multi-gpu parallelism at the level of the linear system solver. Currently only supports the OpenFOAM-extend fork of the OpenFOAM code. https://code.google.com/p/cufflink-library/ 10

CUSP A C++ Templated Sparse Matrix Library cusp-library http://code.google.com/p/cusp-library/ Cusp is a library for sparse linear algebra and graph computations on CUDA. Cusp provides a flexible, high-level interface for manipulating sparse matrices and solving sparse linear systems. [2] Provided Template Solvers: (Bi-) Conjugate Gradient (-Stabilized) GMRES Matrix Storage CSR, COO, HYB, DIA Provided Preconditioners Jacobi (diagonal) preconditioners Sparse Approximate inverse preconditioner Smoothed-Aggregation Algebraic Multigrid preconditioner 11

Thrust http://code.google.com/p/thrust/ Thrust is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). Thrust provides a flexible high-levelinterface for GPU programming that greatly enhances developer productivity. [3] 12

How Cufflink Works OpenFOAM ldumatrix Class Thrust Methods = A x b Cusp Solver on GPU = A x b Cusp Methods 13

How Cufflink Works OpenFOAM ldumatrix Class = Thrust Methods thrust::copy method converts ldumatrix data into COO format. A x b Cusp Solver on GPU = A x b Cusp Methods 14

How Cufflink Works OpenFOAM ldumatrix Class = Thrust Methods thrust::copy method converts ldumatrix data into COO format. A x b Data in COO format is transfered to GPU memory by using CUDA code. Cusp Solver on GPU = A x b Cusp Methods 15

How Cufflink Works OpenFOAM ldumatrix Class = Thrust Methods thrust::copy method converts ldumatrix data into COO format. A x b Data in COO format is transfered to GPU memory by using CUDA code. Cusp Solver on GPU = Data in COO format is changed into different formats in GPU and passed into CUSPbased solver along with a convergence criteria A x b Cusp Methods 16

How Cufflink Works OpenFOAM ldumatrix Class = Thrust Methods thrust::copy method converts ldumatrix data into COO format. A x b Cusp Solver on GPU = A x b Data in COO format is transfered to GPU memory by using CUDA code. Data in COO format is changed into different formats in GPU and passed into CUSPbased solver along with a convergence criteria Residuals are calculated based on OpenFOAM s normalized residual method Cusp Methods 17

How Cufflink Works OpenFOAM ldumatrix Class Thrust Methods = A x b Pass X solution vector back to OpenFOAM by using thrust methods along with GPU solver performance data. Cusp Solver on GPU = A x b 18

Current Cufflink Solvers cufflink_ainvpbicgstab cufflink_ainvpcg cufflink_cg cufflink_diagpbicgstab These solvers also have their parallel versions which works in multi-gpu setups by using OpenFOAM s domain decomposition methods. cufflink_diagpcg cufflink_smapcg 19

Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 Preliminary Results A test Problem. 2D Heat Equation 2 T = 0 Vary N from 10-2000 where N 2 = ncells 20

Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 Preliminary Results Solver Settings All CG solvers Tolerance = 1e-10; MaxIter 1000; solver GAMG; tolerance 1e-10; smoother GaussSeidel; npresweeps 0; npostsweeps 2; cacheagglomeration true; ncellsincoarsestlevel sqrt(ncells); agglomerator faceareapair; mergelevels 1; 21

Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 Preliminary Results Setup CUDA version 4.0 CUSP version 0.2 Thrust version 1.4 Ubuntu 10.04 CPU: Dual Intel Xeon Quad Core E5430 2.66GHz Motherboard: Tyan S5396 RAM: 24 gig GPU: Tesla C2050 3GB DDR5 515 Gflops peak double precision 1.03 Tflops Peak single precision 14 MP * 32 cores/mp = 448 cores Host-device memory bw = 1566 MB/sec (Motherboard specific) 22

Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 1400 Solve() Time Comparison Time [seconds] 1200 1000 800 600 400 cusplink_smapcg GAMG cusplink_dpcg cusplink_cg DPCG-parallel4 DPCG-parallel6-s231 DPCG CG 200 0 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 ncells 23

Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 18 Speedup Comparison 16 Speedup = Ts/Tp = T OFCG /T other 14 Speedup 12 10 8 6 4 DPCG DPCG-parallel4 DPCG-parallel6-s231 DPCG-parallel6-s161 cusplink_dpcg cusplink_cg 2 0 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 ncells 24

Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 Speedup 140 120 100 80 60 Speedup Comparison DPCG DPCG-parallel4 DPCG-parallel6-s231 DPCG-parallel6-s161 cusplink_cg cusplink_dpcg GAMG GAMG6 cusplink_smapcg Speedup = Ts/Tp = T OFCG /T other 40 20 0 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 ncells 25

Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 Speedup Comparison Speedup 60 50 40 30 DPCG DPCG-parallel4 DPCG-parallel6-s231 DPCG-parallel6-s161 cusplink_cg cusplink_dpcg GAMG6 GAMG cusplink_smapcg Speedup = Ts/Tp = T OFCG /T other 20 10 0 0 200000 400000 600000 800000 1000000 1200000 26 ncells

CUDA Solvers in foam-extend-3.0 Cufflink library is built-in since foam-extend-3.0. Right now, compiling CUDA solvers in foam-extend-3.0 is very hard due to lack of knowledge and tutorials about it. In near future, improvements on GPGPU solvers in foam-extend fork of OpenFOAM is expected by the community of foam-extend. It includes the following solvers: cudabicgstab, cudacg 27

Considerations about Future Improvements on Cusp based solvers which would decrease the effect of memory bottleneck between GPU and main memory. Different open-source sparse-matrix linear equations solver can replace the Cusp based ones for performance improvement. However, this is not a trivial task! Right now, multi-gpu on one node is supported, but developments of multi-node gpus would be better for very large scale simulations where one node would not be enough. Problem size must be big enough for compensating GPU memory bottleneck overhead. 28

GPGPU vs CPU 29

GPGPU vs CPU Taken from reference (1) 30

GPGPU vs CPU Taken from reference (1) 31

GPGPU vs CPU Taken from reference (1) 32

GPGPU vs CPU Taken from reference (1) 33

Q & A

References 1. Karl Rupp. CPU, GPU and MIC Hardware Characteristics over Time. retrieved from http://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-overtime/ on date 21.01.2014. 2. Daniel P. Combest, Dr. P.A. Ramachandran, Dr. M.P. Dudukovic. Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA. 6th OpenFOAM Workshop Penn State University. June 15th 2011. 3. The OpenFOAM Extend Project tutorials