OpenFOAM + GPGPU. İbrahim Özküçük

Size: px

Start display at page:

Download "OpenFOAM + GPGPU. İbrahim Özküçük"

David Palmer
6 years ago
Views:

1 OpenFOAM + GPGPU İbrahim Özküçük

2 Outline GPGPU vs CPU GPGPU plugins for OpenFOAM Overview of Discretization CUDA for FOAM Link (cufflink) Cusp & Thrust Libraries How Cufflink Works Performance data of Cufflink solvers CUDA Solvers in foam-extend-3.0 Considerations about future Linear System Solvers in OpenFOAM 2

3 GPGPU vs CPU Taken from reference (1) 3

4 GPGPU vs CPU Taken from reference (1) 4

5 OpenFOAM GPGPU Solvers SpeedIT Plugin to OpenFOAM - Conjugate Gradient & BiConjugate Gradient Further ofgpu, GPU Linear Solvers for OpenFOAM Further Culises - GPU power for OpenFOAM Further 5

6 Overview of Discretization The term discretization means approximation of a problem into discrete quantities. The FV method and others, such as the finite element and finite difference methods, all discretize the problem as follows: Spatial discretization Defining the solution domain by a set of points that fill and bound a region of space when connected; Temporal discretization (For transient problems) dividing the time domain into into a finite number of time intervals, or steps; Equation discretization Generating a system of algebraic equations in terms of discrete quantities defined at specific locations in the domain, from the PDEs that characterize the problem. 6

7 Linear System Solvers in OpenFOAM PBiCG - preconditioned bi-conjugate gradient solver for asymmetric matrices; PCG - preconditioned conjugate gradient solver for symmetric matrices; GAMG - generalized geometric-algebraic multi-grid solver smoothsolver - solver using a smoother for both symmetric and asymmetric matrices diagonalsolver - diagonal solver for both symmetric and asymmetric matrices 7

8 Linear System Solvers in OpenFOAM Preconditioners Diagonal incomplete-cholesky (DIC) Diagonal incomplete LU (DILU) GAMG preconditioner Smoothers Diagonal incomplete-cholesky (DIC) Diagonal incomplete LU (DILU) Gauss-Seidel Variants of DIC and DILU exist with additional Gauss-Seidel smoothing 8

9 Interface for Linear System Solvers OpenFOAM GPGPU Linear ldumatrix Class A b System Solver =? A x b x_solution 9

10 CUDA for FOAM Link (cufflink) Cuda For FOAM Link (cufflink) is an open-source library for linking numerical methods based on Nvidia's Compute Unified Device Architecture (CUDA ) C/C++ programming language and OpenFOAM. Currently, the library utilizes the sparse linear solvers of Cusp and methods from Thrust to solve the linear Ax = b system derived from OpenFOAM's ldumatrix class and return the solution vector. Cufflink is designed to utilize the course-grained parallelism of OpenFOAM (via domain decomposition) to allow multi-gpu parallelism at the level of the linear system solver. Currently only supports the OpenFOAM-extend fork of the OpenFOAM code. 10

11 CUSP A C++ Templated Sparse Matrix Library cusp-library Cusp is a library for sparse linear algebra and graph computations on CUDA. Cusp provides a flexible, high-level interface for manipulating sparse matrices and solving sparse linear systems. [2] Provided Template Solvers: (Bi-) Conjugate Gradient (-Stabilized) GMRES Matrix Storage CSR, COO, HYB, DIA Provided Preconditioners Jacobi (diagonal) preconditioners Sparse Approximate inverse preconditioner Smoothed-Aggregation Algebraic Multigrid preconditioner 11

12 Thrust Thrust is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). Thrust provides a flexible high-levelinterface for GPU programming that greatly enhances developer productivity. [3] 12

13 How Cufflink Works OpenFOAM ldumatrix Class Thrust Methods = A x b Cusp Solver on GPU = A x b Cusp Methods 13

14 How Cufflink Works OpenFOAM ldumatrix Class = Thrust Methods thrust::copy method converts ldumatrix data into COO format. A x b Cusp Solver on GPU = A x b Cusp Methods 14

15 How Cufflink Works OpenFOAM ldumatrix Class = Thrust Methods thrust::copy method converts ldumatrix data into COO format. A x b Data in COO format is transfered to GPU memory by using CUDA code. Cusp Solver on GPU = A x b Cusp Methods 15

16 How Cufflink Works OpenFOAM ldumatrix Class = Thrust Methods thrust::copy method converts ldumatrix data into COO format. A x b Data in COO format is transfered to GPU memory by using CUDA code. Cusp Solver on GPU = Data in COO format is changed into different formats in GPU and passed into CUSPbased solver along with a convergence criteria A x b Cusp Methods 16

17 How Cufflink Works OpenFOAM ldumatrix Class = Thrust Methods thrust::copy method converts ldumatrix data into COO format. A x b Cusp Solver on GPU = A x b Data in COO format is transfered to GPU memory by using CUDA code. Data in COO format is changed into different formats in GPU and passed into CUSPbased solver along with a convergence criteria Residuals are calculated based on OpenFOAM s normalized residual method Cusp Methods 17

18 How Cufflink Works OpenFOAM ldumatrix Class Thrust Methods = A x b Pass X solution vector back to OpenFOAM by using thrust methods along with GPU solver performance data. Cusp Solver on GPU = A x b 18

19 Current Cufflink Solvers cufflink_ainvpbicgstab cufflink_ainvpcg cufflink_cg cufflink_diagpbicgstab These solvers also have their parallel versions which works in multi-gpu setups by using OpenFOAM s domain decomposition methods. cufflink_diagpcg cufflink_smapcg 19

20 Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 Preliminary Results A test Problem. 2D Heat Equation 2 T = 0 Vary N from where N 2 = ncells 20

21 Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 Preliminary Results Solver Settings All CG solvers Tolerance = 1e-10; MaxIter 1000; solver GAMG; tolerance 1e-10; smoother GaussSeidel; npresweeps 0; npostsweeps 2; cacheagglomeration true; ncellsincoarsestlevel sqrt(ncells); agglomerator faceareapair; mergelevels 1; 21

22 Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 Preliminary Results Setup CUDA version 4.0 CUSP version 0.2 Thrust version 1.4 Ubuntu CPU: Dual Intel Xeon Quad Core E GHz Motherboard: Tyan S5396 RAM: 24 gig GPU: Tesla C2050 3GB DDR5 515 Gflops peak double precision 1.03 Tflops Peak single precision 14 MP * 32 cores/mp = 448 cores Host-device memory bw = 1566 MB/sec (Motherboard specific) 22

23 Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th Solve() Time Comparison Time [seconds] cusplink_smapcg GAMG cusplink_dpcg cusplink_cg DPCG-parallel4 DPCG-parallel6-s231 DPCG CG ncells 23

24 Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th Speedup Comparison 16 Speedup = Ts/Tp = T OFCG /T other 14 Speedup DPCG DPCG-parallel4 DPCG-parallel6-s231 DPCG-parallel6-s161 cusplink_dpcg cusplink_cg ncells 24

25 Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 Speedup Speedup Comparison DPCG DPCG-parallel4 DPCG-parallel6-s231 DPCG-parallel6-s161 cusplink_cg cusplink_dpcg GAMG GAMG6 cusplink_smapcg Speedup = Ts/Tp = T OFCG /T other ncells 25

26 Performance Data taken from Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 Speedup Comparison Speedup DPCG DPCG-parallel4 DPCG-parallel6-s231 DPCG-parallel6-s161 cusplink_cg cusplink_dpcg GAMG6 GAMG cusplink_smapcg Speedup = Ts/Tp = T OFCG /T other ncells

27 CUDA Solvers in foam-extend-3.0 Cufflink library is built-in since foam-extend-3.0. Right now, compiling CUDA solvers in foam-extend-3.0 is very hard due to lack of knowledge and tutorials about it. In near future, improvements on GPGPU solvers in foam-extend fork of OpenFOAM is expected by the community of foam-extend. It includes the following solvers: cudabicgstab, cudacg 27

28 Considerations about Future Improvements on Cusp based solvers which would decrease the effect of memory bottleneck between GPU and main memory. Different open-source sparse-matrix linear equations solver can replace the Cusp based ones for performance improvement. However, this is not a trivial task! Right now, multi-gpu on one node is supported, but developments of multi-node gpus would be better for very large scale simulations where one node would not be enough. Problem size must be big enough for compensating GPU memory bottleneck overhead. 28

29 GPGPU vs CPU 29

30 GPGPU vs CPU Taken from reference (1) 30

31 GPGPU vs CPU Taken from reference (1) 31

32 GPGPU vs CPU Taken from reference (1) 32

33 GPGPU vs CPU Taken from reference (1) 33

34 Q & A

35 References 1. Karl Rupp. CPU, GPU and MIC Hardware Characteristics over Time. retrieved from on date Daniel P. Combest, Dr. P.A. Ramachandran, Dr. M.P. Dudukovic. Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA. 6th OpenFOAM Workshop Penn State University. June 15th The OpenFOAM Extend Project tutorials

Application of GPU technology to OpenFOAM simulations

Application of GPU technology to OpenFOAM simulations Jakub Poła, Andrzej Kosior, Łukasz Miroslaw jakub.pola@vratis.com, www.vratis.com Wroclaw, Poland Agenda Motivation Partial acceleration SpeedIT OpenFOAM