Enhanced Oil Recovery simulation Performances on New Hybrid Architectures

Similar documents
AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

An Innovative Massively Parallelized Molecular Dynamic Software

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

GPU-based Parallel Reservoir Simulators

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Iterative Sparse Triangular Solves for Preconditioning

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

NEW ADVANCES IN GPU LINEAR ALGEBRA

GPU PROGRESS AND DIRECTIONS IN APPLIED CFD

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Advances of parallel computing. Kirill Bogachev May 2016

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems

CUDA Accelerated Compute Libraries. M. Naumov

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Krishnan Suresh Associate Professor Mechanical Engineering

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

D o s s i e r. Geosciences Numerical Methods Modélisation numérique en géosciences

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA

ME964 High Performance Computing for Engineering Applications

Accelerating the Iterative Linear Solver for Reservoir Simulation

Solving the heat equation with CUDA

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

GPU Cluster Computing for FEM

AMS526: Numerical Analysis I (Numerical Linear Algebra)

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Optimising the Mantevo benchmark suite for multi- and many-core architectures

arxiv: v1 [cs.ms] 2 Jun 2016

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

Mathematical Methods in Fluid Dynamics and Simulation of Giant Oil and Gas Reservoirs. 3-5 September 2012 Swissotel The Bosphorus, Istanbul, Turkey

DEVELOPMENT OF A RESTRICTED ADDITIVE SCHWARZ PRECONDITIONER FOR SPARSE LINEAR SYSTEMS ON NVIDIA GPU

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

CUDA 6.0 Performance Report. April 2014

Lecture 15: More Iterative Ideas

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Performance of deal.ii on a node

D036 Accelerating Reservoir Simulation with GPUs

PARDISO Version Reference Sheet Fortran

CUDA Toolkit 5.0 Performance Report. January 2013

Implicit schemes for wave models

OPENFOAM ON GPUS USING AMGX

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

ACCELERATING PRECONDITIONED ITERATIVE LINEAR SOLVERS ON GPU

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

Applications of Berkeley s Dwarfs on Nvidia GPUs

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

Accelerating HPL on Heterogeneous GPU Clusters

Application of GPU technology to OpenFOAM simulations

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

AllScale Pilots Applications AmDaDos Adaptive Meshing and Data Assimilation for the Deepwater Horizon Oil Spill

Total efficiency of core components in Finite Element frameworks

PORTABLE AND SCALABLE SOLUTIONS FOR CFD ON MODERN SUPERCOMPUTERS

Lecture 13: March 25

(Sparse) Linear Solvers

High Performance Computing for PDE Towards Petascale Computing

10th August Part One: Introduction to Parallel Computing

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

A GPU Sparse Direct Solver for AX=B

nag sparse nsym sol (f11dec)

CUDA 7.0 Performance Report. May 2015

Automatic Tuning of Sparse Matrix Kernels

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Figure 6.1: Truss topology optimization diagram.

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

A new sparse matrix vector multiplication graphics processing unit algorithm designed for finite element problems

GPU Acceleration of Unmodified CSM and CFD Solvers

PCS - Part 1: Introduction to Parallel Computing

Parallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU

Computing on GPU Clusters

Phase inversion problem: performances on EOS. Annaïg PEDRONO IMFT Service Codes et Simulations Numériques

Sparse Matrix Formats

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

Generic Programming Experiments for SPn and SN transport codes

Scalable, Hybrid-Parallel Multiscale Methods using DUNE

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Transcription:

Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources Enhanced Oil Recovery simulation Performances on New Hybrid Architectures A. Anciaux, J-M. Gratien, O. Ricois, T. Guignon, P. Theveny, M. Hacene Direction Technologies, Informatique et Mathématiques appliquées EOR Simulation Performances on New Hybrid Architectures 26/03/2014 GTC 2014

ArcEOR reservoir simulator New generation research reservoir simulator (RS) based on Arcane/ArcGeoSim platform: Parallel grid management Physics Numerical services: schemes, non linear solvers, linear solvers Focus on Enhanced Oil Recovery Processes: Thermal simulation with steam 2 Direction Technologie, Informatique et Mathématiques appliquées EOR Simulation Performances on New Hybrid Architectures 26/03/2014

Linear solver inside RS At each time step : non linear system to solve Newton For each Newton iteration : solve Ax = b (BiCGStab + precond.). Typical BO simulation: 80 % of time is spent in linear solver A: unstructured sparse matrix Non symetric Block CSR Format (3x3: Black Oil 3 phases) Adjacency graph close to reservoir grid connectivity. 3

GPU Linear solver inside RS Are we able to accelerate solver with GPU? What we need on GPU? sparse matrix vector product (SpMV) preconditioner Base vector linear algebra (CUBLAS) 4

SpMV on GPU CSR Block matrices: Non zero elements are small blocks (1x1,2x2,3x3,4x4..) SpMV exploits block structure to reduce indirection cost: 5

SpMV on GPU Also we use texture cache for x : 1. Bind texture with x 2. Compute y=a.x 3. Synchronize 4. Unbind texture Compare to Cusparse Ellpack (best perf. on our matrices) Cusparse provides a Block CSR format (BSR): Not as fast as point Cusparse Ellpack on our systems Slower than CSR with 3x3, close to Ellpack for 4x4 Directly use original structure (no csr2ell) 6

MFLOPS Single GPU SpMV performances 50000 45000 40000 SpMV on Nvidia K20c/K40c/K40cBoost (NOECC) compared to Intel E5-2680@2.7 Ghz 37411 42100 40023 45162 37749 IFPEN spmv v2 IFPEN spmv v2 K40 Boost cusparse ELLPACK K40 CPU 8 cores CPU 39689 37258 IFPEN spmv v2 K40 cusparse ELLPACK cusparse ELLPACK K40 Boost CPU 4 cores 42224 35000 30000 25000 20000 X1.7 21598 30311 22956 29016 32800 22359 31790 x4.2 25121 29866 33906 x17 26394 33234 27383 31115 27866 35036 33341 33294 29419 29525 29455 22799 23150 15000 10000 14259 12630 12709 11830 11544 11136 10317 12664 9389 8028 8234 8287 7229 7291 68116673 5000 3521 2796 2351 2355 2367 2189 0 Canta (3x3), n=24048 MSUR_9 (4x4), n=86240 IvaskBO (3x3), n=148716 matrices GCSN1 (3x3), n=556594 GCS2K (3x3), n=1112946 spe10 (2x2), n=21888426 7

Polynomial Neuman Polynomial: P(x) = Ix+Nx+N 2 x+n 3 x+...n d x with N= I - w.d -1.A Only requires SpMV and vector algebra As preconditioner: y = P(x) Highly parallel in every context (MPI, GPU, Pthread.) (very) low numerical efficiency High FLOP Cost: degree d means d SpMV to apply 8

ILU0 on GPU: graph coloring Natural order (ijk reservoir grid) LU solve exhibit low parallelism degree. Color matrix adjacency graph: for each node (equation) set a color different from is neighborhood Minimize the number of colors (maximize the number of node in each color) Permute matrix by ordering equations color by color A A 0 A 1 A 3 9

ILU0 on GPU: Permuted solve L i, U i : sparse blocks Solve L.x = y : 1. x 1 = D 1-1.y 1 SpMV 2. x 2 = D 2-1.y 2 L 2.x 1 3. Block tri. solve Solve U.x = y : 1. x 4 = D 4-1.y 4 2. x 3 = D -1 3.y 2 U 3.x 4 3. 10

Color ILU0: performances and drawback 2 colors with majority of nodes: GCS2K example: Color 0 179635 1 179634 2 5754 3 5630 4 179 5 149 6 1 Number of vertices system ILU0 CPU Solve (1 core) Color ILU0 GPU Solve (1GPU) spe10 3.18e+08 1.99e+07 16 GCS2K 2.75e+08 1.67e+07 16.5 IvaskBO 6.30e+07 5.14e+06 12.2 GPU/CPU acceleration K40c NOECC / E5-2680, average processor cycles / LU solve Coloring has negative impact on Krylov solver convergence. 11

AMGP and CPR-AMG Split linear system Ax = r : A = A 1,1 A 1,2 A 2,1 A 2,2, x = X 1 X 2, r = R 1 R 2 A 1,1 is the presure block, A 2,2 is the saturation block AMGP: only solve A 1,1 with AMG A 1,1 X 1 = R 1 12

AMGP and CPR-AMG CPR-AMG: solve A 1,1 with AMG and the whole system with a simple preconditioner (ILU0 or polynomial): Ax (1) = r with ILU0, x (1) = X (1) 1 (1) X 2 SpMV A 1,1 X 1 (2) = R1 A 1,1 X 1 1 A 1,2 X 2 1 with AMG final preconditioner solution is: We use AmgX from Nvidia X 1 (1) + X1 (2) X 2 (1) 13

Spe10 30 days simulation with GPU total solver time inside 30 days simulation (1e-4): Solver type Total num. of iterations total solver time (s) ILU0 CPU 1 core 3498 820 Block Jacobi ILU0 CPU, 8 cores MPI 3812 270 Block Jacobi ILU0 CPU, 16 cores MPI 3756 145 CPR-AMG CPU (IFPSolver) 1 core 361 430 CPR-AMG CPU (IFPSolver), 8 cores MPI 497 143 CPR-AMG CPU (IFPSolver), 16 cores MPI 413 68 Color ILU0 GPU 1 core/1 gpu 10977 153 Poly GPU 1 core/1 gpu 8765 186 AMGP AmgX PMIS GPU, 1 core/1 gpu 989 62 CPR-AMG AmgX PMIS, Poly GPU, 1 core /1 gpu 538 55 K40c NOECC / E5-2680 2 sockets 14

Black Oil Thermal simulation 200K cells, 30 days simulation (1e-7), easy case Solver type Total num. of iterations total solver time (s) Total setup time ILU0 CPU 1 core 297 37 8 Block Jacobi ILU0 CPU, 8 cores MPI 460 19,5 x Block Jacobi ILU0 CPU, 16 cores MPI 398 10,5 x CPR-AMG CPU (IFPSolver) 1 core 131 60 x Color ILU0 GPU 1 core/1 gpu 450 6 2,1 Poly GPU 1 core/1 gpu 476 8 2.8 AMGP AmgX PMIS GPU, 1 core/1 gpu 556 22 12 CPR-AMG AmgX PMIS, Poly GPU, 1 core /1 gpu 143 20 13 CPR-AMG AmgX PMIS, Color ILU0 GPU 1 core /1 gpu 129 37 29 K40c NOECC / E5-2680 2 sockets 15

GPU with MPI 2 primary objectives: How to make an hybrid GPU+MPI SpMV, does it works efficiently? How the full solver behaves (with polynomial)? Does it scale? Test system: Bullx Blade 2xE5-2470@2.3 Ghz + 2xK20m /node (ECC ON) 5 nodes: 80 cores + 10 gpus Infiniband backplane 16 Direction Technologie, Informatique et Mathématiques appliquées EOR Simulation Performances on New Hybrid Architectures 26/03/2014

GPU SpMV with MPI Split local SpMV for process p : y (p) = A (p) int. x (p) x (p) ext x (p) x (p) ext Get x (p) ext with halo/neigh. exchange y (p) ext = A (p) ext. x (p) ext p A (p) ext A (p) int A (p) ext y (p) y (p) ext y (p) = y (p) + y (p) ext Reorder local matrix to minimize y (p) ext and x (p) ext : External dependent equations at end. 17

GPU SpMV with MPI GPU + MPI SpMV WorkFlow: y (p) = A (p) int. x (p) y (p) = y (p) + y (p) ext GPU y (p) ext = A (p) ext. x (p) ext CPU x exchange Not real scale time 18

FLOPS GPU SpMV with MPI: good news 2,5E+11 2,0E+11 1,5E+11 1,0E+11 SpMV MPI+GPU FiveSpot800-800-10_7 2x2 n=12800000, K20m ECC + E5-2470@2.30GHz, 80 cores + 10 gpus MPI 1c1gpu/s CusparseEll MPI 1c1gpu/s IFPENV2 MPI 8c/s async com 9,24E+10 1,34E+11 1,78E+11 2,15E+11 x4.9 12.8M. Cells (2x2) Close to x5 acc. 1c+1gpu/s against full socket use 5,0E+10 x5.5 4,70E+10 8,51E+09 1,68E+10 2,51E+10 3,41E+10 4,32E+10 1 node with 2 gpus equiv. to 5 nodes full socket use! 0,0E+00 8 16 32 48 64 80 batch reserved cores 19

FLOPS GPU SpMV with MPI: bad news 1,6E+11 1,4E+11 1,2E+11 1,0E+11 8,0E+10 6,0E+10 4,0E+10 2,0E+10 0,0E+00 SpMV MPI+GPU GCSN1 3x3 n=556594, K20m ECC + E5-2470@2.30GHz, 80 cores + 10 gpus x3.5 MPI 1c1gpu/s CusparseEll MPI 1c1gpu/s IFPENV2 MPI 8c/s async com 3,89E+10 1,01E+10 6,42E+10 2,27E+10 8,15E+10 7,00E+10 1,17E+11 1,01E+11 1,41E+11 1,10E+11 8 16 32 48 64 80 batch reserved cores x0.7 500K Cells (3x3) 3.5 acc. 2c+1g/s against 16c (full node) At the and: CPU is faster Thanks to L3 cache effect 20

Stand alone MPI+GPU solver Test with Polynomial (spe10 matrix) Multi GPU intrinsic scalability? 1c/1g 2c/2g 4c/4g 6c/6g 8c/8g 10c/10g Total solver time (s) 20,7 11.5 5.8 5.1 3.7 3.2 It. 669 704 616 748 634 626 Acc. 1 1.8 3.5 4 5,6 6,5 21

Conclusions and work in progress Thanks to AmgX: good GPU CPR-AMG preconditioner (1gpu) But the «every day» preconditioner (Color ILU0) is not enough good: New coloring algorithm for decent numerical behavior? MPI+GPU: SpMV : ok with big systems (at least 200K equations per GPU) CPR-AMG and Color ILU0: work in progress 22

Thanks Special thanks to Nvidia AmgX Team: Marat Arsaev and Joe Eaton François Courteille (Nvidia) Work partialy supported by PETALH ANR project 23

Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources www.ifpenergiesnouvelles.com