Enhanced Oil Recovery simulation Performances on New Hybrid Architectures

Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources Enhanced Oil Recovery simulation Performances on New Hybrid Architectures A. Anciaux, J-M. Gratien, O. Ricois, T. Guignon, P. Theveny, M. Hacene Direction Technologies, Informatique et Mathématiques appliquées EOR Simulation Performances on New Hybrid Architectures 26/03/2014 GTC 2014

ArcEOR reservoir simulator New generation research reservoir simulator (RS) based on Arcane/ArcGeoSim platform: Parallel grid management Physics Numerical services: schemes, non linear solvers, linear solvers Focus on Enhanced Oil Recovery Processes: Thermal simulation with steam 2 Direction Technologie, Informatique et Mathématiques appliquées EOR Simulation Performances on New Hybrid Architectures 26/03/2014

Linear solver inside RS At each time step : non linear system to solve Newton For each Newton iteration : solve Ax = b (BiCGStab + precond.). Typical BO simulation: 80 % of time is spent in linear solver A: unstructured sparse matrix Non symetric Block CSR Format (3x3: Black Oil 3 phases) Adjacency graph close to reservoir grid connectivity. 3

GPU Linear solver inside RS Are we able to accelerate solver with GPU? What we need on GPU? sparse matrix vector product (SpMV) preconditioner Base vector linear algebra (CUBLAS) 4

SpMV on GPU CSR Block matrices: Non zero elements are small blocks (1x1,2x2,3x3,4x4..) SpMV exploits block structure to reduce indirection cost: 5

SpMV on GPU Also we use texture cache for x : 1. Bind texture with x 2. Compute y=a.x 3. Synchronize 4. Unbind texture Compare to Cusparse Ellpack (best perf. on our matrices) Cusparse provides a Block CSR format (BSR): Not as fast as point Cusparse Ellpack on our systems Slower than CSR with 3x3, close to Ellpack for 4x4 Directly use original structure (no csr2ell) 6

MFLOPS Single GPU SpMV performances 50000 45000 40000 SpMV on Nvidia K20c/K40c/K40cBoost (NOECC) compared to Intel E5-2680@2.7 Ghz 37411 42100 40023 45162 37749 IFPEN spmv v2 IFPEN spmv v2 K40 Boost cusparse ELLPACK K40 CPU 8 cores CPU 39689 37258 IFPEN spmv v2 K40 cusparse ELLPACK cusparse ELLPACK K40 Boost CPU 4 cores 42224 35000 30000 25000 20000 X1.7 21598 30311 22956 29016 32800 22359 31790 x4.2 25121 29866 33906 x17 26394 33234 27383 31115 27866 35036 33341 33294 29419 29525 29455 22799 23150 15000 10000 14259 12630 12709 11830 11544 11136 10317 12664 9389 8028 8234 8287 7229 7291 68116673 5000 3521 2796 2351 2355 2367 2189 0 Canta (3x3), n=24048 MSUR_9 (4x4), n=86240 IvaskBO (3x3), n=148716 matrices GCSN1 (3x3), n=556594 GCS2K (3x3), n=1112946 spe10 (2x2), n=21888426 7

Polynomial Neuman Polynomial: P(x) = Ix+Nx+N 2 x+n 3 x+...n d x with N= I - w.d -1.A Only requires SpMV and vector algebra As preconditioner: y = P(x) Highly parallel in every context (MPI, GPU, Pthread.) (very) low numerical efficiency High FLOP Cost: degree d means d SpMV to apply 8

ILU0 on GPU: graph coloring Natural order (ijk reservoir grid) LU solve exhibit low parallelism degree. Color matrix adjacency graph: for each node (equation) set a color different from is neighborhood Minimize the number of colors (maximize the number of node in each color) Permute matrix by ordering equations color by color A A 0 A 1 A 3 9

ILU0 on GPU: Permuted solve L i, U i : sparse blocks Solve L.x = y : 1. x 1 = D 1-1.y 1 SpMV 2. x 2 = D 2-1.y 2 L 2.x 1 3. Block tri. solve Solve U.x = y : 1. x 4 = D 4-1.y 4 2. x 3 = D -1 3.y 2 U 3.x 4 3. 10

Color ILU0: performances and drawback 2 colors with majority of nodes: GCS2K example: Color 0 179635 1 179634 2 5754 3 5630 4 179 5 149 6 1 Number of vertices system ILU0 CPU Solve (1 core) Color ILU0 GPU Solve (1GPU) spe10 3.18e+08 1.99e+07 16 GCS2K 2.75e+08 1.67e+07 16.5 IvaskBO 6.30e+07 5.14e+06 12.2 GPU/CPU acceleration K40c NOECC / E5-2680, average processor cycles / LU solve Coloring has negative impact on Krylov solver convergence. 11

AMGP and CPR-AMG Split linear system Ax = r : A = A 1,1 A 1,2 A 2,1 A 2,2, x = X 1 X 2, r = R 1 R 2 A 1,1 is the presure block, A 2,2 is the saturation block AMGP: only solve A 1,1 with AMG A 1,1 X 1 = R 1 12

AMGP and CPR-AMG CPR-AMG: solve A 1,1 with AMG and the whole system with a simple preconditioner (ILU0 or polynomial): Ax (1) = r with ILU0, x (1) = X (1) 1 (1) X 2 SpMV A 1,1 X 1 (2) = R1 A 1,1 X 1 1 A 1,2 X 2 1 with AMG final preconditioner solution is: We use AmgX from Nvidia X 1 (1) + X1 (2) X 2 (1) 13

Spe10 30 days simulation with GPU total solver time inside 30 days simulation (1e-4): Solver type Total num. of iterations total solver time (s) ILU0 CPU 1 core 3498 820 Block Jacobi ILU0 CPU, 8 cores MPI 3812 270 Block Jacobi ILU0 CPU, 16 cores MPI 3756 145 CPR-AMG CPU (IFPSolver) 1 core 361 430 CPR-AMG CPU (IFPSolver), 8 cores MPI 497 143 CPR-AMG CPU (IFPSolver), 16 cores MPI 413 68 Color ILU0 GPU 1 core/1 gpu 10977 153 Poly GPU 1 core/1 gpu 8765 186 AMGP AmgX PMIS GPU, 1 core/1 gpu 989 62 CPR-AMG AmgX PMIS, Poly GPU, 1 core /1 gpu 538 55 K40c NOECC / E5-2680 2 sockets 14

Black Oil Thermal simulation 200K cells, 30 days simulation (1e-7), easy case Solver type Total num. of iterations total solver time (s) Total setup time ILU0 CPU 1 core 297 37 8 Block Jacobi ILU0 CPU, 8 cores MPI 460 19,5 x Block Jacobi ILU0 CPU, 16 cores MPI 398 10,5 x CPR-AMG CPU (IFPSolver) 1 core 131 60 x Color ILU0 GPU 1 core/1 gpu 450 6 2,1 Poly GPU 1 core/1 gpu 476 8 2.8 AMGP AmgX PMIS GPU, 1 core/1 gpu 556 22 12 CPR-AMG AmgX PMIS, Poly GPU, 1 core /1 gpu 143 20 13 CPR-AMG AmgX PMIS, Color ILU0 GPU 1 core /1 gpu 129 37 29 K40c NOECC / E5-2680 2 sockets 15

GPU with MPI 2 primary objectives: How to make an hybrid GPU+MPI SpMV, does it works efficiently? How the full solver behaves (with polynomial)? Does it scale? Test system: Bullx Blade 2xE5-2470@2.3 Ghz + 2xK20m /node (ECC ON) 5 nodes: 80 cores + 10 gpus Infiniband backplane 16 Direction Technologie, Informatique et Mathématiques appliquées EOR Simulation Performances on New Hybrid Architectures 26/03/2014

GPU SpMV with MPI Split local SpMV for process p : y (p) = A (p) int. x (p) x (p) ext x (p) x (p) ext Get x (p) ext with halo/neigh. exchange y (p) ext = A (p) ext. x (p) ext p A (p) ext A (p) int A (p) ext y (p) y (p) ext y (p) = y (p) + y (p) ext Reorder local matrix to minimize y (p) ext and x (p) ext : External dependent equations at end. 17

GPU SpMV with MPI GPU + MPI SpMV WorkFlow: y (p) = A (p) int. x (p) y (p) = y (p) + y (p) ext GPU y (p) ext = A (p) ext. x (p) ext CPU x exchange Not real scale time 18

FLOPS GPU SpMV with MPI: good news 2,5E+11 2,0E+11 1,5E+11 1,0E+11 SpMV MPI+GPU FiveSpot800-800-10_7 2x2 n=12800000, K20m ECC + E5-2470@2.30GHz, 80 cores + 10 gpus MPI 1c1gpu/s CusparseEll MPI 1c1gpu/s IFPENV2 MPI 8c/s async com 9,24E+10 1,34E+11 1,78E+11 2,15E+11 x4.9 12.8M. Cells (2x2) Close to x5 acc. 1c+1gpu/s against full socket use 5,0E+10 x5.5 4,70E+10 8,51E+09 1,68E+10 2,51E+10 3,41E+10 4,32E+10 1 node with 2 gpus equiv. to 5 nodes full socket use! 0,0E+00 8 16 32 48 64 80 batch reserved cores 19

FLOPS GPU SpMV with MPI: bad news 1,6E+11 1,4E+11 1,2E+11 1,0E+11 8,0E+10 6,0E+10 4,0E+10 2,0E+10 0,0E+00 SpMV MPI+GPU GCSN1 3x3 n=556594, K20m ECC + E5-2470@2.30GHz, 80 cores + 10 gpus x3.5 MPI 1c1gpu/s CusparseEll MPI 1c1gpu/s IFPENV2 MPI 8c/s async com 3,89E+10 1,01E+10 6,42E+10 2,27E+10 8,15E+10 7,00E+10 1,17E+11 1,01E+11 1,41E+11 1,10E+11 8 16 32 48 64 80 batch reserved cores x0.7 500K Cells (3x3) 3.5 acc. 2c+1g/s against 16c (full node) At the and: CPU is faster Thanks to L3 cache effect 20

Stand alone MPI+GPU solver Test with Polynomial (spe10 matrix) Multi GPU intrinsic scalability? 1c/1g 2c/2g 4c/4g 6c/6g 8c/8g 10c/10g Total solver time (s) 20,7 11.5 5.8 5.1 3.7 3.2 It. 669 704 616 748 634 626 Acc. 1 1.8 3.5 4 5,6 6,5 21

Conclusions and work in progress Thanks to AmgX: good GPU CPR-AMG preconditioner (1gpu) But the «every day» preconditioner (Color ILU0) is not enough good: New coloring algorithm for decent numerical behavior? MPI+GPU: SpMV : ok with big systems (at least 200K equations per GPU) CPR-AMG and Color ILU0: work in progress 22

Thanks Special thanks to Nvidia AmgX Team: Marat Arsaev and Joe Eaton François Courteille (Nvidia) Work partialy supported by PETALH ANR project 23

Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources www.ifpenergiesnouvelles.com