Enhanced Oil Recovery simulation Performances on New Hybrid Architectures

Size: px

Start display at page:

Download "Enhanced Oil Recovery simulation Performances on New Hybrid Architectures"

Walter Boyd
5 years ago
Views:

Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources Enhanced Oil Recovery simulation Performances on New Hybrid Architectures A.

1 Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources Enhanced Oil Recovery simulation Performances on New Hybrid Architectures A. Anciaux, J-M. Gratien, O. Ricois, T. Guignon, P. Theveny, M. Hacene Direction Technologies, Informatique et Mathématiques appliquées EOR Simulation Performances on New Hybrid Architectures 26/03/2014 GTC 2014

2 ArcEOR reservoir simulator New generation research reservoir simulator (RS) based on Arcane/ArcGeoSim platform: Parallel grid management Physics Numerical services: schemes, non linear solvers, linear solvers Focus on Enhanced Oil Recovery Processes: Thermal simulation with steam 2 Direction Technologie, Informatique et Mathématiques appliquées EOR Simulation Performances on New Hybrid Architectures 26/03/2014

3 Linear solver inside RS At each time step : non linear system to solve Newton For each Newton iteration : solve Ax = b (BiCGStab + precond.). Typical BO simulation: 80 % of time is spent in linear solver A: unstructured sparse matrix Non symetric Block CSR Format (3x3: Black Oil 3 phases) Adjacency graph close to reservoir grid connectivity. 3

4 GPU Linear solver inside RS Are we able to accelerate solver with GPU? What we need on GPU? sparse matrix vector product (SpMV) preconditioner Base vector linear algebra (CUBLAS) 4

5 SpMV on GPU CSR Block matrices: Non zero elements are small blocks (1x1,2x2,3x3,4x4..) SpMV exploits block structure to reduce indirection cost: 5

6 SpMV on GPU Also we use texture cache for x : 1. Bind texture with x 2. Compute y=a.x 3. Synchronize 4. Unbind texture Compare to Cusparse Ellpack (best perf. on our matrices) Cusparse provides a Block CSR format (BSR): Not as fast as point Cusparse Ellpack on our systems Slower than CSR with 3x3, close to Ellpack for 4x4 Directly use original structure (no csr2ell) 6

7 MFLOPS Single GPU SpMV performances SpMV on Nvidia K20c/K40c/K40cBoost (NOECC) compared to Intel Ghz IFPEN spmv v2 IFPEN spmv v2 K40 Boost cusparse ELLPACK K40 CPU 8 cores CPU IFPEN spmv v2 K40 cusparse ELLPACK cusparse ELLPACK K40 Boost CPU 4 cores X x x Canta (3x3), n=24048 MSUR_9 (4x4), n=86240 IvaskBO (3x3), n= matrices GCSN1 (3x3), n= GCS2K (3x3), n= spe10 (2x2), n=

8 Polynomial Neuman Polynomial: P(x) = Ix+Nx+N 2 x+n 3 x+...n d x with N= I - w.d -1.A Only requires SpMV and vector algebra As preconditioner: y = P(x) Highly parallel in every context (MPI, GPU, Pthread.) (very) low numerical efficiency High FLOP Cost: degree d means d SpMV to apply 8

9 ILU0 on GPU: graph coloring Natural order (ijk reservoir grid) LU solve exhibit low parallelism degree. Color matrix adjacency graph: for each node (equation) set a color different from is neighborhood Minimize the number of colors (maximize the number of node in each color) Permute matrix by ordering equations color by color A A 0 A 1 A 3 9

10 ILU0 on GPU: Permuted solve L i, U i : sparse blocks Solve L.x = y : 1. x 1 = D 1-1.y 1 SpMV 2. x 2 = D 2-1.y 2 L 2.x 1 3. Block tri. solve Solve U.x = y : 1. x 4 = D 4-1.y 4 2. x 3 = D -1 3.y 2 U 3.x

11 Color ILU0: performances and drawback 2 colors with majority of nodes: GCS2K example: Color Number of vertices system ILU0 CPU Solve (1 core) Color ILU0 GPU Solve (1GPU) spe e e GCS2K 2.75e e IvaskBO 6.30e e GPU/CPU acceleration K40c NOECC / E5-2680, average processor cycles / LU solve Coloring has negative impact on Krylov solver convergence. 11

12 AMGP and CPR-AMG Split linear system Ax = r : A = A 1,1 A 1,2 A 2,1 A 2,2, x = X 1 X 2, r = R 1 R 2 A 1,1 is the presure block, A 2,2 is the saturation block AMGP: only solve A 1,1 with AMG A 1,1 X 1 = R 1 12

13 AMGP and CPR-AMG CPR-AMG: solve A 1,1 with AMG and the whole system with a simple preconditioner (ILU0 or polynomial): Ax (1) = r with ILU0, x (1) = X (1) 1 (1) X 2 SpMV A 1,1 X 1 (2) = R1 A 1,1 X 1 1 A 1,2 X 2 1 with AMG final preconditioner solution is: We use AmgX from Nvidia X 1 (1) + X1 (2) X 2 (1) 13

14 Spe10 30 days simulation with GPU total solver time inside 30 days simulation (1e-4): Solver type Total num. of iterations total solver time (s) ILU0 CPU 1 core Block Jacobi ILU0 CPU, 8 cores MPI Block Jacobi ILU0 CPU, 16 cores MPI CPR-AMG CPU (IFPSolver) 1 core CPR-AMG CPU (IFPSolver), 8 cores MPI CPR-AMG CPU (IFPSolver), 16 cores MPI Color ILU0 GPU 1 core/1 gpu Poly GPU 1 core/1 gpu AMGP AmgX PMIS GPU, 1 core/1 gpu CPR-AMG AmgX PMIS, Poly GPU, 1 core /1 gpu K40c NOECC / E sockets 14

15 Black Oil Thermal simulation 200K cells, 30 days simulation (1e-7), easy case Solver type Total num. of iterations total solver time (s) Total setup time ILU0 CPU 1 core Block Jacobi ILU0 CPU, 8 cores MPI ,5 x Block Jacobi ILU0 CPU, 16 cores MPI ,5 x CPR-AMG CPU (IFPSolver) 1 core x Color ILU0 GPU 1 core/1 gpu ,1 Poly GPU 1 core/1 gpu AMGP AmgX PMIS GPU, 1 core/1 gpu CPR-AMG AmgX PMIS, Poly GPU, 1 core /1 gpu CPR-AMG AmgX PMIS, Color ILU0 GPU 1 core /1 gpu K40c NOECC / E sockets 15

16 GPU with MPI 2 primary objectives: How to make an hybrid GPU+MPI SpMV, does it works efficiently? How the full solver behaves (with polynomial)? Does it scale? Test system: Bullx Blade 2xE5-2470@2.3 Ghz + 2xK20m /node (ECC ON) 5 nodes: 80 cores + 10 gpus Infiniband backplane 16 Direction Technologie, Informatique et Mathématiques appliquées EOR Simulation Performances on New Hybrid Architectures 26/03/2014

17 GPU SpMV with MPI Split local SpMV for process p : y (p) = A (p) int. x (p) x (p) ext x (p) x (p) ext Get x (p) ext with halo/neigh. exchange y (p) ext = A (p) ext. x (p) ext p A (p) ext A (p) int A (p) ext y (p) y (p) ext y (p) = y (p) + y (p) ext Reorder local matrix to minimize y (p) ext and x (p) ext : External dependent equations at end. 17

18 GPU SpMV with MPI GPU + MPI SpMV WorkFlow: y (p) = A (p) int. x (p) y (p) = y (p) + y (p) ext GPU y (p) ext = A (p) ext. x (p) ext CPU x exchange Not real scale time 18

19 FLOPS GPU SpMV with MPI: good news 2,5E+11 2,0E+11 1,5E+11 1,0E+11 SpMV MPI+GPU FiveSpot _7 2x2 n= , K20m ECC + E5-2470@2.30GHz, 80 cores + 10 gpus MPI 1c1gpu/s CusparseEll MPI 1c1gpu/s IFPENV2 MPI 8c/s async com 9,24E+10 1,34E+11 1,78E+11 2,15E+11 x M. Cells (2x2) Close to x5 acc. 1c+1gpu/s against full socket use 5,0E+10 x5.5 4,70E+10 8,51E+09 1,68E+10 2,51E+10 3,41E+10 4,32E+10 1 node with 2 gpus equiv. to 5 nodes full socket use! 0,0E batch reserved cores 19

20 FLOPS GPU SpMV with MPI: bad news 1,6E+11 1,4E+11 1,2E+11 1,0E+11 8,0E+10 6,0E+10 4,0E+10 2,0E+10 0,0E+00 SpMV MPI+GPU GCSN1 3x3 n=556594, K20m ECC + E5-2470@2.30GHz, 80 cores + 10 gpus x3.5 MPI 1c1gpu/s CusparseEll MPI 1c1gpu/s IFPENV2 MPI 8c/s async com 3,89E+10 1,01E+10 6,42E+10 2,27E+10 8,15E+10 7,00E+10 1,17E+11 1,01E+11 1,41E+11 1,10E batch reserved cores x K Cells (3x3) 3.5 acc. 2c+1g/s against 16c (full node) At the and: CPU is faster Thanks to L3 cache effect 20

21 Stand alone MPI+GPU solver Test with Polynomial (spe10 matrix) Multi GPU intrinsic scalability? 1c/1g 2c/2g 4c/4g 6c/6g 8c/8g 10c/10g Total solver time (s) 20, It Acc ,6 6,5 21

22 Conclusions and work in progress Thanks to AmgX: good GPU CPR-AMG preconditioner (1gpu) But the «every day» preconditioner (Color ILU0) is not enough good: New coloring algorithm for decent numerical behavior? MPI+GPU: SpMV : ok with big systems (at least 200K equations per GPU) CPR-AMG and Color ILU0: work in progress 22

23 Thanks Special thanks to Nvidia AmgX Team: Marat Arsaev and Joe Eaton François Courteille (Nvidia) Work partialy supported by PETALH ANR project 23

24 Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative