S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Size: px

Start display at page:

Download "S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS"

Beverly Small
6 years ago
Views:

1 S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger

2 Outline of Talk Reservoir Simulation Overview GPUs in Reservoir Simulation MPNF (Massively Parallel Nested Factorization) Preconditioner Numerical Results MPNF on multi-gpu Workstations and Clusters Unstructured Prismatic Grids

3 Reservoir Simulation Computer simulation of the flow of fluids (typically, oil, water, and gas) through porous underground formations. Lots of Physics o Multi-phase flow o Compositional, flash calculations o Thermal o Enhanced recovery techniques

4 A Large Reservoir Model

5 Reservoir Simulation Numerical Methods Spatial discretisation - structured grid (..) Multiple coupled equations for each cell Mixed hyperbolic/elliptical Finite volume material and energy conservation Fully implicit time stepping Newton iteration (modified) at each time step Solve sparse linear equations at each Newton iteration Two level linear iteration before every linear iteration, a few iterations using a single phase CPR matrix to accelerate convergence by approximating the pressure field.

6 Can we use GPUs? A Newton Iteration comprises: Property Calculations OK in principle Matrix Assembly OK Linear Solver Krylov subspace or AMG o GMRES/Orthomin OK (dot products) o Matrix multiply OK o Preconditioner - problematic Best existing algorithms have little intrinsic parallelism Have to reconcile remote driving forces

7 Preconditioned Krylov Subspace Method For non-symmetric matrices, use GMRES, Orthomin, BiCGstab etc. Given A.x = r Solve B -1.A.x = B -1.r B is the preconditioner o B approximates A (B -1.A has smaller condition number than A) o quick to compute B -1.r o low storage

8 MPNF Preconditioner Multi-color block incomplete ILU related to Nested Factorization Block diagonal elements are m-kernels typically lines of grid cells oriented along direction with highest transmissibilities Both outer and inner (CPR) iterations use same basic method

9 if and Block Incomplete ILU A = D + L + U B = (D +L).(I+D -1.U) B = D + L + U + L.D -1.U B = A + ε ε = (D -D+ L.D -1.U) choose D = D - approx(l.d -1.U) so ε = L.D -1.U - approx(l.d -1.U)

10 What Approximation? Uncorrected may be adequate for outer iteration approx(l.d -1.U) = 0 Unmodified drop all except diagonal terms approx(l.d -1.U) = diag(l.d -1.U) Modified add error terms to diagonal Rowsum best for CPR iteration (with relaxation) approx(l.d -1.U) = rowsum(l.d -1.U) Colsum - best for single stage preconditioner (material balance) approx(l.d -1.U) = colsum(l.d -1.U) Relaxation Factor can be applied to Rowsum and Colsum approx(l.d -1.U) = β.rowsum(l.d -1.U)

11 Three Color Case D U A = L D U B = L D D I D -1.U L D I D -1.U L D I d u l d u D = T T T T T = l d u l d u l d u l d u Each T (tridiagonal) represents a single line of cells (m-kernel). T terms of a given color are processed in parallel. l d

12 Sample Grid

Oscillatory 4 Color Ordering Numbers represent colors Oscillatory ordering 12343212.

13 Oscillatory 4 Color Ordering Numbers represent colors Oscillatory ordering Better convergence than with cyclical ordering ( ) Coefficient Matrix matrix is properly block tridiagonal Fewer error terms Aerial View of Grid showing Oscillatory 4-Color Ordering

14 Coefficient Matrix with Oscillatory Ordering Each m-kernel comprises one line of cells

15 How Many Threads? For nx * ny * nz grid Assume m-kernels aligned along Z direction There are roughly nx*ny/ncolor m-kernels of each color Assume 1 thread per m-kernel e.g. nx = ny = 100 and 4 colors 2500 threads barely enough Big problems and lots of memory needed

16 Parallelism within m-kernels 1. Twisted Factorization 2 threads per m-kernel 2. Cyclic Reduction each cycle doubles the number of threads (but increases total work)

17 Increasing the m-kernel Size Standard m-kernel is one line We can use blocks of lines (2x1, 2x2 etc.) m-kernel matrix is penta/nona-diagonal instead of tridiagonal Increases accuracy, but reduces parallelism Aerial View of Grid showing Oscillatory 4-Color Ordering with 2-Line m-kernels

18 Coefficient Matrix with 2-Line m-kernels Each m-kernel comprises 2 adjacent lines of cells

19 Coefficient Matrix with Oscillatory Ordering Each m-kernel comprises one line of cells

20 Numerical Results 30 Test Problems in a commercial simulator from 50 cells to 900K cells Structured grids with many additional links (LGRs, well completions, faults etc.) black oil/thermal/compositional Some torture tests Many real field-derived problems Run on Dell Workstation with 2xNehalem CPU and NVIDIA Tesla C2050 GPU with 3GB graphics memory

21 Time-step and Iteration Counts Problem Number TABLE 2 CHANGES TO TIME-STEP AND ITERATION COUNTS Time Step Ratio Non-Linear Iteration Ratio Linear Iteration Ratio CPR Iteration Ratio Geo mean Compared to JALS serial solver (fastest option) All problems completed and validated 10% more time-steps 10% more Newton iterations 50% more linear iterations 4.5 times as many CPR iterations

22 Timing Comparisons for Large Problems Problem Number TABLE 3 MPNF PERFORMANCE ON LARGE PROBLEMS Solver Only MPNF v JALS speed-up Entire Run Current MPNF v JALS speed-up Projected best case speed-up JALS is a state-of-the-art single threaded serial solver Matrix Assembly and Solver on GPU. Property Calculation Code running on Intel CPU Assuming 10x faster Property Calculations on GPU We only look at large problems (>100K cells)

23 Solver Speed-up Factor (s) Solver Speed-up v Size s ~ N E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1E+6 Large Problems Number of Grid Blocks (N)

24 MPNF on Multi-GPU Wokstations and Clusters GPU and cluster are not alternatives can use both MPNF is algorithmically unchanged by domain decomposition o Process m-kernels of a given color simultaneously across all domains o Corrections between colors are passed both within and between domains o Same iteration count whether on one GPU, or a cluster

25 2 GPU Speed-up Scaling on 2 GPUs (all test problems) E+3 1E+4 1E+5 1E+6 Number of Cells Speed-up approaches theoretical maximum (2) for Large Problems

26 2 GPU speed-up Scaling on 2 GPUs (model problem) E+4 1E+5 1E+6 Number of Cells Speed-up approaches theoretical maximum (2) for Large Problems

27 MPNF on Unstructured Prismatic Grids Use map-coloring algorithm to ensure that adjacent m-kernels have different colors

28 6 color Coefficient Matrix for Prismatic Grid

29 Prismatic Grid with Local Grid Refinement Theoretically, only 4 colors are needed, but results may be better with 5 or 6

30 4 color Coefficient Matrix for LGR Grid

31 Conclusions Reservoir simulators can work well on GPUs, but new solver algorithms are required MPNF works well on our test cases MPNF is suitable for both elliptical and coupled elliptical/hyperbolic equations MPNF is significantly faster than a serial solver for large problems (>100K grid cells) MPNF is well suited for implementation on multi-gpu work-stations and clusters MPNF may be adapted for unstructured grids

32 S0432 New Ideas for Massively Parallel Preconditioners The End!

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2