14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

Size: px

Start display at page:

Download "14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs"

Nathan Lynch
6 years ago
Views:

1 14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs K. Esler, D. Dembeck, K. Mukundakrishnan, V. Natoli, J. Shumway and Y. Zhang Stone Ridge Technology, Bel Air, MD J. Gilman, H-Z. Meng ireservoir, Littleton, CO

approaches designed for fine-grained parallelism 2.

2 Slide 2 Overview 1. Computing is undergoing a major transition On-chip concurrency is growing rapidly, GPUs are the vanguard Taking advantage of this growth requires new approaches designed for fine-grained parallelism 2. Reservoir simulation can benefit from this transition, but this requires a holistic approach Algorithms come first, then figure out how to map to the hardware Unaccelerated portions will limit the benefit 3. The performance advantages are significant Speedups relative to existing CPU simulators can exceed 10x It is possible to support the features required for real-field assets

3 Types of Parallelism Slide 3 Coarse-grained N cores: divide domain into N chunks Core processes a chunk sequentially Occasional communication to exchange halo data Fine-grained Tightly coordinated threads work on entire domain 10,000 s of independent units of work Coordination and memory access patterns are key to performance Current Architectures Require Both Core count and vector width are increasing: domains become too small with one domain per thread Algorithms relying on domain-to-boundary ratio weaken Need to utilize: fine-grained parallelism on chip, domain decomposition between chips

4 Slide 4 Oil and Gas Applications Seismic processing GPUs are used widely in production seismic workflows Natural parallelism, single hotspot, simple memory access patterns: relatively easy win Significant performance boost, dense form-factor, and lower costs Reservoir simulation Some experimentation with GPUs, not yet widely adopted: why not? Implicit solvers with strong convergence are hard to parallelize for GPUs Many computational tasks need to be parallelized How can we address these challenges?

5 Slide 5 Project Background 2010: Accelerated Marathon s IMPES simulator with GPUs Explicit saturation updates ported to run on multiple GPUs, CPU AMG solver Demonstrated dramatic speedups relative to original CPU code 2011-present: GPU Algebraic Multigrid PACKage (GAMPACK) Accelerates both setup and solve stages of preconditioner and solver Multi-GPU implementation with a single hierarchy Converges at the same rate, independent of the number of GPUs Significant optimization of algorithms for higher performance 2013-present: Fully-implicit black-oil simulator IMPES formulation not efficient for models with free gas Build a new fully-implicit simulator, unimpeded by legacy heritage Built from the ground up for GPUs, using fine-grained parallel algorithms

6 Slide 6 Optimal solvers for GPUs: Algorithms and Hardware

7 Pressure Preconditioners Slide 7 Usually the first-stage of a CPR preconditioner, most challenging CPR reduction methodology is critical to AMG success Elliptic problems have extremely wide-range of eigenvalues Common preconditioners fail to damp low-frequency modes Sometimes 1000 s of iterations to converge Algebraic Multigrid (AMG) was designed for precisely these problems Constructs hierarchy of matrices to damp low-frequency modes Convergence is independent of system size (for symmetric problems) AMG is very challenging to accelerate on GPUs Setup algorithms are complex, and most are sequential Parallel variants designed for coarse parallelism We have worked on this problem since 2011, developed GAMPACK

8 Pressure Preconditioning: Algorithms Come First Slide 8 Example: thermal2 from Florida Sparse Matrix Collection (1.2M unknowns) Results from a commercial sparse matrix solver package: Results from BoomerAMG CPU AMG solver on dual-socket CPU: 5.0 seconds, 8 iterations (10-6 relative tolerance) 6.5 seconds, 14 iterations (10-10 relative tolerance) Results from GAMPACK on 1 Fermi GPU (previous generation) 0.77 seconds, 8 iterations (10-6 relative tolerance) 1.00 seconds, 16 iterations (10-10 relative tolerance)

5 GHz GPU Speedup: 5x to 17x on a range of small matrices Slide 9

9 CPU vs. GPU AMG GAMPACK on 1 Tesla K40 GPU vs. HYPRE BoomerAMG on 2x Xeon E GHz GPU Speedup: 5x to 17x on a range of small matrices Slide HYPRE HYPER GAMPACK 7.1X Solve time (s) X 12.7X 8.9X 13.2X 12.2X 13.0X 13.0X 0 5.3X 6.5X

10 Slide 10 Addressing Amdahl s Law

Avoiding Bottlenecks Slide 11 For a typical reservoir, linear solve may be 70% of total simulation time The other 30% will kill performance if only the solver is

7X Data movement between CPU and GPU makes this even worse Better results require addressing the remaining 30% on GPU Jacobian evaluation and assembly Property

11 Avoiding Bottlenecks Slide 11 For a typical reservoir, linear solve may be 70% of total simulation time The other 30% will kill performance if only the solver is accelerated Even if one gets 10X for linear solve, it results in only a net 2.7X Data movement between CPU and GPU makes this even worse Better results require addressing the remaining 30% on GPU Jacobian evaluation and assembly Property evaluation (little CPU time, but CPU-GPU transfer takes time) Large body of GPU code: use modern OO design to manage complexity Avoiding an IO bottleneck Handle file IO on a separate thread Use efficient binary formats (e.g. HDF5), SSDs, RAID arrays, etc.

12 Breakdown of Kernel Runtime Slide 12 All computation is on GPU, other than initialization No dominant hotspots: runtime histogram has long tails Accelerating just a few kernels will limit performance Transfer time between CPU and GPU will exacerbate the problem

13 Slide 13 Meeting Requirements for Engineering Workflows

14 Current Features Black-oil simulator Model features Rock, fluid, and rock-fluid properties Well controls Fully-implicit formulation Standard finite volume discretization Graph-based engine for unstructured grids Several solver options, including CPR-AMG Dual porosity-dual permeability for fractured reservoirs Automatic equilibration, multiple equilibrium regions 2-phase and 3-phase support Aquifers Tabulated rock compaction with transmissibility modifiers Support for multiple regions Hysteresis for relative permeabilities, capillary pressure BHP, THP, and rate controls with secondary limits VFP tables Crossflow Arbitrary schedule Slide 14 I/O features Support for standard industry input file formats 1D and 3D output in standard formats Parallel execution CPU threads drive GPUs on a single workstation MPI for cluster-level communication Modern code architecture Full OO approach using C++ and CUDA C++ Templates for generic programming Use of C++11 standards, where helpful template<class T> bool Foo (T a) { return PartMesh(a); }

15 Slide 15 Validation on Synthetic Models and Real Assets

16 Validation Procedures Slide 16 Accuracy comes first Need to preserve history match and forecast for legacy models Results validated against a commercial simulator Solid red lines are present results Dotted lines are from the commercial simulator Both codes use adaptive time stepping, but choose different time steps Axes scales removed from proprietary production plots CPU test hardware (at client company) Intel Xeon X5677 (4 cores 3.47 GHz) GPU test system 2x Xeon E v2 (Ivy Bridge-E Xeon 2.2 GHz) 256 GB RAM Up to 8x NVIDIA Tesla K40 GPUs

17 SPE10 Benchmark Slide 17 Industry standard benchmark 1.1M cells, highly heterogenous permeability field 1 rate-controlled water injector, 4 BHP-controlled producers Watercut (and all other quantities) in agreement with commercial code

18 Conventional Assets Slide 18 Conventional A 179k cells, numerical aquifers, hysteresis, artificial lift (VFP) tables, rate and THP-controls Conventional B 1.3M cells, gas cap, numerical aquifers, rate and BHP controls

$multipliers, tartan grid Unconventional B 3-phase dual porosity, 20M cells (10M fracture / 10M matrix), mobile gas, reversible rock compaction with$

19 Unconventional A Slide 19 Unconventional A 3-phase, 1.4M cells, rock compaction with trans. multipliers, tartan grid Unconventional B 3-phase dual porosity, 20M cells (10M fracture / 10M matrix), mobile gas, reversible rock compaction with trans. multipliers, tartan grid with large volume ratios

20 Slide 20 Performance Results

21 SPE10 Benchmark: Performance Slide 21 SPE10 (1 GPU) SPE10 (2 GPUs) Time steps Newton iterations Linear iterations Solver setup time 31 s 20 s Solver solve time 53 s 40 s Jacobian time 8 s 4 s Initialization time 7 s 7 s Total wall time 103 s 75 s Source for comparison Method Hardware Time Gratien et al. (2007) FI with CPR-AMG 64 CPUs 620 s Fung and Dogru (2008) FI with CPR-LSPS 64 CPUs 490 s Gratien, J.M. et al. [2007] Scalability and load balancing problems in parallel reservoir simulation. SPE Reservoir Simulation Symposium, SPE MS Fung, L.S. and Dogru, A.H. [2008] Parallel unstructured-solver methods for simulation of complex giant reservoirs. SPE Journal, 13 (04),

22 Real Assets Performance results Slide 22 Model Active cells CPU cores for commercial run GPUs for present work Reported speedup Conventional A 179k 16 on Simulator I 1 6 Conventional B 1.36M 32 on Simulator I 2 45 Unconventional A 370k 32 on Simulator I 1 43 Unconventional B 20M 48 on Simulator II Simulator I and Simulator II are commercial offerings Run on dual-socket Xeon X5677 each with GHz Present work run on dual Xeon E GHz + 8 x Tesla K40 server Speedup comes from both algorithms and hardware performance

23 Slide 23 Performance and Parallel Efficiency on Large Models

Exploring Weak Scaling with Synthetic Benchmarks Slide 24 Tile the original SPE10 model horizontally All copies are fully connected: nontrivial test of simulator scalability Mirror properties across

24 Exploring Weak Scaling with Synthetic Benchmarks Slide 24 Tile the original SPE10 model horizontally All copies are fully connected: nontrivial test of simulator scalability Mirror properties across each boundary Symmetry implies that each well production is identical to single SPE10 case Up to ~55M cells can be simulated on one 8-GPU server How do # of iterations and run time scale with number of tiles?

25 Weak Scaling Slide 25 Number of GPUs is proportional to model size (6.6M cells per GPU) Excludes initialization time (currently sequential code) Run time and linear iterations grow slowly with system size

26 Larger Models Slide 26 Single K40 GPU 2-phase: 7M cells 3-phase: 5M cells Four-GPU Workstation 2-phase: 28M cells 3-phase: 18M cells Cluster with 8 K40s per 2U node 2-phase: ~55M cells/node 3-phase: ~35M cells/node We have an MPI-based solution in progress We have run 220M cells and 1000 wells with 8 X 6-GPU nodes Extrapolating, a billion cells should be addressable on a single rack

27 Summary Slide 27 GPUs provide exceptional speedup for real-world reservoir simulation Strong solvers are critical to performance, but Solvers alone are not enough for dramatic gains in performance Remaining code can be accelerated despite complex feature set Results are scalable to very large models Implications for engineering workflows Results over coffee, not overnight: more realizations, more productivity Geoscale models are practical, not just feasible Benefits from GPUs will continue to grow: 1 TB/s of bandwidth expected by 2016

28 Slide 28 Acknowledgements We would like to thank the Marathon Oil Corporation for funding, for providing asset models, and for permission to publish this work. We would like to thank NVIDIA for lending GPUs and for access to their cluster hardware.

Mo A05 Realizing the Potential of GPUs for Reservoir Simulation

Mo A05 Realizing the Potential of GPUs for Reservoir Simulation K. Esler* (Stone Ridge Technology), K. Mukundakrishnan (Stone Ridge Technology), V. Natoli (Stone Ridge Technology), J. Shumway (Stone Ridge