Advances of parallel computing. Kirill Bogachev May PDF Free Download

Advances of parallel computing Kirill Bogachev May 2016

Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being a simple material balance estimator to full-physics numerical simulators As we follow the development of simulations over time, the models become more demanding and more complex: rock properties, fluids and reservoir description, wells models, surface network, compositional and thermal effects, EORs, etc. Grid dimensions are based on available resources and project time frames Proper uncertainty analysis is often skipped due to limited time.

Grid Resolution Effects Coarse (2m x 50m x 0.7m) Fine (1m x 50m x 0.7 m)

Moore s Law The number of transistors in a dense integrated circuit doubles approximately every two years - Gordon Moore, co-founder of Intel, 1965

Evolution of microprocessors Only the number of transistors/cores continue to rise!

2005 - First Serial Multicore CPU s In old clusters all computational cores are isolated by distributed memory (MPI required). Most of the conventional algorithms are designed based on this paradigm. With the shared memory systems all cores communicate directly, which is significantly faster than communication between the cluster nodes. Simulation software has to take it into account to maximize parallel performance.

HPC for Numerical Modeling Climate modeling, weather forecasting Space technologies Digital content Medicine Financial analysis Technical design 7 All industries run massive high-performance computing simulations on a daily basis

In the meantime, in the reservoir simulations 15 13 11 9 7 5 3 PARALLEL SPEED-UP IN RESERVOIR SIMULATIONS 1 1 2 4 8 16 32 64 128

Desktops and Workstations DDR4 Intel Xeon Processor E5 v4 Intel Xeon Processor E5 v4 DDR4 DDR4 DDR4 DDR4 DDR4 Up to 55MB Shared Cache up to 22 cores per CPU Up to 55MB Shared Cache DDR4 DDR4 4 channels of up to DDR3 2400 MHz memory 4 channels of up to DDR4 2400 MHz memory Shared memory systems: Fast interactions between the cores No need to introduce grid domains The system of equations can be solved directly on the matrix level

Desktops and Workstations DDR4 QPI DDR4 NUMA NUMA Bandwidth machine : up to 76GB/s ( ~10 times the Infiniband speed) The software: for maximum performance the following hardware features are used: Shared memory: blocks are selected automatically on the matrix level Non-Uniform Memory Access: memory is allocated dynamically through NUMA Hyperthreading: system threads accessed directly Fast CPU cache: big enough to fit matrix blocks All parts of code are parallel: not just linear solver Special compiler settings

Speed-up vs. single core High-end Desktops and Workstations 2011: Dual Xeon X5650, (2x6) 12 cores, 2.66GHz, 3 channels DDR3 1333 MHz (e.g. HP Z800) 2012: Dual Xeon E2680, (2x8) 16 cores, 2.7GHz, 4 channels DDR3 1600 MHz (e.g. HP Z820) 2013: Dual Xeon E2697v2, (2x12) 24 cores, 2.7GHz, 4 channels DDR3 1866 MHz (e.g. HP Z820) 2014: Dual Xeon E2697v3, (2x14) 28 cores, 2.6GHz, 4 channels DDR4 2133 MHz (e.g. HP Z840) 2016: Dual Xeon E2699v4, (2x22) 44 cores, 2.2GHz, 4 channels DDR4 2400 MHz (e.g. HP Z840) Number of threads

Run time (hours) High-end Desktops and Workstations 2013: Dual Xeon E2697v2, (2x12) 24 cores, 2.7GHz, 4 channels DDR3 1866 MHz (e.g. HP Z820) 2014: Dual Xeon E2697v3, (2x14) 28 cores, 2.6GHz, 4 channels DDR4 2133 MHz (e.g. HP Z840) 2016: Dual Xeon E2699v4, (2x22) 44 cores, 2.2GHz, 4 channels DDR4 2400 MHz (e.g. HP Z840) Number of threads

Modern HPC clusters are not as complex as space shuttles anymore 10-core Xeon E5v2 2.8GHz 8 dual CPU nodes with 160 cores in total (= 8 workstations connected with Infiniband 56Gb/s) 1.024TB of DDR3 1866GHz memory ~ $75K Models with up to 300 million active grid blocks Parallel speed-up 80-100 times

Solver Hybrid algorithm. Removing the bottlenecks. Simulator solver software integrates both MPI and threads system calls ~ 8GB/s MPI Node level: the parallelization between CPU cores is done on the level of solver matrix using OS threads matrix cluster network OS Threads As a result, the number of MPI processes is limited to the number of cluster nodes, not the total number of cores This removes one of the major performance bottlenecks network throughput limit! ~ 80GB/s NUMA Cluster node with 2 CPUs SPE 163090

1 2 4 8 16 32 64 128 200 Model grid domains Suppose we have Model: 3 phase with 2,5 mln active grid cels Cluster: 10 nodes x 20 cores = 200 cores in total Conventional MPI Multilevel Hybrid method 15 13 11 9 7 5 3 1 1 2 4 8 16 32 64 128 200 200 grid domains exchanging boundary conditions 150 120 90 60 30 0 10 grid domains exchanging boundary conditions

Acceleration Cluster Parallel Scalability Old cluster: 20 dual (12 core) nodes, 40 Xeons X5650, 240 cores, 24GB DDR3 1333MHz, Infiniband 40Gb/s New cluster: 8 dual (20 core) nodes, 16 Xeons E5-2680v2, 160 cores, 128GB DDR3 1860MHz, Infiniband 56Gb/s Xeons X5650 Xeons E5-2680v2 Number of cores

Acceleration Testing the limits Top 20 cluster: 512 nodes used Dual Xeon 5570 4096 cores DDR3 1333MHz 21.8 million active blocks 39 wells 3 phase black oil 1328 times From 2,5 weeks to 19 minutes Number of cores SPE 163090

Acceleration factor Testing the limits 1024 SPE: 162090 256 64 16 4 64-core/node 8-core/node 22 million active grid blocks 3-phase blackoil, gas cap 200 well connections 1 1 4 16 64 256 1024 4096 Number of cores No sharp parallel scalability saturation is observed! Technology works for very high core/node densities, ready for future CPUs!

Easy to install easy to use 6.4kW 3.2kW 3.0kW 2.6kW Xeons X5650 Xeons E5-2680v2 Bosch TWK 7603 Tefal FV9630 In house clusters: Can be installed in a regular office space Take only 4-6 weeks to build Need air-conditioned room and LAN connection Significantly more economical than 5-10 years ago

Dispatcher Data In-house Cluster Setup GUI Users GUI GUI GUI Control Network Shared storage Head node Cluster nodes Data Control Cluster network

User Interface Job queue management (start, stop, results view) Full graphics simulation results monitoring at runtime (2D, 3D, wells, perforations, 3D streamlines)

Thank you!

Advances of parallel computing. Kirill Bogachev May 2016