Advances of parallel computing. Kirill Bogachev May 2016

Size: px

Start display at page:

Download "Advances of parallel computing. Kirill Bogachev May 2016"

Sheila Sullivan
6 years ago
Views:

1 Advances of parallel computing Kirill Bogachev May 2016

2 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being a simple material balance estimator to full-physics numerical simulators As we follow the development of simulations over time, the models become more demanding and more complex: rock properties, fluids and reservoir description, wells models, surface network, compositional and thermal effects, EORs, etc. Grid dimensions are based on available resources and project time frames Proper uncertainty analysis is often skipped due to limited time.

3 Grid Resolution Effects Coarse (2m x 50m x 0.7m) Fine (1m x 50m x 0.7 m)

4 Moore s Law The number of transistors in a dense integrated circuit doubles approximately every two years - Gordon Moore, co-founder of Intel, 1965

5 Evolution of microprocessors Only the number of transistors/cores continue to rise!

2005 - First Serial Multicore CPU s In old clusters

directly, which is significantly faster than

6 First Serial Multicore CPU s In old clusters all computational cores are isolated by distributed memory (MPI required). Most of the conventional algorithms are designed based on this paradigm. With the shared memory systems all cores communicate directly, which is significantly faster than communication between the cluster nodes. Simulation software has to take it into account to maximize parallel performance.

Financial analysis Technical design 7 All industries run

7 HPC for Numerical Modeling Climate modeling, weather forecasting Space technologies Digital content Medicine Financial analysis Technical design 7 All industries run massive high-performance computing simulations on a daily basis

In the meantime, in the reservoir simulations 15 13 11 9 7 5 3

8 In the meantime, in the reservoir simulations PARALLEL SPEED-UP IN RESERVOIR SIMULATIONS

9 Desktops and Workstations DDR4 Intel Xeon Processor E5 v4 Intel Xeon Processor E5 v4 DDR4 DDR4 DDR4 DDR4 DDR4 Up to 55MB Shared Cache up to 22 cores per CPU Up to 55MB Shared Cache DDR4 DDR4 4 channels of up to DDR MHz memory 4 channels of up to DDR MHz memory Shared memory systems: Fast interactions between the cores No need to introduce grid domains The system of equations can be solved directly on the matrix level

10 Desktops and Workstations DDR4 QPI DDR4 NUMA NUMA Bandwidth machine : up to 76GB/s ( ~10 times the Infiniband speed) The software: for maximum performance the following hardware features are used: Shared memory: blocks are selected automatically on the matrix level Non-Uniform Memory Access: memory is allocated dynamically through NUMA Hyperthreading: system threads accessed directly Fast CPU cache: big enough to fit matrix blocks All parts of code are parallel: not just linear solver Special compiler settings

7GHz, 4 channels DDR3 1866 MHz (e.g. HP Z820) 2014: Dual Xeon E2697v3, (2x14) 28 cores, 2.6GHz, 4 channels DDR4 2133 MHz (e.g. HP Z840) 2016: Dual Xeon E2699v4, (2x22) 44 cores, 2.

11 Speed-up vs. single core High-end Desktops and Workstations 2011: Dual Xeon X5650, (2x6) 12 cores, 2.66GHz, 3 channels DDR MHz (e.g. HP Z800) 2012: Dual Xeon E2680, (2x8) 16 cores, 2.7GHz, 4 channels DDR MHz (e.g. HP Z820) 2013: Dual Xeon E2697v2, (2x12) 24 cores, 2.7GHz, 4 channels DDR MHz (e.g. HP Z820) 2014: Dual Xeon E2697v3, (2x14) 28 cores, 2.6GHz, 4 channels DDR MHz (e.g. HP Z840) 2016: Dual Xeon E2699v4, (2x22) 44 cores, 2.2GHz, 4 channels DDR MHz (e.g. HP Z840) Number of threads

HP Z820) 2014: Dual Xeon E2697v3, (2x14) 28 cores, 2.

12 Run time (hours) High-end Desktops and Workstations 2013: Dual Xeon E2697v2, (2x12) 24 cores, 2.7GHz, 4 channels DDR MHz (e.g. HP Z820) 2014: Dual Xeon E2697v3, (2x14) 28 cores, 2.6GHz, 4 channels DDR MHz (e.g. HP Z840) 2016: Dual Xeon E2699v4, (2x22) 44 cores, 2.2GHz, 4 channels DDR MHz (e.g. HP Z840) Number of threads

Modern HPC clusters are not as complex as space

1.024TB of DDR3 1866GHz memory ~ $75K Models with

13 Modern HPC clusters are not as complex as space shuttles anymore 10-core Xeon E5v2 2.8GHz 8 dual CPU nodes with 160 cores in total (= 8 workstations connected with Infiniband 56Gb/s) 1.024TB of DDR3 1866GHz memory ~ $75K Models with up to 300 million active grid blocks Parallel speed-up times

14 Solver Hybrid algorithm. Removing the bottlenecks. Simulator solver software integrates both MPI and threads system calls ~ 8GB/s MPI Node level: the parallelization between CPU cores is done on the level of solver matrix using OS threads matrix cluster network OS Threads As a result, the number of MPI processes is limited to the number of cluster nodes, not the total number of cores This removes one of the major performance bottlenecks network throughput limit! ~ 80GB/s NUMA Cluster node with 2 CPUs SPE

1 2 4 8 16 32 64 128 200 Model grid domains Suppose we

64 128 200 200 grid domains exchanging boundary conditions

15 Model grid domains Suppose we have Model: 3 phase with 2,5 mln active grid cels Cluster: 10 nodes x 20 cores = 200 cores in total Conventional MPI Multilevel Hybrid method grid domains exchanging boundary conditions grid domains exchanging boundary conditions

New cluster: 8 dual (20 core) nodes, 16 Xeons E5-2680v2, 160 cores, 128GB

16 Acceleration Cluster Parallel Scalability Old cluster: 20 dual (12 core) nodes, 40 Xeons X5650, 240 cores, 24GB DDR3 1333MHz, Infiniband 40Gb/s New cluster: 8 dual (20 core) nodes, 16 Xeons E5-2680v2, 160 cores, 128GB DDR3 1860MHz, Infiniband 56Gb/s Xeons X5650 Xeons E5-2680v2 Number of cores

8 million active blocks 39 wells 3 phase black oil 1328

17 Acceleration Testing the limits Top 20 cluster: 512 nodes used Dual Xeon cores DDR3 1333MHz 21.8 million active blocks 39 wells 3 phase black oil 1328 times From 2,5 weeks to 19 minutes Number of cores SPE

18 Acceleration factor Testing the limits 1024 SPE: core/node 8-core/node 22 million active grid blocks 3-phase blackoil, gas cap 200 well connections Number of cores No sharp parallel scalability saturation is observed! Technology works for very high core/node densities, ready for future CPUs!

clusters: Can be installed in a regular office space Take only 4-6 weeks

19 Easy to install easy to use 6.4kW 3.2kW 3.0kW 2.6kW Xeons X5650 Xeons E5-2680v2 Bosch TWK 7603 Tefal FV9630 In house clusters: Can be installed in a regular office space Take only 4-6 weeks to build Need air-conditioned room and LAN connection Significantly more economical than 5-10 years ago

20 Dispatcher Data In-house Cluster Setup GUI Users GUI GUI GUI Control Network Shared storage Head node Cluster nodes Data Control Cluster network

21 User Interface Job queue management (start, stop, results view) Full graphics simulation results monitoring at runtime (2D, 3D, wells, perforations, 3D streamlines)

22 Thank you!

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,