Advances of parallel computing. Kirill Bogachev May 2016

Similar documents
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

High-Performance Reservoir Simulations on Modern CPU-GPU Computational Platforms Abstract Introduction

A Comprehensive Study on the Performance of Implicit LS-DYNA

Maximizing Memory Performance for ANSYS Simulations

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

Emerging Technologies for HPC Storage

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

HPC Architectures. Types of resource currently in use

Two-Phase flows on massively parallel multi-gpu clusters

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Software and Performance Engineering for numerical codes on GPU clusters

High performance Computing and O&G Challenges

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

SNAP Performance Benchmark and Profiling. April 2014

Advanced Parallel Programming I

Accelerating Implicit LS-DYNA with GPU

Performance comparison between a massive SMP machine and clusters

HW Trends and Architectures

Intel Workstation Technology

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Moore s Law. Computer architect goal Software developer assumption

A bold advance in reservoir simulation performance Parallel Interactive Reservoir Simulator

Building NVLink for Developers

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

GPUs and Emerging Architectures

IBM Information Technology Guide For ANSYS Fluent Customers

Intel Enterprise Processors Technology

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

SPE Distinguished Lecturer Program

ANSYS HPC Technology Leadership

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Altair RADIOSS Performance Benchmark and Profiling. May 2013

AcuSolve Performance Benchmark and Profiling. October 2011

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

The Stampede is Coming: A New Petascale Resource for the Open Science Community

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

2008 International ANSYS Conference

Modern CPU Architectures

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Now we are going to speak about the CPU, the Central Processing Unit.

BlueGene/L. Computer Science, University of Warwick. Source: IBM

FEMAP/NX NASTRAN PERFORMANCE TUNING

TFLOP Performance for ANSYS Mechanical

Introduction II. Overview

Carlo Cavazzoni, HPC department, CINECA

High Performance Computing (HPC) Introduction

The Mont-Blanc approach towards Exascale

COSC 6385 Computer Architecture - Multi Processor Systems

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Computer Architecture s Changing Definition

Introduction to GPU computing

Users and utilization of CERIT-SC infrastructure

Lecture 1: What is a Computer? Lecture for CPSC 2105 Computer Organization by Edward Bosworth, Ph.D.

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

Sackler Course BMSC-GA 4448 High Performance Computing in Biomedical Informatics. Class 2: Friday February 14 th, :30PM 5:30PM AGENDA

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

The Optimal CPU and Interconnect for an HPC Cluster

What does Heterogeneity bring?

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

OpenFOAM Performance Testing and Profiling. October 2017

AcuSolve Performance Benchmark and Profiling. October 2011

The dark powers on Intel processor boards

Lecture 1: Gentle Introduction to GPUs

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

WHY PARALLEL PROCESSING? (CE-401)

Large scale Imaging on Current Many- Core Platforms

AMD Opteron Processors In the Cloud

Non-uniform memory access (NUMA)

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

ANSYS High. Computing. User Group CAE Associates

The MOSIX Scalable Cluster Computing for Linux. mosix.org

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 03, SPRING 2013

Trends and Challenges in Multicore Programming

Numerical Algorithms on Multi-GPU Architectures

FUSION1200 Scalable x86 SMP System

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Bei Wang, Dmitry Prohorov and Carlos Rosales

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

CHAPTER 1 Introduction

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Turbostream: A CFD solver for manycore

Chapter 1: Introduction to Parallel Computing

BİL 542 Parallel Computing

Why you should care about hardware locality and how.

Transcription:

Advances of parallel computing Kirill Bogachev May 2016

Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being a simple material balance estimator to full-physics numerical simulators As we follow the development of simulations over time, the models become more demanding and more complex: rock properties, fluids and reservoir description, wells models, surface network, compositional and thermal effects, EORs, etc. Grid dimensions are based on available resources and project time frames Proper uncertainty analysis is often skipped due to limited time.

Grid Resolution Effects Coarse (2m x 50m x 0.7m) Fine (1m x 50m x 0.7 m)

Moore s Law The number of transistors in a dense integrated circuit doubles approximately every two years - Gordon Moore, co-founder of Intel, 1965

Evolution of microprocessors Only the number of transistors/cores continue to rise!

2005 - First Serial Multicore CPU s In old clusters all computational cores are isolated by distributed memory (MPI required). Most of the conventional algorithms are designed based on this paradigm. With the shared memory systems all cores communicate directly, which is significantly faster than communication between the cluster nodes. Simulation software has to take it into account to maximize parallel performance.

HPC for Numerical Modeling Climate modeling, weather forecasting Space technologies Digital content Medicine Financial analysis Technical design 7 All industries run massive high-performance computing simulations on a daily basis

In the meantime, in the reservoir simulations 15 13 11 9 7 5 3 PARALLEL SPEED-UP IN RESERVOIR SIMULATIONS 1 1 2 4 8 16 32 64 128

Desktops and Workstations DDR4 Intel Xeon Processor E5 v4 Intel Xeon Processor E5 v4 DDR4 DDR4 DDR4 DDR4 DDR4 Up to 55MB Shared Cache up to 22 cores per CPU Up to 55MB Shared Cache DDR4 DDR4 4 channels of up to DDR3 2400 MHz memory 4 channels of up to DDR4 2400 MHz memory Shared memory systems: Fast interactions between the cores No need to introduce grid domains The system of equations can be solved directly on the matrix level

Desktops and Workstations DDR4 QPI DDR4 NUMA NUMA Bandwidth machine : up to 76GB/s ( ~10 times the Infiniband speed) The software: for maximum performance the following hardware features are used: Shared memory: blocks are selected automatically on the matrix level Non-Uniform Memory Access: memory is allocated dynamically through NUMA Hyperthreading: system threads accessed directly Fast CPU cache: big enough to fit matrix blocks All parts of code are parallel: not just linear solver Special compiler settings

Speed-up vs. single core High-end Desktops and Workstations 2011: Dual Xeon X5650, (2x6) 12 cores, 2.66GHz, 3 channels DDR3 1333 MHz (e.g. HP Z800) 2012: Dual Xeon E2680, (2x8) 16 cores, 2.7GHz, 4 channels DDR3 1600 MHz (e.g. HP Z820) 2013: Dual Xeon E2697v2, (2x12) 24 cores, 2.7GHz, 4 channels DDR3 1866 MHz (e.g. HP Z820) 2014: Dual Xeon E2697v3, (2x14) 28 cores, 2.6GHz, 4 channels DDR4 2133 MHz (e.g. HP Z840) 2016: Dual Xeon E2699v4, (2x22) 44 cores, 2.2GHz, 4 channels DDR4 2400 MHz (e.g. HP Z840) Number of threads

Run time (hours) High-end Desktops and Workstations 2013: Dual Xeon E2697v2, (2x12) 24 cores, 2.7GHz, 4 channels DDR3 1866 MHz (e.g. HP Z820) 2014: Dual Xeon E2697v3, (2x14) 28 cores, 2.6GHz, 4 channels DDR4 2133 MHz (e.g. HP Z840) 2016: Dual Xeon E2699v4, (2x22) 44 cores, 2.2GHz, 4 channels DDR4 2400 MHz (e.g. HP Z840) Number of threads

Modern HPC clusters are not as complex as space shuttles anymore 10-core Xeon E5v2 2.8GHz 8 dual CPU nodes with 160 cores in total (= 8 workstations connected with Infiniband 56Gb/s) 1.024TB of DDR3 1866GHz memory ~ $75K Models with up to 300 million active grid blocks Parallel speed-up 80-100 times

Solver Hybrid algorithm. Removing the bottlenecks. Simulator solver software integrates both MPI and threads system calls ~ 8GB/s MPI Node level: the parallelization between CPU cores is done on the level of solver matrix using OS threads matrix cluster network OS Threads As a result, the number of MPI processes is limited to the number of cluster nodes, not the total number of cores This removes one of the major performance bottlenecks network throughput limit! ~ 80GB/s NUMA Cluster node with 2 CPUs SPE 163090

1 2 4 8 16 32 64 128 200 Model grid domains Suppose we have Model: 3 phase with 2,5 mln active grid cels Cluster: 10 nodes x 20 cores = 200 cores in total Conventional MPI Multilevel Hybrid method 15 13 11 9 7 5 3 1 1 2 4 8 16 32 64 128 200 200 grid domains exchanging boundary conditions 150 120 90 60 30 0 10 grid domains exchanging boundary conditions

Acceleration Cluster Parallel Scalability Old cluster: 20 dual (12 core) nodes, 40 Xeons X5650, 240 cores, 24GB DDR3 1333MHz, Infiniband 40Gb/s New cluster: 8 dual (20 core) nodes, 16 Xeons E5-2680v2, 160 cores, 128GB DDR3 1860MHz, Infiniband 56Gb/s Xeons X5650 Xeons E5-2680v2 Number of cores

Acceleration Testing the limits Top 20 cluster: 512 nodes used Dual Xeon 5570 4096 cores DDR3 1333MHz 21.8 million active blocks 39 wells 3 phase black oil 1328 times From 2,5 weeks to 19 minutes Number of cores SPE 163090

Acceleration factor Testing the limits 1024 SPE: 162090 256 64 16 4 64-core/node 8-core/node 22 million active grid blocks 3-phase blackoil, gas cap 200 well connections 1 1 4 16 64 256 1024 4096 Number of cores No sharp parallel scalability saturation is observed! Technology works for very high core/node densities, ready for future CPUs!

Easy to install easy to use 6.4kW 3.2kW 3.0kW 2.6kW Xeons X5650 Xeons E5-2680v2 Bosch TWK 7603 Tefal FV9630 In house clusters: Can be installed in a regular office space Take only 4-6 weeks to build Need air-conditioned room and LAN connection Significantly more economical than 5-10 years ago

Dispatcher Data In-house Cluster Setup GUI Users GUI GUI GUI Control Network Shared storage Head node Cluster nodes Data Control Cluster network

User Interface Job queue management (start, stop, results view) Full graphics simulation results monitoring at runtime (2D, 3D, wells, perforations, 3D streamlines)

Thank you!