High performance Computing and O&G Challenges

Similar documents
Addressing Heterogeneity in Manycore Applications

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

The determination of the correct

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Advances of parallel computing. Kirill Bogachev May 2016

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

designing a GPU Computing Solution

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Exploration seismology and the return of the supercomputer

GPU implementation of minimal dispersion recursive operators for reverse time migration

ANSYS HPC Technology Leadership

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

IBM Power AC922 Server

Real Parallel Computers

Current Trends in Computer Graphics Hardware

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Software and Performance Engineering for numerical codes on GPU clusters

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Large scale Imaging on Current Many- Core Platforms

High Performance Computing with Accelerators

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Hybrid programming with MPI and OpenMP On the way to exascale

High Performance Ocean Modeling using CUDA

Optimization of finite-difference kernels on multi-core architectures for seismic applications

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Accelerating Implicit LS-DYNA with GPU

Pedraforca: a First ARM + GPU Cluster for HPC

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Building NVLink for Developers

Real Parallel Computers

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Two-Phase flows on massively parallel multi-gpu clusters

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

What does Heterogeneity bring?

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Numerical Algorithms on Multi-GPU Architectures

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

CPU-GPU Heterogeneous Computing

Designed for Maximum Accelerator Performance

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

WHY PARALLEL PROCESSING? (CE-401)

Creating High Performance Clusters for Embedded Use

Hybrid Implementation of 3D Kirchhoff Migration

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Accelerating computation with FPGAs

Modelling Multi-GPU Systems 1

BlueGene/L. Computer Science, University of Warwick. Source: IBM

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Carlo Cavazzoni, HPC department, CINECA

An Innovative Massively Parallelized Molecular Dynamic Software

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

The next generation supercomputer. Masami NARITA, Keiichi KATAYAMA Numerical Prediction Division, Japan Meteorological Agency

H003 Deriving 3D Q Models from Surface Seismic Data Using Attenuated Traveltime Tomography

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

3D Finite Difference Time-Domain Modeling of Acoustic Wave Propagation based on Domain Decomposition

A Comprehensive Study on the Performance of Implicit LS-DYNA

Lecture 1. Introduction Course Overview

IBM Power Advanced Compute (AC) AC922 Server

Optimisation Myths and Facts as Seen in Statistical Physics

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Particleworks: Particle-based CAE Software fully ported to GPU

Parallel Programming on Ranger and Stampede

Trends in HPC (hardware complexity and software challenges)

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

The IBM Blue Gene/Q: Application performance, scalability and optimisation

Automated Geophysical Feature Detection with Deep Learning

IBM Power Systems HPC Cluster

The Omega Seismic Processing System. Seismic analysis at your fingertips

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Outline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters

Turbostream: A CFD solver for manycore

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

FUJITSU Server PRIMERGY CX400 M4 Workload-specific power in a modular form factor. 0 Copyright 2018 FUJITSU LIMITED

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

High Performance Computing on GPUs using NVIDIA CUDA

Technology for a better society. hetcomp.com

Exploring XMP programming model applied to Seismic Imaging application. Laurence BEAUDE

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

Performance Benefits of NVIDIA GPUs for LS-DYNA

High-Performance Reservoir Simulations on Modern CPU-GPU Computational Platforms Abstract Introduction

Transcription:

High performance Computing and O&G Challenges

2 Seismic exploration challenges

High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating technology status and perspectives in TOTAL

Worldwide context Worldwide O&G consumption/year approx. 50 Gboe Average size of new Oil field discovery ~ 50 Mboe since the last 25 years Oil Discoveries vs. Oil Production Gboe 200 150 100 Hydrocarbons (Liquids & Gas) 200 Production discoveries 150 100 50 50 0 0 1920 1930 1940 1950 1960 1970 1980 1990 2000 Gboe 60 50 Industrie : Réserves des nouvelles découvertes Source: IHS-Edin (avril 2008) RR_Gas (Gboe) 52.9 Gaz (Liquids & Gas) RR_Liquids Liquides (Gb) 40 39.4 31 39.3 30 20 10 0 22 23.7 24.7 25.1 20.1 20.3 21.2 22.3 23.7 26 10 17.4 13 9 14 10 9 11 14.2 13 8 23 7 18 15 12 14 14 11 11 8 10 11 7 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 4

Principle: Ultrasound Seismic imaging Medical imaging 7000km² 10 KM Signal frequency: between 6 and 90 Hz Image resolution: some tens of m. Heterogeneous media (spatial Signal frequency: 1 MHz variability of density and signal velocity) Image resolution: few mm. Approximately homogeneous media. 5

Seismic imaging in a nutshell Data acquisition? Interprétation Depth imaging loop 6

Wave Equation the basic tool Seismic modeling Velocity model update 1 2 c 2 t 2 2 x + 2 2 y + 2 2 z 2 P ( x, y, z, t)= 0. Depth Imaging asymptotic approximation Depth Imaging Paraxial/full approximation

Seismic acquisition breakthrough and modeling 2008 100-200 000 $/km² 70x100 m Acquisition modeling is routinely applied to optimize survey acquisition design and reduce overall cost. 10,000 modeled shots - 4 weeks - 100TF 2005 12 000 $/km² 7000 m 10x100 m 1995 1000 m 7 000 $/km² 7000 m Source 7000 m 4x100 m 400 m 5000 m 100 s GB disk storage 10 s TB Migration section and illumination maps computed from modeled seismic acquistion provide useful information to geophysicist 8 Progress in acquisition seismic gives access to high seismic data quality but increase dramatically the size of storage and overall cost

Codes and computing effort some figures 3-55 Hz 10 15 flops Algorithm complexity 9.5 PF 1000 100 Visco elastic FWI petro-elastic inversion 3-35 Hz 900 TF 10 elastic FWI visco elastic modeling 1 isotropic/anisotropic FWI elastic modeling/rtm 3-18 Hz 56 TF 0,5 isotropic/anisotropic RTM isotropic/anisotropic modeling RTM 0,1 Paraxial isotropic/anisotropic imaging Asymptotic approximation imaging 1995 2000 2005 2010 2015 2020 Computing ressource evolution when only one parameter is changed while using the same algorithm: In this case we request the processing to be performed in 8 days 9 Algorithmic complexity and corresponding computing power

HPC: the necessary link 2 «Improved» Seismic data Off shore On shore HPC (Petaflops) 3 More realistic Modeling / Simulations 1 More efficient Imaging algorithm 10

450 HPC Evolution in TOTAL 400 350 300 250 200 150 100 50 0 2004 2005 2006 2007 2008 2009 TeraFlops From SMP cluster to MPP and Hybrid computing SGI ICE+ MPP (16384 Intel X86-64 + 256 GPUS ) SGI ICE+ MPP (10240 Intel X86-64 ) IBM SMP cluster ( 2048 P5) 11 SGI Altix SMP cluster ( 1024 intel ia64)

Why MPP? Seismic depth is massively Parallel: Embarrassingly over the shots: ex 100000 shots can be solved in one iteration on a 100000 cpus machine Explicit data distribution and data parallel model programming when processing one shot Reliability and performances on very large configurations Impact on footprint and power dissipation 4096 12 SGI ICE+ Hypercube topology interconnect: Scalable single system Easy to manage Taking advantage of Scalable interconnect Leads to very efficient algorithm implementation (3D Wave Equation Finite difference solver)

Trends: multiply the number of cores Why hybrid Computing? Our codes can take advantage of many core technology GPUs can be seen as large multicore processors and can offer tremendous speed up. GPUS first step to the massive multicore technology? 4 cores Intel processor 80 cores Intel processor (prototype) 40 35 30 35.66 Nvidia GPU 240 cores ~ 1TF single precision 13 25 20 15 10 5 8.23 4.28 1 0 1 core 4 cores 8 cores 1 GPU (speed up) 3D Wave Equation solver speed up comparison Having access to different technologies through the same interconnect can offer new perspectives in terms of model programming and application development To reach the performances needed for solving our problems, we have no choice but combining high scalable interconnect an massive multicore technology.

Accelerating technology actual status and perspectives in TOTAL Starting in 2007 with the following questions 1. Is it possible to take advantage of GPU for seismic depth imaging 2. What programming model and what programming language? 3. What about communication between host and GPU, best configuration? 4. Can we extrapolate the performances obtained on a single GPU to a full system: Cluster? 5. What about building industrial applications ( Modeling, RTM ) on a GPU cluster based system? 14

1. Is it possible to take advantage of GPU? 15 - Références, date, lieu

Taking advantage of GPUS: actual status 2 main applications implemented on GPU cluster: Seismic modeling Reverse time Migration Seismic modeling motivation: accelerate survey acquisition modeling Wave Equation solver is the basic kernel for seismic depth imaging algorithm Explore different implementations on GPUS: (single GPU, multi GPUs) Explore model programming and programming language RTM Motivation Most accurate depth imaging method: the most expensive Take advantage of progress made on Seismic modeling ( same kernel) Explore communication load balancing between host and device

Seismic modeling Based on the solution of the Wave Equation Explicit Time-space finite difference

Implementation 2 passes: Seismic modeling 1. compute partial derivative along z axis 2. 2D tile sliding window Padding to ensure global memory coalescing Partial transfer and page locked host memory Example Model size 512x512x512 490 time step SGI ICE+, Harpertown 3GHZ Test : 1 GPU 4 cores (4 MPI) 8 cores (8 MPI) 16 cores (16 MPI) 32 cores (32 MPI):

Seismic modeling Example Model size 560x557x905, 22760 time steps SGI ICE+, 512 harpertown 3GHZ 250 232 SGI ICE+, 16 blades harpertown 3 GHZ, 8 Teslas 8 cores per shot (64 shots in parallel) 2 GPUs per shot ( 16 shots in parallel) 200 150 100 50 54 number of shots computed in 24H 0 Rack CPU Rack GPU

Reverse Time Migration Based on the solution of similar kernel used for seismic modeling RTM is solved in 2 Steps involving: 1. Seismic modeling and wavefield storage 2. Seismic modeling, wavefield reload and imaging condition For efficiency reasons Seismic modeling must recover wavefield store/load

CPU-GPU Data transfer optimization Take advantage of asynchronous communications to overlap data transfer First implementation: 75% of elapsed time spent on PCIe communications Take advantage of : PCIe Gen2 bandwidth DMA accessible pinned memory Asynchronous communications FWD BWD % synch asynch 32 15 41 Execution 25 Data transfers 1 2 3 4 1 2 3 4 1 2 3 4 Execution time

2/ Programming model and programming language Model programming: Take advantage of data parallelism Mainly based on efficiently using memory hierarchy, with different approaches: Increase the data locality and re-use Use simple 3D cache blocking, Replace by a 2D cache, sliding on the 3 rd dimension Number of read accesses per data point RTM 2D 4,25 RTM 3D with texture 29 RTM 3D with 3D shared memory accesses 7,5 RTM 3D with sliding 2D shared memory 4

2/ Programming model and programming language Programming language: CUDA: progess rapidely, new features such as partial copy, asynchronous communications, pined memory data access HMPP: Express parallelism and communications in C, Fortran source code: Direct integration into Fortran Code through the use of directives à la OpenMP HMPP handles CPU-GPU communications and kernel launch. HMPP manage automatically HW device. Systematically used in TOTAL for developing accelerated kernel. Work in close relation with CAPS on the automatic code generation. Some delays between new features in CUDA and HMPP support should be anticipated: CUDA pre-released to CAPS? OpenCL: Standard (wait and see), will be addressed through HMPP

3/ Host Device configuration

4/ Single GPU to multi GPU : Cluster

Computation needs for seismic processing is nearly not bounded! Algorithm improvement taking advantage Of high frequency seismic ( impact the size of computational grid) Seismic acquisition parameters (Azimuth width, distance between receivers, ) Implement more complete solution of the imaging solver to solve new challenges: foothills, Heavy Oil. Need a pro-active R&D strategy Algorithms (Full Waveform Inversion, new Solvers, reservoir simulation, EM ) New Hardware architectures and programming models Impact of new technology on processing: Several Teraflops in a WorkStation? Conclusion HPC containers: On site processing, embedded processing 26 Accelerating solution: good progress in understanding the use of GPUs we have a clear strategy : model programming and application development GPUs: first step before many core technology? We do not exclude hybrid technology: (many core + GPUS ) Still need to work in close relation with industrial partners to improve Hardware and tools.