The determination of the correct

Similar documents
Electromagnetic migration of marine CSEM data in areas with rough bathymetry Michael S. Zhdanov and Martin Čuma*, University of Utah

Adaptive Waveform Inversion: Theory Mike Warner*, Imperial College London, and Lluís Guasch, Sub Salt Solutions Limited

H003 Deriving 3D Q Models from Surface Seismic Data Using Attenuated Traveltime Tomography

First-arrival traveltime tomography based on the adjoint-state method

Successful application of joint reflection/refraction tomographic velocity inversion in a shallow water marine environment.

Building starting model for full waveform inversion from wide-aperture data by stereotomography

High performance Computing and O&G Challenges

Practical implementation of SRME for land multiple attenuation

(x, y, z) m 2. (x, y, z) ...] T. m 2. m = [m 1. m 3. Φ = r T V 1 r + λ 1. m T Wm. m T L T Lm + λ 2. m T Hm + λ 3. t(x, y, z) = m 1

Simultaneous joint inversion of refracted and surface waves Simone Re *, Claudio Strobbia, Michele De Stefano and Massimo Virgilio - WesternGeco

G012 Scattered Ground-roll Attenuation for 2D Land Data Using Seismic Interferometry

Multi-source Least-squares Migration of Gulf of Mexico Data

GEOPHYS 242: Near Surface Geophysical Imaging. Class 8: Joint Geophysical Inversions Wed, April 20, 2011

IMAGING USING MULTI-ARRIVALS: GAUSSIAN BEAMS OR MULTI-ARRIVAL KIRCHHOFF?

Refraction Full-waveform Inversion in a Shallow Water Environment

A hybrid tomography method for crosswell seismic inversion

M. Warner* (S-Cube), T. Nangoo (S-Cube), A. Umpleby (S-Cube), N. Shah (S-Cube), G. Yao (S-Cube)

Th G Surface-offset RTM Gathers - What Happens When the Velocity is Wrong?

Chapter 5. 3D data examples REALISTICALLY COMPLEX SYNTHETIC INVERSION. Modeling generation and survey design

U043 3D Prestack Time Domain Full Waveform Inversion

Target-oriented wavefield tomography: A field data example

Angle-gather time migration a

AVO Analysis with Multi-Offset VSP Data

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Shortest-path calculation of first arrival traveltimes by expanding wavefronts

High definition tomography brings velocities to light Summary Introduction Figure 1:

Full waveform inversion of physical model data Jian Cai*, Jie Zhang, University of Science and Technology of China (USTC)

Wave-equation migration from topography: Imaging Husky

Anisotropy-preserving 5D interpolation by hybrid Fourier transform

Travel Time Tomography using Neural Networks

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Geogiga Seismic Pro 8.3 Release Notes

Machine-learning Based Automated Fault Detection in Seismic Traces

Efficient Beam Velocity Model Building with Tomography Designed to Accept 3d Residuals Aligning Depth Offset Gathers

Issues During the Inversion of Crosshole Radar Data: Can We Have Confidence in the Outcome?

BİL 542 Parallel Computing

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Elastic full waveform Inversion for land walkaway VSP data

Review of previous examinations TMA4280 Introduction to Supercomputing

P063 ADAPTIVE TRAVEL-TIME AND RAY- PARAMETER INVERSION OF DENSELY SAMPLED 2-D SEISMIC DATA

We N Converted-phase Seismic Imaging - Amplitudebalancing Source-independent Imaging Conditions

Target-oriented wave-equation inversion with regularization in the subsurface-offset domain

GEOPHYS 242: Near Surface Geophysical Imaging. Class 4: First-Arrival Traveltime Tomography Mon, April 11, 2011

Effects of multi-scale velocity heterogeneities on wave-equation migration Yong Ma and Paul Sava, Center for Wave Phenomena, Colorado School of Mines

Inversion after depth imaging

Image-guided 3D interpolation of borehole data Dave Hale, Center for Wave Phenomena, Colorado School of Mines

WS 09 Velocity Model Building

Driven Cavity Example

Advances of parallel computing. Kirill Bogachev May 2016

Joint seismic traveltime and TEM inversion for near surface imaging Jide Nosakare Ogunbo*, Jie Zhang, GeoTomo LLC

Is the optimum XY spacing of the Generalized Reciprocal Method (GRM) constant or variable?

Anisotropic model building with well control Chaoguang Zhou*, Zijian Liu, N. D. Whitmore, and Samuel Brown, PGS

Interval velocity estimation through convex optimization

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

MAPPING POISSON S RATIO OF UNCONSOLIDATED MATERIALS FROM A JOINT ANALYSIS OF SURFACE-WAVE AND REFRACTION EVENTS INTRODUCTION

High Resolution Imaging by Wave Equation Reflectivity Inversion

A MAP Algorithm for AVO Seismic Inversion Based on the Mixed (L 2, non-l 2 ) Norms to Separate Primary and Multiple Signals in Slowness Space

Introduction to the traveltime tomography and prestack migration for borehole-data within REFLEXW

Stochastic conjugate gradient method for least-square seismic inversion problems Wei Huang*, Hua-Wei Zhou, University of Houston

Extension of delay time analysis for 3-D seismic refraction statics

TomoXPro. Crosshole Survey Design, Modeling, Data Processing, Tomography, and Migration

Maximizing Memory Performance for ANSYS Simulations

Efficient 3D Gravity and Magnetic Modeling

3-D traveltime computation using Huygens wavefront tracing

A least-squares shot-profile application of time-lapse inverse scattering theory

WET inversion of reverse VSP survey, with Rayfract 3.25

GEOPHYS 242: Near Surface Geophysical Imaging. Class 5: Refraction Migration Methods Wed, April 13, 2011

Determination of 2D shallow S wave velocity profile using waveform inversion of P-SV refraction data

Reflection seismic Method - 2D

Common-angle processing using reflection angle computed by kinematic pre-stack time demigration

Geometric theory of inversion and seismic imaging II: INVERSION + DATUMING + STATIC + ENHANCEMENT. August Lau and Chuan Yin.

Hybrid Implementation of 3D Kirchhoff Migration

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Summary. Introduction

Large-scale workflows for wave-equation based inversion in Julia

u = v is the Laplacian and represents the sum of the second order derivatives of the wavefield spatially.

Downloaded 09/16/13 to Redistribution subject to SEG license or copyright; see Terms of Use at

Plane Wave Imaging Using Phased Array Arno Volker 1

Least squares Kirchhoff depth migration: important details

SEG/New Orleans 2006 Annual Meeting

Pyramid-shaped grid for elastic wave propagation Feng Chen * and Sheng Xu, CGGVeritas

Automatic wave equation migration velocity analysis by differential semblance optimization

Data dependent parameterization and covariance calculation for inversion of focusing operators

Lecture 7: Parallel Processing

P. Bilsby (WesternGeco), D.F. Halliday* (Schlumberger Cambridge Research) & L.R. West (WesternGeco)

2D Inversions of 3D Marine CSEM Data Hung-Wen Tseng*, Lucy MacGregor, and Rolf V. Ackermann, Rock Solid Images, Inc.

Y015 Complementary Data-driven Methods for Interbed Demultiple of Land Data

Coherent partial stacking by offset continuation of 2-D prestack data

Parallelizing a seismic inversion code using PVM: a poor. June 27, Abstract

Geogiga Seismic Pro 8.0 Release Notes

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

Target-oriented wave-equation inversion

Iterative resolution estimation in Kirchhoff imaging

We G Time and Frequency-domain FWI Implementations Based on Time Solver - Analysis of Computational Complexities

G042 Subsalt Imaging Challenges - A Deepwater Imaging Analysis

Full-waveform inversion for reservoir characterization: A synthetic study

2 The Elliptic Test Problem

SeisSpace Software. SeisSpace enables the processor to be able focus on the science instead of being a glorified data manager.

Transcription:

SPECIAL High-performance SECTION: H i gh-performance computing computing MARK NOBLE, Mines ParisTech PHILIPPE THIERRY, Intel CEDRIC TAILLANDIER, CGGVeritas (formerly Mines ParisTech) HENRI CALANDRA, Total The determination of the correct velocity structure of the near surface is a crucial step in seismic data processing and depth imaging. Generally, firstarrival traveltime tomography based on refraction data or diving waves is used to assess a velocity model of the subsurface that best explains the data. Such first-arrival traveltime tomography algorithms are very attractive for land data processing because early events in the seismic records are very often dominated by noise, and reflected events are very difficult or even impossible to identify. On the other hand, first arrivals can generally be identified quite clearly and are very often the only data available to reconstruct the near-surface velocity structure. Current seismic surveys now commonly deploy thousands of sources combined with thousands of receivers, leading to millions or several hundreds of millions of acquired traces. The area under investigation could be very large and lead to a velocity model containing millions of parameters whatever the type of parameterization. For 2D geometries, classical refraction tomography algorithms may not suffer from computational limitations. On the contrary, for 3D acquisition, those classical algorithms may face severe restrictions in terms of memory requirement, computation time, or implementation. To overcome these limitations, one could reduce the data to be used or the number of parameters of the model, both leading to either a loss of information or resolution. We address these issues with the use of adjoint state techniques to compute the gradient of the traveltime misfit function. The computational benefits of the adjoint state method, compared to the classical algorithms, are a low memory requirement, a straightforward and efficient parallelization, and an effortless implementation. Indeed, the amount of memory required by this method depends only on the size of the discretized velocity model. In other words, it is independent of the quantity of available input data. Another advantage is that the gradient calculation is carried out shot-by-shot; thus, computation tasks can be easily distributed over many processors. These computational properties have already been assessed and validated for 2D and 3D geometries (Taillandier et al., 2007). We present in this work the practical implementation of our 3D first-arrival traveltime tomography al- 86 The Leading Edge January 2010 Figure 1. Flowchart illustrating the computation of the gradient of the misfit function by the adjoint state method. gorithm based on the adjoint state technique that can handle very large data sets. Using a 3D synthetic model, we present the current performance of this algorithm and needed modifications to extend its scalability to handle the huge data sets that will be available in the coming years. The adjoint state method The first-arrival traveltime tomography algorithm is formulated as the minimization problem of a least squares misfit function (S) defined as where T obs are the recorded first-arrival traveltimes and T the calculated first-arrival traveltimes for a given velocity model c. We use the first-order finite-difference eikonal solver of Podvin and Lecomte (1991) to compute the synthetic data. The adjoint state method allows deriving the gradient of the misfit function (S) with respect to the velocity model (c) with the following formula where is the adjoint state variable. This variable is computed for each source position by solving the following partial differential equation and its boundary condition (1) (2)

Figure 2. Flowchart illustrating the whole iterative process, including the step-length computation and the model update. and, (3) where n is the outward unit vector normal to the acquisition surface. Detailed mathematical developments can be found in Sei and Symes (1994), and Leung and Qian (2006). A numerical method to solve this equation is the fast sweeping method of Zhao (2005) that has good stability and convergence properties and is easy to implement. Practical implementation The refraction tomography algorithm derived from the adjoint state method mainly relies on the computation of the gradient of the misfit function. Figure 1 depicts the steps leading to the assessment of the gradient. Picked traveltimes and an initial velocity model are the main input data. For each shot, computed traveltimes are derived from forward modeling. The traveltime map is used to solve the adjoint state partial differential equation; the residuals are used as initial conditions. The summation over all gradients obtained for each shot provides the global gradient of the misfit function. Two essential properties of the adjoint state method are illustrated in this flowchart. For each shot, the amount of memory required by this algorithm only depends on the size of the discretized velocity model. The other important factor is that it is straightforward to parallelize this algorithm by shot and thus significantly reduce the computation time. Once the gradient is obtained, a local descent optimization technique, such as a steepest descent method, is applied to iteratively minimize the misfit function. The whole iterative process including the step-length computation and the model update is illustrated in Figure 2. The code is written in Fortran 90 and uses message passing interface (MPI) for parallelism. Computational performance We now investigate the computational performances of the 3D first-arrival traveltime tomography algorithm on a fairly large data set. First we compare the performance with two different clusters. The first is made of the latest Intel Xeon X5560 processors, codenamed Nehalem, at 2.8 GHz with 8M of L3 shared cache memory and a Quick Path Interconnect (QPI) at 6.4 GT/s. Each node has 18 GB of DDR3 memory at 1066 MHz for the 8 cores. The second is made of the previous generation of Intel Xeon E5472 processors, codenamed Harpertown, at 3.0 GHz, with 2 GB of memory per core. In both cases, we used up to 256 cores (i.e., 32 nodes interconnected with QDR Infiniband). Number of model parameters 100 1000 1000 = 100 million Number of observed traveltimes per shot 160,000 Number of shots 1000 Total number of observed 160,000 1000 = 160 million traveltimes Table 1. Main features of the synthetic 2D and 3D acquisitions used for computational evaluations The main features of the 3D synthetic model (Figure 3) and the simulated acquisition are summarized in Table 1. During the inversion and for each shot per iteration, we solved three eikonal equations (forward modeling), one for the computation of the misfit function itself and two for the January 2010 The Leading Edge 87

Figure 3. Model used to assess the computational performances of the algorithm. (top) The initial velocity model used for the inversion. (left middle) True model used to compute observed traveltimes. (bottom left) Inverted velocity model. (right) Horizontal velocity profile for z = 700 m, x = 10 km (top) and x = 15 km (bottom); red = initial model, green = inverted model, and black = true model. Figure 4. Speedup (left) and elapsed time (right) comparison between Harpertown processor (E5472) at 3.0 GHz and Nehalem (X5560) at 2.8 GHz when working with 256 shots and up to 256 cores. On the speedup curves, we see that increasing the number of iterations may increase scalability since the impact of the initialization phase is attenuated. The low number of shots per core negatively impacts the scalability in that given case. On the right, we see the benefit of the X5560 even with a lower clock frequency. 88 The Leading Edge January 2010

Figure 5. Intel MPI trace analyzer on 128 cores running 1 shot per core for only one iteration. We see every part of the algorithm as described in Figure 2. Figure 6. MPI trace analyzer on 128 cores for 1 shot per core and 10 iterations per core. Apart from the initialization on the left (reading, processing, and broadcast), we see the MPI_REDUCE (yellow) done for communication of the gradient, followed by the 2 broadcast within the step-length computation (tiny vertical dark lines). step length. We must also compute the adjoint state variable by solving Equation 3. If for this synthetic model, the total number of grid points is 100 1000 1000, for each shot the computation is performed only where the receivers are active that is on a grid 100 500 500. More than 95% of the CPU time is devoted to solving the eikonal equations (forward modeling) and computing the adjoint state variable. For our implementation, the cost of solving the adjoint state is only 1.6 times the cost of solving the eikonal equations. For this synthetic example, it required 50 iterations and took 3 hours and 11 minutes on 256 cores. To evaluate the efficiency of the algorithm, we measured January 2010 The Leading Edge 89

Figure 7. Comparison of the impact of the initialization phase on 128 and 512 cores. Each core has to be aware of the whole geometry for later processing. This part has to be extracted from the main application since the geometry has to be prepared for more efficient use during the tomography. In fact, we will have to take advantage of the shot distribution within the parallel implementation when increasing by 10 100,000 cores. the computation time for one iteration of the minimization process with an increasing number of cores. Following Amdahl s law, the speedup function is defined as the ratio between the computation time for a single core and the computation time for n cores (Figure 4). As for every real world application, it is important to estimate how sensitive the code can be to the clock frequency and/or to the memory bandwidth. These two parameters are almost always the most important within the first-order application characterization equation generally expressed Total time = Time flops + T memory + T io + T comms where each term of this equation means the elapsed time spent for calculation, memory access, IO, and communication, respectively. As we can see on the right of Figure 4, the X5560 at 2.8 GHz gave better results than the E5472 at 3.0 GHz due to a better memory bandwidth (3.5 better with the current configuration). But, such a first-arrival tomography cannot be considered as a memory-bounded application, as are conventional tomography or reservoir simulation codes, for example. In fact, the speedup curve of the E5472 on the left of Figure 4 is not dramatically decreasing when the number of cores increases (as it should in case of limited memory bandwidth). As we will see in the following section, the runs are affected by the initialization phase of the application which takes the lead in this synthetic example with only a few shots and a few iterations per shot (Figure 5 and Figure 7). In such a case, we may expressed the total elapsed time as Total time = T serial part + T parallel part/nb cores + T overhead where T overhead comes from synchronizations, communications, and extra work due to the parallel implementation. The perfect speedup as expressed by Amdahl assumes no overhead and no serial part, i.e., a 100% parallel code. Extending the Amdahl approximation and hypothesis is beyond the scope of this paper and many publications are available on that topic. In our case, we will try to remove the serial part as it can be done as a preprocessing step, and thanks to the adjoint state formalism, our tomography can take advantage of a coarse-grained parallelism, also called granularity. Then, in fact, apart from a few (but large) collective communications, the processing of each shot does not involve too much overhead (Figure 6). The first part of the application mostly corresponds to the input velocity field and acquisition geometry reading and 90 The Leading Edge January 2010

Figure 8. Illustration of the scalability (up to 1024) when considering (or not) the initialization phase. to the acquisition geometry processing to define adequate calculation grids. 1 eikonal 7.45 s 1 adjoint state 1193 s 3 eikonal, 1 adjoint state + overhead 35 s Table 2. Average CPU time required per shot on one X5560 processor at 2.8 GHz for 1 iteration of the inversion. The last is the most time-consuming, and its impact increases with the number of cores, as presented with the worst case on Figure 7 when working only with 1 shot per core and 1 iteration per shot on 128 and 512 cores. This negative impact will be, of course, smoothed when increasing the number of shots per core and the number of inversion iterations (Figure 9), but it will still affect the parallel part as stated by Amdahl. To estimate this impact, we checked the scalability up to 1024 cores with and without preprocessing (Figure 8), considering that it can be taken out of the tomography and especially because the coordinates and traveltime picking usually come as ASCII files that must be sorted prior to the inversion. In the best case (1000 shots per core), we move from a speedup of 368 with the initialization to a value of 852 on 1024 cores without the initialization, which is quite a good result for such a small test case of 102,400 shots. This result leads to great expectations about working with acquisition of a few million shots and a few dozen iterations per shot. On the left of Figure 9, we see that the number of shots per second is quite stable, demonstrating that the load balancing is also very good. In that example, including the initialization, we also see that we need a minimum of 100 shots per core to reach a steady state and a constant number of shots per second. It gives around 3 million shots per day on a 256-core machine with only one iteration per shot. Thus, 50 iterations per shot, roughly needed in a real case, would still produce 3 million shots per day on 12,800 cores (approximately 140+ Tflops peak), or 60,000 shots per day on 256 cores. Next steps With respect to the constant increase of the acquisition size and the constant decrease of the velocity field spatial sampling, our goal is to take advantage of this high scalability to drive the application to a petaflops or multi-petaflops machine, roughly to 100,000 to 1 million cores, considering that the number of shots may go up by 10 100 million in the near future. Extending the previous calculation may give 30 million shots a day on 128,000 cores (assuming 50 iterations per shot). Of course, the algorithm would have to be revisited or extended to ensure quite good stability in terms of load bal- 92 The Leading Edge January 2010

Figure 9. Impact of the initialization phase with respect to the number of shots per core. After a certain amount of shots per core, the application reaches a steady state where the number of shots per second is almost constant (left). Due to a larger initialization impact, the steady state is more difficult to reach when increasing the number of cores (right). The numbers on the picture denote the number of cores. ancing and fault tolerance. In addition, such a huge case may need to split the acquisition into a few domains with respect to the available memory distributed around cluster nodes that are becoming larger and larger shared-memory nodes. The use of mixed parallelism language may also be an option to limit the number of domains, taking advantage of OpenMP or pthreads, within a socket for example. Collective communication may have to be optimized to take advantage of the topology of future machines. With up to 1024 or 2048 cores and with a message size of a few hundred megabytes, the current industrial MPI implementations remain suitable but we know that with an increase of one or two orders of magnitude in the number of cores, we will have to get closer to the topology. On pure floating-point expectations, the use of SIMD instructions would be the next step forward to reduce the time needed per iteration, especially with the wider instructions that will come in coming years. computing capabilities. AFIPS Conference Proceedings, 30, 483 485. Leung, S. and J. Qian, 2006, An adjoint state method for three-dimensional transmission traveltime tomography using first arrivals. Communications in Mathematical Sciences, 4, 249 266. Podvin, P. and I. Lecomte, 1991, Finite difference computation of traveltimes in very contrasted velocity models: a massively parallel approach and its associated tools. Geophysical Journal International, 105, 271 284. Sei, A. and W. Symes, 1994, Gradient calculation of the traveltime cost function without ray tracing. SEG Expanded Abstracts, 1351 1354. Taillandier, C., M. Noble, H. Chauris, and H. Calandra, 2009, Firstarrival traveltime tomography based on the adjoint-state method. Geophysics, 74, 6, WCB57 WCB66. Zhao, H. K., 2005, Fast sweeping method for eikonal equations. Mathematics of Computation, 74, 603 627. Corresponding author: mark.noble@mines-paristech.fr Conclusion The refraction tomography algorithm proposed in this work is promising. It gives very satisfactory results for surveys of realistic size. In terms of computational aspects, this new algorithm meets most expectancies, in the sense that it can be efficiently parallelized, can handle a large amount of input data, and is easy to implement. References Amdahl, G., 1967, Validity of the single processor approach to achieving large-scale January 2010 The Leading Edge 93