The determination of the correct

SPECIAL High-performance SECTION: H i gh-performance computing computing MARK NOBLE, Mines ParisTech PHILIPPE THIERRY, Intel CEDRIC TAILLANDIER, CGGVeritas (formerly Mines ParisTech) HENRI CALANDRA, Total The determination of the correct velocity structure of the near surface is a crucial step in seismic data processing and depth imaging. Generally, firstarrival traveltime tomography based on refraction data or diving waves is used to assess a velocity model of the subsurface that best explains the data. Such first-arrival traveltime tomography algorithms are very attractive for land data processing because early events in the seismic records are very often dominated by noise, and reflected events are very difficult or even impossible to identify. On the other hand, first arrivals can generally be identified quite clearly and are very often the only data available to reconstruct the near-surface velocity structure. Current seismic surveys now commonly deploy thousands of sources combined with thousands of receivers, leading to millions or several hundreds of millions of acquired traces. The area under investigation could be very large and lead to a velocity model containing millions of parameters whatever the type of parameterization. For 2D geometries, classical refraction tomography algorithms may not suffer from computational limitations. On the contrary, for 3D acquisition, those classical algorithms may face severe restrictions in terms of memory requirement, computation time, or implementation. To overcome these limitations, one could reduce the data to be used or the number of parameters of the model, both leading to either a loss of information or resolution. We address these issues with the use of adjoint state techniques to compute the gradient of the traveltime misfit function. The computational benefits of the adjoint state method, compared to the classical algorithms, are a low memory requirement, a straightforward and efficient parallelization, and an effortless implementation. Indeed, the amount of memory required by this method depends only on the size of the discretized velocity model. In other words, it is independent of the quantity of available input data. Another advantage is that the gradient calculation is carried out shot-by-shot; thus, computation tasks can be easily distributed over many processors. These computational properties have already been assessed and validated for 2D and 3D geometries (Taillandier et al., 2007). We present in this work the practical implementation of our 3D first-arrival traveltime tomography al- 86 The Leading Edge January 2010 Figure 1. Flowchart illustrating the computation of the gradient of the misfit function by the adjoint state method. gorithm based on the adjoint state technique that can handle very large data sets. Using a 3D synthetic model, we present the current performance of this algorithm and needed modifications to extend its scalability to handle the huge data sets that will be available in the coming years. The adjoint state method The first-arrival traveltime tomography algorithm is formulated as the minimization problem of a least squares misfit function (S) defined as where T obs are the recorded first-arrival traveltimes and T the calculated first-arrival traveltimes for a given velocity model c. We use the first-order finite-difference eikonal solver of Podvin and Lecomte (1991) to compute the synthetic data. The adjoint state method allows deriving the gradient of the misfit function (S) with respect to the velocity model (c) with the following formula where is the adjoint state variable. This variable is computed for each source position by solving the following partial differential equation and its boundary condition (1) (2)

Figure 2. Flowchart illustrating the whole iterative process, including the step-length computation and the model update. and, (3) where n is the outward unit vector normal to the acquisition surface. Detailed mathematical developments can be found in Sei and Symes (1994), and Leung and Qian (2006). A numerical method to solve this equation is the fast sweeping method of Zhao (2005) that has good stability and convergence properties and is easy to implement. Practical implementation The refraction tomography algorithm derived from the adjoint state method mainly relies on the computation of the gradient of the misfit function. Figure 1 depicts the steps leading to the assessment of the gradient. Picked traveltimes and an initial velocity model are the main input data. For each shot, computed traveltimes are derived from forward modeling. The traveltime map is used to solve the adjoint state partial differential equation; the residuals are used as initial conditions. The summation over all gradients obtained for each shot provides the global gradient of the misfit function. Two essential properties of the adjoint state method are illustrated in this flowchart. For each shot, the amount of memory required by this algorithm only depends on the size of the discretized velocity model. The other important factor is that it is straightforward to parallelize this algorithm by shot and thus significantly reduce the computation time. Once the gradient is obtained, a local descent optimization technique, such as a steepest descent method, is applied to iteratively minimize the misfit function. The whole iterative process including the step-length computation and the model update is illustrated in Figure 2. The code is written in Fortran 90 and uses message passing interface (MPI) for parallelism. Computational performance We now investigate the computational performances of the 3D first-arrival traveltime tomography algorithm on a fairly large data set. First we compare the performance with two different clusters. The first is made of the latest Intel Xeon X5560 processors, codenamed Nehalem, at 2.8 GHz with 8M of L3 shared cache memory and a Quick Path Interconnect (QPI) at 6.4 GT/s. Each node has 18 GB of DDR3 memory at 1066 MHz for the 8 cores. The second is made of the previous generation of Intel Xeon E5472 processors, codenamed Harpertown, at 3.0 GHz, with 2 GB of memory per core. In both cases, we used up to 256 cores (i.e., 32 nodes interconnected with QDR Infiniband). Number of model parameters 100 1000 1000 = 100 million Number of observed traveltimes per shot 160,000 Number of shots 1000 Total number of observed 160,000 1000 = 160 million traveltimes Table 1. Main features of the synthetic 2D and 3D acquisitions used for computational evaluations The main features of the 3D synthetic model (Figure 3) and the simulated acquisition are summarized in Table 1. During the inversion and for each shot per iteration, we solved three eikonal equations (forward modeling), one for the computation of the misfit function itself and two for the January 2010 The Leading Edge 87

Figure 3. Model used to assess the computational performances of the algorithm. (top) The initial velocity model used for the inversion. (left middle) True model used to compute observed traveltimes. (bottom left) Inverted velocity model. (right) Horizontal velocity profile for z = 700 m, x = 10 km (top) and x = 15 km (bottom); red = initial model, green = inverted model, and black = true model. Figure 4. Speedup (left) and elapsed time (right) comparison between Harpertown processor (E5472) at 3.0 GHz and Nehalem (X5560) at 2.8 GHz when working with 256 shots and up to 256 cores. On the speedup curves, we see that increasing the number of iterations may increase scalability since the impact of the initialization phase is attenuated. The low number of shots per core negatively impacts the scalability in that given case. On the right, we see the benefit of the X5560 even with a lower clock frequency. 88 The Leading Edge January 2010

Figure 5. Intel MPI trace analyzer on 128 cores running 1 shot per core for only one iteration. We see every part of the algorithm as described in Figure 2. Figure 6. MPI trace analyzer on 128 cores for 1 shot per core and 10 iterations per core. Apart from the initialization on the left (reading, processing, and broadcast), we see the MPI_REDUCE (yellow) done for communication of the gradient, followed by the 2 broadcast within the step-length computation (tiny vertical dark lines). step length. We must also compute the adjoint state variable by solving Equation 3. If for this synthetic model, the total number of grid points is 100 1000 1000, for each shot the computation is performed only where the receivers are active that is on a grid 100 500 500. More than 95% of the CPU time is devoted to solving the eikonal equations (forward modeling) and computing the adjoint state variable. For our implementation, the cost of solving the adjoint state is only 1.6 times the cost of solving the eikonal equations. For this synthetic example, it required 50 iterations and took 3 hours and 11 minutes on 256 cores. To evaluate the efficiency of the algorithm, we measured January 2010 The Leading Edge 89

Figure 7. Comparison of the impact of the initialization phase on 128 and 512 cores. Each core has to be aware of the whole geometry for later processing. This part has to be extracted from the main application since the geometry has to be prepared for more efficient use during the tomography. In fact, we will have to take advantage of the shot distribution within the parallel implementation when increasing by 10 100,000 cores. the computation time for one iteration of the minimization process with an increasing number of cores. Following Amdahl s law, the speedup function is defined as the ratio between the computation time for a single core and the computation time for n cores (Figure 4). As for every real world application, it is important to estimate how sensitive the code can be to the clock frequency and/or to the memory bandwidth. These two parameters are almost always the most important within the first-order application characterization equation generally expressed Total time = Time flops + T memory + T io + T comms where each term of this equation means the elapsed time spent for calculation, memory access, IO, and communication, respectively. As we can see on the right of Figure 4, the X5560 at 2.8 GHz gave better results than the E5472 at 3.0 GHz due to a better memory bandwidth (3.5 better with the current configuration). But, such a first-arrival tomography cannot be considered as a memory-bounded application, as are conventional tomography or reservoir simulation codes, for example. In fact, the speedup curve of the E5472 on the left of Figure 4 is not dramatically decreasing when the number of cores increases (as it should in case of limited memory bandwidth). As we will see in the following section, the runs are affected by the initialization phase of the application which takes the lead in this synthetic example with only a few shots and a few iterations per shot (Figure 5 and Figure 7). In such a case, we may expressed the total elapsed time as Total time = T serial part + T parallel part/nb cores + T overhead where T overhead comes from synchronizations, communications, and extra work due to the parallel implementation. The perfect speedup as expressed by Amdahl assumes no overhead and no serial part, i.e., a 100% parallel code. Extending the Amdahl approximation and hypothesis is beyond the scope of this paper and many publications are available on that topic. In our case, we will try to remove the serial part as it can be done as a preprocessing step, and thanks to the adjoint state formalism, our tomography can take advantage of a coarse-grained parallelism, also called granularity. Then, in fact, apart from a few (but large) collective communications, the processing of each shot does not involve too much overhead (Figure 6). The first part of the application mostly corresponds to the input velocity field and acquisition geometry reading and 90 The Leading Edge January 2010

Figure 8. Illustration of the scalability (up to 1024) when considering (or not) the initialization phase. to the acquisition geometry processing to define adequate calculation grids. 1 eikonal 7.45 s 1 adjoint state 1193 s 3 eikonal, 1 adjoint state + overhead 35 s Table 2. Average CPU time required per shot on one X5560 processor at 2.8 GHz for 1 iteration of the inversion. The last is the most time-consuming, and its impact increases with the number of cores, as presented with the worst case on Figure 7 when working only with 1 shot per core and 1 iteration per shot on 128 and 512 cores. This negative impact will be, of course, smoothed when increasing the number of shots per core and the number of inversion iterations (Figure 9), but it will still affect the parallel part as stated by Amdahl. To estimate this impact, we checked the scalability up to 1024 cores with and without preprocessing (Figure 8), considering that it can be taken out of the tomography and especially because the coordinates and traveltime picking usually come as ASCII files that must be sorted prior to the inversion. In the best case (1000 shots per core), we move from a speedup of 368 with the initialization to a value of 852 on 1024 cores without the initialization, which is quite a good result for such a small test case of 102,400 shots. This result leads to great expectations about working with acquisition of a few million shots and a few dozen iterations per shot. On the left of Figure 9, we see that the number of shots per second is quite stable, demonstrating that the load balancing is also very good. In that example, including the initialization, we also see that we need a minimum of 100 shots per core to reach a steady state and a constant number of shots per second. It gives around 3 million shots per day on a 256-core machine with only one iteration per shot. Thus, 50 iterations per shot, roughly needed in a real case, would still produce 3 million shots per day on 12,800 cores (approximately 140+ Tflops peak), or 60,000 shots per day on 256 cores. Next steps With respect to the constant increase of the acquisition size and the constant decrease of the velocity field spatial sampling, our goal is to take advantage of this high scalability to drive the application to a petaflops or multi-petaflops machine, roughly to 100,000 to 1 million cores, considering that the number of shots may go up by 10 100 million in the near future. Extending the previous calculation may give 30 million shots a day on 128,000 cores (assuming 50 iterations per shot). Of course, the algorithm would have to be revisited or extended to ensure quite good stability in terms of load bal- 92 The Leading Edge January 2010

Figure 9. Impact of the initialization phase with respect to the number of shots per core. After a certain amount of shots per core, the application reaches a steady state where the number of shots per second is almost constant (left). Due to a larger initialization impact, the steady state is more difficult to reach when increasing the number of cores (right). The numbers on the picture denote the number of cores. ancing and fault tolerance. In addition, such a huge case may need to split the acquisition into a few domains with respect to the available memory distributed around cluster nodes that are becoming larger and larger shared-memory nodes. The use of mixed parallelism language may also be an option to limit the number of domains, taking advantage of OpenMP or pthreads, within a socket for example. Collective communication may have to be optimized to take advantage of the topology of future machines. With up to 1024 or 2048 cores and with a message size of a few hundred megabytes, the current industrial MPI implementations remain suitable but we know that with an increase of one or two orders of magnitude in the number of cores, we will have to get closer to the topology. On pure floating-point expectations, the use of SIMD instructions would be the next step forward to reduce the time needed per iteration, especially with the wider instructions that will come in coming years. computing capabilities. AFIPS Conference Proceedings, 30, 483 485. Leung, S. and J. Qian, 2006, An adjoint state method for three-dimensional transmission traveltime tomography using first arrivals. Communications in Mathematical Sciences, 4, 249 266. Podvin, P. and I. Lecomte, 1991, Finite difference computation of traveltimes in very contrasted velocity models: a massively parallel approach and its associated tools. Geophysical Journal International, 105, 271 284. Sei, A. and W. Symes, 1994, Gradient calculation of the traveltime cost function without ray tracing. SEG Expanded Abstracts, 1351 1354. Taillandier, C., M. Noble, H. Chauris, and H. Calandra, 2009, Firstarrival traveltime tomography based on the adjoint-state method. Geophysics, 74, 6, WCB57 WCB66. Zhao, H. K., 2005, Fast sweeping method for eikonal equations. Mathematics of Computation, 74, 603 627. Corresponding author: mark.noble@mines-paristech.fr Conclusion The refraction tomography algorithm proposed in this work is promising. It gives very satisfactory results for surveys of realistic size. In terms of computational aspects, this new algorithm meets most expectancies, in the sense that it can be efficiently parallelized, can handle a large amount of input data, and is easy to implement. References Amdahl, G., 1967, Validity of the single processor approach to achieving large-scale January 2010 The Leading Edge 93