A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

Similar documents
A scalable parallel LSQR algorithm for solving large-scale linear system for tomographic problems: a case study in seismic tomography

An MPI-CUDA Implementation and Optimization for Parallel Sparse Equations and Least Squares (LSQR) 1

Ultrasonic Multi-Skip Tomography for Pipe Inspection

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Report of Linear Solver Implementation on GPU

Parallel Combinatorial BLAS and Applications in Graph Computations

Lecture 15: More Iterative Ideas

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

Model parametrization strategies for Newton-based acoustic full waveform

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

Introduction to seismic body waves Tomography

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Efficient Beam Velocity Model Building with Tomography Designed to Accept 3d Residuals Aligning Depth Offset Gathers

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Algebraic Iterative Methods for Computed Tomography

1.5D PARALLEL SPARSE MATRIX-VECTOR MULTIPLY

High Performance Computing: Tools and Applications

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

GPU-Based Acceleration for CT Image Reconstruction

A Graphics Processing Unit Implementation and Optimization for Parallel Double-Difference Seismic Tomography

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

APPM 2360 Project 2 Due Nov. 3 at 5:00 PM in D2L

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

Rapid Processing of Synthetic Seismograms using Windows Azure Cloud

GraphBLAS Mathematics - Provisional Release 1.0 -

Interval velocity estimation through convex optimization

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Sparse wavelet expansions for seismic tomography: Methods and algorithms

Machine-learning Based Automated Fault Detection in Seismic Traces

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Geometric Partitioning for Tomography

Using multifrontal hierarchically solver and HPC systems for 3D Helmholtz problem

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Accelerated Load Balancing of Unstructured Meshes

Chapter 5. 3D data examples REALISTICALLY COMPLEX SYNTHETIC INVERSION. Modeling generation and survey design

Temporal Integration Of Seismic Traveltime Tomography

A projected Hessian matrix for full waveform inversion Yong Ma and Dave Hale, Center for Wave Phenomena, Colorado School of Mines

A Novel Image Super-resolution Reconstruction Algorithm based on Modified Sparse Representation

QR Decomposition on GPUs

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Tools and Primitives for High Performance Graph Computation

Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Accelerating the Hessian-free Gauss-Newton Full-waveform Inversion via Preconditioned Conjugate Gradient Method

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Object-Parallel Solution of Large-Scale Lasso Problems

Numerical Linear Algebra

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

High-performance Randomized Matrix Computations for Big Data Analytics and Applications

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Studying GPU based RTC for TMT NFIRAOS

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Optimizing the operations with sparse matrices on Intel architecture

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

PARDISO Version Reference Sheet Fortran

Estimating a 2D stationary PEF on sparse data

Writing Kirchhoff migration/modelling in a matrix form

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

Dense Matrix Algorithms

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Full waveform inversion by deconvolution gradient method

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Exercise Set Decide whether each matrix below is an elementary matrix. (a) (b) (c) (d) Answer:

x = 12 x = 12 1x = 16

Data dependent parameterization and covariance calculation for inversion of focusing operators

We G High-resolution Tomography Using Offsetdependent Picking and Inversion Conditioned by Image-guided Interpolation

Parallel FFT Program Optimizations on Heterogeneous Computers

Introduction to parallel Computing

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Dynamic load balancing in OSIRIS

MSST 08 Tutorial: Data-Intensive Scalable Computing for Science

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Scan Primitives for GPU Computing

Solving the heat equation with CUDA

AMS526: Numerical Analysis I (Numerical Linear Algebra)

2. On classification and related tasks

Fast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle

A hybrid tomography method for crosswell seismic inversion

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!

MEMORY EFFICIENT WDR (WAVELET DIFFERENCE REDUCTION) using INVERSE OF ECHELON FORM by EQUATION SOLVING

Iterative resolution estimation in Kirchhoff imaging

Simulating tsunami propagation on parallel computers using a hybrid software framework

Parallelization of K-Means Clustering Algorithm for Data Mining

Image Manipulation in MATLAB Due Monday, July 17 at 5:00 PM

Transcription:

1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR)

2 LSQR in Seismic Tomography Seismic Tomography: image sub-surface of geological structures using seismic waves. LSQR (Least Squares with QR factorization): a numerical method for solving sparse linear equations in an iterative way. Widely used in seismic tomography. Features: Highly efficient and capable of solving different types of linear systems for large linear inversion problems. The estimated solution usually converges fast.

3 LSQR Solves linear systems, A x = b, where A is a huge sparse matrix, x is the solution variables, and b is a right hand side vector (constant). Requires loading entire matrix into memory. The iterative steps of LSQR include several basic linear algebra operations, e.g. scale and norm and two matrix vector multiplications i.e., y y +A x and x x+a T y, where y is initially derived from vector b.

4 Real-World Application Challenges Memory-intensive LSQR requires load the entire dataset into the memory. But the matrix can be very huge and very sparse. The dataset could be much larger than the above dataset depending on the geological region of interest to calculate. Compute-intensive Thousands or even more of iterations. Every iteration is compute-intensive. Communication-intensive Massive communication between compute nodes.

5 General Idea Propose a partitioning strategy that is based on the special structure of the matrix. Specially, SPLSQR (Scalable Parallel LSQR) contains a novel data decomposition strategy that treats different components of the matrix separately. SPLSQR algorithm was optimized with scalable communication volume between a fixed and modest number of communication neighbors.

6 Matrix Layout In structure seismology, the coefficient matrix (A) is composed of kernel (top) and damping (bottom) component. Kernel: <1% rows; > 90% of nonzeros; sparse but relatively dense compared with damping, unstructured. Damping: >99% rows; < 10% of nonzeros; extremely sparse, structured.

Major Steps of SPLSQR Algorithm Based on MPI Programming Model (1) Reorder damping component to minimize bandwidth. (2) Decomposition partition kernel across columns Partition damping across rows Partition damping transpose across columns (3) Iterative steps Step-1 Multiply kernel with vector. Global sum partial result vector. Communicate with neighbors. Multiply damping with vector. Step-2 Multiple kernel transpose with vector Communicate with neighbor. Multiple damping transpose with vector. Step-3 Sum result vector. (4) Test convergence: if false go to (3), else exit and output final result. 7

8 Matrix Reordering The original damping submatrix with big bandwidth results in huge communication volume because there are large overlaps between MPI tasks after decomposition. Goal: move nonzeros of damping to diagonal area to minimized bandwidth, and thus to reduce overlap in later parallel computation, and reduce communication eventually. Perform row permutations such that the position of leading nonzero of each row is in descending order. Result: several thin bands in the diagonal area. This structure can reduce communication. (a) original damping (b) leading nonzero of (a) (c) result of applying row permutation for (a) (d) enlargement of top of part (c), clearly show several thin bands.

Data Decomposition 9 Different colors represent different MPI tasks. The matrix has a kernel component (top) and a damping component (bottom). In memory, store a single copy of the kernel component in compressed sparse column (CSC), and store two copies of the damping component of the matrix, i.e., one copy of the original in compressed sparse row (CSR) and one copy of the transposed matrix in CSC,. Kernel component of the matrix is partitioned by columns, while the damping component of the matrix is partitioned by rows. The transposed damping matrix uses the same partitioning as the kernel component.

Parallel computation: y y +A x 10 MPI Allreduce on relative small vectors 1. Each task multiplies its local piece of kernel Ak i with local piece of x i and yields its kernel part of vector, yk i.. 2. A reduction across all tasks is performed on yk i to combine the partial results (black).

11 Parallel computation: y y +A x 3. Reconstruct extended x i from neighbors. E.g., yellow task communicates red and blue tasks. 4. The extended x i is multiplied with local damping Ad i and yields local vector yd i.

12 Parallel computation: x x+a T y 1. Each task multiplies its local Ak T i with yk to construct xk i.

13 Parallel computation: x x+a T y 2. Reconstruct extended yd i from neighbors. E.g., yellow task communicates red and blue tasks. 3. The extended yd i is multiplied with local damping Adt i and yields local vector xd i. 4. x i = xk i + xd i.

14 Communication MPI_Allreduce on kernel part of vector y, vector length = number of row in kernel which is trivial compared with the whole matrix. Local task compares local vector with required extended vector and decides which neighbors it needs to communicate. Communication is scalable: larger core -> less overlaps with neighbors -> less communication volume.

15 Optimization: Load Balance The figures show number of nonzeros in each MPI task (processor). Left: Nonzeros in kernel is not evenly distributed. So evenly partition columns results in load imbalance. Middle: Strategy: uneven column partition that makes each MPI task has similar data load (number of nonzeros). However, perfect load balance result in uneven partition of vector, and thus result in communication imbalance. Right: Trade-off between computation load and communication load. Set max_nonzero/avg_nonzero=1.30 max_nonzero/avg_nonzero=6.71 max_nonzero/avg_nonzero=1.00 max_nonzero/avg_nonzero=1.30

The figure shows total and communication time for 100 iterations of the SPLSQR, PLSQR, and PETSc implementations for the DEC3 data set from 60 to 1920 cores. Execution time of the SPLSQR algorithm is 1.7x less than PETSc at 60 cores, and is 7.8x less at 720 cores. Communication cost for SPLSQR is over 50x less than either PLSQR or PETSc. 16 Experiment: Kraken

17 Experiment: Kraken The figure shows total and communication time for 100 iterations of SPLSQR and PETSc for the ANGF data set from 360 to 19,200 cores. The reduction in execution time for SPLSQR versus PETSc varies from a low of 4.3x on 360 cores to a high of 9.9x on 2400 cores. SPLSQR algorithm significantly reduces communication cost versus PETSc by greater than a factor of 100x.

18 Initial Experiment: Blue Waters Perform much larger experiment than Kraken. The figure shows preliminary result on EQANGF62K using BW XE nodes Scalable from 3200 cores to 25,600 cores. Will do more experiment using XE or XK nodes in the future. time (sec) 50 45 40 35 30 25 20 15 10 5 0 LSQR execution time 3200 6400 9600 12800 25600 number of MPI tasks

19 Initial Experiment: Yellowstone Performance comparisons between ours and PETSc s implementations of the LSQR algorithm for 100 LSQR iterations of the 12K dataset from 2,400 to 9,600 processors. Tested on EQANGF125K

20 Selecting the best CUDA kernel for SpMV (y y +A x and x x+a T y) GPU-I: pass matrix in CSR and vector directly to cusparsexcsrmv in cusparse library to calculate y y +A x and x x+a T y. GPU-II: cusparse library supports matrix format conversion between different storage formats; Convert matrix from CSR to CSC. Only do once at the first iteration. Treat CSC matrix as a CSR matrix, invoke cusparsexcsrmv. GPU-III: combination of I and II; one copy of CSR, and one copy of CSC. GPU-IV: write own kernel to perform x x+a T y, without transform the matrix, save half space.

Strong scalability 21 Strong scalability defines how the execution time varies with the number of cores for a fixed problem size. Our multi GPUs approach is scalable from 1 to 60 CPU cores/ GPUs GPU approach is faster than corresponding CPU approach. Better performance than PETSc.

Weak Scalability 22 Weak scalability shows how the execution time varies with the number of cores for a fixed problem size per core. Y axis is the performance per CPU core or per GPU. Performance per CPU core or per GPU slows down a little as the number of cores/gpus increases. GPU approach is faster than corresponding CPU approach. Better performance than PETSc.

Science Impact 23 Many LSQR runs are required to find the optimal damping coefficients. It is now feasible for large scale seismic tomographic inversion thanks to the improved algorithm. Top figure shows misfit reduction when try different damping coefficients. (a) The map shows the study area: the topography and major faults (thick black lines) of southern California. (b, c, d) are perturbation maps, the red regions represent velocity reduction areas and the blue regions represent velocity increase areas. (b) under-damping: perturbations are oscillated. (c) optimal: perturbations show many correlations with geological structures. (d) over-damping: perturbations are too smooth.

24 Conclusion and Future Work SPLSQR algorithm utilizes particular characteristics of coefficient matrix that include both pseudo-dense and sparse components. Demonstrate that the SPLSQR algorithm has scalable communication volume and significantly reduces communication cost compared with existing approaches. Utilize GPU direct technology to speed up communication between GPUs across network. Will do more large scale experiment in Blue Waters.