State of Art and Project Proposals Intensive Computation

Similar documents
Fast Tridiagonal Solvers on GPU

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Scan Primitives for GPU Computing

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Scalable GPU Graph Traversal!

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

B. Tech. Project Second Stage Report on

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

High Performance Computing on GPUs using NVIDIA CUDA

Dynamic Sparse Matrix Allocation on GPUs. James King

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

A MATLAB Interface to the GPU

Portland State University ECE 588/688. Graphics Processors

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Optimization solutions for the segmented sum algorithmic function

Iterative Sparse Triangular Solves for Preconditioning

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Tesla Architecture, CUDA and Optimization Strategies

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Performance impact of dynamic parallelism on different clustering algorithms

A Cross-Platform SpMV Framework on Many-Core Architectures

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Numerical Algorithms

Simultaneous Solving of Linear Programming Problems in GPU

Flexible Batched Sparse Matrix-Vector Product on GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Center for Computational Science

Warps and Reduction Algorithms

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

Parallel FFT Program Optimizations on Heterogeneous Computers

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Parallel Computing: Parallel Architectures Jin, Hai

Roofline Model (Will be using this in HW2)

Using GPUs to compute the multilevel summation of electrostatic forces

Unrolling parallel loops

Gauss-Jordan Matrix Inversion Speed-Up using GPUs with the Consideration of Power Consumption

Parallel Implementations of Gaussian Elimination

Sparse Linear Algebra in CUDA

GPU Programming for Mathematical and Scientific Computing

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Lecture 6: Input Compaction and Further Studies

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

A Survey on Performance Modelling and Optimization Techniques for SpMV on GPUs

Accelerated Load Balancing of Unstructured Meshes

Performance Analysis of Distributed Iterative Linear Solvers

Solving the heat equation with CUDA

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Lecture 2: CUDA Programming

arxiv: v1 [cs.ms] 2 Jun 2016

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Performance potential for simulating spin models on GPU

GPU-Based Acceleration for CT Image Reconstruction

Lecture 27: Fast Laplacian Solvers

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

GPU Sparse Graph Traversal

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Fast and reliable linear system solutions on new parallel architectures

Large scale Imaging on Current Many- Core Platforms

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Transcription:

State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016

Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers Parallel Iterative Jacobi During this lecture we analyze some recent papers on these topics A project can be based on one of these papers (or a variation) and can be developed by using Matlab or CUDA on a GPU

Available Cluster The available cluster consists of four identical nodes Each node is composed respectively by: two quad-core Intel Xeon E5520 CPUs operating at 2.26 GHz two Tesla T10 GPUs PCI Express-2 bus for the connection Each T10 processor has: 240 cores 4 GB memory Tesla S1070 GPU Computing System consists of four T10 processors

SPARSE MATRIX-VECTOR MULTIPLICATION

Sparse Matrix-Vector Multiplication Nathan Bell & Michael Garland - NVIDIA Research Efficient Sparse Matrix-Vector Multiplication on CUDA 2008 - NVIDIA Technical Report NVR-2008-004 Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors 2009 - Proceeding SC '09 - Conference on High Performance Computing Networking, Storage and Analysis

Sparse Matrix-Vector Multiplication Sparse matrix structures arise in numerous computational discipline Methods for efficiently processing them are often critical to the performance of many applications Sparse matrix-vector multiplication (SpMV) operations: are of particular importance in computational science represent the dominant cost in many iterative methods (Ax=b) for solving large scale linear systems and eigenvalue problems (Ax=λx) which arise in a wide variety of scientific and engineering applications

Sparse Matrix-Vector Multiplication With their approach, Authors aim to minimize execution and memory divergence caused by the potential irregularity of the underlying sparse matrix In fact, one of the primary goals of most SpMV optimizations is to mitigate the impact of irregularity in the matrix structure Since an efficient SpMV kernel should be memory-bound, a measure of success is the fraction of peak bandwidth the kernels can achieve Authors address this problem by focusing on choosing appropriate matrix formats and designing parallel kernels that can operate on these formats efficiently

Sparse Matrix-Vector Multiplication Given a particular matrix, it is essential that we select a format that is a good fit for its sparsity pattern Authors identify three basic sparsity regimes: diagonal matrices matrices with roughly uniform row lengths matrices with non-uniform row lengths Each sparsity pattern is addressed by using a different format Authors are not concerned with modifying matrices: They restrict attention to static formats, as opposed to those suitable for rapid insertion and deletion of elements They do not consider transformations such as row permutation that globally restructure the matrix

Sparse Matrix-Vector Multiplication Authors choose well-known formats (DIA, ELL, CSR, COO): ELLPACK format is well-suited to vector and SIMD architectures, but its efficiency degrades when the number of nonzeros per row varies The storage efficiency of the COO format is invariant to the distribution of nonzeros per row, and the use of segmented reduction makes its performance largely invariant as well To obtain the advantages of both, Authors combine these into a hybrid ELL/COO format: to store the typical number of nonzeros per row in the ELL data structure the remaining entries of exceptional rows in the COO format Authors compute a histogram of the row sizes and determines the largest number K such that using K columns per row in the ELL portion of the HYB matrix meets a certain objective measure

Sparse Matrix-Vector Multiplication The full utilization of the GPU requires many thousands of active threads: Therefore, the finer granularity of the CSR (vector) and COO kernels is advantageous when applied to matrices with a limited number of rows Note that such matrices are not necessarily small, as the number of nonzero entries per row is still arbitrarily large Except for the CSR kernels, all methods benefit from full coalescing when accessing the sparse matrix format The SpMV kernel implementations parallelize the outermost for loop over many threads and compute many operations (parallel reductions or segmented scans) in shared memory SpMV implementations are available as open source software

Sparse Matrix-Vector Multiplication To assess the efficiency of these sparse formats and their associated kernels, Authors have collected SpMV performance data on a broad collection of matrices All experiments are run on a system comprised of an NVIDIA GTX 285 GPU paired with an Intel Core i7 965 CPU Each of their SpMV performance measurements is an average (arithmetic mean) over 500 trials or 3.0 seconds of execution time, whichever is less

Sparse Matrix-Vector Multiplication The measurements do not include time spent transferring data between host (CPU) and device (GPU) memory, since Authors are trying to measure the performance of the kernels In many applications, the relevant data structures are created on the device, or reside in device memory for the duration of a series of computations the cost of transfers is negligible When the data is not resident in device memory, such as when the device is used to accelerate a single, selfcontained component of the overall computation, then transfer overhead is a potential concern

Sparse Matrix-Vector Multiplication SpMV kernels operate on various granularities: (1) one thread per row (2) one warp per row (3) one thread per nonzero element Kernels expose different levels of parallelism on different matrices: Assigning one thread per row, as in ELL, implicitly assumes that the number of rows is large enough to generate sufficient parallel work assigning one thread per nonzero, as in COO, only requires that the total number of nonzero entries is sufficiently large

Sparse Matrix-Vector Multiplication The GTX 285 processor accommodates up to 30K coresident threads and requires several thousand threads for a full utilization The performance study is executed on: A collection of synthetic examples (matrices in sparse format, varying the dimension while holding the number of nonzeros) Structured matrices Unstructured matrices

Sparse Matrix-Vector Multiplication H. V. Dang, B. Schmidt CUDA-enabled Sparse MatrixVector Multiplication on GPUs using Atomic Operations 2013 - Parallel Computing, vol. 39, no. 1, pp. 737-750 A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, P. Sadayappan Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications 2014 - Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis J. L. Greathouse, M. Daga Efficient Sparse Matrix-Vector Multiplication on GPUs using the CSR Storage Format 2014 - Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Y. Liu & B. Schmidt LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs 2015 - IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

TRIDIAGONAL SOLVERS

Fast Tridiagonal Solvers on the GPU Yao Zhang, University of California, Davis Jonathan Cohen, NVIDIA John D. Owens, University of California, Davis PPoPP 10 Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Fast Tridiagonal Solvers on the GPU Fast solutions to a tridiagonal system of linear equations are critical for many scientific and engineering problems, as well as realtime or interactive applications in computer graphics Since the 1960s, a variety of parallel algorithms have been developed for solving tridiagonal systems In this paper Authors study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: Cyclic reduction (CR) Parallel cyclic reduction (PCR) Recursive doubling (RD)

Fast Tridiagonal Solvers on the GPU Authors focus on the problem of solving a large number of small tridiagonal systems in parallel, which maps well to the hierarchical and throughput-oriented architecture of GPU Consider a system of n linear equations of the form Ax = d, where A is a tridiagonal matrix n n n b a c c b a c b a c b A 1 3 3 3 2 2 2 1 1 0 0

Fast Tridiagonal Solvers on the GPU The classic algorithm to solve such a system is the Thomas algorithm, which is Gaussian elimination in the tridiagonal matrix case The algorithm has two phases, forward elimination - we eliminate the lower diagonal backward substitution - solve all unknowns from last to first The algorithm is simple, but inherently serial and takes 2n computation steps, because each calculation depends on the result of the immediately preceding calculation

Cyclic Reduction (CR) CR consists of two phases: The forward reduction phase successively reduces a system to a smaller system with half the number of unknowns, until a system of 2 unknowns is reached The backward substitution phase successively determines the other half of the unknowns using the previously solved values The serial Thomas algorithm performs 8n operations while CR performs 17n operations But, on a parallel computer with n processors, CR requires 2 log 2 n - 1 steps while the Thomas algorithm requires 2n steps

Parallel Cyclic Reduction (PCR) PCR is a variant of CR, and only has the forward reduction phase The reduction mechanism and formula are the same as those of CR, but in each reduction step, PCR reduces each of the current systems to two systems of half size PCR takes 12nlog 2 n operations and log 2 n steps to finish PCR requires fewer algorithmic steps than CR but does asymptotically more work per step

Recursive Doubling (RD) The RD algorithm proposed by Stone have been reformulated the algorithm in the scan (or prefix sum) form The basic idea of RD is to express unknowns as the multiplication of a chain of matrices that can be evaluated in parallel using the scan primitive This (step-efficient) version of RD takes 20nlog 2 n operations and log 2 n + 2 steps to finish

Hybrid Algorithms With respect to computational complexity: CR is the best algorithm because it is O(n) while PCR and RD are O(n log2 n) But: CR suffers from a lack of parallelism (end of the forward reduction phase and beginning of the backward substitution phase) on the contrary, although PCR and RD have fewer algorithmic steps, they always have more parallelism through all steps These two observations lead Authors to develop hybrid methods on a GPU that: improve CR by switching to PCR or RD to reduce inefficient steps when there is not enough parallelism to keep a GPU busy

Hybrid Algorithms The hybrid algorithms: First reduce the system to a certain size using the forward reduction phase of CR Then solve the reduced (intermediate) system with the PCR/RD algorithm Finally, they substitute the solved unknowns back into the original systems using the backward substitution phase of CR Note that: switch to PCR or RD before the forward reduction has reached a 2-unknown system PCR and RD enable to finish the inefficient middle steps more quickly, because they have fewer algorithmic steps than CR

System NVIDIA GTX 280 GPU using the CUDA programming model GTX 280 is made up of 30 multiprocessors, and each multiprocessor contains 8 thread processors and 16 KB of on-chip shared memory To solve hundreds of tridiagonal systems simultaneously, the solvers are mapped to the GPU s two-level hierarchical architecture with systems mapped to blocks and equations mapped to threads

Conclusion In this paper Authors: Have studied five tridiagonal solvers that run on a GPU based on three algorithms, CR, PCR and RD Propose an approach to measure, analyze and improve the performance of GPU programs in terms of memory access, computation and control overhead Show that hybrid algorithms have the best performance For solving 512x512-unknown systems, the hybrid solvers achieve a 12.5x speedup over the multi-threaded CPU solver, and a 28x speedup over the LAPACK solver

A Guide for Implementing Tridiagonal Solvers on GPUs Li-Wen Chang and Wen-mei W. Hwu 2014 - Numerical Computations with GPUs, Chapter 2

Guide for Implementing Tridiagonal Solvers Authors give guidelines for customizing a high-performance tridiagonal solver for GPUs They consider a wide set of algorithms for implementing tridiagonal solvers on GPUs, both sequential and parallel algorithms Algorithms are chosen for the requirement of applications, and to take the advantage of massive data parallelism of the GPU Optimizations are proposed to compensate for some inherent limitations of the selected algorithms In order to achieve high performance on GPUs, workloads have to be partitioned and computed in parallel on stream processors

Guide for Implementing Tridiagonal Solvers For the tridiagonal solver, the inherent data dependence found in sequential algorithms (e.g. the Thomas algorithm and the diagonal pivoting method), limits the opportunities for partitioning the workload On the other hand, parallel algorithms (e.g. Cyclic Reduction (CR), Parallel Cyclic Reduction (PCR), or the SPIKE algorithm) allow the partitioning of workloads, but suffer from the required overheads of extra computation, barrier synchronization, or communication

Guide for Implementing Tridiagonal Solvers The paper presents: Partitioning methods are applied to divide workloads for parallel computing Different partitioning techniques require different types of overheads, such as computation or memory overhead State-of-the-art optimization techniques for independent solvers are discussed Different algorithms of independent solvers might require different optimizations Optimization techniques might perform together for more robust independent solvers

Guide for Implementing Tridiagonal Solvers In summary, this paper: Show most cutting-edge optimization techniques, applied in both partitioning methods and independent solvers, for GPU Demonstrates how to apply optimization techniques for building a high-performance tridiagonal solver (study case) Authors say that the main purpose of this paper is: to give readers the current status of GPU tridiagonal solvers, and to inspire readers to customize GPU tridiagonal solvers to meet their application requirements (not showing high performance of SPIKE-CR)

PARALLEL ITERATIVE JACOBI

CUDA-Based Jacobi s Iterative Method Zhihui Zhang, Qinghai Miao, Ying Wang 2009 International Forum on Computer Science- Technology and Applications, 2009 (IFCSTA '09)

CUDA-Based Jacobi s Iterative Method Jacobi s iterative method: has inherent parallelism It s suitable to be implemented on CUDA to run concurrently on many cores The components form of Jacobi s iterative method corresponding to the matrix form is N ) ( given, ) ( ) ( ) ( k n i, x a - b a x x n i j j k j ij i ii k i i }, {1,..., 1 1 1 0

CUDA-Based Jacobi s Iterative Method The proposed implementation of Jacobi s iterative method in one iteration each thread is in charge of one equation or Xi limited by the size of shared memory and amount of registers, 256 threads per block are launched 9 elements in one row of A are read from global memory and stored in shared memory each time (note that 9 which is an odd, and it is chosen to avoid bank conflicts of shared memory) X is shared in a block, and a segment of 1440 elements is read each time from global memory and stored in shared memory (1440 is a multiple of both 9 and 32 - multiple of 9 simplifies the coding when computing dot product in segment, and multiple of 32 satisfies the optimizations requirements according to hardware specifications)

CUDA-Based Jacobi s Iterative Method Authors use a small enough positive number ε to control the solution precision and iteration process When max x 1in ( k) i x ( k1) i the process satisfies the solution precision and the final solution is vector x In the experiments, the accuracy that Authors use to control the termination of Jacobi iteration are 1 e-5 for single floating point and 1 e-10 for double

CUDA-Based Jacobi s Iterative Method The experimental platform is CPU: 2 Intel Xeon X5482 4-cores CPU @3.2Ghz, total 8 cores, but the sequential program runs only on one core GPU: NVIDIA Tesla C1060, 30 SMs, total 30 8=240 cores

A Comparison of Sequential and GPU Implementations of Iterative Methods to Compute Reachability Probabilities Elise Cormie-Bowins - DisCoVeri Group, Department of Computer Science and Engineering, York University 2012 - Proc. of Workshop on GRAPH Inspection and Traversal Engineering (GRAPHITE 2012)

The parallel implementation of the Jacobi method where each thread i, computes an element of x is used to compute the reachability probabilities of a Markov chain M = (S;P; s0) consisting of a nonempty and finite set S of states a transition probability function and an initial state s' S s S 0 P( s, s') P : S S 1 [0,1] satisfying for all The problem is: Given a Markov chain M, and set of goal states GS, what is the probability to reach a state in GS in zero or more transitions from the initial state? s S

To compute the reachability probabilities, one must determine the probability of the initial state leading to a state in GS To do so, one must also determine the probabilities of reaching a state in GS from other states For each state s we compute x s, which is the probability of reaching GS from s The values of x s, for s S, can be expressed as a vector, x x s s' S \ GS P( s, s') x P( s, s' s' ) s' GS

The equation for x can be written as x = A x+b Rearranged, this becomes (I-A) x = b, where I is an identity matrix and A and b are already known from the Markov chain So, x can be found by solving the linear equation (I-A) x = b for x Iterative approximation methods find solutions that are within a specified margin of error of the exact solutions, and can work much more quickly than direct methods

Efficient implementation of Jacobi iterative method for large sparse linear systems on graphic processing units Abal-Kassim Cheik Ahamed, Frédéric Magoulès - CentraleSupélec, Université Paris-Saclay J. Supercomputing, 2016

Efficient implementation of Jacobi iterative method In general, the Jacobi method: has a convergence rate slower than other existing iterative methods (such as Gauss Seidel method or Krylov methods) is very attractive because it is inherently parallel Due to its mathematical properties, the Jacobi method is often chosen to derive convergence proof of more complex methods, e.g., convergence of asynchronous algorithms The paper proposes to parallelize the Jacobi method using GPU CUDA-based programming language

Efficient implementation of Jacobi iterative method To optimize the implementation of Jacobi method on GPU, Authors propose to transform the classical algorithm in a modified algorithm which computes only vector operations The classical Jacobi iteration given by That is N ) ) ( ( given, ) ( 1 1) ( (0) k, u U L b - D u u k - k N ) ( given, ) ( ) ( ) ( k n i, u a - b a u u n i j j k j ij i ii k i i }, {1,..., 1 1 1 0

Efficient implementation of Jacobi iterative method The main idea behind the modified algorithm consists in updating the approximate solution repeatedly using products of sparse matrix and some vectors The classical Jacobi iteration can be rewritten in the vector form (0) u given, ( k1) -1 ( k ) ( k ) u D ( b - ( A) u ) u, k N That is u u (0) i ( k1) i given, 1 a ii ( b i - n j1 a ij u ( k ) j ) u ( k ) i, {1,..., n}, k N

Efficient implementation of Jacobi iterative method In vector version, the loop is augmented to transform the loop into a complete matrix vector product operation The overall computational performance of the algebraic equations solver is linked to the performance of basic linear algebra operations For the algorithm implementing Jacobi s method, the operations performed for each iteration step are: Scaled vectors (u D 1 r ), Addition of vectors (u (k+1) u (k) + u, r b q), Dot product or norm of vector ( r b Au (k) ), Sparse matrix vector multiplications (q Au (k) ).

Efficient implementation of Jacobi iterative method Sparse matrix vector product (SpMV) is the basic operation The treatment of large sparse matrices requires a good choice of storage format that helps the computations of involved operations The performance of the algorithms strongly depends on the data structure of the sparse matrices In the paper, Authors consider compressed sparse row format (CSR) The purpose is to gain efficiency both in terms of number of arithmetic operations and memory usage

Test cases Non-linear Poisson equation (-. ( μ)) u( x, y, z) f ( x, y, z), ( x, y, z) Ω Used to study some specific physical interpretations such as practical study of the earth s subsurface investigations The problems are obtained by finite element method discretization on diverse meshes Global domain in four layers upon μ 3D Laplace equation - μ = 1 3D gravitational potential equation