Technical note: A successive over-relaxation pre-conditioner to solve mixed model equations for genetic evaluation 1

Size: px
Start display at page:

Download "Technical note: A successive over-relaxation pre-conditioner to solve mixed model equations for genetic evaluation 1"

Transcription

1 Running head: Technical note: A successive over-relaxation pre-conditioner to solve mixed model equations for genetic evaluation 1 Karin Meyer 2 Animal Genetics and Breeding Unit 3, University of New England, Armidale NSW 2351, Australia ABSTRACT: A computationally efficient preconditioned conjugate gradient algorithm with a symmetric successive over-relaxation (SSOR) preconditioner for the iterative solution of set mixed model equations is described. Potential computational saving of this approach are examined for an example of single-step genomic evaluation of Australian sheep. Results show that the SSOR preconditioner can substantially reduce the number of iterates required for solutions to converge compared to simpler preconditioners with marked reductions in overall computing time Keywords: genetic evaluation, preconditioned conjugate gradient algorithm, SSOR preconditioner, computational requirements INTRODUCTION Estimation of breeding values in genetic evaluation schemes for livestock requires solving a set of mixed model equations (MME). Often, the number of equations is too large for solutions to be obtained directly, i.e. by inversion or triangular decomposition of the coefficient matrix in the MME, and iterative methods need to be employed. Early applications tended to rely on Gauss-Seidel type solution schemes, often with over-relaxation to improve convergence rates. Over the last two decades, however, the preconditioned conjugate gradient (PCG) algorithm has become the standard method to solve large systems of MME in animal breeding applications (e.g. Strandén and Lidauer, 1999; Tsuruta et al., 2001). Convergence rates of the PCG algorithm depend on the distribution of eigenvalues of the coefficient matrix in the MME. This can be improved, i.e. their spread and the condition number of the matrix can be reduced, by adequate choice of a preconditioning matrix. Loosely speaking, the closer this matrix resembles the coefficient matrix, the less iterations are required. In practice, however, this needs to be balanced with computational requirements both to set up 1 This work was supported by Meat and Livestock Australia under grants B.BFG.0050, B.SGN.0027 and B.SGN I am indebted to A. Swan for making the example data used available. 2 Corresponding author: kmeyer@une.edu.au 3 A joint venture with the NSW Department of Primary Industries

2 and store the preconditioner and to apply it for each iterate and many schemes thus rely on simple, diagonal or block-diagonal matrices (Strandén et al., 2002). This paper describes an alternative preconditioner which has the same structure as the coefficient matrix, together with an efficient implementation in a PCG algorithm. We demonstrate for an applied example that it can substantially reduce the number of iterations and overall computing time required. Let THE SSOR PRECONDITIONER Ax = b (1) denote the system of MME to be solved, with A the coefficient matrix, x the vector of unknowns and b the vector of right hand sides. Equivalently, solutions for x can be obtained by solving M 1 Ax = M 1 b (2) instead, with M denoting the preconditioner. Typically, the matrix M is chosen so that it and the product M 1 r are easy to calculate (with r denoting a vector) whilst reducing the condition number of M 1 A (Benzi, 2002). Saad (2003; Chapter 4) showed that an iterative solution scheme with over-relaxation is equivalent to fixed point iteration on a pre-conditioned system. Decompose the coefficient matrix A into A = L + D + L (3) with L = { a i j } for j < i the matrix comprising the strictly lower triangle of A and D = Diag(A) = { a ii } the matrix of diagonal elements. The preconditioner corresponding to a sym- metric successive over-relaxation (SSOR) scheme is M SSOR = 1 ω(2 ω) (D + ωl) D 1 ( D + ωl ) (4) with 0 < ω < 2 (Saad, 2003). Matrix M SSOR has the same structure (i.e. number and position of non-zero elements) as A, and can be used as preconditioner with a PCG algorithm. Saad (2003) stated that the choice of ω for a preconditioner is not important and recommended a value of ω = 1. Each PCG iterate then requires evaluating the product of the inverse of M and a vector, M 1 SSOR r = h. Fortunately, this can be obtained without even setting up M SSOR explicitly by exploiting the factorisation in (4) into upper and lower triangular matrices. Moreover, as D + L in M SSOR (for ω = 1) is simply the lower triangle of A, there is no computational overhead to 2

3 set up M or additional memory required to store it. We refer to the process of obtaining direct solutions for a system of equations with lower or upper triangular coefficient matrix as a triangular solve. Solving M SSOR h = r for h then involves 3 steps: i) Solve (D + ωl) t 1 = r using a forward triangular solve which gives t 1 = D 1 ( D + ωl ) h. ii) Calculate t 2 = Dt 1, and iii) apply a second, backward triangular solve to ( D + ωl ) h = t 2 for h. However, whilst easy to implement this implies a substantial number of multiplications which increase calculations required per iterate in a standard PCG algorithm to at least twice those needed for a diagonal preconditioner, i.e. M = D. Improved SSOR Improved versions of a PCG algorithm using a SSOR preconditioner have been described by Han and Zhang (2011) and Li et al. (2013). These reduce computations per iterate substantially by recognizing that calculation of the two expensive matrix vector products, namely i) of the coefficient matrix and a vector of directions, Ad, and ii) of the inverse of the preconditioning matrix and a vector, M 1 SSORr can be replaced. Instead, only the solutions for two triangular 63 systems of equations are required We adopt the procedure of Li et al. (2013) which differs from that of Han and Zhang (2011) by an initial transformation of the system of equations. For D = Diag{a ii }, define a transformed system of equations D 1/2 AD 1/2 D 1/2 x = D 1/2 b (5) 68 or A x = b (6) 69 Decomposing A as above gives A = L + L + I (7) and with M = ω (L + 1ω ) ( 2 ω I L + 1 ) ω I = ω 2 ω W W (8) A = W + W λi (9) 72 and λ = (2 ω)/ω. 3

4 73 74 Defining auxiliary vectors y = W 1 r and z = W T r then gives the following PCG algorithm (adapted from Li et al., 2013): For a given vector of starting values for the unknowns, x 0, initialize y 0 W 1( A x 0 b ) z 0 λy 0 d 0 W T z For the k th iterate, compute updated solutions x k x k 1 + αd k 1 with α = λ y k 1 y k 1 d k 1(2z k 1 λd k 1 ) Check for convergence. If the chosen criterion has not been met, update work vectors y k y k 1 + α [ d k 1 + W 1 (z k 1 λd k 1 ) ] z k β z k 1 λ y k with β = y k y k y k 1 y k 1 d k W T z k At convergence, calculate solutions on the original scale as x = D 1/2 x The major computations per iterate are the products of W 1 or W T and a vector (in step 3). As W is triangular, these can again be obtained without inverting W using triangular solves. PCG algorithms in animal breeding applications commonly involve a step resetting the search direction at regular intervals (e.g. Tsuruta et al., 2001) to reduce potential problems arising from the accumulation of rounding errors. This can be achieved in the scheme above by replacing the update of work vectors (step 3) for selected iterates with the corresponding calculations from the set-up phase (step 1). APPLICATION We demonstrate the utility of the SSOR preconditioner for the example considered by Meyer et al. (2015). In brief, this comprised a set of 11 traits considered in genetic evaluation of Australian meat sheep (generously made available from the Meat and Livestock Australia s Sheep Genetics data base and the Cooperative Research Centre for Sheep Industry Innovation), with 5.28 million records on 1.77 million animals. Including parents without records 4

5 there were 1,995,755 animals of which 10,944 were genotyped for 48,599 single nucleotide polymorphisms. Genomic relationships were computed following Yang et al. (2010). The model of analysis included contemporary groups as fixed effects, animals additive genetic effects and genetic groups (93 levels) as random effects for all traits. The latter were fitted explicitly, assigning proportions of membership for each animal. Genetic groups and animals additive genetic effects were fitted assuming the same covariance matrix. In addition, dams permanent environmental effects (653,068 levels) were fitted as random effects for 3 traits. This resulted in 24,161,124 equations in the mixed model. Analyses were carried out using either the pedigree based relationship matrix with 6,584,393 elements in its inverse, A 1, or combining pedigree and genomic information in a single-step model with 66,455,483 non-zero elements in the inverse of the combined relationship matrix, H 1 (half-stored). Furthermore, both the standard multivariate (MV) formulation and the equivalent model using a parameterisation to principal components (PC) (Meyer et al., 2015) were examined. Computing environment and strategy Mixed model equations were solved iteratively, using double precision computations in a preconditioned conjugate gradient algorithm, as implemented in the single-step module of our mixed model package WOMBAT (Meyer, 2007). Non-zero elements of the coefficient matrix (half-stored) in the MME were held in core using a combination of sparse matrix and dense storage. Dense diagonal blocks were assigned for genetic groups, considering all traits together, and genotyped animals. For the MV parameterisation, the latter again used one large block effects for all traits. Ordering equations for genotyped animals within PC, 11 separate diagonal blocks (of size equal to the number of genotyped animals) were used for the PC model, which substantially reduced memory required. Sparse matrix storage for the remaining parts held diagonal elements in core and used compressed sparse row format otherwise. Preconditioning schemes compared were a simple diagonal preconditioner (DIAG), a blockdiagonal preconditioner (BLOX) and the improved SSOR scheme described above, using ω = 1. Solutions were deemed to have converged when α d d/x x dropped below Computations were implemented using single- and multi-threaded versions of routines from the BLAS and sparse BLAS (Blackford et al., 2002) and LAPACK libraries (Anderson et al., 1999). Specifically, each PCG iterate for DIAG and BLOX required the product of the coefficient matrix and a vector, which was formed using routines DSYMV and MKL_DCSRSYMV. Multiple vector inner products (a.k.a. dot products ) needed were evaluated using function DDOT. Triangular solves for SSOR employed routines DAXPY and DAXPYI and functions DDOT and DDOTI to parallelize computations within rows or columns. Use of routines DTRSV and MKL_DCSRTRSV for the latter was disregarded, as it appeared to increase computing time required. For BLOX, the inverse of diagonal blocks were used as preconditioners, except for genotyped animals for 5

6 MV analyses. This comprised a block for all genetic groups effects, separate diagonal blocks for genotyped animals (of dimension 10,944) for each PC, and diagonal blocks equal to the number of traits for which they were fitted for the remaining random effects levels. For MV analyses, a Cholesky decomposition of the dense diagonal block for all genotyped animals and traits was carried out and used as a preconditioner in a triangular solve. These calculations were performed using LAPACK routines DPOTRF, DPOTRI and DPOTRS and BLAS routine DSYMV. Computations were carried under Linux on a shared machine with 512GB of RAM and 28 Intel Xeon CPU E cores (Intel Corporation, Santa Clara, CA), rated at 2.6Ghz, with a cache size of 35MB. BLAS and LAPACK routines used were loaded from the Intel Math Kernel Library (Intel Corporation, Santa Clara, CA). RESULTS Numbers of iterates and computing time required for each of the 24 analyses are summarized in Table 1. Results clearly show the impact of the preconditioner used on the number of iterates required. Results for BLOX differ somewhat from those reported by Meyer et al. (2015) due to a slightly less stringent convergence criterion used compared to the earlier study and tweaks in the implementation since. Correlations between solutions for corresponding analyses were at least For the standard multivariate parameterization, BLOX dramatically reduced the number of iterates required compared to DIAG. This was less pronounced for the PC model, suggesting that de-correlating effects in the transformation to PC scale already achieved a considerable part of the benefits otherwise obtained by considering all traits simultaneously when using BLOX. The SSOR preconditioner further reduced the number of iterates required in all cases, on average to about a third of the corresponding number for DIAG. Results for computing times required did not quite match the differences in numbers of iterates needed. In the main, this reflected differences in efficiency of implementation and, for multi-threaded analyses, in scope for parallelization. While Li et al. (2013) emphasized that computations required per iterate for the improved SSOR PCG algorithm would be very similar to a standard (non-preconditioned) conjugate gradient scheme, times per iterate for our example were higher throughout for SSOR than for DIAG. Nevertheless, for single processor analyses, the overall execution time was reduced by 35% to 60%. Parameterising to principal component yielded comparable reductions in execution time for all preconditioners. With 28 processors available, a different pattern emerged for multi-threaded analyses. Each iterate of DIAG and BLOX involved the product of the coefficient matrix and a vector which was amenable to parallel processing (implemented through highly optimized library routines). In contrast, SSOR required two triangular solves which were carried out in order and thus offered far less opportunity for parallel execution. Similarly, times for BLOX were substantially 6

7 higher than for DIAG for the MV single-step model. This was due to the size of the dense diagonal block (120,384) and the fact that using its Cholesky factor (rather than the inverse) as the preconditioner again involved a triangular solve. DISCUSSION A SSOR preconditioner for a PCG algorithm appears not to have been considered previously in the context of genetic evaluation. Applications for engineering problems have been described, for instance, by Chen et al. (2002), Han and Zhang (2011), Li et al. (2013) and Meng et al. (2016), with favourable reports on convergence rates achieved. Results demonstrate that, using the improved version of Li et al. (2013), it can substantially reduce the computational requirements for iterative solution of mixed model equations. Moreover, it requires about the same memory as the diagonal preconditioner and is easier to implement than a block-diagonal scheme. The SSOR PCG scheme described is most useful for applications where the MME can be held in core or where selected parts of the coefficient matrix can be read strategically from outof-core memory. With large amounts of RAM available for modern hardware, this is feasible for many genetic evaluation schemes for small to moderately large populations. The SSOR preconditioner can also reduce computational requirements for maximum likelihood analyses to estimate variance components which employ Monte Carlo techniques and require multiple solutions of the MME for each iterate, e.g. Matilainen et al. (2012). The scheme is likely to be less beneficial for extremely large applications using an iteration on data type strategy, as multiple passes through the data would be needed in each iterate which may quickly erode the computational savings afforded by the SSOR preconditioner per se However, in the era of multi-threaded and highly parallel computing, the choice of algorithm needs to consider the hardware constellation targeted. Using all processors available, there was little advantage of SSOR over DIAG in terms of overall execution time. Competitive performance of the diagonal preconditioner for parallel computing applications has been reported elsewhere (Pini and Gambolati, 1990). As outlined above, this was due to triangular solves required in each iterate which limited the scope for parallel processing to operations within each row or column. Studies examining PCG algorithms for massively parallel processors have thus taken a different route, suggesting to approximate M 1 SSOR to avoid the need for ordered, triangu- 194 lar solution schemes albeit at the expense of substantial additional memory requirements (e.g Helfenstein and Koko, 2012). Alternative proposals to boost parallel performance of triangular solves range from identification of independent levels in the triangular matrix and appropriate scheduling (Mayer, 2009) to use of iterative schemes (Anzt et al., 2015). 7

8 IMPLICATIONS The improved preconditioned gradient algorithm with SSOR preconditioner described can substantially reduce computing times for iterative solution of mixed model equations in quantitative genetic applications. It is suitable as a drop-in replacement for existing methods for schemes setting up the mixed model equations explicitly and most advantageous for computing environments with moderate amounts of parallelization. LITERATURE CITED Anderson, E., Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, Third edn. ISBN Anzt, H., E. Chow, and J. Dongarra, Iterative sparse triangular solves for preconditioning. In: J. L. Träff, S. Hunold, and F. Versaci, eds., Euro-Par 2015: Parallel Processing, vol of Theoretical Computer Science and General Issues. Springer. ISBN , Benzi, M., Preconditioning techniques for large linear systems: A survey. J. Comput. Phys. 182: doi: /jcph Blackford, L., J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Limsdaine, A. Petitet, R. Pozo, K. Remington, and R. C. Whaley, An updated set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw. 28: doi: / Chen, R.-S., E. K.-N. Yung, C. H. Chan, D. X. Wang, and D. G. Fang, Application of the SSOR preconditioned CG algorithm to the vector FEM for 3D full-wave analysis of electromagnetic-field boundary-value problems. IEEE Trans. Microw. Theory Techn. 50: doi: / Han, L. and Z. Zhang, Application of SSOR-PCG method with improved iteration format in FEM simulation of massive concrete. Water Sci. Engineer. 4: doi: /j.issn Helfenstein, R. and J. Koko, Parallel preconditioned conjugate gradient algorithm on GPU. J. Comput. Appl. Math. 236: doi: /j.cam Li, G., C. Tang, and L. Li, High-efficiency improved symmetric successive overrelaxation preconditioned conjugate gradient method for solving large-scale finite element linear equations. Appl. Math. Mech. 34: doi: /s x. 8

9 Matilainen, K., E. Mäntysaari, M. Lidauer, I. Strandén, and R. Thompson, Employing a Monte Carlo algorithm in expectation maximization restricted maximum likelihood estimation of the linear mixed model. J. Anim. Breed. Genet. 129: doi: /j x. Mayer, J., Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86: doi: /s Meng, Z., F. Li, X. Xu, D. Huang, and D. Zhang, Fast inversion of gravity data using the symmetric successive over-relaxation (SSOR) preconditioned conjugate gradient algorithm. Explor. Geophys. 00:Published online 16 February doi: /EG Meyer, K., WOMBAT a tool for mixed model analyses in quantitative genetics by REML. J. Zhejiang Univ. SCIENCE B 8: doi: /jzus.2007.b0815. Meyer, K., A. Swan, and B. Tier, Technical note: Genetic principal component models for multi-trait single-step genomic evaluation. J. Anim. Sci. 93: doi: /jas Pini, G. and G. Gambolati, Is a simple diagonal scaling the best preconditioner for conjugate gradients on supercomputers? Adv. Water Resour. 13: doi: / (90)90006-P. Saad, Y., Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2nd edn. ISBN Strandén, I. and M. Lidauer, Solving large mixed linear models using preconditioned conjugate gradient iteration. J. Dairy Sci. 82: doi: /jds.S (99) Strandén, I., S. Tsuruta, and I. Misztal, Simple preconditioners for the conjugate gradient method: experience with test day models. J. Anim. Breed. Genet. 119: doi: /j x. Tsuruta, S., I. Misztal, and I. Strandén, Use of the preconditioned conjugate gradient algorithm as a generic solver for mixed-model equations in animal breeding applications. J. Anim. Sci. 79: doi: / x. Yang, J., B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders, D. R. Nyholt, P. A. Madden, A. C. Heath, N. G. Martin, G. W. Montgomery, M. E. Goddard, and P. M. Visscher, Common SNPs explain a large proportion of the heritability for human height. Nature Genet. 42: doi: /ng

10 Table 1: Characteristics of the mixed model equations and computing requirements for diagonal (DIAG), block-diagonal (BLOX) and symmetric successive overrelaxation (SSOR) preconditioning schemes for single- and multi-threaded computations Threads Rel. a Param. b NNZ c No. of iterates Elapsed time (h) DIAG BLOX SSOR DIAG BLOX SSOR Single A 1 MV 918 5,550 1,999 1, PC 1,377 2,784 2,109 1, H 1 MV 8,162 6,691 2,910 1, PC 2,035 3,921 2,984 1, Multi A 1 MV 918 5,521 1,991 1, PC 1,377 2,803 2,107 1, H 1 MV 8,162 6,681 2,889 1, PC 2,035 3,929 2,965 1, a Relationship matrix: A 1 pedigree, H 1 single step b Parameterization: MV standard multivariate, PC principal components c No. of non-zero elements in one triangle of coefficient matrix; in million 10

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Martin Köhler Jens Saak 2 The Gauss-Jordan Elimination scheme is an alternative to the LU decomposition

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Estimating Variance Components in MMAP

Estimating Variance Components in MMAP Last update: 6/1/2014 Estimating Variance Components in MMAP MMAP implements routines to estimate variance components within the mixed model. These estimates can be used for likelihood ratio tests to compare

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Parallel solution for finite element linear systems of. equations on workstation cluster *

Parallel solution for finite element linear systems of. equations on workstation cluster * Aug. 2009, Volume 6, No.8 (Serial No.57) Journal of Communication and Computer, ISSN 1548-7709, USA Parallel solution for finite element linear systems of equations on workstation cluster * FU Chao-jiang

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Supercomputing and Science An Introduction to High Performance Computing

Supercomputing and Science An Introduction to High Performance Computing Supercomputing and Science An Introduction to High Performance Computing Part VII: Scientific Computing Henry Neeman, Director OU Supercomputing Center for Education & Research Outline Scientific Computing

More information

Algorithm 8xx: SuiteSparseQR, a multifrontal multithreaded sparse QR factorization package

Algorithm 8xx: SuiteSparseQR, a multifrontal multithreaded sparse QR factorization package Algorithm 8xx: SuiteSparseQR, a multifrontal multithreaded sparse QR factorization package TIMOTHY A. DAVIS University of Florida SuiteSparseQR is an implementation of the multifrontal sparse QR factorization

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

A Square Block Format for Symmetric Band Matrices

A Square Block Format for Symmetric Band Matrices A Square Block Format for Symmetric Band Matrices Fred G. Gustavson 1, José R. Herrero 2, E. Morancho 2 1 IBM T.J. Watson Research Center, Emeritus, and Umeå University fg2935@hotmail.com 2 Computer Architecture

More information

Epetra Performance Optimization Guide

Epetra Performance Optimization Guide SAND2005-1668 Unlimited elease Printed March 2005 Updated for Trilinos 9.0 February 2009 Epetra Performance Optimization Guide Michael A. Heroux Scalable Algorithms Department Sandia National Laboratories

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid

Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid Yu Zhong Martin D. F. Wong Dept. of Electrical and Computer Engineering Dept. of Electrical and Computer Engineering Univ. of

More information

Abstract Primal dual interior point methods and the HKM method in particular

Abstract Primal dual interior point methods and the HKM method in particular Mathematical Programming manuscript No. (will be inserted by the editor) Brian Borchers Joseph Young How Far Can We Go With Primal Dual Interior Point Methods for SDP? Received: date / Accepted: date Abstract

More information

Techniques for Optimizing FEM/MoM Codes

Techniques for Optimizing FEM/MoM Codes Techniques for Optimizing FEM/MoM Codes Y. Ji, T. H. Hubing, and H. Wang Electromagnetic Compatibility Laboratory Department of Electrical & Computer Engineering University of Missouri-Rolla Rolla, MO

More information

Contents. I The Basic Framework for Stationary Problems 1

Contents. I The Basic Framework for Stationary Problems 1 page v Preface xiii I The Basic Framework for Stationary Problems 1 1 Some model PDEs 3 1.1 Laplace s equation; elliptic BVPs... 3 1.1.1 Physical experiments modeled by Laplace s equation... 5 1.2 Other

More information

High performance matrix inversion of SPD matrices on graphics processors

High performance matrix inversion of SPD matrices on graphics processors High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems

More information

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Hartwig Anzt 1, Marc Baboulin 2, Jack Dongarra 1, Yvan Fournier 3, Frank Hulsemann 3, Amal Khabou 2, and Yushan Wang 2 1 University

More information

Figure 6.1: Truss topology optimization diagram.

Figure 6.1: Truss topology optimization diagram. 6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.

More information

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012. Blocked Schur Algorithms for Computing the Matrix Square Root Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui 2013 MIMS EPrint: 2012.26 Manchester Institute for Mathematical Sciences School of Mathematics

More information

VIII/2015 TECHNICAL REFERENCE GUIDE FOR

VIII/2015 TECHNICAL REFERENCE GUIDE FOR MiX99 Solving Large Mixed Model Equations Release VIII/2015 TECHNICAL REFERENCE GUIDE FOR MiX99 SOLVER Copyright 2015 Last update: Aug 2015 Preface Development of MiX99 was initiated to allow analysis

More information

Performance Evaluation of a New Parallel Preconditioner

Performance Evaluation of a New Parallel Preconditioner Performance Evaluation of a New Parallel Preconditioner Keith D. Gremban Gary L. Miller Marco Zagha School of Computer Science Carnegie Mellon University 5 Forbes Avenue Pittsburgh PA 15213 Abstract The

More information

Strategies for Parallelizing the Solution of Rational Matrix Equations

Strategies for Parallelizing the Solution of Rational Matrix Equations Strategies for Parallelizing the Solution of Rational Matrix Equations José M. Badía 1, Peter Benner, Maribel Castillo 1, Heike Faßbender 3, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, and Gregorio Quintana-Ortí

More information

Some notes on efficient computing and high performance computing environments

Some notes on efficient computing and high performance computing environments Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

A Fast and Exact Simulation Algorithm for General Gaussian Markov Random Fields

A Fast and Exact Simulation Algorithm for General Gaussian Markov Random Fields A Fast and Exact Simulation Algorithm for General Gaussian Markov Random Fields HÅVARD RUE DEPARTMENT OF MATHEMATICAL SCIENCES NTNU, NORWAY FIRST VERSION: FEBRUARY 23, 1999 REVISED: APRIL 23, 1999 SUMMARY

More information

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3 6 Iterative Solvers Lab Objective: Many real-world problems of the form Ax = b have tens of thousands of parameters Solving such systems with Gaussian elimination or matrix factorizations could require

More information

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Brief notes on setting up semi-high performance computing environments. July 25, 2014 Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1

More information

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL)

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL) High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL) Michael Chuvelev, Intel Corporation, michael.chuvelev@intel.com Sergey Kazakov, Intel Corporation sergey.kazakov@intel.com

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana

More information

Accelerating the Hessian-free Gauss-Newton Full-waveform Inversion via Preconditioned Conjugate Gradient Method

Accelerating the Hessian-free Gauss-Newton Full-waveform Inversion via Preconditioned Conjugate Gradient Method Accelerating the Hessian-free Gauss-Newton Full-waveform Inversion via Preconditioned Conjugate Gradient Method Wenyong Pan 1, Kris Innanen 1 and Wenyuan Liao 2 1. CREWES Project, Department of Geoscience,

More information

Application of LSQR to Calibration of a MODFLOW Model: A Synthetic Study

Application of LSQR to Calibration of a MODFLOW Model: A Synthetic Study Application of LSQR to Calibration of a MODFLOW Model: A Synthetic Study Chris Muffels 1,2, Matthew Tonkin 2,3, Haijiang Zhang 1, Mary Anderson 1, Tom Clemo 4 1 University of Wisconsin-Madison, muffels@geology.wisc.edu,

More information

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Sparse LU Factorization for Parallel Circuit Simulation on GPUs Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated

More information

Implementation of a Primal-Dual Method for. SDP on a Shared Memory Parallel Architecture

Implementation of a Primal-Dual Method for. SDP on a Shared Memory Parallel Architecture Implementation of a Primal-Dual Method for SDP on a Shared Memory Parallel Architecture Brian Borchers Joseph G. Young March 27, 2006 Abstract Primal dual interior point methods and the HKM method in particular

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations

More information

Numerical Methods to Solve 2-D and 3-D Elliptic Partial Differential Equations Using Matlab on the Cluster maya

Numerical Methods to Solve 2-D and 3-D Elliptic Partial Differential Equations Using Matlab on the Cluster maya Numerical Methods to Solve 2-D and 3-D Elliptic Partial Differential Equations Using Matlab on the Cluster maya David Stonko, Samuel Khuvis, and Matthias K. Gobbert (gobbert@umbc.edu) Department of Mathematics

More information

(Sparse) Linear Solvers

(Sparse) Linear Solvers (Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

Matrix-free IPM with GPU acceleration

Matrix-free IPM with GPU acceleration Matrix-free IPM with GPU acceleration Julian Hall, Edmund Smith and Jacek Gondzio School of Mathematics University of Edinburgh jajhall@ed.ac.uk 29th June 2011 Linear programming theory Primal-dual pair

More information

Chapter 14: Matrix Iterative Methods

Chapter 14: Matrix Iterative Methods Chapter 14: Matrix Iterative Methods 14.1INTRODUCTION AND OBJECTIVES This chapter discusses how to solve linear systems of equations using iterative methods and it may be skipped on a first reading of

More information

Blocked Schur Algorithms for Computing the Matrix Square Root

Blocked Schur Algorithms for Computing the Matrix Square Root Blocked Schur Algorithms for Computing the Matrix Square Root Edvin Deadman 1, Nicholas J. Higham 2,andRuiRalha 3 1 Numerical Algorithms Group edvin.deadman@nag.co.uk 2 University of Manchester higham@maths.manchester.ac.uk

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt 1(B), Edmond Chow 2, and Jack Dongarra 1 1 University of Tennessee, Knoxville, TN, USA hanzt@icl.utk.edu, dongarra@eecs.utk.edu 2 Georgia

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.3 Iterative Methods Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

Sparse Matrices. Mathematics In Science And Engineering Volume 99 READ ONLINE

Sparse Matrices. Mathematics In Science And Engineering Volume 99 READ ONLINE Sparse Matrices. Mathematics In Science And Engineering Volume 99 READ ONLINE If you are looking for a ebook Sparse Matrices. Mathematics in Science and Engineering Volume 99 in pdf form, in that case

More information

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir

More information

Module 5.5: nag sym bnd lin sys Symmetric Banded Systems of Linear Equations. Contents

Module 5.5: nag sym bnd lin sys Symmetric Banded Systems of Linear Equations. Contents Module Contents Module 5.5: nag sym bnd lin sys Symmetric Banded Systems of nag sym bnd lin sys provides a procedure for solving real symmetric or complex Hermitian banded systems of linear equations with

More information

A priori power estimation of linear solvers on multi-core processors

A priori power estimation of linear solvers on multi-core processors A priori power estimation of linear solvers on multi-core processors Dimitar Lukarski 1, Tobias Skoglund 2 Uppsala University Department of Information Technology Division of Scientific Computing 1 Division

More information

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization 10 th World Congress on Structural and Multidisciplinary Optimization May 19-24, 2013, Orlando, Florida, USA Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization Sirisha Rangavajhala

More information

Storage Formats for Sparse Matrices in Java

Storage Formats for Sparse Matrices in Java Storage Formats for Sparse Matrices in Java Mikel Luján, Anila Usman, Patrick Hardie, T.L. Freeman, and John R. Gurd Centre for Novel Computing, The University of Manchester, Oxford Road, Manchester M13

More information

Preconditioning for linear least-squares problems

Preconditioning for linear least-squares problems Preconditioning for linear least-squares problems Miroslav Tůma Institute of Computer Science Academy of Sciences of the Czech Republic tuma@cs.cas.cz joint work with Rafael Bru, José Marín and José Mas

More information

2nd Introduction to the Matrix package

2nd Introduction to the Matrix package 2nd Introduction to the Matrix package Martin Maechler and Douglas Bates R Core Development Team maechler@stat.math.ethz.ch, bates@r-project.org September 2006 (typeset on October 7, 2007) Abstract Linear

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Sparse Matrix Libraries in C++ for High Performance. Architectures. ferent sparse matrix data formats in order to best

Sparse Matrix Libraries in C++ for High Performance. Architectures. ferent sparse matrix data formats in order to best Sparse Matrix Libraries in C++ for High Performance Architectures Jack Dongarra xz, Andrew Lumsdaine, Xinhui Niu Roldan Pozo z, Karin Remington x x Oak Ridge National Laboratory z University oftennessee

More information

Intel Math Kernel Library (Intel MKL) Sparse Solvers. Alexander Kalinkin Intel MKL developer, Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) Sparse Solvers. Alexander Kalinkin Intel MKL developer, Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) Sparse Solvers Alexander Kalinkin Intel MKL developer, Victor Kostin Intel MKL Dense Solvers team manager Copyright 3, Intel Corporation. All rights reserved. Sparse

More information

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma

More information

VIII/2015 TECHNICAL REFERENCE GUIDE FOR

VIII/2015 TECHNICAL REFERENCE GUIDE FOR MiX99 Solving Large Mixed Model Equations Release VIII/2015 TECHNICAL REFERENCE GUIDE FOR MiX99 PRE-PROCESSOR Copyright 2015 Last update: Aug 2015 Preface Development of MiX99 was initiated to allow more

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Numerically Stable Real-Number Codes Based on Random Matrices

Numerically Stable Real-Number Codes Based on Random Matrices Numerically Stable eal-number Codes Based on andom Matrices Zizhong Chen Innovative Computing Laboratory Computer Science Department University of Tennessee zchen@csutkedu Abstract Error correction codes

More information

NAG Fortran Library Routine Document F04CAF.1

NAG Fortran Library Routine Document F04CAF.1 F04 Simultaneous Linear Equations NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised

More information

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations Developing a High Performance Software Library with MPI and CUDA for Matrix Computations Bogdan Oancea 1, Tudorel Andrei 2 1 Nicolae Titulescu University of Bucharest, e-mail: bogdanoancea@univnt.ro, Calea

More information

2 Fundamentals of Serial Linear Algebra

2 Fundamentals of Serial Linear Algebra . Direct Solution of Linear Systems.. Gaussian Elimination.. LU Decomposition and FBS..3 Cholesky Decomposition..4 Multifrontal Methods. Iterative Solution of Linear Systems.. Jacobi Method Fundamentals

More information

SUMMARY. solve the matrix system using iterative solvers. We use the MUMPS codes and distribute the computation over many different

SUMMARY. solve the matrix system using iterative solvers. We use the MUMPS codes and distribute the computation over many different Forward Modelling and Inversion of Multi-Source TEM Data D. W. Oldenburg 1, E. Haber 2, and R. Shekhtman 1 1 University of British Columbia, Department of Earth & Ocean Sciences 2 Emory University, Atlanta,

More information

Asreml-R: an R package for mixed models using residual maximum likelihood

Asreml-R: an R package for mixed models using residual maximum likelihood Asreml-R: an R package for mixed models using residual maximum likelihood David Butler 1 Brian Cullis 2 Arthur Gilmour 3 1 Queensland Department of Primary Industries Toowoomba 2 NSW Department of Primary

More information

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines NAG Library Chapter Introduction Contents 1 Scope of the Chapter.... 2 2 Background to the Problems... 2 3 Recommendations on Choice and Use of Available Routines... 2 3.1 Naming Scheme... 2 3.1.1 NAGnames...

More information

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop

More information

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear

More information

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 5: Sparse Linear Systems and Factorization Methods Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 18 Sparse

More information

Approaches to Parallel Implementation of the BDDC Method

Approaches to Parallel Implementation of the BDDC Method Approaches to Parallel Implementation of the BDDC Method Jakub Šístek Includes joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík. Institute of Mathematics of the AS CR, Prague

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Analysis of the GCR method with mixed precision arithmetic using QuPAT

Analysis of the GCR method with mixed precision arithmetic using QuPAT Analysis of the GCR method with mixed precision arithmetic using QuPAT Tsubasa Saito a,, Emiko Ishiwata b, Hidehiko Hasegawa c a Graduate School of Science, Tokyo University of Science, 1-3 Kagurazaka,

More information

Performance Evaluation of a New Parallel Preconditioner

Performance Evaluation of a New Parallel Preconditioner Performance Evaluation of a New Parallel Preconditioner Keith D. Gremban Gary L. Miller October 994 CMU-CS-94-25 Marco Zagha School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 This

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical

More information

Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks

Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks Kevin Deweese 1 Erik Boman 2 1 Department of Computer Science University of California, Santa Barbara 2 Scalable Algorithms

More information

Using BLUPF90 UGA

Using BLUPF90 UGA Using BLUPF90 UGA 05-2018 BLUPF90 family programs All programs are controled by the SAME paramenter file. Extra options could be used to set non-default behaviour of each program Understanding parameter

More information

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R.

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R. Progress In Electromagnetics Research Letters, Vol. 9, 29 38, 2009 AN IMPROVED ALGORITHM FOR MATRIX BANDWIDTH AND PROFILE REDUCTION IN FINITE ELEMENT ANALYSIS Q. Wang National Key Laboratory of Antenna

More information

GS3. Andrés Legarra. March 5, Genomic Selection Gibbs Sampling Gauss Seidel

GS3. Andrés Legarra. March 5, Genomic Selection Gibbs Sampling Gauss Seidel GS3 Genomic Selection Gibbs Sampling Gauss Seidel Andrés Legarra March 5, 2008 andres.legarra [at] toulouse.inra.fr INRA, UR 631, F-31326 Auzeville, France 1 Contents 1 Introduction 3 1.1 History...............................

More information

OpenFOAM + GPGPU. İbrahim Özküçük

OpenFOAM + GPGPU. İbrahim Özküçük OpenFOAM + GPGPU İbrahim Özküçük Outline GPGPU vs CPU GPGPU plugins for OpenFOAM Overview of Discretization CUDA for FOAM Link (cufflink) Cusp & Thrust Libraries How Cufflink Works Performance data of

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms H. Anzt, V. Heuveline Karlsruhe Institute of Technology, Germany

More information

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

More information

Step-by-Step Guide to Advanced Genetic Analysis

Step-by-Step Guide to Advanced Genetic Analysis Step-by-Step Guide to Advanced Genetic Analysis Page 1 Introduction In the previous document, 1 we covered the standard genetic analyses available in JMP Genomics. Here, we cover the more advanced options

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Second Order Optimization Methods Marc Toussaint U Stuttgart Planned Outline Gradient-based optimization (1st order methods) plain grad., steepest descent, conjugate grad.,

More information

High-Performance Implementation of the Level-3 BLAS

High-Performance Implementation of the Level-3 BLAS High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for

More information

Performance of Implicit Solver Strategies on GPUs

Performance of Implicit Solver Strategies on GPUs 9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used

More information