Technical note: A successive over-relaxation pre-conditioner to solve mixed model equations for genetic evaluation 1

Size: px

Start display at page:

Download "Technical note: A successive over-relaxation pre-conditioner to solve mixed model equations for genetic evaluation 1"

Kristin Lamb
5 years ago
Views:

1 Running head: Technical note: A successive over-relaxation pre-conditioner to solve mixed model equations for genetic evaluation 1 Karin Meyer 2 Animal Genetics and Breeding Unit 3, University of New England, Armidale NSW 2351, Australia ABSTRACT: A computationally efficient preconditioned conjugate gradient algorithm with a symmetric successive over-relaxation (SSOR) preconditioner for the iterative solution of set mixed model equations is described. Potential computational saving of this approach are examined for an example of single-step genomic evaluation of Australian sheep. Results show that the SSOR preconditioner can substantially reduce the number of iterates required for solutions to converge compared to simpler preconditioners with marked reductions in overall computing time Keywords: genetic evaluation, preconditioned conjugate gradient algorithm, SSOR preconditioner, computational requirements INTRODUCTION Estimation of breeding values in genetic evaluation schemes for livestock requires solving a set of mixed model equations (MME). Often, the number of equations is too large for solutions to be obtained directly, i.e. by inversion or triangular decomposition of the coefficient matrix in the MME, and iterative methods need to be employed. Early applications tended to rely on Gauss-Seidel type solution schemes, often with over-relaxation to improve convergence rates. Over the last two decades, however, the preconditioned conjugate gradient (PCG) algorithm has become the standard method to solve large systems of MME in animal breeding applications (e.g. Strandén and Lidauer, 1999; Tsuruta et al., 2001). Convergence rates of the PCG algorithm depend on the distribution of eigenvalues of the coefficient matrix in the MME. This can be improved, i.e. their spread and the condition number of the matrix can be reduced, by adequate choice of a preconditioning matrix. Loosely speaking, the closer this matrix resembles the coefficient matrix, the less iterations are required. In practice, however, this needs to be balanced with computational requirements both to set up 1 This work was supported by Meat and Livestock Australia under grants B.BFG.0050, B.SGN.0027 and B.SGN I am indebted to A. Swan for making the example data used available. 2 Corresponding author: kmeyer@une.edu.au 3 A joint venture with the NSW Department of Primary Industries

2 and store the preconditioner and to apply it for each iterate and many schemes thus rely on simple, diagonal or block-diagonal matrices (Strandén et al., 2002). This paper describes an alternative preconditioner which has the same structure as the coefficient matrix, together with an efficient implementation in a PCG algorithm. We demonstrate for an applied example that it can substantially reduce the number of iterations and overall computing time required. Let THE SSOR PRECONDITIONER Ax = b (1) denote the system of MME to be solved, with A the coefficient matrix, x the vector of unknowns and b the vector of right hand sides. Equivalently, solutions for x can be obtained by solving M 1 Ax = M 1 b (2) instead, with M denoting the preconditioner. Typically, the matrix M is chosen so that it and the product M 1 r are easy to calculate (with r denoting a vector) whilst reducing the condition number of M 1 A (Benzi, 2002). Saad (2003; Chapter 4) showed that an iterative solution scheme with over-relaxation is equivalent to fixed point iteration on a pre-conditioned system. Decompose the coefficient matrix A into A = L + D + L (3) with L = { a i j } for j < i the matrix comprising the strictly lower triangle of A and D = Diag(A) = { a ii } the matrix of diagonal elements. The preconditioner corresponding to a sym- metric successive over-relaxation (SSOR) scheme is M SSOR = 1 ω(2 ω) (D + ωl) D 1 ( D + ωl ) (4) with 0 < ω < 2 (Saad, 2003). Matrix M SSOR has the same structure (i.e. number and position of non-zero elements) as A, and can be used as preconditioner with a PCG algorithm. Saad (2003) stated that the choice of ω for a preconditioner is not important and recommended a value of ω = 1. Each PCG iterate then requires evaluating the product of the inverse of M and a vector, M 1 SSOR r = h. Fortunately, this can be obtained without even setting up M SSOR explicitly by exploiting the factorisation in (4) into upper and lower triangular matrices. Moreover, as D + L in M SSOR (for ω = 1) is simply the lower triangle of A, there is no computational overhead to 2

3 set up M or additional memory required to store it. We refer to the process of obtaining direct solutions for a system of equations with lower or upper triangular coefficient matrix as a triangular solve. Solving M SSOR h = r for h then involves 3 steps: i) Solve (D + ωl) t 1 = r using a forward triangular solve which gives t 1 = D 1 ( D + ωl ) h. ii) Calculate t 2 = Dt 1, and iii) apply a second, backward triangular solve to ( D + ωl ) h = t 2 for h. However, whilst easy to implement this implies a substantial number of multiplications which increase calculations required per iterate in a standard PCG algorithm to at least twice those needed for a diagonal preconditioner, i.e. M = D. Improved SSOR Improved versions of a PCG algorithm using a SSOR preconditioner have been described by Han and Zhang (2011) and Li et al. (2013). These reduce computations per iterate substantially by recognizing that calculation of the two expensive matrix vector products, namely i) of the coefficient matrix and a vector of directions, Ad, and ii) of the inverse of the preconditioning matrix and a vector, M 1 SSORr can be replaced. Instead, only the solutions for two triangular 63 systems of equations are required We adopt the procedure of Li et al. (2013) which differs from that of Han and Zhang (2011) by an initial transformation of the system of equations. For D = Diag{a ii }, define a transformed system of equations D 1/2 AD 1/2 D 1/2 x = D 1/2 b (5) 68 or A x = b (6) 69 Decomposing A as above gives A = L + L + I (7) and with M = ω (L + 1ω ) ( 2 ω I L + 1 ) ω I = ω 2 ω W W (8) A = W + W λi (9) 72 and λ = (2 ω)/ω. 3

4 73 74 Defining auxiliary vectors y = W 1 r and z = W T r then gives the following PCG algorithm (adapted from Li et al., 2013): For a given vector of starting values for the unknowns, x 0, initialize y 0 W 1( A x 0 b ) z 0 λy 0 d 0 W T z For the k th iterate, compute updated solutions x k x k 1 + αd k 1 with α = λ y k 1 y k 1 d k 1(2z k 1 λd k 1 ) Check for convergence. If the chosen criterion has not been met, update work vectors y k y k 1 + α [ d k 1 + W 1 (z k 1 λd k 1 ) ] z k β z k 1 λ y k with β = y k y k y k 1 y k 1 d k W T z k At convergence, calculate solutions on the original scale as x = D 1/2 x The major computations per iterate are the products of W 1 or W T and a vector (in step 3). As W is triangular, these can again be obtained without inverting W using triangular solves. PCG algorithms in animal breeding applications commonly involve a step resetting the search direction at regular intervals (e.g. Tsuruta et al., 2001) to reduce potential problems arising from the accumulation of rounding errors. This can be achieved in the scheme above by replacing the update of work vectors (step 3) for selected iterates with the corresponding calculations from the set-up phase (step 1). APPLICATION We demonstrate the utility of the SSOR preconditioner for the example considered by Meyer et al. (2015). In brief, this comprised a set of 11 traits considered in genetic evaluation of Australian meat sheep (generously made available from the Meat and Livestock Australia s Sheep Genetics data base and the Cooperative Research Centre for Sheep Industry Innovation), with 5.28 million records on 1.77 million animals. Including parents without records 4

5 there were 1,995,755 animals of which 10,944 were genotyped for 48,599 single nucleotide polymorphisms. Genomic relationships were computed following Yang et al. (2010). The model of analysis included contemporary groups as fixed effects, animals additive genetic effects and genetic groups (93 levels) as random effects for all traits. The latter were fitted explicitly, assigning proportions of membership for each animal. Genetic groups and animals additive genetic effects were fitted assuming the same covariance matrix. In addition, dams permanent environmental effects (653,068 levels) were fitted as random effects for 3 traits. This resulted in 24,161,124 equations in the mixed model. Analyses were carried out using either the pedigree based relationship matrix with 6,584,393 elements in its inverse, A 1, or combining pedigree and genomic information in a single-step model with 66,455,483 non-zero elements in the inverse of the combined relationship matrix, H 1 (half-stored). Furthermore, both the standard multivariate (MV) formulation and the equivalent model using a parameterisation to principal components (PC) (Meyer et al., 2015) were examined. Computing environment and strategy Mixed model equations were solved iteratively, using double precision computations in a preconditioned conjugate gradient algorithm, as implemented in the single-step module of our mixed model package WOMBAT (Meyer, 2007). Non-zero elements of the coefficient matrix (half-stored) in the MME were held in core using a combination of sparse matrix and dense storage. Dense diagonal blocks were assigned for genetic groups, considering all traits together, and genotyped animals. For the MV parameterisation, the latter again used one large block effects for all traits. Ordering equations for genotyped animals within PC, 11 separate diagonal blocks (of size equal to the number of genotyped animals) were used for the PC model, which substantially reduced memory required. Sparse matrix storage for the remaining parts held diagonal elements in core and used compressed sparse row format otherwise. Preconditioning schemes compared were a simple diagonal preconditioner (DIAG), a blockdiagonal preconditioner (BLOX) and the improved SSOR scheme described above, using ω = 1. Solutions were deemed to have converged when α d d/x x dropped below Computations were implemented using single- and multi-threaded versions of routines from the BLAS and sparse BLAS (Blackford et al., 2002) and LAPACK libraries (Anderson et al., 1999). Specifically, each PCG iterate for DIAG and BLOX required the product of the coefficient matrix and a vector, which was formed using routines DSYMV and MKL_DCSRSYMV. Multiple vector inner products (a.k.a. dot products ) needed were evaluated using function DDOT. Triangular solves for SSOR employed routines DAXPY and DAXPYI and functions DDOT and DDOTI to parallelize computations within rows or columns. Use of routines DTRSV and MKL_DCSRTRSV for the latter was disregarded, as it appeared to increase computing time required. For BLOX, the inverse of diagonal blocks were used as preconditioners, except for genotyped animals for 5

6 MV analyses. This comprised a block for all genetic groups effects, separate diagonal blocks for genotyped animals (of dimension 10,944) for each PC, and diagonal blocks equal to the number of traits for which they were fitted for the remaining random effects levels. For MV analyses, a Cholesky decomposition of the dense diagonal block for all genotyped animals and traits was carried out and used as a preconditioner in a triangular solve. These calculations were performed using LAPACK routines DPOTRF, DPOTRI and DPOTRS and BLAS routine DSYMV. Computations were carried under Linux on a shared machine with 512GB of RAM and 28 Intel Xeon CPU E cores (Intel Corporation, Santa Clara, CA), rated at 2.6Ghz, with a cache size of 35MB. BLAS and LAPACK routines used were loaded from the Intel Math Kernel Library (Intel Corporation, Santa Clara, CA). RESULTS Numbers of iterates and computing time required for each of the 24 analyses are summarized in Table 1. Results clearly show the impact of the preconditioner used on the number of iterates required. Results for BLOX differ somewhat from those reported by Meyer et al. (2015) due to a slightly less stringent convergence criterion used compared to the earlier study and tweaks in the implementation since. Correlations between solutions for corresponding analyses were at least For the standard multivariate parameterization, BLOX dramatically reduced the number of iterates required compared to DIAG. This was less pronounced for the PC model, suggesting that de-correlating effects in the transformation to PC scale already achieved a considerable part of the benefits otherwise obtained by considering all traits simultaneously when using BLOX. The SSOR preconditioner further reduced the number of iterates required in all cases, on average to about a third of the corresponding number for DIAG. Results for computing times required did not quite match the differences in numbers of iterates needed. In the main, this reflected differences in efficiency of implementation and, for multi-threaded analyses, in scope for parallelization. While Li et al. (2013) emphasized that computations required per iterate for the improved SSOR PCG algorithm would be very similar to a standard (non-preconditioned) conjugate gradient scheme, times per iterate for our example were higher throughout for SSOR than for DIAG. Nevertheless, for single processor analyses, the overall execution time was reduced by 35% to 60%. Parameterising to principal component yielded comparable reductions in execution time for all preconditioners. With 28 processors available, a different pattern emerged for multi-threaded analyses. Each iterate of DIAG and BLOX involved the product of the coefficient matrix and a vector which was amenable to parallel processing (implemented through highly optimized library routines). In contrast, SSOR required two triangular solves which were carried out in order and thus offered far less opportunity for parallel execution. Similarly, times for BLOX were substantially 6

7 higher than for DIAG for the MV single-step model. This was due to the size of the dense diagonal block (120,384) and the fact that using its Cholesky factor (rather than the inverse) as the preconditioner again involved a triangular solve. DISCUSSION A SSOR preconditioner for a PCG algorithm appears not to have been considered previously in the context of genetic evaluation. Applications for engineering problems have been described, for instance, by Chen et al. (2002), Han and Zhang (2011), Li et al. (2013) and Meng et al. (2016), with favourable reports on convergence rates achieved. Results demonstrate that, using the improved version of Li et al. (2013), it can substantially reduce the computational requirements for iterative solution of mixed model equations. Moreover, it requires about the same memory as the diagonal preconditioner and is easier to implement than a block-diagonal scheme. The SSOR PCG scheme described is most useful for applications where the MME can be held in core or where selected parts of the coefficient matrix can be read strategically from outof-core memory. With large amounts of RAM available for modern hardware, this is feasible for many genetic evaluation schemes for small to moderately large populations. The SSOR preconditioner can also reduce computational requirements for maximum likelihood analyses to estimate variance components which employ Monte Carlo techniques and require multiple solutions of the MME for each iterate, e.g. Matilainen et al. (2012). The scheme is likely to be less beneficial for extremely large applications using an iteration on data type strategy, as multiple passes through the data would be needed in each iterate which may quickly erode the computational savings afforded by the SSOR preconditioner per se However, in the era of multi-threaded and highly parallel computing, the choice of algorithm needs to consider the hardware constellation targeted. Using all processors available, there was little advantage of SSOR over DIAG in terms of overall execution time. Competitive performance of the diagonal preconditioner for parallel computing applications has been reported elsewhere (Pini and Gambolati, 1990). As outlined above, this was due to triangular solves required in each iterate which limited the scope for parallel processing to operations within each row or column. Studies examining PCG algorithms for massively parallel processors have thus taken a different route, suggesting to approximate M 1 SSOR to avoid the need for ordered, triangu- 194 lar solution schemes albeit at the expense of substantial additional memory requirements (e.g Helfenstein and Koko, 2012). Alternative proposals to boost parallel performance of triangular solves range from identification of independent levels in the triangular matrix and appropriate scheduling (Mayer, 2009) to use of iterative schemes (Anzt et al., 2015). 7

8 IMPLICATIONS The improved preconditioned gradient algorithm with SSOR preconditioner described can substantially reduce computing times for iterative solution of mixed model equations in quantitative genetic applications. It is suitable as a drop-in replacement for existing methods for schemes setting up the mixed model equations explicitly and most advantageous for computing environments with moderate amounts of parallelization. LITERATURE CITED Anderson, E., Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, Third edn. ISBN Anzt, H., E. Chow, and J. Dongarra, Iterative sparse triangular solves for preconditioning. In: J. L. Träff, S. Hunold, and F. Versaci, eds., Euro-Par 2015: Parallel Processing, vol of Theoretical Computer Science and General Issues. Springer. ISBN , Benzi, M., Preconditioning techniques for large linear systems: A survey. J. Comput. Phys. 182: doi: /jcph Blackford, L., J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Limsdaine, A. Petitet, R. Pozo, K. Remington, and R. C. Whaley, An updated set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw. 28: doi: / Chen, R.-S., E. K.-N. Yung, C. H. Chan, D. X. Wang, and D. G. Fang, Application of the SSOR preconditioned CG algorithm to the vector FEM for 3D full-wave analysis of electromagnetic-field boundary-value problems. IEEE Trans. Microw. Theory Techn. 50: doi: / Han, L. and Z. Zhang, Application of SSOR-PCG method with improved iteration format in FEM simulation of massive concrete. Water Sci. Engineer. 4: doi: /j.issn Helfenstein, R. and J. Koko, Parallel preconditioned conjugate gradient algorithm on GPU. J. Comput. Appl. Math. 236: doi: /j.cam Li, G., C. Tang, and L. Li, High-efficiency improved symmetric successive overrelaxation preconditioned conjugate gradient method for solving large-scale finite element linear equations. Appl. Math. Mech. 34: doi: /s x. 8

9 Matilainen, K., E. Mäntysaari, M. Lidauer, I. Strandén, and R. Thompson, Employing a Monte Carlo algorithm in expectation maximization restricted maximum likelihood estimation of the linear mixed model. J. Anim. Breed. Genet. 129: doi: /j x. Mayer, J., Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86: doi: /s Meng, Z., F. Li, X. Xu, D. Huang, and D. Zhang, Fast inversion of gravity data using the symmetric successive over-relaxation (SSOR) preconditioned conjugate gradient algorithm. Explor. Geophys. 00:Published online 16 February doi: /EG Meyer, K., WOMBAT a tool for mixed model analyses in quantitative genetics by REML. J. Zhejiang Univ. SCIENCE B 8: doi: /jzus.2007.b0815. Meyer, K., A. Swan, and B. Tier, Technical note: Genetic principal component models for multi-trait single-step genomic evaluation. J. Anim. Sci. 93: doi: /jas Pini, G. and G. Gambolati, Is a simple diagonal scaling the best preconditioner for conjugate gradients on supercomputers? Adv. Water Resour. 13: doi: / (90)90006-P. Saad, Y., Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2nd edn. ISBN Strandén, I. and M. Lidauer, Solving large mixed linear models using preconditioned conjugate gradient iteration. J. Dairy Sci. 82: doi: /jds.S (99) Strandén, I., S. Tsuruta, and I. Misztal, Simple preconditioners for the conjugate gradient method: experience with test day models. J. Anim. Breed. Genet. 119: doi: /j x. Tsuruta, S., I. Misztal, and I. Strandén, Use of the preconditioned conjugate gradient algorithm as a generic solver for mixed-model equations in animal breeding applications. J. Anim. Sci. 79: doi: / x. Yang, J., B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders, D. R. Nyholt, P. A. Madden, A. C. Heath, N. G. Martin, G. W. Montgomery, M. E. Goddard, and P. M. Visscher, Common SNPs explain a large proportion of the heritability for human height. Nature Genet. 42: doi: /ng

10 Table 1: Characteristics of the mixed model equations and computing requirements for diagonal (DIAG), block-diagonal (BLOX) and symmetric successive overrelaxation (SSOR) preconditioning schemes for single- and multi-threaded computations Threads Rel. a Param. b NNZ c No. of iterates Elapsed time (h) DIAG BLOX SSOR DIAG BLOX SSOR Single A 1 MV 918 5,550 1,999 1, PC 1,377 2,784 2,109 1, H 1 MV 8,162 6,691 2,910 1, PC 2,035 3,921 2,984 1, Multi A 1 MV 918 5,521 1,991 1, PC 1,377 2,803 2,107 1, H 1 MV 8,162 6,681 2,889 1, PC 2,035 3,929 2,965 1, a Relationship matrix: A 1 pedigree, H 1 single step b Parameterization: MV standard multivariate, PC principal components c No. of non-zero elements in one triangle of coefficient matrix; in million 10

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26