A Recursive Blocked Schur Algorithm for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

Size: px

Start display at page:

Download "A Recursive Blocked Schur Algorithm for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012."

Gwenda Hubbard
5 years ago
Views:

A Recursive Blocked Schur Algorithm for Computig the Matrix Square Root Deadma, Edvi ad Higham, Nicholas J. ad Ralha, Rui 212 MIMS EPrit: 212.

1 A Recursive Blocked Schur Algorithm for Computig the Matrix Square Root Deadma, Edvi ad Higham, Nicholas J. ad Ralha, Rui 212 MIMS EPrit: Machester Istitute for Mathematical Scieces School of Mathematics The Uiversity of Machester Reports available from: Ad by cotactig: The MIMS Secretary School of Mathematics The Uiversity of Machester Machester, M13 9PL, UK ISSN

2 A Recursive Blocked Schur Algorithm for Computig the Matrix Square Root Edvi Deadma Uiversity of Machester Numerical Algorithms Group Nicholas J. Higham Uiversity of Machester Rui Ralha Uiversity of Miho, Portugal Abstract The Schur method for computig a matrix square root reduces the matrix to the Schur triagular form ad the computes a square root of the triagular matrix. We show that by usig a recursive blockig techique the computatio of the square root of the triagular matrix ca be made rich i matrix multiplicatio. Numerical experimets makig appropriate use of level 3 BLAS show sigificat speedups over the poit algorithm, both i the square root phase ad i the algorithm as a whole. The excellet umerical stability of the poit algorithm is show to be preserved by recursive blockig. These results are exteded to the real Schur method. Recursive blockig is also show to be effective for multiplyig triagular matrices. I. INTRODUCTION A square root of matrix A C is ay matrix satisfyig X 2 A. Matrix square roots have may applicatios, icludig i Markov models of fiace, the solutio of differetial equatios ad the computatio of the polar decompositio ad the matrix sig fuctio [9]. A square root of a matrix (if oe exists) is ot uique. However, if A has o eigevalues o the closed egative real lie the there is a uique pricipal square root A 1/2 whose eigevalues all lie i the ope right half-plae. If A is real, the so is A 1/2. For proofs of these facts ad more o the theory of matrix square roots see [9]. The most umerically stable way of computig matrix square roots is via the Schur method of Björck ad Hammarlig [3]. The matrix A is reduced to upper triagular form ad a recurrece relatio eables the square root of the triagular matrix to be computed a colum or superdiagoal at a time. I II we show that the recurrece ca be reorgaized usig a stadard blockig scheme or recursive blockig i order to make it rich i matrix multiplicatios. We show experimetally that sigificat speedups result whe level 3 BLAS are exploited i the implemetatio. I III we show that the blocked methods maitai the excellet backward stability of the o-blocked method. I IV we discuss the use of the ew approach withi the Schur method ad explai how it ca be exteded to the real Schur method of Higham [7]. We compare our implemetatios writte for the NAG Library with existig MATLAB fuctios. Fially, i V we discuss some further applicatios of recursive blockig for triagular matrices. II. THE USE OF BLOCKING IN THE SCHUR METHOD To compute A 1/2, a Schur decompositio A QT Q is obtaied, where T is upper triagular ad Q is uitary. The A 1/2 QT 1/2 Q. For the remaider of this sectio we will focus o upper triagular matrices oly. The equatio U 2 T (1) ca be solved by otig that U is also upper triagular, so that by equatig elemets, U ii U ij + U ij U jj T ij U 2 ii T ii, (2) j 1 ki+1 U ik U kj. (3) These equatios ca be solved either a colum or a superdiagoal at a time. Differet choices of sig i the scalar square roots of (2) lead to differet matrix square roots. This method will be referred to hereafter as the poit method. The algorithm ca be blocked by lettig the U ij ad T ij i (2) ad (3) refer to m m blocks, where m. The diagoal blocks U ii are the obtaied usig the poit method ad the off-diagoal blocks are obtaied by solvig the Sylvester equatios (3) usig LAPACK routie xtrsyl (where x deotes D or Z accordig to whether real or complex arithmetic is used) [2]. Level 3 BLAS ca be used i computig the righthad side of (3) so sigificat improvemets i efficiecy are expected. This approach is referred to as the block method. To test this approach, a Fortra implemetatio was writte ad compiled with gfortra o a 64 bit Itel i3 dual-core machie, usig the ACML Library for LAPACK ad BLAS calls. Complex upper triagular matrices were geerated, with radom elemets whose real ad imagiary parts were chose from the uiform distributio o [, 1). Figure 1 shows the ru times for the methods, for various differet sized matrices. A block size of 32 was chose, although the speed did ot appear to be particularly sesitive to the block size similar results were obtaied with blocks of size 16, 64 ad 128. The block method was foud to be up to 17 times faster tha the poit method. The residuals Û 2 T / T, where Û is the computed value of U, were similar for both methods. Table II shows that, for 496, over 9% of the ru time is spet i ZGEMM calls.

3 6 TABLE I Profilig of the block method for computig the square root of a triagular matrix, with 496. Format: time i secods (umber of calls). 5 4 Total time take: 3.74 Calls to loopig method:.9 (128) Calls to ZTRSYL 2.4 (8128) Calls to ZGEMM with 32: (341376) 3 poit block recursio TABLE II Profilig of the recursive method for computig the square root of a triagular matrix, with 496. Format: time i secods (umber of calls) Fig. 1. Ru times for the poit, block ad recursio methods for computig the square root of a complex triagular matrix. Total time take: 27.4 Calls to loopig method:.6 (128) Calls to ZTRSYL 2.22 (8128) Calls to ZGEMM total: 22.6 (1668) Calls to ZGEMM with (4) Calls to ZGEMM with (24) Calls to ZGEMM with (112) Calls to ZGEMM with (48) Calls to ZGEMM with (1984) Calls to ZGEMM with (864) A larger block size eables larger GEMM calls to be made. However, it leads to larger calls to the poit algorithm ad to xtrsyl (which oly uses level 2 BLAS). A recursive approach may allow icreased use of level 3 BLAS. Equatio (1) ca be rewritte as ( ) 2 ( ) U11 U 12 T11 T 12, (4) U 22 T 22 where the submatrices are of size /2 or (±1)/2 depedig o the parity of. The U11 2 T 11 ad U22 2 T 22 ca be solved recursively, util some base level is reached, at which poit the poit algorithm is used. The Sylvester equatio U 11 U 12 + U 12 U 22 T 12 ca the be solved usig a recursive algorithm devised by Josso ad Kågström [11]. I this algorithm, the Sylvester equatio AX + XB C, with A ad B triagular, is writte as ( ) ( ) A11 A 12 X11 X 12 + A 22 X 21 X 22 ( ) ( ) X11 X 12 B11 B 12 X 21 X 22 B 22 ( ) C11 C 12, C 21 C 22 where each submatrix is of size /2 or ( ± 1)/2. The A 11 X 11 + X 11 B 11 C 11 A 12 X 21, (5) A 11 X 12 + X 12 B 22 C 12 A 12 X 22 X 11 B 12, (6) A 22 X 21 + X 21 B 11 C 21, (7) A 22 X 22 + X 22 B 22 C 22 X 21 B 12. (8) Equatio (7) is solved recursively, followed by (5) ad (8), ad fially (6). At the base level a routie such as xtrsyl is used. The ru times for a Fortra implemetatio of the recursio method i complex arithmetic, with a base level of size 32, are show i Figure 1. The approach was foud to be faster tha the block method, ad up to 19 times faster tha the poit method, with similar residuals i each case. The precise choice of base level made little differece to the ru time. Table II shows that the ru time is domiated by GEMM calls ad that the time spet i ZTRSYL ad the poit algorithm is similar to the block method. The largest GEMM call uses a submatrix of size /4. III. STABILITY OF THE BLOCKED ALGORITHMS We use the stadard model of floatig poit arithmetic [8, 2.2] i which the result of a floatig poit operatio, op, o two scalars x ad y is writte as fl(x op y) (x op y)(1 + δ), δ u, where u is the uit roudoff. I aalyzig a sequece of floatig poit operatios it is useful to write [8, 3.4] (1 + δ i ) ρi 1 + θ, ρ i ±1, where i1 θ u 1 u : γ. It is also coveiet to defie γ γ c for some small iteger c whose precise value is uimportat. We use a hat deote a computed quatity ad write A for the matrix whose elemets are the absolute values of the elemets of A. Björck ad Hammarlig [3] obtaied a ormwise backward error boud for the Schur method. The computed square root X of the full matrix A satisfies X 2 A + A, where A F γ 3 X 2 F. (9) Higham [9, 6.2] obtaied a compoetwise boud for the triagular phase of the algorithm. The computed square root Û of the triagular matrix T satisfies Û 2 T + T, where T γ Û 2. (1)

4 This boud implies (9). We ow ivestigate whether the boud (1) still holds whe the triagular phase of the algorithm is blocked. Cosider the Sylvester equatio AX + XB C i matrices with triagular A ad B. Whe it is solved i the stadard way by the solutio of triagular systems the residual of the computed ˆX satisfies [8, 16.1] C (A X + XB) γ ( A X + X B ). (11) I the (o-recursive) block method, to boud T ij we must accout for the error i performig the matrix multiplicatios o the right-had side of (3). Stadard error aalysis for matrix multiplicatio yields, for blocks of size m, ( j 1 ) fl j 1 Û ik Û kj Û ik Û kj γ Û 2 ij. ki+1 ki+1 Substitutig this ito the residual for the Sylvester equatio i the off-diagoal blocks, we obtai the compoetwise boud (1). To obtai a boud for the recursive blocked method we must first check if (11) holds whe the Sylvester equatio is solved usig Josso ad Kågström s recursive algorithm. This ca be doe by iductio, assumig that (11) holds at the base level. For the iductive step, if suffices to icorporate the error estimates for the matrix multiplicatios i the right had sides of (5) (8) ito the residual boud. Iductio ca the be applied to the recursive blocked method for the square root. The bouds (1) ad (11) are assumed to hold at the base level. The iductive step is similar to the aalysis for the block method. Overall, (1) is obtaied. We coclude that both our blocked algorithms for computig the matrix square root satisfy backward error bouds of the same forms (9) ad (1) as the poit algorithm. IV. IMPLEMENTATION Whe used with full (o-triagular) matrices, more modest speedups are expected because of the sigificat overhead i computig the Schur decompositio. Figure 2 compares ru times of the MATLAB fuctio sqrtm (which does ot use ay blockig) with Fortra implemetatios of the recursive blocked method (fort_recurse) ad the poit algorithm (fort_poit), called from withi MATLAB usig a mex iterface. The recursive routie is foud to be up to 2.5 times faster tha sqrtm ad 2 times faster tha fort_poit. A extesio of the Schur method due to Higham [7] eables the square root of a real matrix to be computed without usig complex arithmetic. A real Schur decompositio of A is computed. Square roots of the 2 2 diagoal blocks of the upper quasi-triagular factor are computed usig a explicit formula. The recursio (3) ow proceeds either a block colum or a block superdiagoal at a time, where the blocks are of size 1 1, 1 2, 2 1 or 2 2 depedig o the diagoal block structure. A MATLAB implemetatio of this algorithm sqrtm_real is available i the Matrix Fuctio Toolbox [6]. The algorithm ca also be implemeted i a recursive sqrtm fort_poit fort_recurse Fig. 2. Ru times for sqrtm, fort_recurse ad fort_poit for computig the square root of a full matrix with elemets whose real ad imagiary parts are chose from the uiform radom distributio o the iterval [, 1). maer, the oly subtlety beig that the splittig poit for the recursio must be chose to avoid splittig ay 2 2 diagoal blocks. A similar error aalysis to III applies to the real recursive method, though sice oly a ormwise boud is available for the poit algorithm applied to the quasi-triagular matrix the backward error boud (1) holds i the Frobeius orm rather tha elemetwise. Figure 3 compares the ru times of sqrtm ad sqrtm_real with Fortra implemetatios of the real recursive method (fort_recurse_real) ad the real poit method (fort_poit_real), also called from withi MAT- LAB. The recursive routie is foud to be up to 6 times faster tha sqrtm ad sqrtm_real ad 2 times faster tha fort_poit_real. Both the real ad complex recursive blocked routies sped over 9% of their ru time i computig the Schur decompositio, compared with 25% for sqrtm, 44% for fort_poit, 16% for sqrtm_real ad 46% for fort_poit_real. The latter four percetages reflect the overhead of the MAT- LAB iterpreter i executig the recurreces for the (quasi-) triagular square root phase. V. FURTHER APPLICATIONS OF RECURSIVE BLOCKING We briefly metio two further applicatios of recursive blockig schemes. Curretly there are o LAPACK or BLAS routies desiged specifically for multiplyig two triagular matrices, T U V (the closest is the BLAS routie xtrmm which multiplies a triagular matrix by a full matrix). However, a block algorithm is easily derived by partitioig the matrices ito blocks. The product of two off-diagoal blocks is computed usig xgemm. The product of a off-diagoal block ad a diagoal block is computed usig xtrmm. Fially the poit method is used whe multiplyig two diagoal blocks.

5 sqrtm_real sqrtm fort_poit_real fort_recurse_real The iverse of a triagular matrix ca be computed recursively, by expadig UU 1 I as ( ) ( ) ( ) U11 U 12 (U 1 ) 11 (U 1 ) 12 I U 22 (U 1. ) 22 I The (Û 1 ) 11 ad (Û 1 ) 22 are computed recursively ad (Û 1 ) 12 is obtaied by solvig U 11 (Û 1 ) 12 + U 12 (Û 1 ) 22. Provided that forward substitutio is used, the right (or left) recursive iversio method ca be show iductively to satisfy the same right (or left) elemetwise residual boud as the poit method [4]. A Fortra implemetatio of this idea was foud to perform similarly to LAPACK code xtrtri, so o real beefit was derived from recursive blockig Fig. 3. Ru times for sqrtm, sqrtm_real, fort_recurse_real ad fort_poit_real for computig the square root of a full matrix with elemets chose from the uiform radom distributio o [, 1) poit block recursio Fig. 4. Ru times for the poit, block ad recursio methods for multiplyig radomly geerated triagular matrices. I the recursive approach, T UV is rewritte as ( ) ( ) ( ) T11 T 12 U11 U 12 V11 V 12. T 22 U 22 V 22 The T 11 U 11 V 11 ad T 22 U 22 V 22 are computed recursively ad T 12 U 11 V 12 + U 12 V 22 is computed usig two calls to xtrmm. Figure 4 shows ru times for some triagular matrix multiplicatios usig Fortra implemetatios of the poit method, stadard blockig ad recursive blockig (the block size ad base levels were both 32 i this case, although the results were ot too sesitive to the precise choice of these parameters). As for the matrix square root, the block algorithms sigificatly outperform the poit algorithm, with the recursive approach performig the best. VI. CONCLUSIONS We ivestigated two differet blockig techiques withi Björck ad Hammarlig s recurrece for computig a square root of a triagular matrix, fidig that recursive blockig gives the best performace. Neither approach etails ay loss of backward stability. We implemeted the recursive blockig with both the Schur method ad the real Schur method (which works etirely i real arithmetic) ad foud the ew codes to be sigificatly faster tha correspodig poit codes, which iclude the MATLAB fuctios sqrtm (built-i) ad sqrtm_real (from [6]). The ew codes will appear i the ext mark of the NAG Library [12]. Recursive blockig is also fruitful for multiplyig triagular matrices. Because of the importace of the (quasi-) triagular square root, which arises i algorithms for computig the matrix logarithm [1], matrix pth roots [5], ad arbitrary matrix powers [1], this computatioal kerel is a strog coteder for iclusio i ay future extesios of the BLAS. REFERENCES [1] A. H. Al-Mohy ad N. J. Higham. Improved iverse scalig ad squarig algorithms for the matrix logarithm. MIMS EPrit , The Uiversity of Machester, Oct [2] E. Aderso, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dogarra, J. Du Croz, A. Greebaum, S. Hammarlig, A. McKeey, ad D. Sorese. LAPACK Users Guide. Society for Idustrial ad Applied Mathematics, Philadelphia, PA, third editio, [3] Å. Björck ad S. Hammarlig. A Schur method for the square root of a matrix. Liear Algebra Appl., 52/53:127 14, [4] J. J. Du Croz ad N. J. Higham. Stability methods for matrix iversio. IMA J. Numer. Aal., 12(1):1 19, [5] C.-H. Guo ad N. J. Higham. A Schur Newto method for the matrix pth root ad its iverse. SIAM J. Matrix Aal. Appl., 28(3):788 84, 6. [6] N. J. Higham. The Matrix Fuctio Toolbox. ac.uk/ higham/mftoolbox. [7] N. J. Higham. Computig real square roots of a real matrix. Liear Algebra Appl., 88/89:45 43, [8] N. J. Higham. Accuracy ad Stability of Numerical Algorithms. SIAM, 2d editio, 2. [9] N. J. Higham. Fuctios of Matrices: Theory ad Computatio. SIAM, 8. [1] N. J. Higham ad L. Li. A Schur Padé algorithm for fractioal powers of a matrix. SIAM J. Matrix Aal. Appl., 32(3): , 211. [11] I. Josso ad B. Kågström. Recursive blocked algorithms for solvig triagular systems - part I: Oe-sided ad coupled Sylvester-type matrix equatios. ACM Tras. Math. Software, 28(4): , 2. [12] Numerical Algorithms Group. The NAG Fortra Library.

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

Blocked Schur Algorithms for Computing the Matrix Square Root Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui 2013 MIMS EPrint: 2012.26 Manchester Institute for Mathematical Sciences School of Mathematics