A One-Sided Jacobi Algorithm for the Symmetric Eigenvalue Problem

P-Q- A One-Sded Jacob Algorthm for the Symmetrc Egenvalue Problem B. B. Zhou, R. P. Brent E-mal: bng,rpb@cslab.anu.edu.au Computer Scences Laboratory The Australan Natonal Unversty Canberra, ACT 000, Australa Phone: +--900 Fax: +--9 M. Kahn E-mal: Margaret.Kahn@anu.edu.au Supercomputer Faclty The Australan Natonal Unversty Canberra, ACT 000, Australa Phone: +--9 Fax: +-- Abstract A method whch uses one-sded Jacob to solve the symmetrc egenvalue problem n parallel s presented. We descrbe a parallel rng orderng for one-sded Jacob computaton. One dstnctve feature of ths orderng s that t can sort column norms n each sweep, whch s very mportant to acheve fast convergence. Expermental results on both the Fujtsu AP000 and the Fujtsu VPP00 are reported. Introducton Jacob methods for the symmetrc egenvalue problem have recently attracted nterest because they are readly parallelsable and are more accurate than QR-based methods for solvng the same problem []. There are two basc types of Jacob, that s, one-sded Jacob and two-sded Jacob. The tradtonal two-sded Jacob method for the symmetrc egenvalue problem works by performng a sequence of orthogonal smlarty updates A Q T AQ wth the property that each new A, although full, s \more dagonal" than ts predecessor. Eventually, the o-dagonal entres are small enough to be gnored. Because both column and row updatngs are requred, ths method suers from extensve communcaton of small amounts of data between processors n parallel computaton, and nonunt strdes n vec- Copyrght c 99, the authors. Appeared n Proc Thrd Parallel Computng Workshop, Kawasak, Japan, November 99, P-Q-{P-Q-. rpb typeset usng LaT E X tor operatons. One-sded Jacob, though orgnally appled for sngular value decomposton, can also be adapted for the symmetrc egenvalue problem. Ths method requres only column updatng and so does not need as much communcaton and s more sutable for vector ppelne computng. Thus one-sded Jacob s preferable to two-sded Jacob n ppelne/parallel computaton. In parallel mplementaton of one-sded Jacob SVD a key problem s how to choose a reasonable, systematc order of rotatons n each sweep of the computaton so that a fast convergence rate s acheved. In ths paper we descrbe a parallel rng Jacob orderng. One dstnctve feature of ths orderng s that t can sort column norms, whch s very mportant for fast convergence. The expermental results show that the algorthm adoptng ths orderng can acheve the same ecency (n terms of the total number of sweeps) as the cyclc Jacob algorthm n sequental computaton. The paper s organsed as follows. The sequental one-sded Jacob algorthm for computng the SVD s outlned n x. Our parallel rng Jacob orderng s ntroduced n x and the expermental results are presented n x. The method for adaptng one-sded Jacob n symmetrc egenvalue decomposton are descrbed n x. Some conclusons are gven n x. Sequental One-sded Jacob For a matrx A of order m n (m n) the one-sded Jacob method produces an orthogonal matrx V such that AV = S, where the columns

P-Q- of S are orthogonal to wthn a gven tolerance. The non-zero columns of S can be normalsed to gve! S = (U r j0) r 0 0 0 where r n s the rank of A, and r = dag( ; : : : ; r ). Thus A = U r r V T r where V r s an n r matrx consstng of the rst r columns of V. Ths s the sngular value decomposton of A. The matrx V can be generated as a product of plane rotatons. Consder the transformaton by a plane rotaton: c s a a j s c! = a 0 a j 0 where c = cos, s = sn, and a and a j are the -th and j-th columns of the matrx A. We choose to make a 0 and a 0 j orthogonal. As n the tradtonal Jacob algorthm, the rotatons are performed n a xed sequence called a sweep, each sweep consstng of n(n )= rotatons, and every column n the matrx s orthogonalsed wth every other column exactly once per sweep. The teratve procedure termnates f one complete sweep occurs n whch all columns are orthogonal to workng accuracy and no columns are nterchanged. If the rotatons n a sweep are chosen n a reasonable, systematc order, the convergence rate s ultmately quadratc [9, ]. Exceptonal cases n whch cyclng occurs are easly avoded by the use of a threshold strategy []. There are two mportant mplementaton detals whch determne the speed of convergence of the one-sded Jacob method for computng the SVD. The rst s the method of orderng,.e., how to order the n(n )= rotatons n one sweep of computaton. Varous orderngs have been ntroduced n the lterature. In sequental computaton, the most commonly used s the cyclc Jacob orderng (cyclc orderng by rows or by columns) [9, ]. When dscussng sequental Jacob algorthms n ths paper, we assume that the cyclc orderng by rows s appled. The second mportant detal s the method for generatng the plane rotaton parameters c and s n each teraton. For the one-sded Jacob method there are three man rotaton algorthms, whch we now descrbe. Rotaton Algorthm Ths algorthm s derved from the standard two-sded Jacob method for the egenvalue decomposton of the matrx B = A T A. h Suppose that after k sweeps we have the updated matrx A (k) = : a (k) a (k) a (k) n To annhlate the o-dagonal element b (k) j of B (k) = (A (k) ) T A (k) n the (k + ) th sweep, we rst need to compute b (k), b (k) j and b (k) b (k) jj and b (k) jj = (a (k) ) T a (k) = ka (k) k ; b (k) j = (a (k) ) T a (k) j = (a (k) j ) T a (k) j = ka (k) j k :, that s, where kxk s the -norm of the vector x. The plane rotaton factors c and s, whch are used to orthogonalse the correspondng two columns, are then generated based on the two-sded Jacob method. It can be proved that the value of b (k) s ncreased and the value of b (k) jj s decreased after a plane rotaton operaton f b (k) Otherwse, b (k) s decreased and b (k) jj > b (k) jj. s ncreased. Rotaton Algorthm The second algorthm, ntroduced by Hestenes [], s the same as the Algorthm except that the columns a (k) and a (k) j are to be swapped f ka (k) k < ka (k) j k for < j before the orthogonalsaton of the two columns. Therefore, we always have b (k+) b (k+) jj. When the cyclc orderng by rows s appled, the computed sngular values wll be sorted n a nonncreasng order. Rotaton Algorthm The thrd algorthm was derved by Nash [9] and mplemented on the ILLIAC IV by Luk []. To determne the rotaton parameters c and s for orthogonalsng two columns and j, one extra condton has to be satsed n ths algorthm, that s, ka (k+) k ka (k) k = ka (k) j k ka (k+) j k 0:

P-Q- Wth ths extra condton the rotaton parameters are chosen so that ka (k+) k s greater than ka (k+) j k after the orthogonalsaton, wthout explctly exchangng the two columns. As n Algorthm, the computed sngular values wll appear n a nonncreasng order f the cyclc orderng by rows s appled. It s known from numercal experments that an mplementaton whch uses Rotaton Algorthm or s more ecent than the one usng Rotaton Algorthm when the cyclc orderng s appled. It s easy to verfy that mplct n the cyclc orderng s a sortng procedure whch can sort the values of n elements nto nonncreasng (or nondecreasng) order n n(n )= steps. Snce rotaton algorthms and always ncrease b (k) and decrease b (k) jj for < j when orthogonalsng the two columns, the column norms tend to be sorted after each sweep of computatons. Therefore, the columns and ther norms tend to be approxmately determned after a few sweeps and only change by a small amount durng each sweep. Snce the column norms are not sorted durng each sweep when usng rotaton algorthm, t s possble that the norm of column may be ncreased when two columns and j are orthogonalsed n a sweep, but norm of column j wll be ncreased when the two columns meet agan n the next sweep. Thus there are oscllatons n column norms and (emprcally) t takes more sweeps for the same problem to converge. Ths eect was also noted n [, 0]. It s probably the man reason why applyng Rotaton Algorthm or s more ecent than applyng Rotaton Algorthm. In order to compare the performance n terms of the total number of sweeps wth parallel mplementatons whch are descrbed n the followng sectons, we gve n Table some expermental results obtaned on a (sequental) Sun Sparc workstaton. The Rng Jacob Orderng We have seen n the prevous secton that sortng the column norms n each sweep s a very mportant ssue. Our expermental results con- rm that f an orderng does not nclude a proper Sze Alg. Alg. Alg. 0 9 9 00 0 9 9 0 9 9 0 9 9 0 9 9 00 9 0 Table : Results for the cyclc Jacob orderng on a Sun workstaton. sortng procedure n each sweep, t may converge relatvely slowly []. In ths secton we descrbe a parallel rng orderng. Ths orderng can not only generate the requred ndex pars n a mnmum number of steps, but also sort column norms at the same tme. Our Jacob orderng conssts of two procedures, forward sweep and backward sweep, as llustrated n Fg.. They are appled alternately durng the computaton. In ether forward or backward sweep the n ndces are organsed nto two rows. Any two ndces n the same column at a step form one ndex par. One ndex n each column s then shfted to another column as shown by the arrows so that derent ndex pars can be generated at the next step. The up-and-down arrow n Fg. ndcates the exchange of two ndces n the column before one s shfted. Each sweep (forward or backward), takng n steps, can generate n(n )= derent Jacob pars, as well as sort the values of n elements nto nonncreasng (or nondecreasng) order. We outlne a proof that n(n )= derent Jacob pars can be generated n n steps by ether a forward or backward sweep. To do ths, we rst permute the ntal postons of n ndces for the round robn orderng [] and then show that the orderngs can generate the same ndex pars at any step. Snce t s well known that the round robn orderng generates n(n )= derent ndex pars n n steps, ths shows the correctness of our clam. The detaled proof of ths clam s tedous and s omtted. It can easly be vered that the forward sweep and the backward sweep are essentally the same,

P-Q- step : step : step : step : step : step : step : (a) (b) Fgure : The rng Jacob orderng. (a) forward sweep and (b) backward sweep. except that one sorts the elements nto nondecreasng order and the other sorts the elements nto nonncreasng order. Thus we only use the forward sweep as an example to show the procedure on how to sort n elements nto nondecreasng order. (For detals see [].) If the numbers n Fg. (a) are not consdered as ndces, but as the values of n elements, the Fgure gves an example of sortng n elements from nonncreasng order to nondecreasng order. In each step the smaller element n each column s placed on the top except n even steps the larger element s placed on the top f the column has a up-and-down arrow n t. Snce the up-and-down arrow ndcates the exchange of the two elements n the column, these arrows can be removed n even steps by lettng the smaller elements be placed at the top of the correspondng columns. Thus, we may descrbe the sortng procedure as follows: One forward sweep can be appled to sort n elements n a nondecreasng order. Each step n the sweep conssts of two substeps. The rst substep compares the two elements n each column and places the smaller one on the top and the larger one at the bottom. The second substep then shfts the elements located at the bottom to the next column accordng to the arrows whch form a rng, as depcted n Fg. (a). At each odd step the two elements n the column wth a up-and-down arrow have to exchange ther postons before the shft takes place. The n elements are sorted nto nondecreasng order after n such steps (see top row of Fg. (b)). Snce both ndex orderng and sortng can be done smultaneously n ether a forward or a backward sweep, t may seem that applyng these two sweeps alternately n the SVD computaton s not necessary. The reason why we perform the two sweeps alternately s as follows. Suppose that the n ndces are ntally placed n a nonncreasng order. They wll be sorted nto nondecreasng order durng a forward sweep. However, the natural order of ndces for ndex orderng at each step s mantaned durng sortng. Thus the n(n )= derent ndex pars are also generated durng the computaton. Although the orgnal (nonncreasng) order s restored when a backward sweep s performed, the exchange of postons of some ndces s probable. As a consequence some ndex pars may not be produced durng the computaton. Ths can easly be vered by an example of sortng a small number of ndces (whch are ntally placed n a nonncreasng order) usng the backward sweep. Expermental Results In order to see the mportance of sortng the column norms n a parallel mplementaton of the one-sded Jacob SVD, we mplemented our rng orderng algorthm on the Fujtsu AP000 at the Australan Natonal Unversty. In the experment both sngular values and sngular vectors are computed on the AP000, whch s con- gured as a one-dmensonal array. An algorthm wthout parttonng not very useful n practce for general-purpose parallel computaton because the system conguraton s xed, but the sze of user's problem may vary.

P-Q- sze Algorthm Algorthm Algorthm T S T S T S 00. 0.0 0 0.0 0 00..9. 00 0... 00 99... 000 9. 99. 99. 00 0 0 00 0 Table : Results for the rng Jacob orderng on an AP000 wth 00 cells congured as a lnear array (T = tme (sec.), S = sweeps). Our parttonng strategy s based on the method descrbed n []. However, a major derence s that we take sortng nto consderaton. Assume that the gven system has p processors. We rst dvde n columns of the matrx nto p blocks. (The block szes are not necessarly the same.) At the begnnng of a sweep, the columns n each block are orthogonalsed wth each other exactly once usng the cyclc-by-row orderng. If Rotaton Algorthm or s appled, the norms of columns n each block should be sorted n order. We then consder each block as a super ndex and follow the desgned orderng so that p(p ) super ndex pars can be generated n p super steps. In the computaton of each super ndex par each column n one block must be orthogonalsed wth each column n the other block once only usng the cyclc-by-row orderng, but no columns n the same block are orthogonalsed. If a block n a super ndex par s consdered as the column assocated wth ndex (or ndex j), the norms of all columns n that block should be ncreased (or decreased) durng the orthogonalsaton wth the columns n the other block when Rotaton Algorthm or s appled. It s easy to show that the sortng procedure s also mplemented on the completon of the sweep. Some of the expermental results from applyng derent rotaton algorthms are gven n Table. It s easy to see from the table that the program adoptng Rotaton Algorthm s not as ecent as those adoptng rotaton algorthms or, especally when the problem sze s large. If the total number of sweeps s counted, these results are consstent wth those n Table (obtaned n sequental computaton usng the cyclc orderng by rows). In our experment we also measured the senstvty of the performance to the number of processors used n the computaton. The results show that the total number of sweeps requred for the computaton of the same SVD wll not vary as the processor number s changed. Our expermental results are thus clear evdence whch shows how mportant t s to adopt a proper sortng procedure n each sweep. Sze Sweeps Tme (sec.) Mop rate 000 0. 00.0 00.0 00 99. 00.9 9 000. 9 00 99. 00 00.0 00. Table : Results on a Fujtsu VPP00 usng processors. PEs Sze Tme (sec.) 000 9 000 0 0 0 Table : Results on a Fujtsu VPP00 usng dfferent numbers of processors. We recently mplemented our one-sded Jacob SVD algorthm on a Fujtsu VPP00. Some expermental results are gven n Tables and. It can be seen from Table that our algorthm acheves over one thrd of the peak performance for solvng large sze problems. (The peak performance of a four-processor VPP00 s. Gops.) We can also see from Table that a lnear speedup s acheved by usng derent number of processors (rangng from to ) for solvng a gven problem. These results con- rm that for massvely parallel computaton of sngular value decomposton the best approach

P-Q- may be to adopt one-sded Jacob as advocated n [, ]. The Egenvalue Problem The SVD algorthm can be used to nd the egenvalues and egenvectors of a symmetrc matrx. For a symmetrc matrx A of sze n n the one-sded Jacob method produces an orthogonal matrx V such that AV = S, where S has orthogonal columns. We have S T S = V T A T AV =. Thus the egenvalues and sngular values of a symmetrc matrx are equal, except possbly for sgns,.e. =. The sgns of the egenvalues can be obtaned usng the Raylegh quotent: = vt Av v T v : If we calculate egenvalues one by one, t s mpossble to acheve peak performance. Ths s because matrx-vector products suer from the need for one memory reference per multply-add. The performance may be lmted by memory accesses rather than by oatng-pont arthmetc. In order to acheve hgh ecency we should compute all egenvalues smultaneously usng the equaton V T AV = (where V s assumed to be orthonormal). To mnmse the communcaton cost the computaton s dvded nto two steps,.e., Y = AV and V T Y =. There are varous parallel algorthms for computng matrx multplcatons. We choose an ecent one whch places the resultng matrx Y (n the rst step) n a natural order. Snce V and Y are stored n the same manner and s dagonal, the multplcaton of the two matrces n the second step only nvolves local operatons and has operaton count O(n ). If A s postve dente, an alternatve way of ndng the egenvalues and egenvectors of A s by rst computng the Cholesky factorzaton of A and then performng and SVD [, ]. Conclusons We have shown that the one-sded Jacob method can acheve hgh ecency wth parallel orderngs provded consderaton s gven to sortng the column norms. Our parallel rng Jacob orderng can do both ndex orderng and sortng smultaneously durng a sweep. The expermental results show that ths rng orderng algorthm can acheve the same convergence rate as the sequental cyclc Jacob orderng. Some experments have been conducted on Fujtsu AP000 and VPP00 computers. We found that for certan problems Jacob produces results wth hgh accuracy, but QR-based methods do not. Fnally, we pont out that the parallel oddeven ndex orderng [] and the parallel oddeven transposton sort [, ] both have the same communcaton structure. The two procedures can be combned nto a new algorthm whch can ecently mplement one-sded Jacob on general-purpose dstrbuted memory machnes []. Acknowledgements The work was partly supported by the Fujtsu- ANU Research Agreement. Thanks are due to Hrosh Ina and hs colleagues at Fujtsu Lmted for provdng access to a Fujtsu VPP00. References [] S. G. Akl, Parallel Sortng Algorthms, Academc Press, Orlando, Florda, 9. [] G. Baudet and D. Stevenson, \Optmal sortng algorthms for parallel computers", IEEE Trans. on Computers, C{, 9, {. [] C. H. Bschof, \The two-sded block Jacob method on a hypercube", n Hypercube Multprocessors, M. T. Heath, ed., SIAM, 9, pp. -. [] R. P. Brent, \Parallel algorthms for dgtal sgnal processng", Proceedngs of the NATO Advanced Study Insttute on Numercal Lnear Algebra, Dgtal Sgnal Processng and Parallel Algorthms, Leuven, Belgum, August, 9, pp. 9-0. [] R. P. Brent and F. T. Luk, \The soluton of sngular-value and symmetrc egenvalue

P-Q- problems on multprocessor arrays", SIAM J. Sc. and Stat. Comput.,, 9, pp. 9-. [] J. Demmel and K. Veselc, \Jacob's method s more accurate than QR", SIAM J. Sc. Stat. Comput.,, 99, pp. 0-. [] P. J. Eberlen and H. Park, \Ecent mplementaton of Jacob algorthms and Jacob sets on dstrbuted memory archtectures", J. Par. Dstrb. Comput.,, 990, pp. -. [] L. M. Ewerbrng and F. T. Luk, \Computng the sngular value decomposton on the Connecton Machne", IEEE Trans. Computers, 9, 990, pp. -. [9] G. E. Forsythe and P. Henrc, \The cyclc Jacob method for computng the prncpal values of a complex matrx", Trans. Amer. Math. Soc., 9, 90, pp. -. [0] G. R. Gao and S. J. Thomas, \An optmal parallel Jacob-lke soluton method for the sngular value decomposton", n Proc. Internat. Conf. Parallel Proc., 9, pp. -. [] G. H. Golub and C. F. Van Loan, Matrx Computatons, The Johns Hopkns Unversty Press, Baltmore, MD, second ed., 99. [] P. Henrc, \On the speed of convergence of cyclc and quascyclc Jacob methods for computng egenvalues of Hermtan matrces", J. Soc. Indust. Appl. Math.,, 9, pp. -. [] M. R. Hestenes, \Inverson of matrces by borthogonalzaton and related results", J. Soc. Indust. Appl. Math.,, 9, pp. -90. [] T. J. Lee, F. T. Luk and D. L. Boley, Computng the SVD on a fat-tree archtecture, Report 9-, Department of Computer Scence, Rensselaer Polytechnc Insttute, Troy, New York, November 99. [] C. E. Leserson, \Fat-trees: Unversal networks for hardware-ecent supercomputng", IEEE Trans. Computers, C-, 9, pp. 9-90. [] F. T. Luk, \Computng the sngular-value decomposton on the ILLIAC IV", ACM Trans. Math. Softw.,, 90, pp. -9. [] F. T. Luk, \A trangular processor array for computng sngular values", Ln. Alg. Applcs.,, 9, pp. 9-. [] F. T. Luk and H. Park, \On parallel Jacob orderngs", SIAM J. Sc. and Stat. Comput., 0, 99, pp. -. [9] J. C. Nash, \A one-sded transformaton method for the sngular value decomposton and algebrac egenproblem", Comput. J,, 9, pp. -. [0] P. P. M. De Rjk, \A one-sded Jacob Algorthm for computng the sngular value decomposton on a vector computer", SIAM J. Sc. and Stat. Comput., 0, 99, pp. 9-. [] R. Schreber, \Solvng egenvalue and sngular value problems on an underszed systolc array", SIAM. J. Sc. Stat. Comput.,, 9, pp. -. [] K. Veselc and V. Har, \A note on a onesded Jacob algorthm", Numersche Mathematk,, 990, pp. -. [] J. H. Wlknson, The Algebrac Egenvalue Problem, Clarendon Press, Oxford, 9, pp. -. [] B.B. Zhou and R. P. Brent, \A parallel orderng algorthm for ecent one-sded Jacob SVD computatons", to appear n Proc. of Sxth IASTED-ISMM Internatonal Conference on Parallel and Dstrbuted Computng and Systems, Washngton, DC, October 99. [] B. B. Zhou and R. P. Brent, \On the parallel mplementaton of the one-sded Jacob algorthm for sngular value decompostons", to appear n Proc. of rd Euromcro Workshop on Parallel and Dstrbuted Processng, San Remo, Italy, January 99.