A One-Sided Jacobi Algorithm for the Symmetric Eigenvalue Problem

Similar documents
On Parallel Implementation of the One-sided Jacobi Algorithm for Singular Value Decompositions

A Parallel Ring Ordering Algorithm for Ecient One-sided Jacobi SVD Computations

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

Parallel Computation of the Singular Value Decomposition on Tree Architectures

Parallel matrix-vector multiplication

An Optimal Algorithm for Prufer Codes *

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

A Binarization Algorithm specialized on Document Images and Photos

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Module Management Tool in Software Development Organizations

The Codesign Challenge

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Problem Set 3 Solutions

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

an assocated logc allows the proof of safety and lveness propertes. The Unty model nvolves on the one hand a programmng language and, on the other han

Analysis of Continuous Beams in General

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Shared Virtual Memory Machines. Mississippi State, MS Abstract

and NSF Engineering Research Center Abstract Generalized speedup is dened as parallel speed over sequential speed. In this paper

High-Boost Mesh Filtering for 3-D Shape Enhancement

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions


For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Load Balancing for Hex-Cell Interconnection Network

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Lecture 5: Multilayer Perceptrons

Structure from Motion

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Fast Computation of Shortest Path for Visiting Segments in the Plane

Mathematics 256 a course in differential equations for engineering students

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

CMPS 10 Introduction to Computer Science Lecture Notes

LU Decomposition Method Jamie Trahan, Autar Kaw, Kevin Martin University of South Florida United States of America

Programming in Fortran 90 : 2017/2018

Optimal Workload-based Weighted Wavelet Synopses

Array transposition in CUDA shared memory

Classifier Selection Based on Data Complexity Measures *

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Lecture 4: Principal components

Abstract Ths paper ponts out an mportant source of necency n Smola and Scholkopf's Sequental Mnmal Optmzaton (SMO) algorthm for SVM regresson that s c

Cluster Analysis of Electrical Behavior

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

Solving two-person zero-sum game by Matlab

An Accurate Evaluation of Integrals in Convex and Non convex Polygonal Domain by Twelve Node Quadrilateral Finite Element Method

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Communication-Minimal Partitioning and Data Alignment for Af"ne Nested Loops

c 2009 Society for Industrial and Applied Mathematics

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Simulation Based Analysis of FAST TCP using OMNET++

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Hybrid Non-Blind Color Image Watermarking

A Saturation Binary Neural Network for Crossbar Switching Problem

Construction of ROBDDs. area. that such graphs, under some conditions, can be easily manipulated.

Concurrent Apriori Data Mining Algorithms

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Vectorization of Image Outlines Using Rational Spline and Genetic Algorithm

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Laplacian Eigenmap for Image Retrieval

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Related-Mode Attacks on CTR Encryption Mode

Simple March Tests for PSF Detection in RAM

An Approach in Coloring Semi-Regular Tilings on the Hyperbolic Plane

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

O n processors in CRCW PRAM

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

A Robust Method for Estimating the Fundamental Matrix

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

An Efficient Parallel Algorithm of Modified Jacobi Approach for Sparse Linear System

Complex Deformable Objects in Virtual Reality

REFRACTION. a. To study the refraction of light from plane surfaces. b. To determine the index of refraction for Acrylic and Water.

On Some Entertaining Applications of the Concept of Set in Computer Science Course

CHAPTER 2 DECOMPOSITION OF GRAPHS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

A Parallel Gauss-Seidel Algorithm for Sparse Power System. Matrices. D. P. Koester, S. Ranka, and G. C. Fox

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

TN348: Openlab Module - Colocalization

Transcription:

P-Q- A One-Sded Jacob Algorthm for the Symmetrc Egenvalue Problem B. B. Zhou, R. P. Brent E-mal: bng,rpb@cslab.anu.edu.au Computer Scences Laboratory The Australan Natonal Unversty Canberra, ACT 000, Australa Phone: +--900 Fax: +--9 M. Kahn E-mal: Margaret.Kahn@anu.edu.au Supercomputer Faclty The Australan Natonal Unversty Canberra, ACT 000, Australa Phone: +--9 Fax: +-- Abstract A method whch uses one-sded Jacob to solve the symmetrc egenvalue problem n parallel s presented. We descrbe a parallel rng orderng for one-sded Jacob computaton. One dstnctve feature of ths orderng s that t can sort column norms n each sweep, whch s very mportant to acheve fast convergence. Expermental results on both the Fujtsu AP000 and the Fujtsu VPP00 are reported. Introducton Jacob methods for the symmetrc egenvalue problem have recently attracted nterest because they are readly parallelsable and are more accurate than QR-based methods for solvng the same problem []. There are two basc types of Jacob, that s, one-sded Jacob and two-sded Jacob. The tradtonal two-sded Jacob method for the symmetrc egenvalue problem works by performng a sequence of orthogonal smlarty updates A Q T AQ wth the property that each new A, although full, s \more dagonal" than ts predecessor. Eventually, the o-dagonal entres are small enough to be gnored. Because both column and row updatngs are requred, ths method suers from extensve communcaton of small amounts of data between processors n parallel computaton, and nonunt strdes n vec- Copyrght c 99, the authors. Appeared n Proc Thrd Parallel Computng Workshop, Kawasak, Japan, November 99, P-Q-{P-Q-. rpb typeset usng LaT E X tor operatons. One-sded Jacob, though orgnally appled for sngular value decomposton, can also be adapted for the symmetrc egenvalue problem. Ths method requres only column updatng and so does not need as much communcaton and s more sutable for vector ppelne computng. Thus one-sded Jacob s preferable to two-sded Jacob n ppelne/parallel computaton. In parallel mplementaton of one-sded Jacob SVD a key problem s how to choose a reasonable, systematc order of rotatons n each sweep of the computaton so that a fast convergence rate s acheved. In ths paper we descrbe a parallel rng Jacob orderng. One dstnctve feature of ths orderng s that t can sort column norms, whch s very mportant for fast convergence. The expermental results show that the algorthm adoptng ths orderng can acheve the same ecency (n terms of the total number of sweeps) as the cyclc Jacob algorthm n sequental computaton. The paper s organsed as follows. The sequental one-sded Jacob algorthm for computng the SVD s outlned n x. Our parallel rng Jacob orderng s ntroduced n x and the expermental results are presented n x. The method for adaptng one-sded Jacob n symmetrc egenvalue decomposton are descrbed n x. Some conclusons are gven n x. Sequental One-sded Jacob For a matrx A of order m n (m n) the one-sded Jacob method produces an orthogonal matrx V such that AV = S, where the columns

P-Q- of S are orthogonal to wthn a gven tolerance. The non-zero columns of S can be normalsed to gve! S = (U r j0) r 0 0 0 where r n s the rank of A, and r = dag( ; : : : ; r ). Thus A = U r r V T r where V r s an n r matrx consstng of the rst r columns of V. Ths s the sngular value decomposton of A. The matrx V can be generated as a product of plane rotatons. Consder the transformaton by a plane rotaton: c s a a j s c! = a 0 a j 0 where c = cos, s = sn, and a and a j are the -th and j-th columns of the matrx A. We choose to make a 0 and a 0 j orthogonal. As n the tradtonal Jacob algorthm, the rotatons are performed n a xed sequence called a sweep, each sweep consstng of n(n )= rotatons, and every column n the matrx s orthogonalsed wth every other column exactly once per sweep. The teratve procedure termnates f one complete sweep occurs n whch all columns are orthogonal to workng accuracy and no columns are nterchanged. If the rotatons n a sweep are chosen n a reasonable, systematc order, the convergence rate s ultmately quadratc [9, ]. Exceptonal cases n whch cyclng occurs are easly avoded by the use of a threshold strategy []. There are two mportant mplementaton detals whch determne the speed of convergence of the one-sded Jacob method for computng the SVD. The rst s the method of orderng,.e., how to order the n(n )= rotatons n one sweep of computaton. Varous orderngs have been ntroduced n the lterature. In sequental computaton, the most commonly used s the cyclc Jacob orderng (cyclc orderng by rows or by columns) [9, ]. When dscussng sequental Jacob algorthms n ths paper, we assume that the cyclc orderng by rows s appled. The second mportant detal s the method for generatng the plane rotaton parameters c and s n each teraton. For the one-sded Jacob method there are three man rotaton algorthms, whch we now descrbe. Rotaton Algorthm Ths algorthm s derved from the standard two-sded Jacob method for the egenvalue decomposton of the matrx B = A T A. h Suppose that after k sweeps we have the updated matrx A (k) = : a (k) a (k) a (k) n To annhlate the o-dagonal element b (k) j of B (k) = (A (k) ) T A (k) n the (k + ) th sweep, we rst need to compute b (k), b (k) j and b (k) b (k) jj and b (k) jj = (a (k) ) T a (k) = ka (k) k ; b (k) j = (a (k) ) T a (k) j = (a (k) j ) T a (k) j = ka (k) j k :, that s, where kxk s the -norm of the vector x. The plane rotaton factors c and s, whch are used to orthogonalse the correspondng two columns, are then generated based on the two-sded Jacob method. It can be proved that the value of b (k) s ncreased and the value of b (k) jj s decreased after a plane rotaton operaton f b (k) Otherwse, b (k) s decreased and b (k) jj > b (k) jj. s ncreased. Rotaton Algorthm The second algorthm, ntroduced by Hestenes [], s the same as the Algorthm except that the columns a (k) and a (k) j are to be swapped f ka (k) k < ka (k) j k for < j before the orthogonalsaton of the two columns. Therefore, we always have b (k+) b (k+) jj. When the cyclc orderng by rows s appled, the computed sngular values wll be sorted n a nonncreasng order. Rotaton Algorthm The thrd algorthm was derved by Nash [9] and mplemented on the ILLIAC IV by Luk []. To determne the rotaton parameters c and s for orthogonalsng two columns and j, one extra condton has to be satsed n ths algorthm, that s, ka (k+) k ka (k) k = ka (k) j k ka (k+) j k 0:

P-Q- Wth ths extra condton the rotaton parameters are chosen so that ka (k+) k s greater than ka (k+) j k after the orthogonalsaton, wthout explctly exchangng the two columns. As n Algorthm, the computed sngular values wll appear n a nonncreasng order f the cyclc orderng by rows s appled. It s known from numercal experments that an mplementaton whch uses Rotaton Algorthm or s more ecent than the one usng Rotaton Algorthm when the cyclc orderng s appled. It s easy to verfy that mplct n the cyclc orderng s a sortng procedure whch can sort the values of n elements nto nonncreasng (or nondecreasng) order n n(n )= steps. Snce rotaton algorthms and always ncrease b (k) and decrease b (k) jj for < j when orthogonalsng the two columns, the column norms tend to be sorted after each sweep of computatons. Therefore, the columns and ther norms tend to be approxmately determned after a few sweeps and only change by a small amount durng each sweep. Snce the column norms are not sorted durng each sweep when usng rotaton algorthm, t s possble that the norm of column may be ncreased when two columns and j are orthogonalsed n a sweep, but norm of column j wll be ncreased when the two columns meet agan n the next sweep. Thus there are oscllatons n column norms and (emprcally) t takes more sweeps for the same problem to converge. Ths eect was also noted n [, 0]. It s probably the man reason why applyng Rotaton Algorthm or s more ecent than applyng Rotaton Algorthm. In order to compare the performance n terms of the total number of sweeps wth parallel mplementatons whch are descrbed n the followng sectons, we gve n Table some expermental results obtaned on a (sequental) Sun Sparc workstaton. The Rng Jacob Orderng We have seen n the prevous secton that sortng the column norms n each sweep s a very mportant ssue. Our expermental results con- rm that f an orderng does not nclude a proper Sze Alg. Alg. Alg. 0 9 9 00 0 9 9 0 9 9 0 9 9 0 9 9 00 9 0 Table : Results for the cyclc Jacob orderng on a Sun workstaton. sortng procedure n each sweep, t may converge relatvely slowly []. In ths secton we descrbe a parallel rng orderng. Ths orderng can not only generate the requred ndex pars n a mnmum number of steps, but also sort column norms at the same tme. Our Jacob orderng conssts of two procedures, forward sweep and backward sweep, as llustrated n Fg.. They are appled alternately durng the computaton. In ether forward or backward sweep the n ndces are organsed nto two rows. Any two ndces n the same column at a step form one ndex par. One ndex n each column s then shfted to another column as shown by the arrows so that derent ndex pars can be generated at the next step. The up-and-down arrow n Fg. ndcates the exchange of two ndces n the column before one s shfted. Each sweep (forward or backward), takng n steps, can generate n(n )= derent Jacob pars, as well as sort the values of n elements nto nonncreasng (or nondecreasng) order. We outlne a proof that n(n )= derent Jacob pars can be generated n n steps by ether a forward or backward sweep. To do ths, we rst permute the ntal postons of n ndces for the round robn orderng [] and then show that the orderngs can generate the same ndex pars at any step. Snce t s well known that the round robn orderng generates n(n )= derent ndex pars n n steps, ths shows the correctness of our clam. The detaled proof of ths clam s tedous and s omtted. It can easly be vered that the forward sweep and the backward sweep are essentally the same,

P-Q- step : step : step : step : step : step : step : (a) (b) Fgure : The rng Jacob orderng. (a) forward sweep and (b) backward sweep. except that one sorts the elements nto nondecreasng order and the other sorts the elements nto nonncreasng order. Thus we only use the forward sweep as an example to show the procedure on how to sort n elements nto nondecreasng order. (For detals see [].) If the numbers n Fg. (a) are not consdered as ndces, but as the values of n elements, the Fgure gves an example of sortng n elements from nonncreasng order to nondecreasng order. In each step the smaller element n each column s placed on the top except n even steps the larger element s placed on the top f the column has a up-and-down arrow n t. Snce the up-and-down arrow ndcates the exchange of the two elements n the column, these arrows can be removed n even steps by lettng the smaller elements be placed at the top of the correspondng columns. Thus, we may descrbe the sortng procedure as follows: One forward sweep can be appled to sort n elements n a nondecreasng order. Each step n the sweep conssts of two substeps. The rst substep compares the two elements n each column and places the smaller one on the top and the larger one at the bottom. The second substep then shfts the elements located at the bottom to the next column accordng to the arrows whch form a rng, as depcted n Fg. (a). At each odd step the two elements n the column wth a up-and-down arrow have to exchange ther postons before the shft takes place. The n elements are sorted nto nondecreasng order after n such steps (see top row of Fg. (b)). Snce both ndex orderng and sortng can be done smultaneously n ether a forward or a backward sweep, t may seem that applyng these two sweeps alternately n the SVD computaton s not necessary. The reason why we perform the two sweeps alternately s as follows. Suppose that the n ndces are ntally placed n a nonncreasng order. They wll be sorted nto nondecreasng order durng a forward sweep. However, the natural order of ndces for ndex orderng at each step s mantaned durng sortng. Thus the n(n )= derent ndex pars are also generated durng the computaton. Although the orgnal (nonncreasng) order s restored when a backward sweep s performed, the exchange of postons of some ndces s probable. As a consequence some ndex pars may not be produced durng the computaton. Ths can easly be vered by an example of sortng a small number of ndces (whch are ntally placed n a nonncreasng order) usng the backward sweep. Expermental Results In order to see the mportance of sortng the column norms n a parallel mplementaton of the one-sded Jacob SVD, we mplemented our rng orderng algorthm on the Fujtsu AP000 at the Australan Natonal Unversty. In the experment both sngular values and sngular vectors are computed on the AP000, whch s con- gured as a one-dmensonal array. An algorthm wthout parttonng not very useful n practce for general-purpose parallel computaton because the system conguraton s xed, but the sze of user's problem may vary.

P-Q- sze Algorthm Algorthm Algorthm T S T S T S 00. 0.0 0 0.0 0 00..9. 00 0... 00 99... 000 9. 99. 99. 00 0 0 00 0 Table : Results for the rng Jacob orderng on an AP000 wth 00 cells congured as a lnear array (T = tme (sec.), S = sweeps). Our parttonng strategy s based on the method descrbed n []. However, a major derence s that we take sortng nto consderaton. Assume that the gven system has p processors. We rst dvde n columns of the matrx nto p blocks. (The block szes are not necessarly the same.) At the begnnng of a sweep, the columns n each block are orthogonalsed wth each other exactly once usng the cyclc-by-row orderng. If Rotaton Algorthm or s appled, the norms of columns n each block should be sorted n order. We then consder each block as a super ndex and follow the desgned orderng so that p(p ) super ndex pars can be generated n p super steps. In the computaton of each super ndex par each column n one block must be orthogonalsed wth each column n the other block once only usng the cyclc-by-row orderng, but no columns n the same block are orthogonalsed. If a block n a super ndex par s consdered as the column assocated wth ndex (or ndex j), the norms of all columns n that block should be ncreased (or decreased) durng the orthogonalsaton wth the columns n the other block when Rotaton Algorthm or s appled. It s easy to show that the sortng procedure s also mplemented on the completon of the sweep. Some of the expermental results from applyng derent rotaton algorthms are gven n Table. It s easy to see from the table that the program adoptng Rotaton Algorthm s not as ecent as those adoptng rotaton algorthms or, especally when the problem sze s large. If the total number of sweeps s counted, these results are consstent wth those n Table (obtaned n sequental computaton usng the cyclc orderng by rows). In our experment we also measured the senstvty of the performance to the number of processors used n the computaton. The results show that the total number of sweeps requred for the computaton of the same SVD wll not vary as the processor number s changed. Our expermental results are thus clear evdence whch shows how mportant t s to adopt a proper sortng procedure n each sweep. Sze Sweeps Tme (sec.) Mop rate 000 0. 00.0 00.0 00 99. 00.9 9 000. 9 00 99. 00 00.0 00. Table : Results on a Fujtsu VPP00 usng processors. PEs Sze Tme (sec.) 000 9 000 0 0 0 Table : Results on a Fujtsu VPP00 usng dfferent numbers of processors. We recently mplemented our one-sded Jacob SVD algorthm on a Fujtsu VPP00. Some expermental results are gven n Tables and. It can be seen from Table that our algorthm acheves over one thrd of the peak performance for solvng large sze problems. (The peak performance of a four-processor VPP00 s. Gops.) We can also see from Table that a lnear speedup s acheved by usng derent number of processors (rangng from to ) for solvng a gven problem. These results con- rm that for massvely parallel computaton of sngular value decomposton the best approach

P-Q- may be to adopt one-sded Jacob as advocated n [, ]. The Egenvalue Problem The SVD algorthm can be used to nd the egenvalues and egenvectors of a symmetrc matrx. For a symmetrc matrx A of sze n n the one-sded Jacob method produces an orthogonal matrx V such that AV = S, where S has orthogonal columns. We have S T S = V T A T AV =. Thus the egenvalues and sngular values of a symmetrc matrx are equal, except possbly for sgns,.e. =. The sgns of the egenvalues can be obtaned usng the Raylegh quotent: = vt Av v T v : If we calculate egenvalues one by one, t s mpossble to acheve peak performance. Ths s because matrx-vector products suer from the need for one memory reference per multply-add. The performance may be lmted by memory accesses rather than by oatng-pont arthmetc. In order to acheve hgh ecency we should compute all egenvalues smultaneously usng the equaton V T AV = (where V s assumed to be orthonormal). To mnmse the communcaton cost the computaton s dvded nto two steps,.e., Y = AV and V T Y =. There are varous parallel algorthms for computng matrx multplcatons. We choose an ecent one whch places the resultng matrx Y (n the rst step) n a natural order. Snce V and Y are stored n the same manner and s dagonal, the multplcaton of the two matrces n the second step only nvolves local operatons and has operaton count O(n ). If A s postve dente, an alternatve way of ndng the egenvalues and egenvectors of A s by rst computng the Cholesky factorzaton of A and then performng and SVD [, ]. Conclusons We have shown that the one-sded Jacob method can acheve hgh ecency wth parallel orderngs provded consderaton s gven to sortng the column norms. Our parallel rng Jacob orderng can do both ndex orderng and sortng smultaneously durng a sweep. The expermental results show that ths rng orderng algorthm can acheve the same convergence rate as the sequental cyclc Jacob orderng. Some experments have been conducted on Fujtsu AP000 and VPP00 computers. We found that for certan problems Jacob produces results wth hgh accuracy, but QR-based methods do not. Fnally, we pont out that the parallel oddeven ndex orderng [] and the parallel oddeven transposton sort [, ] both have the same communcaton structure. The two procedures can be combned nto a new algorthm whch can ecently mplement one-sded Jacob on general-purpose dstrbuted memory machnes []. Acknowledgements The work was partly supported by the Fujtsu- ANU Research Agreement. Thanks are due to Hrosh Ina and hs colleagues at Fujtsu Lmted for provdng access to a Fujtsu VPP00. References [] S. G. Akl, Parallel Sortng Algorthms, Academc Press, Orlando, Florda, 9. [] G. Baudet and D. Stevenson, \Optmal sortng algorthms for parallel computers", IEEE Trans. on Computers, C{, 9, {. [] C. H. Bschof, \The two-sded block Jacob method on a hypercube", n Hypercube Multprocessors, M. T. Heath, ed., SIAM, 9, pp. -. [] R. P. Brent, \Parallel algorthms for dgtal sgnal processng", Proceedngs of the NATO Advanced Study Insttute on Numercal Lnear Algebra, Dgtal Sgnal Processng and Parallel Algorthms, Leuven, Belgum, August, 9, pp. 9-0. [] R. P. Brent and F. T. Luk, \The soluton of sngular-value and symmetrc egenvalue

P-Q- problems on multprocessor arrays", SIAM J. Sc. and Stat. Comput.,, 9, pp. 9-. [] J. Demmel and K. Veselc, \Jacob's method s more accurate than QR", SIAM J. Sc. Stat. Comput.,, 99, pp. 0-. [] P. J. Eberlen and H. Park, \Ecent mplementaton of Jacob algorthms and Jacob sets on dstrbuted memory archtectures", J. Par. Dstrb. Comput.,, 990, pp. -. [] L. M. Ewerbrng and F. T. Luk, \Computng the sngular value decomposton on the Connecton Machne", IEEE Trans. Computers, 9, 990, pp. -. [9] G. E. Forsythe and P. Henrc, \The cyclc Jacob method for computng the prncpal values of a complex matrx", Trans. Amer. Math. Soc., 9, 90, pp. -. [0] G. R. Gao and S. J. Thomas, \An optmal parallel Jacob-lke soluton method for the sngular value decomposton", n Proc. Internat. Conf. Parallel Proc., 9, pp. -. [] G. H. Golub and C. F. Van Loan, Matrx Computatons, The Johns Hopkns Unversty Press, Baltmore, MD, second ed., 99. [] P. Henrc, \On the speed of convergence of cyclc and quascyclc Jacob methods for computng egenvalues of Hermtan matrces", J. Soc. Indust. Appl. Math.,, 9, pp. -. [] M. R. Hestenes, \Inverson of matrces by borthogonalzaton and related results", J. Soc. Indust. Appl. Math.,, 9, pp. -90. [] T. J. Lee, F. T. Luk and D. L. Boley, Computng the SVD on a fat-tree archtecture, Report 9-, Department of Computer Scence, Rensselaer Polytechnc Insttute, Troy, New York, November 99. [] C. E. Leserson, \Fat-trees: Unversal networks for hardware-ecent supercomputng", IEEE Trans. Computers, C-, 9, pp. 9-90. [] F. T. Luk, \Computng the sngular-value decomposton on the ILLIAC IV", ACM Trans. Math. Softw.,, 90, pp. -9. [] F. T. Luk, \A trangular processor array for computng sngular values", Ln. Alg. Applcs.,, 9, pp. 9-. [] F. T. Luk and H. Park, \On parallel Jacob orderngs", SIAM J. Sc. and Stat. Comput., 0, 99, pp. -. [9] J. C. Nash, \A one-sded transformaton method for the sngular value decomposton and algebrac egenproblem", Comput. J,, 9, pp. -. [0] P. P. M. De Rjk, \A one-sded Jacob Algorthm for computng the sngular value decomposton on a vector computer", SIAM J. Sc. and Stat. Comput., 0, 99, pp. 9-. [] R. Schreber, \Solvng egenvalue and sngular value problems on an underszed systolc array", SIAM. J. Sc. Stat. Comput.,, 9, pp. -. [] K. Veselc and V. Har, \A note on a onesded Jacob algorthm", Numersche Mathematk,, 990, pp. -. [] J. H. Wlknson, The Algebrac Egenvalue Problem, Clarendon Press, Oxford, 9, pp. -. [] B.B. Zhou and R. P. Brent, \A parallel orderng algorthm for ecent one-sded Jacob SVD computatons", to appear n Proc. of Sxth IASTED-ISMM Internatonal Conference on Parallel and Dstrbuted Computng and Systems, Washngton, DC, October 99. [] B. B. Zhou and R. P. Brent, \On the parallel mplementaton of the one-sded Jacob algorthm for sngular value decompostons", to appear n Proc. of rd Euromcro Workshop on Parallel and Dstrbuted Processng, San Remo, Italy, January 99.