and URV decomposition in Parallel Marc Moonen ESAT Katholieke Universiteit Leuven K.Mercierlaan 94, 3001 Heverlee, Belgium

Similar documents
1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

UMIACS-TR March Direction-of-Arrival Estimation Using the. G. Adams. M. F. Griffin. G. W. Stewart y. abstract

Parallel Computation of the Singular Value Decomposition on Tree Architectures

Iterative version of the QRD for adaptive RLS filtering

On Parallel Implementation of the One-sided Jacobi Algorithm for Singular Value Decompositions

A Parallel Ring Ordering Algorithm for Ecient One-sided Jacobi SVD Computations

Implementation of QR Up- and Downdating on a. Massively Parallel Computer. Hans Bruun Nielsen z Mustafa Pnar z. July 8, 1996.

Institute for Advanced Computer Studies. Department of Computer Science. Direction of Arrival and The Rank-Revealing. E. C. Boman y. M. F.

Implementation Of Quadratic Rotation Decomposition Based Recursive Least Squares Algorithm

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap

Linear Arrays. Chapter 7

A Jacobi Method for Signal Subspace Computation

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Geometric Mean Algorithms Based on Harmonic and Arithmetic Iterations

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

Pipeline Givens sequences for computing the QR decomposition on a EREW PRAM q

TENTH WORLD CONGRESS ON THE THEORY OF MACHINES AND MECHANISMS Oulu, Finland, June 20{24, 1999 THE EFFECT OF DATA-SET CARDINALITY ON THE DESIGN AND STR

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Comments on the randomized Kaczmarz method

Arm coordinate system. View 1. View 1 View 2. View 2 R, T R, T R, T R, T. 12 t 1. u_ 1 u_ 2. Coordinate system of a robot

Calculation of extended gcd by normalization

Realization of Hardware Architectures for Householder Transformation based QR Decomposition using Xilinx System Generator Block Sets

CS231A Course Notes 4: Stereo Systems and Structure from Motion

A technique for adding range restrictions to. August 30, Abstract. In a generalized searching problem, a set S of n colored geometric objects

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000

Fast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems

Matrix algorithms: fast, stable, communication-optimizing random?!

Fast Linear Discriminant Analysis using QR Decomposition and Regularization

3.1. Solution for white Gaussian noise

Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet. Y. C. Pati R. Rezaiifar and P. S.

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1

PAPER Design of Optimal Array Processors for Two-Step Division-Free Gaussian Elimination

6 Gotze, et al. tirely based on the evaluation and application of plane rotations. These plane rotations can be implemented in the standard way using

COMP 558 lecture 19 Nov. 17, 2010

Parallel Algorithms for Digital Signal Processing

Dense Matrix Algorithms

We consider the problem of rening quadrilateral and hexahedral element meshes. For

CS6015 / LARP ACK : Linear Algebra and Its Applications - Gilbert Strang

Floating Point Fault Tolerance with Backward Error Assertions

Graph drawing in spectral layout

LARP / 2018 ACK : 1. Linear Algebra and Its Applications - Gilbert Strang 2. Autar Kaw, Transforming Numerical Methods Education for STEM Graduates

Abstract. 1 Introduction

A proposed efficient architecture for OFDM MIMO systems using SVD

Lecture 3: Camera Calibration, DLT, SVD

Numerical Linear Algebra

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE*

Parallel and perspective projections such as used in representing 3d images.

A JACOBI METHOD BY BLOCKS ON A

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.

CS 770G - Parallel Algorithms in Scientific Computing

Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

Vertex Magic Total Labelings of Complete Graphs

The Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a

Localization in Graphs. Richardson, TX Azriel Rosenfeld. Center for Automation Research. College Park, MD

Short on camera geometry and camera calibration

Optimum Array Processing

Analysis of the GCR method with mixed precision arithmetic using QuPAT

Computational Methods in Statistics with Applications A Numerical Point of View. Large Data Sets. L. Eldén. March 2016

Computational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS

Numerical Algorithms

CALCULATING TRANSFORMATIONS OF KINEMATIC CHAINS USING HOMOGENEOUS COORDINATES

Normalized Graph cuts. by Gopalkrishna Veni School of Computing University of Utah

The explicit QR-algorithm for rank structured matrices

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

On the Complexity of Interval-Based Constraint. Networks. September 19, Abstract

Nonsymmetric Problems. Abstract. The eect of a threshold variant TPABLO of the permutation

Computational Methods CMSC/AMSC/MAPL 460. Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science

Linear Equation Systems Iterative Methods

Honors Precalculus: Solving equations and inequalities graphically and algebraically. Page 1

Vertex Magic Total Labelings of Complete Graphs 1

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

On JAM of Triangular Fuzzy Number Matrices

Figure 1: The three positions allowed for a label. A rectilinear map consists of n disjoint horizontal and vertical line segments. We want to give eac

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Algebraic Iterative Methods for Computed Tomography

(Refer Slide Time: 00:02:24 min)

Interleaving Schemes on Circulant Graphs with Two Offsets

Agenda. Rotations. Camera models. Camera calibration. Homographies

Chapter 3. Quadric hypersurfaces. 3.1 Quadric hypersurfaces Denition.

c 2004 Society for Industrial and Applied Mathematics

Sequential and Parallel Algorithms for Cholesky Factorization of Sparse Matrices

Graph Adjacency Matrix Automata Joshua Abbott, Phyllis Z. Chinn, Tyler Evans, Allen J. Stewart Humboldt State University, Arcata, California

Research Interests Optimization:

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations

ON DATA LAYOUT IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM WITH PRE PROCESSING

F02WUF NAG Fortran Library Routine Document

consisting of compact sets. A spline subdivision scheme generates from such

Answers. Chapter 2. 1) Give the coordinates of the following points:

Lecture 27: Fast Laplacian Solvers

IMAGE PROCESSING AND IMAGE REGISTRATION ON SPIRAL ARCHITECTURE WITH salib

ARITHMETIC operations based on residue number systems

SDLS: a Matlab package for solving conic least-squares problems

Unsupervised learning in Vision

Parallel Implementations of Gaussian Elimination

Transcription:

On the QR algorithm and updating the SVD and RV decomposition in Parallel Marc Moonen ESAT Katholieke niversiteit Leuven K.Mercierlaan 94, 3001 Heverlee, Belgium moonen@esat.kuleuven.ac.be Paul Van Dooren niversity of Illinois at rbana-champaign 1101 West Springeld Avenue rbana, Ill 61801, SA vandooren@uicsl.csl.uiuc.edu Filiep Vanpoucke ESAT Katholieke niversiteit Leuven K.Mercierlaan 94, 3001 Heverlee, Belgium vpoucke@esat.kuleuven.ac.be Abstract A Jacobi-type updating algorithm for the SVD or RV decomposition is developed, which is related to the QR algorithm for the symmetric eigenvalue problem. The algorithm employs one-sided transformations, and therefore provides a cheap alternative to earlier developed updating algorithms based on two-sided transformations. The present algorithm as well as the corresponding systolic implementation is therefore roughly twice as fast, compared to the former method, while the tracking properties are preserved. The algorithm is also exted to the -matrix QSVD or QRV case. Finally, the dierences are discussed with a number of closely related algorithms that have recently been proposed. I. Introduction In an earlier report [14], an adaptive algorithm has been developed for tracking the singular value decomposition of a data matrix, when new observations (rows) are added continuously. The algorithm may be organized such 1

that it provides at each time a certain approximation for the exact singular value decomposition. It combines a Jacobi-type diagonalization scheme, based on two-sided orthogonal transformations [10], with QR updates. A systolic implementation for this algorithm is described in [15]. Here, we improve upon these results. An alternative algorithm is described, which employs only one-sided transformations. Row and column transformations are applied in an alternating fashion. The algorithm is therefore roughly twice as fast, whereas its tracking properties are the same as for the two-sided method. The corresponding systolic implementation is roughly the same, but also twice as fast. The algorithmic development starts from a square root version of the QR algorithm for the symmetric eigenvalue problem [6, 13], section III. This algorithm is turned into a Jacobi-type algorithm, based on transformations, by supplementing it with a permutation scheme, section IV. The resulting algorithm may then be interlaced with a QR update, whenever a new row has to be worked in, such that an adaptive scheme is obtained, section V. As the algorithm is operated without shifts of the origin, it is particularly suitable to isolate a cluster of small singular values. This is precisely the aim of a RV decomposition [18]. Thus, at each time, either an exact SVD may be computed or a RV-type approximate decomposition. In section VI, the algorithm is exted to the -matrix QSVD or QRV case. In contrast to the QRV method proposed in [4], no preliminary extraction of a triangular factor is performed here. Moreover, a systolic array implementation is simpler with the present version. Some of these dierences are explained in the last section on related work by others. II. Preliminaries The starting point here is a real 1 data matrix A, which is assumed to be tall and thin, i.e. with more rows than columns. The aim is to compute its singular value decomposition A = Nm Nm V T mm mm with T = V T V = I and a diagonal matrix. In real time applications, A is dened in a recursive manner, i.e. A at time k equals A at time k 1 plus one additional new observation (row) " Ak 1 : A k = a T k 1 We consider real arithmetic, for simplicity. The complex case is similar.

Mostly, exponential forgetting is applied with a forget factor <1, i.e. " Ak 1 A k = : a T k Very often, SVD is used for so-called `subspace tracking' applications. The matrix A k is then supposed to have a clear gap in the singular value spectrum. The larger singular values correspond to the so-called `signal subspace', the smaller singular values correspond to the so-called `noise subspace'. The SVD may be written as n i A =h s " s n Vs " T Vn T where s contains the larger `signal singular values', and n contains the smaller `noise singular values'. The aim is not so much to compute the complete singular value decomposition, rather to compute a good estimate for the subspaces Vs T and Vn T. Therefore, it is not necessary to have an exact diagonal matrix in the middle. An `approximate decomposition', with e.g. a triangular matrix R in the middle as follows ~ n i A =h ~ s ~ R V T R " s R sn V R z n " ~ s T ~V n T also reveals good estimates for the subspaces, as long as the `cross term' kr sn k F is small, such that R s, resp. R n, has roughly the same singular values as s, resp. n. In [14], this is called (somewhat loosely) an `approximate SVD', whereas in [18] this is termed `RV decomposition', referring to the separate factors. In subspace tracking applications, the aim is to have a good estimate for the subspace Vs T or Vn T at each time instant k. The `tracking error', which may be dened in terms of the angles between the exact and approximate subspaces Vs T and V ~ s T [14], should be small at all time. Our aim is to develop ecient adaptive and parallel algorithms for this. ~V T ; III. Square root QR In this section, we focus on computing the SVD of a xed matrix A. It is shown how a Jacobi-type algorithm may be derived from the QR algorithm 3

for the symmetric eigenvalue problem. The SVD of the matrix A may be computed in two steps. First, a QR decomposition is computed, resulting in A Nm Nm = Q A mm R A where Q T A Q A = I and R A is upper triangular. This is done in a nite number of time steps, e.g. with a sequence of Givens transformations [8]. Then an iterative procedure is applied to the triangular factor R A, transforming it into a diagonal matrix. This diagonalization procedure consists in applying a sequence of plane transformations as follows, see [10, 11] for details: R ( R A ( I V ( I 64 R ( T R V [i;k] [i;k] for k =1; ::: ;1 for i =1; ::: ;m 1 ( [i;k] V ( V V [i;k] The parameter i is called the pivot index. The matrices [i;k] and V [i;k] represent orthogonal transformations in the fi; i+1g-plane. [i;k] diers from the identity only in the entries (i; i) =(i+1;i+ 1) = cos, (i; i + 1) = sin and (i +1;i)= sin. Similarly V [i;k] diers from the identity only in the entries (i; i) = (i+1;i+ 1) = cos, (i; i + 1) = sin and (i +1;i)= sin. The angles and are chosen such that applying [i;k] and V [i;k] to R results in a zero (i; i +1)-entry in R, while R still remains in upper triangular form. Among the two possible solutions one chooses the so-called outer rotations closest to a permutation [11]. Each iteration thus essentially reduces to performing a particular SVD on the main diagonal. At each stage we have R A = R V T : Furthermore, each rotation reduces the norm of the o-diagonal part in R. In other words, R converges to a diagonal matrix, resulting in the required 4

SVD : A = Q A R A = Q A R V T : V T This SVD algorithm is simple, amenable to parallel implementation [11], and may be turned into an adaptive algorithm [14]. In a way, the above algorithm may be viewed as a so-called `square root' version of the original Jacobi algorithm for the symmetric eigenvalue problem, applied to A T A = R T R [8]. What is remarkable now, is that another popular algorithm for the symmetric eigenvalue problem, namely the QR algorithm, may be turned into a Jacobi-type square root algorithm, too. The original QR algorithm, applied to A T A,works as follows [8] : X 0 ( A T A " Qk for k =0; ::: ;1 R k = X k X k+1 ( R k Q k In each iteration, a QR factorization of X k is computed. Then the next iterate X k+1 is obtained by reversing the order of Q k and R k and carrying out the multiplication. It is proved that -except for contrived examples- X k converges to a diagonal matrix with the eigenvalues of A T A ordered along the diagonal, i.e. X 1 =. A square root version of this algorithm has been derived in [6, 9, 13] (see also [5] for a related result). With A = Q A R A, one has lower X 0 = R T A def = R T 0 R 0 R A upper The QR factorization of X 0 (cfr. rst iteration) is obtained from the QR factorization of R T 0 X 0 = R T 0 z = Q 0] Q 0] R 0] R 0 Q 0 R 0] R 0 R 0 : 5

The next iterate is then obtained as X 1 = R 0 Q 0 = R 0] R 0 Q 0] = R 0] R T 0] upper lower Finally, the original factorization (lower times upper) is restored, by computing and substituting the QR factorization of R T 0] X 1 = R z 0] R T 0] z R T 0y QT 0y Q 0y R 0y = R T 0y R 0y def = R T 1 R 1 The above process is then repeated to compute similar factorizations for X, X 3, etc. One can verify that, with tidied up notation, a square root algorithm is obtained as follows : R 0 ( R A L k lower R 64 k+1 for k =0; ::: ;1 upper ( R k Q k] upper orthogonal ( Q T ky L k orthogonal lower It is seen that column and row transformations are applied in an alternating fashion, wherewith an upper triangular factor is turned into a lower triangular factor and vice versa. Here, one easily veries that -except for contrived examples- R 1 =, such that indeed R T 1 R 1 = X 1 =. The above QR-type algorithm is operated without shifts of the origin [8]. Therefore, convergence to the complete singular value decomposition is likely to be very slow. On the other hand, with the zero shift, this algorithm is particularly suitable to isolate a cluster of small (close to zero) singular values. Therefore, the above algorithm rapidly converges to the RV form, as given in the previous section, and thus may be called a `RV algorithm', Lk instead of RT k] and R k+1 instead of R ky 6

too. For more details on how this algorithm relates to (the renement step in) the RV algorithm of [18], we refer to [13] and to Section VII. IV. Jacobi QR The next step is to turn the above algorithm into a Jacobi-type algorithm, based on transformations. First we add a permutation matrix in the above algorithm, given as 3 = 64 0 1 1 : 1 1 0 3 75 : The algorithm then works with upper triangular matrices only. R 0 ( R A L k z upper R 64 k+1 for k =0; ::: ;1 upper orthogonal z ( R k Q k] upper orthogonal ( Q T ky L k upper The rst part may be viewed as applying a row permutation, and meanwhile preserving the upper triangular structure by applying orthogonal column transformations. Similarly, the second part may be viewed as applying a column permutation, and meanwhile preserving the upper triangular structure by applying orthogonal row transformations. To obtain a Jacobitype algorithm, it suces to split up these transformations into a sequence of transformations. We consider the case here where m is even 4. It is well known that can be split up with a so-called `odd-even ordering' as follows z 1jm =( 1j 3j4 ::: m `odd' j3 4j5 ::: z m jm 1 `even' 3 Note that T =. 4 Similar formulas apply for the case where m is odd. ) m 7

where iji+1 diers from an identity matrix only in the entries (i; i) = (i+1;i+1) = 0 and (i +1;i)=(i; i +1) = 1. Each time a row (column) permutation iji+1 is applied, a corresponding column (row) transformation restores the upper triangular structure. This is very similar to inserting appropriate permutations in the Kogbetliantz algorithm for computing the SV D of a triangular matrix [11] in order to maintain the upper triangular form at each step. With this, we nally obtain the Jacobi-type square root QR algorithm (with odd-even ordering) : R ( R A ( I V ( I for k =0;:::;1 odd even for j =1;:::; m for i =1;3;:::;m 1; ; 4;:::;m 64 R ( iji+1 R V [i;j;k] ( iji+1 V ( V V [i;j;k] odd 64 R ( T R [i;j;k] iji+1 even for j =1;:::; m for i =1;3;:::;m 1; ; 4;:::;m ( [i;j;k] V ( V iji+1 Again, [i;j;k] and V [i;j;k] represent orthogonal transformations in the fi; i +1g-plane, with rotation angles such that [i;j;k] or V [i;j:k] restores the 8

upper triangular form after the column or row permutation. At each stage we have R A = R V T and again, each iteration reduces the norm of the o-diagonal part in R. In other words, R converges to a diagonal matrix, resulting in the required SVD : A = Q A R A = Q A R V T : V T V. Parallel and adaptive SVD/RV updating The above SVD/RV algorithm is also simple, and directly amenable to parallel implementation. The array of [11] may be used here, see Figure 1. Dots correspond to matrix entries, frames may be thought of as processors. The triangular part stores the matrix R, while V is stored in the upper square part. Matrix is not stored, as we will not need it in the adaptive case, see below. The `odd transformations' (i =1;3;:::) are computed in parallel on the main diagonal, Figure 1.a, and then propagated to the blocks next outward (column transformations are propagated upwards, row transformations are propagated to the right). In Figure 1.c, this rst set of transformations has moved far enough to allow the next set to be generated (`even transformations', this time). After Figure 1.d comes Figure 1.a again, etc. The only dierence with [11] is that each block now only performs either row or column transformations (plus permutations), instead -sided transformations. The array will thus operate roughly twice as fast. Let us now return to the recursive case, with " Ak 1 A k = ; a T k and turn the algorithm of the previous section into an adaptive algorithm. In [15], it is shown how a Jacobi-type process maybeinterlaced with QR updates, whenever new observations have to be worked in. The main diculty of the systolic implementation is the fact that two computational \ows" travel in opposite direction : one ow is associated with the SVD/RV computations on a triangular matrix and updating the corresponding column transformation matrix in a square array, and another ow is associated with applying this transformation matrix to the new incoming observation 9

vectors. The crux of the implementation is to patch up the ows as they cross each other in dierent direction. It is instructive to look at the systolic implementation rst, and then derive the corresponding algorithmic description : Figure is similar to Figure 1, only the interlaced updates are added. The data vectors a k are fed in into the upper V -array inaskewed fashion as indicated with the 's, and propagated to the right, in between two transformation fronts (frames). The rst step is to compute the matrix-vector product ~a T k = at k V, to put the new vector in the same `basis' as the current R matrix. This is computed on the y, with intermediate results being passed on upwards. The resulting vector ~a k becomes available at the top of the square array, and is then reected and propagated downwards, towards the triangular array, indicated with the 's. While going downwards, the ~a k -vector crosses upgoing transformations. These should then be applied to the ~a k -vector too, in order to obtain consistent results. The QR-updating is performed in the triangular part, much like in the conventional Gentleman-Kung array [7], but the pipelining is somewhat dierent here (compatible with the Jacobi-type algorithm). Rotations are generated on the main diagonal, and propagated to the right. In each frame, column and row transformations corresponding to the SVD/RV scheme are performed rst, while in a second step, only row transformations are performed corresponding to the QR-updating (aecting the ~a k -components and the upper part of the -blocks). For further details concerning this array, we refer to [15]. An algorithmic description of this systolic implementation is given as follows : V ( I mm R ( O mm for k =1; ::: ;1 input new observation vector a k 1. Matrix-vector multiplication ~a T k ( a T k V. QR updating " " R R ( Q T k 0 ~a T k 3. SVD/RV steps 10

for i =1; ::: ;m 1 if k + i (mod n) <n 64 else R ( iji+1 R V [i;k] V ( V V [i;k] R ( T R [i;k] iji+1 V ( V iji+1 Again, [i;k] and V [i;k] represent orthogonal transformations in the fi; i + 1g-plane, with rotation angles such that [i;k] or V [i:k] restores the upper triangular form after the column or row permutation. The backbone of the algorithm is the SVD/RV process. Whenever a new vector a k has to be worked in, the process is interlaced with a QR update (step ) with ~a T k = at k V (step 1). To have an algorithm with a xed number of operations per loop, only one sequence of transformations for i =1;;:::;m 1 is performed in step 3 5. Here we make use of the well known fact that an odd-even ordering can be re-organized into sequences of transformations, where in each sequence we havei=1;;:::;m 1(up to a dierent start-up phase). One then only has to be careful with the computation of the transformations, i.e. the decision whether the permutation has to be applied to the left or to the right. This is done with the `if... then...' statement, which is explained as follows. In Figure.a it is seen that the blocks on the diagonal correspond to dierent updates, namely k = and k =1(one could also add k =0and k = 1). The sum k + i is then indeed constant along the diagonal. In other words, the decision whether to perform a row premutation or a column permutation, should only dep on k + i. On the other hand, one should switch from row permutations to column permutations, or the other way around, after each n -th update (for a xed value of i). The above algorithm has an O(m ) computational complexity per update (per loop). At each time, an approximate (RV-type) decomposition is available. The performance analysis of [14] straightforwardly carries over 5 From then on, the algorithm will only provide an approximate decomposition. An exact diagonalization is only obtained with a possibly innite number of SVD steps after each update. 11

to this algorithm. This means that the tracking error (see section II) is bounded by the time variation in m time steps, see [14] for details. The tracking experiments of [14] may be repeated here, revealing much the same results. Finally, with a systolic array with O(m ) processors (see Figure ), an O(m 0 ) throughput is achieved, which means that new vectors can be fed in at a rate which is indepent of the problem size m. VI. QSVD and QRV updating The above updating algorithm/array is readily exted to generalized decompositions for matrix pairs, viz. the quotient singular value decomposition (QSVD) or a similar QRV. Apart from the data matrix A (N m), a second matrix B (p m) is given, which for most applications corresponds to an error covariance matrix B T B. The idea is then mostly to replace methods which are based on the SVD or RV decomposition of A, by methods which are based on decompositions of AR 1 B, where R B is, e.g., anmm triangular factor of B (B = Q B R B ). The post-multiplication with R 1 B represents a pre-whitening operation. The key point is that AR 1 B should never be computed explicitly, because both the inverse and the multiplication may introduce additional errors, or R B may not be invertible at all. The QSVD of the matrix pair fa; Bg or fa; R B g, which isgiven as follows R A z A = Q A A ( A R) Q T B = Q B B ( B R) Q T R B reveals the SVD of AR 1 B in an implicit way. Here A and B are diagonal matrices, R is upper triangular, and A T A = B T B = Q T Q = I. For details, the reader is referred to [17]. Starting from the square triangular factors R A and R B, the QSVD may be computed with an iterative procedure, similar to the SVD procedure : R 1 ( R A R ( R B A ( I B ( I Q ( I for k =1; ::: ;1 1

for i =1; ::: ;m 1 64 R 1 ( T A [i;k] R 1 Q [i;k] R ( T B [i;k] R Q [i;k] A ( A A[i;k] B ( B B [i;k] Q ( Q Q [i;k] The matrices R 1 and R then converge to A R and B R. The matrices A[i;k] ; B[i;k] and Q [i;k] again represent orthogonal transformations in the fi; i + 1g-plane. These transformations correspond to a QSVD with [R 1 ] i;i+1 and [R ] i;i+1, i.e. the submatrices on the intersection of rows i; i+1 and columns i; i +1 in R 1 and R [1]. The key point is that (if R is invertible) 1 [R 1 ] i;i+1 [R ] i;i+1 =[R 1R 1 ] i;i+1 Computing the transformations in a numerically reliable way is a problem here, see e.g. [3,, 1]. Again, the above QSVD algorithm may be turned into an adaptive and parallel updating algorithm, where new rows may be apped to either one or both of the matrices A and B [16]. Our aim is now to develop a QR-type QSVD algorithm, similar to what we had for the 1-matrix case. This is straightforward. The algorithm below is readily seen to be a square root version of algorithm 8.6-1 of [8] for the symmetric generalized eigenvalue problem. R 1 ( R A R ( R B Q ( I A ( I B ( I 13

for k =0;:::;1 odd even for j =1;:::; m for i =1;3;:::;m 1; ; 4;:::;m 64 R 1 ( iji+1 R 1 Q [i;j;k] R ( T B [i;j;k] R Q [i;j;k] Q ( Q Q [i;j;k] A ( A iji+1 B ( B B[i;j;k] odd even for j =1;:::; m for i =1;3;:::;m 1; ; 4;:::;m 64 R ( iji+1 R Q [i;j;k] R 1 ( T A [i;j;k] R 1 Q [i;j;k] Q ( Q Q [i;j;k] A ( A A[i;j;k] B ( B iji+1 nlike in the rst QSVD algorithm, computing the transformations is simple here. In the rst loop, a row permutation is applied to R 1, and then Q [i;j;k] is computed to upper triangularize R 1 again. Finally, B[i;j;k] is computed to upper triangularize R Q [i;j;k]. The second loop starts with a permutation on R, etc. With the above algorithm, R 1 R 1 (if R is invertible) converges to the RV form rst, and then further on to diagonal form. Finally, an adaptive updating algorithm is straightforwardly obtained as follows. Only the orthogonal matrix Q is stored now. Q ( I mm R 1 ( O mm 14

R ( O mm for k =1; ::: ;1 input new observation vectors a k ;b k 1. Matrix-vector multiplication ~a T k ( a T k Q ~ b T k ( b T k Q. QR updating " R 1 " ( Q T R1 0 1k ~a T k " R " ( Q T R 0 k ~ b T k 3. SVD/RV steps for i =1; ::: ;m 1 if k + i (mod n) <n R 1 ( iji+1 R 1 Q [i;k] R ( iji+1 R Q [i;k] Q ( Q Q [i;k] B T 64 else R ( iji+1 R Q [i;k] R 1 ( T A iji+1 R 1 Q [i;k] Q ( Q Q [i;k] The corresponding systolic array is again given in Figure. The square part stores and updates Q, while the triangular part stores and updates R 1 and R,overlaid. The 's also correspond to overlaid data vectors a k and b k. The rest is similar to the 1-matrix case. VII. Relation to other work The decompositions discussed in this paper and related references [4, 6,13, 14, 18, 19] all compute an `approximate decomposition' of a matrix A with 15

" R s R sn R n a triangular matrix in the middle as follows : A =h ~ ~ s n i V " ~ s T ~V n T whereby max (R n ) =and max (R sn ) =are small, and min (R s ) = is reasonably larger than and. As a result, the singular values of R s approximate well the large singular values of A. Those of R n are good approximations of the small singular values of A only if <<. The desired ordering is thus >>>. { In [14], this was called an `approximate SVD'. When one or more sweeps of a Kogbetliantz-type SVD algorithm are applied to the triangular array, quadratic convergence was observed for the o-diagonal part R sn even when the required adjacency of close singular values was not respected. With one SVD sweep one reduces the norm of R sn from to =( ). { In [19] a `RV decomposition' approach was proposed based on estimating small singular values of a matrix A and deating them to the R n block (hence ordering is obtained). An adaptive version of this for matrices A k was then proposed in [18]. Finally, a renement idea of such RV decompositions was proposed and analysed in [6, 19]. One renement step essentially amounts to a QR decomposition and has the similar eect to : reduce the norm of R sn from to = =( ), but also ips around the matrix to a (block) lower triangular one. Therefore [19] recomms a second step to ip it over again to upper triangular form and hence reducing the norm of R sn further to =( ) =4 =( ) 3. Notice that the number of operations of one SV D sweep is roughly equal to that of two QR renement steps. { In this paper we show that similar renement steps can in fact be performed while preserving the upper triangular form at no extra cost. So, instead of having an improved decomposition at the same cost of an SV D sweep, we propose a cheaper procedure with the same renement property as an SVD sweep. Moreover, the parallel implementation ts nicely on the same array as the SVD updating scheme. Another dierence with the RV approach is that no rank test is being performed at any stage. We rely here on the self ordering property of the QR algorithm to expect that in the adaptive case smaller singular values will automatically cluster. This of course can only be expected if the noise subspace corresponding to these small singular values does not change too much with each time step k. Extensions to updating the implicit decomposition for two matrices A and B can be found in [16] (GSVD) and [4] (GRV). The approach chosen by [4] is to ext the work of [19] to the case of an implicit decomposition. This rst requires a joint QR decomposition of the matrices A and B in ; 16

order to extract the common null space and triangular factor R AB. The resulting matrices Q A and Q B then have a joint decomposition known as the CS decomposition, which can be updated adaptively as new observations are being collected [4]. Two weaknesses of this approach are the preliminary rank determination of the joint QR factorization of A and B, and the two stage updating required when new observations are being processed. The main advantage of the method is that the rank revealing part of the QRV decomposition is now concentrated in the matrix Q A. In the present paper we focused on the parallel implementation of the QRV decomposition and chose for this reason not to perform the preliminary extraction of the triangular factor R AB. The resulting algorithm is again easy to implement on the SVD array of[15] and does not require the double ipping of the RV as in [19]. The algorithm is then very close to the one presented for the single matrix case and earlier remarks apply here again. Acknowledgement This research was partially sponsored by ESPRIT Basic Research Action Nr. 380. Marc Moonen is a senior research assistant with the Belgian N.F.W.O. (National Fund for Scientic Research). Paul Van Dooren was partially supported by the Research Board of the niversity of Illinois at rbana-champaign (Grant P 1--68114) and by the National Science Foundation (Grant CCR 909349). Filiep Vanpoucke is a research assistant with the Belgian N.F.W.O. Part of this research was done while the rst two authors were visiting the Institute of Mathematics and Applications, Minneapolis, SA. The Institute's nancial support is gratefully acknowledged. References [1] G.E. Adams, A.W. Bojanczyk and F.T. Luk, `Computing the PSVD of two triangular matrices', submitted to SIAM J. Matrix Anal. Appl. [] Z. Bai and J. Demmel, `Computing the generalized singular value decomposition', Report No. CB/CSD 91/645, Computer Science Division, niv. of California, Berkeley, August 1991, submitted to SIAM J. Sci. Stat. Comp. [3] A.W. Bojanczyk, L.M. Ewerbring, F.T. Luk, P. Van Dooren, `An accurate product SVD algorithm'. Signal Processing, Vol 5 (1991), No., pp 189-0. 17

[4] A. Bojanczyk, F.T. Luk and S. Qiao, `Generalizing the RV decomposition for Adaptive Signal Parameter Estimation', Proceedings of the IMA Workshop on Linear Algebra in Signal Processing, April 6-10, 199, Minneapolis, Minnesota, to appear. [5] J.-M. Delosme, I. Ipsen, `From Bareiss' Algorithm to the Stable Computation of Partial Correlations', Journal of Computational and Applied Mathematics, vol 7, pp 53-91 (1989) [6] E.M. Dowling, L.P. Ammann, R.D. DeGroat, `A TQR-iteration based adaptive SVD for real time angle and frequency tracking'. Technical Report, Erik Jonsson School of Engineering and Computer Science, niversity of Texas at Dallas. [7] W.M. Gentleman, H.T. Kung, `Matrix triangularization by systolic arrays'. Real-Time Signal Processing IV, Proc. SPIE, Vol. 98 (198), pp 19-6. [8] G.H. Golub, C.F. Van Loan, Matrix computations. North Oxford Academic Publishing Co., Johns Hopkins Press, 1988. [9] V. Fernando, B. Parlett, `Accurate singular values and dierential QD algorithms', Report No. CB/PAM-554, Center for Pure and Applied Mathematics, niv. of California, Berkeley, July 199., [10] E. Kogbetliantz, `Solution of linear equations by diagonalization of coecient matrices'. Quart. Appl. Math., 13 (1955), pp 13-13. [11] F.T. Luk, `A triangular processor array for computing singular values'. Lin. Alg. Appl., 77 (1986), pp 59-73. [1] F.T. Luk. `A parallel method for computing the GSVD'. Int. J. Parallel & Distr. Comp.,, pp 50-60, 1985. [13] R. Mathias, G.W. Stewart, `A block QR algorithm and the singular value decomposition'. Technical Report CS-TR 66, Dept. of Computer Science, niversity of Maryland, 199. [14] M. Moonen, P. Van Dooren, J. Vandewalle, `An SVD updating algorithm for subspace tracking', Internal Report K.. Leuven, ESAT/SISTA 1989-13, to appear in SIAM J. Matrix Anal. Appl., 13 (199), No. 4. [15] M. Moonen, P. Van Dooren, J. Vandewalle, `A systolic array for SVD updating', Internal Report K.. Leuven, ESAT/SISTA 1990-18, to appear in SIAM J. Matrix Anal. Appl. 18

[16] M. Moonen, P. Van Dooren, J. Vandewalle, `A systolic algorithm for QSVD updating'. Signal Processing, Vol 5 (1991), No., pp 03-13. [17] C.C. Paige and M. Saunders. `Towards a generalized singular value decomposition'. SIAM J. Numer. Anal., Vol. 18, No. 3, pp 398-405, 1981. [18] G.W. Stewart, `An updating algorithm for subspace tracking'. Technical Report CS-TR 494, Dept. of Computer Science, niversity of Maryland, 1990. To appear in IEEE Trans. on Signal Processing. [19] G.W. Stewart, `On an algorithm for rening a rank-revealing RV decomposition'. Technical Report CS-TR 66, Dept. of Computer Science, niversity of Maryland, 1991. To appear in Lin. Alg. Appl. 19

Figure a i =1! i=3! i=5! i=7! Figure b Figure c i =! i=4! i=6! Figure d Figure 1 Figure a k =6% k=;i=1! k=1;i=3! Figure b Figure c Figure d Figure 0