Implementation of QR Up- and Downdating on a. Massively Parallel Computer. Hans Bruun Nielsen z Mustafa Pnar z. July 8, 1996.

Size: px

Start display at page:

Download "Implementation of QR Up- and Downdating on a. Massively Parallel Computer. Hans Bruun Nielsen z Mustafa Pnar z. July 8, 1996."

Sherilyn Jones
5 years ago
Views:

1 Implementation of QR Up- and Downdating on a Massively Parallel Computer Claus Btsen y Per Christian Hansen y Kaj Madsen z Hans Bruun Nielsen z Mustafa Pnar z July 8, 1996 Abstract We describe an implementation of QR up- and downdating on a massively parallel computer (the Connection Machine CM{200) and show that the algorithm maps well onto the computer. In particular, we show how the use of corrected semi-normal equations for downdating can be eciently implemented. We also illustrate the use of our algorithms in a new LP algorithm. Key words. up- and downdating of QR factorization, corrected seminormal equations, CM{ Introduction In this paper we describe an ecient implementation of updating and downdating of a QR factorization on the Connection Machine CM{200, which is a massively parallel SIMD computer [11]. Many of our considerations are general for massively parallel computers. This project was sponsored by the Danish Center for Parallel Computer Research. M. Pnar was also sponsored by the Danish Natural Science Research Council, Grant No y UNIC (Danish Computing Centre for Research and Education), Building 305, Technical University of Denmark, DK-2800 Lyngby, Denmark (Claus.Btsen@uni-c.dk, Per.Christian.Hansen@uni-c.dk). z Institute for Numerical Analysis, Building 305, Technical University of Denmark, DK Lyngby, Denmark (numikm@vm.uni-c.dk, numimpi@uts.uni-c.dk). 1

2 Many linear algebra routines can be implemented very eciently on massively parallel computers [2]. However, it is not immediately clear whether updating and downdating of a QR factorization in particular in the case where only the triangular matrix R is stored provides enough parallelism for an ecient SIMD implementation. The main goal of this paper is to show that this is indeed the case. Our work was motivated by the use of QR up- and downdating in a new algorithm for linear programming described in [7]. The algorithm was implemented on an 8K CM{200 located at UNIC [8], and since there are no routines for QR up- and downdating in the CMSSL scientic subroutine library for the Connection Machine, it was necessary to implement such routines. Throughout the paper, we are concerned with QR factorizations of the form A = QR (1) with A 2 R mn ; Q 2 R mm ; R 2 R mn ; m n: We assume that the matrix Q is not stored, and we want to recompute the triangular factor R eciently when a row is either apped to A (updating) or removed from A (downdating). The algorithm for updating is a classical one (see, e.g., [5, x12.6]) and the downdating algorithm is a new \hybrid" algorithm using corrected semi-normal equations from [1]. Both algorithms are numerically stable. Our paper is organized as follows. In x2 we summarize the up- and downdating algorithms and we investigate numerically the accuracy of the downdating algorithm. Implementation details are given in x3. In x4 we illustrate the use of our implementation in connection with the above-mentioned LP algorithm. 2 Up- and Downdating Algorithms In this section we briey summarize the algorithms for up- and downdating of a QR factorization. 2.1 Updating If we wish to update the matrix A by an arbitrary new row u T, then this row can always be permuted to the top. Hence, there is no loss of generality 2

3 in assuming an updating of the form u T ~A = A where A ~ is the updated matrix. If we rewrite this equation using (1), then we obtain 1 0 T ~A u T = H; 0 Q R ; where the matrix H has upper Hessenberg form. Now, if ~A = ~ Q ~ R is the QR factorization of ~ A, then it follows that we can obtain ~ R by reducing H to upper triangular form ~ R by means of orthogonal transformations. Two types of orthogonal transformations are relevant here: Givens rotations and fast Givens rotations [5, x5.1]. In connection with the QR updating problem, the fast Givens method requires O(2:5n 2 ) ops 1 while the classical Givens method requires O(3n 2 ) ops. The fast Givens rotations have a reputation for being impractical because of the potential danger for over- or underow; see [5, p. 209]. However, in this particular application this is no problem since each row is only involved on two rotations, so the maximum growth in the elements is limited by a factor 4. We have therefore decided to use the fast Givens rotations. The detailed algorithm for the updating algorithm is given in the Appix. We mention in passing that the QR factorization provided in the CMSSL library uses a block-cyclic data layout [6]. We can easily make our updating algorithm conform with this layout by performing the fast Givens rotations in the same order as this layout (the details are straightforward and are omitted here). In this way, our routines are compatible with the CMSSL routines. 2.2 Classical (LINPACK) Downdating If we want to remove an arbitrary row u T from the matrix A, then again without loss of generality we can assume that A has the form u T A = ; ~A 1 Here, one op is either an addition or a multiplication. 3

4 where ~ A is equal to A with the rst row ut deleted. Now let q T denote the rst row of the matrix Q in the QR factorization of A. Then there exists an orthogonal matrix G such that G T q = (1; 0; : : :; 0) T ; with = 1: In particular, if G is constructed as a sequence of Givens rotations, G = G m?1 G 1 ; where each rotation G T i involves elements i and i + 1 of q, then it follows that G T R has upper Hessenberg form, i.e., G T R = G T 1 G T R = v T m?1 : (2) ~R Moreover, we have that and therefore u T ~A 0 Q G = 0 Q ~ 0 v = A = QR = Q G G T T v T R = 0 Q ~ = ~R ~Q R ~ ; in which we identify A ~ = Q ~ R ~ as the desired QR factorization of A. ~ This algorithm is mixed stable [9]. Often Q is not available because of storage considerations. Hence, the algorithm above must be modied to take this situation into account. Notice that only the n Givens transformations G 1 ; : : :; G n alter R; we need not determine G n+1 ; : : :; G m?1. Now, if q 1:n denotes the rst n components of q, then we have 0 1 G T n+1 GT q q 1:n m?1 A n 1 (3) 0 m? n? 1 where =? 1? kq 1:n k : (4) Since Q is not available, we rst have to compute the necessary quantities in (3). The vector q 1:n can be computed from the system R T q 1:n = u; (5) 4

5 and then is computed from (4). From the vector in (3) we can then construct G n ; : : :; G 1 and apply these rotations to R as in (2); in this way we produce R. ~ This is the well-know LINPACK algorithm [4, x10]. Again, we can use fast Givens transformations in our implementation without danger for over- or underow, and we take into account the block cyclic layout of R conforming with the CMSSL library. 2.3 CSNE Downdating In [1] it is shown that the LINPACK downdating algorithm can be arbitrarily inaccurate, because the sole use of R to form that downdating transformations may lead to a much more ill-conditioned problem than using both Q and R. It is therefore proposed in [1] to use corrected semi-normal equations to improve the accuracy of q 1:n, computed by (5), before it is used to construct the Givens transformations in (2). We summarize the algorithm here and refer to [1] for more details: CSNE Downdating of R 1. Solve R T q 1:n = u T for q 1:n 2. Solve R v = q 1:n for v 3. Let t e 1? Av where e 1 = (1; 0; : : :; 0) T 4. Solve R T q 1:n = A T t for q 1:n 5. Let q 1:n q 1:n + q 1:n 6. Solve R v = q 1:n for v 7. Let t t? Av 8. Let ktk 2 9. Continue using the LINPACK algorithm. The key to the improved stability is the renement of q 1:n in steps 2-5, combined with a more accurate computation of in step 8 via v and t (instead of Eq. (4)). Applying the CSNE algorithm is obviously more expensive than applying the LINPACK algorithm. It is therefore recommed to use a hybrid algorithm where the CSNE algorithm is only used if the system is ill-conditioned and the LINPACK algorithm is used otherwise [1]. As a measure of the conditioning of the system we use 2 = 1? jjq 1:n jj 2 2 = ktk 2 2 and apply the CSNE algorithm if 2 is less than a user-specied tolerance. We used a tolerance equal to 1=4 as recommed in [1]. 5

6 2.4 Accuracy of CSNE Downdating The updating algorithm is known to be numerically stable and to yield good accuracy, so we concentrate on studying the accuracy of the hybrid CSNE downdating algorithm. In order to test the accuracy of this algorithm and, in particular, its sensitivity to an ill-conditioned matrix A, we generate test matrices by the following strategy: 1. Generate a random (n + 1) n matrix A with n even. 2. Modify column n=2 of A as follows: A :;n=2 where 0 1. A :;n=2 + 1? 2? A:;n=2?1 + A :;n=2+1 ; In this way we can use to control the condition of A, since (for small ) column n=2 is almost a linear combination of the columns n=2? 1 and n= We also study the eect of doing more than one iteration in the above algorithm, in the sense that we repeate steps 4{7 together with the additional step v v + v a number of times. The results are shown in Fig. 1 for a series of random matrices. For each value of we generate 10 random matrices according to the above scheme, and for each matrix we compute R ~ by means of the LINPACK algorithm, the CSNE algorithm and the CSNE algorithm with 1 additional iteration. We compare R ~ with the matrix Rdirect ~ from a QR factorization of A ~ (consisting of the bottom 128 rows of A). Figure 1 shows the relative error k R ~? Rdirect ~ k 2 =k Rdirect ~ k 2 as a function of the parameter. As expected the CSNE algorithm is superior to the LINPACK algorithm with respect to sensitivity to an ill-conditioned problem. Furthermore we see that extra iterations in the CSNE algorithm never improve the accuracy of R. ~ 3 SIMD Implementation of Up- and Downdating 3.1 General Considerations When implementing algorithms on a parallel computer it is essential to choose an appropriate data layout in order to be able to operate simultaneously on as many processing elements as possible. On massively parallel 6

7 ~ ~ ~ R-R direct 2 / R direct LINPACK CSNE 10-6 CSNE + 1 extra it Figure 1: Accuracy of the hybrid CSNE algorithm and the LINPACK algorithm when downdating the test matrices described above. 7

8 computers we generally have three fundamentally dierent parallel layouts to choose from. Row-oriented layout: Each row is assigned to parallel processors and all the rows are stacked serially. Column-oriented layout: Each column is assigned to parallel processors and all the columns are stacked serially. Matrix-oriented layout: Rows as well as columns are assigned to parallel processors. Dierent data distributions are possible in this approach. Generally some sort of matrix-oriented layout is used in conjunction with matrix operations since, even for matrices of small dimensions, this ensures that elements are allocated to all processors. The dierent approaches will now be addressed with respect to QR up- and downdating Row-Oriented Layout Using this approach one can perform the fast Givens rotations in the upand downdating in parallel without introducing communication other than broadcasting the constants and which dene the Givens rotations. When a rotation is performed between two rows, the processors have to be activated a number of times. The resulting eciency can be approximated as follows. Let n be the number of columns and p the number of processors, and write n = bn=pcp + q where q is the remainder. Then 1 2 n2 p b n p c(b n p c + 1) + p(b n p c + 1)q The drawback of this layout is that a large problem size relative to the number of processors is needed in order to obtain decent results. E.g., if n = p=2 only 25% of the processor performance is utilized as opposed to 50% if n = p or approximately 67% if n = 2p. The solution of the linear systems related to the downdating can be eciently implemented using the row version of back substitution [5, p. 88] whereas transposed systems can be eciently solved using the column version of forward substitution [5, p. 89]. 8

9 3.1.2 Column-Oriented Layout This layout cannot be eciently used since the Givens rotations would have to be performed on one processor at a time Matrix-Oriented Layout If the processors are considered as congured in a grid, then a row in the matrix will live on a row of processors. Each row as well as each column may be wrapped on the processor grid. One possible matrix-oriented layout is the block cyclic layout often used for gaining good load balance in linear algebra applications [6]. If we use a block cyclic layout then the Givens rotations can be performed in parallel without introducing communication other than broadcasting the constants and when the rows involved in the rotation are part of the same block, and by introducing nearest neighbor communication equal to one shift operation if they are not part of the same block. Only one shift operation is needed to ensure that all the forward/backward references live on the same processor as the active row. It can be shown that the CSNE downdating can be performed in the block cyclic domain without introducing any overhead. The eciency of this layout is by nature much lower than for the roworiented layout since only one or two rows of processors will be active simultaneously. This layout should not be discarded, though, since overall considerations of eciency might imply using a matrix-oriented layout. In any case this layout is attractive if the alternative is refactorization. 3.2 Implementation on the Connection Machine CM-200 The updating and hybrid downdating algorithms have been implemented in CM-FORTRAN on a 8K CM-200, but it also applies to the slower CM- 2. We have implemented the row-oriented as well as the matrix-oriented layout. It has not been possible to activate only the processors on and to the right of the diagonal of the matrix, which means that we do operations on zeros. Furthermore, in the matrix oriented implementation it has not been possible to detect whether two rows live on the same processors resulting in unnecessary activation of the communication primitives. The implementation using a matrix-oriented layout is based on a (:news, :news) layout, has storage requirements of roughly three times the size of 9

10 10 2 Time (sec.) CMSSL QR Fact. CSNE Downdate LINPACK Downdate Update n Figure 2: Performance of up- and downdating compared to the CMSSL QR factorization. the factorized matrix, and uses n? 1 shift operations. Since no shift operations are available for the block cyclic layout this implementation assumes normal ordering in order to avoid one s-operation per loop iteration. Both up- and downdating perform very poorly; less than 5 Mop/s is obtained. The bottleneck can be identied from the assembler code and is the creation of the masks related to performing the Givens rotations. The implementation which is based on a row-oriented layout uses a (:serial, :news) layout and has storage requirements of roughly 2n 2. Here the block cyclic ordering is eciently adaptable. The performance of these routines is shown in Fig. 2 together with results from refactorization using the CMSSL library 2. It is seen that the CSNE downdating is an 2 The timings were produced on the 8K CM-200 present at UNIC with CMF compiler 10

11 order of magnitude slower than the updating and LINPACK downdating, but it is still an order of magnitude faster than the refactorization for large n. Since it has not been possible to avoid the excessive engagement of virtual processors, it should fairly easily be possible to halve the computational work using a lower level language than CM-FORTRAN. 4 An Application in Linear Programming In this section we illustrate the use of the QR up- and downdating routines in an implementation of a new algorithm for linear programming based on a continuation algorithm. Today, such algorithms are realistic alternatives to the classical simplex method. Consider the normalized linear programming problem min x c T x subject to Gx = b;?e x e; (6) where G is m 1 n, c is an n-vector, and e = (1; : : :; 1) T. The continuation algorithm used here [7] solves this problem by rst solving its dual problem, min y kg T y + ck 1 + b T y; (7) and then detecting x from the residual vector G T y + c. The dual problem (7) is solved by an algorithm that essentially substitutes the non-smooth 1- norm with a smooth \Huber" norm with threshold, where the components of the residual vector G T y + c are treated dierently deping on whether they are greater than or smaller than. The key idea is to start with a large and then reduce it until the solution (7) can be identied from the solution to the \Huber" problem. This happens for a positive value of. The main computational problem during the algorithm outlined above is to solve a series of linear systems of equations, where the coecient matrix in each iteration step is modied by a few rank-one up- and downdates. Instead of computing a refactorization of the matrix each time, we use the up- and downdating routines described above and fall back on refactorization only in the few cases when it is necessary due to rank deciency. See [8] for more details about the implementation of the complete algorithm. The continuation algorithm was compared with an implementation of the simplex algorithm GEN SIMPLEX provided in the CMSSL library [10]. V. 1.2 and CMSSL library V. 3.1 Beta 2. The QR factorization timings were produced on a (:news,:news) layout since this seems to yield the highest performance. 11

12 1.0* * *10 3 Time (sec.) 1.0* * *10 2 Simplex LP 6.0* * * * *10 7 Problem size (m 1 +n)n Figure 3: Execution times for the CMSSL simplex algorithm and the continuation method, versus problem size. 12

13 The two algorithms were tested with a number of randomly generated dense matrices G of varying size. The execution times are are shown in Fig. 3 as a function of the quantity (m 1 n)n which is a measure of the \size" of the LP problem. Each point corresponds to the average of ve measured timings. We see that the two algorithms require approximately the same execution time, and that the continuation algorithm is asymptotically faster than the simplex algorithm for these dense test problems. Without the up- and downdating routines, the continuation algorithm would be up to 10 times slower for large n. Moreover, we remark that the continuation algorithm typically produces more accurate results than the simplex algorithm. See [8] for details. 5 Conclusion We have shown that stable up- and downdating of a QR factorization can be implemented eciently on a massively parallel computer, even without the use of low-level routines. We have also shown that a new LP algorithm, based on our routines, is competitive with a simplex routine from the CMSSL library for the Connection Machines. References [1] A. Bjorck, H. Park & L. Elden, Accurate downdating of least squares solutions, SIAM J. Matrix Anal. Appl. 15 (1994), to appear. [2] J. Demmel, M. Heath & H. A. van der Vorst, Parallel numerical linear algebra, Acta Numerica (1993), 111{197. [3] J. J. Dongarra, I. S. Du, D. C. Sorensen & H. A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, SIAM, Philadelphia, [4] J. J. Dongarra, C. B. Moler, J. R. Bunch & G. W. Stewart, LINPACK User's Guide, SIAM, [5] G. H. Golub & C. F. van Loan, Matrix Computations, Second Edition, The Johns Hopkins University Press, Baltimore,

14 [6] W. Lichtenstein & S. L. Johnsson, Block cyclic dense linear algebra, Report TMC 215, Thinking Machines Corporation, 1991; to appear in SIAM J. Sci. Comp. [7] K. Madsen, H. B. Nielsen, & M. C. Pinar, A New Finite Continuation Algorithm for Linear Programming, Report NI-93-07, Institute for Numerical Analysis, Technical University of Denmark; submitted to SIAM J. Optim. [8] K. Madsen, H. B. Nielsen, M. C. Pinar, C. Btsen & P. C. Hansen, Solving Bounded Variable Linear Problems on the Connection Machine CM-200, Report NI-93-08, Institute for Numerical Analysis, Techinical University of Denmark. [9] C. C. Paige, Error analysis of some techniques for updating orthogonal decompositions, Math. Comp. 34 (1980), 465{471. [10] Thinking Machines Corporation, CMSSL for CM Fortran, Version 3.0, CM{200 Edition, [11] Thinking Machines Corporation, CM{200 Technical Summary. Appix: Algorithms In this appix we give the detailed up- and downdating algorithms. The implementations are based on [5, x5]. Updating of R function R ~ = qrud(r,u) [m; n] size(r) [; ; type; ; ] fastgivens(u 1 ; R 1;1 ; 1; 1); = 1= p if type = 1 ~R 1;1:n = (u 1:n + R 1;1:n ); R 1;1:n = u 1:n + R 1;1:n else ~R1;1:n = (u 1:n + R 1;1:n); R 1;1:n = u 1:n + R 1;1:n for i = 2 : n? 1 [; ; type; ; ] fastgivens(r i?1;i; R i;i ; ; 1); = 1= p if type = 1 ~R i;i:n = (R i?1;i:n + R i;i:n ); R i;i:n = R i?1;i:n + R i;i:n 14

15 else ~R i;i:n = (R i?1;i:n + R i;i:n ); R i;i:n = R i?1;i:n + R i;i:n [; ; type; ; ] fastgivens(r n?1;n; R n;n ; ; 1); = 1= p if type = 1 ~R n;n = (R n?1;n + R n;n ) else ~R n;n = (R n?1;n + R n;n ) The op count for this algorithm is approximately 5n 2 =2. Here, the function \fastgivens" is implemented as folows: function [; ; type; 1 ; 2 ] = fastgivens( 1 ; 2 ; 1 ; 2 ) if 2 6= 0 =? 1 = 2 ; =? 2 = 1 ; =? if 1 type = 1; = 1 ; 1 = (1 + ) 2 ; 2 = (1 + ) else type = 2; = 1=; = 1=; = 1= 1 = (1 + ) 1 ; 2 = (1 + ) 2 else type = 2; = 0; = 0 LINPACK Downdating of R function R ~ = qrdd(r; u) [m; n] size(r) p q R?T 1:n;1:nu; = 1? kqk 2 2 [; ; type; ; ] fastgivens(q n ; ; 1; 1); = 1= p if type = 1 ~R n;n = R n;n ; R n;n = R n;n ; q n = q n + else ~R n;n = R n;n ; q n = q n + 15

16 for i = n? 1 :?1 : 2 [; ; type; ; ] fastgivens(q i ; q i+1 ; 1; ); = 1= p if type = 1 ~R i;i:n = (R i;i:n + R i+1;i:n ) else R i;i:n = R i;i:n + R i+1;i:n ; q n = q i + q i+1 ~R i;i:n = (R i;i:n + R i+1;i:n ) R i;i:n = R i;i:n + R i+1;i:n ; q n = q i + q i+1 [; ; type; ; ] fastgivens(q 1 ; q 2 ; 1; ); = 1= p if type = 1 ~R 1;1:n = (R 1;1:n + R 2;1:n ) else ~R 1;1:n = (R 1;1:n + R 2;1:n ) This algorithm also uses \fastgivens", and it requires approximately 5n 2 =2 ops. 16

on the CM-200. Claus Bendtsen Abstract to implement LAPACK-style routines already developed for other architectures,

\Quick" Implementation of Block LU Algorithms on the CM-200. Claus Bendtsen Abstract The CMSSL library only includes a limited amount of mathematical algorithms. Hence, when writing code for the Connection