Implementation of QR Up- and Downdating on a. Massively Parallel Computer. Hans Bruun Nielsen z Mustafa Pnar z. July 8, 1996.

Size: px
Start display at page:

Download "Implementation of QR Up- and Downdating on a. Massively Parallel Computer. Hans Bruun Nielsen z Mustafa Pnar z. July 8, 1996."

Transcription

1 Implementation of QR Up- and Downdating on a Massively Parallel Computer Claus Btsen y Per Christian Hansen y Kaj Madsen z Hans Bruun Nielsen z Mustafa Pnar z July 8, 1996 Abstract We describe an implementation of QR up- and downdating on a massively parallel computer (the Connection Machine CM{200) and show that the algorithm maps well onto the computer. In particular, we show how the use of corrected semi-normal equations for downdating can be eciently implemented. We also illustrate the use of our algorithms in a new LP algorithm. Key words. up- and downdating of QR factorization, corrected seminormal equations, CM{ Introduction In this paper we describe an ecient implementation of updating and downdating of a QR factorization on the Connection Machine CM{200, which is a massively parallel SIMD computer [11]. Many of our considerations are general for massively parallel computers. This project was sponsored by the Danish Center for Parallel Computer Research. M. Pnar was also sponsored by the Danish Natural Science Research Council, Grant No y UNIC (Danish Computing Centre for Research and Education), Building 305, Technical University of Denmark, DK-2800 Lyngby, Denmark (Claus.Btsen@uni-c.dk, Per.Christian.Hansen@uni-c.dk). z Institute for Numerical Analysis, Building 305, Technical University of Denmark, DK Lyngby, Denmark (numikm@vm.uni-c.dk, numimpi@uts.uni-c.dk). 1

2 Many linear algebra routines can be implemented very eciently on massively parallel computers [2]. However, it is not immediately clear whether updating and downdating of a QR factorization in particular in the case where only the triangular matrix R is stored provides enough parallelism for an ecient SIMD implementation. The main goal of this paper is to show that this is indeed the case. Our work was motivated by the use of QR up- and downdating in a new algorithm for linear programming described in [7]. The algorithm was implemented on an 8K CM{200 located at UNIC [8], and since there are no routines for QR up- and downdating in the CMSSL scientic subroutine library for the Connection Machine, it was necessary to implement such routines. Throughout the paper, we are concerned with QR factorizations of the form A = QR (1) with A 2 R mn ; Q 2 R mm ; R 2 R mn ; m n: We assume that the matrix Q is not stored, and we want to recompute the triangular factor R eciently when a row is either apped to A (updating) or removed from A (downdating). The algorithm for updating is a classical one (see, e.g., [5, x12.6]) and the downdating algorithm is a new \hybrid" algorithm using corrected semi-normal equations from [1]. Both algorithms are numerically stable. Our paper is organized as follows. In x2 we summarize the up- and downdating algorithms and we investigate numerically the accuracy of the downdating algorithm. Implementation details are given in x3. In x4 we illustrate the use of our implementation in connection with the above-mentioned LP algorithm. 2 Up- and Downdating Algorithms In this section we briey summarize the algorithms for up- and downdating of a QR factorization. 2.1 Updating If we wish to update the matrix A by an arbitrary new row u T, then this row can always be permuted to the top. Hence, there is no loss of generality 2

3 in assuming an updating of the form u T ~A = A where A ~ is the updated matrix. If we rewrite this equation using (1), then we obtain 1 0 T ~A u T = H; 0 Q R ; where the matrix H has upper Hessenberg form. Now, if ~A = ~ Q ~ R is the QR factorization of ~ A, then it follows that we can obtain ~ R by reducing H to upper triangular form ~ R by means of orthogonal transformations. Two types of orthogonal transformations are relevant here: Givens rotations and fast Givens rotations [5, x5.1]. In connection with the QR updating problem, the fast Givens method requires O(2:5n 2 ) ops 1 while the classical Givens method requires O(3n 2 ) ops. The fast Givens rotations have a reputation for being impractical because of the potential danger for over- or underow; see [5, p. 209]. However, in this particular application this is no problem since each row is only involved on two rotations, so the maximum growth in the elements is limited by a factor 4. We have therefore decided to use the fast Givens rotations. The detailed algorithm for the updating algorithm is given in the Appix. We mention in passing that the QR factorization provided in the CMSSL library uses a block-cyclic data layout [6]. We can easily make our updating algorithm conform with this layout by performing the fast Givens rotations in the same order as this layout (the details are straightforward and are omitted here). In this way, our routines are compatible with the CMSSL routines. 2.2 Classical (LINPACK) Downdating If we want to remove an arbitrary row u T from the matrix A, then again without loss of generality we can assume that A has the form u T A = ; ~A 1 Here, one op is either an addition or a multiplication. 3

4 where ~ A is equal to A with the rst row ut deleted. Now let q T denote the rst row of the matrix Q in the QR factorization of A. Then there exists an orthogonal matrix G such that G T q = (1; 0; : : :; 0) T ; with = 1: In particular, if G is constructed as a sequence of Givens rotations, G = G m?1 G 1 ; where each rotation G T i involves elements i and i + 1 of q, then it follows that G T R has upper Hessenberg form, i.e., G T R = G T 1 G T R = v T m?1 : (2) ~R Moreover, we have that and therefore u T ~A 0 Q G = 0 Q ~ 0 v = A = QR = Q G G T T v T R = 0 Q ~ = ~R ~Q R ~ ; in which we identify A ~ = Q ~ R ~ as the desired QR factorization of A. ~ This algorithm is mixed stable [9]. Often Q is not available because of storage considerations. Hence, the algorithm above must be modied to take this situation into account. Notice that only the n Givens transformations G 1 ; : : :; G n alter R; we need not determine G n+1 ; : : :; G m?1. Now, if q 1:n denotes the rst n components of q, then we have 0 1 G T n+1 GT q q 1:n m?1 A n 1 (3) 0 m? n? 1 where =? 1? kq 1:n k : (4) Since Q is not available, we rst have to compute the necessary quantities in (3). The vector q 1:n can be computed from the system R T q 1:n = u; (5) 4

5 and then is computed from (4). From the vector in (3) we can then construct G n ; : : :; G 1 and apply these rotations to R as in (2); in this way we produce R. ~ This is the well-know LINPACK algorithm [4, x10]. Again, we can use fast Givens transformations in our implementation without danger for over- or underow, and we take into account the block cyclic layout of R conforming with the CMSSL library. 2.3 CSNE Downdating In [1] it is shown that the LINPACK downdating algorithm can be arbitrarily inaccurate, because the sole use of R to form that downdating transformations may lead to a much more ill-conditioned problem than using both Q and R. It is therefore proposed in [1] to use corrected semi-normal equations to improve the accuracy of q 1:n, computed by (5), before it is used to construct the Givens transformations in (2). We summarize the algorithm here and refer to [1] for more details: CSNE Downdating of R 1. Solve R T q 1:n = u T for q 1:n 2. Solve R v = q 1:n for v 3. Let t e 1? Av where e 1 = (1; 0; : : :; 0) T 4. Solve R T q 1:n = A T t for q 1:n 5. Let q 1:n q 1:n + q 1:n 6. Solve R v = q 1:n for v 7. Let t t? Av 8. Let ktk 2 9. Continue using the LINPACK algorithm. The key to the improved stability is the renement of q 1:n in steps 2-5, combined with a more accurate computation of in step 8 via v and t (instead of Eq. (4)). Applying the CSNE algorithm is obviously more expensive than applying the LINPACK algorithm. It is therefore recommed to use a hybrid algorithm where the CSNE algorithm is only used if the system is ill-conditioned and the LINPACK algorithm is used otherwise [1]. As a measure of the conditioning of the system we use 2 = 1? jjq 1:n jj 2 2 = ktk 2 2 and apply the CSNE algorithm if 2 is less than a user-specied tolerance. We used a tolerance equal to 1=4 as recommed in [1]. 5

6 2.4 Accuracy of CSNE Downdating The updating algorithm is known to be numerically stable and to yield good accuracy, so we concentrate on studying the accuracy of the hybrid CSNE downdating algorithm. In order to test the accuracy of this algorithm and, in particular, its sensitivity to an ill-conditioned matrix A, we generate test matrices by the following strategy: 1. Generate a random (n + 1) n matrix A with n even. 2. Modify column n=2 of A as follows: A :;n=2 where 0 1. A :;n=2 + 1? 2? A:;n=2?1 + A :;n=2+1 ; In this way we can use to control the condition of A, since (for small ) column n=2 is almost a linear combination of the columns n=2? 1 and n= We also study the eect of doing more than one iteration in the above algorithm, in the sense that we repeate steps 4{7 together with the additional step v v + v a number of times. The results are shown in Fig. 1 for a series of random matrices. For each value of we generate 10 random matrices according to the above scheme, and for each matrix we compute R ~ by means of the LINPACK algorithm, the CSNE algorithm and the CSNE algorithm with 1 additional iteration. We compare R ~ with the matrix Rdirect ~ from a QR factorization of A ~ (consisting of the bottom 128 rows of A). Figure 1 shows the relative error k R ~? Rdirect ~ k 2 =k Rdirect ~ k 2 as a function of the parameter. As expected the CSNE algorithm is superior to the LINPACK algorithm with respect to sensitivity to an ill-conditioned problem. Furthermore we see that extra iterations in the CSNE algorithm never improve the accuracy of R. ~ 3 SIMD Implementation of Up- and Downdating 3.1 General Considerations When implementing algorithms on a parallel computer it is essential to choose an appropriate data layout in order to be able to operate simultaneously on as many processing elements as possible. On massively parallel 6

7 ~ ~ ~ R-R direct 2 / R direct LINPACK CSNE 10-6 CSNE + 1 extra it Figure 1: Accuracy of the hybrid CSNE algorithm and the LINPACK algorithm when downdating the test matrices described above. 7

8 computers we generally have three fundamentally dierent parallel layouts to choose from. Row-oriented layout: Each row is assigned to parallel processors and all the rows are stacked serially. Column-oriented layout: Each column is assigned to parallel processors and all the columns are stacked serially. Matrix-oriented layout: Rows as well as columns are assigned to parallel processors. Dierent data distributions are possible in this approach. Generally some sort of matrix-oriented layout is used in conjunction with matrix operations since, even for matrices of small dimensions, this ensures that elements are allocated to all processors. The dierent approaches will now be addressed with respect to QR up- and downdating Row-Oriented Layout Using this approach one can perform the fast Givens rotations in the upand downdating in parallel without introducing communication other than broadcasting the constants and which dene the Givens rotations. When a rotation is performed between two rows, the processors have to be activated a number of times. The resulting eciency can be approximated as follows. Let n be the number of columns and p the number of processors, and write n = bn=pcp + q where q is the remainder. Then 1 2 n2 p b n p c(b n p c + 1) + p(b n p c + 1)q The drawback of this layout is that a large problem size relative to the number of processors is needed in order to obtain decent results. E.g., if n = p=2 only 25% of the processor performance is utilized as opposed to 50% if n = p or approximately 67% if n = 2p. The solution of the linear systems related to the downdating can be eciently implemented using the row version of back substitution [5, p. 88] whereas transposed systems can be eciently solved using the column version of forward substitution [5, p. 89]. 8

9 3.1.2 Column-Oriented Layout This layout cannot be eciently used since the Givens rotations would have to be performed on one processor at a time Matrix-Oriented Layout If the processors are considered as congured in a grid, then a row in the matrix will live on a row of processors. Each row as well as each column may be wrapped on the processor grid. One possible matrix-oriented layout is the block cyclic layout often used for gaining good load balance in linear algebra applications [6]. If we use a block cyclic layout then the Givens rotations can be performed in parallel without introducing communication other than broadcasting the constants and when the rows involved in the rotation are part of the same block, and by introducing nearest neighbor communication equal to one shift operation if they are not part of the same block. Only one shift operation is needed to ensure that all the forward/backward references live on the same processor as the active row. It can be shown that the CSNE downdating can be performed in the block cyclic domain without introducing any overhead. The eciency of this layout is by nature much lower than for the roworiented layout since only one or two rows of processors will be active simultaneously. This layout should not be discarded, though, since overall considerations of eciency might imply using a matrix-oriented layout. In any case this layout is attractive if the alternative is refactorization. 3.2 Implementation on the Connection Machine CM-200 The updating and hybrid downdating algorithms have been implemented in CM-FORTRAN on a 8K CM-200, but it also applies to the slower CM- 2. We have implemented the row-oriented as well as the matrix-oriented layout. It has not been possible to activate only the processors on and to the right of the diagonal of the matrix, which means that we do operations on zeros. Furthermore, in the matrix oriented implementation it has not been possible to detect whether two rows live on the same processors resulting in unnecessary activation of the communication primitives. The implementation using a matrix-oriented layout is based on a (:news, :news) layout, has storage requirements of roughly three times the size of 9

10 10 2 Time (sec.) CMSSL QR Fact. CSNE Downdate LINPACK Downdate Update n Figure 2: Performance of up- and downdating compared to the CMSSL QR factorization. the factorized matrix, and uses n? 1 shift operations. Since no shift operations are available for the block cyclic layout this implementation assumes normal ordering in order to avoid one s-operation per loop iteration. Both up- and downdating perform very poorly; less than 5 Mop/s is obtained. The bottleneck can be identied from the assembler code and is the creation of the masks related to performing the Givens rotations. The implementation which is based on a row-oriented layout uses a (:serial, :news) layout and has storage requirements of roughly 2n 2. Here the block cyclic ordering is eciently adaptable. The performance of these routines is shown in Fig. 2 together with results from refactorization using the CMSSL library 2. It is seen that the CSNE downdating is an 2 The timings were produced on the 8K CM-200 present at UNIC with CMF compiler 10

11 order of magnitude slower than the updating and LINPACK downdating, but it is still an order of magnitude faster than the refactorization for large n. Since it has not been possible to avoid the excessive engagement of virtual processors, it should fairly easily be possible to halve the computational work using a lower level language than CM-FORTRAN. 4 An Application in Linear Programming In this section we illustrate the use of the QR up- and downdating routines in an implementation of a new algorithm for linear programming based on a continuation algorithm. Today, such algorithms are realistic alternatives to the classical simplex method. Consider the normalized linear programming problem min x c T x subject to Gx = b;?e x e; (6) where G is m 1 n, c is an n-vector, and e = (1; : : :; 1) T. The continuation algorithm used here [7] solves this problem by rst solving its dual problem, min y kg T y + ck 1 + b T y; (7) and then detecting x from the residual vector G T y + c. The dual problem (7) is solved by an algorithm that essentially substitutes the non-smooth 1- norm with a smooth \Huber" norm with threshold, where the components of the residual vector G T y + c are treated dierently deping on whether they are greater than or smaller than. The key idea is to start with a large and then reduce it until the solution (7) can be identied from the solution to the \Huber" problem. This happens for a positive value of. The main computational problem during the algorithm outlined above is to solve a series of linear systems of equations, where the coecient matrix in each iteration step is modied by a few rank-one up- and downdates. Instead of computing a refactorization of the matrix each time, we use the up- and downdating routines described above and fall back on refactorization only in the few cases when it is necessary due to rank deciency. See [8] for more details about the implementation of the complete algorithm. The continuation algorithm was compared with an implementation of the simplex algorithm GEN SIMPLEX provided in the CMSSL library [10]. V. 1.2 and CMSSL library V. 3.1 Beta 2. The QR factorization timings were produced on a (:news,:news) layout since this seems to yield the highest performance. 11

12 1.0* * *10 3 Time (sec.) 1.0* * *10 2 Simplex LP 6.0* * * * *10 7 Problem size (m 1 +n)n Figure 3: Execution times for the CMSSL simplex algorithm and the continuation method, versus problem size. 12

13 The two algorithms were tested with a number of randomly generated dense matrices G of varying size. The execution times are are shown in Fig. 3 as a function of the quantity (m 1 n)n which is a measure of the \size" of the LP problem. Each point corresponds to the average of ve measured timings. We see that the two algorithms require approximately the same execution time, and that the continuation algorithm is asymptotically faster than the simplex algorithm for these dense test problems. Without the up- and downdating routines, the continuation algorithm would be up to 10 times slower for large n. Moreover, we remark that the continuation algorithm typically produces more accurate results than the simplex algorithm. See [8] for details. 5 Conclusion We have shown that stable up- and downdating of a QR factorization can be implemented eciently on a massively parallel computer, even without the use of low-level routines. We have also shown that a new LP algorithm, based on our routines, is competitive with a simplex routine from the CMSSL library for the Connection Machines. References [1] A. Bjorck, H. Park & L. Elden, Accurate downdating of least squares solutions, SIAM J. Matrix Anal. Appl. 15 (1994), to appear. [2] J. Demmel, M. Heath & H. A. van der Vorst, Parallel numerical linear algebra, Acta Numerica (1993), 111{197. [3] J. J. Dongarra, I. S. Du, D. C. Sorensen & H. A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, SIAM, Philadelphia, [4] J. J. Dongarra, C. B. Moler, J. R. Bunch & G. W. Stewart, LINPACK User's Guide, SIAM, [5] G. H. Golub & C. F. van Loan, Matrix Computations, Second Edition, The Johns Hopkins University Press, Baltimore,

14 [6] W. Lichtenstein & S. L. Johnsson, Block cyclic dense linear algebra, Report TMC 215, Thinking Machines Corporation, 1991; to appear in SIAM J. Sci. Comp. [7] K. Madsen, H. B. Nielsen, & M. C. Pinar, A New Finite Continuation Algorithm for Linear Programming, Report NI-93-07, Institute for Numerical Analysis, Technical University of Denmark; submitted to SIAM J. Optim. [8] K. Madsen, H. B. Nielsen, M. C. Pinar, C. Btsen & P. C. Hansen, Solving Bounded Variable Linear Problems on the Connection Machine CM-200, Report NI-93-08, Institute for Numerical Analysis, Techinical University of Denmark. [9] C. C. Paige, Error analysis of some techniques for updating orthogonal decompositions, Math. Comp. 34 (1980), 465{471. [10] Thinking Machines Corporation, CMSSL for CM Fortran, Version 3.0, CM{200 Edition, [11] Thinking Machines Corporation, CM{200 Technical Summary. Appix: Algorithms In this appix we give the detailed up- and downdating algorithms. The implementations are based on [5, x5]. Updating of R function R ~ = qrud(r,u) [m; n] size(r) [; ; type; ; ] fastgivens(u 1 ; R 1;1 ; 1; 1); = 1= p if type = 1 ~R 1;1:n = (u 1:n + R 1;1:n ); R 1;1:n = u 1:n + R 1;1:n else ~R1;1:n = (u 1:n + R 1;1:n); R 1;1:n = u 1:n + R 1;1:n for i = 2 : n? 1 [; ; type; ; ] fastgivens(r i?1;i; R i;i ; ; 1); = 1= p if type = 1 ~R i;i:n = (R i?1;i:n + R i;i:n ); R i;i:n = R i?1;i:n + R i;i:n 14

15 else ~R i;i:n = (R i?1;i:n + R i;i:n ); R i;i:n = R i?1;i:n + R i;i:n [; ; type; ; ] fastgivens(r n?1;n; R n;n ; ; 1); = 1= p if type = 1 ~R n;n = (R n?1;n + R n;n ) else ~R n;n = (R n?1;n + R n;n ) The op count for this algorithm is approximately 5n 2 =2. Here, the function \fastgivens" is implemented as folows: function [; ; type; 1 ; 2 ] = fastgivens( 1 ; 2 ; 1 ; 2 ) if 2 6= 0 =? 1 = 2 ; =? 2 = 1 ; =? if 1 type = 1; = 1 ; 1 = (1 + ) 2 ; 2 = (1 + ) else type = 2; = 1=; = 1=; = 1= 1 = (1 + ) 1 ; 2 = (1 + ) 2 else type = 2; = 0; = 0 LINPACK Downdating of R function R ~ = qrdd(r; u) [m; n] size(r) p q R?T 1:n;1:nu; = 1? kqk 2 2 [; ; type; ; ] fastgivens(q n ; ; 1; 1); = 1= p if type = 1 ~R n;n = R n;n ; R n;n = R n;n ; q n = q n + else ~R n;n = R n;n ; q n = q n + 15

16 for i = n? 1 :?1 : 2 [; ; type; ; ] fastgivens(q i ; q i+1 ; 1; ); = 1= p if type = 1 ~R i;i:n = (R i;i:n + R i+1;i:n ) else R i;i:n = R i;i:n + R i+1;i:n ; q n = q i + q i+1 ~R i;i:n = (R i;i:n + R i+1;i:n ) R i;i:n = R i;i:n + R i+1;i:n ; q n = q i + q i+1 [; ; type; ; ] fastgivens(q 1 ; q 2 ; 1; ); = 1= p if type = 1 ~R 1;1:n = (R 1;1:n + R 2;1:n ) else ~R 1;1:n = (R 1;1:n + R 2;1:n ) This algorithm also uses \fastgivens", and it requires approximately 5n 2 =2 ops. 16

on the CM-200. Claus Bendtsen Abstract to implement LAPACK-style routines already developed for other architectures,

on the CM-200. Claus Bendtsen Abstract to implement LAPACK-style routines already developed for other architectures, \Quick" Implementation of Block LU Algorithms on the CM-200. Claus Bendtsen Abstract The CMSSL library only includes a limited amount of mathematical algorithms. Hence, when writing code for the Connection

More information

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES Zhou B. B. and Brent R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 000 Abstract We describe

More information

1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma

1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma MULTIPLE SUBSPACE ULV ALGORITHM AND LMS TRACKING S. HOSUR, A. H. TEWFIK, D. BOLEY University of Minnesota 200 Union St. S.E. Minneapolis, MN 55455 U.S.A fhosur@ee,tewk@ee,boley@csg.umn.edu ABSTRACT. The

More information

Parallel Computation of the Singular Value Decomposition on Tree Architectures

Parallel Computation of the Singular Value Decomposition on Tree Architectures Parallel Computation of the Singular Value Decomposition on Tree Architectures Zhou B. B. and Brent R. P. y Computer Sciences Laboratory The Australian National University Canberra, ACT 000, Australia

More information

On Parallel Implementation of the One-sided Jacobi Algorithm for Singular Value Decompositions

On Parallel Implementation of the One-sided Jacobi Algorithm for Singular Value Decompositions On Parallel Implementation of the One-sided Jacobi Algorithm for Singular Value Decompositions B. B. Zhou and R. P. Brent Computer Sciences Laboratory The Australian National University Canberra, ACT 000,

More information

Comments on the randomized Kaczmarz method

Comments on the randomized Kaczmarz method Comments on the randomized Kaczmarz method Thomas Strohmer and Roman Vershynin Department of Mathematics, University of California Davis, CA 95616-8633, USA. strohmer@math.ucdavis.edu, vershynin@math.ucdavis.edu

More information

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

F02WUF NAG Fortran Library Routine Document

F02WUF NAG Fortran Library Routine Document F02 Eigenvalues and Eigenvectors F02WUF NAG Fortran Library Routine Document Note. Before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised

More information

UMIACS-TR March Direction-of-Arrival Estimation Using the. G. Adams. M. F. Griffin. G. W. Stewart y. abstract

UMIACS-TR March Direction-of-Arrival Estimation Using the. G. Adams. M. F. Griffin. G. W. Stewart y. abstract UMIACS-TR 91-46 March 1991 CS-TR-2640 Direction-of-Arrival Estimation Using the Rank-Revealing URV Decomposition G. Adams M. F. Griffin G. W. Stewart y abstract An algorithm for updating the null space

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition

More information

Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System

Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System The Harvard community has made this article openly available. Please share how this

More information

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Zhou, B. B. and Brent, R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 0200 Abstract This paper addresses

More information

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems AMSC 6 /CMSC 76 Advanced Linear Numerical Analysis Fall 7 Direct Solution of Sparse Linear Systems and Eigenproblems Dianne P. O Leary c 7 Solving Sparse Linear Systems Assumed background: Gauss elimination

More information

Fast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems

Fast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems Fast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems Irina F. Gorodnitsky Cognitive Sciences Dept. University of California, San Diego La Jolla, CA 9293-55 igorodni@ece.ucsd.edu Dmitry

More information

NAG Fortran Library Routine Document F08KAF (DGELSS).1

NAG Fortran Library Routine Document F08KAF (DGELSS).1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for co

Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for co Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for computers on scientic application has been the Linpack

More information

A Parallel Ring Ordering Algorithm for Ecient One-sided Jacobi SVD Computations

A Parallel Ring Ordering Algorithm for Ecient One-sided Jacobi SVD Computations A Parallel Ring Ordering Algorithm for Ecient One-sided Jacobi SVD Computations B. B. Zhou and Richard P. Brent Computer Sciences Laboratory The Australian National University Canberra, ACT 000, Australia

More information

Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet. Y. C. Pati R. Rezaiifar and P. S.

Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet. Y. C. Pati R. Rezaiifar and P. S. / To appear in Proc. of the 27 th Annual Asilomar Conference on Signals Systems and Computers, Nov. {3, 993 / Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet

More information

ON DATA LAYOUT IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM WITH PRE PROCESSING

ON DATA LAYOUT IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM WITH PRE PROCESSING Proceedings of ALGORITMY 2009 pp. 449 458 ON DATA LAYOUT IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM WITH PRE PROCESSING MARTIN BEČKA, GABRIEL OKŠA, MARIÁN VAJTERŠIC, AND LAURA GRIGORI Abstract. An efficient

More information

TENTH WORLD CONGRESS ON THE THEORY OF MACHINES AND MECHANISMS Oulu, Finland, June 20{24, 1999 THE EFFECT OF DATA-SET CARDINALITY ON THE DESIGN AND STR

TENTH WORLD CONGRESS ON THE THEORY OF MACHINES AND MECHANISMS Oulu, Finland, June 20{24, 1999 THE EFFECT OF DATA-SET CARDINALITY ON THE DESIGN AND STR TENTH WORLD CONGRESS ON THE THEORY OF MACHINES AND MECHANISMS Oulu, Finland, June 20{24, 1999 THE EFFECT OF DATA-SET CARDINALITY ON THE DESIGN AND STRUCTURAL ERRORS OF FOUR-BAR FUNCTION-GENERATORS M.J.D.

More information

Nonsymmetric Problems. Abstract. The eect of a threshold variant TPABLO of the permutation

Nonsymmetric Problems. Abstract. The eect of a threshold variant TPABLO of the permutation Threshold Ordering for Preconditioning Nonsymmetric Problems Michele Benzi 1, Hwajeong Choi 2, Daniel B. Szyld 2? 1 CERFACS, 42 Ave. G. Coriolis, 31057 Toulouse Cedex, France (benzi@cerfacs.fr) 2 Department

More information

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines NAG Library Chapter Introduction Contents 1 Scope of the Chapter.... 2 2 Background to the Problems... 2 3 Recommendations on Choice and Use of Available Routines... 2 3.1 Naming Scheme... 2 3.1.1 NAGnames...

More information

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Martin Köhler Jens Saak 2 The Gauss-Jordan Elimination scheme is an alternative to the LU decomposition

More information

Techniques for Optimizing FEM/MoM Codes

Techniques for Optimizing FEM/MoM Codes Techniques for Optimizing FEM/MoM Codes Y. Ji, T. H. Hubing, and H. Wang Electromagnetic Compatibility Laboratory Department of Electrical & Computer Engineering University of Missouri-Rolla Rolla, MO

More information

NAG Fortran Library Routine Document F07AAF (DGESV).1

NAG Fortran Library Routine Document F07AAF (DGESV).1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Overview Dense linear algebra algorithms Hybrid CPU GPU implementation

More information

F04EBFP.1. NAG Parallel Library Routine Document

F04EBFP.1. NAG Parallel Library Routine Document F04 Simultaneous Linear Equations F04EBFP NAG Parallel Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check for implementation-dependent

More information

Bias-Variance Tradeos Analysis Using Uniform CR Bound. Mohammad Usman, Alfred O. Hero, Jerey A. Fessler and W. L. Rogers. University of Michigan

Bias-Variance Tradeos Analysis Using Uniform CR Bound. Mohammad Usman, Alfred O. Hero, Jerey A. Fessler and W. L. Rogers. University of Michigan Bias-Variance Tradeos Analysis Using Uniform CR Bound Mohammad Usman, Alfred O. Hero, Jerey A. Fessler and W. L. Rogers University of Michigan ABSTRACT We quantify fundamental bias-variance tradeos for

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Probably the simplest kind of problem. Occurs in many contexts, often as part of larger problem. Symbolic manipulation packages can do linear algebra "analytically" (e.g. Mathematica,

More information

Pipeline Givens sequences for computing the QR decomposition on a EREW PRAM q

Pipeline Givens sequences for computing the QR decomposition on a EREW PRAM q Parallel Computing 32 (2006) 222 230 www.elsevier.com/locate/parco Pipeline Givens sequences for computing the QR decomposition on a EREW PRAM q Marc Hofmann a, *, Erricos John Kontoghiorghes b,c a Institut

More information

SDLS: a Matlab package for solving conic least-squares problems

SDLS: a Matlab package for solving conic least-squares problems SDLS: a Matlab package for solving conic least-squares problems Didier Henrion 1,2 Jérôme Malick 3 June 28, 2007 Abstract This document is an introduction to the Matlab package SDLS (Semi-Definite Least-Squares)

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Computational Methods in Statistics with Applications A Numerical Point of View. Large Data Sets. L. Eldén. March 2016

Computational Methods in Statistics with Applications A Numerical Point of View. Large Data Sets. L. Eldén. March 2016 Computational Methods in Statistics with Applications A Numerical Point of View L. Eldén SeSe March 2016 Large Data Sets IDA Machine Learning Seminars, September 17, 2014. Sequential Decision Making: Experiment

More information

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control.

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control. Neuro-Remodeling via Backpropagation of Utility K. Wendy Tang and Girish Pingle 1 Department of Electrical Engineering SUNY at Stony Brook, Stony Brook, NY 11794-2350. ABSTRACT Backpropagation of utility

More information

Data parallel algorithms 1

Data parallel algorithms 1 Data parallel algorithms (Guy Steele): The data-parallel programming style is an approach to organizing programs suitable for execution on massively parallel computers. In this lecture, we will characterize

More information

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012. Blocked Schur Algorithms for Computing the Matrix Square Root Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui 2013 MIMS EPrint: 2012.26 Manchester Institute for Mathematical Sciences School of Mathematics

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

Sparse Matrix Libraries in C++ for High Performance. Architectures. ferent sparse matrix data formats in order to best

Sparse Matrix Libraries in C++ for High Performance. Architectures. ferent sparse matrix data formats in order to best Sparse Matrix Libraries in C++ for High Performance Architectures Jack Dongarra xz, Andrew Lumsdaine, Xinhui Niu Roldan Pozo z, Karin Remington x x Oak Ridge National Laboratory z University oftennessee

More information

A parallel frontal solver for nite element applications

A parallel frontal solver for nite element applications INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING Int. J. Numer. Meth. Engng 2001; 50:1131 1144 A parallel frontal solver for nite element applications Jennifer A. Scott ; Computational Science

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,

More information

Adaptive Nonlinear Discriminant Analysis. by Regularized Minimum Squared Errors

Adaptive Nonlinear Discriminant Analysis. by Regularized Minimum Squared Errors Adaptive Nonlinear Discriminant Analysis by Regularized Minimum Squared Errors Hyunsoo Kim, Barry L Drake, and Haesun Park February 23, 2005 Abstract: Recently, kernelized nonlinear extensions of Fisher

More information

Journal of Engineering Research and Studies E-ISSN

Journal of Engineering Research and Studies E-ISSN Journal of Engineering Research and Studies E-ISS 0976-79 Research Article SPECTRAL SOLUTIO OF STEADY STATE CODUCTIO I ARBITRARY QUADRILATERAL DOMAIS Alavani Chitra R 1*, Joshi Pallavi A 1, S Pavitran

More information

Floating Point Fault Tolerance with Backward Error Assertions

Floating Point Fault Tolerance with Backward Error Assertions Floating Point Fault Tolerance with Backward Error Assertions Daniel Boley Gene H. Golub * Samy Makar Nirmal Saxena Edward J. McCluskey Computer Science Dept. Computer Science Dept. Center for Reliable

More information

IIAIIIIA-II is called the condition number. Similarly, if x + 6x satisfies

IIAIIIIA-II is called the condition number. Similarly, if x + 6x satisfies SIAM J. ScI. STAT. COMPUT. Vol. 5, No. 2, June 1984 (C) 1984 Society for Industrial and Applied Mathematics OO6 CONDITION ESTIMATES* WILLIAM W. HAGERf Abstract. A new technique for estimating the 11 condition

More information

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving

More information

Blocked Schur Algorithms for Computing the Matrix Square Root

Blocked Schur Algorithms for Computing the Matrix Square Root Blocked Schur Algorithms for Computing the Matrix Square Root Edvin Deadman 1, Nicholas J. Higham 2,andRuiRalha 3 1 Numerical Algorithms Group edvin.deadman@nag.co.uk 2 University of Manchester higham@maths.manchester.ac.uk

More information

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1 LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data

More information

Algebraic Iterative Methods for Computed Tomography

Algebraic Iterative Methods for Computed Tomography Algebraic Iterative Methods for Computed Tomography Per Christian Hansen DTU Compute Department of Applied Mathematics and Computer Science Technical University of Denmark Per Christian Hansen Algebraic

More information

Module 5.5: nag sym bnd lin sys Symmetric Banded Systems of Linear Equations. Contents

Module 5.5: nag sym bnd lin sys Symmetric Banded Systems of Linear Equations. Contents Module Contents Module 5.5: nag sym bnd lin sys Symmetric Banded Systems of nag sym bnd lin sys provides a procedure for solving real symmetric or complex Hermitian banded systems of linear equations with

More information

BMVC 1996 doi: /c.10.41

BMVC 1996 doi: /c.10.41 On the use of the 1D Boolean model for the description of binary textures M Petrou, M Arrigo and J A Vons Dept. of Electronic and Electrical Engineering, University of Surrey, Guildford GU2 5XH, United

More information

Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne

Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of New York Bualo, NY 14260 Abstract The Connection Machine

More information

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI

More information

NetSolve: A Network Server. for Solving Computational Science Problems. November 27, Abstract

NetSolve: A Network Server. for Solving Computational Science Problems. November 27, Abstract NetSolve: A Network Server for Solving Computational Science Problems Henri Casanova Jack Dongarra? y November 27, 1995 Abstract This paper presents a new system, called NetSolve, that allows users to

More information

A Fast and Exact Simulation Algorithm for General Gaussian Markov Random Fields

A Fast and Exact Simulation Algorithm for General Gaussian Markov Random Fields A Fast and Exact Simulation Algorithm for General Gaussian Markov Random Fields HÅVARD RUE DEPARTMENT OF MATHEMATICAL SCIENCES NTNU, NORWAY FIRST VERSION: FEBRUARY 23, 1999 REVISED: APRIL 23, 1999 SUMMARY

More information

Sparse matrices, graphs, and tree elimination

Sparse matrices, graphs, and tree elimination Logistics Week 6: Friday, Oct 2 1. I will be out of town next Tuesday, October 6, and so will not have office hours on that day. I will be around on Monday, except during the SCAN seminar (1:25-2:15);

More information

SDLS: a Matlab package for solving conic least-squares problems

SDLS: a Matlab package for solving conic least-squares problems SDLS: a Matlab package for solving conic least-squares problems Didier Henrion, Jérôme Malick To cite this version: Didier Henrion, Jérôme Malick. SDLS: a Matlab package for solving conic least-squares

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

MOTION. Feature Matching/Tracking. Control Signal Generation REFERENCE IMAGE

MOTION. Feature Matching/Tracking. Control Signal Generation REFERENCE IMAGE Head-Eye Coordination: A Closed-Form Solution M. Xie School of Mechanical & Production Engineering Nanyang Technological University, Singapore 639798 Email: mmxie@ntuix.ntu.ac.sg ABSTRACT In this paper,

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix. MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix. Row echelon form A matrix is said to be in the row echelon form if the leading entries shift to the

More information

Convex Optimization / Homework 2, due Oct 3

Convex Optimization / Homework 2, due Oct 3 Convex Optimization 0-725/36-725 Homework 2, due Oct 3 Instructions: You must complete Problems 3 and either Problem 4 or Problem 5 (your choice between the two) When you submit the homework, upload a

More information

NAG Fortran Library Routine Document F08BHF (DTZRZF).1

NAG Fortran Library Routine Document F08BHF (DTZRZF).1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

Abstract. 1 Introduction

Abstract. 1 Introduction The performance of fast Givens rotations problem implemented with MPI extensions in multicomputers L. Fernández and J. M. García Department of Informática y Sistemas, Universidad de Murcia, Campus de Espinardo

More information

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department of Computer Science and Information Engineering National

More information

The LINPACK Benchmark on the Fujitsu AP 1000

The LINPACK Benchmark on the Fujitsu AP 1000 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory Australian National University Canberra, Australia Abstract We describe an implementation of the LINPACK Benchmark

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

Matrices. Chapter Matrix A Mathematical Definition Matrix Dimensions and Notation

Matrices. Chapter Matrix A Mathematical Definition Matrix Dimensions and Notation Chapter 7 Introduction to Matrices This chapter introduces the theory and application of matrices. It is divided into two main sections. Section 7.1 discusses some of the basic properties and operations

More information

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap Storage-Ecient Finite Field Basis Conversion Burton S. Kaliski Jr. 1 and Yiqun Lisa Yin 2 RSA Laboratories 1 20 Crosby Drive, Bedford, MA 01730. burt@rsa.com 2 2955 Campus Drive, San Mateo, CA 94402. yiqun@rsa.com

More information

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE*

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE* SIAM J. ScI. STAT. COMPUT. Vol. 12, No. 6, pp. 1480-1485, November 1991 ()1991 Society for Industrial and Applied Mathematics 015 SOLUTION OF LINEAR SYSTEMS OF ORDINARY DIFFERENTIAL EQUATIONS ON AN INTEL

More information

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup.

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup. Sparse Implementation of Revised Simplex Algorithms on Parallel Computers Wei Shu and Min-You Wu Abstract Parallelizing sparse simplex algorithms is one of the most challenging problems. Because of very

More information

Performance Evaluation of a New Parallel Preconditioner

Performance Evaluation of a New Parallel Preconditioner Performance Evaluation of a New Parallel Preconditioner Keith D. Gremban Gary L. Miller Marco Zagha School of Computer Science Carnegie Mellon University 5 Forbes Avenue Pittsburgh PA 15213 Abstract The

More information

J.A.J.Hall, K.I.M.McKinnon. September 1996

J.A.J.Hall, K.I.M.McKinnon. September 1996 PARSMI, a parallel revised simplex algorithm incorporating minor iterations and Devex pricing J.A.J.Hall, K.I.M.McKinnon September 1996 MS 96-012 Supported by EPSRC research grant GR/J0842 Presented at

More information

All use is subject to licence, see For any commercial application, a separate licence must be signed.

All use is subject to licence, see   For any commercial application, a separate licence must be signed. HS PAKAGE SPEIFIATION HS 2007 1 SUMMARY This routine uses the Generalized Minimal Residual method with restarts every m iterations, GMRES(m), to solve the n n unsymmetric linear system Ax = b, optionally

More information

Sparse Linear Systems

Sparse Linear Systems 1 Sparse Linear Systems Rob H. Bisseling Mathematical Institute, Utrecht University Course Introduction Scientific Computing February 22, 2018 2 Outline Iterative solution methods 3 A perfect bipartite

More information

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD Goals. The goal of the first part of this lab is to demonstrate how the SVD can be used to remove redundancies in data; in this example

More information

Analysis of the GCR method with mixed precision arithmetic using QuPAT

Analysis of the GCR method with mixed precision arithmetic using QuPAT Analysis of the GCR method with mixed precision arithmetic using QuPAT Tsubasa Saito a,, Emiko Ishiwata b, Hidehiko Hasegawa c a Graduate School of Science, Tokyo University of Science, 1-3 Kagurazaka,

More information

Project Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting

Project Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting Project Report Bernardo A. Gonzalez Torres beaugonz@ucsc.edu Abstract The final term project consist of two parts: a Fortran implementation of a linear algebra solver and a Python implementation of a run

More information

However, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t

However, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t FAST CALCULATION OF GEOMETRIC MOMENTS OF BINARY IMAGES Jan Flusser Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Pod vodarenskou vez 4, 82 08 Prague 8, Czech

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Problem Set 2 Geometry, Algebra, Reality

Problem Set 2 Geometry, Algebra, Reality Problem Set 2 Geometry, Algebra, Reality Applied Mathematics 121 Spring 2011 Due 5:00 PM, Friday, February 11, 2011 Announcements The assignment is due by 5:00 PM, Friday, February 11, 2011. Readings:

More information

Column-Action Methods in Image Reconstruction

Column-Action Methods in Image Reconstruction Column-Action Methods in Image Reconstruction Per Christian Hansen joint work with Tommy Elfving Touraj Nikazad Overview of Talk Part 1: the classical row-action method = ART The advantage of algebraic

More information

CS 770G - Parallel Algorithms in Scientific Computing

CS 770G - Parallel Algorithms in Scientific Computing CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis,

More information

Section 3.1 Gaussian Elimination Method (GEM) Key terms

Section 3.1 Gaussian Elimination Method (GEM) Key terms Section 3.1 Gaussian Elimination Method (GEM) Key terms Rectangular systems Consistent system & Inconsistent systems Rank Types of solution sets RREF Upper triangular form & back substitution Nonsingular

More information

NAG Library Function Document nag_zgelsy (f08bnc)

NAG Library Function Document nag_zgelsy (f08bnc) NAG Library Function Document nag_zgelsy () 1 Purpose nag_zgelsy () computes the minimum norm solution to a complex linear least squares problem minkb Axk 2 x using a complete orthogonal factorization

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Ecient Cubic B-spline Image interpolation on a GPU. 1 Abstract. 2 Introduction. F. Champagnat and Y. Le Sant. September 1, 2011

Ecient Cubic B-spline Image interpolation on a GPU. 1 Abstract. 2 Introduction. F. Champagnat and Y. Le Sant. September 1, 2011 Ecient Cubic B-spline Image interpolation on a GPU F. Champagnat and Y. Le Sant September 1, 2011 1 Abstract Application of geometric transformation to images requires an interpolation step. When applied

More information

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview Aim Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity Julian Hall School of Mathematics University of Edinburgh jajhall@ed.ac.uk What should a 2-hour PhD lecture on structure

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Parallel Auction Algorithm for Linear Assignment Problem

Parallel Auction Algorithm for Linear Assignment Problem Parallel Auction Algorithm for Linear Assignment Problem Xin Jin 1 Introduction The (linear) assignment problem is one of classic combinatorial optimization problems, first appearing in the studies on

More information

B(FOM) 2. Block full orthogonalization methods for functions of matrices. Kathryn Lund. December 12, 2017

B(FOM) 2. Block full orthogonalization methods for functions of matrices. Kathryn Lund. December 12, 2017 B(FOM) 2 Block full orthogonalization methods for functions of matrices Kathryn Lund December 12, 2017 The block full orthogonalization methods for functions of matrices (denoted B(FOM) 2, for short) are

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information