on the CM-200. Claus Bendtsen Abstract to implement LAPACK-style routines already developed for other architectures,

Size: px

Start display at page:

Download "on the CM-200. Claus Bendtsen Abstract to implement LAPACK-style routines already developed for other architectures,"

Randolph Gaines
5 years ago
Views:

1 \Quick" Implementation of Block LU Algorithms on the CM-200. Claus Bendtsen Abstract The CMSSL library only includes a limited amount of mathematical algorithms. Hence, when writing code for the Connection Machine one often has to implement LAPACK-style routines already developed for other architectures, in the hope that acceptable performance can thus be obtained relatively quickly. Due to the massively parallel structure of the CM, algorithms with a serial or global structure such as LU factorization tend to yield poor performance (global here meaning that elements do not only interact with elements in their neighborhood). The purpose of this note is partly to show what performance one can typically obtain when implementing a global algorithm on the CM-200 in a relatively limited period of time, and partly to examine the pros and contras for using a block algorithm. The testing has been performed on the LU factorization in normal as well as in a blocked version. The implementation has been performed by the use of BLAS level 3 equivalent routines and the results have been compared to the LU factorization present in the CMSSL library. The implementation and optimization of both the normal and blocked version have altogether been carried out within 14 days. The obtained performance is very disappointing: only 4% of the CMSSL routine for large matrices. 1 LU Factorization and Solver The implementation of the LU factorization is done by walking down the diagonal and subtracting outer products along the way. The outer products are calculated by means of spreads in the unoptimized version and by the use of a mask and scans in the optimized version. The bottleneck of the operation is undoubtedly the spread/scan operations since these demand a tremendous amount of communication and are thus very time consuming even though the communication is along a single axis. Pivoting and equilibration are not implemented. Since the functions need global communication the best results have been obtained by choosing a :EWS layout. The solver is implemented in a way similar to the factorization. The backward as well as the forward substitution is performed be means of spreads (unoptimized) and scans (optimized). In the optimized version care is taken not to create unnecessary temporaries. Random test systems lead to the performance shown in Fig. 1. The timings shown are elapsed times computed on a 8K CM-200 using double precision. The timings show a complexity for large of 2:6. For the factorization the optimized version is a little faster than the unoptimized (typically a factor of two) UIC (The Danish Computer Center for Research and Education, DTH, DK-2800 Lyngby, Denmark), claus.bendtsen@uni-c.dk 1

2 LU Factorization and Solver sec LU unopt. Solve unopt. LU opt. Solve opt Figure 1: Performance of normal LU factorization and solver. 2

3 whereas the optimized solver is the slowest. The reason is that the use of masks in the solver is more complicated 1 than for the factorization and the overhead is too large to outperform the spreads. 2 Block LU Factorization and Solver The block factorization algorithm described in [1] and [2] reorganizes the Gaussian elimination so that matrix multiplication becomes the dominant operation. Since the matrix multiplication is implemented eciently on the Connection Machine this implies a gain when changing to the block algorithm. The implementation is performed as described in [1] except for the fact that the recursive structure has been changed to an iterative one due to the nature of CM-FORTRA. The unoptimized version is using a :EWS layout of the matrix and uses subscript triplets to identify the dierent blocks 2. The optimized version uses a :EWS layout within each block and a :SERIAL layout for all the blocks which means that each block will be spread across the entire machine and that identifying the dierent blocks is done without communication. Both the optimized an unoptimized version use the optimized version of the normal factorization and the unoptimized version of the solver described in Section 1 for doing operations on the blocks. The implementation has been restricted to systems with a size of a multiple of the block size. Random test systems lead to the performance shown in Figs. 2 and 3. The timings shown are elapsed times computed on a 8K CM-200 using double precision and various block sizes. For the optimized version the time required for packing a normal layout into the combined :SERIAL and :EWS layout has not been included in the timings since most other calculations could be performed on the packed array using multiple instance calls which imply that the user could keep the data packed all the time 3. Each version was tested with dierent block sizes. The complexity for the block factorization is 2:5 for large and the complexity for the block solver is 2:4 for large. Surprisingly the \unoptimized" version of the factorization performs better than the optimized one even though it involves large amounts of unneeded communication. It seems that a block size of 16 for small systems and a size of 32 or 64 for large ones yields the best performance on a 8K machine. I would have expected the block size of 32 to be the best since a block size of 32 maps perfectly onto the machine, using a vector size of 4 for each PE. The reason why the optimized version is performing so poorly seems to be the time used calculating the Schur complement. At a block size of 16, 66% of the time is spent calculating the Schur complement; at a block size of 32 it is 63%, but at a block size of 64 it drops to 39%. This explains why the block size of 64 outperforms the smaller block sizes. It is also the main reason why the optimized version is slower than the unoptimized one. In the unoptimized version the Schur complement is calculated by the use of one matrix-matrix multiplication as opposed to the optimized version where the layout results in a multiple instance matrixmatrix multiplication each with a size equal to the block size 4. 1 Since two scans are needed, one scanning upward and one scanning downward along each axis as opposed to only one scan along each axis. 2 This leads to a lot of communication due to the creation of temporaries. 3 It is actually possible to avoid this packing by rearranging equations but it requires exact knowledge of where the dierent elements live as well as a longer implementation time. 4 At a lower level than CMF or with a dierent layout this multiple instance multiplication could be performed as a single matrix-matrix multiplication and thus increase execution time signicantly. 3

4 sec Performance of Block LU Factorization Type & Block Size Opt. 2 Opt. 8 Opt. 16 Opt. 32 Opt. 64 Unopt. 2 Unopt. 8 Unopt. 16 Unopt. 32 Unopt. 64 Figure 2: Performance of Block LU factorization. 4

5 sec Performance of Block LU Solver Type & Block Size Opt. 2 Opt. 8 Opt. 16 Opt. 32 Opt. 64 Unopt. 2 Unopt. 8 Unopt. 16 Unopt. 32 Unopt. 64 Figure 3: Performance of Block LU solver. 5

6 3 The CMSSL LU Factorization In the CMSSL library the Gaussian elimination is implemented by the use of a point algorithm and block cyclic ordering. The performance for test runs equal to the ones performed in Section 2 is shown in Figs. 4 and 5. The results have been obtained without the use of pivoting and equilibration and by the use of CMSSL Version 3.1 Beta 2. The complexity for large is 1:8 for the factorization and 1:5 for the solver. sec Performance of CMSSL LU Factorization Block Size Figure 4: Performance of CMSSL 3.1 LU factorization. 4 Concluding Remarks The best results of each of the sections above are shown in Fig. 6. Flop rates have been displayed instead of timings using the approximation of (2=3) 3 oatingpoint operations for the factorization routines and (2 2 no: of right hand sides) oating-point operations for the solver routines [3]. Regarding the CMSSL library it is seen that the routines are doing well on large matrices and that our peak performance on large systems is only roughly 4% of the CMSSL peak performance. The reasons why are discussed in Section 2. There seems to be a fair gain when using a block algorithm instead of a normal algorithm and the time of implementation needed to obtain this gain is relatively small. The 6

7 sec Performance of CMSSL LU Solver Block Size Figure 5: Performance of CMSSL 3.1 LU solver. 7

8 MFlops 10 3 Performance of LU Factorizers and Solvers CMSSL LU fact. CMSSL LU solver LU fact. LU solver BLU fact. BLU solver Figure 6: MFlop rates for dierent implementations of LU factorization and solver. 8

9 overall conclusion is that even though a \quick-and-dirty" implementation of block algorithms seems to work better than a similar implementation of point algorithms it takes a long time of implementation to obtain high performance. Since the block algorithm seems to map quite well onto the Connection Machine it should be possible with proper knowledge of the machine architecture, low level programming and sucient time to obtain a much higher performance rate. 5 Further Work If one wishes to obtain a better performance on the LU factorization it could be interesting to try using the :BLOCK layout now supported by CMF. This would enable one to make a single instance matrix-matrix multiplication when computing the Schur complement and still maintain the layout of the optimized block algorithm. Furthermore one could probably optimize communication by doing multiple instance scans rather than single instance scans. References [1] Demmel, James W., Higham, icholas J., Schreiber, Robert S. Block LU Factorization. LAPACK Working ote 40, Febr [2] Golub, Gene H., van Loan, Charles F. Matrix Computations. Second edition, The Johns Hopkins University Press, [3] Thinking Machines Corporation. CMSSL Release otes for the CM-200, Preliminary Documentation for Version 3.1 Beta 2. TMC, Cambridge, Massachusetts, February

Implementation of QR Up- and Downdating on a. Massively Parallel Computer. Hans Bruun Nielsen z Mustafa Pnar z. July 8, 1996.

Implementation of QR Up- and Downdating on a Massively Parallel Computer Claus Btsen y Per Christian Hansen y Kaj Madsen z Hans Bruun Nielsen z Mustafa Pnar z July 8, 1996 Abstract We describe an implementation