Extending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11

Size: px

Start display at page:

Download "Extending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11"

Vanessa Atkinson
6 years ago
Views:

1 Extending CRAFT Data-Distributions for Sparse Matrices G. Bandera E.L. Zapata July 996 Technical Report No: UMA-DAC-96/ Published in: 2nd. European Cray MPP Workshop Edinburgh Parallel Computing Centre, UK, July 25-26, 996 University of Malaga Department of Computer Architecture C. Tecnologico PO Box 44 E Malaga Spain

2 Extending CRAFT Data-Distributions for Sparse Matrices Gerardo Bandera Emilio L. Zapata Computer Architecture Department. University of Malaga. Campus de Teatinos, 2907 Malaga, Spain. fbandera, Abstract The existing methods for distributing data across the processors are not very useful for irregular problems. However, there are some new distribution techniques for this kind of problems, as MRD (Multiple Recursive Decomposition) and BRS (Block Row Scatter), which could be applicated by a data-parallel compiler. In this paper we present a performance comparison of regular (BLOCK and CYCLIC) versus pseudo-regular (BRS) data distribution for sparse matrices on the Cray T3D. We will show results for one of the most useful nonstationary iterative methods for solving symmetric positive denite systems, the Sparse Conjugate Gradient algorithm, using CRAFT with its data-distributions and using two parallelized by-hand codes written in C and Fortran with the BRS sparse distribution. Overview In literature, there are several methods for programming parallel machines: automatic parallelizers (Polaris [6], Parafrase [6]), data-parallel compilers (VFCS [0], Adaptor [8]) and using message passing protocols (PVM [3], MPI [], PARMACS [9]). On the Cray T3D [7], when users program a parallel application, they can choose two main alternatives. In one hand, this parallel machine could be programmed using three protocols for passing messages between processors: Parallel Virtual Machine (PVM), Message Passing Interface (MPI) and Shared Memory routines (SHMEM [2, 3]). The rst two are useful to facilitate portability to other kind of multiprocessors. The last one is the most powerful of those, because it deals directly with ecient communication primitives. For that important reason, the communication time is reduced considerably. MPI achieves actually a better performance than PVM, although the last one contains some special faster routines. In our work we use in the rst step PVM and Fast PVM routines. Later on, we use SHMEM routines because of the speed. The work described in this paper was supported by the Ministry of Education and Science (CICYT) of Spain under project TIC and by the TRACS (Training and Research on Advanced Computing Systems) under the Human Capital and Mobility Programme of the European Union (grant number ERB-CHGE-CT ).

3 In the other hand, the second alternative to program the Cray T3D is using the data-parallel compiler CRAFT [5], whose input language is an HPF [4] subset, consisting in a sequential code extended with some directives helping the compiler to parallelize applications. One important dierence of this data-parallel compiler with respect to others existing actually is that CRAFT is not a source to source translator, as VFCS or Adaptor, in which the resulting code must be compiled with some libraries (one for every dierent machine you want to execute the program). This compiler produces a binary le only useful to run on the Cray T3D. Section 2 contains a briey introduction to sparse matrices and some of their representation schemes. Section 3 shows actual regular data-distributions (3.) and the new distributions techniques able to be implemented by a data-parallel compiler (3.2). Section 4 describes the algorithm, the programming models and the input matrices used in this work. Finally, and after the conclusions, section 5 shows the results obtained by using regular and pseudo-regular distributions with those dierent programming models and sparse matrices. 2 Sparse Matrices A matrix is called sparse if contains an small number of non-zero entries. A range of methods have been developed for storing sparse matrices, which enable their computations to be performed with considerable savings in terms of both memory and computation []. Solution schemes are often optimized to take advantage of the structure within the matrix. In this way, in literature we can nd particular methods for storing sparse matrices. Instead of that, we only assume general formats through the paper. In our work, we have considered the CRS (Compressed Row Storage) format. This scheme consists in representing the matrix using three vectors called Data, Column and Row. The rst two vectors contain the non-zero entries and their corresponding column values. The latter one points the beginning of the entries for every row of the matrix. There are another similar format to represent this kind of matrices, which is the Compressed Column Storage (CCS). Figure shows the CRS format applied to an example matrix. 3 Generic Data-Distributions 3. Regular Distributions Actual data-parallel compiler contains most useful regular distributions to share data across processors, which are BLOCK, CYCLIC and BLOCK-CYCLIC. The way to specify such distributions in the Cray data-parallel Fortran compiler is shown in table. Obviously, users have to choose the distribution depending on how the data have to be used. Distribution BLOCK CYCLIC BLOCK-CYCLIC (k) Specication cdir$ shared V(:block) cdir$ shared V(:block()) cdir$ shared V(:block(k)) Table : Basic mechanisms in CRAFT for distributing a vector V. Above distributions are useful when code contains regular acceses to data. Nevertheless, there are several applications with irregular patterns for accessing data, and actual Craft distributions 2

4 0 0:0 53: 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 2:3 0:0 9:5 0:0 0:0 0:0 0:0 0:0 0:0 6:7 0:0 0:0 0:0 0:0 0:0 72:9 0:0 0:0 0:0 0:0 0:0 7:2 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 93:4 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 3:6 0:0 0:0 0:0 0:0 0:0 44: 0:0 0:0 9:0 0:0 23:2 69:8 0:0 37: 0:0 0:0 0:0 27:6 0:0 0:0 :3 0:0 0:0 64:0 0:0 C A DA CO RO Figure : CRS representation for the matrix A(0,8) with = 6 are not enough for this kind of applications. Last comment implies the necessity of dening new data-distributions for using in those applications, and in section 5 some results that emphasize that requirement will be shown. 3.2 Pseudo-Regular Distributions The existing methods for distributing data across the processors are not very useful for irregular problems. However, there are some new distribution techniques for this kind of problems, as MRD (Multiple Recursive Decomposition) and BRS (Block Row Scatter) [7], which could be applicated by a data-parallel compiler [8]. Both MRD and BRS represent the local portion of the global matrix preserving the storage scheme used for the last one. They could be considered an extension of the BLOCK and CYCLIC regular distribution for sparse (pseudo-regular) applications. MRD is the generalization of the Binary Recursive Decomposition [5] for mapping data index onto every single processor. It consists in improving workload balance and communications by horizontal and vertical partitions. This distribution is useful for algorithms in which the neighbourhood properties of the elements must be preserved. Figure 2-(a) shows an example of a sparse matrix partition. BRS is similar to the regular CYCLIC distribution, but the local representation of submatrices is the CRS format. It is useful when the neighbourhood properties of the elements are not so important in the algorithm. The main advantage of this distribution is the possibility of combining with cyclically distributed data. In gure 2-(b) readers can see an example of this partition. 3

5 (a) (b) Figure 2: Example of pseudo-regular partitions: (a) MRD, (b) BRS. 4 Comparison Benchmark The prove the eciency and the need of implementing new data distributions in CRAFT, we will show some results obtained using CRAFT with its own regular data distributions, and using two other codes, both parallelized by hand in C and Fortran with BRS distributions and using SHMEM routines for communication between processors. 4. Code and Programming Models To check the hability of the data distributions, we use the well-known and one of the most useful nonstationary iterative methods for solving symmetric positive denite systems, the Sparse Conjugate Gradient algorithm [] without precoditioners. With this algorithm, we have tested three dierents versions: First version involves the CYCLIC distribution of the Cray Fortran data-parallel compiler (Craft) [5] for sparse vectors. In the second, the algorithm is written in C distributing data with the BRS distribution and SHMEM routines [2] for communication between processors. In the last one, the algorithm is written in Fortran using also the BRS distribution and SHMEM [3] routines for processor communications. 4.2 Input Matrices To account for the eects of the dimensions, sparsity rate and pattern in the input sparse matrix, we chose for computation several very dierent matrices taken from the Harwell-Boeing collection [2] in which they are identied as BCSSTK4, BCSSTK29, and. BC- SSTK4 is a medium sparse matrix used in linear equations systems, BCSSTK29 and 4

6 PROGRAM SpCG c Declaration Part c Distribution Part cdir$ shared scalar,scalar2,scalar3 cdir$ geometry cyclic (:block()) cdir$ shared (cyclic) :: P,Q,R,X cdir$ shared (cyclic) :: Data,Column,Row c Initiallization c Reading Sparse Matrix and Dense Vectors c Vectors Preprocessing c Main Loop: Convergence Part DO 0 K=, NUM ITERS IF (K.GT. ) THEN scalar2 = scalar scalar = 0.0 ENDIF scalar=dot product(r,r) IF (K.EQ. ) THEN P = R ELSE P = R + (scalar/scalar2) * P ENDIF cdir$ DO SHARED (I) ON Q(I) DO I =, N Q(I) = 0.0 DO J=Row(I),Row(I+)- Q(I) = Q(I) + Data(J)*P(Column(J)) ENDDO ENDDO scalar3=dot product(p,q) scalar3=scalar/scalar3 X = X + scalar3 * P R = R - scalar3 * Q Checking Convergence Criteria ENDDO END! Shared Vbles! Dense Vectors! Sparse Matrix Vectors! Vector-Vector Product! IF-ELSE updating P! SPARSE Matrix-Vector Product! Vector-Vector Product! Update solution! Update residuals Figure 3: CRAFT code for Sparse Conjugate Gradient algorithm (SpCG). The coecients for the system are stored in a sparse matrix A, which is the main input structure to the algorithm. This matrix is internally represented with the CRS format (that is, with Data, Column and Row vectors). are very sparse matrices used in large eigenvalue problems, and contains population migration data and is relatively dense (see Table 2 for matrix characteristics). As it can be seen, BCSSTK29 and are very similar, whereas BCSSTK4 is very small. For these reasons, we sometimes only show timings for BCSSTK29 or and. When interesting, all the others will be included too. Matrix Dimensions NonZeroes () Sparsity Rate (%) BCSSTK BCSSTK Table 2: Characteristics of benchmark matrices 5

7 5 Data-Distributions Comparison 5. Comparing CRAFT Regular Distributions Using CRAFT, programmers can distribute the data across processors using block and cyclic distributions. The election of a distribution method has to be carried out depending on the accessing properties of the data involved in the application. Figure 4 shows a comparison of these distributions for four dierent matrices..60 Craft code Dense Data Distribution Comparison time BLOCK / time CYCLIC BCSSTK29 BCSSTK Figure 4: A comparison between the execution times of both block and cyclic CRAFT datadistributions for the CG algorithm. In the above gure is possible to see the outcomes using cyclic distributions instead of block. This improvement is even better when the number of processors is increased. With that result, is obvious to choose cyclic to share data across processors and therefore this will be the regular distribution to be compared with the BRS pseudo-regular distribution. In most of the sparse applications, the Row sparse vector is accessed many times, and it could produce delays in data accessing, due to excessive communications. We even tried to replicate this index vector across processors to solve this problem, although the results were lightly better, but almost the same. 5.2 Codes Comparison Figure 5 shows a comparison of execution time with three codes explained before. The plots indicate the better performance of the BRS code in relation with the cyclic distribution of CRAFT. At the same time, is possible to observe the scalability properties of the application for all kind of codes. In these gures readers can see that the BRS-C code is even better that the BRS-Fortran. The gain of dierent codes remains constant when the number of processors is increased. This topic is showed in gures 6 and 7, which contains the comparison two-by-two of the three codes. Last gures show that the CYCLIC-CRAFT code is about ve times slower than the BRS-C code and about three than the BRS-Fortran. It's important to know as well that the dierence between C and Fortran is up to two. 6

8 Execution time (log) Craft code (Cyclic) Fortran code (BRS) C code (BRS) Execution time (log) Craft code (Cyclic) Fortran code (BRS) C code (BRS).0.0 (a) (b) Figure 5: CG execution time (using a logaritmic scale) for two sparse matrices: (a) matrix (quite sparse), (b) matrix (rather dense). 7.0 Comparison CYCLIC-CRAFT vs BRS-Fortran 0.0 Comparison CYCLIC-CRAFT vs BRS-C 6.0 BCSSTK BCSSTK29 time CRAFT / time BRS time CRAFT / time BRS (a) (b) Figure 6: Improvement versus CYCLIC-CRAFT for the manual BRS versions of the CG algorithm: (a) Fortran language, (b) C language. To follow with the results showed here, in gures 8 and 9 the speedup and eciencies for all codes using two representative matrices. These gures indicate the problem scalability with all codes. Most of times Craft obtains good scalable programs, due to the intrinsic properties of the compiler. Anyway, the others two codes also gets nice results (even better with the matrix). 7

9 time BRS-Fortran / time BRS-C BRS codes Comparison (Fortran vs C) BCSSTK Figure 7: Comparing C vs Fortran codes Craft code (Cyclic) Fortran code (BRS) C code (BRS) 6 Craft code (Cyclic) Fortran code (BRS) C code (BRS) Speedup 8 4 Speedup (a) (b) Figure 8: Code Speedup. 5.3 Final Comparison And nally, the gure 0 shows the execution time per iteration for 2 matrices, but now it is important to note that two important subjects are grouped here: rst, the time of the sparse matrix-vector multiplication and, in the other hand, the time of the rest of dense operations. As the user can see above, the time of the sparse multiplication is more than 90% of the total time for every iteration of algorithm, so it's necessary to improve the data distribution across processors to achieve a good load balance and to decrease the number of communications. Another possible comment to do is the advantage of CRAFT due to the use of intrinsic BLAS 8

10 Craft code (Cyclic) Fortran code (BRS) C code (BRS) Craft code (Cyclic) Fortran code (BRS) C code (BRS) Efficiency.00 Efficiency (a) (b) Figure 9: Code Eciency Codes Comparison 200 Codes Comparison Execution time (per iteration) Sparse MxV Craft Dense operations Craft Sparse MxV BRS - Fortran Dense operations BRS - Fortran Sparse MxV BRS - C Dense operations BRS - C Execution time (per iteration) Sparse MxV Craft Dense operations Craft Sparse MxV BRS - Fortran Dense operations BRS - Fortran Sparse MxV BRS - C Dense operations BRS - C 0 0 (a) (b) Figure 0: Complete iteration execution time for CYCLIC-CRAFT, BRS-C and BRS-Fortran versions of the parallel CG code: (a) matrix (quite sparse), (b) matrix (rather dense). routines [4] for vector operations. This kind of primitives could be used in SHMEM programs, and the results shoed here would have been even better. 9

11 6 Conclusions Through this work we have tried to show the necessity of dening new data distributions in compilers for parallel computers. These new techniques are overcome useful for irregular applications where actual regular distributions are insucient at all. The new pseudo-regular distributions, as MRD and BRS, exploit locality in applications using sparse matrices. These distributions are able to be developed in a data-parallel compiler, because this tool has enough information to traduce those schemes. The results showed here are good to demonstrate latter comments, although the performance could be better illustrated by using bigger sparse matrices. Actually, the inclusion of some pseudo-regular distributions in data-parallel compilers are under development [8], and this will be the way to obtain a data-parallel compiler to help the parallelization of irregular applications. 7 Acknowledgements We want to thank the Edimburgh Parallel Computing Centre (Scotland, UK) for the use of the Cray T3D parallel machine as well as the CRAFT data-parallel compiler. References [] R. Barrett, M. Berry, T. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, H. van der Vorst, Templates for the solution of linear systems: Building blocks for iterative methods, Ed. SIAM, 994. [2] R. Barriuso, A. Knies, SHMEM User's Guide for C (Revision 2.2), Cray Research Inc., August 994. [3] R. Barriuso, A. Knies, SHMEM User's Guide for Fortran (Revision 2.2), Cray Research Inc., August 994. [4] Basic Linear Algebra Subprograms. A Quick Reference Guide, Univ. of Tennessee, Oak Ridge National Laboratory, Numerical Algorithms Group Ltd. [5] M.J. Berger and S.H. Bokhari, A Partitioning Strategy for Nonuniform Problems on Multiprocessors, IEEE Transaction on Computers, Vol. 36, No. 5, pp , 987. [6] W. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoeinger, D. Padua, P. Petersen, B. Pottenger, L. Rauchwerger, P. Tu, S. Weatherford, Polaris: Improving the Eectiveness of Parallelizing Compilers, Proceedings 7th Workshop on Languages and Compilers for Parallel Computing, Ithaca, NY, pp. 4-54, (publicated by Springer-Verlag, LNCS 892), August 994. [7] S.P. Booth, J. Fisher, P.H. Maccallum, A.D. Simpson, Introduction to the Cray T3D at EPCC, EPCC Internal Report, University of Edimburgh, September 995. [8] T. Brandes, Automatic Translation of Data Parallel Programs to Message Passing Programs, GMD Internal Report Adaptor 93-, January

12 [9] R. Calkin, R. Hempel, H.C. Hoppe, P. Wypior, Portable programming with the parmacs message-passing library, Parallel Computing, Special Issue on Message Passing Interfaces (to appear). [0] B. Chapman et al, Vienna Fortran Compilation System, User's Guide. Institute for Software Technology and Parallel Systems, 993. [] N. Doss, W. Gropp, E. Lusk, A. Skjellum, A model implementation of MPI, Technical Report, Argonne National Laboratory, 993. [2] I.S. Du, R.G. Grimes, J.G. Lewis, Users' Guide for the Harwell-Boeing Sparse Matrix Collection, Research and Technology Division, Boeing Computer Services, Seattle, WA, , USA, 992. [3] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM 3. User's Guide And Reference Manual, Technical Report 287, Oak Ridge National Laboratory, Knoxville, Tennessee, May 993. [4] High Performance Language Specication, Version.0, Technical Report TR92-225, Rice University, May 993. Also available as Scientic Programming 2(-2):-70, Spring and Summer 993. [5] D.M. Pase, T. MacDonald, A. Meltzer, The CRAFT Fortran Programming Model, Scientic Programming, Vol. 3, pp , 994. [6] C.D. Polychronopoulos, M.B. Girkar, M.R. Haghighat, C.L. Lee, B.P. Leung, D.A. Schouten, The Structure of Parafrase-2: an Advanced Parallelizing Compiler for C and Fortran, Proceedings 2 nd Workshop on Languages and Compilers for Parallel Computing, pp , August 989. [7] L.F. Romero, E.L. Zapata, Data distributions for sparse matrix vector multiplication, J. Parallel Computing, Vol. 2, pp , 995. [8] M. Ujaldon, Data-Parallel Compilation Techniques for Sparse Matrix Applications, PhD Thesis Dissertation. University of Malaga. Also available as Technical Report UMA-DAC- 96/02, Department of Computer Architecture, University of Malaga, January 996 (in spanish).

University of Malaga. Sparse Matrix Block-Cyclic Redistribution. Department of Computer Architecture C. Tecnologico PO Box 4114 E Malaga Spain

University of Malaga. Sparse Matrix Block-Cyclic Redistribution. Department of Computer Architecture C. Tecnologico PO Box 4114 E Malaga Spain Sparse Matrix Block-Cyclic Redistribution G. Bandera E.L. Zapata April 999 Technical Report No: UMA-DAC-99/5 Published in: IEEE Int l. Parallel Processing Symposium (IPPS 99) San Juan, Puerto Rico, April