Extending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11

Size: px
Start display at page:

Download "Extending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11"

Transcription

1 Extending CRAFT Data-Distributions for Sparse Matrices G. Bandera E.L. Zapata July 996 Technical Report No: UMA-DAC-96/ Published in: 2nd. European Cray MPP Workshop Edinburgh Parallel Computing Centre, UK, July 25-26, 996 University of Malaga Department of Computer Architecture C. Tecnologico PO Box 44 E Malaga Spain

2 Extending CRAFT Data-Distributions for Sparse Matrices Gerardo Bandera Emilio L. Zapata Computer Architecture Department. University of Malaga. Campus de Teatinos, 2907 Malaga, Spain. fbandera, Abstract The existing methods for distributing data across the processors are not very useful for irregular problems. However, there are some new distribution techniques for this kind of problems, as MRD (Multiple Recursive Decomposition) and BRS (Block Row Scatter), which could be applicated by a data-parallel compiler. In this paper we present a performance comparison of regular (BLOCK and CYCLIC) versus pseudo-regular (BRS) data distribution for sparse matrices on the Cray T3D. We will show results for one of the most useful nonstationary iterative methods for solving symmetric positive denite systems, the Sparse Conjugate Gradient algorithm, using CRAFT with its data-distributions and using two parallelized by-hand codes written in C and Fortran with the BRS sparse distribution. Overview In literature, there are several methods for programming parallel machines: automatic parallelizers (Polaris [6], Parafrase [6]), data-parallel compilers (VFCS [0], Adaptor [8]) and using message passing protocols (PVM [3], MPI [], PARMACS [9]). On the Cray T3D [7], when users program a parallel application, they can choose two main alternatives. In one hand, this parallel machine could be programmed using three protocols for passing messages between processors: Parallel Virtual Machine (PVM), Message Passing Interface (MPI) and Shared Memory routines (SHMEM [2, 3]). The rst two are useful to facilitate portability to other kind of multiprocessors. The last one is the most powerful of those, because it deals directly with ecient communication primitives. For that important reason, the communication time is reduced considerably. MPI achieves actually a better performance than PVM, although the last one contains some special faster routines. In our work we use in the rst step PVM and Fast PVM routines. Later on, we use SHMEM routines because of the speed. The work described in this paper was supported by the Ministry of Education and Science (CICYT) of Spain under project TIC and by the TRACS (Training and Research on Advanced Computing Systems) under the Human Capital and Mobility Programme of the European Union (grant number ERB-CHGE-CT ).

3 In the other hand, the second alternative to program the Cray T3D is using the data-parallel compiler CRAFT [5], whose input language is an HPF [4] subset, consisting in a sequential code extended with some directives helping the compiler to parallelize applications. One important dierence of this data-parallel compiler with respect to others existing actually is that CRAFT is not a source to source translator, as VFCS or Adaptor, in which the resulting code must be compiled with some libraries (one for every dierent machine you want to execute the program). This compiler produces a binary le only useful to run on the Cray T3D. Section 2 contains a briey introduction to sparse matrices and some of their representation schemes. Section 3 shows actual regular data-distributions (3.) and the new distributions techniques able to be implemented by a data-parallel compiler (3.2). Section 4 describes the algorithm, the programming models and the input matrices used in this work. Finally, and after the conclusions, section 5 shows the results obtained by using regular and pseudo-regular distributions with those dierent programming models and sparse matrices. 2 Sparse Matrices A matrix is called sparse if contains an small number of non-zero entries. A range of methods have been developed for storing sparse matrices, which enable their computations to be performed with considerable savings in terms of both memory and computation []. Solution schemes are often optimized to take advantage of the structure within the matrix. In this way, in literature we can nd particular methods for storing sparse matrices. Instead of that, we only assume general formats through the paper. In our work, we have considered the CRS (Compressed Row Storage) format. This scheme consists in representing the matrix using three vectors called Data, Column and Row. The rst two vectors contain the non-zero entries and their corresponding column values. The latter one points the beginning of the entries for every row of the matrix. There are another similar format to represent this kind of matrices, which is the Compressed Column Storage (CCS). Figure shows the CRS format applied to an example matrix. 3 Generic Data-Distributions 3. Regular Distributions Actual data-parallel compiler contains most useful regular distributions to share data across processors, which are BLOCK, CYCLIC and BLOCK-CYCLIC. The way to specify such distributions in the Cray data-parallel Fortran compiler is shown in table. Obviously, users have to choose the distribution depending on how the data have to be used. Distribution BLOCK CYCLIC BLOCK-CYCLIC (k) Specication cdir$ shared V(:block) cdir$ shared V(:block()) cdir$ shared V(:block(k)) Table : Basic mechanisms in CRAFT for distributing a vector V. Above distributions are useful when code contains regular acceses to data. Nevertheless, there are several applications with irregular patterns for accessing data, and actual Craft distributions 2

4 0 0:0 53: 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 2:3 0:0 9:5 0:0 0:0 0:0 0:0 0:0 0:0 6:7 0:0 0:0 0:0 0:0 0:0 72:9 0:0 0:0 0:0 0:0 0:0 7:2 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 93:4 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 3:6 0:0 0:0 0:0 0:0 0:0 44: 0:0 0:0 9:0 0:0 23:2 69:8 0:0 37: 0:0 0:0 0:0 27:6 0:0 0:0 :3 0:0 0:0 64:0 0:0 C A DA CO RO Figure : CRS representation for the matrix A(0,8) with = 6 are not enough for this kind of applications. Last comment implies the necessity of dening new data-distributions for using in those applications, and in section 5 some results that emphasize that requirement will be shown. 3.2 Pseudo-Regular Distributions The existing methods for distributing data across the processors are not very useful for irregular problems. However, there are some new distribution techniques for this kind of problems, as MRD (Multiple Recursive Decomposition) and BRS (Block Row Scatter) [7], which could be applicated by a data-parallel compiler [8]. Both MRD and BRS represent the local portion of the global matrix preserving the storage scheme used for the last one. They could be considered an extension of the BLOCK and CYCLIC regular distribution for sparse (pseudo-regular) applications. MRD is the generalization of the Binary Recursive Decomposition [5] for mapping data index onto every single processor. It consists in improving workload balance and communications by horizontal and vertical partitions. This distribution is useful for algorithms in which the neighbourhood properties of the elements must be preserved. Figure 2-(a) shows an example of a sparse matrix partition. BRS is similar to the regular CYCLIC distribution, but the local representation of submatrices is the CRS format. It is useful when the neighbourhood properties of the elements are not so important in the algorithm. The main advantage of this distribution is the possibility of combining with cyclically distributed data. In gure 2-(b) readers can see an example of this partition. 3

5 (a) (b) Figure 2: Example of pseudo-regular partitions: (a) MRD, (b) BRS. 4 Comparison Benchmark The prove the eciency and the need of implementing new data distributions in CRAFT, we will show some results obtained using CRAFT with its own regular data distributions, and using two other codes, both parallelized by hand in C and Fortran with BRS distributions and using SHMEM routines for communication between processors. 4. Code and Programming Models To check the hability of the data distributions, we use the well-known and one of the most useful nonstationary iterative methods for solving symmetric positive denite systems, the Sparse Conjugate Gradient algorithm [] without precoditioners. With this algorithm, we have tested three dierents versions: First version involves the CYCLIC distribution of the Cray Fortran data-parallel compiler (Craft) [5] for sparse vectors. In the second, the algorithm is written in C distributing data with the BRS distribution and SHMEM routines [2] for communication between processors. In the last one, the algorithm is written in Fortran using also the BRS distribution and SHMEM [3] routines for processor communications. 4.2 Input Matrices To account for the eects of the dimensions, sparsity rate and pattern in the input sparse matrix, we chose for computation several very dierent matrices taken from the Harwell-Boeing collection [2] in which they are identied as BCSSTK4, BCSSTK29, and. BC- SSTK4 is a medium sparse matrix used in linear equations systems, BCSSTK29 and 4

6 PROGRAM SpCG c Declaration Part c Distribution Part cdir$ shared scalar,scalar2,scalar3 cdir$ geometry cyclic (:block()) cdir$ shared (cyclic) :: P,Q,R,X cdir$ shared (cyclic) :: Data,Column,Row c Initiallization c Reading Sparse Matrix and Dense Vectors c Vectors Preprocessing c Main Loop: Convergence Part DO 0 K=, NUM ITERS IF (K.GT. ) THEN scalar2 = scalar scalar = 0.0 ENDIF scalar=dot product(r,r) IF (K.EQ. ) THEN P = R ELSE P = R + (scalar/scalar2) * P ENDIF cdir$ DO SHARED (I) ON Q(I) DO I =, N Q(I) = 0.0 DO J=Row(I),Row(I+)- Q(I) = Q(I) + Data(J)*P(Column(J)) ENDDO ENDDO scalar3=dot product(p,q) scalar3=scalar/scalar3 X = X + scalar3 * P R = R - scalar3 * Q Checking Convergence Criteria ENDDO END! Shared Vbles! Dense Vectors! Sparse Matrix Vectors! Vector-Vector Product! IF-ELSE updating P! SPARSE Matrix-Vector Product! Vector-Vector Product! Update solution! Update residuals Figure 3: CRAFT code for Sparse Conjugate Gradient algorithm (SpCG). The coecients for the system are stored in a sparse matrix A, which is the main input structure to the algorithm. This matrix is internally represented with the CRS format (that is, with Data, Column and Row vectors). are very sparse matrices used in large eigenvalue problems, and contains population migration data and is relatively dense (see Table 2 for matrix characteristics). As it can be seen, BCSSTK29 and are very similar, whereas BCSSTK4 is very small. For these reasons, we sometimes only show timings for BCSSTK29 or and. When interesting, all the others will be included too. Matrix Dimensions NonZeroes () Sparsity Rate (%) BCSSTK BCSSTK Table 2: Characteristics of benchmark matrices 5

7 5 Data-Distributions Comparison 5. Comparing CRAFT Regular Distributions Using CRAFT, programmers can distribute the data across processors using block and cyclic distributions. The election of a distribution method has to be carried out depending on the accessing properties of the data involved in the application. Figure 4 shows a comparison of these distributions for four dierent matrices..60 Craft code Dense Data Distribution Comparison time BLOCK / time CYCLIC BCSSTK29 BCSSTK Figure 4: A comparison between the execution times of both block and cyclic CRAFT datadistributions for the CG algorithm. In the above gure is possible to see the outcomes using cyclic distributions instead of block. This improvement is even better when the number of processors is increased. With that result, is obvious to choose cyclic to share data across processors and therefore this will be the regular distribution to be compared with the BRS pseudo-regular distribution. In most of the sparse applications, the Row sparse vector is accessed many times, and it could produce delays in data accessing, due to excessive communications. We even tried to replicate this index vector across processors to solve this problem, although the results were lightly better, but almost the same. 5.2 Codes Comparison Figure 5 shows a comparison of execution time with three codes explained before. The plots indicate the better performance of the BRS code in relation with the cyclic distribution of CRAFT. At the same time, is possible to observe the scalability properties of the application for all kind of codes. In these gures readers can see that the BRS-C code is even better that the BRS-Fortran. The gain of dierent codes remains constant when the number of processors is increased. This topic is showed in gures 6 and 7, which contains the comparison two-by-two of the three codes. Last gures show that the CYCLIC-CRAFT code is about ve times slower than the BRS-C code and about three than the BRS-Fortran. It's important to know as well that the dierence between C and Fortran is up to two. 6

8 Execution time (log) Craft code (Cyclic) Fortran code (BRS) C code (BRS) Execution time (log) Craft code (Cyclic) Fortran code (BRS) C code (BRS).0.0 (a) (b) Figure 5: CG execution time (using a logaritmic scale) for two sparse matrices: (a) matrix (quite sparse), (b) matrix (rather dense). 7.0 Comparison CYCLIC-CRAFT vs BRS-Fortran 0.0 Comparison CYCLIC-CRAFT vs BRS-C 6.0 BCSSTK BCSSTK29 time CRAFT / time BRS time CRAFT / time BRS (a) (b) Figure 6: Improvement versus CYCLIC-CRAFT for the manual BRS versions of the CG algorithm: (a) Fortran language, (b) C language. To follow with the results showed here, in gures 8 and 9 the speedup and eciencies for all codes using two representative matrices. These gures indicate the problem scalability with all codes. Most of times Craft obtains good scalable programs, due to the intrinsic properties of the compiler. Anyway, the others two codes also gets nice results (even better with the matrix). 7

9 time BRS-Fortran / time BRS-C BRS codes Comparison (Fortran vs C) BCSSTK Figure 7: Comparing C vs Fortran codes Craft code (Cyclic) Fortran code (BRS) C code (BRS) 6 Craft code (Cyclic) Fortran code (BRS) C code (BRS) Speedup 8 4 Speedup (a) (b) Figure 8: Code Speedup. 5.3 Final Comparison And nally, the gure 0 shows the execution time per iteration for 2 matrices, but now it is important to note that two important subjects are grouped here: rst, the time of the sparse matrix-vector multiplication and, in the other hand, the time of the rest of dense operations. As the user can see above, the time of the sparse multiplication is more than 90% of the total time for every iteration of algorithm, so it's necessary to improve the data distribution across processors to achieve a good load balance and to decrease the number of communications. Another possible comment to do is the advantage of CRAFT due to the use of intrinsic BLAS 8

10 Craft code (Cyclic) Fortran code (BRS) C code (BRS) Craft code (Cyclic) Fortran code (BRS) C code (BRS) Efficiency.00 Efficiency (a) (b) Figure 9: Code Eciency Codes Comparison 200 Codes Comparison Execution time (per iteration) Sparse MxV Craft Dense operations Craft Sparse MxV BRS - Fortran Dense operations BRS - Fortran Sparse MxV BRS - C Dense operations BRS - C Execution time (per iteration) Sparse MxV Craft Dense operations Craft Sparse MxV BRS - Fortran Dense operations BRS - Fortran Sparse MxV BRS - C Dense operations BRS - C 0 0 (a) (b) Figure 0: Complete iteration execution time for CYCLIC-CRAFT, BRS-C and BRS-Fortran versions of the parallel CG code: (a) matrix (quite sparse), (b) matrix (rather dense). routines [4] for vector operations. This kind of primitives could be used in SHMEM programs, and the results shoed here would have been even better. 9

11 6 Conclusions Through this work we have tried to show the necessity of dening new data distributions in compilers for parallel computers. These new techniques are overcome useful for irregular applications where actual regular distributions are insucient at all. The new pseudo-regular distributions, as MRD and BRS, exploit locality in applications using sparse matrices. These distributions are able to be developed in a data-parallel compiler, because this tool has enough information to traduce those schemes. The results showed here are good to demonstrate latter comments, although the performance could be better illustrated by using bigger sparse matrices. Actually, the inclusion of some pseudo-regular distributions in data-parallel compilers are under development [8], and this will be the way to obtain a data-parallel compiler to help the parallelization of irregular applications. 7 Acknowledgements We want to thank the Edimburgh Parallel Computing Centre (Scotland, UK) for the use of the Cray T3D parallel machine as well as the CRAFT data-parallel compiler. References [] R. Barrett, M. Berry, T. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, H. van der Vorst, Templates for the solution of linear systems: Building blocks for iterative methods, Ed. SIAM, 994. [2] R. Barriuso, A. Knies, SHMEM User's Guide for C (Revision 2.2), Cray Research Inc., August 994. [3] R. Barriuso, A. Knies, SHMEM User's Guide for Fortran (Revision 2.2), Cray Research Inc., August 994. [4] Basic Linear Algebra Subprograms. A Quick Reference Guide, Univ. of Tennessee, Oak Ridge National Laboratory, Numerical Algorithms Group Ltd. [5] M.J. Berger and S.H. Bokhari, A Partitioning Strategy for Nonuniform Problems on Multiprocessors, IEEE Transaction on Computers, Vol. 36, No. 5, pp , 987. [6] W. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoeinger, D. Padua, P. Petersen, B. Pottenger, L. Rauchwerger, P. Tu, S. Weatherford, Polaris: Improving the Eectiveness of Parallelizing Compilers, Proceedings 7th Workshop on Languages and Compilers for Parallel Computing, Ithaca, NY, pp. 4-54, (publicated by Springer-Verlag, LNCS 892), August 994. [7] S.P. Booth, J. Fisher, P.H. Maccallum, A.D. Simpson, Introduction to the Cray T3D at EPCC, EPCC Internal Report, University of Edimburgh, September 995. [8] T. Brandes, Automatic Translation of Data Parallel Programs to Message Passing Programs, GMD Internal Report Adaptor 93-, January

12 [9] R. Calkin, R. Hempel, H.C. Hoppe, P. Wypior, Portable programming with the parmacs message-passing library, Parallel Computing, Special Issue on Message Passing Interfaces (to appear). [0] B. Chapman et al, Vienna Fortran Compilation System, User's Guide. Institute for Software Technology and Parallel Systems, 993. [] N. Doss, W. Gropp, E. Lusk, A. Skjellum, A model implementation of MPI, Technical Report, Argonne National Laboratory, 993. [2] I.S. Du, R.G. Grimes, J.G. Lewis, Users' Guide for the Harwell-Boeing Sparse Matrix Collection, Research and Technology Division, Boeing Computer Services, Seattle, WA, , USA, 992. [3] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM 3. User's Guide And Reference Manual, Technical Report 287, Oak Ridge National Laboratory, Knoxville, Tennessee, May 993. [4] High Performance Language Specication, Version.0, Technical Report TR92-225, Rice University, May 993. Also available as Scientic Programming 2(-2):-70, Spring and Summer 993. [5] D.M. Pase, T. MacDonald, A. Meltzer, The CRAFT Fortran Programming Model, Scientic Programming, Vol. 3, pp , 994. [6] C.D. Polychronopoulos, M.B. Girkar, M.R. Haghighat, C.L. Lee, B.P. Leung, D.A. Schouten, The Structure of Parafrase-2: an Advanced Parallelizing Compiler for C and Fortran, Proceedings 2 nd Workshop on Languages and Compilers for Parallel Computing, pp , August 989. [7] L.F. Romero, E.L. Zapata, Data distributions for sparse matrix vector multiplication, J. Parallel Computing, Vol. 2, pp , 995. [8] M. Ujaldon, Data-Parallel Compilation Techniques for Sparse Matrix Applications, PhD Thesis Dissertation. University of Malaga. Also available as Technical Report UMA-DAC- 96/02, Department of Computer Architecture, University of Malaga, January 996 (in spanish).

University of Malaga. Sparse Matrix Block-Cyclic Redistribution. Department of Computer Architecture C. Tecnologico PO Box 4114 E Malaga Spain

University of Malaga. Sparse Matrix Block-Cyclic Redistribution. Department of Computer Architecture C. Tecnologico PO Box 4114 E Malaga Spain Sparse Matrix Block-Cyclic Redistribution G. Bandera E.L. Zapata April 999 Technical Report No: UMA-DAC-99/5 Published in: IEEE Int l. Parallel Processing Symposium (IPPS 99) San Juan, Puerto Rico, April

More information

Vienna-Fortran/HPF Extensions for Sparse and Irregular Problems and Their Compilation

Vienna-Fortran/HPF Extensions for Sparse and Irregular Problems and Their Compilation 1068 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 10, OCTOBER 1997 Vienna-Fortran/HPF Extensions for Sparse and Irregular Problems and Their Compilation Manuel Ujaldon, Emilio L.

More information

University of Malaga. Cache Misses Prediction for High Performance Sparse Algorithms

University of Malaga. Cache Misses Prediction for High Performance Sparse Algorithms Cache Misses Prediction for High Performance Sparse Algorithms B.B. Fraguela R. Doallo E.L. Zapata September 1998 Technical Report No: UMA-DAC-98/ Published in: 4th Int l. Euro-Par Conference (Euro-Par

More information

Sparse Matrix Libraries in C++ for High Performance. Architectures. ferent sparse matrix data formats in order to best

Sparse Matrix Libraries in C++ for High Performance. Architectures. ferent sparse matrix data formats in order to best Sparse Matrix Libraries in C++ for High Performance Architectures Jack Dongarra xz, Andrew Lumsdaine, Xinhui Niu Roldan Pozo z, Karin Remington x x Oak Ridge National Laboratory z University oftennessee

More information

Sparse Givens QR Factorization on a Multiprocessor. May 1996 Technical Report No: UMA-DAC-96/08

Sparse Givens QR Factorization on a Multiprocessor. May 1996 Technical Report No: UMA-DAC-96/08 Sparse Givens QR Factorization on a Multiprocessor J. Tourino R. Doallo E.L. Zapata May 1996 Technical Report No: UMA-DAC-96/08 Published in: 2nd Int l. Conf. on Massively Parallel Computing Systems Ischia,

More information

J.A.J.Hall, K.I.M.McKinnon. September 1996

J.A.J.Hall, K.I.M.McKinnon. September 1996 PARSMI, a parallel revised simplex algorithm incorporating minor iterations and Devex pricing J.A.J.Hall, K.I.M.McKinnon September 1996 MS 96-012 Supported by EPSRC research grant GR/J0842 Presented at

More information

Chapter 1. Reprinted from "Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing",Norfolk, Virginia (USA), March 1993.

Chapter 1. Reprinted from Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing,Norfolk, Virginia (USA), March 1993. Chapter 1 Parallel Sparse Matrix Vector Multiplication using a Shared Virtual Memory Environment Francois Bodin y Jocelyne Erhel y Thierry Priol y Reprinted from "Proc. 6th SIAM Conference on Parallel

More information

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors Image Template Matching on Distributed Memory and Vector Multiprocessors V. Blanco M. Martin D.B. Heras O. Plata F.F. Rivera September 995 Technical Report No: UMA-DAC-95/20 Published in: 5th Int l. Conf.

More information

Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for co

Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for co Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for computers on scientic application has been the Linpack

More information

Abstract HPF was originally created to simplify high-level programming of parallel computers. The inventors of HPF strove for an easy-to-use language

Abstract HPF was originally created to simplify high-level programming of parallel computers. The inventors of HPF strove for an easy-to-use language Ecient HPF Programs Harald J. Ehold 1 Wilfried N. Gansterer 2 Dieter F. Kvasnicka 3 Christoph W. Ueberhuber 2 1 VCPC, European Centre for Parallel Computing at Vienna E-Mail: ehold@vcpc.univie.ac.at 2

More information

2 Rupert W. Ford and Michael O'Brien Parallelism can be naturally exploited at the level of rays as each ray can be calculated independently. Note, th

2 Rupert W. Ford and Michael O'Brien Parallelism can be naturally exploited at the level of rays as each ray can be calculated independently. Note, th A Load Balancing Routine for the NAG Parallel Library Rupert W. Ford 1 and Michael O'Brien 2 1 Centre for Novel Computing, Department of Computer Science, The University of Manchester, Manchester M13 9PL,

More information

Analysis of the GCR method with mixed precision arithmetic using QuPAT

Analysis of the GCR method with mixed precision arithmetic using QuPAT Analysis of the GCR method with mixed precision arithmetic using QuPAT Tsubasa Saito a,, Emiko Ishiwata b, Hidehiko Hasegawa c a Graduate School of Science, Tokyo University of Science, 1-3 Kagurazaka,

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

1e+07 10^5 Node Mesh Step Number

1e+07 10^5 Node Mesh Step Number Implicit Finite Element Applications: A Case for Matching the Number of Processors to the Dynamics of the Program Execution Meenakshi A.Kandaswamy y Valerie E. Taylor z Rudolf Eigenmann x Jose' A. B. Fortes

More information

Center for Supercomputing Research and Development. recognizing more general forms of these patterns, notably

Center for Supercomputing Research and Development. recognizing more general forms of these patterns, notably Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann potteng@csrd.uiuc.edu, eigenman@csrd.uiuc.edu Center for Supercomputing Research and Development University of

More information

Technical Report TR , Computer and Information Sciences Department, University. Abstract

Technical Report TR , Computer and Information Sciences Department, University. Abstract An Approach for Parallelizing any General Unsymmetric Sparse Matrix Algorithm Tariq Rashid y Timothy A.Davis z Technical Report TR-94-036, Computer and Information Sciences Department, University of Florida,

More information

Techniques for Optimizing FEM/MoM Codes

Techniques for Optimizing FEM/MoM Codes Techniques for Optimizing FEM/MoM Codes Y. Ji, T. H. Hubing, and H. Wang Electromagnetic Compatibility Laboratory Department of Electrical & Computer Engineering University of Missouri-Rolla Rolla, MO

More information

Northeast Parallel Architectures Center. Syracuse University. May 17, Abstract

Northeast Parallel Architectures Center. Syracuse University. May 17, Abstract The Design of VIP-FS: A Virtual, Parallel File System for High Performance Parallel and Distributed Computing NPAC Technical Report SCCS-628 Juan Miguel del Rosario, Michael Harry y and Alok Choudhary

More information

The Matrix Market Exchange Formats:

The Matrix Market Exchange Formats: NISTIR 5935 The Matrix Market Exchange Formats: Initial Design Ronald F. Boisvert Roldan Pozo Karin A. Remington U. S. Department of Commerce Technology Administration National Institute of Standards and

More information

Bias-Variance Tradeos Analysis Using Uniform CR Bound. Mohammad Usman, Alfred O. Hero, Jerey A. Fessler and W. L. Rogers. University of Michigan

Bias-Variance Tradeos Analysis Using Uniform CR Bound. Mohammad Usman, Alfred O. Hero, Jerey A. Fessler and W. L. Rogers. University of Michigan Bias-Variance Tradeos Analysis Using Uniform CR Bound Mohammad Usman, Alfred O. Hero, Jerey A. Fessler and W. L. Rogers University of Michigan ABSTRACT We quantify fundamental bias-variance tradeos for

More information

Tools and Libraries for Parallel Sparse Matrix Computations. Edmond Chow and Yousef Saad. University of Minnesota. Minneapolis, MN

Tools and Libraries for Parallel Sparse Matrix Computations. Edmond Chow and Yousef Saad. University of Minnesota. Minneapolis, MN Tools and Libraries for Parallel Sparse Matrix Computations Edmond Chow and Yousef Saad Department of Computer Science, and Minnesota Supercomputer Institute University of Minnesota Minneapolis, MN 55455

More information

Vladimir Kotlyar Keshav Pingali Paul Stodghill. Abstract. We describe a novel approach to sparse and dense SPMD code generation: we view arrays

Vladimir Kotlyar Keshav Pingali Paul Stodghill. Abstract. We describe a novel approach to sparse and dense SPMD code generation: we view arrays Unied framework for sparse and dense SPMD code generation (preliminary report) Vladimir Kotlyar Keshav Pingali Paul Stodghill March 10, 1997 Abstract We describe a novel approach to sparse and dense SPMD

More information

PROJECTION MODELING SIMPLIFICATION MARKER EXTRACTION DECISION. Image #k Partition #k

PROJECTION MODELING SIMPLIFICATION MARKER EXTRACTION DECISION. Image #k Partition #k TEMPORAL STABILITY IN SEQUENCE SEGMENTATION USING THE WATERSHED ALGORITHM FERRAN MARQU ES Dept. of Signal Theory and Communications Universitat Politecnica de Catalunya Campus Nord - Modulo D5 C/ Gran

More information

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety Data Parallel Programming with the Khoros Data Services Library Steve Kubica, Thomas Robey, Chris Moorman Khoral Research, Inc. 6200 Indian School Rd. NE Suite 200 Albuquerque, NM 87110 USA E-mail: info@khoral.com

More information

TECHNIQUES FOR THE INTERACTIVE DEVELOPMENT OF NUMERICAL LINEAR ALGEBRA LIBRARIES FOR SCIENTIFIC COMPUTATION. Department of Computer Science

TECHNIQUES FOR THE INTERACTIVE DEVELOPMENT OF NUMERICAL LINEAR ALGEBRA LIBRARIES FOR SCIENTIFIC COMPUTATION. Department of Computer Science TECHNIQUES FOR THE INTERACTIVE DEVELOPMENT OF NUMERICAL LINEAR ALGEBRA LIBRARIES FOR SCIENTIFIC COMPUTATION Bret Andrew Marsolf, Ph.D. Department of Computer Science University of Illinois at Urbana-Champaign,

More information

NetSolve: A Network Server. for Solving Computational Science Problems. November 27, Abstract

NetSolve: A Network Server. for Solving Computational Science Problems. November 27, Abstract NetSolve: A Network Server for Solving Computational Science Problems Henri Casanova Jack Dongarra? y November 27, 1995 Abstract This paper presents a new system, called NetSolve, that allows users to

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

Benchmarking FALCON's MATLAB-to-Fortran 90. Luiz De Rose and David Padua. Abstract

Benchmarking FALCON's MATLAB-to-Fortran 90. Luiz De Rose and David Padua. Abstract Benchmarking FALCON's MATLAB-to-Fortran 90 Compiler on an SGI Power Challenge Luiz De Rose and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801,

More information

User Manual for the Complex Conjugate Gradient Methods Library CCGPAK 2.0

User Manual for the Complex Conjugate Gradient Methods Library CCGPAK 2.0 User Manual for the Complex Conjugate Gradient Methods Library CCGPAK 2.0 Piotr J. Flatau Scripps Institution of Oceanography University of California San Diego La Jolla, CA92093 email: pflatau@ucsd.edu

More information

POM: a Virtual Parallel Machine Featuring Observation Mechanisms

POM: a Virtual Parallel Machine Featuring Observation Mechanisms POM: a Virtual Parallel Machine Featuring Observation Mechanisms Frédéric Guidec, Yves Mahéo To cite this version: Frédéric Guidec, Yves Mahéo. POM: a Virtual Parallel Machine Featuring Observation Mechanisms.

More information

CUMULVS: Collaborative Infrastructure for Developing. Abstract. by allowing them to dynamically attach to, view, and \steer" a running simulation.

CUMULVS: Collaborative Infrastructure for Developing. Abstract. by allowing them to dynamically attach to, view, and \steer a running simulation. CUMULVS: Collaborative Infrastructure for Developing Distributed Simulations James Arthur Kohl Philip M. Papadopoulos G. A. Geist, II y Abstract The CUMULVS software environment provides remote collaboration

More information

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES Zhou B. B. and Brent R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 000 Abstract We describe

More information

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above Janus a C++ Template Library for Parallel Dynamic Mesh Applications Jens Gerlach, Mitsuhisa Sato, and Yutaka Ishikawa fjens,msato,ishikawag@trc.rwcp.or.jp Tsukuba Research Center of the Real World Computing

More information

Efficient Minimization of New Quadric Metric for Simplifying Meshes with Appearance Attributes

Efficient Minimization of New Quadric Metric for Simplifying Meshes with Appearance Attributes Efficient Minimization of New Quadric Metric for Simplifying Meshes with Appearance Attributes (Addendum to IEEE Visualization 1999 paper) Hugues Hoppe Steve Marschner June 2000 Technical Report MSR-TR-2000-64

More information

A parallel frontal solver for nite element applications

A parallel frontal solver for nite element applications INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING Int. J. Numer. Meth. Engng 2001; 50:1131 1144 A parallel frontal solver for nite element applications Jennifer A. Scott ; Computational Science

More information

The Solution of Systems of Linear Equations. using the Conjugate Gradient Method. Jean-Guy Schneider, Edgar F.A. Lederer, Peter Schwab.

The Solution of Systems of Linear Equations. using the Conjugate Gradient Method. Jean-Guy Schneider, Edgar F.A. Lederer, Peter Schwab. The Solution of Systems of Linear Equations using the Conjugate Gradient Method on the Parallel MUSIC-System Jean-Guy Schneider, Edgar F.A. Lederer, Peter Schwab Abstract The solution of large sparse systems

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers The International Arab Journal of Information Technology, Vol. 8, No., April Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers Rukhsana Shahnaz and Anila Usman

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8) Vectorization Using Reversible Data Dependences Peiyi Tang and Nianshu Gao Technical Report ANU-TR-CS-94-08 October 21, 1994 Vectorization Using Reversible Data Dependences Peiyi Tang Department of Computer

More information

Performance Analysis of Distributed Iterative Linear Solvers

Performance Analysis of Distributed Iterative Linear Solvers Performance Analysis of Distributed Iterative Linear Solvers W.M. ZUBEREK and T.D.P. PERERA Department of Computer Science Memorial University St.John s, Canada A1B 3X5 Abstract: The solution of large,

More information

Storage Formats for Sparse Matrices in Java

Storage Formats for Sparse Matrices in Java Storage Formats for Sparse Matrices in Java Mikel Luján, Anila Usman, Patrick Hardie, T.L. Freeman, and John R. Gurd Centre for Novel Computing, The University of Manchester, Oxford Road, Manchester M13

More information

Kriging in a Parallel Environment

Kriging in a Parallel Environment Kriging in a Parallel Environment Jason Morrison (Ph.D. Candidate, School of Computer Science, Carleton University, ON, K1S 5B6, Canada; (613)-520-4333; e-mail: morrison@scs.carleton.ca) Introduction In

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

Supporting Heterogeneous Network Computing: PVM. Jack J. Dongarra. Oak Ridge National Laboratory and University of Tennessee. G. A.

Supporting Heterogeneous Network Computing: PVM. Jack J. Dongarra. Oak Ridge National Laboratory and University of Tennessee. G. A. Supporting Heterogeneous Network Computing: PVM Jack J. Dongarra Oak Ridge National Laboratory and University of Tennessee G. A. Geist Oak Ridge National Laboratory Robert Manchek University of Tennessee

More information

Interactive and Dynamic Content. in Software Repositories. Ronald F. Boisvert. National Institute of Standards and Technology.

Interactive and Dynamic Content. in Software Repositories. Ronald F. Boisvert. National Institute of Standards and Technology. Interactive and Dynamic Content in Software Repositories Ronald F. Boisvert National Institute of Standards and Technology boisvert@nist.gov Shirley V. Browne University of Tennessee browne@cs.utk.edu

More information

INLINE EXPANSION FOR THE POLARIS RESEARCH COMPILER JOHN ROBERT GROUT. B.S., Worcester Polytechnic Institute, 1981 THESIS

INLINE EXPANSION FOR THE POLARIS RESEARCH COMPILER JOHN ROBERT GROUT. B.S., Worcester Polytechnic Institute, 1981 THESIS INLINE EXPANSION FOR THE POLARIS RESEARCH COMPILER BY JOHN ROBERT GROUT B.S., Worcester Polytechnic Institute, 1981 THESIS Submitted in partial fulllment of the requirements for the degree of Master of

More information

A Compiler-Directed Cache Coherence Scheme Using Data Prefetching

A Compiler-Directed Cache Coherence Scheme Using Data Prefetching A Compiler-Directed Cache Coherence Scheme Using Data Prefetching Hock-Beng Lim Center for Supercomputing R & D University of Illinois Urbana, IL 61801 hblim@csrd.uiuc.edu Pen-Chung Yew Dept. of Computer

More information

Image Registration with Automatic Computation of Gradients

Image Registration with Automatic Computation of Gradients Image Registration with Automatic Computation of Gradients Release 1.0 E. G. Kahn 1 and L. H. Staib 2 July 29, 2008 1 The Johns Hopkins University Applied Physics Laboratory, Laurel, Maryland 2 Yale University,

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

High Performance Fortran and Possible Extensions to support Conjugate Gradient Algorithms

High Performance Fortran and Possible Extensions to support Conjugate Gradient Algorithms Syracuse University SURFACE Northeast Parallel Architecture Center College of Engineering and Computer Science 1995 High Performance Fortran and Possible Extensions to support Conjugate Gradient Algorithms

More information

Application Programmer. Vienna Fortran Out-of-Core Program

Application Programmer. Vienna Fortran Out-of-Core Program Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse

More information

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz.

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz. Blocking vs. Non-blocking Communication under MPI on a Master-Worker Problem Andre Fachat, Karl Heinz Homann Institut fur Physik TU Chemnitz D-09107 Chemnitz Germany e-mail: fachat@physik.tu-chemnitz.de

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

Implementation of QR Up- and Downdating on a. Massively Parallel Computer. Hans Bruun Nielsen z Mustafa Pnar z. July 8, 1996.

Implementation of QR Up- and Downdating on a. Massively Parallel Computer. Hans Bruun Nielsen z Mustafa Pnar z. July 8, 1996. Implementation of QR Up- and Downdating on a Massively Parallel Computer Claus Btsen y Per Christian Hansen y Kaj Madsen z Hans Bruun Nielsen z Mustafa Pnar z July 8, 1996 Abstract We describe an implementation

More information

An Evaluation of High Performance Fortran Compilers Using the HPFBench Benchmark Suite

An Evaluation of High Performance Fortran Compilers Using the HPFBench Benchmark Suite An Evaluation of High Performance Fortran Compilers Using the HPFBench Benchmark Suite Guohua Jin and Y. Charlie Hu Department of Computer Science Rice University 61 Main Street, MS 132 Houston, TX 775

More information

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract A High Performance Sparse holesky Factorization Algorithm For Scalable Parallel omputers George Karypis and Vipin Kumar Department of omputer Science University of Minnesota Minneapolis, MN 55455 Technical

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Dynamic Process Management in an MPI Setting. William Gropp. Ewing Lusk. Abstract

Dynamic Process Management in an MPI Setting. William Gropp. Ewing Lusk.  Abstract Dynamic Process Management in an MPI Setting William Gropp Ewing Lusk Mathematics and Computer Science Division Argonne National Laboratory gropp@mcs.anl.gov lusk@mcs.anl.gov Abstract We propose extensions

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department of Computer Science and Information Engineering National

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin. A High Performance Parallel Strassen Implementation Brian Grayson Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 787 bgrayson@pineeceutexasedu Ajay Pankaj

More information

proposed. In Sect. 3, the environment used for the automatic generation of data parallel programs is briey described, together with the proler tool pr

proposed. In Sect. 3, the environment used for the automatic generation of data parallel programs is briey described, together with the proler tool pr Performance Evaluation of Automatically Generated Data-Parallel Programs L. Massari Y. Maheo DIS IRISA Universita dipavia Campus de Beaulieu via Ferrata 1 Avenue du General Leclerc 27100 Pavia, ITALIA

More information

This paper deals with ecient parallel implementations of reconstruction methods in 3D

This paper deals with ecient parallel implementations of reconstruction methods in 3D Ecient Implementation of Parallel Image Reconstruction Algorithms for 3D X-Ray Tomography C. Laurent a, C. Calvin b, J.M. Chassery a, F. Peyrin c Christophe.Laurent@imag.fr Christophe.Calvin@imag.fr a

More information

Abstract. 1 Introduction

Abstract. 1 Introduction The performance of fast Givens rotations problem implemented with MPI extensions in multicomputers L. Fernández and J. M. García Department of Informática y Sistemas, Universidad de Murcia, Campus de Espinardo

More information

COLIND COLP VALS ROWIND VALS ROWIND

COLIND COLP VALS ROWIND VALS ROWIND Compiling Parallel Code for Sparse Matrix Applications Vladimir Kotlyar Keshav Pingali Paul Stodghill Department of Computer Science Cornell University, Ithaca, NY 14853 fvladimir,pingali,stodghilg@cs.cornell.edu

More information

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5 On Improving the Performance of Sparse Matrix-Vector Multiplication James B. White, III P. Sadayappan Ohio Supercomputer Center Ohio State University Columbus, OH 43221 Columbus, OH 4321 Abstract We analyze

More information

In C.-H. Huang, P.Sadayappan, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua

In C.-H. Huang, P.Sadayappan, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua In C.-H. Huang, P.Sadayappan, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua (Editors), Languages and Compilers for Parallel Computing, pages 269-288. Springer-Verlag, August 1995. (8th International

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.3 Iterative Methods Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck.

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck. To be published in: Notes on Numerical Fluid Mechanics, Vieweg 1994 Flow simulation with FEM on massively parallel systems Frank Lohmeyer, Oliver Vornberger Department of Mathematics and Computer Science

More information

Cpu time [s] BICGSTAB_RPC CGS_RPC BICGSTAB_LPC BICG_RPC BICG_LPC LU_NPC

Cpu time [s] BICGSTAB_RPC CGS_RPC BICGSTAB_LPC BICG_RPC BICG_LPC LU_NPC Application of Non-stationary Iterative Methods to an Exact Newton-Raphson Solution Process for Power Flow Equations Rainer Bacher, Eric Bullinger Swiss Federal Institute of Technology (ETH), CH-809 Zurich,

More information

Parallel Numerics, WT 2013/ Introduction

Parallel Numerics, WT 2013/ Introduction Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature

More information

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control.

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control. Neuro-Remodeling via Backpropagation of Utility K. Wendy Tang and Girish Pingle 1 Department of Electrical Engineering SUNY at Stony Brook, Stony Brook, NY 11794-2350. ABSTRACT Backpropagation of utility

More information

2 do i = 1,n sum = 0.0D0 do j = rowptr(i), rowptr(i+1)-1 sum = sum + a(jp) * x(colind(jp)) end do y(i) = sum end do Fig. 1. A sparse matrix-vector mul

2 do i = 1,n sum = 0.0D0 do j = rowptr(i), rowptr(i+1)-1 sum = sum + a(jp) * x(colind(jp)) end do y(i) = sum end do Fig. 1. A sparse matrix-vector mul Improving Memory-System Performance of Sparse Matrix-Vector Multiplication Sivan Toledo y Abstract Sparse matrix-vector multiplication is an important kernel that often runs ineciently on superscalar RISC

More information

Array Decompositions for Nonuniform Computational Environments

Array Decompositions for Nonuniform Computational Environments Syracuse University SURFACE College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects College of Engineering and Computer Science 996 Array Decompositions for Nonuniform

More information

Basic Techniques for Numerical Linear Algebra. on Bulk Synchronous Parallel Computers. Rob H. Bisseling. Department of Mathematics, Utrecht University

Basic Techniques for Numerical Linear Algebra. on Bulk Synchronous Parallel Computers. Rob H. Bisseling. Department of Mathematics, Utrecht University Basic Techniques for Numerical Linear Algebra on Bulk Synchronous Parallel Computers Rob H. Bisseling Department of Mathematics, Utrecht University P. O. Box 80010, 3508 TA Utrecht, the Netherlands http://www.math.ruu.nl/people/bisseling

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

682 M. Nordén, S. Holmgren, and M. Thuné

682 M. Nordén, S. Holmgren, and M. Thuné OpenMP versus MPI for PDE Solvers Based on Regular Sparse Numerical Operators? Markus Nord n, Sverk er Holmgren, and Michael Thun Uppsala University, Information Technology, Dept. of Scientic Computing,

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291

More information

Parallel solution for finite element linear systems of. equations on workstation cluster *

Parallel solution for finite element linear systems of. equations on workstation cluster * Aug. 2009, Volume 6, No.8 (Serial No.57) Journal of Communication and Computer, ISSN 1548-7709, USA Parallel solution for finite element linear systems of equations on workstation cluster * FU Chao-jiang

More information

Parallel Block-Diagonal-Bordered Sparse Linear Solvers for Power. Systems Applications. David P. Koester. Abstract of Dissertation.

Parallel Block-Diagonal-Bordered Sparse Linear Solvers for Power. Systems Applications. David P. Koester. Abstract of Dissertation. Parallel Block-Diagonal-Bordered Sparse Linear Solvers for Power Systems Applications by David P. Koester Abstract of Dissertation October, 1995 This thesis presents research into parallel linear solvers

More information

CONJUGATE GRADIENT METHOD FOR SOLVING LARGE SPARSE LINEAR SYSTEMS ON MULTI-CORE PROCESSORS

CONJUGATE GRADIENT METHOD FOR SOLVING LARGE SPARSE LINEAR SYSTEMS ON MULTI-CORE PROCESSORS Blucher Mechanical Engineering Proceedings May 2014, vol. 1, num. 1 www.proceedings.blucher.com.br/evento/10wccm CONJUGATE GRADIENT METHOD FOR SOLVING LARGE SPARSE LINEAR SYSTEMS ON MULTI-CORE PROCESSORS

More information

Parallel Implementation of a Unied Approach to. Image Focus and Defocus Analysis on the Parallel Virtual Machine

Parallel Implementation of a Unied Approach to. Image Focus and Defocus Analysis on the Parallel Virtual Machine Parallel Implementation of a Unied Approach to Image Focus and Defocus Analysis on the Parallel Virtual Machine Yen-Fu Liu, Nai-Wei Lo, Murali Subbarao, Bradley S. Carlson yiu@sbee.sunysb.edu, naiwei@sbee.sunysb.edu

More information

Vipar Libraries to Support Distribution and Processing of Visualization Datasets

Vipar Libraries to Support Distribution and Processing of Visualization Datasets Vipar Libraries to Support Distribution and Processing of Visualization Datasets Steve Larkin, Andrew J Grant, W T Hewitt Computer Graphics Unit, Manchester Computing University of Manchester, Manchester

More information

100 Mbps DEC FDDI Gigaswitch

100 Mbps DEC FDDI Gigaswitch PVM Communication Performance in a Switched FDDI Heterogeneous Distributed Computing Environment Michael J. Lewis Raymond E. Cline, Jr. Distributed Computing Department Distributed Computing Department

More information

Efficient Assembly of Sparse Matrices Using Hashing

Efficient Assembly of Sparse Matrices Using Hashing Efficient Assembly of Sparse Matrices Using Hashing Mats Aspnäs, Artur Signell, and Jan Westerholm Åbo Akademi University, Faculty of Technology, Department of Information Technologies, Joukahainengatan

More information

Distributed Execution of Actor Programs. Gul Agha, Chris Houck and Rajendra Panwar W. Springeld Avenue. Urbana, IL 61801, USA

Distributed Execution of Actor Programs. Gul Agha, Chris Houck and Rajendra Panwar W. Springeld Avenue. Urbana, IL 61801, USA Distributed Execution of Actor Programs Gul Agha, Chris Houck and Rajendra Panwar Department of Computer Science 1304 W. Springeld Avenue University of Illinois at Urbana-Champaign Urbana, IL 61801, USA

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication

Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication IIMS Postgraduate Seminar 2009 Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication Dakuan CUI Institute of Information & Mathematical Sciences Massey University at

More information

Maple on the Intel Paragon. Laurent Bernardin. Institut fur Wissenschaftliches Rechnen. ETH Zurich, Switzerland.

Maple on the Intel Paragon. Laurent Bernardin. Institut fur Wissenschaftliches Rechnen. ETH Zurich, Switzerland. Maple on the Intel Paragon Laurent Bernardin Institut fur Wissenschaftliches Rechnen ETH Zurich, Switzerland bernardin@inf.ethz.ch October 15, 1996 Abstract We ported the computer algebra system Maple

More information

A Source-to-Source OpenMP Compiler

A Source-to-Source OpenMP Compiler A Source-to-Source OpenMP Compiler Mario Soukup and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada M5S 3G4

More information

Symbolic Evaluation of Sums for Parallelising Compilers

Symbolic Evaluation of Sums for Parallelising Compilers Symbolic Evaluation of Sums for Parallelising Compilers Rizos Sakellariou Department of Computer Science University of Manchester Oxford Road Manchester M13 9PL United Kingdom e-mail: rizos@csmanacuk Keywords:

More information

F k G A S S1 3 S 2 S S V 2 V 3 V 1 P 01 P 11 P 10 P 00

F k G A S S1 3 S 2 S S V 2 V 3 V 1 P 01 P 11 P 10 P 00 PRLLEL SPRSE HOLESKY FTORIZTION J URGEN SHULZE University of Paderborn, Department of omputer Science Furstenallee, 332 Paderborn, Germany Sparse matrix factorization plays an important role in many numerical

More information

2 Notation: A Functional Specification Language

2 Notation: A Functional Specification Language The Construction of Numerical Mathematical Software for the AMT DAP by Program Transformation. James M. Boyle z 1, Maurice Clint, Stephen Fitzpatrick 2, Terence J. Harmer The Queen s University of Belfast

More information