Parallel solution for finite element linear systems of. equations on workstation cluster *

Size: px

Start display at page:

Download "Parallel solution for finite element linear systems of. equations on workstation cluster *"

Rosalind Dean
5 years ago
Views:

1 Aug. 2009, Volume 6, No.8 (Serial No.57) Journal of Communication and Computer, ISSN , USA Parallel solution for finite element linear systems of equations on workstation cluster * FU Chao-jiang (Department of Civil Engineering, Fujian University of Technology, Fuzhou , China) Abstract: With the advances in the high speed computers network technologies, a workstation cluster is becoming the main environment for parallel processing. Finite element linear systems of equations are common throughout structural analysis in Civil Engineering. The preconditioned conjugate gradient method (PCGM) is an iterative method used to solve the finite element systems of equations with symmetric positive definite system matrices. In this paper, the algorithm of PCGM is parallelized and implemented on DELL workstation cluster. Optimization techniques for the sparse matrix vector multiplication are adopted in programming. The storage scheme is analyzed in detail. The experiment result shows that the designed parallel algorithm has high speedup and good efficiency on the high performance workstation cluster. This illustrates the power of parallel computing in solving large problems much faster than on a single processor. Key words: parallel computing; preconditioned conjugate gradient method; finite element method; network; workstation cluster 1. Introduction In recent years, parallel computing has become an important field in science computing. There are two main reasons which make it reasonable. First, the use of parallel computing can handle large-scale problems. Second, networks of computers under LINUX, with a connection of several computers via standard software like MPI, result in a workstation cluster, which is a low cost parallel machine. On such hardware platforms, one can gain first experiences in parallel computing. So * Acknowledgment: This work is supported by Fujian Province Natural Science Foundation (No. 2008J0180) and Scientific Research Start Foundation of Fujian University of Technology (No. GY-Z0707). Corresponding author: FU Chao-jiang, Ph.D.; research field: parallel computing. a workstation cluster is poised to become the primary computing infrastructure for science and engineering. It may achieve cheap, highly available, and provide multiple CPUs for parallel computing. Solving finite element linear systems of equations is one of the most important problems in structural analysis in Civil Engineering. The application of finite element method to engineering problems requires the solution of a system of linear equations. In general, it is possible to write these systems as, Ax = b. Where A is a n n sparse matrix, x is the vector of nodal unknowns and b is n-dimensional balance force vector. Due to the nature of finite element method, A and b are constructed by assembling the individual element contributions, i.e., A = ΣA e ; b = Σb e (e=1, nel). Where A e are the element matrices, b e are the element force vectors, and nel is the number of elements in the mesh. This yields a linear system with a symmetric positive definite sparse system matrix. Unfortunately, the direct methods in elementary linear algebra classes (e.g., Gaussian elimination) are difficult to adapt for distributed memory systems, especially if the coefficient matrix is sparse. So most of the sparse linear system solvers for distributed memory systems use iterative methods [1-2]. An iterative method makes an initial guess at a solution to the system, and then tries to repeatedly improve the guess. Thus, iterative methods produce a sequence of vectors, which gradually converge to the solution to the linear system. Since the finite element method yields a linear system with a symmetric positive definite sparse system 59

2 matrix. This property of the system matrix allows the use of PCGM as the linear solver. 2. Preconditioned conjugate gradient method In order to solve the finite element linear systems of equations Ax = b, the PCG method has been used. The CG method is an iterative technique which converges to the solution to the system, through a sequence of n vector approximations. From any vector approximation to the solution, a search direction which is conjugate to the previous ones is used to determine a new approximation. In practice, only a few iterations are employed to obtain a good estimate of the solution. In the preconditioned version of the CG method, both the condition number of A and the number of iterations may be reduced by pre-multiplying the system Ax = b by a preconditioner M. due to system matrix A with diagonal elements that are very different in magnitude, the preconditioner M is adopted by using the diagonal matrix of A. The PCG method is listed below. peconditioned conjugate gradient algorithm initialize: X 0 r 0 =b AX 0 Mu 0 =r 0 p 0 =u 0 begin loop: for k=0, (1) α k =(u k,r k )/(p k,ap k ) (2) X k+1 =X k +α k p k (3) r k+1 =r k -α k Ap k (4) Mu k+1 =r k+1 (5) if(u k+1,r k+1 ) ε (6) if(r k+1,r k+1 ) ε (7) return (8) β k =(u k+1,r k+1 )/(u k,r k ) (9) p k+1 =u k+1 +β k p k (10)end for end loop 3. Parallel implementation of PCG method The parallelization of the PCG method can be implemented since every vector operation of this algorithm can be parallelized separately, i.e., each processor executes scalar operations on a subset of the components of a vector, the parallel implementation of the preconditioned conjugate gradient method is identical to the serial implementation. The parallelism comes from storing and operating on main sections of the working vectors and the system matrix. The parallel implementation of the preconditioned conjugate gradient method is presented below. initialize: X0 r0 = b - AX0 Mu0 = r0 p0 = u0 C = dot(u0,r0) M is the preconditioning matrix, using M=diag (a11,a22,,ann) Main iterate over: (1) Z = Ap /Parallel Matrix-Vector Product/ (2) α = c/dot(p,z) /Parallel Vector Dot Product/ (3) x = x + α*p /BLAS daxpy Operation/ (4) r = r - α*z /BLAS daxpy Opreation/ (5) solve Mu = r for u /Solve Matrix System/ (6) d = dot(r,r) /Parallel Vector Dot Product/ (7) if(sqrt(d)<tolerance) break; (8) β = d/c (9) p = u + β*p /BLAS daxpy Operation/ (10) gather(p) /Parallel Gather / (11) c = d (12) go to (1) Where BLAS(basic linear algebra subprograms) is a collection of routines that perform specific vector and matrix operations [3]. The parallel algorithm of PCGM is implemented in the programming language C++ using the Message Passing Interface (MPI) standard [4] for communications between processors. This algorithm for the solution of a linear system of equations of dimension n involves 1 matrix-vector product, 2 dot products, and 3 vector updates (daxpy operations) per iteration. This implementation is optimized with respect to memory usage by storing exactly 4 vectors of length n, the smallest number possible, and by using a function that implements the matrix-vector multiplication in matrix-free form, i.e., does not store 60

3 any matrix [5-6]. The vectors stored are the approximation solution x, the search direction p, the residual vector r and the auxiliary vector q. The parallel form of the PCG algorithm requires 4 communication operations per iteration. To compute the matrix-vector product, processors need to interchange one data with their neighboring processors. These communications are implemented with nonblocking MPI_ISEND/MPI_IRECV commands. Since the dot products apply to vectors split across the processors and since the results are needed on all processors, an MPI_Allreduce operation is required in each of the 2 dot products. In case of a diagonal preconditioner, no additional communication is required since diagonal preconditioning can be accomplished locally on all rows contained in a process. Other preconditioners such as incomplete Cholesky require additional communication. So the diagonal preconditioner is used in this algorithm. Optimization techniques [7-8] for the sparse matrix vector multiplication are adopted in programming, i.e., the serial part of the dot products, local to each processor, is implemented in C++ inner function dot; The most expensive operation in each iteration is the matrix-vector product which is implemented in step (1) by using parallel dot operations. In this paper, a reorganization of the computations in the PCG algorithm has been implemented in order to hide the latency of the communications. This overlap of the communications with the computations has been implemented by using asynchronous messages in the message-passing model described below. The parallelization of the product Z = Ap in algorithm can be performed with an overlap between the messages and the computations. The required message with the halo data of p; i.e., the data corresponding to the grid points in the domain of one processor which are used to compute data of Z in a neighboring processor, is overlapped with computations of matrix rows corresponding to the inner part of the domain, where the inner part is the domain without its halos. The latency of the communication step in the inner product p T Z can be hidden by delaying the update of X i by one iteration [9]. 4. Storage scheme The aim of the matrix storage is to achieve a better performance. The important impact on performance is caused by the storage technique. There are several possibilities to store a matrix. The key is not to store unnecessary data. The emphasis of the storage is on developing an algorithm which will solve problems too large to be solved efficiently on a single CPU. Storing all n 2 elements of A R n n explicitly is known as dense storage mode. This is the easiest possible storage scheme. For dense matrices this is good technique. For sparse matrix it is not, because all zero elements are stored. Since most of the n 2 elements are 0, this is wasteful both in memory and in runtime, as many multiplications involving zeros are needlessly computed. Simulations using this storage mode can be easily programmed in a language C++. A common idea is to take advantage of the sparsity of the matrix, i.e., the low percentage of non-zero elements. In this sparse storage mode, only non-zero elements are stored along with integer index information to indicate the position of the element in the matrix. This reduces the memory requirements for the system matrix. This also improves performance, as only multiplications with those elements that are stored explicitly, i.e., non-zero are computed. To reduce the memory usage further, the constant coefficients in pre-determined positions of the system matrix are taken. A function is provided that accepts a vector p as input and returns the vector Z = Ap as output; each component Z k is computed as summation of the appropriate components of p multiplied with hard-coded coefficients [5]. This matrix-vector multiplication function is inserted at the appropriate place of implementation of the PCGM in the programming language C++. This technique is known 61

4 as a matrix-free implementation because no elements of the system matrix A are explicitly stored at all [6]. Such a matrix-free method dramatically reduces the memory requirements. It is the most efficient approach to memory usage. The scheme is adopted in this paper. Memory is an important limitation to the size of the system equations to be solved. As stated earlier, the minimum number of vectors needed to compute the solution of the linear system equations is 4 vectors of length n. Aside from the system matrix, a prediction of 4n is almost all of the memory to compute the solution with the matrix-free method. Using the sparse storage mode, non-zero elements of the system matrix need to be stored. For example, the matrix is pentadiagonal, assuming that only 5 vectors of length n are stored for the matrix. The memory prediction is 5n added to the matrix-free implementation prediction. Densely storing the matrix requires a vector of n 2 elements added to the matrix-free implementation prediction. The theoretical predictions of memory usage for different storage method are discussed in Table 1, when double precision arithmetic is used. Table 1 Predicted memory usage for the three storage methods in Megabytes (MB) n Dense Sparse Matrix-free 256 <1 <1 < <1 < <1 < <1 5. Numerical test and analysis 5.1 Parallel platform The numerical and performance tests of the developed parallel PCG algorithm are performed on DELL workstation cluster in School of Computer Engineering and Science, Shanghai University. It is a cluster with 12 processors arranged in 1 four-processor node with 2.0 GHz Intel Xeon chips (1MB cache) and 4 dual-processor nodes with 2.4 GHz Intel Xeon chips (512 KB cache) and 1.0 GB of memory per node. These nodes are connected with a Mbps Ethernet interconnect. Using the Message Passing Interface (MPI) paradigm [10]. In the message-passing version, MPICH library has been considered because of its better efficiency in computer architecture. MPICH library has been widely adopted as the message-passing interface of choice for many years. In the MPICH implementation, inter-processor communication is performed through special interface provided to send a message, and a matching interface in other processor to receive it. In our programming, the communications are implemented with nonblocking communication in order to perform the overlap of the communications with the computations. 5.2 Numerical test problem and decision The problem to be solved is a classical prototype problem, the Poisson equation with a homogeneous Dirichlet boundary condition, using finite element in two dimensions. This yields a linear system with a symmetric positive definite system matrix. This property of the system matrix allows the use of the PCG method we present in this paper as the solver. The results of computation are listed in Table 2 and Fig. 1. At each time step, the solution of the system of algebraic equations can be divided into two parts: (i) initialization, which includes the generation of the matrix A and right-hand side vector b, the calculation of the preconditioner and the initialization of the PCG method, and (ii) the iterations of the PCG method. Since different problem sizes imply different convergence rates, the ratio between the times required in both parts varies as the sizes of the problem are varied. In order to simplify the presentation of results and avoid problems with the accuracy of the timer and perturbations to the cache behavior, we have timed both parts together. 5.3 Numerical results analysis Memory is an important limitation to the size of the system to be solved. Aside from the system matrix, the vectors of length n used in the method make up the bulk of the memory. Any other variables are ignored. 62

5 The estimation is simply made by counting the number of vectors used. As stated earlier, the minimum number of vectors needed in this algorithm is 4 of length n. The Table 1 shows the theoretical predictions of memory used by the C++ code using the dense, sparse, and the matrix-free implementation, respectively when double-precision arithmetic is used. Less than 1 MB of memory is used for a sparse or matrix-free storage compared to 134 MB for a dense storage when dimension (n) of the system matrix is equal to Adding the dimension (n) to 16384, it takes over 2 GB using dense storage. This is not possible to compute this problem for single processor of the workstations because only 1 GB of memory is available for the single processor computation, so n equals to that is the largest system for memory of the single processor. Clearly, a matrix-free implementation is the best storage method. With a matrix-free implementation, the method is optimal with respect to memory usage, and we are able to solve problems that are much too large for single processor computers. Table 2 and Fig. 1 summarize the timings and speedups for this cluster, using up to 12 processors. It is readily apparent from both Table 2 and Fig. 1 that the timings continue to decrease significantly all the way up to 12 processors. When the number of processor is greater than 4, an apparent slow-down is observed because the communication overhead between processors increases significantly. The cache for each processor is a good size for non-computational uses, but does not hold much data. This leads to frequent loading of data from memory to the cache. The nodes have a 32-bit bus which can not serve the data as fast as the processors can use it. Within the 4 processors, the computation is implemented on 4-processor node. The communication overhead between processors is less; thus its performance is better. speed-up Table 2 Timings in seconds for different n on the 12-processor cluster n n=512 n=2048 n=4096 n= speed up number of processor Fig. 1 Plot of the speedup for different n on the 12-processor cluster n=512 n=1024 n=2048 n= number of processor Fig. 2 Plot of the speedup for different n on the 4-processor cluster To analyze the importance of the different parallel environment for this algorithm, another cluster is constituted, which has 4 single-processor nodes with 600 MHz Intel Pentium Ⅲ processors and 256 MB of memory. These nodes are connected with a Mbps Ethernet interconnect. The same computations are performed on this cluster. The results are listed in Table 3 and Fig. 2. The results for up to 4 processors show 63

6 that the use of about 2 processors constitutes the most efficient use of the resources. Though the processor speed is less than that of the cluster used earlier, the results are good. The speedups in Fig. 1 are higher than that in Fig. 2 for the same number of processor and the same size of problem. This result shows that a high-performance parallel platform is required for parallel computing. Table 3 Timings in seconds for different n on the 4-processor cluster n Conclusions Parallel computing is a useful tool for solving large-scale problems faster. PCG algorithm is parallelized and this parallel algorithm is implemented on two different clusters. Its performance is analyzed. The reduction in memory due to the matrix-free implementation not only allows for the solution of much larger problems but also decreases computing time. The parallel performance studies show that a high-performance parallel platform is necessary to obtain excellent speedup on a cluster. The load balancing is very important to ensure good scalability and to increase efficiency. These observations reflect the fact that the parallel PCGM algorithm involves 4 communications per iteration; therefore it requires a tight coupling of the cluster. This shows how important it is to have a workstation cluster available. The algorithm implemented on different parallel platform appears to be robust. References: [1] Barret R. Templates for the Solution of Linear Systems, Building Blocks for Iterative Methods. SIAM, Philadelphia, [2] Law K. H. A parallel finite element solution method. Computers & Structures, 1986, 23(6): [3] George Em Karniadkis. Parallel Scientific Computing in C++ and MPI. Cambridge University Press, [4] Pacheco P. S. Parallel Programming with MPI. Morgan Kaufmann, [5] Kevin P. Allen. Efficient Parallel Computing for Solving Linear Systems of Equations. Graduate Student Seminar. University of Maryland, Baltimore County, November [6] Aliaga J. I., Hernandez, V. Symmetric sparse matrix-vector product on distributed memory multiprocessors. Conference on Parallel Computing and Transputer Applications. Barcelona, Spain, [7] Brown P. N., Hindmarsh A. C. Matrix-free methods for stiff systems of ODE s. SIAM J. Numer. Anal, 1986, 23: [8] Geus R., Rollin S. Towards a fast parallel sparse symmetric matrix vector multiplication. Parallel Computing, 2007, 27(2): [9] Sorin G. Nastea. Load-balanced sparse matrix vector multiplication on parallel computers. Parallel and Distributed Computing, 1997, 46: [10] FU Chao-jiang. The research on parallel computation of finite element structural analysis based on MPI cluster. Doctor dissertation, Shanghai: Shanghai University, China, (Edited by Amy, Jane) 64

2 The Elliptic Test Problem

2 The Elliptic Test Problem A Comparative Study of the Parallel Performance of the Blocking and Non-Blocking MPI Communication Commands on an Elliptic Test Problem on the Cluster tara Hafez Tari and Matthias K. Gobbert Department