F k G A S S1 3 S 2 S S V 2 V 3 V 1 P 01 P 11 P 10 P 00

Size: px

Start display at page:

Download "F k G A S S1 3 S 2 S S V 2 V 3 V 1 P 01 P 11 P 10 P 00"

Jason Jennings
6 years ago
Views:

1 PRLLEL SPRSE HOLESKY FTORIZTION J URGEN SHULZE University of Paderborn, Department of omputer Science Furstenallee, 332 Paderborn, Germany Sparse matrix factorization plays an important role in many numerical algorithms. In this paper we describe a scalable parallel algorithm based on the Multifrontal Method. omputational experiments on a Parsytec system with 32 processors show that large sparse matrices can be factorized in only a few seconds. Introduction Let 2 M(n; IR), = (a ij ), be a sparse positive denite matrix. can be factorized into a lower triangular matrix L so that = L L t. L is called the holesky factor of and can be computed column by column using Eq.. p p d v t d d = = v v= p vt = p d () d I? vv t =d I Here, d denotes the rst diagonal entry and v is an (n?)-vector. p d and v together form the rst column of L. The remaining columns of L can be obtained by recursively applying Eq. to the submatrix?vv t =d. Eq. also shows that the factorization process can introduce some ll into L, i.e. an element a ij = may become nonzero in L. To demonstrate this, let v = (v : : : v i : : : v j : : : v n? ) with v i ; v j 6=. Then, v ij 6= in vv t and, thus, the corresponding element in? vv t =d is nonzero even if a ij =. In general, L has much more nonzeros than and this ll heavily inuences the performance of the overall factorization process. It is well known that the amount of ll can be reduced by reordering the columns and rows of prior to factorization. The problem of determining the optimal ordering is NP-complete, therefore heuristics are used. ll heuristics are based on the observation that a symmetric matrix can be interpreted as the adjacency matrix of an undirected graph G. In G each node corresponds to a column in. Hence, a renumbering of the nodes in G gives a reordering of the columns in. One of the most successful and widely used ordering heuristics is nested dissection. 2 It starts with computing a minimal node separator S that divides G in two equally sized parts U and V. ll nodes in S are numbered higher than nodes in U and V. The method is recursively applied to the subgraphs induced by U; V. This work was supported by the German Federal Department of Science and Technology (PRLOR project) and by the EU H&M project SOOP.

2 for column j := to n do Let j; i ; : : : ; ir be the locations of nonzeros in column j of L; Let c ; : : : ; cs be the children of j in the elimination tree; Form the frontal matrix F j using the update matrices of all children of j; F j := a j;j a j;i : : : a j;i r a i ;j. U c : : : U c s ; a ir ;j Factor frontal matrix F j into l j;j : : : l i ;j. I l ir ;j : : :. U j l j;j l j;i : : : l j;i r. I ; od Figure : ore of the multifrontal method. 2 The Multifrontal Method The multifrontal method reduces the factorization of the sparse input matrix to the factorization of several dense submatrices. 6 The method is guided by a special data structure, the so called elimination tree. 7 It consists of n nodes, each corresponding to a column in L, and is dened as follows: Node p is the parent of node j if and only if p = minfi > j; l ij 6= g (p is the row index of the rst nonzero subdiagonal element in column j of L). Figure 2 shows the top levels of an elimination tree induced by a nested dissection ordering with separators S 3 ; S 2 ; S. Nodes belonging to the same separator in G form a chain in the elimination tree. Following the nested dissection rule, nodes in S 3 are numbered higher than all other nodes. Hence, S 3 is placed at the top of the elimination tree above S ; S 2 determined in the next recursion level of the nested dissection method. Figure presents the core of the multifrontal method from an algorithmic point of view. gain, L is computed column by column. single iteration of the algorithm can be described as follows: Let j; i ; : : : ; i r be the row indices of all nonzero elements in column j. Now consider all children c ; : : : ; c s of j in the elimination tree. With each child c l an update matrix U cl is associated (we will show how this update matrix is computed for j). The update matrices are summed up to form the frontal matrix F j. In general, the subscripts of the update matrices are a subset of j; i ; : : : ; i r. Therefore, the update matrices have to be extended to conform with the subscripts in F j. This extension together with the addition is symbolized by the operator (extended-add). In

3 S 3 F k G V S S 3 S 2 V 2 V 3 S S 2 F F i j V V V V 2 V 3 P P P P Figure 2: Nested dissection ordering (l) and structure of corresponding elimination tree (r). the next step, Eq. is applied to F j. This gives column L ;j = (l j;j l i;j : : : l ir;j) and the update matrix U j associated with j. Let ~ Fj denote F j without the rst column and the rst row. Then, U j = ~ Fj? (l i;j : : : l ir;j)(l i;j : : : l ir;j) t. 3 Parallelization of the Multifrontal Method The elimination tree has the interesting property that columns in dierent branches of the tree can be factorized in parallel. Thus, the elimination tree provides useful information for parallelizing the multifrontal method. The parallel algorithm can be described best with the help of a simple example (a detailed description can be found in the literature 3 ). Let us assume that P = 4 processors are used to compute L and that the processors are numbered binary from () 2 to () 2 (cf. Figure 2). Each of the 4 subtrees induced by the nodesets V ; : : : ; V 3 is completely mapped to a dierent processor. The factorization of the associated columns can be done without any communication. In the next level (levels are separated by dashed lines in Figure 2) all even columns of the factor matrix F i (F j ) are mapped to processor P (P ) and all odd columns to processor P (P ). The factor matrix F k at the topmost level is distributed over all 4 processors. For example, processor P stores all elements of F k with even column and odd row index. In this way a cyclic mapping of the columns and rows of a frontal matrix can be obtained for P = 2 k processors using only the binary representation of the processor number. ll even bits (bits are numbered from left to right and counting starts with ) determine which columns and all odd bits which rows are mapped on the processor. In the following, we describe in more detail what data has to be exchanged between the processors when Eq. is applied to F j. For ease of presentation we assume that F j is a 22 matrix with nonzeros in rows ; ; : : : ; (i.e. j = )

4 Figure 3: Mapping of 2 2 frontal matrix on 6 proc. (l) and logical proc. grid (r). and that the matrix is mapped on 6 processors according to the rule given above. Figure 3(l) shows for each diagonal and subdiagonal element of F the number of the processor it is mapped on. The elements of the rst column of F are mapped on processors ; 2; 8 and. These processors are called pivot processors and together compute the factor column L ;. In the next step, each processor computes its fraction of U. For this, the processor must have access to certain elements of L ;. For example processor 9 has to compute (l ;2 ; l ;6 ; l ; )(l ; ; l ;5 ; l ;9 ) t to obtain u 2; ; u 6; ; u ; ; u 2;5 ; u 2;9 ; u ;9. To show how each processor receives the requested data, consider the logical processor grid given in Figure 3(r). horizontal and vertical hypercubic broadcast scheme 9 is used to distribute L ; among the processors of the logical grid. First, each pivot processor initiates a horizontal broadcast to distribute its part of L ; among all processors in the same row of the logical grid. s soon as information arrives at a diagonal processor of the grid (in Figure 3(r) processors ; 3; 2 and 5 are diagonal processors), a vertical broadcast is initiated to distribute the information among all processors in the same column of the grid. For example, all elements in (l ;2 ; l ;6 ; l ; ) are mapped on pivot processor 8 and processor 9 receives the elements during the horizontal broadcast initiated by 8. The second vector (l ; ; l ;5 ; l ;9 ) is mapped on pivot processor 2 and processor 9 receives it during the vertical broadcast initiated by diagonal processor 3. Horizontal and vertical broadcast are implemented as additional threads to minimize the communication overhead of the parallel algorithm. The overhead is further minimized by using a block cyclic mapping scheme for the columns and rows of a frontal matrix. Table shows the runnig times in seconds for the factorization of four selected test problems on a Parsytec system with 32 nodes. Each node consists of a PowerP 64 (33MHz) with 64M memory (further technical information can be found at The rst three columns of Table show the number of columns/rows in and the

5 Table : Size of matrices and running times on Parsytec system (in sec.). N jj jlj GRID GRID { mat2hf mat3hf { number of nonzeros in and L. While problems GRID255 and GRID5 have a regular grid structure, problems mat2hf and mat3hf are obtained from unstructured nite element meshes. For the nested dissection ordering a multilevel graph bisection method 4 combined with the helpful set heuristic was used. omputational experiments have shown that high quality orderings can be obtained when using multilevel methods. 8;5 References Parallel algorithms for sparse matrix factorization have been extensively investigated by many researchers during the last decades. Due to space limitations only a small fraction of the relevant literature can be mentioned here.. R. Diekmann,. Monien, R. Preis, Using helpful sets to improve graph bisections, DIMS Series in Discrete Mathematics and Theoretical omputer Science, merican Mathematical Society, Volume 2, George, Nested dissection of a regular nite element mesh, SIM J. Num. nal., (973), pp Gupta, G. Karypis, V. Kumar, Highly Scalable Parallel lgorithms for Sparse Matrix Factorization, TR 94-63, S-Dept., Univ. Minnesota, Hendrickson, R. Leland, The haco User's Guide, Tech. Rep. SND , Sandia Nat. Lab., G. Karypis, V. Kumar, Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs, TR 95-35, S-Dept., Univ. Minnesota, J. W.-H. Liu, The Multifrontal Method for Sparse Matrix Solution: Theory and Practice, SIM Review, 34 (992), pp J. W.-H. Liu, The Role of Elimination Trees in Sparse Factorization, SIM J. Matrix nal. ppl. Vol., No. (99), pp J. Schulze, R. Diekmann, R. Preis, omparing Nested Dissection Orderings for Parallel Sparse Matrix Factorization, Proc. of PDPT '95, SRE 96-3, pp , J. Schulze, Implementation of a Parallel lgorithm for Sparse Matrix Factorization, Tech. Rep. (in preparation), S-Dept., Univ. of Paderborn, M. Yannakakis, omputing the minimum ll-in is NP-complete, SIM J. lgebraic Discrete Methods, 2 (98) pp

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract A High Performance Sparse holesky Factorization Algorithm For Scalable Parallel omputers George Karypis and Vipin Kumar Department of omputer Science University of Minnesota Minneapolis, MN 55455 Technical