Task-Oriented Parallel ILU(k) Preconditioning on Computer Cluster and Multi-core Machine

Size: px

Start display at page:

Download "Task-Oriented Parallel ILU(k) Preconditioning on Computer Cluster and Multi-core Machine"

Dylan Shaw
5 years ago
Views:

1 Task-Oriented Parallel ILU(k) Preconditioning on Computer Cluster and Multi-core Machine 1 arxiv: v2 [cs.dc] 3 Oct 2010 Xin Dong College of Computer Science Northeastern University Boston, MA / USA xindong@ccs.neu.edu Abstract ILU(k) is a preconditioner used for many stable iterative linear solvers. We present TPILU(k), the first efficiently parallelized ILU(k) preconditioner that maintains this stability. Iterative solvers using TPILU(k) are guaranteed to perform at least as well as the same solver using sequential ILU(k). Indeed, in experiments on a 16-core computer, TPILU(k)-based solvers performed up to 9 times faster. Normally, TPILU(k) returns exactly the same preconditioned matrix as the original sequential ILU(k) preconditioner would have returned. As an optimization, TPILU(k) attempts to use a new incomplete inverse method as a fast approximation for the original ILU(k) preconditioned matrix. This allows TPILU(k)-based solvers to compete with fast, unstable parallel solvers, such as parallel Block Jacobi ILU(k). In the regime where parallel Block Jacobi fails, TPILU(k) may also fail, but it will fail fast, and then revert to the slower standard parallelization that returns the same preconditioned matrix as the sequential ILU(k) algorithm. The incomplete inverse method also has benefits for sequential ILU(k)-based iterative solvers. Finally, The combined algorithm also generalizes to an efficient, parallel distributed version for computer clusters. Keywords: ILU(k), parallel computing, preconditioning, Gaussian elimination, task-oriented parallelism I. Introduction This work introduces a new preconditioner, TPILU(k), with good stability and performance across a range of sparse linear systems. For a large sparse linear systems Ax = b, parallel iterative solvers based on ILU(k) [1], [2] often suffer from instability or performance degradation. The reason is that domain decomposition preconditioners used there become slow or unstable with greater parallelism. This happens as they attempt to approximate a linear system by more and smaller subdomains to provide the parallel work for an increasing number of threads. In contrast, TPILU(k) is stable and has performance increasing with the number of threads. This work was partially supported by the National Science Foundation under Grant CCF Gene Cooperman College of Computer Science Northeastern University Boston, MA / USA gene@ccs.neu.edu TPILU(k) uses a novel parallel ILU(k) preconditioner for the base algorithm. However, it first tries a different, incomplete inverse submethod. The submethod either succeeds or fails fast. If it fails, the base parallel ILU(k) algorithm is used. TPILU(k) is particularly important for non-symmetric sparse matrices, whereas symmetric matrices are efficiently handled by other well-known algorithms. Both the base algorithm and submethod of TPILU(k) are sequentially consistent. This means that the parallel algorithm produces the same preconditioned matrix as would be produced by sequential algorithm. This is a key to preventing the slowdown and the instability exist in the parallelization by domain decomposition methods. A particular showcase for TPILU(k) is parallelization for triangular solve stage in each iteration. This stage consists of the forward/backward substitutions, which are a traditional bottleneck for ILU(k)-based iterative solvers. This bottleneck is in part responsible for the turn to domain decomposition methods such as BJILU (Block Jacobi [3] ILU(k)) and PILU(k) [4], [5], [6] (denoted simply PILU, below). The incomplete inverse submethod of TPILU(k) speed up the triangular solver where other methods fail to extract parallelism for this case. Incomplete Inverse Submethod of TPILU(k). The problem of parallelization of triangular solvers has long been the key remaining problem in parallelization of ILU-based iterative solvers. Solving sparse triangular matrices at each iteration of the solver is a sequential bottleneck with little opportunity for parallelization. The incomplete inverse submethod is a novel algorithm for a highly parallel triangular solver. It produces approximate inverse matrices for general sparse triangular matrices, including the incomplete L and Ũ of ILU(k) of interest here. While it does not work always, it is guaranteed to fail fast when it does not

2 work. So, it costs little to try it in a preliminary step. Further, since L and Ũ are already approximations, also taking an approximate inverse is not an issue. The incomplete inverse submethod is inspired by ILU(k) itself. The classical idea of ILU(k) is to throw out some small (higher level) matrix entries. In a similar manner, this submethod throws out some higher level coefficients during the forward/backward substitutions in each iteration. We call this submethod TPIILU and the base algorithm TPILU. Impact on sequential ILU(k). TPIILU also adds an important novelty even for the sequential case. Recall that ILU(k) provides sparse triangular approximations L and Ũ to the matrices L and U. A problem of the form LŨx = b must be solved at every iteration of the solver. Using a fast approximate inverse L 1 and Ũ 1, the previous linear equation reduces to the much faster matrix-vector multiplication below. x = Ũ 1 L 1 b Balancing the time for the TPILU(k) preconditioner and iterations of the linear solver. Next, we consider the case of difficult matrices, when domain decomposition methods fail. As before, the incomplete inverse method is tried first. If it and other methods fail, then the base TPILU(k) algorithm is used. This is guaranteed exactly as stable as a sequential ILU(k) algorithm, and it is also guaranteed to run faster as it uses more threads. In this regime, we are comparing with a sequential ILU(k) algorithm. The parallel TPILU(k) algorithm has a clear benefit when the preconditioning time dominates over the time for the iterations of the solver. Even when the time for the TPILU(k)) preconditioner does not dominate, there is an important side benefit to its parallelization. One can raise the parameter value, k. The result is a higher quality preconditioned matrix, which usually results in fewer iterations for the linear solver. Of course, a higher k leads to two other problems: (a) the time to compute the preconditioned matrix rises quickly as k increases; and (b) the preconditioned matrix becomes denser with higher k, resulting in more computation in a single iteration of the solver. Often, problem (a) is the more severe problem, while the density of the matrix remains manageable. Review of Two Competitive Methods. We compare with two widely used preconditioners: BJILU and PILU. Under the framework of Euclid [7, Section 6.12], both preconditioners appear in Hypre [8], a popular linear solver package developed at Lawrence Livermore National Laboratory since BJILU and PILU are based on domain decomposition: approximate the linear system by multiple linear systems on subspaces, and then handle in parallel. The problem with this approach is that one desires at least as many domains as the number of threads. As the number of threads increases, worse and worse approximations are made to the original linear system in order to produce ever more domains. Hence, the methods risk making an iterative solver increasingly slow or unstable. At the opposite extreme, researchers considered sequentially consistent parallel ILU(k) preconditioners [9]. Such methods, whose parallelism comes from level/backward scheduling, were studied in the 1980 s and achieved a speedup of about 4 or 5 on an Alliant FX-8 [3, page 351] and a speedup of 2 or 3 on a Cray Y-MP [10]. Although PILU provides a better speedup over sequential ILU(k), it is intended to be used only under the very strong condition that the matrix must be wellpartitionable. This condition is violated by linear systems generating many fill-ins and by linear systems accommodating many threads. As noted in [7, Section ]: if subdomains contain relatively few nodes (less than 1,000), or the problem is not well partitioned, Block Jacobi ILU may give faster solution time than PILU. BJILU preconditioners had been successfully used for a class of linear systems, for example, those with diagonally dominant matrices. We define linear systems outside of this class as difficult problems. Solvers based on BJILU would fail to converge for difficult problems as the number of threads w grows. The reason lies in the domain decomposition method, by which a scalable parallelization must approximate the original matrix by at least w subdomains. As a result, the preconditioner is forced to ignore even more off-diagonal matrix entries with more blocks of smaller block dimension. Features of TPILU(k). TPILU(k) is more efficient for applications that generate many difficult problems requiring a solution. We compare an iterative solver using either parallel TPILU(k) preconditioning, or traditional sequential ILU(k) preconditioning. In our experiments, we observe a speedup of up to a factor of 9 over a sequential ILU(k)-based solver by using 16 threads on 16 cores (see Tables XII). Depending on whether the time for the preconditioner dominates in the time for the overall iterative solver, we may observe anywhere from a 30% overall speedup to the full factor of 9 (see Tables V, VI, XII). TPILU(k) produces stable preconditioners across most classes of linear systems. The sequential ILU(k)-based 2

3 solver has this property. Further, because the time for the preconditioning often dominates, we observe a large speedup in our parallel solver, TPILU(k) (see Tables XII, XIII, XIV). In contrast, BJILU-based iterative solvers fail to converge (are unstable) on difficult problems. Even worse, BJILU preconditioning may even succeed for a small number of threads, and then fail for more threads (see Tables I, III, V). While BJILU preconditioning works well sometimes, another method must be used in the case where it fails. TPILU(k) has reliable performance as compared to PILU, whose partitioning cost varies widely. When solving the linear systems arising from 3-D 27-point stencil modelling for Laplace s equations, PILU(k)-based iterative solvers always perform worse than the sequential ILU(k)-based solvers in our experiments (see Tables VII, VIII). TPILU(k) is implemented both for computer clusters and multi-core machines. It takes advantage of the general parallelism existing in LU factorization for any matrix, while adhering to sequential consistency. Sequential consistency is crucial to TPILU(k) in squeezing out more speedup while maintaining stability. In addition, it helps manage potential instabilities due to round-off. Although mathematically, (x+y)+z = x+(y+z), the differing round-off errors can influence the final result. For example, a sequential triangular solver may return different results depending on the order in which each unknown is reduced during the forward/backward substitutions: left-right or right-left. Debugging also becomes difficult when parallelization produces different results. The initial implementation of TPILU(k) used dynamic scheduling of threads for load balancing. Because the fill-ins of symbolic factorization in ILU(k) occur dynamically, we cannot predict in advance how much temporary storage each thread will need to malloc. So, each thread will malloc a different region size at each step. Pre-allocating a large malloc region for each thread is not an option, since the total region allocated may no longer fit in L2 cache. When using many threads, dynamic allocation results in ping-pong access patterns between caches of different cores. This phenomenon is not particular to numerical analysis. See, for example, a discussion of this phenomenon in a Monte Carlo simulation [11, Section 4], where the use of a per-thread allocator with thread-private malloc arenas produced a large improvement over four four widely used memory allocators (ptmallo2, ptmallo3, hoard and tcmalloc). To eliminate this bottleneck, we were forced to switch to static round-robin scheduling of threads. We modified the standard glibc allocator to also use thread-private malloc arenas. Details are in Section IV-E, especially in Table XV. This is an important lesson for future manycore CPUs. We are not aware of other linear solvers currently using this technique. The contributions of TPILU(k) are threefold: 1) an efficient parallel, sequentially consistent ILU(k) preconditioner targeting difficult problems; 2) an incomplete inverse algorithm to overcome a bottleneck related to solving triangular matrices; and 3) improved cache efficiency through static roundrobin thread scheduling and thread-private malloc arenas. The rest of this paper is organized as follows. Section II briefly reviews some previous work. Section III reviews LU factorization and sequential ILU(k) algorithm. Section IV presents task-oriented parallel TPILU(k), including the base algorithm (Sections IV-A through IV-D) and the incomplete inverse method (Section IV-F). Section V analyzes the experimental results. II. Related work ILU(k) [1] was formalized to solve the system of linear equations arising from finite difference discretizations in In 1981, ILU(k) was extended to apply to more general problems [2]. Some early work on parallel ILU(k) preconditioners uses sequentially consistency, as mentioned in Section I. There are also some sequentially consistent parallel IC preconditioners [12], [13] for symmetric linear systems. Pivoting/reordering is the dominant technique today for parallel linear solvers. For example, pivoting is employed to reduce the number of fill-ins for the LU factorization in [14]. The work in [15] provides a pivoting algorithm to make the diagonal elements of a sparse matrix large. The methodology is to compute the optimal pivoting and preprocess a sparse matrix. Reordering is also employed to develop parallel algorithm in [16], which targets distributed sparse matrices. PMILU [17] reorders a linear system to expose greater parallelism. The experimental results demonstrate the scalability of PMILU on four IBM Power 5 SMP nodes, each with 16 shared memory processors. The use of sequentially consistent parallel preconditioner is receding. For example, parallel ILUT [18] has not been supported by Hypre since version b, and Euclid is now recommended as a replacement. Euclid supports both low-level thread parallelism (for matrix add and matrix-vector multiply, etc.) and highlevel distributed parallelism. This emphasis on purely distributed parallelism at the high level is motivated by Euclid s domain decomposition strategy. 3

4 TPILU(k) is inspired by TOP-C [19]. The taskoriented approach differs in spirit from domain decomposition, and from earlier work based on level scheduling [9], [10]. The resulting parallel algorithm is derived by identifying tasks to be computed in parallel from the original sequential algorithm. TOP-C employs a message passing interface (MPI) for distributed memory architectures. In [20], a definition of tasks suitable for dense Gaussian elimination is presented. That simple approach inspired the initial approach of this work on sparse incomplete LU factorization. III. Review of Sequential ILU(k) algorithm See [4], [21], [22] for a detailed review of ILU(k). A brief sketch is provided here. LU factorization completely decomposes a matrix A into the product of a lower triangular matrix L and an upper triangular matrix U. From L and U, one efficiently computes A 1 as U 1 L 1. While the computation of L and U requires O(n 3 ) steps, once done, the computation of the inverse of the triangular matrices proceeds in only O(n 2 ) steps. For sparse matrices, one contents oneself with solving x in Ax = b for vectorsxandb, since A 1, L andu would all be hopelessly dense. Iterative solvers are often used for this purpose. An ILU(k) algorithm finds sparse approximations, L L and Ũ U. The preconditioned iterative solver then implicitly solves AŨ 1 L 1, which is close to the identity. For this purpose, triangular solve operations are integrated into each iteration to obtain solution y such that LŨy = p where p varies on iterations. This has faster convergence and better numerical stability. Here, the level limit k controls what kinds of elements should be computed in the process of incomplete LU factorization. Similarly to LU factorization, ILU(k) factorization can be implemented by the same procedure as Gaussian elimination. Moreover, it also records the elements of a lower triangular matrix L. Because the diagonal elements of L are defined to be 1, we do not need to store them. Therefore, a single filled matrix F is sufficient to store both L and Ũ. A. Terminology for ILU(k) For a huge sparse matrix, a standard dense format would be wasteful. Instead, we just store the position and the value of non-zero elements. Similarly, incomplete LU factorization does not insert all elements that are generated in the process of factorization. Instead, it employs some mechanisms to control how many elements are stored. ILU(k) [4] uses the level limit k as the parameter to implement a more flexible mechanism. Here we review some definitions: Row h Row i Column h f ih Column j Fig. 1. Fill-in f ij with its causative entries f ih and f hj. Definition 3.1: Filled Matrix: We call the matrix in memory the filled matrix, which is composed of all nonzero elements of F. Definition 3.2: A fill entry, or entry for short, is an element stored in memory. (Elements that are not stored are called zero elements.) Definition 3.3: Fill-in: Consider Figure 1. If there exists h such that i,j > h and both f ih and f hj are entries, then the originally zero elementf ij may become an entry because the value of f ij is non-zero after factorization. This element f ij is called a fill-in, that is a candidate of entry. We say the fill-in f ij is caused by the existence of the two entries f ih and f hj. The entries f ih and f hj are the causative entries of f ij. Definition 3.4: Level: The level associated with an entry f ij, is level (i,j) and defined as f hj min level (i,h)+level (h,j)+1. 1 h<min(i,j) The level limit k is used to control what kinds of fillins should be inserted into the filled matrix during the factorization. Only those fill-ins with a level smaller than or equal to k are inserted into the filled matrix F. Other fill-ins are ignored. This allows ILU(k) to maintain a sparse filled matrix for very small values of k. B. ILU(k) and Its Parallelism For LU factorization, the defining equation A = LU is expanded into a ij = min(i,j) h=1 l ih u hj. This yields the defining equations. ( j 1 f ij = a ij l ih u hj )/u jj, j < i h=1 i 1 ( f ij = a ij l ih u hj )/l ii, j i h=1 The computation for incomplete LU factorization follows a definition similar to the above equations except it skips zero elements. In our implementation, the matrixf f ij 4

5 is initialized to A and stored in row-major order form prior to any computation. The computation can be reorganized to use the above equations in the forward direction. As each term l ih u hj for h < j is determined, it can immediately be subtracted from f ij. Just as all rest rows can be reduced simultaneously by the first row in Gaussian elimination, a row-major order for ILU(k) factorization leads to a natural parallel algorithm. Following the defining equations, the ILU(k) algorithm maintains in memory two rows: row h and row i, where h < i. Row h is used to partially reduce row i. For each possible j, the product l ih u hj is used to reduce the entry f ij. Once we have accumulated all products l ih u hj for h < min(i,j), we are done. ILU(k) is separated into two passes: symbolic factorization or Phase I to compute the levels and insert all fill-ins with the level less than or equal to the level limit k into the filled matrix; and numeric factorization or Phase II to compute the values of all entries in the filled matrix. Both passes follow the similar procedure described above. Algorithm 1 illustrates the symbolic factorization phase. It determines for each row j, the set of permitted entries, permitted(j). These are the entries for which the computed entry level or weight is less than or equal to the k. Numeric factorization is simpler, but similar in spirit to the row-merge update pass of Algorithm 1. The lines 14 through 17 control the entries to be updated, and the update of line 19 is replaced by the update of numeric value. The details are omitted. The computation from step 15 to 27 in Algorithm 1 is referred to as one transformation. The corresponding part in Phase II is also called a transformation. IV. TPILU(k): Task-oriented Parallel ILU(k) Algorithm A. Parallel Tasks in ILU(k) As we have mentioned in Section III-B, the parallelism comes from the fact that we can do row transformations in parallel while the sequential algorithm does it topdown and row by row. We introduce the following definition to describe a general model for matrix row reductions, which is valid for Gaussian elimination as well as ILU(k). Definition 4.1: The frontier is the maximum number of rows that are currently reduced completely. According to this definition, the frontier i is also the limit up to which the remaining rows can be partially reduced except for the first unreduced row. The first unreduced row can be reduced completely. That increases the frontier by one. In order to overlap communication and computation, Algorithm 1 Symbolic factorization: Phase I of ILU(k) preconditioning 1: //Calculate levels and permitted entry positions 2: //Loop over rows 3: for j = 1 to n do 4: //Initialization: admit entries in A, and assign them the level zero. 5: permitted(j) empty set //permitted entry in row j 6: for t = 1 to n // nonzero entries in row j do 7: if A j,t 0 then 8: level(j,t) 0 9: insert t into permitted(j) 10: end if 11: end for 12: end for 13: //Row-merge update pass 14: for each unprocessed i permitted(j) with i < j, in ascending order do 15: for t permitted(i) with t > i do 16: weight = level(j,i) + level(i,t) : if if t permitted(j) then 18: //already nonzero in F j,t 19: level(j, t) min{level(j, t), weight} 20: else 21: //zero in F j,t 22: if weight k //level control then 23: insert t into permitted(j) 24: level(j, t) weight 25: end if 26: end if 27: end for 28: end for 29: return permitted the matrix is organized as bands to make the granularity of the computation adjustable. Just as in Figure 2, each band includes the same number of consecutive rows. To handle the case that the number of bands is not a factor of the matrix dimension, one can pad some of the bands with one extra row each. The size of a band is the number of rows in this band. A task is associated to a band and is defined as a set of transformations applied in order to partially reduce the band to the current frontier. For each band, the program must remember up to what column this band has been partially reduced. We call this column the current position, which is the start point of reduction for the next task attached to this band. In addition, it is important to use a variable to remember the first unreduced band. After the first unreduced band is completely reduced, it should be broadcast to all machines (or shared by all 5

6 Column Band Band 1 Band 2 Band Fig. 2. View of a matrix as bands. There are four bands. Each of them consists of 3 rows. After the first band is reduced completely, all rest bands can be partially reduced to the third column in parallel. threads in the multi-core case) and the frontier should be increased by the size of the band for all machines. B. Communication Overhead Reduction Only the completely reduced bands are useful for the reduction of other bands, the intermediate result is not truly needed by other workers. If we assign a fixed group of bands to each worker, then it is unnecessary to broadcast an update message to other workers for any intermediate result. We call this strategy static load balancing. It decreases the communication overhead to a minimum by sending an update message only for the completely reduced bands. Our algorithm assigns bands to workers in a roundrobin fashion. The method avoids sending intermediate copies of partially updated bands. This produces a more regular communication that fits well with the pipelining communication of the next section. A further virtue of this strategy is that it uses a fixed number of message buffers and a fixed buffer size. This avoids dynamically allocating the memory for message handling. Under the strategy of static load balancing, the computations on all processors are coordinated so as to guarantee that no processor can send two update messages simultaneously. In other words, a processor must finish broadcasting an update message before it reduces another band completely. C. Pipeline Communication for Efficient Local Network Usage Although the bands can be reduced simultaneously, their completion follows a strict top-down order. When one band is completely reduced, it is best for the node that holds the first unreduced band to receive the result first. This strategy is implemented by the pipeline model. Following this model, all nodes are organized to form a directed ring. The message is transferred along Sender ID Band: Time step Fig. 3. Pipeline model. The horizontal axis is the time step. The vertical axis is the sender id. The lines represent when the algorithm sends a message. The time step and the sender id of the source are indicated. The receiver is always the successor of the source. The message is marked by the corresponding band number. Only the first several messages are shown. the directed edge. Every node sends the message to its unique successor until each node has received a copy of the message. After this message is forwarded, each node uses this message to update its memory data. Figure 3 presents that this model achieves the aggregate bandwidth. In this figure, the horizontal axis is the time step while the vertical axis is the sender ID (the rank number of each node). Note that at most time steps, there are several nodes participating, each of them is either sending a message to its successor or receiving a message from its predecessor. Algorithm 2 illuminates the implementation to overlap the computation with the communication based on the pipeline model. Both symbolic factorization and numeric factorization can use this model, except for the case of k = 1, whose symbolic factorization reduces to trivial parallelism as discussed in the next section. D. Optimization for Symbolic Factorization in the Case k = 1 One observation from Figure 1 is that level 1 entries no longer participate in symbolic factorization after they are generated. In Figure 1, if either f ih or f hj is an entry of level 1, the resulting fill-in f ij must be an element of level 2 or level 3. So f ij is not inserted into the filled matrix F. Based on this observation, we claim that for symbolic factorization, each row can be reduced independently, no matter whether the previous rows are reduced or not. This observation not only yields greater parallelism, it also allows the communication overhead of the first pass to be postponed to the second pass. We call this special algorithm TPILU(1). Considering that numeric factorization is floating-point arithmetic intensive, it is reasonable to shift all communication overhead toward the second pass. In the symbolic factorization phase, the TPILU(1) algorithm prefers to reduce each row with the original matrix because only level 0 elements are needed. In 6 7 6

7 Algorithm 2 Parallel ILU(k) algorithm with the pipeline model 1: receive from predecessor //non-blocking receive 2: //Loop until all bands are reduced completely 3: while firstunreducedband < numberofbands do 4: get new task (band ID) from the master to work on 5: if there was a band to work on then 6: dotask(band) // reduce band using all previous bands 7: if band is not reduced completely then 8: //not reduced completely, then non-blocking test 9: try to receive a message for some band 10: if a newly reduced band is received then 11: send band to successor //non-blocking send 12: update our copy of newly reduced band 13: continue to receive and update until our band is completely reduced 14: end if 15: else 16: send our reduced band to successor //nonblocking send 17: end if 18: else 19: wait until a new band is available, while in background continuing to receive other reduced bands from predecessor, updating our copy, and sending the reduced band to our successor 20: end if 21: end while addition, the algorithm achieves better performance if it skips the new inserted level 1 elements. This optimization can be implemented by letting all workers reduce the rows bottom-up. Meanwhile no update is sent to any other processor. Under such circumstances, the band size for symbolic factorization does not influence the performance because there is no synchronization among processors. However, we must use the same band size for both symbolic factorization and numeric factorization. E. Other Optimizations Optimization: clusters with multi-core nodes. A hybrid memory model using multiple computers, each with multiple cores, helps to further improve the performance for the TPILU(k) algorithm. On each node, start several threads as workers and one particular thread as a communicator to handles messages between nodes. This design leads to better performance than having each thread worker communicate directly with remote threads. The reason is that the MPI_THREAD_MULTIPLE option of MPI can degrade performance. As described near the end of Section I, the performance benefits from an extension of the malloc library beyond the POSIX standard, to give users access to thread-private malloc arenas. The TPMalloc (threadprivate malloc) of [11] is just such an extension, and was used in this work. Note that the performance issue can easily be confused with a general cache issue. The literature contains ample descriptions of the cache benefits of stride access and avoiding random access [23]. Even though the ping-pong access of a standard malloc library will present as poor cache performance, the cause lies instead in the strict adherence to a POSIX-standard malloc, whereas a thread-private one is needed [11]. Luckily, the thread-private alternative is compatible with our static load balancing strategy in which each worker is assigned a fixed group of bands. Optimization: efficient matrix storage. The compressed sparse row format (CSR) is used in the iteration phase due to its efficiency for arithmetic operations. However, the CSR format does not support enlarging the memory space for several rows simultaneously. Therefore, TPILU(k) initializes the matrix F in row-major format during the symbolic factorization phase. After the matrix shape is determined by symbolic factorization, TPILU(k) changes the output matrix F from row-major format back to CSR format. The format transformations are implemented in the factorization phase. F. Incomplete Inverse Method To parallelize iterative solvers, consider Equations 1, which define the forward substitution in each iteration. x i = b i h<iα ih x h,i = 1,2,...,n (1) Equations 1 is used for solving equations of the form Lx = b. As one can see, Equations 1 must be used in a strictly sequential order. To parallelize this triangular solver, we d like to transform Equations 1 into Equations 2. x i = t<i( β it )b t +b i,i = 1,2,...,n (2) Combine Equations 1 with Equations 2, we have Equations 3. x i = b i h<iα ih x h,i = 1,2,...,n = b i α ih ( ht )b t +b h ) h<i t<h( β = ( (α it α ih β ht ))b t +b i t<i t<h<i (3) 7

8 This yields the defining Equations 4. β it = α it α ih β ht (4) t<h<i Recall that LU-factorization goes forward when reducing by α ih. j > h,α ij α ij α ih α 1 hh α hj (5) We can subtract α ih α ht from α it for each t < h using one additional step as part of reducing by α ih. t < h,α it α it α ih α ht, (6) This additional step achieves a side-effect: after row i is factored completely, each α it is exactly β it. The reason is that they satisfy the same defining Equations 4. As the last step, we flip the sign for each α it after all rows have been factored completely. t < i,α it α it, (7) This is exactly what is needed by the computation that Equations 2 defines. Equations 2 also demonstrates that the lower triangular matrix ( β it ) t<i is part of L 1, whose diagonal elements are still 1, just as L. For ILU(k) preconditioning,( L) 1 may become dense even if L is sparse. To maintain the sparsity, we compute incomplete inverse matrices L 1 following the steps described by Pseudocode 5, 6, 7. However, we throw out some elements by enforcing that the computation in previously mentioned steps is executed only for elements α ij satisfies level(i,j) k. Then, the factorization represented by Pseudocode 5 becomes ILU(k). Correspondingly, the computation represented by Pseudocode 6, 7 produces L 1. The computation of L 1 can be combined with the original numeric factorization phase (By combining Pseudocode 5, 6). One more factorization phase is added to computeũ 1 following bottom-up and rightleft order. L 1 and Ũ 1 reduce the triangular solve stage to matrix-vector multiplications. Then, iterative solvers are parallelized in a trivial way since basic vector-vector matrix-vector operations are embarrassingly parallel. We do not parallelize inner product operations for two reasons: first, they are already fast even when done sequentially; second, parallelization for inner product operations violates sequential consistency. V. Experimental Results Our first test platform is a cluster with an InfiniBand switch. This cluster includes two nodes with a single Quad-Core AMD Opteron Processor 2378 per node and one node with four Intel Xeon E5520 quad-core CPUs for a total of 16 cores. The operating system is Linux and the compiler is gcc The MPI library is OpenMPI 1.4. We also installed Hypre (version 2.6.0b) to compare with our solver, TPILU(k). Following the default setup, we use the -O2 option to compile Hypre. We compile TPILU(k) with the same -O2 option. Three iterative solvers in Hypre use a common interface to work with Euclid: (i) preconditioned conjugate gradient algorithm (Euclid-PCG); (ii) preconditioned generalized minimal residual method algorithm (Euclid-GMRES); and (iii) preconditioned stabilized bi-conjugate gradients algorithm (Euclid- BICGSTAB). We use the third solver, BICGSTAB, with the default tolerance rtol = 10 8 to test both Euclid and TPILU(k). A. TPILU(k) (base algorithm): Driven Cavity Problem In this section, we consider some difficult problems from [24]. They arise from the modeling of the incompressible Navier-Stokes equations. We choose three representatives: (i) e20r3000 with a matrix of dimension 4,241 and 131,556 non-zeros; (ii) e30r3000 with a matrix of dimension 9,661 and 306,356 non-zeros; and (iii) e40r3000 with a matrix of dimension 17,281 and 553,956 non-zeros. The experimental results on e20r3000 for both PILU (the domain decomposition ILU(k) preconditioner in Euclid) and BJILU (the Block Jacobi ILU(k) preconditioner in Euclid) are listed in Table I. In this table, the first column is the name for the preconditioner that follows by the level for PILU method. We tested PILU using the levels k from 1 to 6. Although the level k = 5 achieves the least number of iterations, the best solution time is achieved by k = 2 due to its relatively small cost for preconditioning and for single iteration. As we increase the number of processes, the preconditioning either breaks down or the cost grows greatly. Similar behavior is observed for BJILU. In Table I, the best solution time is achieved by PILU(2) with only one worker, which is marked by an exclamation point in the first column. For e20r3000, the base algorithm TPILU with the level k = 4 leads to a better performance than either PILU or BJILU. The results from TPILU(4) are collected in Table II, which consists of three parts corresponding to distributed-memory, hybrid-memory and sharedmemory. In the distributed-memory case, we speed up the preconditioning using four worker processes either on the same node or on different nodes. The implementation for the hybrid-memory model leads to better 8

9 Method # Precondition Solution Phase Procs Phase: Time # Iters Time PILU, k = PILU, k = PILU, k = 1 3 * PILU, k = PILU, k = 2 (!) PILU, k = PILU, k = 2 3 * PILU, k = PILU, k = PILU, k = PILU, k = 3 3 * PILU, k = PILU, k = PILU, k = PILU, k = 4 3 * PILU, k = PILU, k = PILU, k = PILU, k = 5 3 * PILU, k = PILU, k = PILU, k = PILU, k = 6 3 * PILU, k = BJILU BJILU BJILU 3 * BJILU * TABLE I PILU AND BJILU FOR e20r3000 ON TWO QUAD-CORE NODES. INPUT MATRIX DIMENSION 4,241 AND THE NUMBER OF NON-ZEROS 131,556. TIME IS IN SECONDS. * MEANS THAT PHASE BREAKS DOWN. performance than that for the distributed-memory model even with the same number of workers. As we can see, this implementation allows us to speed up the preconditioning by using three worker threads per node. The implementation for the shared-memory model achieves almost linear speedup for preconditioning. Nodes P rocs Factorization Time (s) Solution Phase T hreads Symbolic Numeric # Iters Time (s) (1 +1) (2 +1) (3 +1) (!) TABLE II TPILU(4) FOR e20r3000 ON TWO QUAD-CORE NODES (OUTPUT MATRIX HAS 835,161 NON-ZERO ENTRIES. FPAS FOR PRECONDITIONING IS 71,297,168.) The experimental results for PILU and BJILU on e30r3000 are listed in Table III. From this table, one can see that BJILU with 2 workers achieves the best solution time. The results for TPILU(3) are collected in Table IV. This table shows that the implementation based on hybrid-memory with 8 CPU cores on two nodes ties with the implementation based on shared-memory with 4 CPU cores on a single node. Compare Table III with Table IV, one sees that TPILU(k) wins again. TPILU(k) finishes in 1.04 s (Table III), as compared to 1.47 s for BJILU and 1.64 s for PILU (Table IV). As before,! marks the best configuration for each method. Method Procs Preconditioning Solution Phase Phase: Time (s) # Iters Time (s) PILU, k = PILU, k = PILU, k = PILU, k = 1 4 * PILU, k = 2 (!) PILU, k = PILU, k = 2 3 * PILU, k = 2 4 * PILU, k = PILU, k = PILU, k = 3 3 * PILU, k = 3 4 * PILU, k = PILU, k = PILU, k = PILU, k = 4 4 * PILU, k = PILU, k = PILU, k = PILU, k = 5 4 * PILU, k = PILU, k = PILU, k = PILU, k = 6 4 * BJILU BJILU (!) BJILU >1000 >4.09 BJILU 4 * TABLE III PILU AND BJILU FOR e30r3000 ON TWO QUAD-CORE NODES. INPUT MATRIX DIMENSION 9,661 AND THE NUMBER OF NON-ZERO ENTRIES 306,356. * MEANS THAT PHASE BREAKS DOWN. TPILU(k) continues to win in the case of e40r3000. TPILU(k) with k = 3 finishes in 1.93 s (Table VI), as compared to 3.15 s for PILU and 3.52 s for BJILU (Table V). Surprisingly, PILU and BJILU register their best performance when executing sequentially, and not in parallel. From Table VI, we find that the implementation for the hybrid model with 8 CPU cores on two nodes is better than the implementation for the shared-memory model with 4 CPU cores on a single node. This demonstrates TPILU(k) s strong potential for greater performance when given additional cores. 9

10 Nodes P rocs Factorization Time (s) Solution Phase T hreads Symbolic Numeric Number Time (s) (!) (!) TABLE IV TPILU(3) FOR e30r3000 ON TWO QUAD-CORE NODES (OUTPUT MATRIX HAS 1,674,369 NON-ZERO ENTRIES. FPAS FOR PRECONDITIONING IS 120,886,822). Nodes P rocs Factorization Time (s) Solution Phase T hreads Symbolic Numeric # Iters Time (s) (1 +1) (2 +1) (3+1) (!) TABLE VI TPILU(3) FOR e40r3000 ON TWO QUAD-CORE NODES (OUTPUT MATRIX HAS 3,070,709 NON-ZERO ENTRIES. FPAS FOR PRECONDITIONING IS 224,029,062.) Method Procs Preconditioning Solution Phase Phase: Time (s) # Iters Time (s) PILU, k = PILU, k = 1 (!) PILU, k = 1 3 * PILU, k = 1 4 * PILU, k = PILU, k = PILU, k = 2 3 * PILU, k = 2 4 * PILU, k = PILU, k = PILU, k = 3 3 * PILU, k = 3 4 * PILU, k = s PILU, k = s PILU, k = 4 3 * PILU, k = 4 4 * PILU, k = PILU, k = PILU, k = 5 3 * PILU, k = 5 4 * PILU, k = PILU, k = PILU, k = 6 3 * PILU, k = 6 4 * BJILU (!) BJILU >1000 > BJILU 3 * BJILU 4 * TABLE V EUCLID FOR e40r3000 ON TWO QUAD-CORE NODES. INPUT MATRIX DIMENSION 17,281 AND THE NUMBER OF NON-ZEROS 553,956. * MEANS THAT PHASE BREAKS DOWN. B. Incomplete Inverse: Matrices from 3D 27-point Central Differencing We tested PILU (the domain decomposition ILU(k) preconditioner in Euclid) on a linear system generated by 3D 27-point central differencing for Poisson s equation. The grid size is Using both k = 0 and k = 1, we measure the preconditioning time, the number of iterations and their total time for five configurations: 1 process; 2 processes on one node; 2 processes on two nodes; 4 processes on one node; and 4 processes on two nodes. The purpose is to find the best configuration on the two nodes with quad-core AMD CPUs. Experimental results are listed in Table VII. (Note that PILU uses processes, instead of threads, for each subdomain.) Procs/Node level Preconditioning Solution Phase Nodes Time (s) # Iters Time (s) TABLE VII PILU FOR 3D 27-POINT STENCIL ON GRID USING TWO QUAD-CORE NODES (k = 0) From Table VII, we can see that the preconditioning time increases greatly when we increase the level from 0 to 1. In addition, it grows rapidly and soon dominates when the number of processes increases. The results in this table also demonstrate that the parallelism from matrix reordering influences the number of iterations even though the level k is kept the same. Under such circumstance, the sequential algorithm wins over PILU in the contest for the best solution time. This is not an accident. We also test the linear systems coming from 3D 27-point Stencil modelling with the grid size between and The sequential algorithm still achieves the best solution time as seen from the results listed in Table VIII. In Table VIII, we also list the number of floating point 10

11 arithmetic (FPA) operations and the number of non-zeros in the output matrix. FPA operations are measured by inserting a counter in our program. From this table, we can see that the number of FPAs is relatively small. This implies that FPAs do not dominate the computation. Hence, the cost of ILU(k) preconditioning is primarily in the row handling, which consists of scanning each element in a row to merge the result from a previous completely factored row that corresponds to the current element. Therefore, the cost is determined by the number of rows and the total number of non-zero elements (non-zero entries) in the output matrix. Grid Matrix Preconditioning Solution phase Size Dim. Time FPAs Entries # Iters Time K M 2M K M 3M K M 6M K M 9M K M 13M K M 19M TABLE VIII OPTIMAL SOLUTION TIME WITH PILU FOR 3D 27-POINT CENTRAL DIFFERENCING ON LARGER GRID SIZES USING TWO QUAD-CORE NODES (k = 0. TIME IS IN SECONDS. GRID SIZEn 3 MEANS n n n.) For all these test cases, TPIILU (the incomplete inverse submethod of TPILU(k)) decreases the preconditioning time and the time per iteration in the solution phase, while maintaining the same number of iterations. This leads to better performance, as seen in Table IX. Mat. Preconditioning and Iteration Time (s) # Iters Dim. 1 thread 2 threads 4 threads 64K K K K K K TABLE IX TPIILU WITH k = 0 FOR 3D 27-POINT CENTRAL DIFFERENCING ON LARGER GRID SIZES USING ONE QUAD-CORE NODE (GRID SIZE FOR EACH ROW IS SAME AS TABLE VIII.) C. Incomplete Inverse: Cage Model for DNA Electrophoresis In this section, we next study linear systems for the cage model of DNA electrophoresis [25]. The model describes the drift, induced by a constant electric field, of homogeneously charged polymers through a gel. We test on the two largest cases: (i) cage14, with a matrix of dimension 1,505,785 and 27,130,349 non-zeros; and (ii) cage15, with a matrix of dimension 5,154,859 and 99,199,551 non-zeros. To solve these systems, ILU(0) is sufficient, which requires only a few iterations to converge. For the first problem, we use one of the quadcore nodes to obtain a speedup of 2.17 using 4 threads, as seen in Table X. For the second problem, we use a 16-core node to obtain a speedup of 2.93 using 8 threads. The results are collected in Table XI. Threads PC time (s) # Iters Time (s) Speedup TABLE X TPILU(K) FOR Cage14 USING ONE QUAD-CORE NODE (MATRIX DIMENSION IS 1,505,785.) Threads PC time (s) # Iters Time (s) Speedup TABLE XI TPILU(K) FOR cage15 USING ONE 16-CORE NODE (MATRIX DIMENSION IS 5,154,859.) According to our statistics, the total number of floating point arithmetic operations is 128,136,984 for cage14 preconditioning and 476,940,712 for cage15 preconditioning. The ratio of the number of FPAs to the number of non-zero entries is less than 5 for both cases. This implies that ILU(k) preconditioning just passes through matrices with few FPAs. We speculate that no ILU(k)- based iterative solver could compete with this incomplete inverse preconditioning approach. D. Incomplete Inverse for ns3da: 3D Navier Stokes The linear system ns3da [26] is a computational fluid dynamics problem used as a test case in FEMLAB, which is developed by Comsol, Inc. Because there are zero diagonal elements in the matrix, we use TPILU(1) as the preconditioner for the iterative solver. For this case, the preconditioning is FPA intensive. As seen in Table XII, TPILU(1) achieves a speedup of 8.93 using 16 threads. We test on a cluster consisting of 33 nodes. Each node has 4 CPU cores (dual processor, dual-core), 2.0 GHz Intel Xeon EM64T processors with either 8 GB or 16 GB per node. The nodes are connected by a Gigabit Ethernet 11

12 Threads PC time (s) # Iters Time (s) Speedup TABLE XII TPILU(K) FOR ns3da ON ONE 16-CORE NODE (MATRIX DIMENSION IS 20,414.) network. The nodes are configured with Linux 2.6.9, gcc and MPICH2 1.0 as the MPI library. TPILU(k) obtains a speedup of 7.22 times (Table XIII) for distributed memory and 7.82 times (Table XIV) for hybrid memory. Procs PC time (s) # Iters Time (s) Speedup TABLE XIII TPILU(K) FORns3Da WITH 16 NODES ON A CLUSTER Procs Threads PC time (s) # Iters Time (s) Speedup TABLE XIV TPILU(K) FORns3Da WITH 8 NODES ON A CLUSTER WITH 1 DEDICATED COMMUNICATION THREAD AND 3 WORKER THREADS PER PROCESS AND ONE PROCESS PER NODE E. TPMalloc Performance To verify whether TPMalloc (thread-private malloc) improves the performance for symbolic factorization, we compare the results from two cases: glibc standard malloc and TPMalloc. The solution time for e40r3000 on the 16-core machine is listed in Table XV. The first column of this table is the number of threads involved. We use 1 to 8 threads and measure the symbolic factorization time and the numeric factorization time. From Table XV, we can see that the time for numeric factorization is not greatly influenced according to whether TPMalloc or standard malloc is used. However, symbolic factorization scales much better with TPMalloc. This is critical, since the symbolic factorization time dominates. Factorization Time (s) Threads Standard Malloc Thread-Private Malloc Symbolic Numeric Symbolic Numeric density TABLE XV TPILU(3) FOR e40r3000 ON ONE 16-CORE NODE speedup A F O H B G E I J D C difficulty ABC BJILU ADE PILU FGIH TPIILU (incomplete inverse) HIJ TPILU (base algorithm) Fig. 4. Three-dimensional comparison of BJILU, PILU and TPILU(k). Dotted lines indicate that a line extends outside of the x-y plane. The third axis, z, is the density axis. F. Conclusion The experimental results can be graphically summarized by Figure 4. While there are regions in which each of BJILU, PILU and TPILU(k) are faster, TPILU(k) also remains competitive in most cases, and it remains stable in all cases. References [1] I. Gustafsson, A class of first order factorization methods, BIT Numerical Mathematics, vol. 18, pp , [2] J. W. Watts III, A conjugate gradient-truncated direct method for the iterative solution of the reservoir simulation pressure equation, SPE Journal, vol. 21, pp , [3] Y. Saad, Iterative methods for sparse linear systems, 1st ed. PWS, [4] D. Hysom and A. Pothen, Level-based incomplete LU factorization: Graph model and algorithms, Lawrence Livermore National Labs, Tech. Rep. UCRL-JC , [5], Efficient parallel computation of ilu(k) preconditioners, in Supercomputing 99, [6], A scalable parallel algorithm for incomplete factor preconditioning, SIAM J. Sci. Comput, vol. 22, pp , [7] hypre: High Performance Preconditioners, User s Manual, version 2.6.0b. [Online]. Available: [8] R. D. Falgout and U. M. Yang, hypre: A Library of High Performance Preconditioners, in ICCS 02, [9] E. Anderson, Parallel implementation of preconditioned conjugate gradient methods for solving sparse systems of linear 12

A Bit-Compatible Parallelization for ILU(k) Preconditioning

A Bit-Compatible Parallelization for ILU(k) Preconditioning Xin Dong and Gene Cooperman College of Computer Science, Northeastern University Boston, MA 25, USA {xindong,gene}@ccs.neu.edu Abstract. ILU(k)