Task-Oriented Parallel ILU(k) Preconditioning on Computer Cluster and Multi-core Machine

Size: px
Start display at page:

Download "Task-Oriented Parallel ILU(k) Preconditioning on Computer Cluster and Multi-core Machine"

Transcription

1 Task-Oriented Parallel ILU(k) Preconditioning on Computer Cluster and Multi-core Machine 1 arxiv: v2 [cs.dc] 3 Oct 2010 Xin Dong College of Computer Science Northeastern University Boston, MA / USA xindong@ccs.neu.edu Abstract ILU(k) is a preconditioner used for many stable iterative linear solvers. We present TPILU(k), the first efficiently parallelized ILU(k) preconditioner that maintains this stability. Iterative solvers using TPILU(k) are guaranteed to perform at least as well as the same solver using sequential ILU(k). Indeed, in experiments on a 16-core computer, TPILU(k)-based solvers performed up to 9 times faster. Normally, TPILU(k) returns exactly the same preconditioned matrix as the original sequential ILU(k) preconditioner would have returned. As an optimization, TPILU(k) attempts to use a new incomplete inverse method as a fast approximation for the original ILU(k) preconditioned matrix. This allows TPILU(k)-based solvers to compete with fast, unstable parallel solvers, such as parallel Block Jacobi ILU(k). In the regime where parallel Block Jacobi fails, TPILU(k) may also fail, but it will fail fast, and then revert to the slower standard parallelization that returns the same preconditioned matrix as the sequential ILU(k) algorithm. The incomplete inverse method also has benefits for sequential ILU(k)-based iterative solvers. Finally, The combined algorithm also generalizes to an efficient, parallel distributed version for computer clusters. Keywords: ILU(k), parallel computing, preconditioning, Gaussian elimination, task-oriented parallelism I. Introduction This work introduces a new preconditioner, TPILU(k), with good stability and performance across a range of sparse linear systems. For a large sparse linear systems Ax = b, parallel iterative solvers based on ILU(k) [1], [2] often suffer from instability or performance degradation. The reason is that domain decomposition preconditioners used there become slow or unstable with greater parallelism. This happens as they attempt to approximate a linear system by more and smaller subdomains to provide the parallel work for an increasing number of threads. In contrast, TPILU(k) is stable and has performance increasing with the number of threads. This work was partially supported by the National Science Foundation under Grant CCF Gene Cooperman College of Computer Science Northeastern University Boston, MA / USA gene@ccs.neu.edu TPILU(k) uses a novel parallel ILU(k) preconditioner for the base algorithm. However, it first tries a different, incomplete inverse submethod. The submethod either succeeds or fails fast. If it fails, the base parallel ILU(k) algorithm is used. TPILU(k) is particularly important for non-symmetric sparse matrices, whereas symmetric matrices are efficiently handled by other well-known algorithms. Both the base algorithm and submethod of TPILU(k) are sequentially consistent. This means that the parallel algorithm produces the same preconditioned matrix as would be produced by sequential algorithm. This is a key to preventing the slowdown and the instability exist in the parallelization by domain decomposition methods. A particular showcase for TPILU(k) is parallelization for triangular solve stage in each iteration. This stage consists of the forward/backward substitutions, which are a traditional bottleneck for ILU(k)-based iterative solvers. This bottleneck is in part responsible for the turn to domain decomposition methods such as BJILU (Block Jacobi [3] ILU(k)) and PILU(k) [4], [5], [6] (denoted simply PILU, below). The incomplete inverse submethod of TPILU(k) speed up the triangular solver where other methods fail to extract parallelism for this case. Incomplete Inverse Submethod of TPILU(k). The problem of parallelization of triangular solvers has long been the key remaining problem in parallelization of ILU-based iterative solvers. Solving sparse triangular matrices at each iteration of the solver is a sequential bottleneck with little opportunity for parallelization. The incomplete inverse submethod is a novel algorithm for a highly parallel triangular solver. It produces approximate inverse matrices for general sparse triangular matrices, including the incomplete L and Ũ of ILU(k) of interest here. While it does not work always, it is guaranteed to fail fast when it does not

2 work. So, it costs little to try it in a preliminary step. Further, since L and Ũ are already approximations, also taking an approximate inverse is not an issue. The incomplete inverse submethod is inspired by ILU(k) itself. The classical idea of ILU(k) is to throw out some small (higher level) matrix entries. In a similar manner, this submethod throws out some higher level coefficients during the forward/backward substitutions in each iteration. We call this submethod TPIILU and the base algorithm TPILU. Impact on sequential ILU(k). TPIILU also adds an important novelty even for the sequential case. Recall that ILU(k) provides sparse triangular approximations L and Ũ to the matrices L and U. A problem of the form LŨx = b must be solved at every iteration of the solver. Using a fast approximate inverse L 1 and Ũ 1, the previous linear equation reduces to the much faster matrix-vector multiplication below. x = Ũ 1 L 1 b Balancing the time for the TPILU(k) preconditioner and iterations of the linear solver. Next, we consider the case of difficult matrices, when domain decomposition methods fail. As before, the incomplete inverse method is tried first. If it and other methods fail, then the base TPILU(k) algorithm is used. This is guaranteed exactly as stable as a sequential ILU(k) algorithm, and it is also guaranteed to run faster as it uses more threads. In this regime, we are comparing with a sequential ILU(k) algorithm. The parallel TPILU(k) algorithm has a clear benefit when the preconditioning time dominates over the time for the iterations of the solver. Even when the time for the TPILU(k)) preconditioner does not dominate, there is an important side benefit to its parallelization. One can raise the parameter value, k. The result is a higher quality preconditioned matrix, which usually results in fewer iterations for the linear solver. Of course, a higher k leads to two other problems: (a) the time to compute the preconditioned matrix rises quickly as k increases; and (b) the preconditioned matrix becomes denser with higher k, resulting in more computation in a single iteration of the solver. Often, problem (a) is the more severe problem, while the density of the matrix remains manageable. Review of Two Competitive Methods. We compare with two widely used preconditioners: BJILU and PILU. Under the framework of Euclid [7, Section 6.12], both preconditioners appear in Hypre [8], a popular linear solver package developed at Lawrence Livermore National Laboratory since BJILU and PILU are based on domain decomposition: approximate the linear system by multiple linear systems on subspaces, and then handle in parallel. The problem with this approach is that one desires at least as many domains as the number of threads. As the number of threads increases, worse and worse approximations are made to the original linear system in order to produce ever more domains. Hence, the methods risk making an iterative solver increasingly slow or unstable. At the opposite extreme, researchers considered sequentially consistent parallel ILU(k) preconditioners [9]. Such methods, whose parallelism comes from level/backward scheduling, were studied in the 1980 s and achieved a speedup of about 4 or 5 on an Alliant FX-8 [3, page 351] and a speedup of 2 or 3 on a Cray Y-MP [10]. Although PILU provides a better speedup over sequential ILU(k), it is intended to be used only under the very strong condition that the matrix must be wellpartitionable. This condition is violated by linear systems generating many fill-ins and by linear systems accommodating many threads. As noted in [7, Section ]: if subdomains contain relatively few nodes (less than 1,000), or the problem is not well partitioned, Block Jacobi ILU may give faster solution time than PILU. BJILU preconditioners had been successfully used for a class of linear systems, for example, those with diagonally dominant matrices. We define linear systems outside of this class as difficult problems. Solvers based on BJILU would fail to converge for difficult problems as the number of threads w grows. The reason lies in the domain decomposition method, by which a scalable parallelization must approximate the original matrix by at least w subdomains. As a result, the preconditioner is forced to ignore even more off-diagonal matrix entries with more blocks of smaller block dimension. Features of TPILU(k). TPILU(k) is more efficient for applications that generate many difficult problems requiring a solution. We compare an iterative solver using either parallel TPILU(k) preconditioning, or traditional sequential ILU(k) preconditioning. In our experiments, we observe a speedup of up to a factor of 9 over a sequential ILU(k)-based solver by using 16 threads on 16 cores (see Tables XII). Depending on whether the time for the preconditioner dominates in the time for the overall iterative solver, we may observe anywhere from a 30% overall speedup to the full factor of 9 (see Tables V, VI, XII). TPILU(k) produces stable preconditioners across most classes of linear systems. The sequential ILU(k)-based 2

3 solver has this property. Further, because the time for the preconditioning often dominates, we observe a large speedup in our parallel solver, TPILU(k) (see Tables XII, XIII, XIV). In contrast, BJILU-based iterative solvers fail to converge (are unstable) on difficult problems. Even worse, BJILU preconditioning may even succeed for a small number of threads, and then fail for more threads (see Tables I, III, V). While BJILU preconditioning works well sometimes, another method must be used in the case where it fails. TPILU(k) has reliable performance as compared to PILU, whose partitioning cost varies widely. When solving the linear systems arising from 3-D 27-point stencil modelling for Laplace s equations, PILU(k)-based iterative solvers always perform worse than the sequential ILU(k)-based solvers in our experiments (see Tables VII, VIII). TPILU(k) is implemented both for computer clusters and multi-core machines. It takes advantage of the general parallelism existing in LU factorization for any matrix, while adhering to sequential consistency. Sequential consistency is crucial to TPILU(k) in squeezing out more speedup while maintaining stability. In addition, it helps manage potential instabilities due to round-off. Although mathematically, (x+y)+z = x+(y+z), the differing round-off errors can influence the final result. For example, a sequential triangular solver may return different results depending on the order in which each unknown is reduced during the forward/backward substitutions: left-right or right-left. Debugging also becomes difficult when parallelization produces different results. The initial implementation of TPILU(k) used dynamic scheduling of threads for load balancing. Because the fill-ins of symbolic factorization in ILU(k) occur dynamically, we cannot predict in advance how much temporary storage each thread will need to malloc. So, each thread will malloc a different region size at each step. Pre-allocating a large malloc region for each thread is not an option, since the total region allocated may no longer fit in L2 cache. When using many threads, dynamic allocation results in ping-pong access patterns between caches of different cores. This phenomenon is not particular to numerical analysis. See, for example, a discussion of this phenomenon in a Monte Carlo simulation [11, Section 4], where the use of a per-thread allocator with thread-private malloc arenas produced a large improvement over four four widely used memory allocators (ptmallo2, ptmallo3, hoard and tcmalloc). To eliminate this bottleneck, we were forced to switch to static round-robin scheduling of threads. We modified the standard glibc allocator to also use thread-private malloc arenas. Details are in Section IV-E, especially in Table XV. This is an important lesson for future manycore CPUs. We are not aware of other linear solvers currently using this technique. The contributions of TPILU(k) are threefold: 1) an efficient parallel, sequentially consistent ILU(k) preconditioner targeting difficult problems; 2) an incomplete inverse algorithm to overcome a bottleneck related to solving triangular matrices; and 3) improved cache efficiency through static roundrobin thread scheduling and thread-private malloc arenas. The rest of this paper is organized as follows. Section II briefly reviews some previous work. Section III reviews LU factorization and sequential ILU(k) algorithm. Section IV presents task-oriented parallel TPILU(k), including the base algorithm (Sections IV-A through IV-D) and the incomplete inverse method (Section IV-F). Section V analyzes the experimental results. II. Related work ILU(k) [1] was formalized to solve the system of linear equations arising from finite difference discretizations in In 1981, ILU(k) was extended to apply to more general problems [2]. Some early work on parallel ILU(k) preconditioners uses sequentially consistency, as mentioned in Section I. There are also some sequentially consistent parallel IC preconditioners [12], [13] for symmetric linear systems. Pivoting/reordering is the dominant technique today for parallel linear solvers. For example, pivoting is employed to reduce the number of fill-ins for the LU factorization in [14]. The work in [15] provides a pivoting algorithm to make the diagonal elements of a sparse matrix large. The methodology is to compute the optimal pivoting and preprocess a sparse matrix. Reordering is also employed to develop parallel algorithm in [16], which targets distributed sparse matrices. PMILU [17] reorders a linear system to expose greater parallelism. The experimental results demonstrate the scalability of PMILU on four IBM Power 5 SMP nodes, each with 16 shared memory processors. The use of sequentially consistent parallel preconditioner is receding. For example, parallel ILUT [18] has not been supported by Hypre since version b, and Euclid is now recommended as a replacement. Euclid supports both low-level thread parallelism (for matrix add and matrix-vector multiply, etc.) and highlevel distributed parallelism. This emphasis on purely distributed parallelism at the high level is motivated by Euclid s domain decomposition strategy. 3

4 TPILU(k) is inspired by TOP-C [19]. The taskoriented approach differs in spirit from domain decomposition, and from earlier work based on level scheduling [9], [10]. The resulting parallel algorithm is derived by identifying tasks to be computed in parallel from the original sequential algorithm. TOP-C employs a message passing interface (MPI) for distributed memory architectures. In [20], a definition of tasks suitable for dense Gaussian elimination is presented. That simple approach inspired the initial approach of this work on sparse incomplete LU factorization. III. Review of Sequential ILU(k) algorithm See [4], [21], [22] for a detailed review of ILU(k). A brief sketch is provided here. LU factorization completely decomposes a matrix A into the product of a lower triangular matrix L and an upper triangular matrix U. From L and U, one efficiently computes A 1 as U 1 L 1. While the computation of L and U requires O(n 3 ) steps, once done, the computation of the inverse of the triangular matrices proceeds in only O(n 2 ) steps. For sparse matrices, one contents oneself with solving x in Ax = b for vectorsxandb, since A 1, L andu would all be hopelessly dense. Iterative solvers are often used for this purpose. An ILU(k) algorithm finds sparse approximations, L L and Ũ U. The preconditioned iterative solver then implicitly solves AŨ 1 L 1, which is close to the identity. For this purpose, triangular solve operations are integrated into each iteration to obtain solution y such that LŨy = p where p varies on iterations. This has faster convergence and better numerical stability. Here, the level limit k controls what kinds of elements should be computed in the process of incomplete LU factorization. Similarly to LU factorization, ILU(k) factorization can be implemented by the same procedure as Gaussian elimination. Moreover, it also records the elements of a lower triangular matrix L. Because the diagonal elements of L are defined to be 1, we do not need to store them. Therefore, a single filled matrix F is sufficient to store both L and Ũ. A. Terminology for ILU(k) For a huge sparse matrix, a standard dense format would be wasteful. Instead, we just store the position and the value of non-zero elements. Similarly, incomplete LU factorization does not insert all elements that are generated in the process of factorization. Instead, it employs some mechanisms to control how many elements are stored. ILU(k) [4] uses the level limit k as the parameter to implement a more flexible mechanism. Here we review some definitions: Row h Row i Column h f ih Column j Fig. 1. Fill-in f ij with its causative entries f ih and f hj. Definition 3.1: Filled Matrix: We call the matrix in memory the filled matrix, which is composed of all nonzero elements of F. Definition 3.2: A fill entry, or entry for short, is an element stored in memory. (Elements that are not stored are called zero elements.) Definition 3.3: Fill-in: Consider Figure 1. If there exists h such that i,j > h and both f ih and f hj are entries, then the originally zero elementf ij may become an entry because the value of f ij is non-zero after factorization. This element f ij is called a fill-in, that is a candidate of entry. We say the fill-in f ij is caused by the existence of the two entries f ih and f hj. The entries f ih and f hj are the causative entries of f ij. Definition 3.4: Level: The level associated with an entry f ij, is level (i,j) and defined as f hj min level (i,h)+level (h,j)+1. 1 h<min(i,j) The level limit k is used to control what kinds of fillins should be inserted into the filled matrix during the factorization. Only those fill-ins with a level smaller than or equal to k are inserted into the filled matrix F. Other fill-ins are ignored. This allows ILU(k) to maintain a sparse filled matrix for very small values of k. B. ILU(k) and Its Parallelism For LU factorization, the defining equation A = LU is expanded into a ij = min(i,j) h=1 l ih u hj. This yields the defining equations. ( j 1 f ij = a ij l ih u hj )/u jj, j < i h=1 i 1 ( f ij = a ij l ih u hj )/l ii, j i h=1 The computation for incomplete LU factorization follows a definition similar to the above equations except it skips zero elements. In our implementation, the matrixf f ij 4

5 is initialized to A and stored in row-major order form prior to any computation. The computation can be reorganized to use the above equations in the forward direction. As each term l ih u hj for h < j is determined, it can immediately be subtracted from f ij. Just as all rest rows can be reduced simultaneously by the first row in Gaussian elimination, a row-major order for ILU(k) factorization leads to a natural parallel algorithm. Following the defining equations, the ILU(k) algorithm maintains in memory two rows: row h and row i, where h < i. Row h is used to partially reduce row i. For each possible j, the product l ih u hj is used to reduce the entry f ij. Once we have accumulated all products l ih u hj for h < min(i,j), we are done. ILU(k) is separated into two passes: symbolic factorization or Phase I to compute the levels and insert all fill-ins with the level less than or equal to the level limit k into the filled matrix; and numeric factorization or Phase II to compute the values of all entries in the filled matrix. Both passes follow the similar procedure described above. Algorithm 1 illustrates the symbolic factorization phase. It determines for each row j, the set of permitted entries, permitted(j). These are the entries for which the computed entry level or weight is less than or equal to the k. Numeric factorization is simpler, but similar in spirit to the row-merge update pass of Algorithm 1. The lines 14 through 17 control the entries to be updated, and the update of line 19 is replaced by the update of numeric value. The details are omitted. The computation from step 15 to 27 in Algorithm 1 is referred to as one transformation. The corresponding part in Phase II is also called a transformation. IV. TPILU(k): Task-oriented Parallel ILU(k) Algorithm A. Parallel Tasks in ILU(k) As we have mentioned in Section III-B, the parallelism comes from the fact that we can do row transformations in parallel while the sequential algorithm does it topdown and row by row. We introduce the following definition to describe a general model for matrix row reductions, which is valid for Gaussian elimination as well as ILU(k). Definition 4.1: The frontier is the maximum number of rows that are currently reduced completely. According to this definition, the frontier i is also the limit up to which the remaining rows can be partially reduced except for the first unreduced row. The first unreduced row can be reduced completely. That increases the frontier by one. In order to overlap communication and computation, Algorithm 1 Symbolic factorization: Phase I of ILU(k) preconditioning 1: //Calculate levels and permitted entry positions 2: //Loop over rows 3: for j = 1 to n do 4: //Initialization: admit entries in A, and assign them the level zero. 5: permitted(j) empty set //permitted entry in row j 6: for t = 1 to n // nonzero entries in row j do 7: if A j,t 0 then 8: level(j,t) 0 9: insert t into permitted(j) 10: end if 11: end for 12: end for 13: //Row-merge update pass 14: for each unprocessed i permitted(j) with i < j, in ascending order do 15: for t permitted(i) with t > i do 16: weight = level(j,i) + level(i,t) : if if t permitted(j) then 18: //already nonzero in F j,t 19: level(j, t) min{level(j, t), weight} 20: else 21: //zero in F j,t 22: if weight k //level control then 23: insert t into permitted(j) 24: level(j, t) weight 25: end if 26: end if 27: end for 28: end for 29: return permitted the matrix is organized as bands to make the granularity of the computation adjustable. Just as in Figure 2, each band includes the same number of consecutive rows. To handle the case that the number of bands is not a factor of the matrix dimension, one can pad some of the bands with one extra row each. The size of a band is the number of rows in this band. A task is associated to a band and is defined as a set of transformations applied in order to partially reduce the band to the current frontier. For each band, the program must remember up to what column this band has been partially reduced. We call this column the current position, which is the start point of reduction for the next task attached to this band. In addition, it is important to use a variable to remember the first unreduced band. After the first unreduced band is completely reduced, it should be broadcast to all machines (or shared by all 5

6 Column Band Band 1 Band 2 Band Fig. 2. View of a matrix as bands. There are four bands. Each of them consists of 3 rows. After the first band is reduced completely, all rest bands can be partially reduced to the third column in parallel. threads in the multi-core case) and the frontier should be increased by the size of the band for all machines. B. Communication Overhead Reduction Only the completely reduced bands are useful for the reduction of other bands, the intermediate result is not truly needed by other workers. If we assign a fixed group of bands to each worker, then it is unnecessary to broadcast an update message to other workers for any intermediate result. We call this strategy static load balancing. It decreases the communication overhead to a minimum by sending an update message only for the completely reduced bands. Our algorithm assigns bands to workers in a roundrobin fashion. The method avoids sending intermediate copies of partially updated bands. This produces a more regular communication that fits well with the pipelining communication of the next section. A further virtue of this strategy is that it uses a fixed number of message buffers and a fixed buffer size. This avoids dynamically allocating the memory for message handling. Under the strategy of static load balancing, the computations on all processors are coordinated so as to guarantee that no processor can send two update messages simultaneously. In other words, a processor must finish broadcasting an update message before it reduces another band completely. C. Pipeline Communication for Efficient Local Network Usage Although the bands can be reduced simultaneously, their completion follows a strict top-down order. When one band is completely reduced, it is best for the node that holds the first unreduced band to receive the result first. This strategy is implemented by the pipeline model. Following this model, all nodes are organized to form a directed ring. The message is transferred along Sender ID Band: Time step Fig. 3. Pipeline model. The horizontal axis is the time step. The vertical axis is the sender id. The lines represent when the algorithm sends a message. The time step and the sender id of the source are indicated. The receiver is always the successor of the source. The message is marked by the corresponding band number. Only the first several messages are shown. the directed edge. Every node sends the message to its unique successor until each node has received a copy of the message. After this message is forwarded, each node uses this message to update its memory data. Figure 3 presents that this model achieves the aggregate bandwidth. In this figure, the horizontal axis is the time step while the vertical axis is the sender ID (the rank number of each node). Note that at most time steps, there are several nodes participating, each of them is either sending a message to its successor or receiving a message from its predecessor. Algorithm 2 illuminates the implementation to overlap the computation with the communication based on the pipeline model. Both symbolic factorization and numeric factorization can use this model, except for the case of k = 1, whose symbolic factorization reduces to trivial parallelism as discussed in the next section. D. Optimization for Symbolic Factorization in the Case k = 1 One observation from Figure 1 is that level 1 entries no longer participate in symbolic factorization after they are generated. In Figure 1, if either f ih or f hj is an entry of level 1, the resulting fill-in f ij must be an element of level 2 or level 3. So f ij is not inserted into the filled matrix F. Based on this observation, we claim that for symbolic factorization, each row can be reduced independently, no matter whether the previous rows are reduced or not. This observation not only yields greater parallelism, it also allows the communication overhead of the first pass to be postponed to the second pass. We call this special algorithm TPILU(1). Considering that numeric factorization is floating-point arithmetic intensive, it is reasonable to shift all communication overhead toward the second pass. In the symbolic factorization phase, the TPILU(1) algorithm prefers to reduce each row with the original matrix because only level 0 elements are needed. In 6 7 6

7 Algorithm 2 Parallel ILU(k) algorithm with the pipeline model 1: receive from predecessor //non-blocking receive 2: //Loop until all bands are reduced completely 3: while firstunreducedband < numberofbands do 4: get new task (band ID) from the master to work on 5: if there was a band to work on then 6: dotask(band) // reduce band using all previous bands 7: if band is not reduced completely then 8: //not reduced completely, then non-blocking test 9: try to receive a message for some band 10: if a newly reduced band is received then 11: send band to successor //non-blocking send 12: update our copy of newly reduced band 13: continue to receive and update until our band is completely reduced 14: end if 15: else 16: send our reduced band to successor //nonblocking send 17: end if 18: else 19: wait until a new band is available, while in background continuing to receive other reduced bands from predecessor, updating our copy, and sending the reduced band to our successor 20: end if 21: end while addition, the algorithm achieves better performance if it skips the new inserted level 1 elements. This optimization can be implemented by letting all workers reduce the rows bottom-up. Meanwhile no update is sent to any other processor. Under such circumstances, the band size for symbolic factorization does not influence the performance because there is no synchronization among processors. However, we must use the same band size for both symbolic factorization and numeric factorization. E. Other Optimizations Optimization: clusters with multi-core nodes. A hybrid memory model using multiple computers, each with multiple cores, helps to further improve the performance for the TPILU(k) algorithm. On each node, start several threads as workers and one particular thread as a communicator to handles messages between nodes. This design leads to better performance than having each thread worker communicate directly with remote threads. The reason is that the MPI_THREAD_MULTIPLE option of MPI can degrade performance. As described near the end of Section I, the performance benefits from an extension of the malloc library beyond the POSIX standard, to give users access to thread-private malloc arenas. The TPMalloc (threadprivate malloc) of [11] is just such an extension, and was used in this work. Note that the performance issue can easily be confused with a general cache issue. The literature contains ample descriptions of the cache benefits of stride access and avoiding random access [23]. Even though the ping-pong access of a standard malloc library will present as poor cache performance, the cause lies instead in the strict adherence to a POSIX-standard malloc, whereas a thread-private one is needed [11]. Luckily, the thread-private alternative is compatible with our static load balancing strategy in which each worker is assigned a fixed group of bands. Optimization: efficient matrix storage. The compressed sparse row format (CSR) is used in the iteration phase due to its efficiency for arithmetic operations. However, the CSR format does not support enlarging the memory space for several rows simultaneously. Therefore, TPILU(k) initializes the matrix F in row-major format during the symbolic factorization phase. After the matrix shape is determined by symbolic factorization, TPILU(k) changes the output matrix F from row-major format back to CSR format. The format transformations are implemented in the factorization phase. F. Incomplete Inverse Method To parallelize iterative solvers, consider Equations 1, which define the forward substitution in each iteration. x i = b i h<iα ih x h,i = 1,2,...,n (1) Equations 1 is used for solving equations of the form Lx = b. As one can see, Equations 1 must be used in a strictly sequential order. To parallelize this triangular solver, we d like to transform Equations 1 into Equations 2. x i = t<i( β it )b t +b i,i = 1,2,...,n (2) Combine Equations 1 with Equations 2, we have Equations 3. x i = b i h<iα ih x h,i = 1,2,...,n = b i α ih ( ht )b t +b h ) h<i t<h( β = ( (α it α ih β ht ))b t +b i t<i t<h<i (3) 7

8 This yields the defining Equations 4. β it = α it α ih β ht (4) t<h<i Recall that LU-factorization goes forward when reducing by α ih. j > h,α ij α ij α ih α 1 hh α hj (5) We can subtract α ih α ht from α it for each t < h using one additional step as part of reducing by α ih. t < h,α it α it α ih α ht, (6) This additional step achieves a side-effect: after row i is factored completely, each α it is exactly β it. The reason is that they satisfy the same defining Equations 4. As the last step, we flip the sign for each α it after all rows have been factored completely. t < i,α it α it, (7) This is exactly what is needed by the computation that Equations 2 defines. Equations 2 also demonstrates that the lower triangular matrix ( β it ) t<i is part of L 1, whose diagonal elements are still 1, just as L. For ILU(k) preconditioning,( L) 1 may become dense even if L is sparse. To maintain the sparsity, we compute incomplete inverse matrices L 1 following the steps described by Pseudocode 5, 6, 7. However, we throw out some elements by enforcing that the computation in previously mentioned steps is executed only for elements α ij satisfies level(i,j) k. Then, the factorization represented by Pseudocode 5 becomes ILU(k). Correspondingly, the computation represented by Pseudocode 6, 7 produces L 1. The computation of L 1 can be combined with the original numeric factorization phase (By combining Pseudocode 5, 6). One more factorization phase is added to computeũ 1 following bottom-up and rightleft order. L 1 and Ũ 1 reduce the triangular solve stage to matrix-vector multiplications. Then, iterative solvers are parallelized in a trivial way since basic vector-vector matrix-vector operations are embarrassingly parallel. We do not parallelize inner product operations for two reasons: first, they are already fast even when done sequentially; second, parallelization for inner product operations violates sequential consistency. V. Experimental Results Our first test platform is a cluster with an InfiniBand switch. This cluster includes two nodes with a single Quad-Core AMD Opteron Processor 2378 per node and one node with four Intel Xeon E5520 quad-core CPUs for a total of 16 cores. The operating system is Linux and the compiler is gcc The MPI library is OpenMPI 1.4. We also installed Hypre (version 2.6.0b) to compare with our solver, TPILU(k). Following the default setup, we use the -O2 option to compile Hypre. We compile TPILU(k) with the same -O2 option. Three iterative solvers in Hypre use a common interface to work with Euclid: (i) preconditioned conjugate gradient algorithm (Euclid-PCG); (ii) preconditioned generalized minimal residual method algorithm (Euclid-GMRES); and (iii) preconditioned stabilized bi-conjugate gradients algorithm (Euclid- BICGSTAB). We use the third solver, BICGSTAB, with the default tolerance rtol = 10 8 to test both Euclid and TPILU(k). A. TPILU(k) (base algorithm): Driven Cavity Problem In this section, we consider some difficult problems from [24]. They arise from the modeling of the incompressible Navier-Stokes equations. We choose three representatives: (i) e20r3000 with a matrix of dimension 4,241 and 131,556 non-zeros; (ii) e30r3000 with a matrix of dimension 9,661 and 306,356 non-zeros; and (iii) e40r3000 with a matrix of dimension 17,281 and 553,956 non-zeros. The experimental results on e20r3000 for both PILU (the domain decomposition ILU(k) preconditioner in Euclid) and BJILU (the Block Jacobi ILU(k) preconditioner in Euclid) are listed in Table I. In this table, the first column is the name for the preconditioner that follows by the level for PILU method. We tested PILU using the levels k from 1 to 6. Although the level k = 5 achieves the least number of iterations, the best solution time is achieved by k = 2 due to its relatively small cost for preconditioning and for single iteration. As we increase the number of processes, the preconditioning either breaks down or the cost grows greatly. Similar behavior is observed for BJILU. In Table I, the best solution time is achieved by PILU(2) with only one worker, which is marked by an exclamation point in the first column. For e20r3000, the base algorithm TPILU with the level k = 4 leads to a better performance than either PILU or BJILU. The results from TPILU(4) are collected in Table II, which consists of three parts corresponding to distributed-memory, hybrid-memory and sharedmemory. In the distributed-memory case, we speed up the preconditioning using four worker processes either on the same node or on different nodes. The implementation for the hybrid-memory model leads to better 8

9 Method # Precondition Solution Phase Procs Phase: Time # Iters Time PILU, k = PILU, k = PILU, k = 1 3 * PILU, k = PILU, k = 2 (!) PILU, k = PILU, k = 2 3 * PILU, k = PILU, k = PILU, k = PILU, k = 3 3 * PILU, k = PILU, k = PILU, k = PILU, k = 4 3 * PILU, k = PILU, k = PILU, k = PILU, k = 5 3 * PILU, k = PILU, k = PILU, k = PILU, k = 6 3 * PILU, k = BJILU BJILU BJILU 3 * BJILU * TABLE I PILU AND BJILU FOR e20r3000 ON TWO QUAD-CORE NODES. INPUT MATRIX DIMENSION 4,241 AND THE NUMBER OF NON-ZEROS 131,556. TIME IS IN SECONDS. * MEANS THAT PHASE BREAKS DOWN. performance than that for the distributed-memory model even with the same number of workers. As we can see, this implementation allows us to speed up the preconditioning by using three worker threads per node. The implementation for the shared-memory model achieves almost linear speedup for preconditioning. Nodes P rocs Factorization Time (s) Solution Phase T hreads Symbolic Numeric # Iters Time (s) (1 +1) (2 +1) (3 +1) (!) TABLE II TPILU(4) FOR e20r3000 ON TWO QUAD-CORE NODES (OUTPUT MATRIX HAS 835,161 NON-ZERO ENTRIES. FPAS FOR PRECONDITIONING IS 71,297,168.) The experimental results for PILU and BJILU on e30r3000 are listed in Table III. From this table, one can see that BJILU with 2 workers achieves the best solution time. The results for TPILU(3) are collected in Table IV. This table shows that the implementation based on hybrid-memory with 8 CPU cores on two nodes ties with the implementation based on shared-memory with 4 CPU cores on a single node. Compare Table III with Table IV, one sees that TPILU(k) wins again. TPILU(k) finishes in 1.04 s (Table III), as compared to 1.47 s for BJILU and 1.64 s for PILU (Table IV). As before,! marks the best configuration for each method. Method Procs Preconditioning Solution Phase Phase: Time (s) # Iters Time (s) PILU, k = PILU, k = PILU, k = PILU, k = 1 4 * PILU, k = 2 (!) PILU, k = PILU, k = 2 3 * PILU, k = 2 4 * PILU, k = PILU, k = PILU, k = 3 3 * PILU, k = 3 4 * PILU, k = PILU, k = PILU, k = PILU, k = 4 4 * PILU, k = PILU, k = PILU, k = PILU, k = 5 4 * PILU, k = PILU, k = PILU, k = PILU, k = 6 4 * BJILU BJILU (!) BJILU >1000 >4.09 BJILU 4 * TABLE III PILU AND BJILU FOR e30r3000 ON TWO QUAD-CORE NODES. INPUT MATRIX DIMENSION 9,661 AND THE NUMBER OF NON-ZERO ENTRIES 306,356. * MEANS THAT PHASE BREAKS DOWN. TPILU(k) continues to win in the case of e40r3000. TPILU(k) with k = 3 finishes in 1.93 s (Table VI), as compared to 3.15 s for PILU and 3.52 s for BJILU (Table V). Surprisingly, PILU and BJILU register their best performance when executing sequentially, and not in parallel. From Table VI, we find that the implementation for the hybrid model with 8 CPU cores on two nodes is better than the implementation for the shared-memory model with 4 CPU cores on a single node. This demonstrates TPILU(k) s strong potential for greater performance when given additional cores. 9

10 Nodes P rocs Factorization Time (s) Solution Phase T hreads Symbolic Numeric Number Time (s) (!) (!) TABLE IV TPILU(3) FOR e30r3000 ON TWO QUAD-CORE NODES (OUTPUT MATRIX HAS 1,674,369 NON-ZERO ENTRIES. FPAS FOR PRECONDITIONING IS 120,886,822). Nodes P rocs Factorization Time (s) Solution Phase T hreads Symbolic Numeric # Iters Time (s) (1 +1) (2 +1) (3+1) (!) TABLE VI TPILU(3) FOR e40r3000 ON TWO QUAD-CORE NODES (OUTPUT MATRIX HAS 3,070,709 NON-ZERO ENTRIES. FPAS FOR PRECONDITIONING IS 224,029,062.) Method Procs Preconditioning Solution Phase Phase: Time (s) # Iters Time (s) PILU, k = PILU, k = 1 (!) PILU, k = 1 3 * PILU, k = 1 4 * PILU, k = PILU, k = PILU, k = 2 3 * PILU, k = 2 4 * PILU, k = PILU, k = PILU, k = 3 3 * PILU, k = 3 4 * PILU, k = s PILU, k = s PILU, k = 4 3 * PILU, k = 4 4 * PILU, k = PILU, k = PILU, k = 5 3 * PILU, k = 5 4 * PILU, k = PILU, k = PILU, k = 6 3 * PILU, k = 6 4 * BJILU (!) BJILU >1000 > BJILU 3 * BJILU 4 * TABLE V EUCLID FOR e40r3000 ON TWO QUAD-CORE NODES. INPUT MATRIX DIMENSION 17,281 AND THE NUMBER OF NON-ZEROS 553,956. * MEANS THAT PHASE BREAKS DOWN. B. Incomplete Inverse: Matrices from 3D 27-point Central Differencing We tested PILU (the domain decomposition ILU(k) preconditioner in Euclid) on a linear system generated by 3D 27-point central differencing for Poisson s equation. The grid size is Using both k = 0 and k = 1, we measure the preconditioning time, the number of iterations and their total time for five configurations: 1 process; 2 processes on one node; 2 processes on two nodes; 4 processes on one node; and 4 processes on two nodes. The purpose is to find the best configuration on the two nodes with quad-core AMD CPUs. Experimental results are listed in Table VII. (Note that PILU uses processes, instead of threads, for each subdomain.) Procs/Node level Preconditioning Solution Phase Nodes Time (s) # Iters Time (s) TABLE VII PILU FOR 3D 27-POINT STENCIL ON GRID USING TWO QUAD-CORE NODES (k = 0) From Table VII, we can see that the preconditioning time increases greatly when we increase the level from 0 to 1. In addition, it grows rapidly and soon dominates when the number of processes increases. The results in this table also demonstrate that the parallelism from matrix reordering influences the number of iterations even though the level k is kept the same. Under such circumstance, the sequential algorithm wins over PILU in the contest for the best solution time. This is not an accident. We also test the linear systems coming from 3D 27-point Stencil modelling with the grid size between and The sequential algorithm still achieves the best solution time as seen from the results listed in Table VIII. In Table VIII, we also list the number of floating point 10

11 arithmetic (FPA) operations and the number of non-zeros in the output matrix. FPA operations are measured by inserting a counter in our program. From this table, we can see that the number of FPAs is relatively small. This implies that FPAs do not dominate the computation. Hence, the cost of ILU(k) preconditioning is primarily in the row handling, which consists of scanning each element in a row to merge the result from a previous completely factored row that corresponds to the current element. Therefore, the cost is determined by the number of rows and the total number of non-zero elements (non-zero entries) in the output matrix. Grid Matrix Preconditioning Solution phase Size Dim. Time FPAs Entries # Iters Time K M 2M K M 3M K M 6M K M 9M K M 13M K M 19M TABLE VIII OPTIMAL SOLUTION TIME WITH PILU FOR 3D 27-POINT CENTRAL DIFFERENCING ON LARGER GRID SIZES USING TWO QUAD-CORE NODES (k = 0. TIME IS IN SECONDS. GRID SIZEn 3 MEANS n n n.) For all these test cases, TPIILU (the incomplete inverse submethod of TPILU(k)) decreases the preconditioning time and the time per iteration in the solution phase, while maintaining the same number of iterations. This leads to better performance, as seen in Table IX. Mat. Preconditioning and Iteration Time (s) # Iters Dim. 1 thread 2 threads 4 threads 64K K K K K K TABLE IX TPIILU WITH k = 0 FOR 3D 27-POINT CENTRAL DIFFERENCING ON LARGER GRID SIZES USING ONE QUAD-CORE NODE (GRID SIZE FOR EACH ROW IS SAME AS TABLE VIII.) C. Incomplete Inverse: Cage Model for DNA Electrophoresis In this section, we next study linear systems for the cage model of DNA electrophoresis [25]. The model describes the drift, induced by a constant electric field, of homogeneously charged polymers through a gel. We test on the two largest cases: (i) cage14, with a matrix of dimension 1,505,785 and 27,130,349 non-zeros; and (ii) cage15, with a matrix of dimension 5,154,859 and 99,199,551 non-zeros. To solve these systems, ILU(0) is sufficient, which requires only a few iterations to converge. For the first problem, we use one of the quadcore nodes to obtain a speedup of 2.17 using 4 threads, as seen in Table X. For the second problem, we use a 16-core node to obtain a speedup of 2.93 using 8 threads. The results are collected in Table XI. Threads PC time (s) # Iters Time (s) Speedup TABLE X TPILU(K) FOR Cage14 USING ONE QUAD-CORE NODE (MATRIX DIMENSION IS 1,505,785.) Threads PC time (s) # Iters Time (s) Speedup TABLE XI TPILU(K) FOR cage15 USING ONE 16-CORE NODE (MATRIX DIMENSION IS 5,154,859.) According to our statistics, the total number of floating point arithmetic operations is 128,136,984 for cage14 preconditioning and 476,940,712 for cage15 preconditioning. The ratio of the number of FPAs to the number of non-zero entries is less than 5 for both cases. This implies that ILU(k) preconditioning just passes through matrices with few FPAs. We speculate that no ILU(k)- based iterative solver could compete with this incomplete inverse preconditioning approach. D. Incomplete Inverse for ns3da: 3D Navier Stokes The linear system ns3da [26] is a computational fluid dynamics problem used as a test case in FEMLAB, which is developed by Comsol, Inc. Because there are zero diagonal elements in the matrix, we use TPILU(1) as the preconditioner for the iterative solver. For this case, the preconditioning is FPA intensive. As seen in Table XII, TPILU(1) achieves a speedup of 8.93 using 16 threads. We test on a cluster consisting of 33 nodes. Each node has 4 CPU cores (dual processor, dual-core), 2.0 GHz Intel Xeon EM64T processors with either 8 GB or 16 GB per node. The nodes are connected by a Gigabit Ethernet 11

12 Threads PC time (s) # Iters Time (s) Speedup TABLE XII TPILU(K) FOR ns3da ON ONE 16-CORE NODE (MATRIX DIMENSION IS 20,414.) network. The nodes are configured with Linux 2.6.9, gcc and MPICH2 1.0 as the MPI library. TPILU(k) obtains a speedup of 7.22 times (Table XIII) for distributed memory and 7.82 times (Table XIV) for hybrid memory. Procs PC time (s) # Iters Time (s) Speedup TABLE XIII TPILU(K) FORns3Da WITH 16 NODES ON A CLUSTER Procs Threads PC time (s) # Iters Time (s) Speedup TABLE XIV TPILU(K) FORns3Da WITH 8 NODES ON A CLUSTER WITH 1 DEDICATED COMMUNICATION THREAD AND 3 WORKER THREADS PER PROCESS AND ONE PROCESS PER NODE E. TPMalloc Performance To verify whether TPMalloc (thread-private malloc) improves the performance for symbolic factorization, we compare the results from two cases: glibc standard malloc and TPMalloc. The solution time for e40r3000 on the 16-core machine is listed in Table XV. The first column of this table is the number of threads involved. We use 1 to 8 threads and measure the symbolic factorization time and the numeric factorization time. From Table XV, we can see that the time for numeric factorization is not greatly influenced according to whether TPMalloc or standard malloc is used. However, symbolic factorization scales much better with TPMalloc. This is critical, since the symbolic factorization time dominates. Factorization Time (s) Threads Standard Malloc Thread-Private Malloc Symbolic Numeric Symbolic Numeric density TABLE XV TPILU(3) FOR e40r3000 ON ONE 16-CORE NODE speedup A F O H B G E I J D C difficulty ABC BJILU ADE PILU FGIH TPIILU (incomplete inverse) HIJ TPILU (base algorithm) Fig. 4. Three-dimensional comparison of BJILU, PILU and TPILU(k). Dotted lines indicate that a line extends outside of the x-y plane. The third axis, z, is the density axis. F. Conclusion The experimental results can be graphically summarized by Figure 4. While there are regions in which each of BJILU, PILU and TPILU(k) are faster, TPILU(k) also remains competitive in most cases, and it remains stable in all cases. References [1] I. Gustafsson, A class of first order factorization methods, BIT Numerical Mathematics, vol. 18, pp , [2] J. W. Watts III, A conjugate gradient-truncated direct method for the iterative solution of the reservoir simulation pressure equation, SPE Journal, vol. 21, pp , [3] Y. Saad, Iterative methods for sparse linear systems, 1st ed. PWS, [4] D. Hysom and A. Pothen, Level-based incomplete LU factorization: Graph model and algorithms, Lawrence Livermore National Labs, Tech. Rep. UCRL-JC , [5], Efficient parallel computation of ilu(k) preconditioners, in Supercomputing 99, [6], A scalable parallel algorithm for incomplete factor preconditioning, SIAM J. Sci. Comput, vol. 22, pp , [7] hypre: High Performance Preconditioners, User s Manual, version 2.6.0b. [Online]. Available: [8] R. D. Falgout and U. M. Yang, hypre: A Library of High Performance Preconditioners, in ICCS 02, [9] E. Anderson, Parallel implementation of preconditioned conjugate gradient methods for solving sparse systems of linear 12

A Bit-Compatible Parallelization for ILU(k) Preconditioning

A Bit-Compatible Parallelization for ILU(k) Preconditioning A Bit-Compatible Parallelization for ILU(k) Preconditioning Xin Dong and Gene Cooperman College of Computer Science, Northeastern University Boston, MA 25, USA {xindong,gene}@ccs.neu.edu Abstract. ILU(k)

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

Parallel solution for finite element linear systems of. equations on workstation cluster *

Parallel solution for finite element linear systems of. equations on workstation cluster * Aug. 2009, Volume 6, No.8 (Serial No.57) Journal of Communication and Computer, ISSN 1548-7709, USA Parallel solution for finite element linear systems of equations on workstation cluster * FU Chao-jiang

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

A parallel direct/iterative solver based on a Schur complement approach

A parallel direct/iterative solver based on a Schur complement approach A parallel direct/iterative solver based on a Schur complement approach Gene around the world at CERFACS Jérémie Gaidamour LaBRI and INRIA Bordeaux - Sud-Ouest (ScAlApplix project) February 29th, 2008

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative

More information

Accelerating the Iterative Linear Solver for Reservoir Simulation

Accelerating the Iterative Linear Solver for Reservoir Simulation Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,

More information

Parallel resolution of sparse linear systems by mixing direct and iterative methods

Parallel resolution of sparse linear systems by mixing direct and iterative methods Parallel resolution of sparse linear systems by mixing direct and iterative methods Phyleas Meeting, Bordeaux J. Gaidamour, P. Hénon, J. Roman, Y. Saad LaBRI and INRIA Bordeaux - Sud-Ouest (ScAlApplix

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems AMSC 6 /CMSC 76 Advanced Linear Numerical Analysis Fall 7 Direct Solution of Sparse Linear Systems and Eigenproblems Dianne P. O Leary c 7 Solving Sparse Linear Systems Assumed background: Gauss elimination

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Implicit schemes for wave models

Implicit schemes for wave models Implicit schemes for wave models Mathieu Dutour Sikirić Rudjer Bo sković Institute, Croatia and Universität Rostock April 17, 2013 I. Wave models Stochastic wave modelling Oceanic models are using grids

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

arxiv: v1 [cs.ms] 2 Jun 2016

arxiv: v1 [cs.ms] 2 Jun 2016 Parallel Triangular Solvers on GPU Zhangxin Chen, Hui Liu, and Bo Yang University of Calgary 2500 University Dr NW, Calgary, AB, Canada, T2N 1N4 {zhachen,hui.j.liu,yang6}@ucalgary.ca arxiv:1606.00541v1

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Probably the simplest kind of problem. Occurs in many contexts, often as part of larger problem. Symbolic manipulation packages can do linear algebra "analytically" (e.g. Mathematica,

More information

Reducing Communication Costs Associated with Parallel Algebraic Multigrid

Reducing Communication Costs Associated with Parallel Algebraic Multigrid Reducing Communication Costs Associated with Parallel Algebraic Multigrid Amanda Bienz, Luke Olson (Advisor) University of Illinois at Urbana-Champaign Urbana, IL 11 I. PROBLEM AND MOTIVATION Algebraic

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3 6 Iterative Solvers Lab Objective: Many real-world problems of the form Ax = b have tens of thousands of parameters Solving such systems with Gaussian elimination or matrix factorizations could require

More information

Parallel Threshold-based ILU Factorization

Parallel Threshold-based ILU Factorization A short version of this paper appears in Supercomputing 997 Parallel Threshold-based ILU Factorization George Karypis and Vipin Kumar University of Minnesota, Department of Computer Science / Army HPC

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Proceedings of FEDSM 2000: ASME Fluids Engineering Division Summer Meeting June 11-15,2000, Boston, MA FEDSM2000-11223 PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Prof. Blair.J.Perot Manjunatha.N.

More information

Figure 6.1: Truss topology optimization diagram.

Figure 6.1: Truss topology optimization diagram. 6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.

More information

Performance of Implicit Solver Strategies on GPUs

Performance of Implicit Solver Strategies on GPUs 9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used

More information

Parallelization Strategy

Parallelization Strategy COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure

More information

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Parallel Sparse LU Factorization on Different Message Passing Platforms

Parallel Sparse LU Factorization on Different Message Passing Platforms Parallel Sparse LU Factorization on Different Message Passing Platforms Kai Shen Department of Computer Science, University of Rochester Rochester, NY 1467, USA Abstract Several message passing-based parallel

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Sparse Linear Systems

Sparse Linear Systems 1 Sparse Linear Systems Rob H. Bisseling Mathematical Institute, Utrecht University Course Introduction Scientific Computing February 22, 2018 2 Outline Iterative solution methods 3 A perfect bipartite

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Multigrid Solvers Method of solving linear equation

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

Lecture 11: Randomized Least-squares Approximation in Practice. 11 Randomized Least-squares Approximation in Practice

Lecture 11: Randomized Least-squares Approximation in Practice. 11 Randomized Least-squares Approximation in Practice Stat60/CS94: Randomized Algorithms for Matrices and Data Lecture 11-10/09/013 Lecture 11: Randomized Least-squares Approximation in Practice Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Accelerating Double Precision FEM Simulations with GPUs

Accelerating Double Precision FEM Simulations with GPUs Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke 1 3 Robert Strzodka 2 Stefan Turek 1 dominik.goeddeke@math.uni-dortmund.de 1 Mathematics III: Applied Mathematics and Numerics, University

More information

Linear systems of equations

Linear systems of equations Linear systems of equations Michael Quinn Parallel Programming with MPI and OpenMP material do autor Terminology Back substitution Gaussian elimination Outline Problem System of linear equations Solve

More information

1.2 Numerical Solutions of Flow Problems

1.2 Numerical Solutions of Flow Problems 1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian

More information

OpenFOAM + GPGPU. İbrahim Özküçük

OpenFOAM + GPGPU. İbrahim Özküçük OpenFOAM + GPGPU İbrahim Özküçük Outline GPGPU vs CPU GPGPU plugins for OpenFOAM Overview of Discretization CUDA for FOAM Link (cufflink) Cusp & Thrust Libraries How Cufflink Works Performance data of

More information

Sparse Matrices Reordering using Evolutionary Algorithms: A Seeded Approach

Sparse Matrices Reordering using Evolutionary Algorithms: A Seeded Approach 1 Sparse Matrices Reordering using Evolutionary Algorithms: A Seeded Approach David Greiner, Gustavo Montero, Gabriel Winter Institute of Intelligent Systems and Numerical Applications in Engineering (IUSIANI)

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear

More information

Scalability of Heterogeneous Computing

Scalability of Heterogeneous Computing Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor

More information

Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012

Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012 Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012 This work was performed under the auspices of the U.S. Department of Energy by under contract DE-AC52-07NA27344. Lawrence

More information

Parallel ILU Ordering and Convergence Relationships: Numerical Experiments

Parallel ILU Ordering and Convergence Relationships: Numerical Experiments NASA/CR-00-2119 ICASE Report No. 00-24 Parallel ILU Ordering and Convergence Relationships: Numerical Experiments David Hysom and Alex Pothen Old Dominion University, Norfolk, Virginia Institute for Computer

More information

Andrew V. Knyazev and Merico E. Argentati (speaker)

Andrew V. Knyazev and Merico E. Argentati (speaker) 1 Andrew V. Knyazev and Merico E. Argentati (speaker) Department of Mathematics and Center for Computational Mathematics University of Colorado at Denver 2 Acknowledgement Supported by Lawrence Livermore

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

The Immersed Interface Method

The Immersed Interface Method The Immersed Interface Method Numerical Solutions of PDEs Involving Interfaces and Irregular Domains Zhiiin Li Kazufumi Ito North Carolina State University Raleigh, North Carolina Society for Industrial

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Sparse LU Factorization for Parallel Circuit Simulation on GPUs Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated

More information

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

The 3D DSC in Fluid Simulation

The 3D DSC in Fluid Simulation The 3D DSC in Fluid Simulation Marek K. Misztal Informatics and Mathematical Modelling, Technical University of Denmark mkm@imm.dtu.dk DSC 2011 Workshop Kgs. Lyngby, 26th August 2011 Governing Equations

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Chapter 4. Matrix and Vector Operations

Chapter 4. Matrix and Vector Operations 1 Scope of the Chapter Chapter 4 This chapter provides procedures for matrix and vector operations. This chapter (and Chapters 5 and 6) can handle general matrices, matrices with special structure and

More information

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem Guan Wang and Matthias K. Gobbert Department of Mathematics and Statistics, University of

More information

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures Paper 6 Civil-Comp Press, 2012 Proceedings of the Eighth International Conference on Engineering Computational Technology, B.H.V. Topping, (Editor), Civil-Comp Press, Stirlingshire, Scotland Application

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

CS6015 / LARP ACK : Linear Algebra and Its Applications - Gilbert Strang

CS6015 / LARP ACK : Linear Algebra and Its Applications - Gilbert Strang Solving and CS6015 / LARP 2018 ACK : Linear Algebra and Its Applications - Gilbert Strang Introduction Chapter 1 concentrated on square invertible matrices. There was one solution to Ax = b and it was

More information

Lecture 17: More Fun With Sparse Matrices

Lecture 17: More Fun With Sparse Matrices Lecture 17: More Fun With Sparse Matrices David Bindel 26 Oct 2011 Logistics Thanks for info on final project ideas. HW 2 due Monday! Life lessons from HW 2? Where an error occurs may not be where you

More information

Structure-Adaptive Parallel Solution of Sparse Triangular Linear Systems

Structure-Adaptive Parallel Solution of Sparse Triangular Linear Systems Structure-Adaptive Parallel Solution of Sparse Triangular Linear Systems Ehsan Totoni, Michael T. Heath, and Laxmikant V. Kale Department of Computer Science, University of Illinois at Urbana-Champaign

More information

First Experiences with Intel Cluster OpenMP

First Experiences with Intel Cluster OpenMP First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Contents. I The Basic Framework for Stationary Problems 1

Contents. I The Basic Framework for Stationary Problems 1 page v Preface xiii I The Basic Framework for Stationary Problems 1 1 Some model PDEs 3 1.1 Laplace s equation; elliptic BVPs... 3 1.1.1 Physical experiments modeled by Laplace s equation... 5 1.2 Other

More information

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;

More information

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

MVAPICH2 vs. OpenMPI for a Clustering Algorithm MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore

More information

MEMORY EFFICIENT WDR (WAVELET DIFFERENCE REDUCTION) using INVERSE OF ECHELON FORM by EQUATION SOLVING

MEMORY EFFICIENT WDR (WAVELET DIFFERENCE REDUCTION) using INVERSE OF ECHELON FORM by EQUATION SOLVING Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC Vol. 3 Issue. 7 July 2014 pg.512

More information

Chapter 14: Matrix Iterative Methods

Chapter 14: Matrix Iterative Methods Chapter 14: Matrix Iterative Methods 14.1INTRODUCTION AND OBJECTIVES This chapter discusses how to solve linear systems of equations using iterative methods and it may be skipped on a first reading of

More information

Techniques for Optimizing FEM/MoM Codes

Techniques for Optimizing FEM/MoM Codes Techniques for Optimizing FEM/MoM Codes Y. Ji, T. H. Hubing, and H. Wang Electromagnetic Compatibility Laboratory Department of Electrical & Computer Engineering University of Missouri-Rolla Rolla, MO

More information

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R.

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R. Progress In Electromagnetics Research Letters, Vol. 9, 29 38, 2009 AN IMPROVED ALGORITHM FOR MATRIX BANDWIDTH AND PROFILE REDUCTION IN FINITE ELEMENT ANALYSIS Q. Wang National Key Laboratory of Antenna

More information

Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems

Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems Dr.-Ing. Achim Basermann, Melven Zöllner** German Aerospace Center (DLR) Simulation- and Software Technology

More information

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 15 Numerically solve a 2D boundary value problem Example:

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

(Sparse) Linear Solvers

(Sparse) Linear Solvers (Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert

More information

Computational issues in linear programming

Computational issues in linear programming Computational issues in linear programming Julian Hall School of Mathematics University of Edinburgh 15th May 2007 Computational issues in linear programming Overview Introduction to linear programming

More information

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8. CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview

More information