Communication-Avoiding Parallel Algorithms for Solving Triangular Matrix Equations

Size: px
Start display at page:

Download "Communication-Avoiding Parallel Algorithms for Solving Triangular Matrix Equations"

Transcription

1 Research Collection Bachelor Thesis Communication-Avoiding Parallel Algorithms for Solving Triangular Matrix Equations Author(s): Wicky, Tobias Publication Date: 2015 Permanent Link: htts://doi.org/ /ethz-a Rights / License: In Coyright - Non-Commercial Use Permitted This age was generated automatically uon download from the ETH Zurich Research Collection. For more information lease consult the Terms of use. ETH Library

2 Communication-Avoiding Parallel Algorithms for Solving Triangular Matrix Equations Bachelor Thesis Tobias Wicky Friday 30 th October, 2015 Advisors: Prof. Dr. T. Hoefler, Dr. E. Solomonik Deartment of Mathematics, ETH Zürich

3

4 Abstract In this work an algorithm for solving triangular systems of equations for multile right hand sides is resented. The algorithm for solving triangular systems for multile right hand sides, commonly referred to as the TRSM roblem, is a very imortant in dense linear algebra as it is a subroutine for most decomositions of matrices as LU or QR. To imrove erformance over the standard iterative algorithms for TRSM, a block wise inversion aired with triangular matrix multilications is used. To erform the inversion, the lower triangular form of the matrix is exloited and a recursive scheme is alied to further decrease communication cost. With that, the latency of the algorithm decreases while the bandwidth and floating oint oerations count stay asymtotically the same. Concretely, a decrease of latency with a factor of 2/3 / log was achieved for a significant range of relative matrix sizes when working with rocessors. The roosed method is imlemented and its erformance is benchmarked against the widely used ScaLAPACK [1] library. The results show romising tendencies for the inversion, with a maximal seedu of 1.7 over ScaLAPACK for 4096 rocessors. Due to the inferior erformance of triangular matrix multilications with resect to the triangular solve, no overall imrovement is made yet. i

5

6 Contents Contents iii 1 Introduction 1 2 Previous Work Matrix Multilication Triangular matrix solve for single right hand sides Triangular matrix solve for multile right hand sides Communication in TRSM Execution Time Model Recursive TRSM Algorithm Choice of base-case size Triangular Matrix Inversion 11 5 TRSM with Inversion TRSM with full inversion Recursive TRSM with Block Inversion Summary Imlementation 21 7 Exerimental Results 23 8 Further Work 31 9 Conclusion 33 Bibliograhy 35 iii

7

8 Chater 1 Introduction The goal of this work is to find a communication minimizing algorithm for solving triangular matrix equations with multile right hand sides. Motivation: With the decreasing imortance of the amount of floating oint oerations on the total execution time of a rogram, communication costs will turn out to be a more and more imortant factor. The goal therefore is to find an algorithm that stays asymtotically otimal in terms of flo-cost while decreasing the communication cost. The aroach taken here is to secifically decrease the asymtotic latency cost while keeing the bandwidth constant. We take this aroach because we assume that the asymtotic uer bounds for the bandwidth are very strong for the standard recursive aroach as described in [2] since they are equal to the bandwidth costs that are required in for matrix multilication. TRSM is a crucial roblem in many alications as it is a subroutine for a lot of algorithms in dense linear algebra, as for examle the LU decomosition described in [3], or the QR decomosition. An other very imortant alication of the TRSM is the recursive Cholesky factorization where it is the base-case algorithm as described in [4]. It is also the critical routine for solving general dense linear systems of equations. Problem Definition: The linear system of equations L X = B will be solved for X R n k, where L R n n denotes a lower-triangular matrix and B R n k a dense matrix. In this work we will only account for communication cost (bandwidth and latency) that arises due to communication between different comute units. In the theoretical art of this work, three algorithms are resented to solve the roblem recursively where each algorithm takes a different aroach: The algorithm referred to as Recursive TRSM slits the initial roblem into smaller subroblems, until the roblem 1

9 1. Introduction is small enough such that each comute unit can solve the given subroblem for some different right hand sides. This first aroach is based on the aroach taken in [2]. With a detailed cost analysis we will describe otimal base-case sizes deendent on the relative matrix sizes. This aroach is exlained in detail in Section 3.2. The algorithm referred to as TRSM with full inversion inverts the triangular matrix comletely in a recursive fashion and then solves the system by a triangular multilication of the inverted matrix L 1 with the right hand side B. This aroach is discussed due to the low latency that is required to do inversion and matrix multilication. In some relative matrix sizes this aroach rovides an overall higher bandwidth cost and floating oint oerations count on the cost of the low latency. It is discussed in detail in Section 5.1. The algorithm referred to as Recursive-Inversion TRSM combines the two aroaches, reducing the roblem recursively u until a certain oint and inverts the small roblem to solve it. With this aroach we aim at a lower latency due to the use of the inversion as well as keeing the bandwidth and floating oint oerations count low as we can choose the base case size as it is desired. With a cost analysis of this aroach, we were able to find a cost otimal base-case size that decreases the latency and kees bandwidth as well as flo-costs constant comared to the recursive TRSM. With rocessors working on the roblem, the decrease of latency was obtained for a large range of relative matrix sizes ( k [ )), n where a gain of a factor of 2/3 log was achieved. It is discussed in Section 5.2. Results: To see how the algorithm erforms in ractice we show scaling lots of the Recursive-Inversion TRSM comared to the method that ScaLA- PACK [1] rovides. We were able to see that our aroach of inverting the lower triangular matrix is faster than what ScaLAPACK rovides. Even though this is an imortant art of the algorithm, we observed worse scaling for the total time to solution for TRSM. This can be exlained by the fact that one very slow art of the algorithm was the triangular matrix multilication that uses ScaLAPACKs imlementation. The lots suggest a good scaling behavior but they show that the roblem size used was very small and that with increasing roblem sizes, better results are exected. 2

10 Chater 2 Previous Work In this chater, the revious work relevant to the toic is introduced: A cost analysis for matrix multilication is resented and a standard method for solving the TRSM roblem is shown. 2.1 Matrix Multilication In this section the relevant results from the CARMA algorithm, including the cost analysis for matrix multilication, resented in [5], are summarized. One of the key arts of recursive algorithms for solving the TRSM roblem are matrix matrix multilications. In [5], Demmel et al resent communication otimal algorithms for matrix multilications with the resective costs. The algorithm we resent is based on the matrix multilication resented in their work, where a threefold case distinction occurs. The fact that these three different cases all have different bounds for communication costs sets the constraint that the algorithm resented here also has to do the same distinction of cases. Initial conditions: We consider the matrix multilication of a dense matrix A R n n with an other dense matrix B R n k executed on rocessors that are aligned on a rocessor gird Π. Bandwidth: The CARMA algorithm [5] which we refer to as C = MM(A, B, n, k, Π, ) achieves communication bandwidth cost W MM (n, k, ), which can be subdivided into three cases: O (nk/ ) n> k (two large dimensions) ( ( n W MM (n, k, ) = O 2 ) 2/3 ) k k/ n k (three large dimensions) ( O n 2) n < k/ (one large dimension) 3

11 2. Previous Work In the case of one large dimension, where the right hand side B is larger than the triangular matrix A, the best way to do a matrix multilication is to use a one dimensional layout for the rocessor grid. In the case of two large dimensions, where the matrix A is much larger than the right hand side B, the best way of erforming a matrix multilication is to use a two dimensional layout for the rocessor grid. And for the case of three large dimensions, where the matrices A and B are aroximately of the same size, it is roosed to use a three dimensional grid layout. Latency: The latency cost of matrix multilication given unlimited memory is S MM (n, k, ) = log() Flo Cost: Matrix multilication takes O ( ) flos, which can be divided on rocessors and therefore we have F MM (n, k, ) = O ( n 2 ) k Previous Analysis: For the case where k = n the bandwidth analysis of a general matrix multilication goes back to what is resented in [6]. Aggarwal et al. resented a cost analysis in the LPRAM model. In this work, the authors showed that the same cost can also be achieved for the transitive closure roblem that can be extended to the roblem of doing an LU decomosition. The the fact that these bandwidth costs can be obtained for the LU decomosition was later demonstrated by Tiskin in [7]. He used the bulk synchronous arallel (BSP) execution time model. Since the deendencies in LU are more comlicated than they are for TRSM, we also exected TRSM to be able to have the same asymtotic costs as a general matrix multilication Triangular matrix solve for single right hand sides Algorithms for the roblem of triangular solve for a single right hand side (when X and B are vectors) have been well-studied. A communicationefficient arallel algorithm was given by Heath and Romine [8]. This arallel algorithm was later shown to be an otimal schedule in latency and bandwidth costs via lower bounds [9]. However, when X and B are matrices (k > 1), it is ossible to achieve significantly lower communication costs relative to the amount of comutation required.

12 2.3. Triangular matrix solve for multile right hand sides 2.3 Triangular matrix solve for multile right hand sides The idea of recursively solving triangular matrix systems with many right hand sides already showed u a long time ago in the work of Elmroth et al [2]. There are two main ways to slit the roblem into smaller subroblems: Case 1: Slitting the right hand side into two (indeendent) subroblems in the fashion A X = B becoming where the subroblems are solved indeendently. A [X 1 X 2 ] = [ B1 B 2 ] A X 1 = B 1 A X 2 = B 2 Case 2: Slitting the triangular matrix into two deendent subtask in the fashion of [ ] [ ] [ ] A11 X1 B1 = A 12 A 22 X 2 B 2 where the subroblems A 11 X 1 = B 1 A 22 X 2 = B 2 A 12 X 1 are solved. With a roer mixing of both cases and a new aroach of calculating the base-cases of these recursions, it was ossible to achieve good bandwidth and latency costs. In [10], it is shown that a sequential execution of the TRSM algorithm can achieve the same bandwidth cost as a general matrix matrix multilication. This is done using the α β model and accounting for different cache-line sizes. The cost analysis done in [11] shows that bandwidth-wise, one can achieve costs for TRSM that are not worse than what a general matrix multilication achieves. We do a similar analysis with a different model in Section 3.2. The stability of inverting a triangular matrix has been studied in [12]. It has been stated that no general error bounds exist for blocked methods with resect to the error bound of the iterative methods. We leave the investigation of the stability of our aroach as further work. 5

13

14 Chater 3 Communication in TRSM In this chater we rovide a model to calculate arallel execution time and we make a cost analysis of the recursive TRSM algorithm using the communication uer bounds that were resented in the revious chater. 3.1 Execution Time Model The model we use to calculate the arallel execution time of an algorithm along its critical ath is the α β γ model. It describes the total execution time of the algorithm T in terms of the floating oint oerations (flo) count F, the bandwidth W and the latency (synchronization cost) S along the critical ath: T = γ F + β W + α S We do not lace constraints on the local memory size. As it is assumed that with time, comuting elements will become faster and with that a decrease of γ is exected, the goal of this work is to find an algorithmic aroach to solving triangular matrix equations with multile right hand sides that only increases the flo cost F by a constant, while decreasing the latency S. This is a reasonable aroach, since the imortance of α and β becomes higher as γ gets lower. 3.2 Recursive TRSM Algorithm In this section, the algorithmic aroach to solving TRSM recursively is resented. Also a cost analysis of the recursive aroach of TRSM is shown. Algorithmic aroach: The algorithm to solve many triangular systems of equations, commonly referred to as TRSM roblem reads: Given a lowertriangular matrix L R n n and a dense matrix B R n k, the goal is to 7

15 3. Communication in TRSM comute the matrix X R n k such that L X = B. The recursive algorithm that is resented by Elmroth et al. in [2] subdivides the L matrix into a 2 2 set of blocks at each ste and erforms two TRSM calls in sequence with all rocessors at each recursive level. Algorithm 1, Algorithm 1: X = Rec-TRSM(L, B, n, k, Π,, n 0 ) Require: L is a lower triangular n n matrix and B is a rectangular n k matrix, both distributed over Π = rocessors. If n n 0, allgather L onto all rocessors, subdivide B = [B 1,..., B ] and X = [X 1,..., X ] and comute X i = L 1 B i with the ith rocessor. [ ] L11 0 Subdivide L into n/2 n/2 blocks, L = L 21 L [ 22 ] B1 Subdivide B and X into n/2 k blocks, B =, X = Comute X 1 = Rec-TRSM(L 11, B 1, n/2, k, Π,, n 0 ). Comute B 2 = B 2 MM(L 21, X 1, n/2, k, Π, ). Comute X 2 = Rec-TRSM(L 22, B 2, n/2, k, Π,, n 0). Ensure: LX = B. B 2 [ X1 X 2 ]. Rec-TRSM(L, B, n, k, Π, n 0 ) requires two recursive calls and a matrix multilication at each recursive level until n n 0. This aroach yields to the communication cost recur- Bandwidth Cost: rence W Rec TRSM (n, k,, n 0 ) = W MM (n/2, k, ) + 2W Rec TRSM (n/2, k,, n 0 ) which decreases geometrically at each level as long as k > n/ (first case of Algorithm 1). At the base-case, the all-gather of L requires a communication cost of W Rec TRSM (n 0, k,, n 0 ) = O ( n 2 ) 0 There are n/n 0 base-cases that are executed in sequence using all rocessors, for a total cost of W base cases (n, k,, n 0 ) = n n 0 W Rec TRSM (n 0, k,, n 0 ) = O (nn 0 ) Choice of base-case size We desire that W base cases (n, k,, n 0 ) W TRSM (n, k,, n 0 ), which imlies that we need a different choice of n 0 deending on the initial size of our matrix: 8

16 3.2. Recursive TRSM Algorithm One large dimension: In this case, because the matrix A is small, it makes no sense to slit it and therefore we would ick n 0 = n and with that choice, no recursion occurs. Two large dimensions: When the initial matrix multilication costs O ( nk/ ) and we want the base-case not to dominate, we choose ( ) k n 0 = max 1, It is imortant to note that as we recurse, the matrix multilications in each level costs the same amount of bandwidth, and therefore we ick u a logarithmic factor for the total bandwidth. Three large dimensions: When the initial matrix multilication costs O ( ) 2/3, we select ( ( ) nk 2 1/3 ) n 0 = max 1, 2 Latency: The latency cost is dominated by the n/n 0 base-cases since they may not be executed concurrently and therefore comrise a execution ath within the algorithm of S Rec TRSM (n, k,, n 0 ) = (n/n 0 ) S MM (n 0, k, ). Each base-case requires an all-gather, which imlies S MM (n 0, k, ) = O (log()) latency cost, yielding an overall latency cost of S Rec TRSM (n, k,, n 0 ) = O ((n/n 0 ) log ) This general cost leads to the following latency costs, when the choice of n 0 is made to minimize bandwidth cost with resect to the initial matrix multilication, as done above. One large dimension: Two large dimensions: S Rec TRSM (n, k,, Three large dimensions: ( ) nk 2 1/3 ) S Rec TRSM (n, k,, S Rec TRSM (n, k,, n) = O (log ) 2 ) ( ( k = O min 1, ( ( = O min n, k ) ) n log ( n ) ) ) 2/3 log k 9

17 3. Communication in TRSM Flo Cost: The flo cost of such an algorithm is dominated by the to level matrix multilications and therefore costs F Rec TRSM (n, k, ) = O ( n 2 ) k 10

18 Chater 4 Triangular Matrix Inversion In this chater, the algorithmic aroach to inverting a lower triangular system recursively is resented. Also a cost analysis of this aroach is resented. Algorithmic aroach: As the scaling of the latency of the algorithm discussed reviously grows with the number of rocessors involved, its scalability is limited. Therefore, this aroach seems subotimal and with that the goal is to find an algorithm that requires less latency. Noting that matrix multilication requires little latency, the cost of inverting a triangular matrix was investigated. Parallel triangular matrix inversion can be done with a shorter critical ath than the TRSM aroach discussed before. This aroach is shown in Algorithm 2. It is to note that whenever a call to MM is made, we seak of the matrix multilication mentioned in [5]. The aroach taken in Algorithm 2 is the following: Each roblem is subdivided into two recursive matrix inversions, which are executed concurrently with half the rocessors, then two matrix multilications are erformed to comlete the inversion. Bandwidth cost: Since we execute the two recursive calls in Rec Tri Inv(L, n, Π, ) simultaneously with two disjoint sets of rocessors, the communication cost recurrence is given by W Rec Tri Inv (n, ) = 2 W MM (n/2, n/2, ) + W Rec Tri Inv (n/2, /2, n 0 ) If we assume n to be sufficiently large, we find that = 1 should be achieved and that is when n 0 = n/2 log = n/, so therefore there are a total of log recursive levels, each of which requires a matrix multilication. The communication cost associated with each level decreases geometrically, therefore the total cost of the algorithm is dominated by the cost of the tolevel matrix multilication, which is ( ) n 2 W MM (n/2, n/2, ) = O 2/3 11

19 4. Triangular Matrix Inversion Algorithm 2: L 1 = Rec-Tri-Inv(L, n, Π, ) Require: L, a lower triangular n n matrix distributed over Π = rocessors. if = 1 then L 1 = sequential inversion(l) end else [ ] L11 0 Subdivide L into n/2 n/2 blocks, L = L 21 L 22 Subdivide Π = [Π 1, Π 2 ] where Π 1 and Π 2 each contain /2 rocessors L 1 11 = Rec-Tri-Inv(L 11, n/2, Π 1, /2) L22 1 = Rec-Tri-Inv(L 22, n/2, Π 2, /2) L 1 21 = MM(L22 1, L 21, n/2, n/2, Π, ) L 1 21 = MM(L 1 21, L 1 11, n/2, n/2, Π, ) [ ] L 1 Assemble L 1 from the n/2 n/2 blocks, L 1 = end Ensure: LL 1 = I L 1 21 L22 1 Since this cost of the matrix multilication decreases as n and both are decreased by a factor of two, this initial matrix multilication dominates the bandwidth cost, ( ) n 2 W Rec Tri Inv (n,, n 0 ) = O (W MM (n/2, n/2, )) = O. 2/3 12 Latency cost: Since there are log() recursive levels and at each ste we do a matrix-matrix multilication, the latency cost is ( ) S Rec Tri Inv (n, ) = O log 2 () Flo cost: The total flo cost for the inversion is F Rec Tri Inv (n, n 0, ) = F Base Cases (n, n 0, ) + F MM (n/2, n/2, ) The flo cost of a sequential inversion of a triangular matrix is, as stated by Hunger in [13], and the total base-case flo cost is F Seq Inv (n) = 1 3 n3 F Base Cases (n, n 0, ) = n n 0 F Seq Inv (n 0 ) = n n n3 0 = 1 3 ( ) n 3

20 This gets dominated by the to-level matrix multilication, which every rocessor is working on and therefore needs ( ) n 3 F MM (n/2, n/2) = O flos. Since the to level matrix multilication is the most exensive and the other levels are geometrically decreasing this gives us the cost of ( ) n 3 F Rec Tri Inv (n, n 0, ) = O 13

21

22 Chater 5 TRSM with Inversion In this chater, we discuss aroaches to solve the TRSM roblem using the inversion derived in Section 4. The idea to use inversion for its low latency costs arises from what Tiskin did in [7]: He used inversion to decrease the latency in the LU factorization. We discuss otimal base-case choices deendent on relative matrix sizes for both TRSM with full inversion as well as recursive TRSM with block inversion. 5.1 TRSM with full inversion In this section, the algorithm to solving TRSM with a comlete inversion of L is given. We rovide a cost analysis for this method as well as an otimal base-case choice. Algorithmic aroach: If TRSM is done with full inversion of the matrix, the algorithm works as described in Algorithm 3. Algorithm 3: X = Inv-TRSM(L, B, n, k, Π) Require: L is a lower triangular n n matrix and B is a rectangular n k matrix, both distributed over Π = rocessors. L 1 = Rec-Tri-Inv(L, n, Π) X = MM(L 1, B, n, k, Π) Ensure: LX = B. 15

23 5. TRSM with Inversion 16 Bandwidth cost: This aroach leads to a total bandwidth cost of W Inv TRSM (n, k,, n 0 ) = W Rec Tri Inv (n,, n 0 ) + W MM (n, k, ) ( ) n 2 O 2/3 n > k (two large dimensions) ( ( = n O 2 ) 2/3 ) k + n2 2/3 k/ n k (three large dimensions) O ( n 2) n < k/ (one large dimension) To not be dominated by the matrix inversion, we therefore need W MM (n, k, ) > W Rec Tri Inv W MM (n, k, ) > O ( n 2 2/3 ). This is only the case when n < k which makes sense, since otherwise the full inversion would obviously be the dominating art, as the matrix is larger then the right hand side. Latency cost: This aroach leads to a total latency cost of S Inv TRSM (n, k,, n 0 ) = S Rec Tri Inv (n,, n 0 ) + W MM (n, k, ) ( ) = O log 2, for all of the three cases. Flo cost: The flo cost of this algorithm is F Inv TRSM (n, k,, n 0 ) = F Rec Tri Inv (n,, n 0 ) + W MM (n, k, ) ( ) ( n 3 n = O + O 2 ) k. Therefore, the algorithm requires substantially more comutation than Algorithm 3 when n > k. 5.2 Recursive TRSM with Block Inversion In this section, we discuss an algorithm to solving TRSM recursively with a comlete inversion as the base-case. We rovide a cost analysis for this method as well as otimal base-case choices deendent on the relative matrix sizes. Algorithmic aroach: The goal is to kee the latency as low as ossible, without increasing the asymtotic flo or bandwidth cost. We want to achieve, that the multilication with the right hand side is the bandwidthwise dominant art and not the the matrix inversion (which has cost O ( n 2 2/3 ) since the former is bandwidth we necessarily have to ay. To achieve this, it is necessary that n k. Therefore, recursive stes are taken as in Algorithm 1 in Section 3.2 until n k. Once a recursive level is reached, where n ),

24 5.2. Recursive TRSM with Block Inversion is sufficiently small relative to k, a switch to Algorithm 3 is erformed, as the triangular matrix inversion should no longer be a bottleneck. The resulting algorithm, that is referred to as Rec-Inv-TRSM, is the same as Algorithm 1, excet that the base-case is relaced by a call to Algorithm 3. It is shown as Algorithm 4. Algorithm 4: X = Rec-Tri-Inv-TRSM(L, B, n, k, Π, n 0 ) Require: L is a lower triangular n n matrix and B is a rectangular n k matrix, both distributed over Π = rocessors. if n n 0 then X = Inv-TRSM(L, B, n, k, Π) end else [ ] L11 0 Subdivide L into n/2 n/2 blocks, L = L 21 L [ 22 ] B1 Subdivide B and X into n/2 k blocks, B =, X = Comute X 1 = Rec-Tri-Inv-TRSM(L 11, B 1, n/2, k, Π, n 0 ). Comute B 2 = B 2 MM(L 21, X 1, n/2, k, ). Comute X 2 = Rec-Tri-Inv-TRSM(L 22, B 2, n/2, k, Π, n 0). [ ] X1 Assemble X from the n/2 k blocks, X =. end Ensure: LX = B. X 2 B 2 [ X1 X 2 ]. Case 1: If n k we do not need any stes of the first (recursive) TRSM aroach and can directly invert the matrix. This gives the results show in Section 5.1. Case 2: The more interesting case is where n > k and we can use stes of both aroaches. Bandwidth cost: Here we show that, using inversion, we do not get a higher bandwidth than with the initial TRSM aroach. The base-case costs us ( n 2 ) W Rec Inv TRSM (n 0, k,, n 0 ) = O 0 2/3 bandwidth and since we have n n 0 base-cases, that are all executed sequentially, this leads us to a total cost for all the base-cases of ( ) n n0 W base cases (n, k,, n 0 ) = O 2/3 17

25 5. TRSM with Inversion With the cost for the initial matrix multilication as stated in Section 2.1, this leads to a total cost of W Rec Inv TRSM (n, k,, n 0 ) = W base cases (n, k,, n 0 ) + W MM (n, k, ) = n n 0 2/3 + W MM(n, k, ). Since the goal is to be dominated by the initial MM, i.e. W Rec Inv TRSM (n, k,, n 0 ) = O (W MM (n, k, )) n 0 is chosen accordingly: One large dimension: This case never occurs in n > k. We would be in Case 1 and do the direct inversion. Two large dimensions: n 0 = k 1/6 Three large dimensions: n 0 = n 1/3 k 2/3 This leads to the following total bandwidth cost: W Rec Inv TRSM (n, k,, n 0 ) = (( ( )) ) n nk O 1 + log k ( ( n = O 2 ) 2/3 ) k ( n O 2 ) k n > k k/ n k n < k/ (two large dimensions) (three large dimensions) (one large dimension) Latency: The total latency is given by the number of base-cases of the first art of the algorithm times the base-case latency (that is S TRI ). Therefore we have (in α β γ) S Rec Inv TRSM (n, k,, n 0 ) = n S Rec Tri Inv n ( ) 0 n O k 1/6 log2 n > k ( ( = n ) 2/3 O log ) 2 k/ n k k ( ) O log 2 n < k/ (two large dimensions) (three large dimensions) (one large dimension) This is an imrovement over the analysis given in Section 3.2. Flo costs: With the given choice of n 0 and 0, it can be shown that the flo cost is not asymtotically higher than the one for a standard imlementation of TRSM. The flo cost of Rec-TRSM is ( n F Rec TRSM (n, k, ) = O 2 ) k, 18

26 5.3. Summary whereas the cost of Algorithm 4 is ( n F Rec Inv TRSM = O 2 k + n n 3 ) ( 0 n = O 2 ) k n 0 + nn2 0. Therefore it is desired that nn2 0 < n2 k, which imlies the criterion that n 0 < nk. It is demonstrated that the choice of n0 made to obtain the desired bandwidth cost, also satisfies this criterion and therefore does not require additional comutation work asymtotically. One large dimension: that The choice of the base-case size is n 0 = n and with n 2 0 n 2 < nk < nk n 0 < nk Two large dimensions: The choice of the base-case size is n 0 = k 1/6. Since two large dimensions also imose the constraint n > k, k < n, n k < 1/4, n 0 < k n 1/4 1/6 = nk 1/12 < nk Thus, this choice of n 0 guarantees that the comutation cost involved in the triangular matrix inversion is always of low order when there are two large matrix dimensions at the beginning of the recursion. Three large dimensions: The choice of the base-case size is n 0 = n 1/3 k 2/3. Since three large dimensions also imose the constraint n > k, we directly see that ( n 0 n 1/3 k 2/3 < n 1/3) ( n 1/6 k 1/2) nk which imlies that the comutation cost of the inversion may be of leading order. Therefore, in ractice, when n = k, it may make sense to take a few stes of recursion in Algorithm 4, before erforming the inversion. 5.3 Summary In this art all the results derived in this section are summed u and a table with the total cost of all algorithms looked at is given. 19

27 5. TRSM with Inversion 1 Large Dimension W S F MM n 2 log TRSM Rec n 2 log TRSM Inv n 2 log 2 TRSM RecInv n 2 log 2 2 Large Dimensions W S F MM nk log TRSM Rec (1 + log ( n k TRSM Inv nk + n2 TRSM RecInv (1 + log ( n k )) nk ( min 1, ) k n log 2/3 log 2 )) nk log 2 k 1/6 3 Large Dimensions W S F n + n3 MM TRSM Rec TRSM Inv TRSM RecInv ( ( ( ( ) 2/3 log ) 2/3 ( min n, ( n k ) 2/3 ) log ) 2/3 + nk log 2 n 2 k + n3 ) 2/3 ( n ) 2/3 k log 2 n 2 k Table 5.1: Asymtotic uer bounds for bandwidth- (W), latency- (S) and flo-costs (F) for all the algorithms mentioned In Table 9.1, the costs for all the algorithms resented in Sections 3 and 5 are shown as asymtotic uer bounds for a triangular matrix L R n n and a right hand side B R n k with rocessors working on the task. 20

28 Chater 6 Imlementation In this chater, some imlementation details are given. For the imlementation of the given algorithms, ScaLAPACK [1] was used as a base level library to erform the base-case calculations as well as the matrix multilications. It is imortant to note that with that choice, the otimal communication costs, that were assumed throughout the theoretical art, are not achieved for the one and the three large dimensions case, since ScaLAPACK only imlements a two dimensional matrix multilication. For a efficient work distribution while staying on a relatively simle distribution model, a block cyclic distribution of the matrix was chosen as shown in Figure 6.1. Each rocessor owns the arts of the matrix that are colored in its color and with that as soon as we multily the four-by-four blocks, we have all the rocessors working on that and not only on the to level matrix multilication. With this choice we are able to ensure that after some blocked stes, all rocessors are working on the matrix multilications. This was esecially imortant as we showed that the to level matrix multilications are always the most exensive. For efficiency reasons, all algorithms were imlemented in an iterative scheme Figure 6.1: Reresentation of a block cyclic distribution of the matrix 21

29 6. Imlementation Inversion: For the inversion of the base-cases, the algorithm roosed in section 5.1 was imlemented. To ensure that the base-case of the inversion is done sequentially, the block size for the distribution was chosen to be the base-case size of the inversion. In each level the number of matrix multilications decreases by two as the size of the matrices increases by two until we reach the to level, as can be seen in Figure 6.2: In the first level (denoted by 1), the four green squares are multilied. In the second level, the two bigger, red matrices are multilied and in the final ste, the blue square matrix is multilied leading to a comlete inversion. Triangular Solver: The triangular solve itself was using this inversion within each base-case and then did the required udates with ScaLAPACK s triangular matrix multilication and dense matrix multilication resectively, as can be seen in Algorithm 5. The udates on the right hand side done in udate B are always done u to the art that was calculated, as it was roosed in the recursive art in Section 5.2. These udates are denoted as the consecutive numbers in Figure 6.2: After the first base-case has been used to calculate X 1, the small (green) udate 1 is done on B 2. After solving for X 2 in the second ste, the udate of B affects B 3 and B 4 as the bigger (red) block 2 is udated. This goes on until the last base-case is handled. BC BC 1 BC 1 BC 2 BC 1 BC 2 BC 3 BC BC BC BC BC 1 BC BC BC 7 BC Figure 6.2: Recursion in the triangular matrices: Inversion stes (left), triangular solve (right) Algorithm 5: Rec-Inv-TRSM(L, B) for i in number of basecases do Rec-Inv(L i,i ) for i in number of basecases do X i = ScaLAPACK mm(l 1 i,i, B i ) udate B(X i, L) 22

30 Chater 7 Exerimental Results In this section, results of the erformed exeriments are rovided. After a descrition of the exerimental setu, including information about the machine used, the erformance of the roosed algorithms is evaluated. Exerimental setu: The exeriments were erformed on a Cray XC40 with 1256 nodes. Each node contains two Intel Xenon E v3 CPUs with 12 cores each and 64 GB of memory. The nodes are connected by the Aries rorietary interconnect from Cray, with a dragonfly network toology. Comilers and libraries: LAPACK and ScaLAPACK [1] was used from the rovided recomiled cray-libsci library version For the matrix storage and correctness checks, the Eigen [14] library version was used. All rograms were comiled with gcc with the otimization flag O3. Initial Conditions: The matrices were created locally using the drand48 random number generator within the range [1, 101), where L is lower triangular and B is dense. Restrictions: To focus on the algorithmic imrovements, the timing was started after the MPI-communicators were initialized, the Cblas grid was created and the local matrices were initialized. Each simulation was ran six to eleven times and the first run was neglected. To summarize the data, as suggested in [15, 16], the harmonic mean x (h) = n n i=1 (1/x i ) was used instead of the arithmetic mean. Also for each rocessor number, the two-sided 95%-confidence intervals of each set of exerimental results is shown. These intervals were calculated as mentioned in [16] based on Students t distribution: CI : [ x t(n 1, α/2)s/ n, x + t(n 1, α/2)s/ n ] 23

31 7. Exerimental Results where s denotes the samle standard deviation ( n ) s = i x) i=1(x 2 /(n 1) In each of the following grahs, the mean value is lotted with a larger icon and lines, all the single runs are lotted with the smaller icons and the 95% confidence interval is lotted in black. Base-Case Size: We exected that in real simulations, the theoretically otimal base-case size can be the non-otimal one for some roblems. This is due to constants of the asymtotic comlexities as well as the two dimensional imlementation of the matrix matrix multilication that we used. Therefore ranges around the otimal base-case size were chosen for each exeriment (n ot /4, n ot /2, n ot, 2 n ot, 4 n ot ). Only the best erforming base-case size was accounted for. Strong scaling: The strong scaling lots were roduced to show the average erformance over the runs. Therefore the total flo count (n 2 k for TRSM and 1 3 n n) was divided by the execution time which led to the flo-count er second: G TRSM (t exec (), n, k) = G Inv (t exec (), n) = t exec () n n t exec () For a erfectly arallelizeable algorithm with no arallelization overhead cost, this roerty should scale like, since the execution time in this case would scale this way. Weak scaling: For the weak scaling lots, the roerty G was reused and divided by the machine eak erformance for each set of nodes used. Therefore the lotted roerty is: P(t exec (), n, k, ) = G(t exec (), n, k) Peak er rocessor For the eak erformance 41.6 GF / core has been assumed. Each rocessor runs on one core only. For a erfectly arellelizeable algorithm with no overhead cost, this roerty should be constant at 1, since we would run at eak erformance with every rocessor for the whole time. Inversion: One of the benchmarks we did, was to benchmark our imlementation for the inversion against the inversion ScaLAPACK rovides. The

32 Strong scaling of Inversionfor N= Gigaflo/s Algorithm Proosed Imlementaion 25 Scalaack Number of Cores Figure 7.1: Strong scaling of the inversion with N=2048 Strong scaling of Inversionfor N= Gigaflo/s Algorithm Proosed Imlementaion Scalaack Number of Cores Figure 7.2: Strong scaling of the inversion with N=8192 results can be seen in Figures 7.1,7.2 and 7.3. One can see, that this art of the TRSM-solver was able to to beat ScaLAPACK in strong as well as in weak scaling. It is interesting to observe, that for smaller roblems (u to N = 8192), the roblems aear to be too small to get the full benefit of 4096 ranks working on it. But the imortant thing to note is, that for roblems large enough, the roosed algorithm is a real imrovement comared to ScaLAPACK, as one gets more than a factor of 1.6 in Gigaflo er second for N = The weak scaling has been started at N = 1024 and = 4 and increased N by 2 as increased by 4 to again kee memory usage er rocessor constant. In the weak scaling lot, visible in Figure 7.4, it can be 25

33 7. Exerimental Results 6000 Strong scaling of Inversionfor N= Gigaflo/s Algorithm Proosed Imlementaion Scalaack Number of Cores Figure 7.3: Strong scaling of the inversion with N=32768 Weak scaling of Inversion 0.25 Percentage of Theoretical Floating Point Peak Algorithm Proosed Imlementaion Scalaack Number of Cores Figure 7.4: Weak scaling starting with N=1024 for =4 seen, that the roosed method is strictly better in terms of ercentage of eak erformance. Three Large Dimensions: Benchmarks for the algorithm were erformed for the three large dimensions case. The strong scaling lots for different matrix sizes, where L R N N and B R N K, are rovided for the resented imlementation as well as ScaLAPACKs algorithm in Figures 7.5,7.6 and 7.7. It is clearly visible that the scaling is very oor for small matrix sizes, but as we increase the sizes, the scaling becomes very good. Unfortunately, the overall erformance is still worse than what ScaLAPACK offers. 26

34 Strong scaling of TRSM in three large dimensions for N= Algorithm Proosed Imlementaion Scalaack Gigaflo/s Number of Cores Figure 7.5: Strong scaling for N=K= Strong scaling of TRSM in three large dimensions for N= Algorithm Proosed Imlementaion Scalaack Gigaflo/s Number of Cores Figure 7.6: Strong scaling for N=K=8192 After rofiling the runs, it was easy to see that the biggest roblem does not lie in the inversion art but in the very slow triangular matrix - dense matrix multilication (TRMM). It is even slower than doing the comlete triangular solve (TRSM), even though the maximal amount of arallelism ossible for TRSM is only nk, whereas it is for the TRMM. The weak scaling lot can be seen in Figure 7.8. As for the weak scaling, it was observed that the scaling was not as romising as the strong scaling suggests. The results were created with the following set of arameters: Starting with = 4 and N = K = 1024, N and K were both increased by a factor of two as increased by a factor of four. 27

35 7. Exerimental Results Strong scaling of TRSM in three large dimensions for N= Algorithm Proosed Imlementaion Scalaack Gigaflo/s Number of Cores Figure 7.7: Strong scaling for N=K=32768 Percentage of Theoretical Floating Point Peak Weak scaling of TRSM in three large dimensions Algorithm Proosed Imlementaion Scalaack Number of Cores Figure 7.8: Weak scaling starting with N=1024 for =4 Two Large Dimensions: Since the TRMM was such a dominating factor in the revious case, one could hoe for better erformance as the right hand side size decreases. The strong scaling lots were created for several sizes n of the matrix L R n n and ket the width size of the right hand side K R n k constant. The observed results can be seen in Figures 7.9,7.10 and One can see that for smaller numbers of rocessors, the discussed aroach works very well. Unfortunately, the erformance relative to ScaLA- PACK decreases at the very end due to a sike in ScaLAPACKs erformance, which is unexlainable for now. But the interesting art is, that as we decrease the time sent in the TRMM, the more imortant the actual new 28

36 content becomes and this shows to be efficient. The weak scaling has been started at N = 1024 and = 4 and increased N by two as increases by four to again kee memory usage er rocessor constant. The results are shown in Figure Strong scaling of TRSM in two large dimensions for N=4096 Algorithm Proosed Imlementaion Scalaack 400 Gigaflo/s Number of Cores Figure 7.9: Strong scaling for N=4096 with K= Strong scaling of TRSM in two large dimensions for N=16384 Algorithm Proosed Imlementaion Scalaack Gigaflo/s Number of Cores Figure 7.10: Strong scaling for N=16384 with K=512 One Large Dimension: Since we already found out that the inversion is fast and the TRMM is slow, we decided to not do any benchmarks for the one large dimensions case to save comuting time, as nothing interesting could have been observed there. 29

37 7. Exerimental Results 3000 Strong scaling of TRSM in two large dimensions for N=32768 Algorithm Proosed Imlementaion Scalaack Gigaflo/s Number of Cores Figure 7.11: Strong scaling for N=32768 with K=512 Percentage of Theoretical Floating Point Peak Weak scaling of TRSM in the case of two large dimensions Algorithm Proosed Imlementaion Scalaack Number of Cores Figure 7.12: Weak scaling starting with N=1024 for =4 Summary of the results: The roosed algorithm does not bring the exected benefits due to the slow imlementation of the triangular matrix multilication. With less time sent in the triangular matrix multilication, better erformance for the roosed algorithm can be obtained. Nevertheless, the main art of the algorithm, being the faster inversion of triangular blocks, shows a good increase of erformance over the routine imlemented in ScaLAPACK [1]. 30

38 Chater 8 Further Work In this chater, further work relevant to this toic will be resented. Stability Analysis: As mentioned in Section 2, a stability analysis of the roosed method for blocked inversion could be done. Such an aroach is artially described in [2]. For the roblems looked at in this work, where every lower triangular matrix was good conditioned and therefore no roblems have risen in correctness comared to the results a regular, iterative scheme rovided. Otimizing Triangular Matrix Multilications: The results show very clearly that, in order to make the roosed method work, a much faster triangular matrix multilication is needed. Due to the limited area of alication comared to a dense matrix multilication or even a triangular solve, the otimization in ScaLAPACK is resumably rather low. Therefore, most likely, new code has to be develoed at this oint. Another limitation of the current state is that we always use a two-dimensional grid. To get the asymtotic otimal cost, this should be extended to at least cover three-dimensional as well as one-dimensional grid layouts. Adating the structure for arbitrary matrix sizes: So far, only roblems with matrix sizes chosen as owers of two are looked at. This made the recursive algorithms more simle to imlement, as there are no leftovers. An imrovement to the code could be made, such that general sizes are usable. The most common aroach of doing so, would be to extend the matrix to fit a ower of two with a block of the identity matrix at the bottom and add zero rows to the right hand side. Further otimization: The results show, that the code has a erformance ga that exists as subroutines used show a significant lack of scaling. Also one can see that the erformance ends u never being higher than 30 % of the eak erformance for a large number of rocessors, and for very well 31

39 8. Further Work otimized code, one can exect to get a factor of about two more in eak erformance. 32

40 Chater 9 Conclusion This work resented a new aroach to a communication avoiding way of solving a triangular system for multile right hand sides (TRSM). TRSM serves as a subroutine for a lot of algorithms in dense linear algebra like the LU decomosition and it is the critical assert for solving linear systems of equations. It has been shown that the algorithm we roose asymtotically uses the same bandwidth and flo costs as the standard iterative scheme for solving the TRSM roblem, but decreases the latency by a factor of 2/3 log for the case of three large dimensions as well as for the case of two large dimensions. To achieve the decreased latency one has to carefully ick the base-case size to not be dominated by the bandwidth or the flo costs. For all three cases of relative matrix sizes we resented base-case sizes that are otimal. For the case of one large dimension, the algorithm does not erform any better than the existing one due to the fact that single rocessor work on the left side is always referred and therefore no gain was achieved. The summarized costs are shown in Table 9.1. Due to the fact that only asymtotic uer bounds are resented for the matrix multilications, we were only able to give asymtotic uer bounds for the erformance of our algorithm. With this decrease of latency our algorithm is very romising for solving linear systems faster on machines with a large number of rocessors. This is esecially true as for larger systems, the communication cost is more of a bottleneck than it is on small machines. Also this work oens u the question if one should go back and consider the new triangular inversion as a subroutine for other alications as it was done by Tiskin for the LU factorization in [7]. Exeriments with a not heavily otimized version of this algorithm were erformed. We used ScaLAPACKs two dimensional matrix multilication for the calculations of our algorithm. The results showcased that the new aroach of doing the inversion turned out to bring a notable seedu, whereas, due to the lack of a well otimized triangular matrix multilica- 33

41 9. Conclusion 1 Large Dimension W S F TRSM Rec n 2 log TRSM RecInv n 2 log 2 2 Large Dimensions W S F TRSM Rec (1 + log ( n k TRSM RecInv (1 + log ( n k TRSM Rec TRSM RecInv )) nk )) nk min n ( 1, k ) n log log 2 k 1/6 3 Large Dimensions W S F ( ) 2/3 ( min n, ( n ) ) 2/3 k log ( ) 2/3 ( n k ) 2/3 log 2 Table 9.1: Summarized uer bounds for bandwidth- (W), latency- (S) and flo-costs (F) for all the algorithms mentioned tion, the time to solution for the algorithm resented to solve triangular systems still turns out to be higher than the reference. We were able to see that the roblem sizes for which we did our exeriments were rather small and with that, the ercentage of eak erformance was lower then exected. But the trends that the grahs show are romising. 34

42 Bibliograhy [1] L. S. Blackford, J. Choi, A. Cleary, E. D Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley, ScaLAPACK Users Guide, Society for Industrial and Alied Mathematics, Philadelhia, PA, [2] Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kågström, Recursive blocked algorithms and hybrid data structures for dense matrix library software, SIAM review, vol. 46, no. 1,. 3 45, [3] Edgar Solomonik and James Demmel, Communication-otimal arallel 2.5 D matrix multilication and LU factorization algorithms, in Euro-Par 2011 Parallel Processing, Sringer, [4] Fred G Gustavson, Recursion leads to automatic variable blocking for dense linear-algebra algorithms, IBM Journal of Research and Develoment, vol. 41, no. 6, , [5] J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lishitz, O. Schwartz, and O. Sillinger, Communication-otimal arallel recursive rectangular matrix multilication, in Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symosium on, May 2013, [6] Alok Aggarwal, Ashok K Chandra, and Marc Snir, Communication comlexity of rams, Theoretical Comuter Science, vol. 71, no. 1,. 3 28, [7] Alexander Tiskin, Bulk-synchronous arallel gaussian elimination, Journal of Mathematical Sciences, vol. 108, no. 6, , [8] Michael T Heath and Charles H Romine, Parallel solution of triangular systems on distributed-memory multirocessors, SIAM Journal on Scientific and Statistical Comuting, vol. 9, no. 3, ,

43 Bibliograhy [9] Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel, Tradeoffs between synchronization, communication, and comutation in arallel linear algebra comutations, in Proceedings of the 26th ACM Symosium on Parallelism in Algorithms and Architectures, New York, NY, USA, 2014, SPAA 14, , ACM. [10] Grey Ballard, James Demmel, Benjamin Lishitz, Oded Schwartz, and Sivan Toledo, Communication efficient gaussian elimination with artial ivoting using a shae morhing data layout, in Proceedings of the twenty-fifth annual ACM symosium on Parallelism in algorithms and architectures. ACM, 2013, [11] Benjamin Lishitz, Communication-avoiding arallel recursive algorithms for matrix multilication, Tech. Re., EECS Deartment, University of California, Berkeley, [12] Jeremy J Du Croz and Nicholas J Higham, Stability of methods for matrix inversion, IMA Journal of Numerical Analysis, vol. 12, no. 1,. 1 19, [13] Rahael Hunger, Floating oint oerations in matrix-vector calculus, Munich University of Technology, Inst. for Circuit Theory and Signal Processing Munich, [14] Gaël Guennebaud, Benoît Jacob, et al., Eigen v3, htt://eigen.tuxfamily.org, [15] Phili J Fleming and John J Wallace, How not to lie with statistics: the correct way to summarize benchmark results, Communications of the ACM, vol. 29, no. 3, , [16] T. Hoefler and R. Belli, Scientific Benchmarking of Parallel Comuting Systems, Nov. 2015, Acceted at IEEE/ACM International Conference on High Performance Comuting, Networking, Storage and Analysis (SC15). 36

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K.

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K. inuts er clock cycle Streaming ermutation oututs er clock cycle AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS Ren Chen and Viktor K.

More information

10. Parallel Methods for Data Sorting

10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting... 1 10.1. Parallelizing Princiles... 10.. Scaling Parallel Comutations... 10.3. Bubble Sort...3 10.3.1. Sequential Algorithm...3

More information

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model COMP 6 - Parallel Comuting Lecture 6 November, 8 Bulk-Synchronous essing Model Models of arallel comutation Shared-memory model Imlicit communication algorithm design and analysis relatively simle but

More information

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model. U.C. Berkeley CS273: Parallel and Distributed Theory Lecture 18 Professor Satish Rao Lecturer: Satish Rao Last revised Scribe so far: Satish Rao (following revious lecture notes quite closely. Lecture

More information

Efficient Parallel Hierarchical Clustering

Efficient Parallel Hierarchical Clustering Efficient Parallel Hierarchical Clustering Manoranjan Dash 1,SimonaPetrutiu, and Peter Scheuermann 1 Deartment of Information Systems, School of Comuter Engineering, Nanyang Technological University, Singaore

More information

Randomized algorithms: Two examples and Yao s Minimax Principle

Randomized algorithms: Two examples and Yao s Minimax Principle Randomized algorithms: Two examles and Yao s Minimax Princile Maximum Satisfiability Consider the roblem Maximum Satisfiability (MAX-SAT). Bring your knowledge u-to-date on the Satisfiability roblem. Maximum

More information

arxiv: v1 [cs.dc] 13 Nov 2018

arxiv: v1 [cs.dc] 13 Nov 2018 Task Grah Transformations for Latency Tolerance arxiv:1811.05077v1 [cs.dc] 13 Nov 2018 Victor Eijkhout November 14, 2018 Abstract The Integrative Model for Parallelism (IMP) derives a task grah from a

More information

Collective communication: theory, practice, and experience

Collective communication: theory, practice, and experience CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Comutat.: Pract. Exer. 2007; 19:1749 1783 Published online 5 July 2007 in Wiley InterScience (www.interscience.wiley.com)..1206 Collective

More information

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism Erlin Yao, Mingyu Chen, Rui Wang, Wenli Zhang, Guangming Tan Key Laboratory of Comuter System and Architecture Institute

More information

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Journal of Comuting and Information Technology - CIT 8, 2000, 1, 1 12 1 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Eunice E. Santos Deartment of Electrical

More information

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22 Collective Communication: Theory, Practice, and Exerience FLAME Working Note # Ernie Chan Marcel Heimlich Avi Purkayastha Robert van de Geijn Setember, 6 Abstract We discuss the design and high-erformance

More information

A Model-Adaptable MOSFET Parameter Extraction System

A Model-Adaptable MOSFET Parameter Extraction System A Model-Adatable MOSFET Parameter Extraction System Masaki Kondo Hidetoshi Onodera Keikichi Tamaru Deartment of Electronics Faculty of Engineering, Kyoto University Kyoto 66-1, JAPAN Tel: +81-7-73-313

More information

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network 1 Sensitivity Analysis for an Otimal Routing Policy in an Ad Hoc Wireless Network Tara Javidi and Demosthenis Teneketzis Deartment of Electrical Engineering and Comuter Science University of Michigan Ann

More information

Complexity analysis of matrix product on multicore architectures

Complexity analysis of matrix product on multicore architectures Comlexity analysis of matrix roduct on multicore architectures Mathias Jacquelin, Loris Marchal and Yves Robert École Normale Suérieure de Lyon, France {Mathias.Jacquelin Loris.Marchal Yves.Robert}@ens-lyon.fr

More information

Equality-Based Translation Validator for LLVM

Equality-Based Translation Validator for LLVM Equality-Based Translation Validator for LLVM Michael Ste, Ross Tate, and Sorin Lerner University of California, San Diego {mste,rtate,lerner@cs.ucsd.edu Abstract. We udated our Peggy tool, reviously resented

More information

Complexity analysis of matrix product on multicore architectures

Complexity analysis of matrix product on multicore architectures Comlexity analysis of matrix roduct on multicore architectures Mathias Jacquelin, Loris Marchal and Yves Robert École Normale Suérieure de Lyon, France {Mathias.Jacquelin Loris.Marchal Yves.Robert}@ens-lyon.fr

More information

Complexity analysis and performance evaluation of matrix product on multicore architectures

Complexity analysis and performance evaluation of matrix product on multicore architectures Comlexity analysis and erformance evaluation of matrix roduct on multicore architectures Mathias Jacquelin, Loris Marchal and Yves Robert École Normale Suérieure de Lyon, France {Mathias.Jacquelin Loris.Marchal

More information

split split (a) (b) split split (c) (d)

split split (a) (b) split split (c) (d) International Journal of Foundations of Comuter Science c World Scientic Publishing Comany ON COST-OPTIMAL MERGE OF TWO INTRANSITIVE SORTED SEQUENCES JIE WU Deartment of Comuter Science and Engineering

More information

Hardware-Accelerated Formal Verification

Hardware-Accelerated Formal Verification Hardare-Accelerated Formal Verification Hiroaki Yoshida, Satoshi Morishita 3 Masahiro Fujita,. VLSI Design and Education Center (VDEC), University of Tokyo. CREST, Jaan Science and Technology Agency 3.

More information

Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4

Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4 Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4 M. Gajbe a A. Canning, b L-W. Wang, b J. Shalf, b H. Wasserman, b and R. Vuduc, a a Georgia Institute of Technology,

More information

Directed File Transfer Scheduling

Directed File Transfer Scheduling Directed File Transfer Scheduling Weizhen Mao Deartment of Comuter Science The College of William and Mary Williamsburg, Virginia 387-8795 wm@cs.wm.edu Abstract The file transfer scheduling roblem was

More information

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation To aear in IEEE VLSI Test Symosium, 1997 SITFIRE: Scalable arallel Algorithms for Test Set artitioned Fault Simulation Dili Krishnaswamy y Elizabeth M. Rudnick y Janak H. atel y rithviraj Banerjee z y

More information

Truth Trees. Truth Tree Fundamentals

Truth Trees. Truth Tree Fundamentals Truth Trees 1 True Tree Fundamentals 2 Testing Grous of Statements for Consistency 3 Testing Arguments in Proositional Logic 4 Proving Invalidity in Predicate Logic Answers to Selected Exercises Truth

More information

Lecture 3: Geometric Algorithms(Convex sets, Divide & Conquer Algo.)

Lecture 3: Geometric Algorithms(Convex sets, Divide & Conquer Algo.) Advanced Algorithms Fall 2015 Lecture 3: Geometric Algorithms(Convex sets, Divide & Conuer Algo.) Faculty: K.R. Chowdhary : Professor of CS Disclaimer: These notes have not been subjected to the usual

More information

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS Kevin Miller, Vivian Lin, and Rui Zhang Grou ID: 5 1. INTRODUCTION The roblem we are trying to solve is redicting future links or recovering missing links

More information

1.5 Case Study. dynamic connectivity quick find quick union improvements applications

1.5 Case Study. dynamic connectivity quick find quick union improvements applications . Case Study dynamic connectivity quick find quick union imrovements alications Subtext of today s lecture (and this course) Stes to develoing a usable algorithm. Model the roblem. Find an algorithm to

More information

A BICRITERION STEINER TREE PROBLEM ON GRAPH. Mirko VUJO[EVI], Milan STANOJEVI] 1. INTRODUCTION

A BICRITERION STEINER TREE PROBLEM ON GRAPH. Mirko VUJO[EVI], Milan STANOJEVI] 1. INTRODUCTION Yugoslav Journal of Oerations Research (00), umber, 5- A BICRITERIO STEIER TREE PROBLEM O GRAPH Mirko VUJO[EVI], Milan STAOJEVI] Laboratory for Oerational Research, Faculty of Organizational Sciences University

More information

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScrit Objects Shiyi Wei and Barbara G. Ryder Deartment of Comuter Science, Virginia Tech, Blacksburg, VA, USA. {wei,ryder}@cs.vt.edu

More information

Introduction to Parallel Algorithms

Introduction to Parallel Algorithms CS 1762 Fall, 2011 1 Introduction to Parallel Algorithms Introduction to Parallel Algorithms ECE 1762 Algorithms and Data Structures Fall Semester, 2011 1 Preliminaries Since the early 1990s, there has

More information

521493S Computer Graphics Exercise 3 (Chapters 6-8)

521493S Computer Graphics Exercise 3 (Chapters 6-8) 521493S Comuter Grahics Exercise 3 (Chaters 6-8) 1 Most grahics systems and APIs use the simle lighting and reflection models that we introduced for olygon rendering Describe the ways in which each of

More information

Optimization of Collective Communication Operations in MPICH

Optimization of Collective Communication Operations in MPICH To be ublished in the International Journal of High Performance Comuting Alications, 5. c Sage Publications. Otimization of Collective Communication Oerations in MPICH Rajeev Thakur Rolf Rabenseifner William

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization

CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization Edgar Solomonik University of Illinois at Urbana-Champaign August 31, 2016 Review of

More information

Extracting Optimal Paths from Roadmaps for Motion Planning

Extracting Optimal Paths from Roadmaps for Motion Planning Extracting Otimal Paths from Roadmas for Motion Planning Jinsuck Kim Roger A. Pearce Nancy M. Amato Deartment of Comuter Science Texas A&M University College Station, TX 843 jinsuckk,ra231,amato @cs.tamu.edu

More information

Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching

Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching Mitigating the Imact of Decomression Latency in L1 Comressed Data Caches via Prefetching by Sean Rea A thesis resented to Lakehead University in artial fulfillment of the requirement for the degree of

More information

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs An emirical analysis of looy belief roagation in three toologies: grids, small-world networks and random grahs R. Santana, A. Mendiburu and J. A. Lozano Intelligent Systems Grou Deartment of Comuter Science

More information

arxiv: v1 [cs.mm] 18 Jan 2016

arxiv: v1 [cs.mm] 18 Jan 2016 Lossless Intra Coding in with 3-ta Filters Saeed R. Alvar a, Fatih Kamisli a a Deartment of Electrical and Electronics Engineering, Middle East Technical University, Turkey arxiv:1601.04473v1 [cs.mm] 18

More information

Sensitivity of multi-product two-stage economic lotsizing models and their dependency on change-over and product cost ratio s

Sensitivity of multi-product two-stage economic lotsizing models and their dependency on change-over and product cost ratio s Sensitivity two stage EOQ model 1 Sensitivity of multi-roduct two-stage economic lotsizing models and their deendency on change-over and roduct cost ratio s Frank Van den broecke, El-Houssaine Aghezzaf,

More information

An Efficient VLSI Architecture for Adaptive Rank Order Filter for Image Noise Removal

An Efficient VLSI Architecture for Adaptive Rank Order Filter for Image Noise Removal International Journal of Information and Electronics Engineering, Vol. 1, No. 1, July 011 An Efficient VLSI Architecture for Adative Rank Order Filter for Image Noise Removal M. C Hanumantharaju, M. Ravishankar,

More information

CASCH - a Scheduling Algorithm for "High Level"-Synthesis

CASCH - a Scheduling Algorithm for High Level-Synthesis CASCH a Scheduling Algorithm for "High Level"Synthesis P. Gutberlet H. Krämer W. Rosenstiel Comuter Science Research Center at the University of Karlsruhe (FZI) HaidundNeuStr. 1014, 7500 Karlsruhe, F.R.G.

More information

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data Efficient Processing of To-k Dominating Queries on Multi-Dimensional Data Man Lung Yiu Deartment of Comuter Science Aalborg University DK-922 Aalborg, Denmark mly@cs.aau.dk Nikos Mamoulis Deartment of

More information

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH Jin Lu, José M. F. Moura, and Urs Niesen Deartment of Electrical and Comuter Engineering Carnegie Mellon University, Pittsburgh, PA 15213 jinlu, moura@ece.cmu.edu

More information

Near-Optimal Routing Lookups with Bounded Worst Case Performance

Near-Optimal Routing Lookups with Bounded Worst Case Performance Near-Otimal Routing Lookus with Bounded Worst Case Performance Pankaj Guta Balaji Prabhakar Stehen Boyd Deartments of Electrical Engineering and Comuter Science Stanford University CA 9430 ankaj@stanfordedu

More information

AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS

AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS Philie LACOMME, Christian PRINS, Wahiba RAMDANE-CHERIF Université de Technologie de Troyes, Laboratoire d Otimisation des Systèmes Industriels (LOSI)

More information

A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems

A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems Cheng-Yuan Chang and I-Lun Tseng Deartment of Comuter Science and Engineering Yuan Ze University,

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sarse Matrix-Matrix Multilication Aydın Buluç Deartment of Comuter Science University of California, Santa Barbara aydin@cs.ucsb.edu John R. Gilbert Deartment of Comuter

More information

Distributed Estimation from Relative Measurements in Sensor Networks

Distributed Estimation from Relative Measurements in Sensor Networks Distributed Estimation from Relative Measurements in Sensor Networks #Prabir Barooah and João P. Hesanha Abstract We consider the roblem of estimating vectorvalued variables from noisy relative measurements.

More information

A DEA-bases Approach for Multi-objective Design of Attribute Acceptance Sampling Plans

A DEA-bases Approach for Multi-objective Design of Attribute Acceptance Sampling Plans Available online at htt://ijdea.srbiau.ac.ir Int. J. Data Enveloment Analysis (ISSN 2345-458X) Vol.5, No.2, Year 2017 Article ID IJDEA-00422, 12 ages Research Article International Journal of Data Enveloment

More information

The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing

The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing Mikael Taveniku 2,3, Anders Åhlander 1,3, Magnus Jonsson 1 and Bertil Svensson 1,2

More information

Interactive Image Segmentation

Interactive Image Segmentation Interactive Image Segmentation Fahim Mannan (260 266 294) Abstract This reort resents the roject work done based on Boykov and Jolly s interactive grah cuts based N-D image segmentation algorithm([1]).

More information

A Novel Iris Segmentation Method for Hand-Held Capture Device

A Novel Iris Segmentation Method for Hand-Held Capture Device A Novel Iris Segmentation Method for Hand-Held Cature Device XiaoFu He and PengFei Shi Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200030, China {xfhe,

More information

Stereo Disparity Estimation in Moment Space

Stereo Disparity Estimation in Moment Space Stereo Disarity Estimation in oment Sace Angeline Pang Faculty of Information Technology, ultimedia University, 63 Cyberjaya, alaysia. angeline.ang@mmu.edu.my R. ukundan Deartment of Comuter Science, University

More information

A Metaheuristic Scheduler for Time Division Multiplexed Network-on-Chip

A Metaheuristic Scheduler for Time Division Multiplexed Network-on-Chip Downloaded from orbit.dtu.dk on: Jan 25, 2019 A Metaheuristic Scheduler for Time Division Multilexed Network-on-Chi Sørensen, Rasmus Bo; Sarsø, Jens; Pedersen, Mark Ruvald; Højgaard, Jasur Publication

More information

Patterned Wafer Segmentation

Patterned Wafer Segmentation atterned Wafer Segmentation ierrick Bourgeat ab, Fabrice Meriaudeau b, Kenneth W. Tobin a, atrick Gorria b a Oak Ridge National Laboratory,.O.Box 2008, Oak Ridge, TN 37831-6011, USA b Le2i Laboratory Univ.of

More information

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets An imroved algorithm for Hausdorff Voronoi diagram for non-crossing sets Frank Dehne, Anil Maheshwari and Ryan Taylor May 26, 2006 Abstract We resent an imroved algorithm for building a Hausdorff Voronoi

More information

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1 Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Sanning Trees 1 Honge Wang y and Douglas M. Blough z y Myricom Inc., 325 N. Santa Anita Ave., Arcadia, CA 916, z School of Electrical and

More information

Skip List Based Authenticated Data Structure in DAS Paradigm

Skip List Based Authenticated Data Structure in DAS Paradigm 009 Eighth International Conference on Grid and Cooerative Comuting Ski List Based Authenticated Data Structure in DAS Paradigm Jieing Wang,, Xiaoyong Du,. Key Laboratory of Data Engineering and Knowledge

More information

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties Imroved heuristics for the single machine scheduling roblem with linear early and quadratic tardy enalties Jorge M. S. Valente* LIAAD INESC Porto LA, Faculdade de Economia, Universidade do Porto Postal

More information

Submission. Verifying Properties Using Sequential ATPG

Submission. Verifying Properties Using Sequential ATPG Verifying Proerties Using Sequential ATPG Jacob A. Abraham and Vivekananda M. Vedula Comuter Engineering Research Center The University of Texas at Austin Austin, TX 78712 jaa, vivek @cerc.utexas.edu Daniel

More information

Using Rational Numbers and Parallel Computing to Efficiently Avoid Round-off Errors on Map Simplification

Using Rational Numbers and Parallel Computing to Efficiently Avoid Round-off Errors on Map Simplification Using Rational Numbers and Parallel Comuting to Efficiently Avoid Round-off Errors on Ma Simlification Maurício G. Grui 1, Salles V. G. de Magalhães 1,2, Marcus V. A. Andrade 1, W. Randolh Franklin 2,

More information

A Study of Protocols for Low-Latency Video Transport over the Internet

A Study of Protocols for Low-Latency Video Transport over the Internet A Study of Protocols for Low-Latency Video Transort over the Internet Ciro A. Noronha, Ph.D. Cobalt Digital Santa Clara, CA ciro.noronha@cobaltdigital.com Juliana W. Noronha University of California, Davis

More information

The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops

The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops The R-LRPD Test: Seculative Parallelization of Partially Parallel Loos Francis Dang, Hao Yu, Lawrence Rauchwerger Det. of Comuter Science, Texas A&M University College Station, TX 778- {fhd,hy89,rwerger}@cs.tamu.edu

More information

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model IMS Network Deloyment Cost Otimization Based on Flow-Based Traffic Model Jie Xiao, Changcheng Huang and James Yan Deartment of Systems and Comuter Engineering, Carleton University, Ottawa, Canada {jiexiao,

More information

CS 470 Spring Mike Lam, Professor. Performance Analysis

CS 470 Spring Mike Lam, Professor. Performance Analysis CS 470 Sring 2018 Mike Lam, Professor Performance Analysis Performance analysis Why do we arallelize our rograms? Performance analysis Why do we arallelize our rograms? So that they run faster! Performance

More information

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications OMNI: An Efficient Overlay Multicast Infrastructure for Real-time Alications Suman Banerjee, Christoher Kommareddy, Koushik Kar, Bobby Bhattacharjee, Samir Khuller Abstract We consider an overlay architecture

More information

Learning Robust Locality Preserving Projection via p-order Minimization

Learning Robust Locality Preserving Projection via p-order Minimization Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Learning Robust Locality Preserving Projection via -Order Minimization Hua Wang, Feiing Nie, Heng Huang Deartment of Electrical

More information

Power Savings in Embedded Processors through Decode Filter Cache

Power Savings in Embedded Processors through Decode Filter Cache Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang Rajesh Guta Alexandru Nicolau Deartment of Information and Comuter Science University of California, Irvine Irvine, CA 92697-3425

More information

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics structure arises in many alications of geometry. The dual structure, called a Delaunay triangulation also has many interesting roerties. Figure 3: Voronoi diagram and Delaunay triangulation. Search: Geometric

More information

AUTOMATIC EXTRACTION OF BUILDING OUTLINE FROM HIGH RESOLUTION AERIAL IMAGERY

AUTOMATIC EXTRACTION OF BUILDING OUTLINE FROM HIGH RESOLUTION AERIAL IMAGERY AUTOMATIC EXTRACTION OF BUILDING OUTLINE FROM HIGH RESOLUTION AERIAL IMAGERY Yandong Wang EagleView Technology Cor. 5 Methodist Hill Dr., Rochester, NY 1463, the United States yandong.wang@ictometry.com

More information

Brigham Young University Oregon State University. Abstract. In this paper we present a new parallel sorting algorithm which maximizes the overlap

Brigham Young University Oregon State University. Abstract. In this paper we present a new parallel sorting algorithm which maximizes the overlap Aeared in \Journal of Parallel and Distributed Comuting, July 1995 " Overlaing Comutations, Communications and I/O in Parallel Sorting y Mark J. Clement Michael J. Quinn Comuter Science Deartment Deartment

More information

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level,

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level, [9] J. J. Dongarra, R. Hemel, A. J. G. Hey, and D. W. Walker, \A Proosal for a User-Level, Message Passing Interface in a Distributed-Memory Environment," Tech. Re. TM-3, Oak Ridge National Laboratory,

More information

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka Parallel Construction of Multidimensional Binary Search Trees Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka School of CIS and School of CISE Northeast Parallel Architectures Center Syracuse

More information

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems Matlab Virtual Reality Simulations for otimizations and raid rototying of flexible lines systems VAMVU PETRE, BARBU CAMELIA, POP MARIA Deartment of Automation, Comuters, Electrical Engineering and Energetics

More information

Applying the fuzzy preference relation to the software selection

Applying the fuzzy preference relation to the software selection Proceedings of the 007 WSEAS International Conference on Comuter Engineering and Alications, Gold Coast, Australia, January 17-19, 007 83 Alying the fuzzy reference relation to the software selection TIEN-CHIN

More information

Using Permuted States and Validated Simulation to Analyze Conflict Rates in Optimistic Replication

Using Permuted States and Validated Simulation to Analyze Conflict Rates in Optimistic Replication Using Permuted States and Validated Simulation to Analyze Conflict Rates in Otimistic Relication An-I A. Wang Comuter Science Deartment Florida State University Geoff H. Kuenning Comuter Science Deartment

More information

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree To aear in IEEE TKDE Title: Efficient Skyline and To-k Retrieval in Subsaces Keywords: Skyline, To-k, Subsace, B-tree Contact Author: Yufei Tao (taoyf@cse.cuhk.edu.hk) Deartment of Comuter Science and

More information

Fast Distributed Process Creation with the XMOS XS1 Architecture

Fast Distributed Process Creation with the XMOS XS1 Architecture Communicating Process Architectures 20 P.H. Welch et al. (Eds.) IOS Press, 20 c 20 The authors and IOS Press. All rights reserved. Fast Distributed Process Creation with the XMOS XS Architecture James

More information

Parallel Mesh Generation

Parallel Mesh Generation Parallel Mesh Generation Nikos Chrisochoides Comuter Science Deartment College of William and Mary Williamsburg, VA 23185 and Division of Alied Mathematics Brown University 182 George Street Providence,

More information

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering Autonomic Physical Database Design - From Indexing to Multidimensional Clustering Stehan Baumann, Kai-Uwe Sattler Databases and Information Systems Grou Technische Universität Ilmenau, Ilmenau, Germany

More information

Improving Trust Estimates in Planning Domains with Rare Failure Events

Improving Trust Estimates in Planning Domains with Rare Failure Events Imroving Trust Estimates in Planning Domains with Rare Failure Events Colin M. Potts and Kurt D. Krebsbach Det. of Mathematics and Comuter Science Lawrence University Aleton, Wisconsin 54911 USA {colin.m.otts,

More information

Avoiding Communication in Sparse Matrix Computations

Avoiding Communication in Sparse Matrix Computations Avoiding Communication in Sarse Matrix Comutations James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick Deartment of Electrical Engineering and Comuter Science University of California

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Non-Strict Independence-Based Program Parallelization Using Sharing and Freeness Information

Non-Strict Independence-Based Program Parallelization Using Sharing and Freeness Information Non-Strict Indeendence-Based Program Parallelization Using Sharing and Freeness Information Daniel Cabeza Gras 1 and Manuel V. Hermenegildo 1,2 Abstract The current ubiuity of multi-core rocessors has

More information

SPARSE SIGNAL REPRESENTATION FOR COMPLEX-VALUED IMAGING Sadegh Samadi 1, M üjdat Çetin 2, Mohammad Ali Masnadi-Shirazi 1

SPARSE SIGNAL REPRESENTATION FOR COMPLEX-VALUED IMAGING Sadegh Samadi 1, M üjdat Çetin 2, Mohammad Ali Masnadi-Shirazi 1 SPARSE SIGNAL REPRESENTATION FOR COMPLEX-VALUED IMAGING Sadegh Samadi 1, M üjdat Çetin, Mohammad Ali Masnadi-Shirazi 1 1. Shiraz University, Shiraz, Iran,. Sabanci University, Istanbul, Turkey ssamadi@shirazu.ac.ir,

More information

A Reconfigurable Architecture for Quad MAC VLIW DSP

A Reconfigurable Architecture for Quad MAC VLIW DSP A Reconfigurable Architecture for Quad MAC VLIW DSP Sangwook Kim, Sungchul Yoon, Jaeseuk Oh, Sungho Kang Det. of Electrical & Electronic Engineering, Yonsei University 132 Shinchon-Dong, Seodaemoon-Gu,

More information

Low Power Implementations for Adaptive Filters

Low Power Implementations for Adaptive Filters Low Power Imlementations for Adative ilters Marius Vollmer Stehan Klauke Jürgen Götze University of ortmund Information Processing Lab htt://www-dt.e-technik.uni-dortmund.de marius.vollmer stehan.klauke

More information

A Petri net-based Approach to QoS-aware Configuration for Web Services

A Petri net-based Approach to QoS-aware Configuration for Web Services A Petri net-based Aroach to QoS-aware Configuration for Web s PengCheng Xiong, YuShun Fan and MengChu Zhou, Fellow, IEEE Abstract With the develoment of enterrise-wide and cross-enterrise alication integration

More information

EE678 Application Presentation Content Based Image Retrieval Using Wavelets

EE678 Application Presentation Content Based Image Retrieval Using Wavelets EE678 Alication Presentation Content Based Image Retrieval Using Wavelets Grou Members: Megha Pandey megha@ee. iitb.ac.in 02d07006 Gaurav Boob gb@ee.iitb.ac.in 02d07008 Abstract: We focus here on an effective

More information

A Symmetric FHE Scheme Based on Linear Algebra

A Symmetric FHE Scheme Based on Linear Algebra A Symmetric FHE Scheme Based on Linear Algebra Iti Sharma University College of Engineering, Comuter Science Deartment. itisharma.uce@gmail.com Abstract FHE is considered to be Holy Grail of cloud comuting.

More information

Process and Measurement System Capability Analysis

Process and Measurement System Capability Analysis Process and Measurement System aability Analysis Process caability is the uniformity of the rocess. Variability is a measure of the uniformity of outut. Assume that a rocess involves a quality characteristic

More information

Summary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.

Summary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model. Summary Design of Parallel and High-Performance Computing: Distributed-Memory Models and lgorithms Edgar Solomonik ETH Zürich December 9, 2014 Lecture overview Review: α-β communication cost model LogP

More information

Lecture 8: Orthogonal Range Searching

Lecture 8: Orthogonal Range Searching CPS234 Comutational Geometry Setember 22nd, 2005 Lecture 8: Orthogonal Range Searching Lecturer: Pankaj K. Agarwal Scribe: Mason F. Matthews 8.1 Range Searching The general roblem of range searching is

More information

Accurate, Efficient and Scalable Graph Embedding

Accurate, Efficient and Scalable Graph Embedding Accurate, Efficient and Scalable Grah Embedding Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgoal Kannan, Viktor Prasanna University of Southern California Los Angeles, USA {zengh, hongkuaz, ajiteshs,

More information

Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform

Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chi Platform Uzi Vishkin George C. Caragea Bryant Lee Aril 2006 University of Maryland, College Park, MD 20740 UMIACS-TR

More information

Multigrain Parallel Delaunay Mesh Generation: Challenges and Opportunities for Multithreaded Architectures

Multigrain Parallel Delaunay Mesh Generation: Challenges and Opportunities for Multithreaded Architectures Multigrain Parallel Delaunay Mesh Generation: Challenges and Oortunities for Multithreaded Architectures Christos D. Antonooulos, Xiaoning Ding, Andrey Chernikov, Fili Blagojevic, Dimitrios S. Nikolooulos,

More information

Privacy Preserving Moving KNN Queries

Privacy Preserving Moving KNN Queries Privacy Preserving Moving KNN Queries arxiv:4.76v [cs.db] 4 Ar Tanzima Hashem Lars Kulik Rui Zhang National ICT Australia, Deartment of Comuter Science and Software Engineering University of Melbourne,

More information

Leak Detection Modeling and Simulation for Oil Pipeline with Artificial Intelligence Method

Leak Detection Modeling and Simulation for Oil Pipeline with Artificial Intelligence Method ITB J. Eng. Sci. Vol. 39 B, No. 1, 007, 1-19 1 Leak Detection Modeling and Simulation for Oil Pieline with Artificial Intelligence Method Pudjo Sukarno 1, Kuntjoro Adji Sidarto, Amoranto Trisnobudi 3,

More information

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 Mingliang Chen 1, Weiyao Lin 1*, Xiaozhen Zheng 2 1 Deartment of Electronic Engineering, Shanghai Jiao Tong University, China

More information

A Scalable Parallel Approach for Peptide Identification from Large-scale Mass Spectrometry Data

A Scalable Parallel Approach for Peptide Identification from Large-scale Mass Spectrometry Data 2009 International Conference on Parallel Processing Workshos A Scalable Parallel Aroach for Petide Identification from Large-scale Mass Sectrometry Data Gaurav Kulkarni, Ananth Kalyanaraman School of

More information

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island A GPU Heterogeneous Cluster Scheduling Model for Preventing Temerature Heat Island Yun-Peng CAO 1,2,a and Hai-Feng WANG 1,2 1 School of Information Science and Engineering, Linyi University, Linyi Shandong,

More information

S16-02, URL:

S16-02, URL: Self Introduction A/Prof ay Seng Chuan el: Email: scitaysc@nus.edu.sg Office: S-0, Dean s s Office at Level URL: htt://www.hysics.nus.edu.sg/~hytaysc I was a rogrammer from to. I have been working in NUS

More information