Communication-Avoiding Parallel Algorithms for Solving Triangular Matrix Equations

Size: px

Start display at page:

Download "Communication-Avoiding Parallel Algorithms for Solving Triangular Matrix Equations"

Morris Scott
5 years ago
Views:

Research Collection Bachelor Thesis Communication-Avoiding Parallel Algorithms for Solving Triangular Matrix Equations Author(s): Wicky, Tobias Publication Date: 2015 Permanent Link: htts://doi.

1 Research Collection Bachelor Thesis Communication-Avoiding Parallel Algorithms for Solving Triangular Matrix Equations Author(s): Wicky, Tobias Publication Date: 2015 Permanent Link: htts://doi.org/ /ethz-a Rights / License: In Coyright - Non-Commercial Use Permitted This age was generated automatically uon download from the ETH Zurich Research Collection. For more information lease consult the Terms of use. ETH Library

2 Communication-Avoiding Parallel Algorithms for Solving Triangular Matrix Equations Bachelor Thesis Tobias Wicky Friday 30 th October, 2015 Advisors: Prof. Dr. T. Hoefler, Dr. E. Solomonik Deartment of Mathematics, ETH Zürich

4 Abstract In this work an algorithm for solving triangular systems of equations for multile right hand sides is resented. The algorithm for solving triangular systems for multile right hand sides, commonly referred to as the TRSM roblem, is a very imortant in dense linear algebra as it is a subroutine for most decomositions of matrices as LU or QR. To imrove erformance over the standard iterative algorithms for TRSM, a block wise inversion aired with triangular matrix multilications is used. To erform the inversion, the lower triangular form of the matrix is exloited and a recursive scheme is alied to further decrease communication cost. With that, the latency of the algorithm decreases while the bandwidth and floating oint oerations count stay asymtotically the same. Concretely, a decrease of latency with a factor of 2/3 / log was achieved for a significant range of relative matrix sizes when working with rocessors. The roosed method is imlemented and its erformance is benchmarked against the widely used ScaLAPACK [1] library. The results show romising tendencies for the inversion, with a maximal seedu of 1.7 over ScaLAPACK for 4096 rocessors. Due to the inferior erformance of triangular matrix multilications with resect to the triangular solve, no overall imrovement is made yet. i

6 Contents Contents iii 1 Introduction 1 2 Previous Work Matrix Multilication Triangular matrix solve for single right hand sides Triangular matrix solve for multile right hand sides Communication in TRSM Execution Time Model Recursive TRSM Algorithm Choice of base-case size Triangular Matrix Inversion 11 5 TRSM with Inversion TRSM with full inversion Recursive TRSM with Block Inversion Summary Imlementation 21 7 Exerimental Results 23 8 Further Work 31 9 Conclusion 33 Bibliograhy 35 iii

8 Chater 1 Introduction The goal of this work is to find a communication minimizing algorithm for solving triangular matrix equations with multile right hand sides. Motivation: With the decreasing imortance of the amount of floating oint oerations on the total execution time of a rogram, communication costs will turn out to be a more and more imortant factor. The goal therefore is to find an algorithm that stays asymtotically otimal in terms of flo-cost while decreasing the communication cost. The aroach taken here is to secifically decrease the asymtotic latency cost while keeing the bandwidth constant. We take this aroach because we assume that the asymtotic uer bounds for the bandwidth are very strong for the standard recursive aroach as described in [2] since they are equal to the bandwidth costs that are required in for matrix multilication. TRSM is a crucial roblem in many alications as it is a subroutine for a lot of algorithms in dense linear algebra, as for examle the LU decomosition described in [3], or the QR decomosition. An other very imortant alication of the TRSM is the recursive Cholesky factorization where it is the base-case algorithm as described in [4]. It is also the critical routine for solving general dense linear systems of equations. Problem Definition: The linear system of equations L X = B will be solved for X R n k, where L R n n denotes a lower-triangular matrix and B R n k a dense matrix. In this work we will only account for communication cost (bandwidth and latency) that arises due to communication between different comute units. In the theoretical art of this work, three algorithms are resented to solve the roblem recursively where each algorithm takes a different aroach: The algorithm referred to as Recursive TRSM slits the initial roblem into smaller subroblems, until the roblem 1

9 1. Introduction is small enough such that each comute unit can solve the given subroblem for some different right hand sides. This first aroach is based on the aroach taken in [2]. With a detailed cost analysis we will describe otimal base-case sizes deendent on the relative matrix sizes. This aroach is exlained in detail in Section 3.2. The algorithm referred to as TRSM with full inversion inverts the triangular matrix comletely in a recursive fashion and then solves the system by a triangular multilication of the inverted matrix L 1 with the right hand side B. This aroach is discussed due to the low latency that is required to do inversion and matrix multilication. In some relative matrix sizes this aroach rovides an overall higher bandwidth cost and floating oint oerations count on the cost of the low latency. It is discussed in detail in Section 5.1. The algorithm referred to as Recursive-Inversion TRSM combines the two aroaches, reducing the roblem recursively u until a certain oint and inverts the small roblem to solve it. With this aroach we aim at a lower latency due to the use of the inversion as well as keeing the bandwidth and floating oint oerations count low as we can choose the base case size as it is desired. With a cost analysis of this aroach, we were able to find a cost otimal base-case size that decreases the latency and kees bandwidth as well as flo-costs constant comared to the recursive TRSM. With rocessors working on the roblem, the decrease of latency was obtained for a large range of relative matrix sizes ( k [ )), n where a gain of a factor of 2/3 log was achieved. It is discussed in Section 5.2. Results: To see how the algorithm erforms in ractice we show scaling lots of the Recursive-Inversion TRSM comared to the method that ScaLA- PACK [1] rovides. We were able to see that our aroach of inverting the lower triangular matrix is faster than what ScaLAPACK rovides. Even though this is an imortant art of the algorithm, we observed worse scaling for the total time to solution for TRSM. This can be exlained by the fact that one very slow art of the algorithm was the triangular matrix multilication that uses ScaLAPACKs imlementation. The lots suggest a good scaling behavior but they show that the roblem size used was very small and that with increasing roblem sizes, better results are exected. 2

10 Chater 2 Previous Work In this chater, the revious work relevant to the toic is introduced: A cost analysis for matrix multilication is resented and a standard method for solving the TRSM roblem is shown. 2.1 Matrix Multilication In this section the relevant results from the CARMA algorithm, including the cost analysis for matrix multilication, resented in [5], are summarized. One of the key arts of recursive algorithms for solving the TRSM roblem are matrix matrix multilications. In [5], Demmel et al resent communication otimal algorithms for matrix multilications with the resective costs. The algorithm we resent is based on the matrix multilication resented in their work, where a threefold case distinction occurs. The fact that these three different cases all have different bounds for communication costs sets the constraint that the algorithm resented here also has to do the same distinction of cases. Initial conditions: We consider the matrix multilication of a dense matrix A R n n with an other dense matrix B R n k executed on rocessors that are aligned on a rocessor gird Π. Bandwidth: The CARMA algorithm [5] which we refer to as C = MM(A, B, n, k, Π, ) achieves communication bandwidth cost W MM (n, k, ), which can be subdivided into three cases: O (nk/ ) n> k (two large dimensions) ( ( n W MM (n, k, ) = O 2 ) 2/3 ) k k/ n k (three large dimensions) ( O n 2) n < k/ (one large dimension) 3

11 2. Previous Work In the case of one large dimension, where the right hand side B is larger than the triangular matrix A, the best way to do a matrix multilication is to use a one dimensional layout for the rocessor grid. In the case of two large dimensions, where the matrix A is much larger than the right hand side B, the best way of erforming a matrix multilication is to use a two dimensional layout for the rocessor grid. And for the case of three large dimensions, where the matrices A and B are aroximately of the same size, it is roosed to use a three dimensional grid layout. Latency: The latency cost of matrix multilication given unlimited memory is S MM (n, k, ) = log() Flo Cost: Matrix multilication takes O ( ) flos, which can be divided on rocessors and therefore we have F MM (n, k, ) = O ( n 2 ) k Previous Analysis: For the case where k = n the bandwidth analysis of a general matrix multilication goes back to what is resented in [6]. Aggarwal et al. resented a cost analysis in the LPRAM model. In this work, the authors showed that the same cost can also be achieved for the transitive closure roblem that can be extended to the roblem of doing an LU decomosition. The the fact that these bandwidth costs can be obtained for the LU decomosition was later demonstrated by Tiskin in [7]. He used the bulk synchronous arallel (BSP) execution time model. Since the deendencies in LU are more comlicated than they are for TRSM, we also exected TRSM to be able to have the same asymtotic costs as a general matrix multilication Triangular matrix solve for single right hand sides Algorithms for the roblem of triangular solve for a single right hand side (when X and B are vectors) have been well-studied. A communicationefficient arallel algorithm was given by Heath and Romine [8]. This arallel algorithm was later shown to be an otimal schedule in latency and bandwidth costs via lower bounds [9]. However, when X and B are matrices (k > 1), it is ossible to achieve significantly lower communication costs relative to the amount of comutation required.

12 2.3. Triangular matrix solve for multile right hand sides 2.3 Triangular matrix solve for multile right hand sides The idea of recursively solving triangular matrix systems with many right hand sides already showed u a long time ago in the work of Elmroth et al [2]. There are two main ways to slit the roblem into smaller subroblems: Case 1: Slitting the right hand side into two (indeendent) subroblems in the fashion A X = B becoming where the subroblems are solved indeendently. A [X 1 X 2 ] = [ B1 B 2 ] A X 1 = B 1 A X 2 = B 2 Case 2: Slitting the triangular matrix into two deendent subtask in the fashion of [ ] [ ] [ ] A11 X1 B1 = A 12 A 22 X 2 B 2 where the subroblems A 11 X 1 = B 1 A 22 X 2 = B 2 A 12 X 1 are solved. With a roer mixing of both cases and a new aroach of calculating the base-cases of these recursions, it was ossible to achieve good bandwidth and latency costs. In [10], it is shown that a sequential execution of the TRSM algorithm can achieve the same bandwidth cost as a general matrix matrix multilication. This is done using the α β model and accounting for different cache-line sizes. The cost analysis done in [11] shows that bandwidth-wise, one can achieve costs for TRSM that are not worse than what a general matrix multilication achieves. We do a similar analysis with a different model in Section 3.2. The stability of inverting a triangular matrix has been studied in [12]. It has been stated that no general error bounds exist for blocked methods with resect to the error bound of the iterative methods. We leave the investigation of the stability of our aroach as further work. 5

14 Chater 3 Communication in TRSM In this chater we rovide a model to calculate arallel execution time and we make a cost analysis of the recursive TRSM algorithm using the communication uer bounds that were resented in the revious chater. 3.1 Execution Time Model The model we use to calculate the arallel execution time of an algorithm along its critical ath is the α β γ model. It describes the total execution time of the algorithm T in terms of the floating oint oerations (flo) count F, the bandwidth W and the latency (synchronization cost) S along the critical ath: T = γ F + β W + α S We do not lace constraints on the local memory size. As it is assumed that with time, comuting elements will become faster and with that a decrease of γ is exected, the goal of this work is to find an algorithmic aroach to solving triangular matrix equations with multile right hand sides that only increases the flo cost F by a constant, while decreasing the latency S. This is a reasonable aroach, since the imortance of α and β becomes higher as γ gets lower. 3.2 Recursive TRSM Algorithm In this section, the algorithmic aroach to solving TRSM recursively is resented. Also a cost analysis of the recursive aroach of TRSM is shown. Algorithmic aroach: The algorithm to solve many triangular systems of equations, commonly referred to as TRSM roblem reads: Given a lowertriangular matrix L R n n and a dense matrix B R n k, the goal is to 7

15 3. Communication in TRSM comute the matrix X R n k such that L X = B. The recursive algorithm that is resented by Elmroth et al. in [2] subdivides the L matrix into a 2 2 set of blocks at each ste and erforms two TRSM calls in sequence with all rocessors at each recursive level. Algorithm 1, Algorithm 1: X = Rec-TRSM(L, B, n, k, Π,, n 0 ) Require: L is a lower triangular n n matrix and B is a rectangular n k matrix, both distributed over Π = rocessors. If n n 0, allgather L onto all rocessors, subdivide B = [B 1,..., B ] and X = [X 1,..., X ] and comute X i = L 1 B i with the ith rocessor. [ ] L11 0 Subdivide L into n/2 n/2 blocks, L = L 21 L [ 22 ] B1 Subdivide B and X into n/2 k blocks, B =, X = Comute X 1 = Rec-TRSM(L 11, B 1, n/2, k, Π,, n 0 ). Comute B 2 = B 2 MM(L 21, X 1, n/2, k, Π, ). Comute X 2 = Rec-TRSM(L 22, B 2, n/2, k, Π,, n 0). Ensure: LX = B. B 2 [ X1 X 2 ]. Rec-TRSM(L, B, n, k, Π, n 0 ) requires two recursive calls and a matrix multilication at each recursive level until n n 0. This aroach yields to the communication cost recur- Bandwidth Cost: rence W Rec TRSM (n, k,, n 0 ) = W MM (n/2, k, ) + 2W Rec TRSM (n/2, k,, n 0 ) which decreases geometrically at each level as long as k > n/ (first case of Algorithm 1). At the base-case, the all-gather of L requires a communication cost of W Rec TRSM (n 0, k,, n 0 ) = O ( n 2 ) 0 There are n/n 0 base-cases that are executed in sequence using all rocessors, for a total cost of W base cases (n, k,, n 0 ) = n n 0 W Rec TRSM (n 0, k,, n 0 ) = O (nn 0 ) Choice of base-case size We desire that W base cases (n, k,, n 0 ) W TRSM (n, k,, n 0 ), which imlies that we need a different choice of n 0 deending on the initial size of our matrix: 8

16 3.2. Recursive TRSM Algorithm One large dimension: In this case, because the matrix A is small, it makes no sense to slit it and therefore we would ick n 0 = n and with that choice, no recursion occurs. Two large dimensions: When the initial matrix multilication costs O ( nk/ ) and we want the base-case not to dominate, we choose ( ) k n 0 = max 1, It is imortant to note that as we recurse, the matrix multilications in each level costs the same amount of bandwidth, and therefore we ick u a logarithmic factor for the total bandwidth. Three large dimensions: When the initial matrix multilication costs O ( ) 2/3, we select ( ( ) nk 2 1/3 ) n 0 = max 1, 2 Latency: The latency cost is dominated by the n/n 0 base-cases since they may not be executed concurrently and therefore comrise a execution ath within the algorithm of S Rec TRSM (n, k,, n 0 ) = (n/n 0 ) S MM (n 0, k, ). Each base-case requires an all-gather, which imlies S MM (n 0, k, ) = O (log()) latency cost, yielding an overall latency cost of S Rec TRSM (n, k,, n 0 ) = O ((n/n 0 ) log ) This general cost leads to the following latency costs, when the choice of n 0 is made to minimize bandwidth cost with resect to the initial matrix multilication, as done above. One large dimension: Two large dimensions: S Rec TRSM (n, k,, Three large dimensions: ( ) nk 2 1/3 ) S Rec TRSM (n, k,, S Rec TRSM (n, k,, n) = O (log ) 2 ) ( ( k = O min 1, ( ( = O min n, k ) ) n log ( n ) ) ) 2/3 log k 9

17 3. Communication in TRSM Flo Cost: The flo cost of such an algorithm is dominated by the to level matrix multilications and therefore costs F Rec TRSM (n, k, ) = O ( n 2 ) k 10

18 Chater 4 Triangular Matrix Inversion In this chater, the algorithmic aroach to inverting a lower triangular system recursively is resented. Also a cost analysis of this aroach is resented. Algorithmic aroach: As the scaling of the latency of the algorithm discussed reviously grows with the number of rocessors involved, its scalability is limited. Therefore, this aroach seems subotimal and with that the goal is to find an algorithm that requires less latency. Noting that matrix multilication requires little latency, the cost of inverting a triangular matrix was investigated. Parallel triangular matrix inversion can be done with a shorter critical ath than the TRSM aroach discussed before. This aroach is shown in Algorithm 2. It is to note that whenever a call to MM is made, we seak of the matrix multilication mentioned in [5]. The aroach taken in Algorithm 2 is the following: Each roblem is subdivided into two recursive matrix inversions, which are executed concurrently with half the rocessors, then two matrix multilications are erformed to comlete the inversion. Bandwidth cost: Since we execute the two recursive calls in Rec Tri Inv(L, n, Π, ) simultaneously with two disjoint sets of rocessors, the communication cost recurrence is given by W Rec Tri Inv (n, ) = 2 W MM (n/2, n/2, ) + W Rec Tri Inv (n/2, /2, n 0 ) If we assume n to be sufficiently large, we find that = 1 should be achieved and that is when n 0 = n/2 log = n/, so therefore there are a total of log recursive levels, each of which requires a matrix multilication. The communication cost associated with each level decreases geometrically, therefore the total cost of the algorithm is dominated by the cost of the tolevel matrix multilication, which is ( ) n 2 W MM (n/2, n/2, ) = O 2/3 11

19 4. Triangular Matrix Inversion Algorithm 2: L 1 = Rec-Tri-Inv(L, n, Π, ) Require: L, a lower triangular n n matrix distributed over Π = rocessors. if = 1 then L 1 = sequential inversion(l) end else [ ] L11 0 Subdivide L into n/2 n/2 blocks, L = L 21 L 22 Subdivide Π = [Π 1, Π 2 ] where Π 1 and Π 2 each contain /2 rocessors L 1 11 = Rec-Tri-Inv(L 11, n/2, Π 1, /2) L22 1 = Rec-Tri-Inv(L 22, n/2, Π 2, /2) L 1 21 = MM(L22 1, L 21, n/2, n/2, Π, ) L 1 21 = MM(L 1 21, L 1 11, n/2, n/2, Π, ) [ ] L 1 Assemble L 1 from the n/2 n/2 blocks, L 1 = end Ensure: LL 1 = I L 1 21 L22 1 Since this cost of the matrix multilication decreases as n and both are decreased by a factor of two, this initial matrix multilication dominates the bandwidth cost, ( ) n 2 W Rec Tri Inv (n,, n 0 ) = O (W MM (n/2, n/2, )) = O. 2/3 12 Latency cost: Since there are log() recursive levels and at each ste we do a matrix-matrix multilication, the latency cost is ( ) S Rec Tri Inv (n, ) = O log 2 () Flo cost: The total flo cost for the inversion is F Rec Tri Inv (n, n 0, ) = F Base Cases (n, n 0, ) + F MM (n/2, n/2, ) The flo cost of a sequential inversion of a triangular matrix is, as stated by Hunger in [13], and the total base-case flo cost is F Seq Inv (n) = 1 3 n3 F Base Cases (n, n 0, ) = n n 0 F Seq Inv (n 0 ) = n n n3 0 = 1 3 ( ) n 3

20 This gets dominated by the to-level matrix multilication, which every rocessor is working on and therefore needs ( ) n 3 F MM (n/2, n/2) = O flos. Since the to level matrix multilication is the most exensive and the other levels are geometrically decreasing this gives us the cost of ( ) n 3 F Rec Tri Inv (n, n 0, ) = O 13

22 Chater 5 TRSM with Inversion In this chater, we discuss aroaches to solve the TRSM roblem using the inversion derived in Section 4. The idea to use inversion for its low latency costs arises from what Tiskin did in [7]: He used inversion to decrease the latency in the LU factorization. We discuss otimal base-case choices deendent on relative matrix sizes for both TRSM with full inversion as well as recursive TRSM with block inversion. 5.1 TRSM with full inversion In this section, the algorithm to solving TRSM with a comlete inversion of L is given. We rovide a cost analysis for this method as well as an otimal base-case choice. Algorithmic aroach: If TRSM is done with full inversion of the matrix, the algorithm works as described in Algorithm 3. Algorithm 3: X = Inv-TRSM(L, B, n, k, Π) Require: L is a lower triangular n n matrix and B is a rectangular n k matrix, both distributed over Π = rocessors. L 1 = Rec-Tri-Inv(L, n, Π) X = MM(L 1, B, n, k, Π) Ensure: LX = B. 15

23 5. TRSM with Inversion 16 Bandwidth cost: This aroach leads to a total bandwidth cost of W Inv TRSM (n, k,, n 0 ) = W Rec Tri Inv (n,, n 0 ) + W MM (n, k, ) ( ) n 2 O 2/3 n > k (two large dimensions) ( ( = n O 2 ) 2/3 ) k + n2 2/3 k/ n k (three large dimensions) O ( n 2) n < k/ (one large dimension) To not be dominated by the matrix inversion, we therefore need W MM (n, k, ) > W Rec Tri Inv W MM (n, k, ) > O ( n 2 2/3 ). This is only the case when n < k which makes sense, since otherwise the full inversion would obviously be the dominating art, as the matrix is larger then the right hand side. Latency cost: This aroach leads to a total latency cost of S Inv TRSM (n, k,, n 0 ) = S Rec Tri Inv (n,, n 0 ) + W MM (n, k, ) ( ) = O log 2, for all of the three cases. Flo cost: The flo cost of this algorithm is F Inv TRSM (n, k,, n 0 ) = F Rec Tri Inv (n,, n 0 ) + W MM (n, k, ) ( ) ( n 3 n = O + O 2 ) k. Therefore, the algorithm requires substantially more comutation than Algorithm 3 when n > k. 5.2 Recursive TRSM with Block Inversion In this section, we discuss an algorithm to solving TRSM recursively with a comlete inversion as the base-case. We rovide a cost analysis for this method as well as otimal base-case choices deendent on the relative matrix sizes. Algorithmic aroach: The goal is to kee the latency as low as ossible, without increasing the asymtotic flo or bandwidth cost. We want to achieve, that the multilication with the right hand side is the bandwidthwise dominant art and not the the matrix inversion (which has cost O ( n 2 2/3 ) since the former is bandwidth we necessarily have to ay. To achieve this, it is necessary that n k. Therefore, recursive stes are taken as in Algorithm 1 in Section 3.2 until n k. Once a recursive level is reached, where n ),

24 5.2. Recursive TRSM with Block Inversion is sufficiently small relative to k, a switch to Algorithm 3 is erformed, as the triangular matrix inversion should no longer be a bottleneck. The resulting algorithm, that is referred to as Rec-Inv-TRSM, is the same as Algorithm 1, excet that the base-case is relaced by a call to Algorithm 3. It is shown as Algorithm 4. Algorithm 4: X = Rec-Tri-Inv-TRSM(L, B, n, k, Π, n 0 ) Require: L is a lower triangular n n matrix and B is a rectangular n k matrix, both distributed over Π = rocessors. if n n 0 then X = Inv-TRSM(L, B, n, k, Π) end else [ ] L11 0 Subdivide L into n/2 n/2 blocks, L = L 21 L [ 22 ] B1 Subdivide B and X into n/2 k blocks, B =, X = Comute X 1 = Rec-Tri-Inv-TRSM(L 11, B 1, n/2, k, Π, n 0 ). Comute B 2 = B 2 MM(L 21, X 1, n/2, k, ). Comute X 2 = Rec-Tri-Inv-TRSM(L 22, B 2, n/2, k, Π, n 0). [ ] X1 Assemble X from the n/2 k blocks, X =. end Ensure: LX = B. X 2 B 2 [ X1 X 2 ]. Case 1: If n k we do not need any stes of the first (recursive) TRSM aroach and can directly invert the matrix. This gives the results show in Section 5.1. Case 2: The more interesting case is where n > k and we can use stes of both aroaches. Bandwidth cost: Here we show that, using inversion, we do not get a higher bandwidth than with the initial TRSM aroach. The base-case costs us ( n 2 ) W Rec Inv TRSM (n 0, k,, n 0 ) = O 0 2/3 bandwidth and since we have n n 0 base-cases, that are all executed sequentially, this leads us to a total cost for all the base-cases of ( ) n n0 W base cases (n, k,, n 0 ) = O 2/3 17

25 5. TRSM with Inversion With the cost for the initial matrix multilication as stated in Section 2.1, this leads to a total cost of W Rec Inv TRSM (n, k,, n 0 ) = W base cases (n, k,, n 0 ) + W MM (n, k, ) = n n 0 2/3 + W MM(n, k, ). Since the goal is to be dominated by the initial MM, i.e. W Rec Inv TRSM (n, k,, n 0 ) = O (W MM (n, k, )) n 0 is chosen accordingly: One large dimension: This case never occurs in n > k. We would be in Case 1 and do the direct inversion. Two large dimensions: n 0 = k 1/6 Three large dimensions: n 0 = n 1/3 k 2/3 This leads to the following total bandwidth cost: W Rec Inv TRSM (n, k,, n 0 ) = (( ( )) ) n nk O 1 + log k ( ( n = O 2 ) 2/3 ) k ( n O 2 ) k n > k k/ n k n < k/ (two large dimensions) (three large dimensions) (one large dimension) Latency: The total latency is given by the number of base-cases of the first art of the algorithm times the base-case latency (that is S TRI ). Therefore we have (in α β γ) S Rec Inv TRSM (n, k,, n 0 ) = n S Rec Tri Inv n ( ) 0 n O k 1/6 log2 n > k ( ( = n ) 2/3 O log ) 2 k/ n k k ( ) O log 2 n < k/ (two large dimensions) (three large dimensions) (one large dimension) This is an imrovement over the analysis given in Section 3.2. Flo costs: With the given choice of n 0 and 0, it can be shown that the flo cost is not asymtotically higher than the one for a standard imlementation of TRSM. The flo cost of Rec-TRSM is ( n F Rec TRSM (n, k, ) = O 2 ) k, 18

26 5.3. Summary whereas the cost of Algorithm 4 is ( n F Rec Inv TRSM = O 2 k + n n 3 ) ( 0 n = O 2 ) k n 0 + nn2 0. Therefore it is desired that nn2 0 < n2 k, which imlies the criterion that n 0 < nk. It is demonstrated that the choice of n0 made to obtain the desired bandwidth cost, also satisfies this criterion and therefore does not require additional comutation work asymtotically. One large dimension: that The choice of the base-case size is n 0 = n and with n 2 0 n 2 < nk < nk n 0 < nk Two large dimensions: The choice of the base-case size is n 0 = k 1/6. Since two large dimensions also imose the constraint n > k, k < n, n k < 1/4, n 0 < k n 1/4 1/6 = nk 1/12 < nk Thus, this choice of n 0 guarantees that the comutation cost involved in the triangular matrix inversion is always of low order when there are two large matrix dimensions at the beginning of the recursion. Three large dimensions: The choice of the base-case size is n 0 = n 1/3 k 2/3. Since three large dimensions also imose the constraint n > k, we directly see that ( n 0 n 1/3 k 2/3 < n 1/3) ( n 1/6 k 1/2) nk which imlies that the comutation cost of the inversion may be of leading order. Therefore, in ractice, when n = k, it may make sense to take a few stes of recursion in Algorithm 4, before erforming the inversion. 5.3 Summary In this art all the results derived in this section are summed u and a table with the total cost of all algorithms looked at is given. 19

27 5. TRSM with Inversion 1 Large Dimension W S F MM n 2 log TRSM Rec n 2 log TRSM Inv n 2 log 2 TRSM RecInv n 2 log 2 2 Large Dimensions W S F MM nk log TRSM Rec (1 + log ( n k TRSM Inv nk + n2 TRSM RecInv (1 + log ( n k )) nk ( min 1, ) k n log 2/3 log 2 )) nk log 2 k 1/6 3 Large Dimensions W S F n + n3 MM TRSM Rec TRSM Inv TRSM RecInv ( ( ( ( ) 2/3 log ) 2/3 ( min n, ( n k ) 2/3 ) log ) 2/3 + nk log 2 n 2 k + n3 ) 2/3 ( n ) 2/3 k log 2 n 2 k Table 5.1: Asymtotic uer bounds for bandwidth- (W), latency- (S) and flo-costs (F) for all the algorithms mentioned In Table 9.1, the costs for all the algorithms resented in Sections 3 and 5 are shown as asymtotic uer bounds for a triangular matrix L R n n and a right hand side B R n k with rocessors working on the task. 20

28 Chater 6 Imlementation In this chater, some imlementation details are given. For the imlementation of the given algorithms, ScaLAPACK [1] was used as a base level library to erform the base-case calculations as well as the matrix multilications. It is imortant to note that with that choice, the otimal communication costs, that were assumed throughout the theoretical art, are not achieved for the one and the three large dimensions case, since ScaLAPACK only imlements a two dimensional matrix multilication. For a efficient work distribution while staying on a relatively simle distribution model, a block cyclic distribution of the matrix was chosen as shown in Figure 6.1. Each rocessor owns the arts of the matrix that are colored in its color and with that as soon as we multily the four-by-four blocks, we have all the rocessors working on that and not only on the to level matrix multilication. With this choice we are able to ensure that after some blocked stes, all rocessors are working on the matrix multilications. This was esecially imortant as we showed that the to level matrix multilications are always the most exensive. For efficiency reasons, all algorithms were imlemented in an iterative scheme Figure 6.1: Reresentation of a block cyclic distribution of the matrix 21

29 6. Imlementation Inversion: For the inversion of the base-cases, the algorithm roosed in section 5.1 was imlemented. To ensure that the base-case of the inversion is done sequentially, the block size for the distribution was chosen to be the base-case size of the inversion. In each level the number of matrix multilications decreases by two as the size of the matrices increases by two until we reach the to level, as can be seen in Figure 6.2: In the first level (denoted by 1), the four green squares are multilied. In the second level, the two bigger, red matrices are multilied and in the final ste, the blue square matrix is multilied leading to a comlete inversion. Triangular Solver: The triangular solve itself was using this inversion within each base-case and then did the required udates with ScaLAPACK s triangular matrix multilication and dense matrix multilication resectively, as can be seen in Algorithm 5. The udates on the right hand side done in udate B are always done u to the art that was calculated, as it was roosed in the recursive art in Section 5.2. These udates are denoted as the consecutive numbers in Figure 6.2: After the first base-case has been used to calculate X 1, the small (green) udate 1 is done on B 2. After solving for X 2 in the second ste, the udate of B affects B 3 and B 4 as the bigger (red) block 2 is udated. This goes on until the last base-case is handled. BC BC 1 BC 1 BC 2 BC 1 BC 2 BC 3 BC BC BC BC BC 1 BC BC BC 7 BC Figure 6.2: Recursion in the triangular matrices: Inversion stes (left), triangular solve (right) Algorithm 5: Rec-Inv-TRSM(L, B) for i in number of basecases do Rec-Inv(L i,i ) for i in number of basecases do X i = ScaLAPACK mm(l 1 i,i, B i ) udate B(X i, L) 22

30 Chater 7 Exerimental Results In this section, results of the erformed exeriments are rovided. After a descrition of the exerimental setu, including information about the machine used, the erformance of the roosed algorithms is evaluated. Exerimental setu: The exeriments were erformed on a Cray XC40 with 1256 nodes. Each node contains two Intel Xenon E v3 CPUs with 12 cores each and 64 GB of memory. The nodes are connected by the Aries rorietary interconnect from Cray, with a dragonfly network toology. Comilers and libraries: LAPACK and ScaLAPACK [1] was used from the rovided recomiled cray-libsci library version For the matrix storage and correctness checks, the Eigen [14] library version was used. All rograms were comiled with gcc with the otimization flag O3. Initial Conditions: The matrices were created locally using the drand48 random number generator within the range [1, 101), where L is lower triangular and B is dense. Restrictions: To focus on the algorithmic imrovements, the timing was started after the MPI-communicators were initialized, the Cblas grid was created and the local matrices were initialized. Each simulation was ran six to eleven times and the first run was neglected. To summarize the data, as suggested in [15, 16], the harmonic mean x (h) = n n i=1 (1/x i ) was used instead of the arithmetic mean. Also for each rocessor number, the two-sided 95%-confidence intervals of each set of exerimental results is shown. These intervals were calculated as mentioned in [16] based on Students t distribution: CI : [ x t(n 1, α/2)s/ n, x + t(n 1, α/2)s/ n ] 23

31 7. Exerimental Results where s denotes the samle standard deviation ( n ) s = i x) i=1(x 2 /(n 1) In each of the following grahs, the mean value is lotted with a larger icon and lines, all the single runs are lotted with the smaller icons and the 95% confidence interval is lotted in black. Base-Case Size: We exected that in real simulations, the theoretically otimal base-case size can be the non-otimal one for some roblems. This is due to constants of the asymtotic comlexities as well as the two dimensional imlementation of the matrix matrix multilication that we used. Therefore ranges around the otimal base-case size were chosen for each exeriment (n ot /4, n ot /2, n ot, 2 n ot, 4 n ot ). Only the best erforming base-case size was accounted for. Strong scaling: The strong scaling lots were roduced to show the average erformance over the runs. Therefore the total flo count (n 2 k for TRSM and 1 3 n n) was divided by the execution time which led to the flo-count er second: G TRSM (t exec (), n, k) = G Inv (t exec (), n) = t exec () n n t exec () For a erfectly arallelizeable algorithm with no arallelization overhead cost, this roerty should scale like, since the execution time in this case would scale this way. Weak scaling: For the weak scaling lots, the roerty G was reused and divided by the machine eak erformance for each set of nodes used. Therefore the lotted roerty is: P(t exec (), n, k, ) = G(t exec (), n, k) Peak er rocessor For the eak erformance 41.6 GF / core has been assumed. Each rocessor runs on one core only. For a erfectly arellelizeable algorithm with no overhead cost, this roerty should be constant at 1, since we would run at eak erformance with every rocessor for the whole time. Inversion: One of the benchmarks we did, was to benchmark our imlementation for the inversion against the inversion ScaLAPACK rovides. The

32 Strong scaling of Inversionfor N= Gigaflo/s Algorithm Proosed Imlementaion 25 Scalaack Number of Cores Figure 7.1: Strong scaling of the inversion with N=2048 Strong scaling of Inversionfor N= Gigaflo/s Algorithm Proosed Imlementaion Scalaack Number of Cores Figure 7.2: Strong scaling of the inversion with N=8192 results can be seen in Figures 7.1,7.2 and 7.3. One can see, that this art of the TRSM-solver was able to to beat ScaLAPACK in strong as well as in weak scaling. It is interesting to observe, that for smaller roblems (u to N = 8192), the roblems aear to be too small to get the full benefit of 4096 ranks working on it. But the imortant thing to note is, that for roblems large enough, the roosed algorithm is a real imrovement comared to ScaLAPACK, as one gets more than a factor of 1.6 in Gigaflo er second for N = The weak scaling has been started at N = 1024 and = 4 and increased N by 2 as increased by 4 to again kee memory usage er rocessor constant. In the weak scaling lot, visible in Figure 7.4, it can be 25

33 7. Exerimental Results 6000 Strong scaling of Inversionfor N= Gigaflo/s Algorithm Proosed Imlementaion Scalaack Number of Cores Figure 7.3: Strong scaling of the inversion with N=32768 Weak scaling of Inversion 0.25 Percentage of Theoretical Floating Point Peak Algorithm Proosed Imlementaion Scalaack Number of Cores Figure 7.4: Weak scaling starting with N=1024 for =4 seen, that the roosed method is strictly better in terms of ercentage of eak erformance. Three Large Dimensions: Benchmarks for the algorithm were erformed for the three large dimensions case. The strong scaling lots for different matrix sizes, where L R N N and B R N K, are rovided for the resented imlementation as well as ScaLAPACKs algorithm in Figures 7.5,7.6 and 7.7. It is clearly visible that the scaling is very oor for small matrix sizes, but as we increase the sizes, the scaling becomes very good. Unfortunately, the overall erformance is still worse than what ScaLAPACK offers. 26

34 Strong scaling of TRSM in three large dimensions for N= Algorithm Proosed Imlementaion Scalaack Gigaflo/s Number of Cores Figure 7.5: Strong scaling for N=K= Strong scaling of TRSM in three large dimensions for N= Algorithm Proosed Imlementaion Scalaack Gigaflo/s Number of Cores Figure 7.6: Strong scaling for N=K=8192 After rofiling the runs, it was easy to see that the biggest roblem does not lie in the inversion art but in the very slow triangular matrix - dense matrix multilication (TRMM). It is even slower than doing the comlete triangular solve (TRSM), even though the maximal amount of arallelism ossible for TRSM is only nk, whereas it is for the TRMM. The weak scaling lot can be seen in Figure 7.8. As for the weak scaling, it was observed that the scaling was not as romising as the strong scaling suggests. The results were created with the following set of arameters: Starting with = 4 and N = K = 1024, N and K were both increased by a factor of two as increased by a factor of four. 27

35 7. Exerimental Results Strong scaling of TRSM in three large dimensions for N= Algorithm Proosed Imlementaion Scalaack Gigaflo/s Number of Cores Figure 7.7: Strong scaling for N=K=32768 Percentage of Theoretical Floating Point Peak Weak scaling of TRSM in three large dimensions Algorithm Proosed Imlementaion Scalaack Number of Cores Figure 7.8: Weak scaling starting with N=1024 for =4 Two Large Dimensions: Since the TRMM was such a dominating factor in the revious case, one could hoe for better erformance as the right hand side size decreases. The strong scaling lots were created for several sizes n of the matrix L R n n and ket the width size of the right hand side K R n k constant. The observed results can be seen in Figures 7.9,7.10 and One can see that for smaller numbers of rocessors, the discussed aroach works very well. Unfortunately, the erformance relative to ScaLA- PACK decreases at the very end due to a sike in ScaLAPACKs erformance, which is unexlainable for now. But the interesting art is, that as we decrease the time sent in the TRMM, the more imortant the actual new 28

36 content becomes and this shows to be efficient. The weak scaling has been started at N = 1024 and = 4 and increased N by two as increases by four to again kee memory usage er rocessor constant. The results are shown in Figure Strong scaling of TRSM in two large dimensions for N=4096 Algorithm Proosed Imlementaion Scalaack 400 Gigaflo/s Number of Cores Figure 7.9: Strong scaling for N=4096 with K= Strong scaling of TRSM in two large dimensions for N=16384 Algorithm Proosed Imlementaion Scalaack Gigaflo/s Number of Cores Figure 7.10: Strong scaling for N=16384 with K=512 One Large Dimension: Since we already found out that the inversion is fast and the TRMM is slow, we decided to not do any benchmarks for the one large dimensions case to save comuting time, as nothing interesting could have been observed there. 29

37 7. Exerimental Results 3000 Strong scaling of TRSM in two large dimensions for N=32768 Algorithm Proosed Imlementaion Scalaack Gigaflo/s Number of Cores Figure 7.11: Strong scaling for N=32768 with K=512 Percentage of Theoretical Floating Point Peak Weak scaling of TRSM in the case of two large dimensions Algorithm Proosed Imlementaion Scalaack Number of Cores Figure 7.12: Weak scaling starting with N=1024 for =4 Summary of the results: The roosed algorithm does not bring the exected benefits due to the slow imlementation of the triangular matrix multilication. With less time sent in the triangular matrix multilication, better erformance for the roosed algorithm can be obtained. Nevertheless, the main art of the algorithm, being the faster inversion of triangular blocks, shows a good increase of erformance over the routine imlemented in ScaLAPACK [1]. 30

38 Chater 8 Further Work In this chater, further work relevant to this toic will be resented. Stability Analysis: As mentioned in Section 2, a stability analysis of the roosed method for blocked inversion could be done. Such an aroach is artially described in [2]. For the roblems looked at in this work, where every lower triangular matrix was good conditioned and therefore no roblems have risen in correctness comared to the results a regular, iterative scheme rovided. Otimizing Triangular Matrix Multilications: The results show very clearly that, in order to make the roosed method work, a much faster triangular matrix multilication is needed. Due to the limited area of alication comared to a dense matrix multilication or even a triangular solve, the otimization in ScaLAPACK is resumably rather low. Therefore, most likely, new code has to be develoed at this oint. Another limitation of the current state is that we always use a two-dimensional grid. To get the asymtotic otimal cost, this should be extended to at least cover three-dimensional as well as one-dimensional grid layouts. Adating the structure for arbitrary matrix sizes: So far, only roblems with matrix sizes chosen as owers of two are looked at. This made the recursive algorithms more simle to imlement, as there are no leftovers. An imrovement to the code could be made, such that general sizes are usable. The most common aroach of doing so, would be to extend the matrix to fit a ower of two with a block of the identity matrix at the bottom and add zero rows to the right hand side. Further otimization: The results show, that the code has a erformance ga that exists as subroutines used show a significant lack of scaling. Also one can see that the erformance ends u never being higher than 30 % of the eak erformance for a large number of rocessors, and for very well 31

39 8. Further Work otimized code, one can exect to get a factor of about two more in eak erformance. 32

40 Chater 9 Conclusion This work resented a new aroach to a communication avoiding way of solving a triangular system for multile right hand sides (TRSM). TRSM serves as a subroutine for a lot of algorithms in dense linear algebra like the LU decomosition and it is the critical assert for solving linear systems of equations. It has been shown that the algorithm we roose asymtotically uses the same bandwidth and flo costs as the standard iterative scheme for solving the TRSM roblem, but decreases the latency by a factor of 2/3 log for the case of three large dimensions as well as for the case of two large dimensions. To achieve the decreased latency one has to carefully ick the base-case size to not be dominated by the bandwidth or the flo costs. For all three cases of relative matrix sizes we resented base-case sizes that are otimal. For the case of one large dimension, the algorithm does not erform any better than the existing one due to the fact that single rocessor work on the left side is always referred and therefore no gain was achieved. The summarized costs are shown in Table 9.1. Due to the fact that only asymtotic uer bounds are resented for the matrix multilications, we were only able to give asymtotic uer bounds for the erformance of our algorithm. With this decrease of latency our algorithm is very romising for solving linear systems faster on machines with a large number of rocessors. This is esecially true as for larger systems, the communication cost is more of a bottleneck than it is on small machines. Also this work oens u the question if one should go back and consider the new triangular inversion as a subroutine for other alications as it was done by Tiskin for the LU factorization in [7]. Exeriments with a not heavily otimized version of this algorithm were erformed. We used ScaLAPACKs two dimensional matrix multilication for the calculations of our algorithm. The results showcased that the new aroach of doing the inversion turned out to bring a notable seedu, whereas, due to the lack of a well otimized triangular matrix multilica- 33

41 9. Conclusion 1 Large Dimension W S F TRSM Rec n 2 log TRSM RecInv n 2 log 2 2 Large Dimensions W S F TRSM Rec (1 + log ( n k TRSM RecInv (1 + log ( n k TRSM Rec TRSM RecInv )) nk )) nk min n ( 1, k ) n log log 2 k 1/6 3 Large Dimensions W S F ( ) 2/3 ( min n, ( n ) ) 2/3 k log ( ) 2/3 ( n k ) 2/3 log 2 Table 9.1: Summarized uer bounds for bandwidth- (W), latency- (S) and flo-costs (F) for all the algorithms mentioned tion, the time to solution for the algorithm resented to solve triangular systems still turns out to be higher than the reference. We were able to see that the roblem sizes for which we did our exeriments were rather small and with that, the ercentage of eak erformance was lower then exected. But the trends that the grahs show are romising. 34

42 Bibliograhy [1] L. S. Blackford, J. Choi, A. Cleary, E. D Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley, ScaLAPACK Users Guide, Society for Industrial and Alied Mathematics, Philadelhia, PA, [2] Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kågström, Recursive blocked algorithms and hybrid data structures for dense matrix library software, SIAM review, vol. 46, no. 1,. 3 45, [3] Edgar Solomonik and James Demmel, Communication-otimal arallel 2.5 D matrix multilication and LU factorization algorithms, in Euro-Par 2011 Parallel Processing, Sringer, [4] Fred G Gustavson, Recursion leads to automatic variable blocking for dense linear-algebra algorithms, IBM Journal of Research and Develoment, vol. 41, no. 6, , [5] J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lishitz, O. Schwartz, and O. Sillinger, Communication-otimal arallel recursive rectangular matrix multilication, in Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symosium on, May 2013, [6] Alok Aggarwal, Ashok K Chandra, and Marc Snir, Communication comlexity of rams, Theoretical Comuter Science, vol. 71, no. 1,. 3 28, [7] Alexander Tiskin, Bulk-synchronous arallel gaussian elimination, Journal of Mathematical Sciences, vol. 108, no. 6, , [8] Michael T Heath and Charles H Romine, Parallel solution of triangular systems on distributed-memory multirocessors, SIAM Journal on Scientific and Statistical Comuting, vol. 9, no. 3, ,

43 Bibliograhy [9] Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel, Tradeoffs between synchronization, communication, and comutation in arallel linear algebra comutations, in Proceedings of the 26th ACM Symosium on Parallelism in Algorithms and Architectures, New York, NY, USA, 2014, SPAA 14, , ACM. [10] Grey Ballard, James Demmel, Benjamin Lishitz, Oded Schwartz, and Sivan Toledo, Communication efficient gaussian elimination with artial ivoting using a shae morhing data layout, in Proceedings of the twenty-fifth annual ACM symosium on Parallelism in algorithms and architectures. ACM, 2013, [11] Benjamin Lishitz, Communication-avoiding arallel recursive algorithms for matrix multilication, Tech. Re., EECS Deartment, University of California, Berkeley, [12] Jeremy J Du Croz and Nicholas J Higham, Stability of methods for matrix inversion, IMA Journal of Numerical Analysis, vol. 12, no. 1,. 1 19, [13] Rahael Hunger, Floating oint oerations in matrix-vector calculus, Munich University of Technology, Inst. for Circuit Theory and Signal Processing Munich, [14] Gaël Guennebaud, Benoît Jacob, et al., Eigen v3, htt://eigen.tuxfamily.org, [15] Phili J Fleming and John J Wallace, How not to lie with statistics: the correct way to summarize benchmark results, Communications of the ACM, vol. 29, no. 3, , [16] T. Hoefler and R. Belli, Scientific Benchmarking of Parallel Comuting Systems, Nov. 2015, Acceted at IEEE/ACM International Conference on High Performance Comuting, Networking, Storage and Analysis (SC15). 36

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K.

inuts er clock cycle Streaming ermutation oututs er clock cycle AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS Ren Chen and Viktor K.