A Study on the Performance of Cholesky-Factorization using MPI

A Study o the Performace of Cholesky-Factorizatio usig MPI Ha S. Kim Scott B. Bade Departmet of Computer Sciece ad Egieerig Uiversity of Califoria Sa Diego {hskim, bade}@cs.ucsd.edu Abstract Cholesky-factorizatio is a well kow method for solvig a liear system of equatios. I this paper, various parallel algorithms for Cholesky-factorizatio usig MPI are desiged, aalyzed, ad implemeted. Parallel algorithms deped o the layout of workload. Colum strip cyclic layout ad Colum block strip cyclic layout are the two mai layout discussed i this paper. We show the performace of parallel algorithm scales up relatively well ad the commuicatio overhead degrades most of performace gaied from parallelism. Also to fid out what the optimal size of block is i colum block strip cyclic layout, the decompositio of executio time was performed. The result is optimal block size is aroud 3 to 6 ad the reaso why the size varies is because this value results from the summatio of commuicatio overhead distributio ad post computatio cost distributio. 1. Itroductio Cholesky factorizatio [1] is a well kow method whe solvig a liear system of equatios. For a give matrix which has some special property (positive defiite matrix) ca be decomposed ito two matrix ad with these two matrix, it becomes much easier to solve the liear system of equatios. I this project (CSE 26 Parallel Computatio Fall 25 research project at Uiversity of Califoria Sa Diego), we preset parallel algorithms implemeted with MPI [3]. To implemet a real scietific applicatio compoet, careful aalysis has bee doe sice the very early stages. Choosig oe better algorithm out of various cadidates is based o cost aalysis ad durig implemetatios ad experimets we did additioal ivestigatios o the bottleeck of implemetatio. We orgaize this paper as follows: I Sectio 2, we describe the backgroud of the problem dealt i this paper. I Sectio 3, as a aïve approach, we preset the sequetial algorithm. I Sectio 4, we show two parallel algorithms ad cost aalysis. I Sectio 5, we show the performace of the parallel algorithms ad aalytical results. We coclude the paper i Sectio 6 ad 7. 2. Backgroud 2.1. Cholesky Factorizatio Cholesky Factorizatio is used i solvig liear systems of equatios. A liear system of equatios i ukows x 1, x 2,, x is defied as follow[1]: A set of equatios E1, E2,, E of the form, E1 : a11x1 + a12x2 +... + a1x = b1 E2 : a21x1 + a22x2 +... + a2x = b2 E : a1x1 + a2x2 +... + ax = b The systems of equatios above ca be expressed with matrix; - 1 -

Ax = b where a11 a21 A =. a1 a a a 12 22. 2 a a a 1 2., ad x1 b1 x = =, b x b Cholesky factorizatio of a give symmetric, positive defiite square matrix A is of the form A=U T U where U is upper triagular ad U T is traspose of U. 2.2. Positive Defiite Matrix Matrix A is positive defiite matrix if ad oly if A = A T, x T Ax > for all x!=. 2.2.1. Positive Defiite Matrix Geeratio Positive defiite matrixes of arbitrary size are required as a iput to the Cholesky factorizatio program. The approach to geerate a positive defiite matrix is quite simple. First geerate a real-value upper triagular matrix. The traspose the matrix. Now we have U ad U T. Simply multiplyig two matrixes ca geerate a positive defiite matrix. The followig theorem guaratees the result to be a property positive defiite matrix; Theorem 1. (Ayers 1962) A real symmetric matrix A is positive defiite iff there exists a real osigular matrix M such that A = MM T, where M T is the traspose 3. Iitial Approach 3.1. Serial Algorithm Serial algorithm helps to uderstad how the algorithm works ad how we ca improve the algorithm by parallelism. Serial algorithm i textbook is used. The ruig time of this algorithm is O( 3 ). The outermost loop iterates times. The last two loops iterates (-k+1)(-k) / 2 times for each k. Summatio k from to -1 yields 3 term, which meas the ruig time is O( 3 ). Algorithm 1. SERIAL_CHOLESKY procedure SERIAL_CHOLESKY(A) for k := to -1 do A[k,k] := A [ k, k] ; for j:=k+1 to -1 do A[k,j] := A[k,j]/A[k,k]; for i:=k+1 to -1 do for j:=i to -1 do - 2 -

edfor ed SERIAL_CHOLESKY A[i,j]:=A[i,j]-A[k,i] x A[k,j]; 4. Parallel Algorithm 4.1. Partitio Layout To decide which partitioig layout would be adequate for this problem, we eed to aalyze first o how Cholesky factorizatio works. Figure.1 To get the value located at the star mark, (a) multiplyig the row (bold horizotal lie) of the leftmost matrix ad the colum (bold vertical lie above the star mark) of the secod leftmost matrix. However, by usig the symmetry property of Cholesky factorizatio, this multiplicatio equals (b) multiplyig two colums (bold vertical lie) of the rightmost matrix. As the figure above shows, to get the value of the star mark, two rows must be previously calculated. This iformatio, say A[:i, i] ad A[:i, j], should be broadcasted to the processors which wishes to compute A[i,j]. The we have three choices: (1) Colum strip partitio layout, (2) colum strip cyclic partitio layout, ad (3) block cyclic partitio layout. It is trivial to argue that the colum strip cyclic partitio would perform better tha colum strip partitio because cyclic placemet ca equalize the load of computatio ad also creates cyclic depedecy of computatio. The issue is the colum strip cyclic layout versus the block cyclic layout. I argue that the colum strip cyclic partitio layout performs better tha the block cyclic partitio layout. Before divig ito this argumet, we eed to defie some assumptios ad defiitios. Assumptio 1. The commuicatio cost is defied as α-β model. Namely, (cost) = (the umber of commuicatios) * (α + β * (the legth of data per commuicatios) ) Assumptio 2. The commuicatio chael is full-duplex. Namely, we do t eed to worry about simultaeous trasfer of data betwee two processors. Also the formal defiitio of two partitioig techiques is; - 3 -

Defiitio 1. Colum Strip Cyclic Layout The colum strip cyclic layout i two-dimesioal problem set is defied as follows; For give processors ad N by N matrix, the ith colum is allocated to { (i%) }th processor. For example, i 4 by 4 matrix ad with two processors, the layout is 1 1 Figure 2. Colum Strip Cyclic Layout with two processors i 4 by 4 matrix. Defiitio 2. Block Cyclic Layout The block cyclic layout i two-dimesioal problem is defied as follows; For a give N by N matrix A ad give 2 processors, the elemet A[i,j] is allocated to { (i % ) + (j % ) }th processor where i <= j For example, i 4 by 4 matrix with four processors, the layout is, 1 1 3 2 3 1 3 Figure 3. Block Cyclic Layout with four processors i 4 by 4 matrix. 4.2. Cost Aalysis To decide a better algorithm i terms of commuicatio overhead, some aalysis had to be doe before implemetatio. Theorem 2. The colum strip cyclic partitioig performs better tha the block cyclic partitioig i Cholesky factorizatio. To prove this theorem, two lemmas should be itroduced. Lemma 1. The commuicatio overhead of the colum strip cyclic partitioig is O(N 2 ) where N deotes the - 4 -

umber of colums i a matrix. Proof) For processor X to get the value of A[i,j], the processor eed two colums of iformatio. However, because of the colum strip cyclic partitioig, the oly iformatio that resides outside of the processor is A[:i, j]. Therefore, for each update the commuicatio cost of sedig oe row (A[:i, j]) of double type accordig to α-β model is, α + 8βi If we assume that the topology of the etwork is fully coected ad we ca use a recursive algorithm for broadcastig, the upper boud for the broadcast cost of sedig ith colum is,,where deotes the umber of processes. (α + 8βi) log This kid of broadcast takes place N times. Therefore, the total commuicatio cost is as follows; N 1 i= ( α + 8βi)log = αn log + 4β log N( N 1) = O( N 2 ) (1) Q.E.D. Lemma 2. The commuicatio overhead of the block cyclic partitioig is also O(N 2 ) where N deotes the umber of colums i a matrix. Proof) For processor X to get the value of A[i,j], the processor eed two colums of iformatio. Because the layout is block cyclic, processors shares oe colum if deotes the umber of processes available i a parallel computatio. Therefore, processors should broadcast its iformatio to processors. For a processor, the amout of iformatio to be set is i 8 Ad the broadcast cost for each processor is ( α + i 8β )log However, for oe update, 2 total cost for oe update is, Summatio i from to N-1 yield, processors must participate for commuicatios. Therefore, the i 2 ( α + 8β ) log - 5 -

N 1 i= (2 = αn = O( N log α + 16βi log ) log + 4β log N( N 1) 2 ) (2) Q.E.D. Proof of Theorem 2 It is trivial from the lemma 1 ad 2 whe comparig the cost equatio i (1) ad (2) sice the startup time of colum strip layout is α N log N whereas i case of block cyclic layoutα N log. Q.E.D. 4.3. Parallel Algorithm Now that we proved the colum strip cyclic layout is more efficiet, the parallel algorithm ca be implemeted as follows; Algorithm 2. PARALLEL_CHOLESKY procedure PARALLEL_CHOLESKY(A) for k:= to N-1 do if(k % pes == rak) //do calculatio prior to ay other processors; for i := to k-1 do A[k,k] := A[k,k] - A[i,k]*A[i,k]; A[k,k] := sqrt(a[k,k]); edif // other processors eed to wait util this loop eds Broadcast its result to other processors; Update matrix with the data received from broadcast for i:= start to N 1 by pes do for j:= to k-1 do A[k,i] := A[k,i] A[j,k]; A[k,i] := A[k,i]/A[k,k]; edfor edfor ed PARALLEL_CHOLESKY 4.4. Colum Block Strip Layout Rather tha allocatig oe colum to oe processor, a couple of colums ca be allocated to oe - 6 -

processor. For example, th ad 1 st colums are allocated to th processor, ad 2 d ad 3 rd colums are allocated to 1 st processor ad so o. The optimal width of block is to be foud. Oe extreme is oe, which is the origial algorithm. The other extreme is processor divides up the matrix evely, which meas N/ colums are allocated each processor ad this layout have some disadvatage i load balacig ad executio schedule. The advatage of this approach is maily two: (1) cache effect ad (2) commuicatio overhead reductio. By calculatig a block of data, we ca utilize cache mechaism. Because memory layout is rowmajor, with the origial algorithm, we caot exploit cache effect. However, if the block of data ca be read ad calculated, we ca reduce the umber of memory read. Secod advatage is the total umber of commuicatio ca be reduced. Rather tha broadcast oe elemet to every processor, collectig some iformatio i oe processor ad broadcastig all together ca reduce the umber of commuicatio by a factor of the width of the block. 4.4.1. Block Strip Parallel Algorithm Basic structure of block strip parallel algorithm is very similar to colum strip algorithm. The major differece is other processors eed to wait util blocks of colum are computed. For example, whe the width of strip is b, oe processor should compute 1/2 * b (b+1) etries. After the iitial computatio, the processor desigated to compute the etries broadcast the values to other processors. Other processors update their matrix ad execute their ow computatio with the ewly computed values. I this stage, each processor also computes blocked strip of size b i oe executio. This algorithm termiates because each computatio is executed i lock-step ad the step is bouded by the size of matrix N. Algorithm 3. BLOCK_PARALLEL_CHOLESKY procedure BLOCK_PARALLEL_CHOLESKY(A) for k:= to N-1 do if((k/b) % pes == rak) Compute the left-most colum block prior to broadcastig; edif // other processors eed to wait util this loop eds Broadcast colum block to other processors; Update matrix with the data received from broadcast while eedmorecomputatio == true Other processors starts computatio with the updated iformatio For each proc, (N/b/pes)-may colums should be computed edwhile edfor ed BLOCK_PARALLEL_CHOLESKY - 7 -

4.4.2. Commuicatio cost The commuicatio cost of this algorithm is trivial. Because we aggregates b updates ito oe but still the overall amout of data trasferred remais same, the commuicatio cost is, log α N + 4βN( N 1) log b 5. Experimet 5.1. Scalability Study 5.1.1. Experimet Pla We performed experimets measurig the executio time of colum strip cyclic algorithm as the umber of processes icreases but with a fixed workload (i.e. with a fixed size of matrix 216 by 216) o the FWGrid [5] machie. Oly 24 processors i oe rack are used because cross-rack commuicatio overhead is far larger tha i-rack commuicatio overhead. alpha (us) 1/beta (1^(-9)) Cross-rack 44.9 9.748 I-rack 38.1 9.668 Differece (%) 17.8477693.829 Table 1. Compariso betwee cross-rack ad i-rack (Alpha Beta Model Costat) As we ca see, the startup overhead i cross-rack commuicatio is 17% more tha i-rack. However, whe program rus across rack, the performace gets worse. A experimet was performed with 1 processors. Oe used processors oly i rack 5 ad the other used all the processors i FWGrid machie. The colum strip cyclic algorithm was ru with these processors. 1 processors i rack cross rack ruig time 14.93 143.46 Table 2. Compariso betwee cross-rack ad i-rack (Ruig Time) These figures clearly show why we oly used 24 processors i oe rack because the differece is a factor of 1. 5.1.2. Result # procs 1 2 3 5 9 12 16 18 2 24 Ruig Time(s) 83.62 59.65 4.23 26.42 16.67 13.12 1.27 9.9 9.3 8.36 Speedup 1 1.4 2.7 3.16 5.1 6.37 8.14 8.43 8.98 1. Efficiecy 1.7.69.63.55.53.5.46.44.41 Table 3. Scalability of Colum Strip Cyclic Layout - 8 -

9 8 7 Ruig Time 1.2 1 Efficiecy Ruig Time 6 5 4 3 2 1 Efficiecy.8.6.4.2 5 1 15 2 # Procs 5 1 15 2 # Proc Figure 4. Ruig Time Figure 5. Efficiecy 5.1.3. Discussio The ruig time shows a very smooth mootoically decreasig graph. However, efficiecy is betwee.4 ad.6, which is either good or bad. To fid out what causes the low efficiecy, aother experimet must be performed. The easiest way is to suppress commuicatio. Without commuicatio, the efficiecy is ehaced. Most of the efficiecy idicates above.9. So we ca coclude that the commuicatio overhead lowered the efficiecy of the overall performace. # Procs 1 2 3 5 9 12 16 18 2 24 Efficiecy 1.94.97.93.9.87.87.9.9.87 Ruig Time 83.62 44.42 28.55 17.96 1.31 7.95 5.96 5.1 4.62 3.99 Table 4. Tabular Data Ruig Time 9 8 7 6 5 4 3 2 1 Ruig Time w/o Commuicatio 5 1 15 2 25 # Procs Efficiecy 1.8.6.4.2 Efficiecy w/o Commuicatio 5 1 15 2 25 # Procs Figure 6. Ruig Time whe performed without commuicatio Figure 7. Efficiecy whe performed without commuicatio 5.2. Optimal Block Size 5.2.1. Experimet Pla I colum block strip cyclic algorithm, we eed to fid out what the optimal size of width is. Experimet was performed with gradual icremet i block size. Firstly, ruig time was measured with 216 by 216 matrix ad 18 processors. However, because of usatisfactory result described below, extra experimets were added: experimets with 12 processors, 9 processors ad 4 processors. - 9 -

5.2.2. Result Ruig Time with 18 Procs 1.5 1 9.5 9 Ruig Time 8.5 8 7.5 7 6.5 6 5 1 15 2 25 Block Size Figure 8. Ruig Time with 18 processors ad 216 by 216 matrix blk size 1 2 3 4 5 6 8 1 12 15 2 24 ruig time 9.81 9.22 8.87 9.62 9.71 9.42 9.28 9.41 9.45 9.5 9.96 1.23 stdev.1.3.6.24.23.2.17.14.17.19.21.28 Table 5. Ruig Time with 18 processors ad 216 by 216 matrix Ruig Time with 12 Procs 14.5 14 13.5 13 Ruig Time 12.5 12 11.5 11 1.5 1 5 1 15 2 25 Block Size Figure 9. Ruig Time with 12 processors ad 216 by 216 matrix blk size 1 2 3 4 5 6 8 1 12 15 2 24 ruig time 12.57 12.12 11.91 11.96 11.73 11.35 12.1 12.1 11.97 12.48 13.29 14.16 stdev.17.6.7.4.1.17.11.15.22.21.17.13 Table 6. Ruig Time with 12 processors ad 216 by 216 matrix - 1 -

Ruig Time with 9 Procs 19 18.5 18 Ruig Time 17.5 17 16.5 16 5 1 15 2 25 Block Size Figure 1. Ruig Time with 9 processors ad 216 by 216 matrix blk size 1 2 3 4 5 6 8 1 12 15 2 24 ruig time 16.44 16.39 16.13 17.18 17.59 17.8 17.92 17.68 17.81 17.98 18.7 18.62 stdev.5.18.19.29.27.2.14.21.21.15.19.24 Table 7. Ruig Time with 9 processors ad 216 by 216 matrix 5.2.3. Discussio Ufortuately, with this set of experimets, we could t figure out what the optimal value is. First of all, the graph chages accordig to umber of processors ad the fluctuatio shows too much irregularity. Oe ad oly commo patter is the icreasig patter whe block size exceeds 15. Eve though we ca t pick oe value for optimal size, we ca say it has some miimum value for each experimet. With 18 processors, whe block size is 3, the ruig time is 8.87 ad stadard deviatio is oly.6. If we ca assume that the distributio follows Gaussia distributio, eve whe comparig with the closest ruig time (whe block size is 2), this distributio is statistically showig better performace. The same argumet ca be applied to aother case: block size of 6 with 12 processors. However, block size of 3 with 9 processors has too small probability so we ca t argue that 3 is showig better performace. 5.3. Bottleeck Aalysis 5.3.1. Experimet Pla Although the commuicatio overhead domiates the most of performace degradatio i Sectio 5.1, further ivestigatio is eeded to aalyze more accurate behavior of this program especially for Sectio 5.2. The executio of Choleksy-factorizatio algorithms cosist of three parts: (1) iitial computatio, which computes oe etry to broadcast while other processors are waitig for, (2) commuicatio ad (3) update ad computatio, which parallelism plays its role. So our method for fidig bottleeck is by suppressig each part ad measurig executio time. - 11 -

5.3.2. Result Decompositio of Executio Time Iitial Computatio Commuicatio Post commuicatio 12. 1. Percetage 8. 6. 4. 2.. 1 2 3 4 5 6 8 1 12 15 2 24 Block Size Figure 11. Decompositio of Executio Time block size 1 2 3 4 5 6 8 1 12 15 2 24 iitial computatio (s).17.23.21.37.31.26.12.53.66.47.97.8 commuicatio (s) 4.51 4.21 3.73 3.8 3.16 3.82 4.6 4.9 3.77 3.23 2.55 2.31 post computatio (s) 5.13 4.78 4.93 6.17 6.24 5.34 5.1 4.79 5.2 5.8 6.45 7.12 Table 8. The executio time of each compoet 5.3.3. Aalysis The miimum value i this settig is whe we use block size of 3. However, we caot fid ay patter from the graph showig 3 is the miimum. Post computatio has small value but ot the smallest (size of 2 is the smallest) ad commuicatio overhead is also ot the smallest (size of 4). Block size of 3 becomes the smallest because it has small commuicatio overhead ad small post computatio cost though it does ot have the smallest values. With this graph oly thig we ca ifer is the commuicatio cost has its ow distributio as the block size icreases ad the post computatio cost has other distributio of its ow. So the summatio of two distributio results the overall performace. Probably that s why the graphs i Sectio 5.2 have irregularity. 6. Future Work The bottleeck aalysis is ot perfect at this time. However, the basic idea of decomposig executio time is illustrated i this paper. Further ivestigatio would be per-process measurig. Because there exist some imbalace of executio time betwee processors, more iformatio about bottleeck ca be foud with per-processor measuremet. Also we did t take ito accout the cache effect. Cachig ca drastically chage the overall performace of a program. However, i this paper, the cachig problem has ot bee discussed much. Further ivestigatio should be placed oto how well the cache works o parallel algorithms. - 12 -

7. Coclusio I this project, we leared how the parallel algorithm ca be aalyzed, desiged ad optimized. Cost aalysis ad proof, algorithm descriptio, ad decompositio of a program ito may compoets all helped to uderstad how a parallel computatio works. Also, through this project, we had a chace to use FWGrid machie which has may processes up to almost 1. Ad usig this machie, we were able to check the scalability of a program. 8. Bibliography [1] Advaced Egieerig Mathematics 8 th Ed., by Erwi Kreyszig, Joh Wiley & Sos, Ic., 1999. [2] Itroductio to Parallel Computig 2 d Ed., by Grama, Gupta, Karypis, ad Kumar. Addiso-Wesley Publisher, 23. [3] MPI (Message Passig Iterface) Home Page: http://www.mcs.al.gov/mpi [4] All the idea related to layout take from lecture otes by Prof. Scott Bade: http://www.cse.ucsd.edu/classes/fa5/cse26/lectures/lec14/lec14.pdf [5] FWGrid Home Page: http://fwgrid.ucsd.edu/ - 13 -