An improved Thomas Algorithm for finite element matrix parallel computing

Size: px

Start display at page:

Download "An improved Thomas Algorithm for finite element matrix parallel computing"

Kellie Ryan
5 years ago
Views:

1 A improved Thomas Algorithm for fiite elemet matrix parallel computig Qigfeg Du 1a), Zogli Li 1b), Hogmei Zhag 2, Xili Lu 2, Liu Zhag 1 1) School of Software Egieerig, Togji Uivesity,Shaghai, , Chia 2) Research Istitute of Structural Egieerig ad Disaster Reductio, Togji Uiversity,Shaghai, , Chia) zhaghogmei@togji.edu.c ABSTRACT With the expasio of the scale of liear fiite elemet aalyzig data, efficiecy of computatio is a mai techical bottleeck. For solvig the bottleeck, parallel computatio is playig a icreasigly promiet role. At preset, the researches o liear fiite elemet parallel computatio are maily cocetrated o the preprocessig phase, ad goal of these researches is to reduce the commuicatio overhead ad improve the homogeeous degree. The researches o computatio phase dealig with liear fiite elemet aalyzig data are rare. However, most of computatio cost is i this phase. I this paper, we study ad aalyze stiffess matrix decompositio ad after comparig the differet matrix decompositio algorithms, a improved algorithm for parallel computatio based o Thomas algorithm is proposed. Verificatio by a large amout of data proves that the improved algorithm greatly ehaces the parallel performace of liear fiite elemet computatio. 1. INTRODUCTION Nowadays, liear fiite elemet method is widely used i large complex combiatio structure aalysis. With the growth of data scale, liear fiite elemet parallel computatio (Aath 2003) is playig a icreasigly importat role i egieerig field especially i the cocrete structure simulatio (Liu 2009, Lv 2011). At the momet, the research o liear fiite elemet parallel computatio is maily cocetrated o the pre-processig phase, ad goal of their researches is to reduce the commuicatio overhead ad improve the homogeeous degree (Maurer 2011, Paz 2005). Seldom research is coducted o computatio phase dealig with fiite elemet aalyzig data. Ad most of computatio cost is i this phase ad the computatio is maily i solvig large size liear equatios. Structural stiffess matrix, which is the coefficiet matrix of the equatios, is sigular, symmetrical ad sparse, with o-zero elemets spread o a * Correspodig author, Associate Professor, PhD, zhaghogmei@togji.edu.c ; du_cloud@togji.edu.c a Professor, PhD b Master Studet Note: Copied from the mauscript submitted to Computers ad Cocrete, A Iteratioal Joural for presetatio at ASEM13 Cogress 1469

2 stripe regio ad is usually a tridiagoal matrix (Turmo 2012). The most mature decompositio algorithm is Gaussia Elimiatio Algorithm, which is also called LU algorithm. For tridiagoal matrix, Thomas proposed chasig algorithm (Thomas algorithm) based o the LU algorithm (Turmo, J. 2012). This algorithm is very effective o solvig tridiagoal liear equatios, but it is ot suitable for parallel computatio (Paz. 2005). I order to solve the problem of parallel computatio of tridiagoal liear equatios, the origial Thomas algorithm o sigle processor wea alyzed here, ad the, a improved algorithm suitable for parallel computatio by describig the idea ad logic of the algorithm is proposed for complex structural matrix aalyze. To verify the efficiecy of the improved algorithm, the parallel method amed MPI (Message Passig Iterface) (Pacheco. 1997) was employed to test o various levels of specificity of data. Test result idicates that the improved algorithm ehaces the parallel performace of liear fiite elemet computatio sigificatly. 1. BACKGROUND AND RELEVANT KNOWLEDGE 1.1 The decompositio of the coefficiet matrixes of liear equatios ad Thomas Algorithm Assumig that is a large size liear equatios ad is the coefficiet matrix. The most mature decompositio algorithm is Gaussia Elimiatio Algorithm. This decompositio algorithm is divided ito two phases. Firstly, some algebraic operatios simplify ito upper triagular equatios, so ca be writte as ( is a uit upper triagular coefficiet matrix). The secod phase, backward substitutio method is used to solve the equatios. A improved method of the method is algorithm, that is the algorithm that decomposes matrix ad lower triagle matrix ad that is. is stored i the lower triagle of ad is stored i the upper triagular of (sice the diagoal elemets are ot stored, the default value is 1). Gaussia decompositio method is applicable to the geeral dese matrix. Cholesky decompositio algorithm is widely used to solve positive defiite matrix, beig see as a special case of algorithm ad more suitable to solve the symmetric positive defiite matrix. I the algorithm, the matrix is decomposed ito a product of a triagular matrix ad its traspose, i.e., ( is a upper triagular matrix). Obviously, the decompositio algorithm calculated amout is, ad the calculated amout is oly half of Gaussia decompositio method. For Tridiagoal matrix, Thomas proposed chasig algorithm (Thomas algorithm) based o the algorithm. The algorithm is very simple ad the calculated amout is oly times of multiplicatio ad divisio operatios. The algorithm is a umerically stable algorithm ad is a classical algorithm to solve tridiagoal liear equatios too. 1.2 Structural stiffess matrix A elemet of structural stiffess matrix meas: how much force should be exerted o ode whe the displacemet of ode is oe uit value while other odes are zero. The differece betwee elemet stiffess matrix ad structural stiffess matrix is that, structure is the collectio of uit ad every uit affects the structure. As a uit stiffess matrix is symmetrical ad sigular, the structural stiffess matrix that is 1470

3 itegrated by some uits is also symmetrical ad sigular. That is that costrait coditio of displacemet has to be give i order to remove the sigularity of, so that the displacemet of elemets ca be obtaied. The structural stiffess matrix is the collectio of uit stiffess matrixes. Although the total umber of elemets is more, ad the order of structural stiffess is high, most of the elemets are zero. So if the umber is reasoable, the o-zero elemets will spread o a stripe regio cetered i pricipal diagoal. I short, a structural stiffess matrix is sigular, symmetrical ad sparse, with ozero elemets spread o a stripe regio. With reasoable umberig, the matrix is positive defiite tridiagoal matrix. 1.3 Existig problems Fiite elemet aalysis is a very importat umerical aalysis method that has bee widely used i the field of egieerig ad scietific computig. However, large or very large complex structure aalysis usig fiite elemet aalysis method will result i the calculated amout icreases expoetially ad the usual strategy is to use a supercomputer to calculate. I recet years, the fiite elemet parallel computig researches draw the researchers' attetio, ad oe of the cocers is to improve the fiite elemet parallel computig algorithm to raise the efficiecy of large scale complex structural aalysis uder commo distributed parallel computig eviromet ad make it applicable to commo users. The fiite elemet distributed parallel computig ca be divided ito three stages: pre-processig, computatio ad post-processig. I the pre-processig stage, fiite elemet model is built, ad the uit grid is divided. Durig the post-processig, we aalyze the results to help users extract iformatio ad uderstad the calculated results. The computig cost is maily i the computatio stage. I fact, this majority of computatio is to solve large-scale liear equatios. The computatio process ivolves liear equatios coefficiet matrix algorithm, Thomas algorithm ad structural stiffess matrix. The key poit of our research is the stiffess matrix decompositio. This paper aalyzes the features of large-scale liear equatios coefficiet matrix ad the Thomas algorithm ad the we put forward a effective matrix decompositio strategy ad a improved Thomas algorithm based o the strategy, which is applicable to large complex structure aalysis ad suitable for parallel computatio. 2. THE IMPROVEMENT OF THOMAS ALGORITHM 2.1 The existig Thomas algorithm o a sigle process Algorithm itroductio Firstly, a sigle process Thomas algorithm is give ad assumig a coefficiet matrix A a positive tridiagoal matrix, that is: b1 c1 a2 b2 c 2 A = (1) a 1 b 1 c 1 a b 1471

4 Decompose A accordig to Crout, that is, L = (2) U = (3) a i, i ad i are udetermied coefficiet. By matrix multiplicatio, we ca get: b1 1, c1 1 1 ai i, bi i i1 ai, i2,3,, (4) ci i i, i 2,3,, 1 1 b1, 1 c1/ b1 i bi aii 1, i2,3,,, (5) i ci / i, i 2, 3,, 1, Therefore, the existig tridiagoal equatios are equivalet to the followig two equatios The logic of the algorithm Step 1: Iput data,,, Step 2: Computatio 1 b1, 1 c1/ b1 i bi aii 1, i2,3,,, i ci / i, i 2, 3,, 1, (6) Step 3: Solvig the equatios y f / b, Step 4: Solvig the equatios x y, i i i i1 i y f ay /, i 2,3,,, x y x, i1, 2,,1. i i i i 1 Step5:Output the solutio of equatios (7) (8) 1472

5 The process of calculatig ad is the process of forward sweep. The process of is the process of backward substitutio. As the formula of Thomas algorithm is very simple, its calculated amout of is times of multiplicatio ad divisio. Thomas algorithm is umerically stable algorithm, so it is widely used i serial processig tridiagoal equatios. 2.2 Improvemet of Thomas algorithm for parallel computig 2.2.1The existig storage strategy of large tridiagoal matrix Block Storage Strategy: Assume is a -order square matrix, ad is the umber of ode machies. For large fiite elemet aalysis, it is, uder ormal circumstaces. Assume ad block storage scheme is, the first ode machie stores the first lies, the secod ode machie for the secod lies, ad so o, ad the last ode machie store the last lies. If is ot a iteger multiple of, the rest of the rows is stored i first ode. As the frot lies of the matrix first complete the matrix decompositio, whe each lies is decomposed a idle machie is added, so the storage strategy is serious load imbalace. Sigle-lie Shutter Storage Strategy: Sigle lie shutter storage is that the lie is stored i the ode machie. For example, the first ode stores lie 1, lie ad lie, ad so o. Whe lie decompositio is completed, the first processor stops operatios. This strategy miimizes the load imbalace. However, the commuicatio overhead icreased sigificatly. Multi-lie Shutter Storage Strategy: Multi-lie shutter storage strategy takes the advatages of the above two. This algorithm is to decompose the matrix ito blocks by rows, each block cotais multiple lies ad the block stored ito the ode machie. Whe block decompositio is completed, the first processor will stop operatio. Sice each block cotais multiple lies, the commuicatio overhead is less tha the sigle lie shutter storage The existig Decompositio strategy ad mappig techology --- Cholesky decompositio The idea of the Cholesky Decompositio: b1 c1 B1 C1 a2 b2 c T 2 C1 B2 C2 T a 1 b 1 c 1 Cq 1 Bq 1 Cq 1 a b T Cq Bq Fig. 1 The left is the matrix before decompositio, the right is the matrix after decmpositio Fig. 1 is the two matrixes before decompositio ad after decompositio. is a m- order square matrix, ad oly the lower left quarter elemet is ot zero. is a m- order positive defiite tridiagoal matrix, ad it is decomposed Cholesky decompositio method:. is a diagoal matrix. is a uit lower triagular matrix. Matrix trasformatios are as follows. 1473

6 I Fig. 2, ad all its elemets, except for those i the last lie, are 0. As is a diagoal matrix, the elemets of the last lie of ca be elimiated to zero respectively except for the last elemet. ad are the matrixes after elimiatio. Thus, the elemets of the last lies of, as well as form ew threediagoal matrix equatios. The small three-diagoal matrix equatios ca be solved o a sigle processor, the the solutio is set to other processors. The origial problem ca be solved. D1 C1 T C1 D2 C2 C D C T Cp D T p1 p1 p1 Fig. 2 The left is the matrix before decompositio, the right is the matrix after decmpositio p Algorithm descriptio: Step 1: Solve, Make =. Step 2: Trasform. Step 3: Solve, Solve the small three-diagoal matrix equatios o a sigle processor. Step 4: Solve the origial problem. This algorithm has good parallelism, but the computatioal complexity is double that of the serial algorithm. Despite a icrease i parallelism, but the computatioal complexity reduces the efficiecy of the algorithm. Therefore, we would like to improve the idea of this algorithm to reduce computatioal complexity The improvemet of storage strategy (Improved Multilie Wrapped Iterleaved Row Storage IMWIRS) Large tridiagoal matrix belogs to large sparse matrixes, all elemets, except for those ear the diagoal, are zero. Storig those zero elemets ito the memory is a waste of memory resources. A ew algorithm is to trasform the matrix, ad oly store ozero elemets. This storage strategy is a improvemet of the multi-lie roller shutter storage method, which is specifically applicable to tridiagoal matrix. Assumig a liear equatios, is a tridiagoal matrix of orders. The origial storage method is to store ito a two-dimesioal array. The improved method oly eeds 3* memory space, which is oly 3/ of the origial oe. With the icrease of, the advatage becomes more obvious. 1474

7 b1 c1 a2 b2 c2 A = a b c a b 0 b1 c1 a2 b2 c 2 A ' = a1 b 1 c 1 a b Fig. 3 A is the origial storage method, which eeds array, ad A is the trasformed oe with oly memory space I fig. 3, i order to be clear, the correspodig elemets have the same label. I actual storage, we assume that ad respectively represet arrays ray before trasformatio ad after trasformatio. is the elemet of, ad is of.,,, Decompositio strategy ad mappig techology based o improved storage strategy Step1, decomposig a large tridiagoal matrix ito blocks; assumig that the matrix is a -order matrix, ad decomposig it ito m-order small square matrixes. Assumig q = / m ad there are processors, accordig to multi-lie shutter store strategy, the first processor stores the first m rows, the secod processor stores the secod m rows ad so o, the i-th processor stores m rows; there are rows. b1 c1 B1 C1 a2 b2 c 2 A2 B2 C2 a 1 b 1 c 1 Aq 1 Bq 1 Cq 1 a b Aq Bq Fig. 4 the left oe is the origial tridiagoal matrix, the right oe is the matrix after decompositio I Fig. 4, is a tridiagoal matrix; is a square matrix with oly the lower left corer elemet is o-zero; is a square matrix with oly the top right corer elemet is ozero., are all m-order square matrixes. Step2, storig, ito the i-th ode accordig to the storage method itroduced above. 1475

8 I Fig. 5, is the oe-zero elemet of ; is the oe-zero elemet of ; the remaiig elemets costitute tridiagoal matrix. a( i1) m1 b( i1) m1 c( i1) m1 a( i1) m2 b( i1) m2 c( i1) m2 aim 1 bim 1 cim 1 a b c im im im i i i A B C Fig. 5 The elemets ad blocks of the i-th ode a b c a b c a b c a b c ( i1) m1 ( i1) m1 ( i1) m1 ( i1) m2 ( i1) m2 ( i1) m2 im1 im1 im1 im im im Fig. 6 Showig how data is stored ito the i-th ode Step3, for each ode, solvig small tridiagoal matrix ad oly oe o-zero elemet matrixes, usig Thomas algorithm. The first ode do t have to deal with, ad the last ode do t have to solve Descriptio of Improved Algorithm Accordig to the aalysis above, we give the cocrete descriptio of the algorithm. The algorithm oly stores the elemets ear the diagoal therefore ca save memory overhead. I terms of decompositio strategy ad mappig techology, we decompose large-scale three-diagoal matrix ito small square matrixes; each processor eeds to solve oe small tridiagoal matrix ad two square matrixes that cotai oly oe elemet. The each processor will sed the results to the master processor to calculate the results of the origial problem. The algorithm is described below. Assumig the etwork topology of parallel computers is master-slave architecture ad the liear equatios are described as. Step1, Iputtig the order of the large stiffess matrix ito master computer. Assumig program decomposes the large matrix by lie ito blocks (that is slave odes), the accordig to multi-lie shutter storage strategy, master computer calculates how may lies each block cotais, that is. 1476

9 Step2, Iputtig the elemets of the large matrix ad vector. While iputtig the elemets, master computer assigs from the first lie to the lie ad vector to the first ode, the lie to the lie ad vector to the secod ode ad so o, the remaiig lies are assiged to the first ode. It's worth otig that whe assigig elemets, master ode oly processes the pricipal diagoal elemets. Step3, For each slave ode, allocatig a array ad the elemets assiged to it are stored i the array by IMWIRS storage strategy. Step4, For each slave ode, decomposig the elemets of the array ito three -order square matrix by logic; the first elemet of the array is regarded as the lower right elemet of, the last elemet is regarded as the top left elemet of ad the remaiig elemets is regarded as s; the mappig method refers to figure 4. Step5, For each slave ode, is calculated accordig to sigle process Thomas algorithm. Step6, For each ode, calculatig ad, which cotai oly oe elemet (The first elemet does ot eed to calculate, ad the last ode (p) do ot have to calculate ). Step7, For each ode, returig the computatio results to master computer, the master computer calculates the fial result, ad outputs the result vector Algorithm implemetatio usig pseudo-code 1477

10 1. master 2. procedure distributio() 3. Iput :iteger;// order of the coefficiet matrix 4. q:iteger;// umber of the blocks(sometimes equals to the umber of slave computers) 5. resultvector:array; // result vector of Ax = b 6. Begi 7. m=/q;// the umber of lies i each block 8. P curret=0;//mark the curret processed block 9. while(p curret<q){ //have ot distributed all blocks 10. for (j = 0; j < m; ++j) 11. for(i=0; i < ; ++i) 12. { 13. If( i-j <=1)oly the tridiagal elemet be set to slave computers 14. { 15. iput a[i,j];// a[i,j] is a elemet of the coefficiet matrix 16. distribute a[i,j]to ode computer P curret%p;//p is the total umber of slave computer 17. } 18. } 19. Distribute resultarray to ode Pcurret%P; 20. Pcurret++; 21. } 22. ed 23. Slave p 24. procedure store() 25. Iput a[i,j]//the coefficiet elemets which the master computer distribute to it 26. ew resultvector // result vector of Ax=b 27. Begi 28. ew coefficietarray // m*3 Array to store the elemets which the master computer distribute to it. 1478

11 29. for(i=1;i<=m;++i) 30. { 31. coefficietarray[i,0] = a[i,i-1]; 32. } 33. for(i=0;i<=m;i++i) 34. { 35. coefficietarray[i,1]=a[i,j]; 36. } 37. for(i=0;i<=m-1;++i) 38. { 39. coeefficietarray[i,2]=a[i,i+1]; 40. } 41. coefficietarray[0,0]=0;//the first elemet of the array 42. coefficietarray[m-1,2]=0;//the last elemet of the array 43. ew resultarray:array//*1 Array to store result 44. Copy resultvector to resultarray; 45. Ed 46. Slave computer p 47. procedure calculate()//thomas algorithm 48. Begi 49. coefficietarray[0,1] 50. coefficietarray[0,2]/ coefficietarray [0,1] 51. = coefficietarray[i,1]- coefficietarray[i,0] 52. = coefficietarray[i,2]/ (resultarray[1]- )/ sed to master 58. Ed 59. Master 60. prit 2.2.7Time ad Space Complexity Aalysis of Improved Algorithm The origial algorithm uses Gaussia elimiatio method ad its time complexity is O() ad space complexity is O( 2 ). Here, based o the above pseudo-code logic of improved algorithm, we aalyze the time ad space complexity of improved algorithm. Observig pseudo-code of lie 10 to lie 21 (matrix partitio), we ca see that it is double layer for loop. We kow that whe there are several loops, the time complexity of a algorithm is decided by the frequecy f() of the iermost statemet i the maximum loop estig. I the pseudocode, the maximum frequecy of statemets is lie 15 ad lie 16 with times ier loop ad m times outer loop. Accordig to lie 7, we kow that m=/q (q is the umber of computers), that is m ad is liear relatioship ad therefore the time complexity of this part of pseudo-code is T()=O(m)=O( 2 ). From lie 29 to lie 40 (matrix assigmet), there are three sigle layer for loop, the scale of each loop is m, m ad m+1, ad correspodig time complexity are O(m), O(m) ad O(m+1), that is their 1479

12 complexity are all O(). I short, the time complexity should be the maximum times of statemets executed withi the whole code; therefore the fial time complexity of improved algorithm is T()=O( 2 +3)=O( 2 ). As to space complexity, the mai space cost is matrix elemet storage. Accordig to algorithm descriptio, there are q slave ode computers ad each ode computer eeds to create a array of m 3; therefore the total space eeded is q m 3= 3, that is space complexity is S()=O(3)=O(). We ca see that the time complexity of improved algorithm does ot decrease compared with the origial algorithm (owig to matrix partitio icreasig the complexity). However, the advatage of improved algorithm are maily embodied i the high performace of parallel computig ad great alleviatio of space cost, especially for large size stiffess matrix, such as matrix data file is larger tha 500M, that is matrix order is more tha Through the verificatio below, we ca see that the improved algorithm could ehace the computig speed of large size stiffess matrix equatios greatly. 3. IMPROVED ALGORITHM VERIFICATION Here, we use the improved algorithm to solve tridiagoal liear equatios AX b to verify the efficiecy of the algorithm. We employ a parallel method amed MPI. MPI (a stadard, a model of message passig iterface, with a variety of implemetatios such as MPICH) is a tool to coect multiple hosts through etwork for parallel computig. We ca also utilize it for multicore or multi-cpu parallel computig o oe sigle machie but the efficiecy is poor. It ca coordiate several hosts together for parallel computig, ad therefore it has good scalability i parallel computig. However, commuicatio amog processes could also lead to the problems of large memory overhead, low parallel efficiecy as well as complexity i programig. I additio, we also cosider stimulatig parallel computig by employig OpeMP. OpeMP is desiged for parallel computig o sigle host with multiple CPUs or multiple cores. I other words, OpeMP is more suitable for parallel computig o sigle machie with shared memory ad sice threads for parallel computig could share memory, it is of high efficiecy ad low memory overhead. Yet OpeMP is oly available for parallel computig o sigle host rather tha cluster. I order to verify the high efficiecy of the improved algorithm ad use as much resource as possible durig the verificatio, we choose MPI, that is to apply multi hosts cooperatig together for parallel computig. For the tridiagoal liear equatios AX Y, we assig differet orders of coefficiet matrix by data scale, ad these orders of coefficiet matrix are 1 * 10 7,5 * 10 7,1 * 10 8, 5 * 10 8 ad 1 * For each order, we verify it by 1 processor, 2 processors, 4 processors, 8 processors ad 16 processors respectively. Whe the umber of processor is 1, it meas the origial serial algorithm. I additio, T(uit is secod) represets computatio time. S represets speedup ratio ad E represets parallel efficiecy. Results of verificatio are show from Table 1 to Table 5. I these tables, S = T 1 /T m, whe m = 1, T 1 is the origial serial computig time, ad whe m>1, T m is the parallel computig time usig differet umbers of processors. The parallel efficiecy E is equal to T m / m * T m. 1480

13 Table 1 verificatio result of a 1*10 7 order matrix m(processor T m (uit is S(S=T 1 /T m ) E(E=T 1 /m*t m ) amout) secod(s)) Table 2 verificatio result of a 5*10 7 order matrix m(processor T m (uit is S(S=T 1 /T m ) E(E=T 1 /m*t m ) amout) secod(s)) Table 3 verificatio result of a 1*10 8 order matrix m(processor T m (uit is S(S=T 1 /T m ) E(E=T 1 /m*t m ) amout) secod(s)) Table 4 verificatio result of a 5*10 8 order matrix m(processor T m (uit is S(S=T 1 /T m ) E(E=T 1 /m*t m ) amout) secod(s)) Table 5 verificatio result of a 1*10 9 order matrix m(processor T m (uit is S(S=T 1 /T m ) E(E=T 1 /m*t m ) amout) secod(s))

14 Based o the verificatio results show i Table 1 to Table 5, we ca see whe the order is same, with the icreasig of the umber of processors, the computig time decreases sigificatly; but, with the umber of processors icreasig, the decreasig extet of computig time is slowig dow. Verificatio maifests that with the icreasig of order of the stiffess matrix, the computig efficiecy of improved algorithm improves greatly. As the order of the matrix ad the umber of processors are chagig, the figures from figure 6 to figure8 show the relatioship amog order of matrix, umber of processor m, computig time T, speedup ratio S ad parallel efficiecy E. I Fig. 7, whe the umber of processors icreases from oe to two, the algorithm trasforms from serial computig to parallel computig, ad the computig time drops sharply, especially for large-scale matrix. Whe the scale of the matrix is larger, the less time it uses for parallel computig. O the other had, with the icreasig umber of processors, the computig-time is becomig closer. Because whe the umber of processors is large eough, such as m=16, the processig performace is high eough to deal with differet orders of matrix i a short time. I Fig. 8, with the icreasig umber of processors, the speedup S of differet scale matrix has a tred of liear icreasig, especially whe the umber of processors icreases from 1 to 2, speedup ehaces greatly; the whe m is more tha 2, speedup ratio icreases getly. I additio, we ca see that whe the umber of processors is the same, bigger-scale matrix has higher speedup tha smaller-scale matrix. It is because whe commuicatio cost is the same, the advatage of processig performace is more obvious whe the scale of matrix is large. Fig. 7 The treds of T(s) with the icreasig of m 1482

15 Matrix order Fig. 8 The treds of S with the icreasig of m I Fig. 9, with the icreasig umber of processors, the parallel efficiecy E of differet scale matrix has a tred of liear decreasig. It is because the more processors are, the greater the commuicatio cost is, which leads to the declie of parallel efficiecy. I additio, we ca see that whe the umber of processors is the same, bigger-scale matrix has higher parallel efficiecy tha smaller-scale matrix. The reaso for this is that the advatage of parallel performace couteracts the commuicatio cost whe the scale is great eough. It also reflect the efficiecy of the improved algorithm. Matrix order Fig. 9 The treds of E with the icreasig of m 4. CONCLUSIONS As you ca see, i this paper, we aalyze the origial Thomas algorithm o sigle processor first, ad we discuss the existig storage strategy of large tridiagoal matrix, existig decompositio strategy ad mappig techology. Ad the we propose our improved Thomas algorithm based o our aalysis. At last, we preset the pseudocode accordig to the idea of the improved algorithm ad verify the efficiecy of the improved algorithm by employig the parallel method MPI o various levels of specificity of data. The results of verificatio show the improved algorithm has good 1483

16 performace for liear fiite elemet parallel computig, ad it embodies i four aspects: Savig storage space. The space complexity of the improved algorithm is S()=O(3)=O() ad the space complexity of the origial algorithm is S()=O( 2 ). That is to say the space cost of improved algorithm is oly 3/ times of origial algorithm (Thomas Algorithm). Lower iteractio overhead ad computatio complexity. Iteractio overhead is small whe the umber of processors is i specific scope, ad the computatio complexity is far less tha cholesky algorithm. Parallel efficiecy decreases with too may processors. With the icreasig of the umber of processors, the speedup ratio icreases. But, whe the umber is over 16, the icreasig of speedup ratio lowers dow ad parallel efficiecy starts to declie because of the overhead of commuicatio icreasig. The efficiecy of parallel computatio icreases with the icreasig size of matrix. The performace of the improved algorithm is very high, especially for large size matrix because the commuicatio overhead amog differet computer odes could be overlooked whe the size of matrix is large eough. I future, we will do more experimets usig structure stiffess matrix i actual scees to verify or further improve our algorithm ad we will also apply our research results i egieerig field especially i the cocrete structure simulatio. ACKNOWLEDGMENTS The authors gratefully ackowledge the fiacial support provided by Kwag-Hua Fud for College of Civil Egieerig i Togji Uiversity, the Natioal Natural Sciece Foudatio of Chia(Grat No: , , ad ), Hog Kog, Macao ad Taiwa Sciece & Techology Cooperatio Program of Chia (2012DFH70130), the Fudametal Research Fuds of the Cetral Uiversities (2011QNA4016), Zhejiag Provicial Natural Sciece Foudatio of Chia (LR13E080001). The authors also thaks the hard work provided by Xuefei Zhou ad Xiaowei Zhou i this research program. REFERENCES Aath Grama, Ashul Gupta, George Karypis, et al. (2003), Itroductio to Parallel Computig, Beijig, Chia, Jue. C.Xavier,S.S (2004), IyegarItroductio to Parallel Algorithms, Beijig, Chia, Jue. Efediev, Y., Hou, T. Y. (2009), Multiscale fiite elemet methods, Applied ad Computatioal Mathematics, 217, 50. Ferziger, J. H., Perić, M. (1996), Computatioal methods for fluid dyamics (Vol. 3). Berli: Spriger. Hedrickso, B., Kolda, T. G. (2000), Graph partitioig models for parallel computig, Parallel Computig, 26(12), Kim, H.S., Wu, S.Z., Chag, L.W. (2001), A Scalable Tridiagoal Solver for GPUs[C] 2011 Iteratioal Coferece o Parallel Processig (ICPP),, 2011(9),

17 Law, K. H. (1986), A parallel fiite elemet solutio method. Computers & Structures, 23(6), Liu, W.J., Wag R.Q. (2009), Parallel Computig Based Fiite Elemet Aalysis st Iteratioal Coferece o Iformatio Sciece ad Egieerig (ICISE 2009), 2009, Lv H, Di R.H., Gog H., Li C.X. (2011), A MPI/OpeMP hybrid parallel oliear equatio solver used i fiite elemet aalysis Sixth Chia Grid Aual Coferece (Chia Grid), 2011, Maurer, D., Wieers, C. (2011), A parallel block LU decompositio method for distributed fiite elemet matrices. Parallel Computig, 2011, 37(12), Paz, C. N. M., Alves, J. L. D., Ebecke, N. F. F. (2005), Assessmet of computatioal performace for a vector parallel implemetatio: 3D probabilistic model discrete crackig i cocrete. Computers & Cocrete, 2(5), Pacheco, P. S. (1997), Parallel programmig with MPI, Morga Kaufma Pub. Su, W.Y., Du, Q.K., Che, J.R. (2007). Calculatio Method, Beijig, Chia, May. Takizawa, K., Tezduyar, T. E. (2012), Computatioal methods for parachute fluid structure iteractios, Archives of Computatioal Methods i Egieerig, 19(1), Turmo, J., Ramos, G., & Aparicio, A. C. (2012). Towards a model of dry shear keyed joits: modellig of pael tests, Computers & Cocrete, 10(5), Wag, X.B.,Zhog, Z.H. (2004), Geeralized Thomas algorithm for solvig cyclic tridiagoal equatios, Joural of Computer Mechaics, 104(2),

LU Decomposition Method

LU Decomposition Method SOLUTION OF SIMULTANEOUS LINEAR EQUATIONS LU Decompositio Method Jamie Traha, Autar Kaw, Kevi Marti Uiversity of South Florida Uited States of America kaw@eg.usf.edu http://umericalmethods.eg.usf.edu Itroductio