A Parallel Gauss-Seidel Algorithm for Sparse Power System. Matrices. D. P. Koester, S. Ranka, and G. C. Fox

Size: px

Start display at page:

Download "A Parallel Gauss-Seidel Algorithm for Sparse Power System. Matrices. D. P. Koester, S. Ranka, and G. C. Fox"

Derek McDonald
6 years ago
Views:

1 A Parallel Gauss-Sedel Algorthm for Sparse Power System Matrces D. P. Koester, S. Ranka, and G. C. Fox School of Computer and Informaton Scence and The Northeast Parallel Archtectures Center (NPAC) Syracuse Unversty Syracuse, NY 3- Abstract We descrbe the mplementaton and performance of an ecent parallel Gauss-Sedel algorthm that has been developed for rregular, sparse matrces from electrcal power systems applcatons. Although, Gauss- Sedel algorthms are nherently sequental, by performng specalzed orderngs on sparse matrces, t s possble to elmnate much of the data dependences caused by precedence n the calculatons. A twopart matrx orderng technque has been developed rst to partton the matrx nto block-dagonalbordered form usng dakoptc technques and then to mult-color the data n the last dagonal block usng graph colorng technques. The ordered matrces often have extensve parallelsm, whle mantanng the strct precedence relatonshps n the Gauss-Sedel algorthm. We present tmng results for a parallel Gauss-Sedel solver mplemented on the Thnkng Machnes CM-5 dstrbuted memory mult-processor. The algorthm presented here requres actve message remote procedure calls n order to mnmze communcatons overhead and obtan good relatve speedup. Introducton Power system dstrbuton networks are generally herarchcal wth lmted numbers of hgh-voltage lnes transmttng electrcty to hghly nterconnected local networks that eventually dstrbute power to customers. Electrcal power grds have graph representatons whch n turn can be expressed as matrces electrcal buses are graph nodes and matrx dagonal elements, whle electrcal transmsson lnes are graph edges whch can be represented as non-zero odagonal matrx elements. We show that t s possble to dentfy the herarchcal structure wthn a power system matrx usng only the knowledge of the nterconnecton pattern by tearng the matrx nto parttons and couplng equatons that yeld a block-dagonal-bordered matrx. Nodetearng-based parttonng dentes the basc network structure that provdes parallelsm for the majorty of calculatons wthn a Gauss-Sedel teraton. Graph mult-colorng has been used to order the last dagonal matrx block and subsequently dentfy avalable parallelsm. We mplemented explct load balancng as part of each of the aforementoned orderng steps to maxmze parallel algorthm ecency. We mplemented the parallel Gauss-Sedel algorthm on the Thnkng Machnes CM-5 dstrbuted memory mult-processor exclusvely usng explct message passng based on Connecton Machne actve message layer (CMAML) remote procedure calls (RPCs). The communcatons paradgm we use throughout ths algorthm employs CMAML RPCs to send ndvdual values to destnaton processors as soon as values have been calculated. Ths paradgm greatly smpled the development and mplementaton of ths parallel sparse Gauss-Sedel algorthm. Parallel mplementatons of Gauss-Sedel have generally been developed for regular problems such as the soluton of Laplace's equatons by nte derences [, 5], where red-black colorng schemes are used to provde ndependence n the calculatons and some parallelsm. Ths scheme has been extended to multcolorng for addtonal parallelsm n more complcated regular problems [5], however, we are nterested n the soluton of rregular lnear systems. There has been some research nto applyng parallel Gauss-Sedel to crcut smulaton problems [], although ths work showed poor parallel speedup potental n a theoretcal study. Ths reference also extended tradtonal Gauss-

2 Sedel and Gauss-Jacob technques to waveform relaxaton methods that trade overhead and convergence rate for parallelsm. Other research wth parallel Gauss-Sedel methods for power systems applcatons s presented n [7], although our research ders substantally from that work: our research utlzes a dfferent matrx orderng paradgm, a derent load balancng paradgm, and a derent parallel mplementaton paradgm. Our work utlzes dakoptc-based matrx parttonng technques developed ntally for a parallel block-dagonal-bordered drect sparse lnear solver [9, ]. In reference [9] we examned load balancng ssues assocated wth parttonng power systems matrces for parallel Cholesk factorzaton. The paper s organzed as follows. In secton, we ntroduce the electrcal power systems applcaton that s the bass for ths work. In secton 3, we brey revew the Gauss-Sedel teratve method, then present a theoretcal dervaton of the avalable parallelsm wth Gauss-Sedel for a block-dagonal-bordered form sparse matrx. We dscuss the preprocessng phase that orders the sparse matrces n secton 5, and we descrbe our parallel Gauss-Sedel algorthm mplementaton n secton 6. Analyss of parallel algorthm performance for actual power system load ow matrces are presented n secton 7. We present our conclusons n secton 8. Power System Applcatons The underlyng motvaton for our research s to mprove the performance of electrcal power system applcatons to provde real-tme power system control and real-tme support for proactve decson makng. Ths research has focused on matrces from load-ow applcatons []. Load-ow analyss examnes steadystate equatons based on the symmetrc postve defnte network admttance matrx that represents the power system dstrbuton network. Load ow analyss entals the soluton of non-lnear systems of smultaneous equatons, whch are performed by repeatedly solvng sparse lnear equatons. Sparse lnear solvers account for the majorty of oatng pont operatons encountered n load-ow analyss. 3 The Gauss-Sedel Method We are consderng an teratve soluton to the lnear system Ax = b; () where A s an (n n) sparse matrx, x and b are vectors of length n, and we are solvng for x. Iteratve solvers are an alternatve to drect methods that attempt to calculate an exact soluton to the system of equatons. Iteratve methods attempt to nd a soluton to the system of lnear equatons by repeatedly solvng the lnear system usng approxmatons to the x vector. Iteratons contnue untl the soluton s wthn a predetermned acceptable bound on the error. The Gauss-Sedel method can be wrtten as: x (k+) X X b? a j x (k+) j? a j x (k) A j ; a j< j> () where: or x (k) s the th unknown n x durng the k th teraton, = ; ; n and k = ; ; :::, x () s the ntal guess for the th unknown n x, a j s the coecent of A n the th row and j th column, b s the th value n b. x (k+) = (D + L)? [b? Ux (k) ]; (3) where: x (k) s the k th soluton to x, k = ; ; :::, x () s the ntal guess at x, D s the dagonal of A, L s the strctly lower trangular porton of A, U s the strctly upper trangular porton of A, b s rght-hand-sde vector. The representaton n equaton s used n the development of the parallel algorthm, whle the equvalent matrx-based representaton n equaton 3 s used below n dscussons of avalable parallelsm. We present a general sequental sparse Gauss-Sedel algorthm n gure. Ths algorthm calculates a constant number of teratons before checkng convergence. It s very dcult to determne f one-step teratve methods, lke the Gauss-Sedel method, converge for general matrces. Nevertheless, t s possble to prove that the Gauss-Sedel method does converge and yelds the unque soluton x for Ax = b wth any ntal startng vector x () for both dagonally domnant and symmetrc postve dente matrces [5]. These theorems prove that the Gauss-Sedel method converges for these matrx types, however, there s no evdence as to the rate of convergence. Symmetrc sparse matrces can be represented by graphs wth elements n equatons correspondng to

3 whle > converge for k = to n ter for = to n ~x x x b for each j such that a j 6= x x? (a j x j ) x x =a for = to n + abs(~x? x ) endwhle Fgure : Sparse Gauss-Sedel Algorthm undrected edges n the graph [6]. Orderng a symmetrc sparse matrx s actually lttle more than changng the labels assocated wth nodes n an undrected graph. Modfyng the orderng of a sparse matrx s smple to perform usng a permutaton matrx P that smply generates elementary row and column exchanges. Applyng a permutaton matrx to the orgnal lnear system n equaton yelds (PAP T )(Px) = (Pb): () Whle orderng the matrx can greatly smplfy accessng parallelsm nherent wthn the matrx structure, orderng can have an eect on convergence [5]. In secton 7, we present emprcal data to show that n spte of the orderng to yeld parallelsm, convergence appears to be rapd for postve dente power systems load-ow matrces. Avalable Parallelsm Whle Gauss-Sedel algorthms for dense matrces are nherently sequental, t s possble to dentfy sparse matrx parttons that do not have mutual data dependences, so calculatons can proceed n parallel whle mantanng the strct precedence rules n the Gauss-Sedel technque. Entre sparse matrx parttons can be calculated n parallel wthout requrng communcatons. All parallelsm n the Gauss-Sedel algorthm s derved from wthn the actual nterconnecton relatonshps between elements n the matrx. Whle much of the parallelsm n ths algorthm comes from the block-dagonal-bordered orderng of the sparse matrx, further orderng of the last dagonal block s requred to provde parallelsm n what would otherwse be a purely sequental porton of the algorthm. The last dagonal block represents the nterconnecton structure wthn the equatons that couple the parttons n the block-dagonal porton of the matrx. These equatons are rather sparse, so, t s smple to color the graph representng ths porton of the matrx. Separate graph colors represent rows where x (k+) can be calculated n parallel, because wthn a color, no two nodes have any adjacent edges. To clearly dentfy the avalable parallelsm n the block-dagonal-bordered Gauss-Sedel method, we de- ne a block dagonal matrx partton, apply that partton to formula 3, and equate terms to dentfy avalable parallelsm. We must also dene a subparttonng of the last dagonal block to dentfy parallelsm after mult-colorng.. Block-Dagonal-Bordered Matrces We dene a parttonng of the system of lnear equatons dened n equaton, where the permutaton matrx P orders the matrx nto block-dagonalbordered form. We dene PAP T = A ; A ;m.... A m?;m? A m?;m C A ; A m; A m;m? A m;m (5) and Px and Pb are parttoned wth smlar dmensons. Equaton 3 dvdes the PAP T matrx nto a dagonal component D, a strctly lower dagonal matrx L, and a strctly upper dagonal matrx U such that: PAP T = D + L + U (6) Dervaton of the block-dagonal-bordered form of the D, L, and U matrces s straghtforward. Equaton 3 requres the calculaton of (D + L)?, whch also s smple to determne explctly, because ths matrx has block-dagonal-lower-bordered form. Gven these parttoned matrces, t s relatvely straghtforward to dentfy avalable parallelsm. For, ( = ; ; m? ), we obtan: x (k+) = (D ; + L ; )? h b? U ; x (k)? U ;m x (k) m ; (7)

4 and for the lower border and last dagonal block we obtan: x (k+) m = (D m;m + L m;m )? h b m? P m? = (L? m; x(k+) )? U m;m x (k) m :(8) We can dentfy the parallelsm n the blockdagonal-bordered porton of the matrx by examnng equatons 7 and 8. If the block-dagonal-bordered matrx parttons A ;, A m;, and A ;m ( m? ) are assgned to the same processor, then there are no communcatons untl x (k+) m s calculated. Note that the vector x m (k) s requred for the calculatons n each partton, however, there s no volaton of the strct precedence rules n the Gauss-Sedel method, because these values are not calculated untl the last step. After calculatng x (k+) n the rst (m? ) parttons, the values of x (k+) m must be calculated usng the lower border and last block. If we assgn ^b = b m? m? X = L? m; x(k+) ; (9) then the formulaton of x (k+) m = ^x (k+) looks smlar to equaton 3: ^x (k+) = (D m;m + L m;m ) h^b?? U m;m x (k) : () Fgure descrbes the calculaton steps n the parallel Gauss-Sedel for a block-dagonal-bordered sparse matrx. Ths gure depcts four dagonal blocks, and data/processor assgnments (,,, and ) are lsted for the data block.. Mult-Colored Matrces The orderng mposed by the permutaton matrx P, ncludes mult-colorng-based orderng of the last dagonal block that produces sub-parttons wth parallelsm, We dene the sub-parttonng as: A m;m = ^D ; ^A; ^A;c ^A ; ^D; ^A;c..... ^A c; ^Ac; ^Dc;c C A : () where ^D; are dagonal blocks and c s the number of colors. After formng L m;m and U m;m, t s straght forward to prove that: ^x (k+) = ^D? ; ^b? X j< ^A ;j^x (k+) j? X j> ^A ;j^x (k) j 3 5 () () SOLVE FOR x IN DIAGONAL BLOCKS x = () CALCULATE (MATRIX X VECTOR) (3) SOLVE FOR x PRODUCT AND SEND IN LAST DIAGONAL BLOCK Fgure : Block-Bordered-Dagonal Form Gauss- Sedel Method (m = 5) Calculatng ^x (k+) n each sub-partton (color) of the last dagonal block does not requre values of ^x (k+) wthn the sub-partton, so we can calculate the ndvdual values wthn a color n any order and dstrbute these calculatons to separate processors wthout concern for precedence. In order to mantan the strct precedence n the Gauss-Sedel algorthm, the values of ^x (k+) calculated n each step must be broadcast to all processors, and processng cannot proceed for any processor untl t receves the new values of ^x (k+) from all other processors. Fgure 3 llustrates the data/processor assgnments n the last dagonal block. 5 The Preprocessng Phase In the prevous secton, we developed the theory for parallel Gauss-Sedel methods, however, before such technques can be mplemented on real power systems matrces, we must be able to generate the permutaton matrces, P, to produce block-dagonalbordered/mult-colored sparse matrces. All avalable parallelsm for our Gauss-Sedel algorthm s dented from the nterconnecton structure of elements n the sparse matrx durng ths preprocessng phase. Inherent n both preprocessng steps s explct load-balancng to determne processor/data mappngs for ecent mplementaton of the Gauss-Sedel algorthm. Ths preprocessng phase ncurs sgncantly more overhead than solvng a sngle nstance of the b

5 C C C3 () SOLVE FOR x WITHIN A COLOR () BROADCAST NEW x VALUES Fgure 3: Mult-Colored Gauss-Sedel Method for the Last Dagonal Block (c = 3) sparse matrx; consequently, the use of ths technque wll be lmted to problems that have statc matrx structures that can reuse the ordered matrx multple tmes n order to amortze the cost of the preprocessng phase over numerous matrx solutons. 5. Orderng the Matrx nto Block- Bordered-Dagonal Form We requre a technque that orders rregular matrces nto block-dagonal-bordered form whle lmtng the number of couplng equatons. Mnmzng the number of couplng equatons mnmzes the sze of the last dagonal block, and mnmzes the amount of broadcast communcatons requred when calculatng values of ^x (k+). Mnmzng the sze of the last dagonal block has some drawbacks. We have found an nverse relatonshp between last block sze and loadmbalance between processors. Ths can aect potental parallelsm f the resultng workload n the dagonal blocks cannot be dstrbuted unformly throughout a mult-processor [9]. When determnng the optmal orderng for a sparse matrx, the sze of the last dagonal block and the subsequent addtonal communcatons may be traded for an orderng that yelds good load balance n the hghly parallel porton of the calculatons, especally for larger numbers of processors. We have chosen node-tearng [, ], whch s a specalzed form of dakoptcs, to order sparse power systems matrces nto block-dagonal-bordered form. We have selected node-tearng nodal analyss because ths algorthm determnes the natural structure n the matrx whle provdng the means to mnmze the number of couplng equatons []. Tearng here refers x b to breakng the orgnal problem nto smaller subproblems whose partal solutons can be combned to gve the soluton of the orgnal problem. The node-tearng-based orderng algorthm has a user-selectable nput parameter, max DB, the maxmum sze of the dagonal blocks. Varyng ths nput parameter permts the user to vary characterstcs n the ordered dagonal blocks. Emprcal data s presented later n secton 7 to llustrate parallel lnear solver algorthm performance as a functon of ths parameter. Load balancng for node-tearng-based orderng s performed wth a smple pgeon-hole type algorthm that uses a metrc based on the number of oatng pont multply/add operatons n a partton, nstead of smply usng the number of rows per partton. Load balancng examnes the number of operatons when calculatng x (k+) n the matrx parttons and the number of operatons when calculatng the sparse matrx vector products n preparaton to solve for ^x (k+). Ths algorthm nds an optmal dstrbuton for workload to processors, however, actual dsparty n processor workload s dependent on the actual rregular sparse matrx structure. 5. Orderng the Last Dagonal Block The mult-colorng algorthm we selected for ths work s based on the saturaton degree orderng algorthm []. We also requre load balancng, a feature not commonly mplemented wthn graph multcolorng. The saturaton degree orderng algorthm selects a node n the graph that has the largest number of derently colored neghbors. We have added the capablty to the saturaton degree orderng algorthm to select the color for a node n a manner that equalzes the number of nodes wth a partcular color. The graphs encountered for colorng n ths work were very sparse, and often requred three or less colors. Detals of ths graph mult-colorng algorthm are presented n [8]. 6 Parallel Implementaton We have mplemented a parallel verson of a blockdagonal-bordered sparse Gauss-Sedel algorthm n the C programmng language for the Thnkng Machnes CM-5 mult-computer usng CMAML RPCs as the exclusve bass for nterprocessor communcatons [3]. Underlyng the whole concept of actve messages s the paradgm that the user takes the responsblty for handlng messages as they arrve at a destnaton.

6 The user wrtes a handler functon that takes the data from a regster and uses t n a calculaton or assgns the data to memory. By assgnng message handlng responsbltes to the user, communcatons overhead can be sgncantly reduced. Sgncant mprovements n the performance of the algorthm were observed for actve messages, when compared to more tradtonal communcatons paradgms that use the standard blockng CMMD send and CMMD receve functons n conjuncton wth packng data nto communcatons buers. A sgncant porton of communcatons requre each processor to send short data buers to every other processor. For tradtonal message passng paradgms, the cost for communcatons ncreases drastcally as the number of processors ncreases, because each message ncurs the same latency regardless of the amount of data sent. As a result, performance for buered communcatons quckly becomes unacceptable as the number of processors ncreases. To sgncantly reduce communcatons overhead, we mplemented each porton of the algorthm usng CMAML remote procedure calls (CMAML rpc). The communcatons paradgm we use throughout ths algorthm s to send a double precson data value to the destnaton processor as soon as the value s calculated. Communcatons n the algorthm occur at dstnct tme phases, so pollng for the actve message handler functon s ecent. An actve message on the CM-5 has a four word payload, whch s more than adequate to send a double precson oatng pont value and an nteger vector poston ndcator. The use of actve messages greatly smpled the development and mplementaton of ths parallel sparse Gauss-Sedel algorthm, because there was no requrement to mantan and pack communcatons buers. Ths mplementaton uses mplct data structures based on vectors of C programmng language structures to store and retreve data ecently wthn the sparse matrx. These data structures provde good cache coherence, because non-zero data values and column locaton ndcators are stored n adjacent physcal memory locatons. Data s stored as sparse vectors wth mplct referencng, so only the SPARC processors on each node were used for calculatons. Our parallel Gauss-Sedel algorthm has the followng dstnct sectons:. solve for x (k+) n the P dagonal blocks. calculate ^b m? = b m? L? = m; x(k+) by formng the (matrx vector) products n parallel 3. solve for ^x (k+) n the last dagonal block. check convergence A pseudo-code representaton of the parallel Gauss- Sedel solver s presented n gure. 7 Emprcal Results Overall performance of our parallel Gauss-Sedel lnear solver s dependent on both the performance of the matrx orderng n the preprocessng phase and the performance of the parallel Gauss-Sedel mplementaton. Because these two components of the parallel Gauss-Sedel mplementaton are nextrcably related, the best way to assess the potental of ths technque s to measure the speedup performance usng real power system load-ow matrces. We rst present speedup results for three separate power systems matrces: BCSPWR9,73 nodes and,39 edges [] BCSPWR 5,3 nodes and 8,7 edges [] EPRI-6K,8 nodes and 5,6 edges [3] Matrces BCSPWR9 and BCSPWR are from the Boeng Harwell seres and the EPRI-6K matrx s dstrbuted wth the Extended Transent-Mdterm Stablty Program (ETMSP) from EPRI. These matrces have been preprocessed usng a sequental program that orders the matrx, load balances each orderng step, and produces the mplct data structures for the parallel Gauss-Sedel lnear solver. The preprocessng was repeated for multple values of max DB, the nput value to the node-tearng algorthm. Due to the statc nature of the power system grd, such orderngs could be reused for many hours or even days of calculatons n real electrcal power utlty operatons load-ow applcatons. Emprcal performance data was collected for each of the aforementoned power systems matrces usng through 3 processors on the Thnkng Machnes CM- 5 at the Northeast Parallel Archtectures Center at Syracuse Unversty. The NPAC CM-5 s congured wth all 3 nodes n a sngle partton, so user software was requred to dene the number of processors used to actually solve a lnear system. We present emprcal speedup data collected on the parallel Gauss-Sedel algorthm for the three power systems matrces, and we also present a detaled performance analyss usng actual run tmes for the ndvdual subsectons of the parallel Gauss-Sedel lnear solver to llustrate the ef- cacy of the load balancng step n the preprocessng phase and to llustrate performance bottlenecks. All tmng samples are for a combnaton of four teratons and a sngle convergence check.

7 Node Program whle > converge for k = to n ter /* solve for x (k+) n the dagonal blocks */ for all rows on ths processor ~x x x b for each j [; n] such that a j 6= x x? (a j x j ) x x =a /* calculate L? m; x(k+) */ for all rows on ths processor ~x x ^b b for all lower border non-zero rows for each j such that a j 6=? (a j x j ) usng actve message rpc on processor () ^b ^b? /* solve for ^x (k+) n the last dagonal block */ for all colors on ths processor c for all rows n color c x ^b for each j such that a j 6= x x? (a j x j ) x x =a usng actve message rpc broadcast x wat untl all values of x have arrved /* check convergence */ for all rows on ths processor + abs(~x? x ) for all other processors usng actve message rpc on processor + endwhle Fgure : Parallel Sparse Gauss-Sedel Algorthm RELATIVE SPEEDUP RELATIVE SPEEDUP FOR GAUSS SEIDEL 6 8 BCSPWR9 BCSPWR EPRI-6K Fgure 5: Relatve Speedup,, 8, 6, and 3 processors 7. Performance Analyss As an ntroducton to the performance of the parallel Gauss-Sedel algorthm, we present a graph that plots relatve speedup versus the number of processors. Fgure 5 plots the best speedup measured for each of the power systems matrces for,, 8, 6, and 3 processors. These graphs show that performance for the EPRI-6K data set s the best of the three data sets examned. Speedup reaches a maxmum of.6 for 3 processors and speedups of greater than. were measured for 6 processors. Relatve speedups for the BCSPWR9 and BC- SPWR matrces are less than for the EPRI-6K matrx, but each has speedup n excess of 7. for 6 processors. For both the BCSPWR9 and BCSPWR matrces, the last dagonal block requres approxmately 5% of the total calculatons whle the last block of the EPRI-6K matrx can be ordered so that only % of all calculatons occur there. The lkely cause for lmted speedup wth the Boeng-Harwell matrces s that communcatons overhead becomes a sgncant part of the overall processng tme because ^x (k+) values must be broadcast to other processors before processng can proceed to the next color. There are nsuf- cent parallel operatons when solvng for x (k+) n the dagonal blocks for these matrces to oset the effect of the communcatons overhead encountered n the last block. A detaled examnaton of relatve speedup s presented n gure 6 for the EPRI-6K data. Ths gure contans a graph wth four curves plottng relatve speedup for each of four maxmum matrx partton szes, 8, 9, 56, and 3 nodes, used n the node-

8 RELATIVE SPEEDUP NODES 9 NODES 56 NODES 3 NODES MILLISECONDS Dagonal Blocks and Upper Border RUN TIME - 8 RUN TIME - 9 RUN TIME - 56 RUN TIME Fgure 6: Relatve Speedup for EPRI-6K Data,, 8, 6, and 3 processors tearng algorthm. The speedup curves for the varous matrx orderngs clearly llustrate the eects of load mbalance for some matrx orderngs. For all four matrx orderngs, speedup s nearly equal for through 6 processors. However, the values for relatve speedup dverge for 3 processors. We can look further nto the cause of the dsparty n the relatve speedup values n the EPRI-6K data by examnng the performance of each of the four dstnct sectons of the parallel algorthm. Fgure 7 contans four graphs that each have four curves that plot processng tme n mllseconds versus the number of processors for each of four values of max DB. These graphs are log-log scaled, so for perfect speedup, processng tmes should fall on a straght lne wth decreasng slope for repeated doublng of the number of processors. One or more curves on each of the performance graphs for the dagonal blocks and upper border, for updatng the last dagonal block, and for convergence checks llustrate nearly perfect speedup wth as many as 3 processors. Unfortunately the performance for calculatng values n the last block does not also have stellar parallel performance. The performance graph for the dagonal blocks and lower border clearly llustrates the causes for the load mbalance observed n the relatve speedup graph n gure 6. For some matrx orderngs, load balancng s not able to dvde the work evenly for larger numbers of processors. Ths occurs for larger values of max DB. Selectng small values of max DB wll provde better speedups for sxteen or more processors. Updatng the last block requres both calculatons of sparse (matrx vector) products and rregular communcatons, but yelds good performance even for 3 MILLISECONDS MILLISECONDS MILLISECONDS 8 Update Last Block RUN TIME - 8 RUN TIME - 9 RUN TIME - 56 RUN TIME Last Block RUN TIME - 8 RUN TIME - 9 RUN TIME - 56 RUN TIME Check Convergence RUN TIME - 8 RUN TIME - 9 RUN TIME - 56 RUN TIME Fgure 7: Tmngs for Algorthm Components EPRI-6K Data,, 8, 6, and 3 processors

9 processors. Update tmes are correlated to the sze of the last dagonal block, whch s nversely related to the magntude of max DB. The performance graph for checkng convergence llustrates that the load balancng step does not assgn equal numbers of rows to all processors. The number of rows on a processor vares as a functon of the load balancng. Whle the curves on ths graph are somewhat erratc, performance s mprovng wth near perfect parallelsm even for 3 processors. We must reterate that all avalable parallelsm n ths work s a result of orderng the matrx and dentfyng relatonshps n the connectvty pattern wthn the structure of the matrx. Power systems load ow matrces are some of the most sparse rregular matrces encountered. For the EPRI-6K data, the most frequently encountered number of edges at a node s only two, and 8.% of the nodes have three or less edges. For the BCSPWR matrx, 7% of the nodes have three or less edges. Consequently, power systems matrces pose sgncant challenges to produce ecent parallel sparse matrx algorthms. In gure 8, we present a representatve orderng of the EPRI-6K data wth max DB equal to 56 nodes. Ths matrx represents the adjacency structure of the network graph, and clearly llustrates sparsty. Nonzero entres n the matrx are represented as dots, and the matrx s delmted by a boundng box. Ths gure contans two matrces: the ordered sparse matrx and an enlargements of the last block after mult-colorng. Ths parttoned matrx has been load-balanced for eght processors. The number of nodes n the last dagonal block s, the numbers of edges are only, and ths matrx partton s bpartte requrng only two colors. To obtan the full benets of parallel processng speedup throughout a load ow applcaton, all data redstrbuton must be elmnated. Jacoban calculatons when solvng the systems of non-lnear equatons must consder the processor/data assgnments from the sparse lnear solver. Otherwse, data redstrbuton overhead would nullfy any speedup obtanable n the parallel lnear solver. 7. Convergence Convergence for a gven data set s crtcal to the performance of an teratve lnear solver. We have appled our solver to sample postve dente matrces that have actual power networks as the bass for the sparsty pattern, and random values for the entres. A sample of measured convergence data s presented n table. Ths table presents the total error and LAST DIAGONAL BLOCK Fgure 8: Ordered EPRI-6K Matrx max DB = 56 Iteraton Total Error P 8 abs(x(k+)? x (k) ) max 8 x (k+) Table : Convergence for EPRI-6K Data the maxmum value for an teraton. All ntal values, x (), have been dened to equal :. Convergence s rather rapd, and after four teratons, total error equals?. We hypothesze that ths good convergence rate s n part due to havng good estmates of the ntal startng vector. For actual solutons of power systems load ows, ths solver would be used wthn an teratve non-lnear solver, so good estmates of startng ponts for each soluton would be readly avalable. 8 Conclusons We have developed a parallel sparse Gauss-Sedel solver wth the potental for good relatve speedup for the very sparse, rregular matrces encountered n electrcal power system applcatons. Block-dagonal-

10 bordered matrx structure oers promse for smpled mplementaton and also oers a smple decomposton of the problem nto clearly dentable subproblems. The node-tearng orderng heurstc has proven to be successful n dentfyng the herarchcal structure n the power systems matrces, and reducng the number of couplng equatons so that the graph mult-colorng algorthm can usually color the last block wth only two or three colors. All avalable parallelsm n our Gauss-Sedel algorthm s derved from wthn the actual nterconnecton relatonshps between elements n the matrx, and dented n the sparse matrx orderngs. Consequently, avalable parallelsm s not unlmted. Relatve speedup tends to ncrease ncely untl ether load-balance overhead or communcatons overhead cause speedup to level o. We have shown that, dependng on the matrx, relatve ecency declnes rapdly after 8 or 6 processors, lmtng the utlty of applyng large numbers of processors to a sngle parallel lnear solver. Nevertheless, other dmensons exst n electrcal power system applcatons that can be exploted to use large numbers of processors ecently. Whle a moderate number of processors can be ecently appled to a sngle power system smulaton, multple events can be smulated smultaneously. Acknowledgments We thank Alvn Leung, Nancy McCracken, Paul Coddngton, and Tony Skjellum for ther assstance n ths research. Ths work has been supported n part by Nagara Mohawk Power Corporaton, the New York State Scence and Technology Foundaton, the NSF under co-operatve agreement No. CCR-98, and ARPA under contract #DABT63-9-K-5. References [] D. Brelaz. New Methods to Color the Vertces of a Graph. Comm. ACM, :5, 979. [] I. S. Du, R. G. Grmes, and J. G. Lews. Users Gude for the Harwell-Boeng Sparse Matrx Collecton. Techncal report, Boeng Computer Servces, 99. [3] Electrcal Power Research Insttute, Palo Alto, Calforna. Extended Transent-Mdterm Stablty Program: Verson 3. - Volume : Programmers Manual, Part, Aprl 993. [] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solvng Problems on Concurrent Processors. Prentce Hall, 988. [5] G. Golub and J. M. Ortega. Scentc Computng wth an Introducton to Parallel Computng. Academc Press, Boston, MA., 993. [6] M. T. Heath, E. Ng, and B. W. Peyton. Parallel Algorthms for Sparse Lnear Systems. In Parallel Algorthms for Matrx Computatons, pages 83{. SIAM, Phladelpha, 99. [7] G. Huang and W. Ongsakul. Managng the Bottlenecks n Parallel Gauss-Sedel Type Algorthms for Power Flow Analyss. Proceedngs of the 8th Power Industry Computer Applcatons (PICA) Conference, pages 7{8, May 993. [8] D. P. Koester, S. Ranka, and G. C. Fox. A Parallel Gauss-Sedel Algorthm for Sparse Power Systems Matrces. Techncal Report SCCS-63, NPAC, Aprl 99. [9] D. P. Koester, S. Ranka, and G. C. Fox. Parallel Block-Dagonal-Bordered Sparse Lnear Solvers for Electrcal Power System Applcatons. In Proceedng of the Scalable Parallel Lbrares Conference. IEEE Press, 99. [] D. P. Koester, S. Ranka, and G. C. Fox. Parallel Cholesk Factorzaton of Block-Dagonal- Bordered Sparse Matrces. Techncal Report SCCS-6, NPAC, January 99. [] R. A. Saleh, K. A. Gallvan, M. Chang, I. N. Hajj, D. Smart, and T. N. Trck. Parallel Crcut Smulaton on Supercomputers. Proceedngs of the IEEE, 77():95{93, December 989. [] A. Sangovann-Vncentell, L. K. Chen, and L. O. Chua. Node-Tearng Nodal Analyss. Techncal Report ERL-M58, Electroncs Research Laboratory, College of Engneerng, Unversty of Calforna, Berkeley, October 976. [3] T. von Ecken, D. E. Culler, S. C. Goldsten, and K. E. Schauser. Actve Messages: a Mechansm for Integrated Communcaton and Computaton. In Nneteenth Internatonal Symposum on Computer Archtecture, New York, 99. ACM Press. [] Y. Wallach. Calculatons and Programs for Power System Networks. Prentce-Hall, 986.

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr