Precondtonng Parallel Sparse Iteratve Solvers for Crcut Smulaton A. Basermann, U. Jaekel, and K. Hachya 1 Introducton One mportant mathematcal problem n smulaton of large electrcal crcuts s the soluton of hgh-dmensonal lnear equaton systems. The correspondng matrces are real, non-symmetrc, very ll-condtoned, have an rregular sparsty pattern, and nclude a few dense and columns. When the systems become large, teratve solvers are very lkely to outperform drect methods. For convergence acceleraton of teratve solvers, parallelzaton and approprate precondtonng are suted technques to reduce the executon tme. We present a parallel teratve algorthm wth dstrbuted Schur complement (DSC) precondtonng [5] whch acheves an accuracy of the soluton smlar to a drect solver but usually s dstnctly faster for large problems. The parallel effcency of the method s ncreased by transformng the equaton system nto a problem wthout dense and columns whch results n a system reducton as well as by explotaton of parallel graph parttonng methods. The costs of local, ncomplete LU decompostons are decreased by fll-n reducng reorderng methods of the matrx and a threshold strategy for the factorzaton. 2 Problem of Dense Rows and Columns Matrces from crcut smulaton problems are usually very sparse but nclude a few (nearly) dense and columns. In the parallel case, dense and columns are dffcult to handle for parttonng methods snce they result n couplngs between C & C Research Laboratores, NEC Europe Ltd., D-53757 Sankt Augustn, Germany, {basermann, jaekel}@ccrl-nece.de Hgh-End Desgn Technology Development Group, Technology Foundaton Development Dvson, NEC Electroncs Corp., Japan, k-hachya@ax.jp.nec.com
all equatons. In addton, good load balance s hard to acheve f the matrx s dstrbuted row-wse. Fll-n reducng orderng methods may become very costly due to a few dense and columns, and the matrces may get very ll-condtoned. Fortunately, dense and columns are usually easy to remove from crcut smulaton matrces snce the correspondng columns or as a rule have only one non-zero entry on the dagonal. Such equatons normally nclude voltage sources (constrants). A dense column whose correspondng row has only one dagonal entry can be removed snce the correspondng unknown can be determned from the row equaton and substtuted n all other equatons. On the other hand, a dense row (equaton) whose correspondng column has merely one dagonal entry s only responsble for the correspondng unknown. All other equatons can be solved ndependently. Fg. 1 llustrates the prncple of the matrx reducton n both cases. If the correspondng columns or of dense and columns do not have one Fgure 1. Removal of (nearly) dense and columns. dagonal entry only such and columns can be handled by usng the Woodbury formula [3]. Ths case s rare for crcut smulaton problems and does not occur for the matrces nvestgated here. 3 Dstrbuted Schur Complement Technques In the followng, technques for the teratve soluton wth DSC precondtonng of crcut smulaton equaton systems are sketched. More detals, n partcular on the theory, can be found n [5]. 3.1 Defntons Fg. 2 schematcally dsplays the row-wse dstrbuton of a matrx A to two processors. Each processor owns ts local row block. The square matrces A are the local dagonal blocks of A. We assume that the local are arranged n such a way that the wthout couplngs to the other processor(s) come frst and then the wth couplngs. The former are called nternal, have only entres n the A part of the local and are not coupled wth of other processors. The latter addtonally have entres outsde the A part or are coupled wth of other processors. These local are named local nterface. The part outsde A whch represents couplngs between the processors s called local nterface matrx. From the vew of processor 2 n Fg. 2, the local nterface of processor 1 wth entres at column postons n the area of 2 are external nterface. Snce
Processor 1 Local A 1 Internal x 1 Local nterface External nterface Processor 2 A 2 x 2 Fgure 2. DSC defntons: Matrx dstrbuted to two processors. the sparsty pattern of crcut smulaton matrces usually s non-symmetrc local nterface of processor may have entres n A only but are un-drectonally coupled wth of other processors. These are external nterface from the vew of the other processors. Ths can not be determned locally on processor, communcaton s necessary. Snce each row of the matrx corresponds to a specfc unknown of the equaton system (row 1 to soluton vector component 1 and to rght hand sde component 1, e.g.) nternal unknowns, local nterface unknowns, and external nterface unknowns can be defned correspondngly. 3.2 Algorthm Fg. 3 gves a schematc survey of the DSC algorthm per processor. On each processor an outer B-CGSTAB teraton [6] s performed for all local (unknowns). As basc teratve method, a flexble varant of GMRES, FGMRES [4, 5], s also well suted for the DSC algorthm but s not consdered here because of ts hgher storage requrements. The outer teraton contans a partal matrx-vector multplcaton whch requres communcaton snce each processor only owns ts local segment of the vector. It s necessary to exchange components of non-local vector segments whch correspond to external nterface unknowns (). Wthn the outer B-CGSTAB teraton, an nner B-CGSTAB teraton for the local nterface (unknowns) only s performed. Ths ncludes a partal matrxvector multplcaton of the nterface system but the communcaton scheme s the same as for the outer matrx-vector multplcaton and thus has to be mplemented only once. From the mathematcal pont of vew, each processer solves the followng
B-CGSTAB teraton for all local (unknowns) B-CGSTAB teraton for the local nterface (unknowns) Matrx-vector multplcaton: Communcaton of external nterface unknowns Matrx-vector multplcaton: Communcaton of external nterface unknowns Fgure 3. Schematc vew of the DSC algorthm on each processor. equaton: A x + y,ext = b, x = ( ) ( ) u f, b y = g. (1) x are the local vector components, y,ext the external nterface vector components, and b s the local segment of the rght hand sde vector. x s splt nto the nternal vector components u and the local nterface vector components y, b accordngly. A s then splt (see [5] for detals), and (1) s reformulated: A = ( ) ( ) ( ) ( B F B F u + E C E C y 0 Neghbours j E j y j ) = ( f g ). (2) The result of the sum over all neghbourng processors j wth couplngs to processor n (2) s the same as that of y,ext n (1). E j y j s the part of y,ext whch reflects the contrbuton to the local equaton from the neghbourng processor j. The matrx equaton (2) represents two equatons. From the frst, we derve an expresson for u, substtute u n the second equaton and get u = B 1 (f F y ) S y + Neghbours j E j y j = g E B 1 f. (3) S = C E B 1 F s the local Schur complement. Note that (3) s an equaton for the nterface vector components only. (3) can be rewrtten as a block-jacob precondtoned Schur complement sys-
Processor 1 Local Local nterface Block ncomp lete LU for the local U 1 U 1,S L 1 L 1,S Block nco mplete LU for the local nterface Fgure 4. Prncple of precondtonng wthn the DSC algorthm. tem [5]: y + S 1 Neghbours j E j y j = S 1 (g E B 1 f ). (4) 3.3 Precondtonng Fg. 4 llustrates the prncple of precondtonng wthn the DSC algorthm per processor. The outer teraton from Fg. 3 s precondtoned per processor by a block ncomplete LU decomposton wth threshold (ILUT) [4] of the local dagonal block (L 1 U 1 n Fg. 4). For precondtonng the nner teraton, a block ILUT for the local nterface only s exploted. Ths factorzaton need not be computed but can be used from the lower rght part of the decomposton for the outer teraton (L 1,S U 1,S n Fg. 4). Mathematcally speakng, we perform a block factorzaton of A on processor usng the splttng from (2): ( ) ( ) ( ) B F A = B 0 I B 1 = F. (5) E C E S 0 I We then assume that we have the LU decomposton S = L,S U,S of the local Schur complement. Wth ths, we formulate the LU factorzaton ( ) ( L,B 0 U,B L 1 L U = E U 1,B F ) (6),B L,S 0 U,S wth B = L,B U,B the LU decomposton of B. By transformng the rght hand
sde of (6) nto [( L,B 0 E U 1,B L,S ) ( )] ( U,B 0 I U 1,B L 1,B F ) = 0 U,S 0 I ( ) ( B 0 I B 1 E S 0 I we fnd after comparson wth (5) that L U s an LU factorzaton of A. The other way round, we also see the practcal advantage from (6) that the LU factorzaton S = L,S U,S of the local Schur complement has not to be computed explctly f we already have an LU factorzaton of the local dagonal block A. If we perform ncomplete decompostons we get an approxmate, precondtoned Schur complement system wth the approxmaton S of the local Schur complement S (compare wth (4) ): y + S 1 Neghbours j E j y j = 3.4 Reparttonng and Reorderng F ) 1 S (g E B 1 f ). (7) The dstrbuted sparsty pattern of the matrx can be represented as a dstrbuted graph wth nodes and edges. Graph reparttonng can then be used to reduce the number of couplngs between the dstrbuted matrx row blocks. In graph theory formulaton, the reducton s done by a mnmzaton of the number of edges cut n the graph. Ths goal of graph parttonng corresponds to a mnmzaton of the number of nterface unknowns n the DSC algorthm, and thus problem (7) s made very small. For graph parttonng, we use the ParMETIS software from the Unversty of Mnnesota [2]. Snce ParMETIS requres an undrected graph as nput the non-symmetrc pattern of the matrx has to be symmetrzed for the matrx graph constructon. For local, ncomplete decompostons, we use METIS nested dssecton reorderng to reduce fll-n nto the factors [2]. Nested dssecton reorderng usually generates a smlar sparsty pattern for the local dagonal blocks A on each processor. Ths results n smlar fll-n for each ILUT and thus supports load balancng. 4 Results The followng experments were performed on NEC s PC cluster GRISU (32 2-way SMP nodes wth AMD Athlon MP 1900+ CPUs, 1.6 GHz, 1 GB man memory per node, Myrnet2000 nterconnecton network between the nodes). For the followng experments, the equaton systems ccp and crc2a whch stem from smulatons of NEC crcuts, the systems crcut 2 and crcut 3 from Phlps crcuts [1] as well as the systems hcrcut and scrcut from Motorola crcuts are used. The latter four systems are avalable from the Matrx Market (http://www.cse.ufl.edu/ davs/sparse/{bomhof, Hamm})
Table 1. Orgnal and reduced systems: Parttonng nto eght sub-domans. Orgnal matrx Reduced matrx Matrx Order Non-zeros #If vars Order Non-zeros #If vars ccp 89556 760630 36077 89378 603753 2010 crc2a 482969 3912413 422900 482963 2750390 402 crcut 2 4510 21199 1265 1262 9702 629 crcut 3 12127 48137 2688 7607 34024 186 hcrcut 105676 513072 2490 105592 506970 975 scrcut 170998 958936 1874 170850 958370 1700 Table 2. DSC tmes on 8 GRISU processors: Effect of orderng and parttonng. Parttonng + orderng Parttonng only No permutaton Matrx #If vars Tme/s Tme/s #If vars Tme/s ccp 2010 0.51 0.97 65856 27.48 crc2a 402 0.57 2.02 8752 13.51 crcut 2 629 0.04 0.05 1245 0.07 crcut 3 186 0.25 0.25 1152 0.65 hcrcut 975 0.23 117.99 92654 9.27 scrcut 1700 2.16 29.69 112444 807.30 Table 1 presents orders and numbers of non-zeros for the orgnal and the reduced matrces as well as the number of nterface varables (f vars) for parttonng nto eght sub-domans. Reparttonng s appled. The results n table 1 show a dstnctly smaller number of nterface varables n the case of the reduced systems. Reparttonng s effectve n ths case and results n very low costs for the nner teraton from Fg. 3. In table 2, tmes of the DSC method on eght GRISU processors wth both reparttonng and reorderng, wth reparttonng only, and wthout reparttonng and reorderng are dsplayed for the reduced systems. The number of nterface varables s gven for the frst and last scenaro; for the second, the number s the same as for the frst one. The shortest tmes by far n table 2 are acheved for the DSC method wth reparttonng and reorderng. Reparttonng keeps the nterface system small, reorderng usually reduces fll-n and mproves the load balance of the processors (see 3.4). Table 3 shows executons tmes of the DSC method on 1 to 16 GRISU processors for the largest reduced test case crc2a. In Fg. 5 total, sequental executon tmes on GRISU of NEC s crcut smulator MUSASI wth the orgnal drect solver (Schur complement method) and the developed teratve DSC algorthm are dsplayed for a medum sze crcut problem (transent analyss) whch results n equaton systems of order 24705. The left tmes do not nclude the tme for the setup phase whle the rght tmes do contan ths tme. The most expensve operatons n the setup phase are the symbolc factorza-
Table 3. DSC tmes on GRISU: Scalablty. Tme/s on p processors Matrx p = 1 p = 2 p = 4 p = 8 p = 12 p = 16 crc2a 2.19 1.89 1.00 0.57 0.43 0.39 ton wth Markowtz orderng whch are necessary and state-of-the-art for a drect solver, but can be left out for the teratve DSC solver developed. 900 800 700 Drect solver Iteratve solver Tmes n seconds 600 500 400 300 200 100 0 Total tme wthout setup Total tme wth setup Fgure 5. Sequental executon tmes of NEC s crcut smulator MUSASI. The left tmes n Fg. 5 show a speedup of 2.3 of the DSC solver over the orgnal drect solver. Wth setup phase, the teratve solver outperforms the drect one by even a factor of 4.1 snce the costly symbolc factorzaton wth Markowtz orderng can be skpped. From frst tests, a smlar rato of the smulaton tmes wth drect and teratve solver can be expected for the parallel case, but the ntegraton of the teratve DSC solver nto the parallel crcut smulator MUSASI s not complete yet. 5 Conclusons For equaton systems from real problems, we demonstrated that the DSC algorthm combned wth reparttonng and reorderng s a well suted teratve solver for crcut smulaton. For the performance of teratve solvers, a system reducton by the removal of trval equatons s crucal. Moreover, a dstnct reducton of smulaton tme can be expected f commonly used drect methods n crcut smulaton codes are replaced wth the teratve DSC method presented.
Bblography [1] C. W. Bomhof and H. A. Van der Vorst, A parallel lnear system solver for crcut smulaton problems, Numercal Lnear Algebra wth Applcatons, 7 (2000), pp. 649 665. [2] G. Karyps and V. Kumar, ParMETIS: Parallel graph parttonng and sparse matrx orderng lbrary, Tech. rep. # 97-060, Unversty of Mnnesota, 1997. [3] W. H. Press, S. A. Teukolsky, W. T. Vetterlng, and B. P. Flannery, Numercal Recpes n C, 2nd ed., Cambrdge Unversty Press, 1994. [4] Y. Saad, Iteratve Methods for Sparse Lnear Systems, PWS, Boston, 1996. [5] Y. Saad and M. Sosonkna, Dstrbuted Schur complement technques for general sparse lnear systems, SISC, 21 (1999), pp. 1337 1356. [6] H. Van der Vorst, B-CGSTAB: A fast and smoothly convergng varant of B-CG for the soluton of nonsymmetrc lnear systems, SIAM J. Sc. Statst. Comput., 13 (1992), pp. 631 644.