Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

Similar documents
Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Cluster Analysis of Electrical Behavior

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Solving two-person zero-sum game by Matlab

Parallel matrix-vector multiplication

Programming in Fortran 90 : 2017/2018

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL)

Smoothing Spline ANOVA for variable screening

c 2009 Society for Industrial and Applied Mathematics

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

Wavefront Reconstructor

Control strategies for network efficiency and resilience with route choice

A Binarization Algorithm specialized on Document Images and Photos

Module Management Tool in Software Development Organizations

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Very simple computational domains can be discretized using boundary-fitted structured meshes (also called grids)

Lecture 4: Principal components

Edge Detection in Noisy Images Using the Support Vector Machines

NUMERICAL ANALYSIS OF A COUPLED FINITE-INFINITE ELEMENT METHOD FOR EXTERIOR HELMHOLTZ PROBLEMS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Cost-efficient deployment of distributed software services

Polyhedral Compilation Foundations

Support Vector Machines

Array transposition in CUDA shared memory

An Optimal Algorithm for Prufer Codes *

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

S1 Note. Basis functions.

arxiv: v3 [cs.na] 18 Mar 2015

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Load Balancing for Hex-Cell Interconnection Network

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Module 6: FEM for Plates and Shells Lecture 6: Finite Element Analysis of Shell

LU Decomposition Method Jamie Trahan, Autar Kaw, Kevin Martin University of South Florida United States of America

Private Information Retrieval (PIR)

GSLM Operations Research II Fall 13/14

The Codesign Challenge

Categories and Subject Descriptors B.7.2 [Integrated Circuits]: Design Aids Verification. General Terms Algorithms

Multiblock method for database generation in finite element programs

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

A Robust Method for Estimating the Fundamental Matrix

CMPS 10 Introduction to Computer Science Lecture Notes

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

On Some Entertaining Applications of the Concept of Set in Computer Science Course

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Hermite Splines in Lie Groups as Products of Geodesics

Hierarchical clustering for gene expression data analysis

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

HEURISTIC METHOD OF DYNAMIC STRESS ANALYSIS IN MULTIBODY SIMULATION USING HPC

Related-Mode Attacks on CTR Encryption Mode

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

BFF1303: ELECTRICAL / ELECTRONICS ENGINEERING. Direct Current Circuits : Methods of Analysis

Positive Semi-definite Programming Localization in Wireless Sensor Networks

Angle-Independent 3D Reconstruction. Ji Zhang Mireille Boutin Daniel Aliaga

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

a tree-dmensonal settng. In te presented approac, we construct te ne mes by renng an exstng coarse mes and updatng te nodes of te ne mes accordng to t

RECENT research on structured mesh flow solver for aerodynamic problems shows that for practical levels of

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

CSE 326: Data Structures Quicksort Comparison Sorting Bound

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

A Parallel Gauss-Seidel Algorithm for Sparse Power System. Matrices. D. P. Koester, S. Ranka, and G. C. Fox

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

Projection-Based Performance Modeling for Inter/Intra-Die Variations

Conditional Speculative Decimal Addition*

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

Repeater Insertion for Two-Terminal Nets in Three-Dimensional Integrated Circuits

Vectorization in the Polyhedral Model

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

A fault tree analysis strategy using binary decision diagrams

Virtual Machine Migration based on Trust Measurement of Computer Node

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

ELEC 377 Operating Systems. Week 6 Class 3

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A One-Sided Jacobi Algorithm for the Symmetric Eigenvalue Problem

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

An Entropy-Based Approach to Integrated Information Needs Assessment

High-Boost Mesh Filtering for 3-D Shape Enhancement

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Support Vector Machines

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Solitary and Traveling Wave Solutions to a Model. of Long Range Diffusion Involving Flux with. Stability Analysis

SENSITIVITY ANALYSIS IN LINEAR PROGRAMMING USING A CALCULATOR

Topology Design using LS-TaSC Version 2 and LS-DYNA

Structure from Motion

Reading. 14. Subdivision curves. Recommended:

ETNA Kent State University

Hybrid Non-Blind Color Image Watermarking

Transcription:

Precondtonng Parallel Sparse Iteratve Solvers for Crcut Smulaton A. Basermann, U. Jaekel, and K. Hachya 1 Introducton One mportant mathematcal problem n smulaton of large electrcal crcuts s the soluton of hgh-dmensonal lnear equaton systems. The correspondng matrces are real, non-symmetrc, very ll-condtoned, have an rregular sparsty pattern, and nclude a few dense and columns. When the systems become large, teratve solvers are very lkely to outperform drect methods. For convergence acceleraton of teratve solvers, parallelzaton and approprate precondtonng are suted technques to reduce the executon tme. We present a parallel teratve algorthm wth dstrbuted Schur complement (DSC) precondtonng [5] whch acheves an accuracy of the soluton smlar to a drect solver but usually s dstnctly faster for large problems. The parallel effcency of the method s ncreased by transformng the equaton system nto a problem wthout dense and columns whch results n a system reducton as well as by explotaton of parallel graph parttonng methods. The costs of local, ncomplete LU decompostons are decreased by fll-n reducng reorderng methods of the matrx and a threshold strategy for the factorzaton. 2 Problem of Dense Rows and Columns Matrces from crcut smulaton problems are usually very sparse but nclude a few (nearly) dense and columns. In the parallel case, dense and columns are dffcult to handle for parttonng methods snce they result n couplngs between C & C Research Laboratores, NEC Europe Ltd., D-53757 Sankt Augustn, Germany, {basermann, jaekel}@ccrl-nece.de Hgh-End Desgn Technology Development Group, Technology Foundaton Development Dvson, NEC Electroncs Corp., Japan, k-hachya@ax.jp.nec.com

all equatons. In addton, good load balance s hard to acheve f the matrx s dstrbuted row-wse. Fll-n reducng orderng methods may become very costly due to a few dense and columns, and the matrces may get very ll-condtoned. Fortunately, dense and columns are usually easy to remove from crcut smulaton matrces snce the correspondng columns or as a rule have only one non-zero entry on the dagonal. Such equatons normally nclude voltage sources (constrants). A dense column whose correspondng row has only one dagonal entry can be removed snce the correspondng unknown can be determned from the row equaton and substtuted n all other equatons. On the other hand, a dense row (equaton) whose correspondng column has merely one dagonal entry s only responsble for the correspondng unknown. All other equatons can be solved ndependently. Fg. 1 llustrates the prncple of the matrx reducton n both cases. If the correspondng columns or of dense and columns do not have one Fgure 1. Removal of (nearly) dense and columns. dagonal entry only such and columns can be handled by usng the Woodbury formula [3]. Ths case s rare for crcut smulaton problems and does not occur for the matrces nvestgated here. 3 Dstrbuted Schur Complement Technques In the followng, technques for the teratve soluton wth DSC precondtonng of crcut smulaton equaton systems are sketched. More detals, n partcular on the theory, can be found n [5]. 3.1 Defntons Fg. 2 schematcally dsplays the row-wse dstrbuton of a matrx A to two processors. Each processor owns ts local row block. The square matrces A are the local dagonal blocks of A. We assume that the local are arranged n such a way that the wthout couplngs to the other processor(s) come frst and then the wth couplngs. The former are called nternal, have only entres n the A part of the local and are not coupled wth of other processors. The latter addtonally have entres outsde the A part or are coupled wth of other processors. These local are named local nterface. The part outsde A whch represents couplngs between the processors s called local nterface matrx. From the vew of processor 2 n Fg. 2, the local nterface of processor 1 wth entres at column postons n the area of 2 are external nterface. Snce

Processor 1 Local A 1 Internal x 1 Local nterface External nterface Processor 2 A 2 x 2 Fgure 2. DSC defntons: Matrx dstrbuted to two processors. the sparsty pattern of crcut smulaton matrces usually s non-symmetrc local nterface of processor may have entres n A only but are un-drectonally coupled wth of other processors. These are external nterface from the vew of the other processors. Ths can not be determned locally on processor, communcaton s necessary. Snce each row of the matrx corresponds to a specfc unknown of the equaton system (row 1 to soluton vector component 1 and to rght hand sde component 1, e.g.) nternal unknowns, local nterface unknowns, and external nterface unknowns can be defned correspondngly. 3.2 Algorthm Fg. 3 gves a schematc survey of the DSC algorthm per processor. On each processor an outer B-CGSTAB teraton [6] s performed for all local (unknowns). As basc teratve method, a flexble varant of GMRES, FGMRES [4, 5], s also well suted for the DSC algorthm but s not consdered here because of ts hgher storage requrements. The outer teraton contans a partal matrx-vector multplcaton whch requres communcaton snce each processor only owns ts local segment of the vector. It s necessary to exchange components of non-local vector segments whch correspond to external nterface unknowns (). Wthn the outer B-CGSTAB teraton, an nner B-CGSTAB teraton for the local nterface (unknowns) only s performed. Ths ncludes a partal matrxvector multplcaton of the nterface system but the communcaton scheme s the same as for the outer matrx-vector multplcaton and thus has to be mplemented only once. From the mathematcal pont of vew, each processer solves the followng

B-CGSTAB teraton for all local (unknowns) B-CGSTAB teraton for the local nterface (unknowns) Matrx-vector multplcaton: Communcaton of external nterface unknowns Matrx-vector multplcaton: Communcaton of external nterface unknowns Fgure 3. Schematc vew of the DSC algorthm on each processor. equaton: A x + y,ext = b, x = ( ) ( ) u f, b y = g. (1) x are the local vector components, y,ext the external nterface vector components, and b s the local segment of the rght hand sde vector. x s splt nto the nternal vector components u and the local nterface vector components y, b accordngly. A s then splt (see [5] for detals), and (1) s reformulated: A = ( ) ( ) ( ) ( B F B F u + E C E C y 0 Neghbours j E j y j ) = ( f g ). (2) The result of the sum over all neghbourng processors j wth couplngs to processor n (2) s the same as that of y,ext n (1). E j y j s the part of y,ext whch reflects the contrbuton to the local equaton from the neghbourng processor j. The matrx equaton (2) represents two equatons. From the frst, we derve an expresson for u, substtute u n the second equaton and get u = B 1 (f F y ) S y + Neghbours j E j y j = g E B 1 f. (3) S = C E B 1 F s the local Schur complement. Note that (3) s an equaton for the nterface vector components only. (3) can be rewrtten as a block-jacob precondtoned Schur complement sys-

Processor 1 Local Local nterface Block ncomp lete LU for the local U 1 U 1,S L 1 L 1,S Block nco mplete LU for the local nterface Fgure 4. Prncple of precondtonng wthn the DSC algorthm. tem [5]: y + S 1 Neghbours j E j y j = S 1 (g E B 1 f ). (4) 3.3 Precondtonng Fg. 4 llustrates the prncple of precondtonng wthn the DSC algorthm per processor. The outer teraton from Fg. 3 s precondtoned per processor by a block ncomplete LU decomposton wth threshold (ILUT) [4] of the local dagonal block (L 1 U 1 n Fg. 4). For precondtonng the nner teraton, a block ILUT for the local nterface only s exploted. Ths factorzaton need not be computed but can be used from the lower rght part of the decomposton for the outer teraton (L 1,S U 1,S n Fg. 4). Mathematcally speakng, we perform a block factorzaton of A on processor usng the splttng from (2): ( ) ( ) ( ) B F A = B 0 I B 1 = F. (5) E C E S 0 I We then assume that we have the LU decomposton S = L,S U,S of the local Schur complement. Wth ths, we formulate the LU factorzaton ( ) ( L,B 0 U,B L 1 L U = E U 1,B F ) (6),B L,S 0 U,S wth B = L,B U,B the LU decomposton of B. By transformng the rght hand

sde of (6) nto [( L,B 0 E U 1,B L,S ) ( )] ( U,B 0 I U 1,B L 1,B F ) = 0 U,S 0 I ( ) ( B 0 I B 1 E S 0 I we fnd after comparson wth (5) that L U s an LU factorzaton of A. The other way round, we also see the practcal advantage from (6) that the LU factorzaton S = L,S U,S of the local Schur complement has not to be computed explctly f we already have an LU factorzaton of the local dagonal block A. If we perform ncomplete decompostons we get an approxmate, precondtoned Schur complement system wth the approxmaton S of the local Schur complement S (compare wth (4) ): y + S 1 Neghbours j E j y j = 3.4 Reparttonng and Reorderng F ) 1 S (g E B 1 f ). (7) The dstrbuted sparsty pattern of the matrx can be represented as a dstrbuted graph wth nodes and edges. Graph reparttonng can then be used to reduce the number of couplngs between the dstrbuted matrx row blocks. In graph theory formulaton, the reducton s done by a mnmzaton of the number of edges cut n the graph. Ths goal of graph parttonng corresponds to a mnmzaton of the number of nterface unknowns n the DSC algorthm, and thus problem (7) s made very small. For graph parttonng, we use the ParMETIS software from the Unversty of Mnnesota [2]. Snce ParMETIS requres an undrected graph as nput the non-symmetrc pattern of the matrx has to be symmetrzed for the matrx graph constructon. For local, ncomplete decompostons, we use METIS nested dssecton reorderng to reduce fll-n nto the factors [2]. Nested dssecton reorderng usually generates a smlar sparsty pattern for the local dagonal blocks A on each processor. Ths results n smlar fll-n for each ILUT and thus supports load balancng. 4 Results The followng experments were performed on NEC s PC cluster GRISU (32 2-way SMP nodes wth AMD Athlon MP 1900+ CPUs, 1.6 GHz, 1 GB man memory per node, Myrnet2000 nterconnecton network between the nodes). For the followng experments, the equaton systems ccp and crc2a whch stem from smulatons of NEC crcuts, the systems crcut 2 and crcut 3 from Phlps crcuts [1] as well as the systems hcrcut and scrcut from Motorola crcuts are used. The latter four systems are avalable from the Matrx Market (http://www.cse.ufl.edu/ davs/sparse/{bomhof, Hamm})

Table 1. Orgnal and reduced systems: Parttonng nto eght sub-domans. Orgnal matrx Reduced matrx Matrx Order Non-zeros #If vars Order Non-zeros #If vars ccp 89556 760630 36077 89378 603753 2010 crc2a 482969 3912413 422900 482963 2750390 402 crcut 2 4510 21199 1265 1262 9702 629 crcut 3 12127 48137 2688 7607 34024 186 hcrcut 105676 513072 2490 105592 506970 975 scrcut 170998 958936 1874 170850 958370 1700 Table 2. DSC tmes on 8 GRISU processors: Effect of orderng and parttonng. Parttonng + orderng Parttonng only No permutaton Matrx #If vars Tme/s Tme/s #If vars Tme/s ccp 2010 0.51 0.97 65856 27.48 crc2a 402 0.57 2.02 8752 13.51 crcut 2 629 0.04 0.05 1245 0.07 crcut 3 186 0.25 0.25 1152 0.65 hcrcut 975 0.23 117.99 92654 9.27 scrcut 1700 2.16 29.69 112444 807.30 Table 1 presents orders and numbers of non-zeros for the orgnal and the reduced matrces as well as the number of nterface varables (f vars) for parttonng nto eght sub-domans. Reparttonng s appled. The results n table 1 show a dstnctly smaller number of nterface varables n the case of the reduced systems. Reparttonng s effectve n ths case and results n very low costs for the nner teraton from Fg. 3. In table 2, tmes of the DSC method on eght GRISU processors wth both reparttonng and reorderng, wth reparttonng only, and wthout reparttonng and reorderng are dsplayed for the reduced systems. The number of nterface varables s gven for the frst and last scenaro; for the second, the number s the same as for the frst one. The shortest tmes by far n table 2 are acheved for the DSC method wth reparttonng and reorderng. Reparttonng keeps the nterface system small, reorderng usually reduces fll-n and mproves the load balance of the processors (see 3.4). Table 3 shows executons tmes of the DSC method on 1 to 16 GRISU processors for the largest reduced test case crc2a. In Fg. 5 total, sequental executon tmes on GRISU of NEC s crcut smulator MUSASI wth the orgnal drect solver (Schur complement method) and the developed teratve DSC algorthm are dsplayed for a medum sze crcut problem (transent analyss) whch results n equaton systems of order 24705. The left tmes do not nclude the tme for the setup phase whle the rght tmes do contan ths tme. The most expensve operatons n the setup phase are the symbolc factorza-

Table 3. DSC tmes on GRISU: Scalablty. Tme/s on p processors Matrx p = 1 p = 2 p = 4 p = 8 p = 12 p = 16 crc2a 2.19 1.89 1.00 0.57 0.43 0.39 ton wth Markowtz orderng whch are necessary and state-of-the-art for a drect solver, but can be left out for the teratve DSC solver developed. 900 800 700 Drect solver Iteratve solver Tmes n seconds 600 500 400 300 200 100 0 Total tme wthout setup Total tme wth setup Fgure 5. Sequental executon tmes of NEC s crcut smulator MUSASI. The left tmes n Fg. 5 show a speedup of 2.3 of the DSC solver over the orgnal drect solver. Wth setup phase, the teratve solver outperforms the drect one by even a factor of 4.1 snce the costly symbolc factorzaton wth Markowtz orderng can be skpped. From frst tests, a smlar rato of the smulaton tmes wth drect and teratve solver can be expected for the parallel case, but the ntegraton of the teratve DSC solver nto the parallel crcut smulator MUSASI s not complete yet. 5 Conclusons For equaton systems from real problems, we demonstrated that the DSC algorthm combned wth reparttonng and reorderng s a well suted teratve solver for crcut smulaton. For the performance of teratve solvers, a system reducton by the removal of trval equatons s crucal. Moreover, a dstnct reducton of smulaton tme can be expected f commonly used drect methods n crcut smulaton codes are replaced wth the teratve DSC method presented.

Bblography [1] C. W. Bomhof and H. A. Van der Vorst, A parallel lnear system solver for crcut smulaton problems, Numercal Lnear Algebra wth Applcatons, 7 (2000), pp. 649 665. [2] G. Karyps and V. Kumar, ParMETIS: Parallel graph parttonng and sparse matrx orderng lbrary, Tech. rep. # 97-060, Unversty of Mnnesota, 1997. [3] W. H. Press, S. A. Teukolsky, W. T. Vetterlng, and B. P. Flannery, Numercal Recpes n C, 2nd ed., Cambrdge Unversty Press, 1994. [4] Y. Saad, Iteratve Methods for Sparse Lnear Systems, PWS, Boston, 1996. [5] Y. Saad and M. Sosonkna, Dstrbuted Schur complement technques for general sparse lnear systems, SISC, 21 (1999), pp. 1337 1356. [6] H. Van der Vorst, B-CGSTAB: A fast and smoothly convergng varant of B-CG for the soluton of nonsymmetrc lnear systems, SIAM J. Sc. Statst. Comput., 13 (1992), pp. 631 644.