PARALLEL AND DISTRIBUTED MULTI-ALGORITHM CIRCUIT SIMULATION. A Thesis RUICHENG DAI

Size: px

Start display at page:

Download "PARALLEL AND DISTRIBUTED MULTI-ALGORITHM CIRCUIT SIMULATION. A Thesis RUICHENG DAI"

Philippa Phillips
5 years ago
Views:

1 PARALLEL AND DISTRIBUTED MULTI-ALGORITHM CIRCUIT SIMULATION A Thesis by RUICHENG DAI Submitted to the Office of Graduate Studies of Texas A&M Uiversity i partial fulfillmet of the requiremets for the degree of MASTER OF SCIENCE August 2012 Major Subject: Computer Egieerig

2 PARALLEL AND DISTRIBUTED MULTI-ALGORITHM CIRCUIT SIMULATION A Thesis by RUICHENG DAI Submitted to the Office of Graduate Studies of Texas A&M Uiversity i partial fulfillmet of the requiremets for the degree of MASTER OF SCIENCE Approved by: Chair of Committee, Committee Members, Head of Departmet, Peg Li Nacy Amato Jiag Hu Costas N. Georghiades August 2012 Major Subject: Computer Egieerig

3 iii ABSTRACT Parallel ad Distributed Multi-Algorithm Circuit Simulatio. (August 2012) Ruicheg Dai, B.S., Zhejiag Uiversity Chair of Advisory Committee: Dr. Peg Li With the proliferatio of parallel computig, parallel computer-aided desig (CAD) has received sigificat research iterests. Trasiet trasistor-level circuit simulatio plays a importat role i digital/aalog circuit desig ad verificatio. Icreased VLSI desig complexity has made circuit simulatio a ever growig bottleeck, makig parallel processig a appealig solutio for addressig this challege. I this thesis, we propose ad develop a parallel ad distributed multialgorithm approach to leverage the power of multi-core computer clusters for speedig up trasistor-level circuit simulatio. The targeted multi-algorithm approach provides a atural paradigm for exploitig parallelism for circuit simulatio. Parallel circuit simulatio is facilitated through the exploratio of algorithm diversity where multiple simulatio algorithms collaboratively work o a sigle simulatio task. To utilize computer clusters comprisig of multi-core processors, each algorithm is executed o a separate ode with sufficiet system resource such as processig power, memory ad I/O badwidth. We propose two commuicatio schemes, amely master-slave ad peer-to-peer schemes, to allow for iter-algorithm commuicatio. Compared with the shared-memory based multi-

4 iv algorithm implemetatio, the proposed simulatio approach alleviates cache/memory cotetio as a result of multi-algorithm executio ad provides further rutime speedups.

5 v DEDICATION To my parets

6 vi ACKNOWLEDGEMENTS First ad foremost, I would like to thak my advisor, Dr. Peg Li. Dr. Li has supervised, advised ad guided me from the very begiig stage of this work, as well as gave me extraordiary experiece throughout the research. His dedicatio to excellece, ecouragemet to studets, ad ethusiasm for research, will leave a lastig imprit o me. I would like to thak other professors as well, who are always willig to discuss with me ad give ew ideas. Particular thaks to Dr. Amato ad Dr. Hu, for their costructive commets o this thesis. Thaks also to my colleagues, departmet faculty ad staff for makig my time at Texas A&M Uiversity a great experiece. Fially, I am grateful for my family ad frieds. Thaks to my mother ad father for their ecouragemet ad love.

7 vii TABLE OF CONTENTS ABSTRACT... iii DEDICATION... v TABLE OF CONTENTS...vii LIST OF FIGURES... ix LIST OF TABLES... xi 1. INTRODUCTION Motivatio Previous Work ad Limitatios Overview ad Orgaizatio BACKGROUND Trasistor-level Circuit Simulatio Parallel Computig MULTI-ALGORITHM PARALLELISM Multi-Algorithm Parallelism Simulatio Algorithms Diversity i Noliear Iterative Methods Diversity i Numerical Methods Algorithm Selectio HIERARCHY OF PARALLEL AND DISTRIBUTED CIRCUIT SIMULATION Multi-Algorithm Commuicatio Structure Master-slave Structure Peer-to-Peer Structure Multiple Threads i A Sigle Algorithm Parallel Device Evaluatio Parallel Matrix Solver Page

8 viii 5. RESULT AND ANALYSIS Supercomputer Result MPI vs. Sequetial Algorithm MPI vs. HMAPS Compariso Betwee MPI Methods Accuracy CONCLUSIONS REFERENCES VITA... 53

9 ix LIST OF FIGURES Figure 1. Trasistor-level circuit simulatio i digital/asic desig flow... 6 Figure 2. A simple circuit Figure 3. Work flow of the trasiet circuit simulatio Figure 4. A sample for circuit simulatio result Figure 5. Illustratio of the multi-algorithm parallelism Figure 6. Newto-Raphso method Figure 7. Successive Chord method Figure 8. Stability regio of umerical itegratio methods Figure 9. Stability regio of Absolute Stability Figure 10. A computer cluster Figure 11. Global sychroizer ode i Master-Slave structure Figure 12. Details of a algorithm ode i the Master-Slave structure Figure 13. Flow chart of the algorithm ode ad global sychroizer i MasterSlave scheme Figure 14. Peer-to-Peer commuicatio scheme Figure 15. Flow chart of the algorithm ode i peer to peer scheme Figure 16. A sapshot of the supercomputer Hydra Figure 17. Compariso betwee master-slave ad peer-to-peer commuicatio structure Figure 18. Simulatio results o Node Page

10 Figure 19. Simulatio results o Node x

11 xi LIST OF TABLES Page Table 1. Compariso betwee sequetial algorithm ad MPI methods Table 2. Resource allocatio betwee HMAPS ad MPI methods Table 3. Compariso betwee HMAPS ad MPI methods... 44

12 1 1. INTRODUCTION 1.1 Motivatio As a fudametal techology i computer-aided desig, circuit simulatio provides isights ito electroic circuits by leveragig mathematical models to replicate the behavior of a actual electroic device or circuit [1]. I trasistor-level time-domai circuit simulatio, DC aalysis is used to obtai quiescet operatig poit ad trasiet aalysis is employed to compute the time-domai respose of the circuit. Accurate, fast ad robust trasistor-level circuit simulatio plays a critical part i the desig ad verificatio of digital/aalog circuit. Sice 1965, Gordo E. Moore, the co-fouder of Itel put forward that the umber of trasistors o itegrated circuits would double every two years. This prophecy, also kow as Moore s law, became the guidace of the developmet of itegrated circuit techology for later decades. A typical Very Large Scale Itegrated (VLSI) Circuit may itegrate millios of trasistors ad other compoets i a few square millimeters o a chip. Simulatio of large IC desigs as well as iheret high accuracy requiremets places a heavy burde o circuit simulatio. For istace, circuit desigers may have to sped several days or eve weeks o expesive circuit simulatio, which greatly iflueces the desig efficiecy. However, with the recet idustry s shift to multi- ad may-core processor This thesis follows the style of IEEE Trasactios o Computer-aided Desig of Itegrated Circuit ad System.

13 2 techology, parallel computig is ubiquitous ad chagig the ladscape of computig ad data processig. This chage has made profoud implicatios o the developmet of compute-itesive applicatios. Leveragig the available parallel compute hardware leaves ew opportuities ad challeges to large-scale circuit simulatio. 1.2 Previous Work ad Limitatios Parallel circuit simulatio is ot a ew topic. The two key challeges of applyig parallelism to CAD area are parallel algorithm developmet ad parallel program implemetatio. Prior work attempted to realize more parallelism from several differet perspectives. Parallel device evaluatio ad matrix solve [2][3] are the most direct methods. Device evaluatio ad matrix solve are the most time cosumig parts i simulatio ad domiate the total simulatio time. It is straightforward to leverage more threads/cpus i these two parts to gai large parallelism. However, the speedup is ot liear due to the characteristic of the circuit ad multi-core computers. Creatig threads, termiatig ad sychroizatio also will add some overhead to the system. There also have bee attempts to realize parallel capabilities i a sigle simulatio algorithm. Waveform pipeliig approach [4] simultaeously computes circuit solutios at multiple adjacet time poits i a way resemblig hardware pipeliig. Circuit decompositio ca divide a large circuit ito several small subcircuits which ca be solved i parallel. However, decompositio-based circuit simulatio algorithms like multilevel ewto algorithm [5] ad waveform relaxatio

14 3 algorithm [6] have issues i terms of covergece. I additio, these two methods exploit fie-graied parallelism, hece require large programmig effort. The multi-algorithm parallel approach [7] exploits iter-algorithm parallelism by ruig several simulatio algorithms o a shared-memory multi-core machie simultaeously. However, most of these works are carried o multi-core shared memory machies. While the methods are gaiig the beefits from these platforms, like low ochip commuicatio overhead, they also have to pay a price for the drawbacks. For istace, the memory o a multi-core machie is shared by all processes/threads ad the umber of CPUs o oe computer is limited due to the maufacture process ad power cosumptio. Hece, memory cotetio is ievitable as well as severe thread cotetio whe the umber of threads is greater tha the umber of CPUs. The system performace will suffer oticeable degradatio. Computer clusters offer a promisig computig solutio to address ever complex, computatioally itesive simulatio problems with sufficiet computig resources ad high memory badwidth. 1.3 Overview ad Orgaizatio I this thesis, we propose a distributed ad parallel multi-algorithm circuit simulatio where multiple simulatio algorithms are mapped o separated odes i a supercomputer ad work o the same simulatio task with effective commuicatio schemes to realize the o-the-fly sychroizatio ad exploratio of algorithm diversity. With sufficiet

15 4 computig resource utilized for parallel device evaluatios ad parallel matrix solvers i each algorithm, simulatio rutime is further reduced. As a coarse-graied parallel approach, the proposed distributed circuit simulatio requires less programmig effort ad is applicable for a icreasig umber of simulatio algorithms. This thesis is orgaized as follows. I Chapter 2, we itroduce the backgroud for time-domai circuit simulatio ad parallel computig. The the priciple of multialgorithm circuit simulatio as well as the diversity of umerical itegratio methods ad oliear iterative methods will be discussed i Chapter 3. I Chapter 4, we will preset the details of the MPI based parallel ad distributed circuit simulatio. I Chapter 5, the platform where the experimets are carried o ad experimetal results will be give. Fially, coclusios are draw i Chapter 6.

16 5 2. BACKGROUND 2.1 Trasistor-level Circuit Simulatio Trasistor-level time-domai circuit simulatio, a computer-aided desig tool, greatly improves desig efficiecy ad reduces the labor itesity i digital/asic VLSI circuit desig. Figure 1 is a flow chart of digital/asic circuit desig. First, system specificatios ad requiremets eed to be completed. A graph editor or text editor is used to describe the circuit s structure ad behavior. After the behavioral descriptio, sythesis realizes the automatic coversio from high level abstractio to low level descriptio where RTL code is traslated to a gate-level circuit. Physical desig icludig floorplaig, placemet ad routig is the carried out to geerate the layout of the desig. At last, maufacturig process fabricates desigs oto silico dies which are packaged ito ICs [1]. Trasistor-level circuit simulatio ca be performed at the circuit desig level based o pre-layout schematic. Also, it may be performed after the post-layout circuit etlists are extracted out. It is ot surprisig that simulatio plays a vital part i predictig circuit performace ad rejectig a failig desig due to trasistor-level circuit simulatio also plays a importat role i the desig of aalog ad RF circuits.

17 Figure 1. Trasistor-level circuit simulatio i digital/asic desig flow. 6

18 7 I trasistor-level circuit simulatio, circuit aalysis problem is formulated accordig to circuit structure, device parameters ad aalysis requiremets. KVL (Kirchhoff's voltage laws) ad KCL (Kirchhoff's curret laws) are two basic priciples i simulatio. Hece, a electroic circuit ca be described as a differetial-algebraic equatio, d dt q( x) f ( x) u( t) (2.1) here, u (t) is the iput vector, x (t) is the vector of odal voltages ad brach currets. q (x) ad f (x) correspodig to dyamic elemets ad static elemets are oliear fuctios. Regardig equatio (2.1), the existece of oliear fuctios, q (x) ad f (x) is due to the fact that the trasistors i the CMOS techology are oliear elemets with complex oliear characteristic. The differetial operatio represets the behavior of eergy storage compoets like capacitors ad iductors which have delay i followig the chages of iput sources. For istace, a simple circuit i Figure 2 ca be described as equatio (2.2) R1 R2 R 1 R2 4 1 E R V 2 1 * 1 1 R4 V2 0 R 2 R3 (2.2)

19 8 Figure 2. A simple circuit. To solve equatio (2.1), DC aalysis is used to obtai a iitial operatig poit. I DC aalysis, all the dyamic circuit elemets are removed ad a oliear iterative method is applied to get the solutio coverged i several iteratios. The a umerical itegratio method is applied to calculate the trasiet solutios. At each time poit, trasiet aalysis, similarly, eeds to utilize the oliear iterative method to obtai a coverged solutio. I other words, by adoptig a umerical itegratio formula, the time-domai trasiet respose of the circuit is obtaied by solvig a sequece of equivalet oliear DC problems sequetially at all time poits [8]. The flow chart of the circuit simulatio is show i Figure 3. I trasistor-level circuit simulatio, device evaluatio ad matrix solve are the two most time cosumig parts. At each iteratio i a sigle time poit, device

20 9 evaluatio is performed to obtai equivalet mathematical models of circuit compoets. The evaluatio requires umerous computatios, especially for oliear compoets such as diodes, trasistors, oliear resistaces ad oliear capacitaces which have a large amout of device model derivatives. For istace, a diode s voltage ad curret ca be represeted as VD VT I I ( e 1) (2.3) D S Here, I S is the reverse bias saturatio curret ad V T is the thermal voltage. The model of the device has a importat positio i the whole procedure of circuit aalysis because the accuracy of simulatio results depeds o the precisio of the model sigificatly. Matrix solve is the applied to obtai the solutio for that specific iteratio. We LU decompose the matrix to solve the equatios. Whe the coefficiet matrix is a sparse matrix, the time complexity of solvig the equatios will be approximately O () [9], here is the umber of the odes i the circuit.

21 Figure 3. Work flow of the trasiet circuit simulatio. 10

22 Parallel Computig From the perspective of computer architecture, symmetric multiprocessor (SMP) machie is a system with two or more homogeeous processors o oe chip, sharig memory subsystem ad bus structure. Although multiple CPUs are ruig at the same time, they perform as a sigle machie. The system distributes the tasks i a queue symmetrically over multiple CPUs, thus greatly improvig data processig ability of the whole system. Computer clusters emerged as a result of developmets of low cost microprocessors ad high speed etworks. May idepedet computer odes are coected to each other i the cluster through fast local area etworks. Oe computer ode ca be a sigle processor or a multiple-processor system, which has memory, I/O devices ad operatig system. The system ca provide a fast ad reliable service solutio, which ca hardly be obtaied eve through a very expesive shared memory system. For these parallel platforms, Pthreads ad MPI are two most popular parallel programmig APIs. POSIX threads [10], commoly kow as Pthreads, specifies a set of iterfaces (fuctios, header files) for threaded programmig where a sigle process ca create multiple threads. Every thread ca be assiged differet kid of work ad ru idepedetly. These threads share data ad heap segmets, but each thread has its ow stack to store automatic variables. MPI, a kid of Message Passig Iterface released i May 1994, is actually a stadard of message passig fuctio library [11]. It absorbs beefits from may existig message passig fuctio libraries ad becomes oe of the most popular parallel

23 12 programmig eviromets, especially for distributed storage computers ad etworkbased workstatios. MPI has may advatages i providig the ecessary coditios for the developmet of parallel software idustry: portable ad flexible complete asychroous commuicatio fuctio. formal, detailed ad precise defiitio I the MPI based programmig model, a fixed set of processes are created i the iitializatio of the program. Processes receive ad sed massages by callig library fuctios. These processes ca execute the same or differet code paths, correspodigly called sigle program multiple data (SPMD) or multiple program multiple data (MPMD). Commuicatios betwee the processes ca be poit-to-poit or collective.

24 13 3. MULTI-ALGORITHM PARALLELISM 3.1 Multi-Algorithm Parallelism From the foregoig discussio, the trasiet circuit simulatio problem ca be formulated as equatio (3.1). d dt q( x( t)) f ( x( t)) u( t) (3.1) I a circuit simulatio algorithm, oe oliear iterative method is utilized to liearize the oliear fuctios ad oe umerical itegratio method replaces differetial operatio with differece operatio. Newto Raphso ad Successive Chord are typical oliear iterative methods while Backward Euler, Gear2 ad DASSL are classic umerical itegratio methods. A variety of simulatio algorithms are the geerated withi a set of combiatio betwee these two kids of methods. SPICE (Simulatio Program with Itegrated Circuit Emphasis) [12] is takig Newto-Raphso ad Backward Euler as its basic circuit simulatio algorithm. It is a geeral-purpose, ope source electroic circuit simulator for itegrated circuit ad board-level desig. Compared to Newto-Raphso ad Backward Euler algorithm, Successive Chord is a higher speed simulatio algorithm. While the algorithm pool provides a great diversity, it also brigs i the complexity i choosig a optimal algorithm for a specific circuit because the algorithms behave quite differetly for differet kids of circuits, eve i differet stages o the same circuit durig the whole simulatio time.

14 Figure 4. A sample for circuit simulatio result. Figure 4 is simulatio results obtaied by usig SC algorithm ad Newto + BE algorithm for iverter chai circuit.

25 14 Figure 4. A sample for circuit simulatio result. Figure 4 is simulatio results obtaied by usig SC algorithm ad Newto + BE algorithm for iverter chai circuit. Durig the simulatio, we fid SC algorithm prits out results much faster o part A ad C but slower o part B. From the figure above, we ca see the waveform remais stable durig parts A ad C. Cosiderig SC algorithm s advatage, it ca coverge very quickly ad the cost for each iteratio is very small by usig a costat Jacobia matrix. I part B, the waveform chages sigificatly, SC algorithm eeds a large umber of iteratios to coverge to the fial solutio at every time step. Although the cost for each iteratio is still small, the time spet o oe time step is icreasig sigificatly. Whe the waveform gets steeper, SC probably will diverge. Ispired by this observatio, we kow a optimal solutio will be obtaied if the beefit of SC algorithm o parts A ad C is exploited as well as the beefit of Newto + BE algorithm o part B. Cosequetly, we refer to the multi-algorithm approach i [7] ad propose a ew approach that builds o a distributed memory platform to ru multiple simulatio

15 algorithms o multiple computer odes i parallel to exploit the diversity of these algorithms. To illustrate, we assume two algorithms are iitiated o the same circuit simulatio.

26 15 algorithms o multiple computer odes i parallel to exploit the diversity of these algorithms. To illustrate, we assume two algorithms are iitiated o the same circuit simulatio. I Figure 5, part A is correspodig to the first time period while part B is the secod period. I the first period, algorithm SC is the fastest due to the reaso discussed, it ca iform its results to algorithm BE + Newto at the ed of the first period. With this faster solutio, Algorithms BE + Newto ca skip its slow part ad begi its ext period calculatio. I part B, Algorithms BE + Newto turs out to be faster ad it shares the solutio with algorithm SC. I this way, whe we adopt more algorithms, we are pickig out the best performig algorithm for every small period alog the whole simulatio ad all algorithms beefits are explored ad simulatio speed will be optimal. Figure 5. Illustratio of the multi-algorithm parallelism

27 16 Cocerig the commuicatio graularity, if we set the iterval as whole simulatio time, the system will perform as pickig out the fastest simulatio algorithm for the simulatio task. The diversity will ot be fully exploited. However, if we choose a small iterval, the commuicatio will be frequet ad ifluece the calculatio speed as mutual memory access coflicts are icreasig. Hece, there exist tradeoffs betwee efficiecy ad commuicatio frequecy. I the implemetatio, we eed to choose a reasoable graularity ad make the iformatio sharig amog all the algorithms efficiet. This will be discussed i Chapter Simulatio Algorithms I this sectio, we discuss the advatages ad disadvatages of differet oliear iterative methods ad umerical itegratio methods as well as their roles i simulatio algorithm selectio Diversity i Noliear Iterative Methods At a sigle time poit, the equatio (3.1) ca be represeted as equatio (3.2). A. Newto-Raphso F ( x) 0 (3.2) Newto-Raphso is a effective method i solvig oliear equatios [12]. The solutio at k 1 iteratio is determied by equatio (3.3). here, J x ) is called the Jacobia matrix. ( k J ( xk 1 k )( xk xk ) F( x ) (3.3)

28 17 F1 x1 F2 x1 J ( x k ).. F x1 F1 x2 F2 x 2 F x F1 x F2 x F x (3.4) Assumig k th iteratio's solutio is kow, the Jacobia matrix ad F x ) ca be calculated by device evaluatio, the ( k 1) th solutio is extracted by solvig the equatio (3.3). If the differece betwee solutios at iteratio k 1 ad k is smaller tha a give threshold, it is accepted as the coverged solutio. If ot, we eed to proceed to the ext iteratio. For istace, r 1 is the root of equatio f ( x) 0 i Figure 6. The iitial solutio is assumed at poit P 0( x0, y0), x 1 is obtaied by usig the taget lie 1 which is correspodig to equatio (3.3). However, y 1 is larger tha expected. The ext solutio x 2 is calculated based o poit P 1 similarly. ( k

29 18 Figure 6. Newto-Raphso method. Whe xk is close to the exact solutio, it ca be proved that [12] x (3.5) 2 k 1 C( xk ) Here C is costat. Hece, Newto's method has a quadratic covergece rate. Whe Newto s method is applied i circuit simulatio, its Jacobia matrix eeds to be recalculated by evaluatig all the devices ad decomposed i each iteratio. There are a large umber of expesive derivative computatios. Although Newto method is robust with the quadratic covergece rate, the cost for each iteratio is really high ad the simulatio time at oe step is large. B. Successive Chord method Aother oliear iterative method is Successive Chord method (SC) [13]. It ca be represeted as

30 19 J sc( 1 k xk xk ) F( x ) (3.6) here, the Jacobia matrix J sc is costat. I the followig Figure 7, we ca get x 1 by usig the taget lie 1 which is correspodig to equatio (3.6). The fial solutio x 2 will be obtaied i ext iteratio based o poit P 1. The obvious differece is that the taget lies are parallel. Figure 7. Successive Chord method Compared to Newto Raphso, SC method s advatage is that it uses costat Jacobia matrix J sc i simulatio. The Jacobia matrix is costructed, decomposed at the begiig ad the lower upper triagular (LU) factors are stored to reuse efficietly. So the method does ot eed to calculate the derivative of device equatios durig the whole simulatio. Cosequetly, the cost for each iteratio i SC method is very small. However, the covergece rate of the SC method is liear which meas for every

31 20 time step, the method probably eeds more iteratios. The strict covergece criteria for SC method is 1 I J sc J F ( v ) 1 (3.7) Here, I is idetity matrix, J sc is chord value, J ( v F ) is the exact Jacobia matrix. Cosequetly, the J sc matrix should be selected wisely. Otherwise this method will probably diverge. Accordig to our research, SC method is hard to coverge for aalog circuits which have greater chages compared to the combiatio circuits Diversity i Numerical Methods I trasiet aalysis, equatio (3.1) may be represeted as a first order differetial equatio: x f ( x, t) t0 t T (3.8) with iitial coditio: x( t0) x 0 Here, x is the derivative of x, t is the time variable. The iitial solutio x( t ) x is 0 0 solved by DC aalysis. I order to solve the differetial-algebraic equatios, first we eed to discretize t 0,T to several distict time poits ( t0, t1, t2, t T). The we use the differece equatio to replace the differetial equatio to get the approximate values at these poits x, x,, x x ). For the solutio at t 1, the umber of the previous ( m

32 21 solutios ( x, 1, ) used is determied by the umerical methods which ca be x classified ito oe-step ad multi-step methods. A. Oe-step method Backward Euler is a oe step method [12] with x x h x (3.9) 1 1 The local trucatio errors (LTEs) is LTE BE 2 x( ) h (3.10) 2 here, h t 1 t. I circuit simulatio, a fixed step-size method is adopted if h is fixed as a reasoable value. There also exists variable step-size method for Backward Euler. After a acceptable value is decided as the boud for local trucatio error, variable h is calculated as h 2 x ( ) (3.11) Here, x ( ) is secod order derivative. x 1 is calculated by equatio (3.9). If the local trucatio error at t 1 is smaller tha, the solutio is acceptable. Otherwise, it will be abadoed ad the solutio eeds re-computatio with a smaller h util the solutio satisfies the error tolerace. The variable step-size method ehaces Backward Euler method with a larger time step. Forward Euler is also a oe step method with

33 22 x h x x 1 (3.12) It does ot iclude 1 x so the calculatio is explicit ad simple. The solutio at ay time ca be obtaied oly by its previous solutios, which cotributes to its fast speed as well as low robustess. Aother oe step method is Trapezoidal [14]. The formula is ) ( x x h x x (3.13) with local trucatio errors (LTEs) as 12 ) ( 3 x h LTE TR (3.14) It has smaller local trucatio error ad larger step size. B. Multi-step methods Muliti-step methods employ the solutio ),, ( 1 1 p x x x at poits ),, ( 1 1 p t t t i umerical itegratio: p i i i p i i i x x x (3.15) p is the order of the itegratio method. Gear2 [15] method uses the followig formula to get the solutio at 1 t. h h h h h x h h h h h x h h h h x x ) ( ) (2 ) ( ) (2 (3.16) Here, 1 1 1, t t h t t h, the local trucatio error is

34 23 LTE 2 2 h 1( h 1 h ) x ( ) (3.17) 6(2h h ) Gear2 1 Here t t 1. Compared to Backward Euler, Gear2 has more complicated itegratio formula ad is much faster with smaller LTE ad larger time step size. DASSL [16], a variable-order variable-stepsize method, uses the predictor ad corrector to solve the differetial equatio. The predictor for a k th order formula is geerated by iterpolatig the last k 1 solutios. ) x i 0,1,..., k. (3.18) P ( 1 t i i P Hece, the solutio at time 1 ca be predicted by usig the predictor fuctio, 1 x (0) P ( t ) 1 x 1( t 1) (0) P (3.19) C The corrector polyomial is a iterpolatio of the predictor at last k time poits 1 ad ca be solved by the equatio (3.20), C (0) C (0) ( x ) h ( x ) 0 (3.20) s k 1 here s, h 1 is predicted step size for t 1. j j 1 After the corrector C 1 at 1 t is obtaied, the circuit solutio is solved by equatio (3.21) with LTE applied to determie x is accepted or ot. F C C, ( t ), ( t )) 0 (3.21) ( t DASSL uses the LTE to cotrol the step size ad the itegratio order dyamically. Before calculatig x, DASSL utilize the existig step size ad the order

35 24 k to estimates the LTE at t. With the estimated LTE, DASSL determies the order k for the ext time step. After x is solved with above equatios, k is used to solve the ext time poit solutio or recompute x based o whether x is accepted or ot. DASSL has very complex cotrol scheme to maitai stability ad is possible to achieve sigificat speedup Algorithm Selectio About the oliear iterative methods, we will use the Newto-Raphso ad Successive Chord method. I the umerical methods, the values we got at t, t, t, t ) is ( ( T approximatio to the exact values, they are actually x, x, x, x x ). The errors are itroduced by two ways. First, local trucatio error is brought i because at time t 1, we abado the high order differetial item. Secod, we get the solutio at time t 1 m with the previous solutios ( x, 1, ) which we assume are exact values. However, x these solutios are approximatios because of the LTE. Hece, the errors may accumulate. If the iflueces of the previous errors o later time poites do ot icrease with time, this method is stable. If the errors are accumulated ad exceed the error limit, the method is ot stable. I order to clarify this, we itroduce a test equatio, If we apply the Forward Euler to the test equatio, we will get x x (3.22)

36 25 x 1 x xh x( 1 h ) x0(1 h) (3.23) Whe error at the iitial solutio is assumed as 0, the error at time t is 1 (1 1 0 h) (3.24) here 0 ad real. Cosequetly, whe 1 h 1 or 0 h 2, 1 is bouded ad the method is stable. If we represet 1 h 1 like Figure 8(a). The shaded part is called stability regio. i the complex plae of h, it will be A stability cocept, called Absolute Stability, specifies that a method is absolutely stable if the regio of the absolute stability covers the etire left plae as i Firgure 9. Accordig to this cocept, Forward Euler is ustable while Backward Euler, Trapezoidal method ad fixed step size Gear2 method i Figure 8(b)(c)(d) are ucoditioally stable. Actually, stability ad local trucatio error are two major cosideratios i selectig umerical itegratio methods. BE is robust ad easy to implemet, with large local trucatio error ad small time step size. Fixed step size Gear2 has much smaller local trucatio error ad larger time step size. However, Gear2 is much more complex to implemet ad brigs i a large computatio cost at every time poit. The stability of the DASSL method is more difficult to aalyze. Accordig to the experimets, DASSL is stable i most cases as Figure 9 ad potetially leads to the largest time step size. I practice, the performace idex of a particular algorithm is determied by the circuit type ad iput sigal. It is difficult to tell which oe is the optimal before executig it oe time. I the system, we choose Newto-Raphso method (Newto) as a

37 26 solid base for the system ad Successive Chord method (SC), Gear2 + Newto ad DASSL + Newto as aggressive algorithms to speed up the whole system. Figure 8. Stability regio of umerical itegratio methods. Figure 9. Stability regio of Absolute Stability.

27 4. HIERARCHY OF PARALLEL AND DISTRIBUTED CIRCUIT SIMULATION The hierarchy of parallel ad distributed circuit simulatio, built o a computer cluster i Figure 10, adopts two levels of parallelism,

At the higher level of parallelism, multiple simulatio algorithms are performed i parallel o separate computer odes with MPI methods trasferrig data betwee them to exploit the algorithm diversity.

38 27 4. HIERARCHY OF PARALLEL AND DISTRIBUTED CIRCUIT SIMULATION The hierarchy of parallel ad distributed circuit simulatio, built o a computer cluster i Figure 10, adopts two levels of parallelism, iter-algorithm parallelism ad itraalgorithm parallelism. At the higher level of parallelism, multiple simulatio algorithms are performed i parallel o separate computer odes with MPI methods trasferrig data betwee them to exploit the algorithm diversity. The cloud i Figure 10 represets the commuicatio structures betwee odes. Two MPI commuicatio structures are proposed, amely master-slave structure ad peer-to-peer structure, with differet characteristic correspodig to the type ad size of circuit. At the lower level of parallelism, each algorithm has full cotrol of all resources like CPUs, memory badwidth ad I/O, which allows it to reach to high itra-algorithm parallelism. Figure 10. A computer cluster

28 4.1 Multi-Algorithm Commuicatio Structure 4.1.1 Master-slave Structure I the master-slave structure, a flexible global sychroizer is utilized.

The commuicatio betwee the sychroizer ad algorithm odes is as Figure 11. Figure 11. Global sychroizer ode i Master-Slave structure.

39 Multi-Algorithm Commuicatio Structure Master-slave Structure I the master-slave structure, a flexible global sychroizer is utilized. Each algorithm commuicates with the global sychroizer rather tha talks to each other i the simulatio. The sychroizer broadcasts to iform all the algorithms the ew solutios. The commuicatio betwee the sychroizer ad algorithm odes is as Figure 11. Figure 11. Global sychroizer ode i Master-Slave structure. I order to show a clear view of the hierarchy, we discuss the mai roles that the algorithm ode side ad global sychroizer side play. Oe algorithm ode is demaded to sed all circuit odes iformatio icludig voltages or currets to the other algorithms to brig them to where it is stadig. I additio, some algorithms like Gear2, DASSL, ot oly eed the iformatio at most recet time poit, but also eed several previous time steps solutios to calculate the ew result. Hece, every algorithm seds k time steps results to the global sychroizer. Here,

40 29 k is determied by the highest order amog the umerical itegratio methods i the system. From the foregoig discussio, Newto-Raphso eeds previous oe time step solutio; Gear2 eeds previous two time steps solutios while DASSL eeds previous five time steps solutios. We keep k as 6 after takig the ew solutio ito cosideratio. I additio, a algorithm ode fully cotrols graularity of the commuicatio with the global sychroizer. I this implemetatio, we choose the graularity as oe time step for all the algorithms. Hece, the algorithm ode sigals a commuicatio thread to trasfer the solutio after it fiishes oe time step computatio. The reaso of creatig a ew thread to take over the iteractio task is to overcome the couplig betwee commuicatio ad computatio. Although the algorithm ode ca use the oblockig MPI sed method to trasfer its ow solutios, the MPI broadcast method i receivig the most recet solutios back is blockig. Figure 12 shows a computer ode with 4 cores o which the BE + Newto is mapped.

41 30 Figure 12. Details of a algorithm ode i the Master-Slave structure. Because the commuicatio load i the global sychroizer is impressively large, the sychroizer is mapped to a sigle ode to avoid memory cotetio. Durig simulatio, it moitors all algorithm odes. As soo as oe algorithm ode is sedig a ew solutio, the sychroizer makes the coectio ad receives the solutio. The sychroizer maitais the most recet solutio data structure the system has durig the simulatio. The data structure cotais k time steps solutios. After the sychroizer receives a ew message, the message is merge-sorted with the stored data, ad the first k solutios are kept ad the data structure is updated. If the ew solutio provided by a algorithm is ahead of the existig solutios, after merge sort, the data structure will be updated with the ew solutio by isertig it ito the structure.

42 31 However, if the ew solutio is stale ad lags behid the existig solutios, it will be abadoed ad the solutio structure stays uchaged. After the global sychroizer processes oe message ad gets updated, it will broadcast ew solutios to all algorithm odes. Hece, all algorithms will be updated with the latest solutios ad begi their ext step calculatio. I this way, the global sychroizer will always keep the most recet solutios ad algorithm odes iteract with each other idirectly. The detailed work flow of the system is show i Figure 13. I the master-slave structure, all algorithms will be sychroized cotiuously. Slow executio of each of these algorithms is sidestepped by others ad their advatages will be fully exploited. However, the global sychroizer eeds to process ad trasfer a large amout of data sice there are several odes cotiuously sedig messages to it. Cosequetly, the sychroizer may easily be the bottleeck of the system ad affect system efficiecy.

43 Figure 13. Flow chart of the algorithm ode ad global sychroizer i Master-Slave scheme. 32

33 4.1.2 Peer-to-Peer Structure To avoid the bottleeck o the sychroizer, we come up with a peer-to-peer scheme.

44 Peer-to-Peer Structure To avoid the bottleeck o the sychroizer, we come up with a peer-to-peer scheme. I this structure, oe algorithm ode similarly creates two threads for computatio ad commuicatio, respectively. The commuicatio thread receives messages from its precedig ode, processes the received message with its ow solutios, the seds the updated solutio to the ext ode. The four algorithms form a loop ad the most recet solutios keep circulatig i the loop to sychroize all algorithms ad explore their diversity. The commuicatio structure is show i Figure 14. Figure 14. Peer-to-Peer commuicatio scheme. Apparetly, this structure saves the resource by abadoig the global sychroizer ad distributes the large amout of data processig work burde o the global sychroizer to each algorithm ode. It elimiates the effect of bottleeck ad also decreases the etwork load because i the master-slave structure the commuicatio is collective ad algorithm ode may be ot aware the status of the global sychroizer

45 34 ad seds a stale solutio which will occupy the etwork badwidth ad hamper effective commuicatio. The mai disadvatage is that i the peer-to-peer structure, all algorithms will be updated oly whe oe-loop data trasfer is completed. However, i the master-slave structure, all other algorithms will be iformed immediately as soo as ay oe algorithm gets a ew effective solutio. I this loop structure, deadlock, start ad exit of the program eeds additioal attetio. For istace, deadlocks happe whe the successor ode waits o a blockig MPI message from the precursor ode which has reached the ed of the simulatio ad exited. I our implemetatio, algorithm BE + Newto which is the most stable ad has low computatioal cost for the iitial time steps is used to trigger the trasfer of data as a loop. At the ed of the simulatio, a flag is used to track how may odes have fiished. Every ode will icremet the flag before it exits. The flag is stored i the MPI message. Hece, whe a ode receives a message with a flag value equal to the umber of all other algorithms, it kows all previous odes have fiished ad it skips sedig the message to the ext ode ad exits. This way, the system ca exit correctly. Figure 15 shows the detailed work flow i this structure.

46 Figure 15. Flow chart of the algorithm ode i peer to peer scheme. 35

47 Multiple Threads i A Sigle Algorithm Trasiet aalysis may be coducted over a large umber of time steps. At every time step, it eeds several iteratios to get covergece. Hece, the umber of iteratios ca be very high. Device evaluatio ad matrix solve carried o at every iteratio are very time cosumig ad take early the whole simulatio time. I previous discussio, there is a tradeoff betwee the umber of the iteratios per time step ad the cost of each iteratio for differet oliear iterative methods. Here we further made use of the power of multi-core processor to expedite the device evaluatio ad matrix solve i a sigle algorithm ode. A distributed platform provides the possibility of fully realizig itra-algorithm parallelism as oe algorithm mapped o oe ode ca exclusively access all the compute ad memory resources Parallel Device Evaluatio I the device evaluatio, Jacobia matrix J x ) has a large umber of partial differetial ( k items. I parallelizatio, oliear elemets are divided ito several groups, ad each group is hadled by oe thread. The speedup for this ca reach liear scalig whe there are sufficiet oliear elemets. However, because of the cost of spawig, executio ad termiatio of threads, the beefits of parallelizatio may be reduced especially whe oliear elemets i the circuit are few.

48 Parallel Matrix Solver I our platform, SuperLU [17] is made use of as parallel matrix solver. SuperLU is a geeral purpose library providig direct solutio to large, sparse, o-symmetric systems of liear equatios o high performace machies. The library routies perform LU decompositio with partial pivotig ad triagular system solves through forward ad backward substitutio. It exploits two sources of parallelism i the sparse LU factorizatio. The coarse level parallelism comes from the sparsity of the matrix, ad is exposed by the colum elimiatio tree of the matrix. The secod level of parallelism comes from pipeliig the computatios of depedet colums. The performace of matrix solve has bottleeck after the umber of threads used reaches a certai umber due to the circuit s ad the computer ode's characteristics. For istace, whe usig more threads i SuperLU, accessig critical sectios via locks will icrease ad result i degradatio of parallel performace. The more processors there are, the larger commuicatio loss there will be. Secod, the solver eeds to divide the matrix ito several parts ad pipelie the operatio o every part. Hece, the dese ad small matrix geerated by device evaluatio has more depedece ad is hard to be divided to several idepedet parts, makig the parallel performace worse. O the cotrary, the speedup is large for the sparse ad large matrices. The computer ode o our platform is a symmetric multi-processor system with 8 dual core processors. The commuicatio betwee the dual cores i oe packaged processor chip is twice as faster as the commuicatio betwee the cores i differet processors chips. Hece, the performace of the parallel matrix solve has a degradatio

49 38 whe the umber of the cores reaches to a odd umber sice the ew added core eeds to trasfer data to cores i other chips. We choose to use eve umber of threads for parallel device evaluatio ad matrix solve which achieve better speedups.

39 5. RESULT AND ANALYSIS 5.1 Supercomputer Hydra (see Figure 16) is a 52-ode, 832-processor IBM cluster. The 52 odes are further orgaized ad housed ito five physical frames [18].

50 39 5. RESULT AND ANALYSIS 5.1 Supercomputer Hydra (see Figure 16) is a 52-ode, 832-processor IBM cluster. The 52 odes are further orgaized ad housed ito five physical frames [18]. The cluster uses IBM highperformace commuicatio switch for parallel processig ad other commuicatio betwee the odes. Each ode coects to the HPS etwork usig two adapters. HPS routes a message packet to aother ode [18]. Figure 16. A sapshot of the supercomputer Hydra. O Hyrda, whe ruig a Pthreads program, the umber of threads durig executio ca be set by the eviromet variable OMP_SET_NUM_THREADS. A MPI program is executed uder the Parallel Operatig Eviromet (POE). Whe the

51 40 program is beig executed, the umber of tasks ca be set by the eviromet variable PROCS. Typically, tasks are mapped 1-to-1 o processors. I the batch file, we ca specify how tasks to be assiged. We assig the MPI tasks to 5 odes with variable ode. Every ode ca use 4 CPUs ad 1.5gb memory by settig CosumableCpus as 4, CosumableMemory as 1500mb where 1500mb is the aggregate amout of memory take up by 4 threads. 5.2 Result MPI vs. Sequetial Algorithm First, we compare the MPI master-slave (MPI-MS) structure s rutime results with the four sigle sequetial algorithms: Newto+BE, SC, Newto+Gear2, Newto+DASSL for several circuits i Table 1. The rutime results are i secods. MPI-MS 1 core meas that we use oe core for oe algorithm i the system. The speedup1 is MPI-MS 1 core over Newto + BE, which is the basic SPICE setup. MPI-MS 2 cores is that we assig 2 cores for every algorithm. The speedup2 is its speedup over MPI-MS with 1 core. The N/A i the table meas the algorithms are ot stable or diverge i the simulatio.

52 Table 1. Compariso betwee sequetial algorithm ad MPI methods size /MB No. of Li. ele. No. of FETs No. of odes Newto BE/s SC/s Newto Gear2/s Newto DASSL/s MPI-MS 1 core/s speed up1 MPI-MS 2 cores/s speed up2 mesh mesh N/A mesh18k k 50 10k N/A mesh28k k 50 15k N/A iv_chai iv_chai grid20k k grid30k k 0 12k b_adder la_mixer N/A mixer N/A

53 42 For mesh circuits [19], which have lots of liear elemets ad few oliear trasistors, SC method is the fastest algorithm by avoidig repeatedly evaluatig devices ad factorizig large matrix. It ca get covergece at every time poit quickly. Compared to SC method, other algorithms caot save this large amout of time ad eeds loger time to fiish the simulatio. This situatio is more obvious for larger mesh circuits like mesh18k, ad mesh28k which takes BE + Newto algorithm several hours to complete. MPI master-slave structure takes advatage of SC method ad reaches a sigificat large speedup over Newto + BE. The ivert-chai circuits have more oliear elemets. SC algorithm demads a lot of iteratios to get covergece due to more complicated circuit operatig coditio ad its worse covergece rate. I this case, the umber of iteratios domiates the cost for each time step eve the cost for oe iteratio is still small. The multi-step itegratio methods perform better i these circuits especially whe the circuits are small. The MPI master-slave structure which exploits the diversity of differet algorithms ad the advatages of differet algorithms i differet stages, reaches the smallest simulatio time. Mixer circuits are oe kid of aalog circuits with small size, high accuracy requiremets ad complex trasistor operatig coditio chages. SC algorithm may ot get covergece for whole simulatio time. The Newto + Gear2 algorithm is gettig results fast. The MPI master-slave method ca ru a little faster tha Newto + Gear2 with other algorithms cotributios.

54 43 After applyig more threads i sigle algorithm i the distributed system, we fid that speedup2 almost reaches the optimal for the iverter chai circuits. This may be due to the fact that the iverter chai circuits cosist of a large umber of trasistors which ca be divided equally ito two groups ad hadled efficietly by two threads. I additio, the size of the matrix obtaied by device evaluatio is suitable for the parallel matrix solver. The speedup for other circuits is ot as good as iverter chais. Eve worse, aalog circuits have performace drop after beig applied two threads for a sigle algorithm. Aalog circuits are either small or with a small umber of oliear elemets ad have large overhead i parallel device evaluatio ad matrix solve. Creatig/termiatig threads itroduces a relatively larger cost to these small circuits. The beefits itroduced by multiple threads are smaller tha the overhead. These results demostrate the beefits brought by the MPI based multi-algorithm circuit simulatio ad multiple threads i a sigle algorithm for certai classes of circuits MPI vs. HMAPS I this sectio, the results betwee HMAPS [20] ad MPI based distributed simulatio are compared. HMAPS ru i oe ode with 8 threads ad 2 gigabytes memory while MPI methods are usig two threads for each algorithm o several odes. The resource allocatio ad results are i Table 2 ad Table 3. The size/mb colum shows the memory size of oe circuit data copy. Colum HMAPS, MPI-MS ad MPI-P2P show the rutimes i secod. The MPI-MS speedup is the MPI master-slave structure s

55 44 speedup over HMAPS while the MPI-P2P speedup is the MPI peer-to-peer structure s speedup over HMAPS. Table 2. Resource allocatio betwee HMAPS ad MPI methods. Threads/algorithm Nodes Threads/ode Memory HMAPS GB MPI master-slave GB MPI Peer-to-Peer GB Table 3. Compariso betwee HMAPS ad MPI methods. Circuit size/mb HMAPS/s MPI-MS/s MPI-MS MPI-P2P MPI-P2P/s speedup speedup mesh mesh mesh mesh iv_chai iv_chai grid20k grid30k b_adder la_mixer mixer

56 45 I HMAPS [20], multiple algorithms are mapped to a sigle shared-memory system ad every algorithm shares computig resources. I the results above, four algorithms are ruig with their ow copy of circuit data, with totally four copies o oe computer ode. It requests 3 gigabytes for the mesh18 circuit ad 2.5 gigabytes memory for grid30k. O the 2 gigabytes shared-memory system, the memory cotetio is large ad the simulatio takes loger time to fiish. The MPI based distributed system rus algorithms o separate odes. The memory used o oe ode is 800 megabytes for mesh18 circuit ad 600 megabytes memory for grid30k. Hece, the memory cotetio is smaller ad speedup ca reach as high as The MPI based methods are about 15 percetages faster for mesh4, mesh6, grid20k ad iverter chai circuits. These circuits ormally eed about several hudred megabytes memory but MPI structures have more commuicatio overhead tha HMAPS where threads access shared local memory quickly ad the commuicatio betwee the algorithms ca be made frequet. I the distributed system, commuicatio speed is limited by the etwork badwidth ad the size of messages. The commuicatio cost ad delay could be large whe simulatig large circuits. However, the MPI based platform is capable of icorporatig more algorithms to further exploit iter-algorithm parallelism which is more difficult for HMAPS.

57 Compariso Betwee MPI Methods The compariso betwee the MPI master-slave structure ad the peer-to-peer structure is show i Figure 17. The speedups are the two MPI based methods speedups over HMAPS. Figure 17. Compariso betwee master-slave ad peer-to-peer commuicatio structure. For small circuits like mesh4, mesh6, mesh8 ad grid20k, each algorithm updates the global sychroizer quickly after gettig its ow solutio i the MPI master-slave structure. The sychroizer will also broadcast ad iform every algorithm the most recet solutio immediately. It has little bottleeck due to the fact that the circuit size is small ad the data processig is quick. However, the MPI peer-to-peer structure has a delay i updatig all algorithms because the algorithms receive the latest solutio oly after the solutio experieces oe loop trasfer. This is demostrated i the figure above

58 47 which show that the MPI master-slave structure is faster tha the MPI peer-to-peer structure. For large circuits, like mesh18 ad grid30k, the speedup of the MPI masterslave scheme is much smaller tha the MPI peer-to-peer scheme. I these circuits, the messages geerated by the circuit have huge size ad the sychroizer eeds to receive a large amout data from the algorithm odes durig the simulatio as well as the data processig time i sychroizer is icreasig. These factors put a large work load o the global sychroizer ad cause a bottleeck. Moreover, the algorithms may sed the stale solutios to the sychroizer because they are ot aware of the status of the sychroizer. I this case, the etwork badwidth is occupied ad wasted by these kids of useless commuicatio. I the peer-to-peer scheme, the processig ad etwork load is distributed amog all the algorithm odes ad the bottleeck effect is alleviated. I additio, oe ode resource which is occupied by the sychroizer is saved Accuracy We compare the results betwee BE + Newto ad the distributed circuit simulatio o two odes of mesh4 circuit i Figure 18 ad Figure 19. The BE + Newto is the basic SPICE setup ad accurate. We compare the two voltages from the two methods o the same time poits, ad the stadard deviatio is smaller tha volt. Hece, the simulatio results are acceptable.

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method A ew Morphological 3D Shape Decompositio: Grayscale Iterframe Iterpolatio Method D.. Vizireau Politehica Uiversity Bucharest, Romaia ae@comm.pub.ro R. M. Udrea Politehica Uiversity Bucharest, Romaia mihea@comm.pub.ro