Asynchronous Power Flow on Graphic Processing Units

Size: px

Start display at page:

Download "Asynchronous Power Flow on Graphic Processing Units"

Adele Andrews
5 years ago
Views:

1 1 Asyncronous Power Flow on Grapic Processing Units Manuel Marin, Student Member, IEEE, David Defour, and Federico Milano, Senior Member, IEEE Abstract Asyncronous iterations can be used to implement fixed-point metods suc as Jacobi and Gauss-Seidel on parallel computers wit ig syncronization costs. However, tey are rarely considered in practice due to te low convergence rate. Tis paper describes an implementation on GPUs of a novel Power Flow analysis model using asyncronous iterations. We present our model for te solution of te Power Flow analysis problem, prove its convergence and evaluate its performance for a GPU execution. Index Terms Asyncronous iterations, fixed-point metod, power flow analysis, grapic processing units. I. INTRODUCTION Power Flow (PF) analysis refers to te steady-state analysis of AC power networks, widely used in several applications for power system operation and planning [1]. Traditional PF analysis tools rely on matrix factorization using igly optimized serial direct solvers [2], [3], [4]. More recently, a parallel-friendly PF analysis tool as also been introduced, based on an analogical representation of te system [5]. Te real power system is mapped onto a CMOS integrated circuit, wic allows one to pysically simulate te system at a muc iger frequency. Te efficiency of tis approac is linked wit te intrinsic parallelism of te pysical system, were every element of te network is acting simultaneously. Te main drawback is te lack of flexibility since for every system tat one wants to simulate, a new analog model as to be built. In tis paper we present a software implementation of te same idea presented in [5], so tat any arbitrary system can be analyzed witout extra effort. Tis allows us to propose an intrinsically parallel PF analysis model. Our approac is based on fixed-point iterations performed individually for every element in te system. Results of succesive updates are excanged until an equilibrium point is reaced, as in te concept of team algoritms [6], [7]. Te approac is well-suited for parallel implementation using asyncronous iterations, wic ave also been applied to te network flow problem [8], [9]. Te main reason wy asyncronous iterations are not usually preferred over syncronous ones is tat (i) tey require stronger assumptions in order to converge, and (ii) convergence rate is smaller. However, massively parallel computing environments can benefit from te asyncronous approac M. Marin is wit Université de Liège, Institut Montefiore, Liège, Belgium ( mmarin@ulg.ac.be). D. Defour is wit Université de Perpignan Via Domitia, DALI and Université Montpellier 2, LIRMM, France ( david.defour@univ-perp.fr). F. Milano is wit te Scool of Electrical, Electronic and Communications Engineering of te University College Dublin, Dublin, Ireland ( federico.milano@ucd.ie). wenever te cost of syncronization becomes too ig. For instance, in te Grapic Processing Unit (GPU) explicit syncronization between treads from different Streaming Multiprocessors is only acievable troug te ost processor, wic drastically reduces performance. In order to target GPU arcitectures, we study convergence properties of our model and provide a proof of convergence. Te remainder of te paper is organized as follows. Section II introduces fixed-point iterations and presents several alternatives for parallel implementation, including syncronous, asyncronous and partially asyncronous frameworks. In Section III, te proposed intrinsically parallel metod for PF analysis is presented and its convergence properties are studied. Section IV describes te implementation and evaluation of te proposed metod on a set of bencmarks as well as comparisons wit existing approaces. Finally, Section V draws conclusions and outlines future work. II. FIXED-POINT ITERATIONS ON A PARALLEL COMPUTER Let us first recall te concept of fixed-point iterations defined as follows. Definition 1 (Fixed-point iteration). Let f: R n R n and x R n. Ten, x is a fixed-point of f( ) if and only if x = f(x). Furtermore, let x (0) R n. Ten, te iteration given by: x (k+1) = f(x (k) ), k = 0,1,..., (1) is called a fixed-point iteration on f( ). Examples of fixed-point iterations are te Jacobi and Gauss- Seidel metods [1]. Fixed-point iterations can be implemented in parallel by splitting te set of input element among treads. Eac tread updates its respective elements and communicates its results to oters wit or witout syncronization. A. Syncronous iterations Wit syncronous iterations, all te elements are updated by block before te process moves as a wole into te next step. Te above requires to place a syncronization point at te end of eac iteration. As a consequence, treads tat finis teir updates earlier are forced to wait for te oters to catc up. Te reason for treads aving different processing times are various. Some of tem may be intrinsic to te algoritm, e.g., irregular memory access patterns, load imbalance. Some oters may be related to te arcitecture, e.g., differences in processor frequency, communication network constraints [10].

2 2 B. Asyncronous iterations An alternative to te syncronous approac is te so-called asyncronous iteration, were individual treads do not wait for oters and simply go aead wit teir work. Tis means tat at every asyncronous iteration only a subset of te elements are updated. No syncronization point is explicitly added to te program. As a consequence, some of te updates may be computed wit data from several iterations back. An asyncronous iteration is defined in terms of an update function and a delay function. Te update function, notedu( ), receives te iteration counter k N, and returns a subset of {1,...,n} indicating te list of treads tat update teir elements at iteration k. Te delay function, noted d( ), receives te indices i,j {1,...,n} and te iteration counter k N, and returns te delay of tread j wit respect to tread i at iteration k. Tis delay represents te number of iterations performed since te value of x j was last made visible to i. Definition 2 (Asyncronous iteration). Let x R n and f: R n R n. For k N,i,j {1,...,n}, let U(k) {1,...,n} and d(i,j,k) N 0, suc tat x (k+1) i = d(i,j,k) 0, i,j,k, (2a) lim d(i,j,k) <, k i,j, (2b) {k: i U(k)} =, i. (2c) In addition, let x (0) R n. Ten, te iteration defined by: { f i (x (k d(i,1,k)) x (k) i 1,...,x (k d(i,n,k)) n ) if i U(k), if i U(k), (3) is called an asyncronous iteration on f( ), wit update function U( ) and delay function d( ). Assumptions in (2a), (2b) and (2c) ensure tat te process is well defined. Specifically, te assumption in (2a) states tat only values computed in previous iterations are used in any update. Te one in (2b) states tat newer values of te elements are eventually used. Finally, te assumption in (2c) states tat no element ceases to be updated during te course of te iteration. In practical terms, if te updates are made visible as soon as tey are computed, ten te tird assumption implies te second. Tis is te case, for example, of an application tat uses GPU global memory to store te data. Note tat a syncronous iteration is a special case of asyncronous iteration wit U(k) = {1,...,n} and d(i,j,k) = 0, for all i, j, k. Due to te peculiarities introduced above, asyncronous iterations converge differently from syncronous ones. Te following result considers te case of a linear application. Teorem 1 (Necessary and sufficient condition for convergence of linear asyncronous iterations [11]). Let f: R n R n, given by, f(x) = Lx+b, (4) were L R n n, and b R n. Let L denote te matrix of absolute values of te entries of L, and ρ( L ) its spectral radius. If ρ( L ) < 1, (5) ten any asyncronous iteration on f( ) converges to x, te unique fixed-point of f( ), regardless of te selection of U( ), d( ) and x (0). Furtermore, if ρ( L ) 1, ten tere exists U( ), d( ) and x (0) suc tat te associated asyncronous iteration does not converge to x. Proof. See [11]. C. Partially asyncronous iterations In some circumstances, convergence of asyncronous iterations can be facilitated by making treads syncronize every once in a wile. Definition 3 (Partially asyncronous iteration). Consider Definition 2 of an asyncronous iteration. Replace assumptions in (2b) and (2c) by te following: d N: d(i,j,k) d, i,j,k, (6a) s N: i s s=1 U(k +s), i,k. (6b) d(i,i,k) = 0, i,k, (6c) Ten, te iteration given by equation (3) is now termed a partially asyncronous iteration. Te assumption in (6a) establises tat not only newer values of te vector elements are eventually used, but eac of tese values is used before d iterations ave passed from teir calculation. Te assumption in (6b), in turn, states tat eac element is updated at least once in every s consecutive iterations. Finally, te assumption in (6c) establises tat every element is updated using its own last calculated value. In practical terms, te first two assumptions can be met by performing a syncronization step every l iterations, were l is te minimum of d and s. Te tird assumption can be met by aving eac element assigned to only one tread. As mentioned above, partially asyncronous iterations converge more easily tan totally asyncronous ones. Te following result establises a convergence criteria in te linear case, wit softer assumptions tan te ones in Teorem 1. Teorem 2 (Sufficient condition for convergence of linear partially asyncronous iterations [12]). Let f: R n R n, given by, f(x) = Lx+b, (7) were L R n n, and b R n. Let g: R n R n, given by: were α (0,1). If L = (l ij ) is irreducible and g(x) = (1 α)x+αf(x), (8) n l ij 1, i, (9) j=1 ten any partially asyncronous iteration on g( ) converges to x, fixed-point of f( ). Proof. See [12].

3 3 In [13], te autors presented a special case in wic convergence occurs for α = 1, at a linear (geometric) rate. Tey also determined a lower bound for te average convergence rate (per iteration) in te long-term, τ, as follows: wit τ (1 σ r ) 1/r, (10) r = 1+ d+(n 1)( s+ d), (11a) ( ) + x i σ = min l ij i,j x, (11b) j were min + ( ) refers to te minimum of te positive elements. Tis means tat te error is scaled at most by a factor of (1 σ r ) 1/r in any average iteration. As σ is always lower tan 1, te convergence rate approaces 1 for increasing values of n, d and s. Tat is to say, te convergence becomes sublinear. For illustration purposes, consider tat d = s = 0 (syncronous iteration). In tis case, r = 1 and τ = 1 σ. In oter words, te convergence rate is unaffected by te problem size n and only depends on σ. Now consider tat d = s = 1. Tis corresponds to a partially asyncronous iteration wit a syncronization step every oter iteration. In tis case, r = 2n, wic means tat convergence slows down as te problem grows. Figure 1 sows te lower bound on te convergence rate as a function of te problem size n. Te different curves represent different values of σ. Observe tat, unless σ is very ig, convergence becomes sub-linear very quickly. Rate of convergence, τ Problem size, n σ = 0.1 σ = 0.5 σ = 0.9 Fig. 1. Bound on te convergence rate as a function of te problem size, for different values of te parameter σ. Tis migt be a strong reason to prefer te syncronous approac over te partially asyncronous. However, te partially asyncronous approac can be a valid alternative in some cases, e.g., if te costs of syncronization are relatively ig, or wen te number of iterations required to converge is relatively low. III. PROPOSED METHOD FOR AC CIRCUIT ANALYSIS Tis section presents ow to perform PF analysis based on partially asyncronous iterations. Te metod is designed to be implemented on Single Instruction Multiple Tread (SIMT) arcitectures suc as GPUs, by exploiting regularity. Te metod is presented troug a top-down approac. Te network model is described first, specifying ow te different parts communicate and interact. Next, te models of te different elements tat compose te power system are specified in teir atomic beaviour. Finally, te convergence properties of te proposed metod are studied. A. Network model Te proposed metodology distinguises two fundamental units in te power system: components and nodes. Te components are te electrical devices, and te nodes are te points of interconnection. For sake of simplicity and also to facilitate illustration of te metod, te following tree assumptions are made: 1) Only dipole components, i.e., components connected to two nodes, are present. 2) Tere is no more tan one component connected between any two nodes. 3) Every node is connected to at least two components. Te circuit is well defined if tere are components connecting all te nodes in a closed pat. Node in te circuit is noted n. Te component connected between n and n m is noted c m. Te set of all nodes in te circuit is noted N, and te set of components, C. In addition, a set of influencers, I N, is defined for eac non-ground node n N. Tis set contains all te nodes separated from n by exactly one component. Tese are also known as te fanin nodes [14]. For illustration purposes, consider te circuit in Fig. 2. In tis case, N = {n 0,n 1,n 2,n 3 } and C = {c 01,c 12,c 02,c 23,c 03 }. Te circuit is well defined, since c 01, c 12, c 23 and c 03 connect all te nodes in a closed pat. In addition, I 1 = {n 0,n 2 }, I 2 = {n 0,n 1,n 3 }, and I 3 = {n 0,n 2 }. n 1 n 2 n 3 c 01 c 12 Fig. 2. Example 4-node circuit. n 0 c 02 In te proposed metodology, Ú denotes te voltage at node n, and Ùm denotes te voltage at node n according to component c m. Te two are different during te course of te iteration, and become equal at convergence. In addition, m denotes te power injected to n by component c m, and denotes te power mismatc at n. c 23 c 03 At every iteration k, a component c m obtains Ú (k) from node n, and returns Ù (k) m and (k) m and to it. A

4 4 similar excange occurs wit node n m at te oter end of te component. Next, at te same iteration k, node n receives Ù (k) j and j from eac component c j connected to it, and returns Ú (k+1) and (k+1) to all of tem. Ten, te process moves into a next iteration. Figure 3 illustrates te above for component c 12 and node n 2 in te example 4-node circuit. n 0 n 1 n 3 Ú1, 1 Ù12 Ù21 n 1 c 12 n c 02 c 12 c 23 (a) (b) Ù20 20 Ù21 21 Ù23 23 Ú2, 2 Fig. 3. Network model: (a) Component; and (b) Node. B. Component and node models n 2 Ú2 2 1) Component model: For te purposes of te proposed metod, only tree types of component are required: te generator, te branc and te sunt. Te use of a universal model improves instruction regularity of te GPU implementation, as discussed in Section IV. Te impedance of component c m is noted z m. a) Generator: It is connected between te ground and one node, say n. Te generator is tere to ensure tat te reactive power mismatc at te node is equal to zero. In tis sense, te model assumes tat generators are able to provide as muc reactive power as needed in order to maintain te voltage. Te slack generator also does te same for real power. Accordingly, te iteration is te following: for nonslack generators, And for te slack generator, Im( ) = 0. (12) = 0. (13) b) Branc: It is connected between two non-ground nodes, say n and n m. It consists of a constant impedance, z m. Te branc iteration is defined as follows: Ù (k) m = Ú(k) +z m m = Ú(k) ( Ú (k) ( Ú (k) m Ú (k) z m ), (14a) ). (14b) c) Sunt: It is connected between te ground and one node, say n. It represents an impedance to ground, noted z 0 (te subscript 0 indicates te ground). Te sunt iteration is similar to te branc one. Te main difference is tat Úm in equation (14b) is replaced by zero (te ground voltage). Te iteration is te following: Ù (k) 0 = Ú(k) +z 0 (k) 2 ( ), (15a) Ú (k) 0 = Ú (z m ), (15b) were indicates te absolute value. Tese equations are obtained by applying te Om Law on component c m, and te Kircoff Current Law on node n. Note tat if te power mismatc at n is equal to zero (i.e., = 0), ten Ù (k) m = Ú(k). Tat is to say, te voltage according to component c m is te same as te one according to node n itself. Tis corresponds to a condition of convergence at component level. 2) Node model: Te node is simply modelled as a point were different components excange data and consolidate teir results. Te node iteration is defined as follows: Ú (k+1) (k+1) = m I w m Ù (k) m, = m I m, (16a) (16b) were w m is defined as te weigt of component c m witin node n. Tese weigts satisfy m I w m = 1. (17) In oter words, te voltage at eac node is updated wit te weigted sum of te voltages computed by eac component connected to it. Note tat if convergence is attained at component level (i.e., Ù (k) m = Ú(k), m I ), ten Ú (k+1) = Ú (k). Tis appens wen all te components agree on te value of te node voltage, and corresponds to a condition of convergence at node level. A second node iteration is added to allow PQ loads to become constant impedances, if te voltage at te node is outside specific limits. Tis is a tecnique used in traditional PF analysis [1]. In te proposed metod, te power injected to n by loads is noted L,. If te voltage is outside te limits, tis variable is scaled by a factor depending on te current voltage and te violated limit. Ten, te power mismatc at te node is adjusted to reflect te cange in te load. In conclusion, te iteration is composed of two parts. Te first updates te power injected by loads, as follows: ( ) 2 v (k) v min L, if v (k) < v min, (k+1) ( ) 2 L, = v (k) v max L, if v (k) > v max (18), oterwise, L,

5 5 were v min and v max are te voltage limits at n. Ten, te second part updates te power mismatc, as follows: (k+1) = (k) L, + (k+1) L,. (19) Te use of te above iteration reduces instruction regularity, as it needs to be performed only by nodes. However, it may be required for convergence of te metod and involves a limited number of operations wic can be acceptable in a SIMT execution context. C. Convergence study Tis section investigates te convergence properties of te proposed iterative metod for two different syncronization scemes. Let n 0 be te ground-node, and N te set of all nodes in te circuit minus te ground. Tat is to say, N = N \{n 0 }. Also, let Ú C n 1 be te vector of voltages at all non-ground nodes. Te iteration function, : C n 1 C n 1, is given by: (Ú) = Ú + m I j I w m z m Új Ú z j, N. (20) Tis expression is obtained by combining equations (14a), (14b), (16a), (16b) and (17). Equation (20) is linear, tus, it can be written as follows: (Ú) = LÚ, (21) were L is given by: l = 1 w mz m, z j m I j I w mz m z j if j I, l j = m I 0 oterwise. (22) However, one can notice tat a totally asyncronous iteration on f( ), as given by (21), is not guaranteed to converge. Tis is linked wit te fact tat te sum of te elements in any row of L, were L is given by (22), is equal to 1, ten 1 is an eigenvalue of L and ρ(l) 1. Since for every matrix A, ρ( A ) ρ(a), ten ρ( L ) 1. Tis implies tat to ensure convergence of te proposed iteration, furter assumptions are needed. Te following teorem establises a sufficient condition for convergence based on partially asyncronous iterations. Teorem 3. Let ( ) defined by (Ú) = (1 α)ú+α (Ú), were 0 < α < 1 and ( ) given by (20). Let te weigts, w m,m I, satisfy te following: 1 0,. (23) z j m I j I w mz m Ten, any partially asyncronous iteration over ( ) converges to Ú, fixed-point of ( ). Proof. See [15]. IV. CASE STUDY Te proposed metod for PF analysis as been tested on te Nvidia Kepler GPU arcitecture [16]. It as been evaluated on a series of bencmarks, wit te aim of exploring convergence, performance and scalability issues. A. Implementation Regularity is one of te most powerful leverages for GPU performance [17]. Algoritm 1 presents a regular implementation of te proposed metod. In order to ensure asyncronous evaluation of eac component, we set te number of blocks to be smaller tat te number of blocks tat can be andled in parallel by te ardware. Terefore all treads are continuously running until convergence of te system. Treads are andling te set of components in a round-robin manner, wic leads to a similar amount of work per tread as components are always connected to eiter one or two nodes. If treads were assigned to nodes instead, ten some treads would perform muc more work tan oters, as nodes can be connected to any number of components. Instructions 5 and 6 calculate te voltage at its component s terminals and updates te appropriate coordinates on vector Ú. Instructions 7 and 8 calculate te power injections at eac terminal and updates te appropriate coordinates on vector. Te process is repeated after syncronizing all treads, until convergence is reaced. Algoritm 1 Intrinsically parallel algoritm for PF analysis. Input: Vector of initial power mismatces, ; component model parameters; tolerance for te error, ǫ, maximum number of asyncronous iterations, K. Output: Vector of node voltages, Ú. 1: Ú = 1 0 # Initial guess 2: for all,m: c m C in parallel do 3: k = 0 4: repeat 5: calculate Ùm and Ùm 6: update Ú and Úm 7: calculate m and m 8: update and m 9: k k +1 10: if k > K ten 11: syncronize treads 12: k = 0 13: until, m < ǫ As discussed in [18], coalesced global memory accesses can drastically improve GPU performance by reducing te number of memory transactions. In te proposed implementation, te above is acieved by using a structure of arrays approac. In addition, performance can benefit of sared memory for recurrent memory operations. However, sared memory is only accessible to treads witin one tread-block, wereas global memory is accessible to all treads. Accordingly, using sared memory in te proposed metod increases te delay between treads, wic affects convergence as discussed in Section III-C. Te point were sared memory usage brings

6 6 performance improvement is igly dependent of te problem parameters and setting tis point could benefit from previous researc conducted on flexible communication [19]. B. Evaluation Te implementation is evaluated using randomly generated bencmarks from 1,024 to 8,192 nodes, wic correspond to real size networks. Te bencmarks are obtained using te random radial network generator provided by Dome [3]. We used a Xeon E5645 CPU, wit 12 cores at 2.4Gz (1 used), a Tesla K40c, wit cores at 0.7 Gz and a GeForce GTX 680 wit cores at 1.06Gz. All source code is compiled using G and CUDA ) Convergence: As discussed in Section II-C, partially asyncronous fixed-point iterations typically exibit a sublinear rate of convergence. Tis means tat reducing te convergence tolerance below a small tresold may require a large number of iterations. Figure 4 illustrates te relation between te convergence error and te number of iterations for te proposed metod using te K40c GPU. It sows te mismatc (on a logaritmic scale) as a function of te maximum number of iterations performed by any component. Te different curves represent different problem sizes. Note tat te beaviour is very similar on all 6 bencmarks, suggesting tat convergence is independent from te problem size. However, te convergence rate for tis experiment depends on te targeted accuracy. For example, reducing te error from 10 3 to 10 4 requires about 555 iterations in average. Tis corresponds to a convergence rate of Te convergence rate as an asymptotic limit of 1 wen te targeted accuracy is increasing wic is coerent wit te teoretical study of Section II-C, Te farter te metod is from te exact solution, te faster it approaces tat solution. Tis is te opposite of te Newton-Rapson (NR) metod were te convergence rate is improving at eac iteration. From tis point of view, it may be interesting to combine bot approaces into a ybrid solution tat switces metods depending on te error. For example, te proposed metod can be used to obtain a first solution fast, and te NR metod can be used to improve tat solution afterwards. 2) Performance: We compared te performance of te proposed approac wit serial NR provided by Dome [3]. Te Dome s implementation uses KLU, a igly optimized library for sparse matrix factorization, particularly suited to deal wit matrices from circuit analysis [20]. Te proposed metod is tested on te Tesla K40c GPU, and te NR on te Xeon E5645 CPU. Figure 5 sows te execution time of bot metods (on a logaritmic scale) as a function of te problem size, n, for a targeted error of Te NR execution time grows exponentially wit te problem size, wic is consistent wit precedent results [1]. In te proposed metod, in turn, te execution time seems less affected by te size, and exibits a more flat profile. At about 3,000 nodes, tere is a turning point in wic te proposed metod becomes faster tan NR. Tis turning point is dependent of many parameters (arcitecture, Numerical error e-05 1e Number of iterations n = 3072 n = 4096 n = 5120 n = 6144 n = 7168 n = 8192 Fig. 4. Numerical error as a function of te number of iterations on te K40c GPU. Simulation time (s) Fig ) Problem size (n) KLU N-R Prop. metod Related performance of N-R and te proposed metod (accuracy: implementation, topology,...) and it is expected to evolve favorably for te proposed metod as te number of embeded cores per cip is increasing. 3) Scalability: As mentioned in Section I, performance of asyncronous metods sould increase wit te number of computing cores. Te above is confirmed by Fig. 6, wic sows te execution time of te proposed asyncronous metod on two Nvidia Kepler GPU arcitectures as a function of te problem size. Te Tesla K40c GPU, embedding 2,880 computing cores, allows for a performance gain of about 1.3 over te GTX 680, embedding 1,536 cores. Tis means tat te proposed metod almost acieve strong scalability. Fig. 7 represents te number of iterations needed to converge as a function of te number of blocks launced. We compared te syncronous version (tat syncronizes treads after eac iteration), wit te asyncronous one (tat does not perform any syncronization at all). We observe tat only te latter benefits from increasing number of tread-blocks. Moreover, we see a turning point in wic performance does not improve and rater decays as te number of blocks keeps growing.

7 7 Simulation time (s) Problem size (n) GTX 680 Tesla K40c Fig. 6. Related performance of te proposed metod on two GPU arcitectures. Number of iterations Number of blocks Sync. Async. Fig. 7. Related performance of syncronous and asyncronous metod. V. CONCLUSIONS Tis paper presents a partially asyncronous fixed-point metod for PF analysis. Te metod is designed to be implemented on SIMT arcitectures suc as te GPU. A teoretical result is provided wic sows tat te metod is guaranted to converge, altoug convergence is generally sub-linear. Simulation results sow tat: (i) te convergence rate of te proposed metod tends to be better at an early stage, wic makes it a good complement of te traditional NR metod; (ii) te proposed metod becomes faster tan te traditional NR approac for problems of considerable size; (iii) te proposed metod scales well wit te number of computing cores in te GPU arcitecture. Future works will consider te improvement of te proposed CUDA implementation. In particular, te use of sared memory will be investigated. Tis addressing convergence of te metod in te case were an increasing number of iterations are performed asyncronously, i.e., witout communicating te results to all treads. [2] P. Simulator, Version 10.0 scopf, PVQV, PowerWorld Corporation, Campaign, IL, vol , [3] F. Milano, A pyton-based software tool for power system analysis, in IEEE Power and Energy Society General Meeting, 2013, pp [4] M. Simulink and M. Natick, Te matworks, [5] I. Nagel, Analog microelectronic emulation for dynamic power system computation, P.D. dissertation, École Polytecnique Fédérale de Lausanne, [6] S. Talukdar, S. Pyo, and R. Merotra, Designing algoritms and assignments for distributed processing. Electric Power Researc Institute, [7] J. Tsitsiklis, Problems in decentralized decision making and computation. DTIC Document, Tec. Rep., [8] D. Bertsekas and D. El Baz, Distributed asyncronous relaxation metods for convex network flow problems, SIAM Journal on Control and Optimization, vol. 25, no. 1, pp , [9] D. El Baz, P. Spiteri, J.-C. Miellou, and D. Gazen, Asyncronous iterative algoritms wit flexible communication for nonlinear network flow problems, Journal of Parallel and Distributed Computing, vol. 38, pp. 1 15, [10] A. Frommer and D. Szyld, On asyncronous iterations, Journal of Computational and Applied Matematics, vol. 123, no. 1 2, pp , 2000, numerical Analysis Vol. III: Linear Algebra. [Online]. Available: ttp:// [11] D. Cazan and W. Miranker, Caotic relaxation, Linear algebra and its applications, vol. 2, no. 2, pp , [12] P. Tseng, D. Bertsekas, and J. Tsitsiklis, Partially asyncronous, parallel algoritms for network flow and oter problems, SIAM Journal on Control and Optimization, vol. 28, no. 3, pp , [13] B. Lubacevsky and D. Mitra, A caotic asyncronous algoritm for computing te fixed-point of a nonnegative matrix of unit spectral radius, Journal of te ACM, vol. 33, no. 1, pp , [14] A. Newton and A. Sangiovanni-Vincentelli, Relaxation-based electrical simulation, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 3, no. 4, pp , [15] M. Marin, Gpu-enanced power flow analysis, P.D. dissertation, Informatique Perpignan and University college Dublin, 2015, tèse de doctorat dirigée par Defour, David et Milano, Federico. [Online]. Available: ttp:// [16] Nvidia s next generation CUDA compute arcitecture: Kepler GK110, [17] S. Collange, Design callenges of GPGPU arcitectures: specialized aritmetic units and exploitation of regularity, P.D. dissertation, Univ. de Perpignan, Nov [Online]. Available: ttps://tel.arcivesouvertes.fr/tel [18] C. Cuda, Programming guide, Nvidia Corporation, July, [19] J. Miellou, D. El Baz, and P. Spiteri, A new class of asyncronous iterative algoritms wit order intervals, Matematics of Computation of te American Matematical Society, vol. 67, no. 221, pp , [20] T. Davis and E. Palamadai Natarajan, Algoritm 907: KLU, a direct sparse solver for circuit simulation problems, ACM Transactions on Matematical Software, vol. 37, no. 3, pp. 36:1 36:17, Sep [Online]. Available: ttp://doi.acm.org/ / REFERENCES [1] F. Milano, Power system modelling and scripting. London, UK: Springer, 2010.

Asynchronous Power Flow on Graphic Processing Units

Asynchronous Power Flow on Graphic Processing Units Asyncronous Power Flow on Grapic Processing Units Manuel Marin, David Defour, Federico Milano To cite tis version: Manuel Marin, David Defour, Federico Milano. Asyncronous Power Flow on Grapic Processing