Parallelizing Hines Matrix Solver in Neuron Simulations on GPU

Size: px

Start display at page:

Download "Parallelizing Hines Matrix Solver in Neuron Simulations on GPU"

Laureen Greer
5 years ago
Views:

1 Parallelzng Hnes Matrx Solver n Neuron Smulatons on GPU by Dharma teja, Kshore Kothapall n 4th IEEE Internatonal Conference on Hgh Performance Computng, Data, and Analytcs Report No: IIIT/TR/07/- Centre for Securty, Theory and Algorthms Internatonal Insttute of Informaton Technology Hyderabad , INDIA December 07

2 Parallelzng Hnes Matrx Solver n Neuron Smulatons on GPU Dharma Teja Vootur, Kshore Kothapall Internatonal Insttute Of Informaton Technology - Hyderabad Hyderabad, Inda dharmateja.vootur@research.t.ac.n,kkshore@t.ac.n Upnder S Bhalla Natonal Center for Bologcal Scences Tata Insttute of Fundamental Research Bangalore, Inda bhalla@ncbs.res.n Abstract Hnes matrces arse n the smulatons of mathematcal models descrbng ntaton and propagaton of acton potentals n a neuron. In ths work, we explot the structural propertes of Hnes matrces and desgn a scalable, lnear work, recursve parallel algorthm for solvng a system of lnear equatons where the underlyng matrx s a Hnes matrx, usng the Exact Doman Decomposton Method (EDD). We gve a general form for representng a Hnes matrx and use the general form to prove that the ntermedate matrx obtaned va the EDD has the same structural propertes as that of a Hnes matrx. Usng the above observaton, we propose a novel decomposton strategy called fne decomposton whch s sutable for a GPU archtecture. Our algorthmc approach R-FINE-TPT based on fne decomposton outperforms the prevously known approach n all the cases and gves a speedup of.5x on average for a varety of nput neuron morphologes. We further perform experments to understand the behavour of R-FINE-TPT approach and show ts robustness. We also employ a machne learnng technque called lnear regresson to effectvely gude recurson n our algorthm. I. INTRODUCTION Sparse matrces and computatons on sparse matrces arse n many areas of scence and engneerng such as computatonal flud dynamcs, computatonal neuroscence and molecular dynamcs []. Promnent among the computatons on sparse matrces nclude matrx vector multplcaton, matrx matrx multplcaton, and solvng a system of lnear equatons where the underlyng matrx of coeffcents s sparse. The mportance of these computatons can be gauged by the fact that these computatons are ncluded as dwarfs n the Berkeley report []. It s therefore not surprsng that most modern lbrares n the parallel settng nclude optmzed routnes for the above computatons on sparse matrces [3], [4]. Several researchers have focused on mprovng the performance of sparse matrx computatons on a varety of modern many- and mult-core archtectures. Proment examples nclude [5], [6] and [7]. Wangdong et al. [5] and Matam et al. [6] provde effcent algorthms for sparse matrx vector multplcaton and sparse matrx matrx multplcaton respectvely on hybrd (CPU+GPU) archtectures. Agullo et al. [7] optmze drect matrx solvers for Intel KNL archtectures. In recent years, one approach that s beng used to mprove the effcency of sparse matrx computatons on modern parallel archtectures s to understand the strucure of sparsty of the matrx and ts mplcatons to parallel algorthm desgn and mplementaton. Examples of such nstances are seen n the work of Ramamoorthy et al. [8] for multplyng two scalefree sparse matrces, Vootur et al. [9] for multplyng two quas-band sparse matrces, Buluc et al. [0] for multplyng two hyper-sparse matrces, and Wangdong et al. [] for multplyng a quas-band sparse matrx wth a dense vector. In ths paper, we nvestgate GPU algorthms for solvng a system of lnear equatons where the underlyng matrx s a sparse matrx. In partcular, the sparse matrx we study s called a Hnes matrx that has the followng sparsty structure. A Hnes matrx s a symmetrc matrx where every row of the matrx has only one nonzero element wth a column ndex bgger than the row ndex. Solvng a system of lnear equatons wth the underlyng matrx beng a Hnes matrx s denoted as HnesSolver n the rest of the paper. The rest of the paper s organzed as follows. In Secton I-A, we gve the motvaton for the problem and n Secton I-B, we dscuss related work and lst our key contrbutons n Secton I-C. In Secton II, we descrbe how a Hnes matrx s generated from the mathematcal model, ts structural propertes and a general form. In Secton III, we descrbe a lnear O(N) algorthm for HnesSolver usng an Exact Doman Decomposton method(edd) and also dscuss how to talor our algorthm on a GPU. Results and experments are dscussed n Secton IV. Fnally, we conclude and outlne future work n Secton V. A. Motvaton Neuron smulatons happen n a tme-step manner and a HnesSolver s nvoked n each tme-step. Generally, researchers have to run these smulatons for many tme-steps to understand a partcular behavour. For example, a sngle neuron smulaton runnng for a neuron tme of one mnute wth mll second tme-step nvolves 60,000 HnesSolver nstances. In a gven tme-step, the computaton apart from a HnesSolver s reasonably parallel and s sutable for GPU. By havng HnesSolver on a GPU, smulatons can be made faster by makng use of GPU hardware and avodng costly memory transfers from GPU to CPU at each tme-step. In case of network smulatons, whch nvolve the study of the behavor of nterconnected neurons, multple Hnes matrx systems have to be solved at each tme-step. It s possble to map the computaton of solvng multple Hnes matrx systems

3 nto a sngle bg Hnes matrx system, whch when solved gves solutons to the ndvdual systems. Further, t s not uncommon for researchers to run experments whch nvolves changng only a sngle parameter whle keepng other parameters fxed. Ths types of experments also result n solvng multple Hnes matrces n each tme-step. We dscuss one such case n Secton IV-D4, where for a set of experments the matrx remans the same, but the rght hand sde vectors vary n a gven tme-step. Hence, parallelzng HnesSolver on a GPU can speedup neuron smulatons and also enable researchers to perform rapd expermentaton. B. Related Work HnesSolver s studed n the sequental settng by Hnes []. Hnes [] proposed a modfed Gauss elmnaton algorthm whose runtme s O(N), where N s the number of rows n the Hnes matrx. Hnes also proposed parallel Gauss elmnaton algorthms for HnesSolver n [5] and [6]. To understand these algorthms, t s helpful to vsualze a Hnes matrx as a rooted tree wth self-loops. These algorthms are based on the fact that sutably selected subtrees can be processed n parallel. However, ths approach suffers from the followng drawbacks. Frstly, there s a constrant on subtree dvson whch lmts the amount of parallelsm avalable. Secondly, they are desgned for optmzng network smulatons on mult-core archtectures, where multple Hnes matrces of dfferent szes have to be solved and the ablty to break a tree allows for effcent load balancng. In HnesSolver, trangularzng the segments at the same level of a tree can be done n parallel. Ths dea was exploted by Roy et al. n [7] whch s also based on Gauss elmnaton. One of the drawbacks of ths approach s that the parallel tme of the algorthm s bounded by the depth of the tree. Another drawback s that the amount of parallelsm and computaton at each level s dependent on the nput and hence can ntroduce sgnfcant load mbalance. Some of the above problems are solved by Mascagn [3], Larrba Pey [4] who ntroduced the Exact Doman Decomposton (EDD) technque to solve matrces correspondng to undrected graphs. EDD nvolves solvng a doman matrx of sze equal to the number of nodes wth degree greater than two. In case of matrces correspondng to undrected graphs, the doman matrx does not exhbt any specal propertes and t was suggested to solve usng any drect solver, such as Gauss elmnaton. C. Contrbutons In ths work, we frst start by provng that the doman matrx obtaned va the exact doman decomposton method on a Hnes matrx has the same structural propertes as that of a Hnes matrx. Ths result has three mmedate benefts as lsted below. ) It allows us to apply the exact doman decomposton technque recursvely. ) As the recurson bottoms out, the small sze of the resultng doman matrx allows us to nvoke a sequental HnesSolver [] n much less tme. 3) It allows us to ntroduce a decomposton strategy called fne decomposton whch can be effcently mapped onto a GPU. Usng the above observatons, we desgn an effcent parallel algorthm and ts GPU mplementaton to solve a system of lnear equatons where the underlyng matrx s a Hnes matrx. Our expermental results on an Nvda Tesla K40c GPU over a varety of nputs ndcate that our algorthmc approach R- FINE-TPT based on fne decomposton s.5x faster than the prevously known approach. We also conduct experments to study the effect of parameters such as amount of fneness n R-FINE-TPT, depth of recurson, compartment resoluton and number of rght hand sdes n the matrx system to show the robustness of our approach. We also employ a machne learnng technque called lnear regresson to fnd a threshold functon whch helps n decdng when to stop the recurson n our algorthm. II. HINES MATRIX A. The Hodgkn-Huxley Model The Hodgkn-Huxley model s a mathematcal model proposed by Hodgkn and Huxley [9] to explan onc mechansms and voltage behavour nvolved n the ntaton and propagaton of acton potentals n neurons. A non-lnear dfferental equaton models how potental dfference (V m ) changes wth respect to on-channels, current and other propertes of a neuron. To smulate the model, a neuron s dscretzed spatally nto multple compartments as shown n Fgure (a). The relatonshps between varous compartments n a compartmentalzed neuron can then be represented as a rooted tree as shown n Fgure (b) where each node n the tree corresponds to a compartment. A node V n the tree has a unque parent compartment V p and chld compartments as shown n Fgure (c). The tree s then numbered usng a DFS numberng scheme from leaves to root. DFS numberng ensures two thngs. ) The number of a node s larger than all ts chldren and smaller than ts parent. ) The compartments n each branch of the neuron have consecutve numbers. The current balance equaton of th compartment at j th tmestep s then descrbed accordng to Equaton. (V j V j ) C t = (E V j )/Rm +(V j k=ionchannels() p V j ) Ga,p+ G j,k (E,k V j ) + Iext + c=chldren(v ) (Vc j V j ) Ga,c () where V j and G j,k represent voltage and conductance of on channel k respectvely for compartment at tme step j.

The constants Iext, C, E, Rm, E,k represent external current, membrane capactance, membrane restng potental, membrane resstance and reverse potental of on channel k respectvely for compartment.

By solatng them, Equaton can be wrtten n a concse manner as shown n Equaton. For more detals refer [0].

k=ionchannels() G j,k G j,k E k and {V j c c chldren(v )} are the voltages of th compartment, parent compartment of V and chld compartments of V respectvely at j th tme step and A j,,aj,p and {A j,c

4 The constants Iext, C, E, Rm, E,k represent external current, membrane capactance, membrane restng potental, membrane resstance and reverse potental of on channel k respectvely for compartment. Ga t,t represents radal conductance between compartments t and t. t s the tme nterval between two tme steps. The only unknowns n Equaton are V j Vp j and {Vc j c chldren(v )}. By solatng them, Equaton can be wrtten n a concse manner as shown n Equaton. For more detals refer [0]. A j,p V j p + A j, V j + where A j, = ( C t + Rm + b j = ( V j r=negh() c=chldren(v ) Ga,r + A j,c V j c = b j () k=ionchannels() A j,p = Ga,p, c chldren(v)a j,c = Ga,c Iext + E +V j C Rm t + V j p k=ionchannels() G j,k G j,k E k and {V j c c chldren(v )} are the voltages of th compartment, parent compartment of V and chld compartments of V respectvely at j th tme step and A j,,aj,p and {A j,c c chldren(v )} are the correspondng coeffcents. The current balance equatons of all compartments can then be represented n a matrx form A x = b, where x = [V j V j N ]T and b = [b b N ] T. Solvng ths lnear system gves the voltage values for compartments n each tmestep. Fgure corresponds to the structure of the matrx formed by the current balance equatons of compartmentalzed neuron n Fgure (b). The matrces obtaned from the voltage PDE smulatons fall under a class of matrces called Hnes matrces. A Hnes matrx has the followng structural propertes: ) The matrx s symmetrc. [A j = A j ] ) In each row, there exsts only one nonzero element wth column ndex j such that j >. [!j (A,j 0 and j > )] From Equaton, we can see that the off-dagonal elements of A have contrbutons only from the radal conductance Ga t,t. As the radal conductance between any two compartments t and t s the same rrespectve of order.e, Ga t,t = Ga t,t, the matrx A s symmetrc. The only nonzero element n row after A, corresponds to the coeffcent of parent compartment of V. B. A General Form A Hnes matrx A can be represented n a general form along wth some condtons. Let R be the set of compartments wth more than one chldren and juncton set J be a superset of R. Dvdng the matrx at rows J and columns J results n a grd G of dmensons (S + ) (S + ), where S = J. Each man dagonal entry of G s a block dagonal matrx T r ) ) (a) Compartmentalzed neuron (b) Neuron tree (c) Parent and chldren of th compartment Fg. : Mult compartment neuron modellng. A x = b Matrx system T r (T r...t rk ) Block dagonal matrx wth k blocks T r j Symmetrc Trdagonal matrx C j ( C j... C j ) Column vector C j splt nto k vectors. J, J Juncton array J and th juncton. x Vector x [J +:J ] b Vector b [J +:J ] x,b x [], b [] A,j A[][j] P arent[] Column ndex of the only non-zero entry after A, JunctonIndex[k] Index of juncton k n juncton array J TABLE I: Notaton used n general form, algorthms and proof. wth k blocks, wth each block beng a symmetrc trdagonal matrx. A non man-dagonal entry of G s a zero matrx O. A Hnes matrx A can then be represented n the form of Equaton 3. We however note that not all matrces whch can be represented n the form of Equaton 3 are Hnes matrces. In Secton II-C, we descrbe the condtons that the general form should satsfy for t to represent only Hnes matrces. The notaton used for descrbng general form s descrbed n Table I. T r C, O.. C,S O C T, A J,J C T,.. A J,J S C T (S+), O C A =, T r.. C,S O (3) : : : : : : C T,S A JS,J C T,S.. A JS,J S C T S+,S O C S+, O.. C S+,S T r S+ T r T r = T r.. T r k C,j C,j C,j =.. C k,j (4)

element at the end. Each T r r s bounded by two rows startrow and endrow n matrx A.

5 Fg. : The general form of Hnes matrx correspondng to compartmentalzed neuron n Fgure (b) wth J=[0,4]. x = x x J x : x JS x S+ b = C. Necessary Condtons b b J b : b JS b S+ [ x, b ] = x, b x, b : x k, b k Condton : For a gven T r r, where S + and r k, only one column vector among { C # r» j j S} s non-zero wth only one non-zero element at the end. Each T r r s bounded by two rows startrow and endrow n matrx A. As T r r s a trdagonal matrx, the matrx element to the rght of each man dagonal element n T r r s n tself except for endrow. From general form, we know that P arent[endrow] has to be n juncton set J. As there s only one non-zero after each man dagonal element n a Hnes matrx, the matrx element to the rght of A endrow,endrow has to be n one of the column vectors { C # r» j j S}. As the non-zero column vector s also bounded by startrow and endrow, only the last element of that vector s non-zero. For the matrx n Fgure, we can see that each trdagonal matrx (T r, T r, T r, 3 T r) has only one nonzero column vector ( C #,,» C,, C,, 3 C,) to ts rght. (5) Condton : For a gven juncton J, C k, 0. Each trdaonal matrx T r r corresponds to an unbranched segment n the tree. For a juncton node J, T r k corresponds to the segment whch s numbered just before numberng juncton node J. From a DFS numberng scheme, we can say that J =P arent[endrow( T # r» k )]. So, C k, becomes a non-zero vector. For the matrx n Fgure, both the vectors correspondng to junctons.e, C, 3 and C, are non-zero. Condton 3: If P arent[j ] J, then ( C +, )T = 0, else ( C +, )T 0 and has exactly one non-zero at the frst ndex. If P arent[j ] / J, then accordng to DFS numberng system P arent[j ] = J +. Ths means A J,(J +) 0. As A J,(J +) s the frst element of ( C +, )T, ( C +, )T [] = A J,(J +). As there can be only one non-zero element to the rght of A J,J, all the elements of ( C +, )T are zero except for the frst one. If P arent[j ] J, then the row and column of the only non-zero element to the rght of A J,J belong to J. Ths means all the row vectors to the rght of A J,J are zero vectors whch ncludes ( C +, )T. For matrx n Fgure, both P arent[j ] and P arent[j ] are not n J. Hence both C, T and C3,T are non-zero. Condton 4: For a gven juncton row J, all row vectors after ( C +, )T are zero vectors. Only one non-zero element exsts after each man dagonal element n Hnes matrx. From condton 3, we know that for a juncton row J, t may only be part of row vector ( C +, )T. So the rest of the row vectors after ( C +, )T are zero vectors. III. OUR APPROACH A. EDD on an Undrected Graph The Exact Doman Decomposton (EDD) method was frst employed by Mascagn [3] to solve matrces correspondng to undrected graphs. The man dea s to create subdomans by breakng the graph at nodes wth degree greater than two. In such a decomposton, each subdoman corresponds to a chan graph and the matrx of the subdoman corresponds to a trdagonal matrx. These subdomans are solved ndependently and the solutons are fused together based on subdoman relatonshps to construct the fnal soluton. Any undrected graph can be represented n general form descrbed n Equaton 3. Thus the EDD algorthm for a matrx that can be represented n general form can be descrbed n Algorthm. Algorthm Doman decomposton method for a matrx n general form correspondng to an undrected graph. : =:S+ R = T r b : =:S+ j=:s P,j = T r C,j 3: =:S j=:s M[][j] = ( l=s+ r=kl T l= r= C l, P j,l ) A J,J j 4: =:S M rhs [] = ( l=s+ T l= C l, Rl ) b J 5: M x = M M rhs 6: =:S+ x = R k=s k= M x[k] P,k

6 In Algorthm, we start by solvng each T r wth multple rght hand sdes j S C,j and b. Subsequently, the doman matrx M and ts rght hand sde M rhs are constructed usng rows at juncton ndces of matrx A and the trdagonal solutons computed. Solvng the system (M M rhs ) gves solutons M x for compartments n the juncton set. The solutons for non juncton nodes are then computed usng M x and the trdagonal solutons. The complexty of ths algorthm s O(N S + S 3 ), where N s the sze of the matrx and S s the number of junctons. B. Doman Matrx n the Exact Doman Decomposton Method In ths secton, we prove that when EDD s appled on Hnes matrx, the doman matrx obtaned has the same structural propertes as that of a Hnes matrx. Theorem. The doman matrx M has the structural propertes of a Hnes matrx. Proof: We prove t by showng that the doman matrx M satsfes the structural propertes of a Hnes matrx as descrbed n Secton II. () The matrx s symmetrc, M,j = M j,. From Algorthm, we know that any element M,j of the doman matrx M can be constructed as follows: and M,j = M j, = ( S+ ) T C l, (T rl ) C l,j A J,J j l= ( S+ ) T C l,j (T rl ) C l, A Jj,J l= If R s a symmetrc matrx of sze N N and p, q are column vectors of sze N, then p T R q = q T R p. So for a vald l,, and j, ( C T l, (T rl ) C l,j ) = ( C T l,j (T rl ) C l, ). As a Hnes matrx s a symmetrc matrx, A J,J j = A Jj,J. Hence the matrx M s symmetrc. () In each row, there exsts only one nonzero element wth column ndex j such that j >. [!j (M,j 0andj > )] For a gven juncton row J, the non-zero row vectors before and after A JJ can be dvded nto two sets S left and S rght respectvely. Because a Hnes matrx s symmetrc, S left can only contrbute to the man dagonal element M, of the doman matrx M. From the Condtons 3 and 4, we know that S rght = { C # +,T» } or. Each element n row of M after man dagonal element M,, j >, can then be represented n the followng cases. Case : S rght = { C # +,T» } In ths case, we can see that: M,j = C +, T (T r + ) C +,j From condton, we know that for T r+, there exsts only one non-zero column vector to ts rght.e,!j ( C +,j 0 &j > ). Case : S rght = We have that M,j = A J,J j. From condtons 3 and 4, we know that ths can happen only f P arent[j ] J. As there s only one non-zero element after A J,J, only one of the elements from set {A J,J +, A J,J +...A J,J S } s non-zero. In both cases, for a gven row n the doman matrx M, there exsts only one j such that j > and M,j 0. C. An O(N) Lnear Algorthm for HnesSolver In ths secton, we descrbe an O(N) algorthm for HnesSolver usng EDD, where N s the total number of compartments. If R s the set of compartments wth more than one chld, then a juncton set J s a superset of R. Let S be the number of compartments n a juncton set J. Our algorthm comprses of four stages. ) Solve ndependent tr-dagonal systems. ) Generate the doman matrx. 3) Solve the doman matrx system (M M rhs ). 4) Construct the fnal soluton ( x ). Analyss: In Steps 3,6,7, and 0 of Algorthm, ndependent trdagonal systems are beng solved. The cumulatve sze of the trdagonal systems from Step-3 s bounded by N and that of Steps 6,7, and 0 s bounded by N. So Stage- nvolves solvng trdagonal systems wth cumulatve sze bounded by 3N. As a trdagonal system can be solved n lnear tme, the complexty of Stage- s O(N). The complexty for computng both the man dagonal of M and rght hand sde M rhs s O( =S = n ), where n s the number of neghbours of compartment J. It can be seen from Step- 6 that the complexty for computng non-zero off dagonal element of M s O(). As ( =S = n < N) and the doman matrx M has only (S ) non-zero off dagonal elements, the complexty of Stage- s O(N + S). As a Hnes matrx can be solved n lnear tme [], the complexty for solvng doman matrx n Stage-3 s O(S). Stage-4 has the same loop structure as that of Stage-. In places where a tr-dagonal system s solved, vector scalng and addton s performed. Hence the complexty of Stage-4 s O(N). The complexty of Stages -4 are O(N), O(N +S), O(S) and O(N) respectvely. As the number of junctons S cannot exceed N, the complexty of our algorthm s O(N). D. EDD for HnesSolver on GPU In ths secton, we show how HnesSolver can be effcently mapped onto a GPU archtecture usng EDD. In Secton III-C, we showed that the complexty of our algorthm s O(N), rrespectve of the number of juncton compartments. We use ths fact and propose a decomposton strategy called

7 Algorthm O(N) algorthm for HnesSolverusng EDD. : #Stage- Solve trdagonal systems. : for =:S do 3: Q = T r+ C +, 4: for r=:k do 5: col = JunctonIndex[Parent[endRow(T r r)]] 6: P r = T r r C # r»,col 7: R r = T rr b r 8: end for 9: end for 0: R S+ = T r S+ : b S+ // End case : #Stage- Constructng the doman matrx. 3: for =:S do 4: M[][] += (C+, [] Q [] A J,J ) 5: col = JunctonIndex[Parent[endRow(T r+ )]] 6: M[][col] = (C+, [] P + [] A J,J col ) 7: M rhs [] += (C+, [] R + [] b J ) 8: for r=:k do 9: col = JunctonIndex[Parent[endRow(T r r)]] 0: M[col][col] += C # r»,col [ C # r»,col ] P r[ C # r»,col ] : M rhs [col] += C # r»,col [ C # r»,col ] Rr [ C # r»,col ] : end for 3: end for 4: 5: #Stage-3 Fnd solutons at junctons 6: M x = M M rhs 7: 8: #Stage-4 Construct x. 9: for # =:S» do 30: x + -= M x[] Q 3: for r=:k do 3: col = JunctonIndex[Parent[endRow(T r r)]] 33: x r -= M x[col] P r 34: x r += Rr 35: end for 36: end for fne decomposton. Mnmal Decomposton(J md ): Compartments wth more than one chldren are chosen as junctons. Fne Decomposton(J fd ): The goal of ths decomposton s to have equal sze trdagonal systems to solve n Stage-. In order to acheve that, we break each branch of the tree nto K szed chans and nclude last compartment of each chan n juncton set J fd. Apart from these compartments, J fd ncludes all compartments whch have more than one chldren.e, juncton set J md. An example of mnmal decomposton and fne decomposton wth K = 4 for a gven tree are shown n Fgures 3(a) and 3(b) respectvely. Hollow nodes n Fgures 3(a) and 3(b) correspond to juncton compartments. Algorthm 3 descrbes the recursve algorthm for HnesSolver usng the Exact Doman Decomposton (a) Mnmal Decomposton (b) Fne Decomposton wth K=4 Fg. 3: Decomposton strateges. Fg. 4: Recursve applcaton of fne decomposton wth K=3 method. ) Stage-: In ths stage, the computaton nvolves solvng many trdagonal systems. Ths computaton can be performed usng two approaches TRSV and TPT. TRSV: Makng all juncton rows and junctons columns of matrx A zero, except for man dagonal elements results n a trdagonal matrx T. Fgure 5 shows the trdagonal matrx T obtaned from the Hnes matrx shown n Fgure. From Stage- of Algorthm, we know that each trdagonal T r r has to be solved wth at least two and at most three rght hand sdes. We poston them accordngly as shown n Fgure and use an optmzed trdagonal solver from NVIDIA s CUDA lbrary CuSparse [4] to solve trdagonal matrx T wth three rght hand sdes. TPT: In ths approach, we map each thread to solve an ndependent trdagonal system. In mnmal decomposton, ndependent trdagonal systems that have to be solved n Stage- are bg and have lot of varance n sze. So usng TPT approach for solvng trdagonal systems suffers from less parallelsm, hgher load mbalance and more amount of work per thread. TRSV approach hdes these to some extent and takes advantage of optmzed lbrary functon for trdagonal solver. Ths compatblty leads to MIN-TRSV approach. In fne decomposton, each ndependent trdagonal system

) Stage-: In ths stage, we construct the doman matrx system (M, M rhs ). As there s no dependency among the non-zero elements of the doman matrx system, they can be constructed n parallel.

8 Fg. 5: Mappng Stage- computaton to trdagonal solver(trsv) wth three rght hand sdes n MIN-TRSV. n Stage- s almost of same sze and has equal compute. So usng the TPT approach, one can take advantage of the SIMD archtecture of a GPU. Ths compatblty leads to R-FINE- TPT approach. ) Stage-: In ths stage, we construct the doman matrx system (M, M rhs ). As there s no dependency among the non-zero elements of the doman matrx system, they can be constructed n parallel. 3) Stage-3: In ths stage, we have a choce to run the algorthm recursvely to solve doman matrx system (M M rhs ). In the R-FINE-TPT approach, we run the algorthm recursvely and stop when usng a GPU s no longer effcent. In case of MIN-TRSV, we do not run the algorthm recursvely. The reason s that the tree correspondng to doman matrx when mnmal decomposton s appled has no vertces of degree two. Applyng any decomposton recursvely on that tree wll not reduce the sze of doman matrx sze sgnfcantly. So n the MIN-TRSV approach, the runtme s less when we do not recurse. 4) Stage-4: Stage-4 nvolves constructng fnal soluton x. As there s no dependency among elements of x, each element of x can be constructed n parallel. E. Implementaton Detals Hnes matrx A of sze N s stored usng two arrays, parent array P of sze N and data array D of sze N. P [] stores the column ndex of the only nonzero after A[][]. As A s symmetrc, we only store the nonzero values of upper trangular matrx n a row major fashon n D. By storng n the row major fashon, we get better memory coalescng whle accessng a trdagonal matrx T r j n Stage-. We used buffer arrays to avod replcaton of trdagonal matrces n the computatons of Stage-. We employed NVIDIA s CUDA Occupancy Calculator tool to confgure thread block szes n CUDA kernels. Algorthm 3 Recursve algorthm for HnesSolver usng Exact Doman Decomposton method. R-EDD(A, b, decomposton, num-rhs) { f decomposton == MIN then Use mnmal decomposton wth TRSV approach to solve trdagonal systems n Stage-. else f decomposton == FINE then Use fne decomposton wth TPT approach to solve trdagonal systems n Stage-. end f Stage: Construct doman matrx system (M, M rhs ) f decomposton == FINE and Threshold(rows(M),num-rhs) == False then M x = R-EDD(M, M rhs, decomposton, num-rhs) else Transfer (M,M rhs ) to CPU. Solve Doman system M x = M/M rhs on CPU. Transfer M x to GPU. end f Stage-4: Construct x return x } A. Platform IV. RESULTS AND ANALYSIS We use an Nvda Tesla K40c GPU for all our experments. It s mounted on an Intel K CPU wth 3GB RAM. The K40c has a total of 880 cores organzed n 5 SMx, wth each core clocked at 745 MHz. It provdes a peak double precson floatng pont performance of.43 Tflops and sngle precson floatng pont performance of 4.9 TFlops. Each SM also has a 64KB confgurable cache to explot data localty. B. Dataset All nput Hnes matrces come from neuron morphologes taken from []. We choose our dataset n such a way that they come from dfferent parts of bran and has varaton n sze and the number of junctons. Some detals of the chosen morphologes are shown n Table II. We group the dataset nto three categores small (7K-K), medum (9K-35K), and large (80K-0K) neurons based on sze of the matrx. C. Results We compare our R-FINE-TPT approach wth the MIN- TRSV approach whch s based on mnmal decomposton strategy suggested by Mascagn n [3]. These approaches dffer n the decomposton strategy used for fndng junctons and the computaton strategy used to solve trdagonal systems n Stage-. All operatons are carred out n double precson. From the results n Fgure 6, t can be observed that R-FINE- TPT s faster than MIN-TRSV for all classes of nput. It has to be noted that we acheve a speedup of x on EC5 neuron,

9 Tme(ms) EC Rat-ngf Alvarez skna-8-r R-FINE-TPT MIN-TRSV speedup L395-LCN MA349-dSAC HICAP3 4-traced alphamn4 cell-36-trace Speedup sze (K ) to be solved n Stage-. As we ncrease K, number of threads to be launched.e, 3N/K decreases n Stage- and work per thread.e, solvng trdagonal of sze (K ) ncreases. Havng few threads wth more work s not good for GPU and t can be observed n run tme of Stage- n Fgure 7. As K ncreases, runnng tme of Stage- ncreases. Stage-3 of EDD nvolves tme for recursve call T (N/K). It decreases wth ncrease n K and t can be observed n Fgure 7. Threads launched n Stage- and Stage-4 are very lght and changng K has lttle mpact on ther runtme. The overall runtme decreases to a certan K and then ncreases. For our nput lnear neuron of sze 00K the best performance s at K = 3. Fg. 6: Results on nput dataset Neuron Compartments Junctons Branches EC Rat-ngf Alvarez-Control-Cell skna-8-r L395-LCN MA349-dSAC HICAP traced alphamn cell-36-trace Tme(ms) R-FINE-TPT Stage- Stage-3 Stage- Stage K n Fne Decomposton. TABLE II: Detals of neuron morphologes whch s the neuron wth hghest number of compartments than any other neuron n the repostory []. The reason for the good performance of R-FINE-TPT over MIN-TRSV s that n R-FINE-TPT there s more parallelsm, less work per thread, and neglgble load mbalance. Whereas n case of MIN-TRSV, threads have to coordnate among themselves to solve one bg trdagonal system wth three rght hand sdes. Ths requrement for coordnaton results n poor performance when compared to R-FINE-TPT. D. Further experments In ths secton, we perform two sets of experments. One set of experments study the mpact parameters such as K n the fne decomposton, depth D n the recurson, and varyng the number of rght hand sdes on the runtme of the algorthm. The second set of experments are amed at comng up wth gudelnes to choose approprate values for K and D automatcally based on emprcal data. For some experments we use lnear neuron as our model. In lnear neuron, there s only one branch and all the compartments have only one chld. ) Varyng K n Fne Decomposton: In ths experment, we study how varyng K n fne decomposton affects overall runtme and runtme of ndvdual stages n R-FINE-TPT. To understand the behavour, we take a lnear neuron of sze 00K as the nput. For a lnear neuron wth N compartments, there are roughly N/K junctons and 3N/K trdagonal systems of Fg. 7: Impact of varyng K on R-FINE-TPT approach. ) Varyng the Depth of Recurson, D: In ths experment, we study the effect of recurson depth D on the R-FINE-TPT approach. From Fgure 8, we can see that as we ncrease D, the speedup ncreases to a certan pont and decreases from then on. Ths s due to the fact that at nflexon pont t s better to solve the matrx on a CPU rather than runnng the algorthm recursvely. For large neurons, the nflecton pont s at D = 3 and the average sze of the doman matrx at D = 3 s 800. For such small matrces, t s faster to solve t on a CPU despte the cost of memory transfers. As long as we have bgger matrces to solve at each level, t s benefcal to run the algorthm recursvely. 3) Varyng Resoluton: In compartmental modellng, each branch of the neuron s dvded nto multple compartments. More accurate smulatons are possble by ncreasng the number of compartments nto whch a branch s dvded. The morphology fle contans a partcular compartmentalzaton of a neuron. In ths experment, we obtan a P -resoluton morphology by breakng an orgnal compartment n the morphology nto P compartments. If the nput morphology has N compartments, P -resoluton morphology contans P N compartments. From Fgure 9, we can see that R-FINE- TPT performs better than MIN-TRSV for all resolutons. The prmary reason for ths s that usng fne decomposton enables us to have computaton n Stage- dvded n to many threads wth very lttle work. Ths coupled wth recurson s the reason for better performance compared to MIN-TRSV.

10 SpeedUp Small Medum Large Recurson level(d) Fg. 8: Impact of varyng recurson depth D on speedup. SpeedUp Small Medum Large Number of rght hand sdes Tme(ms) Large-MIN-TRSV Large-R-FINE-TPT Medum-MIN-TRSV Medum-R-FINE-TPT Small-MIN-TRSV Small-R-FINE-TPT P-resoluton Fg. 9: Impact of varyng resoluton on R-FINE-TPT and MIN- TRSV approaches. 4) Varyng Rght Hand Sdes: Voltage behavour studes have a lot of parameters to tnker wth. For example, n Equaton, havng dfferent values of external compartment current (Iext ) effects only rght hand sde of the matrx system. Now, t s possble to do multple smulatons for dfferent values of Iext at once. Ths s advantageous because t suffces to factorze the trdagonal matrces n Stage- only once and use the factorzatons for all rght hand sdes. In ths experment, we see how R-FINE-TPT behaves wth change n the number of rght hand sdes. From Fgure 0, t can be seen that speedup ncreases wth respect to number of rght hand sdes for all classes of neurons. 5) Determnng the Threshold Functon: In ths experment, we fnd a boolean threshold functon for decdng when to stop the recurson n Algorthm 3. The threshold functon depends on two parameters: the sze of the matrx, N, and the number of rght hand sdes, R. We have to break the recurson at a stage where usng CPU s better than usng GPU. So, we run an experment to fnd out the largest matrx sze at whch CPU s better than GPU for each value of R. From Fgure, we can see that the data s followng /x behavour. Hence we modeled the functon as (N = a 0 /R + a ) and used lnear regresson to fnd constants a 0 and a at whch the error s mnmum. Threshold functon thus obtaned from the above Fg. 0: Impact of varyng rght hand sdes on speedup. technque s (N (545/R) 40 0). The actual value n case of R = s 3500 and t s recommended to use threshold functon (N 3500) when usng R =. Sze of the matrx(n) Threshold value N-545/M - 40 = Number of rght hand sdes(r) Fg. : Threshold functon for matrx wth multple rght hand sdes 6) Choosng K n Fne Decomposton: In ths experment, we provde nsghts for choosng best value for K n fne decomposton when R =. The choce of K depends on the sze of the matrx N and the K values chosen for matrces less than sze N. For lnear neurons of sze N > 3500, we ran our algorthm for dfferent values of K and chose the K whch gave the best runtme. We fnd best K values for all matrx szes constructvely. Hence, n the lower levels of recurson, we use the computed best K values. From Fgure, we can see that the varance of K s hgh for small values of N. For larger values of N, where recurson depth s greater than one, the best value of K remans constant at three. One nterestng thng to observe s that for neurons around sze 0000, t s possble to go down two levels n the recurson but the best K value s the one that recurses only once. To get good performance, mantan a look up table for smaller matrces and use K = 3 for larger matrces.

11 Best K value Matrx sze(n) Best K Fg. : Best value of K n Fne Decomposton V. CONCLUSIONS AND FUTURE WORK In scentfc smulatons based on ordnay and partal dfferental equatons, matrx solvers are almost always a bottleneck and makng them faster reduces smulaton tme consderably. In ths paper, we have demonstrated that embracng the semantcs of matrces nto parallel algorthm desgn helps n desgnng effcent parallel solvers. The general form gven for Hnes matrces n ths paper provdes a framework for provng more results on Hnes matrces. On the software front, we wll consoldate our algorthms nto a CUDA lbrary whch can then be used by neuron smulator frameworks such as NEURON [], MOOSE (Multscale Object-Orented Smulaton Envronment) [3] and others. In voltage PDE smulaton of a neuron, only man-dagonal of the matrx and rght hand sde vector changes n every tmestep. It would be nterestng to see f any numercal method can take advantage of ths and n turn lead to effcent parallel algorthms. REFERENCES [] Unversty of Florda (0) UF sparse matrx collecton.avalable at:( /matrces/groups.html). [] Asanovc, Krste, et al. The landscape of parallel computng research: A vew from berkeley. Vol.. Techncal Report UCB/EECS , EECS Department, Unversty of Calforna, Berkeley, 006. [3] Intel Math Kernel Lbrary, [4] Nvda sparse matrx lbrary(cusparse), nvda.com/cusparse. [5] Wangdong Yang, Kenl L, and Keqn L. A hybrd computng method of SpMV on CPU-GPU heterogeneous computng systems, n Proc. of Journal of Parallel and Dstrbuted Computng (07), volume 04, [6] Kran Kumar Matam, Sva Rama Krshna Bharadwaj, and Kshore Kothapall. Sparse Matrx Matrx Multplcaton on Hybrd CPU+GPU Platforms, n Proc. of 9th Annual Internatonal Conference on Hgh Performance Computng (HPC), Pune, Inda, 0, -0. [7] Emmanuel Agullo, Alfredo Buttar, Mkko Bycklng, Abdou Guermouche, Ian Maslah. Achevng hgh-performance wth a sparse drect solver on Intel KNL. [Research Report] RR- 9035, Inra Bordeaux Sud-Ouest; CNRS-IRIT; Intel corporaton; Unverst of Bordeaux. 07, pp. 5. [8] Kran Raj Ramamoorthy, Dp Sankar, Kannan Srnathan and Kshore Kothapall, A Novel Heterogeneous Algorthm for Multplyng Scale-Free Sparse Matrces, n Proc. of IPDPS Workshops, 06, [9] Dharma Teja Vootur and Kshore Kothapall, Parallel Algorthm for Quas-Band Matrx-Matrx Multplcaton, n Proc. of Parallel Processng and Appled Mathematcs 05, [0] Aydn Buluc and John Glbert. Challenges and advances n parallel sparse matrx-matrx multplcaton. In Proc. ICPP 008, [] Wangdong Yang, Kenl L, Yan Lu, Ln Sh and Lanjun Wan. Optmzaton of quas-dagonal matrx-vector multplcaton on GPU. Internatonal Journal On Hgh Performance Computng Applcatons, Vol. 8() 04, [] Mchael Hnes. Effcent computaton of branched nerve equatons. Internatonal journal of bo-medcal computng 5. (984): [3] Mchael Mascagn. A parallelzng algorthm for computng solutons to arbtrarly branched cable neuron models. Journal of Neuroscence Methods 990, vol 36, [4] Osep-Llus Larrba-Pey, Mchael Mascagn, Angel Jorba, Juan J. Navarro. An Analyss of the Parallel Computaton of Arbtrarly Branched Cable Neuron Models. In PPSC 995, ( ). [5] Mchael L. Hnes, Hubert Echner and Felx Schurmann. Neuron splttng n compute-bound parallel network smulatons enables runtme scalng wth twce as many processors. Journal of Computatonal Neuroscence. 008;5():03-0. do:0.007/s [6] Mchael L. Hnes, Henry Markram and Felx Schurmann. Fully mplct parallel smulaton of sngle neurons. Journal of computatonal neuroscence 5.3 (008): [7] Ben-Shalom, Roy, Glad Lberman, and Alon Korngreen. Acceleratng compartmental modelng on a graphcal processng unt. Fronters n neuronformatcs 7 (03): 4. [8] Harold S. Stone. An effcent parallel algorthm for the soluton of a trdagonal lnear system of equatons. JACM 973, 0:7-38. [9] Hodgkn AL, Huxley AF. A quanttatve descrpton of membrane current and ts applcaton to conducton and exctaton n nerve. The Journal of Physology. 95;7(4): [0] James M. Bowe, Beeman Davd. Compartmental modelng, The Book of GENESIS: Explorng Realstc Neural Models wth the GEneral NEural SImulaton System. 998, New York: Sprnger-Verlag, 7-6. [] Ascol A. Gorgo, Duncan E. Donohue, and Maryam Halav. NeuroMorpho. Org: A central resource for neuronal morphologes. Journal of Neuroscence 7.35 (007): [] NEURON ( [3] MOOSE(Multscale Object-Orented Smulaton Envronment) neuron smulator, (

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr