A Data-Parallel Implementation of O(N) Hierarchical N-body Methods

Size: px
Start display at page:

Download "A Data-Parallel Implementation of O(N) Hierarchical N-body Methods"

Transcription

1 A Data-Parallel Implementaton of O(N) Herarchcal N-body Methods The Harvard communty has made ths artcle openly avalable. Please share how ths access benefts you. Your story matters Ctaton Hu, Yu, and S. Lennart Johnsson A Data-Parallel Implementaton of O(N) Herarchcal N-body Methods. Harvard Computer Scence Group TR Ctable lnk Terms of Use Ths artcle was downloaded from Harvard Unversty s DASH repostory, and s made avalable under the terms and condtons applcable to Other Posted Materal, as set forth at nrs.harvard.edu/urn-3:hul.instrepos:dash.current.terms-ofuse#laa

2 A Data{Parallel Implementaton of O(N ) HerarchcalN{body Methods Yu Hu S. Lennart Johnsson TR May 1996 Parallel Computng Research Group Center for Research n Computng Technology Harvard Unversty Cambrdge, Massachusetts To appear n Proceedngs of Supercomputng '96.

3 A Data Parallel Implementaton of O(N ) Herarchcal N body Methods Yu Hu " and S. Lennart Johnsson "* " Dvson of Appled Scences, Harvard Unversty Cambrdge, Massachusetts * Department of Computer Scences, Unversty of Houston Houston, Texas Emal: hu@das.harvard.edu, johnsson@cs.uh.edu Abstract The O(N ) herarchcal N body algorthms and Massvely Parallel Processors allow partcle systems of 100 mllon partcles or more to be smulated n acceptable tme. We present a data parallel mplementaton of Anderson s method and demonstrate both effcency and scalablty of the mplementaton on the Connecton Machne CM 5/5E systems. The communcaton tme for large partcle systems amounts to about 10 25%, and the overall effcency s about 35%. The evaluaton of the potental feld of a system of 100 mllon partcles takes 3 mnutes and 15 mnutes on a 256 node CM 5E, gvng expected four and seven dgts of accuracy, respectvely. The speed of the code scales lnearly wth the number of processors and number of partcles. Keywords: N body smulaton, multpole algorthms, herarchcal N body methods, data parallel programmng, massvely parallel processors. 1 Introducton The problem of computng the force (or the potental) exerted on one another by a system of electrcal charges (or masses nteractng gravtatonally) has been wdely studed and has applcatons n areas such as celestal mechancs, plasma physcs and molecular dynamcs. Algorthms that compute the forces for a system of N partcles n O(N ) operatons have been devsed [13, 14, 12, 6, 36, 1]. The constant of proportonalty s n the range 1,000 10,000. Earler herarchcal algorthms, such as those proposed by Appel [2] and by Barnes and Hut [4], were beleved to have an arthmetc complexty of O(N log N ) and dd not have a rgorous error bound, although Appel s verson was later proved to be of O(N ) [10]. The two methods were later extended to be of O(N ) wth analytcal error bounds and by combnng wth the dea of multpole expansons [11, 35]. Parallel mplementaton of the O(N log N ) or O(N ) herarchcal N body methods have been of great nterests as Massvely Parallel Processors (MPPs) offer the prmary storage and compute power for smulaton of systems wth several hundred mllon partcles by usng these fast algorthms. Table 1 gves a summary of sequental and parallel mplementatons of herarchcal N body methods. In comparng performance 1

4 Author Method & error Prog. Performance Model N Eff. Cycles/ P System partcle adaptve O(N log N) methods Salmon [27] BH, quadrupole MP Ncube Warren Salmon [33] BH, quadrupole MP 8.78M 26% 180K 512 Intel Delta Warren Salmon [34] BH, 1 = 10 3 MP 8.78M 28% 266K 512 Intel Delta Warren Salmon [35] BH, 1 = 10 2 MP 2M 111K 256 CM 5E Lu Bhatt [24] BH, quadrupole MP 10M 30% 97K 256 CM 5 Sngh et al. [29] BH SM DASH, KSR 1 nonadaptve O(N) methods Leathrum Board [23] GR, p=8 100K 65% 250K 1 RS/ GR, p=8 SM 1M 20% 32 KSR-1 Ellott Board [9] GR, FFT, p=8 100K 73% 200K 1 RS/ GR, FFT, p=8 SM 1M 14% 32 KSR 1 Schmdt Lee [28] GR, p=8 40K 39% 312K 1 Cray YMP 8/864 GR, p=16 40K 22% 1034K 1 Cray YMP 8/864 Zhao Johnsson [37] Zhao, p=3 DP 16K 12% 560K 8K CM 2 Hu Johnsson Anderson, D=5 DP 100M 27% 37K 256 CM 5E (ths work) Anderson, D=14 DP 100M 35% 183K 256 CM 5E adaptve O(N) methods Sngh et al. [30] GR, 2 D, adap SM DASH, KSR 1 Nyland et al. [26] GR, 3 D, adap DP Table 1: A summary of sequental and parallel mplementatons of herarchcal N body methods. All performance numbers are for unform partcle dstrbutons. Methods used are for three dmensons, unless otherwse stated. 1 s the error bound per partal acceleraton relatve to the mean acceleraton of the system. Empty entres mply unavalable data. MP, SM, and DP are short for message passng, shared memory, and data parallel, respectvely. results from mplementatons of dfferent N body methods, often wth dfferent parameters, on dfferent platforms runnng at dfferent clock speed, we propose the use of effcency of floatng pont operatons and cycles per partcle as the standard measure. Effcency alone s nsuffcent n comparng dffer algorthms that requre dfferent number of operatons. Cycles per partcle ncorporates machne sze, clock rate, arthmetc complextes of dfferent methods, but t does not dstngush nodal archtecture, e.g., superscalar archtectures can perform multple operatons per cycle. Barnes and Hut s O(N log N ) method has been mplemented usng the message passng programmng paradgm by Salmon and Warren [27, 33, 34] on the Intel Touchstone Delta and by Lu and Bhatt [24] on the CM 5. Both groups used assembly language for tme crtcal kernels. Salmon and Warren acheved effcences n the range 24 28%, whle Lu and Bhatt acheved 30% effcency. Recently, Warren and Salmon [35] extended ther code to ncorporate multpole and local expansons and made t portable to a varety of parallel machnes. For nonadaptve O(N ) methods, Greengard and Gropp [12] mplemented Greengard Rokhln s method n 2 D on a shared memory machne (the Encore Multmax 320), but data s not suffcently complete for ncluson n Table 1. Zhao and Johnsson [37] developed a data parallel mplementaton on the CM 2 of Zhao s method, and acheved an effcency of 12% for expansons n Cartesan coordnates, whch yelds more costly multpole expanson calculatons than polar coordnates. Leathrum and Board [23, 5] and 2

5 Ellott and Board [9] acheved effcences n the range 14 20% n mplementng Fast Fourer Transform accelerated Greengard Rokhln s method [15] on the KSR 1. Schmdt and Lee [28] vectorzed ths method for the Cray Y MP and acheved an effcency of 39% on a sngle processor. For comparson, we have also ncluded the results reported n ths paper. Lttle progress has been made n the mplementaton of adaptve O(N ) methods n dstrbuted memory machnes. Sngh et al. [29, 30] mplemented both O(N log N ) and O(N ) methods on the Stanford DASH machne, but no measures of the acheved effcency s avalable. Nyland et. al. [26] dscussed how to express the three dmensonal adaptve verson of Greengard Rokhln s method [6] n a data parallel subset of the Proteus language, whch s stll under mplementaton on parallel machnes. In ths paper, we descrbe a data parallel mplementaton of Anderson s method for N body smulatons. The mplementaton s made n Connecton Machne Fortran (CMF) [31] because no Hgh Performance Fortran (HPF) [16] compler was avalable at the tme of ths project. All but one of the features of CMF that we use are also avalable n HPF. Data moton s managed through the use of data dstrbuton drectves and control of the storage to sequence assocaton n mappng arrays to the MPP memory unts. Addtonal performance gans are acheved through aggregaton of computatons, and by a careful trade off between communcaton and redundant computaton. Our novel contrbutons to the mplementaton of O(N ) herarchcal N body methods on MPPs are mnmal data moton n parent chldren nteractons, low data moton n neghbor nteractons for nteractve feld computatons, redundant computaton/communcaton trade offs, representng translaton operatons as matrx vector multplcatons (level 2 BLAS), aggregatng multple ndependent translaton operatons nto multple nstances of matrx matrx multplcatons (level 3 BLAS), reducng the number of translaton operatons through the use of supernodes, expressng herarchcal operatons on flattened data structures effcently n a data parallel language, effcent memory usage. Most of our optmzaton technques apply to any dstrbuted memory machne. However, the relatve mert of the technques depend upon machne metrcs. We report on the performance trade offs on the CM 5/5E. To our knowledge, ths work represents the frst mplementaton of Anderson s method on a parallel machne as well as the frst mplementaton of an O(N ) N body algorthm n a data parallel language. Moreover, the effcency of our mplementaton for partcle systems wth unformly dstrbuton s compettve to those hghly effcent parallel mplementatons of Barnes Hut s algorthm usng low level message passng and assembly language programmng. Our mplementaton s also memory effcent. To our knowledge, ths s the frst known long range smulaton that smulated systems of 100 mllon partcles. Secton 2 brefly descrbes the computatonal structure of O(N ) N body methods and the computatonal elements used n Anderson s method, and defnes several of the terms used n the above summary of contrbutons. Our optmzaton technques for programmng herarchcal methods n CMF are presented n Secton 3. Secton 4 reports the performance results of our mplementaton and the measured accuracy. Secton 5 concludes the paper. 3

6 Level l Level 0 Level 1 n n n n b n n n n Level 2 n n n n b n n n n Level l+1 Fgure 1: Recursve doman decompostons, the near feld, and the nteractve feld n two dmensons. 2 O(N ) N body Methods The O(N ) herarchcal N body methods [14, 36, 1] share the same computatonal structure; they only dffer n the computatonal elements used n approxmatng the aggregated potental or force due a cluster of faraway partcles. We brefly descrbe the computatonal structure of the O(N ) methods and the computatonal elements used n Anderson s method n ths secton. 2.1 Doman Decomposton The O(N ) methods start wth refnng the computatonal doman nto a herarchy of smaller and smaller subdomans (see Fgure 1). Mesh level 0 represents the entre doman (box). Mesh level l + 1 s obtaned from level l by subdvdng each subdoman at level l (parent box) nto four (n two dmensons) or eght (n three dmensons) equally szed subdomans (chld boxes). In an adaptve method, only subdomans wth suffcently many partcles are further subdvded. Boxes that are not further subdvded are leaves. Herarchcal methods can be easly extended to rectangular domans n two dmensons and paralleleppedc domans n three dmensons [1]. Wth respect to each subdoman (box) n the herarchy, the whole doman s parttoned nto three regons. The defnton of the three regons has a sgnfcant mpact on the constant n the O(N ) asymptotc arthmetc complexty, as well as on the accuracy of the method. In the orgnal formulaton of multpole based methods [14], the near feld s defned as those subdomans that share a boundary pont wth the consdered subdoman n two dmensons, and those subdomans whch share a boundary pont wth the consdered subdoman and second nearest neghbor subdomans whch share a boundary pont wth the nearest neghbor subdomans n three dmensons. We denote these two knds of near felds as wth one separaton and two separaton, respectvely. In general, the d separaton near feld n two or three dmensons contan (2d + 1) 2 and (2d + 1) 3 subdomans, respectvely. The far feld of a subdoman s the entre doman excludng the subdoman and ts near feld subdomans. The nteractve feld of a subdoman at level l s the part of the far feld that s contaned n ts parent s near feld. In three dmensons, these defntons yeld 7(2d + 1) 3 nteractve feld subdomans. In the rest of the paper, two separaton near feld s assumed unless otherwse stated. 4

7 2.2 Computatonal Structure There are two key deas n O(N ) methods that lead to the lnear arthmetc complexty. The frst, also used n O(N log N ) methods, s to represent a cluster of partcles suffcently far away from an evaluaton pont by a sngle computatonal element, called far feld potental representaton. The exact computatonal elements are represented by an nfnte number of terms n the multpole based methods or sphere ntegratons n Anderson s method, and hence, n practce, are approxmated by elements represented by a fnte number of terms or dscretzed ntegratons. The O(N ) methods also ntroduces a local feld potental representaton a second knd of computatonal element. Ths element approxmates the potental feld n a local doman due to partcles n the far doman. The second key dea s to herarchcally form and use as few computatonal elements as possble. The method s to herarchcally combne chldren s far feld potental to form parent s and pass parent s local feld potental to chldren s, as shown n the algorthm below. O(N ) methods can be abstracted n terms of three functons G; Φ; Ψ, three translaton operators T 1 ; T 2 and T 3, and a set of recursve equatons. The physcal meanngs of T 1 ; T 2 and T 3 are: shftng a far feld potental, convertng a far feld potental to a local feld potental, and shftng a local feld potental. G s the potental functon n an explct Newtonan formulaton, Φ l s the contrbuton of subdoman at level l to the potental feld n domans n ts far feld. Ψ l represents the contrbuton to the potental feld n subdoman at level l due to partcles n subdoman s far feld regon,.e., the local feld potental n subdoman at level l. The computatonal structure s descrbed as follows [21]. Algorthm: (A generc herarchcal method) 1. Compute Φ h for all boxes at the leaf level h. 2. Upward pass: for l = h 1; h 2; :::; 2, compute Φ l n = 3. Downward pass: for l = 2; 3; :::; h, compute Ψ l = T 3 (Ψ l 1 parent() ) + X 2fchldren(n)g T 1 (Φ l+1 ): X j2fnteractve feld()g T 2 (Φ l j ): 4. Far feld: evaluate local feld potental at partcle k nsde every leaf level subdoman k; far feld = Ψ h box(k) (k): 5. Near feld: evaluate the potental feld due to the partcles n the near feld of leaf level subdomans, usng a drect evaluaton of the Newtonan nteractons wth nearby partcles, 2.3 Optmal Herarchy Depth k; near feld = X j2fnear feld(box(k))g G j (k): For N unformly dstrbuted partcles and a herarchy of depth h havng M = 8 h leaf level boxes, the total number of operatons requred for the above generc herarchcal method s T total (N; M; p) = O(Np) + O(f 1 (p)m ) + O((N nt f 2 (p) + f 3 (p))m ) + O(Np) + O( N 2 M ); 5

8 where p s the number of coeffcents n the feld representaton for a computatonal element, f 1 (p), f 2 (p), and f 3 (p) are the operaton counts for the three translaton operators, respectvely, and N nt s the number of nteractve feld boxes for nteror nodes,.e., N nt = 875 for a three dmensonal problem usng the Greengard Rokhln neghborhood defnton. The fve terms correspond to the operaton counts for the fve steps of the method. The mnmum value of T total s O(N ) for M = c N,.e., the number of leaf level boxes for the optmal herarchy depth s proportonal to the number of partcles. Snce the terms lnear n M represent the operaton counts n traversng the herarchy, and the term O( N 2 M ) represents the operaton count n the drect evaluaton n the near feld, the optmal herarchy depth balances the tme of the herarchy traversal and the drect evaluaton. In three dmensons, convertng the far feld potentals of nteractve feld boxes to local feld potentals domnate the tme n traversng the herarchy. The use of supernodes n two separaton [36, 12] reduces the effectve value of N nt n three dmensons from 875 to 189, whch brngs about a dramatc mprovement n the overall performance, at the cost of slghtly decreased accuracy. 2.4 Anderson s Method Anderson [1] uses Posson s formula for representng solutons of Laplace equaton. One advantage of ths formulaton s that the component operatons of the multpole method are very easy to formulate for approxmatons based on Posson s formula (the translaton operators n equatons (2) (4)). Another advantage s that the computatons n two and three dmensons are very smlar. Therefore, a code for three dmensons s easly obtaned from a code for two dmensons, or vce versa. Let g(x; y; z) denote potental values on a sphere of radus a and denote by Ψ the harmonc functon external to the sphere wth these boundary values. Gven a sphere of radus a and a pont ~x wth sphercal coordnates (r; ; ) outsde the sphere, let ~x p = (cos()sn(); sn()sn(); cos()) be the pont on the unt sphere along the vector from the orgn to the pont ~x. The potental value at ~x s (equaton (14) n [1] Ψ(~x) = 1 4 ZS2 " 1 X n=0 (2n + 1)( a r )n+1 P n (~s ~x p ) # g(a~s)ds; (1) where the ntegraton s carred out over S 2, the surface of the unt sphere, and P n s the nth Legendre functon. Gven a numercal formula for ntegratng functons on the surface of the sphere wth K ntegraton ponts ~s and ther correspondng weghts w, the followng formula (equaton (15) n [1] s used to approxmate the potental at ~x: Ψ(~x) " KX X M =1 n=0 (2n + 1)( a r )n+1 P n (~s ~x p ) # g(a~s )w : (2) Ths approxmaton s called an outer sphere approxmaton. n ths approxmaton two approxmatons are made compared to Equaton (1): the seres s truncated, and the ntegral s dscretzed. In approxmatng Posson s formula, one frst chooses an ntegraton order D, whch determnes the error decay rate of the approxmaton. One then chooses among dfferent ntegraton formulas the one requrng fewest ntegraton ponts whch translates nto fewest arthmetc operatons for the ntegraton. The optmal choces of K, M, and a n Table 2 are gven by Anderson [1]. The approxmaton used to represent potentals nsde a gven regon s (equaton (16) n [1] Ψ(~x) " KX X M =1 n=0 (2n + 1)( r a )n+1 P n (~s ~x p ) 6 # g(a~s )w ; (3)

9 Order of K M 0 = aouter 0 = a nner Expected error ntegraton D ( D/2) decay rate (D/2+2) Table 2: Parameter selectons and expected error decay rate of outer/nner sphere approxmatons n Anderson s method. s the sde length of a box s x j s 0. x j xj. 0 (a) Translatons T1 and T3 (b) Translaton T2 Fgure 2: Translatons as evaluatons of the approxmatons. and s called an nner sphere approxmaton. The outer sphere and nner sphere approxmatons defne the computatonal elements n Anderson s method. Outer sphere approxmatons are constructed for clusters of partcles n the leaf level boxes. Durng the upward pass, outer sphere approxmatons of chld boxes are combned nto a sngle outer sphere approxmaton of ther parent box (T 1 ) by smply evaluatng the potental nduced by the component outer sphere approxmatons at the ntegraton ponts of the parent outer sphere, as shown n Fgure 2. The stuaton s smlar for the other two translatons used n the method; shftng a parent box s nner sphere approxmaton to add to ts chldren s nner sphere approxmatons (T 3 ), and convertng the outer sphere approxmatons of a box s nteractve feld boxes to add to the box s nner sphere approxmaton (T 2 ). 3 A Data Parallel Implementaton In ths Secton, we present a data parallel mplementaton of Anderson s method n CMF on the CM 5/5E. The optmzatons manly focus on mnmzng the data movement through careful management of data dstrbuton and data references and on mprovng arthmetc effcency through aggregatng feld translaton operatons nto hgh level BLAS operatons. Most optmzatons make use of the array alasng feature of CMF [32]. Snce a sngle processng node of CM 5/5E has four (vrtual) Vector Unts (VU), each wth ts own ALU, Regster Fle, and memory, for clarty, we wll use VUs nstead of processng nodes n the followng dscusson. 7

10 Leaf level Nonleaf levels Fgure 3: Embeddng of a herarchy of grds n two 4 D arrays. 3.1 Data Structure and Dstrbuton Maxmzng concurrency and mnmzng communcaton among nodes are crucal n achevng hgh performance on dstrbuted memory machnes n addton to explotng spatal and temporal localty n the local memory herarches. The fact that data dstrbuton, or layout, usually s not known untl run tme further complcates memory management on dstrbuted memory archtectures. Run tme data allocaton s the norm when (parallel) codes can be executed on systems wth dfferent confguratons wthout recomplatons. There are two man data structures n a herarchcal method: one for storng the potental feld n the herarchy and the other for storng partcle nformaton. Far feld potentals are stored for all levels of the herarchy, snce they are computed n the upward pass and used n the downward pass. We embed the herarchy of far feld potentals n one fve dmensonal (5 D) array as follows (see Fgure 3): the leaf level s embedded n one layer of the 4 D array,.e., FAR POT(1; :; :; :; :), and level (h ) s embedded n FAR POT(2, :, 2 1 : L : 2 ; 2 1 : M : 2 ; 2 1 : N : 2 ). Three of the axes represent the organzaton of the boxes n the three spatal dmensons, whle the fourth axs s used to represent data local to a box. The embeddng preserves localty between a box and ts descendants n the herarchy. If at some level there s at least one box per VU, then for each box, all ts descendants wll be allocated to the same VU as the box tself. Gven an array declaraton wth compler drectves that only specfy whether or not an axs s dstrbuted (parallel) or local to a VU (seral), the Connecton Machne Run Tme System attempts to balance subgrd extents and mnmze the surface to volume rato. Snce communcaton s mnmzed for nonadaptve herarchcal methods when the surface to volume rato of the subgrds s mnmzed, the default layout s deal. Let the extents of the three spatal axes of the 5 D potental arrays be L,M, and N, respectvely. The extents are equal to the number of leaf level boxes along the three spatal dmensons, and hence are powers of 2 for a nonadaptve method. The global address space, denoted by b p+n 1 b p+n 2 :::b n b n 1 b n 2 :::b 0, s mapped onto the underlyng physcal machne. Wth block allocaton, the address feld s broken nto two parts the hgh order bts form the VU address and the low order bts form the local memory address. For a multdmensonal array wth block allocaton for each axs, the VU address feld and the local memory address feld are both broken nto segments, one for each axs. Moreover, snce on the Connecton Machne systems the number of VUs along any axs s constraned to be a power of two and the number of leaf level boxes along any axs s a power of two as well, t suffces to consder address bts n studyng the layout of boxes. The address felds for each of the two slces of 4 D subarrays n the 5 D potental array are shown n Fgure 4. 8

11 Axs Extent VU address Local memory address b p+n 1b p+n 2:::b n b n 1b n 2:::b 0 0 K b..b 1 L b..b b..b 2 M b..b b..b 3 N b..b b..b Fgure 4: The allocaton of the local potental arrays LOCAL POT to processng nodes. The nput to the program conssts of a boundng box and relevant partcle data gven n the form of a collecton of 1 D arrays, one for each attrbute. For partcle box nteractons (step 1 and 4 n the algorthm) and the drect evaluaton n the near feld (step 5), t s, however, both convenent and effcent to represent partcle attrbutes as 4 D arrays, wth three of the axes representng the domans of the leaf level potental array boxes, and the fourth representng the partcles n those boxes. The partcle attrbutes n the 4 D arrays wll be allocated to the same VU as the leaf level box of the herarchy to whch the partcles belong. In the next secton we dscuss how to accomplsh ths form of algnment of partcle attrbutes wth leaf level boxes. 3.2 Partcle box Interactons at the Leaf level Partcle box nteractons occur n formng the far feld potental for leaf level boxes before traversng the herarchy, and n evaluatng the local feld potental of leaf level boxes at the partcles nsde each box after traversng the herarchy. For the leaf level partcle box nteractons before traversng the herarchy, the contrbutons of all partcles n a box to each ntegraton pont on the sphere n Anderson s method (or to each term n the multpole expanson for the box) must be accumulated. Dfferent boxes have dfferent number of partcles, and therefore the number of terms added vares wth the leaf level boxes. Once the partcles are sorted such that partcles belongng to the same box are ordered together, a segmented + scan s a convenent way of addng the contrbutons of all the partcles wthn each of the boxes n parallel. A send communcaton s needed to move data between 1 D sorted partcle arrays used for the + scan and the 4 D potental arrays for partcle box nteractons. Smlarly, a scan and a send communcaton are requred n n evaluatng the local feld potental at the partcles nsde the leaf level boxes. Both scan and send communcaton may be qute tme consumng on the CM 5/5E. However, f the partcles are sorted n such a way that they are allocated to the same VU as the leaf level boxes to whch they belong, both the scan and the send requre no communcaton. The segmented scan becomes a set of scans local to each VU and can be mplemented very effcently. The sends become local memory references (copy). Unfortunately, as long as an array assgnment nvolves arrays of dfferent shape, e.g., 1 D partcle arrays and 4 D potental arrays, the CMF compler generates run tme system calls whch handles the most general case,.e., those that nvolves nter node data movement. Such run tme system calls ncur a hgh overhead even f nter node data movement never occurs. Usng the 4 D partcle array representaton can avod such scenaro. The 4 D partcle arrays have the same three parallel axes as the 4 D potental arrays. The scan operatons and the send communcaton become ndexng on the fourth local axs, and no communcaton s requred. Now the problem turns nto how to make the 1 D to 4 D reshapng of partcle arrays effcent. Snce 9

12 y1 y x1 x box addresses: x1 x0,y1 y0 keys n sortng: y1x1 y0x0 Fgure 5: Sortng partcles for maxmum localty n reshapng partcle arrays. the nput partcles have to be sorted once to brng partcles belongng to the same leaf level box together, and the cost of sortng s relatvely ndependent of dstrbutons of source and destnatons, we want the sort to maxmze the localty n reshapng the sorted 1 D partcle arrays to the 4-D partcle arrays, n addton to brng partcles belongng to the same box together. The followng coordnate sort (see Fgure 5) sorts partcles based on keys constructed from the partcle locatons, the leaf level box coordnates, and ther allocaton, and accomplshes ths task. Algorthm: (Coordnate sort) 1. Fnd the layout of the 4 D potental arrays usng ntrnsc mappng functons, e.g., the number of bts for the VU address and the local memory address for each axs; 2. For each partcle, generate the coordnates of the box to whch t belongs, denoted by xx::x, yy::y, and zz:::z; 3. Splt the box coordnates nto VU address and local memory address accordng to the layout of the potental arrays, denoted as x::xjx::x, y::yjy::y, z::zjz::z; 4. Form keys for sortng by concatenatng the VU addresses wth local memory addresses, denoted as z::zy::yx::xjz::zy::yx::x; 5. Sort. Partcles belongng to the same box are adjacent to each other after sortng. Furthermore, for a unform partcle dstrbuton, f there s at least one leaf level box per VU, then each partcle n the sorted 1 D array wll be allocated to the same VU as the leaf level box to whch t belongs n the local feld potental array. Therefore, no communcaton s needed n copyng partcle attrbutes from the sorted 1 D array to the 4 D array of partcle attrbutes whch has the same layout as the 4 D potentals array. For a near unform partcle dstrbuton, t s expected that the coordnate sort wll leave most partcles n the same VU memory as the leaf boxes to whch they belong. 3.3 Box box Interactons durng Herarchy Traversal Durng the upward pass, the combnng of far feld potentals of chld boxes to form the far feld potental of the parent box (T 1 ) requre parent chld box box nteractons. Durng the downward pass, convertng the local feld potentals for parent boxes to that for chld boxes (T 3 ) also requre parent chld (box box) 10

13 Operaton K = 12, h = 8 K = 72, h = 7 T 1; T 3: arthmetc 54% 60% T 2: arthmetc 74% 85% arthmetc ncl. copy 60% 79% arthmetc ncl. copy and maskng 44% 74% Table 3: Leaf level arthmetc effcences on a 256 node CM 5E. The aggregaton of T 2 translatons nvolves copyng and maskng. nteractons. The downward pass also requres neghbor (box box) nteractons for the converson of the far feld potental of nteractve feld boxes to local feld potentals (T 2 ). In Anderson s varant of the fast multpole method, each of the three translaton operators used n traversng the herarchy can be aggregated nto matrces and ther actons on the potental feld further aggregated nto multple nstance matrx matrx multplcatons. Snce there are no other computatons n the herarchy, the entre herarchcal part takes the form of a collecton of matrx matrx multplcatons, whch are mplemented effcently on most computers as part of the Basc Lnear Algebra Subroutnes (BLAS) [8, 7, 22]. The Connecton Machne Scentfc Software Lbrary (CMSSL) [31] supports both sngle nstance and multple nstance BLAS. For entre herarchy traversal (step 2 and 3 of the algorthm), our technques for optmzng the computatons result n an arthmetc effcency of 40% for K = 12 and a herarchy of depth eght, and an arthmetc effcency of 69% for K = 72 and a herarchy of depth seven. Table 3 summarzes the leaf level arthmetc effcences for K = 12 and K = 72 on a 256 node CM 5E. The peak arthmetc effcency of about 74% for K = 12 and 85% for K = 72 at the leaf level s degraded due to the cost of copyng, maskng, and a relatvely lower effcency at the hgher levels of the herarchy. Snce the matrx multplcaton has complexty O(K 2 ) and the cost of copyng and maskng s lnear n K, the arthmetc effcency when ncludng copyng and maskng decreases more for K = 12 than for K = 72. On the communcaton sde, by usng our technques for avodng excess data movement n prefetchng nteractve feld boxes n neghbor (box box) nteractons and for extractng/embeddng parent and chld boxes from the embedded potental arrays n parent chld nteractons, communcaton only contrbutes 12% of the total tme for herarchy traversal for K = 12 and a herarchy of depth eght. For K = 72 and a herarchy of depth seven communcaton amounts to 25% Interactve feld Box box Communcaton The nteractve feld computaton domnates the herarchcal part of the code. The nteractve feld of a chld box contans all boxes nsde a subgrd centered at the the center of the chld box s parent box, but outsde a subgrd centered at the chld box. Dependng upon whch chld box of a parent s the target, the nteractve feld extends two or three boxes beyond the near feld at the level of the chld box n the postve and negatve drecton along each axs. The near feld and nteractve felds of sblngs dffer. Each box needs to fetch the potental vectors of ts 875 nteractve feld boxes durng the nteractve feld computaton. The smplest way to fetch potental vectors of neghbor boxes s to use ndvdual CSHIFTs, one for each neghbor, as shown n Fgure 6(a). In the Connecton Machne Run Tme System, CSHIFTs along more than one axs are mplemented as a sequence of ndependent shfts, one for each axs, resultng n excessve 11

14 X (a) ndvdual CSHIFTs x y (b) CSHFITs wth unt offset subgrds on VUs S1 S2 2 2 (c) excessve data movement (d) stencl communcaton Fgure 6: Optmzng communcaton n neghbor nteractons. data moton. A better way to structure the CSHIFTs s to mpose a lnear orderng upon the nteractve feld boxes, as shown n Fgure 6(b). The potental vectors of neghbor boxes are shfted through each target box, usng a CSHIFT wth unt offset at every step. The three axes use dfferent bts n ther VU addresses. The rghtmost axs uses the lowest order bts and the leftmost axs uses the hghest order bts n the default axes orderng. In some networks, lke meshes and hypercubes, node addresses are often defned such that nodes that dffer n ther lowest order bts are adjacent. In such networks, a lnear orderng that uses CSHIFTs along the rghtmost axs most often mples less data moton between nodes than other orderngs. Ths shft order s also advantageous n our mplementaton due to the bandwdth lmtatons and the addressng of the fat tree network of the CM 5/5E. Unfortunately, the scheme just outlned results n excessve data moton, though less so than the drect use of CSHIFTs. Assume that n two dmensons every VU has a S1 S2 subgrd of boxes 1, and that the CSHIFTs are made most often along the y axs. Every CSHIFT wth unt offset nvolves a physcal shft of boundary elements off VU and a local copyng of the remanng elements. After shftng sx steps along the y axs, the CSHIFT makes a turn and moves along the x axs n the next step, followed by a sequence of steps along the y axs n the opposte drecton (see Fgure 6(c)). All the elements n a VU, except the ones n the last row before the turn, are moved back through the same VUs durng the steps after the turn. Thus, ths seemngly effcent way of expressng neghbor communcaton n CMF nvolves excessve communcaton n addton to the local data movement. Nevertheless, on a CM 5E t s 7.4 tmes faster than drect use of CSHIFTs for a subgrd wth axes extents 16 and K = 12. In order to elmnate excess data movement, we dentfy non local nteractve feld boxes for all boxes n the local subgrd, and structure the communcaton to fetch only those boxes. For a chld box on the boundary of the subgrd n a VU, the nteractve feld box furthest away from t s at dstance four along the axs normal to the boundary of the subgrd. Hence, the ghost regon s four boxes deep on each face of the subgrd. Usng the array alasng feature of CMF, the ghost boxes can be easly addressed by creatng 1 We gnore the local axs n ths secton snce communcaton only happens on parallel axes. 12

15 Method Number of non local Number of local Number of Relatve tme boxes fetched box moves CSHIFTS K = 12 K = 72 Drect on unalased arrays Lnearzed unalased arrays 85, ,608 1, Drect on alased arrays 3,584 7, Lnearzed alased arrays 4,352 6, Table 4: Comparson of data moton needs for nteractve feld evaluaton on a 32 node CM 5E, wth the local subgrd of extents 8 and ghost boxes stored n a subgrd when usng alased arrays. an array alas that separates the VU address from the local memory address. Wth the subgrds dentfed explctly, fetchng boxes n ghost regons n three dmensons requres fetchng sx surface regons, 12 edge regons, and eght corner regons. Fetchng the potental vectors of the boxes n the ghost regon drectly through array sectons and CSHIFTs n prncple results n no excess data moton. However, due to the mplementaton of CSHIFTs on the CM 5/5E excess data moton s ncurred, as mentoned above. Hence, ths seemngly optmal way of fetchng ghost boxes s not the most effcent technque on the CM 5/5E. Creatng a lnear orderng through all the VUs contanng ghost boxes and usng CSHIFTs to move whole subgrds followed by array sectonng after subgrds are moved to the destnaton VU reduces the communcaton tme by a factor of about 1.5. Movng whole subgrds s necessary n order to keep the contnuty of the lnear orderng of the subgrds. Although some redundant data moton takes place, t s consderably reduced compared to usng a lnear orderng on an unalased array. Table 4 summarzes the data moton requrements for the four methods for a subgrd of shape S1 S2 S3 wth S1 = S2 = S3 = 8. Note that for subgrd extents of less than four along any axs, communcaton beyond nearest neghbor VUs s requred Parent chld Box box Communcaton Usng the embeddng descrbed n Secton 3.1, the far feld potentals of boxes at all levels of the herarchy are embedded n two layers of a 4 D array. Durng traversal of the herarchy, temporary arrays of a sze equal to the number of boxes at the current level of the herarchy are used n the computaton. We abstract two generc functons Multgrd-embed and Multgrd-extract for embeddng/extractng a temporary array of potental vectors correspondng to some level of the herarchy nto/from the 4 D array. The reducton operator used n the upward pass s abstracted as a Multgrd-reduce operator, whle the dstrbuton operator used n the downward pass s abstracted as a Multgrd-dstrbute operator. By frst creatng an array alas separatng the local address from the physcal address, then usng array sectonng, the embeddng/extracton operaton s performed as a local copy operaton f both source and destnaton locatons are local to a VU. If source and destnaton addresses are on dfferent VUs, whch occurs close to the root of the herarchy of grds, then a two step procedure s used. Frst, a temporary array correspondng to the level of the herarchy that has the least number of boxes larger than the number of VUs,.e., at least one box on each VU, s allocated. Then, for Multgrd-embed the source s frst embedded n ths temporary array, whch then s embedded n the 4 D destnaton array. The second step s a local copy operaton, as before, whle the frst requres communcaton. But ths communcaton s much more effcent than the communcaton n embeddng the source drectly n the destnaton array because the overhead n computng send addresses, whch s about lnear n the array sze, s smaller. Ths overhead 13

16 10 1 Use send n CMF Local copyng or two-step scheme Tme (seconds) K 32K 256K 2M 16M Boxes n the temporary array Fgure 7: Performance mprovement of Multgrd-embed usng array sectonng and alasng. The two step scheme was used for the frst two cases and local copyng was used for the remanng cases. may domnate the actual communcaton, whch s proportonal to the number of elements selected. On the CM 5E, the performance of Multgrd-embed s mproved by a factor of up to two orders of magntude usng the local copyng or the two step scheme, as shown n Fgure Translatons as BLAS Operatons In Anderson s method, the translaton operators evaluate the approxmatons of the feld on the source spheres at the ntegraton ponts of the destnaton spheres (see Fgure 2). A feld approxmaton (equaton (2) or (3)) can be rewrtten as Φ( ~x j ) KX =1 f (~s ; ~x j ) g(a~s ); j = 1; K; (4) where f (~s ; ~x j ) represents the nner summaton n the orgnal approxmaton. f (~s ; ~x j ) s a functon of the unt vector ~s from the orgn of the source sphere to ts th ntegraton pont and the unt vector ~x j from the orgn of the source sphere to the jth ntegraton pont on the destnaton sphere. The evaluaton of the feld at an ntegraton pont on the destnaton sphere due to the feld values at all ntegraton ponts on the source spheres s an nner product computaton. Hence, the evaluaton of the feld at all ntegraton ponts on the destnaton sphere due to the feld values at all the ntegraton ponts of the source sphere consttutes a matrx vector multplcaton, where the matrx s of shape K K. We refer to ths matrx as a translaton matrx, snce the net effect can be nterpreted as a translaton of the feld from the ntegraton ponts on the source sphere to the ntegraton ponts on the destnaton sphere. The entres of the translaton matrx only depends on the relatve locatons of the source and destnaton spheres. Translaton matrces for T 1 and T 3 Snce n three dmensons a parent has eght chldren, each of the translaton operators T 1 and T 3 can be represented by eght matrces, one for each of the dfferent parent chld translatons. The same matrces can be used for all levels, and for the translatons between any parent and ts chldren rrespectve of locaton. Combnng the far feld of eght chld boxes to form the far feld of ther parent (T 1 ) can be expressed as eght matrx vector multplcatons, followed by an addton of the resultng vectors. Due to the symmetry 14

17 of the dstrbuton of the ntegraton ponts on the spheres, the eght matrces requred to represent T 1 (T 3 ) are permutatons of each other. One matrx can be obtaned through sutable row and column permutatons of another. Based on ths fact, the above combnng translaton can be expressed as a matrx matrx multplcaton of the translaton matrx and a matrx contanng eght permuted potental vectors of the chldren as columns, followed by permutatons of the columns of the product matrx whch then are added to form the potental vector of the parent box. A smlar approach can be used for T 3. Ths approach saves on the computaton and storage of translaton matrces and may acheve better arthmetc effcency through the aggregated matrx matrx multplcaton. However, on the CM 5E, the tme for the permutatons exceeds the gan n arthmetc effcency. In our code we store all eght matrces for each translaton operator. Even though permutatons are not used n applyng the translaton operators to the potental feld, they could be used n the precomputaton phase. Snce the permutatons depend on K (the number of ntegraton ponts) n a non trval fashon, usng permutatons n the precomputaton stage would requre storage of the permutatons for all dfferent Ks. In order to conserve memory we explctly compute all matrces at run tme (when K s known). We dscuss redundant computaton communcaton trade offs n Secton Translaton Matrces for T 2 As descrbed before, each box, except boxes suffcently close to the boundares, has 875 boxes n ts nteractve feld. Though each of the eght chldren of a parent requres 875 matrces, the sblngs share many matrces. The nteractve feld boxes of the eght sblngs have offsets n the range [ 5 + ; 4 + ] [ 5 + j; 4 + j] [ 5 + k; 4 + k]n[ 2; 2] [ 2; 2] [ 2; 2]; ; j; k 2 f0; 1g, respectvely. Each offset corresponds to a dfferent translaton matrx. The unon of the nteractve felds of the eght sblngs has = 1206 boxes wth 1206 offsets n the range [ 5; 5] [ 5; 5] [ 5; 5]n[ 2; 2] [ 2; 2] [ 2; 2]. For ease of ndexng, we generate the translaton matrces also for the 125 subdomans excluded from the nteractve feld, or a total of = 1331 matrces Aggregaton of Translatons Aggregaton of computatons lowers the overheads n computatons. In addton, the aggregaton of computatons may allow for addtonal optmzatons by provdng addtonal degrees of freedom n schedulng operatons at a gven tme. In Anderson s method aggregaton of computatons n parent chld nteractons results n multple nstance matrx matrx multplcatons. Snce the same translaton matrx s used for all parent same chld nteractons, the matrx vector multplcatons can be aggregated nto a matrx matrx multplcaton. Wth the data layout used for the 4 D potental arrays, aggregaton can only be performed along one of the three space dmensons wthout a data reallocaton n local memory. Thus, for a K K matrx, aggregaton wthout data reallocaton results n a K K by K S matrx multplcaton, where S s the axs of aggregaton. But, Sm such matrx multplcatons can be treated as one multple nstance matrx matrx multplcaton, where Sm s the length of the axs chosen for the multple nstance call to a CMSSL multple nstance matrx multplcaton routne. On a CM 5E, ths aggregaton mproved the performance of the T 1 and T 3 translatons from 58 Mflops/s/PN to 87 Mflops/s/PN for K = 12 and a subgrd of extents The matrces beng multpled are of shape and 12 8 wth 16 such nstances handled n a sngle call. For K = 72 and a subgrd of extents the performance mproved from 95 Mflops/s/PN to 96 Mflops/s/PN. Matrx shapes are and 72 4 and the number of nstances s 8. The 15

18 mnor mprovement for ths case s because the matrces are large that before aggregaton, the matrx vector multply already acheves good performance. For the far feld to local feld conversons n the nteractve feld nteractons, the unon of the nteractve felds of all the boxes n a subgrd s four boxes deep on all faces of the subgrd, and s prefetched and stored n a subgrd of shape (S1 + 8) (S2 + 8) (S3 + 8). Smlar to parent chld nteractons, each nteractve feld converson for a sngle box par s performed as a matrx vector multplcaton. Snce all box pars wth the same relatve locaton use the same translaton matrx, conversons for all local boxes and ther correspondng nteractve feld boxes wth the same relatve locaton can be aggregated nto a sngle matrx matrx multplcaton. For the nteractve feld computatons we rearrange the arrays n local memory va copyng such that a sngle nstance matrx multplcaton s performed on matrces of shape K K and K ( S1 2 S2 2 S3 2 ). For S1 = S2 = 32; S3 = 16 and K = 12, the executon rate of the by matrx multplcaton s 119 Mflops/s/PN. If there are no DRAM page faults, the copyng requres 2K cycles for a potental vector for whch the matrx multplcaton deally takes K 2 cycles. Thus, the relatve tme for copyng s 2=K. For K = 12 ths amounts to about 17%. Wth the cost of copyng ncluded, the measured performance of the translaton s 85 Mflops/s/PN. For S1 = 16; S2 = 16; S3 = 8 and K = 72, the executon rates of the by matrx multplcaton s 136 Mflops/s/PN. Includng the cost of copyng, the measured performance s 124 Mflops/s/PN. The copyng cost can be reduced by copyng a whole column block of (S1 + 8) S2=2 S3=2 boxes nto two lnear memory blocks; one for even slces of the column, and the other for odd slces. Each local column copy can be used on average 8.75 tmes. The cost of copyng s therefore reduced to (S1+8) (S1K) of that of matrx multplcaton, assumng no page faults. Includng the cost of copyng, the performance of translatons n neghbor nteractons reaches 96 and 127 Mflops/s/PN for K = 12 and K = 72, respectvely. Copyng of sectons of subgrds to allow for a K K by a K ( S1 2 S2 2 S3 2 ) matrx multplcaton s also an alternatve n parent chld nteractons, but s not used due to ts relatvely hgher cost. See [17] for detal Precomputng Translaton Matrces All translaton matrces are precomputed. Snce the translaton matrces are shared by all boxes at all levels, only one copy of each matrx s needed on each VU. Two extreme ways of computng these translaton matrces are: 1. to compute all the translaton matrces on every VU, 2. to compute each translaton matrx only once wth dfferent VUs computng dfferent matrces followed by a spread to all other VUs as a matrx s needed. In the frst method the computatons are embarrassngly parallel and no communcaton s needed. However, redundant computatons are performed. In the second method there s no redundant computaton, but replcaton s requred. If there are fewer matrces to be computed than there are VUs, then the VUs can be parttoned nto groups wth as many VUs n a group as there are matrces to be computed. Each group computes the entre collecton of matrces, followed by spreads wthn that group when a matrx s needed. The replcaton may also be performed as an all to all broadcast [20]. The load balance wth ths amount of redundant computaton s the same as wth no redundancy, but the communcaton cost may be reduced. On the CM 5E, for K varyng from 12 to 72, replcatng a K K translaton matrx to all nodes s about three to twelve tmes faster than computng t. Thus, computng the matrces n parallel followed by 16

19 Tme (seconds) compute 8 matrces on each VU compute + replcate w/o groupng compute + replcate w/ groupng replcate porton w/o groupng replcate porton w/ groupng Number of ntegraton ponts on the sphere Fgure 8: Computaton vs. replcaton n precomputng translaton matrces for T 1 (T 3 ) on a 256 node CM 5E. Tme (seconds) compute 1331 matrces on every VU compute n parallel + replcate on 256PN Tme (seconds) replcate on 256PN replcate on 64PN replcate on 32PN compute n parallel on 32PN compute n parallel on 64PN compute n parallel on 256PN Number of ntegraton ponts on the sphere Number of ntegraton ponts on the sphere Fgure 9: Computaton vs. replcaton n precomputng translaton matrces for T 2 on the CM 5E. replcaton s always a wnnng choce. For T 1 and T 3 replcaton among eght VUs nstead of all VUs s an opton. Fgure 8 shows the performance of the three methods: no replcaton, replcaton among groups of eght VUs, and replcaton to all VUs. The cost of computng the matrces n parallel followed by replcaton wthout groupng s 66% to 24% of that of computng all matrces on each VU, as K vares from 12 to 72. Wth groupng, the computaton cost s the same as wthout groupng, but the cost of replcaton s reduced by a factor of 1.75 to 1.26 as K vares from 12 to 72. The reason for the decrease of the dfference as K ncreases s that for larger K, the replcaton tme s domnated by bandwdth, whle for small K latency and overhead domnate. For T 2, computng one copy of each of the 1331 translaton matrces and replcatng t across all the nodes s up to an order of magntude faster than computng all on every VU, as shown n Fgure 9(a) for a 256 node CM 5E. The tme for computng 1331 matrces n parallel decreases on larger CM 5Es, as shown n Fgure 9(b), whle the replcaton tme, whch domnates the total tme, ncreases by about 10 20% for large K as the number of nodes doubles. As a result, the total tme for the method ncreases by at most 62% as the number of nodes changes from 32 to

20 Fgure 10: Explotng symmetry n the drect evaluaton n the near feld. Storng all 1331 translaton matrces n double precson on each VU requres K 2 bytes of memory,.e., 1.53 Mbytes for K = 12 and 53.9 Mbytes for K = 12. Therefore, replcaton of a matrx s delayed untl t s needed. The replcaton s made through one to all broadcast rather than all to all broadcast. The total number of replcatons s 1331 (h 1), where h s the depth of the herarchy, snce the T 2 translatons are used frst at level two. 3.4 Partcle partcle Interactons n the Near feld Snce the drect evaluaton n the near feld accounts for about half of the total arthmetc operatons at optmal herarchy depth, ts effcency s crucal to the overall performance of O(N ) methods. It s both effcent and convenent to use 4 D partcle arrays n the drect evaluaton n the near feld. The partcle partcle nteractons can then be vewed as neghbor box box nteractons: each box nteracts wth ts 124 neghbor boxes n the near feld. Each neghbor box box nteracton nvolves all to all nteractons between partcles n one box and partcles n the other. Explotng the symmetry of nteracton (Newton s thrd law) result n 62 nstead of 124 box box nteractons. One way of explotng symmetry s shown n a 2 D example n Fgure 10. As box 0 traverses boxes 1 to 4, the nteractons between box 0 and each of the four boxes wll be computed. The nteractons from the four boxes to box 0 are accumulated and traverse along wth box 0. In data parallel programmng, whle box 0 traverses boxes 1 to 4, boxes 5 to 8 wll traverse box 0 and the nteractons between them and box 0 wll be computed. The nteractons from these four boxes to box 0 wll be accumulated and stored n box 0. Fnally, the two contrbutons to box 0 wll be combned wth nteractons among partcles n box 0. A smlar dea for explotng symmetry appled drectly to partcles nstead of boxes was used by Applegate et al. [3]. In three dmensons, the boxes nvolved n box box nteractons of a target box can be ordered lnearly and brought to the target box through 62 sngle step CSHIFTs. Another way s to fetch non local near feld boxes from other VUs usng 4 D arrays alased nto local subgrds through array alasng, much n the same way as n fetchng non local nteractve feld boxes, as descrbed n Secton Due to an optmzaton tradng memory requrements for arthmetc effcency descrbed below, the memory requrements n the near feld nteractons are hgh. For ths reason we choose the frst method snce t requres less temporary storage. Moreover, the CSHIFTs account for only about 10-15% of the tme for the drect evaluaton. 3.5 Load balancng ssues n nonadaptve O(N ) Methods Nonadaptve herarchcal methods use nonadaptve doman decomposton, and the herarchy of recursvely decomposed domans s balanced. Three sources of parallelsm exst n traversng the herarchy, namely, among all the boxes at the same level n parent chld and nteractve feld nteractons, among each box s nteractve feld boxes n the far feld to local feld conversons, and among all boxes at all levels n the 18

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

Chapter 1. Comparison of an O(N ) and an O(N log N ) N -body solver. Abstract

Chapter 1. Comparison of an O(N ) and an O(N log N ) N -body solver. Abstract Chapter 1 Comparson of an O(N ) and an O(N log N ) N -body solver Gavn J. Prngle Abstract In ths paper we compare the performance characterstcs of two 3-dmensonal herarchcal N-body solvers an O(N) and

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

Chapter 6 Programmng the fnte element method Inow turn to the man subject of ths book: The mplementaton of the fnte element algorthm n computer programs. In order to make my dscusson as straghtforward

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

ELEC 377 Operating Systems. Week 6 Class 3

ELEC 377 Operating Systems. Week 6 Class 3 ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Analysis of Continuous Beams in General

Analysis of Continuous Beams in General Analyss of Contnuous Beams n General Contnuous beams consdered here are prsmatc, rgdly connected to each beam segment and supported at varous ponts along the beam. onts are selected at ponts of support,

More information

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes SPH3UW Unt 7.3 Sphercal Concave Mrrors Page 1 of 1 Notes Physcs Tool box Concave Mrror If the reflectng surface takes place on the nner surface of the sphercal shape so that the centre of the mrror bulges

More information

Analysis of 3D Cracks in an Arbitrary Geometry with Weld Residual Stress

Analysis of 3D Cracks in an Arbitrary Geometry with Weld Residual Stress Analyss of 3D Cracks n an Arbtrary Geometry wth Weld Resdual Stress Greg Thorwald, Ph.D. Ted L. Anderson, Ph.D. Structural Relablty Technology, Boulder, CO Abstract Materals contanng flaws lke nclusons

More information

Accounting for the Use of Different Length Scale Factors in x, y and z Directions

Accounting for the Use of Different Length Scale Factors in x, y and z Directions 1 Accountng for the Use of Dfferent Length Scale Factors n x, y and z Drectons Taha Soch (taha.soch@kcl.ac.uk) Imagng Scences & Bomedcal Engneerng, Kng s College London, The Rayne Insttute, St Thomas Hosptal,

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices Internatonal Mathematcal Forum, Vol 7, 2012, no 52, 2549-2554 An Applcaton of the Dulmage-Mendelsohn Decomposton to Sparse Null Space Bases of Full Row Rank Matrces Mostafa Khorramzadeh Department of Mathematcal

More information

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions Sortng Revew Introducton to Algorthms Qucksort CSE 680 Prof. Roger Crawfs Inserton Sort T(n) = Θ(n 2 ) In-place Merge Sort T(n) = Θ(n lg(n)) Not n-place Selecton Sort (from homework) T(n) = Θ(n 2 ) In-place

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation Precondtonng Parallel Sparse Iteratve Solvers for Crcut Smulaton A. Basermann, U. Jaekel, and K. Hachya 1 Introducton One mportant mathematcal problem n smulaton of large electrcal crcuts s the soluton

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation Loop Transformatons for Parallelsm & Localty Last week Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Scalar expanson for removng false dependences Loop nterchange Loop

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Array transposition in CUDA shared memory

Array transposition in CUDA shared memory Array transposton n CUDA shared memory Mke Gles February 19, 2014 Abstract Ths short note s nspred by some code wrtten by Jeremy Appleyard for the transposton of data through shared memory. I had some

More information

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont) Loop Transformatons for Parallelsm & Localty Prevously Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Loop nterchange Loop transformatons and transformaton frameworks

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to

More information

Wavefront Reconstructor

Wavefront Reconstructor A Dstrbuted Smplex B-Splne Based Wavefront Reconstructor Coen de Vsser and Mchel Verhaegen 14-12-201212 2012 Delft Unversty of Technology Contents Introducton Wavefront reconstructon usng Smplex B-Splnes

More information

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law) Machne Learnng Support Vector Machnes (contans materal adapted from talks by Constantn F. Alfers & Ioanns Tsamardnos, and Martn Law) Bryan Pardo, Machne Learnng: EECS 349 Fall 2014 Support Vector Machnes

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements Module 3: Element Propertes Lecture : Lagrange and Serendpty Elements 5 In last lecture note, the nterpolaton functons are derved on the bass of assumed polynomal from Pascal s trangle for the fled varable.

More information

GSLM Operations Research II Fall 13/14

GSLM Operations Research II Fall 13/14 GSLM 58 Operatons Research II Fall /4 6. Separable Programmng Consder a general NLP mn f(x) s.t. g j (x) b j j =. m. Defnton 6.. The NLP s a separable program f ts objectve functon and all constrants are

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES A SYSOLIC APPROACH O LOOP PARIIONING AND MAPPING INO FIXED SIZE DISRIBUED MEMORY ARCHIECURES Ioanns Drosts, Nektaros Kozrs, George Papakonstantnou and Panayots sanakas Natonal echncal Unversty of Athens

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

A parallel Poisson solver using the fast multipole method on networks of workstations

A parallel Poisson solver using the fast multipole method on networks of workstations A parallel Posson solver usng the fast multpole method on networks of workstatons June-Yub Lee (jylee@math.ewha.ac.kr, jylee@cms.nyu.edu) Dept. of Math, Ewha Womans Unversty, Seoul120-750, KOREA, Karpjoo

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

3D vector computer graphics

3D vector computer graphics 3D vector computer graphcs Paolo Varagnolo: freelance engneer Padova Aprl 2016 Prvate Practce ----------------------------------- 1. Introducton Vector 3D model representaton n computer graphcs requres

More information

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016) Technsche Unverstät München WSe 6/7 Insttut für Informatk Prof. Dr. Thomas Huckle Dpl.-Math. Benjamn Uekermann Parallel Numercs Exercse : Prevous Exam Questons Precondtonng & Iteratve Solvers (From 6)

More information

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vdyanagar Faculty Name: Am D. Trved Class: SYBCA Subject: US03CBCA03 (Advanced Data & Fle Structure) *UNIT 1 (ARRAYS AND TREES) **INTRODUCTION TO ARRAYS If we want

More information

Ecient Computation of the Most Probable Motion from Fuzzy. Moshe Ben-Ezra Shmuel Peleg Michael Werman. The Hebrew University of Jerusalem

Ecient Computation of the Most Probable Motion from Fuzzy. Moshe Ben-Ezra Shmuel Peleg Michael Werman. The Hebrew University of Jerusalem Ecent Computaton of the Most Probable Moton from Fuzzy Correspondences Moshe Ben-Ezra Shmuel Peleg Mchael Werman Insttute of Computer Scence The Hebrew Unversty of Jerusalem 91904 Jerusalem, Israel Emal:

More information

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS CACHE MEMORY DESIGN FOR INTERNET PROCESSORS WE EVALUATE A SERIES OF THREE PROGRESSIVELY MORE AGGRESSIVE ROUTING-TABLE CACHE DESIGNS AND DEMONSTRATE THAT THE INCORPORATION OF HARDWARE CACHES INTO INTERNET

More information

c 2009 Society for Industrial and Applied Mathematics

c 2009 Society for Industrial and Applied Mathematics SIAM J. MATRIX ANAL. APPL. Vol. 31, No. 3, pp. 1382 1411 c 2009 Socety for Industral and Appled Mathematcs SUPERFAST MULTIFRONTAL METHOD FOR LARGE STRUCTURED LINEAR SYSTEMS OF EQUATIONS JIANLIN XIA, SHIVKUMAR

More information

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters Proper Choce of Data Used for the Estmaton of Datum Transformaton Parameters Hakan S. KUTOGLU, Turkey Key words: Coordnate systems; transformaton; estmaton, relablty. SUMMARY Advances n technologes and

More information

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss. Today s Outlne Sortng Chapter 7 n Wess CSE 26 Data Structures Ruth Anderson Announcements Wrtten Homework #6 due Frday 2/26 at the begnnng of lecture Proect Code due Mon March 1 by 11pm Today s Topcs:

More information

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation Intellgent Informaton Management, 013, 5, 191-195 Publshed Onlne November 013 (http://www.scrp.org/journal/m) http://dx.do.org/10.36/m.013.5601 Qualty Improvement Algorthm for Tetrahedral Mesh Based on

More information

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to: 4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/

More information

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like: Self-Organzng Maps (SOM) Turgay İBRİKÇİ, PhD. Outlne Introducton Structures of SOM SOM Archtecture Neghborhoods SOM Algorthm Examples Summary 1 2 Unsupervsed Hebban Learnng US Hebban Learnng, Cntd 3 A

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

S.P.H. : A SOLUTION TO AVOID USING EROSION CRITERION?

S.P.H. : A SOLUTION TO AVOID USING EROSION CRITERION? S.P.H. : A SOLUTION TO AVOID USING EROSION CRITERION? Célne GALLET ENSICA 1 place Emle Bloun 31056 TOULOUSE CEDEX e-mal :cgallet@ensca.fr Jean Luc LACOME DYNALIS Immeuble AEROPOLE - Bat 1 5, Avenue Albert

More information

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming CS 4/560 Desgn and Analyss of Algorthms Kent State Unversty Dept. of Math & Computer Scence LECT-6 Dynamc Programmng 2 Dynamc Programmng Dynamc Programmng, lke the dvde-and-conquer method, solves problems

More information

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT 3. - 5. 5., Brno, Czech Republc, EU APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT Abstract Josef TOŠENOVSKÝ ) Lenka MONSPORTOVÁ ) Flp TOŠENOVSKÝ

More information

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL)

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL) Crcut Analyss I (ENG 405) Chapter Method of Analyss Nodal(KCL) and Mesh(KVL) Nodal Analyss If nstead of focusng on the oltages of the crcut elements, one looks at the oltages at the nodes of the crcut,

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

Lecture #15 Lecture Notes

Lecture #15 Lecture Notes Lecture #15 Lecture Notes The ocean water column s very much a 3-D spatal entt and we need to represent that structure n an economcal way to deal wth t n calculatons. We wll dscuss one way to do so, emprcal

More information

Private Information Retrieval (PIR)

Private Information Retrieval (PIR) 2 Levente Buttyán Problem formulaton Alce wants to obtan nformaton from a database, but she does not want the database to learn whch nformaton she wanted e.g., Alce s an nvestor queryng a stock-market

More information

(e.g., []). In such cases, both the grd generaton process and the soluton of the resultng lnear systems can be computatonally expensve. The lack of re

(e.g., []). In such cases, both the grd generaton process and the soluton of the resultng lnear systems can be computatonally expensve. The lack of re A Free-Space Adaptve FMM-ased PDE Solver n Three Dmensons H. Langston L. Greengard D. orn October, Abstract We present a kernel-ndependent, adaptve fast multpole method (FMM) of arbtrary order accuracy

More information

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach Data Representaton n Dgtal Desgn, a Sngle Converson Equaton and a Formal Languages Approach Hassan Farhat Unversty of Nebraska at Omaha Abstract- In the study of data representaton n dgtal desgn and computer

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

Load-Balanced Anycast Routing

Load-Balanced Anycast Routing Load-Balanced Anycast Routng Chng-Yu Ln, Jung-Hua Lo, and Sy-Yen Kuo Department of Electrcal Engneerng atonal Tawan Unversty, Tape, Tawan sykuo@cc.ee.ntu.edu.tw Abstract For fault-tolerance and load-balance

More information

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification Introducton to Artfcal Intellgence V22.0472-001 Fall 2009 Lecture 24: Nearest-Neghbors & Support Vector Machnes Rob Fergus Dept of Computer Scence, Courant Insttute, NYU Sldes from Danel Yeung, John DeNero

More information

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics Introducton G10 NAG Fortran Lbrary Chapter Introducton G10 Smoothng n Statstcs Contents 1 Scope of the Chapter... 2 2 Background to the Problems... 2 2.1 Smoothng Methods... 2 2.2 Smoothng Splnes and Regresson

More information

K-means and Hierarchical Clustering

K-means and Hierarchical Clustering Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

CHAPTER 10: ALGORITHM DESIGN TECHNIQUES

CHAPTER 10: ALGORITHM DESIGN TECHNIQUES CHAPTER 10: ALGORITHM DESIGN TECHNIQUES So far, we have been concerned wth the effcent mplementaton of algorthms. We have seen that when an algorthm s gven, the actual data structures need not be specfed.

More information

High-Boost Mesh Filtering for 3-D Shape Enhancement

High-Boost Mesh Filtering for 3-D Shape Enhancement Hgh-Boost Mesh Flterng for 3-D Shape Enhancement Hrokazu Yagou Λ Alexander Belyaev y Damng We z Λ y z ; ; Shape Modelng Laboratory, Unversty of Azu, Azu-Wakamatsu 965-8580 Japan y Computer Graphcs Group,

More information

Measuring Integration in the Network Structure: Some Suggestions on the Connectivity Index

Measuring Integration in the Network Structure: Some Suggestions on the Connectivity Index Measurng Integraton n the Network Structure: Some Suggestons on the Connectvty Inde 1. Measures of Connectvty The connectvty can be dvded nto two levels, one s domestc connectvty, n the case of the physcal

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Dynamic wetting property investigation of AFM tips in micro/nanoscale

Dynamic wetting property investigation of AFM tips in micro/nanoscale Dynamc wettng property nvestgaton of AFM tps n mcro/nanoscale The wettng propertes of AFM probe tps are of concern n AFM tp related force measurement, fabrcaton, and manpulaton technques, such as dp-pen

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

Very simple computational domains can be discretized using boundary-fitted structured meshes (also called grids)

Very simple computational domains can be discretized using boundary-fitted structured meshes (also called grids) Structured meshes Very smple computatonal domans can be dscretzed usng boundary-ftted structured meshes (also called grds) The grd lnes of a Cartesan mesh are parallel to one another Structured meshes

More information

Random Kernel Perceptron on ATTiny2313 Microcontroller

Random Kernel Perceptron on ATTiny2313 Microcontroller Random Kernel Perceptron on ATTny233 Mcrocontroller Nemanja Djurc Department of Computer and Informaton Scences, Temple Unversty Phladelpha, PA 922, USA nemanja.djurc@temple.edu Slobodan Vucetc Department

More information

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide Lobachevsky State Unversty of Nzhn Novgorod Polyhedron Quck Start Gude Nzhn Novgorod 2016 Contents Specfcaton of Polyhedron software... 3 Theoretcal background... 4 1. Interface of Polyhedron... 6 1.1.

More information

Modeling, Manipulating, and Visualizing Continuous Volumetric Data: A Novel Spline-based Approach

Modeling, Manipulating, and Visualizing Continuous Volumetric Data: A Novel Spline-based Approach Modelng, Manpulatng, and Vsualzng Contnuous Volumetrc Data: A Novel Splne-based Approach Jng Hua Center for Vsual Computng, Department of Computer Scence SUNY at Stony Brook Talk Outlne Introducton and

More information

Reading. 14. Subdivision curves. Recommended:

Reading. 14. Subdivision curves. Recommended: eadng ecommended: Stollntz, Deose, and Salesn. Wavelets for Computer Graphcs: heory and Applcatons, 996, secton 6.-6., A.5. 4. Subdvson curves Note: there s an error n Stollntz, et al., secton A.5. Equaton

More information

Optimal Workload-based Weighted Wavelet Synopses

Optimal Workload-based Weighted Wavelet Synopses Optmal Workload-based Weghted Wavelet Synopses Yoss Matas School of Computer Scence Tel Avv Unversty Tel Avv 69978, Israel matas@tau.ac.l Danel Urel School of Computer Scence Tel Avv Unversty Tel Avv 69978,

More information

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements Explct Formulas and Effcent Algorthm for Moment Computaton of Coupled RC Trees wth Lumped and Dstrbuted Elements Qngan Yu and Ernest S.Kuh Electroncs Research Lab. Unv. of Calforna at Berkeley Berkeley

More information