AUTHOR QUERY FORM. Fax:

Size: px
Start display at page:

Download "AUTHOR QUERY FORM. Fax:"

Transcription

1 Our reference: YJCPH 3186 P-authorquery-v8 AUTHOR QUERY FORM Journal: YJCPH Please e-mal or fax your responses and any correctons to: E-mal: Artcle Number: 3186 Fax: Dear Author, Please check your proof carefully and mark all correctons at the approprate place n the proof (e.g., by usng on-screen annotaton n the PDF fle) or comple them n a separate lst. For correcton or revson of any artwork, please consult Any queres or remarks that have arsen durng the processng of your manuscrpt are lsted below and hghlghted by flags n the proof. Clck on the Q lnk to go to the locaton n the proof. Locaton n artcle Q1 Q2 Query / Remark: clck on the Q lnk to go Please nsert your reply or correcton at the correspondng lne n the proof Please check country name nserted for afflaton s okay as typeset. Followng equatons have been sequentally renumbered. Please check, and correct f necessary. Q3 Please provde complete detals for Ref. [11]. Q4 Q5 Please note the footnote * provded n Table 3.2 but not cted n the table. Please check. Please note footnotes * and ** provded n Table 4.3 but not cted n the table. Please check. Q6 Please provde the sgnfcance of * n Table 4.6. Thank you for your assstance.

2 1 YJCPH 3186 Journal of Computatonal Physcs xxx (2010) xxx xxx Contents lsts avalable at ScenceDrect Journal of Computatonal Physcs ournal homepage: 2 Fast evaluaton of Helmholtz potental on graphcs processng unts (GPUs) 3 Shaong L *, Bors Lvshtz, Vtaly Lomakn 4 Q1 Department of Electrcal and Computer Engneerng, Unversty of Calforna, San Dego, Unted States 5 artcle nfo Artcle hstory: 9 Receved 21 August Receved n revsed form 19 June Accepted 21 July Avalable onlne xxxx 13 Keywords: 14 Graphcs processng unts (GPUs) 15 Computatonal electromagnetcs 16 Fast methods 17 Non-unform grd nterpolaton methods 18 Integral equatons Introducton abstract Ths paper presents a parallel algorthm mplemented on graphcs processng unts (GPUs) for rapdly evaluatng spatal convolutons between the Helmholtz potental and a largescale source dstrbuton. The algorthm mplements a non-unform grd nterpolaton method (NGIM), whch uses ampltude and phase compensaton and spatal nterpolaton from a sparse grd to compute the feld outsde a source doman. NGIM reduces the computatonal tme cost of the drect feld evaluaton at N observers due to N co-located sources from O(N 2 )too(n) nthe statc and low-frequency regmes, to O(NlogN) n the hgh-frequency regme, and between these costs n the mxed-frequency regme. Memory requrements scale as O(N) n all frequency regmes. Several mportant dfferences between CPU and GPU mplementatons of the NGIM are requred to result n optmal performance. In partcular, n the CPU mplementatons all operatons, where possble, are pre-computed and stored n memory n a preprocessng stage. Ths reduces the computatonal tme but sgnfcantly ncreases the memory consumpton. In the GPU mplementatons, where handlng memory often s a crtcal bottle neck, several approaches are used to accelerate the computatons. A sgnfcant latency of the GPU global memory access s hdden by mplementng coalesced readng, whch requres arrangng many array elements n contguous parts of memory. Contrary to the CPU verson, most of the steps n the GPU mplementatons are executed on-fly and only necessary arrays are kept n memory. Ths results n sgnfcantly reduced memory consumpton, ncreased problem sze N that can be handled, and reduced computatonal tme on GPUs. The obtaned GPU CPU speed-up ratos are from 150 to 400 dependng on the requred accuracy and problem sze. The presented method and ts CPU and GPU mplementatons can fnd mportant applcatons n varous felds of physcs and engneerng. Ó 2010 Publshed by Elsever Inc Ths paper s concerned wth the evaluaton of the dscrete transform of the form uðr m Þ¼ XN n¼1;n m e krm rn r m r n Q n; m ¼ 1; 2;...; N ð1þ 52 on graphcs processor unts (GPUs) wth computatonal complexty (.e. the computatonal tme and number of operaton) of 53 O(N) n the low-frequency regme, O(NlogN) n the hgh-frequency regme, and between these costs n the mxed-frequency 54 regme. In Eq. (1), the potental u(r m ) at the observaton locatons r m s evaluated by a dscrete convoluton of the Green s 55 functon G(r m,r n ) = exp( kr m r n )/r m r n and sources Q n co-located wth the observers. The sources Q n are dstrbuted * Correspondng author. E-mal addresses: sl@ucsd.edu (S. L), vtaly@ece.ucsd.edu (V. Lomakn) /$ - see front matter Ó 2010 Publshed by Elsever Inc. do: /.cp (2010), do: /.cp

3 2 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx 56 n a doman of the lnear sze D. The total number of sources and observers s N. The potental u(r) and the Green s functon 57 G(r,r 0 ) satsfy the Helmholtz equaton wth the wavenumber k correspondng to the wavelength k =2p/kand frequency f = kc/ 58 2p wth wave velocty c. In the low-frequency regme the computatonal doman s small n terms of the wavelength (D k, 59 or D k ) and the source densty s prescrbed by a partcular problem (e.g. by the geometrcal features n the framework on 60 ntegral equatons [1]). In the hgh-frequency regme, the computatonal doman s large n terms of the wavelength (D k) 61 and the source densty s determned by the wavelength accordng to the Nyqust crteron. In the specal case of k = f = 0 the 62 potental satsfes the Posson equaton and the problem s consdered to be statc. 63 The task of evaluatng the potental n Eq. (1) s mportant n many areas of computatonal physcs, ncludng Electromag- 64 netcs, Optcs, Magnetcs, Acoustcs, Elastodynamcs, among others. Ths computatonal task can appear n problems requr- 65 ng fndng felds due to a gven source dstrbuton. It s also a key for solvng ntegral equatons by teratve methods [1], n 66 whch the dscrete transformaton of (1) approxmates contnuous spatal convolutons. The computatonal complexty of the 67 drect evaluaton of (1) scales as O(N 2 ). Ths quadratc dependence on N severely lmts the practcal applcablty of many 68 solvers that nvolve spatal convolutons. 69 The task of reducng ths hgh computatonal cost has been a subect to extensve nvestgaton. Acceleraton can be 70 acheved by developng fast methods that reduce the asymptotc complexty of the evaluaton of (1) from O(N 2 )too(n) 71 or O(NlogN) operatons as well as developng parallelzaton technques to allow these algorthms to utlze mult-processor 72 systems. Methods reducng the computatonal complexty nclude the Fast Fourer Transform (FFT) based methods [2 5], 73 Fast Multpole Methods (FMMs) [6 15], and nterpolaton-based methods [16 20]. The FFT-based methods explot the space 74 nvarant nature of the Green s functon and reduce the computatonal cost to O(NlogN) and O(N 3/2 logn) for volumetrc and 75 general surface problems, respectvely. FFT-based methods can handle low- and hgh-frequency regmes but become neff- 76 cent for non-unform problems where a large fracton of a structure s free from sources/observers. The FMM explots spec- 77 tral representatons of the felds n terms of multpole expansons or plane wave expansons wth dagonal translaton 78 operators. The computatonal cost of FMMs scales as O(NlogN) and O(N) for hgh- and low-frequency problems, respectvely. 79 The nterpolaton-based methods, ncludng the non-unform grd nterpolaton method (NGIM), explot the fact that the 80 feld potental far from a source dstrbuton s a functon wth a known asymptotc behavor. The knowledge of ths behavor 81 allows smoothng the fast spatal varatons of the potental, computng t on a sparse grd, and nterpolatng to the requred 82 observaton ponts. Symmetrc mplementatons of the NGIM have been shown to scale as O(N) for statc problems [18,21]. 83 Asymmetrc modfcatons of the NGIM have a cost of O(NlogN) for hgh-, low-, and mxed-frequency problems [16,17,20]. 84 Smlar to FMMs, NGIMs can handle non-unform geometres and can have the same asymptotc cost for volumetrc and sur- 85 face problems. 86 Even fast methods mplemented on a sngle processor are lmted n ther performance. As the performance of a sngle 87 core saturates, utlzng mult-core and mult-processor systems becomes crucal for advancng future hgh-performance 88 computng. An mportant aspect of ths efforts are parallel mplementatons of fast convolutonal methods to evaluate the 89 transformaton n Eq. (1). There exst multple parallel mplementatons of the FFT-based methods and FMMs. Most of these 90 mplementatons utlze central processng unt (CPU) cluster systems [22 25]. 91 Exctng opportuntes arse when usng new hardware archtectures ncludng graphcs processng unts (GPUs). Such 92 systems, orgnally desgned for gamng and graphcs processng, comprse hundreds and even thousands of stream proces- 93 sors n a sngle envelope at a low cost (for example, nvda GeForce GTX 480 GPU has 480 stream processors). GPU systems 94 offer the power of a CPU-based cluster but at the cost of a smple desktop computer. Moreover, the recently ntroduced Com- 95 pute Unfed Devce Archtecture (CUDA) allows wrtng general purpose hgh-level codes wthout consderng specfcs of a 96 partcular GPU. In recent years, there has been a sgnfcant effort to port varous methods developed for seral and parallel 97 CPU systems to GPUs. Several authors presented effcent GPU based codes for N-body problems relyng on drect summa- 98 tons (as n Eq. (1)) [26,27]. Whle the speed-up of these codes can be very mpressve, the fact that the computatonal cost 99 scales as O(N 2 ) does not allow usng these codes for large-scale practcal problems. In a recent paper [28], the authors pre- 100 sented a CUDA mplementaton of the FMM for statc problems (k = 0) and obtaned good acceleraton rates. CUDA mple- 101 mentatons of nterpolaton-based methods for statcs have also been presented [29]. However, to the best of our 102 knowledge, no CUDA mplementatons of fast codes for the evaluaton of the Helmholtz-type potental n Eq. (1) have been 103 addressed n open lterature. 104 Ths paper descrbes hghly effcent CPU and GPU mplementatons of a modfcaton of the NGIM for the fast evaluaton 105 of the potental u(r m ) n Eq. (1). The NGIM dffers from the FMMs n that t reles on drect spatal nterpolatons. As a result, 106 the same NGIM can be appled to statc (k = 0) and dynamc (k 0) problems, as well as to problems wth other kernels. In 107 addton, NGIM has a smple structure and does not requre any specal functon evaluatons, whch facltates ts mplemen- 108 tatons on GPU systems. The asymptotc computatonal tme cost of NGIM method scales aso(n) for statc/low-frequency, 109 O(NlogN) for hgh-frequency problems, and between these costs for mxed-frequency problems. The memory consumpton 110 cost s of O(N) for all problem types. We descrbe how the data structures of CPU and GPU NGIM mplementatons are ar- 111 ranged to optmze ther performance. Specfcally, the memory handlng s descrbed n detal as t s a crucal component 112 for speedng up the GPU mplementatons. It s shown how one can choose the parameters of NGIM to balance the compu- 113 tatonal costs of dfferent stages of ts mplementatons, leadng to optmal performance. The acheved sngle GPU sngle 114 CPU speed-up ratos are shown to be n the range dependng on the accuracy and problem sze. Matchng the 115 acheved speeds on CPUs systems would requre a cluster of much hgher cost, hgher power consumpton, and possbly ded- 116 cated space nfrastructure. (2010), do: /.cp

4 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx The paper outlne s as follows. Secton 2 presents foundatons of the NGIM startng wth the descrpton of the grd con- 118 structon approaches n Secton 2.1 and proceedng wth the mult-level NGIM n Secton 2.2. Secton 3 detals the mplemen- 119 taton of the mult-level NGIM on GPUs for each of ts steps. Secton 4 shows computatonal results demonstratng the power 120 of the NGIM and ts mplementatons. Fnally, Secton 5 summarzes the paper and dscusses the presented method and ob- 121 taned results Foundatons of the non-unform grd nterpolaton method Grd constructon 124 The mult-level NGIM decomposes the geometry nto a herarchy of boxes contanng sources and observers. The accel- 125 eraton s based on the smoothness and nterpolaton propertes of the ampltude- and phase-compensated felds due to suf- 126 fcently separated source and observer dstrbutons. Ths secton descrbes the propertes of the feld generated by a sources 127 dstrbuton confned to a source box and observed n the space outsde ths box as well as the feld n an observaton box 128 generated by the sources outsde ths box. These propertes are used to construct sparse grds allowng computng the feld at 129 any observaton locaton by locally nterpolatng t from the samples at these sparse grds Outgong sphercal non-unform grds from a source box B s 131 Consder a source box B s of the largest dagonal sze 2R s centered at r s that contans a subset of the entre source dstr- 132 buton. The potental u(r,b s ) due to the sources n B s at an observaton pont r suffcently dstant from the source such that r s = r r s > cr s wth c > 1 can be evaluated by the ampltude and phase restoraton [16 18,21] 136 uðr; B s Þ¼ e kr s r s ~uðr; B s Þ ð2þ from the ampltude- and phase-compensated potental 140 ~uðr; B s Þ¼r s e kr s uðr; B s Þ: 141 In Eq. (3), the multplcaton of the feld by the dstance r s and the exponental term e kr s compensate the feld varatons asso- 142 cated wth the ampltude decay and phase shft common to all sources n the box B s for the dstant observer r. Ths com- 143 pensaton cancels the rapd feld varatons of the actual potental u(r,b s ) and renders the compensated potental ũ(r,b s )a 144 spatally slowly varyng functon. As a result, the potental u(r,b s )atr can be computed by evaluatng the ampltude and 145 phase-compensated potental over a hghly sparse grd _ r surroundng the observer, locally nterpolatng the compensated potental, and restorng the ampltude and phase at r va (2) 149 uðr; B s Þ¼ X e kr rs r r s wðr; r_ Þ~uðr _ Þ¼ X w _ ðr; r _ Þ~uðr _ Þ: In Eq. (4), the nterpolaton s local,.e. the summatons are over an O(1) samples at the grd ponts _ r surroundng the observer at r, wth nterpolaton coeffcents w ðr; _ r Þ. In the last equalty n Eq. (4), the coeffcents w _ ðr; _ r Þ nclude the nter polaton and feld restoraton coeffcents. Followng Ref. [17,18,21], the sparse grd _ r s chosen to be sphercal,.e. t s constructed based on two angular coordnates h s, u s and radal coordnate r s defned wth respect to the center of the source 154 box center r s. The mnmal angular and radal samplng rates for constructng the non-unform sphercal grds have been de- 155 rved n Ref. [17]. The angular h- and u-samplng rate fa s satsfes f a s ¼ X aðkr s þ 1Þ=p, where X a P 1 s an angular oversam- 156 plng rato [17]. The requred local radal samplng rate for the low-frequency case s derved followng the ntroducton of a 157 new recprocal radal varable a 1/r s. The unform a-samplng rate fa s should satsfy f a s ¼ X rr s =ð4pþ; where X r P 1 s a radal 158 oversamplng rato. More rgorous estmates on the samplng rates can be derved but they do not modfy practcally ob- 159 taned results n terms of the computatonal tme and memory consumpton. The requred unform samplng versus a trans- 160 lates nto a non-unform samplng versus r s wth a local samplng rate of fa s ¼ X rr s ðkr s þ 1Þ=ð4pÞ; thereby ustfyng the 161 noton of non-unform grd (NG). Clearly, the radal samplng becomes sparser as the dstance between the observaton 162 and the source ponts ncreases. 163 The local nterpolaton n Eq. (4) can be accomplshed based on varous polynomals, e.g. Lagrange, nterpolatons. The 164 error can be controlled by ncreasng the nterpolaton order or by ncreasng the oversamplng ratos,.e. ncreasng the den- 165 sty of the non-unform grd. Ths non-unform grd nterpolaton procedure allows for a sgnfcant reducton of the compu- 166 tatonal complexty due to the fact that only a small number of samples are requred at the sparse non-unform grd to 167 compute a large number of observers and that the local nterpolaton s computatonally nexpensve. ð3þ ð4þ Local Cartesan unform grds at a box B o 169 Consder an observaton box B o of the largest sze 2R o centered at r o that contans a subset of the all observaton locatons. 170 The potental u(r,b o ) nsde B o s produced by sources outsde a sphere of radus cr o. The potental u(r,b o ) can be computed by evaluatng t over a grd of ponts r^ nsde or around the box B o and locally nterpolatng t to the observaton locaton r (2010), do: /.cp

5 4 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx A Level l Box A Level l + 1 Box A Level l + 2 Box Observaton Box Interactng Far-feld Box Non-nteractng Far-feld Box Near-Feld Box Fg The llustraton of dfferent geometrc relatons between source boxes and a certan observaton box. 174 uðr; B s Þ¼ X w^ ðr; r _ Þuðr^Þ: ð5þ 175 Smlar to Eq. (4), the nterpolaton n Eq. (5) s local,.e. t s executed from an O(1) samples at the grd ponts r^ surroundng 176 the observer at r wth nterpolaton coeffcents w^ ðr; _ r Þ.Several approaches can be followed to construct the grd r^. For 177 example, n Refs. [18,21], sphercal grds were used for the statc problem (k = 0). Here, for convenence of nterpolatons 178 we use unform Cartesan grds (CGs) algned wth the observaton boxes. The samplng rate n the x, y, and z coordnates 179 satsfes fx o ¼ X xðk þ 1=R o Þ, where X x P 1 s the oversamplng rato. As n the case of non-unform grds, the local nterpola- 180 ton n Eq. (5) can be based on varous polynomals (e.g. Lagrange) nterpolaton types. 181 In the low-frequency regme wth kr o 1 the number of CG samples s (almost) frequency ndependent and t can be 182 much smaller than the number of observers, whch results n a sgnfcant reducton of the number of operatons. In the 183 hgh-frequency regme, the number of observers n the observer box s of the same order as the number of grd ponts. Intro- 184 ducng such grds would result n no computatonal gan and could reduce accuracy; hence CGs are not defned for boxes of 185 sze comparable or larger than the wavelength Mult-level non-unform grd nterpolaton method Doman decomposton and data structure 188 The computatonal doman s decomposed nto a mult-level herarchy of boxes, smlarly to FMMs. Startng from the larg- 189 est box of sde length D enclosng the entre doman, the boxes are recursvely subdvded. The subdvson herarchy s 190 stored n an octal tree, where each box B l n of level l(0 < l < L) and sde length D l =2 l D s consdered to be a parent to ts (eght) 191 chld boxes B lþ1 m of sze D l+1 =2 (l+1) D at level (l + 1). The sze of the smallest box (at the fnest level) s consdered to be much 192 smaller than the wavelength for low- and hgh-frequency problems. Local Lst I L ðb l mþ stores ponters to the sources con- 193 taned n non-empty boxes B L m on level l. Only non-empty boxes are kept n the actual data structure. 194 Based on the geometrc subdvson of the computatonal doman, pars of near- and nteractng far-feld boxes are den- 195 tfed. A par of boxes p s consdered near-feld when they are close such that the maxmal dstance between them s less than 196 a preset factor c ffffff 3 DL (typcally 2 < c < 4) and ther parent boxes do not form a far-feld par; such boxes exst at level L 197 (green 1 boxes n Fg. 2.1). The constructon of the near-feld box pars leads to the defnton of near-feld nteracton lsts 198 (NIL) I N ðb L n Þ, whose members are all boxes pared wth BL n n the near-feld. The number of levels L s chosen such that for 199 each smallest box there s O(1) of near-feld pars and the total number of near-feld pars s O(N). p A par of same-level boxes 200 s consdered nteractng far-feld when the maxmal dstance between them s greater than c ffffff p 3 Dl and the dstance between 201 ther parent boxes s smaller than c ffffff 3 Dl 1 (yellow boxes n Fg. 2.1). The constructon of the far-feld box pars leads to the 202 defnton of far-feld nteracton lsts (FIL) I F ðb l n Þ, whch for each observer box Bl n (at all levels) contans all boxes pared wth 1 For nterpretaton of color n Fg. 2.1, the reader s referred to the web verson of ths artcle. (2010), do: /.cp

6 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx B l n n the far-feld (n the same level). For any box at any level there are only O(1) far-feld pars. For the fnest level L, the 204 number of the far-feld nteracton pars s O(N) (assumng that the number of sources/observers per box s of O(1)). For coar- 205 ser levels the number of boxes and far-feld pars decreases so that the total number of the far-feld nteracton pars s O(N). 206 Each non-empty box at all levels contans NG samples. In the low-frequency regme, the densty of the NGs s constant at 207 all levels and the number of the samples per box s O(1). In the hgh-frequency regme, the densty of NGs for the boxes at 208 coarser levels ncreases as O(8 L l ) (snce t s related to the wavelength). CGs are defned for boxes at levels smaller than a 209 certan nterface level,.e. l 6 l nt wth 2 6 l nt 6 L. The nterface level s a level at whch the boxes sze s suffcently smaller 210 than the wavelength. For levels l P l nt, referred to as low-frequency levels, CGs reduce the computatonal complexty as ex- 211 planed n Secton 2.1. For levels l < l nt, referred to as hgh-frequency levels, CGs are omtted. The ntroducton of the nterface 212 level makes the algorthm most effcent n low-, hgh-, and mxed-frequency regmes. In the low-frequency regme, l nt s 213 chosen as the coarsest level l nt = 2 at whch far-feld nteracton pars exst,.e. all levels are low-frequency levels. In the 214 hgh-frequency regme, l nt s chosen at the fnest level l nt = L,.e. all levels are hgh-frequency levels. In the mxed-frequency 215 regme, l nt s 2 6 l nt 6 L and ts value s determned by whether the low- or hgh-frequency regme domnates. 216 Based on the above doman subdvson, the feld u(r ) at an observaton locaton r s obtaned by aggregatng the felds 217 due to sources n (source) boxes B l n0 that appear n the NIL and FIL 219 uðr Þ¼u NF ðr Þþu FF ðr Þ: 220 The near-feld contrbuton u NF (r ) s evaluated drectly usng Eq. (1) va 222 u NF ðr Þ¼ X r 2I L ðb L n Þ X e kr r r B L m 2I NðB L r Q ; n Þ ð6þ ð7þ 223 where the summatons are over the sources at r n the local lst of the boxes B L m that belong to the NIL of the box BL n con- 224 tanng the observer at r. There s only O(1) of near-feld nteractng boxes for every observaton box B L n and the computa- 225 tonal cost of evaluatng the near-feld contrbuton scales as O(N). 226 The far-feld contrbuton u FF (r ) s evaluated va the mult-level NGIM as descrbed next Non-unform grd nterpolaton method for the far-feld evaluaton 228 The evaluaton of the far-feld contrbuton s accomplshed n 4 steps based on the behavor of the potental and proce- 229 dures descrbed n Secton 2.1. Procedures descrbng each stage are llustrated n Fg. 2.2 (n a 2-D form for clarty) Stage 1 (fnest level l = L: Sources? NG computatons; Fg. 2.2(a)):. At the fnest level L, the potental s computed usng n o 231 Eq. (1) at the NGs for all non-empty boxes. The NGs _ L;m r n ths stage are defned for all non-empty boxes B L m at level L as n o at the correspondng NGs due to charges n the 232 descrbed n Secton Samples of the far-feld potental _ L u _ L;m m r 233 Local Lst of box B L 234 m are computed as _ u L m r_ L;m ¼ X k e _ L;m r r r 2I L ðb L m Þ r _ L;m r Q : ð8þ Snce the number of boxes at the fnest level L s O(N) and there are O(1) grd ponts per box the computatonal cost of eval- 238 uatng the felds at all NGs for all level L boxes scales as O(N) for the consdered low-frequency or statc regmes Stage 2 (upward pass: aggregaton of NGs; Fg. 2.2(b)):. The far-feld potentals at ther respectve NGs for all boxes at all 240 levels from l = L 1tol = 2 are obtaned recursvely by nterpolatng and aggregatng the contrbutons from the NGs of the 241 correspondng chld boxes. Specfcally, the far-feld potental at the NGs of box B l 1 m s obtaned by addng up contrbutons 242 from ts chld boxes va 243 _ u l 1 ¼ X X ; _ l;n r u : ð9þ 245 m r _ l 1;m B l n 2Bl 1 m w _ l;n r _ l 1;m _ l n r _ l;n r _ L;m n o 246 Here, the potentals _ l 1 u m of the parent box Bl 1 m are nterpolated from ther respectve chld NGs r_ l;n to the new NGs n o _ l 1;m 247 r. The coeffcents w _ l;n _ l 1;m r ; _ l;n r are nterpolaton coeffcents as defned n Eq. (4). Referrng to the dscusson n 248 Secton on the number of non-empty boxes and the number of NG ponts per box, the computatonal cost of Stage for low- and hgh-frequency problems s O(N) and O(NlogN), respectvely Stage 3 (downward pass: NG? CG transtons; CG decomposton; Fg. 2.2(c)):. In ths stage, the feld samples at CGs are 251 calculated. The procedure starts by computng the felds at the CGs of the (observaton) boxes B l nt n at the nterface level l nt. 252 These felds are obtaned by nterpolatng from the NG samples of the source boxes B l m for levels l = 2,...,l nt. The boxes B l nt m 253 belong to the FIL of the observaton box B l nt n (.e. B l nt n 2 I F ðb l nt m ÞÞ. The correspondng felds at the CG pont of box Bl nt n from the 254 nteracton lst boxes of level l nt, referred to as nt ;l Q2 nt u^l, are computed as nt ;n n r^l (2010), do: /.cp

7 6 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx NG samples Sources actve CG samples at level l nactve CG samples at level l NG samples at level l-1 (a) Source-to-NG stage (stage 1) (b) NG-to-NG stage (stage 2) NG samples for step 1 CG samples for step 2 CG samples NG samples NG-CG transton (step 1) NG samples Observers CG-CG transton (step 2) (c) NG-to-CG and CG-to-CG stage (stage 3) (d) CG-to-observer stage (stage 4) Fg The 2-D llustraton of tasks n each stage of the mult-level mplementaton of NGIM. 256 ^lnt ;l u nt nt ;n n r^l ¼ X B l nt m 2I F B l nt n X w _ l nt ;m nt ;n r^l ; _ l r nt ;m u _ lnt m r_ l nt ;m : ð10þ 257 Smlarly, the boxes B l m for levels l = 2,...,l nt 1 belong to the nteracton lst of the parents of the box B l nt n at correspondng 258 levels. The correspondng felds nt ;l nt ;n u^l n r^l (l = 2,...,l nt 1) are computed smlarly to Eq. (9). All the felds from the nter- 259 acton boxes at levels l = 2,...,l nt are supermposed as nt nt ;n u^l n r^l ¼ P l nt nt ;l nt ;n l¼2 n r^l to result n the felds at the CG ponts of 260 the nterface level boxes B l nt n. 261 Next, for all levels l = l nt + 1,...,L, the felds at the CGs of the (observaton) boxes B l n are computed ncludng two contr- 262 butons: () the felds accumulated by summng up feld contrbutons nterpolated from the NGs of the boxes n the FIL at the 263 same level (smlar to Eq. (9)) and () those nherted from ther parent boxes by nterpolatng from the CGs: 265 ^l u ¼ X n r^l;n B l m 2I F ðb l n Þ X w _ l;m r^l;n ; r _ l;m u _ l m ðr_ l;m Þþ X w^l 1;k r^l;n ; r^l 1;k u ^l 1 k r^l 1;k : ð11þ 266 Here, the frst term s smlar to that n the rght hand sde of Eq. (9), whle n the second term the feld samples at the CGs of 267 the parent box B l 1 k are used to obtan the feld samples at the CGs of ts chld boxes B l n 2 Bl 1 k va nterpolaton as n Eq. (5). 268 Ths procedure s repeated recursvely for all non-empty boxes from level l nt + 1 to level L. Relaton between these two steps 269 s shown n Fg As mentoned n Secton 2.2.1, the number of far-feld nteracton pars per box at a certan level s O(1). The number of CG 271 samples for all boxes at levels l = l nt,...,l s O(1). As a result, the computatonal cost of Stage 3 scales as O(N) and O(NlogN) 272 for the low- and hgh-frequency regmes, respectvely. The ncreased cost n the hgh-frequency regme s because the felds 273 at the CGs of boxes at the nterface level are evaluated from NGs at all hgh-frequency levels, assumed to be of O(logN). In the (2010), do: /.cp

8 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx 7 NG samples at level l NG samples at level l+1 NG samples at level l+2 NG-CG transton stage CG decomposton stage CG samples at level l CG samples at level l+1 CG samples at level l+2 Fg The relaton of NG CG transton stage and CG decomposton stage n calculatng the feld values on CG samples of boxes at each computatonal level. 274 mxed-frequency regme, the cost scales between O(N) and O(NlogN) dependng on the number of low- and hgh-frequency 275 levels. 276 It should be mentoned that whle Stages 2 and 3 conceptually are dfferent (Stage 2 computes felds at NGs whle Stage computes felds at CGs), n a code mplementaton they can be combned together. For example, for all levels 2 < l < l nt NGs of 278 boxes at level l can be used to obtaned NGs of boxes at level l 1 and to obtan CGs of boxes at level l nt as n Eq. (9). After 279 these operatons the NGs at level l are not necessary and can be dscarded. Such an approach results n the same computa- 280 tonal tme but reduces memory n the hgh-frequency regme Stage 4 (fnest level: CG? observers; Fg. 2.2(d)):. On the fnest level L, the far-feld contrbutons at the observatons 282 ponts r belongng to a level L box B L n are obtaned by nterpolaton from the CGs of the same box va u FF ðr Þ¼ X ^L w^l;n r ; r^l;n un r^l;n : 284 ð12þ 285 Snce there s O(1) CGs n every box and there are O(N) boxes n the fnest level, the computatonal complexty of evaluatng 286 the far-feld contrbuton at all observers s O(N) Computatonal complexty 288 Frst, we dscuss the computatonal cost n the low-frequency regme. Assume for smplcty that the source dstrbuton 289 occupes most or all the computatonal doman such that all the boxes are non-empty. An estmate on the number of oper- 290 atons (computatonal tme) of NGIM s gven by 292 T low ðn; LÞ ¼C 1 Nn NG þ C 2 8 L n NG þ C 3 8 L n CG þ C 4 N þ C 5 ðn=8 L Þ 2 : 293 Here, n NG = O(1) and n CG = O(1) s the number of NG and CG samples per box, respectvely, and C 1,2,3,4 are constants of O(1). 294 Each of the frst four terms n the rght hand sde corresponds to the computatonal cost of the four far-feld stages and the 295 ffth term s the cost of the near-feld stage. As explaned above, the doman recursve subdvson s executed untl there s a 296 fnte number of O(1) of sources n a box, whch means that L s chosen as L = Clog 8 N (wth C = O(1)). As a result, all fve terms 297 n T low are on the order of O(N), thus resultng n an O(N) overall computatonal tme cost. The computatonal cost s the same 298 for source dstrbutons coverng only a part of the computatonal space (n whch case the number of non-empty boxes s 299 smaller). 300 In the hgh-frequency regme, the computatonal cost s T hgh ðn; LÞ ¼C 1 Nn NG þ C 2 Ln NG 8 L þ C 3 Ln CG 8 L þ C 4 N þ C 5 ðn=8 L Þ 2 ; ð14þ 304 where n NG = O(1) and n CG = O(1) s the number of NG and CG samples n the smallest boxes, respectvely. For L = Clog 8 N, the 305 second and thrd terms n the rght hand sde of (14) are of O(NlogN). The ncrease of the computatonal tme s due to the 306 ncrease of the grd denstes. In the mxed-frequency regme, the computatonal tme cost s between O(N) and O(NlogN). 307 The memory consumpton of NGIM mplemented on GPUs s estmated as 309 M low 6 C 1 N þ C 2 ðn CG þ n NG Þ8 L þ C 3 8 L ; 310 both n the low- and hgh-frequency regmes (assumng all boxes are actve). Here, the frst term n the rght hand sde s the 311 memory requred for storng source and observer ampltudes and coordnates, the second term s for storng NG and CG feld 312 ampltudes, and the thrd term s for storng NILs and other box-related nformaton. For L = Clog 8 N the memory consump- 313 ton s of O(N) n any frequency regme. ð13þ Implementaton of the non-unform grd nterpolaton method on GPUs 315 GPUs were orgnally ntended for graphc processng such as vsualzaton, renderng, computer graphc anmatons, and 316 3D gamng. However, recent developments n the GPU technology have drawn attenton of scentfc computng commun- 317 tes due to GPU s extraordnary computatonal power. Recently, new programmng envronments/languages have been 318 ntroduced to facltate the use of GPU systems for hgh-performance scentfc computng. In partcular, nvda provdes (2010), do: /.cp

9 8 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx 319 Compute Unfed Devce Archtecture (CUDA) for the development of general purpose applcatons on GPUs [30]. CUDA s a 320 functon-extended C programmng language that allows a C program to be compled and executed on stream processors of 321 GPUs. On logcal level, each runnng nstance launched on a GPU s called a CUDA thread and s dentfed by ts thread ID, 322 whch can be a one-, two-, or three-dmensonal ndex. These IDs can be used, for example, as addressng parameters to 323 let the threads access dfferent parts of memory or ndcate whch flow paths the thread should follow. However, the threads 324 are not spread drectly nto all stream processors. A certan number of threads are bundled together to form a thread block. A 325 thread block usually contans a group of threads that are expected to execute closely related (yet possbly dfferent) tasks. 326 Threads n a block are combned nto groups of 32, referred to as warps. Warps are atomc unts that always go together on 327 a mult-processor. Each mult-processor conssts of 16 stream processors (cores), and takes one or more warps at a tme. 328 Wthn each block, threads are granted a certan amount of shared memory that can be as fast as regsters. Ths shared 329 memory s smultaneously accessble by all threads n the same block but s nether recoverable after the block threads fnsh 330 ther ob nor accessble by threads from other blocks. The shared memory can be used effcently as an ntermedate storage 331 lke regsters to speed-up calculatons. The nput and output data s usually stored n global memory. Although the maxmal 332 speed of the global memory s hgh (e.g. 177 GBts/s bandwdth for nvda GeForce GTX 480), t has a notceable latency n 333 handlng every read or wrte nstructon (e.g., cycles for nvda GeForce GTX 480). To overcome ths latency and 334 reduce the speed msmatch between the shared and global memores, CUDA offers a coalesced accessng scheme, where sev- 335 eral readng or wrtng nstructons are combned wthn one transacton. Ths faster scheme s trggered when threads n the 336 same warp access a contguous address of global memory. 337 In the mplementaton of the NGIM, the aforementoned concepts and mechansms are crtcal to speed-up calculatons. 338 These concepts have sgnfcant effects on the tme and memory consumpton of the NGIM on GPUs and result n a number of 339 mportant modfcatons n the data structure of the code as compared to the CPU mplementatons. The mplementaton of 340 the NGIM on GPUs follows the same stage-by-stage protocol as that on CPUs, yet extensve changes are made to parallelze 341 the operatons and utlze tools provded by CUDA Preprocessng and ntalzaton stages 343 In the preprocessng stage, all vectors, matrces and other data structures used by the NGIM are ntated. The ntalza- 344 ton ncludes memory allocaton n global memory of GPU, copyng coordnates of sources and observers to the allocated 345 matrces as well as reshapng and copyng auxlary matrces, such as matrces storng ndces of far-feld nteracton boxes. 346 One task done n the preprocessng stage specfcally for GPU s rearrangng the source nformaton storage so that sources 347 belongs to the same box, at all levels, are stuated contguously n the memory. Ths s crtcal for the GPU to adopt coalesced 348 accessng to accelerate the memory handlng, whch wll be descrbed n detal below. 349 In the CPU verson of preprocessng, constructng coeffcent matrces for nterpolaton (descrbed n later paragraphs) 350 occupes a sgnfcant porton of the whole preprocessng procedure n terms of memory usage and computaton tme. 351 But n the GPU verson, smlar tasks are sgnfcantly shortened or elmnated, because the NG and CG sample coordnates 352 as well as all nterpolaton coeffcent matrces are constructed on-fly every tme they are requested. These on-fly operatons 353 are repeated n each stage to reduce the memory consumpton and reduce the total memory access tme. As a result, the 354 preprocessng tme of the GPU code s reduced, makng the code more effcent and practcal Near-feld computaton 356 In ths stage, the felds at the observers are evaluated drectly va Eq. (1) by addng up the feld contrbutons from sources 357 belongng to the level L boxes n the NILs of the observer s box. Methods to parallelze ths stage also apply to drect eval- 358 uatons of the classcal n-body problems [27], whch s used for the speed-up comparson. In addton, the technques used 359 here are applcable to the acceleraton of other (more complcated) steps of the NGIM. 360 Two approaches can be followed to compute the near-feld nteractons. In the frst approach, the values of the Green s 361 functon for all source-observer pars partcpatng n the near-feld nteractons are tabulated as an nteracton matrx. Ths 362 s done only once n the preprocessng stage (e.g. of an teratve electromagnetc solver), and only smple matrx vector mul- 363 tplcatons are used for teratons. However, even for a moderate N, the resultng memory consumpton can be very large for 364 GPUs (and also for CPU) systems. The other approach s evaluatng all near-feld nteractons drectly va Eq. (1) on-fly at 365 every teraton. Whle ths approach may lead to some speed reducton, t drastcally reduces the memory requrements, 366 so t s preferred n our GPU mplementaton of NGIM. 367 All results for the near-feld stage are shown for the on-fly approach for both CPU and GPU mplementatons. The reason s 368 twofold. Frst, the memory consumpton for large problems can be too hgh even for CPU systems. Second, for large problem 369 szes the pre-computaton approach may not gve a sgnfcant advantage due to the resultng large amount of memory han- 370 dled. In our tests, we have mplemented both pre-computaton and on-fly approaches on the CPU verson, and found that the 371 pre-computaton approach was only tmes faster for problems below N = 4 mllon (whch can be handled by our CPU 372 verson on a computer wth 32 GB of memory). For larger problems, the performance gap between the two versons of the 373 code may be even smaller. Therefore, we beleve that the beneft of memory consumpton reducton n the on-fly approach 374 outweghs a potental extra speed-up of the pre-computaton approach. (2010), do: /.cp

10 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx Pseudocode 1. Near-feld Calculaton: 376 grd dmenson = number of actve (non-empty) boxes; block dmenson >= number of observers/sources 377 per box; sources n the same box are stored n global memory contguously 379 // The declaraton of external shared arrays, whch are beng dynamcally allocated rght before each launchng of the 380 kernel 381 extern shared float dyn_sdata1[]; 382 extern shared float2 dyn_sdata2[]; 383 //Ths array stores the coordnates of sources n a box 384 float* Rs=(float*)dyn_sdata1; 385 //Ths array stores the feld values at observaton ponts 386 float2* Us=(float2*) & dyn_sdata2[(blockdm.x 1)+blockDm.x]; 387 float2 pnt; 388 float Rr[3]; 389 nt IdxReceverBox = ndex of observaton box processed by ths block of threads; 390 nt NumNearBoxes = number of source boxes that have near-feld box relaton wth current 391 observaton box; 392 nt NumReceverBox = number of observers n the box beng processed; 393 nt tdx = ndex number of observer beng handled by the current thread; 394 // load the observers coordnates to Qr[] 395 for (nt = 0; < 3;++) 396 { 397 Qr[]=d_fChargeInfo[((IdxReceverBox-1)*3+)*blockDm.x + tdx]; 398 } 399 syncthreads (); 400 // traverse around all near-feld boxes 401 for (ntidxnearboxcurrent = 1;ntIdxNearBoxCurrent <=NumNearBoxes;ntIdxNearBoxCurrent++) 402 { 403 boxidx = ndex number of source box current beng processed 404 chargenum = number of charges nsde ths source box; 405 // load the sources coordnates and ampltude to Qs[] and Qs_amp[] 406 for (nt = 0; < 3;++) 407 { 408 Qs[tdx*3 + ]= Charge_coordnates[boxIdx*3*blockDm.x + *blockdm.x + tdx]; 409 } 410 Qs_amp[tdx]=Charge_ampltude[(boxIdx-1)*blockDm.x + tdx]; 411 syncthreads (); 412 // accumulate the contrbuton of all sources n the same source box 413 for (nt = 0;< NumReceverBox;++) 414 { 415 float del[3]; 416 del[0]=qs[3*]-qr[0]; 417 del[1]=qs[3* + 1]-Qr[1]; 418 del[2]=qs[3* + 2]-Qr[2]; 419 // ths lne calculate 1/r 420 dst = sqrtf (del[0]*del[0]+del[1]*del[1]+del[2]*del[2]); 421 f (dst >1e-6) 422 { 423 // use ntermeda varables to save tme on expensve operatons lke dvson and trgonometrcs 424 float del_cos, del_sn, qs[2], rdst; 425 del_cos = cosf (-k0*dst); 426 del_sn = snf (-k0*dst); 427 qs[0]=qs_amp[].x; 428 qs[1]=qs_amp[].y; 429 rdst = 1.0f/dst; 430 Pt.x += rdst*(qs[0]*del_cos-qs[1]*del_sn); 431 Pt.y += rdst*(qs[0]*del_sn + qs[1]*del_cos); 432 } (contnued on next page) (2010), do: /.cp

11 10 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx 435 } 436 syncthreads (); 437 // Then we could output the value of P to correspondng unt n output array 438 P? feld value vector Pseudocode 1 demonstrates the mplementaton of ths approach on GPUs. The pseudocode s wrtten n CUDA-extended 441 C language, but certan ntalzatons, nput/output parts, and error control parts are omtted for clarty. The techncal detals 442 of mplementng the drect calculaton of feld values based on Eq. (1) on GPU are gven next. 443 (1) We adopted one-thread-per-observer type of parallelzaton, n whch a thread s responsble for calculatng the feld 444 value at an observer. 445 (2) Threads handlng observers n the same box are bundled to form a thread block. One or several thread blocks may be 446 launched to handle a certan box when the number of observers wthn the box exceeds the hardware lmt on the 447 number of threads a block can contan (e.g. 512 for Tesla C1060 and 1024 for GeForce GTX 480). The number of observ- 448 ers n a box should be greater than 32 (warp sze) to have all the stream processors occuped and all computatonal 449 resource fully utlzed. 450 (3) Near-felds at observers n a box are computed from the sources n boxes belongng to the NIL of the observer box. Ths 451 fact s exploted by loadng and storng coordnates and ampltudes of these sources to shared memory, such that sev- 452 eral threads handlng dfferent observers use the same nformaton wthout repeated memory loadngs. Furthermore, 453 the sources and observers are arranged n the preprocessng stage so that coordnates and ampltudes of sources n a 454 box are located n contguous memory addresses. As a result, the memory loadng operatons by a block of threads are 455 always coalesced. 456 (4) The amount of shared memory has to be determned at run tme snce the number of threads per block (.e. the num- 457 ber of sources and observers per box) s not known at the complaton tme. We use dynamc shared memory alloca- 458 ton, n whch the amount of shared memory s calculated each tme before launchng the kernel and s allocated as a 459 sngle array. 460 (5) One-, two-, or three-dmensonal grds of blocks can be used to handle all non-empty boxes. In the current mplemen- 461 taton, we use two-dmensonal grds of blocks. Ths allows any practcal number of blocks to be launched for each 462 ndvdual kernel. 463 (6) Some ntrnsc mathematcal functons are used to accelerate the computatons. These functons nclude sngle-prec- 464 son versons of sn and cos functons, used to evaluate the complex exponental n the Greens functon n Eq. (1). 465 Other nstructon level technques nclude replacng the nteger dvson and modulo operatons wth btwse shftng 466 and AND operatons when the dvdend s power of 2 [30]. 467 (7) The data type float2 s used n the near-feld and followng stages snce t can be drectly mapped to complex data 468 type we use n our CPU code wrtten n FORTRAN. In the CUDA compler, the operator overload mechansm s ntro- 469 duced so the operatons on float2 can be defned exactly as those for complex numbers Table 3.1 shows the computatonal tme of the near-feld stage on CPU and GPU. The CPU tmng results were obtaned on 472 a sngle core of an Intel Xeon 3.2 GHz CPU usng Intel Fortran Compler v10 wth O3 optmzaton (there was 473 around 20-fold speed-up of a O3 optmzed CPU code over a non-optmzed one). At the GPU end, an nvda GTX480 at MHz wth 1.5 GBytes of memory was used. The GPU mplementaton was wrtten and compled usng CUDA Toolkt Table 3.1 The computatonal tmes and speed-up ratos of the near-feld stage on CPU (Xeon X5248) and GPU (GeForce GTX 480). N p s the average number of sources per box on level L. The relaton between N p and N s N p = N/8 L. N p L CPU a GPU Rato e e e e e0 5.90e e0 7.49e e0 1.74e e1 4.84e e1 7.76e e1 1.45e e2 5.30e e2 6.38e e2 1.19e e3 4.37e a All tmng results shown n ths secton are n seconds. (2010), do: /.cp

12 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx v3.0 from nvda. Both CPU and GPU versons of the code used on-fly approach and the source and observer dstrbuton was 476 random. 477 It s evdent that the speed-up ratos of the GPU code compared to the CPU one are very hgh, varyng between 200 and The speed-ups are hgher for larger N p,.e. for a larger number of sources and observers per box, when the massve par- 479 allelzaton s fully exploted. Takng nto account the fact that the number of the GPU cores n the consdered case s 480 and 480 they are run at the clock rate around 4.5 tmes lower than that of the CPU, achevng the acceleraton rates above 600 s 481 mpressve. Such hgh rates are obtaned not only due to massve parallelzaton of floatng pont operatons and memory 482 loadng but also due to the coalesced memory access. 483 A comment should be made on the scalng of the computatonal complexty n Table 3.1. For a fxed number of levels L, 484 the complexty scales approxmately as O(N 2 ) wth ncreasng N (N = N p 8 L n Table 3.1) as the number of near-feld evalua- 485 tons s proportonal to N 2 p. The complexty of O(N) for the near-feld stage of the NGIM s acheved due to the fact that the 486 number of levels L ncreases wth an ncrease of N. Indeed, the computatonal tme behaves as O(N) for the same number of 487 sources/observers per smallest box (.e. for the same N p n the example of Table 3.1). Clearly, the level ncrease s dscrete and 488 the choce of the sze N at whch the code s swtched to a hgher herarchy level depends on the relatve computatonal tme 489 of the near- and far-feld components. 490 The near-feld computaton stage s one of two potentally most tme consumng stages n the algorthm (the other tme 491 consumng stage s stage 3 of the far-feld evaluaton). Therefore, achevng such hgh acceleraton rates n ths stage s very 492 mportant Outward computaton from sources to NG samples (Stage 1) 494 The NG constructon stage computes the feld values at NGs, whch s the frst step of the upward pass of the algorthm. 495 The core operatons n ths stage are (a) the constructon of NGs of each non-empty box at the fnest level L and (b) the drect 496 calculaton of the feld values at these NGs va Eq. (8). 497 The CPU verson of code conssts of two nested loops to deal wth all pars of sources and NG samples for ndvdual boxes 498 and another loop to account for all boxes at level L (Eq. (8)). The evaluaton of feld s executed by coeffcent-ampltude mul- 499 tplcaton, where the nteracton coeffcents are computed va Eq. (8) n the preprocessng stage. For GPU, no coeffcent 500 matrces are used for the reasons mentoned n prevous sectons. Instead, the loops over the NG samples and boxes are par- 501 allelzed and substtuted by threads and blocks. Two approaches can be followed to parallelze ths subroutne and to map 502 the operatons to all processors on GPUs. One approach s one-thread-per-box parallelzaton used n smlar stages of FMM 503 adopted n Ref. [28] and the other opton s one-thread-per-observer parallelzaton, whch s smlar to what was done n 504 the near-feld computaton stage n Secton In the frst approach, each thread s responsble for calculatng feld values at all NG samples of a desgnated box. Ths can 506 be done only serally but usng ths approach the code may handle non-unform source dstrbuton effcently. However, the 507 overall effcency of the algorthm n practcal stuaton s lmted by several other factors. Suppose there are N b sources n a 508 box at the fnest level. Then, one thread has to execute floatng pont readng from the global memory 5N b tmes n order to 509 load the source coordnates and (complex) ampltudes to the shared memory or regster. No memory coalescng scheme can 510 be appled n ths stuaton and these readngs wll suffer from the global memory access latency. Moreover, for the on-fly 511 verson of the NGIM, the constructon of NGs requres a relatvely large number of operatons, whch makes the work load 512 of a sngle thread heavy, undermnng the effects of parallelzaton. Fnally, n the one-thread-per-box parallelzaton ap- 513 proach, the GPU computatonal resources are utlzed completely only when the problem sze s large enough. From our test 514 runs on NVIDIA Tesla C1060, L has to be greater or equal to 5 to have all the stream processors fully utlzed. The stuaton 515 becomes even worse on GTX 480 as t has more stream processors. Summarzng, we beleve that the one-thread-per-box 516 parallelzaton approach s suboptmal for the NGIM (and possbly for FMM). 517 In the second ( one-thread-per-observer parallelzaton) approach, one thread handles one observer, same as the ap- 518 proach used n Secton 3.2, where an observer means an NG sample. One or several blocks of threads are allocated for each 519 observaton box. In ths approach the coalesced memory readng technque s appled to accelerate the loadng of source 520 coordnates and ampltudes. The task of the NG constructon s also dstrbuted to a group of threads. In the low-frequency 521 regme, usually one block of threads s assgned to one box of NG samples, and the maxmal effcency s acheved. In the 522 hgh- and mxed-frequency regmes, the boxes at hgh-frequency levels can have a large number of NG samples. Ths re- 523 qures assgnng multple blocks for each box. 524 The computatonal tmes of stage 1 are presented n Table 3.2 (these results are frequency regme ndependent for the 525 same N, L, and the number of NG samples per smallest box). It s evdent that the speed-up rato ncreases sgnfcantly wth 526 an ncrease of the number of sources per box. The oversamplng rates of NGs n the radus and angle, X r and X a are defned n 527 Secton 2.2 and chosen to be 2 here to render the average L 1 error of NGIM at 10 3 level. For small problems (wth a small 528 number of sources and boxes), the work load dstrbuted to each stream processor s nsuffcent to hde the expense of the 529 on-fly NG grd constructon, threads launchng, and global memory readng even under coalesced accessng scheme. There- 530 fore, for the L = 3 case, the computatonal tme s nearly a constant (the speed-ups are stll qute hgh). When there are more 531 boxes to be process wth an ncreased L, the computaton tme ncreases wth the problem sze but slower than O(N). Another 532 factor that affects the speed dfferences between CPU and CPU s the oversamplng rate. For hgh oversamplng ratos,.e. for (2010), do: /.cp

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Wavefront Reconstructor

Wavefront Reconstructor A Dstrbuted Smplex B-Splne Based Wavefront Reconstructor Coen de Vsser and Mchel Verhaegen 14-12-201212 2012 Delft Unversty of Technology Contents Introducton Wavefront reconstructon usng Smplex B-Splnes

More information

Array transposition in CUDA shared memory

Array transposition in CUDA shared memory Array transposton n CUDA shared memory Mke Gles February 19, 2014 Abstract Ths short note s nspred by some code wrtten by Jeremy Appleyard for the transposton of data through shared memory. I had some

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

Chapter 6 Programmng the fnte element method Inow turn to the man subject of ths book: The mplementaton of the fnte element algorthm n computer programs. In order to make my dscusson as straghtforward

More information

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes SPH3UW Unt 7.3 Sphercal Concave Mrrors Page 1 of 1 Notes Physcs Tool box Concave Mrror If the reflectng surface takes place on the nner surface of the sphercal shape so that the centre of the mrror bulges

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

ELEC 377 Operating Systems. Week 6 Class 3

ELEC 377 Operating Systems. Week 6 Class 3 ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

Analysis of Continuous Beams in General

Analysis of Continuous Beams in General Analyss of Contnuous Beams n General Contnuous beams consdered here are prsmatc, rgdly connected to each beam segment and supported at varous ponts along the beam. onts are selected at ponts of support,

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Lecture #15 Lecture Notes

Lecture #15 Lecture Notes Lecture #15 Lecture Notes The ocean water column s very much a 3-D spatal entt and we need to represent that structure n an economcal way to deal wth t n calculatons. We wll dscuss one way to do so, emprcal

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices Internatonal Mathematcal Forum, Vol 7, 2012, no 52, 2549-2554 An Applcaton of the Dulmage-Mendelsohn Decomposton to Sparse Null Space Bases of Full Row Rank Matrces Mostafa Khorramzadeh Department of Mathematcal

More information

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements Module 3: Element Propertes Lecture : Lagrange and Serendpty Elements 5 In last lecture note, the nterpolaton functons are derved on the bass of assumed polynomal from Pascal s trangle for the fled varable.

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide Lobachevsky State Unversty of Nzhn Novgorod Polyhedron Quck Start Gude Nzhn Novgorod 2016 Contents Specfcaton of Polyhedron software... 3 Theoretcal background... 4 1. Interface of Polyhedron... 6 1.1.

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

Accounting for the Use of Different Length Scale Factors in x, y and z Directions

Accounting for the Use of Different Length Scale Factors in x, y and z Directions 1 Accountng for the Use of Dfferent Length Scale Factors n x, y and z Drectons Taha Soch (taha.soch@kcl.ac.uk) Imagng Scences & Bomedcal Engneerng, Kng s College London, The Rayne Insttute, St Thomas Hosptal,

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

Electrical analysis of light-weight, triangular weave reflector antennas

Electrical analysis of light-weight, triangular weave reflector antennas Electrcal analyss of lght-weght, trangular weave reflector antennas Knud Pontoppdan TICRA Laederstraede 34 DK-121 Copenhagen K Denmark Emal: kp@tcra.com INTRODUCTION The new lght-weght reflector antenna

More information

CMPS 10 Introduction to Computer Science Lecture Notes

CMPS 10 Introduction to Computer Science Lecture Notes CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Very simple computational domains can be discretized using boundary-fitted structured meshes (also called grids)

Very simple computational domains can be discretized using boundary-fitted structured meshes (also called grids) Structured meshes Very smple computatonal domans can be dscretzed usng boundary-ftted structured meshes (also called grds) The grd lnes of a Cartesan mesh are parallel to one another Structured meshes

More information

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss. Today s Outlne Sortng Chapter 7 n Wess CSE 26 Data Structures Ruth Anderson Announcements Wrtten Homework #6 due Frday 2/26 at the begnnng of lecture Proect Code due Mon March 1 by 11pm Today s Topcs:

More information

Chapter 1. Comparison of an O(N ) and an O(N log N ) N -body solver. Abstract

Chapter 1. Comparison of an O(N ) and an O(N log N ) N -body solver. Abstract Chapter 1 Comparson of an O(N ) and an O(N log N ) N -body solver Gavn J. Prngle Abstract In ths paper we compare the performance characterstcs of two 3-dmensonal herarchcal N-body solvers an O(N) and

More information

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline mage Vsualzaton mage Vsualzaton mage Representaton & Vsualzaton Basc magng Algorthms Shape Representaton and Analyss outlne mage Representaton & Vsualzaton Basc magng Algorthms Shape Representaton and

More information

Private Information Retrieval (PIR)

Private Information Retrieval (PIR) 2 Levente Buttyán Problem formulaton Alce wants to obtan nformaton from a database, but she does not want the database to learn whch nformaton she wanted e.g., Alce s an nvestor queryng a stock-market

More information

3D vector computer graphics

3D vector computer graphics 3D vector computer graphcs Paolo Varagnolo: freelance engneer Padova Aprl 2016 Prvate Practce ----------------------------------- 1. Introducton Vector 3D model representaton n computer graphcs requres

More information

Load-Balanced Anycast Routing

Load-Balanced Anycast Routing Load-Balanced Anycast Routng Chng-Yu Ln, Jung-Hua Lo, and Sy-Yen Kuo Department of Electrcal Engneerng atonal Tawan Unversty, Tape, Tawan sykuo@cc.ee.ntu.edu.tw Abstract For fault-tolerance and load-balance

More information

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data Malaysan Journal of Mathematcal Scences 11(S) Aprl : 35 46 (2017) Specal Issue: The 2nd Internatonal Conference and Workshop on Mathematcal Analyss (ICWOMA 2016) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.15 No.10, October 2015 1 Evaluaton of an Enhanced Scheme for Hgh-level Nested Network Moblty Mohammed Babker Al Mohammed, Asha Hassan.

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal

More information

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements Explct Formulas and Effcent Algorthm for Moment Computaton of Coupled RC Trees wth Lumped and Dstrbuted Elements Qngan Yu and Ernest S.Kuh Electroncs Research Lab. Unv. of Calforna at Berkeley Berkeley

More information

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016) Technsche Unverstät München WSe 6/7 Insttut für Informatk Prof. Dr. Thomas Huckle Dpl.-Math. Benjamn Uekermann Parallel Numercs Exercse : Prevous Exam Questons Precondtonng & Iteratve Solvers (From 6)

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to

More information

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface. IDC Herzlya Shmon Schocken Assembler Shmon Schocken Sprng 2005 Elements of Computng Systems 1 Assembler (Ch. 6) Where we are at: Human Thought Abstract desgn Chapters 9, 12 abstract nterface H.L. Language

More information

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005 Exercses (Part 4) Introducton to R UCLA/CCPR John Fox, February 2005 1. A challengng problem: Iterated weghted least squares (IWLS) s a standard method of fttng generalzed lnear models to data. As descrbed

More information

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vdyanagar Faculty Name: Am D. Trved Class: SYBCA Subject: US03CBCA03 (Advanced Data & Fle Structure) *UNIT 1 (ARRAYS AND TREES) **INTRODUCTION TO ARRAYS If we want

More information

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics Introducton G10 NAG Fortran Lbrary Chapter Introducton G10 Smoothng n Statstcs Contents 1 Scope of the Chapter... 2 2 Background to the Problems... 2 2.1 Smoothng Methods... 2 2.2 Smoothng Splnes and Regresson

More information

THE PULL-PUSH ALGORITHM REVISITED

THE PULL-PUSH ALGORITHM REVISITED THE PULL-PUSH ALGORITHM REVISITED Improvements, Computaton of Pont Denstes, and GPU Implementaton Martn Kraus Computer Graphcs & Vsualzaton Group, Technsche Unverstät München, Boltzmannstraße 3, 85748

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A New Approach For the Ranking of Fuzzy Sets With Different Heights New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

APPLICATION OF A COMPUTATIONALLY EFFICIENT GEOSTATISTICAL APPROACH TO CHARACTERIZING VARIABLY SPACED WATER-TABLE DATA

APPLICATION OF A COMPUTATIONALLY EFFICIENT GEOSTATISTICAL APPROACH TO CHARACTERIZING VARIABLY SPACED WATER-TABLE DATA RFr"W/FZD JAN 2 4 1995 OST control # 1385 John J Q U ~ M Argonne Natonal Laboratory Argonne, L 60439 Tel: 708-252-5357, Fax: 708-252-3 611 APPLCATON OF A COMPUTATONALLY EFFCENT GEOSTATSTCAL APPROACH TO

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Multiblock method for database generation in finite element programs

Multiblock method for database generation in finite element programs Proc. of the 9th WSEAS Int. Conf. on Mathematcal Methods and Computatonal Technques n Electrcal Engneerng, Arcachon, October 13-15, 2007 53 Multblock method for database generaton n fnte element programs

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

CE 221 Data Structures and Algorithms

CE 221 Data Structures and Algorithms CE 1 ata Structures and Algorthms Chapter 4: Trees BST Text: Read Wess, 4.3 Izmr Unversty of Economcs 1 The Search Tree AT Bnary Search Trees An mportant applcaton of bnary trees s n searchng. Let us assume

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning Parallel Inverse Halftonng by Look-Up Table (LUT) Parttonng Umar F. Sddq and Sadq M. Sat umar@ccse.kfupm.edu.sa, sadq@kfupm.edu.sa KFUPM Box: Department of Computer Engneerng, Kng Fahd Unversty of Petroleum

More information

Dependence of the Color Rendering Index on the Luminance of Light Sources and Munsell Samples

Dependence of the Color Rendering Index on the Luminance of Light Sources and Munsell Samples Australan Journal of Basc and Appled Scences, 4(10): 4609-4613, 2010 ISSN 1991-8178 Dependence of the Color Renderng Index on the Lumnance of Lght Sources and Munsell Samples 1 A. EL-Bally (Physcs Department),

More information

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to: 4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/

More information

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalization TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Brave New World Pseudocode Reference

Brave New World Pseudocode Reference Brave New World Pseudocode Reference Pseudocode s a way to descrbe how to accomplsh tasks usng basc steps lke those a computer mght perform. In ths week s lab, you'll see how a form of pseudocode can be

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6) Harvard Unversty CS 101 Fall 2005, Shmon Schocken Assembler Elements of Computng Systems 1 Assembler (Ch. 6) Why care about assemblers? Because Assemblers employ some nfty trcks Assemblers are the frst

More information

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation Precondtonng Parallel Sparse Iteratve Solvers for Crcut Smulaton A. Basermann, U. Jaekel, and K. Hachya 1 Introducton One mportant mathematcal problem n smulaton of large electrcal crcuts s the soluton

More information

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain AMath 483/583 Lecture 21 May 13, 2011 Today: OpenMP and MPI versons of Jacob teraton Gauss-Sedel and SOR teratve methods Next week: More MPI Debuggng and totalvew GPU computng Read: Class notes and references

More information

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions Sortng Revew Introducton to Algorthms Qucksort CSE 680 Prof. Roger Crawfs Inserton Sort T(n) = Θ(n 2 ) In-place Merge Sort T(n) = Θ(n lg(n)) Not n-place Selecton Sort (from homework) T(n) = Θ(n 2 ) In-place

More information

Solving two-person zero-sum game by Matlab

Solving two-person zero-sum game by Matlab Appled Mechancs and Materals Onlne: 2011-02-02 ISSN: 1662-7482, Vols. 50-51, pp 262-265 do:10.4028/www.scentfc.net/amm.50-51.262 2011 Trans Tech Publcatons, Swtzerland Solvng two-person zero-sum game by

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation Intellgent Informaton Management, 013, 5, 191-195 Publshed Onlne November 013 (http://www.scrp.org/journal/m) http://dx.do.org/10.36/m.013.5601 Qualty Improvement Algorthm for Tetrahedral Mesh Based on

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

High resolution 3D Tau-p transform by matching pursuit Weiping Cao* and Warren S. Ross, Shearwater GeoServices

High resolution 3D Tau-p transform by matching pursuit Weiping Cao* and Warren S. Ross, Shearwater GeoServices Hgh resoluton 3D Tau-p transform by matchng pursut Wepng Cao* and Warren S. Ross, Shearwater GeoServces Summary The 3D Tau-p transform s of vtal sgnfcance for processng sesmc data acqured wth modern wde

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters Proper Choce of Data Used for the Estmaton of Datum Transformaton Parameters Hakan S. KUTOGLU, Turkey Key words: Coordnate systems; transformaton; estmaton, relablty. SUMMARY Advances n technologes and

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT 3. - 5. 5., Brno, Czech Republc, EU APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT Abstract Josef TOŠENOVSKÝ ) Lenka MONSPORTOVÁ ) Flp TOŠENOVSKÝ

More information

Repeater Insertion for Two-Terminal Nets in Three-Dimensional Integrated Circuits

Repeater Insertion for Two-Terminal Nets in Three-Dimensional Integrated Circuits Repeater Inserton for Two-Termnal Nets n Three-Dmensonal Integrated Crcuts Hu Xu, Vasls F. Pavlds, and Govann De Mchel LSI - EPFL, CH-5, Swtzerland, {hu.xu,vasleos.pavlds,govann.demchel}@epfl.ch Abstract.

More information

Dynamic wetting property investigation of AFM tips in micro/nanoscale

Dynamic wetting property investigation of AFM tips in micro/nanoscale Dynamc wettng property nvestgaton of AFM tps n mcro/nanoscale The wettng propertes of AFM probe tps are of concern n AFM tp related force measurement, fabrcaton, and manpulaton technques, such as dp-pen

More information

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints TPL-ware Dsplacement-drven Detaled Placement Refnement wth Colorng Constrants Tao Ln Iowa State Unversty tln@astate.edu Chrs Chu Iowa State Unversty cnchu@astate.edu BSTRCT To mnmze the effect of process

More information

The Shortest Path of Touring Lines given in the Plane

The Shortest Path of Touring Lines given in the Plane Send Orders for Reprnts to reprnts@benthamscence.ae 262 The Open Cybernetcs & Systemcs Journal, 2015, 9, 262-267 The Shortest Path of Tourng Lnes gven n the Plane Open Access Ljuan Wang 1,2, Dandan He

More information