AUTHOR QUERY FORM. Fax:

Size: px

Start display at page:

Download "AUTHOR QUERY FORM. Fax:"

Melvyn Summers
5 years ago
Views:

1 Our reference: YJCPH 3186 P-authorquery-v8 AUTHOR QUERY FORM Journal: YJCPH Please e-mal or fax your responses and any correctons to: E-mal: Artcle Number: 3186 Fax: Dear Author, Please check your proof carefully and mark all correctons at the approprate place n the proof (e.g., by usng on-screen annotaton n the PDF fle) or comple them n a separate lst. For correcton or revson of any artwork, please consult Any queres or remarks that have arsen durng the processng of your manuscrpt are lsted below and hghlghted by flags n the proof. Clck on the Q lnk to go to the locaton n the proof. Locaton n artcle Q1 Q2 Query / Remark: clck on the Q lnk to go Please nsert your reply or correcton at the correspondng lne n the proof Please check country name nserted for afflaton s okay as typeset. Followng equatons have been sequentally renumbered. Please check, and correct f necessary. Q3 Please provde complete detals for Ref. [11]. Q4 Q5 Please note the footnote * provded n Table 3.2 but not cted n the table. Please check. Please note footnotes * and ** provded n Table 4.3 but not cted n the table. Please check. Q6 Please provde the sgnfcance of * n Table 4.6. Thank you for your assstance.

1 YJCPH 3186 Journal of Computatonal Physcs xxx (2010) xxx xxx Contents lsts avalable at ScenceDrect Journal of Computatonal Physcs ournal homepage: www.elsever.

Calforna, San Dego, Unted States 5 artcle nfo 20 7 8 Artcle hstory: 9 Receved 21 August 2009 10 Receved n revsed form 19 June 2010 11 Accepted 21 July 2010 12 Avalable onlne xxxx 13 Keywords: 14

2 1 YJCPH 3186 Journal of Computatonal Physcs xxx (2010) xxx xxx Contents lsts avalable at ScenceDrect Journal of Computatonal Physcs ournal homepage: 2 Fast evaluaton of Helmholtz potental on graphcs processng unts (GPUs) 3 Shaong L *, Bors Lvshtz, Vtaly Lomakn 4 Q1 Department of Electrcal and Computer Engneerng, Unversty of Calforna, San Dego, Unted States 5 artcle nfo Artcle hstory: 9 Receved 21 August Receved n revsed form 19 June Accepted 21 July Avalable onlne xxxx 13 Keywords: 14 Graphcs processng unts (GPUs) 15 Computatonal electromagnetcs 16 Fast methods 17 Non-unform grd nterpolaton methods 18 Integral equatons Introducton abstract Ths paper presents a parallel algorthm mplemented on graphcs processng unts (GPUs) for rapdly evaluatng spatal convolutons between the Helmholtz potental and a largescale source dstrbuton. The algorthm mplements a non-unform grd nterpolaton method (NGIM), whch uses ampltude and phase compensaton and spatal nterpolaton from a sparse grd to compute the feld outsde a source doman. NGIM reduces the computatonal tme cost of the drect feld evaluaton at N observers due to N co-located sources from O(N 2 )too(n) nthe statc and low-frequency regmes, to O(NlogN) n the hgh-frequency regme, and between these costs n the mxed-frequency regme. Memory requrements scale as O(N) n all frequency regmes. Several mportant dfferences between CPU and GPU mplementatons of the NGIM are requred to result n optmal performance. In partcular, n the CPU mplementatons all operatons, where possble, are pre-computed and stored n memory n a preprocessng stage. Ths reduces the computatonal tme but sgnfcantly ncreases the memory consumpton. In the GPU mplementatons, where handlng memory often s a crtcal bottle neck, several approaches are used to accelerate the computatons. A sgnfcant latency of the GPU global memory access s hdden by mplementng coalesced readng, whch requres arrangng many array elements n contguous parts of memory. Contrary to the CPU verson, most of the steps n the GPU mplementatons are executed on-fly and only necessary arrays are kept n memory. Ths results n sgnfcantly reduced memory consumpton, ncreased problem sze N that can be handled, and reduced computatonal tme on GPUs. The obtaned GPU CPU speed-up ratos are from 150 to 400 dependng on the requred accuracy and problem sze. The presented method and ts CPU and GPU mplementatons can fnd mportant applcatons n varous felds of physcs and engneerng. Ó 2010 Publshed by Elsever Inc Ths paper s concerned wth the evaluaton of the dscrete transform of the form uðr m Þ¼ XN n¼1;n m e krm rn r m r n Q n; m ¼ 1; 2;...; N ð1þ 52 on graphcs processor unts (GPUs) wth computatonal complexty (.e. the computatonal tme and number of operaton) of 53 O(N) n the low-frequency regme, O(NlogN) n the hgh-frequency regme, and between these costs n the mxed-frequency 54 regme. In Eq. (1), the potental u(r m ) at the observaton locatons r m s evaluated by a dscrete convoluton of the Green s 55 functon G(r m,r n ) = exp( kr m r n )/r m r n and sources Q n co-located wth the observers. The sources Q n are dstrbuted * Correspondng author. E-mal addresses: sl@ucsd.edu (S. L), vtaly@ece.ucsd.edu (V. Lomakn) /$ - see front matter Ó 2010 Publshed by Elsever Inc. do: /.cp (2010), do: /.cp

3 2 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx 56 n a doman of the lnear sze D. The total number of sources and observers s N. The potental u(r) and the Green s functon 57 G(r,r 0 ) satsfy the Helmholtz equaton wth the wavenumber k correspondng to the wavelength k =2p/kand frequency f = kc/ 58 2p wth wave velocty c. In the low-frequency regme the computatonal doman s small n terms of the wavelength (D k, 59 or D k ) and the source densty s prescrbed by a partcular problem (e.g. by the geometrcal features n the framework on 60 ntegral equatons [1]). In the hgh-frequency regme, the computatonal doman s large n terms of the wavelength (D k) 61 and the source densty s determned by the wavelength accordng to the Nyqust crteron. In the specal case of k = f = 0 the 62 potental satsfes the Posson equaton and the problem s consdered to be statc. 63 The task of evaluatng the potental n Eq. (1) s mportant n many areas of computatonal physcs, ncludng Electromag- 64 netcs, Optcs, Magnetcs, Acoustcs, Elastodynamcs, among others. Ths computatonal task can appear n problems requr- 65 ng fndng felds due to a gven source dstrbuton. It s also a key for solvng ntegral equatons by teratve methods [1], n 66 whch the dscrete transformaton of (1) approxmates contnuous spatal convolutons. The computatonal complexty of the 67 drect evaluaton of (1) scales as O(N 2 ). Ths quadratc dependence on N severely lmts the practcal applcablty of many 68 solvers that nvolve spatal convolutons. 69 The task of reducng ths hgh computatonal cost has been a subect to extensve nvestgaton. Acceleraton can be 70 acheved by developng fast methods that reduce the asymptotc complexty of the evaluaton of (1) from O(N 2 )too(n) 71 or O(NlogN) operatons as well as developng parallelzaton technques to allow these algorthms to utlze mult-processor 72 systems. Methods reducng the computatonal complexty nclude the Fast Fourer Transform (FFT) based methods [2 5], 73 Fast Multpole Methods (FMMs) [6 15], and nterpolaton-based methods [16 20]. The FFT-based methods explot the space 74 nvarant nature of the Green s functon and reduce the computatonal cost to O(NlogN) and O(N 3/2 logn) for volumetrc and 75 general surface problems, respectvely. FFT-based methods can handle low- and hgh-frequency regmes but become neff- 76 cent for non-unform problems where a large fracton of a structure s free from sources/observers. The FMM explots spec- 77 tral representatons of the felds n terms of multpole expansons or plane wave expansons wth dagonal translaton 78 operators. The computatonal cost of FMMs scales as O(NlogN) and O(N) for hgh- and low-frequency problems, respectvely. 79 The nterpolaton-based methods, ncludng the non-unform grd nterpolaton method (NGIM), explot the fact that the 80 feld potental far from a source dstrbuton s a functon wth a known asymptotc behavor. The knowledge of ths behavor 81 allows smoothng the fast spatal varatons of the potental, computng t on a sparse grd, and nterpolatng to the requred 82 observaton ponts. Symmetrc mplementatons of the NGIM have been shown to scale as O(N) for statc problems [18,21]. 83 Asymmetrc modfcatons of the NGIM have a cost of O(NlogN) for hgh-, low-, and mxed-frequency problems [16,17,20]. 84 Smlar to FMMs, NGIMs can handle non-unform geometres and can have the same asymptotc cost for volumetrc and sur- 85 face problems. 86 Even fast methods mplemented on a sngle processor are lmted n ther performance. As the performance of a sngle 87 core saturates, utlzng mult-core and mult-processor systems becomes crucal for advancng future hgh-performance 88 computng. An mportant aspect of ths efforts are parallel mplementatons of fast convolutonal methods to evaluate the 89 transformaton n Eq. (1). There exst multple parallel mplementatons of the FFT-based methods and FMMs. Most of these 90 mplementatons utlze central processng unt (CPU) cluster systems [22 25]. 91 Exctng opportuntes arse when usng new hardware archtectures ncludng graphcs processng unts (GPUs). Such 92 systems, orgnally desgned for gamng and graphcs processng, comprse hundreds and even thousands of stream proces- 93 sors n a sngle envelope at a low cost (for example, nvda GeForce GTX 480 GPU has 480 stream processors). GPU systems 94 offer the power of a CPU-based cluster but at the cost of a smple desktop computer. Moreover, the recently ntroduced Com- 95 pute Unfed Devce Archtecture (CUDA) allows wrtng general purpose hgh-level codes wthout consderng specfcs of a 96 partcular GPU. In recent years, there has been a sgnfcant effort to port varous methods developed for seral and parallel 97 CPU systems to GPUs. Several authors presented effcent GPU based codes for N-body problems relyng on drect summa- 98 tons (as n Eq. (1)) [26,27]. Whle the speed-up of these codes can be very mpressve, the fact that the computatonal cost 99 scales as O(N 2 ) does not allow usng these codes for large-scale practcal problems. In a recent paper [28], the authors pre- 100 sented a CUDA mplementaton of the FMM for statc problems (k = 0) and obtaned good acceleraton rates. CUDA mple- 101 mentatons of nterpolaton-based methods for statcs have also been presented [29]. However, to the best of our 102 knowledge, no CUDA mplementatons of fast codes for the evaluaton of the Helmholtz-type potental n Eq. (1) have been 103 addressed n open lterature. 104 Ths paper descrbes hghly effcent CPU and GPU mplementatons of a modfcaton of the NGIM for the fast evaluaton 105 of the potental u(r m ) n Eq. (1). The NGIM dffers from the FMMs n that t reles on drect spatal nterpolatons. As a result, 106 the same NGIM can be appled to statc (k = 0) and dynamc (k 0) problems, as well as to problems wth other kernels. In 107 addton, NGIM has a smple structure and does not requre any specal functon evaluatons, whch facltates ts mplemen- 108 tatons on GPU systems. The asymptotc computatonal tme cost of NGIM method scales aso(n) for statc/low-frequency, 109 O(NlogN) for hgh-frequency problems, and between these costs for mxed-frequency problems. The memory consumpton 110 cost s of O(N) for all problem types. We descrbe how the data structures of CPU and GPU NGIM mplementatons are ar- 111 ranged to optmze ther performance. Specfcally, the memory handlng s descrbed n detal as t s a crucal component 112 for speedng up the GPU mplementatons. It s shown how one can choose the parameters of NGIM to balance the compu- 113 tatonal costs of dfferent stages of ts mplementatons, leadng to optmal performance. The acheved sngle GPU sngle 114 CPU speed-up ratos are shown to be n the range dependng on the accuracy and problem sze. Matchng the 115 acheved speeds on CPUs systems would requre a cluster of much hgher cost, hgher power consumpton, and possbly ded- 116 cated space nfrastructure. (2010), do: /.cp

4 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx The paper outlne s as follows. Secton 2 presents foundatons of the NGIM startng wth the descrpton of the grd con- 118 structon approaches n Secton 2.1 and proceedng wth the mult-level NGIM n Secton 2.2. Secton 3 detals the mplemen- 119 taton of the mult-level NGIM on GPUs for each of ts steps. Secton 4 shows computatonal results demonstratng the power 120 of the NGIM and ts mplementatons. Fnally, Secton 5 summarzes the paper and dscusses the presented method and ob- 121 taned results Foundatons of the non-unform grd nterpolaton method Grd constructon 124 The mult-level NGIM decomposes the geometry nto a herarchy of boxes contanng sources and observers. The accel- 125 eraton s based on the smoothness and nterpolaton propertes of the ampltude- and phase-compensated felds due to suf- 126 fcently separated source and observer dstrbutons. Ths secton descrbes the propertes of the feld generated by a sources 127 dstrbuton confned to a source box and observed n the space outsde ths box as well as the feld n an observaton box 128 generated by the sources outsde ths box. These propertes are used to construct sparse grds allowng computng the feld at 129 any observaton locaton by locally nterpolatng t from the samples at these sparse grds Outgong sphercal non-unform grds from a source box B s 131 Consder a source box B s of the largest dagonal sze 2R s centered at r s that contans a subset of the entre source dstr- 132 buton. The potental u(r,b s ) due to the sources n B s at an observaton pont r suffcently dstant from the source such that r s = r r s > cr s wth c > 1 can be evaluated by the ampltude and phase restoraton [16 18,21] 136 uðr; B s Þ¼ e kr s r s ~uðr; B s Þ ð2þ from the ampltude- and phase-compensated potental 140 ~uðr; B s Þ¼r s e kr s uðr; B s Þ: 141 In Eq. (3), the multplcaton of the feld by the dstance r s and the exponental term e kr s compensate the feld varatons asso- 142 cated wth the ampltude decay and phase shft common to all sources n the box B s for the dstant observer r. Ths com- 143 pensaton cancels the rapd feld varatons of the actual potental u(r,b s ) and renders the compensated potental ũ(r,b s )a 144 spatally slowly varyng functon. As a result, the potental u(r,b s )atr can be computed by evaluatng the ampltude and 145 phase-compensated potental over a hghly sparse grd _ r surroundng the observer, locally nterpolatng the compensated potental, and restorng the ampltude and phase at r va (2) 149 uðr; B s Þ¼ X e kr rs r r s wðr; r_ Þ~uðr _ Þ¼ X w _ ðr; r _ Þ~uðr _ Þ: In Eq. (4), the nterpolaton s local,.e. the summatons are over an O(1) samples at the grd ponts _ r surroundng the observer at r, wth nterpolaton coeffcents w ðr; _ r Þ. In the last equalty n Eq. (4), the coeffcents w _ ðr; _ r Þ nclude the nter polaton and feld restoraton coeffcents. Followng Ref. [17,18,21], the sparse grd _ r s chosen to be sphercal,.e. t s constructed based on two angular coordnates h s, u s and radal coordnate r s defned wth respect to the center of the source 154 box center r s. The mnmal angular and radal samplng rates for constructng the non-unform sphercal grds have been de- 155 rved n Ref. [17]. The angular h- and u-samplng rate fa s satsfes f a s ¼ X aðkr s þ 1Þ=p, where X a P 1 s an angular oversam- 156 plng rato [17]. The requred local radal samplng rate for the low-frequency case s derved followng the ntroducton of a 157 new recprocal radal varable a 1/r s. The unform a-samplng rate fa s should satsfy f a s ¼ X rr s =ð4pþ; where X r P 1 s a radal 158 oversamplng rato. More rgorous estmates on the samplng rates can be derved but they do not modfy practcally ob- 159 taned results n terms of the computatonal tme and memory consumpton. The requred unform samplng versus a trans- 160 lates nto a non-unform samplng versus r s wth a local samplng rate of fa s ¼ X rr s ðkr s þ 1Þ=ð4pÞ; thereby ustfyng the 161 noton of non-unform grd (NG). Clearly, the radal samplng becomes sparser as the dstance between the observaton 162 and the source ponts ncreases. 163 The local nterpolaton n Eq. (4) can be accomplshed based on varous polynomals, e.g. Lagrange, nterpolatons. The 164 error can be controlled by ncreasng the nterpolaton order or by ncreasng the oversamplng ratos,.e. ncreasng the den- 165 sty of the non-unform grd. Ths non-unform grd nterpolaton procedure allows for a sgnfcant reducton of the compu- 166 tatonal complexty due to the fact that only a small number of samples are requred at the sparse non-unform grd to 167 compute a large number of observers and that the local nterpolaton s computatonally nexpensve. ð3þ ð4þ Local Cartesan unform grds at a box B o 169 Consder an observaton box B o of the largest sze 2R o centered at r o that contans a subset of the all observaton locatons. 170 The potental u(r,b o ) nsde B o s produced by sources outsde a sphere of radus cr o. The potental u(r,b o ) can be computed by evaluatng t over a grd of ponts r^ nsde or around the box B o and locally nterpolatng t to the observaton locaton r (2010), do: /.cp

5 4 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx A Level l Box A Level l + 1 Box A Level l + 2 Box Observaton Box Interactng Far-feld Box Non-nteractng Far-feld Box Near-Feld Box Fg The llustraton of dfferent geometrc relatons between source boxes and a certan observaton box. 174 uðr; B s Þ¼ X w^ ðr; r _ Þuðr^Þ: ð5þ 175 Smlar to Eq. (4), the nterpolaton n Eq. (5) s local,.e. t s executed from an O(1) samples at the grd ponts r^ surroundng 176 the observer at r wth nterpolaton coeffcents w^ ðr; _ r Þ.Several approaches can be followed to construct the grd r^. For 177 example, n Refs. [18,21], sphercal grds were used for the statc problem (k = 0). Here, for convenence of nterpolatons 178 we use unform Cartesan grds (CGs) algned wth the observaton boxes. The samplng rate n the x, y, and z coordnates 179 satsfes fx o ¼ X xðk þ 1=R o Þ, where X x P 1 s the oversamplng rato. As n the case of non-unform grds, the local nterpola- 180 ton n Eq. (5) can be based on varous polynomals (e.g. Lagrange) nterpolaton types. 181 In the low-frequency regme wth kr o 1 the number of CG samples s (almost) frequency ndependent and t can be 182 much smaller than the number of observers, whch results n a sgnfcant reducton of the number of operatons. In the 183 hgh-frequency regme, the number of observers n the observer box s of the same order as the number of grd ponts. Intro- 184 ducng such grds would result n no computatonal gan and could reduce accuracy; hence CGs are not defned for boxes of 185 sze comparable or larger than the wavelength Mult-level non-unform grd nterpolaton method Doman decomposton and data structure 188 The computatonal doman s decomposed nto a mult-level herarchy of boxes, smlarly to FMMs. Startng from the larg- 189 est box of sde length D enclosng the entre doman, the boxes are recursvely subdvded. The subdvson herarchy s 190 stored n an octal tree, where each box B l n of level l(0 < l < L) and sde length D l =2 l D s consdered to be a parent to ts (eght) 191 chld boxes B lþ1 m of sze D l+1 =2 (l+1) D at level (l + 1). The sze of the smallest box (at the fnest level) s consdered to be much 192 smaller than the wavelength for low- and hgh-frequency problems. Local Lst I L ðb l mþ stores ponters to the sources con- 193 taned n non-empty boxes B L m on level l. Only non-empty boxes are kept n the actual data structure. 194 Based on the geometrc subdvson of the computatonal doman, pars of near- and nteractng far-feld boxes are den- 195 tfed. A par of boxes p s consdered near-feld when they are close such that the maxmal dstance between them s less than 196 a preset factor c ffffff 3 DL (typcally 2 < c < 4) and ther parent boxes do not form a far-feld par; such boxes exst at level L 197 (green 1 boxes n Fg. 2.1). The constructon of the near-feld box pars leads to the defnton of near-feld nteracton lsts 198 (NIL) I N ðb L n Þ, whose members are all boxes pared wth BL n n the near-feld. The number of levels L s chosen such that for 199 each smallest box there s O(1) of near-feld pars and the total number of near-feld pars s O(N). p A par of same-level boxes 200 s consdered nteractng far-feld when the maxmal dstance between them s greater than c ffffff p 3 Dl and the dstance between 201 ther parent boxes s smaller than c ffffff 3 Dl 1 (yellow boxes n Fg. 2.1). The constructon of the far-feld box pars leads to the 202 defnton of far-feld nteracton lsts (FIL) I F ðb l n Þ, whch for each observer box Bl n (at all levels) contans all boxes pared wth 1 For nterpretaton of color n Fg. 2.1, the reader s referred to the web verson of ths artcle. (2010), do: /.cp

6 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx B l n n the far-feld (n the same level). For any box at any level there are only O(1) far-feld pars. For the fnest level L, the 204 number of the far-feld nteracton pars s O(N) (assumng that the number of sources/observers per box s of O(1)). For coar- 205 ser levels the number of boxes and far-feld pars decreases so that the total number of the far-feld nteracton pars s O(N). 206 Each non-empty box at all levels contans NG samples. In the low-frequency regme, the densty of the NGs s constant at 207 all levels and the number of the samples per box s O(1). In the hgh-frequency regme, the densty of NGs for the boxes at 208 coarser levels ncreases as O(8 L l ) (snce t s related to the wavelength). CGs are defned for boxes at levels smaller than a 209 certan nterface level,.e. l 6 l nt wth 2 6 l nt 6 L. The nterface level s a level at whch the boxes sze s suffcently smaller 210 than the wavelength. For levels l P l nt, referred to as low-frequency levels, CGs reduce the computatonal complexty as ex- 211 planed n Secton 2.1. For levels l < l nt, referred to as hgh-frequency levels, CGs are omtted. The ntroducton of the nterface 212 level makes the algorthm most effcent n low-, hgh-, and mxed-frequency regmes. In the low-frequency regme, l nt s 213 chosen as the coarsest level l nt = 2 at whch far-feld nteracton pars exst,.e. all levels are low-frequency levels. In the 214 hgh-frequency regme, l nt s chosen at the fnest level l nt = L,.e. all levels are hgh-frequency levels. In the mxed-frequency 215 regme, l nt s 2 6 l nt 6 L and ts value s determned by whether the low- or hgh-frequency regme domnates. 216 Based on the above doman subdvson, the feld u(r ) at an observaton locaton r s obtaned by aggregatng the felds 217 due to sources n (source) boxes B l n0 that appear n the NIL and FIL 219 uðr Þ¼u NF ðr Þþu FF ðr Þ: 220 The near-feld contrbuton u NF (r ) s evaluated drectly usng Eq. (1) va 222 u NF ðr Þ¼ X r 2I L ðb L n Þ X e kr r r B L m 2I NðB L r Q ; n Þ ð6þ ð7þ 223 where the summatons are over the sources at r n the local lst of the boxes B L m that belong to the NIL of the box BL n con- 224 tanng the observer at r. There s only O(1) of near-feld nteractng boxes for every observaton box B L n and the computa- 225 tonal cost of evaluatng the near-feld contrbuton scales as O(N). 226 The far-feld contrbuton u FF (r ) s evaluated va the mult-level NGIM as descrbed next Non-unform grd nterpolaton method for the far-feld evaluaton 228 The evaluaton of the far-feld contrbuton s accomplshed n 4 steps based on the behavor of the potental and proce- 229 dures descrbed n Secton 2.1. Procedures descrbng each stage are llustrated n Fg. 2.2 (n a 2-D form for clarty) Stage 1 (fnest level l = L: Sources? NG computatons; Fg. 2.2(a)):. At the fnest level L, the potental s computed usng n o 231 Eq. (1) at the NGs for all non-empty boxes. The NGs _ L;m r n ths stage are defned for all non-empty boxes B L m at level L as n o at the correspondng NGs due to charges n the 232 descrbed n Secton Samples of the far-feld potental _ L u _ L;m m r 233 Local Lst of box B L 234 m are computed as _ u L m r_ L;m ¼ X k e _ L;m r r r 2I L ðb L m Þ r _ L;m r Q : ð8þ Snce the number of boxes at the fnest level L s O(N) and there are O(1) grd ponts per box the computatonal cost of eval- 238 uatng the felds at all NGs for all level L boxes scales as O(N) for the consdered low-frequency or statc regmes Stage 2 (upward pass: aggregaton of NGs; Fg. 2.2(b)):. The far-feld potentals at ther respectve NGs for all boxes at all 240 levels from l = L 1tol = 2 are obtaned recursvely by nterpolatng and aggregatng the contrbutons from the NGs of the 241 correspondng chld boxes. Specfcally, the far-feld potental at the NGs of box B l 1 m s obtaned by addng up contrbutons 242 from ts chld boxes va 243 _ u l 1 ¼ X X ; _ l;n r u : ð9þ 245 m r _ l 1;m B l n 2Bl 1 m w _ l;n r _ l 1;m _ l n r _ l;n r _ L;m n o 246 Here, the potentals _ l 1 u m of the parent box Bl 1 m are nterpolated from ther respectve chld NGs r_ l;n to the new NGs n o _ l 1;m 247 r. The coeffcents w _ l;n _ l 1;m r ; _ l;n r are nterpolaton coeffcents as defned n Eq. (4). Referrng to the dscusson n 248 Secton on the number of non-empty boxes and the number of NG ponts per box, the computatonal cost of Stage for low- and hgh-frequency problems s O(N) and O(NlogN), respectvely Stage 3 (downward pass: NG? CG transtons; CG decomposton; Fg. 2.2(c)):. In ths stage, the feld samples at CGs are 251 calculated. The procedure starts by computng the felds at the CGs of the (observaton) boxes B l nt n at the nterface level l nt. 252 These felds are obtaned by nterpolatng from the NG samples of the source boxes B l m for levels l = 2,...,l nt. The boxes B l nt m 253 belong to the FIL of the observaton box B l nt n (.e. B l nt n 2 I F ðb l nt m ÞÞ. The correspondng felds at the CG pont of box Bl nt n from the 254 nteracton lst boxes of level l nt, referred to as nt ;l Q2 nt u^l, are computed as nt ;n n r^l (2010), do: /.cp

7 6 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx NG samples Sources actve CG samples at level l nactve CG samples at level l NG samples at level l-1 (a) Source-to-NG stage (stage 1) (b) NG-to-NG stage (stage 2) NG samples for step 1 CG samples for step 2 CG samples NG samples NG-CG transton (step 1) NG samples Observers CG-CG transton (step 2) (c) NG-to-CG and CG-to-CG stage (stage 3) (d) CG-to-observer stage (stage 4) Fg The 2-D llustraton of tasks n each stage of the mult-level mplementaton of NGIM. 256 ^lnt ;l u nt nt ;n n r^l ¼ X B l nt m 2I F B l nt n X w _ l nt ;m nt ;n r^l ; _ l r nt ;m u _ lnt m r_ l nt ;m : ð10þ 257 Smlarly, the boxes B l m for levels l = 2,...,l nt 1 belong to the nteracton lst of the parents of the box B l nt n at correspondng 258 levels. The correspondng felds nt ;l nt ;n u^l n r^l (l = 2,...,l nt 1) are computed smlarly to Eq. (9). All the felds from the nter- 259 acton boxes at levels l = 2,...,l nt are supermposed as nt nt ;n u^l n r^l ¼ P l nt nt ;l nt ;n l¼2 n r^l to result n the felds at the CG ponts of 260 the nterface level boxes B l nt n. 261 Next, for all levels l = l nt + 1,...,L, the felds at the CGs of the (observaton) boxes B l n are computed ncludng two contr- 262 butons: () the felds accumulated by summng up feld contrbutons nterpolated from the NGs of the boxes n the FIL at the 263 same level (smlar to Eq. (9)) and () those nherted from ther parent boxes by nterpolatng from the CGs: 265 ^l u ¼ X n r^l;n B l m 2I F ðb l n Þ X w _ l;m r^l;n ; r _ l;m u _ l m ðr_ l;m Þþ X w^l 1;k r^l;n ; r^l 1;k u ^l 1 k r^l 1;k : ð11þ 266 Here, the frst term s smlar to that n the rght hand sde of Eq. (9), whle n the second term the feld samples at the CGs of 267 the parent box B l 1 k are used to obtan the feld samples at the CGs of ts chld boxes B l n 2 Bl 1 k va nterpolaton as n Eq. (5). 268 Ths procedure s repeated recursvely for all non-empty boxes from level l nt + 1 to level L. Relaton between these two steps 269 s shown n Fg As mentoned n Secton 2.2.1, the number of far-feld nteracton pars per box at a certan level s O(1). The number of CG 271 samples for all boxes at levels l = l nt,...,l s O(1). As a result, the computatonal cost of Stage 3 scales as O(N) and O(NlogN) 272 for the low- and hgh-frequency regmes, respectvely. The ncreased cost n the hgh-frequency regme s because the felds 273 at the CGs of boxes at the nterface level are evaluated from NGs at all hgh-frequency levels, assumed to be of O(logN). In the (2010), do: /.cp

8 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx 7 NG samples at level l NG samples at level l+1 NG samples at level l+2 NG-CG transton stage CG decomposton stage CG samples at level l CG samples at level l+1 CG samples at level l+2 Fg The relaton of NG CG transton stage and CG decomposton stage n calculatng the feld values on CG samples of boxes at each computatonal level. 274 mxed-frequency regme, the cost scales between O(N) and O(NlogN) dependng on the number of low- and hgh-frequency 275 levels. 276 It should be mentoned that whle Stages 2 and 3 conceptually are dfferent (Stage 2 computes felds at NGs whle Stage computes felds at CGs), n a code mplementaton they can be combned together. For example, for all levels 2 < l < l nt NGs of 278 boxes at level l can be used to obtaned NGs of boxes at level l 1 and to obtan CGs of boxes at level l nt as n Eq. (9). After 279 these operatons the NGs at level l are not necessary and can be dscarded. Such an approach results n the same computa- 280 tonal tme but reduces memory n the hgh-frequency regme Stage 4 (fnest level: CG? observers; Fg. 2.2(d)):. On the fnest level L, the far-feld contrbutons at the observatons 282 ponts r belongng to a level L box B L n are obtaned by nterpolaton from the CGs of the same box va u FF ðr Þ¼ X ^L w^l;n r ; r^l;n un r^l;n : 284 ð12þ 285 Snce there s O(1) CGs n every box and there are O(N) boxes n the fnest level, the computatonal complexty of evaluatng 286 the far-feld contrbuton at all observers s O(N) Computatonal complexty 288 Frst, we dscuss the computatonal cost n the low-frequency regme. Assume for smplcty that the source dstrbuton 289 occupes most or all the computatonal doman such that all the boxes are non-empty. An estmate on the number of oper- 290 atons (computatonal tme) of NGIM s gven by 292 T low ðn; LÞ ¼C 1 Nn NG þ C 2 8 L n NG þ C 3 8 L n CG þ C 4 N þ C 5 ðn=8 L Þ 2 : 293 Here, n NG = O(1) and n CG = O(1) s the number of NG and CG samples per box, respectvely, and C 1,2,3,4 are constants of O(1). 294 Each of the frst four terms n the rght hand sde corresponds to the computatonal cost of the four far-feld stages and the 295 ffth term s the cost of the near-feld stage. As explaned above, the doman recursve subdvson s executed untl there s a 296 fnte number of O(1) of sources n a box, whch means that L s chosen as L = Clog 8 N (wth C = O(1)). As a result, all fve terms 297 n T low are on the order of O(N), thus resultng n an O(N) overall computatonal tme cost. The computatonal cost s the same 298 for source dstrbutons coverng only a part of the computatonal space (n whch case the number of non-empty boxes s 299 smaller). 300 In the hgh-frequency regme, the computatonal cost s T hgh ðn; LÞ ¼C 1 Nn NG þ C 2 Ln NG 8 L þ C 3 Ln CG 8 L þ C 4 N þ C 5 ðn=8 L Þ 2 ; ð14þ 304 where n NG = O(1) and n CG = O(1) s the number of NG and CG samples n the smallest boxes, respectvely. For L = Clog 8 N, the 305 second and thrd terms n the rght hand sde of (14) are of O(NlogN). The ncrease of the computatonal tme s due to the 306 ncrease of the grd denstes. In the mxed-frequency regme, the computatonal tme cost s between O(N) and O(NlogN). 307 The memory consumpton of NGIM mplemented on GPUs s estmated as 309 M low 6 C 1 N þ C 2 ðn CG þ n NG Þ8 L þ C 3 8 L ; 310 both n the low- and hgh-frequency regmes (assumng all boxes are actve). Here, the frst term n the rght hand sde s the 311 memory requred for storng source and observer ampltudes and coordnates, the second term s for storng NG and CG feld 312 ampltudes, and the thrd term s for storng NILs and other box-related nformaton. For L = Clog 8 N the memory consump- 313 ton s of O(N) n any frequency regme. ð13þ Implementaton of the non-unform grd nterpolaton method on GPUs 315 GPUs were orgnally ntended for graphc processng such as vsualzaton, renderng, computer graphc anmatons, and 316 3D gamng. However, recent developments n the GPU technology have drawn attenton of scentfc computng commun- 317 tes due to GPU s extraordnary computatonal power. Recently, new programmng envronments/languages have been 318 ntroduced to facltate the use of GPU systems for hgh-performance scentfc computng. In partcular, nvda provdes (2010), do: /.cp

9 8 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx 319 Compute Unfed Devce Archtecture (CUDA) for the development of general purpose applcatons on GPUs [30]. CUDA s a 320 functon-extended C programmng language that allows a C program to be compled and executed on stream processors of 321 GPUs. On logcal level, each runnng nstance launched on a GPU s called a CUDA thread and s dentfed by ts thread ID, 322 whch can be a one-, two-, or three-dmensonal ndex. These IDs can be used, for example, as addressng parameters to 323 let the threads access dfferent parts of memory or ndcate whch flow paths the thread should follow. However, the threads 324 are not spread drectly nto all stream processors. A certan number of threads are bundled together to form a thread block. A 325 thread block usually contans a group of threads that are expected to execute closely related (yet possbly dfferent) tasks. 326 Threads n a block are combned nto groups of 32, referred to as warps. Warps are atomc unts that always go together on 327 a mult-processor. Each mult-processor conssts of 16 stream processors (cores), and takes one or more warps at a tme. 328 Wthn each block, threads are granted a certan amount of shared memory that can be as fast as regsters. Ths shared 329 memory s smultaneously accessble by all threads n the same block but s nether recoverable after the block threads fnsh 330 ther ob nor accessble by threads from other blocks. The shared memory can be used effcently as an ntermedate storage 331 lke regsters to speed-up calculatons. The nput and output data s usually stored n global memory. Although the maxmal 332 speed of the global memory s hgh (e.g. 177 GBts/s bandwdth for nvda GeForce GTX 480), t has a notceable latency n 333 handlng every read or wrte nstructon (e.g., cycles for nvda GeForce GTX 480). To overcome ths latency and 334 reduce the speed msmatch between the shared and global memores, CUDA offers a coalesced accessng scheme, where sev- 335 eral readng or wrtng nstructons are combned wthn one transacton. Ths faster scheme s trggered when threads n the 336 same warp access a contguous address of global memory. 337 In the mplementaton of the NGIM, the aforementoned concepts and mechansms are crtcal to speed-up calculatons. 338 These concepts have sgnfcant effects on the tme and memory consumpton of the NGIM on GPUs and result n a number of 339 mportant modfcatons n the data structure of the code as compared to the CPU mplementatons. The mplementaton of 340 the NGIM on GPUs follows the same stage-by-stage protocol as that on CPUs, yet extensve changes are made to parallelze 341 the operatons and utlze tools provded by CUDA Preprocessng and ntalzaton stages 343 In the preprocessng stage, all vectors, matrces and other data structures used by the NGIM are ntated. The ntalza- 344 ton ncludes memory allocaton n global memory of GPU, copyng coordnates of sources and observers to the allocated 345 matrces as well as reshapng and copyng auxlary matrces, such as matrces storng ndces of far-feld nteracton boxes. 346 One task done n the preprocessng stage specfcally for GPU s rearrangng the source nformaton storage so that sources 347 belongs to the same box, at all levels, are stuated contguously n the memory. Ths s crtcal for the GPU to adopt coalesced 348 accessng to accelerate the memory handlng, whch wll be descrbed n detal below. 349 In the CPU verson of preprocessng, constructng coeffcent matrces for nterpolaton (descrbed n later paragraphs) 350 occupes a sgnfcant porton of the whole preprocessng procedure n terms of memory usage and computaton tme. 351 But n the GPU verson, smlar tasks are sgnfcantly shortened or elmnated, because the NG and CG sample coordnates 352 as well as all nterpolaton coeffcent matrces are constructed on-fly every tme they are requested. These on-fly operatons 353 are repeated n each stage to reduce the memory consumpton and reduce the total memory access tme. As a result, the 354 preprocessng tme of the GPU code s reduced, makng the code more effcent and practcal Near-feld computaton 356 In ths stage, the felds at the observers are evaluated drectly va Eq. (1) by addng up the feld contrbutons from sources 357 belongng to the level L boxes n the NILs of the observer s box. Methods to parallelze ths stage also apply to drect eval- 358 uatons of the classcal n-body problems [27], whch s used for the speed-up comparson. In addton, the technques used 359 here are applcable to the acceleraton of other (more complcated) steps of the NGIM. 360 Two approaches can be followed to compute the near-feld nteractons. In the frst approach, the values of the Green s 361 functon for all source-observer pars partcpatng n the near-feld nteractons are tabulated as an nteracton matrx. Ths 362 s done only once n the preprocessng stage (e.g. of an teratve electromagnetc solver), and only smple matrx vector mul- 363 tplcatons are used for teratons. However, even for a moderate N, the resultng memory consumpton can be very large for 364 GPUs (and also for CPU) systems. The other approach s evaluatng all near-feld nteractons drectly va Eq. (1) on-fly at 365 every teraton. Whle ths approach may lead to some speed reducton, t drastcally reduces the memory requrements, 366 so t s preferred n our GPU mplementaton of NGIM. 367 All results for the near-feld stage are shown for the on-fly approach for both CPU and GPU mplementatons. The reason s 368 twofold. Frst, the memory consumpton for large problems can be too hgh even for CPU systems. Second, for large problem 369 szes the pre-computaton approach may not gve a sgnfcant advantage due to the resultng large amount of memory han- 370 dled. In our tests, we have mplemented both pre-computaton and on-fly approaches on the CPU verson, and found that the 371 pre-computaton approach was only tmes faster for problems below N = 4 mllon (whch can be handled by our CPU 372 verson on a computer wth 32 GB of memory). For larger problems, the performance gap between the two versons of the 373 code may be even smaller. Therefore, we beleve that the beneft of memory consumpton reducton n the on-fly approach 374 outweghs a potental extra speed-up of the pre-computaton approach. (2010), do: /.cp

10 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx Pseudocode 1. Near-feld Calculaton: 376 grd dmenson = number of actve (non-empty) boxes; block dmenson >= number of observers/sources 377 per box; sources n the same box are stored n global memory contguously 379 // The declaraton of external shared arrays, whch are beng dynamcally allocated rght before each launchng of the 380 kernel 381 extern shared float dyn_sdata1[]; 382 extern shared float2 dyn_sdata2[]; 383 //Ths array stores the coordnates of sources n a box 384 float* Rs=(float*)dyn_sdata1; 385 //Ths array stores the feld values at observaton ponts 386 float2* Us=(float2*) & dyn_sdata2[(blockdm.x 1)+blockDm.x]; 387 float2 pnt; 388 float Rr[3]; 389 nt IdxReceverBox = ndex of observaton box processed by ths block of threads; 390 nt NumNearBoxes = number of source boxes that have near-feld box relaton wth current 391 observaton box; 392 nt NumReceverBox = number of observers n the box beng processed; 393 nt tdx = ndex number of observer beng handled by the current thread; 394 // load the observers coordnates to Qr[] 395 for (nt = 0; < 3;++) 396 { 397 Qr[]=d_fChargeInfo[((IdxReceverBox-1)*3+)*blockDm.x + tdx]; 398 } 399 syncthreads (); 400 // traverse around all near-feld boxes 401 for (ntidxnearboxcurrent = 1;ntIdxNearBoxCurrent <=NumNearBoxes;ntIdxNearBoxCurrent++) 402 { 403 boxidx = ndex number of source box current beng processed 404 chargenum = number of charges nsde ths source box; 405 // load the sources coordnates and ampltude to Qs[] and Qs_amp[] 406 for (nt = 0; < 3;++) 407 { 408 Qs[tdx*3 + ]= Charge_coordnates[boxIdx*3*blockDm.x + *blockdm.x + tdx]; 409 } 410 Qs_amp[tdx]=Charge_ampltude[(boxIdx-1)*blockDm.x + tdx]; 411 syncthreads (); 412 // accumulate the contrbuton of all sources n the same source box 413 for (nt = 0;< NumReceverBox;++) 414 { 415 float del[3]; 416 del[0]=qs[3*]-qr[0]; 417 del[1]=qs[3* + 1]-Qr[1]; 418 del[2]=qs[3* + 2]-Qr[2]; 419 // ths lne calculate 1/r 420 dst = sqrtf (del[0]*del[0]+del[1]*del[1]+del[2]*del[2]); 421 f (dst >1e-6) 422 { 423 // use ntermeda varables to save tme on expensve operatons lke dvson and trgonometrcs 424 float del_cos, del_sn, qs[2], rdst; 425 del_cos = cosf (-k0*dst); 426 del_sn = snf (-k0*dst); 427 qs[0]=qs_amp[].x; 428 qs[1]=qs_amp[].y; 429 rdst = 1.0f/dst; 430 Pt.x += rdst*(qs[0]*del_cos-qs[1]*del_sn); 431 Pt.y += rdst*(qs[0]*del_sn + qs[1]*del_cos); 432 } (contnued on next page) (2010), do: /.cp

11 10 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx 435 } 436 syncthreads (); 437 // Then we could output the value of P to correspondng unt n output array 438 P? feld value vector Pseudocode 1 demonstrates the mplementaton of ths approach on GPUs. The pseudocode s wrtten n CUDA-extended 441 C language, but certan ntalzatons, nput/output parts, and error control parts are omtted for clarty. The techncal detals 442 of mplementng the drect calculaton of feld values based on Eq. (1) on GPU are gven next. 443 (1) We adopted one-thread-per-observer type of parallelzaton, n whch a thread s responsble for calculatng the feld 444 value at an observer. 445 (2) Threads handlng observers n the same box are bundled to form a thread block. One or several thread blocks may be 446 launched to handle a certan box when the number of observers wthn the box exceeds the hardware lmt on the 447 number of threads a block can contan (e.g. 512 for Tesla C1060 and 1024 for GeForce GTX 480). The number of observ- 448 ers n a box should be greater than 32 (warp sze) to have all the stream processors occuped and all computatonal 449 resource fully utlzed. 450 (3) Near-felds at observers n a box are computed from the sources n boxes belongng to the NIL of the observer box. Ths 451 fact s exploted by loadng and storng coordnates and ampltudes of these sources to shared memory, such that sev- 452 eral threads handlng dfferent observers use the same nformaton wthout repeated memory loadngs. Furthermore, 453 the sources and observers are arranged n the preprocessng stage so that coordnates and ampltudes of sources n a 454 box are located n contguous memory addresses. As a result, the memory loadng operatons by a block of threads are 455 always coalesced. 456 (4) The amount of shared memory has to be determned at run tme snce the number of threads per block (.e. the num- 457 ber of sources and observers per box) s not known at the complaton tme. We use dynamc shared memory alloca- 458 ton, n whch the amount of shared memory s calculated each tme before launchng the kernel and s allocated as a 459 sngle array. 460 (5) One-, two-, or three-dmensonal grds of blocks can be used to handle all non-empty boxes. In the current mplemen- 461 taton, we use two-dmensonal grds of blocks. Ths allows any practcal number of blocks to be launched for each 462 ndvdual kernel. 463 (6) Some ntrnsc mathematcal functons are used to accelerate the computatons. These functons nclude sngle-prec- 464 son versons of sn and cos functons, used to evaluate the complex exponental n the Greens functon n Eq. (1). 465 Other nstructon level technques nclude replacng the nteger dvson and modulo operatons wth btwse shftng 466 and AND operatons when the dvdend s power of 2 [30]. 467 (7) The data type float2 s used n the near-feld and followng stages snce t can be drectly mapped to complex data 468 type we use n our CPU code wrtten n FORTRAN. In the CUDA compler, the operator overload mechansm s ntro- 469 duced so the operatons on float2 can be defned exactly as those for complex numbers Table 3.1 shows the computatonal tme of the near-feld stage on CPU and GPU. The CPU tmng results were obtaned on 472 a sngle core of an Intel Xeon 3.2 GHz CPU usng Intel Fortran Compler v10 wth O3 optmzaton (there was 473 around 20-fold speed-up of a O3 optmzed CPU code over a non-optmzed one). At the GPU end, an nvda GTX480 at MHz wth 1.5 GBytes of memory was used. The GPU mplementaton was wrtten and compled usng CUDA Toolkt Table 3.1 The computatonal tmes and speed-up ratos of the near-feld stage on CPU (Xeon X5248) and GPU (GeForce GTX 480). N p s the average number of sources per box on level L. The relaton between N p and N s N p = N/8 L. N p L CPU a GPU Rato e e e e e0 5.90e e0 7.49e e0 1.74e e1 4.84e e1 7.76e e1 1.45e e2 5.30e e2 6.38e e2 1.19e e3 4.37e a All tmng results shown n ths secton are n seconds. (2010), do: /.cp

12 S. L et al. / Journal of Computatonal Physcs xxx (2010) xxx xxx v3.0 from nvda. Both CPU and GPU versons of the code used on-fly approach and the source and observer dstrbuton was 476 random. 477 It s evdent that the speed-up ratos of the GPU code compared to the CPU one are very hgh, varyng between 200 and The speed-ups are hgher for larger N p,.e. for a larger number of sources and observers per box, when the massve par- 479 allelzaton s fully exploted. Takng nto account the fact that the number of the GPU cores n the consdered case s 480 and 480 they are run at the clock rate around 4.5 tmes lower than that of the CPU, achevng the acceleraton rates above 600 s 481 mpressve. Such hgh rates are obtaned not only due to massve parallelzaton of floatng pont operatons and memory 482 loadng but also due to the coalesced memory access. 483 A comment should be made on the scalng of the computatonal complexty n Table 3.1. For a fxed number of levels L, 484 the complexty scales approxmately as O(N 2 ) wth ncreasng N (N = N p 8 L n Table 3.1) as the number of near-feld evalua- 485 tons s proportonal to N 2 p. The complexty of O(N) for the near-feld stage of the NGIM s acheved due to the fact that the 486 number of levels L ncreases wth an ncrease of N. Indeed, the computatonal tme behaves as O(N) for the same number of 487 sources/observers per smallest box (.e. for the same N p n the example of Table 3.1). Clearly, the level ncrease s dscrete and 488 the choce of the sze N at whch the code s swtched to a hgher herarchy level depends on the relatve computatonal tme 489 of the near- and far-feld components. 490 The near-feld computaton stage s one of two potentally most tme consumng stages n the algorthm (the other tme 491 consumng stage s stage 3 of the far-feld evaluaton). Therefore, achevng such hgh acceleraton rates n ths stage s very 492 mportant Outward computaton from sources to NG samples (Stage 1) 494 The NG constructon stage computes the feld values at NGs, whch s the frst step of the upward pass of the algorthm. 495 The core operatons n ths stage are (a) the constructon of NGs of each non-empty box at the fnest level L and (b) the drect 496 calculaton of the feld values at these NGs va Eq. (8). 497 The CPU verson of code conssts of two nested loops to deal wth all pars of sources and NG samples for ndvdual boxes 498 and another loop to account for all boxes at level L (Eq. (8)). The evaluaton of feld s executed by coeffcent-ampltude mul- 499 tplcaton, where the nteracton coeffcents are computed va Eq. (8) n the preprocessng stage. For GPU, no coeffcent 500 matrces are used for the reasons mentoned n prevous sectons. Instead, the loops over the NG samples and boxes are par- 501 allelzed and substtuted by threads and blocks. Two approaches can be followed to parallelze ths subroutne and to map 502 the operatons to all processors on GPUs. One approach s one-thread-per-box parallelzaton used n smlar stages of FMM 503 adopted n Ref. [28] and the other opton s one-thread-per-observer parallelzaton, whch s smlar to what was done n 504 the near-feld computaton stage n Secton In the frst approach, each thread s responsble for calculatng feld values at all NG samples of a desgnated box. Ths can 506 be done only serally but usng ths approach the code may handle non-unform source dstrbuton effcently. However, the 507 overall effcency of the algorthm n practcal stuaton s lmted by several other factors. Suppose there are N b sources n a 508 box at the fnest level. Then, one thread has to execute floatng pont readng from the global memory 5N b tmes n order to 509 load the source coordnates and (complex) ampltudes to the shared memory or regster. No memory coalescng scheme can 510 be appled n ths stuaton and these readngs wll suffer from the global memory access latency. Moreover, for the on-fly 511 verson of the NGIM, the constructon of NGs requres a relatvely large number of operatons, whch makes the work load 512 of a sngle thread heavy, undermnng the effects of parallelzaton. Fnally, n the one-thread-per-box parallelzaton ap- 513 proach, the GPU computatonal resources are utlzed completely only when the problem sze s large enough. From our test 514 runs on NVIDIA Tesla C1060, L has to be greater or equal to 5 to have all the stream processors fully utlzed. The stuaton 515 becomes even worse on GTX 480 as t has more stream processors. Summarzng, we beleve that the one-thread-per-box 516 parallelzaton approach s suboptmal for the NGIM (and possbly for FMM). 517 In the second ( one-thread-per-observer parallelzaton) approach, one thread handles one observer, same as the ap- 518 proach used n Secton 3.2, where an observer means an NG sample. One or several blocks of threads are allocated for each 519 observaton box. In ths approach the coalesced memory readng technque s appled to accelerate the loadng of source 520 coordnates and ampltudes. The task of the NG constructon s also dstrbuted to a group of threads. In the low-frequency 521 regme, usually one block of threads s assgned to one box of NG samples, and the maxmal effcency s acheved. In the 522 hgh- and mxed-frequency regmes, the boxes at hgh-frequency levels can have a large number of NG samples. Ths re- 523 qures assgnng multple blocks for each box. 524 The computatonal tmes of stage 1 are presented n Table 3.2 (these results are frequency regme ndependent for the 525 same N, L, and the number of NG samples per smallest box). It s evdent that the speed-up rato ncreases sgnfcantly wth 526 an ncrease of the number of sources per box. The oversamplng rates of NGs n the radus and angle, X r and X a are defned n 527 Secton 2.2 and chosen to be 2 here to render the average L 1 error of NGIM at 10 3 level. For small problems (wth a small 528 number of sources and boxes), the work load dstrbuted to each stream processor s nsuffcent to hde the expense of the 529 on-fly NG grd constructon, threads launchng, and global memory readng even under coalesced accessng scheme. There- 530 fore, for the L = 3 case, the computatonal tme s nearly a constant (the speed-ups are stll qute hgh). When there are more 531 boxes to be process wth an ncreased L, the computaton tme ncreases wth the problem sze but slower than O(N). Another 532 factor that affects the speed dfferences between CPU and CPU s the oversamplng rate. For hgh oversamplng ratos,.e. for (2010), do: /.cp

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more