GPU Accelerated Blood Flow Computation using the Lattice Boltzmann Method

Size: px

Start display at page:

Download "GPU Accelerated Blood Flow Computation using the Lattice Boltzmann Method"

Imogene Alicia Shepherd
6 years ago
Views:

1 GPU Accelerated Blood Flow Computaton usng the Lattce Boltmann Method Cosmn Nţă, Lucan Mha Itu, Constantn Sucu Department of Automaton Translvana Unversty of Braşov Braşov, Romana Constantn Sucu Corporate Technology Semens Braşov, Romana Abstract We propose a numercal mplementaton based on a Graphcs Processng Unt (GPU) for the acceleraton of the executon tme of the Lattce Boltmann Method (LBM). The study focuses on the applcaton of the LBM for patent-specfc blood flow computatons, and hence, to obtan hgher accuracy, double precson computatons are employed. The LBM specfc operatons are grouped nto two kernels, whereas only one of them uses nformaton from neghborng nodes. Snce for blood flow computatons regularly only 1/5 or less of the nodes represent flud nodes, an ndrect addressng scheme s used to reduce the memory requrements. Three GPU cards are evaluated wth dfferent 3D benchmark applcatons (Posseulle flow, ld-drven cavty flow and flow n an elbow shaped doman) and the best performng card s used to compute blood flow n a patent-specfc aorta geometry wth coarctaton. The speed-up over a mult-threaded CPU code s of 19.42x. The comparson wth a basc GPU based LBM mplementaton demonstrates the mportance of the optmaton actvtes. Keywords Lattce Boltmann Method, parallel computng, GPU, CUDA, coarctaton of the aorta I. INTRODUCTION In recent years, there has been consderable focus on computatonal approaches for modelng the flow of blood n the human cardovascular system. When used n conjuncton wth patent-specfc anatomcal models extracted from medcal mages, such technques provde mportant nsghts nto the structure and functon of the cardovascular system [1]. The Lattce Boltmann Method (LBM) has been ntroduced n the 80 s, and has developed nto an alternatve powerful numercal solver for the Naver-Stokes (NS) equatons for modelng flud flow. Specfcally, LBM has been used consstently n the last years n several blood flow applcatons (e.g. coronares [2], aneurysms [3], abdomnal aorta [4]). The LBM s a mesoscopc partcle based method, whch has ts orgn n the Lattce Gas Automata. It uses a smplfed knetc model of the essental physcs of mcroscopc processes, such that the macroscopc propertes of the system are governed by a certan set of equatons. The equaton of LBM s hyperbolc, and can be solved explctly and effcently on parallel computers [5]. Wth the ncreasng computatonal power of Graphcs Processng Unts (GPU), parallel computng has become avalable at a relatvely small cost. Wth the advent of CUDA (Compute Unfed Devce Archtecture), several researchers have dentfed the potental of GPUs to accelerate Computatonal Flud Dynamcs (CFD) applcatons to unprecedented levels [6]. Due to the hgh computatonal requrements, there has been a lot of nterest n explorng hgh performance computng technques for speedng up the LBM algorthms. Effcent CUDA based mplementatons of the 3D LBM have been proposed prevously n the lterature [7-10], whch were optmed for specfc applcatons. Tölke et al. [10] obtaned a speed-up of around 100x over a sequental mplementaton on the Intel Xeon CPU for the flow around a movng sphere. Obrecht et al. [9] studed the flow n an urban envronment and obtaned for a mult-gpu mplementaton a speed-up of 28x compared to a mult-threaded CPU based mplementaton. All these researches focused on sngle precson computatons. Wth the ntroducton of the Ferm and the Kepler archtecture, the performance of double precson computatons on NVIDIA GPU cards has ncreased substantally. In ths paper we ntroduce a parallel mplementaton of the LBM desgned for blood flow computatons. To meet the hgh accuracy requrements of blood flow applcatons, computatons are performed wth double precson. Three recently released GPUs have been consdered and, to correctly evaluate the speed-up potental, results are compared aganst both sngle-core and mult-core CPU-based mplementatons. The best performng GPU card s frst determned usng three popular benchmarkng applcatons, and then t s used for computng blood flow n a patent-specfc aorta geometry wth coarctaton (CoA), contanng the descendng aorta and the supra-aortc branches. CoA s a congental cardac defect usually consstng of a dscrete shelf-lke narrowng of the aortc meda nto the lumen of the aorta, occurrng n 5 to 8% of all patents wth congental heart dsease [11]. The narrowng can lead to a sgnfcant pressure drop, whch affects the health of the patent. Both the mportance and the potental of CFD based approaches for non-nvasve dagnoss of CoA patents have been recently emphased n a challenge [12], where the LBM produced good results. The paper s organed as follows. In secton two we frst brefly ntroduce the LBM used heren. Then we ntroduce the numercal mplementaton, focusng on ts optmed parallelaton on a GPU. Secton three frst presents detaled results for the speed-up obtaned wth dfferent GPUs for the benchmarkng applcatons, and then t dsplays the results obtaned wth the best performng GPU card for the patent

2 specfc CoA geometry. Fnally, n secton four, we draw the conclusons. II. METHODS A. The Lattce Boltmann Method For studyng the parallel mplementaton of the LBM, we consdered the sngle relaxaton tme verson of the equaton, based on the Bhatnagar-Gross-Krook (BGK) approxmaton, whch assumes that the macroscopc quanttes of the flud are not nfluenced by most of the molecular collsons: f t + c f 1 τ ( ) eq = f f, (1) where f represents the probablty dstrbuton functon along an axs c, τ s a relaxaton factor related to the flud vscosty, x represents the poston and t s the tme. The dscretaton n space and tme s performed wth fnte dfference formulas. Ths s usually done n two steps: Δt eq f t + Δ = f + ( f f ), (2a) τ and f ( x + c Δt, t + Δ = f t Δ. (2b) + The frst equaton s known as the collson step, whle the second one represents the streamng step. f eq s called the equlbrum dstrbuton and s gven by the followng formula: 2 2 eq ck u 1 ck u 1 u f = ω ρ(, ) 1+ + x t, (3) 2 2 cs 2 c 2 s cs where ω s a weghtng scalar, c s s the lattce speed of sound, c k s the drecton vector, and u s the flud velocty. ρ(x, s a scalar feld, commonly called densty, whch s related to the macroscopc flud pressure as follows: ρ p( x, =. (4) 3 Once all f have been computed, the macroscopc quanttes (velocty and densty) can be determned: 1 ( x = n u, c f, (5) ρ n = 0 = 0 ρ( x, = f. (6) The computatonal doman s smlar to a regular grd used for fnte dfference algorthms. For a more detaled descrpton of the Boltmann equaton and the collson operator we refer the reader to [5]. The current study focuses on 3D flow domans: we used the D3Q15 lattce structure, dsplayed n fg. 1 for a sngle grd node. The weghtng factors are: ω = 16/72 for = 0, ω = 8/72 for = 1 6, and ω = 1/72 for = The boundary condtons (nlet, outlet and wall) are crucal for any flud flow computaton. For the LBM, the macroscopc quanttes (flow rate/pressure) can not be drectly mposed at nlet and outlet. Instead, the known values of the macroscopc quanttes are used for computng the unknown dstrbuton functons near the boundary. For the nlet and outlet of the doman we used Zou-He [13] boundary condtons wth known velocty. For the outlet we used homogeneous Neumann boundary condton. The arteral geometry has complex boundares n patent-specfc blood flow computatons, and hence, for mprovng the accuracy of the results, we used advanced bounce-back boundary condtons based on nterpolatons [14]. The sold walls are defned as an sosurface of a scalar feld, commonly known as the level-set functon. B. GPU based parallel mplementaton of the Lattce Boltmann Method In the followng we focus on the GPU based parallelaton of the above descrbed LBM. The GPU s vewed as a compute devce whch s able to run a very hgh number of threads n parallel nsde a kernel (a functon, wrtten n C language, whch s executed on the GPU and launched by the CPU). The GPU contans several streamng multprocessors, each of them contanng several cores. The GPU contans a certan amount of global memory to/from whch the CPU thread can wrte/read, and whch s accessble by all multprocessors. Furthermore, each multprocessor also contans shared memory and regsters whch are splt between the thread blocks and the threads, whch run on the multprocessor, respectvely. The LBM s both computatonally expensve and memory demandng [15], but ts explct nature and the data localty (the computatons for a sngle grd node requre only the values of the neghborng nodes) make t deal for parallel mplementatons. Each node can be computed at each tme step ndependently from other nodes. A frst mportant dfference between the CPU and the GPU mplementaton of the LBM s the memory arrangement. Regularly, on the CPU, a data structure contanng all the requred floatng-pont values for a grd node s defned, and then an array of ths data structure s created (the Array Of Structures approach AOS). Ths approach s not a vable soluton on the GPU because the global memory accesses would not be coalesced and would drastcally decrease the performance [16]. Instead of AOS, the Structure Of Arrays (SOA) approach has been consdered [15]: a dfferent array s allocated for each varable of a node, leadng to a total of 35 arrays, 15 for the densty functons, another 15 for swappng the new densty functons wth the old ones after the streamng step, three for the velocty, one for the Fg. 1. The D3Q15 lattce structure, frst number n the notaton s the space dmenson, whle the second one s the lattce lnks number.

3 densty and one for the level-set functon. The memory access patterns for the AOS and SOA approaches are dsplayed n fg. 2 for the three velocty components. The workflow of the GPU-based LBM mplementaton s dsplayed n fg. 3. All computatons are performed on the GPU. Therefore, hostdevce memory copy operatons are only requred when storng ntermedate (transent or unsteady flows) or fnal results (steady flows). Two dfferent kernels have been defned and are called at each teraton. The operatons n (2) (6) have been assocated to the two kernels based on the necessty of accessng nformaton from the neghborng nodes. Kernel 1 frst computes the macroscopc quanttes (velocty and densty), based on (5) and (6), by teratng through the 15 probablty dstrbuton functons. Then t apples the Zou-He boundary condtons at the nlet of the doman and t performs the collson step: frst the equlbrum dstrbuton functon s computed usng (3) and then the new probablty dstrbuton functons are determned based on (2). The second kernel focuses on the streamng step, the nterpolated bounce-back boundary condton and the outlet boundary condton. All these operatons requre nformaton from the neghborng nodes. The operatons of the second kernel are more complex snce the grd nodes located at the boundary requre a dfferent treatment than the other nodes. Ths leads to dfferent code executon paths and therefore to reduced parallelsm. However, snce relatvely few grd nodes resde next to the boundary, ths aspect s not crucal for the overall performance. The workflow of the streamng step s dsplayed n fg. 4 (for smplcty, the treatment of the nodes of the outlet boundary s not dsplayed). One can see that, f a node s surrounded n opposte drectons by sold nodes, the smple bounce back rule s appled nstead of the nterpolated bounce back rule, whch would lead to numercal dvergence. Ths case s encountered relatvely often n geometres wth complex boundares, especally around sharp edges. For both kernels, one CUDA thread s mapped to one node and snce all arrays are one-dmensonal, also the executon confguraton of the kernels s one-dmensonal, both at block and at grd level. Due to the hgh accuracy requrements of blood flow computatons, and unlke prevous researches, all computatons were performed wth double precson. Because the arrays and the executon confguraton are one-dmensonal, t s necessary Fg. 2. Memory access patterns: Array of Structures (top), Structure of Arrays (bottom). Fg. 3. LBM workflow. to map the three-dmensonal coordnates nsde the grd to a global ndex used to access the data from the arrays: = N N + j N k. (7) g y + g = N N j = k = g g y, N N N y y N N, j N. where, j and k are the node coordnates n the 3D LBM grd. Note that these values are approxmated wth the floor functon, N x, N y and N are the grd ses n each drecton and g s the global ndex of the node n the one-dmensonal array. Equatons (7) and (8) are used nsde the second kernel for fndng the global ndex of the neghbourng nodes. The LBM s usually appled for a rectangular grd. For blood flow computatons, the rectangular grd s chosen so as to nclude the arteral geometry of nterest. In ths case though, the flud nodes represent only 1/5 or less of the total number of nodes. Hence, f the nature of the nodes (flud/sold) s not taken nto account, around 80% of the allocated memory s not used and around 80% of the threads do not perform any computatons. To avod ths problem, we used an ndrect addressng scheme, dsplayed n fg. 5. Memory s only (8)

4 Fg. 4. The workflow of the second kernel n fg. 3. Fg. 5. Indrect addressng. allocated for the flud nodes and an addtonal array (called flud ndex array) s ntroduced for mappng the global ndex determned wth (7) to the flud nodes arrays (negatve values n the flud ndex array correspond to sold nodes). The content of the flud ndex array s determned n the preprocessng stage on the CPU and s requred only durng the streamng step. Snce for the operatons performed nsde the frst kernel n fg. 3 no nformaton from the neghborng nodes s requred, the executon confguraton of the frst kernel s set up so as to generate a number of threads equal to the number of flud nodes. For the second kernel on the other sde, the number of threads n the executon confguraton s set equal to the total number of nodes, to avod the necessty of a search operaton n the flud ndex array. III. RESULTS To compare the performance of the CPU based mplementaton of the LBM wth the GPU based mplementaton for double precson computatons, we consdered three dfferent NVIDIA GPU cards: GeForce GTX 460, GeForce GTX 650 and GeForce GTX 680 (the frst one s based on the Ferm archtecture, whle the other two are based on the Kepler archtecture). The CPU based mplementaton was run on an eght-core 7 processor usng both sngle and mult-threaded code. Parallelaton of the CPU code was performed usng OpenMP. Three dfferent 3D benchmark applcatons were frst consdered for determnng the best performng GPU card: Posseulle flow, ld-drven cavty flow and flow n an elbow shaped doman. Dfferent grd resolutons were consdered and table I dsplays the executon tmes for all test cases, correspondng to one computaton step. The performance mprovements are sgnfcant and demonstrate that a GPU based mplementaton of the LBM s superor to a mult-core CPU based mplementaton. The best performance s obtaned for the GTX 680 (see table I). The speed-up s computed based on the mult-threaded CPU code. The speed-up compared to the sngle-threaded CPU code vares between 150x and 290x. Note that the performance of the GTX 650 card s on average around 2x lower than of the GTX 460. Ths confrms the concerns rased for the frst GPUs of the Kepler archtecture, the performance of whch are n fact lower than for the prevously released cards of the 400 and 500 GeForce seres (wth the advantage of lower power consumpton). Once the GTX680 was determned as best performng GPU card for double-precson 3D computatons, we used t to compute blood flow n a patent-specfc aorta model wth coarctaton, whch was recently used n a CFD challenge [12]. To obtan the correspondence between the lattce unts and the physcal unts, we used the method descrbed n [17]. The computatons were ntaled wth the equlbrum dstrbuton functon, and for the current research actvty we focused on steady-state computatons,.e. we mposed the average value of the flow rate profle specfed n the challenge. The grd se was set to 92x156x428 ( nodes), of whch only represented flud nodes (less than 10%). The total number of computaton steps to obtan convergence strongly depends on the grd resoluton,.e. the tme needed by the pressure wave to propagate from one end to the other, an aspect whch s gven by the lattce speed of sound. Fg. 6 dsplays the computaton results obtaned after tme steps (the converged soluton). Followng the dea n [18], namely that lower occupancy leads to better performance, we tested dfferent executon confguratons. The executon tmes obtaned for dfferent thread block confguratons, for the entre computaton, are dsplayed n table II alongsde the executon tme for the mult-threaded CPU code. As has been reported prevously [15], executon confguratons wth fewer threads per block lead to better performance. The best performng executon confguraton s wth 128 threads per block and the speed-up compared to the executon tme of the mult-threaded CPU mplementaton s of 19.42x.

Speed-Up Tme [ms] Speed-Up 100x100x400 3924.8 608.38 13.7 44.41 45.30 13.43 21.00 28.97 50x50x200 484.3 81.39 1.9 42.84 6.00 13.57 3.00 27.13 25x25x100 61.01 11.24 0.30 37.47 0.80 14.05 0.50 22.

5 TABLE I. EXECUTION TIMES OF BENCHMARKING APPLICATIONS FOR ONE COMPUTATION STEP FOR DIFFERENT GRID CONFIGURATIONS. Benchmark case Posseulle flow Ld-drven cavty flow Elbow Grd resoluton Snglethreaded CPU code [ms] Multthreaded CPU code [ms] GeForce GTX 680 GeForce GTX 650 GeForce GTX 460 Tme [ms] Speed-Up Tme [ms] Speed-Up Tme [ms] Speed-Up 100x100x x50x x25x x100x x50x x25x x200x x100x Fg. 7. Comparson of basc vs optmed LBM GPU mplementaton. was allocated for all nodes, ncludng the sold nodes), used four kernels for the operatons of the LBM at each teraton, and executed all kernels wth a total number of threads equal to the total number of nodes. The results are dsplayed n fg. 7 for dfferent thread block confguratons and show that the optmaton actvtes are crucal for the speed-up (wth the basc LBM GPU verson, the speed-up s of only 4.41x compared to the mult-threaded CPU code). The speed-up of the optmed LBM GPU verson compared to the basc LBM GPU verson s of 4.40x. Fg. 6. Computaton result (streamlnes) for the patent-specfc coarctaton geometry. TABLE II. COMPARISON OF EXECUTION TIMES FOR DIFFERENT EXECUTION CONFIGURATIONS Confguraton Executon tme [s] GPU - 64 threads/block GPU threads/block GPU threads/block GPU threads/block GPU threads/block CPU - multthreaded The mplementaton and optmaton aspects descrbed n the prevous secton were desgned specfcally for blood flow computatons. To evaluate the mpact of these actvtes we also performed the flow computatons n the same model wth a basc verson of the LBM GPU mplementaton. The basc LBM GPU verson dd not use ndrect addressng (memory IV. DISCUSSION AND CONCLUSIONS In ths paper, we ntroduced a GPU-based parallel mplementaton of the Lattce Boltmann Method, optmed for patent-specfc blood flow computatons. Double precson computatons were employed for hgher accuracy and three dfferent NVIDIA GPU cards were consdered. Based on three 3D benchmarkng applcatons, the GTX680 card was determned as best performng GPU and was subsequently used to compute blood flow n a aorta geometry wth coarctaton. To our knowledge, ths s the frst work to evaluate the potental of Kepler archtecture GPU cards for acceleratng the executon of the LBM. Moreover, t s the frst paper to consder double precson computatons for hgher accuracy. A detaled comparson wth prevous mplementatons [7-10] s dffcult to perform snce generally the mplementatons are optmed for specfc actvtes and dfferent GPUs have been used n dfferent studes. However, the overall results obtaned heren are remarkable: the speed-up over a sngle-threaded CPU mplementaton vares between 150x and 290x, whereas prevously a speed-up of 100x was reported [10]. The speed-up of the CoA geometry blood flow computaton was of 19.42x

6 compared to a mult-threaded CPU mplementaton, whereas prevously a speed-up of 28x was reported, but for a mult- GPU and not a sngle GPU mplementaton [9]. The optmaton actvtes were desgned for patentspecfc blood flow computatons n general (not n partcular for the coarctaton geometry), where the rato of flud nodes to total number of nodes s usually around 1/5 or less. Hence we used an ndrect adressng scheme and allocated memory only for the flud nodes. Furthermore, the operatons were grouped nto two kernels: the frst one performs operatons for whch nformaton from neghborng nodes s not requred, whle the second one uses nformaton from neghborng nodes. Ths way the number of kernels s reduced, and t was possble to use an executon confguraton wth reduced number of threads for the operatons for whch nformaton from the neghborng nodes s not requred. As proposed n the CFD challenge [12], we only consdered rgd wall computatons. If elastc arteral walls are consdered, then the flud ndex array n fg. 5 has to be recomputed at each tme step snce the classfcaton of nodes nto flud and sold nodes changes over tme. All LBM based results reported for [12] were obtaned for CPU based mplementatons. Although the LBM s faster than the classc CFD approach, based on the Naver-Stokes equatons, the acceleraton of the executon tme remans a crucal task for several reasons. Frst of all, when blood flow s modelled n patent-specfc geometres n a clncal settng, results are requred n a tmely manner not only to potentally treat the patent faster, but also to perform computatons for more patents n a certan amount of tme. Furthermore, when performng patent-specfc computatons, t s necessary to match certan patent-specfc characterstcs, lke pressure or flow rates. Hence, the parameters of the model need be tuned, and the computaton needs to be run repeatedly for the same geometry, thus ncreasng the total executon tme for a sngle patent [19]. Several future work actvtes have been dentfed. From a computatonal pont of vew, the global memory accesses of the second kernel can be further optmed, and a mult-gpu based mplementaton wll be consdered for further decreasng the executon tme. From a modelng pont of vew, for more severe coarctatons than the one dsplayed n fg. 6, the Reynolds number ncreases consderably and a Smagornsky sub-grd model needs to be employed [9]. [3] J. Bernsdorf, and D. Wang, Non-Newtonan blood flow smulaton n cerebral aneurysms, Computers & Mathematcs wth Applcatons, vol. 58 pp , [4] A.M. Artol, A.G. Hoekstra, and P.M.A. Sloot, Mesoscopc smulatons of systolc flow n the human abdomnal aorta, Journal of Bomechancs, vol. 39, pp , [5] S. Succ, The Lattce Boltmann Equaton - For Flud Dynamcs and Beyond. New York: Oxford Unversty Press, [6] D. Krk, and W.M. Hwu, Programmng Massvely Parallel Processors: A Hands-on Approach. London: Elsever, [7] P. Baley, J. Myre, S.D.C. Walsh, D.J. Llja, and M.O. Saar, Acceleratng lattce Boltmann flud flow smulatons usng graphcs processors, IEEE Internatonal Conference on Parallel Processng, Venna, Austra, pp , Sept [8] M. Bernasch, M. Fatca, S. Melchonna, S. Succ, and E. Kaxras, A flexble hgh-performance lattce Boltmann GPU code for the smulatons of flud flows n complex geometres, Concurrency Computaton: Practce & Experence, vol. 22, pp. 1-14, [9] C. Obrecht, F. Kunk, B. Tourancheau, and J.-J. Roux, Towards urban-scale flow smulatons usng the Lattce Boltmann Method, Buldng Smulaton Conference, Sydney, Australa, pp , Nov [10] J. Tölke, and M. Krafcyk, TeraFLOP computng on a desktop PC wth GPUs for 3D CFD, Internatonal Journal of Computatonal Flud Dynamcs, vol. 22, pp , [11] R.E. Rngel, and K. Jenkns, Coarctaton of the aorta stent tral (coas, 2007, [12] ***, CFD Challenge: Smulaton of Hemodynamcs n a Patent-Specfc Aortc Coarctaton Model, [13] Q. Zou, and X. He, On pressure and velocty boundary condtons for the Lattce Boltmann BGK model, Physcs of Fluds, vol. 9, pp , [14] M. Boud, M. Frdaouss, and P. Lallemand, Momentum transfer of a Boltmann-Lattce flud wth boundares, Physcs of Fluds, vol. 13, pp , [15] M. Astorno, J. Becerra Sagredo, and A. Quarteron, A modular lattce Boltmann solver for GPU computng processors, SeMA journal, vol. 59, pp , [16] NVIDIA Corporaton: CUDA, Compute Unfed Devce Archtecture Best Practces Gude v5.0 (2013). [17] J. Latt, Hydrodynamc lmt of lattce Boltmann equatons, PhD Thess, Unverste de Geneve, Geneve, Swterland, [18] V. Volkov, Better performance at lower occupancy, GPU Technology Conference, San Jose, USA, [19] D.R.Golbert, P.J. Blanco, A. Clausse, and R.A. Fejóo, Tunng a Lattce-Boltmann model for applcatons n computatonal hemodynamcs, Medcal Engneerng & Physcs, vol. 34, pp , ACKNOWLEDGMENT Ths work s supported by the program Partnershps n Prorty Domans (PN II), fnanced by ANCS, CNDI - UEFISCDI, under the project nr. 130/2012. REFERENCES [1] C.A. Taylor, and D.A. Stenman, Image-based modelng of blood flow and vessel wall dynamcs: applcatons, methods and future drectons, Annals of Bomedcal Engneerng, vol. 38, pp , [2] S. Melchonna, M. Bernasch, S. Succ, E. Kaxras, F.J. Rybck, Mtsouras D, et al., Hydroknetc approach to large-scale cardovascular blood flow, Computer Physcs Communcatons, vol. 181, pp , 2010.

S.P.H. : A SOLUTION TO AVOID USING EROSION CRITERION?

S.P.H. : A SOLUTION TO AVOID USING EROSION CRITERION? Célne GALLET ENSICA 1 place Emle Bloun 31056 TOULOUSE CEDEX e-mal :cgallet@ensca.fr Jean Luc LACOME DYNALIS Immeuble AEROPOLE - Bat 1 5, Avenue Albert