GPU Accelerated Blood Flow Computation using the Lattice Boltzmann Method

Similar documents
S.P.H. : A SOLUTION TO AVOID USING EROSION CRITERION?

Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units

Research Article Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Parallelism for Nested Loops with Non-uniform and Flow Dependences

The Codesign Challenge

Wavefront Reconstructor

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

A Binarization Algorithm specialized on Document Images and Photos

Cluster Analysis of Electrical Behavior

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

User Authentication Based On Behavioral Mouse Dynamics Biometrics

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Dynamic wetting property investigation of AFM tips in micro/nanoscale

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Accounting for the Use of Different Length Scale Factors in x, y and z Directions

Hermite Splines in Lie Groups as Products of Geodesics


2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Array transposition in CUDA shared memory

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Modeling of Airfoil Trailing Edge Flap with Immersed Boundary Method

S1 Note. Basis functions.

High-Boost Mesh Filtering for 3-D Shape Enhancement

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Edge Detection in Noisy Images Using the Support Vector Machines

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

3D vector computer graphics

A Product Model based Approach to Interactive CAE Design Optimization

Solitary and Traveling Wave Solutions to a Model. of Long Range Diffusion Involving Flux with. Stability Analysis

An Optimal Algorithm for Prufer Codes *

Modeling, Manipulating, and Visualizing Continuous Volumetric Data: A Novel Spline-based Approach

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Mathematics 256 a course in differential equations for engineering students

Programming in Fortran 90 : 2017/2018

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Module 6: FEM for Plates and Shells Lecture 6: Finite Element Analysis of Shell

An Influence of the Noise on the Imaging Algorithm in the Electrical Impedance Tomography *

Multiblock method for database generation in finite element programs

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

UrbaWind, a Computational Fluid Dynamics tool to predict wind resource in urban area

COMPARISON OF TWO MODELS FOR HUMAN EVACUATING SIMULATION IN LARGE BUILDING SPACES. University, Beijing , China

AADL : about scheduling analysis

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

Resolving Ambiguity in Depth Extraction for Motion Capture using Genetic Algorithm

X- Chart Using ANOM Approach

An Accurate Evaluation of Integrals in Convex and Non convex Polygonal Domain by Twelve Node Quadrilateral Finite Element Method

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

COMBINED VERTEX-BASED CELL-CENTRED FINITE VOLUME METHOD FOR FLOWS IN COMPLEX GEOMETRIES

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

An inverse problem solution for post-processing of PIV data

Parallel matrix-vector multiplication

Flow over Broad Crested Weirs: Comparison of 2D and 3D Models

Reducing Frame Rate for Object Tracking

Electrical analysis of light-weight, triangular weave reflector antennas

Simulation of a Ship with Partially Filled Tanks Rolling in Waves by Applying Moving Particle Semi-Implicit Method

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

Virtual Machine Migration based on Trust Measurement of Computer Node

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Simulation Based Analysis of FAST TCP using OMNET++

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Wishing you all a Total Quality New Year!

Learning-Based Top-N Selection Query Evaluation over Relational Databases

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

Lecture 5: Multilayer Perceptrons

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Load Balancing for Hex-Cell Interconnection Network

VISCOELASTIC SIMULATION OF BI-LAYER COEXTRUSION IN A SQUARE DIE: AN ANALYSIS OF VISCOUS ENCAPSULATION

DESIGN OF A HAPTIC DEVICE FOR EXCAVATOR EQUIPPED WITH CRUSHER

CMPS 10 Introduction to Computer Science Lecture Notes

Inverse kinematic Modeling of 3RRR Parallel Robot

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Towards sibilant /s/ modelling: preliminary computational results

Efficient Distributed File System (EDFS)

IP Camera Configuration Software Instruction Manual

ELEC 377 Operating Systems. Week 6 Class 3

Kinematics of pantograph masts

AP PHYSICS B 2008 SCORING GUIDELINES

An Image Fusion Approach Based on Segmentation Region

An Entropy-Based Approach to Integrated Information Needs Assessment

Concurrent Apriori Data Mining Algorithms

Distance Calculation from Single Optical Image

Solving two-person zero-sum game by Matlab

Constructing Minimum Connected Dominating Set: Algorithmic approach

Support Vector Machines

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

PHYSICS-ENHANCED L-SYSTEMS

Transcription:

GPU Accelerated Blood Flow Computaton usng the Lattce Boltmann Method Cosmn Nţă, Lucan Mha Itu, Constantn Sucu Department of Automaton Translvana Unversty of Braşov Braşov, Romana Constantn Sucu Corporate Technology Semens Braşov, Romana Abstract We propose a numercal mplementaton based on a Graphcs Processng Unt (GPU) for the acceleraton of the executon tme of the Lattce Boltmann Method (LBM). The study focuses on the applcaton of the LBM for patent-specfc blood flow computatons, and hence, to obtan hgher accuracy, double precson computatons are employed. The LBM specfc operatons are grouped nto two kernels, whereas only one of them uses nformaton from neghborng nodes. Snce for blood flow computatons regularly only 1/5 or less of the nodes represent flud nodes, an ndrect addressng scheme s used to reduce the memory requrements. Three GPU cards are evaluated wth dfferent 3D benchmark applcatons (Posseulle flow, ld-drven cavty flow and flow n an elbow shaped doman) and the best performng card s used to compute blood flow n a patent-specfc aorta geometry wth coarctaton. The speed-up over a mult-threaded CPU code s of 19.42x. The comparson wth a basc GPU based LBM mplementaton demonstrates the mportance of the optmaton actvtes. Keywords Lattce Boltmann Method, parallel computng, GPU, CUDA, coarctaton of the aorta I. INTRODUCTION In recent years, there has been consderable focus on computatonal approaches for modelng the flow of blood n the human cardovascular system. When used n conjuncton wth patent-specfc anatomcal models extracted from medcal mages, such technques provde mportant nsghts nto the structure and functon of the cardovascular system [1]. The Lattce Boltmann Method (LBM) has been ntroduced n the 80 s, and has developed nto an alternatve powerful numercal solver for the Naver-Stokes (NS) equatons for modelng flud flow. Specfcally, LBM has been used consstently n the last years n several blood flow applcatons (e.g. coronares [2], aneurysms [3], abdomnal aorta [4]). The LBM s a mesoscopc partcle based method, whch has ts orgn n the Lattce Gas Automata. It uses a smplfed knetc model of the essental physcs of mcroscopc processes, such that the macroscopc propertes of the system are governed by a certan set of equatons. The equaton of LBM s hyperbolc, and can be solved explctly and effcently on parallel computers [5]. Wth the ncreasng computatonal power of Graphcs Processng Unts (GPU), parallel computng has become avalable at a relatvely small cost. Wth the advent of CUDA (Compute Unfed Devce Archtecture), several researchers have dentfed the potental of GPUs to accelerate Computatonal Flud Dynamcs (CFD) applcatons to unprecedented levels [6]. Due to the hgh computatonal requrements, there has been a lot of nterest n explorng hgh performance computng technques for speedng up the LBM algorthms. Effcent CUDA based mplementatons of the 3D LBM have been proposed prevously n the lterature [7-10], whch were optmed for specfc applcatons. Tölke et al. [10] obtaned a speed-up of around 100x over a sequental mplementaton on the Intel Xeon CPU for the flow around a movng sphere. Obrecht et al. [9] studed the flow n an urban envronment and obtaned for a mult-gpu mplementaton a speed-up of 28x compared to a mult-threaded CPU based mplementaton. All these researches focused on sngle precson computatons. Wth the ntroducton of the Ferm and the Kepler archtecture, the performance of double precson computatons on NVIDIA GPU cards has ncreased substantally. In ths paper we ntroduce a parallel mplementaton of the LBM desgned for blood flow computatons. To meet the hgh accuracy requrements of blood flow applcatons, computatons are performed wth double precson. Three recently released GPUs have been consdered and, to correctly evaluate the speed-up potental, results are compared aganst both sngle-core and mult-core CPU-based mplementatons. The best performng GPU card s frst determned usng three popular benchmarkng applcatons, and then t s used for computng blood flow n a patent-specfc aorta geometry wth coarctaton (CoA), contanng the descendng aorta and the supra-aortc branches. CoA s a congental cardac defect usually consstng of a dscrete shelf-lke narrowng of the aortc meda nto the lumen of the aorta, occurrng n 5 to 8% of all patents wth congental heart dsease [11]. The narrowng can lead to a sgnfcant pressure drop, whch affects the health of the patent. Both the mportance and the potental of CFD based approaches for non-nvasve dagnoss of CoA patents have been recently emphased n a challenge [12], where the LBM produced good results. The paper s organed as follows. In secton two we frst brefly ntroduce the LBM used heren. Then we ntroduce the numercal mplementaton, focusng on ts optmed parallelaton on a GPU. Secton three frst presents detaled results for the speed-up obtaned wth dfferent GPUs for the benchmarkng applcatons, and then t dsplays the results obtaned wth the best performng GPU card for the patent

specfc CoA geometry. Fnally, n secton four, we draw the conclusons. II. METHODS A. The Lattce Boltmann Method For studyng the parallel mplementaton of the LBM, we consdered the sngle relaxaton tme verson of the equaton, based on the Bhatnagar-Gross-Krook (BGK) approxmaton, whch assumes that the macroscopc quanttes of the flud are not nfluenced by most of the molecular collsons: f t + c f 1 τ ( ) eq = f f, (1) where f represents the probablty dstrbuton functon along an axs c, τ s a relaxaton factor related to the flud vscosty, x represents the poston and t s the tme. The dscretaton n space and tme s performed wth fnte dfference formulas. Ths s usually done n two steps: Δt eq f t + Δ = f + ( f f ), (2a) τ and f ( x + c Δt, t + Δ = f t Δ. (2b) + The frst equaton s known as the collson step, whle the second one represents the streamng step. f eq s called the equlbrum dstrbuton and s gven by the followng formula: 2 2 eq ck u 1 ck u 1 u f = ω ρ(, ) 1+ + x t, (3) 2 2 cs 2 c 2 s cs where ω s a weghtng scalar, c s s the lattce speed of sound, c k s the drecton vector, and u s the flud velocty. ρ(x, s a scalar feld, commonly called densty, whch s related to the macroscopc flud pressure as follows: ρ p( x, =. (4) 3 Once all f have been computed, the macroscopc quanttes (velocty and densty) can be determned: 1 ( x = n u, c f, (5) ρ n = 0 = 0 ρ( x, = f. (6) The computatonal doman s smlar to a regular grd used for fnte dfference algorthms. For a more detaled descrpton of the Boltmann equaton and the collson operator we refer the reader to [5]. The current study focuses on 3D flow domans: we used the D3Q15 lattce structure, dsplayed n fg. 1 for a sngle grd node. The weghtng factors are: ω = 16/72 for = 0, ω = 8/72 for = 1 6, and ω = 1/72 for = 7 14. The boundary condtons (nlet, outlet and wall) are crucal for any flud flow computaton. For the LBM, the macroscopc quanttes (flow rate/pressure) can not be drectly mposed at nlet and outlet. Instead, the known values of the macroscopc quanttes are used for computng the unknown dstrbuton functons near the boundary. For the nlet and outlet of the doman we used Zou-He [13] boundary condtons wth known velocty. For the outlet we used homogeneous Neumann boundary condton. The arteral geometry has complex boundares n patent-specfc blood flow computatons, and hence, for mprovng the accuracy of the results, we used advanced bounce-back boundary condtons based on nterpolatons [14]. The sold walls are defned as an sosurface of a scalar feld, commonly known as the level-set functon. B. GPU based parallel mplementaton of the Lattce Boltmann Method In the followng we focus on the GPU based parallelaton of the above descrbed LBM. The GPU s vewed as a compute devce whch s able to run a very hgh number of threads n parallel nsde a kernel (a functon, wrtten n C language, whch s executed on the GPU and launched by the CPU). The GPU contans several streamng multprocessors, each of them contanng several cores. The GPU contans a certan amount of global memory to/from whch the CPU thread can wrte/read, and whch s accessble by all multprocessors. Furthermore, each multprocessor also contans shared memory and regsters whch are splt between the thread blocks and the threads, whch run on the multprocessor, respectvely. The LBM s both computatonally expensve and memory demandng [15], but ts explct nature and the data localty (the computatons for a sngle grd node requre only the values of the neghborng nodes) make t deal for parallel mplementatons. Each node can be computed at each tme step ndependently from other nodes. A frst mportant dfference between the CPU and the GPU mplementaton of the LBM s the memory arrangement. Regularly, on the CPU, a data structure contanng all the requred floatng-pont values for a grd node s defned, and then an array of ths data structure s created (the Array Of Structures approach AOS). Ths approach s not a vable soluton on the GPU because the global memory accesses would not be coalesced and would drastcally decrease the performance [16]. Instead of AOS, the Structure Of Arrays (SOA) approach has been consdered [15]: a dfferent array s allocated for each varable of a node, leadng to a total of 35 arrays, 15 for the densty functons, another 15 for swappng the new densty functons wth the old ones after the streamng step, three for the velocty, one for the Fg. 1. The D3Q15 lattce structure, frst number n the notaton s the space dmenson, whle the second one s the lattce lnks number.

densty and one for the level-set functon. The memory access patterns for the AOS and SOA approaches are dsplayed n fg. 2 for the three velocty components. The workflow of the GPU-based LBM mplementaton s dsplayed n fg. 3. All computatons are performed on the GPU. Therefore, hostdevce memory copy operatons are only requred when storng ntermedate (transent or unsteady flows) or fnal results (steady flows). Two dfferent kernels have been defned and are called at each teraton. The operatons n (2) (6) have been assocated to the two kernels based on the necessty of accessng nformaton from the neghborng nodes. Kernel 1 frst computes the macroscopc quanttes (velocty and densty), based on (5) and (6), by teratng through the 15 probablty dstrbuton functons. Then t apples the Zou-He boundary condtons at the nlet of the doman and t performs the collson step: frst the equlbrum dstrbuton functon s computed usng (3) and then the new probablty dstrbuton functons are determned based on (2). The second kernel focuses on the streamng step, the nterpolated bounce-back boundary condton and the outlet boundary condton. All these operatons requre nformaton from the neghborng nodes. The operatons of the second kernel are more complex snce the grd nodes located at the boundary requre a dfferent treatment than the other nodes. Ths leads to dfferent code executon paths and therefore to reduced parallelsm. However, snce relatvely few grd nodes resde next to the boundary, ths aspect s not crucal for the overall performance. The workflow of the streamng step s dsplayed n fg. 4 (for smplcty, the treatment of the nodes of the outlet boundary s not dsplayed). One can see that, f a node s surrounded n opposte drectons by sold nodes, the smple bounce back rule s appled nstead of the nterpolated bounce back rule, whch would lead to numercal dvergence. Ths case s encountered relatvely often n geometres wth complex boundares, especally around sharp edges. For both kernels, one CUDA thread s mapped to one node and snce all arrays are one-dmensonal, also the executon confguraton of the kernels s one-dmensonal, both at block and at grd level. Due to the hgh accuracy requrements of blood flow computatons, and unlke prevous researches, all computatons were performed wth double precson. Because the arrays and the executon confguraton are one-dmensonal, t s necessary Fg. 2. Memory access patterns: Array of Structures (top), Structure of Arrays (bottom). Fg. 3. LBM workflow. to map the three-dmensonal coordnates nsde the grd to a global ndex used to access the data from the arrays: = N N + j N k. (7) g y + g = N N j = k = g g y, N N N y y N N, j N. where, j and k are the node coordnates n the 3D LBM grd. Note that these values are approxmated wth the floor functon, N x, N y and N are the grd ses n each drecton and g s the global ndex of the node n the one-dmensonal array. Equatons (7) and (8) are used nsde the second kernel for fndng the global ndex of the neghbourng nodes. The LBM s usually appled for a rectangular grd. For blood flow computatons, the rectangular grd s chosen so as to nclude the arteral geometry of nterest. In ths case though, the flud nodes represent only 1/5 or less of the total number of nodes. Hence, f the nature of the nodes (flud/sold) s not taken nto account, around 80% of the allocated memory s not used and around 80% of the threads do not perform any computatons. To avod ths problem, we used an ndrect addressng scheme, dsplayed n fg. 5. Memory s only (8)

Fg. 4. The workflow of the second kernel n fg. 3. Fg. 5. Indrect addressng. allocated for the flud nodes and an addtonal array (called flud ndex array) s ntroduced for mappng the global ndex determned wth (7) to the flud nodes arrays (negatve values n the flud ndex array correspond to sold nodes). The content of the flud ndex array s determned n the preprocessng stage on the CPU and s requred only durng the streamng step. Snce for the operatons performed nsde the frst kernel n fg. 3 no nformaton from the neghborng nodes s requred, the executon confguraton of the frst kernel s set up so as to generate a number of threads equal to the number of flud nodes. For the second kernel on the other sde, the number of threads n the executon confguraton s set equal to the total number of nodes, to avod the necessty of a search operaton n the flud ndex array. III. RESULTS To compare the performance of the CPU based mplementaton of the LBM wth the GPU based mplementaton for double precson computatons, we consdered three dfferent NVIDIA GPU cards: GeForce GTX 460, GeForce GTX 650 and GeForce GTX 680 (the frst one s based on the Ferm archtecture, whle the other two are based on the Kepler archtecture). The CPU based mplementaton was run on an eght-core 7 processor usng both sngle and mult-threaded code. Parallelaton of the CPU code was performed usng OpenMP. Three dfferent 3D benchmark applcatons were frst consdered for determnng the best performng GPU card: Posseulle flow, ld-drven cavty flow and flow n an elbow shaped doman. Dfferent grd resolutons were consdered and table I dsplays the executon tmes for all test cases, correspondng to one computaton step. The performance mprovements are sgnfcant and demonstrate that a GPU based mplementaton of the LBM s superor to a mult-core CPU based mplementaton. The best performance s obtaned for the GTX 680 (see table I). The speed-up s computed based on the mult-threaded CPU code. The speed-up compared to the sngle-threaded CPU code vares between 150x and 290x. Note that the performance of the GTX 650 card s on average around 2x lower than of the GTX 460. Ths confrms the concerns rased for the frst GPUs of the Kepler archtecture, the performance of whch are n fact lower than for the prevously released cards of the 400 and 500 GeForce seres (wth the advantage of lower power consumpton). Once the GTX680 was determned as best performng GPU card for double-precson 3D computatons, we used t to compute blood flow n a patent-specfc aorta model wth coarctaton, whch was recently used n a CFD challenge [12]. To obtan the correspondence between the lattce unts and the physcal unts, we used the method descrbed n [17]. The computatons were ntaled wth the equlbrum dstrbuton functon, and for the current research actvty we focused on steady-state computatons,.e. we mposed the average value of the flow rate profle specfed n the challenge. The grd se was set to 92x156x428 (6142656 nodes), of whch only 518969 represented flud nodes (less than 10%). The total number of computaton steps to obtan convergence strongly depends on the grd resoluton,.e. the tme needed by the pressure wave to propagate from one end to the other, an aspect whch s gven by the lattce speed of sound. Fg. 6 dsplays the computaton results obtaned after 10000 tme steps (the converged soluton). Followng the dea n [18], namely that lower occupancy leads to better performance, we tested dfferent executon confguratons. The executon tmes obtaned for dfferent thread block confguratons, for the entre computaton, are dsplayed n table II alongsde the executon tme for the mult-threaded CPU code. As has been reported prevously [15], executon confguratons wth fewer threads per block lead to better performance. The best performng executon confguraton s wth 128 threads per block and the speed-up compared to the executon tme of the mult-threaded CPU mplementaton s of 19.42x.

TABLE I. EXECUTION TIMES OF BENCHMARKING APPLICATIONS FOR ONE COMPUTATION STEP FOR DIFFERENT GRID CONFIGURATIONS. Benchmark case Posseulle flow Ld-drven cavty flow Elbow Grd resoluton Snglethreaded CPU code [ms] Multthreaded CPU code [ms] GeForce GTX 680 GeForce GTX 650 GeForce GTX 460 Tme [ms] Speed-Up Tme [ms] Speed-Up Tme [ms] Speed-Up 100x100x400 3924.8 608.38 13.7 44.41 45.30 13.43 21.00 28.97 50x50x200 484.3 81.39 1.9 42.84 6.00 13.57 3.00 27.13 25x25x100 61.01 11.24 0.30 37.47 0.80 14.05 0.50 22.48 100x100x100 977.94 152.48 6.40 23.83 21.40 7.13 9.20 16.57 50x50x50 120.81 20.34 0.80 25.43 2.70 7.53 1.20 16.95 25x25x25 15.09 3.35 0.10 33.50 0.40 8.38 0.30 11.17 200x200x50 1956.12 91.02 2.50 36.41 8.60 10.58 4.40 20.69 100x100x50 242.46 12.0 0.90 13.33 2.80 4.29 0.70 17.14 Fg. 7. Comparson of basc vs optmed LBM GPU mplementaton. was allocated for all nodes, ncludng the sold nodes), used four kernels for the operatons of the LBM at each teraton, and executed all kernels wth a total number of threads equal to the total number of nodes. The results are dsplayed n fg. 7 for dfferent thread block confguratons and show that the optmaton actvtes are crucal for the speed-up (wth the basc LBM GPU verson, the speed-up s of only 4.41x compared to the mult-threaded CPU code). The speed-up of the optmed LBM GPU verson compared to the basc LBM GPU verson s of 4.40x. Fg. 6. Computaton result (streamlnes) for the patent-specfc coarctaton geometry. TABLE II. COMPARISON OF EXECUTION TIMES FOR DIFFERENT EXECUTION CONFIGURATIONS Confguraton Executon tme [s] GPU - 64 threads/block 37.160 GPU - 128 threads/block 34.654 GPU - 256 threads/block 35.743 GPU - 512 threads/block 35.825 GPU - 1024 threads/block 39.989 CPU - multthreaded 673.028 The mplementaton and optmaton aspects descrbed n the prevous secton were desgned specfcally for blood flow computatons. To evaluate the mpact of these actvtes we also performed the flow computatons n the same model wth a basc verson of the LBM GPU mplementaton. The basc LBM GPU verson dd not use ndrect addressng (memory IV. DISCUSSION AND CONCLUSIONS In ths paper, we ntroduced a GPU-based parallel mplementaton of the Lattce Boltmann Method, optmed for patent-specfc blood flow computatons. Double precson computatons were employed for hgher accuracy and three dfferent NVIDIA GPU cards were consdered. Based on three 3D benchmarkng applcatons, the GTX680 card was determned as best performng GPU and was subsequently used to compute blood flow n a aorta geometry wth coarctaton. To our knowledge, ths s the frst work to evaluate the potental of Kepler archtecture GPU cards for acceleratng the executon of the LBM. Moreover, t s the frst paper to consder double precson computatons for hgher accuracy. A detaled comparson wth prevous mplementatons [7-10] s dffcult to perform snce generally the mplementatons are optmed for specfc actvtes and dfferent GPUs have been used n dfferent studes. However, the overall results obtaned heren are remarkable: the speed-up over a sngle-threaded CPU mplementaton vares between 150x and 290x, whereas prevously a speed-up of 100x was reported [10]. The speed-up of the CoA geometry blood flow computaton was of 19.42x

compared to a mult-threaded CPU mplementaton, whereas prevously a speed-up of 28x was reported, but for a mult- GPU and not a sngle GPU mplementaton [9]. The optmaton actvtes were desgned for patentspecfc blood flow computatons n general (not n partcular for the coarctaton geometry), where the rato of flud nodes to total number of nodes s usually around 1/5 or less. Hence we used an ndrect adressng scheme and allocated memory only for the flud nodes. Furthermore, the operatons were grouped nto two kernels: the frst one performs operatons for whch nformaton from neghborng nodes s not requred, whle the second one uses nformaton from neghborng nodes. Ths way the number of kernels s reduced, and t was possble to use an executon confguraton wth reduced number of threads for the operatons for whch nformaton from the neghborng nodes s not requred. As proposed n the CFD challenge [12], we only consdered rgd wall computatons. If elastc arteral walls are consdered, then the flud ndex array n fg. 5 has to be recomputed at each tme step snce the classfcaton of nodes nto flud and sold nodes changes over tme. All LBM based results reported for [12] were obtaned for CPU based mplementatons. Although the LBM s faster than the classc CFD approach, based on the Naver-Stokes equatons, the acceleraton of the executon tme remans a crucal task for several reasons. Frst of all, when blood flow s modelled n patent-specfc geometres n a clncal settng, results are requred n a tmely manner not only to potentally treat the patent faster, but also to perform computatons for more patents n a certan amount of tme. Furthermore, when performng patent-specfc computatons, t s necessary to match certan patent-specfc characterstcs, lke pressure or flow rates. Hence, the parameters of the model need be tuned, and the computaton needs to be run repeatedly for the same geometry, thus ncreasng the total executon tme for a sngle patent [19]. Several future work actvtes have been dentfed. From a computatonal pont of vew, the global memory accesses of the second kernel can be further optmed, and a mult-gpu based mplementaton wll be consdered for further decreasng the executon tme. From a modelng pont of vew, for more severe coarctatons than the one dsplayed n fg. 6, the Reynolds number ncreases consderably and a Smagornsky sub-grd model needs to be employed [9]. [3] J. Bernsdorf, and D. Wang, Non-Newtonan blood flow smulaton n cerebral aneurysms, Computers & Mathematcs wth Applcatons, vol. 58 pp. 1024-1029, 2009. [4] A.M. Artol, A.G. Hoekstra, and P.M.A. Sloot, Mesoscopc smulatons of systolc flow n the human abdomnal aorta, Journal of Bomechancs, vol. 39, pp. 873-874, 2006. [5] S. Succ, The Lattce Boltmann Equaton - For Flud Dynamcs and Beyond. New York: Oxford Unversty Press, 2001. [6] D. Krk, and W.M. Hwu, Programmng Massvely Parallel Processors: A Hands-on Approach. London: Elsever, 2010. [7] P. Baley, J. Myre, S.D.C. Walsh, D.J. Llja, and M.O. Saar, Acceleratng lattce Boltmann flud flow smulatons usng graphcs processors, IEEE Internatonal Conference on Parallel Processng, Venna, Austra, pp. 550-557, Sept. 2009. [8] M. Bernasch, M. Fatca, S. Melchonna, S. Succ, and E. Kaxras, A flexble hgh-performance lattce Boltmann GPU code for the smulatons of flud flows n complex geometres, Concurrency Computaton: Practce & Experence, vol. 22, pp. 1-14, 2010. [9] C. Obrecht, F. Kunk, B. Tourancheau, and J.-J. Roux, Towards urban-scale flow smulatons usng the Lattce Boltmann Method, Buldng Smulaton Conference, Sydney, Australa, pp. 933-940, Nov. 2011. [10] J. Tölke, and M. Krafcyk, TeraFLOP computng on a desktop PC wth GPUs for 3D CFD, Internatonal Journal of Computatonal Flud Dynamcs, vol. 22, pp. 443-456, 2008. [11] R.E. Rngel, and K. Jenkns, Coarctaton of the aorta stent tral (coas, 2007, http://clncaltrals.gov/ct2/show/nct00552812. [12] ***, CFD Challenge: Smulaton of Hemodynamcs n a Patent-Specfc Aortc Coarctaton Model, http://www.vascularmodel.org/mcca2012/. [13] Q. Zou, and X. He, On pressure and velocty boundary condtons for the Lattce Boltmann BGK model, Physcs of Fluds, vol. 9, pp. 1591-1598, 1997. [14] M. Boud, M. Frdaouss, and P. Lallemand, Momentum transfer of a Boltmann-Lattce flud wth boundares, Physcs of Fluds, vol. 13, pp. 452-3459, 2001. [15] M. Astorno, J. Becerra Sagredo, and A. Quarteron, A modular lattce Boltmann solver for GPU computng processors, SeMA journal, vol. 59, pp. 53-78, 2012. [16] NVIDIA Corporaton: CUDA, Compute Unfed Devce Archtecture Best Practces Gude v5.0 (2013). [17] J. Latt, Hydrodynamc lmt of lattce Boltmann equatons, PhD Thess, Unverste de Geneve, Geneve, Swterland, 2007. [18] V. Volkov, Better performance at lower occupancy, GPU Technology Conference, San Jose, USA, 2010. [19] D.R.Golbert, P.J. Blanco, A. Clausse, and R.A. Fejóo, Tunng a Lattce-Boltmann model for applcatons n computatonal hemodynamcs, Medcal Engneerng & Physcs, vol. 34, pp. 339-349, 2012. ACKNOWLEDGMENT Ths work s supported by the program Partnershps n Prorty Domans (PN II), fnanced by ANCS, CNDI - UEFISCDI, under the project nr. 130/2012. REFERENCES [1] C.A. Taylor, and D.A. Stenman, Image-based modelng of blood flow and vessel wall dynamcs: applcatons, methods and future drectons, Annals of Bomedcal Engneerng, vol. 38, pp. 1188-1203, 2010. [2] S. Melchonna, M. Bernasch, S. Succ, E. Kaxras, F.J. Rybck, Mtsouras D, et al., Hydroknetc approach to large-scale cardovascular blood flow, Computer Physcs Communcatons, vol. 181, pp. 462-72, 2010.