cuibm A GPU Accelerated Immersed Boundary Method S. K. Layton, A. Krishnan and L. A. Barba Corresponding author: labarba@bu.edu Department of Mechanical Engineering, Boston University, Boston, MA, 225, USA. Abstract: A projection-based immersed boundary method (IBM) is dominated by sparse linear algebra routines. Using the open-source CUSP library, we observe a speedup with respect to a single CPU core which reflects the constraints of a bandwidth-dominated problem on the GPU. Nevertheless, GPUs offer the capacity to solve large problems on commodity hardware. This work includes validation and a convergence study of the GPU-accelerated IBM, and various optimizations. Keywords: Immersed Boundary, Computational Fluid Dynamics, GPU Computing. Introduction Conventional CFD techniques require the generation of a mesh that conforms to the geometry of any boundaries in the fluid domain. The immersed boundary method (IBM), in contrast, allows using a grid that does not conform to solid boundaries. In the IBM, the fluid is represented by an Eulerian grid (typically a Cartesian grid) and the solid boundary points are represented by a collection of Lagrangian points. This has several advantages. Mesh generation is trivial, and simulations involving moving solid bodies and boundaries are made simpler. The Navier-Stokes equations are solved on the entire grid (including points within the solid), and the effect of the solid body is modelled by adding a singular force distribution f along the solid boundary to the fluid which enforces the no-slip condition. The governing equations are: u t + u u = p + ν u + f(ξ(s, t))δ(ξ x)ds, s u =, u(ξ(s, t)) = u(x)δ(x ξ)dx = u B (ξ(s, t)), s where u B is the velocity of the body at the boundary point locations. The different IBM formulations use different techniques to calculate the forcing term, f. The IBM was introduced in 972 by Peskin [3] to model blood flow within the elastic membranes of the heart. It experienced a revival in the 99s thanks to increased computational capacity and growing interest in moving-boundary problems. The reader can find various IBM formulations described in the 25 review by Mittal and Iaccarino []. In the present work, we implement the algorithm presented in [4] for the solution of two-dimensional incompressible viscous flows with immersed boundaries, explained in detail in 2. To our knowledge, the IBM has not previously been implemented on the GPU. The perspective of doing so is the capacity of solving large three-dimensional moving boundary problems on commodity hardware. (a) (b) (c)
2 Immersed Boundary Projection Method The Navier-Stokes equations (a)-(c) are discretized on a staggered grid and we obtain the following set of algebraic equations: Âu n+ ˆr n = Ĝφ + ˆbc + Ĥf ˆDu n+ = bc 2 (2a) (2b) Êu n+ = u n+ B. (2c) Here, φ and f are vectors containing the pressure and the values of the singular force at the boundary points of the immersed boundary respectively. The velocity at the current time step u n is known; ˆbc and bc 2 are obtained from the boundary conditions on the velocity; Ĥ and Ê are the regularization and interpolation matrices respectively. These matrices are used to transfer values of the flow variables between the Eulerian and Lagrangian grids. The above system of equations can be solved to obtain the velocity field at time step n +, the pressure (to a constant) and the body forces. But the left-hand side matrix is indefinite, and solving the system directly would be ill-advised. For time stepping, an explicit second order Adams-Bashforth scheme is used for the convection terms and Crank-Nicolson is used for diffusion. All spatial derivatives are calculated using central differences. By performing appropriate transformations (see [4] for details), one can show that the above system is equivalent to: ( A Q Q T ) ( q n+ λ ) = ( r r 2 ), (3) where q n+ is the momentum flux at each cell boundary and λ is a vector containing both the pressure and the body force values. Consider an N th order approximation of the inverse of matrix A, given by B N. We can now perform the same factorisation as described in [2] to obtain the following set of equations, which can be solved to obtain the velocity distribution at time step n + : Aq = r Q T B N Qλ = Q T q r 2 q n+ = q B N Qλ (4a) (4b) (4c) Only the left hand side of (3) is affected by the factorisation, and hence r and r 2 remain the same. This factorisation is very advantageous as the two linear systems (4a) and (4b) that we now need to solve can be made positive definite, and can be solved efficiently using the conjugate gradient method. In the absence of an immersed boundary, this set of equations is the same as that solved in the traditional fractional step method or projection method [2]. The projection step (4c) simultaneously ensures a divergence-free velocity field and that the no-slip condition on the immersed boundary is satisfied in the next time step. 3 Implementation The matrices A, Q and Q T are sparse, the vectors q n+ and λ are dense, and all operations require tools for sparse linear algebra. To take advantage of the GPU, we need some way of both representing and operating on these matrices and vectors on the device. Currently, there are two main choices for this: CUSPARSE, part of NVIDIA s CUDA, or the external library, CUSP. The CUSP library is being developed by several NVIDIA employees with minimal software dependencies and released freely under an open-
Time [s].4.35.3.25.2.5..5 Average over timesteps (a) Timing breakdown AXPY Apply BCs Conversion Force Calculation Force Output Generate bc Generate r2 Generate rn MMM Mat vec Mem Transfer Output Preconditioner Solve Solve 2 Transfer q Transpose Update B Update QT Time [s] 4 35 3 25 2 5 5 CPU GPU 2 # of unknowns 3 4 x 6 (b) Solving linear equations Figure : (a) Timing breakdown for flow past a cylinder at Re = 4 using the GPU code. (b) Comparison of time taken to solve a system of linear equations Ax = b on the CPU and GPU. A is chosen as the standard 5-pt Poisson stencil. source license. We use the CUSP library for several reasons: it is actively developed and separate from the main CUDA distribution, allowing for faster addition of new features (such as new pre-conditioners, solvers, etc.); and, all objects/methods from the library are usable on both CPU and GPU. This allows us the flexibility to, for example, perform branching-heavy code on the CPU, before trivially transferring to the device and running (for instance) a linear solve, where it will be significantly faster. It also allows us to maintain both a CPU and GPU code. Figure (a) shows a breakdown of the timings from an example run ( time steps of flow past a cylinder at Re = 4). The mesh comprises of 4 8 cells, resulting in systems of over 3, unknowns. Even in this relatively small test, the time is dominated by the solution of a linear system, denoted by Solve 2. Speeding up this linear solve is the major motivation for using the GPU. Figure (b) shows a timing comparison between the CPU and GPU using CUSP s conjugate gradient solver. The system being solved in this case is given by a traditional 5-point Poisson stencil, which while not directly used in the IBM code, gives a good measure of relative performance. The plot shows the wall-clock time required to solve to a relative accuracy of 5 for numbers of unknowns ranging from 25 to 4 6. For large systems, the GPU solve is significantly faster, with a speedup of 8 for the largest system shown. Our choice of tools allows us to easily perform all sparse linear algebra operations on the GPU. On the other hand, there are parts of the algorithm that cannot easily be expressed using linear algebra, such as generating the convection term using a finite-difference stencil and applying boundary conditions to the velocities (which involves modifying select values of appropriate arrays). One possible way of performing these actions is to transfer data from the GPU, do the calculations on the CPU and transfer the modified vector(s) back to the GPU every time step this incurs a prohibitively high cost in memory transfers. The alternative is to use custom-written CUDA kernels utilizing all appropriate techniques, including the use of shared memory, to perform these operations on the GPU. This requires access to the underlying data from the CUSP data structures, and can be done using the Thrust library, on which CUSP was built. The combination of accelerated linear algebra and custom kernels on the GPU has resulted in initial runs showing up to 7 speedup over our equivalent CPU code, for the problem sizes we ran. This is almost as good as the 8 speedup experienced by the 5-point Poisson solver in Figure (b).
4 Validation 4. Couette flow between concentric cylinders As a validation test, we calculate the flow between two concentric cylinders of radius r i =.5 and r o = centered at the origin. The outer cylinder is held stationary while the inner cylinder is impulsively rotated from rest with an angular velocity of Ω =.5. The cylinders are contained in a square stationary box of side.5 centered at the origin. The fluid in the entire domain is initially at rest and the calculations were carried out for kinematic viscosity ν =.3. The steady-state analytical solution for this flow is known. The velocity distribution in the interior of the inner cylinder is the same as for solid body rotation and the azimuthal velocity between the two cylinders is given by: u θ (r) = Ωr i (r o /r r/r o ) (r o /r i r i /r o ). (5) We compared this to the numerical solution for six different grid sizes ranging from 75 75 to 45 45. Table shows the L 2 and L norms of the relative errors and Figure 2(b) shows that the scheme is first-order accurate in space, as expected for the IBM formulation we used..3.25 5x5 grid Analytical Solution L-2 norm L-inf norm st Order convergence u (r).2.5. Error norm...5..2.3.4.5.6.7.8.9 r (a)... (b) Cell width Figure 2: (a) Comparison of the numerical solution on a 5 5 grid with the analytical solution and (b) convergence study, showing errors for different grid sizes. To verify the temporal order of convergence, we ran a simulation from t = to t = 8 on a 5 5 grid, using different time steps ( t =.,.5 and.25). Both first- and third-order accurate expansions of B N were used and the calculated orders of convergence (using the L 2 norms of the differences in the solutions) at various times have been summarised in Table, and are as expected. Order of convergence Order of convergence Time (N = ) (N = 3).8.97 2.67 2.99 2.85 4.93 2.73 8.97 2.83 Table : Calculated order of convergence at different times for Couette-flow validation.
2 5.5.5 -.5 - -.5-2 - 2 3 4 Drag Coefficient 4 3 2 2 4 6 8 2 4 6 8 2 Time (a) (b) Figure 3: Steady state vorticity field (a) and time varying drag coefficient (b) for external flow over a circular cylinder at Reynolds number 4. The contour lines in (a) are drawn from -3 to 3 in steps of.4. 4.2 External flow over a circular cylinder We also carried out computations to simulate external flow over a circular cylinder at Reynolds number 4. The cylinder is of diameter d = centered at the origin and is placed in an external flow with freestream velocity u =. The simulation was carried out on a 2 2 grid with uniform cell spacing in the entire domain, which was a square with opposite corners at ( 5, 5) and (5, 5). The velocity at the inlet, top and bottom of the domain was fixed to the freestream velocity and the outlet boundary condition used was u t + u u x =. The initial condition was a uniform velocity field in the entire domain. The vorticity field obtained for this case is shown in Figure 3(a) and the time varying drag coefficient is plotted in Figure 3(b). The drag coefficient at steady state is found to be.6, which is in good agreement with the expected value [5]. 5 Conclusions and Future Work At this time, we have a validated GPU code for the projection IBM, and we have shown convergence with the expected rates. Using the free and open-source CUSP and Thrust libraries to provide sparse linear algebra functionality, a speedup of 7 over the equivalent CPU code was obtained for the largest tested problem. In the final paper we will provide a more extensive study of optimizations, timing and breakdowns and demonstrate moving boundary applications. References [] R. Mittal and G. Iaccarino. Immersed boundary methods. Ann. Rev. Fluid Mech., 37():239 26, 25. [2] J. B. Perot. An analysis of the fractional step method. J. Comp. Phys., 8():5 58, 993. [3] C.S. Peskin. Flow patterns around heart valves: A numerical method. J. Comp. Phys., (2):252 27, 972. [4] K. Taira and T. Colonius. The immersed boundary method: A projection approach. J. Comp. Phys., 225(2):28 237, 27. [5] D. J. Tritton. Experiments on the flow past a circular cylinder at low Reynolds numbers. J. Fluid Mech., 6(4):547 567, 959.