ANNALES MATHÉMATIQUES BLAISE PASCAL

Size: px
Start display at page:

Download "ANNALES MATHÉMATIQUES BLAISE PASCAL"

Transcription

1 ANNALES MATHÉMATIQUES BLAISE PASCAL Georges-Henri Cottet, Jean-Matthieu Etancelin, Franck Perignon, Christophe Picard, Florian De Vuyst & Christophe Labourdette Is GPU the future of Scientific Computing? Volume 20, n o 1 (2013), p < 20_1_75_0> Annales mathématiques Blaise Pascal, 2013, tous droits réservés. L accès aux articles de la revue «Annales mathématiques Blaise Pascal» ( implique l accord avec les conditions générales d utilisation ( Toute utilisation commerciale ou impression systématique est constitutive d une infraction pénale. Toute copie ou impression de ce fichier doit contenir la présente mention de copyright. Publication éditée par le laboratoire de mathématiques de l université Blaise-Pascal, UMR 6620 du CNRS Clermont-Ferrand France cedram Article mis en ligne dans le cadre du Centre de diffusion des revues académiques de mathématiques

2 Annales mathématiques Blaise Pascal 20, (2013) Is GPU the future of Scientific Computing? Georges-Henri Cottet Jean-Matthieu Etancelin Franck Perignon Christophe Picard Florian De Vuyst Christophe Labourdette Abstract These past few years, new types of computational architectures based on graphics processors have emerged. These technologies provide important computational resources at low cost and low energy consumption. Lots of developments have been done around GPU and many tools and libraries are now available to implement efficiently softwares on those architectures. This article contains the two contributions of the mini-symposium about GPU organized by Loïc Gouarin (Laboratoire de Mathématiques d Orsay), Alexis Hérault (CNAM) and Violaine Louvet (Institut Camille Jordan). This mini-symposium was an opportunity to explore the upcoming role of hardware accelerators and how it will affect the way applications are designed and developed. As the main issue of the mini-symposium was graphical cards, this document contains contributions about two feedbacks on the behavior of different numerical methods on GPU: ones on particle method for transport equations, the other on Lattice Boltzmann Methods for Navier Stokes equations, Finite Volume schemes for Euler equations and particles methods for kinetic equations. Keywords: GPU, méthode particulaire, EDP, Mécanique des Fluides, interaction, visualisation, calcul instantané, volumes finis, méthode Lattice Boltzmann, méthode particulaire, programmation multicœur. Math. classification: 35L05, 65M08, 76M25, 76N15, 76P05, 97N40. 75

3 Le GPU est-il le futur du calcul scientifique? Résumé Ces dernières années, de nouveaux types d architectures basés sur les processeurs graphiques ont émergés. Ces technologies fournissent d importantes ressources computationelles à faible coût et faible consommation d énergie. Les nombreux dévelopements effectués sur le GPU ont alors permis la création et l implémentation de logiciels sur ce type d architecture. Cet article contient les deux contributions de ce mini-symposium GPU organisé par Loïc Gouarin (Laboratoire de Mathématiques d Orsay), Alexis Hérault (CNAM) et Violaine Louvet (Institut Camille Jordan). La premiere concerne les méthodes particulaires pour les équations de transport, la seconde concerne la résolution des équations de Navier-Stokes et des équations d Euler. Contents 1. Particle method on GPU Previous work Particle method OpenCL computing Particle method with OpenCL Advection step Remeshing step Performances and results Level set test case Computational efficiency Conclusions Some examples of instant computations of fluid dynamics on GPU GPU computing with instantaneous visualization and interaction in CFD. Experience feedback Instantaneous computing and interactive visualization Impact on the design of numerical methods Impact on the data structures General experience feedback and concluding remarks Videos of instant GPU computations on youtube 97 Acknowledgements 97 References 97 76

4 GPU Computing 1. Particle method on GPU by Georges-Henri Cottet, Jean-Matthieu Etancelin, Franck Perignon and Christophe Picard. In this part we present a graphics processing unit (GPU) implementation of a particle method for transport equations. More precisely the numerical method under consideration is a remeshed particle method. Not only remeshing particles makes simulations more accurate in flows with strong strain, but it leads to algorithms more regular in term of data structures. In this work, we develop a Python library using GPU through OpenCL standard that implements this remeshed particle method which already shows interesting performances Previous work Particle method In the present work, we focus on solving advection equation (1.1) by means of remeshed particle method t u + div(au) = 0. (1.1) Remeshed particle methods can be seen as forward semi-lagrangian methods. For each time step, they consist in 2 sub-steps. An advection step where particles, carrying local masses of u, are advected with their local velocity, followed by a remeshing step where particles are restarted on a regular grid. In the advection step, one solves the following system of different equations d x p dt = a( x p, t). (1.2) The remeshing step is performed with the following general formula u g = ( ) xg x p ũ p W. (1.3) p h In the above formulas x p, ũ p represent the particle locations and weights after advection, and x g, u g their location and weights after remeshing on a regular grid. The algorithm to solve equation (1.1) for a time step t = n t can be summarized as follows: 77

5 G.H. Cottet et al. - Initialize particle locations and weights : x n p x g, ũ n p u n g - Solve equation (1.2) with a 2nd (or 4th) order Runge Kutta scheme x n p x n p + ta n ( x n p + t 2 an ( x n p )) - Remesh particles on grid : u g = ( ) p ũpw xg x p h. Note that depending on the problem at hand, particle velocities are either computed analytically or interpolated from grid-values. In general this scheme is extended to any dimension using tensorproduct formulas for the remeshing kernels. In [7], an alternating direction algorithm was proposed where particles are successively pushed and remeshed along axis directions. This method reduces the cost of the remeshing step, allowing to use high order interpolation kernels with large stencils. In this work we use the 6-point kernel M 6 [1] M 6(x) = ) 1 12 (25 x (1 x ) 4 38 x 3 3 x x + 12 ) 1 24 (25 x ( x 1) ( x 2) x x 48 1 if 0 < x < 1 if 1 < x < 2 24 (3 x )3 (5 x 8) ( x 2) if 2 < x < 3 0 if 3 < x. A distinctive feature of remeshed particle methods is the time step does not depend on the number of particles. This allows to perform accurate high resolution simulations with large time steps OpenCL computing OpenCL is an open standard for parallel programming of heterogeneous systems [8]. It provides application programming interfaces for managing hybrid platforms containing many CPUs and GPUs and a programming language based on C99 for writing instructions executed concurrently on the OpenCL devices. OpenCL applications must define an execution model by setting a host program that executes on the host system and send OpenCL kernels to devices using a command queue. A kernel contains executable code that concurrently runs on devices compute units which are called work-items. A memory model need to be explicitly defined to manage data layout in the memory hierarchy. Details of these models such as work-item number 78

6 GPU Computing in a work-group or memory access pattern have a very strong impact on program efficiency Particle method with OpenCL Particle methods have already been implemented on GPU. In [11], a two dimensional solver for bluff body flows is developed using OpenGL and CUDA. The method allows to deal with particle distributions of up to at a speed greater than 20 fps. The accuracy of GPU computations was also addressed by comparing GPU results with high resolution double precision benchmark calculations on CPU. Our implementation of the method presented in section uses different abstract layers by means of a Python class hierarchy in order to have a well-defined program structure easy to use and develop. Computations are not performed on the host side of the program but on the devices in different kernels, unnoticed by user. According to the method, the algorithm is split into two parts namely an advection and a remeshing step. These two parts are repeated several times to perform a dimensional splitting for each simulation time step. We depict in figure 1.1 the algorithm for one splitting direction. For simplicity we take the velocity as a constant. In the general case, the velocity field is computed once at every time step. Grid velocity x y z Grid scalar Advection kernel Remeshing kernel Particle position Particle scalar Figure 1.1. Execution layout on GPU. Memory objects are depicted in red and OpenCL kernels in green 79

7 G.H. Cottet et al. The main constraints for implementations on GPU are to make a proper and optimized use of the memory size and bandwidth. In fact, in a treedimensional case, each particle needs 6 floats in global memory to be completely defined. For example, particles need 6GB memory in simple precision 1. This problem will be tackled in future work using several GPUs. The memory access is detailed in the following sections. In figure 1.1, memory objects are either OpenCL Buffers or Images. The current work is to determine which type of object is best suited to our algorithm. In order to take advantage of the splitting algorithm, the different one dimensional problems are distributed among work-items. In our implementation, one work-item is not in charge of one grid point but of one line of grid points in the splitting direction. Advantages of this distribution are detailed in the following parts. For example, particles will be computed over work-items, each one in charge of 1024 particles Advection step In our splitting algorithm, only one component of velocity and particle position in the current splitting direction need to be considered. The other components are respectively unused or leaved unchanged. The particle position variable reduces to a scalar. In our method, particles are created on each grid point and initialized with the value on the grid. Grid points coordinates are re-computed each time from the global OpenCL index space thanks to buit-in functions. Once particles are created and initialized, evolution ODEs are solved for particle position using the grid velocity field. This is done by means of a 2nd order Runge-Kutta scheme. The problem is to interpolate the grid velocity to compute intermediary steps in the time-stepping scheme because the needed data depend on the velocity field. Therefore the memory access pattern might not be linear, depending on the computation process. A simple improvement for this point is to make data closer to workitems, by use of a copy of the needed grid data in private memory. Interpolations are then performed in private memory so data are read with the fastest memory access available. A strong performance improvement was obtained by arranging data layout for the grid velocity. In fact, as data are accessed line by line, we 1 Today s cards memory ranges between 128MB and 8GB. 80

8 GPU Computing make the data contiguous in a direction orthogonal to the splitting direction. Consequently, work-items can together read contiguous data in global memory and then avoid strided accesses. For scalar data, a similar memory layout can be used by transposing data from one splitting direction to the next. This implementation needs further improvements to cope with small private memory, or larger problems. In these cases, a new re-arrangement of tasks is required Remeshing step As for advection, memory access pattern is execution dependant since a particle is remeshed on its nearest grid points. On top of that, two different particles can have exactly the same remeshing grid points. We need synchronization within particles to avoid concurrent memory writing access. More precisely, remeshing points overlap for particles that are in the same one dimensional line in the splitting direction under consideration. Thus, synchronization is done when only one work-item works on the line. A simple improvement is to write results in a local buffer and, once all particles are remeshed, copy this buffer into the global memory. This minimizes global memory access for remeshing Performances and results Level set test case Our method is tested with a classical and challenging problem for level set methods, namely a sphere subjected to a incompressible velocity field in a periodic cube [0; 1] 3. This test consists in the advection of a passive scalar initialized with value u = 1 inside a sphere of radius 0.15 and u = 0 elsewhere. The advection velocity is given by the following formula 2 sin(πx) 2 sin(2πy) sin(2πz) a(x) = sin(2πx) sin(πy) 2 sin(2πz). sin(2πx) sin(2πy) sin(πz) 2 One of the tests implemented in [7] and the references therein is presented in Figure 1.2. This simulation is performed with N = and a time step value which would correspond to a CFL number equal to 25 81

9 G.H. Cottet et al. (a) T = 0 (b) T = 0.2 (c) T = 0.4 (d) T = 0.6 (e) T = 0.8 (f) T = 1.0 Figure 1.2. Iso-surface of level 0.5 at different times, CF L = 25, dt = at this resolution. Time T = 1 is reached in 20 iterations and took 25 seconds Computational efficiency OpenCL kernels can be compute-bound or memory-bound. Our advection kernel is memory-bound since the operational intensity equals to 2.25 operations per byte of data accessed from the memory and remeshing kernel is nearly compute-bound with a operational intensity of 9.5. In figure 1.3, we present profiling results and timings. For larger problems, almost all compute time is spent in kernels that could be optimized. The initialization and host code are sequential codes quite independent of the problem size. The initialization part consists in reading problems data 82

10 GPU Computing and creating python objects structure. The host code tasks are to set the OpenCL execution layout and to launch the kernels. For particles, the whole computation is performed in 25 seconds, which corresponds to 1.25 seconds per time step. As a comparison, a Fortran/MPI solver performs the same simulation on 4 Intel Core i7 running at 2.4 GHz in 62 seconds, which corresponds to 12.4 seconds per time step. This shows a speedup of nearly 10 against the parallel Fortran code. 100 Code profiling over different problem sizes Time distribution Time per iteraton (sec/ite) Initialisation Advection kernel Remeshing kernel Host code Total time per iteration Figure 1.3. Profiling data on ATI Radeon HD 6770M To give another comparison, the computational times showed in [11] are about seconds per time step for one million particle for the whole Navier-Stokes solver in simple precision. In our case, we obtain a computing time of for problems of similar size. Note however that the problems are rather different, since the problems considered in [11] were two-dimensional and used lower order remeshing schemes but involved non-local field evaluations (through FFT). 83

11 G.H. Cottet et al Conclusions In this work we showed implementations of 3D particle methods in GPUs. A splitting algorithm together with a high order remeshing kernel were considered for a transport equation discretized on a single GPU with about 16 million particles. Ongoing work concerns further optimizations of our code and its Python implementation on several GPUs to tackle larger problems. 2. Some examples of instant computations of fluid dynamics on GPU by Florian De Vuyst and Christophe Labourdette This section is a summary of our experience feedback on GPU and GPGPU computing for two-dimensional computational fluid dynamics using fine grids and three-dimensional kinetic transport problems. The choice of the computational approach is clearly critical for both performance speedup and efficiency. In our numerical experiments, we used a Lattice Boltzmann approach (LBM) for the incompressible Navier-Stokes equations, a finite volume Flux Vector Splitting (FVS) method for the compressible Euler equations and a lagrangian particle approach for a linear kinetic problem GPU computing with instantaneous visualization and interaction in CFD. Experience feedback Instantaneous computing and interactive visualization High performance computing (HPC) knows an important growth since recent years. Theoretical peak processing performance and storage capacity in supercomputers gained three or four orders of magnitude in less than a decade. However, scientists and computational engineers would like more flexibility in terms of delay of response and ease of use. Manufacturers develop cluster computers to exceed the petaflop (see the exaflop!), but the cost and planning of very large computations imposes workflow constraints in batch mode. Recently, the design of manycore processors like graphics processing units GPU, general purpose graphics processing units GPGPU and manyintegrated cores MIC allow one to get theoretical teraflop performance into 84

12 GPU Computing a simple office workstation. This potential flexibility of use with only one user let us imagine new ways of computing and use cases like interactive simulation and instant computations. The applied mathematicians are often little concerned with the very large calculations, they are more interested in the design of methods and algorithms for performing the calculations. That s why we emphasize here on ways of instant computing, in particular on GPU or GPGPU. We especially focus on fluid dynamics problems where time scales of interest allow for interaction with the simulation, and where the models can be controlled by changing parameters and operating conditions with effects viewed instantaneously. The spin-off effect of such applications is the ease with which a user may play with the computational method and the underlying Physics thanks to the visualization. We believe that the coupling between instantaneous computing and interactive visualization really brings an extra dimension to better understand and evaluate methods or schemes. It is a new valuable tool for the applied mathematician. GPU computing today seems to be an inevitable affordable way to build such kinds of applications. Of course there is a price to pay to fully take advantage of GPU resources. We need to reconsider conventional methods and design new innovative computational algorithms to really take advantage of the theoretical peak performance Impact on the design of numerical methods As a simple statement of fact, standard sophisticated discretization methods for partial differential problems or optimization algorithms are not really suited for high-performance parallelization on manycore GPU-like architectures. For example, implicit methods lead to a large (sparse) linear system which is solved either by a direct method involving a sparse factorization and a sequential descent/roll up, or by an iterative algorithm which is also sequential by nature. Of course, one can find BLAS-like libraries on many-cores (like cublas on nvidia GPU), but today reported speedups are partially satisfactory. For that reason, explicit methods are certainly much more suitable on GPU architectures. Another aspect is the memory access to neighboring degrees of freedom, which is a common issue for PDE-based problems. It is important to notice that there are strongly optimized data structures and methods for fixed neighbor stencil patterns access. This makes cartesian structured grids very suitable candidates. For 85

13 G.H. Cottet et al. complex geometries, one can imagine immersed boundary methods (IBM) into cartesian uniform meshes. Our belief at CMLA is that we have to reconsider both models and methods, in order to derive efficient single-program multiple-data (SPMD) algorithms with communications rates that do not affect the floating point performance too much. For most of the classical PDEs, there are tracks to achieve manycore-suited computational approaches. For example, Lattice Boltzmann (LB) methods are a particular class of cellular automata able to discretize classical equations like the heat equations, convection-diffusion equations and even the unsteady Navier-Stokes equations or more complicated coupled systems. Because of the explicit nature of the method and the uniform local spatial pattern/stencil of discretization, the LB methods are excellent SPMD computational approaches. In the next section, we will focus and give more details on LB methods for solving the incompressible Navier-Stokes equations. Another track is the underlying microscopic dynamics behind macroscopic models. Generally PDEs are nothing else but a deterministic macroscopic representation of some microscopic or mesoscopic dynamics with uncertainty taken into account (stochastic effects like brownian motion). The Laplace operator for example is the macroscopic diffusion operator of the microscopic brownian random walk. Generally, at the microscopic scale, there is an underlying transport process and a collision/interaction phenomenon. On the other hand, the possible numerical simulation at microscopic scale will require a large number of individuals in order to derive accurate statistics able to return the macroscopic effects accurately. Manycore architectures are excellent hardware candidates to run population-based computational approaches (like particle methods) on a very large number of individuals. In the sequel we shall give some illustrative examples. Incompressible flows. Consider the two-dimensional incompressible Navier-Stokes equations defined on a bounded spatial domain Ω of R 2 : u = 0, t > 0, x Ω, (2.1) ρ ( t u + u u) + p ρν u = 0, t > 0, x Ω, (2.2) where ρ is the (constant) density, u the fluid velocity, p the pressure and ν > 0 the kinematic 86

14 GPU Computing viscosity of the fluid. There is lot of literature on how to discretize this system of equations. Because of the implicit incompressibility constraint u = 0, generally an implicit solver is used leading to the solution of a large linear system to solve at each time step, what is not very directly suitable for GPU. Since two decades, a new family of discrete explicit methods, namely the Lattice Boltzmann methods (LBM) (see the book [12] for example or the review paper [9]) are a kind of kinetic-based cellular automata. They are based on a discretization of the Boltzmann equation t f + e x f = ( t f) coll governing the distribution f = f(x, e, t) of gas particles having speed e at position x and time t. The term ( t f) coll models all the possible pairwise particle collisions. It is expected that the system fulfils the so-called H- theorem, stating that the entropy functional H(f) = f log f de decreases in time. The equilibrium steady state returns the well-known Maxwellian distribution (see for example [3] for a general theory). Lattice Boltzmann methods have the advantage to be implemented very easily and even to deal with complex geometries using an immersed boundary approach while being potentially very accurate. Let us consider the 2D Navier-Stokes case: the basic LB method is the so-called Lattice BGK (LBGK) method that uses a BGK collision operator ( t f) coll = f eq f τ for an equilibrium distribution f eq and a characteristic collision time scale τ > 0. The discretization process first deals with a finite set of discrete velocities {e i } i=1,...,n, N > 1. This leads to a coupled system of spatial transport equations t f i + e i x f i = f eq i f i, i = 1,..., N τ with f i (x, t) f(x, e i, t). The standard D2Q9 lattice makes use of a uniform spatial grid with constant space step per direction x = 1, and (only) N = 9 discrete microscopic velocities {e i } i=0,...,8 (with e 0 = 0) as shown in figure 2.1. Then we have nine discrete transport-collision equations to solve. Using the characteristic method for integrating transport term and a first order explicit Euler dicretization for the collision term, 87

15 G.H. Cottet et al. we get f i (x + e i t, t + t) f i (x, t) = t τ [f eq i (x, t) f i (x, t)], i = 0,..., 8, (2.3) with τ > 0 the characteristic collision time, for each lattice point x. The zeroth-order and first-order moments allow us to retrieve both density and momentum. Denoting by S = {0,..., 8}, we have (1, e i ) f i (x, t) = (ρ, ρu)(x, t). (2.4) i S From a formal Chapman-Enskog expansion of the discrete density probability functions f i, f i = f (0) i + εf (1) i + ε 2 f (2) i +... where ε is a lattice Knudsen number, for ε 1 it is possible retrieve the macroscopic Fluid Mechanics equations. In order to reproduce the Navier-Stokes equations, only the first two approximations f (0) i and f (1) i are required [9]. The zero-th order term f (0) i identifies with the equilibrium discribution f eq i. Let us consider a dimensionless lattice size x = 1 and a lattice speed c = 1 ( t = 1). By choosing the equilibrium density function f eq i = w i ρ (1 + 3e i u (e i u) u 2 ), (2.5) with weighting factors w 0 = 4/9, w k = 1/9 for k = 1,..., 4 and w k = 1/36 for k = 5,..., 8, then for u 1, we get (up to a scaling) second order accurate approximations of the incompressible Navier-Stokes equations with a kinematic viscosity ν equal to ν = 1 ( τ 1 ). (2.6) 3 2 Actually, for a given fluid kinematic viscosity ν, we compute τ > 1 2 such that (2.6) holds. It can be shown that the LBGK method is linearly stable as soon as τ > 1/2. Practically, it becomes unstable for τ close to 1 2 (high Reynolds number), but stabilization methods exist in this case (MRT, entropy fix, positivity preserving, etc, see [2]). 88

16 GPU Computing Figure 2.1. The two-dimensional D2Q9 lattice pattern. LBM code porting on GPU. It is easy to check that LBM can be rewritten as a two-step fractional step method, with i) a collisionless transport evolution, ii) a pure collision process. The GPU collision step can be perfectly done in parallel because of only pointwise operations. The transport step requires communications with the direct first neighboring lattice points. But, because this communication pattern is uniform over the whole computational domain, there are potentially important ways of improvement and performance gain of memory access. On NVIDIA GPU boards, using CUDA programming, one can use texture memory (both structures and access methods) that are optimized for fixed memory patterns. We implemented the D2Q9 LBGK scheme with a stabilization method proposed by Brownlee et al. in [2]. We used Pixel Buffer Object (PBO) for opengl instant visualization and binding between CUDA structures and PBO. On a lattice grid of typical size , we observe speedup factors of about 100 compared to a single-thread CPU sequential computation, allowing for interactivity, visual appearance and evolution of von Karman vortex sheddings. Flow interaction is made possible by adding new obstacles during computation with the mouse (see figure 2.2). This is easily handled by the GPU programming using a solid mask array. Compressible flows. Let us now consider a compressible fluid. The Euler equations govern the dynamics of a perfect fluid. The mass, momentum 89

17 G.H. Cottet et al. Figure 2.2. Instant Lattice Boltzmann GPU computation of the Navier-Stokes equations on a cartesian grid of typical size Flow interaction is made possible by adding new obstacles during computation with the mouse. and total energy conservation equations read t ρ + (ρu) = 0, (2.7) t (ρu) + (ρu u) + p = 0, (2.8) t (ρe) + ((ρe + p)u) = 0 (2.9) with density ρ, velocity vector u, pressure p and specific total energy E. The energy E is the sum of the kinetic energy u 2 /2 and the internal energy e. For the perfect gas with constant specific heat ratio γ, γ (1, 3], we have E = e u 2, e = 1 p γ 1 ρ. (2.10) The above system can be written in condensed vector form t U + F (U) = 0, U = (ρ, ρu, E) T. (2.11) This system is known to be hyperbolic on its admissible state space ([4]). For discretization, we consider a conservative finite volume scheme built on an unstructured finite volume mesh made of cells K. We will denote A KL the edge separating the two volumes K and L and ν KL the unit exterior vector orthogonal to A KL. A general explicit first-order finite volume scheme reads U k+1 K = U n K t K L V (K) 90 A KL Φ(U n K, U n L, ν KL ), (2.12)

18 GPU Computing with a numerical flux Φ(U n K, U n L, ν KL) having at least Lipschitz-continuous regularity, and being consistent i.e. Φ(U, U, ν) = F (U)ν. For stability purposes, numerical fluxes are generally built to fulfil an upwinding property. In this context, two main families of upwind flux are identified (see [6]): the Flux Difference Splitting (FDS) one, and the Flux Vector Splitting (FVS) one. FDS fluxes (including Godunov, Osher, Roe, HLLE, etc...) are written in the form Φ(UK, n UL, n ν KL ) = (2.13) 1 2 (F (U K, n ν KL ) + F (UL, n ν KL )) 1 2 A(U(s), ν KL ) U (s) ds where A(U, ν) denote the (diagonalizable) Jacobian matrix of the flux in the direction ν, and Γ n KL = Γ(U K n, U L n, s) is a Lipschitz continuous path linking the states UK n and U L n with a curvilinear parameter s [0, 1]. The second term of the RHS of (2.13) corresponds to the upwind artificial viscosity term. From the GPU computational point of view, FDS schemes require at each time step i) the computation of the FDS flux with memory reads of the cell states; ii) cell updates with memory reads of the FDS fluxes (see figure 2.3). Memory transfer may be a limiting performance factor for GPUs if the DRAM bandwidth is saturated. Γ n KL (a) (b) Figure 2.3. FV scheme with Flux Difference Splitting FDS schemes. FDS require two memory transfers: (a) computation of the numerical flux with memory reads of states; (b) state update into control volumes with memory reads of numerical fluxes. The Flux Vector Splitting (FVS) family [6] has numerical fluxes written in the form Φ(U n K, U n L, ν KL ) = F + (U n K, ν KL ) + F (U n L, ν KL ) (2.14) 91

19 G.H. Cottet et al. with F + representing the leftward part of the flux and F representing the rightward part. Consistency requirements involve the identity F + (U, ν) + F (U, ν) = F (U)ν. What is peculiar with FVS is that F + (UK n, ν KL) can be computed without any knowledge of the neigboring state UL n, and conversely. Then the GPU computation of F + (UK n, ν KL) may be seen as a pointwise, cell-centered computation, perfectly done in parallel. For that reason, one has only to send F + and F to the neighboring cells for state updates, thus reducing memory reads and DRAM transfer (see figure 2.4). Moreover, FVS fluxes Figure 2.4. FV scheme with Flux Vector Splitting (FVS) fluxes. FVS only require one memory transfers: F + ou F are sended to the neighboring cells for state update. generally do not require neither eigenstructure decomposition nor matrixvector products, what improves the whole performance. For example, the van Leer s FVS with Hänel-Schwane energy-flux modification [5] leads to the scripts (written here for ν = (1, 0)): F ρ + = ρc 4 (M + 1)2 1 ( M 1) + max(u, 0) 1 ( M >1), (2.15) p + = p 4 (M + 1)2 (2 M) 1 ( M 1) + p 1 (M>1), (2.16) F + ρu = u F + ρ + p +, F + ρv = v F + ρ, (2.17) F + ρe = (E + p/ρ) F + ρ (2.18) where c = γp/ρ is the speed of sound and M = u/c is the normal Mach number. For these reasons, FVS are clearly better candidates for GPU implementation and high-performance computation [13]. On figure 2.5, we show an instant computation of the well-known 2D Mach 3 forwarding step case on a very fine grid, using (2.15)-(2.18) as FVS scheme. Again, 92

20 GPU Computing mouse interactivity allows us to add obstacles on-the-fly and observe the fluid response. This is of great interest for training, education and comprehension of Fluid Dynamics. Figure 2.5. Instant GPU computation on the well-known Mach 3 forward step channel 2D problem. Hänel s FVS is here used. The flow can be perturbed by directly adding new wall obstacles with the mouse pointer. Three-dimensional free transport kinetic equations. This case was designed to evaluate GPU performance of particle methods. Let us consider the following homogeneous Vlasov equation in 3D: find f = f(x, v, t), x Ω(t) R 3, v R 3, t > 0, solution of t f + v x f + a(x) v f = 0, x Ω(t) R 3, v R 3, t > 0 (2.19) with f(.,., 0) L 1 (Ω R 3 ). Standard eulerian discretization methods would involve a mesh in a space of dimension 6, what is still not realistic to address at the present time. An alternative approach is to reformulate the transport dynamics behind this equation. Let us consider the following system of motion equations defined on a large set of particles {x k } k : { ẋk (t) = v k, (2.20) v k (t) = a(x k ) and the discrete measure-valued distribution N f = ω k δ(x x k (t)) δ(v v k (t)) (2.21) k=1 93

21 G.H. Cottet et al. for some given weighting factors {ω k } k, ω k > 0. We want to evaluate the L 1 norm on f in the time-dependent domain Ω(t) (a pulsating sphere for example, see [14]). For that we have to compute at each time step f(t) L 1 (Ω(t)) = N ω k 1 (xk Ω(t)). (2.22) k=1 The summation has to be optimized on a GPU architecture. For that we used the sum-reduction algorithm proposed in the GPU code examples of the CUDA SDK. We have observed parallel speedup factors of about for particles sets between 1 million and 8 million particles on a GPGPU TESLA C2070. (a) (b) Figure 2.6. Validation of GPU acceleration of free transport equations on a moving domain with parallel reduction at each time step for L 1 -norm computation. (a) Schematic of the particles and moving domain; (b) Discrete L 1 (Ω(t))- norm of the distribution during time t Impact on the data structures In the above sections, we have seen how GPU computing may change the way of thinking both physical models and computational methods. Beyond pure computational aspects, there is of course the programming and code optimization dimensions. The obtained speedup and the necessary work of specidifc GPU programming are subject to a search of thrust performance. Parallel computing strongly alters the balance of power in the 94

22 GPU Computing classical duality memory-computation. A first simple rule is to try to keep as much as possible data on the parallel device and transfer data as less as possible because of limited bandwidth. To illustrate, it is even often cheaper to recompute a result than transmitting it. Communication performance is also strongly linked to the way to handle complex data. Data coalescence is a critical keypoint for optimal performance. To offset a large latency of global memory a rational way is to read consecutive blocks in memory (coalescing). There are two kinds of non-coalescing memory accesses, described in the figure 2.7: either the threads cannot access to neighboring fields in the right order; or there is an alignment problem, the first thread of a warp must access one memory multiple of 32, 64 or 128 (depending on the data), see [10]. It is difficult when looking to optimized performance, to work with very sophisticated data models. The notion of array perfectly fits to this scheme, data types more developed as structures or classes instead readily scatter in the data. For example it is much more efficient to manipulate Structures of Arrays (SoA) that Arrays of Structures (AoS). To be efficient, it is necessary to stay close to the data and always keep in mind the specific hardware architecture of GPU and the constraints it imposes to the data. This is probably the main reason why we think that today CUDA is one of the most appropriate language to obtain optimal performance on GPU General experience feedback and concluding remarks Let us conclude by some humble advices. Today s GPU for scientific computing is a question a tradeoff between performance and design and/or implementation effort. Reasonable (suboptimal) speedup (say 10 or 20) is easy to reach. GPU parallel programming is far easier for computational methods feined on cartesian grids or meshfree particle methods. For stronger performance, the way is to fing a good tradeoff between implementation effort, code readability and performance. GPU computing requires a real reflection on the choice of data structures, especially for 95

23 G.H. Cottet et al. Figure 2.7. Schematic of data coalescence, extracted from the CUDA C Best Practices Guide, NVIDIA 2012 [10] the sake of memory coalescence: arrays of structures AoS versus structures of arrays SoA, byte alignment, etc. In the same spirit, the strategy/use of intermediate cache memory level (per streaming multiprocessor for example) is also of great importance for high performance. Texture memory is 96

24 GPU Computing particularly suited for local partial differential operators involving uniform spatial stencils. Our belief at CMLA is that GPU/manycore processors will deeply impact the numerical solvers in the next years. We have to think about paradigm shifts for both modeling and discretization for strong better GPU performance Videos of instant GPU computations on youtube All the interactive computations can be found at the following URL: Acknowledgements This work has been partly supported by the FARMAN Institute of ENS Cachan and by a NVIDIA Equipment Grant in References [1] M. Bergdorf & P. Koumoutsakos A Lagrangian particlewavelet method, Multiscale Models. Simul. 5 (2006), no. 3, p [2] R. Brownlee, A. Gorban & J. Levesley Stabilization of the lattice boltzmann method using the Ehrenfests coarse-graining data, Physical Review E 74 (2006), p [3] C. Cercignani The boltzmann equation and its applications, vol. 67, Springer-Verlag, [4] E. Godlewski & P.-A. Raviart Numerical approximation of hyperbolic conservation laws, vol. 118, Applied Mathematical Sciences, Springer-Verlag, Boston, [5] D. Hanel & R. Schwane An implicit flux-vector splitting scheme for the computation of viscous hypersonic flow, AIAA Paper 25 (1989), Paper [6] A. Harten, P. Lax & B. van Leer On upstream differencing and godunov-type schemes for hyperbolic conservation laws, SIAM Review 25 (1983), p

25 G.H. Cottet et al. [7] A. Magni & G. Cottet Accurate, non-oscillatory, remeshing schemes for particle methods, J. Comput. Phys. 231 (2012), no. 1, p [8] A. Munshi et al. The OpenCL Specification, Khronos OpenCL Working Group (2011). [9] R. Nourgaliev, T. Dinh, T. Theofanous & D. Joseph The lattice boltzmann equation method: theoretical interpretation, numerics and implications, Int. J. of Multiphase Flow 29 (2003), p [10] Cuda c best practices guide 4.1, nvidia [11] D. Rossinelli, M. Bergdorf, G. Cottet & P. Koumoutsakos GPU accelerated simulations of bluff body flows using vortex methods, J. Comput. Phys. 229 (2010), no. 9, p [12] S. Succi The lattice boltzmann equation for fluid dynamics and beyond, Oxford, 2001, ISBN: [13] F. D. Vuyst A flux vector splitting method that preserves stationary contact discontinuities, Acta Mathematicae Applicandae (2013), accepted, under revision. [14] F. D. Vuyst & F. Salvarani Gpu-accelerated numerical simulations of the knudsen gas on time-dependent domains, Computer Physics Communications 184 (2013), no. 3, p Georges-Henri Cottet Laboratoire Jean Kuntzmann Université Joseph Fourier BP , Grenoble Cedex 9 France Georges-Henri.Cottet@imag.fr Jean-Matthieu Etancelin Laboratoire Jean Kuntzmann Université Joseph Fourier BP , Grenoble Cedex 9 France Jean-Matthieu.Etancelin@imag.fr 98

26 GPU Computing Franck Perignon Laboratoire Jean Kuntzmann Université Joseph Fourier BP , Grenoble Cedex 9 France franck.perignon@imag.fr Florian De Vuyst Centre de Mathématiques et de leurs Applications CMLA CNRS UMR , avenue du Président Wilson Cachan CEDEX FRANCE devuyst@cmla.ens-cachan.fr Christophe Picard Laboratoire Jean Kuntzamnn Université Joseph Fourier BP , Grenoble Cedex 9 France Christophe.Picard@imag.fr Christophe Labourdette Centre de Mathématiques et de leurs Applications CMLA CNRS UMR , avenue du Président Wilson Cachan CEDEX FRANCE labour@cmla.ens-cachan.fr 99

Is GPU the future of Scientific Computing?

Is GPU the future of Scientific Computing? Is GPU the future of Scientific Computing? Georges-Henri Cottet, Jean-Matthieu Etancelin, Franck Pérignon, Christophe Picard, Florian De Vuyst, Christophe Labourdette To cite this version: Georges-Henri

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids

More information

Multi-scale problems, high performance computing and hybrid numerical methods

Multi-scale problems, high performance computing and hybrid numerical methods Multi-scale problems, high performance computing and hybrid numerical methods G. Balarac, G.-H. Cottet, J.-M. Etancelin, J.-B. Lagaert, F. Perignon and C. Picard Abstract The turbulent transport of a passive

More information

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more Parallel Free-Surface Extension of the Lattice-Boltzmann Method A Lattice-Boltzmann Approach for Simulation of Two-Phase Flows Stefan Donath (LSS Erlangen, stefan.donath@informatik.uni-erlangen.de) Simon

More information

ALE Seamless Immersed Boundary Method with Overset Grid System for Multiple Moving Objects

ALE Seamless Immersed Boundary Method with Overset Grid System for Multiple Moving Objects Tenth International Conference on Computational Fluid Dynamics (ICCFD10), Barcelona,Spain, July 9-13, 2018 ICCFD10-047 ALE Seamless Immersed Boundary Method with Overset Grid System for Multiple Moving

More information

Solving Partial Differential Equations on Overlapping Grids

Solving Partial Differential Equations on Overlapping Grids **FULL TITLE** ASP Conference Series, Vol. **VOLUME**, **YEAR OF PUBLICATION** **NAMES OF EDITORS** Solving Partial Differential Equations on Overlapping Grids William D. Henshaw Centre for Applied Scientific

More information

1. Mathematical Modelling

1. Mathematical Modelling 1. describe a given problem with some mathematical formalism in order to get a formal and precise description see fundamental properties due to the abstraction allow a systematic treatment and, thus, solution

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

cuibm A GPU Accelerated Immersed Boundary Method

cuibm A GPU Accelerated Immersed Boundary Method cuibm A GPU Accelerated Immersed Boundary Method S. K. Layton, A. Krishnan and L. A. Barba Corresponding author: labarba@bu.edu Department of Mechanical Engineering, Boston University, Boston, MA, 225,

More information

CS205b/CME306. Lecture 9

CS205b/CME306. Lecture 9 CS205b/CME306 Lecture 9 1 Convection Supplementary Reading: Osher and Fedkiw, Sections 3.3 and 3.5; Leveque, Sections 6.7, 8.3, 10.2, 10.4. For a reference on Newton polynomial interpolation via divided

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Numerical Analysis of Shock Tube Problem by using TVD and ACM Schemes

Numerical Analysis of Shock Tube Problem by using TVD and ACM Schemes Numerical Analysis of Shock Tube Problem by using TVD and Schemes Dr. Mukkarum Husain, Dr. M. Nauman Qureshi, Syed Zaid Hasany IST Karachi, Email: mrmukkarum@yahoo.com Abstract Computational Fluid Dynamics

More information

The gas-kinetic methods have become popular for the simulation of compressible fluid flows in the last

The gas-kinetic methods have become popular for the simulation of compressible fluid flows in the last Parallel Implementation of Gas-Kinetic BGK Scheme on Unstructured Hybrid Grids Murat Ilgaz Defense Industries Research and Development Institute, Ankara, 626, Turkey and Ismail H. Tuncer Middle East Technical

More information

A Particle Cellular Automata Model for Fluid Simulations

A Particle Cellular Automata Model for Fluid Simulations Annals of University of Craiova, Math. Comp. Sci. Ser. Volume 36(2), 2009, Pages 35 41 ISSN: 1223-6934 A Particle Cellular Automata Model for Fluid Simulations Costin-Radu Boldea Abstract. A new cellular-automaton

More information

NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011

NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011 NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011 First-Order Hyperbolic System Method If you have a CFD book for hyperbolic problems, you have a CFD book for all problems.

More information

Possibility of Implicit LES for Two-Dimensional Incompressible Lid-Driven Cavity Flow Based on COMSOL Multiphysics

Possibility of Implicit LES for Two-Dimensional Incompressible Lid-Driven Cavity Flow Based on COMSOL Multiphysics Possibility of Implicit LES for Two-Dimensional Incompressible Lid-Driven Cavity Flow Based on COMSOL Multiphysics Masanori Hashiguchi 1 1 Keisoku Engineering System Co., Ltd. 1-9-5 Uchikanda, Chiyoda-ku,

More information

NUMERICAL VISCOSITY. Convergent Science White Paper. COPYRIGHT 2017 CONVERGENT SCIENCE. All rights reserved.

NUMERICAL VISCOSITY. Convergent Science White Paper. COPYRIGHT 2017 CONVERGENT SCIENCE. All rights reserved. Convergent Science White Paper COPYRIGHT 2017 CONVERGENT SCIENCE. All rights reserved. This document contains information that is proprietary to Convergent Science. Public dissemination of this document

More information

Overview of Traditional Surface Tracking Methods

Overview of Traditional Surface Tracking Methods Liquid Simulation With Mesh-Based Surface Tracking Overview of Traditional Surface Tracking Methods Matthias Müller Introduction Research lead of NVIDIA PhysX team PhysX GPU acc. Game physics engine www.nvidia.com\physx

More information

MESHLESS SOLUTION OF INCOMPRESSIBLE FLOW OVER BACKWARD-FACING STEP

MESHLESS SOLUTION OF INCOMPRESSIBLE FLOW OVER BACKWARD-FACING STEP Vol. 12, Issue 1/2016, 63-68 DOI: 10.1515/cee-2016-0009 MESHLESS SOLUTION OF INCOMPRESSIBLE FLOW OVER BACKWARD-FACING STEP Juraj MUŽÍK 1,* 1 Department of Geotechnics, Faculty of Civil Engineering, University

More information

Computational Fluid Dynamics using OpenCL a Practical Introduction

Computational Fluid Dynamics using OpenCL a Practical Introduction 19th International Congress on Modelling and Simulation, Perth, Australia, 12 16 December 2011 http://mssanz.org.au/modsim2011 Computational Fluid Dynamics using OpenCL a Practical Introduction T Bednarz

More information

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System

More information

Fast Multipole Method on the GPU

Fast Multipole Method on the GPU Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical

More information

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Asynchronous OpenCL/MPI numerical simulations of conservation laws Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation

More information

Solution for Euler Equations Lagrangian and Eulerian Descriptions

Solution for Euler Equations Lagrangian and Eulerian Descriptions Solution for Euler Equations Lagrangian and Eulerian Descriptions Valdir Monteiro dos Santos Godoi valdir.msgodoi@gmail.com Abstract We find an exact solution for the system of Euler equations, supposing

More information

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing

Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing Natasha Flyer National Center for Atmospheric Research Boulder, CO Meshes vs. Mesh-free

More information

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications

More information

On the high order FV schemes for compressible flows

On the high order FV schemes for compressible flows Applied and Computational Mechanics 1 (2007) 453-460 On the high order FV schemes for compressible flows J. Fürst a, a Faculty of Mechanical Engineering, CTU in Prague, Karlovo nám. 13, 121 35 Praha, Czech

More information

Debojyoti Ghosh. Adviser: Dr. James Baeder Alfred Gessow Rotorcraft Center Department of Aerospace Engineering

Debojyoti Ghosh. Adviser: Dr. James Baeder Alfred Gessow Rotorcraft Center Department of Aerospace Engineering Debojyoti Ghosh Adviser: Dr. James Baeder Alfred Gessow Rotorcraft Center Department of Aerospace Engineering To study the Dynamic Stalling of rotor blade cross-sections Unsteady Aerodynamics: Time varying

More information

D036 Accelerating Reservoir Simulation with GPUs

D036 Accelerating Reservoir Simulation with GPUs D036 Accelerating Reservoir Simulation with GPUs K.P. Esler* (Stone Ridge Technology), S. Atan (Marathon Oil Corp.), B. Ramirez (Marathon Oil Corp.) & V. Natoli (Stone Ridge Technology) SUMMARY Over the

More information

Investigation of cross flow over a circular cylinder at low Re using the Immersed Boundary Method (IBM)

Investigation of cross flow over a circular cylinder at low Re using the Immersed Boundary Method (IBM) Computational Methods and Experimental Measurements XVII 235 Investigation of cross flow over a circular cylinder at low Re using the Immersed Boundary Method (IBM) K. Rehman Department of Mechanical Engineering,

More information

On the order of accuracy and numerical performance of two classes of finite volume WENO schemes

On the order of accuracy and numerical performance of two classes of finite volume WENO schemes On the order of accuracy and numerical performance of two classes of finite volume WENO schemes Rui Zhang, Mengping Zhang and Chi-Wang Shu November 29, 29 Abstract In this paper we consider two commonly

More information

FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION. M. Panchatcharam 1, S. Sundar 2

FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION. M. Panchatcharam 1, S. Sundar 2 International Journal of Applied Mathematics Volume 25 No. 4 2012, 547-557 FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION M. Panchatcharam 1, S. Sundar 2 1,2 Department of Mathematics

More information

Driven Cavity Example

Driven Cavity Example BMAppendixI.qxd 11/14/12 6:55 PM Page I-1 I CFD Driven Cavity Example I.1 Problem One of the classic benchmarks in CFD is the driven cavity problem. Consider steady, incompressible, viscous flow in a square

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

Math 690N - Final Report

Math 690N - Final Report Math 690N - Final Report Yuanhong Li May 05, 008 Accurate tracking of a discontinuous, thin and evolving turbulent flame front has been a challenging subject in modelling a premixed turbulent combustion.

More information

1.2 Numerical Solutions of Flow Problems

1.2 Numerical Solutions of Flow Problems 1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian

More information

Numerical Methods for (Time-Dependent) HJ PDEs

Numerical Methods for (Time-Dependent) HJ PDEs Numerical Methods for (Time-Dependent) HJ PDEs Ian Mitchell Department of Computer Science The University of British Columbia research supported by National Science and Engineering Research Council of

More information

Introduction to the immersed boundary method

Introduction to the immersed boundary method Introduction to the immersed boundary method Motivation. Hydrodynamics and boundary conditions The incompressible Navier-Stokes equations, ( ) u ρ + (u )u = p + ρν 2 u + f, () t are partial differential

More information

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011

More information

Lecture 1.1 Introduction to Fluid Dynamics

Lecture 1.1 Introduction to Fluid Dynamics Lecture 1.1 Introduction to Fluid Dynamics 1 Introduction A thorough study of the laws of fluid mechanics is necessary to understand the fluid motion within the turbomachinery components. In this introductory

More information

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS 14 th European Conference on Mixing Warszawa, 10-13 September 2012 LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS Felix Muggli a, Laurent Chatagny a, Jonas Lätt b a Sulzer Markets & Technology

More information

Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models. C. Aberle, A. Hakim, and U. Shumlak

Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models. C. Aberle, A. Hakim, and U. Shumlak Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models C. Aberle, A. Hakim, and U. Shumlak Aerospace and Astronautics University of Washington, Seattle American Physical Society

More information

Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python

Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python Micha l Januszewski Institute of Physics University of Silesia in Katowice, Poland Google GTC 2012 M. Januszewski (IoP, US) Sailfish:

More information

GPU-based Distributed Behavior Models with CUDA

GPU-based Distributed Behavior Models with CUDA GPU-based Distributed Behavior Models with CUDA Courtesy: YouTube, ISIS Lab, Universita degli Studi di Salerno Bradly Alicea Introduction Flocking: Reynolds boids algorithm. * models simple local behaviors

More information

Faculty of Mechanical and Manufacturing Engineering, University Tun Hussein Onn Malaysia (UTHM), Parit Raja, Batu Pahat, Johor, Malaysia

Faculty of Mechanical and Manufacturing Engineering, University Tun Hussein Onn Malaysia (UTHM), Parit Raja, Batu Pahat, Johor, Malaysia Applied Mechanics and Materials Vol. 393 (2013) pp 305-310 (2013) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/amm.393.305 The Implementation of Cell-Centred Finite Volume Method

More information

LS-DYNA 980 : Recent Developments, Application Areas and Validation Process of the Incompressible fluid solver (ICFD) in LS-DYNA.

LS-DYNA 980 : Recent Developments, Application Areas and Validation Process of the Incompressible fluid solver (ICFD) in LS-DYNA. 12 th International LS-DYNA Users Conference FSI/ALE(1) LS-DYNA 980 : Recent Developments, Application Areas and Validation Process of the Incompressible fluid solver (ICFD) in LS-DYNA Part 1 Facundo Del

More information

Three Dimensional Numerical Simulation of Turbulent Flow Over Spillways

Three Dimensional Numerical Simulation of Turbulent Flow Over Spillways Three Dimensional Numerical Simulation of Turbulent Flow Over Spillways Latif Bouhadji ASL-AQFlow Inc., Sidney, British Columbia, Canada Email: lbouhadji@aslenv.com ABSTRACT Turbulent flows over a spillway

More information

2.7 Cloth Animation. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter 2 123

2.7 Cloth Animation. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter 2 123 2.7 Cloth Animation 320491: Advanced Graphics - Chapter 2 123 Example: Cloth draping Image Michael Kass 320491: Advanced Graphics - Chapter 2 124 Cloth using mass-spring model Network of masses and springs

More information

GPU Implementation of Implicit Runge-Kutta Methods

GPU Implementation of Implicit Runge-Kutta Methods GPU Implementation of Implicit Runge-Kutta Methods Navchetan Awasthi, Abhijith J Supercomputer Education and Research Centre Indian Institute of Science, Bangalore, India navchetanawasthi@gmail.com, abhijith31792@gmail.com

More information

Final Report. Discontinuous Galerkin Compressible Euler Equation Solver. May 14, Andrey Andreyev. Adviser: Dr. James Baeder

Final Report. Discontinuous Galerkin Compressible Euler Equation Solver. May 14, Andrey Andreyev. Adviser: Dr. James Baeder Final Report Discontinuous Galerkin Compressible Euler Equation Solver May 14, 2013 Andrey Andreyev Adviser: Dr. James Baeder Abstract: In this work a Discontinuous Galerkin Method is developed for compressible

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

International Supercomputing Conference 2009

International Supercomputing Conference 2009 International Supercomputing Conference 2009 Implementation of a Lattice-Boltzmann-Method for Numerical Fluid Mechanics Using the nvidia CUDA Technology E. Riegel, T. Indinger, N.A. Adams Technische Universität

More information

ENERGY-224 Reservoir Simulation Project Report. Ala Alzayer

ENERGY-224 Reservoir Simulation Project Report. Ala Alzayer ENERGY-224 Reservoir Simulation Project Report Ala Alzayer Autumn Quarter December 3, 2014 Contents 1 Objective 2 2 Governing Equations 2 3 Methodolgy 3 3.1 BlockMesh.........................................

More information

Immersed Boundary Method and Chimera Method applied to Fluid-

Immersed Boundary Method and Chimera Method applied to Fluid- The numericsacademy Fixed Colloquium IBM on Moving Immersed IBM Boundary Applications Methods : Conclusion Current Status and Future Research Directions 15-17 June 2009, Academy Building, Amsterdam, the

More information

NIA CFD Futures Conference Hampton, VA; August 2012

NIA CFD Futures Conference Hampton, VA; August 2012 Petascale Computing and Similarity Scaling in Turbulence P. K. Yeung Schools of AE, CSE, ME Georgia Tech pk.yeung@ae.gatech.edu NIA CFD Futures Conference Hampton, VA; August 2012 10 2 10 1 10 4 10 5 Supported

More information

Mid-Year Report. Discontinuous Galerkin Euler Equation Solver. Friday, December 14, Andrey Andreyev. Advisor: Dr.

Mid-Year Report. Discontinuous Galerkin Euler Equation Solver. Friday, December 14, Andrey Andreyev. Advisor: Dr. Mid-Year Report Discontinuous Galerkin Euler Equation Solver Friday, December 14, 2012 Andrey Andreyev Advisor: Dr. James Baeder Abstract: The focus of this effort is to produce a two dimensional inviscid,

More information

Generic Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids

Generic Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids Generic Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids Matthieu Lefebvre 1, Jean-Marie Le Gouez 2 1 PhD at Onera, now post-doc at Princeton, department of Geosciences,

More information

Navier-Stokes & Flow Simulation

Navier-Stokes & Flow Simulation Last Time? Navier-Stokes & Flow Simulation Pop Worksheet! Teams of 2. Hand in to Jeramey after we discuss. Sketch the first few frames of a 2D explicit Euler mass-spring simulation for a 2x3 cloth network

More information

A MULTI-DOMAIN ALE ALGORITHM FOR SIMULATING FLOWS INSIDE FREE-PISTON DRIVEN HYPERSONIC TEST FACILITIES

A MULTI-DOMAIN ALE ALGORITHM FOR SIMULATING FLOWS INSIDE FREE-PISTON DRIVEN HYPERSONIC TEST FACILITIES A MULTI-DOMAIN ALE ALGORITHM FOR SIMULATING FLOWS INSIDE FREE-PISTON DRIVEN HYPERSONIC TEST FACILITIES Khalil Bensassi, and Herman Deconinck Von Karman Institute for Fluid Dynamics Aeronautics & Aerospace

More information

Unstructured Mesh Generation for Implicit Moving Geometries and Level Set Applications

Unstructured Mesh Generation for Implicit Moving Geometries and Level Set Applications Unstructured Mesh Generation for Implicit Moving Geometries and Level Set Applications Per-Olof Persson (persson@mit.edu) Department of Mathematics Massachusetts Institute of Technology http://www.mit.edu/

More information

Coping with the Ice Accumulation Problems on Power Transmission Lines

Coping with the Ice Accumulation Problems on Power Transmission Lines Coping with the Ice Accumulation Problems on Power Transmission Lines P.N. Shivakumar 1, J.F.Peters 2, R.Thulasiram 3, and S.H.Lui 1 1 Department of Mathematics 2 Department of Electrical & Computer Engineering

More information

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS NAVIER-STOKES EQUATIONS u t + u u + 1 ρ p = Ԧg + ν u u=0 WHAT IS COMPUTATIONAL FLUID DYNAMICS? Branch of Fluid Dynamics which uses computer power to approximate

More information

Numerical Methods for PDEs. SSC Workgroup Meetings Juan J. Alonso October 8, SSC Working Group Meetings, JJA 1

Numerical Methods for PDEs. SSC Workgroup Meetings Juan J. Alonso October 8, SSC Working Group Meetings, JJA 1 Numerical Methods for PDEs SSC Workgroup Meetings Juan J. Alonso October 8, 2001 SSC Working Group Meetings, JJA 1 Overview These notes are meant to be an overview of the various memory access patterns

More information

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer

More information

SENSEI / SENSEI-Lite / SENEI-LDC Updates

SENSEI / SENSEI-Lite / SENEI-LDC Updates SENSEI / SENSEI-Lite / SENEI-LDC Updates Chris Roy and Brent Pickering Aerospace and Ocean Engineering Dept. Virginia Tech July 23, 2014 Collaborations with Math Collaboration on the implicit SENSEI-LDC

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

CGT 581 G Fluids. Overview. Some terms. Some terms

CGT 581 G Fluids. Overview. Some terms. Some terms CGT 581 G Fluids Bedřich Beneš, Ph.D. Purdue University Department of Computer Graphics Technology Overview Some terms Incompressible Navier-Stokes Boundary conditions Lagrange vs. Euler Eulerian approaches

More information

CUDA/OpenGL Fluid Simulation. Nolan Goodnight

CUDA/OpenGL Fluid Simulation. Nolan Goodnight CUDA/OpenGL Fluid Simulation Nolan Goodnight ngoodnight@nvidia.com Document Change History Version Date Responsible Reason for Change 0.1 2/22/07 Nolan Goodnight Initial draft 1.0 4/02/07 Nolan Goodnight

More information

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible

More information

1 Past Research and Achievements

1 Past Research and Achievements Parallel Mesh Generation and Adaptation using MAdLib T. K. Sheel MEMA, Universite Catholique de Louvain Batiment Euler, Louvain-La-Neuve, BELGIUM Email: tarun.sheel@uclouvain.be 1 Past Research and Achievements

More information

The Development of a Navier-Stokes Flow Solver with Preconditioning Method on Unstructured Grids

The Development of a Navier-Stokes Flow Solver with Preconditioning Method on Unstructured Grids Proceedings of the International MultiConference of Engineers and Computer Scientists 213 Vol II, IMECS 213, March 13-15, 213, Hong Kong The Development of a Navier-Stokes Flow Solver with Preconditioning

More information

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters SIAM PP 2014 High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters C. Riesinger, A. Bakhtiari, M. Schreiber Technische Universität München February 20, 2014

More information

ALE METHODS FOR DETERMINING STATIONARY SOLUTIONS OF METAL FORMING PROCESSES

ALE METHODS FOR DETERMINING STATIONARY SOLUTIONS OF METAL FORMING PROCESSES European Congress on Computational Methods in Applied Sciences and Engineering ECCOMAS 2000 Barcelona, 11-14 September 2000 c ECCOMAS ALE METHODS FOR DETERMINING STATIONARY SOLUTIONS OF METAL FORMING PROCESSES

More information

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Computational Fluid Dynamics (CFD) using Graphics Processing Units Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores

More information

The use of gas-kinetic schemes for the simulation of compressible flows become widespread in the two last

The use of gas-kinetic schemes for the simulation of compressible flows become widespread in the two last A Gas-Kinetic BGK Scheme for Parallel Solution of 3-D Viscous Flows on Unstructured Hybrid Grids Murat Ilgaz Defense Industries Research and Development Institute, Ankara, 626, Turkey and Ismail H. Tuncer

More information

Realistic Animation of Fluids

Realistic Animation of Fluids Realistic Animation of Fluids p. 1/2 Realistic Animation of Fluids Nick Foster and Dimitri Metaxas Realistic Animation of Fluids p. 2/2 Overview Problem Statement Previous Work Navier-Stokes Equations

More information

Introduction to the immersed boundary method

Introduction to the immersed boundary method Introduction to the immersed boundary method by Timm Krüger (info@timm-krueger.de), last updated on September 27, 20 Motivation. Hydrodynamics and boundary conditions The incompressible Navier-Stokes equations,

More information

Aeroacoustic computations with a new CFD solver based on the Lattice Boltzmann Method

Aeroacoustic computations with a new CFD solver based on the Lattice Boltzmann Method Aeroacoustic computations with a new CFD solver based on the Lattice Boltzmann Method D. Ricot 1, E. Foquet 2, H. Touil 3, E. Lévêque 3, H. Machrouki 4, F. Chevillotte 5, M. Meldi 6 1: Renault 2: CS 3:

More information

Interaction of Fluid Simulation Based on PhysX Physics Engine. Huibai Wang, Jianfei Wan, Fengquan Zhang

Interaction of Fluid Simulation Based on PhysX Physics Engine. Huibai Wang, Jianfei Wan, Fengquan Zhang 4th International Conference on Sensors, Measurement and Intelligent Materials (ICSMIM 2015) Interaction of Fluid Simulation Based on PhysX Physics Engine Huibai Wang, Jianfei Wan, Fengquan Zhang College

More information

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1 Markus Hadwiger, KAUST Reading Assignment #2 (until Feb. 17) Read (required): GLSL book, chapter 4 (The OpenGL Programmable

More information

SPH: Towards the simulation of wave-body interactions in extreme seas

SPH: Towards the simulation of wave-body interactions in extreme seas SPH: Towards the simulation of wave-body interactions in extreme seas Guillaume Oger, Mathieu Doring, Bertrand Alessandrini, and Pierre Ferrant Fluid Mechanics Laboratory (CNRS UMR6598) Ecole Centrale

More information

PHYSICALLY BASED ANIMATION

PHYSICALLY BASED ANIMATION PHYSICALLY BASED ANIMATION CS148 Introduction to Computer Graphics and Imaging David Hyde August 2 nd, 2016 WHAT IS PHYSICS? the study of everything? WHAT IS COMPUTATION? the study of everything? OUTLINE

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Module 1: Introduction to Finite Difference Method and Fundamentals of CFD Lecture 13: The Lecture deals with:

Module 1: Introduction to Finite Difference Method and Fundamentals of CFD Lecture 13: The Lecture deals with: The Lecture deals with: Some more Suggestions for Improvement of Discretization Schemes Some Non-Trivial Problems with Discretized Equations file:///d /chitra/nptel_phase2/mechanical/cfd/lecture13/13_1.htm[6/20/2012

More information

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Dirk Ribbrock, Markus Geveler, Dominik Göddeke, Stefan Turek Angewandte Mathematik, Technische Universität Dortmund

More information

Shallow Water Simulations on Graphics Hardware

Shallow Water Simulations on Graphics Hardware Shallow Water Simulations on Graphics Hardware Ph.D. Thesis Presentation 2014-06-27 Martin Lilleeng Sætra Outline Introduction Parallel Computing and the GPU Simulating Shallow Water Flow Topics of Thesis

More information

Continuum-Microscopic Models

Continuum-Microscopic Models Scientific Computing and Numerical Analysis Seminar October 1, 2010 Outline Heterogeneous Multiscale Method Adaptive Mesh ad Algorithm Refinement Equation-Free Method Incorporates two scales (length, time

More information

ANNALES DE L I. H. P., SECTION B

ANNALES DE L I. H. P., SECTION B ANNALES DE L I. H. P., SECTION B GARY CHARTRAND FRANK HARARY Planar Permutation Graphs Annales de l I. H. P., section B, tome 3, n o 4 (1967), p. 433438

More information

A Novel Approach to High Speed Collision

A Novel Approach to High Speed Collision A Novel Approach to High Speed Collision Avril Slone University of Greenwich Motivation High Speed Impact Currently a very active research area. Generic projectile- target collision 11 th September 2001.

More information

Level set methods Formulation of Interface Propagation Boundary Value PDE Initial Value PDE Motion in an externally generated velocity field

Level set methods Formulation of Interface Propagation Boundary Value PDE Initial Value PDE Motion in an externally generated velocity field Level Set Methods Overview Level set methods Formulation of Interface Propagation Boundary Value PDE Initial Value PDE Motion in an externally generated velocity field Convection Upwind ddifferencingi

More information

The WENO Method in the Context of Earlier Methods To approximate, in a physically correct way, [3] the solution to a conservation law of the form u t

The WENO Method in the Context of Earlier Methods To approximate, in a physically correct way, [3] the solution to a conservation law of the form u t An implicit WENO scheme for steady-state computation of scalar hyperbolic equations Sigal Gottlieb Mathematics Department University of Massachusetts at Dartmouth 85 Old Westport Road North Dartmouth,

More information

Numerical Methods for Hyperbolic and Kinetic Equations

Numerical Methods for Hyperbolic and Kinetic Equations Numerical Methods for Hyperbolic and Kinetic Equations Organizer: G. Puppo Phenomena characterized by conservation (or balance laws) of physical quantities are modelled by hyperbolic and kinetic equations.

More information

Lagrangian and Eulerian Representations of Fluid Flow: Kinematics and the Equations of Motion

Lagrangian and Eulerian Representations of Fluid Flow: Kinematics and the Equations of Motion Lagrangian and Eulerian Representations of Fluid Flow: Kinematics and the Equations of Motion James F. Price Woods Hole Oceanographic Institution Woods Hole, MA, 02543 July 31, 2006 Summary: This essay

More information

Using a Single Rotating Reference Frame

Using a Single Rotating Reference Frame Tutorial 9. Using a Single Rotating Reference Frame Introduction This tutorial considers the flow within a 2D, axisymmetric, co-rotating disk cavity system. Understanding the behavior of such flows is

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

Example 13 - Shock Tube

Example 13 - Shock Tube Example 13 - Shock Tube Summary This famous experiment is interesting for observing the shock-wave propagation. Moreover, this case uses the representation of perfect gas and compares the different formulations:

More information