Accelerating Fluids Simulation Using SPH and Implementation on GPU

Size: px

Start display at page:

Download "Accelerating Fluids Simulation Using SPH and Implementation on GPU"

Julie Fleming
5 years ago
Views:

1 IT Examensarbete 30 hp December 2015 Accelerating Fluids Simulation Using SPH and Implementation on GPU Aditya Hendra Institutionen för informationsteknologi Department of Information Technology

Abstract Accelerating Fluids Simulation Using SPH and Implementation on GPU Aditya Hendra Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4,

3 Abstract Accelerating Fluids Simulation Using SPH and Implementation on GPU Aditya Hendra Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box Uppsala Telefon: Telefax: Hemsida: Fluids simulation is usually done with CFD methods which offers high precision but needs days/weeks/months to compute on desktop CPUs which limits the practical use in industrial control systems. In order to reduce the computation time Smoothed Particle Hydrodynamics (SPH) method is used. SPH is commonly used to simulate fluids in computer graphics field, especially in gaming. It offers faster computation at the cost of lesser accuracy. The goal of this work is to determine the feasibility of using SPH method with GPU parallel programming to provide fluids simulation which is fast enough for real-time feedback control simulation. A previous master thesis work about accelerating fluids simulation using SPH method was done by Ann Johansson at ABB. Her work in Matlab using intel i7-2.4ghz needs 7089 seconds to compute a water-jet simulation with particles and time-step of second. Our work utilizes GPU parallel programs implemented in Fluidsv3, an open-source software as the base code. With CUDA C/C++ and Nvidia GTX980, we need 18 seconds to compute a water-jet simulation using particles and time-step of second. Currently, our work lacks of validation method to measure the accuracy of the fluids simulation and more work needs to be done about this. However it only takes 80 msec to compute one iteration which opens an opportunity to be used together with any real-time systems, such as a feedback control system, that has a period of 100msec. This mean it could model industry processes that utilize water such as the cooling process in a hot rolling mill. The next question, which is not addressed in this study, would be how to satisfy application dependent needs such as: simulation accuracy, required parameters, simulations duration in real-time, etc. Handledare: Kateryna Mishchenko Ämnesgranskare: Stefan Engblom Examinator: Wang Yi IT Tryckt av: Reprocentralen ITC

5 Acknowledgement I would like to express my gratitude to ABB s project team, especially my thesis supervisor Kateryna Mishchenko, Markus Lindgren and Lokman Hosain for their trust and guidance from the start till the end of the thesis work. Furthermore, I would like to thank you Rama Hoetzlein for making his work of Fluidsv3 available freely on the web. I also want to thank you my thesis reviewer Stefan Engblom from Uppsala University for his beneficial help in writing. My sincere thanks also goes to my master program coordinator at Uppsala University, Philipp Rümmer whom has helped me with many things during my education there. Last but not least, I am very grateful to Swedish Institute for providing me full scholarship for my master education in Sweden. Without it, this work would not be possible.

7 Contents Acknowledgement... Acronyms Introduction Project Purpose and Goal Scope Background Fluids Simulation with Smoothed-Particle Hydrodynamics Navier-Stokes Equation Smoothed Particle Hydrodynamics (SPH) GPU Parallel Programming General-Purpose computing on Graphics Processing Units (GPGPU) CUDA SPH for Fluids Simulation Acceleration of MATLAB code using GPUs GPU Implementation of Fluids simulation using SPH Method SPH Implementation with Fluidsv Alternative SPH Implementation Other GPGPU Platform GPGPU Fluids (Water-Jet) Simulation GPU Performance Benchmarking Water-Jet Model Simulation Modification to Fluidsv Visual Correctness Modification Fluidsv3 Original Boundary Handling Ihmsen et.al Boundary Handling Parallel Programming Optimization Reduce-Malloc Graphics-Interoperability GPU Shared-Memory Predictive-Corrective Incompressible SPH (PCISPH) Known Limitation of Water-Spray simulation SPH Method for Fluids Simulation Particles Scaling v

8 4.7.2 Incompressibility and Simulation Stability Initial Speed vs. Time Step Obtaining Parameters Value Simulation Correctness Discussion Performance Benchmark Unused Optimization Technique Different Simulation Parameter to Simulation Speed Conclusion Accelerated Fluids Simulation GPU shared-memory PCISPH Future Works References Appendix A: Fluidsv3 s Algorithm Appendix B: Water-Jet Spray s Algorithm... 57

9 Acronyms CFD Computational fluid dynamics. 11 CUDA Compute Unified Device Architecture. 20 GLEW OpenGL Extension Wrangler Library. 28 GPGPU General-Purpose computing on Graphics Processing Units. v, 11, 20 GPU Graphics Processor Unit. 11, 23 HPC High Performance Computing. 23 PCISPH Predictive-Corrective Incompressible SPH. 31 SIMD Single Instruction Multiple Data. 20 SM Streaming Multiprocessors. 21, 23, 24 SPH Smoothed Particle Hydrodynamics. v, 11,

11 1. Introduction Fluids is a term for liquid and gaseous substances, which are common in many industrial applications. At ABB Ltd, we work with optimization of process models related to fluids are frequently done and hot rolling mills model optimization is one of them. Hot rolling mills is a metal forming process to make a very thin sheet of metal while retaining specific metal properties. Figure 1.1 shows the typical hot rolling mill setting. A heated metal block is processed through several steps of rolling mills to get the appropriate thickness before being cooled down in high capacity cooler and coiled into a finish product. Figure 1.1. Hot-Rolling Mill Illustration One of many things that could affect the quality of hot rolling mills end products is the cooling process. Cooling is the last part of the whole rolling process and is done by the run out table. A run-out table moves the hot metal sheets under water-spray jets to cool these sheets down to a certain temperature. It is important to have a better understanding of how the cooling process affects the metal quality in order to have end products with correct properties. One method to achieve this is by using a complete computer simulation of the cooling process with feedback control system. However, a reasonably fast fluids simulation is needed to be able to incorporate it with real-time control system. This means we have to select computational methods that are fast and provide good results. The first step to perform such a computer simulation is to have an accelerated fluids simulation for water-spray jet which still has good enough accuracy. Fluids simulation is typically done using Computational fluid dynamics (CFD) methods which require a lot of computation time that makes it not suitable for our aims. One method that is fast and provides good enough results is Smoothed Particle Hydrodynamics (SPH). SPH is commonly used for fluids simulation in gaming industry where computation speed is very important. 11

12 SPH is an interpolation or estimation method. It provides an approximation of numerical equations of fluids dynamics by substituting the fluids with a set of particles. Using SPH, the fluids movement is depicted by moving particles. All particles position need to be computed for each iteration. However, there is no specific order to compute particles position : particle A could be computed before particle B is computed, vice versa as long as all particles position have been computed before the next iteration starts. This process suits quite well for parallel programming implementation, such as General-Purpose computing on Graphics Processing Units (GPGPU), where particles position could be computed in parallel. A GPU was originally used for graphics manipulation and image processing especially for gaming visualisation. Technology improvements made it possible to exploit it for general computing. The computation power behind a GPU are hundreds or thousands of parallel computation cores. Individually, each of these cores is much slower and simpler than CPU s core an intel i7 has 4Ghz core speed while an Nvidia GTX980 has 1.1Ghz core speed. Intel i7 cores are also equipped with many built-in hardware instruction sets that are not available in GPU cores. However, the huge number of them makes up the GPU cores slow speed and can provide a higher throughput than CPU s high speed cores. A common analogy is to say that a CPU is like a Ferrari and a GPU is like a city bus. A bus is much slower than a Ferrari, but it could carry a lot of passengers. Of course, the Ferrari will arrive first but if we wait until the bus arrives and use that amount of time for the Ferrari to go back and forth to carry more passengers then we could see that in that time the bus carries more passenger or in terms of computing performance, the GPU has higher throughput. A Ferrari has more luxurious features than bus, likewise a CPU has more advanced instruction sets than a GPU but if those instructions sets are not used in the simulation then they are not necessary. Importantly, SPH implementation using GPU parallel programming has been done to simulate many fluids application in [1], [2], [3], [4], [5]. However, for our particular needs we want to do a specific implementation of fluids simulation water-jet sprays using SPH and GPU parallel programming implementation for faster computations. 1.1 Project Purpose and Goal This thesis work is a continuation from master thesis work done by Ann Johansson at ABB for her master thesis at Uppsala University. In [6], Johansson looked at various methods which are generally suitable for video games and chose SPH as a method to simulate spray water jets in Matlab. The purpose of this master thesis is to serve as a milestone for a comprehensive computer simulation. The comprehensive simulation will simulate a 12

13 complete run-out table in a rolling mill with multiple jets cooling the rolled material using a feedback control running in real-time. The goal of this master thesis is to ascertain the feasibility and drawbacks of using SPH method with GPU parallel programming implementation for accelerated fluids simulation that could be used in real-time feedback control simulation. The project could be divided into three problem statements: 1. Is it feasible to use GPU parallel programming to provide a fast waterspray jet simulation to be incorporated for real-time feedback control simulation? If not, what is the best performance we could achieve? What should be done to get a real-time control simulation? 2. What is the tradeoff between simulation speed and accuracy? What are the physics parameters related to this? 3. What are the benefits/drawbacks of using GPUs for fluid simulations, when compared to a CPU solution? This report is divided into the six sections. Section 2 discusses the basics of SPH and GPU parallel programming. Section 3 covers SPH implementation in fluids simulation. Section 4 is devoted to GPU implementation of fluid simulation, its optimization results and limitation. Section 5 discusses simulation s results. Section 6 states the conclusion of the work and suggested future works. 1.2 Scope The scope of this work assumes the following: 1. Any physical model development and improvement is not part of the project. 2. The physics model is provided by ABB and is not the primary contribution of this thesis. 3. We only look into single GPU implementation using Nvidia GTX 980 gaming graphic card. 4. We only do implementation on a stand-alone workstation using Visual Studio 2010 and 2013 as the IDE. 5. We use an open source fluids simulation code, Fluidsv3 [7], as our base code. Fluidsv3 is an open source fluid simulator for the CPU and GPU using the Smoothed Particle Hydrodynamics (SPH) method. Fluidsv3 is written and copyrighted by Rama Hoetzlein with attribute-zlib license which grants us right to use his software for any purpose including commercial applications as long as the original author is acknowledged. 13

14 2. Background This section describes the theory and methodology used as a foundation of the project. It describes Fluids simulation with Smoothed-particle hydrodynamics (SPH) and GPGPU Graphics Processor Unit (GPU) Parallel Programming. 2.1 Fluids Simulation with Smoothed-Particle Hydrodynamics Navier-Stokes Equation Commonly, fluids simulations are done using the Navier-Stokes equations computing the motion of fluids [8]. Navier-Stokes equations for an incompressible fluid are: u ρ t + u u = p + µ 2 u + f (2.1) u = 0, (2.2) where ρ is the density, u is the velocity, p is the pressure, µ is the viscosity, t is the time and f is the sum of external forces, e.g. gravity. u is the divergence of velocity. There are different numerical methods to solve Navier-Stokes equations which could be divided into two main categories: grid-based Eulerian methods and mesh-free Lagrangian methods. The grid-based Eulerian methods such as Finite Volume method, Finite element method, and Finite difference method give at high accuracy the expense of more computation time. The mesh-free or Lagrangian methods such as Smoothed Particle Hydrodynamics (SPH) needs less computation time but gives less accuracy that is still hopefully acceptable Smoothed Particle Hydrodynamics (SPH) In this project the mesh-free SPH method is chosen to compute fluids simulation. It provides an approximation of numerical equations of fluids dynamics by substituting the fluids with a set of particles. The method was originally developed by Gingold and Monaghan [9] in 1977 and independently by Lucy [10] in Originally, it was used to solve astrophysical problems. 14

15 Gingold and Monaghan improved SPH algorithm to conserve linear and angular momentum which attests the similarities between SPH and molecular dynamics [11]. SPH is an interpolation or estimation method. According to it, a scalar physical quantity of any particle called A is interpolated at location r by a weighted sum of contributions from all neighbouring particles within a radius h: A j A s (r)= m j W(r r j,h), (2.3) j ρ j where : j is the index to all the neighbouring particles, m j is the mass of particle j, r j is its position, ρ j is the density, A j is the scalar physical quantity of fluids particle j, and W(r r j,h) is the smoothing kernel with a radius h. The gradient of A is computed as follows: A j A s (r)= m j W(r r j,h), (2.4) j ρ j where A s (r) is the gradient of the scalar A at position r. The second derivative of function A is: 2 A j A s (r)= m j 2 W(r r j,h), (2.5) j ρ j where 2 A s (r) is the Laplacian of the scalar A at position r. Derivation of equation (2.3) are given thoroughly in [6] and Kelager gave the derivation steps for equation (2.4) and (2.5) in [12]. We use these equations to evaluate several scalar quantities which affect fluids dynamics such as density, pressure and viscosity as described in equation (2.1). An elaborate explanation of SPH application for fluids simulation is mentioned in [13]. The usage of particles guarantees the mass conservation and equation (2.2) can be omitted, moreover since the particles move with the fluid the convective term u u in equation (2.1) could be omitted, too. In the end, the mesh free or Lagrangian of the Navier-Stokes equation for an incompressible fluids takes the form: ρ u t = p + µ 2 u + f external, (2.6) with p is the pressure term, µ 2 v is the viscosity term and f external is the external force. The sum of these three force density field determines the movement of particles. For each particle i, we get: 15

16 F i = p i + µ 2 u i + fi external, u F i = ρ, t a i = u i t = F i ρ i, where u i is the velocity, F i is the force density field, ρ i is the density field, and a i is the acceleration of particle i respectively. The pressure term p has the following representation: (2.7) f pressure i p i + p j = p(r i )= m j W(r i r j,h), (2.8) j 2ρ j where f pressure i is the pressure force for particle i. In [14] it is suggested to compute p as: p = k(ρ ρ 0 ), (2.9) where ρ 0 is the rest density, k is the gas stiffness constant and ρ is derived from equation (2.3) with ρ as the scalar quantity: ρ j ρ(i)= m j W(r i r j,h) j ρ j = m j W(r i r j,h). j (2.10) The viscosity term µ 2 v is : f viscosity i = µ 2 u j u i u(r i )=µ m j 2 W(r i r j,h), (2.11) j ρ j where f viscosity i is the viscosity force for particle i. Smoothing kernels W introduced in equation (2.8), (2.10) and (2.11) has the following representations, according to [13]: Smoothing kernel for density equation at (2.10) is 16 W poly6 (r,h)= πh 9 (h 2 r 2 ) 3 0 r h 0 otherwise, (2.12)

17 where h is the smooth kernel radius and r is the scalar value of r. Smoothing kernel for pressure equation at (2.8) is W spiky (r,h)= 15 (h r) 3 0 r h πh 6 0 otherwise, W spiky (r,h)= 45 r r (h r)2 0 r h πh 6 0 otherwise, (2.13) for pressure we use the gradient of W spiky smoothing kernel. Smoothing kernel for viscosity equation in (2.11) is W viscosity (r,h)= 15 2πh 3 2 W viscosity (r,h)= 45 πh 6 r3 + r2 + h 2h 3 h 2 2r 1 0 r h 0 otherwise, (h r) 0 r h 0 otherwise, (2.14) for viscosity we use the Laplacian of W viscosity smoothing kernel. Simulating fluids movement using SPH method is an iterative process. Each iteration computes an estimation for every particles position at the next timestep. First, the component of F i, which are f pressure i (equation (2.8)), f viscosity i ((2.11)) and fi external (gravity) are computed. Second, the three components are summed to get F i. Third, F i is used to compute a i using equation (2.7). Last, a i is used to obtain particles position by solving the ODE using leapfrog integration. The full procedure of Navier-Stokes computations using SPH method steps for each particle i are stated in Algorithm 1. 17

18 Algorithm 1 Navier-Stokes SPH Algorithm 1: Compute density 2: Compute pressure from density ρ(i)=ρ(r i )= m j W poly6 (r i r j,h) (2.15) j p i = p(r i )=k(ρ(r i ) ρ 0 ). (2.16) 3: Compute pressure force from pressure interaction between neighbouring particles f pressure i p i + p j = p(r i )= m j W spiky (r i r j,h). (2.17) j 2ρ j 4: Compute viscosity force between neighbouring particles f viscosity i = µ 2 u j u i u(r i )=µ m j 2 W viscosity (r i r j,h). (2.18) j ρ j 5: Sum the pressure force, viscosity force and external force (e.g. gravity) 6: Compute the acceleration F i = f pressure i + f viscosity i + f external i. (2.19) a i = F i ρ i. (2.20) 7: Solve the ODE using leap-frog integration to get u t, velocity at time t for each particle by using the following steps: (a) The velocity at time t t is computed by u t+ 1 2 t = u t 1 2 t + ta t, (2.21) (b) then the position at time t + t is computed by (c) The velocity at time t is computed by r t+ t = r t + tu t+ 1 2 t. (2.22) u t = u ta 0, (2.23) (d) and u t is approximate by the average of u t 1 2 t + u t+ 1 2 t, 18 where t is the integration time step. u t u t 1 2 t + u t+ 1 2 t, (2.24) 2

19 2.2 GPU Parallel Programming This section discusses how GPU became prevalent in parallel programming and what is the high level technology behind it General-Purpose computing on Graphics Processing Units (GPGPU) Originally, GPU was intensively used in gaming industry due to its high throughput. Each of GPU s cores processes data for different pixels in parallel. GPUs keep evolving to have more and faster computation cores. This attracts applications in different industries which utilize repetitive and parallel computations. GPU utilization for non-graphical applications was started in 2003 by adopting the already existed high-level shading languages such as DirectX, OpenGL and Cg [15]. However, this approach has several shortcomings. Firstly, the users need to have extensive knowledge about computer graphics API and GPU architecture. Secondly, the computation problem has to be represented in vertex coordinates, textures and shaders. Thirdly, some basic programming features such as random memory read and write were not available. Lastly, no GPU at that time could provide double precision floating point, which is important for some applications. To solve those shortcomings, new programming framework such as Compute Unified Device Architecture (CUDA) from Nvidia and OpenCL from Khronos Group were developed. Their main goal is to provide more suitable tools to harness GPU s potential for general applications. These new frameworks make GPU programming a real general purpose programming languange CUDA In 2007, Nvidia launched a new GPGPU framework which is designed for general purpose programming, known as CUDA [16]. CUDA is a framework that provides programmers the possibility to run general purpose parallelprogramming applications written in C, C++, Fortran, or OpenCL in every Nvidia GPUs launched since CUDA toolkit accomodates a comprehensive software development environment in C or C++ languages which makes it possible to adopt almost all C or C++ language capabilities. In CUDA, every parallel program is written in a special function called kernel. The CUDA kernel is called from the CPU (host) and executed in the GPU (device). Kernel functions are executed by a set of threads in a Single Instruction Multiple Data (SIMD) fashion. SIMD enables a single instruction 19

20 to be executed by many processing threads at the same time. Each thread handles its own input and output for computation. Each kernel execution creates a logical structure of threads in a hierarchical order. It starts with a thread grid, consisting of many thread blocks while each block consists of many threads. The number of blocks and all threads need to be defined before kernel execution. Threads, thread blocks and thread grids in its hierarchical corresponding memory structure are presented in Figure 2.1. Each thread has its own local memory which is the GPU register, and threads in a thread block could access its own block shared memory. Threads in each grid that execute the same kernel function have access to the application s global memory. Figure 2.1. Hierarchy of threads, thread blocks and thread grids with its corresponding memory allocation in CUDA [15] CUDA provides a unique ID for each thread within its block and a unique ID for each block within its grid. Additionally, unique grid ID for more than one grid creation at the same time is provided. It is possible to access an arbitrary thread within each grid by combining the block s ID and thread ID. For example: 20 index =(blockidx.x blockdim.x)+threadidx.x. (2.25)

21 The variable index is used to access a unique thread index in each block within one grid. In each GPU there are many computation cores, Nvidia calls them as CUDA cores and group them in a set called Streaming Multiprocessors (SM). SIMD a single instruction is executed by many processing threads, this is done in a subset of SM which is called warp that consists of 32 CUDA threads. This number comes from Nvidia s existing GPU architecture design. All threads in a block are placed in its warps. Every thread in a warp has a sequential index, so the first 32 threads, thread with index from 0 until 31, will be in the same warp, the second 32 threads will be in the next warp, and so on. Every warp will be executed asynchronously with respect to each other. Each thread in a warp is executed in parallel at the same time so that if one thread delays its execution then the whole warp will be delayed. For this reason it is a good practice to have threads in a warp to access memory address in chronological order so that we could reduce GPU s global memory access by utilizing coalesced data loaded from it. It is a common pratice to use a number of multiple 32 as a number of threads in a block. If we assign 16 threads in a blocks then each block will be counted as one warp of 32 threads with 16 threads inactive and reduce the overall throughput. Each Nvidia GPU architecture has its own hardware limitation that could affect the choice of the appropriate thread number in a block. The limitations we usually look into are: the maximum number of threads in a block the maximum number of active blocks an SM could handle the same time the allocated shared memory in an SM the allocated register size in an SM the number of CUDA cores in an SM. Shared memory is a piece of memory block that has lower latency than GPU s global memory. The amount is limited per SM and is usually used to store data which are repeatedly used by the threads in a block, so that threads do not need to access global memory which has higher latency. The amount of shared memory assigned to the block affects the number of active blocks in a SM. For example, if we allocate 128 threads in a block and want to use 256 bytes of shared memory for each block, then the number of active blocks per SM is active_blocks = total_shared_memory, (2.26) 256 because we have to divide the avaiable shared memory to every active block. If the number of active blocks is small then this could result in a small number of active warps a SM will handle them at the same time. The warp scheduler will 21

22 execute a warp that has all computation data for every thread in that warp. If one thread in a warp is still waiting for data then that warp will not be executed and will be scheduled for next iteration. GPU s high throughput comes from utilizing all computation cores at the same time. Therefore, we want to have enough number of active blocks or more precisely warps in an SM, so that when a warp is not ready there are other warps ready to be executed. In short, to maximize the performance we want to make the CUDA cores in an SM to be as busy as possible instead of being idle. CUDA cores are the computation engines. Each executing thread in a warp will be processed by a core. Each SM has a limited number of cores, therefore there are a limited number of warp being processed at the same time within an SM. A GPU with more and faster cores will give higher throughput. In this project the Nvidia GTX 980 gaming graphic card is used. Comparing to the most advanced Nvidia High Performance Computing (HPC) GPU card, the Nvidia Tesla K80, the GTX 980 card costs just 10-15% of its price. Nvidia GTX 980 s peak processing power for single precision is 4612 Giga Floating Point Operations Per Second (GFLOPS) and 144 GFLOPS for double precision while Nvidia Tesla K80 s peak processing power for single precision is 8740 GFLOPS and 2910 GFLOPS for double precision unit. Figure 2.2 shows the diagram of one Nvidia GTX980 Streaming Multiprocessors (SM). Nvidia GTX980 has 16 SM each with 128 CUDA cores distributed in 4 warp scheduler and 96KB shared memory. 22

23 Figure 2.2. Nvidia GTX980 codename is Maxwell therefore its Streaming Multiprocessors (SM) is called SMM. SMM s block diagram [17] 23

24 3. SPH for Fluids Simulation This section discusses one of possible implementations of SPH for fluids simulation using Matlab. The motivation of using different implementation platform to use SPH for fluids simulation is given as well. 3.1 Acceleration of MATLAB code using GPUs One of ABB s in-house implementation of SPH is a modification of a master thesis work in SPH [6]. It is written in Matlab which uses CPU s computational power. For this acceleration effort we use Intel i7 Quad cores, 4 Ghz with 32 GB of RAM and MATLAB R2015a. According to Matlab guidelines [18], there are several methods of how GPU computational power could be used in Matlab, in order to reduce the computation time. The first method is to allocate the variable with high computation cycle into the GPU using gpuarray command, and let Matlab manage the rest. This approach needs computation time 14x than ABB s in-house implementation. This is due to the inefficiency code structure and too often memory copying process between GPU and CPU at each iteration. The second method is to collect all computationally extensive pieces of the code into a MATLAB function and run this function on GPU. Unfortunately, there is some limitation on the structure and contents of the codes. Firstly, codes should use only Matlab functions that are implemented on GPU [18]. Secondly, the codes should not use index operation and all computations should be done element wise. With these restrictions, there are only a few of lines of codes that could be triggered on GPU, resulting in a similar result with the previous method. The third method is to call the CUDA kernel files directly from Matlab. However, this method does not provide any benefit in terms of effort. In order to fully benefit the effort of using CUDA kernels, the whole implementation should be in C/C++ environment. Therefore, we do not attempt this method. The Fourth method is code refactoring. The code branches and loops are replaced by matrices operations when it is possible. Table 3.1 shows the result of code refactoring to compute 500 iterations of fluids simulation. The speed improvement is significant, but it is not fast enough to be used in real time feedback control systems. There are three main 24

25 Table 3.1 Matlab Refactored Benchmark for 500 iterations Matlab Function Original Execution Time (s) Refactored Code Execution Time (s) Speed Improvement DoubleDensityRelax x TemperatureCompute x ViscousImpulse x Matlab functions in the implementation, DoubleDensityRelax, ViscousImpulse, and TemperatureCompute. As shown in Table 3.1 the code refactoring improves the speed of computing DoubleDensityRelax and TemperatureCompute by 14x. No improvement for ViscousImpulse, because the loop involves a sequential process where each iteration uses previous iteration s data. To animate the fluids another code in Matlab dealing with visualization is used. This code adds another 6387 seconds with no possibility of refactoring because it consists of one for-loop. This contributes to a total duration of seconds to simulate 500 iterations of fluids simulation or an average of 30.5 seconds for each iteration. This is obviously longer than the period of a real-time feedback control systems which is less than a second. Therefore, the Matlab implementation is not suitable and a different approach is required. 3.2 GPU Implementation of Fluids simulation using SPH Method Section 2.2 describes the motivation to use GPU s high throughput to solve an extensive computation problem, specifically for problem that has high arithmetic intensity that can be reformulated in parallel computation code. An arithmetic problem that is associative and commutative is one example of the ideal problem for GPU such as Nvidia s, because the computation could be done in any order which is not the case for subtraction or division operations. In SPH, fluids is approximated using a number of particles. In each iteration every particle s position is computed based on its distance to its neighbouring particles. However, there is no order in which particles are to be computed. Particle A could be computed first or simultaneously with particle B without changing the result as long as all particles have been computed before next iteration starts.this means that multi-cores parallel programming framework could be used to compute each particles simultaneously. Some of the first implementation of SPH on GPU using OpenGL and Cg were done in [19] and [20]. Yan et al [1] gave an alternative SPH algorithm that uses non-uniform particle model which means the particles could split or merge. It incorporates an adaptive surface tension model for better stability. 25

26 After Nvidia launched CUDA, fluids simulation using SPH method on GPU became more common with promising speed up result compared to CPU. Hérault et al [2] distributed the SPH implementation into three parts: neighborlist-construction, force-computation, and Euler-integration. The speed up is different for each part where force-computation had the most speed up of 207 times compared to CPU, while neighbor-list-construction got 15.1 times and Euler-integration was 23.8 times faster. Gao et al [21] measured the simulation frame rate improvement on their work and managed to get a speed up of nearly 140 times with a simulation of particles SPH Implementation with Fluidsv3 Based on the literature research we conclude that GPU implementation is more efficient than CPU implementation to do fluids simulation with SPH, especially in terms of simulation speed. To shorten the learning curve, we decide to use Fluidsv3 which is an open-source fluids simulation of SPH. Fluidsv3 [7] source code is provided as it is and it simulates the ocean waves using SPH method. The code was written in C/C++ with CUDA Toolkit. We change at the implementation code for our needs and optimized it to reduce the simulation time which is the main topic of discussion in section Alternative SPH Implementation During the implementation phase, we found several other SPH implementations using GPGPU framework that are of interest and could be investigated in the future: Open Worm project [4], an open source project which is about to create a virtual C. Elegans Nematode in a computer using OpenCL. DualSphysics research group [5], is a research group of several universities in Europe. They provide a CPU and GPU (CUDA) implementation of SPH for research and applications in fluids simulations. Other computer graphics research groups at ETH Zürich [22] and University of Freiburg [23] have done research and different solutions using SPH methods. 3.3 Other GPGPU Platform The two biggest GPU producer are Nvidia and AMD. Using Fluidsv3 code means using CUDA toolkit, and using CUDA toolkit means we have to use Nvidia s GPU. Nevertheless, this is not the only way to implement GPU parallel programming. OpenCL supports more heterogeneous parallel programming hardware (CPU, GPU, FPGA, etc) as long as it has multi-core structure. It is an open question 26

27 whether to choose CUDA, OpenCL or other GPU parallel programming languange and it is not discussed in this work. 27

28 4. GPGPU Fluids (Water-Jet) Simulation This section describes the modification of Fluidsv3 original implementation and our efforts on further improvements of the performance. An overview of Fluidsv3 original algorithm is presented in Appendix A. 4.1 GPU Performance Benchmarking To improve the simulation s speed, the GPU s performance during simulation is analyzed. The results are used to look for parts of the implementation that could be improved to achieve faster simulations. There are two tools from Nvidia that can be used to analyse the performance of a CUDA program. They are Nvidia Nsight Performance Analysis and Nvidia Visual Profiler [24] and [25]. Both tools offer similar performance analysis capabilities but have some differences. Nsight Performance Analysis is a plug-in tool for Microsoft Visual Studio or Eclipse and could only be executed from either one of the two IDEs. Nvidia Visual Profiler is a stand alone tool that could be used to analyse an executable file regardless of IDE used. Another difference is that Nvidia Visual Profiler offers Guided Analysis mode that helps to find automatically the CUDA kernels of the program to be optimized and how to do it. Nsight Performance Analysis offers more elaborate benchmark of the application. It analyses not only the CUDA kernels but also the program s connection with system s API or visualization API such as DirectX and OpenGL. We use Nsight Performance Analysis because of its elaborate benchmarks and since Nvidia Profiler s Guided Analysis mode is too general for our problem. Nvidia Performance Analysis has two main benchmark tools: Trace and Profiler-CUDA. Trace is used to measure the GPU utilization and OpenGL activity, while Profiler-CUDA is used to get the performance report of CUDA kernels executed in the application. More information about both tools are found in [24] and [25]. The experiment under consideration is about the computation of fluids particles which simulate a spray of water-jet. The experiment starts with no particle and terminated when the number of particles reaches 1 million. There are three measurements that we include in the report: 1. GPU Utilization, it represents in percentage how often the GPU was utilized in comparison to the overall benchmark time. Higher value means 28

29 the GPU is less idling during the simulation. This information is provided by Nsight Performance Analysis - Trace. 2. GFLOPS, a weighted sum of all executed single precision operations per second. Higher value means that tbe GPU spends more time in computation which also means that the algorithm is more efficient in using GPU computation power. Nvidia GTX 980 has a maximum computation power of 4612 GFLOPS for single precision operation. This information is provided by Nsight Performance Analysis - Profiler-CUDA. 3. Iteration Time, time consumed to compute one simulation s iteration in milliseconds. Lower value means more computation iterations per second which also means faster simulation. A lower iteration time is important for our optimization method. This information is provided by Fluidsv3 s internal function. 4.2 Water-Jet Model The work is about a specific case of accelerated fluids simulation, water-jet spray. This model is chosen because of its use to simulate coolant agent in hot rolling mills which is one of ABB s area of interest. In the real world water-jet spray has many applications such as, fountain water, shower spray, cleaning spray, fire extinguisher, etc. For hot rolling mills coolant, pure water with adjustable velocity and volume is sprayed from many inlets to hot surface metal in order to cool it down. To simulate this, several parameters are introduced. The initial speed to adjust the velocity, the inlet radius to adjust the volume and the number of inlet for how many inlets to include in the simulation. In order to simulate the flow of water-jet spray, a new layer of particles is added in the initial position in each time step. The shape and number of particles used as a new layer are constant based on simulation configuration. The layers gradually build up and simulate the flow of water computed by SPH method. Although in real world the water could be sprayed in the opposite direction against the gravity, the work only simulate water sprayed in the same direction of gravity. 4.3 Simulation Modification to Fluidsv3 Figure 4.1 shows Fluidsv3 original application a continuous ocean-wave simulation. The simulation model under consideration is the water-jet spray and in order to have that visualization, the following modifications are required: 29

2 shows zoomed-in layers of particles. Figure 4.2. Three layer of initialized particles Simulation boundary. Fluidsv3 simulates a continuous ocean-wave in a simulation space shaped of a box.

30 Figure 4.1. Fluidsv3 continuous ocean-wave simulation Particle Initialization, number of particles and particles position. Fluidsv3 initializes all fluids particles in the beginning of the simulation, while we add the particles layer by layer at every time step to simulate a sprayed fluids. Figure 4.2 shows zoomed-in layers of particles. Figure 4.2. Three layer of initialized particles Simulation boundary. Fluidsv3 simulates a continuous ocean-wave in a simulation space shaped of a box. The water jet spray model under consideration has a single solid boundary at the bottom Reusing deleted particles memory. If particles bounce outside the simulation area then they will be deleted and the memory space used will be cleared so that it could be used for new particles. Changing particles initial speed. Our fluids simulation is a water-jet spray which injects fluids from top to bottom in Y-axis. The initialized particles have a given initialization speed. In summary, Fluidsv3 s simulation parameters are modified according to Table 6.3, Appendix. 30

Figure 4.4 shows the bottom side of the simulation model where the fluids is penetrating the boundary.

31 4.4 Visual Correctness Modification Figure 4.3 shows the result of the modification, four inlets of water-spray jets hitting a solid surface simultaneously. Figure 4.3. The upper view of the original boundary handling method The first step is to verify the correctness of the model from the physical standpoint. Figure 4.4 shows the bottom side of the simulation model where the fluids is penetrating the boundary. This is not physically correct since particles are penetrating through the surface which is the steel strip (not shown). Figure 4.4. Hidden Boundary Particles diameter set to meter 31

4.4.1 Fluidsv3 Original Boundary Handling Fluidsv3 original boundary handling uses hidden boundary particles. The hidden particles diameter acts as a damping area that brakes the fluids movement.

32 4.4.1 Fluidsv3 Original Boundary Handling Fluidsv3 original boundary handling uses hidden boundary particles. The hidden particles diameter acts as a damping area that brakes the fluids movement. Fluids particles receive opposing force whenever they enter this area. In other words, particles are repelled when they penetrate the hidden-boundary particles. The more penetration, the stronger opposing force. Equation (2.22) shows that particles position for time step t + t depends on their velocity at time step t t which is highly affected by the initial velocity, u 0 at equation (2.23). As iteration goes, u 0 s value is accumulated in u t+ 1 2 t hence affects how big the particles move in each iteration which in the end decides if the particles penetrate the solid surface boundary or not. An experiment of using u 0 of 20 m/s and hidden particles diameter of 0.03 x (simscale value) meter results in some particles position computed almost beyond the damping area, thus the opposing force is not strong enough and particles leak through the surface. Figure 4.4 shows fluids passing the solid surface (not drawn). A simple fix for this case it to use bigger hidden particles diameter. Using hidden particles diameter of 0.06 ( meter in reality) prevents the particles from penetrating the surface. Figure 4.5 shows that the particles do not penetrate the solid boundary when we use bigger hidden particles diameter. The side penetration is because the fluids particles are beyond the boundary area. Bigger hidden particles diameter provides bigger breaking area, so that the particles have more space to reduce their velocity thus no boundary penetration. Figure 4.5. Hidden Boundary Particles diameter set to meter Ihmsen et.al Boundary Handling The solution depends on many parameters such as initialization speed or gas stiffness constant. For example, if we want to use initial speed of 60 m/s, 32

This method was originally used with PCISPH [27] but could also be used for SPH. Figure 4.

33 which might be too big for water-jet, an even bigger hidden particles diameter is required to prevent particles penetrating the boundary. Clearly a better boundary handling mechanism can improve the simulation. After some literature review, we decide to try a new boundary handling method propopsed by Ihmsen et.al [26]. This method was originally used with PCISPH [27] but could also be used for SPH. Figure 4.6 shows that this method gives a fluids splashing or fluids repelling effect which usually exists when a high speed fluids hit a solid surface compared to figure 4.3 which does not visualize any repelling effect. Figure 4.7 shows that the particles do not penetrate the boundary even though a hidden particles diameter of 0.03 ( meter in reality) and an initial speed of 60 m/s are used. Figure 4.6. The upper view of Ihmsen et.al boundary handling Figure 4.7. The side view of Ihmsen et.al boundary handling Observing the visualization results and its ability to use wider range of initial speed or gas stiffness constant, it is concluded that Ihmsen et.al boundary 33

34 method is better than Fluidsv3 original method in terms of physical correctness of simulation. Thus, this method will be used for further code optimization 4.5 Parallel Programming Optimization We use Nsight Performance Analysis mentioned in Section 4.1 to gauge the GPU s performance of our simulation to make the simulation run faster. The water-jet simulation has an iteration time of msec and GPU utilization of 36% which means that most of the GPU time is spent idling, waiting for its turn. In order to improve the GPU utilization and reduce the computation time for each iteration, the following methods were implemented one by one Reduce-Malloc Nsight Performance Analysis shows that most of the GPU time is spent waiting for CUDAmalloc and CUDAmemcpy to finish their operations making these operations as the bottleneck. Nvidia GTX 980 has 4GB RAM which, based on Fluidsv3 s data structure, is big enough to store 22 millions particles data. By allocating GPU s memory for 22 millions particles in the beginning of simulation, CUDAmalloc would only be called once. CUDAmemcopy operations are also called once to copy initialization particles data from host (CPU) to device (GPU). This result in improvement of GPU utilization to 75.6 % and reduction of iteration time to msec Graphics-Interoperability Fluidsv3 uses OpenGL to render the animation and during this process, it needs to copy the data from computation memory to OpenGL memory. In [28], Graphics-Interoperability means that the same GPU memory buffer is used to store data for rendering and computation results. This method removes the need to copy data from computation memory to rendering memory. Using this approach, the GPU utilization is improved to 87.2 % and the iteration time is reduced to msec GPU Shared-Memory Nsight Performance Analysis also shows that there are two CUDA kernel functions that consume most of the GPU computation resource, computepressure (19 %) and computeforce (64 %). Improving the computation speed of 34

these functions should improve the total simulations speed. For improvement and benchmark try-out we choose kernel function computepressure because it has simpler code structure than computeforce.

35 these functions should improve the total simulations speed. For improvement and benchmark try-out we choose kernel function computepressure because it has simpler code structure than computeforce. Figure 4.8 shows the kernel warp issue efficiency for computepressure kernel function which is the GPU s ability to issue instructions during kernel s execution cycle [24]. Instructions are executed in a warp which is a set of threads. Higher percentage means that more instructions are executed, but computepressure has % of warp issue efficiency. This means that 45.97% of all computation cycles does not issue any instruction. Figure 4.9 shows that memory dependency, which is the condition where memory store or load cannot be done because resources are either not available or are fully utilized, is the cause for % of the kernels inability to run instructions. Here, the question is which operations are causing this memory dependency problem? Figure 4.8. Warps Issue Efficiency for computepressure CUDA kernel From the variety of data that computepressure kernel needs to retrieve from GPU s global memory, particles position data are the most frequent one. The kernel needs to compute every particle s distance to its neighbouring particles. The more neighbour a particle has, the more data access the kernel does for that particle. All particles used in SPH method are put in grids of similar size cubes as mentioned in Appendix A. Figure 4.10 restates figure 6.2, Appendix, which depicts a 3x3x3 grids structure with each grid is represented by coloured circles. Each grid contains different number of particles. 35

Figure 4.9. Issue Stall Reasons for computepressure CUDA kernel Figure 4.10. Grid Neighbours.

36 Figure 4.9. Issue Stall Reasons for computepressure CUDA kernel Figure Grid Neighbours. Yellow circle represent the centre grid and the rest are the neighbouring grids. The dotted line is for visualization purpose only and does not represent any information. 36

37 All particles in the centre grid, the yellow circle, have the same neighbouring particles located in the neighbouring grids. Therefore, all particles in the centre grid request the same neighbouring particles position which results in data request redundancies from GPU s global memory. If we can store these often requested data in a closer place than GPU s global memory then we could provide the data to CUDA kernel faster. As a result, this improves the warp efficiency and allows for doing more computations per cycle [29]. This closer memory place is called GPU sharedmemory. All Nvidia s GPU that support CUDA have shared memory. It is a memory that resides in the GPU chip itself. It has roughly 100 times lower latency than uncached GPU global memory, which resides outside the GPU chip, provided that there is no bank conflict between the threads [30]. Goswami et.al [31] used shared memory for their SPH-fluids simulation in Nvidia GPU. Although the paper did not give the results of comparison of using vs. not using shared memory, it was decided to use the share memory in SPH computations. The goal is to investigate whether it is possible to achieve any further computational improvements since we have different implementation compared to Goswami s. In order to reduce data traffic to the global memory, it is required to place all the neighbouring particles data in the GPU shared-memory and let the centre grid s particles access the shared-memory. However, this could only be done by introducing extra line of codes to copy the required data into the sharedmemory. Additionally, it is important to keep the memory offset in such a way that each centre grid particle could access the correct shared-memory offset. Accessing the wrong offset results in unstable, wrong or even run-time error simulation. Our project uses Nvidia GTX980 which has 96KB of shared memory for each streaming multiprocessors (SM) where each SM could handle 32 thread blocks simultaneously. A particle s position data is a vector float3 data type. This means that each position storage requires 12 bytes of memory. With limited memory space we need to use more code to repeatedly re-store necessary particles data. The benchmark shows that the GPU utilization is up to 95 %, but the iteration time increases to msec. The number of operations of computepressure kernel function drops to 85 GFLOPS from 747 GFLOPS. This is caused by not parallelized extra code which introduces a lot of sequential loops. The code makes the GPU busier hence the increase of GPU utilization to 95 %. Nevertheless, since the code is not executed in parallel the number of operations drops to 85 GFLOPS. 37

4.6 Predictive-Corrective Incompressible SPH (PCISPH) In [7], Fluidsv3 s author suggested PCISPH, a different SPH method to improve fluids simulation s speed.

38 4.6 Predictive-Corrective Incompressible SPH (PCISPH) In [7], Fluidsv3 s author suggested PCISPH, a different SPH method to improve fluids simulation s speed. It is an SPH method that enforce fluids incompressibility [27]. PCISPH method predicts density fluctuations which are then corrected using pressure forces. This prediction and correction step are repeated for several times to get lowest density fluctuations in order to enforce fluids incompressibility. The algorithm is described in Algorithm 7, Appendix. PCISPH s authors stated the possibility to use a time-step 35x bigger than SPH while retaining the simulation s visualization. This is an interesting feature since time-step determines how long the simulation elapses in one iteration. Bigger time-step means that for each computation s iteration the simulation predicts a longer duration of how the fluids should behave in real world. For example, 10 iterations with a time-step of second equal to 0.01 second of simulated real-world time but using a bigger time-step of 0.01 second, 10 iterations equal to 0.1 second of simulated real-world time. However, we are not successful in implementing PCISPH using bigger time-step and because of the limitation of time, we do not further investigation. Instead, we have different fluids speed visualization when using PCISPH. Table 6.4, Appendix, shows the simulation configuration with PCISPH which is the same as SPH s configuration except the initial speed value. Furthermore, instead of gas-stiffness constant, PCISPH uses pcisphdelta parameter. Figure 4.11 shows the visualization effect of PCISPH with 0.5 million particles. The simulation shows fluids that is progressing at faster speed compared to original SPH Algorithm Figure PCISPH Simulation when it reaches 0.5 million particles Appendix B explains the water-jet spray algorithm using SPH and PCISPH in a whole. 38

Figure 4.12. SPH Simulation when it reaches 0.5 million particles 4.

39 Figure SPH Simulation when it reaches 0.5 million particles 4.7 Known Limitation of Water-Spray simulation SPH Method for Fluids Simulation There are several known limitation in the current water-spray jet implementation. This section describes them and how they affect the simulation Particles Scaling Fluidsv3 original simulation uses SimScale parameter to scale up the distance between particles for better visualization. The problem here is that changing the parameter does not rescale the particles size visualization. This could result in wrong visualization such as a fluids visualized by very dense particles even though it is not Incompressibility and Simulation Stability Equation (2.9) which is used to compute p is restated as p = k(ρ ρ 0 ). (4.1) In equation (4.1), p measures the pressure for the particles. ρ 0 is the fluids expected rest density during simulation. The water is modeled as an artificially incompressible fluids with Navier-Stokes equation meaning that the fluids should have the same density at every part of it and in every time step. The particles move at each time step and change the fluids density. Therefore, the equation above is used in order to enforce the incompressible fluids. When ρ < ρ 0 then the particles experience an attraction force and if ρ > ρ 0 the particles experience a repelling force. Here, the question is how big a k is. A bigger k gives strong attractionrepelling force and makes the fluids ρ being equal to ρ 0 faster, so that particles are not congested in the same time for too many time steps. Reduced 39

40 congested particles in the same smooth-radius area means less neighbouring particles and less computation per iteration and vice versa. However, if a too big k is used then it is required to use much smaller time step, otherwise the simulation will be unstable. A bigger k will also provide bigger force which results in bigger particle movement in one time step. If the particles move further than the smoothing kernel radius in one time step then the attraction force is not computed, thus lost. If many particles experience this case then the fluids particles will look as they are blown up Initial Speed vs. Time Step Since we simulate water-jet spray, the fluids initial speed depends on our discretion. In the simulation, initial speed value is accumulated in particles velocity which determines how far a layer of fluids particles move in one time step. This makes initial speed very influential for particles movement. In each time step a new layer of fluids particles is added in the same simulation s coordinates. It is important to have an initial speed that fits the time step. Using an initial speed and a time step which are too small result in a bloating particles fluids instead of a flowing fluids. Figure 4.13 shows such a case. On the other hand, using an initial speed and time step which are too big result in a big gap between layers of particles. This is also not correct since there is no attraction force between layers of particles. Figure 4.14 shows such a case. Therefore, it is important to have an initial speed that fits the time step in order to produce a stable and working simulation. However, such a combination of initial speed and time step are obtained experimentally which is cumbersom, not ideal and affect the usefulness of the simulation Obtaining Parameters Value The simulation uses many adjustable parameters which determine the simulation s result and stability. Parameters such as smoothing kernel radius, viscosity, particle s radius, and pcisph s delta obtained their value by trial and error. The use of boundary handling method based on [26] prevents any particles from penetrating the solid surface for the current simulation setting. This does not mean by default that it will work for any arbitrary initial speed. The effectiveness of the boundary handling method also depends on the diameter of the solid boundary particles and gas stiffness constant. 40

Figure 4.13. Bloating particles fluids Figure 4.14. Too big gaps between layers of particles 4.7.

studies: Creation of waves by land slides, Dam-break propagation over wet bed sand and Wave-structure interaction or in [37] for hydraulic jump test case.

41 Figure Bloating particles fluids Figure Too big gaps between layers of particles Simulation Correctness SPH has been used in many fluids simulation, and its accuracy has been tested by different test cases, for example in [39] for case studies: Creation of waves by land slides, Dam-break propagation over wet bed sand and Wave-structure interaction or in [37] for hydraulic jump test case. However, the measurement of water-jet simulation accuracy is not implemented in this work and no implementation for proper physical correctness is done. Instead, the correctness of our implementation is subjectively assessed by visual analysis. 41

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic