Accelerating Fluids Simulation Using SPH and Implementation on GPU

Size: px
Start display at page:

Download "Accelerating Fluids Simulation Using SPH and Implementation on GPU"

Transcription

1 IT Examensarbete 30 hp December 2015 Accelerating Fluids Simulation Using SPH and Implementation on GPU Aditya Hendra Institutionen för informationsteknologi Department of Information Technology

2

3 Abstract Accelerating Fluids Simulation Using SPH and Implementation on GPU Aditya Hendra Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box Uppsala Telefon: Telefax: Hemsida: Fluids simulation is usually done with CFD methods which offers high precision but needs days/weeks/months to compute on desktop CPUs which limits the practical use in industrial control systems. In order to reduce the computation time Smoothed Particle Hydrodynamics (SPH) method is used. SPH is commonly used to simulate fluids in computer graphics field, especially in gaming. It offers faster computation at the cost of lesser accuracy. The goal of this work is to determine the feasibility of using SPH method with GPU parallel programming to provide fluids simulation which is fast enough for real-time feedback control simulation. A previous master thesis work about accelerating fluids simulation using SPH method was done by Ann Johansson at ABB. Her work in Matlab using intel i7-2.4ghz needs 7089 seconds to compute a water-jet simulation with particles and time-step of second. Our work utilizes GPU parallel programs implemented in Fluidsv3, an open-source software as the base code. With CUDA C/C++ and Nvidia GTX980, we need 18 seconds to compute a water-jet simulation using particles and time-step of second. Currently, our work lacks of validation method to measure the accuracy of the fluids simulation and more work needs to be done about this. However it only takes 80 msec to compute one iteration which opens an opportunity to be used together with any real-time systems, such as a feedback control system, that has a period of 100msec. This mean it could model industry processes that utilize water such as the cooling process in a hot rolling mill. The next question, which is not addressed in this study, would be how to satisfy application dependent needs such as: simulation accuracy, required parameters, simulations duration in real-time, etc. Handledare: Kateryna Mishchenko Ämnesgranskare: Stefan Engblom Examinator: Wang Yi IT Tryckt av: Reprocentralen ITC

4

5 Acknowledgement I would like to express my gratitude to ABB s project team, especially my thesis supervisor Kateryna Mishchenko, Markus Lindgren and Lokman Hosain for their trust and guidance from the start till the end of the thesis work. Furthermore, I would like to thank you Rama Hoetzlein for making his work of Fluidsv3 available freely on the web. I also want to thank you my thesis reviewer Stefan Engblom from Uppsala University for his beneficial help in writing. My sincere thanks also goes to my master program coordinator at Uppsala University, Philipp Rümmer whom has helped me with many things during my education there. Last but not least, I am very grateful to Swedish Institute for providing me full scholarship for my master education in Sweden. Without it, this work would not be possible.

6

7 Contents Acknowledgement... Acronyms Introduction Project Purpose and Goal Scope Background Fluids Simulation with Smoothed-Particle Hydrodynamics Navier-Stokes Equation Smoothed Particle Hydrodynamics (SPH) GPU Parallel Programming General-Purpose computing on Graphics Processing Units (GPGPU) CUDA SPH for Fluids Simulation Acceleration of MATLAB code using GPUs GPU Implementation of Fluids simulation using SPH Method SPH Implementation with Fluidsv Alternative SPH Implementation Other GPGPU Platform GPGPU Fluids (Water-Jet) Simulation GPU Performance Benchmarking Water-Jet Model Simulation Modification to Fluidsv Visual Correctness Modification Fluidsv3 Original Boundary Handling Ihmsen et.al Boundary Handling Parallel Programming Optimization Reduce-Malloc Graphics-Interoperability GPU Shared-Memory Predictive-Corrective Incompressible SPH (PCISPH) Known Limitation of Water-Spray simulation SPH Method for Fluids Simulation Particles Scaling v

8 4.7.2 Incompressibility and Simulation Stability Initial Speed vs. Time Step Obtaining Parameters Value Simulation Correctness Discussion Performance Benchmark Unused Optimization Technique Different Simulation Parameter to Simulation Speed Conclusion Accelerated Fluids Simulation GPU shared-memory PCISPH Future Works References Appendix A: Fluidsv3 s Algorithm Appendix B: Water-Jet Spray s Algorithm... 57

9 Acronyms CFD Computational fluid dynamics. 11 CUDA Compute Unified Device Architecture. 20 GLEW OpenGL Extension Wrangler Library. 28 GPGPU General-Purpose computing on Graphics Processing Units. v, 11, 20 GPU Graphics Processor Unit. 11, 23 HPC High Performance Computing. 23 PCISPH Predictive-Corrective Incompressible SPH. 31 SIMD Single Instruction Multiple Data. 20 SM Streaming Multiprocessors. 21, 23, 24 SPH Smoothed Particle Hydrodynamics. v, 11,

10

11 1. Introduction Fluids is a term for liquid and gaseous substances, which are common in many industrial applications. At ABB Ltd, we work with optimization of process models related to fluids are frequently done and hot rolling mills model optimization is one of them. Hot rolling mills is a metal forming process to make a very thin sheet of metal while retaining specific metal properties. Figure 1.1 shows the typical hot rolling mill setting. A heated metal block is processed through several steps of rolling mills to get the appropriate thickness before being cooled down in high capacity cooler and coiled into a finish product. Figure 1.1. Hot-Rolling Mill Illustration One of many things that could affect the quality of hot rolling mills end products is the cooling process. Cooling is the last part of the whole rolling process and is done by the run out table. A run-out table moves the hot metal sheets under water-spray jets to cool these sheets down to a certain temperature. It is important to have a better understanding of how the cooling process affects the metal quality in order to have end products with correct properties. One method to achieve this is by using a complete computer simulation of the cooling process with feedback control system. However, a reasonably fast fluids simulation is needed to be able to incorporate it with real-time control system. This means we have to select computational methods that are fast and provide good results. The first step to perform such a computer simulation is to have an accelerated fluids simulation for water-spray jet which still has good enough accuracy. Fluids simulation is typically done using Computational fluid dynamics (CFD) methods which require a lot of computation time that makes it not suitable for our aims. One method that is fast and provides good enough results is Smoothed Particle Hydrodynamics (SPH). SPH is commonly used for fluids simulation in gaming industry where computation speed is very important. 11

12 SPH is an interpolation or estimation method. It provides an approximation of numerical equations of fluids dynamics by substituting the fluids with a set of particles. Using SPH, the fluids movement is depicted by moving particles. All particles position need to be computed for each iteration. However, there is no specific order to compute particles position : particle A could be computed before particle B is computed, vice versa as long as all particles position have been computed before the next iteration starts. This process suits quite well for parallel programming implementation, such as General-Purpose computing on Graphics Processing Units (GPGPU), where particles position could be computed in parallel. A GPU was originally used for graphics manipulation and image processing especially for gaming visualisation. Technology improvements made it possible to exploit it for general computing. The computation power behind a GPU are hundreds or thousands of parallel computation cores. Individually, each of these cores is much slower and simpler than CPU s core an intel i7 has 4Ghz core speed while an Nvidia GTX980 has 1.1Ghz core speed. Intel i7 cores are also equipped with many built-in hardware instruction sets that are not available in GPU cores. However, the huge number of them makes up the GPU cores slow speed and can provide a higher throughput than CPU s high speed cores. A common analogy is to say that a CPU is like a Ferrari and a GPU is like a city bus. A bus is much slower than a Ferrari, but it could carry a lot of passengers. Of course, the Ferrari will arrive first but if we wait until the bus arrives and use that amount of time for the Ferrari to go back and forth to carry more passengers then we could see that in that time the bus carries more passenger or in terms of computing performance, the GPU has higher throughput. A Ferrari has more luxurious features than bus, likewise a CPU has more advanced instruction sets than a GPU but if those instructions sets are not used in the simulation then they are not necessary. Importantly, SPH implementation using GPU parallel programming has been done to simulate many fluids application in [1], [2], [3], [4], [5]. However, for our particular needs we want to do a specific implementation of fluids simulation water-jet sprays using SPH and GPU parallel programming implementation for faster computations. 1.1 Project Purpose and Goal This thesis work is a continuation from master thesis work done by Ann Johansson at ABB for her master thesis at Uppsala University. In [6], Johansson looked at various methods which are generally suitable for video games and chose SPH as a method to simulate spray water jets in Matlab. The purpose of this master thesis is to serve as a milestone for a comprehensive computer simulation. The comprehensive simulation will simulate a 12

13 complete run-out table in a rolling mill with multiple jets cooling the rolled material using a feedback control running in real-time. The goal of this master thesis is to ascertain the feasibility and drawbacks of using SPH method with GPU parallel programming implementation for accelerated fluids simulation that could be used in real-time feedback control simulation. The project could be divided into three problem statements: 1. Is it feasible to use GPU parallel programming to provide a fast waterspray jet simulation to be incorporated for real-time feedback control simulation? If not, what is the best performance we could achieve? What should be done to get a real-time control simulation? 2. What is the tradeoff between simulation speed and accuracy? What are the physics parameters related to this? 3. What are the benefits/drawbacks of using GPUs for fluid simulations, when compared to a CPU solution? This report is divided into the six sections. Section 2 discusses the basics of SPH and GPU parallel programming. Section 3 covers SPH implementation in fluids simulation. Section 4 is devoted to GPU implementation of fluid simulation, its optimization results and limitation. Section 5 discusses simulation s results. Section 6 states the conclusion of the work and suggested future works. 1.2 Scope The scope of this work assumes the following: 1. Any physical model development and improvement is not part of the project. 2. The physics model is provided by ABB and is not the primary contribution of this thesis. 3. We only look into single GPU implementation using Nvidia GTX 980 gaming graphic card. 4. We only do implementation on a stand-alone workstation using Visual Studio 2010 and 2013 as the IDE. 5. We use an open source fluids simulation code, Fluidsv3 [7], as our base code. Fluidsv3 is an open source fluid simulator for the CPU and GPU using the Smoothed Particle Hydrodynamics (SPH) method. Fluidsv3 is written and copyrighted by Rama Hoetzlein with attribute-zlib license which grants us right to use his software for any purpose including commercial applications as long as the original author is acknowledged. 13

14 2. Background This section describes the theory and methodology used as a foundation of the project. It describes Fluids simulation with Smoothed-particle hydrodynamics (SPH) and GPGPU Graphics Processor Unit (GPU) Parallel Programming. 2.1 Fluids Simulation with Smoothed-Particle Hydrodynamics Navier-Stokes Equation Commonly, fluids simulations are done using the Navier-Stokes equations computing the motion of fluids [8]. Navier-Stokes equations for an incompressible fluid are: u ρ t + u u = p + µ 2 u + f (2.1) u = 0, (2.2) where ρ is the density, u is the velocity, p is the pressure, µ is the viscosity, t is the time and f is the sum of external forces, e.g. gravity. u is the divergence of velocity. There are different numerical methods to solve Navier-Stokes equations which could be divided into two main categories: grid-based Eulerian methods and mesh-free Lagrangian methods. The grid-based Eulerian methods such as Finite Volume method, Finite element method, and Finite difference method give at high accuracy the expense of more computation time. The mesh-free or Lagrangian methods such as Smoothed Particle Hydrodynamics (SPH) needs less computation time but gives less accuracy that is still hopefully acceptable Smoothed Particle Hydrodynamics (SPH) In this project the mesh-free SPH method is chosen to compute fluids simulation. It provides an approximation of numerical equations of fluids dynamics by substituting the fluids with a set of particles. The method was originally developed by Gingold and Monaghan [9] in 1977 and independently by Lucy [10] in Originally, it was used to solve astrophysical problems. 14

15 Gingold and Monaghan improved SPH algorithm to conserve linear and angular momentum which attests the similarities between SPH and molecular dynamics [11]. SPH is an interpolation or estimation method. According to it, a scalar physical quantity of any particle called A is interpolated at location r by a weighted sum of contributions from all neighbouring particles within a radius h: A j A s (r)= m j W(r r j,h), (2.3) j ρ j where : j is the index to all the neighbouring particles, m j is the mass of particle j, r j is its position, ρ j is the density, A j is the scalar physical quantity of fluids particle j, and W(r r j,h) is the smoothing kernel with a radius h. The gradient of A is computed as follows: A j A s (r)= m j W(r r j,h), (2.4) j ρ j where A s (r) is the gradient of the scalar A at position r. The second derivative of function A is: 2 A j A s (r)= m j 2 W(r r j,h), (2.5) j ρ j where 2 A s (r) is the Laplacian of the scalar A at position r. Derivation of equation (2.3) are given thoroughly in [6] and Kelager gave the derivation steps for equation (2.4) and (2.5) in [12]. We use these equations to evaluate several scalar quantities which affect fluids dynamics such as density, pressure and viscosity as described in equation (2.1). An elaborate explanation of SPH application for fluids simulation is mentioned in [13]. The usage of particles guarantees the mass conservation and equation (2.2) can be omitted, moreover since the particles move with the fluid the convective term u u in equation (2.1) could be omitted, too. In the end, the mesh free or Lagrangian of the Navier-Stokes equation for an incompressible fluids takes the form: ρ u t = p + µ 2 u + f external, (2.6) with p is the pressure term, µ 2 v is the viscosity term and f external is the external force. The sum of these three force density field determines the movement of particles. For each particle i, we get: 15

16 F i = p i + µ 2 u i + fi external, u F i = ρ, t a i = u i t = F i ρ i, where u i is the velocity, F i is the force density field, ρ i is the density field, and a i is the acceleration of particle i respectively. The pressure term p has the following representation: (2.7) f pressure i p i + p j = p(r i )= m j W(r i r j,h), (2.8) j 2ρ j where f pressure i is the pressure force for particle i. In [14] it is suggested to compute p as: p = k(ρ ρ 0 ), (2.9) where ρ 0 is the rest density, k is the gas stiffness constant and ρ is derived from equation (2.3) with ρ as the scalar quantity: ρ j ρ(i)= m j W(r i r j,h) j ρ j = m j W(r i r j,h). j (2.10) The viscosity term µ 2 v is : f viscosity i = µ 2 u j u i u(r i )=µ m j 2 W(r i r j,h), (2.11) j ρ j where f viscosity i is the viscosity force for particle i. Smoothing kernels W introduced in equation (2.8), (2.10) and (2.11) has the following representations, according to [13]: Smoothing kernel for density equation at (2.10) is 16 W poly6 (r,h)= πh 9 (h 2 r 2 ) 3 0 r h 0 otherwise, (2.12)

17 where h is the smooth kernel radius and r is the scalar value of r. Smoothing kernel for pressure equation at (2.8) is W spiky (r,h)= 15 (h r) 3 0 r h πh 6 0 otherwise, W spiky (r,h)= 45 r r (h r)2 0 r h πh 6 0 otherwise, (2.13) for pressure we use the gradient of W spiky smoothing kernel. Smoothing kernel for viscosity equation in (2.11) is W viscosity (r,h)= 15 2πh 3 2 W viscosity (r,h)= 45 πh 6 r3 + r2 + h 2h 3 h 2 2r 1 0 r h 0 otherwise, (h r) 0 r h 0 otherwise, (2.14) for viscosity we use the Laplacian of W viscosity smoothing kernel. Simulating fluids movement using SPH method is an iterative process. Each iteration computes an estimation for every particles position at the next timestep. First, the component of F i, which are f pressure i (equation (2.8)), f viscosity i ((2.11)) and fi external (gravity) are computed. Second, the three components are summed to get F i. Third, F i is used to compute a i using equation (2.7). Last, a i is used to obtain particles position by solving the ODE using leapfrog integration. The full procedure of Navier-Stokes computations using SPH method steps for each particle i are stated in Algorithm 1. 17

18 Algorithm 1 Navier-Stokes SPH Algorithm 1: Compute density 2: Compute pressure from density ρ(i)=ρ(r i )= m j W poly6 (r i r j,h) (2.15) j p i = p(r i )=k(ρ(r i ) ρ 0 ). (2.16) 3: Compute pressure force from pressure interaction between neighbouring particles f pressure i p i + p j = p(r i )= m j W spiky (r i r j,h). (2.17) j 2ρ j 4: Compute viscosity force between neighbouring particles f viscosity i = µ 2 u j u i u(r i )=µ m j 2 W viscosity (r i r j,h). (2.18) j ρ j 5: Sum the pressure force, viscosity force and external force (e.g. gravity) 6: Compute the acceleration F i = f pressure i + f viscosity i + f external i. (2.19) a i = F i ρ i. (2.20) 7: Solve the ODE using leap-frog integration to get u t, velocity at time t for each particle by using the following steps: (a) The velocity at time t t is computed by u t+ 1 2 t = u t 1 2 t + ta t, (2.21) (b) then the position at time t + t is computed by (c) The velocity at time t is computed by r t+ t = r t + tu t+ 1 2 t. (2.22) u t = u ta 0, (2.23) (d) and u t is approximate by the average of u t 1 2 t + u t+ 1 2 t, 18 where t is the integration time step. u t u t 1 2 t + u t+ 1 2 t, (2.24) 2

19 2.2 GPU Parallel Programming This section discusses how GPU became prevalent in parallel programming and what is the high level technology behind it General-Purpose computing on Graphics Processing Units (GPGPU) Originally, GPU was intensively used in gaming industry due to its high throughput. Each of GPU s cores processes data for different pixels in parallel. GPUs keep evolving to have more and faster computation cores. This attracts applications in different industries which utilize repetitive and parallel computations. GPU utilization for non-graphical applications was started in 2003 by adopting the already existed high-level shading languages such as DirectX, OpenGL and Cg [15]. However, this approach has several shortcomings. Firstly, the users need to have extensive knowledge about computer graphics API and GPU architecture. Secondly, the computation problem has to be represented in vertex coordinates, textures and shaders. Thirdly, some basic programming features such as random memory read and write were not available. Lastly, no GPU at that time could provide double precision floating point, which is important for some applications. To solve those shortcomings, new programming framework such as Compute Unified Device Architecture (CUDA) from Nvidia and OpenCL from Khronos Group were developed. Their main goal is to provide more suitable tools to harness GPU s potential for general applications. These new frameworks make GPU programming a real general purpose programming languange CUDA In 2007, Nvidia launched a new GPGPU framework which is designed for general purpose programming, known as CUDA [16]. CUDA is a framework that provides programmers the possibility to run general purpose parallelprogramming applications written in C, C++, Fortran, or OpenCL in every Nvidia GPUs launched since CUDA toolkit accomodates a comprehensive software development environment in C or C++ languages which makes it possible to adopt almost all C or C++ language capabilities. In CUDA, every parallel program is written in a special function called kernel. The CUDA kernel is called from the CPU (host) and executed in the GPU (device). Kernel functions are executed by a set of threads in a Single Instruction Multiple Data (SIMD) fashion. SIMD enables a single instruction 19

20 to be executed by many processing threads at the same time. Each thread handles its own input and output for computation. Each kernel execution creates a logical structure of threads in a hierarchical order. It starts with a thread grid, consisting of many thread blocks while each block consists of many threads. The number of blocks and all threads need to be defined before kernel execution. Threads, thread blocks and thread grids in its hierarchical corresponding memory structure are presented in Figure 2.1. Each thread has its own local memory which is the GPU register, and threads in a thread block could access its own block shared memory. Threads in each grid that execute the same kernel function have access to the application s global memory. Figure 2.1. Hierarchy of threads, thread blocks and thread grids with its corresponding memory allocation in CUDA [15] CUDA provides a unique ID for each thread within its block and a unique ID for each block within its grid. Additionally, unique grid ID for more than one grid creation at the same time is provided. It is possible to access an arbitrary thread within each grid by combining the block s ID and thread ID. For example: 20 index =(blockidx.x blockdim.x)+threadidx.x. (2.25)

21 The variable index is used to access a unique thread index in each block within one grid. In each GPU there are many computation cores, Nvidia calls them as CUDA cores and group them in a set called Streaming Multiprocessors (SM). SIMD a single instruction is executed by many processing threads, this is done in a subset of SM which is called warp that consists of 32 CUDA threads. This number comes from Nvidia s existing GPU architecture design. All threads in a block are placed in its warps. Every thread in a warp has a sequential index, so the first 32 threads, thread with index from 0 until 31, will be in the same warp, the second 32 threads will be in the next warp, and so on. Every warp will be executed asynchronously with respect to each other. Each thread in a warp is executed in parallel at the same time so that if one thread delays its execution then the whole warp will be delayed. For this reason it is a good practice to have threads in a warp to access memory address in chronological order so that we could reduce GPU s global memory access by utilizing coalesced data loaded from it. It is a common pratice to use a number of multiple 32 as a number of threads in a block. If we assign 16 threads in a blocks then each block will be counted as one warp of 32 threads with 16 threads inactive and reduce the overall throughput. Each Nvidia GPU architecture has its own hardware limitation that could affect the choice of the appropriate thread number in a block. The limitations we usually look into are: the maximum number of threads in a block the maximum number of active blocks an SM could handle the same time the allocated shared memory in an SM the allocated register size in an SM the number of CUDA cores in an SM. Shared memory is a piece of memory block that has lower latency than GPU s global memory. The amount is limited per SM and is usually used to store data which are repeatedly used by the threads in a block, so that threads do not need to access global memory which has higher latency. The amount of shared memory assigned to the block affects the number of active blocks in a SM. For example, if we allocate 128 threads in a block and want to use 256 bytes of shared memory for each block, then the number of active blocks per SM is active_blocks = total_shared_memory, (2.26) 256 because we have to divide the avaiable shared memory to every active block. If the number of active blocks is small then this could result in a small number of active warps a SM will handle them at the same time. The warp scheduler will 21

22 execute a warp that has all computation data for every thread in that warp. If one thread in a warp is still waiting for data then that warp will not be executed and will be scheduled for next iteration. GPU s high throughput comes from utilizing all computation cores at the same time. Therefore, we want to have enough number of active blocks or more precisely warps in an SM, so that when a warp is not ready there are other warps ready to be executed. In short, to maximize the performance we want to make the CUDA cores in an SM to be as busy as possible instead of being idle. CUDA cores are the computation engines. Each executing thread in a warp will be processed by a core. Each SM has a limited number of cores, therefore there are a limited number of warp being processed at the same time within an SM. A GPU with more and faster cores will give higher throughput. In this project the Nvidia GTX 980 gaming graphic card is used. Comparing to the most advanced Nvidia High Performance Computing (HPC) GPU card, the Nvidia Tesla K80, the GTX 980 card costs just 10-15% of its price. Nvidia GTX 980 s peak processing power for single precision is 4612 Giga Floating Point Operations Per Second (GFLOPS) and 144 GFLOPS for double precision while Nvidia Tesla K80 s peak processing power for single precision is 8740 GFLOPS and 2910 GFLOPS for double precision unit. Figure 2.2 shows the diagram of one Nvidia GTX980 Streaming Multiprocessors (SM). Nvidia GTX980 has 16 SM each with 128 CUDA cores distributed in 4 warp scheduler and 96KB shared memory. 22

23 Figure 2.2. Nvidia GTX980 codename is Maxwell therefore its Streaming Multiprocessors (SM) is called SMM. SMM s block diagram [17] 23

24 3. SPH for Fluids Simulation This section discusses one of possible implementations of SPH for fluids simulation using Matlab. The motivation of using different implementation platform to use SPH for fluids simulation is given as well. 3.1 Acceleration of MATLAB code using GPUs One of ABB s in-house implementation of SPH is a modification of a master thesis work in SPH [6]. It is written in Matlab which uses CPU s computational power. For this acceleration effort we use Intel i7 Quad cores, 4 Ghz with 32 GB of RAM and MATLAB R2015a. According to Matlab guidelines [18], there are several methods of how GPU computational power could be used in Matlab, in order to reduce the computation time. The first method is to allocate the variable with high computation cycle into the GPU using gpuarray command, and let Matlab manage the rest. This approach needs computation time 14x than ABB s in-house implementation. This is due to the inefficiency code structure and too often memory copying process between GPU and CPU at each iteration. The second method is to collect all computationally extensive pieces of the code into a MATLAB function and run this function on GPU. Unfortunately, there is some limitation on the structure and contents of the codes. Firstly, codes should use only Matlab functions that are implemented on GPU [18]. Secondly, the codes should not use index operation and all computations should be done element wise. With these restrictions, there are only a few of lines of codes that could be triggered on GPU, resulting in a similar result with the previous method. The third method is to call the CUDA kernel files directly from Matlab. However, this method does not provide any benefit in terms of effort. In order to fully benefit the effort of using CUDA kernels, the whole implementation should be in C/C++ environment. Therefore, we do not attempt this method. The Fourth method is code refactoring. The code branches and loops are replaced by matrices operations when it is possible. Table 3.1 shows the result of code refactoring to compute 500 iterations of fluids simulation. The speed improvement is significant, but it is not fast enough to be used in real time feedback control systems. There are three main 24

25 Table 3.1 Matlab Refactored Benchmark for 500 iterations Matlab Function Original Execution Time (s) Refactored Code Execution Time (s) Speed Improvement DoubleDensityRelax x TemperatureCompute x ViscousImpulse x Matlab functions in the implementation, DoubleDensityRelax, ViscousImpulse, and TemperatureCompute. As shown in Table 3.1 the code refactoring improves the speed of computing DoubleDensityRelax and TemperatureCompute by 14x. No improvement for ViscousImpulse, because the loop involves a sequential process where each iteration uses previous iteration s data. To animate the fluids another code in Matlab dealing with visualization is used. This code adds another 6387 seconds with no possibility of refactoring because it consists of one for-loop. This contributes to a total duration of seconds to simulate 500 iterations of fluids simulation or an average of 30.5 seconds for each iteration. This is obviously longer than the period of a real-time feedback control systems which is less than a second. Therefore, the Matlab implementation is not suitable and a different approach is required. 3.2 GPU Implementation of Fluids simulation using SPH Method Section 2.2 describes the motivation to use GPU s high throughput to solve an extensive computation problem, specifically for problem that has high arithmetic intensity that can be reformulated in parallel computation code. An arithmetic problem that is associative and commutative is one example of the ideal problem for GPU such as Nvidia s, because the computation could be done in any order which is not the case for subtraction or division operations. In SPH, fluids is approximated using a number of particles. In each iteration every particle s position is computed based on its distance to its neighbouring particles. However, there is no order in which particles are to be computed. Particle A could be computed first or simultaneously with particle B without changing the result as long as all particles have been computed before next iteration starts.this means that multi-cores parallel programming framework could be used to compute each particles simultaneously. Some of the first implementation of SPH on GPU using OpenGL and Cg were done in [19] and [20]. Yan et al [1] gave an alternative SPH algorithm that uses non-uniform particle model which means the particles could split or merge. It incorporates an adaptive surface tension model for better stability. 25

26 After Nvidia launched CUDA, fluids simulation using SPH method on GPU became more common with promising speed up result compared to CPU. Hérault et al [2] distributed the SPH implementation into three parts: neighborlist-construction, force-computation, and Euler-integration. The speed up is different for each part where force-computation had the most speed up of 207 times compared to CPU, while neighbor-list-construction got 15.1 times and Euler-integration was 23.8 times faster. Gao et al [21] measured the simulation frame rate improvement on their work and managed to get a speed up of nearly 140 times with a simulation of particles SPH Implementation with Fluidsv3 Based on the literature research we conclude that GPU implementation is more efficient than CPU implementation to do fluids simulation with SPH, especially in terms of simulation speed. To shorten the learning curve, we decide to use Fluidsv3 which is an open-source fluids simulation of SPH. Fluidsv3 [7] source code is provided as it is and it simulates the ocean waves using SPH method. The code was written in C/C++ with CUDA Toolkit. We change at the implementation code for our needs and optimized it to reduce the simulation time which is the main topic of discussion in section Alternative SPH Implementation During the implementation phase, we found several other SPH implementations using GPGPU framework that are of interest and could be investigated in the future: Open Worm project [4], an open source project which is about to create a virtual C. Elegans Nematode in a computer using OpenCL. DualSphysics research group [5], is a research group of several universities in Europe. They provide a CPU and GPU (CUDA) implementation of SPH for research and applications in fluids simulations. Other computer graphics research groups at ETH Zürich [22] and University of Freiburg [23] have done research and different solutions using SPH methods. 3.3 Other GPGPU Platform The two biggest GPU producer are Nvidia and AMD. Using Fluidsv3 code means using CUDA toolkit, and using CUDA toolkit means we have to use Nvidia s GPU. Nevertheless, this is not the only way to implement GPU parallel programming. OpenCL supports more heterogeneous parallel programming hardware (CPU, GPU, FPGA, etc) as long as it has multi-core structure. It is an open question 26

27 whether to choose CUDA, OpenCL or other GPU parallel programming languange and it is not discussed in this work. 27

28 4. GPGPU Fluids (Water-Jet) Simulation This section describes the modification of Fluidsv3 original implementation and our efforts on further improvements of the performance. An overview of Fluidsv3 original algorithm is presented in Appendix A. 4.1 GPU Performance Benchmarking To improve the simulation s speed, the GPU s performance during simulation is analyzed. The results are used to look for parts of the implementation that could be improved to achieve faster simulations. There are two tools from Nvidia that can be used to analyse the performance of a CUDA program. They are Nvidia Nsight Performance Analysis and Nvidia Visual Profiler [24] and [25]. Both tools offer similar performance analysis capabilities but have some differences. Nsight Performance Analysis is a plug-in tool for Microsoft Visual Studio or Eclipse and could only be executed from either one of the two IDEs. Nvidia Visual Profiler is a stand alone tool that could be used to analyse an executable file regardless of IDE used. Another difference is that Nvidia Visual Profiler offers Guided Analysis mode that helps to find automatically the CUDA kernels of the program to be optimized and how to do it. Nsight Performance Analysis offers more elaborate benchmark of the application. It analyses not only the CUDA kernels but also the program s connection with system s API or visualization API such as DirectX and OpenGL. We use Nsight Performance Analysis because of its elaborate benchmarks and since Nvidia Profiler s Guided Analysis mode is too general for our problem. Nvidia Performance Analysis has two main benchmark tools: Trace and Profiler-CUDA. Trace is used to measure the GPU utilization and OpenGL activity, while Profiler-CUDA is used to get the performance report of CUDA kernels executed in the application. More information about both tools are found in [24] and [25]. The experiment under consideration is about the computation of fluids particles which simulate a spray of water-jet. The experiment starts with no particle and terminated when the number of particles reaches 1 million. There are three measurements that we include in the report: 1. GPU Utilization, it represents in percentage how often the GPU was utilized in comparison to the overall benchmark time. Higher value means 28

29 the GPU is less idling during the simulation. This information is provided by Nsight Performance Analysis - Trace. 2. GFLOPS, a weighted sum of all executed single precision operations per second. Higher value means that tbe GPU spends more time in computation which also means that the algorithm is more efficient in using GPU computation power. Nvidia GTX 980 has a maximum computation power of 4612 GFLOPS for single precision operation. This information is provided by Nsight Performance Analysis - Profiler-CUDA. 3. Iteration Time, time consumed to compute one simulation s iteration in milliseconds. Lower value means more computation iterations per second which also means faster simulation. A lower iteration time is important for our optimization method. This information is provided by Fluidsv3 s internal function. 4.2 Water-Jet Model The work is about a specific case of accelerated fluids simulation, water-jet spray. This model is chosen because of its use to simulate coolant agent in hot rolling mills which is one of ABB s area of interest. In the real world water-jet spray has many applications such as, fountain water, shower spray, cleaning spray, fire extinguisher, etc. For hot rolling mills coolant, pure water with adjustable velocity and volume is sprayed from many inlets to hot surface metal in order to cool it down. To simulate this, several parameters are introduced. The initial speed to adjust the velocity, the inlet radius to adjust the volume and the number of inlet for how many inlets to include in the simulation. In order to simulate the flow of water-jet spray, a new layer of particles is added in the initial position in each time step. The shape and number of particles used as a new layer are constant based on simulation configuration. The layers gradually build up and simulate the flow of water computed by SPH method. Although in real world the water could be sprayed in the opposite direction against the gravity, the work only simulate water sprayed in the same direction of gravity. 4.3 Simulation Modification to Fluidsv3 Figure 4.1 shows Fluidsv3 original application a continuous ocean-wave simulation. The simulation model under consideration is the water-jet spray and in order to have that visualization, the following modifications are required: 29

30 Figure 4.1. Fluidsv3 continuous ocean-wave simulation Particle Initialization, number of particles and particles position. Fluidsv3 initializes all fluids particles in the beginning of the simulation, while we add the particles layer by layer at every time step to simulate a sprayed fluids. Figure 4.2 shows zoomed-in layers of particles. Figure 4.2. Three layer of initialized particles Simulation boundary. Fluidsv3 simulates a continuous ocean-wave in a simulation space shaped of a box. The water jet spray model under consideration has a single solid boundary at the bottom Reusing deleted particles memory. If particles bounce outside the simulation area then they will be deleted and the memory space used will be cleared so that it could be used for new particles. Changing particles initial speed. Our fluids simulation is a water-jet spray which injects fluids from top to bottom in Y-axis. The initialized particles have a given initialization speed. In summary, Fluidsv3 s simulation parameters are modified according to Table 6.3, Appendix. 30

31 4.4 Visual Correctness Modification Figure 4.3 shows the result of the modification, four inlets of water-spray jets hitting a solid surface simultaneously. Figure 4.3. The upper view of the original boundary handling method The first step is to verify the correctness of the model from the physical standpoint. Figure 4.4 shows the bottom side of the simulation model where the fluids is penetrating the boundary. This is not physically correct since particles are penetrating through the surface which is the steel strip (not shown). Figure 4.4. Hidden Boundary Particles diameter set to meter 31

32 4.4.1 Fluidsv3 Original Boundary Handling Fluidsv3 original boundary handling uses hidden boundary particles. The hidden particles diameter acts as a damping area that brakes the fluids movement. Fluids particles receive opposing force whenever they enter this area. In other words, particles are repelled when they penetrate the hidden-boundary particles. The more penetration, the stronger opposing force. Equation (2.22) shows that particles position for time step t + t depends on their velocity at time step t t which is highly affected by the initial velocity, u 0 at equation (2.23). As iteration goes, u 0 s value is accumulated in u t+ 1 2 t hence affects how big the particles move in each iteration which in the end decides if the particles penetrate the solid surface boundary or not. An experiment of using u 0 of 20 m/s and hidden particles diameter of 0.03 x (simscale value) meter results in some particles position computed almost beyond the damping area, thus the opposing force is not strong enough and particles leak through the surface. Figure 4.4 shows fluids passing the solid surface (not drawn). A simple fix for this case it to use bigger hidden particles diameter. Using hidden particles diameter of 0.06 ( meter in reality) prevents the particles from penetrating the surface. Figure 4.5 shows that the particles do not penetrate the solid boundary when we use bigger hidden particles diameter. The side penetration is because the fluids particles are beyond the boundary area. Bigger hidden particles diameter provides bigger breaking area, so that the particles have more space to reduce their velocity thus no boundary penetration. Figure 4.5. Hidden Boundary Particles diameter set to meter Ihmsen et.al Boundary Handling The solution depends on many parameters such as initialization speed or gas stiffness constant. For example, if we want to use initial speed of 60 m/s, 32

33 which might be too big for water-jet, an even bigger hidden particles diameter is required to prevent particles penetrating the boundary. Clearly a better boundary handling mechanism can improve the simulation. After some literature review, we decide to try a new boundary handling method propopsed by Ihmsen et.al [26]. This method was originally used with PCISPH [27] but could also be used for SPH. Figure 4.6 shows that this method gives a fluids splashing or fluids repelling effect which usually exists when a high speed fluids hit a solid surface compared to figure 4.3 which does not visualize any repelling effect. Figure 4.7 shows that the particles do not penetrate the boundary even though a hidden particles diameter of 0.03 ( meter in reality) and an initial speed of 60 m/s are used. Figure 4.6. The upper view of Ihmsen et.al boundary handling Figure 4.7. The side view of Ihmsen et.al boundary handling Observing the visualization results and its ability to use wider range of initial speed or gas stiffness constant, it is concluded that Ihmsen et.al boundary 33

34 method is better than Fluidsv3 original method in terms of physical correctness of simulation. Thus, this method will be used for further code optimization 4.5 Parallel Programming Optimization We use Nsight Performance Analysis mentioned in Section 4.1 to gauge the GPU s performance of our simulation to make the simulation run faster. The water-jet simulation has an iteration time of msec and GPU utilization of 36% which means that most of the GPU time is spent idling, waiting for its turn. In order to improve the GPU utilization and reduce the computation time for each iteration, the following methods were implemented one by one Reduce-Malloc Nsight Performance Analysis shows that most of the GPU time is spent waiting for CUDAmalloc and CUDAmemcpy to finish their operations making these operations as the bottleneck. Nvidia GTX 980 has 4GB RAM which, based on Fluidsv3 s data structure, is big enough to store 22 millions particles data. By allocating GPU s memory for 22 millions particles in the beginning of simulation, CUDAmalloc would only be called once. CUDAmemcopy operations are also called once to copy initialization particles data from host (CPU) to device (GPU). This result in improvement of GPU utilization to 75.6 % and reduction of iteration time to msec Graphics-Interoperability Fluidsv3 uses OpenGL to render the animation and during this process, it needs to copy the data from computation memory to OpenGL memory. In [28], Graphics-Interoperability means that the same GPU memory buffer is used to store data for rendering and computation results. This method removes the need to copy data from computation memory to rendering memory. Using this approach, the GPU utilization is improved to 87.2 % and the iteration time is reduced to msec GPU Shared-Memory Nsight Performance Analysis also shows that there are two CUDA kernel functions that consume most of the GPU computation resource, computepressure (19 %) and computeforce (64 %). Improving the computation speed of 34

35 these functions should improve the total simulations speed. For improvement and benchmark try-out we choose kernel function computepressure because it has simpler code structure than computeforce. Figure 4.8 shows the kernel warp issue efficiency for computepressure kernel function which is the GPU s ability to issue instructions during kernel s execution cycle [24]. Instructions are executed in a warp which is a set of threads. Higher percentage means that more instructions are executed, but computepressure has % of warp issue efficiency. This means that 45.97% of all computation cycles does not issue any instruction. Figure 4.9 shows that memory dependency, which is the condition where memory store or load cannot be done because resources are either not available or are fully utilized, is the cause for % of the kernels inability to run instructions. Here, the question is which operations are causing this memory dependency problem? Figure 4.8. Warps Issue Efficiency for computepressure CUDA kernel From the variety of data that computepressure kernel needs to retrieve from GPU s global memory, particles position data are the most frequent one. The kernel needs to compute every particle s distance to its neighbouring particles. The more neighbour a particle has, the more data access the kernel does for that particle. All particles used in SPH method are put in grids of similar size cubes as mentioned in Appendix A. Figure 4.10 restates figure 6.2, Appendix, which depicts a 3x3x3 grids structure with each grid is represented by coloured circles. Each grid contains different number of particles. 35

36 Figure 4.9. Issue Stall Reasons for computepressure CUDA kernel Figure Grid Neighbours. Yellow circle represent the centre grid and the rest are the neighbouring grids. The dotted line is for visualization purpose only and does not represent any information. 36

37 All particles in the centre grid, the yellow circle, have the same neighbouring particles located in the neighbouring grids. Therefore, all particles in the centre grid request the same neighbouring particles position which results in data request redundancies from GPU s global memory. If we can store these often requested data in a closer place than GPU s global memory then we could provide the data to CUDA kernel faster. As a result, this improves the warp efficiency and allows for doing more computations per cycle [29]. This closer memory place is called GPU sharedmemory. All Nvidia s GPU that support CUDA have shared memory. It is a memory that resides in the GPU chip itself. It has roughly 100 times lower latency than uncached GPU global memory, which resides outside the GPU chip, provided that there is no bank conflict between the threads [30]. Goswami et.al [31] used shared memory for their SPH-fluids simulation in Nvidia GPU. Although the paper did not give the results of comparison of using vs. not using shared memory, it was decided to use the share memory in SPH computations. The goal is to investigate whether it is possible to achieve any further computational improvements since we have different implementation compared to Goswami s. In order to reduce data traffic to the global memory, it is required to place all the neighbouring particles data in the GPU shared-memory and let the centre grid s particles access the shared-memory. However, this could only be done by introducing extra line of codes to copy the required data into the sharedmemory. Additionally, it is important to keep the memory offset in such a way that each centre grid particle could access the correct shared-memory offset. Accessing the wrong offset results in unstable, wrong or even run-time error simulation. Our project uses Nvidia GTX980 which has 96KB of shared memory for each streaming multiprocessors (SM) where each SM could handle 32 thread blocks simultaneously. A particle s position data is a vector float3 data type. This means that each position storage requires 12 bytes of memory. With limited memory space we need to use more code to repeatedly re-store necessary particles data. The benchmark shows that the GPU utilization is up to 95 %, but the iteration time increases to msec. The number of operations of computepressure kernel function drops to 85 GFLOPS from 747 GFLOPS. This is caused by not parallelized extra code which introduces a lot of sequential loops. The code makes the GPU busier hence the increase of GPU utilization to 95 %. Nevertheless, since the code is not executed in parallel the number of operations drops to 85 GFLOPS. 37

38 4.6 Predictive-Corrective Incompressible SPH (PCISPH) In [7], Fluidsv3 s author suggested PCISPH, a different SPH method to improve fluids simulation s speed. It is an SPH method that enforce fluids incompressibility [27]. PCISPH method predicts density fluctuations which are then corrected using pressure forces. This prediction and correction step are repeated for several times to get lowest density fluctuations in order to enforce fluids incompressibility. The algorithm is described in Algorithm 7, Appendix. PCISPH s authors stated the possibility to use a time-step 35x bigger than SPH while retaining the simulation s visualization. This is an interesting feature since time-step determines how long the simulation elapses in one iteration. Bigger time-step means that for each computation s iteration the simulation predicts a longer duration of how the fluids should behave in real world. For example, 10 iterations with a time-step of second equal to 0.01 second of simulated real-world time but using a bigger time-step of 0.01 second, 10 iterations equal to 0.1 second of simulated real-world time. However, we are not successful in implementing PCISPH using bigger time-step and because of the limitation of time, we do not further investigation. Instead, we have different fluids speed visualization when using PCISPH. Table 6.4, Appendix, shows the simulation configuration with PCISPH which is the same as SPH s configuration except the initial speed value. Furthermore, instead of gas-stiffness constant, PCISPH uses pcisphdelta parameter. Figure 4.11 shows the visualization effect of PCISPH with 0.5 million particles. The simulation shows fluids that is progressing at faster speed compared to original SPH Algorithm Figure PCISPH Simulation when it reaches 0.5 million particles Appendix B explains the water-jet spray algorithm using SPH and PCISPH in a whole. 38

39 Figure SPH Simulation when it reaches 0.5 million particles 4.7 Known Limitation of Water-Spray simulation SPH Method for Fluids Simulation There are several known limitation in the current water-spray jet implementation. This section describes them and how they affect the simulation Particles Scaling Fluidsv3 original simulation uses SimScale parameter to scale up the distance between particles for better visualization. The problem here is that changing the parameter does not rescale the particles size visualization. This could result in wrong visualization such as a fluids visualized by very dense particles even though it is not Incompressibility and Simulation Stability Equation (2.9) which is used to compute p is restated as p = k(ρ ρ 0 ). (4.1) In equation (4.1), p measures the pressure for the particles. ρ 0 is the fluids expected rest density during simulation. The water is modeled as an artificially incompressible fluids with Navier-Stokes equation meaning that the fluids should have the same density at every part of it and in every time step. The particles move at each time step and change the fluids density. Therefore, the equation above is used in order to enforce the incompressible fluids. When ρ < ρ 0 then the particles experience an attraction force and if ρ > ρ 0 the particles experience a repelling force. Here, the question is how big a k is. A bigger k gives strong attractionrepelling force and makes the fluids ρ being equal to ρ 0 faster, so that particles are not congested in the same time for too many time steps. Reduced 39

40 congested particles in the same smooth-radius area means less neighbouring particles and less computation per iteration and vice versa. However, if a too big k is used then it is required to use much smaller time step, otherwise the simulation will be unstable. A bigger k will also provide bigger force which results in bigger particle movement in one time step. If the particles move further than the smoothing kernel radius in one time step then the attraction force is not computed, thus lost. If many particles experience this case then the fluids particles will look as they are blown up Initial Speed vs. Time Step Since we simulate water-jet spray, the fluids initial speed depends on our discretion. In the simulation, initial speed value is accumulated in particles velocity which determines how far a layer of fluids particles move in one time step. This makes initial speed very influential for particles movement. In each time step a new layer of fluids particles is added in the same simulation s coordinates. It is important to have an initial speed that fits the time step. Using an initial speed and a time step which are too small result in a bloating particles fluids instead of a flowing fluids. Figure 4.13 shows such a case. On the other hand, using an initial speed and time step which are too big result in a big gap between layers of particles. This is also not correct since there is no attraction force between layers of particles. Figure 4.14 shows such a case. Therefore, it is important to have an initial speed that fits the time step in order to produce a stable and working simulation. However, such a combination of initial speed and time step are obtained experimentally which is cumbersom, not ideal and affect the usefulness of the simulation Obtaining Parameters Value The simulation uses many adjustable parameters which determine the simulation s result and stability. Parameters such as smoothing kernel radius, viscosity, particle s radius, and pcisph s delta obtained their value by trial and error. The use of boundary handling method based on [26] prevents any particles from penetrating the solid surface for the current simulation setting. This does not mean by default that it will work for any arbitrary initial speed. The effectiveness of the boundary handling method also depends on the diameter of the solid boundary particles and gas stiffness constant. 40

41 Figure Bloating particles fluids Figure Too big gaps between layers of particles Simulation Correctness SPH has been used in many fluids simulation, and its accuracy has been tested by different test cases, for example in [39] for case studies: Creation of waves by land slides, Dam-break propagation over wet bed sand and Wave-structure interaction or in [37] for hydraulic jump test case. However, the measurement of water-jet simulation accuracy is not implemented in this work and no implementation for proper physical correctness is done. Instead, the correctness of our implementation is subjectively assessed by visual analysis. 41

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

CGT 581 G Fluids. Overview. Some terms. Some terms

CGT 581 G Fluids. Overview. Some terms. Some terms CGT 581 G Fluids Bedřich Beneš, Ph.D. Purdue University Department of Computer Graphics Technology Overview Some terms Incompressible Navier-Stokes Boundary conditions Lagrange vs. Euler Eulerian approaches

More information

Interaction of Fluid Simulation Based on PhysX Physics Engine. Huibai Wang, Jianfei Wan, Fengquan Zhang

Interaction of Fluid Simulation Based on PhysX Physics Engine. Huibai Wang, Jianfei Wan, Fengquan Zhang 4th International Conference on Sensors, Measurement and Intelligent Materials (ICSMIM 2015) Interaction of Fluid Simulation Based on PhysX Physics Engine Huibai Wang, Jianfei Wan, Fengquan Zhang College

More information

Fluids in Games. Jim Van Verth Insomniac Games

Fluids in Games. Jim Van Verth Insomniac Games Fluids in Games Jim Van Verth Insomniac Games www.insomniacgames.com jim@essentialmath.com Introductory Bits General summary with some details Not a fluids expert Theory and examples What is a Fluid? Deformable

More information

Shape of Things to Come: Next-Gen Physics Deep Dive

Shape of Things to Come: Next-Gen Physics Deep Dive Shape of Things to Come: Next-Gen Physics Deep Dive Jean Pierre Bordes NVIDIA Corporation Free PhysX on CUDA PhysX by NVIDIA since March 2008 PhysX on CUDA available: August 2008 GPU PhysX in Games Physical

More information

Abstract. Introduction. Kevin Todisco

Abstract. Introduction. Kevin Todisco - Kevin Todisco Figure 1: A large scale example of the simulation. The leftmost image shows the beginning of the test case, and shows how the fluid refracts the environment around it. The middle image

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Navier-Stokes & Flow Simulation

Navier-Stokes & Flow Simulation Last Time? Navier-Stokes & Flow Simulation Pop Worksheet! Teams of 2. Hand in to Jeramey after we discuss. Sketch the first few frames of a 2D explicit Euler mass-spring simulation for a 2x3 cloth network

More information

Fluid Simulation. [Thürey 10] [Pfaff 10] [Chentanez 11]

Fluid Simulation. [Thürey 10] [Pfaff 10] [Chentanez 11] Fluid Simulation [Thürey 10] [Pfaff 10] [Chentanez 11] 1 Computational Fluid Dynamics 3 Graphics Why don t we just take existing models from CFD for Computer Graphics applications? 4 Graphics Why don t

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS NAVIER-STOKES EQUATIONS u t + u u + 1 ρ p = Ԧg + ν u u=0 WHAT IS COMPUTATIONAL FLUID DYNAMICS? Branch of Fluid Dynamics which uses computer power to approximate

More information

2.11 Particle Systems

2.11 Particle Systems 2.11 Particle Systems 320491: Advanced Graphics - Chapter 2 152 Particle Systems Lagrangian method not mesh-based set of particles to model time-dependent phenomena such as snow fire smoke 320491: Advanced

More information

Comparison between incompressible SPH solvers

Comparison between incompressible SPH solvers 2017 21st International Conference on Control Systems and Computer Science Comparison between incompressible SPH solvers Claudiu Baronea, Adrian Cojocaru, Mihai Francu, Anca Morar, Victor Asavei Computer

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Navier-Stokes & Flow Simulation

Navier-Stokes & Flow Simulation Last Time? Navier-Stokes & Flow Simulation Optional Reading for Last Time: Spring-Mass Systems Numerical Integration (Euler, Midpoint, Runge-Kutta) Modeling string, hair, & cloth HW2: Cloth & Fluid Simulation

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Acknowledgements. Prof. Dan Negrut Prof. Darryl Thelen Prof. Michael Zinn. SBEL Colleagues: Hammad Mazar, Toby Heyn, Manoj Kumar

Acknowledgements. Prof. Dan Negrut Prof. Darryl Thelen Prof. Michael Zinn. SBEL Colleagues: Hammad Mazar, Toby Heyn, Manoj Kumar Philipp Hahn Acknowledgements Prof. Dan Negrut Prof. Darryl Thelen Prof. Michael Zinn SBEL Colleagues: Hammad Mazar, Toby Heyn, Manoj Kumar 2 Outline Motivation Lumped Mass Model Model properties Simulation

More information

Surface Tension in Smoothed Particle Hydrodynamics on the GPU

Surface Tension in Smoothed Particle Hydrodynamics on the GPU TDT4590 Complex Computer Systems, Specialization Project Fredrik Fossum Surface Tension in Smoothed Particle Hydrodynamics on the GPU Supervisor: Dr. Anne C. Elster Trondheim, Norway, December 21, 2010

More information

NVIDIA. Interacting with Particle Simulation in Maya using CUDA & Maximus. Wil Braithwaite NVIDIA Applied Engineering Digital Film

NVIDIA. Interacting with Particle Simulation in Maya using CUDA & Maximus. Wil Braithwaite NVIDIA Applied Engineering Digital Film NVIDIA Interacting with Particle Simulation in Maya using CUDA & Maximus Wil Braithwaite NVIDIA Applied Engineering Digital Film Some particle milestones FX Rendering Physics 1982 - First CG particle FX

More information

GPU-accelerated data expansion for the Marching Cubes algorithm

GPU-accelerated data expansion for the Marching Cubes algorithm GPU-accelerated data expansion for the Marching Cubes algorithm San Jose (CA) September 23rd, 2010 Christopher Dyken, SINTEF Norway Gernot Ziegler, NVIDIA UK Agenda Motivation & Background Data Compaction

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Microwell Mixing with Surface Tension

Microwell Mixing with Surface Tension Microwell Mixing with Surface Tension Nick Cox Supervised by Professor Bruce Finlayson University of Washington Department of Chemical Engineering June 6, 2007 Abstract For many applications in the pharmaceutical

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

3D Simulation of Dam-break effect on a Solid Wall using Smoothed Particle Hydrodynamic

3D Simulation of Dam-break effect on a Solid Wall using Smoothed Particle Hydrodynamic ISCS 2013 Selected Papers Dam-break effect on a Solid Wall 1 3D Simulation of Dam-break effect on a Solid Wall using Smoothed Particle Hydrodynamic Suprijadi a,b, F. Faizal b, C.F. Naa a and A.Trisnawan

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPGPU. Peter Laurens 1st-year PhD Student, NSC GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing

More information

Realistic Animation of Fluids

Realistic Animation of Fluids 1 Realistic Animation of Fluids Nick Foster and Dimitris Metaxas Presented by Alex Liberman April 19, 2005 2 Previous Work Used non physics-based methods (mostly in 2D) Hard to simulate effects that rely

More information

Solidification using Smoothed Particle Hydrodynamics

Solidification using Smoothed Particle Hydrodynamics Solidification using Smoothed Particle Hydrodynamics Master Thesis CA-3817512 Game and Media Technology Utrecht University, The Netherlands Supervisors: dr. ir. J.Egges dr. N.G. Pronost July, 2014 - 2

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Realtime Water Simulation on GPU. Nuttapong Chentanez NVIDIA Research

Realtime Water Simulation on GPU. Nuttapong Chentanez NVIDIA Research 1 Realtime Water Simulation on GPU Nuttapong Chentanez NVIDIA Research 2 3 Overview Approaches to realtime water simulation Hybrid shallow water solver + particles Hybrid 3D tall cell water solver + particles

More information

Navier-Stokes & Flow Simulation

Navier-Stokes & Flow Simulation Last Time? Navier-Stokes & Flow Simulation Implicit Surfaces Marching Cubes/Tetras Collision Detection & Response Conservative Bounding Regions backtracking fixing Today Flow Simulations in Graphics Flow

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Level Set Methods for Two-Phase Flows with FEM

Level Set Methods for Two-Phase Flows with FEM IT 14 071 Examensarbete 30 hp December 2014 Level Set Methods for Two-Phase Flows with FEM Deniz Kennedy Institutionen för informationsteknologi Department of Information Technology Abstract Level Set

More information

2.7 Cloth Animation. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter 2 123

2.7 Cloth Animation. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter 2 123 2.7 Cloth Animation 320491: Advanced Graphics - Chapter 2 123 Example: Cloth draping Image Michael Kass 320491: Advanced Graphics - Chapter 2 124 Cloth using mass-spring model Network of masses and springs

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

Particle-based Fluid Simulation

Particle-based Fluid Simulation Simulation in Computer Graphics Particle-based Fluid Simulation Matthias Teschner Computer Science Department University of Freiburg Application (with Pixar) 10 million fluid + 4 million rigid particles,

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

A High Quality, Eulerian 3D Fluid Solver in C++ A Senior Project. presented to. the Faculty of the Computer Science Department of

A High Quality, Eulerian 3D Fluid Solver in C++ A Senior Project. presented to. the Faculty of the Computer Science Department of A High Quality, Eulerian 3D Fluid Solver in C++ A Senior Project presented to the Faculty of the Computer Science Department of California Polytechnic State University, San Luis Obispo In Partial Fulfillment

More information

Particle-Based Fluid Simulation. CSE169: Computer Animation Steve Rotenberg UCSD, Spring 2016

Particle-Based Fluid Simulation. CSE169: Computer Animation Steve Rotenberg UCSD, Spring 2016 Particle-Based Fluid Simulation CSE169: Computer Animation Steve Rotenberg UCSD, Spring 2016 Del Operations Del: = x Gradient: s = s x y s y z s z Divergence: v = v x + v y + v z x y z Curl: v = v z v

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Divergence-Free Smoothed Particle Hydrodynamics

Divergence-Free Smoothed Particle Hydrodynamics Copyright of figures and other materials in the paper belongs to original authors. Divergence-Free Smoothed Particle Hydrodynamics Bender et al. SCA 2015 Presented by MyungJin Choi 2016-11-26 1. Introduction

More information

FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION. M. Panchatcharam 1, S. Sundar 2

FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION. M. Panchatcharam 1, S. Sundar 2 International Journal of Applied Mathematics Volume 25 No. 4 2012, 547-557 FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION M. Panchatcharam 1, S. Sundar 2 1,2 Department of Mathematics

More information

Lagrangian methods and Smoothed Particle Hydrodynamics (SPH) Computation in Astrophysics Seminar (Spring 2006) L. J. Dursi

Lagrangian methods and Smoothed Particle Hydrodynamics (SPH) Computation in Astrophysics Seminar (Spring 2006) L. J. Dursi Lagrangian methods and Smoothed Particle Hydrodynamics (SPH) Eulerian Grid Methods The methods covered so far in this course use an Eulerian grid: Prescribed coordinates In `lab frame' Fluid elements flow

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

Parallel GPU-Based Fluid Animation. Master s thesis in Interaction Design and Technologies JAKOB SVENSSON

Parallel GPU-Based Fluid Animation. Master s thesis in Interaction Design and Technologies JAKOB SVENSSON Parallel GPU-Based Fluid Animation Master s thesis in Interaction Design and Technologies JAKOB SVENSSON Department of Applied Information Technology CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden

More information

Particleworks: Particle-based CAE Software fully ported to GPU

Particleworks: Particle-based CAE Software fully ported to GPU Particleworks: Particle-based CAE Software fully ported to GPU Introduction PrometechVideo_v3.2.3.wmv 3.5 min. Particleworks Why the particle method? Existing methods FEM, FVM, FLIP, Fluid calculation

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Dense matching GPU implementation

Dense matching GPU implementation Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important

More information

PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON. Pawe l Wróblewski, Krzysztof Boryczko

PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON. Pawe l Wróblewski, Krzysztof Boryczko Computing and Informatics, Vol. 28, 2009, 139 150 PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON Pawe l Wróblewski, Krzysztof Boryczko Department of Computer

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Universiteit Leiden Opleiding Informatica

Universiteit Leiden Opleiding Informatica Universiteit Leiden Opleiding Informatica Comparison of the effectiveness of shared memory optimizations for stencil computations on NVIDIA GPU architectures Name: Geerten Verweij Date: 12/08/2016 1st

More information

Smoke Simulation using Smoothed Particle Hydrodynamics (SPH) Shruti Jain MSc Computer Animation and Visual Eects Bournemouth University

Smoke Simulation using Smoothed Particle Hydrodynamics (SPH) Shruti Jain MSc Computer Animation and Visual Eects Bournemouth University Smoke Simulation using Smoothed Particle Hydrodynamics (SPH) Shruti Jain MSc Computer Animation and Visual Eects Bournemouth University 21st November 2014 1 Abstract This report is based on the implementation

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs)

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs) Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs) Hasindu Gamaarachchi, Roshan Ragel Department of Computer Engineering University of Peradeniya Peradeniya, Sri Lanka hasindu8@gmailcom,

More information

ECE 574 Cluster Computing Lecture 16

ECE 574 Cluster Computing Lecture 16 ECE 574 Cluster Computing Lecture 16 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 26 March 2019 Announcements HW#7 posted HW#6 and HW#5 returned Don t forget project topics

More information

General-purpose computing on graphics processing units (GPGPU)

General-purpose computing on graphics processing units (GPGPU) General-purpose computing on graphics processing units (GPGPU) Thomas Ægidiussen Jensen Henrik Anker Rasmussen François Rosé November 1, 2010 Table of Contents Introduction CUDA CUDA Programming Kernels

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU. Basics of s Basics Introduction to Why vs CPU S. Sundar and Computing architecture August 9, 2014 1 / 70 Outline Basics of s Why vs CPU Computing architecture 1 2 3 of s 4 5 Why 6 vs CPU 7 Computing 8

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

GPU Programming for Mathematical and Scientific Computing

GPU Programming for Mathematical and Scientific Computing GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

GPU Simulations of Violent Flows with Smooth Particle Hydrodynamics (SPH) Method

GPU Simulations of Violent Flows with Smooth Particle Hydrodynamics (SPH) Method Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe GPU Simulations of Violent Flows with Smooth Particle Hydrodynamics (SPH) Method T. Arslan a*, M. Özbulut b a Norwegian

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 20

ECE 571 Advanced Microprocessor-Based Design Lecture 20 ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi

More information

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA Implementation of Adaptive Coarsening Algorithm on GPU using CUDA 1. Introduction , In scientific computing today, the high-performance computers grow

More information