Automatic Dynamic Task Distribution between CPU and GPU for Real-Time Systems

Size: px
Start display at page:

Download "Automatic Dynamic Task Distribution between CPU and GPU for Real-Time Systems"

Transcription

1 Automatic Dynamic Task Distribution between CPU and GPU for Real-Time Systems Marcelo Zamith Mark Joselli Esteban Clua Anselmo Montenegro Aura Conci Regina Leal-Toledo Instituto de Computação Universidade Federal Fluminense Marcos d Ornellas Cesar Pozzer Laboratório de Computação Aplicada Universidade Federal de Santa Maria {ornellas,pozzer}@inf.ufsm.br Luis Valente Bruno Feijó VisionLab/IGames Departamento de Informática PUC-Rio {lvalente, bruno}@inf.puc-rio.br Abstract 1. Introduction The increase of computational power of programmable GPU (Graphics Processing Unit) brings new concepts for using these devices for generic processing. Hence, with the use of the CPU and the GPU for data processing come new ideas that deals with distribution of tasks among CPU and GPU, such as automatic distribution. The importance of the automatic distribution of tasks between CPU and GPU lies in three facts. First, automatic task distribution enables the applications to use the best of both processors. Second, the developer does not have to decide which processor will do the work, allowing the automatic task distribution system to choose the best option for the moment. And third, sometimes, the application can be slowed down by other processes if the CPU or GPU is already overloaded. Based on these facts, this paper presents new schemes for efficient automatic task distribution between CPU and GPU. This paper also includes tests and results of implementing those schemes with a test case and with a real-time system. Keywords: Parallel computing, task distribution, GPGPU, real-time loop models, real-time systems. Programmable GPUs (Graphics Processing Unit) implement new paradigms for high performance computing, such as CUDA (Compute Unified Device Architecture) at the nvidia new series (series 8000 and later). These technologies increase the general computations that can be made on GPUs, bringing new possibilities for the GPGPU (General- Purpose computing on GPU) area. Some examples of these works are quantum Monte Carlo [1], Artificial Intelligence [12], and ray casting [9] implementations on GPU. Because of the GPU s SIMD (Single Instruction Multiple Data) parallel architecture, GPU processing usually is fast for high intensity computations and slow for low intensity ones, when comparing with CPU processing. This paper tries to avoid using only one processor at a time, taking the best of both the CPU and GPU by implementing five schemes to distribute tasks between them, being four of these schemes automatic ones. With an automatic task distribution the developer does not have to decide which processor will do the work allowing the automatic distribution to choose. This is important to emphasize, because many times the developer is unable to predict the hardware of the end user and by using an automatic distribution he/she does not have to. Also, sometimes processor load on the CPU or GPU may increase, penalizing

2 running applications. Therefore, an automatic task distribution scheme can detect this situation and distribute tasks to the other processor. The motivations for employing automatic task distribution schemes are as follows: 1. Take advantage of the best characteristics of both the CPU and GPU; 2. Take out from the developer the decision of which processor should run some tasks, leaving this to the automatic distribution scheme to choose; 3. Redistribution of tasks between the processors when a processor (CPU or GPU) is overloaded with work. To test these schemes, this work presents an implementation of collision detection of bounding spheres in both the CPU and the GPU. The bounding sphere collision detection is normally used as the broad phase of the collision detection system. One major application for GPGPU, and as consequence, for task distribution between CPU and GPU, is real-time systems, such as games, virtual reality and virtual simulations. For this reason, the schemes for distribution of tasks between CPU and GPU were designed to be used with realtime systems. In order to proof the application of these schemes with a real-time system, this paper presents tests with a new multhithread architecture. The paper has been organized as follows: Section 2 presents the concept of GPGPU and some related works developed in the area. Section 3 presents the test case that uses bounding sphere collision detection. Section 4 presents an architecture to test the schemes with a real-time system. Section 5 describes the schemes for process distributions between CPU-GPU and section 6 presents the tests with these schemes. Finally, section 7 points out the conclusions of this work. 2. GPGPU The GPUs are processors dedicated for graphics computation. The development of programmable GPUs has opened a new area of research, permitting to use the graphic device for processing non-graphic data. However, there are many constraints on what kind of data the GPU can process. For example, scatter memory operations (indexed write array operations) are inefficient and there are no integer data operands like bit-wise logical operations AND, OR, XOR, NOT and bit-shifts. On the other hand, the advantages are that the GPU is much faster than the CPU, when considering all parallel processors. An ATI X1900 XTX, for instance, can sustain a measured 240 GFLOPS against 25.6 GFLOPS for the SSE units of a dual-core 3.7 GHz Intel Pentium Extreme Edition 965 [10]. GPUs are very good for processing applications that require high arithmetic rates and data bandwidths. Because of the SIMD parallel architecture of the GPU (the nvidia G80, as an example, has 128 unified shading processors), the development of this kind of application requires a different programming paradigm than the traditional CPU sequential programming model Related Works on GPGPU Green [5] presents a commercial physics engine called Havok FX for rigid bodies and particle system that has several methods implemented on the GPU, obtaining results eight times faster with the use of nvidia GeForce 8800GTX GPU card with an Intel Core 2 Duo Extreme 2.93GHz than CPU version. This justifies the implementation of some physics functionalities on the GPU, such as collision detection. Because of the high performance of the fragment processors, which allows high parallelization of the problems that can be solved in this structure, it is possible to have more bodies on the physics simulation application. Besides the Havok FX, there are other works related to the implementation of physics simulation processes on the GPU such as particle system [8], deformable bodies system [3], cloth simulation [18], and collision detection [4]. It is important to remark that none of the works available in the literature has approached the automatic distribution of tasks between CPU and GPU. 3 Test Case: Bounding Sphere Collision Detection To test the task distribution schemes, a bounding sphere collision detection is implemented in both CPU and GPU. Collision detection is a complex operation. For n bodies in a system, there must be a collision detection check between the O(n 2 ) pairs of bodies. Normally, to reduce this computation cost, this task is performed in two steps: first, the broad phase, and second, the narrow phase. In the broad phase the collision library detects which bodies have a chance of colliding among themselves. In the narrow phase a more refined algorithm to do the collisions tests are performed between the pairs of bodies that passed by the broad phase. The second phase is where the collisions tests between the bodies are actually done. It calculates the point or points where the bodies intercept, the depth of those points of interception and the normal of the points of interception. This phase has high intensity arithmetic computation so it cannot be done in real time for every body if the simulation has a 2

3 high number of bodies. It needs the broad phase to take out the bodies that do not have a chance of colliding. The broad phase can be carried out in different ways: using a grid algorithm, an axis aligned bounding boxes (AABB) or bounding spheres, for example. This work uses bounding spheres in the broad phase. This approach has some advantages in relation to the traditional axis aligned bounding boxes such as: the bounds of the calculated sphere only need to be calculated once, unlike the axis aligned bounding box that needs to be recalculated every time that the body rotates and moves. Also, the bounding sphere needs four scalars (x, y, z, r) of storage against six scalars for the AABB [2]. The algorithm to perform this test is very simple: if the bodies are at a distance d from each other, less than the sum of radius (r 1 + r 2 ), then they can be in collision and will be passed to the narrow phase. 3.1 Bounding Sphere Collision on the GPU The broad phase has been implemented on the CPU and also on the GPU. To implement it on the GPU it is necessary to approach the problem appropriately using textures. In order to do that, each texel (RGBA) was mapped to a body, including its position (x, y, z) and radius. Each four bodies comprise a line in the texture. Figure 1 depicts this structure. Figure 1. Structure of the data in the texture. This method is splitted in passes to make better use of the parallel structure of the GPU. The first pass applies the bounding sphere check on the first four bodies against all other bodies. Then, the shader returns a texture indicating which bodies can actually collide, and applies the narrow phase to them. As an example, Figure 2 illustrates a case where the first body is tested against the last four ones, with the results written in a texel. After this first pass, the shader performs another pass checking the next four bodies against the remaining ones Figure 2. An example of a collision test in the GPU. (i.e. the bodies that were not checked yet) and receives a texture indicating which bodies have a chance of colliding, sending them to the narrow phase. This process repeats until there are no remaining bodies. For a system with n bodies, this method divides the collision detection problem in n 4 passes on the GPU. Table 1 illustrates the tests performed with this method, both on the CPU and the GPU. All the tests of this paper were made in a Athlon64 with 2GB RAM and a nvidia 8400GS GPU card with an PCI-Express 16x socket. The time unit is milliseconds. Table 1. Numerical results, in milliseconds, of the sphere bounding collision detection. # Bodies CPU Time GPU time Speedup , , , The results show that the GPU starts to become faster than the CPU when more than 1024 bodies are present in the system, this fact is because of the data transfers (it s the bottleneck of this shader) for the GPU that can consume more than 50% of the GPU time. 4 Real-Time Loop Models A real-time loop mode is needed in order to tests the distribution schemes with an real-time system. 3

4 One of the reasons to create real-time loop models is to simulate the apparent parallelism a real-time application conveys. Under an ideal setting, computers would have infinite memory and processing power, so the solution would be simply to distribute all tasks to all resources. However, in practice this is not the case, because computers have a limited amount of memory and few processing cores. Hence, to overcome this limitation it is necessary to design realtime loop models in such a way that simulates parallelism and provides interactivity. The typical tasks a real-time loop runs can be grouped in three areas: the input device query, the update stage and the presentation stage. The first area corresponds to capturing input from user devices, like keyboards, mice, joysticks, microphones, among others. The second area encompasses all tasks that affect the application state. For games, it includes game AI (Artificial Intelligence), game logic (decisions based on game rules and user dada), physics simulation, calculating animations, and others. The third area presents the results to the user, using audio (sound effects, background music) and video (scene rendering). Real-time loop models can be categorized as coupled models or uncoupled models [16]. The coupling relates to a dependence on the execution ordering of some tasks. In practice, this means that if a task gets too long to be processed, other ones that depend on it will start later, degrading application performance (and possibly degrading interactivity). Hence, the main idea of uncoupled models is to separate the tasks that may interfere with each other. For example, consider the simple coupled model [16] depicted in Figure 3, and commonly found in the literature [13] [11]. Figure 3. Simple coupled model. This model is the simplest approach to run a real-time application, where the three stages are run sequentially in a single loop. A delay on one of the stages would be immediately noticed by the user, as one stage cannot start while the previous one is still running. An example of uncoupled model is the single thread uncoupled model [16]. This model separates the update and rendering stages, so they can run independently, as Figure 4 illustrates. Figure 4. Single thread uncoupled model. The single threaded uncoupled model with a GPGPU stage uncouple from the main loop is an extension of this model, and presents a new and very efficient approach for using the GPU as a math and physics co-processor, as shows Figure 5. Figure 5. Single thread uncoupled model with an GPGPU stage uncoupled from the main loop. In parallel programming models, like the presented architecture, data are processed simultaneously. When they are independent there is not any problems, like the render and GPGPU stages, but on the other hand, there are several problems when the data are shared or need to be executed as an order, like update and GPGPU stages. In this case it is necessary to guarantee mutual-exclusive access to shared data and to preserve task execution ordering, thus applied synchronization among the update stage and the GPGPU stage. To avoid any problems with shared data among the stages, it needs to use a synchronization object, as applications that use many threads do. Synchronization objects are tools for handling task dependence and execution ordering. This measure should also be carefully applied in order to avoid thread starvation and deadlocks. The current implementation of the presented model uses semaphores as synchronization object, but other synchronization object can be implemented. 5 Schemes for Task Distribution between CPU-GPU This section presents the schemes for task distribution between CPU and GPU. It is important to mention that for correct distribution between processors (GPU and CPU), it is necessary to implement the algorithm for both processors (for example, the collision algorithm had to be written for the CPU and the GPU). Thus it ensures that, even though, different programs paradigms will have different codifications, the same task can be run similarly in both processors. This section is divided in two subsection, one for the manual decision and another with the automatic decisions. 4

5 5.1 Manual Decision to Distribute The simplest decision on how to distribute tasks between the CPU and GPU is to let the user or developer decide. This decision can be done during the application via a C function or even a script language, such as LUA [6]. Former works [17] implement a task scheduling distribution between CPU and GPU via a script file. 5.2 Automatic Decisions to Distribute The heuristics for automatic distribution are implemented by deriving from a base class, Distributor. This class provides a time counter for its subclasses that is updated automatically. At each frame, the classes derived from the Distributor are asked for a decision on how to process the next frame. Figure 6 illustrates an application loop, based on this approach. Algorithm 1 Starting distribution if framecount == 10 then calculateelapsedtimegpu() if framecount == 20 then calculateelapsedtimecpu() if GPUTime < CPUTime then return mode at each processor. All the subsequent schemes spend 5 or 10 frames in initial tests for each processor because of the same principle. This method, in normal conditions, always selects the fastest processor without the necessity for the user or developer to select it. This is important for the developer who does not know the hardware where the application is going to run, and wants to uses the fastest processor available. Even though the starting automatic distribution selects the fastest processor, it does not avoid the application being slowed down by other processes in the system, if the CPU or GPU is already overloaded with some work. To avoid this, it requires a strategy of distribution that keeps track of the performance every frame. Figure 6. A loop with automatic distribution. Each class that inherits the Distributor class must implement the function decidemode, that is purely virtual, to make use of this strategy. This function is responsible for deciding where the application will execute the next frame of the task. The schemes for automatic task distribution that this paper presents are: starting distribution, cycle distribution, best time distribution and resource distribution Starting Distribution The starting strategy for automatic distribution between CPU and GPU is very simple: it calculates 10 frames in the GPU mode and 10 frames in the CPU mode. Based on these measurements, it selects the fastest processor to process all the frames of the application. The Algorithm 1 lists the pseudo-code for this approach. The reason this method does 10 frames calculation in each processor is to avoid making a wrong decision that could happen, if this scheme would only calculate 1 frame Cycle Distribution The cycle distribution has the following strategy: at every 100 frames, the engine calculates 5 frames in the GPU mode and another 5 frames in the CPU mode. With these times, the real time automatic distribution chooses the fastest processor to simulate the next 90 frames, as the Algorithm 2 lists. The reason this method uses a cycle of 100 frames is to have a cycle that is high enough, so the slowest processor does not slow the overall application, and also, low enough that it can make real-time distribution between the processors. From the pseudo code, it is possible to see that by using this automatic distribution scheme, only 5% of the frames are spent using the slower scenario and 95% with the best one. This scheme is ideal for tasks where the performance difference between the processors is low. If the difference between the processors is high, the 5% of the frames spent in the slowest mode will affect the overall performance of the application. 5

6 Algorithm 2 Cycle distribution if framecount == 5 then calculateelapsedtimegpu() if framecount == 10 then calculateelapsedtimecpu() if GPUTime < CPUTime then if framecount == 100 then framecount 0 return mode Best Time Distribution The idea of this scheme consists on creating an approach that is able to redistribute tasks between CPU and GPU without the costs of making a lot of tests in the slowest processor, like cycle distribution scheme. This scheme presents this strategy: it starts calculating 10 frames on the GPU, and another 10 in the CPU. At the end of these 20 frames it determines the processor with the fastest time, and uses it for the next 10 frames. If the time of this processor after these 10 frames is less than the best time of the other processor, it calculates 10 frames on the other processor, and so forth. This scheme always saves the fastest time of each processor. Algorithm 4 implements this strategy. The idea of this distribution is to use the fastest processor in most cases and to use the slowest one to take out work of the fastest processor when it is overloaded Resource Distribution The idea of the resource strategy consists on the use of libraries to verify the processor usage (percentage). The Windows API was used for this verification on the CPU. Initial tests have shown that the CPU usage varies with the number of bodies and remains constant when this same number does not vary, showing that this verification can be used for the purpose of distribution. The NVPerfKit [7] was used for checking this parameter in the GPU. The initial tests have shown that the GPU usage varies very little with the number of bodies and do not remain constant when this number is constant, showing that it is not possible to predict whether or not the GPU is overloaded with work by this method. For those reason, this Algorithm 3 Best time distribution if mode == GPU then if GPUBestTime > GPUTime then GPUBestTime GPUTime if mode == CPU then if CPUBestTime > CPUTime then CPUBestTime CPUTime if framecount == 10 then calculateelapsedtimegpu() if framecount == 20 then calculateelapsedtimecpu() if GPUBestTime < CPUBestTime then if framecount > 30 then if mode == GPU then if GPUTime > CPUBestTime then framecount 21 if mode == CPU then if CPUTime > GPUBestTime then framecount 21 return mode verification is not good for the purpose of distribution. This scheme uses only the CPU usage, using the following strategy: 10 frames are calculated at the GPU and 10 frames at the CPU. Based on the obtained results, the fastest processor is selected, in the same manner that the starting automatic distribution. If the fastest processor is the CPU, it activates a variable usecpu, and after 10 frames it starts to verify the percentage of use of the CPU, if this percentage raises to more than 80%, it sends the next 10 frames to the GPU. Algorithm 3 implements this approach. This automatic task distribution scheme is aimed at sim- 6

7 Algorithm 4 Resource distribution if framecount == 10 then calculateelapsedtimegpu() if framecount == 20 then calculateelapsedtimecpu() if GPUTime < CPUTime then usecpu true if usecpu AND framecount > 30 then perc = getperccpu() if perc > 80 then framecount 21 return mode ulations where the CPU is faster than the GPU, and the CPU uses the GPU as an auxiliary processor when it is overloaded with work. When the GPU is faster than the CPU, this scheme will not be able to redistribute tasks, and will behave like the starting automatic distribution. 6 Results Table 2 illustrates the numerical results from all the task distribution schemes, in normal conditions, with the test case. These results demonstrate that the starting distribution, in normal conditions, always behaves like the best case. And the cycle distribution behaves like almost the best case when the difference between the CPU implementation and the GPU implementation is low, in the others cases the 5% spent in the slower processor is affecting the overall performance of the application. These results also demonstrate that the best time distribution behaves almost like the best case, with the advantage that it can redistribute task between the processors. And that the resource distribution behaves like the best case when there are 1024 or more bodies, because in this case it does not have to check the percentage of the CPU. To test the schemes in the real time system, this work has implemented the collision shader (as the GPGPU stage) with the public physics library Open Dynamics Engine (ODE) [14] (as the update stage) with the single thread uncoupled model with an GPGPU stage uncoupled from the main loop. For data input and rendering an academic framework was used [15]. The numerical results, in frames per second, are in Table 3. These results show that the task distribution schemes between CPU and GPU behaves well with real-time system and can be applied in real-time applications, such as games. 7 Conclusion This work has proposed different schemes to distribute tasks between CPU and GPU to take advantage of both processors. These tasks distribution schemes can be of two kinds: a manual distribution or an automatic distribution. One scheme for automatic task distribution, the starting distribution, is designed to be executed only in the beginning of the simulation. The importance of this mode is that it can select the fastest processor without the necessity of the user or developer to determine it. The others distribution strategies are used to avoid the fact that other applications or the system can slow down the simulation whereas, with an automatic distribution, it can redistribute tasks between the processors, performing most of its works in the fastest processor. The automatic distribution scheme that has performed better in tests were the auto scheme. This paper has also presented a sphere bounding detection implemented in both CPU and GPU, to test the schemes for task distribution with good results. To test the schemes for task distribution between CPU and GPU with real-time systems, this work has presented a single threaded uncoupled model with a GPGPU stage uncouple from the main loop. These tests show that the schemes for task distribution can be used with real-time systems. References [1] A. Anderson, W. G. III, and P. Schröder. Quantum monte carlo on graphical processing units. Computer Physics Communications 177(3), pages , [2] C. Ericson. Real-Time Collision Detection. Morgan Kaufmann, [3] J. Georgii, F. Echtler, and R. Westermann. Interactive simulation of deformable bodies on gpu. Proceedings of Simulation and Visualization 2005, pages , [4] K. N. Govindaraju, S. Redon, M. C. Lin, and D. Manocha. CULLIDE: interactive collision detection between complex models in large environments using graphics hardware. Graphics Hardware 2003, pages 25 32, [5] S. Green. Gpgpu physics. SIGGRAPH GPGPU Tutorial,

8 Table 2. Numerical results, in milliseconds, from the schemes for task distribution in 100 frames of the aplication. # Manual Distribution Automatic Distribution Bodies CPU GPU Starting Cycle Best Time Resource ,202 4,952 4,952 5,162 4,966 4, ,427 15,333 15,333 16,472 15,989 15, ,858 51,851 51,851 63,064 52,001 51, ,262, , , , , ,333 Table 3. A numerical result, in FPS, of the single thread uncoupled model with an GPGPU stage uncoupled from the main loop with the distribution schemes. # Manual Distribution Automatic Distribution Bodies CPU GPU Starting Cycle Best Time Resource [6] R. Ierusalimschy, L. H. de Figueiredo, and W. Celes. Lua - an extensible extension language. Software: Practice & Experience 26 (6), pages , [7] J. Kiel and S. Dietrich. Gpu performance tuning with nvidia performance tools. Game Developers Conference, [8] P. Kipfer, M. Segal, and R. Westermann. Uberflow: a gpubased particle engine. Graphics Hardware 2004, pages , [9] C. Muller, M. Strengert, and T. Ertl. Adaptive load balancing for raycasting of non-uniformly bricked volumes. Parallel Computing 33 (6), pages , [10] J. D. Owens, D. Leubke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 26(1), pages , [11] A. Rollings and D. Morris. Game Architecture and Design: A New Edition. New Riders Publishing, [12] T. Rudomýn, E. Millán, and B. Hernández. Fragment shaders for agent animation using finite state machines. Simulation Modelling Practice and Theory 13(8), pages , [13] D. Sanchez and C. Dalmau. Core Techniques and Allgorithms in Game Programming. New Riders Publishing, [14] R. Smith. Open dynamics engine. Available at: 20/12/2007. [15] L. Valente. Guff: um framework para desenvolvimento de jogos. Master s thesis, Universidade Federal Fluminense, In Portuguese. [16] L. Valente, A. Conci, and B. Feijo. Real time game loop models for single-player computer games. In Proceedings of the IV Brazilian Symposium on Computer Games and Digital Entertainment, pages 89 99, [17] M. Zamith, E. Clua, P. Pagliosa, A. Conci, A. Montenegro, and L. Valente. The gpu used as a math co-processor in real time applications. Proceedings of the VI Brazilian Symposium on Computer Games and Digital Entertainment, pages 37 43, [18] C. Zeller. Cloth simulation on the GPU. ACM SIGGRAPH 05: ACM SIGGRAPH 2005 Sketches,

The GPU Used as a Math Co-Processor in Real Time Applications

The GPU Used as a Math Co-Processor in Real Time Applications The GPU Used as a Math Co-Processor in Real Time Applications Marcelo P. M. Zamith Esteban W. G. Clua Aura Conci Anselmo Montenegro Instituto de Computação Universidade Federal Fluminense Paulo A. Pagliosa

More information

GpuWars: Design and Implementation of a GPGPU Game

GpuWars: Design and Implementation of a GPGPU Game GpuWars: Design and Implementation of a GPGPU Game Mark Joselli UFF, Medialab Esteban Clua UFF, Medialab Figure 1: Teaser of the GpuWars Game. Abstract The GPUs (Graphics Processing Units) have evolved

More information

What is a Rigid Body?

What is a Rigid Body? Physics on the GPU What is a Rigid Body? A rigid body is a non-deformable object that is a idealized solid Each rigid body is defined in space by its center of mass To make things simpler we assume the

More information

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology Outline Introduction Algorithmic complexity on GPUs Algorithmic Building Blocks Gather & Scatter Reductions Scan (parallel prefix)

More information

UberFlow: A GPU-Based Particle Engine

UberFlow: A GPU-Based Particle Engine UberFlow: A GPU-Based Particle Engine Peter Kipfer Mark Segal Rüdiger Westermann Technische Universität München ATI Research Technische Universität München Motivation Want to create, modify and render

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

Spring 2009 Prof. Hyesoon Kim

Spring 2009 Prof. Hyesoon Kim Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

GPGPU Lessons Learned. Mark Harris

GPGPU Lessons Learned. Mark Harris GPGPU Lessons Learned Mark Harris General-Purpose Computation on GPUs Highly parallel applications Physically-based simulation image processing scientific computing computer vision computational finance

More information

Evaluation Of The Performance Of GPU Global Memory Coalescing

Evaluation Of The Performance Of GPU Global Memory Coalescing Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea

More information

CUDA Particles. Simon Green

CUDA Particles. Simon Green CUDA Particles Simon Green sdkfeedback@nvidia.com Document Change History Version Date Responsible Reason for Change 1.0 Sept 19 2007 Simon Green Initial draft Abstract Particle systems [1] are a commonly

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

What s New with GPGPU?

What s New with GPGPU? What s New with GPGPU? John Owens Assistant Professor, Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Microprocessor Scaling is Slowing

More information

General-Purpose Computation on Graphics Hardware

General-Purpose Computation on Graphics Hardware General-Purpose Computation on Graphics Hardware Welcome & Overview David Luebke NVIDIA Introduction The GPU on commodity video cards has evolved into an extremely flexible and powerful processor Programmability

More information

SUGARSCAPE ON STEROIDS: SIMULATING OVER A MILLION AGENTS AT INTERACTIVE RATES ABSTRACT

SUGARSCAPE ON STEROIDS: SIMULATING OVER A MILLION AGENTS AT INTERACTIVE RATES ABSTRACT 53 SUGARSCAPE ON STEROIDS: SIMULATING OVER A MILLION AGENTS AT INTERACTIVE RATES R. M. D SOUZA *, Dept. of MEEM, Michigan Tech. University M. LYSENKO, Dept. of Computer Science, Michigan Tech. University

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

Scalable Ambient Effects

Scalable Ambient Effects Scalable Ambient Effects Introduction Imagine playing a video game where the player guides a character through a marsh in the pitch black dead of night; the only guiding light is a swarm of fireflies that

More information

Efficient Stream Reduction on the GPU

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger Grenoble University Email: droger@inrialpes.fr Ulf Assarsson Chalmers University of Technology Email: uffe@chalmers.se Nicolas Holzschuch Cornell University

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Fast continuous collision detection among deformable Models using graphics processors CS-525 Presentation Presented by Harish

Fast continuous collision detection among deformable Models using graphics processors CS-525 Presentation Presented by Harish Fast continuous collision detection among deformable Models using graphics processors CS-525 Presentation Presented by Harish Abstract: We present an interactive algorithm to perform continuous collision

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Non-Linearly Quantized Moment Shadow Maps

Non-Linearly Quantized Moment Shadow Maps Non-Linearly Quantized Moment Shadow Maps Christoph Peters 2017-07-30 High-Performance Graphics 2017 These slides include presenter s notes for your convenience. 1 In this presentation we discuss non-linearly

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

GPU Architecture and Function. Michael Foster and Ian Frasch

GPU Architecture and Function. Michael Foster and Ian Frasch GPU Architecture and Function Michael Foster and Ian Frasch Overview What is a GPU? How is a GPU different from a CPU? The graphics pipeline History of the GPU GPU architecture Optimizations GPU performance

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

GPU-AWARE HYBRID TERRAIN RENDERING

GPU-AWARE HYBRID TERRAIN RENDERING GPU-AWARE HYBRID TERRAIN RENDERING Christian Dick1, Jens Krüger2, Rüdiger Westermann1 1 Computer Graphics and Visualization Group, Technische Universität München, Germany 2 Interactive Visualization and

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

CPSC / Sonny Chan - University of Calgary. Collision Detection II

CPSC / Sonny Chan - University of Calgary. Collision Detection II CPSC 599.86 / 601.86 Sonny Chan - University of Calgary Collision Detection II Outline Broad phase collision detection: - Problem definition and motivation - Bounding volume hierarchies - Spatial partitioning

More information

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010 1 Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010 Presentation by Henrik H. Knutsen for TDT24, fall 2012 Om du ønsker, kan du sette inn navn, tittel på foredraget, o.l.

More information

From Brook to CUDA. GPU Technology Conference

From Brook to CUDA. GPU Technology Conference From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i

More information

CUDA Particles. Simon Green

CUDA Particles. Simon Green CUDA Particles Simon Green sdkfeedback@nvidia.com Document Change History Version Date Responsible Reason for Change 1.0 Sept 19 2007 Simon Green Initial draft 1.1 Nov 3 2007 Simon Green Fixed some mistakes,

More information

An Efficient Approach for Emphasizing Regions of Interest in Ray-Casting based Volume Rendering

An Efficient Approach for Emphasizing Regions of Interest in Ray-Casting based Volume Rendering An Efficient Approach for Emphasizing Regions of Interest in Ray-Casting based Volume Rendering T. Ropinski, F. Steinicke, K. Hinrichs Institut für Informatik, Westfälische Wilhelms-Universität Münster

More information

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing) ME 290-R: General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing) Sara McMains Spring 2009 Lecture 7 Outline Last time Visibility Shading Texturing Today Texturing continued

More information

Abstract. Introduction. Kevin Todisco

Abstract. Introduction. Kevin Todisco - Kevin Todisco Figure 1: A large scale example of the simulation. The leftmost image shows the beginning of the test case, and shows how the fluid refracts the environment around it. The middle image

More information

There are two lights in the scene: one infinite (directional) light, and one spotlight casting from the lighthouse.

There are two lights in the scene: one infinite (directional) light, and one spotlight casting from the lighthouse. Sample Tweaker Ocean Fog Overview This paper will discuss how we successfully optimized an existing graphics demo, named Ocean Fog, for our latest processors with Intel Integrated Graphics. We achieved

More information

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO 2009: The GPU Computing Tipping Point Jen-Hsun Huang, CEO Someday, our graphics chips will have 1 TeraFLOPS of computing power, will be used for playing games to discovering cures for cancer to streaming

More information

Scientific Visualization Final Report

Scientific Visualization Final Report 1. Goal Scientific Visualization Final Report Suqin Zeng 00746736 This project is aimed to simulate a dynamic particle system, visualizing the density of the particles in real time. 2. System implementation

More information

Height field ambient occlusion using CUDA

Height field ambient occlusion using CUDA Height field ambient occlusion using CUDA 3.6.2009 Outline 1 2 3 4 Theory Kernel 5 Height fields Self occlusion Current methods Marching several directions from each fragment Sampling several times along

More information

Collision Detection II. These slides are mainly from Ming Lin s course notes at UNC Chapel Hill

Collision Detection II. These slides are mainly from Ming Lin s course notes at UNC Chapel Hill Collision Detection II These slides are mainly from Ming Lin s course notes at UNC Chapel Hill http://www.cs.unc.edu/~lin/comp259-s06/ Some Possible Approaches Geometric methods Algebraic techniques Hierarchical

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Dynamic Spatial Partitioning for Real-Time Visibility Determination. Joshua Shagam Computer Science

Dynamic Spatial Partitioning for Real-Time Visibility Determination. Joshua Shagam Computer Science Dynamic Spatial Partitioning for Real-Time Visibility Determination Joshua Shagam Computer Science Master s Defense May 2, 2003 Problem Complex 3D environments have large numbers of objects Computer hardware

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

GPU-based Image-space Approach to Collision Detection among Closed Objects

GPU-based Image-space Approach to Collision Detection among Closed Objects GPU-based Image-space Approach to Collision Detection among Closed Objects Han-Young Jang jhymail@gmail.com TaekSang Jeong nanocreation@gmail.com Game Research Center College of Information and Communications

More information

MobileWars: A Mobile GPGPU Game

MobileWars: A Mobile GPGPU Game MobileWars: A Mobile GPGPU Game Mark Joselli 1, Jose Ricardo Silva Jr. 2, Esteban Clua, and Eduardo Soluri 3 1 PUC-PR mejoselli@gmail.com 2 MediaLab - UFF {jricardo,esteban}@ic.uff.br 3 Nullpointer Tecnologia

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

COMP 4801 Final Year Project. Ray Tracing for Computer Graphics. Final Project Report FYP Runjing Liu. Advised by. Dr. L.Y.

COMP 4801 Final Year Project. Ray Tracing for Computer Graphics. Final Project Report FYP Runjing Liu. Advised by. Dr. L.Y. COMP 4801 Final Year Project Ray Tracing for Computer Graphics Final Project Report FYP 15014 by Runjing Liu Advised by Dr. L.Y. Wei 1 Abstract The goal of this project was to use ray tracing in a rendering

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

General Purpose Computing on Graphical Processing Units (GPGPU(

General Purpose Computing on Graphical Processing Units (GPGPU( General Purpose Computing on Graphical Processing Units (GPGPU( / GPGP /GP 2 ) By Simon J.K. Pedersen Aalborg University, Oct 2008 VGIS, Readings Course Presentation no. 7 Presentation Outline Part 1:

More information

Software Occlusion Culling

Software Occlusion Culling Software Occlusion Culling Abstract This article details an algorithm and associated sample code for software occlusion culling which is available for download. The technique divides scene objects into

More information

CENG 477 Introduction to Computer Graphics. Graphics Hardware and OpenGL

CENG 477 Introduction to Computer Graphics. Graphics Hardware and OpenGL CENG 477 Introduction to Computer Graphics Graphics Hardware and OpenGL Introduction Until now, we focused on graphic algorithms rather than hardware and implementation details But graphics, without using

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Volumetric Particle Separating Planes for Collision Detection

Volumetric Particle Separating Planes for Collision Detection Volumetric Particle Separating Planes for Collision Detection by Brent M. Dingle Fall 2004 Texas A&M University Abstract In this paper we describe a method of determining the separation plane of two objects

More information

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances Stefano Cagnoni 1, Alessandro Bacchini 1,2, Luca Mussi 1 1 Dept. of Information Engineering, University of Parma,

More information

Evaluating Computational Performance of Backpropagation Learning on Graphics Hardware 1

Evaluating Computational Performance of Backpropagation Learning on Graphics Hardware 1 Electronic Notes in Theoretical Computer Science 225 (2009) 379 389 www.elsevier.com/locate/entcs Evaluating Computational Performance of Backpropagation Learning on Graphics Hardware 1 Hiroyuki Takizawa

More information

A Parallel Access Method for Spatial Data Using GPU

A Parallel Access Method for Spatial Data Using GPU A Parallel Access Method for Spatial Data Using GPU Byoung-Woo Oh Department of Computer Engineering Kumoh National Institute of Technology Gumi, Korea bwoh@kumoh.ac.kr Abstract Spatial access methods

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

Fast BVH Construction on GPUs

Fast BVH Construction on GPUs Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California

More information

Voxel-Based Global-Illumination

Voxel-Based Global-Illumination Voxel-Based Global-Illumination By Thiedemann, Henrich, Grosch, and Müller Real-Time Near-Field Global Illumination on a Voxel Model (Morteza), M.Sc. Marc Treib Overview Illumination What is voxelization?

More information

Volume Graphics Introduction

Volume Graphics Introduction High-Quality Volume Graphics on Consumer PC Hardware Volume Graphics Introduction Joe Kniss Gordon Kindlmann Markus Hadwiger Christof Rezk-Salama Rüdiger Westermann Motivation (1) Motivation (2) Scientific

More information

Particle Simulation using CUDA. Simon Green

Particle Simulation using CUDA. Simon Green Particle Simulation using CUDA Simon Green sdkfeedback@nvidia.com July 2012 Document Change History Version Date Responsible Reason for Change 1.0 Sept 19 2007 Simon Green Initial draft 1.1 Nov 3 2007

More information

Point Cloud Filtering using Ray Casting by Eric Jensen 2012 The Basic Methodology

Point Cloud Filtering using Ray Casting by Eric Jensen 2012 The Basic Methodology Point Cloud Filtering using Ray Casting by Eric Jensen 01 The Basic Methodology Ray tracing in standard graphics study is a method of following the path of a photon from the light source to the camera,

More information

A GPU-Based Data Structure for a Parallel Ray Tracing Illumination Algorithm

A GPU-Based Data Structure for a Parallel Ray Tracing Illumination Algorithm A GPU-Based Data Structure for a Parallel Ray Tracing Illumination Algorithm Diego Cordeiro Barboza Esteban Walter Gonzalez Clua Universidade Federal Fluminense, Instituto de Computação - Media Lab, Brasil

More information

Redefining Game Engine Architecture through Concurrency

Redefining Game Engine Architecture through Concurrency New Trends in Software Methodologies, Tools and Techniques H. Fujita et al. (Eds.) IOS Press, 2014 2014 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-434-3-767 767 Redefining

More information

Mapping Multi- Agent Systems Based on FIPA Specification to GPU Architectures

Mapping Multi- Agent Systems Based on FIPA Specification to GPU Architectures Mapping Multi- Agent Systems Based on FIPA Specification to GPU Architectures Luiz Guilherme Oliveira dos Santos 1, Flávia Cristina Bernadini 1, Esteban Gonzales Clua 2, Luís C. da Costa 2 and Erick Passos

More information

Alignment invariant image comparison implemented on the GPU

Alignment invariant image comparison implemented on the GPU Alignment invariant image comparison implemented on the GPU Hans Roos Highquest, Johannesburg hans.jmroos@gmail.com Yuko Roodt Highquest, Johannesburg yuko@highquest.co.za Willem A. Clarke, MIEEE, SAIEE

More information

Face Detection on CUDA

Face Detection on CUDA 125 Face Detection on CUDA Raksha Patel Isha Vajani Computer Department, Uka Tarsadia University,Bardoli, Surat, Gujarat Abstract Face Detection finds an application in various fields in today's world.

More information

Applications of Explicit Early-Z Culling

Applications of Explicit Early-Z Culling Applications of Explicit Early-Z Culling Jason L. Mitchell ATI Research Pedro V. Sander ATI Research Introduction In past years, in the SIGGRAPH Real-Time Shading course, we have covered the details of

More information

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J. Journal of Universal Computer Science, vol. 14, no. 14 (2008), 2416-2427 submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.UCS Tabu Search on GPU Adam Janiak (Institute of Computer Engineering

More information

Architecture of Request Distributor for GPU Clusters

Architecture of Request Distributor for GPU Clusters 2012 Third Workshop on Applications for Multi-Core Architecture Architecture of Request Distributor for GPU Clusters Mani Shafaat Doost, S. Masoud Sadjadi School of Computing and Information Sciences Florida

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Efficient Parallel Simulation of an Individual-Based Fish Schooling Model on a Graphics Processing Unit

Efficient Parallel Simulation of an Individual-Based Fish Schooling Model on a Graphics Processing Unit Efficient Parallel Simulation of an Individual-Based Fish Schooling Model on a Graphics Processing Unit Hong Li Department of Computer Science University of California, Santa Barbara, CA 93106 hongli@cs.ucsb.edu

More information

GPGPU: Parallel Reduction and Scan

GPGPU: Parallel Reduction and Scan Administrivia GPGPU: Parallel Reduction and Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 3 due Wednesday 11:59pm on Blackboard Assignment 4 handed out Monday, 02/14 Final Wednesday

More information

Efficient and Scalable Shading for Many Lights

Efficient and Scalable Shading for Many Lights Efficient and Scalable Shading for Many Lights 1. GPU Overview 2. Shading recap 3. Forward Shading 4. Deferred Shading 5. Tiled Deferred Shading 6. And more! First GPU Shaders Unified Shaders CUDA OpenCL

More information

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPGPU. Peter Laurens 1st-year PhD Student, NSC GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing

More information

Fast Uniform Grid Construction on GPGPUs Using Atomic Operations

Fast Uniform Grid Construction on GPGPUs Using Atomic Operations Fast Uniform Grid Construction on GPGPUs Using Atomic Operations Davide BARBIERI a, Valeria CARDELLINI a and Salvatore FILIPPONE b a Dipartimento di Ingegneria Civile e Ingegneria Informatica Università

More information

PanoMOBI: Panoramic Mobile Entertainment System

PanoMOBI: Panoramic Mobile Entertainment System PanoMOBI: Panoramic Mobile Entertainment System Barnabas Takacs 1,2 1 MTA SZTAKI, Virtual Human Interface Group, Hungarian Academy of Sciences, Kende u. 11-13, 1111 Budapest, Hungary 2 Digital Elite Inc.

More information

Advanced Lighting casting shadows where the sun never shines!

Advanced Lighting casting shadows where the sun never shines! Advanced Lighting casting shadows where the sun never shines! Overview Shadows in real life are normally considered to be an integral part of scene a natural outcome of the interplay of light and objects.

More information

Cloth Simulation on the GPU. Cyril Zeller NVIDIA Corporation

Cloth Simulation on the GPU. Cyril Zeller NVIDIA Corporation Cloth Simulation on the GPU Cyril Zeller NVIDIA Corporation Overview A method to simulate cloth on any GPU supporting Shader Model 3 (Quadro FX 4500, 4400, 3400, 1400, 540, GeForce 6 and above) Takes advantage

More information

First Steps in Hardware Two-Level Volume Rendering

First Steps in Hardware Two-Level Volume Rendering First Steps in Hardware Two-Level Volume Rendering Markus Hadwiger, Helwig Hauser Abstract We describe first steps toward implementing two-level volume rendering (abbreviated as 2lVR) on consumer PC graphics

More information

An Adaptive Control Scheme for Multi-threaded Graphics Programs

An Adaptive Control Scheme for Multi-threaded Graphics Programs Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 498 An Adaptive Control Scheme for Multi-threaded Graphics Programs

More information

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay Introduction to CUDA Lecture originally by Luke Durant and Tamas Szalay Today CUDA - Why CUDA? - Overview of CUDA architecture - Dense matrix multiplication with CUDA 2 Shader GPGPU - Before current generation,

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1 Markus Hadwiger, KAUST Reading Assignment #2 (until Feb. 17) Read (required): GLSL book, chapter 4 (The OpenGL Programmable

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

Programmable Graphics Hardware (GPU) A Primer

Programmable Graphics Hardware (GPU) A Primer Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism

More information

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU Parallel LZ77 Decoding with a GPU Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU Outline Background (What?) Problem definition and motivation (Why?)

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Graphics Processing Unit Architecture (GPU Arch)

Graphics Processing Unit Architecture (GPU Arch) Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce 6800 GPU 1 What is a GPU From Wikipedia : A specialized processor efficient at manipulating and displaying computer graphics

More information

MIX RENDERINGS TO FOCUS PLAYER'S ATTENTION

MIX RENDERINGS TO FOCUS PLAYER'S ATTENTION MIX RENDERINGS TO FOCUS PLAYER'S ATTENTION Vincent Boyer, Jordane Suarez Kamel Haddad L.I.A.S.D. Université Paris 8, 2 rue de la libérté 93526 Saint-Denis, France E-mail: boyer@ai.univ-paris8.fr [dex,

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Real Time Ray Tracing

Real Time Ray Tracing Real Time Ray Tracing Programação 3D para Simulação de Jogos Vasco Costa Ray tracing? Why? How? P3DSJ Real Time Ray Tracing Vasco Costa 2 Real time ray tracing : example Source: NVIDIA P3DSJ Real Time

More information

Multi-View Soft Shadows. Louis Bavoil

Multi-View Soft Shadows. Louis Bavoil Multi-View Soft Shadows Louis Bavoil lbavoil@nvidia.com Document Change History Version Date Responsible Reason for Change 1.0 March 16, 2011 Louis Bavoil Initial release Overview The Multi-View Soft Shadows

More information