Automatic Dynamic Task Distribution between CPU and GPU for Real-Time Systems

Size: px

Start display at page:

Download "Automatic Dynamic Task Distribution between CPU and GPU for Real-Time Systems"

Paul Paul
6 years ago
Views:

1 Automatic Dynamic Task Distribution between CPU and GPU for Real-Time Systems Marcelo Zamith Mark Joselli Esteban Clua Anselmo Montenegro Aura Conci Regina Leal-Toledo Instituto de Computação Universidade Federal Fluminense Marcos d Ornellas Cesar Pozzer Laboratório de Computação Aplicada Universidade Federal de Santa Maria {ornellas,pozzer}@inf.ufsm.br Luis Valente Bruno Feijó VisionLab/IGames Departamento de Informática PUC-Rio {lvalente, bruno}@inf.puc-rio.br Abstract 1. Introduction The increase of computational power of programmable GPU (Graphics Processing Unit) brings new concepts for using these devices for generic processing. Hence, with the use of the CPU and the GPU for data processing come new ideas that deals with distribution of tasks among CPU and GPU, such as automatic distribution. The importance of the automatic distribution of tasks between CPU and GPU lies in three facts. First, automatic task distribution enables the applications to use the best of both processors. Second, the developer does not have to decide which processor will do the work, allowing the automatic task distribution system to choose the best option for the moment. And third, sometimes, the application can be slowed down by other processes if the CPU or GPU is already overloaded. Based on these facts, this paper presents new schemes for efficient automatic task distribution between CPU and GPU. This paper also includes tests and results of implementing those schemes with a test case and with a real-time system. Keywords: Parallel computing, task distribution, GPGPU, real-time loop models, real-time systems. Programmable GPUs (Graphics Processing Unit) implement new paradigms for high performance computing, such as CUDA (Compute Unified Device Architecture) at the nvidia new series (series 8000 and later). These technologies increase the general computations that can be made on GPUs, bringing new possibilities for the GPGPU (General- Purpose computing on GPU) area. Some examples of these works are quantum Monte Carlo [1], Artificial Intelligence [12], and ray casting [9] implementations on GPU. Because of the GPU s SIMD (Single Instruction Multiple Data) parallel architecture, GPU processing usually is fast for high intensity computations and slow for low intensity ones, when comparing with CPU processing. This paper tries to avoid using only one processor at a time, taking the best of both the CPU and GPU by implementing five schemes to distribute tasks between them, being four of these schemes automatic ones. With an automatic task distribution the developer does not have to decide which processor will do the work allowing the automatic distribution to choose. This is important to emphasize, because many times the developer is unable to predict the hardware of the end user and by using an automatic distribution he/she does not have to. Also, sometimes processor load on the CPU or GPU may increase, penalizing

2 running applications. Therefore, an automatic task distribution scheme can detect this situation and distribute tasks to the other processor. The motivations for employing automatic task distribution schemes are as follows: 1. Take advantage of the best characteristics of both the CPU and GPU; 2. Take out from the developer the decision of which processor should run some tasks, leaving this to the automatic distribution scheme to choose; 3. Redistribution of tasks between the processors when a processor (CPU or GPU) is overloaded with work. To test these schemes, this work presents an implementation of collision detection of bounding spheres in both the CPU and the GPU. The bounding sphere collision detection is normally used as the broad phase of the collision detection system. One major application for GPGPU, and as consequence, for task distribution between CPU and GPU, is real-time systems, such as games, virtual reality and virtual simulations. For this reason, the schemes for distribution of tasks between CPU and GPU were designed to be used with realtime systems. In order to proof the application of these schemes with a real-time system, this paper presents tests with a new multhithread architecture. The paper has been organized as follows: Section 2 presents the concept of GPGPU and some related works developed in the area. Section 3 presents the test case that uses bounding sphere collision detection. Section 4 presents an architecture to test the schemes with a real-time system. Section 5 describes the schemes for process distributions between CPU-GPU and section 6 presents the tests with these schemes. Finally, section 7 points out the conclusions of this work. 2. GPGPU The GPUs are processors dedicated for graphics computation. The development of programmable GPUs has opened a new area of research, permitting to use the graphic device for processing non-graphic data. However, there are many constraints on what kind of data the GPU can process. For example, scatter memory operations (indexed write array operations) are inefficient and there are no integer data operands like bit-wise logical operations AND, OR, XOR, NOT and bit-shifts. On the other hand, the advantages are that the GPU is much faster than the CPU, when considering all parallel processors. An ATI X1900 XTX, for instance, can sustain a measured 240 GFLOPS against 25.6 GFLOPS for the SSE units of a dual-core 3.7 GHz Intel Pentium Extreme Edition 965 [10]. GPUs are very good for processing applications that require high arithmetic rates and data bandwidths. Because of the SIMD parallel architecture of the GPU (the nvidia G80, as an example, has 128 unified shading processors), the development of this kind of application requires a different programming paradigm than the traditional CPU sequential programming model Related Works on GPGPU Green [5] presents a commercial physics engine called Havok FX for rigid bodies and particle system that has several methods implemented on the GPU, obtaining results eight times faster with the use of nvidia GeForce 8800GTX GPU card with an Intel Core 2 Duo Extreme 2.93GHz than CPU version. This justifies the implementation of some physics functionalities on the GPU, such as collision detection. Because of the high performance of the fragment processors, which allows high parallelization of the problems that can be solved in this structure, it is possible to have more bodies on the physics simulation application. Besides the Havok FX, there are other works related to the implementation of physics simulation processes on the GPU such as particle system [8], deformable bodies system [3], cloth simulation [18], and collision detection [4]. It is important to remark that none of the works available in the literature has approached the automatic distribution of tasks between CPU and GPU. 3 Test Case: Bounding Sphere Collision Detection To test the task distribution schemes, a bounding sphere collision detection is implemented in both CPU and GPU. Collision detection is a complex operation. For n bodies in a system, there must be a collision detection check between the O(n 2 ) pairs of bodies. Normally, to reduce this computation cost, this task is performed in two steps: first, the broad phase, and second, the narrow phase. In the broad phase the collision library detects which bodies have a chance of colliding among themselves. In the narrow phase a more refined algorithm to do the collisions tests are performed between the pairs of bodies that passed by the broad phase. The second phase is where the collisions tests between the bodies are actually done. It calculates the point or points where the bodies intercept, the depth of those points of interception and the normal of the points of interception. This phase has high intensity arithmetic computation so it cannot be done in real time for every body if the simulation has a 2

3 high number of bodies. It needs the broad phase to take out the bodies that do not have a chance of colliding. The broad phase can be carried out in different ways: using a grid algorithm, an axis aligned bounding boxes (AABB) or bounding spheres, for example. This work uses bounding spheres in the broad phase. This approach has some advantages in relation to the traditional axis aligned bounding boxes such as: the bounds of the calculated sphere only need to be calculated once, unlike the axis aligned bounding box that needs to be recalculated every time that the body rotates and moves. Also, the bounding sphere needs four scalars (x, y, z, r) of storage against six scalars for the AABB [2]. The algorithm to perform this test is very simple: if the bodies are at a distance d from each other, less than the sum of radius (r 1 + r 2 ), then they can be in collision and will be passed to the narrow phase. 3.1 Bounding Sphere Collision on the GPU The broad phase has been implemented on the CPU and also on the GPU. To implement it on the GPU it is necessary to approach the problem appropriately using textures. In order to do that, each texel (RGBA) was mapped to a body, including its position (x, y, z) and radius. Each four bodies comprise a line in the texture. Figure 1 depicts this structure. Figure 1. Structure of the data in the texture. This method is splitted in passes to make better use of the parallel structure of the GPU. The first pass applies the bounding sphere check on the first four bodies against all other bodies. Then, the shader returns a texture indicating which bodies can actually collide, and applies the narrow phase to them. As an example, Figure 2 illustrates a case where the first body is tested against the last four ones, with the results written in a texel. After this first pass, the shader performs another pass checking the next four bodies against the remaining ones Figure 2. An example of a collision test in the GPU. (i.e. the bodies that were not checked yet) and receives a texture indicating which bodies have a chance of colliding, sending them to the narrow phase. This process repeats until there are no remaining bodies. For a system with n bodies, this method divides the collision detection problem in n 4 passes on the GPU. Table 1 illustrates the tests performed with this method, both on the CPU and the GPU. All the tests of this paper were made in a Athlon64 with 2GB RAM and a nvidia 8400GS GPU card with an PCI-Express 16x socket. The time unit is milliseconds. Table 1. Numerical results, in milliseconds, of the sphere bounding collision detection. # Bodies CPU Time GPU time Speedup , , , The results show that the GPU starts to become faster than the CPU when more than 1024 bodies are present in the system, this fact is because of the data transfers (it s the bottleneck of this shader) for the GPU that can consume more than 50% of the GPU time. 4 Real-Time Loop Models A real-time loop mode is needed in order to tests the distribution schemes with an real-time system. 3

4 One of the reasons to create real-time loop models is to simulate the apparent parallelism a real-time application conveys. Under an ideal setting, computers would have infinite memory and processing power, so the solution would be simply to distribute all tasks to all resources. However, in practice this is not the case, because computers have a limited amount of memory and few processing cores. Hence, to overcome this limitation it is necessary to design realtime loop models in such a way that simulates parallelism and provides interactivity. The typical tasks a real-time loop runs can be grouped in three areas: the input device query, the update stage and the presentation stage. The first area corresponds to capturing input from user devices, like keyboards, mice, joysticks, microphones, among others. The second area encompasses all tasks that affect the application state. For games, it includes game AI (Artificial Intelligence), game logic (decisions based on game rules and user dada), physics simulation, calculating animations, and others. The third area presents the results to the user, using audio (sound effects, background music) and video (scene rendering). Real-time loop models can be categorized as coupled models or uncoupled models [16]. The coupling relates to a dependence on the execution ordering of some tasks. In practice, this means that if a task gets too long to be processed, other ones that depend on it will start later, degrading application performance (and possibly degrading interactivity). Hence, the main idea of uncoupled models is to separate the tasks that may interfere with each other. For example, consider the simple coupled model [16] depicted in Figure 3, and commonly found in the literature [13] [11]. Figure 3. Simple coupled model. This model is the simplest approach to run a real-time application, where the three stages are run sequentially in a single loop. A delay on one of the stages would be immediately noticed by the user, as one stage cannot start while the previous one is still running. An example of uncoupled model is the single thread uncoupled model [16]. This model separates the update and rendering stages, so they can run independently, as Figure 4 illustrates. Figure 4. Single thread uncoupled model. The single threaded uncoupled model with a GPGPU stage uncouple from the main loop is an extension of this model, and presents a new and very efficient approach for using the GPU as a math and physics co-processor, as shows Figure 5. Figure 5. Single thread uncoupled model with an GPGPU stage uncoupled from the main loop. In parallel programming models, like the presented architecture, data are processed simultaneously. When they are independent there is not any problems, like the render and GPGPU stages, but on the other hand, there are several problems when the data are shared or need to be executed as an order, like update and GPGPU stages. In this case it is necessary to guarantee mutual-exclusive access to shared data and to preserve task execution ordering, thus applied synchronization among the update stage and the GPGPU stage. To avoid any problems with shared data among the stages, it needs to use a synchronization object, as applications that use many threads do. Synchronization objects are tools for handling task dependence and execution ordering. This measure should also be carefully applied in order to avoid thread starvation and deadlocks. The current implementation of the presented model uses semaphores as synchronization object, but other synchronization object can be implemented. 5 Schemes for Task Distribution between CPU-GPU This section presents the schemes for task distribution between CPU and GPU. It is important to mention that for correct distribution between processors (GPU and CPU), it is necessary to implement the algorithm for both processors (for example, the collision algorithm had to be written for the CPU and the GPU). Thus it ensures that, even though, different programs paradigms will have different codifications, the same task can be run similarly in both processors. This section is divided in two subsection, one for the manual decision and another with the automatic decisions. 4

5 5.1 Manual Decision to Distribute The simplest decision on how to distribute tasks between the CPU and GPU is to let the user or developer decide. This decision can be done during the application via a C function or even a script language, such as LUA [6]. Former works [17] implement a task scheduling distribution between CPU and GPU via a script file. 5.2 Automatic Decisions to Distribute The heuristics for automatic distribution are implemented by deriving from a base class, Distributor. This class provides a time counter for its subclasses that is updated automatically. At each frame, the classes derived from the Distributor are asked for a decision on how to process the next frame. Figure 6 illustrates an application loop, based on this approach. Algorithm 1 Starting distribution if framecount == 10 then calculateelapsedtimegpu() if framecount == 20 then calculateelapsedtimecpu() if GPUTime < CPUTime then return mode at each processor. All the subsequent schemes spend 5 or 10 frames in initial tests for each processor because of the same principle. This method, in normal conditions, always selects the fastest processor without the necessity for the user or developer to select it. This is important for the developer who does not know the hardware where the application is going to run, and wants to uses the fastest processor available. Even though the starting automatic distribution selects the fastest processor, it does not avoid the application being slowed down by other processes in the system, if the CPU or GPU is already overloaded with some work. To avoid this, it requires a strategy of distribution that keeps track of the performance every frame. Figure 6. A loop with automatic distribution. Each class that inherits the Distributor class must implement the function decidemode, that is purely virtual, to make use of this strategy. This function is responsible for deciding where the application will execute the next frame of the task. The schemes for automatic task distribution that this paper presents are: starting distribution, cycle distribution, best time distribution and resource distribution Starting Distribution The starting strategy for automatic distribution between CPU and GPU is very simple: it calculates 10 frames in the GPU mode and 10 frames in the CPU mode. Based on these measurements, it selects the fastest processor to process all the frames of the application. The Algorithm 1 lists the pseudo-code for this approach. The reason this method does 10 frames calculation in each processor is to avoid making a wrong decision that could happen, if this scheme would only calculate 1 frame Cycle Distribution The cycle distribution has the following strategy: at every 100 frames, the engine calculates 5 frames in the GPU mode and another 5 frames in the CPU mode. With these times, the real time automatic distribution chooses the fastest processor to simulate the next 90 frames, as the Algorithm 2 lists. The reason this method uses a cycle of 100 frames is to have a cycle that is high enough, so the slowest processor does not slow the overall application, and also, low enough that it can make real-time distribution between the processors. From the pseudo code, it is possible to see that by using this automatic distribution scheme, only 5% of the frames are spent using the slower scenario and 95% with the best one. This scheme is ideal for tasks where the performance difference between the processors is low. If the difference between the processors is high, the 5% of the frames spent in the slowest mode will affect the overall performance of the application. 5

6 Algorithm 2 Cycle distribution if framecount == 5 then calculateelapsedtimegpu() if framecount == 10 then calculateelapsedtimecpu() if GPUTime < CPUTime then if framecount == 100 then framecount 0 return mode Best Time Distribution The idea of this scheme consists on creating an approach that is able to redistribute tasks between CPU and GPU without the costs of making a lot of tests in the slowest processor, like cycle distribution scheme. This scheme presents this strategy: it starts calculating 10 frames on the GPU, and another 10 in the CPU. At the end of these 20 frames it determines the processor with the fastest time, and uses it for the next 10 frames. If the time of this processor after these 10 frames is less than the best time of the other processor, it calculates 10 frames on the other processor, and so forth. This scheme always saves the fastest time of each processor. Algorithm 4 implements this strategy. The idea of this distribution is to use the fastest processor in most cases and to use the slowest one to take out work of the fastest processor when it is overloaded Resource Distribution The idea of the resource strategy consists on the use of libraries to verify the processor usage (percentage). The Windows API was used for this verification on the CPU. Initial tests have shown that the CPU usage varies with the number of bodies and remains constant when this same number does not vary, showing that this verification can be used for the purpose of distribution. The NVPerfKit [7] was used for checking this parameter in the GPU. The initial tests have shown that the GPU usage varies very little with the number of bodies and do not remain constant when this number is constant, showing that it is not possible to predict whether or not the GPU is overloaded with work by this method. For those reason, this Algorithm 3 Best time distribution if mode == GPU then if GPUBestTime > GPUTime then GPUBestTime GPUTime if mode == CPU then if CPUBestTime > CPUTime then CPUBestTime CPUTime if framecount == 10 then calculateelapsedtimegpu() if framecount == 20 then calculateelapsedtimecpu() if GPUBestTime < CPUBestTime then if framecount > 30 then if mode == GPU then if GPUTime > CPUBestTime then framecount 21 if mode == CPU then if CPUTime > GPUBestTime then framecount 21 return mode verification is not good for the purpose of distribution. This scheme uses only the CPU usage, using the following strategy: 10 frames are calculated at the GPU and 10 frames at the CPU. Based on the obtained results, the fastest processor is selected, in the same manner that the starting automatic distribution. If the fastest processor is the CPU, it activates a variable usecpu, and after 10 frames it starts to verify the percentage of use of the CPU, if this percentage raises to more than 80%, it sends the next 10 frames to the GPU. Algorithm 3 implements this approach. This automatic task distribution scheme is aimed at sim- 6

7 Algorithm 4 Resource distribution if framecount == 10 then calculateelapsedtimegpu() if framecount == 20 then calculateelapsedtimecpu() if GPUTime < CPUTime then usecpu true if usecpu AND framecount > 30 then perc = getperccpu() if perc > 80 then framecount 21 return mode ulations where the CPU is faster than the GPU, and the CPU uses the GPU as an auxiliary processor when it is overloaded with work. When the GPU is faster than the CPU, this scheme will not be able to redistribute tasks, and will behave like the starting automatic distribution. 6 Results Table 2 illustrates the numerical results from all the task distribution schemes, in normal conditions, with the test case. These results demonstrate that the starting distribution, in normal conditions, always behaves like the best case. And the cycle distribution behaves like almost the best case when the difference between the CPU implementation and the GPU implementation is low, in the others cases the 5% spent in the slower processor is affecting the overall performance of the application. These results also demonstrate that the best time distribution behaves almost like the best case, with the advantage that it can redistribute task between the processors. And that the resource distribution behaves like the best case when there are 1024 or more bodies, because in this case it does not have to check the percentage of the CPU. To test the schemes in the real time system, this work has implemented the collision shader (as the GPGPU stage) with the public physics library Open Dynamics Engine (ODE) [14] (as the update stage) with the single thread uncoupled model with an GPGPU stage uncoupled from the main loop. For data input and rendering an academic framework was used [15]. The numerical results, in frames per second, are in Table 3. These results show that the task distribution schemes between CPU and GPU behaves well with real-time system and can be applied in real-time applications, such as games. 7 Conclusion This work has proposed different schemes to distribute tasks between CPU and GPU to take advantage of both processors. These tasks distribution schemes can be of two kinds: a manual distribution or an automatic distribution. One scheme for automatic task distribution, the starting distribution, is designed to be executed only in the beginning of the simulation. The importance of this mode is that it can select the fastest processor without the necessity of the user or developer to determine it. The others distribution strategies are used to avoid the fact that other applications or the system can slow down the simulation whereas, with an automatic distribution, it can redistribute tasks between the processors, performing most of its works in the fastest processor. The automatic distribution scheme that has performed better in tests were the auto scheme. This paper has also presented a sphere bounding detection implemented in both CPU and GPU, to test the schemes for task distribution with good results. To test the schemes for task distribution between CPU and GPU with real-time systems, this work has presented a single threaded uncoupled model with a GPGPU stage uncouple from the main loop. These tests show that the schemes for task distribution can be used with real-time systems. References [1] A. Anderson, W. G. III, and P. Schröder. Quantum monte carlo on graphical processing units. Computer Physics Communications 177(3), pages , [2] C. Ericson. Real-Time Collision Detection. Morgan Kaufmann, [3] J. Georgii, F. Echtler, and R. Westermann. Interactive simulation of deformable bodies on gpu. Proceedings of Simulation and Visualization 2005, pages , [4] K. N. Govindaraju, S. Redon, M. C. Lin, and D. Manocha. CULLIDE: interactive collision detection between complex models in large environments using graphics hardware. Graphics Hardware 2003, pages 25 32, [5] S. Green. Gpgpu physics. SIGGRAPH GPGPU Tutorial,

8 Table 2. Numerical results, in milliseconds, from the schemes for task distribution in 100 frames of the aplication. # Manual Distribution Automatic Distribution Bodies CPU GPU Starting Cycle Best Time Resource ,202 4,952 4,952 5,162 4,966 4, ,427 15,333 15,333 16,472 15,989 15, ,858 51,851 51,851 63,064 52,001 51, ,262, , , , , ,333 Table 3. A numerical result, in FPS, of the single thread uncoupled model with an GPGPU stage uncoupled from the main loop with the distribution schemes. # Manual Distribution Automatic Distribution Bodies CPU GPU Starting Cycle Best Time Resource [6] R. Ierusalimschy, L. H. de Figueiredo, and W. Celes. Lua - an extensible extension language. Software: Practice & Experience 26 (6), pages , [7] J. Kiel and S. Dietrich. Gpu performance tuning with nvidia performance tools. Game Developers Conference, [8] P. Kipfer, M. Segal, and R. Westermann. Uberflow: a gpubased particle engine. Graphics Hardware 2004, pages , [9] C. Muller, M. Strengert, and T. Ertl. Adaptive load balancing for raycasting of non-uniformly bricked volumes. Parallel Computing 33 (6), pages , [10] J. D. Owens, D. Leubke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 26(1), pages , [11] A. Rollings and D. Morris. Game Architecture and Design: A New Edition. New Riders Publishing, [12] T. Rudomýn, E. Millán, and B. Hernández. Fragment shaders for agent animation using finite state machines. Simulation Modelling Practice and Theory 13(8), pages , [13] D. Sanchez and C. Dalmau. Core Techniques and Allgorithms in Game Programming. New Riders Publishing, [14] R. Smith. Open dynamics engine. Available at: 20/12/2007. [15] L. Valente. Guff: um framework para desenvolvimento de jogos. Master s thesis, Universidade Federal Fluminense, In Portuguese. [16] L. Valente, A. Conci, and B. Feijo. Real time game loop models for single-player computer games. In Proceedings of the IV Brazilian Symposium on Computer Games and Digital Entertainment, pages 89 99, [17] M. Zamith, E. Clua, P. Pagliosa, A. Conci, A. Montenegro, and L. Valente. The gpu used as a math co-processor in real time applications. Proceedings of the VI Brazilian Symposium on Computer Games and Digital Entertainment, pages 37 43, [18] C. Zeller. Cloth simulation on the GPU. ACM SIGGRAPH 05: ACM SIGGRAPH 2005 Sketches,

The GPU Used as a Math Co-Processor in Real Time Applications

The GPU Used as a Math Co-Processor in Real Time Applications Marcelo P. M. Zamith Esteban W. G. Clua Aura Conci Anselmo Montenegro Instituto de Computação Universidade Federal Fluminense Paulo A. Pagliosa