Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel computation has only recently been explored. Parallel algorithms running on GPUs can often achieve speeds up to 10 times over similar CPU algorithms. This technology has been applied to many fields such as physics simulations, signal processing, financial modeling, neural networks and countless others. The model for GPU computing is to use a CPU and GPU together in a heterogeneous co-processing computing model. The sequential part of the application runs on the CPU and the computationally-intensive part is accelerated by the GPU. From the user s perspective, the application just runs faster because it is using the high performance of the GPU to boost performance. GPUs are different to CPUs because they are designed to run hundreds even thousands of threads simultaneously (Fig. 1). For example: in gaming, a GPU could be running different threads to render individual pixels of an image. Programming a CPU, on the other hand, restricts you to 1, 2 or 4 CPU threads. The advantage for CPUs is that individual threads can be used to run totally different programs, whereas a GPU is designed to run the same program multiple times across thousands of threads. GPUs in that sense truly process data in parallel and the programmer should design GPU programs with the processing method in mind. Figure 1: Comparison of structure between CPU and GPU
To run codes on GPU device, you need an environment you can develop using CUDA C. The following items are necessary: (a) A CUDA-enabled graphics processor (b) An Nvidia device drive (c) A CUDA development toolkit (d) A standard C complier The excellent HPC folks at Computer Centre already had the above set up to your convenience. In accordance with the laws governing written works of computer programming, below is a Hello, world! example that illustrates how to invoke multiple threads on GPU devices. In the above code, we see that CUDA C adds the global qualifier to standard C. This mechanism alerts the complier that a function should be compiled to run on a device instead of the host. There is nothing special about passing parameters to a kernel. A kernel call looks and acts exactly like any function call in standard C. The runtime system takes care of any complexity introduced by these parameters that need to get from the host to the device. In the main function, two values are inside the angle bracket, indicating that the kernel function will launch one block, and five threads in this block. Please refer to the CUDA C book for details on block and thread definitions. After compiling the above code by nvcc, the program will output as follows:
As we can see, each thread encounters the printf() command with as many lines of output as there are threads launched in the grid. As expected, global values 1.2345 are common between all threads, and local values (threadidx.x) are distinct for each thread. A for loop fragment can be very easily accelerated on GPU as well. The following example will illustrate how we could use CUDA C in summing two vectors.
First three arrays were allocated on the device using calls to cudamalloc(): two arrays, dev_a and dev_b, to hold inputs, and one array, dev_c, to hold the result. By invoking function cudamemcpy(), the input data was copied to the device with the parameter cudamemcpyhosttodevice and the result data was copied back to the host with cudamemcpydevicetohost. After computation on the device, the allocated memory resource was released with cudafree(). In the above example, we specified N as the number of parallel blocks and the collection of these parallel blocks a grid. This specifies to the runtime system that we want a one-dimensional grid of N blocks. These threads will have varying values for blockid.x, the first taking value 0 and the last taking value N-1. Taking four blocks as an example, all are running through the same copy of the device code but have different values for the variable blockidx.x. This is what the actual code being
executed in each of the four parallel blocks looks like after the runtime substitutes the appropriate block index for blockidx.x. Why do we check whether tid is less than N? It should always be less than N, since we have specifically launched the kernel such that this assumption holds. But once this rule is broken by incaution, such bugs cannot be found in compiling time. The presence of these errors will not prevent the user from continuing the execution of the application, but they will most certainly cause all manner of unpredictable and unsavory side effects downstream. Thus, it is necessary to check any operation that might fail as this could save hours of pain in debugging the code later. Finally, the archival of the speed-up varies for different applications, hardware devices and the code quality. Understanding the parameters of GPU devices you are using will help you improve the performance of your applications. Check out Nvidia resources that explain the technique details of CUDA C. (https://developer.nvidia.com/category/zone/cuda-zone)