Solution of the Transport Equation Using Graphical Processing Units

Solution of the Transport Equation Using Graphical Processing Units Gil Gonçalves Brandão October - 2009 1 Introduction Computational Fluid Dynamics (CFD) always have struggled for faster computing resources to solve their problems. Until a few years ago, it was possible to just wait a few months and buy a new faster machine. Unfortunately, current technology bumped into a wall: the ever decreasing die size with increasing transistor density has made heat problems unbearable. The solution for growth is now, more than ever, parallel computing. The CPU industry already has turned multicore. But depending on the problem, the speedup of parallel computing can be from negligible to linear scaling, with the number of processing units. Meanwhile, the graphic cards industry was pursuing it s own path, using parallel, dedicated hardware from the start and drove by the rich market demand of gamers. The Graphics Processing Unit (GPU) gained more capabilities and, in the last couple of years, toolkits and dedicated frameworks allowing for the true start of the exploration of GPU power for general computing. This work focus on dissecting and exploring the GPU for solving partial differential equation problems. 1.1 GPU Technologies From the hardware side, there are mainly two device makers, NVIDIA and ATI, and their successive generations of devices. In the software side there are two kinds of technology to compute data on the GPUs: mapping the algorithms to the graphics pipeline or using the more recent and general languages. Currently, in the first approach, there are two big families: NVIDIA Cg 1 and OpenGL Shading Language 2 ; In the general language field, there are NVIDIA Common Unified Device Architecture (CUDA) 3, OpenCL 4 and the Brook 5 family. With the release of the CUDA framework, the devices were completely opened to programmers and, since then, all kinds of computational applications have been released: video encoding, weather prediction, molecular simulation, fluid dynamics, computer vision, cryptography, etc. In the CFD context, lattice Boltzmann methods have been studied[11, 7, 15]. Apart from this the Navier-Stokes equation has been solved using finite volume[2] and finite element[6] codes. In the linear algebra domain codes 1 http://developer.nvidia.com/page/cg main.html 2 http://www.opengl.org/documentation/glsl 3 http://www.nvidia.com/object/cuda home.html 4 http://www.khronos.org/opencl/ 5 http://graphics.stanford.edu/projects/brookgpu 1

have been developed. NVIDIA released a CUDA version of the standard BLAS library (routines that compute simple operations such as addition and multiplication on vector and matrix level). At a higher level (linear system solvers), in the present, two main orientation seems to exist: in one side there is a big interest in maintaining the old LAPACK 6 library interface, using the GPU as a high performance resource. The factorization algorithms (such as the LU, QR or Cholesky factorizations) are being implemented[13] and hybrid CPU-GPU approaches are being studied [12]. On the other side, a new generic approach to algebra algorithms is being developed[4] and GPUs are being used as a test framework to this new approach[1]. Beside this two approaches, there is also work on sparse matrix algebra[5]. 1.2 Objectives The main objectives of the present work are: To investigate the concepts behind the technology, their implementation and what key mechanisms can lead to best performances. To investigate the performance of GPU based computing in a class of CFD problems: solving the advection-diffusion transport equation using finite difference methods. To achieve the stated objectives, the following was done: Port a state of the art benchmark to the CUDA environment to understand the programming paradigm and compare it with the CPU environment. Implement tests that quantify the different memory access methods relative performance. Implement equivalent programs that solve the advection-diffusion uni-dimensional equation in the both the CPU and GPU environments using finite differences compact schemes to compute the spatial derivatives and the Runge-Kutta method to perform the time integration. The resulting systems of algebraic equations were solved using two direct dense solvers are compared for each environment. A LU based solver and inverse matrix based solver. 2 Cuda Programming Overview 2.1 GPU Hardware Model Each GPU is an aggregate of multi core processors (multiprocessors - MP) sharing a global memory. The GPU (as a parallel system) is essentially a shared memory system. Each MP is composed of: a number of scalar cores which perform the computations (these scalar cores are specialized in arithmetic instructions); a instruction unit responsible to delivering instructions to the scalar cores; and on-chip shared memory that can be used in scalar core communication (this memory isn t accessible by any other MP in the GPU). The memory system of the current NVIDIA GPUs is complex. There are two main groups of memory: on-chip (the memory is located inside each MP) and off-chip or global memory (the memory is located in the GPU and accessible by all MP). 6 For complete information, search the lapack working notes in http://www.netlib.org 2

Global memory is organized into 4 types : linear memory, texture memory, constant memory and local memory. The main implication of using each type is how a MP access the memory: any access to linear memory means to use the shared bus; texture and constant memories are cached but read only. Because the bus to global memory is shared and serialization of accesses occur, the GPU has the ability to coalesce 7 some access patterns. On-chip memory has two additional types: the shared memory, which is directly accessible by any scalar core inside each MP and the local registers that are private to each scalar core. In terms of execution in a GPU environment, the minimal computing task is the thread. These threads are created, managed (scheduled) and destroyed by the GPU, i.e., the threads live in hardware space. This is one of the major differences from other common parallel environments. In GPUs the threads are grouped into sets of up to 32 threads called warps that is the scheduling unit. Figure 1: The GPU anatomy 2.2 CUDA Programming Model The CUDA software model is an extension of C/C++ programming languages that reflects the underlying GPU hardware. There main extensions are[9, sec 4.1]: function and variables qualifiers to specify whether the function or variable is referred to the host or to the device and a directive to configure and start a kernel. A CUDA program is no more than a usual C/C++ program that make calls to CUDA kernels. A function to be run in the device must have the device or the global qualifiers. The former defines functions to be called by code run on the device, the later defines kernels 8. By default functions with no specifier are considered to be host functions. In terms of variables, the environment defines the scope of the variables, i.e, in device functions the variables belong to the device memory space and on host functions, the variables belong to the host memory space. In the other cases, qualifiers are used (similar to the function ones). Neither the device can directly access the host variables, neither can the host directly access the device s ones. The only direct interface existent is in the kernel call where the kernel parameters are automatically copied to the device s memory. The memory management (allocation, free and copies) is done by the host using determined functions The host holds the locations of the device s data in its own memory by using traditional pointers. 7 An access is said to be coalesced if with only one transaction, several requests are fulfilled. 8 Additionally kernel functions must be void typed, i.e, cannot return any value. 3

Figure 2: Asynchronous execution of the device unrelated code while the device is processing the data. Regarding the execution configuration, the threads are organized in a matrix like form called block; each block is attributed to a MP. The blocks are also organized in a matrix like form called grid. Within a MP, each thread has built-in variables that can be used to do the assignment of the tasks to the particular threads. After launching the kernel, the mapping of each thread to the MP and scalar cores is automatically done by the hardware. The execution of the device s threads is asynchronous with respect to the host, i.e, the host can execute another In the figure 2.2 is shown the thread organization as well as the asynchronous run feature. The synchronization is done using a function, in which the host program waits until all threads in the device have finished their work. 3 CUDA Environment Tests Some testes were done to understand which capabilities the used system have. The tests are based in two NVIDIA benchmarks and following [7], a port of the Stream 9 benchmark was implemented. 3.1 Peak Throughput In order to know what is the peak throughput achievable with the GPU, a NVIDIA test is used. It consists in an unrolled loop with a total of 2048 FMAD 10 instructions The maximum FLOPS value achieved is 617GF LOP S. This maximum value is not steadily attained from the beginning, for block sizes of 32, 64 and 128. Looking at the data it s found that full performance is only achieved if the scheduler has 32 or more threads to take care of. 3.2 Bandwidth: Host - device transfers A benchmark that copies blocks of data from the host to the device using all the methods that the API provides was implemented. The total time is evaluated as a function of the number of elements transfered. There is a high penalty in performance when transferring small quantities of data. For large transfers (more than 10 6 elements) a global stable value was achieved. 3.3 Bandwidth: Device - device transfers The main intra-device transfers performance is evaluated, different vector copy operation versions are implemented using read operations from each one of the available memory spaces and a write to the global memory. Because the texture and constant memories are cached, a vector form of the sum reduction operation (a(i) = M j=1 b(j), i = 1 N) is also implemented to take 9 http://www.cs.virginia.edu/stream/ 10 A FMAD instruction is a hardware instruction that computes an multiplication and a sum, e.g., a b + c. 4

advantage of it. And because of the global memory is not cached, a software based cache 11 is implemented with shared memory. The most important observation is that the device memory performance was completely destroyed with small vector dimensions. The theoretical bandwidth is 102GB/s but (for small sizes, N < 16k), only < 7GB/s was achieved. The test scales in performance and (for sufficiently large vector sizes - N > 2 20 ), significant performances were achieved (B > 60GB/s). In [14] is reported that greater performances were achieved (89%) but the devices that were used aren t the same. In terms of relative performance between each access type, the direct access to the global memory and the access through texture cache perform identically, but access to the constant memory is slower. In the cached access test the results are significantly different from the previous test, in even with small sets and all access types the performance exceeded the memory s bandwidth: the implemented software cache with the share memory presents by far the best results with large sizes attaining a peak of 287GB/s (280%). Access through texture memory also outperformed the global memory limit by a 178% factor. 4 Stream benchmark The Stream benchmark consists in 4 vector operations: vector copy, product by scalar, vector sum and vector sum plus scalar product. The performance is evaluated as a function of the number of blocks and the block configuration. The previous results showed completely different performances for small and large size problems, so two different vector sizes are evaluated. The phenomenon of bad performances for small sizes is maintained and has one common parameter constant for all configurations: 256 threads per MP (which corresponds to the size of the warp).this result is identical to the FLOPS test, but now with memory transactions in play. For large vector dimensions the performance is approximately constant and equal to 75GB/s. Finally, a comparison of the host CPU performance and the device performance (not considering host-device transfers) is done. The results are present in the table 1. operation cpu gpu speedup cpu gpu speedup (M B/s) (M B/s) (M B/s) (M B/s) N = 2E3 N = 2E6 copy 2314 2003 0.87 3245 74482 22.9 scale 2274 1973 0.87 3181 74814 23.5 add 3247 3004 0.93 3188 77136 24.2 triad 3247 3004 0.93 3310 76666 23.1 Table 1: Stream benchmark results 11 in fact, no true cache mechanism[8] was implemented. The code takes advantage of knowing previously what memories will be used with higher frequency. 5

5 Solution of the 1D transport equation The model used to test the computational performance speedup of GPGPU programming is a linearized version of the Burgers equation or the uni-directional transport equation. u t + U u 0 x = ν 2 u x 2 (1) where U 0 and ν are real constants and u is a continuous field.to solve this equation in order to u, the equation 1 is first transformed into an explicit form (equation 2): F (t, x) = u t = ν 2 u x 2 U u 0 x As shown by equation 2, this is an initial value problem. To solve it, the classic 4 th order Runge- Kutta method[3] is used. The spatial derivatives (first and second) are calculated using 4 th order compact schemes [10]. These methods are a particular case of central difference schemes (eq. 3)and the derivatives are calculated by solving a linear equation system, Ax = b where A is an N N element matrix and x and b are N element vectors; β(u i 2 + u i+2) + α(u i 1 + u i+1) + u = 1 ( ui+1 u i 1 x a In a matrix form, the problem is now formulated as: + u i+2 u i 2 b + u ) i+3 u i 3 c u νa 1 2 B 2u U 0 A 1 1 B 1u (4) The computational domain of the problem is defined by the following constrains: the domain of the spatial coordinate is normalized: x [0, 1]; x is an uniform mesh of N points (hence x = 1 N 1 ); the time step, t, is constant and it s given by the Courant number; the simulation has NT time steps. U 0 is normalized (U 0 = 1); the viscosity coefficient ν is calculated in function of the Fourier number 12 As stated, the compact scheme methods used belong to the class of a particular case of central differences, so special considerations have to taken at both boundaries of the spatial domain (u(t, x = 0) and u (t, x = 1)). On the right side, a Dirichlet condition is imposed, i.e, u(t, x = 1) = 0. On the opposite side, the model order reduction presented in [10] was implemented and on this boundary the problem is represented by a forward 3 rd order scheme. (2) (3) 5.1 Implementation The implementation consists on two versions of the code: a C based serial (single processor) version and a CUDA based version. The code of each version is as identical as possible. The core of the simulation is a simple sequential loop in the time variable. In each loop the velocity field u is updated by the Runge-Kutta integration. All data is logged into memory 12 Even if calculating the physical constant isn t a natural practice in problems of fluid mechanics - where the objective is to calculate the flow for a given fluid - the focus of the current work is computational; So we compute the problem s parameters as a function of computational significant units (as the number of points and iterations) and numerical stability. 6

and dumped to a file at the end. In the CUDA version, after the problem initialization, all the necessary data is copied to the GPU memory. Only in the end, the data is downloaded back into the main memory. For the linear system solver, two direct are methods considered : an LU (with partial pivoting) solver and a matrix inversion solver. In both methods all constants are computed during initialization and using a serial CPU method: the LU solver computes the pivots, L and U during initialization and the inverse based solver computes the A 1 matrices. 5.1.1 Solving an LU factorized linear system on GPU The only routine that, inside the main loop, wasn t implemented by any CUDA based package is the equivalent of the LAPACK sgetrs routine. This routine forms a pair with the sgetrf routine: sgetrf computes the LU factorization with partial pivoting of a general M N matrix and sgetrs solves a linear system AX = B, with that previously factorized A matrix. The netlib version of the sgetrs function is used as a guide to port the function to the CUDA architecture. The routine makes calls to BLAS library routines that are available in the CuBLAS package, so they were used. It also calls a LAPACK internal routine, slaswp, that was implemented. The slaswp is a routine that applies a given permutation (in form of row interchanges) to a matrix. It receives an integer array with the indices of the rows that need to be permuted and the matrix to operate on. The algorithm, as implemented by LAPACK, its inherently sequential because the order of row interchange matters: in the LAPACK standard, the indices in the pivot s array returned by the sgetrf routine may be repeated. This leads to differences if the interchange is applied in different orders. If the predetermined (sequential) order isn t followed the final output will differ and thus will lead to an erroneous solution. However, the columns of the solution matrix are completely independent, so a decomposition may be done mapping each task to a column. 6 Metrics The essential metric in this work is the time ratio between the serial version and the CUDA version (G = t serial t cuda ). Since the initialization is always done on the CPU, another important measure in the CUDA version is the ratio of times between the initialization part and the total time (R = tinit t total ). 6.1 Simulations Results All the simulations were made with C = 0.3, F = 0.1 and 500 time steps. Performance is evaluated as a function of the problem size. In the figure 3a the evolution of speedup of both methods is represented. On the left side axis is the scale for the inverse method while on the right side axis it s the scale for the LU method. The results are significantly different for each method. The speedup obtained with the CUDA version for the inverse method is always greater than unity: the lowest value is 6.5 (N = 500) and the maximum is 15.2 (N = 2000). With the LU method only in one case a speedup of 1 was achieved; all other cases, the performance was poorer when compared with the serial version. Another important measure is the comparison of the best method for each platform, i.e., inverse in the GPU with the LU in the CPU (T Clu /T Ginv ), which is represented in the figure 3a with the blue dashed line, referring to the left side scale. The computation is now 3.5 to 10.8 faster. 7

Speedup (inverse and best method) 16 12 8 4 0 1.2 0.8 0.6 0.4 0 1000 2000 3000 4000 5000 Problem size 1 Speedup (LU method) inverse initialization 100% 90% 80% 70% 60% 50% 40% 30% 20% 2.75% 2.5% 2.25% 2% 1.75% 1.5% 1.25% 1% 0.75% 0 1000 2000 3000 4000 5000 Problem size lu initialization inverse solver best case lu solver inverse solver gpu/cpu inverse lu solver (a) Absolute speedup (b) Initialization ratio Figure 3: Performance The evolution of time ratio between the initialization part and the total simulation (done with the GPU) is represented in the figure 3b. On the left axis is represented the scale for the inverse method and on the right side is the scale for the LU. As it can be seen, as the number of points grows, the initial inversion becomes very significant. With N = 500 the ratio is about 30% of the total; With N = 2000 (the speedup peak), the initialization takes 57% of the total and for N = 5000 it s 82%. With the LU, the evolution is completely different as the initialization fraction is negligible (< 3%). In the same figure is presented the ratio between the initialization part of the inverse method done on the CPU and done on GPU (T Cinit inv /T Ginit inv ). The scale is the one on the left side axis. This relation represents essentially the weight of the data transfer: the distance of the curve to 1 represents the difference between each version initialization runtime (in the code, the only differences that exist are the memory allocation and transfer to the device). As seen in the figure, for N = 500 it takes around 25% of the time; for N 2000 it represents no more than 3%. This clearly explains the disruption in performance showed in figure 3a: as the fraction of the initialization becomes the predominant one, and because only a very small fraction of the time is spent on bus transfers, the inversion becomes the main computational problem. It s known that the LU method should perform better on systems which the relation rows/columns of the solution matrix B is equal to 1 or less. Computing an inverse matrix is a particular linear system for which that relation is exactly 1. The implemented LU solver is now used to compute the inverse matrices. In the figure 4 the results of the global speedup obtained are presented: on the left side axis is the scale to the speedup obtained when com- Speedup (inverse) 2.4 2.2 2 1.8 1.6 1.4 1.2 4 0 1000 2000 3000 4000 5000 Problem size cpu/gpu inverse cpu lu/gpu inverse Figure 4: Speedup with the inverse computed on the GPU 20 18 16 14 12 10 8 6 Speedup (LU method) 8

pared with the case of computing the inverse matrix on the CPU (black continuous line). There are two remarks to do (they are both consequence of the initialization being a considerable fraction): 1)the performance gain is always greater than 1 which means that, in the end, for every problem size a performance boost was achieved; 2) the boost is continuously increasing - even at a slow rate - which means that the initialization burden got somehow mitigated. The updated best method comparison is presented in the same figure with the blue-dashed line and the right side axis scale: the results were boosted as the blue line had shown. The speedups are now between 4.3 to 18.8; 7 Conclusion The main objective of this work is to investigate whether a class of problems in the computational fluid domain could benefit from the possibility open by the GPU based computing. This objective was accomplished with success as speedups between 4 and 18 were obtained. The study of the parallel computing paradigm and its influence on the device s model emerge from the fact that the use of GPUs for scientific problem solving is a novelty. Because of the device s design, best performances are generally obtained with massive size problems. This size factor leads to the fact that communication between the host and the device (with the hardware used) becomes continually less significant when compared with the traditional sequential access pattern to the variables. In respect to the device s memory system, majors differences from the host memory were observed: in the current (multi-core) computers, the system s memory bus is shared by 4 cores while in the GPU, the equivalent bus is shared among at least 8 scalar cores (as concrete example, in the GPU used in this work there are 240 scalar cores). This implies that the knowledge of how to explore the device s memory system have a significant impact on the results obtained. When repeated access to a vector variable is needed, there are several approaches to obtain a cached (thus faster) access to it. To achieve the main goal of the present work, a uni-dimensional convection-diffusion transport equation solver was implemented. Two direct methods were compared: an LU method and the inverse method. The implemented LU method performs poorly on the device for this problem as its solution it s just one column. The inverse method outperforms the LU which clearly shows how an algorithm better suited for sequential computation is uncorrelated with its performance on parallel computing. The fact that the matrix inversion on the CPU is an expensive computation implies that the method s scalability is doomed for problems with sizes larger than 2000. This fact also completely hides the eventual problem of the latency added by data transfer to the device. Finally the matrix inversion step was done on the GPU (using the LU solver developed) and this lead to a performance increase factor of approximately 2 when compared with the previous strategy of computing the matrix on the CPU. 7.1 Future Work The use of GPUs in the scientific computing will dramatic consequences that may change the numerical methods used today. The future work should be develop in two directions: i)continue to evaluate the platform as a computational resource (by investigating on CPU/GPU cooperation, comparing the platform with traditional cluster platforms and the study of clustering GPUs) and, ii) implement the knowledge to engineering CFD applications. 9

References [1] S. Barrachina, M. Castillo, F. D. Igual, and G. Q.-O. Rafael Mayo, Enrique S. Quintana-Ortí. Exploiting the capabilities of modern gpus for dense matrix computations. Technical report, Universidad Jaime I, 2008. [2] J. Cohen and M. Garland. Solving computational problems with gpu computing. Computing in Science and Engineering, 11(5):58 63, 2009. [3] J. H. Ferziger and P. Milovan. Computational Methods for Fluid Dynamics. Springer, 2 edition, 1997. [4] F. G. Van Zee, E. Chan, R. van de Geijn, E. S. Quintana-Ortí, and G. Quintana-Ortí. Introducing: The libflame library for dense matrix computations. CiSE, page 9. [5] M. Garland. Sparse matrix computations on manycore gpu s. In DAC 08: Proceedings of the 45th annual Design Automation Conference, pages 2 6, New York, NY, USA, 2008. ACM. [6] D. Goddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, H. Wobker, C. Becker, and S. Turek. Using gpus to improve multigrid solver performance on a cluster. Int. J. Comput. Sci. Eng., 4(1):36 55, 2008. [7] J. Habich. Performance evaluation of numeric compute kernels on nvidia gpus. Master s thesis, FRIEDRICH-ALEXANDER-UNIVERSITAT, 2008. [8] L. Null and J. Lobur. Essentials of Computer Organization and Architecture. Jones and Bartlett Publishers, Inc., USA, 2003. [9] Nvidia. CUDA Programming Guide. [10] L. Sanjiva K. Compact finite difference schemes with spectral-like resolution. Journal of Computational Physics, 103:16 42, 1992. [11] J. Tolke and M. Krafczyk. Teraflop computing on a desktop pc with gpus for 3d cfd. Int. J. Comput. Fluid Dyn., 22(7):443 456, 2008. [12] S. Tomov, J. Dongarra, and M. Baboulin. Towards dense linear algebra for hybrid gpu accelerated manycore systems. Technical Report 210, LAPACK Working Note, Oct. 2008. [13] V. Volkov and J. Demmel. Lu, qr and cholesky factorizations using vector capabilities of gpus. Technical report, Electrical Engineering and Computer Sciences, University of California at Berkeley, 2008. [14] V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In SC 08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1 11, Piscataway, NJ, USA, 2008. IEEE Press. [15] Y. Zhao. Lattice boltzmann based pde solver on the gpu. Vis. Comput., 24(5):323 333, 2008. 10