Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

Size: px

Start display at page:

Download "Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units"

Lucas Marsh
5 years ago
Views:

1 Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Khor Shu Heng Engineering Science Programme National University of Singapore Abstract This paper presents a parallel alternating direction implicit (ADI) solver for the two-dimensional heat diffusion problem on an NVidia Graphics Processing Units (GPU). The first section of the work gives a brief introduction on Compute United Device Architecture (CUDA), the programming interface for parallel programming on an NVidia GPU, whereas the second section describes the implementation details of the tridiagonal system solver and the setup of the corresponding right hand side for implicit solution in and direction. The tridiagonal solver used in this work is based on the parallel cyclic reduction algorithm implemented by Zhang et al. [1]. The original algorithm does not supports system size which is non-power of two and uses 5 shared memory usage, where is the tridiagonal system size. We noticed that the shared memory usage can be reduced to 3 for cases where the tridiagonal system is symmetric with uniform elements on the diagonals. Slight modification has been done to cater for cases where the system size if non-power of two. We have also attempted to make the computation of right hand sides as efficient as possible, especially for the solution in y direction, Using CUDA Visual Profiler, the performance of the GPU ADI solver was compared with the serial implementation in CPU, which was based on Gaussian Elimination Scheme without pivoting. Reasonable acceleration was achieved for both float type computation and double type computation. 1. Introduction 1.1 Parallel computing using Graphics Processing Units Graphics Processing Units is specially designed for computation tasks exhibit fine grained data parallelism, with high ratio of arithmetic operation to memory operation. In three-dimensional graphics rendering, large set of pixels and vertices data are mapped onto parallel processing threads. Modern GPU are highly parallel, multithreaded, with more multicore processor than a CPU. This makes GPU a viable

2 and cheaper alternative for parallel programming compared to multicore CPU, vector computer and grid computing. In November 2006, NVidia introduced C with CUDA extension, a general purpose parallel computing architecture, which enables the user to use C language to leverage the parallelism of supported NVidia GPU for data parallel tasks. Since then, CUDA has begun to gain popularity in the high performance computing community, and has been applied in diverse fields: computational finance, computational fluid dynamics, image reconstruction for CT scan and molecular simulations. In general, programming in CUDA involves memory transfer between the host memory (CPU) and the device memory (GPU). The host calls a kernel, which perform the parallel computing task in device memory. The computation result is then written back to the host memory. Figure 1. GPU architecture GPU processes data parallel using threads. Threads are grouped in blocks. Each threads has a private local memory, whereas each block has a shared memory bank accessible to all the threads within the block. All threads have access to the same global memory bank. More information on the programming methodology and optimization techniques are available in [2]. 1.2 Alternating Direction Method Two-dimensional heat diffusion problem is governed by the partial differential equation: Diffusion problems usually suffer from numerical instability for explicit schemes, thus need to be solved using implicit schemes like Crank Nicolson scheme. While the linear equation systems generated in

Crank Nicolson scheme are tridiagonal for one-dimensional case, this is no longer true for two-dimension. More computational effort is required to obtain the solution for the linear systems.

3 Crank Nicolson scheme are tridiagonal for one-dimensional case, this is no longer true for two-dimension. More computational effort is required to obtain the solution for the linear systems. Alternating direction method circumvent this problem by halving the time step and solves the governing partial differential equation implicitly in one spatial dimension for each sub step. Where And similarly for. For a grid size of n n, n independent diagonally dominant tridiagonal systems of size n are generated for implicit solution in each dimension. 2. Implementation of Alternating Direction Method in CUDA 2.1 Mapping of higher dimensional array To work with two-dimensional problem in CUDA, one can map the two-dimensional data for temperature distribution onto a one-dimensional array in the following manner: Notice that the pitch is not necessarily equal to the row length. One should use cudamallocpitch function to automatically allocate the array with the suitable pitch length such that the memory access in direction is coalesced. Non-coalesced memory access is slower and may impact overall performance.

4 2.2 Implicit Solution in direction Each independent tridiagonal system is mapped onto a block, whereas each equation of the tridiagonal system is mapped onto a thread within the block. We declare the array for the right hand side in the shared memory. Computation of the right hand side correspond to each tridiagonal system involves communication between each element in a row and the elements adjacent to it in the y direction. Excerpt of Code from the Kernel for Right Hand Side Setup for x direction #define INDEX(i,j,pitch) (i + mul24(j,pitch)) global void rhssetup(float*rhs,float*u,int m,int pitch,int pitch2,float alpha) unsigned int thid=threadidx.x; unsigned int blid=blockidx.x; unsigned int center=index(thid+1,blid+1,pitch); if(thid<m) rhs[index(thid,blid,pitch2)]=(1-alpha)*u[center]+alpha/2*(u[centerpitch]+u[center+pitch]); syncthreads; rhs[blid*pitch2]+=(alpha/2)*u[(blid+1)*pitch]; rhs[index(m-1,blid,pitch2)]+=(alpha/2)*u[index(m+2,blid+1,pitch)]; The mul24 function is defined for efficient multiplication. The pitch length is different for the the right hand side array and the temperature distribution array since the x and y dimension are different in general. 2.3 Parallel Cylic Reduction In this work, the tridiagonal solver used is base on the parallel cyclic reduction algorithm implemented by Zhang et. al. [1]. Parallel cylic reduction is a variant of the cylic reduction algorithm first proposed by Hockney and Golub in 1965[3]. Parallel cyclic reduction differs from cylic reduction by having only the fowrad reduction phase. The algorithm solves a tridiagonal system of size n in steps and 12 n computations. In contrast, Gaussian elimination without pivoting solves the same problem size with 2 n steps.

5 The idea of parallel cyclic reduction is to reduce the original tridiagonal system to smaller systems of half the original size in a recursive manner. Consider an n by n tridiagonal system: For each row, and are eliminated by means of row operations involving row and the two rows a stride above/below i. Initial stride is 1. This process updates and generates and as fill in. The odd indexed rows and the even indexed rows have now become two independent tridiagonal systems. Repeat the same process with stride double that of the previous one, and we will get smaller and smaller independent systems. Assume for the moment that the system size is power of two. Iterating the forward reduction phase for times will yield independent tridiagonal systems of size 2. In Zhang et al s implementation, the updated value of and always overwrite the original one, hence only 5 (including right hand side, d, and solution, x )storage requirement is needed in the shared memory. 2.4 System size of non-power-of-two For system sizes which are non-power-of-two, the forward reduction can be performed for ceil times. The end result would be floor numbers of systems of size two and numbers of systems of size one.

6 The original code section for the back substitution can be changed from: if (thid < delta) int addr1 = thid; int addr2 = thid+delta; float tmp3 = b[addr2]*b[addr1]-c[addr1]*a[addr2]; x[addr1] = (b[addr2]*d[addr1]-c[addr1]*d[addr2])/tmp3; x[addr2] = (d[addr2]*b[addr1]-d[addr1]*a[addr2])/tmp3; To: if (thid < delta) int addr1 = thid; int addr2 = thid+delta; float tmp3 = b[addr2]*b[addr1]- c[addr1]*a[addr2]; if(addr2<n) x[addr1] = (b[addr2]*d[addr1]-c[addr1]*d[addr2])/tmp3; x[addr2] = (d[addr2]*b[addr1]-d[addr1]*a[addr2])/tmp3; else x[addr1]=d[addr1]/b[addr1]; The code section in the red bracket solves system size of two, whereas the code section in the green bracket solves system size of one. 2.5 Symmetric tridiagonal system with uniform elements on the diagonals For Dirichlet boundary condition, the tridiagonal system involved is symmetric with uniform elements on the diagonals. For such system, we observe the following: 1) The upper and lower diagonals of the new tridiagonal systems formed during the forward reduction phase are filled with identical elements. 2) Only the first and the last elements of the main diagonals have different values from other elements on the main diagonal.

7 3) Computation of every subsequent values of the new a and c only require the knowledge of the initial value of b. These observations allow us to reduce shared memory storage and less memory read/write operation. Only b and d need to be stored in the shared memory. a can be dropped out and only c is stored in the register. Consider the original code section: for (int j = 0; j <iteration; j++) int i = thid; if(i < delta) float tmp2 = c[i] / b[i+delta]; bnew = b[i] - a[i+delta] * tmp2; dnew = d[i] - d[i+delta] * tmp2; anew = 0; cnew = -c[i+delta] * tmp2; else if((systemsize-i-1) < delta) float tmp = a[i] / b[i-delta]; bnew = b[i] - c[i-delta] * tmp; dnew = d[i] - d[i-delta] * tmp; anew = -a[i-delta] * tmp; cnew = 0; else float tmp1 = a[i] / b[i-delta]; float tmp2 = c[i] / b[i+delta]; bnew = b[i] - c[i-delta] * tmp1 - a[i+delta] * tmp2; dnew = d[i] - d[i-delta] * tmp1 - d[i+delta] * tmp2; anew = -a[i-delta] * tmp1; cnew = -c[i+delta] * tmp2; syncthreads(); b[i] = bnew; d[i] = dnew; a[i] = anew; c[i] = cnew; delta *=2; syncthreads();

8 This can be replaced by for (int j = 0; j <iteration; j++) float temp=c/b; int i = thid; if(i < delta) float tmp2 = c / b[i+delta]; bnew = b[i] - c * tmp2; dnew = d[i] - d[i+delta] * tmp2; else if((systemsize-i-1) < delta) float tmp = c / b[i-delta]; bnew = b[i] - c * tmp; dnew = d[i] - d[i-delta] * tmp; else float tmp1 = c / b[i-delta]; float tmp2 = c / b[i+delta]; bnew = b[i] - c * (tmp1 + tmp2); dnew = d[i] - d[i-delta] * tmp1 - d[i+delta] * tmp2; syncthreads(); b[i] = bnew; d[i] = dnew; delta *=2; syncthreads(); B=B-2*c*temp; c*=-temp; Where B is the original value of the main diagonal element. Memory read/write from/into register is much faster than shared memory. Profiling result from CUDA Visual Profiler shows that this replacement reduces the computation time by about one fifth. 2.6 Implicit Solution in y direction After the implicit solution in x direction has been computed, the right hand side corresponding to the implicit solution in the y direction can be computed in the similar manner as in the x direction.

9 Code excerpt for the right hand side computation: unsigned int thid=threadidx.x; unsigned int blid=blockidx.x; unsigned int center=index(blid+1,thid+1,pitch); if(thid<n) rhs[index(thid,blid,pitch2)]=(1-alpha)*u[center]+alpha/2*(u[center- 1]+u[center+1]); However, profiler result shows that the right hand side computation for implicit solution in the y direction is much less efficient than the right hand side computation for the implicit solution in the x direction. This is due to the non-coalesced memory access pattern, which is much slower. Fig. Profiling result using CUDA Visual Profiler. ADISolve refers to the routine for tridiagonal solver in x and y direction, memcpydtoh refers to memory transfer from device memory to host memory, initialize refers to the routine to set up the initial condition, while rhssetup and rhssetup2 refers to right hand side computation corresponds to x direction and y direction respectively. 3. Result The ADI solver was implemented on a GTX285 NVidia GPU, which has the ability to run algorithm with double precision. The serial version of ADI solver was based on the Gaussian Elimination scheme without pivoting, and was implemented on a Intel Core2Duo CPU E8400 at 3.0 GHz with 4 Gb of ram. The heat diffusion problem tested has Dirichlet boundary condition. Both codes were tested with float precision and double precision for dimension size (including boundary condition) of , and for 3000 time steps. Time taken for memory transfer from device to host was taken into consideration. Below is the summary of the timing result: System Size GPU CPU float double float double s 1.75 s 2.26 s 2.33 s s 3.06 s s s s 9.89 s s s

10 Only NVidia Graphics Card with computing capability of 1.3 may run double precision computation. Current generation GPU has considerable lower bandwidth for double precision than float precision, which renders them less suitable when a high precision is necessary. Due to shared memory size limitation we have not implemented the code for system size of more than It is possible to implement the same algorithm using global memory, but this will result in performance penalty due to low bandwidth of global memory access 4. Conclusion Reasonable acceleration for tridiagonal system solver has been achieved. The algorithm presented here can be further optimized by improving the right hand side computation routine. Future work will concentrate on the extension of the algorithm to cases where the system size is more than Reference [1] Zhang Y., Cohen J., Owens J.D. Fast Tridiagonal Solver on the GPU. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, p , [2] NVidia CUDA compute unified device architecture, programming guide, Version 2.0. [3] R.W. Hockney, C.R. Jesshope. Parallel Computers. Adam Hilger, Bristol, 1981.

Fast Tridiagonal Solvers on GPU

Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based