A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA Abu.Asaduzzaman@wichita.edu Abstract In order to fast effective analysis of large systems, high performance computing (HPC) is essential. NVIDIA Compute Unified Device Architecture (CUDA)- assisted central processing unit (CPU) and graphics processing unit (GPU) computing platform has proven its potential to be used for HPC supports. In CPU/GPU computing, original data and instructions are copied from CPU-main-memory to GPU-global-memory. Inside GPU, it would be beneficial to keep the data into shared memory (shared only by the threads of that block) than in the global memory (shared by all threads). However, GPU shared memory is much smaller than GPU global memory (for Fermi Tesla C2075, total shared memory per block is 48 KB and total global memory is 5.6 GB). In this paper, we introduce a CPU-main-memory to GPU-globalmemory mapping technique to improve the GPU/overall system performance by increasing the effectiveness of GPU shared memory. Experimental results, from solving Laplace s equation for 512x512 matrix using Fermi and Kepler cards, show that proposed CPU-to-GPU memory mapping technique help decrease the overall execution time by more than 75%. Index Terms Cache memory organization; CUDA architecture; electric charge distribution; GPU memory; high performance computing; I. INTRODUCTION Modern CPU consists of a number of cores optimized for sequential serial processing while a GPU consists of hundreds of smaller, more efficient cores designed for handling multiple tasks simultaneously. GPUs help CPU as a powerful energy-efficient data centers in many small-and-medium businesses around the world. GPUaccelerated computing is the use of a GPU together with a CPU to accelerate scientific, engineering, and enterprise applications. CUDA Version 5.5 toolkit helps developers to obtain the best performance. Parallelism and optimization techniques simplify the programming for CUDA capable GPU architecture. Whereas, Nvidia has announced CUDA 6, the latest version of its GPU programming language, which adds a Unified Memory capability as shown in Figure 1. Unified memory relieves programmers from the trials and tribulations of having to manually copy data back and forth between separate CPU and GPU memory spaces [1]. Fig. 1. CUDA memory model: (a) unified and (b) actual. Data in GPU global memory takes more time to be processed when compared with data in GPU shared memory. In CUDA assisted multithreaded programming, a thread usually processes data that are not in consecutive CPU-memory locations. This CPU-data may not autoqualify to be in the GPU shared memory. As a result, overall system performance may decrease significantly. Since the introduction of dual-core netbook machines (in year 2005) to 16-core workstation computers, parallel processing is a reality. Today, command-prompt machines are almost out, multithreaded CPU/GPU computers are in [2]. To take advantage of the multicore systems, software engineers are developing parallel applications that will also meet the requirements of the growing highperformance computation. NVIDIA CUDA/GPU technology provides multithreading without context switching [3]. However, due to the proper mapping from CPU-

memory to GPU-memory, GPU shared memory may not be used efficively. Therefore, a smart memory mapping technique is needed to improve the GPU, as well as the overall system performance. This work is aimed to develop a methodology to rearrange the data while copying from CPU to GPU so that the data associated to the threads in a GPUblock resides together and fits in GPU-shared memory and hence improve performance. The rest of the paper is organized as follow: Section II motivates the work by presenting some related important articles. Section III introduces the proposed CPUto-GPU data mapping technique. Experimental details are described in Section IV. Experimental results are discussed in Section V. Finally, this work is concluded in Section VI. the parallel parts are sent to the GPU card. Figure 3 illustrates a typical CPU-GPU organization. For each parallel part, multiple threads are generated. Threads are executed in the GPU cores concurrently in parallel. In a GPU, different types of memories are available; global memory is the largest memory available to all the computational blocks and it is visible to each and every thread in the same compute grid with large size; shared memory is within a computational block, which is visible to threads running within the computational block. The shared memory is very fast to access, but much smaller in capacity than the global memory. GPU shared memory helps improve performance mainly because (i) it is dedicated to a CUDA-block and (ii) closer to the processing cores (see Figure 3). The results from GPU are sent back to the CPU. II. BACKGROUND AND MOTIVATION In this section, we briefly discuss CPU cache memory hierarchy, GPU memory organization, data level parallelism, and traditional CPU-to-GPU memory/data mapping. A. CPU Memory Organization Most contemporary CPUs (from Intel, AMD, and IBM) have multicore architecture and each core has its own private level-1 cache (CL1). Cache memory organization of such a multicore system also has private or shared level-2 cache (CL2) and main memory. CL1 is usually split into instruction cache (I1) and data cache (D1), but CL2 is usually unified. Cache memory organization of an Intel-like 4-core CPU system is illustrated in Figure 2 [4]. Fig. 2. Intel-like CPU cache memory organization. B. GPU Memory Organization In a multicore CPU and manycore GPU platform that supports CUDA applications, the user starts the application on the CPU. The initialization and serial parts are executed in the CPU. The data and code for Fig. 3. GPU memory organization [2]. C. Data Level Parallelism Data parallelism is an important parallel processing technique as it can take advantage of the locality principle. In data parallelism, a program is decomposed into concurrent units which execute the same instructions on distinct data [5], [6]. The Massachusetts Institute of Technology (MIT) researchers introduce 2 data parallelism strategies for concurrent execution: spatial data partitioning (SDP) and temporal data partitioning (TDP) [7]. In SDP strategy, by using spatial indexes data is divided among processes Figure 4. This strategy is applicable when we have large dimensions of spatial data and has few dependencies. And, for enabling the communication and synchronization we need some additional instructions. The latency of the parallelized application decreases. The load-balancing of the parallelized application tends to be easy while the application performs

the same amount of work on all spatial indices. into GPU shared memory. Therefore, a new CPU-to- GPU memory mapping is needed to improve the GPU shared memory performance. Fig. 4. Two Parallelization Strategies [7]. In TDP strategy, according to temporal index data is divided among processes. Performing computation on each process on all spatial indices associated with its assigned temporal index as illustrated in Figure 4. The communication will be application dependent in the parallel implementation. Whereas, in a typical TDP implementation, starting from assigned temporal index the process executes all instructions on the data. This strategy is applicable when we have large temporal data dimension with few dependences. The throughput of the application increases in a parallelized application. The latency of the application remains the same. If we write applications using TDP strategy with another pattern, the load-balancing will be easy, even the computation varies tremendously between inputs. Experimental results show that the pure TDP implementation achieves the best throughput, while the pure SDP implementation achieves the best latency although with a loss of quality. Fig. 5. Traditional CPU to GPU global memory mapping. B. Proposed CPU-to-GPU Memory Mapping In this work, we propose a novel CPU-main-memory to GPU-global-memory mapping technique to increase system performance. As shown in Figure 6, CPU-data should be regrouped such a way that data associated to the same thread can be stored in the consecutive memory locations. This data regrouping and mapping should be done on CPU during run-time. III. COPYING CPU-DATA TO GPU-MEMORY First we discuss the traditional method of copying CPU-data to GPU-global-memory. Then we present our proposed technique to move CPU-data to GPU-globalmemory to increase performance. A. Traditional CPU-to-GPU Memory Mapping In traditional GPU computing, data (and instruction) from CPU memory are copied into GPU memory as shown in Figure 5. The data from a single block/thread is directly copied from CPU memory to GPU-globalmemory. As a result, the data in GPU global memory may be stored in different memory blocks. And that makes it difficult (if not impossible) to store that data Fig. 6. Proposed CPU memory to GPU shared memory mapping. According to this mapping strategy, data X1, X2, etc. from different CPU-memory locations are stored together

in GPU global memory. Unlike the traditional method, this organization should allow to keep the data in the GPU shared memory and increase performance. IV. EXPERIMENTAL DETAILS In this section, CPU/GPU system parameters, 2D electrical charge distribution, and developed CUDA/C code for GPU with/without shared memory are discussed. A. CPU/GPU System Parameters We use two popular GPU cards (Fermi and Kepler) with a multicore CPU. The system configuration parameters for the workstation are summarized in Table I. The dual-processor (quad-core per processor) workstation runs at 2.13 GHz. The Fermi card has 14 streaming multiprocessors (SMs), each SM has 32 CUDA cores. The Kepler card has 13 SMs, each SM has 192 CUDA cores. The operating system used is Debian 6.0. Parameter CPU TABLE I SYSTEM PARAMETERS Description Intel Xeon CPU Cores 8 CPU RAM Fermi GPU Card 6GB NVIDIA Tesla C2075 Fermi GPU Cores 448 Fermi Clock Speed Fermi Global Memory Fermi Shared Memory Kepler GPU Card 1.15 GHz 5.4GB 49KB/Block NVIDIA Tesla K20m Kepler GPU Cores 2496 Kepler Clock Speed Kepler Global Memory Kepler Shared Memory Operating System 0.71 GHz 4.8GB 49KB/Block Linux Debian ɛ x(i.j) (φ i+1,j φ i,j ) /dx + ɛ y(i.j) (φ i,j+1 φ i,j ) /dy + ɛ x(i 1.j) (φ i,j φ i 1,j ) /dx + ɛ y(i,j 1) (φ i,j φ i,j 1 ) /dy = 0 (1) Where dx and dy are the spatial grid size, φ i,j is the electric potential defined at lattice point (i, j), and ɛ x(i.j) and ɛ y(i.j) are the effective x- and y-direction permittivity defined at edges of the element cell (i, j). For very uniform material, electric potential can be considered the same in all directions. Therefore, Equation 1 becomes a 2D problem as shown in Equation 2 and can be solved using the discrete approach. (φ i+1,j φ i,j ) /dx + (φ i,j+1 φ i,j ) /dy + (φ i,j φ i 1,j ) /dx + (φ i,j φ i,j 1 ) /dy = 0 (2) The multithreaded CUDA/C shared memory implementation of the 2D Laplace s equation for charge distribution is shown in Figure 7. Here, the right values of i (i.e., current threadidx.x) and j (i.e., threadidx.y) for each thread, and the shared variables As[i][j] are used for memory latency hiding optimization. Thread executions are synchronized to ensure correctness. B. 2D Electric Charge Distribution In many cases, when the charge distribution is not known, the Poisson s equation can be used to solve electrostatic problems. For materials with electric potential φ and medium permittivity ɛ, based on the finite-difference approximations, the Laplace s equation (a customized form of the Poisson s equation) for a 2D problem can be presented as Equation 1. Fig. 7. Main loop in CUDA/C to solve Laplace s equation for charge distribution. V. RESULTS AND DISCUSSION We conduct the experiment of high electric charge distribution (as seen in Equation 2) using the code

sample (in Figure 7). We implement three versions of the program: (i) CPU-only, (ii) GPU without shared memory, and (iii) GPU with shared memory. While copying data from CPU-main-memory to GPU-globalmemory, we apply our proposed technique so that the shared memory can be used efficiently. A. Validation of CUDA/C Programs In order to validate the developed CUDA/C programs, we consider an 8x8 matrix. As shown in Figure 8, initially Node(4,4), Node(4,5), Node(5,4), and Node(5,5) are set with a high value of 10000 and all other nodes are set to a low value 0 (zero). Nodes right outside of the 8x8 matrix are also set to a low value 0 (as a boundary condition). Fig. 9. Validation of the developed CUDA/C code. B. Impact of the Number of Threads In the experiments, execution time decreases as the number of threads increases as illustrated in Figure 10. Results show that for small number of threads (less than 8), Kepler takes more time when compared with that of Fermi; but for large number of threads (greater than 16), Fermi takes more time compared to Kepler. The reason for Kepler taking more time than Fermi for less than 16 threads is that Fermi runs at a faster clock rate (Fermi at 1.15 GHz, Kepler at 0.71 GHz). The reason for Fermi taking more time than Kepler for more than 16 threads is that Fermi has less load/store units than Kepler does (Fermi has 16 units, Kepler has 32 units). Fig. 8. An 8x8 matrix with boundary condition. Using the CPU/C and CUDA/C (without GPU shared memory) codes we calculate the new values of all the nodes of the matrix as stated in Equation 3. Where, 1 n 8 and 1 m 8. N n,m = 1 5 (N n,m 1 + N n,m+1 + Fig. 10. GPU Time Vs Number of Threads. N n,m + N n 1,m + N n+1,m ) (3) The program stops when each and every node has a value less than 1. Figure 9 shows values for Node(1,1), Node(3,4), Node(5,5) and Node(8,8) after iterations 1, 10, 50 and 100. As expected, it is observed that both CPU/C and CUDA/C versions produce exactly the same value for each node after any number of iteration. C. Impact of GPU Shared Memory For 16x16 threads, both times decrease as the GPU shared memory size increases (as shown in Figure 11). It should be noted that Fermi takes less time than Kepler does. This is probably because Ferni runs at a faster speed than Kepler and Fermi has a larger memory bus width than Kepler (Fermi bus width 384-bit, Kepler bus width 320-bit).

Fig. 11. GPU Time Vs Shared Memory Used. D. Impact of CPU-to-GPU Memory Mapping Finally, we evaluate the impact of the CPU-to-GPU memory mapping technique. Figure 12 shows the execution times due to Fermi card while solving the Laplace s equation for electric charge distribution on a 512x512 thin surface. For the number of threads greater than 9x9, GPU shared memory shows improvement. For more than 16x16 threads, execution time increases; this is probably due to the limitation of 16 load/store units. Experimental results directs that the proposed CPU-to-GPU memory mapping with GPU shared memory provides the best performance. Fig. 12. Impact of Data Regrouping. efficient CPU-memory to GPU-global-memory mapping algorithms are required to improve performance. In this paper, we present a CPU-to-GPU memory mapping technique that enhances the GPU (as well as the overall system) performance. We implement three solutions (CPUonly, CPU/GPU without shared memory, and CPU/GPU with shared memory) to solve Laplace s equation for electric charge distribution on a 2D thin surface using NVIDIA Fermi (448 cores) and Kepler (2496 cores) GPU cards. Experimental results clearly support the usefulness of GPU-shared-memory for both GPU cards. Results also show that the proper regrouping of CPUdata while copying into GPU-global-memory help improve performance. Based on the experimental results the proposed CPU-to-GPU memory mapping technique is capable of decreasing the overall execution time by more than 75%. In many research including computational analysis of composite materials, where modeling and simulation of nanocomposites (that requires large number of computations) is the primary challenge, high performance computing is a must. We plan to extend this CPU-to- GPU memory mapping technique to study composite materials for aircraft applications in our next endeavor. REFERENCES [1] M. Harris, http://devblogs.nvidia.com/parallelforall/unifiedmemory-in-cuda-6/. [2] B. Chapman, G. Jost, and R. D. Pas, Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, Oct 2007. [3] T. Edison, Gpu memory system, http://www.yuwangcg.com/project1.html. [4] A. Asaduzzaman, A power-aware multi-level cache organization effective for multi-core embedded systems, in JCP, 2012. [5] K. S. McKinley, S. Carr, and C. Tseng, Improving data locality with loop transformations, ACM Transactions on Programming Languages and Systems, vol. 18, no. 4, p. 424, 1996. [6] P. J. Denning, The locality principle, 2006, communication Networks and Computer Systems. [7] H. Hoffmann, A. Agarwal, and S. Devadas, Partitioning strategies for concurrent programming, in Massachusetts Institute of Technology (MIT), CSAL Lab, 2009. VI. CONCLUSION NVIDIA CUDA-accelerated GPU computing has potential to provide faster and inexpensive solutions to address massively large/complex problems. In CPU/GPU computing, CPU-data is first copied into GPU-globalmemory. It would be beneficial to keep the data into GPU-shared-memory than into GPU-global-memory. As shared memory is much smaller than global memory,