A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Similar documents
Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Optimization solutions for the segmented sum algorithmic function

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Introduction to GPU hardware and to CUDA

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Warps and Reduction Algorithms

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

CS 179 Lecture 4. GPU Compute Architecture

Two-Phase flows on massively parallel multi-gpu clusters

Parallel Computing: Parallel Architectures Jin, Hai

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

GPGPU/CUDA/C Workshop 2012

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

TUNING CUDA APPLICATIONS FOR MAXWELL

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Performance Benefits of NVIDIA GPUs for LS-DYNA

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Parallel Implementation of the NIST Statistical Test Suite

TUNING CUDA APPLICATIONS FOR MAXWELL

GPGPU/CUDA/C Workshop 2012

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

GPU Programming Using NVIDIA CUDA

CUDA. Matthew Joyner, Jeremy Williams

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

B. Tech. Project Second Stage Report on

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL

Tesla Architecture, CUDA and Optimization Strategies

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

HPC with Multicore and GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs

Introduction to CUDA Programming

How to Optimize Geometric Multigrid Methods on GPUs

GPU Fundamentals Jeff Larkin November 14, 2016

WHY PARALLEL PROCESSING? (CE-401)

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

General Purpose GPU Computing in Partial Wave Analysis

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

CSC501 Operating Systems Principles. OS Structure

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

high performance medical reconstruction using stream programming paradigms

Parallel Computing. Hwansoo Han (SKKU)

POST-SIEVING ON GPUs

CME 213 S PRING Eric Darve

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Programmable Graphics Hardware (GPU) A Primer

Fundamental CUDA Optimization. NVIDIA Corporation

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Quantitative study of computing time of direct/iterative solver for MoM by GPU computing

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

CUDA Experiences: Over-Optimization and Future HPC

Benchmarking the Memory Hierarchy of Modern GPUs

CUDA GPGPU Workshop 2012

HPC Architectures. Types of resource currently in use

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Effect of memory latency

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

Efficient Computation of Radial Distribution Function on GPUs

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Evaluation Of The Performance Of GPU Global Memory Coalescing

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Numerical Simulation on the GPU

OpenACC programming for GPGPUs: Rotor wake simulation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Computer Architecture

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Fundamental CUDA Optimization. NVIDIA Corporation

Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

CS427 Multicore Architecture and Parallel Computing

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Preparing seismic codes for GPUs and other

Memory Bound Computing

Parallel Programming Multicore systems

Optimizing CUDA for GPU Architecture. CSInParallel Project

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

CDA3101 Recitation Section 13

Transcription:

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA Abu.Asaduzzaman@wichita.edu Abstract In order to fast effective analysis of large systems, high performance computing (HPC) is essential. NVIDIA Compute Unified Device Architecture (CUDA)- assisted central processing unit (CPU) and graphics processing unit (GPU) computing platform has proven its potential to be used for HPC supports. In CPU/GPU computing, original data and instructions are copied from CPU-main-memory to GPU-global-memory. Inside GPU, it would be beneficial to keep the data into shared memory (shared only by the threads of that block) than in the global memory (shared by all threads). However, GPU shared memory is much smaller than GPU global memory (for Fermi Tesla C2075, total shared memory per block is 48 KB and total global memory is 5.6 GB). In this paper, we introduce a CPU-main-memory to GPU-globalmemory mapping technique to improve the GPU/overall system performance by increasing the effectiveness of GPU shared memory. Experimental results, from solving Laplace s equation for 512x512 matrix using Fermi and Kepler cards, show that proposed CPU-to-GPU memory mapping technique help decrease the overall execution time by more than 75%. Index Terms Cache memory organization; CUDA architecture; electric charge distribution; GPU memory; high performance computing; I. INTRODUCTION Modern CPU consists of a number of cores optimized for sequential serial processing while a GPU consists of hundreds of smaller, more efficient cores designed for handling multiple tasks simultaneously. GPUs help CPU as a powerful energy-efficient data centers in many small-and-medium businesses around the world. GPUaccelerated computing is the use of a GPU together with a CPU to accelerate scientific, engineering, and enterprise applications. CUDA Version 5.5 toolkit helps developers to obtain the best performance. Parallelism and optimization techniques simplify the programming for CUDA capable GPU architecture. Whereas, Nvidia has announced CUDA 6, the latest version of its GPU programming language, which adds a Unified Memory capability as shown in Figure 1. Unified memory relieves programmers from the trials and tribulations of having to manually copy data back and forth between separate CPU and GPU memory spaces [1]. Fig. 1. CUDA memory model: (a) unified and (b) actual. Data in GPU global memory takes more time to be processed when compared with data in GPU shared memory. In CUDA assisted multithreaded programming, a thread usually processes data that are not in consecutive CPU-memory locations. This CPU-data may not autoqualify to be in the GPU shared memory. As a result, overall system performance may decrease significantly. Since the introduction of dual-core netbook machines (in year 2005) to 16-core workstation computers, parallel processing is a reality. Today, command-prompt machines are almost out, multithreaded CPU/GPU computers are in [2]. To take advantage of the multicore systems, software engineers are developing parallel applications that will also meet the requirements of the growing highperformance computation. NVIDIA CUDA/GPU technology provides multithreading without context switching [3]. However, due to the proper mapping from CPU-

memory to GPU-memory, GPU shared memory may not be used efficively. Therefore, a smart memory mapping technique is needed to improve the GPU, as well as the overall system performance. This work is aimed to develop a methodology to rearrange the data while copying from CPU to GPU so that the data associated to the threads in a GPUblock resides together and fits in GPU-shared memory and hence improve performance. The rest of the paper is organized as follow: Section II motivates the work by presenting some related important articles. Section III introduces the proposed CPUto-GPU data mapping technique. Experimental details are described in Section IV. Experimental results are discussed in Section V. Finally, this work is concluded in Section VI. the parallel parts are sent to the GPU card. Figure 3 illustrates a typical CPU-GPU organization. For each parallel part, multiple threads are generated. Threads are executed in the GPU cores concurrently in parallel. In a GPU, different types of memories are available; global memory is the largest memory available to all the computational blocks and it is visible to each and every thread in the same compute grid with large size; shared memory is within a computational block, which is visible to threads running within the computational block. The shared memory is very fast to access, but much smaller in capacity than the global memory. GPU shared memory helps improve performance mainly because (i) it is dedicated to a CUDA-block and (ii) closer to the processing cores (see Figure 3). The results from GPU are sent back to the CPU. II. BACKGROUND AND MOTIVATION In this section, we briefly discuss CPU cache memory hierarchy, GPU memory organization, data level parallelism, and traditional CPU-to-GPU memory/data mapping. A. CPU Memory Organization Most contemporary CPUs (from Intel, AMD, and IBM) have multicore architecture and each core has its own private level-1 cache (CL1). Cache memory organization of such a multicore system also has private or shared level-2 cache (CL2) and main memory. CL1 is usually split into instruction cache (I1) and data cache (D1), but CL2 is usually unified. Cache memory organization of an Intel-like 4-core CPU system is illustrated in Figure 2 [4]. Fig. 2. Intel-like CPU cache memory organization. B. GPU Memory Organization In a multicore CPU and manycore GPU platform that supports CUDA applications, the user starts the application on the CPU. The initialization and serial parts are executed in the CPU. The data and code for Fig. 3. GPU memory organization [2]. C. Data Level Parallelism Data parallelism is an important parallel processing technique as it can take advantage of the locality principle. In data parallelism, a program is decomposed into concurrent units which execute the same instructions on distinct data [5], [6]. The Massachusetts Institute of Technology (MIT) researchers introduce 2 data parallelism strategies for concurrent execution: spatial data partitioning (SDP) and temporal data partitioning (TDP) [7]. In SDP strategy, by using spatial indexes data is divided among processes Figure 4. This strategy is applicable when we have large dimensions of spatial data and has few dependencies. And, for enabling the communication and synchronization we need some additional instructions. The latency of the parallelized application decreases. The load-balancing of the parallelized application tends to be easy while the application performs

the same amount of work on all spatial indices. into GPU shared memory. Therefore, a new CPU-to- GPU memory mapping is needed to improve the GPU shared memory performance. Fig. 4. Two Parallelization Strategies [7]. In TDP strategy, according to temporal index data is divided among processes. Performing computation on each process on all spatial indices associated with its assigned temporal index as illustrated in Figure 4. The communication will be application dependent in the parallel implementation. Whereas, in a typical TDP implementation, starting from assigned temporal index the process executes all instructions on the data. This strategy is applicable when we have large temporal data dimension with few dependences. The throughput of the application increases in a parallelized application. The latency of the application remains the same. If we write applications using TDP strategy with another pattern, the load-balancing will be easy, even the computation varies tremendously between inputs. Experimental results show that the pure TDP implementation achieves the best throughput, while the pure SDP implementation achieves the best latency although with a loss of quality. Fig. 5. Traditional CPU to GPU global memory mapping. B. Proposed CPU-to-GPU Memory Mapping In this work, we propose a novel CPU-main-memory to GPU-global-memory mapping technique to increase system performance. As shown in Figure 6, CPU-data should be regrouped such a way that data associated to the same thread can be stored in the consecutive memory locations. This data regrouping and mapping should be done on CPU during run-time. III. COPYING CPU-DATA TO GPU-MEMORY First we discuss the traditional method of copying CPU-data to GPU-global-memory. Then we present our proposed technique to move CPU-data to GPU-globalmemory to increase performance. A. Traditional CPU-to-GPU Memory Mapping In traditional GPU computing, data (and instruction) from CPU memory are copied into GPU memory as shown in Figure 5. The data from a single block/thread is directly copied from CPU memory to GPU-globalmemory. As a result, the data in GPU global memory may be stored in different memory blocks. And that makes it difficult (if not impossible) to store that data Fig. 6. Proposed CPU memory to GPU shared memory mapping. According to this mapping strategy, data X1, X2, etc. from different CPU-memory locations are stored together

in GPU global memory. Unlike the traditional method, this organization should allow to keep the data in the GPU shared memory and increase performance. IV. EXPERIMENTAL DETAILS In this section, CPU/GPU system parameters, 2D electrical charge distribution, and developed CUDA/C code for GPU with/without shared memory are discussed. A. CPU/GPU System Parameters We use two popular GPU cards (Fermi and Kepler) with a multicore CPU. The system configuration parameters for the workstation are summarized in Table I. The dual-processor (quad-core per processor) workstation runs at 2.13 GHz. The Fermi card has 14 streaming multiprocessors (SMs), each SM has 32 CUDA cores. The Kepler card has 13 SMs, each SM has 192 CUDA cores. The operating system used is Debian 6.0. Parameter CPU TABLE I SYSTEM PARAMETERS Description Intel Xeon CPU Cores 8 CPU RAM Fermi GPU Card 6GB NVIDIA Tesla C2075 Fermi GPU Cores 448 Fermi Clock Speed Fermi Global Memory Fermi Shared Memory Kepler GPU Card 1.15 GHz 5.4GB 49KB/Block NVIDIA Tesla K20m Kepler GPU Cores 2496 Kepler Clock Speed Kepler Global Memory Kepler Shared Memory Operating System 0.71 GHz 4.8GB 49KB/Block Linux Debian ɛ x(i.j) (φ i+1,j φ i,j ) /dx + ɛ y(i.j) (φ i,j+1 φ i,j ) /dy + ɛ x(i 1.j) (φ i,j φ i 1,j ) /dx + ɛ y(i,j 1) (φ i,j φ i,j 1 ) /dy = 0 (1) Where dx and dy are the spatial grid size, φ i,j is the electric potential defined at lattice point (i, j), and ɛ x(i.j) and ɛ y(i.j) are the effective x- and y-direction permittivity defined at edges of the element cell (i, j). For very uniform material, electric potential can be considered the same in all directions. Therefore, Equation 1 becomes a 2D problem as shown in Equation 2 and can be solved using the discrete approach. (φ i+1,j φ i,j ) /dx + (φ i,j+1 φ i,j ) /dy + (φ i,j φ i 1,j ) /dx + (φ i,j φ i,j 1 ) /dy = 0 (2) The multithreaded CUDA/C shared memory implementation of the 2D Laplace s equation for charge distribution is shown in Figure 7. Here, the right values of i (i.e., current threadidx.x) and j (i.e., threadidx.y) for each thread, and the shared variables As[i][j] are used for memory latency hiding optimization. Thread executions are synchronized to ensure correctness. B. 2D Electric Charge Distribution In many cases, when the charge distribution is not known, the Poisson s equation can be used to solve electrostatic problems. For materials with electric potential φ and medium permittivity ɛ, based on the finite-difference approximations, the Laplace s equation (a customized form of the Poisson s equation) for a 2D problem can be presented as Equation 1. Fig. 7. Main loop in CUDA/C to solve Laplace s equation for charge distribution. V. RESULTS AND DISCUSSION We conduct the experiment of high electric charge distribution (as seen in Equation 2) using the code

sample (in Figure 7). We implement three versions of the program: (i) CPU-only, (ii) GPU without shared memory, and (iii) GPU with shared memory. While copying data from CPU-main-memory to GPU-globalmemory, we apply our proposed technique so that the shared memory can be used efficiently. A. Validation of CUDA/C Programs In order to validate the developed CUDA/C programs, we consider an 8x8 matrix. As shown in Figure 8, initially Node(4,4), Node(4,5), Node(5,4), and Node(5,5) are set with a high value of 10000 and all other nodes are set to a low value 0 (zero). Nodes right outside of the 8x8 matrix are also set to a low value 0 (as a boundary condition). Fig. 9. Validation of the developed CUDA/C code. B. Impact of the Number of Threads In the experiments, execution time decreases as the number of threads increases as illustrated in Figure 10. Results show that for small number of threads (less than 8), Kepler takes more time when compared with that of Fermi; but for large number of threads (greater than 16), Fermi takes more time compared to Kepler. The reason for Kepler taking more time than Fermi for less than 16 threads is that Fermi runs at a faster clock rate (Fermi at 1.15 GHz, Kepler at 0.71 GHz). The reason for Fermi taking more time than Kepler for more than 16 threads is that Fermi has less load/store units than Kepler does (Fermi has 16 units, Kepler has 32 units). Fig. 8. An 8x8 matrix with boundary condition. Using the CPU/C and CUDA/C (without GPU shared memory) codes we calculate the new values of all the nodes of the matrix as stated in Equation 3. Where, 1 n 8 and 1 m 8. N n,m = 1 5 (N n,m 1 + N n,m+1 + Fig. 10. GPU Time Vs Number of Threads. N n,m + N n 1,m + N n+1,m ) (3) The program stops when each and every node has a value less than 1. Figure 9 shows values for Node(1,1), Node(3,4), Node(5,5) and Node(8,8) after iterations 1, 10, 50 and 100. As expected, it is observed that both CPU/C and CUDA/C versions produce exactly the same value for each node after any number of iteration. C. Impact of GPU Shared Memory For 16x16 threads, both times decrease as the GPU shared memory size increases (as shown in Figure 11). It should be noted that Fermi takes less time than Kepler does. This is probably because Ferni runs at a faster speed than Kepler and Fermi has a larger memory bus width than Kepler (Fermi bus width 384-bit, Kepler bus width 320-bit).

Fig. 11. GPU Time Vs Shared Memory Used. D. Impact of CPU-to-GPU Memory Mapping Finally, we evaluate the impact of the CPU-to-GPU memory mapping technique. Figure 12 shows the execution times due to Fermi card while solving the Laplace s equation for electric charge distribution on a 512x512 thin surface. For the number of threads greater than 9x9, GPU shared memory shows improvement. For more than 16x16 threads, execution time increases; this is probably due to the limitation of 16 load/store units. Experimental results directs that the proposed CPU-to-GPU memory mapping with GPU shared memory provides the best performance. Fig. 12. Impact of Data Regrouping. efficient CPU-memory to GPU-global-memory mapping algorithms are required to improve performance. In this paper, we present a CPU-to-GPU memory mapping technique that enhances the GPU (as well as the overall system) performance. We implement three solutions (CPUonly, CPU/GPU without shared memory, and CPU/GPU with shared memory) to solve Laplace s equation for electric charge distribution on a 2D thin surface using NVIDIA Fermi (448 cores) and Kepler (2496 cores) GPU cards. Experimental results clearly support the usefulness of GPU-shared-memory for both GPU cards. Results also show that the proper regrouping of CPUdata while copying into GPU-global-memory help improve performance. Based on the experimental results the proposed CPU-to-GPU memory mapping technique is capable of decreasing the overall execution time by more than 75%. In many research including computational analysis of composite materials, where modeling and simulation of nanocomposites (that requires large number of computations) is the primary challenge, high performance computing is a must. We plan to extend this CPU-to- GPU memory mapping technique to study composite materials for aircraft applications in our next endeavor. REFERENCES [1] M. Harris, http://devblogs.nvidia.com/parallelforall/unifiedmemory-in-cuda-6/. [2] B. Chapman, G. Jost, and R. D. Pas, Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, Oct 2007. [3] T. Edison, Gpu memory system, http://www.yuwangcg.com/project1.html. [4] A. Asaduzzaman, A power-aware multi-level cache organization effective for multi-core embedded systems, in JCP, 2012. [5] K. S. McKinley, S. Carr, and C. Tseng, Improving data locality with loop transformations, ACM Transactions on Programming Languages and Systems, vol. 18, no. 4, p. 424, 1996. [6] P. J. Denning, The locality principle, 2006, communication Networks and Computer Systems. [7] H. Hoffmann, A. Agarwal, and S. Devadas, Partitioning strategies for concurrent programming, in Massachusetts Institute of Technology (MIT), CSAL Lab, 2009. VI. CONCLUSION NVIDIA CUDA-accelerated GPU computing has potential to provide faster and inexpensive solutions to address massively large/complex problems. In CPU/GPU computing, CPU-data is first copied into GPU-globalmemory. It would be beneficial to keep the data into GPU-shared-memory than into GPU-global-memory. As shared memory is much smaller than global memory,