Speed Up Your Codes Using GPU

Similar documents
Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Vector Addition on the Device: main()

Introduction to CUDA C

ECE 574 Cluster Computing Lecture 17

Introduction to CUDA C

ECE 574 Cluster Computing Lecture 15

Paralization on GPU using CUDA An Introduction

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU Programming Using CUDA

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

COMP 605: Introduction to Parallel Computing Quiz 4: Module 4 Quiz: Comparing CUDA and MPI Matrix-Matrix Multiplication

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

High-Performance Computing Using GPUs

COSC 462 Parallel Programming

CUDA Kenjiro Taura 1 / 36

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

CUDA C/C++ BASICS. NVIDIA Corporation

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CS377P Programming for Performance GPU Programming - I

CUDA C Programming Mark Harris NVIDIA Corporation

Module 2: Introduction to CUDA C

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Scientific discovery, analysis and prediction made possible through high performance computing.

GPGPU/CUDA/C Workshop 2012

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Lecture 2: Introduction to CUDA C

An Introduction to GPU Computing and CUDA Architecture

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Practical Introduction to CUDA and GPU

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Module 2: Introduction to CUDA C. Objective

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CUDA. More on threads, shared memory, synchronization. cuprintf

Lecture 3: Introduction to CUDA

Scientific GPU computing with Go A novel approach to highly reliable CUDA HPC 1 February 2014

Introduction to CUDA Programming

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Parallel Programming

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model Michael Garland

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Parallel Computing. Lecture 19: CUDA - I

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

GPU programming. Dr. Bernhard Kainz

Assignment 7. CUDA Programming Assignment

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Introduction to CUDA

CS 179: GPU Computing. Lecture 2: The Basics

Lecture 11: GPU programming

Massively Parallel Algorithms

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Data parallel computing

Hands-on CUDA exercises

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

CUDA Programming. Aiichiro Nakano

Lecture 10!! Introduction to CUDA!

Advanced Topics in CUDA C

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Introduction to Parallel Computing with CUDA. Oswald Haan

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Process Time Comparison between GPU and CPU

GPU Programming for Mathematical and Scientific Computing

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

04. CUDA Data Transfer

GPU Programming Using NVIDIA CUDA

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

CUDA. Sathish Vadhiyar High Performance Computing

HPCSE II. GPU programming and CUDA

Tesla Architecture, CUDA and Optimization Strategies

Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science 2018 Workshop

CUDA Architecture & Programming Model

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

ME964 High Performance Computing for Engineering Applications

Lecture 1: an introduction to CUDA

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Transcription:

Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel computation has only recently been explored. Parallel algorithms running on GPUs can often achieve speeds up to 10 times over similar CPU algorithms. This technology has been applied to many fields such as physics simulations, signal processing, financial modeling, neural networks and countless others. The model for GPU computing is to use a CPU and GPU together in a heterogeneous co-processing computing model. The sequential part of the application runs on the CPU and the computationally-intensive part is accelerated by the GPU. From the user s perspective, the application just runs faster because it is using the high performance of the GPU to boost performance. GPUs are different to CPUs because they are designed to run hundreds even thousands of threads simultaneously (Fig. 1). For example: in gaming, a GPU could be running different threads to render individual pixels of an image. Programming a CPU, on the other hand, restricts you to 1, 2 or 4 CPU threads. The advantage for CPUs is that individual threads can be used to run totally different programs, whereas a GPU is designed to run the same program multiple times across thousands of threads. GPUs in that sense truly process data in parallel and the programmer should design GPU programs with the processing method in mind. Figure 1: Comparison of structure between CPU and GPU

To run codes on GPU device, you need an environment you can develop using CUDA C. The following items are necessary: (a) A CUDA-enabled graphics processor (b) An Nvidia device drive (c) A CUDA development toolkit (d) A standard C complier The excellent HPC folks at Computer Centre already had the above set up to your convenience. In accordance with the laws governing written works of computer programming, below is a Hello, world! example that illustrates how to invoke multiple threads on GPU devices. In the above code, we see that CUDA C adds the global qualifier to standard C. This mechanism alerts the complier that a function should be compiled to run on a device instead of the host. There is nothing special about passing parameters to a kernel. A kernel call looks and acts exactly like any function call in standard C. The runtime system takes care of any complexity introduced by these parameters that need to get from the host to the device. In the main function, two values are inside the angle bracket, indicating that the kernel function will launch one block, and five threads in this block. Please refer to the CUDA C book for details on block and thread definitions. After compiling the above code by nvcc, the program will output as follows:

As we can see, each thread encounters the printf() command with as many lines of output as there are threads launched in the grid. As expected, global values 1.2345 are common between all threads, and local values (threadidx.x) are distinct for each thread. A for loop fragment can be very easily accelerated on GPU as well. The following example will illustrate how we could use CUDA C in summing two vectors.

First three arrays were allocated on the device using calls to cudamalloc(): two arrays, dev_a and dev_b, to hold inputs, and one array, dev_c, to hold the result. By invoking function cudamemcpy(), the input data was copied to the device with the parameter cudamemcpyhosttodevice and the result data was copied back to the host with cudamemcpydevicetohost. After computation on the device, the allocated memory resource was released with cudafree(). In the above example, we specified N as the number of parallel blocks and the collection of these parallel blocks a grid. This specifies to the runtime system that we want a one-dimensional grid of N blocks. These threads will have varying values for blockid.x, the first taking value 0 and the last taking value N-1. Taking four blocks as an example, all are running through the same copy of the device code but have different values for the variable blockidx.x. This is what the actual code being

executed in each of the four parallel blocks looks like after the runtime substitutes the appropriate block index for blockidx.x. Why do we check whether tid is less than N? It should always be less than N, since we have specifically launched the kernel such that this assumption holds. But once this rule is broken by incaution, such bugs cannot be found in compiling time. The presence of these errors will not prevent the user from continuing the execution of the application, but they will most certainly cause all manner of unpredictable and unsavory side effects downstream. Thus, it is necessary to check any operation that might fail as this could save hours of pain in debugging the code later. Finally, the archival of the speed-up varies for different applications, hardware devices and the code quality. Understanding the parameters of GPU devices you are using will help you improve the performance of your applications. Check out Nvidia resources that explain the technique details of CUDA C. (https://developer.nvidia.com/category/zone/cuda-zone)