Scientific discovery, analysis and prediction made possible through high performance computing.

Similar documents
Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to CUDA Programming

Tesla Architecture, CUDA and Optimization Strategies

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Basics of CADA Programming - CUDA 4.0 and newer

GPU Programming Using CUDA

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

University of Bielefeld

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

ECE 574 Cluster Computing Lecture 15

Introduction to Parallel Computing with CUDA. Oswald Haan

HPCSE II. GPU programming and CUDA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Parallel Numerical Algorithms

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Real-time Graphics 9. GPGPU

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

COSC 6374 Parallel Computations Introduction to CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Real-time Graphics 9. GPGPU

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

ECE 574 Cluster Computing Lecture 17

CS 179: GPU Computing. Lecture 2: The Basics

High Performance Computing and GPU Programming

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CUDA Architecture & Programming Model

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Introduction to GPGPUs and to CUDA programming model

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Programming with CUDA, WS09

EEM528 GPU COMPUTING

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Practical Introduction to CUDA and GPU

CUDA programming interface - CUDA C

GPU programming. Dr. Bernhard Kainz

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Lecture 1: an introduction to CUDA

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Introduction to GPU hardware and to CUDA

Lecture 11: GPU programming

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Lecture 3: Introduction to CUDA

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 10!! Introduction to CUDA!

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA (Compute Unified Device Architecture)

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

CUDA Programming Model

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Introduction to CUDA

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA C/C++ BASICS. NVIDIA Corporation

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

CUDA Basics. July 6, 2016

Massively Parallel Algorithms

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Vector Addition on the Device: main()

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Parallel Computing. Lecture 19: CUDA - I

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Parallel Accelerators

Lecture 2: Introduction to CUDA C

GPU Programming. Rupesh Nasre.

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

High-Performance Computing Using GPUs

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Cartoon parallel architectures; CPUs and GPUs

Programmable Graphics Hardware (GPU) A Primer

ECE 408 / CS 483 Final Exam, Fall 2014

Parallel Accelerators

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Introduc)on to GPU Programming

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Data parallel computing

GPGPU/CUDA/C Workshop 2012

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

CUDA Parallelism Model

Transcription:

Scientific discovery, analysis and prediction made possible through high performance computing.

An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

Introduction

Contents What is GPU computing? Brief History of GPGPU Introduction to CUDA CUDA API Basics Advanced CUDA Concepts

What is GPU computing? GPU computing is the use of the GPU together with the CPU to accelerate general purpose applications GPGPU (General Purpose Computing on GPUs) Offload the most computationally intense work to the GPU As a tag-team, the CPU and GPU work well together CPU: Optimized for serial processes SISD (MIMD) GPU: Optimized for parallel processes SIMD

Why GPU computing? CPU vs GPU http://www.youtube.com/watch?v=-p28lkwtzri

AMD Opteron 2435 Specs (CPU) 6 processor cores 12 virtual cores (hyperthreading) 904 million transistors ~100 GFLOPs 768 KB L1 Cache 3 MB L2 Cache 6 MB L3 Cache

Nvidia Tesla x2090 Specs 512 CUDA cores 3 billion transistors (GPU) 1.33 TFLOPs (SP floating point) 665 GFLOPs (DP floating point) 6 GB on-board memory 177 GB/s memory bandwidth

Brief History of GPGPU On October 11 th, 1999, Nvidia creates the first ever GPU Offloaded the task of transformation & lighting In the early 2000 s, many started to notice the power of the GPU Researchers started writing code in OpenGL and Cg Limited accessibility to general programmers and industry Seeing a need, Nvidia made their GPUs fully programmable Offered the CUDA parallel programming model Works in a variety of languages, most notably C, C++ and Fortran

Introduction to CUDA Compute Unified Device Architecture (CUDA) With CUDA, an Nvidia GPU can be used for general purpose processing Only Nvidia GPUs are able to be used with CUDA Different versions of CUDA result in different API calls being available CUDA will work on all Nvidia GPUs from the G8x series onwards Nvidia Tesla GPUs available on ARSC s supercomputer Fish compute nodes CUDA works on all major operating systems Microsoft Windows, Mac OSX, and many variants of Linux

Introduction to CUDA (cont.) To run CUDA at home, you can visit: https://developer.nvidia.com/cuda-downloads Download the CUDA release for your operating system Follow the instructions in the provided Getting Started guide Once you have installed the CUDA toolkit, you will have all of the necessary tools to compile and run CUDA on your system An important tool is nvcc which does the work of compiling your CUDA source code into a binary CUDA source code is normally contained in a file with the suffix.cu

CUDA API Basics In the following section, I will be going through some of the basic API calls available in CUDA For those familiar with C/C++, these will seem fairly natural to the language With a few caveats, such as << >> Each new API call will give information about the API call and a small piece of example code to show how it could be used. DISCLAIMER: While there are Fortran examples online for use with CUDA, I have neither tested nor tried any. All of the following works with C.

cudamalloc Similar to the malloc command for allocation of memory on a server Allocates a chunk of memory in the GPU s available memory Can use a pointer to indicate the start of available memory float * or void* Called with two arguments: A reference to a memory location (i.e. &var1) Size of memory to allocate to the memory location Made easier with a function called sizeof()

cudamalloc const int N = 20;! size_t size = 30 * sizeof(float);! float* d_a, d_b;! void* d_c;!! cudamalloc(&d_a, (10 * sizeof(float)));! cudamalloc(&d_b, (N * sizeof(float)));! cudamalloc(&d_c, size);!

cudafree cudafree releases the memory that has been allocated on the device Identical to free() for C/C++ malloc() cudafree and cudamalloc behave differently depending on where they were executed cudafree run on the device cannot free device memory that was allocated by the host cudamalloc run on the device will only be able to allocate space up to the cudalimitmallocheapsize Called with a single argument: Pointer to memory location on device

cudafree float* d_a;! cudamalloc(&d_a, 30 * sizeof(float));!...! cudafree(d_a);!

cudamemcpy This function copies data between the host system and the GPU device It is required that the memory copy has a pre-allocated amount of space available for the data to be copied The function is used for copying data to and from the device and also to copy on the device cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Called with 4 arguments: A pointer to the memory that is being copied to A pointer to the memory that is being copied from The size of the data being transferred from the second arg. to the first arg. The direction to copy the data (host to device, device to host)

cudamemcpy cudamemcpy(d_a, h_a, 30 * sizeof(float), cudamemcpyhosttodevice);!...! cudamemcpy(h_a, d_a, 30 * sizeof(float), cudamemcpydevicetohost);!! const int N = 50;! size_t size = N * sizeof(float);! cudamemcpy(d_b, h_b, size, cudamemcpyhosttodevice);!...! cudamemcpy(h_b, d_b, size, cudamemcpydevicetohost);!

Example Code What does this code do? What would you expect the result to be from this running on the GPU?

Kernels When I think of kernels, I think of two things

CUDA Kernels A kernel is a function callable from the host system to the CUDA-enabled device for being run on many threads in parallel This allows for work to be performed on data that has been loaded onto the memory of the GPU CUDA expands the C language with a set of its own directives for controlling the flow of execution Kernels are defined using one of three prefixes: host : Runs only on the host, can only be executed from the host device : Runs only on the device, can only be executed from device global : Runs only on the device, can only be executed from the host A limitation of CUDA kernels is that they can not be recursive (i.e. call themselves) and cannot have a variable number of arguments

CUDA Kernel Examples device void increment_values(...) {... }!! global float gpu_main(...) {... }!! host int main(...) {... }!

CUDA Kernel Examples host void incrementonhost(float *host_a, int N)! {! }! for (int i = 0; i < N; i++) {! }! host_a[i] = host_a[i] + 1.f;! global void incrementondevice(float *device_a, int N)! {! }! int idx = blockidx.x * blockdim.x + threadidx.x;! if (idx < N) {! }! device_a[idx] = device_a[idx] + 1.f;!

CUDA Thread Indexing You may have noticed the undefined syntax in previous example i.e. threadidx.x, blockdim.x, blockidx.x CUDA has these built-in variables for the blocks of threads that are run against a kernel Rather than performing a loop, we use the parallel nature of the threads to perform the same work For a better understanding of this concept, take a look at the picture in the following slide

CUDA Thread Indexing The first thing to understand is that for every kernel, a grid is created when executing that kernel A grid is a 2-D array of blocks A block is a 3-D array of threads All of the threads within a block are able to communicate and synchronize Threads within a block share memory A thread is a single instance of a parallel process To gain the true power of the GPU, hundreds of threads must be executing in parallel Due to hardware restrictions, the most threads possible per block is 512

CUDA Thread Indexing Every CUDA thread has its own unique ID This can be determined in a straightforward way using its blockdim, blockidx & threadidx variables For a 1D block: int idx = blockidx.x * blockdim.x + threadidx.x;! For a 2D block: int idx = blockidx.x * blockdim.x * blockdim.y + threadidx.y * blockdim.x + threadidx.x;! For a 3D block: int idx = blockidx.x * blockdim.x * blockdim.y * blockdim.z + threadidx.z * blockdim.x * blockdim.y + threadidx.y * blockdim.x + threadidx.x;!

CUDA Thread Indexing Modeling the block after the problem can result in easier thread indexing For example, a matrix can be indexed like this: int idx = blockidx.x * blockdim.x + threadidx.x;! int idy = blockidx.y * blockdim.y + threadidx.y;! These built-in variables can be used in a variety of ways to index your data The data is accessible by all threads The programmer decides how best to access and manipulate the data

CUDA Dim3 Variables Another CUDA addition to the C language is the dim3 type for a variable dim3 provides a way of defining dimensions that a grid of blocks or a block of threads can have These provide the ability for unsigned integers to be used as the limits to these dimensions As their name implies, they are capable of being 3-D definitions to match with thread structure within a block dim3 variables are defined using parentheses to indicate the dimensions dim3 <varname>(<dim1>,<dim2>,<dim3>);

CUDA Dim3 Example int M = 4;! int N = 8;! dim3 blocks_per_grid(m,m);! dim3 threads_per_block(n,n);!

Running a CUDA Kernel To run a CUDA kernel, an extension to the C language has been added function_name<<<dimgrid, dimblock>>>(args);! CUDA kernels run asynchronously You can continue running sequential code on the CPU while the parallel work is being done on the GPU Calls to the function cudamemcpy() block the execution of the next lines of code All threads running on the GPU are synchronized before they are returned by cudamemcpy

Example Code What does this code do? What would you expect the result to be from this running on the GPU?

The More You Know Now you know everything you need to make a working CUDA program Know enough to be dangerous Basics are easy just like in every parallel programming extension Doing things right takes practice May not be obvious changes, requires optimization Understanding when a code should be written for the GPU Lots of data to compute over Using the same commands regardless of input Limited branching (or branching in an expected way)

CUDA Warps A warp may sound like something out of Star Trek A weaving term used to describe threads arranged lengthwise on a loom and crossed by the woof In CUDA, the hardware is designed to execute in groups of 32 threads This is known as a warp The smallest amount of threads that can be executed Naturally, 32 threads of parallel work on data is hardly working the GPU to its fullest The GPU takes the input of blocks and breaks them down into warps to be executed on the GPU Can be run on old, new, or future Nvidia GPUs due to the abstraction in code for the SMs Conditional branching done based on warps can be much more efficient Conditionals can have a profound effect on the runtime of kernels

Nvidia GPU Architecture

CUDA Memory Threads within the same block CAN communicate with each other during execution This is due to a shared 16 KB memory block inside a streaming multiprocessor Threads outside of the same block must write back their results before they are accessible by other blocks Makes memory management more complicated All of the threads in a block are guaranteed to run on the same SM Uses the same shared memory block Similar to cache in CPUs, this shared memory block is much faster than reading from the GPU memory Very nearly as fast as reading / writing to a GPU register

CUDA Limitations An SM has 8,192 registers shared amongst all threads All blocks using that SM are limited by this value Choice of SM is done by the hardware, not by the programmer The number of active blocks on an SM can not exceed 8 The number of active warps on an SM can not exceed 24 Meaning only 768 threads per SM maximum 12,288 threads can be executing on all SMs at a time Optimizing a CUDA program Finding a balance between number of blocks and their size A block of 512 threads would be very inefficient since only one block could be running on an SM (768 512 = 256 threads idle) Nvidia recommends running between 128 and 256 threads per block

CUDA Shared Memory The shared memory available to all threads in a block is managed by the programmer The CUDA software does not make use of this memory unless requested Efficient use of the shared memory in blocks contributes to faster execution of code Reads / writes to global device memory can be 100-150 times slower than shared memory accesses Takes 4 clock cycles to read from shared memory Takes 400 clock cycles to read from global memory! Registers >(=) Shared > Global

CUDA Shared Memory For a kernel to use shared memory, it must first declare an amount of shared memory to allocate A third optional argument to the CUDA kernel execution function_name<<<numblocks, numdims, sharedmemsize>>>(args) To use the shared memory, it is easiest to let the memory be dynamically allocated extern shared float* shared_data; This will allocate the full size of the shared memory to this variable To have more than one array of data allocated in shared memory extern shared float* shared; float* a = &shared[0]; float* b = &shared[count_a];

CUDA Shared Memory Example global void testfunc(int count_a) {! extern shared float* shared;! }!! float* a = &shared[0];! float* b = &shared[count_a];!...! int problemsize = 256 * 2048;! int numthreadsperblock = 256;! int numblocks = problemsize / numthreadsperblock;! int sharedmemsize = numthreadsperblock * sizeof(float);! int count_a = 64;! testfunc<<<numblocks,numthreadsperblock,sharedmemsize>>>(count_a);!!

Synchronize Threads Before attempting to write out data to global memory, you must synchronize the threads You run the risk of trying to pull from memory for data that has not been written yet Race conditions syncthreads(); This is run from inside a kernel to block until all threads in a block reach this point For blocking until all of the threads in a grid have finished cudathreadsynchronize(); This must be run from the host

Example Code What does this code do? What would you expect the result to be from this running on the GPU?

CUDA Errors Detecting and handling errors is essential to creating robust and usable software No one wants to use code that fails with no way to determine why CUDA provides error codes specific to particular problems encountered These error codes can be converted into a string of characters to be displayed CUDA error codes have their own type: cudaerror_t char* cudageterrorstring(cudaerror_t code); Provides a human-readable description of the error code A convenient command to get the most recent CUDA error cudagetlasterror(); Useful if done after a blocking call since it will get the latest error at the end of a kernel execution for example

CUDA Error Example void checkcudaerror(const char *msg) {! cudaerror_t err = cudagetlasterror();! }! if ( cudasuccess!= err)! {! }! fprintf(stderr, CUDA Error: %s: %s.\n, msg,! cudageterrorstring(err));! exit(exit_failure);!

Example Code What does this code do? What would you expect the result to be from this running on the GPU?

Conclusion CUDA is only one example of how to write code for the GPU OpenCL, Microsoft s DirectCompute, and C++ AMP CUDA attempts to make programming for the GPU easy by providing a familiar code structure Easier than passing textures in OpenGL Further work is being done to make running code on the GPU truly easy OpenACC directives or compilation by LLVM Give GPU programming a try and have fun!