INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Similar documents
INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

GPU Computing: Introduction to CUDA. Dr Paul Richmond

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA Architecture & Programming Model

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA C Programming Mark Harris NVIDIA Corporation

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Basics of CADA Programming - CUDA 4.0 and newer

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to CUDA Programming

Practical Introduction to CUDA and GPU

HPCSE II. GPU programming and CUDA

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Speed Up Your Codes Using GPU

ECE 574 Cluster Computing Lecture 15

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Introduction to CUDA

ECE 574 Cluster Computing Lecture 17

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

GPU Programming Using CUDA

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

High Performance Linear Algebra on Data Parallel Co-Processors I

EEM528 GPU COMPUTING

Parallel Numerical Algorithms

University of Bielefeld

Tesla Architecture, CUDA and Optimization Strategies

CUDA C/C++ BASICS. NVIDIA Corporation

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Lecture 11: GPU programming

CS 179: GPU Computing. Lecture 2: The Basics

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Module 2: Introduction to CUDA C. Objective

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Modern GPU Programming With CUDA and Thrust. Gilles Civario (ICHEC)

Using a GPU in InSAR processing to improve performance

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Introduction to CUDA CIRC Summer School 2014

Module 2: Introduction to CUDA C

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

GPU programming. Dr. Bernhard Kainz

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

High-Performance Computing Using GPUs

GPU CUDA Programming

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

General-purpose computing on graphics processing units (GPGPU)

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Vector Addition on the Device: main()

CS377P Programming for Performance GPU Programming - I

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

CUDA Kenjiro Taura 1 / 36

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

An Introduction to OpenAcc

Real-time Graphics 9. GPGPU

Lecture 1: an introduction to CUDA

CUDA Programming. Aiichiro Nakano

Introduc)on to GPU Programming

Introduction to CUDA (1 of n*)

CS 179: GPU Computing

Real-time Graphics 9. GPGPU

CUDA Basics. July 6, 2016

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Lecture 8: GPU Programming. CSE599G1: Spring 2017

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Paralization on GPU using CUDA An Introduction

CUDA Programming Model

Hands-on CUDA Optimization. CUDA Workshop

Lab 1 Part 1: Introduction to CUDA

CUDA (Compute Unified Device Architecture)

Overview: Graphics Processing Units

Introduction to GPGPUs and to CUDA programming model

An Introduction to GPU Computing and CUDA Architecture

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

CS179 GPU Programming Recitation 4: CUDA Particles

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

How to run a job on a Cluster?

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Transcription:

INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015

OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different memories in the GPU PART III - Tue 27.10 12-14 Writing fast kernels Optimization strategies PART IV - Fri 30.10 10-12 Profiling

PART 1 THE VERY BASICS

INTRODUCTION Modern graphics processors evolved from 3D accelerators of the 80s and 90s that were needed to run computer games with increasingly demanding graphics Even today, the development is driven by the gaming industry, but general purpose applications are becoming more and more important In recent years, GPU computing in various disciplines of science has seen a rapid growth Many modern supercomputers include GPUs The two big players are ATI (AMD) and NVIDIA So far, the majority of general purpose GPU computations have been implemented in NVIDIA s CUDA framework

The evolution of the theoretical performance of CPUs and GPUs

CPU VS GPU In a nutshell, the CPU typically runs one or a few very fast computational threads, while the GPU runs on the order of ten thousand simultaneous (not so fast) threads The hardware in the GPU emphasizes fast data processing over cache and flow control To benefit from the GPU, the computation should ideally be dataparallel and arithmetically intensive data-parallel: the same instructions are executed in parallel to a large number of independent data elements arithmetic intensity: the ratio of arithmetic operations to memory operations

GPU PROGRAMMING High level Libraries Linear algebra FFT Thrust (STL for CUDA) etc. OpenACC like OpenMP for GPUs MATLAB Low level Write you own kernels CUDA for NVIDIA OpenCL for AMD

GPU PROGRAMMING When to use GPUs? The best case scenario is a large data-parallel problem that can be parallelized to tens of thousands of simultaneous threads Typical examples include manipulating large matrices and vectors It might not be worth it if the data is used only once, because the speedup has to overcome the cost of data transfers to and from the GPU With an abundance of libraries for standard algorithms, one can in many cases use GPUs in a high-level code without having to know about the details (don t have to write any kernels) However, optimal performance often requires code that is tailor-made for the specific problem Keep in mind: the GPU is separated from the host system by the PCI-e bus and has its own memory! PCI-e

CUDA Compute Unified Device Architecture is the parallel programming model by NVIDIA for its GPUs The GPUs are programmed with an extension of the C language The major difference compared to ordinary C are special functions, called kernels, that are executed on the GPU by many threads in parallel Each thread is assigned an ID number that allows different threads to operate on different data The typical structure of a CUDA program: 1. Prepare data 2. Allocate memory on the GPU 3. Transfer data to GPU 4. Perform the computation on the GPU 5. Transfer results back to the host side

CUDA To illustrate the idea of CUDA, consider a function that copies a vector A to a vector B: In CUDA, we can do this in parallel on the GPU: The global qualifier tells the compiler to compile this function for the GPU threadidx.x is an intrinsic variable that contains the ID number of each thread

CUDA The thread ID guides the different threads to operate on different data: Thread 0: Thread 1: etc.

THREAD HIERARCHY Thread hierarchy in CUDA The threads are organized into blocks that form a grid Threads within a block are executed on the same multiprocessor and they can communicate via the shared memory Threads in the same block can be synchronized but different blocks are completely independent

CUDA The maximum number of threads per block in current GPUs is 1024 The thread ID in the variable threadidx can be up to 3-dimensional for convenience dim3 threadsperblock(8, 8); //2-dimensional block The grid can be 1- or 2-dimensional, and the maximum number of blocks per dimension is 65535 (Fermi) or ~2 billion (Kepler) In addition to the thread ID, there are intrinsic variables for the grid dimensions, block ID and the threads per block available in the kernel, called griddim, blockidx and blockdim The number of threads per block should always be divisible by 32, because threads are executed in groups of 32, called warps

CUDA global void Vcopy( intú A, intú B, int N ) { int id = blockidx. x ú blockdim. x + threadidx. x ; } if ( id < N ) B [ id ] = A [ id ]; int main () { //Host... code } Vcopy<<<numBlocks, numthreadsperblock>>>(a, B, N ); A better version of the vector copy kernel for vectors of arbitrary length Each thread calculates its global ID number Removing the if statement would result in array overflow, unless N happens to be divisible by blockdim, such that we could launch exactly N threads

ALLOCATION AND TRANSFER The GPU memory is separate from the host memory that the CPU sees Before any kernels can be run, the necessary data has to transferred to the GPU memory via the PCIe bus Memory on the GPU can be allocated with cudamalloc cudamalloc((void**)&d_a, Nbytes); Memory transfers can be issued with cudamemcpy cudamemcpy(d_a, A, Nbytes, cudamemcpyhosttodevice) Be careful with the order of arguments (to, from,...) The last argument can be any of cudamemcpyhosttodevice, cudamemcpydevicetohost, cudamemcpydevicetodevice or cudamemcpyhosttohost

global void Vcopy( intú A, intú B, int N ) { int id = blockidx. x ú blockdim. x + threadidx. x ; } if ( id < N ) B [ id ] = A [ id ]; int main () { int N = 1000; int numthreadsperblock = 256; int numblocks = N / 256 + 1; intú A = new int [ N ]; intú B = new int [ N ]; The vector copy code including the host part where we allocate the GPU memory and transfer the data to and from the GPU intú d_a ; intú d_b ; for( int i=0; i<n ; A [ i ] = i ; i++) cudamalloc ((voidúú)&d_a, cudamalloc ((voidúú)&d_b, Núsizeof( int)); Núsizeof( int)); cudamemcpy( d_a, A, Núsizeof( int), cudamemcpyhosttodevice); Vcopy<<<numBlocks, numthreadsperblock>>>(d_a, d_b, N ); cudamemcpy( B, d_b, Núsizeof( int), cudamemcpydevicetohost); for( int i=0; i<n ; i++) if( A [ i ]!= B [ i ]) std ::cout << "Error." << std ::endl ;

COMPILING CUDA programs are compiled and linked with the nvcc compiler (available in Triton with module load cuda) Files with CUDA code need to have the.cu extension The syntax is identical to compiling with gcc, for example The only thing to remember is to compile for the appropriate GPU architecture, given with the -arch option (-arch sm_20 for Fermi GPUs) To compile and link a simple program in a file main.cu nvcc -arch sm_20 -o programname main.cu

SUMMARY OF PART I Modern GPUs can be used for general purpose calculations which benefit from large-scale parallelization CUDA is the programming model by NVIDIA for its GPUs, and it enables writing code that runs on the GPU with an extension of the C programming language CUDA programs consist of host code, run on the CPU, and kernels that are executed on the GPU in parallel by threads whose launch configuration is given in the kernel call In the kernels, the threads can be identified with the intrinsic variables threadidx and blockidx CUDA programs are compiled with nvcc

HANDS-ON SESSION 1

GPU COMPUTING ON TRITON https://wiki.aalto.fi/display/triton/gpu+computing Triton has 19 GPU nodes with 2 GPUs per node GPU jobs can be run in the Slurm partition gpu Before anything else, load the CUDA module with module load cuda

GPU COMPUTING ON TRITON GPU jobs can be run straight from the command line or by submitting a job script On the frontend, you can use salloc and srun: Or with a script:

GPU COMPUTING ON TRITON You can check the queue status with squeue -p gpu You can cancel jobs with scancel <JOBID>

Log on to Triton HANDS-ON SESSION 1 ssh -X <username>@triton.aalto.fi Get the exercises from /triton/scip/gpu/gpuhandson1.tar cp /triton/scip/gpu/gpuhandson1.tar. Extract tar xvf GPUHandson1.tar Now the exercises can be found in the subdirectories of./gpuhandson Each exercise folder contains the.cu file that needs work, a Makefile for easy compiling, a job script for running the code in the queue system and an info.txt that tells you what you need to do There is also a Solutions subfolder with the right answer, but no peeking until you ve tried your best!

HANDS-ON SESSION 1 Edit the.cu files with your favorite text editor I only know about emacs. Turn on c++ highlighting by Esc-x-c++-Enter Compile the code with make Run the program with sbatch job.sh The program will be submitted to the gpushort queue. The jobs should be very quick but you might want to check the queue status with squeue -p gpushort The output will be written in a <excercise_name>.out file

HANDS-ON SESSION 1 The first exercise is in the folder 1_vecadd Read info.txt for instructions The code copies the contents of an array B into array A Your job is to change the code so that it sums A and B into C