CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Similar documents
Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Basics of CADA Programming - CUDA 4.0 and newer

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

COSC 6374 Parallel Computations Introduction to CUDA

Lecture 11: GPU programming

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA C Programming Mark Harris NVIDIA Corporation

EEM528 GPU COMPUTING

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

CUDA Basics. July 6, 2016

CUDA. Sathish Vadhiyar High Performance Computing

Programming with CUDA, WS09

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

High Performance Computing and GPU Programming

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU CUDA Programming

Parallel Computing. Lecture 19: CUDA - I

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Introduc)on to GPU Programming

Lecture 3: Introduction to CUDA

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Tesla Architecture, CUDA and Optimization Strategies

HPCSE II. GPU programming and CUDA

Massively Parallel Algorithms

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

Introduction to CUDA Programming

INTRODUCTION TO CUDA PROGRAMMING BHUPENDER THAKUR

COSC 6339 Accelerators in Big Data

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model Michael Garland

CS179 GPU Programming Recitation 4: CUDA Particles

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

NVIDIA CUDA Compute Unified Device Architecture

CUDA C/C++ BASICS. NVIDIA Corporation

Parallel Accelerators

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Parallel Accelerators

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Vector Addition on the Device: main()

Introduction to GPGPUs and to CUDA programming model

University of Bielefeld

Parallel Numerical Algorithms

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Module 2: Introduction to CUDA C

Real-time Graphics 9. GPGPU

CUDA C/C++ BASICS. NVIDIA Corporation

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Scientific discovery, analysis and prediction made possible through high performance computing.

Basic unified GPU architecture

CUDA Parallelism Model

ECE 574 Cluster Computing Lecture 15

Module 2: Introduction to CUDA C. Objective

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Lecture 2: Introduction to CUDA C

Lecture 5. Performance Programming with CUDA

High Performance Linear Algebra on Data Parallel Co-Processors I

Lecture 10!! Introduction to CUDA!

Optimizing Parallel Reduction in CUDA

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

ECE 574 Cluster Computing Lecture 17

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Mathematical computations with GPUs

Real-time Graphics 9. GPGPU

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

An Introduction to GPU Computing and CUDA Architecture

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CS 179: GPU Computing. Lecture 2: The Basics

GPU Programming Using CUDA

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Lecture 2: CUDA Programming

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

GPU programming. Dr. Bernhard Kainz

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

High-Performance Computing Using GPUs

CS : Many-core Computing with CUDA

Writing and compiling a CUDA code

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

CUDA C PROGRAMMING GUIDE

Transcription:

CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list

G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into 16 SMs TPC: Texture Processing Clusters

CUDA Programming Model The GPU is a compute device serves as a coprocessor for the host CPU has its own device memory on the card executes many threads in parallel Parallel kernels run a single program in many threads GPU expects 1000 s of threads for full utilization

CUDA Programming Kernels Device = GPU Host = CPU Kernel = function called from the host that runs on the device One kernel is executed at a time Many threads execute each kernel

CUDA Threads A CUDA kernel is executed by an array of threads All threads run the same code Each thread has an ID Compute memory addresses Make control decisions CUDA threads are extremely lightweight Very little creation overhead Instant switching

Thread Batching Kernel launches a grid of thread blocks Threads within a block can Share data through shared memory Synchronize their execution Threads in different block cannot cooperate

Thread ID Each thread has access to: threadidx.x - thread ID within block blockidx.x - block ID within grid blockdim.x - number of threads per block

Multidimensional IDs Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing for processing multidimensional data We will talk about it later

Kernel Memory Access Registers Global Memory Kernel input and output data reside here Off-chip, large, uncached Shared Memory Shared among threads in a single block Host On-chip, small, as fast as registers The host can read & write global memory but not shared memory Grid Block (0, 0) Registers Global Memory Shared Memory Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Registers Shared Memory Thread (0, 0) Registers Thread (1, 0)

Execution Model Kernels are launched in grids One kernel executes at a time A block executes on one multiprocessor Does not migrate

Programming Basics

Outline New stuffs Executing codes on GPU Memory management Shared memory Schedule and synchronization

NEW STUFFS

C Extension New syntax and built-in variables New restrictions No recursion in device code No function pointers in device code API/Libraries CUDA Runtime (Host and Device) Device Memory Handling (cudamalloc,...) Built-in Math Functions (sin, sqrt, mod,...) Atomic operations (for concurrency) Data types (2D textures, dim2, dim3,...)

New Syntax <<<... >>> host, global, device constant, shared, device syncthreads()

Built-in Variables dim3 griddim; Dimensions of the grid in blocks(griddim.z unused) dim3 blockdim; Dimensions of the block in threads dim3 blockidx; Block index within the grid dim3 threadidx; Thread index within the block dim3 (Based on uint3) struct dim3{int x,y,z;} Used to specify dimensions Default value (1,1,1)

Function Qualifiers global : called from the host (CPU) code, and run on GPU cannot be called from device (GPU) code must return void device : called from other GPU functions, and run on GPU cannot be called from host (CPU) code host : called from host, and run on CPU, host and device : Sample use: overloading operators Compiler will generate both CPU and GPU code

Variable Qualifiers (GPU code) device : stored in global memory (not cached, high latency) accessible by all threads lifetime: application constant : stored in global memory (cached) read-only for threads, written by host Lifetime: application shared : stored in shared memory (like registers) accessible by all threads in the same threadblock lifetime: block lifetime Unqualified variables: stored in local memory scalars and built-in vector types are stored in registers arrays are stored in device memory

EXECUTING CODES ON GPU

global global void minimal( int* d_a) { *d_a = 13; } global void assign( int* d_a, int value) { int idx = blockdim.x * blockidx.x + threadidx.x; d_a[idx] = value; }

Launching kernels Modified C function call syntax: kernel<<<dim3 grid, dim3 block>>>( ) Execution Configuration ( <<< >>> ): grid dimensions: x and y thread-block dimensions: x, y, and z

EX: VecAdd Add two vectors, A and B, of dimension N, and put result to vector C // Kernel definition global void VecAdd(float* A, float* B, float* C) { int i = threadidx.x; C[i] = A[i] + B[i]; } int main() {... // Kernel invocation VecAdd<<<1, N>>>(A, B, C); }

EX: MatAdd Add two matrices, A and B, of dimension N, and put result to matrix C // Kernel definition global void MatAdd(float A[N][N], float B[N][N], float C[N][N]){ int i = threadidx.x; int j = threadidx.y; C[i][j] = A[i][j] + B[i][j]; } int main(){... // Kernel invocation dim3 dimblock(n, N); MatAdd<<<1, dimblock>>>(a, B, C); }

Ex: MatAdd // Kernel definition global void MatAdd(float A[N][N], float B[N][N],float C[N][N]){ int i = blockidx.x * blockdim.x + threadidx.x; int j = blockidx.y * blockdim.y + threadidx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main(){... // Kernel invocation dim3 dimblock(16, 16); dim3 dimgrid((n + dimblock.x 1) / dimblock.x, (N + dimblock.y 1) / dimblock.y); MatAdd<<<dimGrid, dimblock>>>(a, B, C); }

Executing Code on the GPU Kernels are C functions with some restrictions Can only access GPU memory Must have void return type No variable number of arguments ( varargs ) Not recursive No static variables Function arguments automatically copied from CPU to GPU memory

Compiling a CUDA Program

Compiled files

MEMORY MANAGEMENT

Managing Memory Host (CPU) code manages device (GPU) memory: Applies to global device memory (DRAM) Tasks Allocate/Free Copy data

GPU Memory Allocation / Release cudamalloc(void ** pointer, size_t nbytes) cudamemset(void * pointer, int value, size_t count) cudafree(void* pointer)

Data Copies cudamemcpy(void *dst, void *src, size_t nbytes, enum cudamemcpykind direction); enum cudamemcpykind cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Blocks CPU thread: returns after the copy is complete Doesn t start copying until previous CUDA calls complete

Ex: VecAdd // Device code global void VecAdd(float* A, float* B, float* C){ int i = blockdim.x * blockidx.x + threadidx.x; if (i < N) C[i] = A[i] + B[i]; } // Host code int main() { int N =...; size_t size = N * sizeof(float); // Allocate input h_a and h_b in host memory float* h_a = malloc(size); float* h_b = malloc(size); // Allocate vectors in device memory float *d_a, *d_b, *d_c; cudamalloc((void**)&d_a, size); cudamalloc((void**)&d_b, size); cudamalloc((void**)&d_c, size);

// Copy vectors from host memory to device memory cudamemcpy(d_a, h_a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, h_b, size, cudamemcpyhosttodevice); // Invoke kernel int threadsperblock = 256; int blockspergrid = (N+threadsPerBlock 1)/threadsPerBlock; VecAdd<<<blocksPerGrid, threadsperblock>>>(d_a, d_b, d_c); // Copy result from device memory to host memory // h_c contains the result in host memory cudamemcpy(h_c, d_c, size, cudamemcpydevicetohost); } // Free device memory cudafree(d_a); cudafree(d_b); cudafree(d_c);

Shared Memory shared : variable qualifier EX: parallel sum global void reduce0(int *g_idata, int *g_odata) { shared int sdata[n]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; // do reduction in shared mem // write result for this block to global mem if (tid == 0) g_odata[blockidx.x] = sdata[0]; }

Dynamic Shared Memory When the size of the shared memory is determined in the runtime. global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; // do reduction in shared mem // write result for this block to global mem if (tid == 0) g_odata[blockidx.x] = sdata[0]; }

How to decide the SM size? When CPU launches kernel function, the 3 rd argument specify the size of the shared memory. kernel<<<griddim, blockdim,smsize>>>( )

SYNCHRONIZATION

Host Synchronization All kernel launches are asynchronous control returns to CPU immediately kernel executes after all previous CUDA calls have completed cudamemcpy() is synchronous control returns to CPU after copy completes copy starts after all previous CUDA calls have completed cudathreadsynchronize() blocks until all previous CUDA calls complete

Device Runtime Synchronization void syncthreads(); Synchronizes all threads in a block Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared Allowed in conditional code only if the conditional is uniform across the entire thread block

Ex: Parallel summation

Ex: Parallel summation global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockidx.x] = sdata[0]; }

Homework Read programming guide chap 1 and chap 2 Implement matrix-matrix multiplication. C=A*B, where A,B,C are NxN matrices. C[i][j]=sum_{k=1,...,N} A[i][k]*B[k][j] Let each thread compute one C[i][j] Try (1) not to use shared memory and (2) use shared memory