Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Similar documents
Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 10!! Introduction to CUDA!

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CS516 Programming Languages and Compilers II

CS 314 Principles of Programming Languages

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Lecture 3: Introduction to CUDA

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Parallel Computing. Lecture 19: CUDA - I

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 2: CUDA Programming

Module 2: Introduction to CUDA C. Objective

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Tesla Architecture, CUDA and Optimization Strategies

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Introduc)on to CUDA. T.V. Singh Ins)tute for Digital Research and Educa)on UCLA

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

CSE 160 Lecture 24. Graphical Processing Units

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Linear Algebra on the GPU. Pawel Pomorski, HPC Software Analyst SHARCNET, University of Waterloo

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Module 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Practical Introduction to CUDA and GPU

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Lecture 11: GPU programming

Scientific discovery, analysis and prediction made possible through high performance computing.

Module 3: CUDA Execution Model -I. Objective

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Programming with CUDA, WS09

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

COSC 6374 Parallel Computations Introduction to CUDA

ECE 574 Cluster Computing Lecture 15

Basic unified GPU architecture

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Lecture 3. Programming with GPUs

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Parallel Numerical Algorithms

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

CUDA Basics. July 6, 2016

ECE 408 / CS 483 Final Exam, Fall 2014

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Plan. CS : High-Performance Computing with CUDA. Matrix Transpose Characteristics (1/2) Plan

ECE 574 Cluster Computing Lecture 17

QUASAR UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE. Website: gepura.io Bart Goossens

Introduction to Parallel Computing with CUDA. Oswald Haan

Real-time Graphics 9. GPGPU

CS : High-Performance Computing with CUDA

High Performance Linear Algebra on Data Parallel Co-Processors I

EEM528 GPU COMPUTING

Lecture 1: an introduction to CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Vector Addition on the Device: main()

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Compiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010

CUDA C/C++ BASICS. NVIDIA Corporation

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

High-Performance Computing Using GPUs

GPU Programming Using CUDA

GPU programming. Dr. Bernhard Kainz

Massively Parallel Algorithms

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Introduction to GPGPUs and to CUDA programming model

Real-time Graphics 9. GPGPU

GPU Programming. Rupesh Nasre.

CUDA Programming. Aiichiro Nakano

Image convolution with CUDA

GPU Computing with CUDA

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU CUDA Programming

Hands-on CUDA Optimization. CUDA Workshop

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA C/C++ BASICS. NVIDIA Corporation

GPU Computing: A Quick Start

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Transcription:

Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples

Lecture questions: 1. Suggest two significant differences between CUDA and OpenCL. 2. How does matrix transposing benefit from using shared memory? 3. When do you typically need to synchronize threads? CUDA = Compute Unified Device Architecture Developed by NVidia Only available on NVidia boards, G80 or better GPU architecture Designed to hide the graphics heritage and add control and flexibility

Similar to shader-based solutions and OpenCL: 1. Upload data to GPU 2. Execute kernel 3. Download result Integrated source The source of host and kernel code can be in the same source file, written as one and the same program! Major difference to shaders and OpenCL, where the kernel source is separate and explicitly compiled by the host. Kernel code identified by special modifiers.

CUDA An architecture and C extension (and more!) Spawn a large number of threads, to be ran virtually in parallel Just like in graphics! You can!t expect all fragments/computations to be executed in parallel. Instead, they are executed a bunch at a time - a warp. But unlike graphics it looks much more like an ordinary C program! No more data stored as pixels - they are just arrays! Simple CUDA example A working, compilable example #include <stdio.h> const int N = 16; const int blocksize = 16; global void simple(float *c) c[threadidx.x] = threadidx.x; int main() int i; float *c = new float[n]; float *cd; const int size = N*sizeof(float); cudamalloc( (void**)&cd, size ); dim3 dimblock( blocksize, 1 ); dim3 dimgrid( 1, 1 ); simple<<<dimgrid, dimblock>>>(cd); cudamemcpy( c, cd, size, cudamemcpydevicetohost ); cudafree( cd ); for (i = 0; i < N; i++) printf("%f ", c[i]); printf("\n"); delete[] c; printf("done\n"); return EXIT_SUCCESS;

Simple CUDA example A working, compilable example #include <stdio.h> const int N = 16; const int blocksize = 16; Kernel global void simple(float *c) c[threadidx.x] = threadidx.x; thread identifier int main() int i; float *c = new float[n]; float *cd; const int size = N*sizeof(float); cudamalloc( (void**)&cd, size ); dim3 dimblock( blocksize, 1 ); dim3 dimgrid( 1, 1 ); simple<<<dimgrid, dimblock>>>(cd); Call kernel cudamemcpy( c, cd, size, cudamemcpydevicetohost ); cudafree( cd ); for (i = 0; i < N; i++) printf("%f ", c[i]); printf("\n"); delete[] c; printf("done\n"); return EXIT_SUCCESS; Allocate GPU memory 1 block, 16 threads Read back data Modifiers for code Three modifiers are provided to specify how code should be used: global executes on the GPU, invoked from the CPU. This is the entry point of the kernel. device is local to the GPU host is CPU code (superfluous). CPU host myhostfunc() GPU device mydevicefunc(() global myglobalfunc(()

Memory management cudamalloc(ptr, datasize) cudafree(ptr) Similar to CPU memory management, but done by the CPU to allocate on the GPU cudamemcpy(dest, src, datasize, arg) arg = cudamemcpydevicetohost or cudamemcpyhosttodevice Kernel execution simple<<<griddim, blockdim>>>( ) (Weird! Who came up with the syntax?) The grid is a grid of thread blocks. Threads have numbers within its block. Built-in variables for kernel: threadidx and blockidx blockdim and griddim (Note that no prefix is used, like GLSL does.)

Compiling Cuda nvcc nvcc is nvidia!s tool, /usr/local/cuda/bin/nvcc Source files suffixed.cu Command-line for the simple example: nvcc simple.cu -o simple (Command-line options exist for libraries etc) Compiling Cuda for larger applications nvcc and gcc in co-operation nvcc for.cu files gcc for.c/.cpp etc Mixing languages possible. Final linking must include C++ runtime libs. Example: One C file, one CU file

Example of multi-unit compilation Source files: cudademokernel.cu and cudademo.c nvcc cudademokernel.cu -o cudademokernel.o -c gcc -c cudademo.c -o cudademo.o -I/usr/local/cuda/include g++ cudademo.o cudademokernel.o -o cudademo - L/usr/local/cuda/lib -lcuda -lcudart -lm Link with g++ to include C++ runtime C/CUDA program code.cu CUDA compilation behind the scene nvcc CPU binary PTX code PTX to target Target binary code

Executing a Cuda program Must set environment variable to find Cuda runtime. export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH Then run as usual:./simple A problem when executing without a shell! Launch with execve() Computing with CUDA Organization and access Blocks, threads...

Warps A warp is the minimum number of data items/threads that will actually be processed in parallel by a CUDA capable device. This number is set to 32. We usually don!t care about warps but rather discuss threads and blocks. Processing organization 1 warp = 32 threads 1 kernel - 1 grid 1 grid - many blocks 1 block - 1 thread processor 1 block - many threads Use many threads and many blocks! > 200 blocks recomended. Thread # multiple of 32

Distributing computing over threads and blocks Hierarcical model Grid Block 0,0 Block 1,0 Block 2,0 Block 3,0 Block n,n Thread 0,0 Thread 1,0 Thread 2,0 Thread 3,0 Block 0,1 Block 1,1 Block 2,1 Block 3,1 Thread 0,1 Thread 1,1 Thread 2,1 Thread 3,1 Thread 0,2 Thread 1,2 Thread 2,2 Thread 3,2 griddim.x * griddim.y blocks Thread 0,3 Thread 1,3 Thread 2,3 Thread 3,3 BlockDim.x * blockdim.y threads Indexing data with thread/block IDs Calculate index by blockidx, blockdim, threadidx Another simple example, calculate square of every element, device part: // Kernel that executes on the CUDA device global void square_array(float *a, int N) int idx = blockidx.x * blockdim.x + threadidx.x; if (idx<n) a[idx] = a[idx] * a[idx];

Host part of square example Set block size and grid size // main routine that executes on the host int main(int argc, char *argv[]) float *a_h, *a_d; // Pointer to host and device arrays const int N = 10; // Number of elements in arrays size_t size = N * sizeof(float); a_h = (float *)malloc(size); cudamalloc((void **) &a_d, size); // Allocate array on device // Initialize host array and copy it to CUDA device for (int i=0; i<n; i++) a_h[i] = (float)i; cudamemcpy(a_d, a_h, size, cudamemcpyhosttodevice); // Do calculation on device: int block_size = 4; int n_blocks = N/block_size + (N%block_size == 0? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N); // Retrieve result from device and store it in host array cudamemcpy(a_h, a_d, sizeof(float)*n, cudamemcpydevicetohost); // Print results and cleanup for (int i=0; i<n; i++) printf("%d %f\n", i, a_h[i]); free(a_h); cudafree(a_d); Memory access Vital for performance! Memory types Coalescing Example of using shared memory

Memory types Global Shared Constant (read only) Texture cache (read only) Local Registers Care about these when optimizing - not to begin with Global memory 400-600 cycles latency! Shared memory fast temporary storage Coalesce memory access! Continuous Aligned on power of 2 boundary Addressing follows thread numbering Use shared memory for reorganizing data for coalescing!

Using shared memory to reduce number of global memory accesses Read blocks of data to shared memory Process Write back as needed Shared memory as manual cache Example: Matrix multiplication Matrix multiplication To multiply two N*N matrices, every item will have to be accessed N times! Naive implementation: 2N3 global memory accesses!

Matrix multiplication Let each block handle a part of the output. Load the parts of the matrix needed for the block into shared memory. Modified computing model: Upload data to global GPU memory For a number of parts, do: Upload partial data to shared memory Process partial data Write partial data to global memory Download result to host

Synchronization As soon as you do something where one part of a computation depends on a result from another thread, you must synchronize! syncthreads() Typical implementation: Read to shared memory syncthreads() Process shared memory synchthreads() Write result to global memory Accelerating by coalescing Pure memory transfers can be 10x faster by taking advantage of memory coalescing! Example: Matrix transpose No computations! Only memory accesses.

Matrix transpose Naive implementation global void transpose_naive(float *odata, float* idata, int width, int height) unsigned int xindex = blockdim.x * blockidx.x + threadidx.x; unsigned int yindex = blockdim.y * blockidx.y + threadidx.y; if (xindex < width && yindex < height) unsigned int index_in = xindex + width * yindex; unsigned int index_out = yindex + height * xindex; odata[index_out] = idata[index_in]; How can this be bad? Matrix transpose Coalescing problems Row-by-row and column-by-column. Column accesses non-coalesced!

Matrix transpose Coalescing solution Read from global memory to shared memory In order from global, any order to shared Write to global memory In order write to global, any order from shared Better CUDA matrix transpose kernel global void transpose(float *odata, float *idata, int width, int height) shared float block[block_dim][block_dim+1]; // read the matrix tile into shared memory unsigned int xindex = blockidx.x * BLOCK_DIM + threadidx.x; unsigned int yindex = blockidx.y * BLOCK_DIM + threadidx.y; if((xindex < width) && (yindex < height)) unsigned int index_in = yindex * width + xindex; block[threadidx.y][threadidx.x] = idata[index_in]; syncthreads(); Shared memory for temporary storage Read data to temporary buffer // write the transposed matrix tile to global memory xindex = blockidx.y * BLOCK_DIM + threadidx.x; yindex = blockidx.x * BLOCK_DIM + threadidx.y; if((xindex < height) && (yindex < width)) unsigned int index_out = yindex * height + xindex; odata[index_out] = block[threadidx.x][threadidx.y]; Write data to tglobal memory

Coalescing rules of thumb The data block should start on a multiple of 64 It should be accessed in order (by thread number) It is allowed to have threads skipping their item Data should be in blocks of 4, 8 or 16 bytes Texture memory Cached! Can be fast if data access patterns are good. Texture filtering, linear interpolation. Especially good for handling 4 floats at a time (float4). cudabindtexturetoarray() binds data to a texture unit.

Porting to CUDA 1. Parallel-friendly CPU algorithm. 2. Trivial (serial) CUDA implementation. 3. Split to blocks and threads. 4. Take advantage of shared memory. CUDA emulation mode CUDA programs can be compiled to CPU only versions. --device-emulation Lets you run CUDA (slowly) on non-nvidia hardware Debugging easier (e.g. printf)

That!s all folks! Next: Laborations, hands-on experience of all three techniques!