Lecture 10!! Introduction to CUDA!

Similar documents
Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Lecture 3: Introduction to CUDA

Parallel Computing. Lecture 19: CUDA - I

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Module 2: Introduction to CUDA C. Objective

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Tesla Architecture, CUDA and Optimization Strategies

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Module 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C

Practical Introduction to CUDA and GPU

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CSE 160 Lecture 24. Graphical Processing Units

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Scientific discovery, analysis and prediction made possible through high performance computing.

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Parallel Numerical Algorithms

Vector Addition on the Device: main()

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA Basics. July 6, 2016

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

COSC 6374 Parallel Computations Introduction to CUDA

ECE 574 Cluster Computing Lecture 15

Lecture 3. Programming with GPUs

EEM528 GPU COMPUTING

Introduction to Parallel Computing with CUDA. Oswald Haan

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

GPU programming. Dr. Bernhard Kainz

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

GPU Programming Using CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

Module 3: CUDA Execution Model -I. Objective

Real-time Graphics 9. GPGPU

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Programming with CUDA, WS09

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

ECE 574 Cluster Computing Lecture 17

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Basic unified GPU architecture

CUDA C/C++ BASICS. NVIDIA Corporation

CS 179: GPU Computing. Lecture 2: The Basics

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Real-time Graphics 9. GPGPU

CUDA Programming. Aiichiro Nakano

GPU CUDA Programming

Introduction to GPGPUs and to CUDA programming model

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU Programming. Rupesh Nasre.

CUDA Architecture & Programming Model

Introduction to CUDA (1 of n*)

Lecture 11: GPU programming

ECE 408 / CS 483 Final Exam, Fall 2014

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Introduction to CUDA Programming

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Computing: A Quick Start

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Compiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010

Lecture 1: an introduction to CUDA

CUDA C/C++ BASICS. NVIDIA Corporation

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA. Sathish Vadhiyar High Performance Computing

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Massively Parallel Algorithms

Image convolution with CUDA

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CUDA Parallelism Model

HPCSE II. GPU programming and CUDA

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU Computing with CUDA

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

QUASAR UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE. Website: gepura.io Bart Goossens

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

High Performance Computing and GPU Programming

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Cartoon parallel architectures; CPUs and GPUs

GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming. GPU computing. GPU computing.

CUDA Parallel Programming Model Michael Garland

Transcription:

1(50) Lecture 10 Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY 1(50)

Laborations Some revisions may happen while making final adjustments for Linux Mint. Last minute changes may occur. The lab questions are vital Answers must be written down before we can examine you Thus - no lab reports needed. 2(50)2(50)

Lecture material Lecture material available on the web. Latest version comes up before the lecture, but usually not more than an hour or two before. Last year's material is also available. We have a local course page: http://computer-graphics.se/multicore/ Lab material and lecture notes (and any additional material) are/will be on that page. 3(50)3(50)

Previous lecture: GPU development - why did it become a general purpose parallel architecture GPU architecture A quick look at GPU coding (Hello World) 4(50)4(50)

G80 processor hierarchy 8 top-level groups of TPCs SM is a group of 8 SIMD cores 5(50)5(50)

This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example 6(50)6(50)

Lecture questions: 1. What concept in CUDA corresponds to a SM (streaming multiprocessor) in the architecture? 2. How does matrix multiplication benefit from using shared memory? 3. When do you typically need to synchronize threads? 7(50)7(50)

CUDA = Compute Unified Device Architecture Developed by NVidia Only available on NVidia boards, G80 or better GPU architecture Designed to hide the graphics heritage and add control and flexibility 8(50)8(50)

Computing model: 1. Upload data to GPU 2. Execute kernel 3. Download result Similar to shader-based solutions and OpenCL 9(50)9(50)

Integrated source Source of host and kernel code in the same source file Major difference to shaders and OpenCL. Kernel code identified by special modifiers. 10(50) 10(50)

CUDA Architecture and C extension Spawn a large number of threads, to be ran virtually in parallel Just like in graphics Fragments/computations not quite executed in parallel. A bunch at a time - a warp. Looks much more like an ordinary C program No more data stored as pixels - just arrays 11(50)11(50)

Simple CUDA example A working, compilable example #include <stdio.h> const int N = 16; const int blocksize = 16; global void simple(float *c) { c[threadidx.x] = threadidx.x; } int main() { int i; float *c = new float[n]; float *cd; const int size = N*sizeof(float); cudamalloc( (void**)&cd, size ); dim3 dimblock( blocksize, 1 ); dim3 dimgrid( 1, 1 ); simple<<<dimgrid, dimblock>>>(cd); cudamemcpy( c, cd, size, cudamemcpydevicetohost ); cudafree( cd ); for (i = 0; i < N; i++) printf("%f ", c[i]); printf("\n"); delete[] c; printf("done\n"); return EXIT_SUCCESS; } 12(50)12(50)

Simple CUDA example A working, compilable example #include <stdio.h> const int N = 16; const int blocksize = 16; Kernel global void simple(float *c) { c[threadidx.x] = threadidx.x; } thread identifier int main() { int i; float *c = new float[n]; float *cd; const int size = N*sizeof(float); cudamalloc( (void**)&cd, size ); dim3 dimblock( blocksize, 1 ); dim3 dimgrid( 1, 1 ); simple<<<dimgrid, dimblock>>>(cd); Call kernel cudamemcpy( c, cd, size, cudamemcpydevicetohost ); cudafree( cd ); for (i = 0; i < N; i++) printf("%f ", c[i]); printf("\n"); delete[] c; printf("done\n"); return EXIT_SUCCESS; } Allocate GPU memory 1 block, 16 threads Read back data 13(50)13(50)

Modifiers for code Three modifiers are provided to specify how code should be used: global executes on the GPU, invoked from the CPU. This is the entry point of the kernel. device is local to the GPU host is CPU code (superfluous). CPU host myhostfunc() GPU device mydevicefunc(() global myglobalfunc(() 14(50)14(50)

Memory management cudamalloc(ptr, datasize) cudafree(ptr) Similar to CPU memory management, but done by the CPU to allocate on the GPU cudamemcpy(dest, src, datasize, arg) arg = cudamemcpydevicetohost or cudamemcpyhosttodevice 15(50)15(50)

Kernel execution simple<<<griddim, blockdim>>>( ) grid = blocks, block = threads Built-in variables for kernel: threadidx and blockidx blockdim and griddim (Note that no prefix is used, like GLSL does.) 16(50)16(50)

Compiling Cuda nvcc nvcc is nvidia s tool, /usr/local/cuda/bin/nvcc Source files suffixed.cu Command-line for the simple example: nvcc simple.cu -o simple (Command-line options exist for libraries etc) 17(50)17(50)

Compiling Cuda for larger applications nvcc and gcc in co-operation nvcc for.cu files gcc for.c/.cpp etc Mixing languages possible. Final linking must include C++ runtime libs. Example: One C file, one CU file 18(50)18(50)

Example of multi-unit compilation Source files: cudademokernel.cu and cudademo.c nvcc cudademokernel.cu -o cudademokernel.o -c gcc -c cudademo.c -o cudademo.o -I/usr/local/cuda/include g++ cudademo.o cudademokernel.o -o cudademo -L/usr/local/ cuda/lib -lcuda -lcudart -lm Link with g++ to include C++ runtime 19(50)19(50)

C/CUDA program code.cu CUDA compilation behind the scene nvcc CPU binary PTX code PTX to target Target binary code 20(50)20(50)

Executing a Cuda program Must set environment variable to find Cuda runtime. export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH Then run as usual:./simple A problem when executing without a shell Launch with execve() 21(50)21(50)

Computing with CUDA Organization and access Blocks, threads... 22(50)22(50)

Warps A warp is the minimum number of data items/threads that will actually be processed in parallel by a CUDA capable device. This number varies with different GPUs. We usually don t care about warps but rather discuss threads and blocks. 23(50)23(50)

Processing organization 1 warp = 32 threads 1 kernel - 1 grid 1 grid - many blocks 1 block - 1 SM 1 block - many threads Use many threads and many blocks > 200 blocks recommended. Thread # multiple of 32 24(50)24(50)

Distributing computing over threads and blocks Hierarcical model Grid Block 0,0 Block 1,0 Block 2,0 Block 3,0 Block n,n Thread 0,0 Thread 1,0 Thread 2,0 Thread 3,0 Block 0,1 Block 1,1 Block 2,1 Block 3,1 Thread 0,1 Thread 1,1 Thread 2,1 Thread 3,1 Thread 0,2 Thread 1,2 Thread 2,2 Thread 3,2 griddim.x * griddim.y blocks Thread 0,3 Thread 1,3 Thread 2,3 Thread 3,3 BlockDim.x * blockdim.y threads 25(50)25(50)

Indexing data with thread/block IDs Calculate index by blockidx, blockdim, threadidx Another simple example, calculate square of every element, device part: // Kernel that executes on the CUDA device global void square_array(float *a, int N) { int idx = blockidx.x * blockdim.x + threadidx.x; if (idx<n) a[idx] = a[idx] * a[idx]; } 26(50)26(50)

Host part of square example Set block size and grid size // main routine that executes on the host int main(int argc, char *argv[]) { float *a_h, *a_d; // Pointer to host and device arrays const int N = 10; // Number of elements in arrays size_t size = N * sizeof(float); a_h = (float *)malloc(size); cudamalloc((void **) &a_d, size); // Allocate array on device // Initialize host array and copy it to CUDA device for (int i=0; i<n; i++) a_h[i] = (float)i; cudamemcpy(a_d, a_h, size, cudamemcpyhosttodevice); // Do calculation on device: int block_size = 4; int n_blocks = N/block_size + (N%block_size == 0? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N); // Retrieve result from device and store it in host array cudamemcpy(a_h, a_d, sizeof(float)*n, cudamemcpydevicetohost); // Print results and cleanup for (int i=0; i<n; i++) printf("%d %f\n", i, a_h[i]); free(a_h); cudafree(a_d); } 27(50)27(50)

Julia example Bigger problem, addressing calculation must be 2D Simple OpenGL output (similar to the labs) 28(50)28(50)

For this case: Separate for x and y Julia example global void kernel( unsigned char *ptr, float r, float im) { // map from blockidx to pixel position int x = blockidx.x * blockdim.x + threadidx.x; int y = blockidx.y * blockdim.y + threadidx.y; int offset = x + y * DIM; Actual index which implies memory position // now calculate the value at that position int juliavalue = julia( x, y, r, im ); --- calculate colors --- ptr[offset*4 + 0] = red; ptr[offset*4 + 0] = green; ptr[offset*4 + 0] = blue; ptr[offset*4 + 3] = 255; } Every thread computes one single pixel 29(50)29(50)

Conclusion about indexing Every thread does its own calculation for indexing memory blockidx, blockdim, threadidx 1, 2 or 3 dimensions Usually 2 dimensions 30(50)30(50)

Memory access Vital for performance Memory types Coalescing Example of using shared memory 31(50)31(50)

Memory types Global Shared Constant (read only) Texture cache (read only) Local Registers Care about these when optimizing - not to begin with 32(50)32(50)

Global memory 400-600 cycles latency Shared memory fast temporary storage Coalesce memory access Continuous Aligned on power of 2 boundary Addressing follows thread numbering Use shared memory for reorganizing data for coalescing 33(50)33(50)

Using shared memory to reduce number of global memory accesses Read blocks of data to shared memory Process Write back as needed Shared memory as manual cache Example: Matrix multiplication 34(50)34(50)

Matrix multiplication To multiply two N*N matrices, every item will have to be accessed N times Naive implementation: 2N 3 global memory accesses 35(50)35(50)

Matrix multiplication Let each block handle a part of the output. Load the parts of the matrix needed for the block into shared memory. 36(50)36(50)

Matrix multiplication on CPU Simple triple for loop void MatrixMultCPU(float *a, float *b, float *c, int thesize) { int sum, i, j, k; // For every destination element for(i = 0; i < thesize; i++) for(j = 0; j < thesize; j++) { sum = 0; // Sum along a row in a and a column in b for(k = 0; k < thesize; k++) sum = sum + (a[i*thesize + k]*b[k*thesize + j]); c[i*thesize + j] = sum; } } 37(50)37(50)

Naive GPU version Replace outer loops by thread indices global void MatrixMultNaive(float *a, float *b, float *c, int thesize) { int sum, i, j, k; i = blockidx.x * blockdim.x + threadidx.x; j = blockidx.y * blockdim.y + threadidx.y; // For every destination element sum = 0; // Sum along a row in a and a column in b for(k = 0; k < thesize; k++) sum = sum + (a[i*thesize + k]*b[k*thesize + j]); c[i*thesize + j] = sum; } 38(50)38(50)

Naive GPU version inefficient Every thread makes 2N global memory accesses Can be significantly reduced using shared memory 39(50)39(50)

Optimized GPU version Data split into blocks. Every element takes part in all the blocks in the same row for A, column for B For every such block Every thread reads one element to shared memory Then loop over the appropriate row and column for the block 40(50)40(50)

Example: 16 blocks 41(50)41(50)

42(50)42(50)

Destination element for thread C A B Destination patch for thread Every patch corresponds to one block, computing the output for that patch All patches on the same row in A are needed to produce the destination block And all patches in the same column of C For every patch, the thread reads one element matching the destination element For every patch, we loop over the part of one row and column to perform that part of the computation What one thread reads is used by everybody in the same row (A) or column (B) 43(50)43(50)

Optimized GPU version Loop over patches (1D) Allocate shared memory Copy one element to shared memory Loop over row/column in patch, compute, accumulate result for one element Write result to global memory global void MatrixMultOptimized( float* A, float* B, float* C, int thesize) { int i, j, k, b, ii, jj; // Global index for thread i = blockidx.x * blockdim.x + threadidx.x; j = blockidx.y * blockdim.y + threadidx.y; float sum = 0.0; // for all source patches for (b = 0; b < griddim.x; b++) { shared float As[BLOCKSIZE*BLOCKSIZE]; shared float Bs[BLOCKSIZE*BLOCKSIZE]; // Index locked to patch ii = b * blockdim.x + threadidx.x; jj = b * blockdim.y + threadidx.y; As[threadIdx.y*blockDim.x + threadidx.x] = A[ii*theSize + j]; Bs[threadIdx.y*blockDim.x + threadidx.x] = B[i*theSize + jj]; syncthreads(); // Synchronize to make sure all data is loaded // Loop, perform computations in patch for (k = 0; k < blockdim.x; ++k) sum += As[threadIdx.y*blockDim.x + k] * Bs[k*blockDim.x + threadidx.x]; syncthreads(); // Synch so nobody starts next pass prematurely } C[i*theSize + j] = sum; } 44(50)44(50)

25-30 times faster? So what did I do? Decent number of threads and blocks Use shared memory for temporary storage All threads read ONE item, but use many Synchronize Compare with single-thread CPU :) 45(50)45(50)

Modified computing model: Upload data to global GPU memory For a number of parts, do: Upload partial data to shared memory Process partial data Write partial data to global memory Download result to host 46(50)46(50)

Synchronization As soon as you do something where one part of a computation depends on a result from another thread, you must synchronize syncthreads() Typical implementation: Read to shared memory syncthreads() Process shared memory synchthreads() Write result to global memory 47(50)47(50)

Synchronization Really wonderfully simple - everybody are doing the same thing anyway Synchronization simply means "wait until everybody are done with this part" Deadlocks can still occur 48(50)48(50)

Summary: Make threads and blocks to make the hardware occupied Access data depending on thread/block number Memory accesses are expensive Shared memory is fast Make threads within a block cooperate Synchronize 49(50)49(50)

That s all folks Next: More about memory management and optimization. 50(50)50(50)