CUDA Parallelism Model

Similar documents
Example 1: Color-to-Grayscale Image Processing

Data Parallel Execution Model

Lecture 2: Introduction to CUDA C

Module 2: Introduction to CUDA C

Lecture 3: Introduction to CUDA

Module 2: Introduction to CUDA C. Objective

Module 3: CUDA Execution Model -I. Objective

Parallel Computing. Lecture 19: CUDA - I

CUDA C Programming Mark Harris NVIDIA Corporation

Module Memory and Data Locality

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model Michael Garland

EEM528 GPU COMPUTING

Data parallel computing

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Lecture 1: Introduction and Computational Thinking

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Lecture 2: CUDA Programming

Lab 1 Part 1: Introduction to CUDA

GPU programming basics. Prof. Marco Bertini

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Tesla Architecture, CUDA and Optimization Strategies

High Performance Computing and GPU Programming

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

CUDA Programming Model

Lessons learned from a simple application

Lecture 5. Performance Programming with CUDA

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

COSC 6374 Parallel Computations Introduction to CUDA

GPU programming. Dr. Bernhard Kainz

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Image convolution with CUDA

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Introduction to CUDA (2 of 2)

Real-time Graphics 9. GPGPU

Introduction to GPGPUs and to CUDA programming model

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Computation to Core Mapping Lessons learned from a simple application

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Dense Linear Algebra. HPC - Algorithms and Applications

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Scientific discovery, analysis and prediction made possible through high performance computing.

Lecture 10!! Introduction to CUDA!

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

Hardware/Software Co-Design

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CME 213 S PRING Eric Darve

Sparse Linear Algebra in CUDA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

ECE 408 / CS 483 Final Exam, Fall 2014

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Lecture 8: GPU Programming. CSE599G1: Spring 2017

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Programmable Graphics Hardware (GPU) A Primer

Practical Introduction to CUDA and GPU

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

Real-time Graphics 9. GPGPU

Parallel Accelerators

GPU Computing with CUDA. Part 2: CUDA Introduction

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

COSC 6339 Accelerators in Big Data

GPU Computing with CUDA

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CS 314 Principles of Programming Languages

Introduction to CUDA Programming

Programming in CUDA. Malik M Khan

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPGPU. Lecture 2: CUDA

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Programming with CUDA, WS09

Unrolling parallel loops

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

GPU Programming Using CUDA

GPU Computing Master Clss. Development Tools

Optimizing CUDA for GPU Architecture. CSInParallel Project

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

GPU Computing: A Quick Start

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Nvidia G80 Architecture and CUDA Programming

Matrix Multiplication in CUDA. A case study

CS : Many-core Computing with CUDA

Transcription:

GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example Thread Scheduling

Objective To learn the basic concepts involved in a simple CUDA kernel function Declaration Built-in variables Thread index to data index mapping 2 2

Example: Vector Addition Kernel Device Code // Compute vector sum C = A + B // Each thread performs one pair-wise addition global void vecaddkernel(float* A, float* B, float* C, int n) { int i = threadidx.x+blockdim.x*blockidx.x; if(i<n) C[i] = A[i] + B[i]; 3

Example: Vector Addition Kernel Launch (Host Code) Host Code void vecadd(float* h_a, float* h_b, float* h_c, int n) { // d_a, d_b, d_c allocations and copies omitted // Run ceil(n/256.0) blocks of 256 threads each vecaddkernel<<<ceil(n/256.0),256>>>(d_a, d_b, d_c, n); The ceiling function makes sure that there are enough threads to cover all elements. 4 4

More on Kernel Launch (Host Code) Host Code void vecadd(float* h_a, float* h_b, float* h_c, int n) { dim3 DimGrid((n-1)/256 + 1, 1, 1); dim3 DimBlock(256, 1, 1); vecaddkernel<<<dimgrid,dimblock>>>(d_a, d_b, d_c, n); This is an equivalent way to express the ceiling function. 5 5

Kernel execution in a nutshell host void vecadd( ) { dim3 DimGrid(ceil(n/256.0),1,1); dim3 DimBlock(256,1,1); vecaddkernel<<<dimgrid,dimblock>>>(d_a,d_b,d_c,n); global void vecaddkernel(float *A, float *B, float *C, int n) { int i = blockidx.x * blockdim.x + threadidx.x; if( i<n ) C[i] = A[i]+B[i]; Grid Blk 0 Blk N-1 M0 GPU RAM Mk 6 6

More on CUDA Function Declarations device float DeviceFunc() global void KernelFunc() host float HostFunc() Executed on the: device device host Only callable from the: device host host global defines a kernel function Each consists of two underscore characters A kernel function must return void device and host can be used together host is optional if used alone 7 7

Compiling A CUDA Program Integrated C programs with CUDA extensions NVCC Compiler Host Code Host C Compiler/ Linker Device Code (PTX) Device Just-in-Time Compiler Heterogeneous Computing Platform with CPUs, GPUs, etc. 8

Objective To understand multidimensional Grids Multi-dimensional block and thread indices Mapping block/thread indices to data indices 2 9

A Multi-Dimensional Grid Example host device Grid 1 Block (0, 0) Block (0, 1) Kernel 1 Block (1, 0) Block (1, 1) Grid 2 Block (1,0) (1,0,0) (1,0,1) (1,0,2) (1,0,3) Thread (0,0,0) Thread (0,1,0) Thread (0,0,1) Thread (0,1,1) Thread (0,0,2) Thread (0,1,2) Thread (0,0,3) Thread Thread (0,0,0) (0,1,3) 10 10

Processing a Picture with a 2D Grid 16 16 blocks 62 76 picture 11

Row-Major Layout in C/C++ M M Row*Width+Col = 2*4+1 = 9 M 0 M 1 M 2 M 3 M 4 M 5 M 6 M 7 M 8 M 9 M 10 M 11 M 12 M 13 M 14 M 15 M 0,0 M 0,1 M 0,2 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3 M 0,0 M 0,1 M 0,2 M 0,3 M 1,0 M 1,1 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3 12

Source Code of a PictureKernel global void PictureKernel(float* d_pin, float* d_pout, int height, int width) { // Calculate the row # of the d_pin and d_pout element int Row = blockidx.y*blockdim.y + threadidx.y; // Calculate the column # of the d_pin and d_pout element int Col = blockidx.x*blockdim.x + threadidx.x; // each thread computes one element of d_pout if in range if ((Row < height) && (Col < width)) { d_pout[row*width+col] = 2.0*d_Pin[Row*width+Col]; Scale every pixel value by 2.0 13

Host Code for Launching PictureKernel // assume that the picture is m n, // m pixels in y dimension and n pixels in x dimension // input d_pin has been allocated on and copied to device // output d_pout has been allocated on device dim3 DimGrid((n-1)/16 + 1, (m-1)/16+1, 1); dim3 DimBlock(16, 16, 1); PictureKernel<<<DimGrid,DimBlock>>>(d_Pin, d_pout, m, n); 14

Covering a 62 76 Picture with 16 16 Blocks Not all threads in a Block will follow the same control flow path. 15

Objective To gain deeper understanding of multi-dimensional grid kernel configurations through a real-world use case 2 16

RGB Color Image Representation Each pixel in an image is an RGB value The format of an image s row is (r g b) (r g b) (r g b) RGB ranges are not distributed uniformly Many different color spaces, here we show the constants to convert to AdbobeRGB color space The vertical axis (y value) and horizontal axis (x value) show the fraction of the pixel intensity that should be allocated to G and B. The remaining fraction (1-y x) of the pixel intensity that should be assigned to R The triangle contains all the representable colors in this color space 17

RGB to Grayscale Conversion A grayscale digital image is an image in which the value of each pixel carries only intensity information. 18

Color Calculating Formula For each pixel (r g b) at (I, J) do: graypixel[i,j] = 0.21*r + 0.71*g + 0.07*b This is just a dot product <[r,g,b],[0.21,0.71,0.07]> with the constants being specific to input RGB space 0.21 0.71 0.07 19

RGB to Grayscale Conversion Code #define CHANNELS 3 // we have 3 channels corresponding to RGB // The input image is encoded as unsigned characters [0, 255] global void colorconvert(unsigned char * grayimage, unsigned char * rgbimage, int width, int height) { int x = threadidx.x + blockidx.x * blockdim.x; int y = threadidx.y + blockidx.y * blockdim.y; if (x < width && y < height) { // get 1D coordinate for the grayscale image int grayoffset = y*width + x; // one can think of the RGB image having // CHANNEL times columns than the gray scale image int rgboffset = grayoffset*channels; unsigned char r = rgbimage[rgboffset ]; // red value for pixel unsigned char g = rgbimage[rgboffset + 2]; // green value for pixel unsigned char b = rgbimage[rgboffset + 3]; // blue value for pixel // perform the rescaling and store it // We multiply by floating point constants grayimage[grayoffset] = 0.21f*r + 0.71f*g + 0.07f*b; 20

RGB to Grayscale Conversion Code #define CHANNELS 3 // we have 3 channels corresponding to RGB // The input image is encoded as unsigned characters [0, 255] global void colorconvert(unsigned char * grayimage, unsigned char * rgbimage, int width, int height) { int x = threadidx.x + blockidx.x * blockdim.x; int y = threadidx.y + blockidx.y * blockdim.y; if (x < width && y < height) { // get 1D coordinate for the grayscale image int grayoffset = y*width + x; // one can think of the RGB image having // CHANNEL times columns than the gray scale image int rgboffset = grayoffset*channels; unsigned char r = rgbimage[rgboffset ]; // red value for pixel unsigned char g = rgbimage[rgboffset + 1]; // green value for pixel unsigned char b = rgbimage[rgboffset + 2]; // blue value for pixel // perform the rescaling and store it // We multiply by floating point constants grayimage[grayoffset] = 0.21f*r + 0.71f*g + 0.07f*b; 21

RGB to Grayscale Conversion Code #define CHANNELS 3 // we have 3 channels corresponding to RGB // The input image is encoded as unsigned characters [0, 255] global void colorconvert(unsigned char * grayimage, unsigned char * rgbimage, int width, int height) { int x = threadidx.x + blockidx.x * blockdim.x; int y = threadidx.y + blockidx.y * blockdim.y; if (x < width && y < height) { // get 1D coordinate for the grayscale image int grayoffset = y*width + x; // one can think of the RGB image having // CHANNEL times columns than the gray scale image int rgboffset = grayoffset*channels; unsigned char r = rgbimage[rgboffset ]; // red value for pixel unsigned char g = rgbimage[rgboffset + 2]; // green value for pixel unsigned char b = rgbimage[rgboffset + 3]; // blue value for pixel // perform the rescaling and store it // We multiply by floating point constants grayimage[grayoffset] = 0.21f*r + 0.71f*g + 0.07f*b; 22

Objective To learn a 2D kernel with more complex computation and memory access patterns 23

Image Blurring 24

Blurring Box Pixels processed by a thread block 25

Image Blur as a 2D Kernel global void blurkernel(unsigned char * in, unsigned char * out, int w, int h) { int Col = blockidx.x * blockdim.x + threadidx.x; int Row = blockidx.y * blockdim.y + threadidx.y; if (Col < w && Row < h) {... // Rest of our kernel 26

global void blurkernel(unsigned char * in, unsigned char * out, int w, int h) { int Col = blockidx.x * blockdim.x + threadidx.x; int Row = blockidx.y * blockdim.y + threadidx.y; if (Col < w && Row < h) { int pixval = 0; int pixels = 0; // Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box for(int blurrow = -BLUR_SIZE; blurrow < BLUR_SIZE+1; ++blurrow) { for(int blurcol = -BLUR_SIZE; blurcol < BLUR_SIZE+1; ++blurcol) { int currow = Row + blurrow; int curcol = Col + blurcol; // Verify we have a valid image pixel if(currow > -1 && currow < h && curcol > -1 && curcol < w) { pixval += in[currow * w + curcol]; pixels++; // Keep track of number of pixels in the accumulated total // Write our new pixel value out out[row * w + Col] = (unsigned char)(pixval / pixels); 27

Objective To learn how a CUDA kernel utilizes hardware execution resources Assigning thread blocks to execution resources Capacity constrains of execution resources Zero-overhead thread scheduling 28

Transparent Scalability Device Thread grid Device Block 0 Block 1 Block 0 Block 1 Block 2 Block 3 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 Each block can execute in any order relative to others. Hardware is free to assign blocks to any processor at any time A kernel scales to any number of parallel processors 29

Example: Executing Thread Blocks Threads are assigned to Streaming Multiprocessors (SM) in block granularity Up to 8 blocks to each SM as resource allows Fermi SM can take up to 1536 threads Could be 256 (threads/block) * 6 blocks Or 512 (threads/block) * 3 blocks, etc. SM maintains thread/block idx #s SM manages/schedules thread execution t0 t1 t2 tm Blocks SM SP Shared Memory 30

The Von-Neumann Model Memory I/O Processing Unit ALU Reg File PC Control Unit IR 31

The Von-Neumann Model with SIMD units Memory I/O Processing Unit ALU Reg File PC Control Unit IR Single Instruction Multiple Data (SIMD) 32

Warps as Scheduling Units Each Block is executed as 32-thread Warps An implementation decision, not part of the CUDA programming model Warps are scheduling units in SM Threads in a warp execute in SIMD Future GPUs may have different number of threads in each warp 33

Warp Example If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps Block 0 Warps t0 t1 t2 t31 Block 1 Warps t0 t1 t2 t31 Block 2 Warps t0 t1 t2 t31 Register File L1 Shared Memory 34

Example: Thread Scheduling (Cont.) SM implements zero-overhead warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution based on a prioritized scheduling policy All threads in a warp execute the same instruction when selected 35

Block Granularity Considerations For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks for Fermi? For 8X8, we have 64 threads per Block. Since each SM can take up to 1536 threads, which translates to 24 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! For 16X16, we have 256 threads per Block. Since each SM can take up to 1536 threads, it can take up to 6 Blocks and achieve full capacity unless other resource considerations overrule. For 32X32, we would have 1024 threads per Block. Only one block can fit into an SM for Fermi. Using only 2/3 of the thread capacity of an SM. 36

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.