Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Similar documents
Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

CUDA C Programming Mark Harris NVIDIA Corporation

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C

Lecture 11: GPU programming

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

COSC 6374 Parallel Computations Introduction to CUDA

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Massively Parallel Algorithms

EEM528 GPU COMPUTING

Basics of CADA Programming - CUDA 4.0 and newer

Parallel Computing. Lecture 19: CUDA - I

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

HPCSE II. GPU programming and CUDA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA Basics. July 6, 2016

CS179 GPU Programming Recitation 4: CUDA Particles

Mathematical computations with GPUs

CUDA C/C++ BASICS. NVIDIA Corporation

CS 179: GPU Computing. Lecture 2: The Basics

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CUDA C/C++ BASICS. NVIDIA Corporation

Lecture 3: Introduction to CUDA

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU CUDA Programming

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

An Introduction to GPU Computing and CUDA Architecture

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Programming with CUDA, WS09

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

COSC 6339 Accelerators in Big Data

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Basic unified GPU architecture

Introduction to GPGPUs and to CUDA programming model

CUDA Parallel Programming Model Michael Garland

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Lecture 6b Introduction of CUDA programming

Parallel Numerical Algorithms

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Introduction to CUDA Programming

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Introduc)on to GPU Programming

Lecture 10!! Introduction to CUDA!

NVIDIA CUDA Compute Unified Device Architecture

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

04. CUDA Data Transfer

Vector Addition on the Device: main()

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Introduction to CUDA (1 of n*)

Data parallel computing

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

University of Bielefeld

Tesla Architecture, CUDA and Optimization Strategies

Parallel Accelerators

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

GPU Computing with CUDA

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Accelerators

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

CUDA programming interface - CUDA C

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Writing and compiling a CUDA code

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

CS : Many-core Computing with CUDA

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

GPU Programming Introduction

GPU Programming Using CUDA

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Introduction to GPU Programming

Memory Management. Memory Access Bandwidth. Memory Spaces. Memory Spaces

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CUDA Parallelism Model

High Performance Linear Algebra on Data Parallel Co-Processors I

GPGPU in Film Production. Laurence Emms Pixar Animation Studios

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Transcription:

Basic Elements of CUDA Algoritmi e Calcolo Parallelo

References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied Parallel Programming (ECE498@UIUC) http:// courses.engr.illinois.edu/ece498/al/ q Useful references " Programming Massively Parallel Processors: A Hands-on Approach, David B. Kirk and Wen-mei W. Hwu " http://www.gpgpu.it/ (CUDA Tutorial) " CUDA Programming Guide http://developer.nvidia.com/object/ gpucomputing.html

Compiling the Code q nvcc <filename>.cu [-o <executable>] " Builds release mode q nvcc g <filename>.cu " Builds debug (device) mode " Can debug host code but not device code (runs on GPU)

GPU s Memory Management

Managing Memory q CPU and GPU have separate memory spaces q Host (CPU) code manages device (GPU) memory: " Allocate / free " Copy data to and from device " Applies to global device memory (DRAM)

GPU Memory Allocation / Release q cudamalloc(void ** pointer, size_t nbytes) q cudamemset(void * pointer, int value, size_t count) q cudafree(void* pointer) int n = 1024; int nbytes = 1024*sizeof(int); int *d_a = 0; cudamalloc(&d_a, nbytes ); cudamemset(d_a, 0, nbytes); cudafree(d_a);

Data Copies q cudamemcpy(void *dst, void *src, size_t nbytes, enum cudamemcpykind direction); " src is the pointer to data to be copied and dst is the pointer to the destination " blocks CPU thread (returns after the copy is complete) " doesn t start copying until previous CUDA calls complete " direction specifies locations (host or device) of src and dst: cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice

2D memory management q cudamallocpitch(void** devptr, size_t* pitch, size_t widthinbytes, size_t height) " As an example, given float *d_a; size_t pitch; cudamallocpitch(&d_a,&pitch,16*sizeof(float),4); " to access on the device to an element: d_a[row*16+col] d_a[row*pitch/sizeof(float)+col] q cudamemcpy2d(void *dst, size_t dpitch, const void* src, size_t spitch, size_t width, size_t height, enum cudamemcpykind direction) q Pitch size is in byte! " Remember to divide by the sizeof(typeofdata);

Kernels

Executing Code on the GPU q Kernels are C functions with some restrictions " Can only access GPU memory " Must have void return type " No variable number of arguments ( varargs ) " Not recursive " No static variables q Function arguments automatically copied from CPU to GPU memory

Function Qualifiers Executed on Callable from device type DeviceFunc() Device Device global void KernelFunc() Device Host host type HostFunc() Host Host q global functions must return void q host and device qualifiers can be combined " Compiler will generate both CPU and GPU code

Launching Kernels q Modified C function call syntax: " kernel<<<dim3 grid, dim3 block>>>( ) q Execution Configuration ( <<< >>> ): " grid dimensions: x and y (2D) " thread-block dimensions: x, y, and z (3D) q When grid or block have only x dimension, it is possible to define the size of the grid/blocks with an integer q Examples: dim3 grid(16, 16); dim3 block(16,16); my_kernel_fun<<<grid, block>>>(...); my_kernel_fun2<<<32, 512>>>(...);

Block IDs and Thread IDs q Each thread uses IDs to decide what data to work on " Block ID: 1D or 2D " Thread ID: 1D, 2D, or 3D Host Kernel 1 Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) q Simplifies memory addressing when processing multidimensional data " Image processing " Solving PDEs on volumes " Kernel 2 Block (1, 1) Thread (0,0,0) Thread (0,1,0) Grid 2 (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread (1,0,0) Thread (1,1,0) Thread Thread (2,0,0) (3,0,0) Thread Thread (2,1,0) (3,1,0)

CUDA Built-in Device Variables q All global and device functions have access to these automatically defined variables " dim3 griddim Dimensions of the grid in blocks (at most 2D) " dim3 blockdim Dimensions of the block in threads " dim3 blockidx Block index within the grid " dim3 threadidx Thread index within the block

A simple kernel global void assign( int* d_a, int value) { } int idx = blockdim.x * blockidx.x + threadidx.x; d_a[idx] = value; int N=k*512; //Assumiamo N multiplo intero di 512 float* h_a = new float[n]; float* d_a; cudamalloc(&d_a, N*sizeof(float)); assign<<< N/512, 512>>>(d_A, 3); cudamemcpy(h_a, d_a, N*sizeof(float), cudamemcpydevicetohost); cudafree(d_a);

Example: Increment Array Elements q Increment N-element vector a by scalar b q Let s assume N=16, blockdim=4 -> 4 blocks blockidx.x=0 blockdim.x=4 threadidx.x= 0,1,2,3 idx=0,1,2,3 blockidx.x=1 blockdim.x=4 threadidx.x= 0,1,2,3 idx=4,5,6,7 blockidx.x=2 blockdim.x=4 threadidx.x= 0,1,2,3 idx=8,9,10,11 blockidx.x=3 blockdim.x=4 threadidx.x=0,1,2,3 idx=12,13,14,15 q int idx = blockdim.x * blockid.x + threadidx.x; " will map from local index threadidx to global index q NB: blockdim should be >= 32 in real code, this is just an example

Example: Increment Array Elements q CPU implementation void increment_cpu(float *a, float b, int N) { for (int idx = 0; idx<n; idx++) a[idx] = a[idx] + b; } void main() {... increment_cpu(a, b, N); }

Example: Increment Array Elements q CUDA implementation global void increment_gpu(float *a, float b, int N) { } int idx = blockidx.x * blockdim.x + threadidx.x; if (idx < N) a[idx] = a[idx] + b; void main() { } dim3 dimblock (blocksize); dim3 dimgrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimgrid, dimblock>>>(a, b, N);

Example of kernel for 2D data // Host code int width = 64, height = 64; float* devptr; size_t pitch; cudamallocpitch(&devptr,&pitch,width * sizeof(float), height); MyKernel<<<grid, block>>>(devptr, pitch, width, height); // Device code global void MyKernel(float* devptr, size_t pitch, int } } width, int height){ float element = devptr[r*pitch/sizeof(float)+c];

Synchronization

Host Synchronization q All kernel launches are asynchronous " control returns to CPU immediately " kernel executes after all previous CUDA calls have completed q cudamemcpy() is synchronous " control returns to CPU after copy completes " copy starts after all previous CUDA calls have completed q cudathreadsynchronize() " blocks until all previous CUDA calls complete

Example: Host Code // allocate host memory int numbytes = N * sizeof(float) float* h_a = new float[n]; // allocate device memory float* d_a = 0; cudamalloc(&d_a, numbytes); // copy data from host to device cudamemcpy(d_a, h_a, numbytes, cudamemcpyhosttodevice); // execute the kernel increment_gpu<<< N/blockSize, blocksize>>>(d_a, b); // copy data from device back to host // --> implicit synchronization cudamemcpy(h_a, d_a, numbytes, cudamemcpydevicetohost); // free device memory cudafree(d_a);

GPU Thread Synchronization q void syncthreads(); q Synchronizes all threads in a block " Generates barrier synchronization instruction " No thread can pass this barrier until all threads in the block reach it " Used to sync when accessing shared memory q Allowed in conditional code only if the conditional is uniform across the entire thread block

Additional features

Variable Qualifiers (GPU code) q device " Stored in device memory (large, high latency, no cache) " Allocated with cudamalloc ( device qualifier implied) " Accessible by all threads " Lifetime: application q shared " Stored in on-chip shared memory (very low latency) " Allocated by execution configuration or at compile time " Accessible by all threads in the same thread block " Lifetime: kernel execution q Unqualified variables: " Scalars and built-in vector types are stored in registers " Arrays of more than 4 elements stored in device memory

Using shared memory Size known at compile time global void kernel( ) { } shared float sdata[256]; int main(void) { } kernel<<<nblocks,blocksize>>>( ); Size known at kernel launch global void kernel( ) { extern shared float sdata[]; } int main(void) { smbytes = blocksize*sizeof(float); kernel<<<nblocks, blocksize, smbytes>>>( ); }

Built-in Vector Types q Can be used in GPU and CPU code " [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4] q dim3 " Structures accessed with x, y, z, w fields: uint4 param; int y = param.y; " Based on uint3 " Used to specify dimensions " Default value (1,1,1)

CUDA Error Reporting to CPU q All CUDA calls return error code: " Except for kernel launches " cudaerror_t type q cudaerror_t cudagetlasterror(void) " Returns the code for the last error (no error has a code) " Can be used to get error from kernel execution q char* cudageterrorstring(cudaerror_t code) " Returns a null-terminated character string describing the error q printf( %s\n, cudageterrorstring(cudagetlasterror()));

Examples

Sum of vectors 36 q Given two real vectors A and B of size N q Compute vector C = A+B

Solution: Host Code (1) 37 int main(){ int N =...; size_t size = N * sizeof(float); // Allocate input vectors h_a and h_b in host memory float* h_a = (float*)malloc(size); float* h_b = (float*)malloc(size); // Initialize input vectors... // Allocate vectors in device memory float* d_a; cudamalloc(&d_a, size); float* d_b; cudamalloc(&d_b, size); float* d_c; cudamalloc(&d_c, size);

Solution: Host Code (2) 38 } // Copy vectors from host memory to device memory cudamemcpy(d_a, h_a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, h_b, size, cudamemcpyhosttodevice); // Invoke kernel int threadsperblock = 256; int blockspergrid = (N+threadsPerBlock 1)/ threadsperblock; VecAdd<<<blocksPerGrid, threadsperblock>>>(d_a, d_b, d_c, N); // Copy result from device memory to host memory // h_c contains the result in host memory cudamemcpy(h_c, d_c, size, cudamemcpydevicetohost); // Free device memory cudafree(d_a); cudafree(d_b); cudafree(d_c); // Free host memory...

Solution: Device Code 39 global void VecAdd(float* A, float* B, float* C, int N) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < N) C[i] = A[i] + B[i]; }