CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Similar documents
GPU Programming Introduction

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation

An Introduction to GPU Computing and CUDA Architecture

HPCSE II. GPU programming and CUDA

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Lecture 6b Introduction of CUDA programming

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Introduction to GPGPUs and to CUDA programming model

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Tesla Architecture, CUDA and Optimization Strategies

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

ECE 574 Cluster Computing Lecture 15

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

CUDA Architecture & Programming Model

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Numerical Algorithms

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Introduction to CUDA C

Practical Introduction to CUDA and GPU

Lecture 3: Introduction to CUDA

Introduction to CUDA C

ECE 574 Cluster Computing Lecture 17

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

CS 179: GPU Computing. Lecture 2: The Basics

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CS377P Programming for Performance GPU Programming - I

CUDA C Programming Mark Harris NVIDIA Corporation

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Parallel Computing. Lecture 19: CUDA - I

University of Bielefeld

Module 2: Introduction to CUDA C. Objective

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Introduction to CUDA Programming

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Speed Up Your Codes Using GPU

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to CUDA CIRC Summer School 2014

Massively Parallel Algorithms

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

GPU programming. Dr. Bernhard Kainz

Scientific discovery, analysis and prediction made possible through high performance computing.

CUDA. More on threads, shared memory, synchronization. cuprintf

GPU CUDA Programming

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

04. CUDA Data Transfer

Introduction to CUDA

GPU Programming Using CUDA

COSC 462 Parallel Programming

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

High-Performance Computing Using GPUs

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

COSC 6374 Parallel Computations Introduction to CUDA

High Performance Linear Algebra on Data Parallel Co-Processors I

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science 2018 Workshop

Programming with CUDA, WS09

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CUDA Basics. July 6, 2016

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Module 2: Introduction to CUDA C

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Hands-on CUDA Optimization. CUDA Workshop

Paralization on GPU using CUDA An Introduction

Lecture 2: Introduction to CUDA C

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

EEM528 GPU COMPUTING

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Transcription:

CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan

CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration

Compute Unified Device Architecture CUDA Exposes GPU computing for general purpose Flexible and scalable architecture Based on industry-standard C/C++ Small set of extensions to enable heterogeneous programming Straightforward APIs to manage devices, memory etc. For NVIDIA GPUs only

Concepts to be covered Heterogeneous computing Blocks, Threads Indexing Shared memory syncthreads() Warps, Divergence Asynchronous operation Handling errors Managing devices

Heterogeneous Computing CPU Host, CPU RAM Host Memory GPU Device, GPU RAM Device Memory www.nvidia.com

Hello World! GPU code kernel global indicates it runs on device Triple angle brackets mark a call from host code to device code kernel launch Returns void global void mykernel(void) { cuprintf( Hello World!\n ); } int main(void) { mykernel<<<1,1>>>(); printf( CPU Hello World!\n"); return 0; }

Hello World! nvcc helloworld.cu./a.out

Working with codes Open Terminal ssh X user#@10.21.1.166 (user 1-25) ssh X guest@10.6.5.254 (user 26-50) ssh X guest@192.168.1.211 (user 26-50) cd codes/helloworld/ make./helloworld gedit &

Hello World! Parallel Change to mykernel<<<n,1>>>(); Launches N blocks CPU calls kernel & continues its work Compile: $ cd helloworld_blocks $ make $./helloworld_blocks global void mykernel(void) { cuprintf( Hello World!\n ); } int main(void) { int N=100; mykernel<<<n,1>>>(); printf( CPU Hello World!\n"); return 0; }

Processing Flow PCI Bus Copy from Host Memory (CPU) to Device Memory (GPU)

Processing Flow CPU launches Kernel PCI Bus Kernel accesses memory at much faster rate Utilizes on-chip cache memory

Processing Flow PCI Bus Copy results back from Device Memory (GPU) to Host Memory (CPU)

Device Memory Management cudaerror_t cudamalloc( void ** devptr, size_t size_bytes) cudaerror_t cudamemcpy ( void* dst, void* src, size_t count, enum cudamemcpykind kind) cudamemcpyhosttohost Host -> Host cudamemcpyhosttodevice Host -> Device cudamemcpydevicetohost Device -> Host cudamemcpydevicetodevice Device -> Device Example: int a[100], *dev_a; cudamalloc (&dev_a, sizeof(int)*100); cudamemcpy( dev_a, a, sizeof(int)*100, cudamemcpyhosttodevice);

Vector Addition How to identify which block it is? Each block takes care of one element blockidx.x You can have 3 dimensional blocks blockidx.x, blockidx.y, blockidx.z 0 1 2 3 0 1 2 4 0 2 4 8 griddim.x dim3(65535,65535,1024) void vectoradd(int *a, int *b, int *c) { for( int i=0; i<100; i++) c[i] = a[i] + b[i]; } global void vectoradd(int *a, int *b, int *c) { int i = blockidx.x; c[i] = a[i] + b[i]; }

Vector Addition global void vectoradd(int *a,int *b, int *c) { int i= blockidx.x; c[i] = a[i] + b[i]; } int main(void) { int host_a[100], host_b[100],host_c[100]; int *dev_a, *dev_b, *dev_c; cudamalloc( &dev_a, sizeof(int)*100); cudamalloc( &dev_b, sizeof(int)*100); cudamalloc( &dev_c, sizeof(int)*100); } Memory Allocation Compile: $ cd vectoradd/ $ make $./vectoradd Memory Copy cudamemcpy(dev_a, host_a, sizeof(int)*100, cudamemcpyhosttodevice); cudamemcpy(dev_b, host_b, sizeof(int)*100, cudamemcpyhosttodevice); vectoradd<<<n,1>>>(dev_a,dev_b,dev_c); cudamemcpy(host_c, dev_c, sizeof(int)*100, cudamemcpydevicetohost); return 0;

Threads A block can have many threads. For vector addition, the kernel launch would be vectoradd<<<1,n>>>(da,db,dc); Maximum thread Dimension (3-dimensional) (1024,1024,64) threadidx.x blockdim.x global void vectoradd(int *a, int *b, int *c) { int i=threadidx.x; c[i]=a[i]+b[i]; } Compile: $ cd vectoradd_threads/ $ make $./vectoradd_threads

Threads 3D mesh inside 3D mesh Why threads? Communicate Synchronize Blocks can t

Built-in Variables threadidx.x, threadidx.y, threadidx.z blockidx.x, blockidx.y, blockidx.z blockdim.x, y, z (1024,1024,64) Number of threads per block. griddim.x, y, z (65535, 65535, 1024) - Number of blocks in a kernel call (called as Grid of blocks)

Index Calculation Using blocks and threads simultaneously. i=threadidx.x + blockdim.x*blockidx.x; threadidx.x threadidx.x threadidx.x threadidx.x 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 blockidx.x=0 blockidx.x=1 blockidx.x=2 blockidx.x=3 blockdim.x=8 (no of threads in a block) griddim.x=4 (no of blocks in that kernel launch) add<<<n/threads_per_block, THREADS_PER_BLOCK>>>(

Boundary Conditions Usually blockdim.x is in multiples of 32 Always put boundary conditions on data size global void vectoradd(int *a, int *b, int *c, int N) { int i = threadidx + blockdim.x * blockidx.x; if(i<n) c[i]=a[i]+b[i]; } Compile: $ cd vectoradd_full/ $ make $./vectoradd_full

For Very Large N Very large N (N>10 6 ) global void vectoradd(int *a, int *b, int *c, long N) { long i = threadidx + blockdim.x * blockidx.x; for(; i<n; i+= griddim.x * blockdim.x) c[i]=a[i]+b[i]; } Compile: $ cd vectoradd_large/ $ make $./vectoradd_large

Block Scheduling Streaming Multiprocessors are executing units - SM Different GPUs have different no of SMs. There is communication among threads. No communication among blocks. No specific order in block scheduling.

Block Scheduling All threads in a block execute in a single SM No guarantee in order of execution Hardware schedules based on available SMs 3 SMs available BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4

Block Scheduling All threads in a block execute in a single SM No guarantee in order of execution Hardware schedules based on available SMs 3 SMs available BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4

Block Scheduling All threads in a block execute in a single SM No guarantee in order of execution Hardware schedules based on available SMs 3 SMs available BLOCK 1 BLOCK 2 BLOCK 4

Block Scheduling All threads in a block execute in a single SM No guarantee in order of execution Hardware schedules based on available SMs 3 SMs available BLOCK 1 BLOCK 2 BLOCK 4

1-D Stencil Compute a(i)+a(i+1)+a(i+2) global void stencil(int *a, int *b) { int i=threadidx.x; b[i]=a[i]+a[i+1]+a[i+2]; } Compile: $ cd 1dstencil/ $ make $./1dstencil 0 1 2 3 4 5 6 7 threadidx.x=0 threadidx.x=1

Global Memory Till now we have been using global memory for our computations. Very slow to access Allocated using cudamalloc(..)

1-D Stencil Revisited Compute a(i)+a(i+1)+a(i+2) global void stencil(int *a, int *b) { int i=threadidx.x; b[i]=a[i]+a[i+1]+a[i+2]; } Data could be shared among threads 0 1 2 3 4 5 6 7 threadidx.x=0 threadidx.x=1 3 global read + 1 global write per thread

Shared Memory Memory shared among threads inside a block. Cannot be accessed from another block Declared inside kernel code shared int a[100]; On-chip, very fast

1-D Stencil Shared Copy to shared memory global void stencil(int *a, int *b) { int i = threadidx.x; shared sa[100]; sa[i] = a[i]; b[i] = sa[i]+sa[i+1]+sa[i+2]; } Compile: $ cd 1dstencil_shared $ make $./1dstencil_shared Write the result to global memory Shared memory is visible a block only. Cannot be accessed by other blocks, CPU

Access Times Registers (1-2 cycles) Shared memory (10 cycles) Global memory (100s of cycles) Local memory (100s of cycles)

Run time Comparison 3 global read + 1 global write per thread 3*100+100=400 cycles 1 global read + 3 shared read + 1 global write per thread 1*100+3*10+1*100= 230 cycles Use nvprof./file_name to see the runtime of programs

Memory Hierarchy Registers Per thread on chip Data lifetime = thread lifetime Local memory Per thread off-chip memory (DRAM) Data lifetime = thread lifetime Shared memory Per thread block : on-chip memory Data lifetime = block lifetime Global (device) memory Accessible by all threads and host (CPU) Data lifetime= Entire program from allocation to de-allocation Host (CPU) memory Not directly accessible by CUDA threads

syncthreads() Synchronizes all threads within a block Waits till all the threads execute till syncthreads(); Used to prevent RAW, WAR, WAW hazards RAW Read After Write WAR Write After Read WAW Write After Write Synchronize to commit all the memory writes, reads and computation. syncthreads();

Reduction Addition of N numbers Other operations +,*, AND, OR, XOR, maximum, minimum etc. void reduce(int *a, int *result) { *result=0; for( int i=0; i<100; i++) result=result+a[i]; } Serial How to parallelize?

Reduction Addition of N numbers Other operations +,*, AND, OR, XOR, maximum, minimum etc. void reduce(int *a, int *result) { *result=0; for( int i=0; i<100; i++) result=result+a[i]; } Serial How to parallelize? Using associative property! a+b+c+d = (a+b)+ (c+d)

Reduction N numbers log 2 (N) steps to compute Share result of 1 st step to other threads in 2 nd step. 0 1 2 3 4 5 6 7 threadidx.x 1 5 9 13 6 22 0 Some algorithms are not straight forward to implement in parallel.

Reduction kernel Read to Shared memory Operate & write to shared memory Write to global memory global void reduce(int *a, int *result) { int i= threadidx.x; shared s_a[n]; s_a[i]=a[i]; syncthreads( ); for( int stride=1; stride < N; stride*=2){ if(i%stride==0) s_a[2*i]=s_a[2*i] + s_a[2*i+stride]; syncthreads( ); } *result=s_a[0]; } Compile: $ cd reduction/ $ make $./reduction

CUDA programming model

CUDA programming model Blocks mapped to SM

Warps Inside SM, threads are split into group of 32 threads called warps. All threads in single warp execute in parallel. If executing warp needs waiting or barrier, it is put into hold and another warp is dispatched for execution. This is taken care by warp scheduler All threads in a warp execute SAME instruction.

Warp No guarantee on order of warps dispatched. GPU architectures Tesla,Fermi, Kepler Warp size = 32 Fermi 2 warp schedulers 2 instruction units

Divergence Alternative threads in a warp execute different each warp takes 2 time step Warp 1 if ( threadidx.x%2==0) a[ threadidx.x ] +=1 else a[ threadidx.x ] +=2 } 0 2 4 6 8... 1 3 5 7 9... if else

Divergence All threads in a warp executes same instruction each warp takes 1 time step if ( threadidx.x<32) a[ threadidx.x ] +=1 else a[ threadidx.x ] +=2 } Warp 1 (if) Warp 2 (else)

Reduction revisited Divergence at all strides Each thread in a warp execute different instructions Solution: Modify condition 0 1 2 3 4 5 6 7 1 5 9 13 6 22 0 for( int stride=1; stride < N; stride*=2){ if( i % stride==0) sa[2*i] = sa[2*i] + sa[2*i+stride]; syncthreads( ); } *result=sa[0]; }

Reduction (No Divergence) Add elements stride away 0 1 2 3 4 5 6 7 4 6 8 10 12 16 28 No Divergence for stride>=32 for( int stride=blockdim.x; stride >0; stride/=2) if( i < stride) sa[i]=sa[i] + sa[i+stride]; syncthreads( ); *result=sa[0]; } Compile: $ cd reduction_nodiv/ $ make $./reduction_nodiv

Resource allocation Split your program to small kernels. Why? Each SM has limited registers, shared memory. The amount depends on compute capability of GPU 1.0, 1.1, 1.2, 1.3, 2.x, 3.0, 3.5, 5.0 Fermi, Tesla, Kepler Global memory is large (>512MB) nvcc Xptxas=-v filename.cu

Resource limits Number of thread blocks per SM is limited by Registers Shared memory usage No. of blocks per SM Number of threads Limits 1.3 2.X 3.X 5.0 Registers/SM 16K 32K 64K 64K Shared Memory/SM 16KB 48KB 48KB 64KB Blocks/SM 8 8 16 32 Threads/SM 1024 1536 2048 2048 Occupancy

Asynchronous Kernel launches are Asynchronous cudamemcpy, cudamalloc Synchronous cudamemcpyasync() - Asynchronous, does not block the CPU cudadevicesynchronize() - Blocks the CPU until all preceding CUDA calls have completed Asynchronous calls can utilize CPU also while GPU is busy.

Handling Errors All CUDA API calls return an error code (cudaerror_t) Error in the API call itself Error in an earlier asynchronous operation (e.g. kernel) Get the error code for the last error: cudaerror_t cudagetlasterror(void) Get a string to describe the error: char *cudageterrorstring(cudaerror_t) printf("%s\n", cudageterrorstring(cudagetlasterror()));

Device Management Application can query and select GPUs cudagetdevicecount(int *count) cudasetdevice(int device) cudagetdevice(int *device) cudagetdeviceproperties(cudadeviceprop *prop, int device) Multiple host threads can share a device A single host thread can manage multiple devices cudasetdevice(i) to select current device cudamemcpy( ) for peer-to-peer copies

Summary Write and launch CUDA C/C++ kernels global, <<<>>>, blockidx, threadidx, blockdim Manage GPU memory cudamalloc(), cudamemcpy(), cudafree() Manage communication and synchronization shared, syncthreads() cudamemcpy() vs cudamemcpyasync(), cudadevicesynchronize() Resource limits Registers, Shared memory, blocks/sm, threads/sm

Advanced concepts(not covered) Memory Coalescing Constant memory Streams Atomics Shared memory conflicts Texture memory

Tools nvcc NVIDIA compiler nvprof - command line profiler nvvp Visual profiler cuda-memcheck Memory bugs Nsight Visual Studio, Eclipse Allinea DDT

Libraries CUBLAS CUDA accelerated Basic Linear Algebra CUFFT Fast Fourier Transform (1D, 2D, 3D) Thrust C++ template library (similar to C++ STL) CULA Dense, Sparse Linear Algebra OpenCV Computer Vision, Image processing AccelerEyes ArrayFire MATLAB, LabVIEW, Mathematica, Python ABACUS, AMBER, ANSYS, GROMACS, LAAMPS, NAMD,

Online Resources http://developer.nvidia.com/cuda-training Coursera - Heterogeneous computing Udacity - CS344 Intro To Parallel Programming GPU computing Webinars CUDA Documentation Books CUDA by Example Programming Massively Parallel Processors: A Hands-on Approach GPU GEMS

Questions?