Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Similar documents
Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation

HPCSE II. GPU programming and CUDA

An Introduction to GPU Computing and CUDA Architecture

Lecture 6b Introduction of CUDA programming

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

GPU Programming Introduction

GPU Computing: Introduction to CUDA. Dr Paul Richmond

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to CUDA Programming

Introduction to CUDA C

Tesla Architecture, CUDA and Optimization Strategies

Lecture 11: GPU programming

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CUDA C Programming Mark Harris NVIDIA Corporation

ECE 574 Cluster Computing Lecture 15

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Vector Addition on the Device: main()

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Introduction to CUDA C

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Programmable Graphics Hardware (GPU) A Primer

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA Architecture & Programming Model

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

ECE 574 Cluster Computing Lecture 17

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Basics of CADA Programming - CUDA 4.0 and newer

Massively Parallel Algorithms

University of Bielefeld

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Parallel Numerical Algorithms

High-Performance Computing Using GPUs

CUDA Programming. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU Programming with CUDA. Pedro Velho

GPU Programming Using CUDA

Parallel Accelerators

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Module 2: Introduction to CUDA C. Objective

GPU programming. Dr. Bernhard Kainz

High Performance Linear Algebra on Data Parallel Co-Processors I

COSC 6374 Parallel Computations Introduction to CUDA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Parallel Computing. Lecture 19: CUDA - I

Module 2: Introduction to CUDA C

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to OpenACC. 16 May 2013

Lecture 3: Introduction to CUDA

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Introduc)on to GPU Programming

COSC 6339 Accelerators in Big Data

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

CS377P Programming for Performance GPU Programming - I

OpenMP and GPU Programming

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

High Performance Computing and GPU Programming

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

EEM528 GPU COMPUTING

Lecture 2: Introduction to CUDA C

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Lecture 10!! Introduction to CUDA!

Practical Introduction to CUDA and GPU

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

An Introduction to GPU Programming

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Data parallel computing

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

Parallel Accelerators

Speed Up Your Codes Using GPU

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

GPU CUDA Programming

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

Transcription:

Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software 3 Basic CUDA Example: Addition Example: Vector Addition 4 More CUDA Syntax & Features 5 Summary

Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software 3 Basic CUDA Example: Addition Example: Vector Addition 4 More CUDA Syntax & Features 5 Summary

Heterogeneous Computing What is it? Use different compute device(s) concurrently in the same computation. Example: Leverage CPUs for general computing components and use GPU s for data parallel / FLOP intensive components. Pros: Faster and cheaper ($/FLOP/Watt) computation Cons: More complicated to program

Heterogeneous Computing Terminology GPGPU : General Purpose Graphics Processing Unit HOST : CPU and its memory DEVICE : Accelerator (GPU) and its memory

Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software 3 Basic CUDA Example: Addition Example: Vector Addition 4 More CUDA Syntax & Features 5 Summary

GPU vs. CPUs CPU GPU general purpose task parallelism (diverse tasks) maximize serial performance large cache multi-threaded (4-16) some SIMD (SSE, AVX) data parallelism (single task) maximize throughput small cache super-threaded (500-2000+) almost all SIMD

Speedup What kind of speedup can I expect? 1 TFLOPs per GPU vs. 100 GFLOPs multi-core CPU 0x - 50x reported

Speedup What kind of speedup can I expect? 1 TFLOPs per GPU vs. 100 GFLOPs multi-core CPU 0x - 50x reported Speedup depends on problem structure need many identical independent calculations preferably sequential memory access single vs. double precision (K20 3.52 TF SP vs 1.17 TF DP) level of intimacy with hardware time investment

GPU Programming GPGPU Languages OpenGL, DirectX (Graphics only) OpenCL (1.0, 1.1, 2.0) Open Standard CUDA (NVIDIA proprietary) OpenACC OpenMP 4.0

CUDA What is it? Compute Unified Device Architecture parallel computing platform and programming model created by NVIDIA Language Bindings C/C++ nvcc compiler (works with gcc/intel) Fortran (PGI compiler) Others (pycuda,jcuda, etc.) CUDA Versions (V1.0-6.0) Hardware Compute Capability (1.0-3.5) Tesla M20*0 (Fermi) has CC 2.0 Tesla K20 (Kepler) has CC 3.5

Compute Canada Resources GPU Systems Westgrid: Parallel 60 nodes (3x NVIDIA M2070) SharcNet: Monk 54 nodes (2x NVIDIA M2070) SciNet: Gravity, ARC 49 nodes (2x NVIDIA M2090) 8 nodes (2x NVIDIA M2070) 1 node (1x NVIDIA K20) CalcuQuebec: Guillimin 50 nodes (2x NVIDIA K20)

Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software 3 Basic CUDA Example: Addition Example: Vector Addition 4 More CUDA Syntax & Features 5 Summary

CUDA Example: Addition Device Kernel code global void add(float *a, float *b, float *c) { *c = *a + *b; } CUDA Syntax: Qualifiers global indicates a function that runs on the DEVICE called from the HOST Used by the compiler, nvcc, to separate sort HOST and DEVICE components

CUDA Example: Addition Host Code: Components Allocate Host/Device memory Initialize Host Data Copy Host Data to Device Execute Kernel on Device Copy Device Data back to Host Output Clean-up

CUDA Example: Addition Host code: Memory Allocation int main(void ) { } float a, b, c; // host copies of a, b, c float *da, *db, *dc; // device copies of a, b, c int size = sizeof(float ); // Allocate space for device copies of a, b, c cudamalloc((void **)&da, size); cudamalloc((void **)&db, size); cudamalloc((void **)&dc, size);... CUDA Syntax: Memory Allocation cudamalloc allocates memory on DEVICE cudafree deallocates memory on DEVICE

CUDA Example: Addition Host code: Data Movement { }... // Setup input values a = 2.0; b = 7.0; // Copy inputs to device cudamemcpy(da, &a, size, cudamemcpyhosttodevice); cudamemcpy(db, &b, size, cudamemcpyhosttodevice);... CUDA Syntax: Memory Transfers copies memory from DEVICE to HOST from HOST to DEVICE cudamemcpy

CUDA Example: Addition Host code: kernel execution { }... // Launch add() kernel on GPU add<<<1,1>>>(da, db, dc);... CUDA Syntax: kernel execution <<<N,M>>> DEVICE Triple brackets denote a call from HOST to Come back to N, M values later

CUDA Example: Addition Host code: Get Data, Output, Cleanup {... // Copy result back to host cudamemcpy(&c, dc, size, cudamemcpydevicetohost); printf('' %2.0f + %2.0f = %2.0f '',a,b,c); // Cleanup cudafree(da); cudafree(db); cudafree(dc); return 0; } Compile and Run $nvcc -arch=sm 20 hello.cu -o hello $./hello $2.0 + 7.0 =9.0

CUDA Basics: Review Device Kernel code global function qualifier Host Code Allocate Host/Device memory cudamalloc(...) Initialize Host Data Copy Host Data to Device cudamemcpy(...) Execute Kernel on Device fn<<< N, M >>>(...) Copy Device Data to Host cudamemcpy(...) Output Clean-up cudafree(...)

CUDA Parallelism: Threads, Blocks, Grids Blocks & Threads Threads: execution thread Blocks: group of threads Grids: set of blocks built-in variables to define threads position threadidx blockidx blockdim

CUDA Parallelism: Threads, Blocks, Grids Blocks & Threads Threads operate in a SIMD(ish) manner, each execute the same instructions in lockstep Blocks are assigned to a GPU, executing one warp at a time (usually 32 threads)

CUDA Parallelism: Threads, Blocks, Grids Blocks & Threads Threads operate in a SIMD(ish) manner, each execute the same instructions in lockstep Blocks are assigned to a GPU, executing one warp at a time (usually 32 threads) Kernel execution fn<<< blockspergrid, threadsperblock >>>(...)

CUDA Example: Vector Addition Host code: Allocate Memory int main(void ) { } int N=1024; //size of vector float *a, *b, *c; // host copies of a, b, c float *da, *db, *dc; // device copies of a, b, c int size = N*sizeof(float ); // Allocate space for host copies of a, b, c a = (float *)malloc (size); b = (float *)malloc (size); c = (float *)malloc (size); // Allocate space for device copies of a, b, c cudamalloc((void **)&da, size); cudamalloc((void **)&db, size); cudamalloc((void **)&dc, size);...

CUDA Example: Vector Addition Host code: Initialize and Copy {... // Setup input values for (int i=0;i<n;i++){ a[i] = (float )i; b[i] = 2.0*(float )i; } // Copy inputs to device cudamemcpy(da, a, size, cudamemcpyhosttodevice); cudamemcpy(db, b, size, cudamemcpyhosttodevice); } // Launch add() kernel on GPU add<<<1,n>>>(da, db, dc);...

CUDA Example: Vector Addition Host code: Get Data, Output, Cleanup {... // Copy result back to host cudamemcpy(c, dc, size, cudamemcpydevicetohost); printf(``hello World!, I can add on a GPU''); for (int i=0;i<n;i++){ printf("%d %2.0f + %2.0f = %2.0f",i,a[i],b[i],c[i]); } // Cleanup free(a); free(b); free(c); cudafree(da); cudafree(db); cudafree(dc); return 0; }

CUDA Threads Kernel using just threads global void add(float *a, float *b, float *c) { c[threadidx.x] = a[threadidx.x] + b[threadidx.x]; } Host code: kernel execution... // Launch add() kernel on GPU // with 1 block and N threads add<<<1,n>>>(da, db, dc);...

CUDA Blocks Kernel using just blocks global void add(float *a, float *b, float *c) { c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; } Host code: kernel execution... // Launch add() kernel on GPU // with N blocks, 1 thread each add<<<n,1>>>(da, db, dc);...

CUDA Blocks & Threads Indexing with Blocks and Threads Use built-in variables to define unique position threadidx : thread ID (within a block) blockidx : block ID blockdim : threads per block

CUDA Blocks & Threads kernel using blocks global void add(float *a, float *b, float *c, int n) { int index = threadidx.x + blockidx.x * blockdim.x; if(index < n); c[index] = a[index] + b[index]; } Host code: kernel execution... int TB=128; //threads per block // Launch add() kernel on GPU add<<<n/tb,tb>>>(da, db, dc,n);...

Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software 3 Basic CUDA Example: Addition Example: Vector Addition 4 More CUDA Syntax & Features 5 Summary

More CUDA Syntax Qualifiers Functions global : Device kernels called from host host : Host only (default) device : Device only called from device Data shared : Memory shared within a block constant : Special memory for constants (cached) Control syncthreads(): thread barrier within a block

More CUDA Syntax Qualifiers Functions global : Device kernels called from host host : Host only (default) device : Device only called from device Data shared : Memory shared within a block constant : Special memory for constants (cached) Control syncthreads(): thread barrier within a block More details Grids and Blocks can be 1D, 2D, or 3D ( type dim3 ) Error Handling: cudaerror t cudagetlasterror(void) Device Management: cudagetdeviceproperties(...)

CUDA Kernels Kernel Limitations No recursion in host, allowed in device for CC > 2.0 No variable argument lists No dynamic memory allocation No function pointers No static variables inside kernels

CUDA Kernels Kernel Limitations No recursion in host, allowed in device for CC > 2.0 No variable argument lists No dynamic memory allocation No function pointers No static variables inside kernels Performance Tips Exploit parallelism Avoid branches in device code Avoid memory transfers between Device and Host GPU memory high bandwidth/high latency can hide latency with lots of threads access patterns matter (coalesced)

CUDA Libraries & Applications Libraries CUBLAS CULA CUSPARSE CUFFT https://developer.nvidia.com/gpu-accelerated-libraries

CUDA Libraries & Applications Libraries CUBLAS CULA CUSPARSE CUFFT https://developer.nvidia.com/gpu-accelerated-libraries Applications GROMACS, NAMD, LAMMPS, AMBER, CHARMM WRF, GEOS-5 Fluent 15.0, ANSYS, Abaqus Matlab, Mathematica http://www.nvidia.com/object/gpu-applications.html

Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software 3 Basic CUDA Example: Addition Example: Vector Addition 4 More CUDA Syntax & Features 5 Summary

Intro to CUDA Summary Hetergeneous Computing CUDA Basics global, cudamemcpy(...), fn<<<blocks,threads per block>>>(...) blocks, threads indexing ( threadidx.x, blockidx.x, blockdim.x ) Limitations Performance CUDA Libraries and Applications https://developer.nvidia.com/cuda