Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Similar documents
Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation

An Introduction to GPU Computing and CUDA Architecture

HPCSE II. GPU programming and CUDA

Lecture 6b Introduction of CUDA programming

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPU Programming Introduction

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Introduction to CUDA Programming

Introduction to CUDA C

Scientific discovery, analysis and prediction made possible through high performance computing.

Lecture 11: GPU programming

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Tesla Architecture, CUDA and Optimization Strategies

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CUDA C Programming Mark Harris NVIDIA Corporation

ECE 574 Cluster Computing Lecture 15

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Vector Addition on the Device: main()

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

Introduction to CUDA C

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Programmable Graphics Hardware (GPU) A Primer

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

ECE 574 Cluster Computing Lecture 17

CUDA Architecture & Programming Model

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Basics of CADA Programming - CUDA 4.0 and newer

Massively Parallel Algorithms

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

University of Bielefeld

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Parallel Numerical Algorithms

CUDA Programming. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

GPU Programming with CUDA. Pedro Velho

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

High-Performance Computing Using GPUs

GPU Programming Using CUDA

Parallel Accelerators

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Module 2: Introduction to CUDA C. Objective

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU programming. Dr. Bernhard Kainz

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

COSC 6374 Parallel Computations Introduction to CUDA

Module 2: Introduction to CUDA C

Parallel Computing. Lecture 19: CUDA - I

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Introduction to OpenACC. 16 May 2013

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

CS377P Programming for Performance GPU Programming - I

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

High Performance Computing and GPU Programming

OpenMP and GPU Programming

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Lecture 3: Introduction to CUDA

COSC 6339 Accelerators in Big Data

Introduc)on to GPU Programming

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Lecture 2: Introduction to CUDA C

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Lecture 10!! Introduction to CUDA!

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

An Introduction to GPU Programming

EEM528 GPU COMPUTING

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

Data parallel computing

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Parallel Accelerators

Speed Up Your Codes Using GPU

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

GPU CUDA Programming

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Practical Introduction to CUDA and GPU

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Transcription:

Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014

Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software 3 Basic CUDA Example: Addition Example: Vector Addition 4 More CUDA Syntax & Features 5 Summary

Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software 3 Basic CUDA Example: Addition Example: Vector Addition 4 More CUDA Syntax & Features 5 Summary

Heterogeneous Computing What is it? Use different compute device(s) concurrently in the same computation. Example: Leverage CPUs for general computing components and use GPU s for data parallel / FLOP intensive components. Pros: Faster and cheaper ($/FLOP/Watt) computation Cons: More complicated to program

Heterogeneous Computing Terminology GPGPU : General Purpose Graphics Processing Unit HOST : CPU and its memory DEVICE : Accelerator (GPU) and its memory

Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software 3 Basic CUDA Example: Addition Example: Vector Addition 4 More CUDA Syntax & Features 5 Summary

GPU vs. CPUs CPU GPU general purpose task parallelism (diverse tasks) maximize serial performance large cache multi-threaded (4-16) some SIMD (SSE, AVX) data parallelism (single task) maximize throughput small cache super-threaded (500-2000+) almost all SIMD

Speedup What kind of speedup can I expect? 1 TFLOPs per GPU vs. 100 GFLOPs multi-core CPU 0x - 50x reported Speedup depends on problem structure need many identical independent calculations preferably sequential memory access single vs. double precision (K20 3.52 TF SP vs 1.17 TF DP) level of intimacy with hardware time investment

GPU Programming GPGPU Languages OpenGL, DirectX (Graphics only) OpenCL (1.0, 1.1, 2.0) Open Standard CUDA (NVIDIA proprietary) OpenACC OpenMP 4.0

CUDA What is it? Compute Unified Device Architecture parallel computing platform and programming model created by NVIDIA Language Bindings C/C++ nvcc compiler (works with gcc/intel) Fortran (PGI compiler) Others (pycuda,jcuda, etc.) CUDA Versions (V1.0-6.0) Hardware Compute Capability (1.0-3.5) Tesla M20*0 (Fermi) has CC 2.0 Tesla K20 (Kepler) has CC 3.5

Compute Canada Resources GPU Systems Westgrid: Parallel 60 nodes (3x NVIDIA M2070) SharcNet: Monk 54 nodes (2x NVIDIA M2070) SciNet: Gravity, ARC 49 nodes (2x NVIDIA M2090) 8 nodes (2x NVIDIA M2070) 1 node (1x NVIDIA K20) CalcuQuebec: Guillimin 50 nodes (2x NVIDIA K20)

Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software 3 Basic CUDA Example: Addition Example: Vector Addition 4 More CUDA Syntax & Features 5 Summary

CUDA Example: Addition Device Kernel code global void add(float *a, float *b, float *c) { *c = *a + *b; } CUDA Syntax: Qualifiers global indicates a function that runs on the DEVICE called from the HOST Used by the compiler, nvcc, to separate sort HOST and DEVICE components

CUDA Example: Addition Host Code: Components Allocate Host/Device memory Initialize Host Data Copy Host Data to Device Execute Kernel on Device Copy Device Data back to Host Output Clean-up

CUDA Example: Addition Host code: Memory Allocation int main(void ) { } float a, b, c; // host copies of a, b, c float *da, *db, *dc; // device copies of a, b, c int size = sizeof(float ); // Allocate space for device copies of a, b, c cudamalloc((void **)&da, size); cudamalloc((void **)&db, size); cudamalloc((void **)&dc, size);... CUDA Syntax: Memory Allocation cudamalloc allocates memory on DEVICE cudafree deallocates memory on DEVICE

CUDA Example: Addition Host code: Data Movement { }... // Setup input values a = 2.0; b = 7.0; // Copy inputs to device cudamemcpy(da, &a, size, cudamemcpyhosttodevice); cudamemcpy(db, &b, size, cudamemcpyhosttodevice);... CUDA Syntax: Memory Transfers copies memory from DEVICE to HOST from HOST to DEVICE cudamemcpy

CUDA Example: Addition Host code: kernel execution { }... // Launch add() kernel on GPU add<<<1,1>>>(da, db, dc);... CUDA Syntax: kernel execution <<<N,M>>> DEVICE Triple brackets denote a call from HOST to Come back to N, M values later

CUDA Example: Addition Host code: Get Data, Output, Cleanup {... // Copy result back to host cudamemcpy(&c, dc, size, cudamemcpydevicetohost); printf('' %2.0f + %2.0f = %2.0f '',a,b,c); // Cleanup cudafree(da); cudafree(db); cudafree(dc); return 0; } Compile and Run $nvcc -arch=sm 20 hello.cu -o hello $./hello $2.0 + 7.0 =9.0

CUDA Basics: Review Device Kernel code global function qualifier Host Code Allocate Host/Device memory cudamalloc(...) Initialize Host Data Copy Host Data to Device cudamemcpy(...) Execute Kernel on Device fn<<< N, M >>>(...) Copy Device Data to Host cudamemcpy(...) Output Clean-up cudafree(...)

CUDA Parallelism: Threads, Blocks, Grids Blocks & Threads Threads: execution thread Blocks: group of threads Grids: set of blocks built-in variables to define threads position threadidx blockidx blockdim

CUDA Parallelism: Threads, Blocks, Grids Blocks & Threads Threads operate in a SIMD(ish) manner, each execute the same instructions in lockstep Blocks are assigned to a GPU, executing one warp at a time (usually 32 threads) Kernel execution fn<<< blockspergrid, threadsperblock >>>(...)

CUDA Example: Vector Addition Host code: Allocate Memory int main(void ) { } int N=1024; //size of vector float *a, *b, *c; // host copies of a, b, c float *da, *db, *dc; // device copies of a, b, c int size = N*sizeof(float ); // Allocate space for host copies of a, b, c a = (float *)malloc (size); b = (float *)malloc (size); c = (float *)malloc (size); // Allocate space for device copies of a, b, c cudamalloc((void **)&da, size); cudamalloc((void **)&db, size); cudamalloc((void **)&dc, size);...

CUDA Example: Vector Addition Host code: Initialize and Copy {... // Setup input values for (int i=0;i<n;i++){ a[i] = (float )i; b[i] = 2.0*(float )i; } // Copy inputs to device cudamemcpy(da, a, size, cudamemcpyhosttodevice); cudamemcpy(db, b, size, cudamemcpyhosttodevice); } // Launch add() kernel on GPU add<<<1,n>>>(da, db, dc);...

CUDA Example: Vector Addition Host code: Get Data, Output, Cleanup {... // Copy result back to host cudamemcpy(c, dc, size, cudamemcpydevicetohost); printf(``hello World!, I can add on a GPU''); for (int i=0;i<n;i++){ printf("%d %2.0f + %2.0f = %2.0f",i,a[i],b[i],c[i]); } // Cleanup free(a); free(b); free(c); cudafree(da); cudafree(db); cudafree(dc); return 0; }

CUDA Threads Kernel using just threads global void add(float *a, float *b, float *c) { c[threadidx.x] = a[threadidx.x] + b[threadidx.x]; } Host code: kernel execution... // Launch add() kernel on GPU // with 1 block and N threads add<<<1,n>>>(da, db, dc);...

CUDA Blocks Kernel using just blocks global void add(float *a, float *b, float *c) { c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; } Host code: kernel execution... // Launch add() kernel on GPU // with N blocks, 1 thread each add<<<n,1>>>(da, db, dc);...

CUDA Blocks & Threads Indexing with Blocks and Threads Use built-in variables to define unique position threadidx : thread ID (within a block) blockidx : block ID blockdim : threads per block

CUDA Blocks & Threads kernel using blocks global void add(float *a, float *b, float *c, int n) { int index = threadidx.x + blockidx.x * blockdim.x; if(index < n) c[index] = a[index] + b[index]; } Host code: kernel execution... int TB=128; //threads per block // Launch add() kernel on GPU add<<<n/tb,tb>>>(da, db, dc,n);...

Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software 3 Basic CUDA Example: Addition Example: Vector Addition 4 More CUDA Syntax & Features 5 Summary

More CUDA Syntax Qualifiers Functions global : Device kernels called from host host : Host only (default) device : Device only called from device Data shared : Memory shared within a block constant : Special memory for constants (cached) Control syncthreads(): thread barrier within a block More details Grids and Blocks can be 1D, 2D, or 3D ( type dim3 ) Error Handling: cudaerror t cudagetlasterror(void) Device Management: cudagetdeviceproperties(...)

CUDA Kernels Kernel Limitations No recursion in host, allowed in device for CC > 2.0 No variable argument lists No dynamic memory allocation No function pointers No static variables inside kernels Performance Tips Exploit parallelism Avoid branches in device code Avoid memory transfers between Device and Host GPU memory high bandwidth/high latency can hide latency with lots of threads access patterns matter (coalesced)

CUDA Libraries & Applications Libraries CUBLAS CULA CUSPARSE CUFFT https://developer.nvidia.com/gpu-accelerated-libraries Applications GROMACS, NAMD, LAMMPS, AMBER, CHARMM WRF, GEOS-5 Fluent 15.0, ANSYS, Abaqus Matlab, Mathematica http://www.nvidia.com/object/gpu-applications.html

Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software 3 Basic CUDA Example: Addition Example: Vector Addition 4 More CUDA Syntax & Features 5 Summary

Intro to CUDA Summary Hetergeneous Computing CUDA Basics global, cudamemcpy(...), fn<<<blocks,threads per block>>>(...) blocks, threads indexing ( threadidx.x, blockidx.x, blockdim.x ) Limitations Performance CUDA Libraries and Applications https://developer.nvidia.com/cuda