Introduction to CUDA Programming

Similar documents
Tesla Architecture, CUDA and Optimization Strategies

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Parallel Accelerators

Parallel Accelerators

University of Bielefeld

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Scientific discovery, analysis and prediction made possible through high performance computing.

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

High Performance Computing and GPU Programming

CUDA C Programming Mark Harris NVIDIA Corporation

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to Parallel Computing with CUDA. Oswald Haan

CUDA Architecture & Programming Model

Basics of CADA Programming - CUDA 4.0 and newer

Parallel Numerical Algorithms

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

COSC 6374 Parallel Computations Introduction to CUDA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Lecture 11: GPU programming

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CSE 160 Lecture 24. Graphical Processing Units

Introduction to GPU programming. Introduction to GPU programming p. 1/17

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Introduction to GPU hardware and to CUDA

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Paralization on GPU using CUDA An Introduction

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA Programming Model

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Programming in CUDA. Malik M Khan

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Module 2: Introduction to CUDA C

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

An Introduction to GPU Computing and CUDA Architecture

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

ECE 574 Cluster Computing Lecture 15

GPU programming. Dr. Bernhard Kainz

Massively Parallel Algorithms

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA C/C++ BASICS. NVIDIA Corporation

Module 2: Introduction to CUDA C. Objective

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

EEM528 GPU COMPUTING

HPCSE II. GPU programming and CUDA

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Vector Addition on the Device: main()

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to CUDA (1 of n*)

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Using a GPU in InSAR processing to improve performance

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

GPU Programming with CUDA. Pedro Velho

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

GPU Computing with CUDA. Part 2: CUDA Introduction

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA C/C++ BASICS. NVIDIA Corporation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Programming. Aiichiro Nakano

COSC 6339 Accelerators in Big Data

Introduction to GPGPUs and to CUDA programming model

GPU CUDA Programming

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Real-time Graphics 9. GPGPU

ECE 574 Cluster Computing Lecture 17

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 2: Introduction to CUDA C

Introduc)on to GPU Programming

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

High-Performance Computing Using GPUs

Programmable Graphics Hardware (GPU) A Primer

GPU Programming Using CUDA

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Lecture 5. Performance Programming with CUDA

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

CUDA Performance Optimization. Patrick Legresley

Real-time Graphics 9. GPGPU

Practical Introduction to CUDA and GPU

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

CS 179 Lecture 4. GPU Compute Architecture

CUDA Parallel Programming Model Michael Garland

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Cartoon parallel architectures; CPUs and GPUs

Transcription:

Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC

Outline Motivation for GPUs and CUDA Overview of Heterogeneous Parallel Computing TACC Facts: the NVIDIA Tesla K20 GPUs on Stampede Structure of CUDA Programs Threading and the Thread Hierarchy Memory Model Advanced Performance Tips 10/30/2013 www.cac.cornell.edu 2

Motivation Why Use GPUs? Parallel and multithreaded hardware design Floating-point computations Graphics rendering General-purpose computing as well Energy efficiency More FLOP/s per watt than CPUs MIC vs. GPU Comparable performance Different programming models 10/30/2013 www.cac.cornell.edu 3

Motivation Peak Performance Comparison 10/30/2013 www.cac.cornell.edu 4

Motivation What is CUDA? Compute Unified Device Architecture Many-core, shared-memory, multithreaded programming model An Application Programming Interface (API) General-purpose computing on GPUs (GPGPU) Multi-core vs. Many-core Multi-core Small number of sophisticated cores (=CPUs) Many-core Large number of weaker cores (=GPU) 10/30/2013 www.cac.cornell.edu 5

Motivation Why CUDA? Advantages High-level, C/C++/Fortran language extensions Scalability Thread-level abstractions Runtime library Thrust parallel algorithm library Limitations Not vendor-neutral: NVIDIA CUDA-enabled GPUs only Alternative: OpenCL This presentation will be in C 10/30/2013 www.cac.cornell.edu 6

Overview Heterogeneous Parallel Computing CPU: Fast serial processing Large on-chip caches Minimal read/write latency Sophisticated logic control GPU: High parallel throughput Large numbers of cores High memory bandwidth 10/30/2013 www.cac.cornell.edu 7

Overview Different Designs, Different Purposes Intel Sandy Bridge E5-2680 NVIDIA Tesla K20 Processing Units 8 13 SMs, 192 cores each, 2496 cores total Clock Speed (GHz) 2.7 0.706 Maximum Hardware Threads 8 cores, 1 thread each (not 2: hyperthreading is off) = 8 threads with SIMD units 13 SMs, 192 cores each, all with 32-way SIMT = 79872 threads Memory Bandwidth 51.6 GB/s 205 GB/s L1 Cache Size 64 KB/core 64 KB/SMs L2 Cache Size 256 KB/core 768 KB, shared L3 Cache Size 20MB N/A SM = Stream Multiprocessor 10/30/2013 www.cac.cornell.edu 8

Overview Alphabet Soup GPU Graphics Processing Unit GPGPU General-Purpose computing on GPUs CUDA Compute Unified Device Architecture (NVIDIA) Multi-core A processor chip with 2 or more CPUs Many-core A processor chip with 10s to 100s of CPUs SM Stream Multiprocessor SIMD Single Instruction Multiple Data SIMT Single Instruction Multiple Threads = SIMD-style multithreading on the GPU 10/30/2013 www.cac.cornell.edu 9

Overview SIMD SISD: Single Instruction Single Data SIMD: Single Instruction Multiple Data Example: a vector instruction performs the same operation on multiple data simultaneously Intel and AMD extended their instruction sets to provide operations on vector registers Intel s SIMD extensions MMX Multimedia extensions SSE Streaming SIMD Extensions AVX Advanced Vector Extensions SIMD matters in CPUs It also matters in GPUs 10/30/2013 www.cac.cornell.edu 10

TACC Facts GPUs on Stampede 6400+ compute nodes, each with: 2 Intel Sandy Bridge processors (E5-2680) 1 Intel Xeon Phi coprocessor (MIC) 128 GPU nodes, each augmented with 1 NVIDIA Tesla K20 GPU Login nodes do not have GPU cards installed! 10/30/2013 www.cac.cornell.edu 11

TACC Facts CUDA on Stampede To run your CUDA application on one or more Stampede GPUs: Load CUDA software using the module utility Compile your code using the NVIDIA nvcc compiler Acts like a wrapper, hiding the intrinsic compilation details for GPU code Submit your job to a GPU queue 10/30/2013 www.cac.cornell.edu 12

TACC Facts Lab 1: Querying the Device 1. Extract the lab files to your home directory $ cd $HOME $ tar xvf ~tg459572/labs/intro_cuda.tar 2. Load the CUDA software $ module load cuda 10/30/2013 www.cac.cornell.edu 13

TACC Facts Lab 1: Querying the Device 3. Go to lab 1 directory, devicequery $ cd Intro_CUDA/devicequery There are 2 files: Source code: devicequery.cu Batch script: batch.sh 4. Use NVIDIA nvcc compiler, to compile the source code $ nvcc -arch=sm_30 devicequery.cu -o devicequery 10/30/2013 www.cac.cornell.edu 14

TACC Facts Lab 1: Querying the Device 5. Job submission: Running 1 task on 1 node: #SBATCH -n 1 GPU development queue: #SBATCH -p gpudev $ sbatch batch.sh $ more gpu_query.o[job ID] Queue Name Time Limit Max Nodes Description gpu 24 hrs 32 GPU main queue gpudev 4 hrs 4 GPU development nodes vis 8 hrs 32 GPU nodes + VNC service visdev 4 hrs 4 GPU + VNC development 10/30/2013 www.cac.cornell.edu 15

TACC Facts Lab 1: Querying the Device CUDA Device Query... There are 1 CUDA devices. CUDA Device #0 Major revision number: 3 Minor revision number: 5 Name: Tesla K20m Total global memory: 5032706048 Total shared memory per block: 49152 Total registers per block: 65536 Warp size: 32 Maximum memory pitch: 2147483647 Maximum threads per block: 1024 Maximum dimension 0 of block: 1024 Maximum dimension 1 of block: 1024 Maximum dimension 2 of block: 64 Maximum dimension 0 of grid: 2147483647 Maximum dimension 1 of grid: 65535 Maximum dimension 2 of grid: 65535 Clock rate: 705500 Total constant memory: 65536 Texture alignment: 512 Concurrent copy and execution: Yes Number of multiprocessors: 13 Kernel execution timeout: No 10/30/2013 www.cac.cornell.edu 16

Structure Grids and Blocks in CUDA Programs 10/30/2013 www.cac.cornell.edu 17

Structure Host and Kernel Codes Host Code Your CPU code Takes care of: Device memory Kernel invocation int main() { //CPU code [Invoke GPU functions] } Kernel Code Your GPU code Executed on the device global qualifier Must have return type void global void gpufunc(arg1, arg2, ) { //GPU code } 10/30/2013 www.cac.cornell.edu 18

Structure Type Qualifiers Function Type Qualifiers in CUDA global Callable from the host only Executed on the device void return type device Executed on the device only Callable from the device only host Executed on the host only Callable from the host only Equivalent to declaring a function without any qualifier There are variable type qualifiers available as well Refer to the NVIDIA documentation for details 10/30/2013 www.cac.cornell.edu 19

Structure Invoking a Kernel Kernel is invoked from the host int main() { //Kernel Invocation gpufunc<<<gridconfig,blkconfig>>>(arguments ) } Calling a kernel uses familiar syntax (function/subroutine call) augmented by Chevron syntax The Chevron syntax (<<< >>>) configures the kernel First argument: How many blocks in a grid Second argument: How many threads in a block 10/30/2013 www.cac.cornell.edu 20

Threading Thread Hierarchy Thread Basic execution unit Block Grid Thread group assigned to an SM* Independent Threads within a block can: Synchronize Share data Communicate All the blocks invoked by a kernel *Max 1024 threads per block (K20) 10/30/2013 www.cac.cornell.edu 21

Threading Index Keywords Threads and blocks have unique IDs Thread: threadidx Block: blockidx threadidx can have maximum 3 dimensions threadidx.x, threadidx.y, and threadidx.z blockidx can have maximum 2 dimensions blockidx.x and blockidx.y Why multiple dimensions? Programmer s convenience Helps to think about working with a 2D array 10/30/2013 www.cac.cornell.edu 22

Threading Parallelism Types of parallelism: Thread-level Task Parallelism Every thread, or group of threads, executes a different instruction Not ideal because of thread divergence Block-level Task Parallelism Different blocks perform different tasks Multiple kernels are invoked to start the tasks Data Parallelism Memory is shared across threads and blocks 10/30/2013 www.cac.cornell.edu 23

Threading Warp Threads in a block are bundled into small groups of warps 1 warp = 32 Threads with consecutive threadidx values [0..31] form the first warp [32 63] form the second warp, etc. A full warp is mapped to one SIMD unit Single Instruction Multiple Threads, SIMT Therefore, threads in a warp cannot diverge Execution is serialized to prevent divergence For example, in an if-then-else construct: 1. All threads execute then affects only threads where condition is true 2. All threads execute else affects only threads where condition is false 10/30/2013 www.cac.cornell.edu 24

Memory Memory Model Kernel Per-device global memory Block Per-block shared memory Thread Per-thread local memory Per-thread register CPU and GPU do not share memory Two-way arrows indicate read/write capability 10/30/2013 www.cac.cornell.edu 25

Memory Memory Hierarchy Per-thread local memory Private storage for local variables Fastest Lifetime: thread Per-block shared memory Shared within a block 48kB, fast Lifetime: kernel shared qualifier Per-device global memory Shared 5GB, Slowest Lifetime: application 10/30/2013 www.cac.cornell.edu 26

Memory Memory Transfer Allocate device memory cudamalloc() Memory transfer between host and device cudamemcpy() Deallocate memory cudafree() Host Device 10/30/2013 www.cac.cornell.edu 27

Memory int main() { //Host memory allocation //(just use normal malloc) h_a=(float *)malloc(size); h_b=(float *)malloc(size); h_c=(float *)malloc(size); Lab 2: Vector Add Allocate host memory h_varname : host memory d_varname : device memory //Device memory allocation cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); Allocate device memory } //Memory transfer, kernel invocation cudamemcpy(d_a, h_a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, h_b, size, cudamemcpyhosttodevice); cudamemcpy(d_c, h_c, size, cudamemcpyhosttodevice); vec_add<<<n/512, 512>>>(d_A, d_b, d_c); cudamemcpy(h_c, d_c, size, cudamemcpydevicetohost); cudafree(d_a); cudafree(d_b); cudafree(d_c); free(h_a); free(h_b); free(h_c); Deallocate the memory Move data from host to device Invoke the kernel Move data from device to host 10/30/2013 www.cac.cornell.edu 28

Memory //Vector Size #define N 5120000 Lab 2: Vector Add //Kernel function global void vec_add(float *d_a, float *d_b, float *d_c) { //Define Index int i=blockdim.x * blockidx.x + threadidx.x; } //Vector Add d_c[i]=d_a[i]+d_b[i]; int main() { vec_add<<<n/512, 512>>>(d_A, d_b, d_c); } 10/30/2013 www.cac.cornell.edu 29

Memory Lab 2: Vector Add $ cd $HOME/Intro_CUDA/vectoradd $ nvcc -arch=sm_30 vectoradd.cu -o vectoradd $ sbatch batch.sh Things to try on your own (after the talk): Time the performance using a different vector length Time the performance using a different block size Timing tool: /usr/bin/time p <executable> CUDA also provides a better timing tool, see NVIDIA Documentation 10/30/2013 www.cac.cornell.edu 30

Advanced Performance Tips Minimize execution divergence Thread divergence serializes the execution Maximize on-chip memory (per-block shared, and per-thread) Global memory is slow (~200GB/s) Optimize memory access Coalesced memory access 10/30/2013 www.cac.cornell.edu 31

Advanced Coalesced Memory Access What is coalesced memory access? Combine all memory transactions into a single warp access On the NVIDIA Tesla K20: 32 threads * 4-byte word = 128 bytes What are the requirements? Memory alignment Sequential memory access Dense memory access 10/30/2013 www.cac.cornell.edu 32

Advanced Memory Alignment 1 Transaction: Sequential, In-order, Aligned 1 Transaction: Sequential, Reordered, Aligned 2 Transactions: Sequential, In-order, Misaligned 10/30/2013 www.cac.cornell.edu 33

Advanced Performance Topics Consider the following code: Is memory access aligned? Is memory access sequential? //The variable, offset, is a constant int i=blockdim.x * blockidx.x + threadidx.x; int j=blockdim.x * blockidx.x + threadidx.x + offset; d_b2[i]=d_a2[j]; 10/30/2013 www.cac.cornell.edu 34

Summary GPU is very good at massively parallel jobs CPU is very good at moderately parallel jobs and serial processing GPU threads and memory are linked in a hierarchy A block of threads shares local memory (on the SM) A grid of blocks shares global memory (on the device) CUDA provides high-level syntax for assigning work The kernel is the function to be executed on the GPU Thread count and distribution are specified when a kernel is invoked cudamemcpy commands move data between host and device Programming rules must be followed to get the best performance Move data between host and device as little as possible Avoid thread divergence within a warp of threads (32) Preferentially use on-chip (local block) memory Try to perform coalesced memory accesses with warps of threads 10/30/2013 www.cac.cornell.edu 35

Final Lab 3: Matrix Multiplication $ cd $HOME/Intro_CUDA/matrix_mul $ nvcc -arch=sm_30 matrix_mul.cu -o matmul $ sbatch batch.sh Things to try on your own (after the talk): Compare the performance to the CUDA BLAS matrix multiplication routine Can you improve its performance? Hints: Use on-chip memory Use page-locked memory, see cudamallochost() 10/30/2013 www.cac.cornell.edu 36

References Recommended Reading: CUDA Documentation Hwu, Wen-Mei; Kirk, David. (2010). Programming Massively Parallel Processors: A Hands-on Approach. 10/30/2013 www.cac.cornell.edu 37