CUDA C Programming Mark Harris NVIDIA Corporation

Similar documents
CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model Michael Garland

GPU Computing with CUDA. Part 2: CUDA Introduction

Module 2: Introduction to CUDA C

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Module 2: Introduction to CUDA C. Objective

Lecture 2: Introduction to CUDA C

EEM528 GPU COMPUTING

Parallel Computing. Lecture 19: CUDA - I

Lecture 3: Introduction to CUDA

Lecture 15 CUDA. EEC 171 Parallel Architectures John Owens UC Davis

High Performance Linear Algebra on Data Parallel Co-Processors I

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA Basics. July 6, 2016

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Basics of CADA Programming - CUDA 4.0 and newer

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduc)on to GPU Programming

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CUDA Parallelism Model

Introduction to CUDA (1 of n*)

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

CUDA Architecture & Programming Model

Introduction to CUDA Programming

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Tesla Architecture, CUDA and Optimization Strategies

Lecture 1: Introduction and Computational Thinking

CUDA Programming Model

Programming with CUDA, WS09

High-Performance Computing Using GPUs

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Data parallel computing

Introduction to CUDA C

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Introduction to CUDA C

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

ECE 574 Cluster Computing Lecture 17

Scientific discovery, analysis and prediction made possible through high performance computing.

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA. Sathish Vadhiyar High Performance Computing

ECE 574 Cluster Computing Lecture 15

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Massively Parallel Algorithms

University of Bielefeld

GPU CUDA Programming

GPU Programming Using CUDA

Introduction to CUDA (2 of 2)

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Lecture 6b Introduction of CUDA programming

CUDA Programming. Aiichiro Nakano

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Lecture 11: GPU programming

Introduction to GPU Programming

CS377P Programming for Performance GPU Programming - I

Parallel Numerical Algorithms

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA (Compute Unified Device Architecture)

Vector Addition on the Device: main()

CS 193G. Lecture 1: Introduction to Massively Parallel Computing

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

COSC 6374 Parallel Computations Introduction to CUDA

GPGPU. Lecture 2: CUDA

GPU programming. Dr. Bernhard Kainz

Lecture 3. Programming with GPUs

An Introduction to GPU Computing and CUDA Architecture

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

High Performance Computing and GPU Programming

Intro to GPU s for Parallel Computing

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Introduction to GPGPUs and to CUDA programming model

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Outline. Part 1. An Introduction to CUDA/OpenCL and Graphics Processors. Heterogeneous Parallel Computing. Tradeoffs between CPUs and GPUs 2/23/15

An Introduction to CUDA/OpenCL and Graphics Processors. Bryan Catanzaro, NVIDIA Research

Hardware/Software Co-Design

Introduction to CELL B.E. and GPU Programming. Agenda

Parallel Accelerators

Mathematical computations with GPUs

GPU Programming Introduction

Practical Introduction to CUDA and GPU

CSE 160 Lecture 24. Graphical Processing Units

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Transcription:

CUDA C Programming Mark Harris NVIDIA Corporation

Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment Next Generation Architecture Getting Started Resources

CUDA PROGRAMMING MODEL REVIEW

CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels

CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU Array of threads, in parallel All threads execute the same code, can take different paths float x = input[threadid]; float y = func(x); output[threadid] = y; Each thread has an ID Select input/output data Control decisions

CUDA Programming Model - Summary A kernel executes as a grid of thread blocks Host Device Kernel 1 0 1 2 3 1D A block is a batch of threads Communicate through shared memory 0,0 0,1 0,2 0,3 Each block has a block ID Kernel 2 1,0 1,1 1,2 1,3 2D Each thread has a thread ID

Blocks must be independent Any possible interleaving of blocks should be valid presumed to run to completion without pre-emption can run in any order can run concurrently OR sequentially Blocks may coordinate but not synchronize shared queue pointer: OK shared lock: BAD can easily deadlock Independence requirement gives scalability

Communication Within a Block Threads may need to cooperate Memory accesses Share results Cooperate using shared memory Accessible by all threads within a block Restriction to within a block permits scalability Fast communication between N threads is not feasible when N large

CUDA CUDA C

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecadd(float* A, float* B, float* C) { int i = threadidx.x + blockdim.x * blockidx.x; C[i] = A[i] + B[i]; } Device Code int main() { // Run N/256 blocks of 256 threads each vecadd<<< N/256, 256>>>(d_A, d_b, d_c); }

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecadd(float* A, float* B, float* C) { int i = threadidx.x + blockdim.x * blockidx.x; C[i] = A[i] + B[i]; } int main() { // Run N/256 blocks of 256 threads each vecadd<<< N/256, 256>>>(d_A, d_b, d_c); } Host Code

CUDA: Memory Management Explicit memory allocation returns pointers to GPU memory cudamalloc() cudafree() Explicit memory copy for host device, device device cudamemcpy()

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecadd(float* A, float* B, float* C) { int i = threadidx.x + blockdim.x * blockidx.x; C[i] = A[i] + B[i]; } int main() { // Run N/256 blocks of 256 threads each vecadd<<< N/256, 256>>>(d_A, d_b, d_c); }

Example: Host code for vecadd // allocate and initialize host (CPU) memory float *h_a =, *h_b = ; // allocate device (GPU) memory float *d_a, *d_b, *d_c; cudamalloc( (void**) &d_a, N * sizeof(float)); cudamalloc( (void**) &d_b, N * sizeof(float)); cudamalloc( (void**) &d_c, N * sizeof(float)); // copy host memory to device cudamemcpy( d_a, h_a, N * sizeof(float), cudamemcpyhosttodevice) ); cudamemcpy( d_b, h_b, N * sizeof(float), cudamemcpyhosttodevice) ); // execute the kernel on N/256 blocks of 256 threads each vecadd<<<n/256, 256>>>(d_A, d_b, d_c);

CUDA: Minimal extensions to C/C++ Declaration specifiers to indicate where things live global void KernelFunc(...); device void DeviceFunc(...); device int GlobalVar; shared int SharedVar; // kernel callable from host // function callable on device // variable in device memory // in per-block shared memory Extend function invocation syntax for parallel kernel launch KernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each Special variables for thread identification in kernels dim3 threadidx; dim3 blockidx; dim3 blockdim; dim3 griddim; Intrinsics that expose specific operations in kernel code syncthreads(); // barrier synchronization

Synchronization of blocks Threads within block may synchronize with barriers Step 1 syncthreads(); Step 2 Blocks coordinate via atomic memory operations e.g., increment shared queue pointer with atomicinc() Implicit barrier between dependent kernels vec_minus<<<nblocks, blksize>>>(a, b, c); vec_dot<<<nblocks, blksize>>>(c, c);

Using per-block shared memory Variables shared across block shared int *begin, *end; Scratchpad memory shared int scratch[blocksize]; scratch[threadidx.x] = begin[threadidx.x]; // compute on scratch values begin[threadidx.x] = scratch[threadidx.x]; Block Shared Communicating values between threads scratch[threadidx.x] = begin[threadidx.x]; syncthreads(); int left = scratch[threadidx.x - 1];

Summing Up CUDA C = C + a few simple extensions makes it easy to start writing parallel programs Three key abstractions: 1. hierarchy of parallel threads 2. corresponding levels of synchronization 3. corresponding memory spaces Supports massive parallelism of many-core GPUs

CUDA C Programming Questions?