Lecture 2: Introduction to CUDA C

Similar documents
Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C. Objective

Parallel Computing. Lecture 19: CUDA - I

Lecture 3: Introduction to CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

CUDA Parallelism Model

Data parallel computing

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Lecture 1: Introduction and Computational Thinking

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model Michael Garland

Module 3: CUDA Execution Model -I. Objective

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Introduction to CUDA (1 of n*)

GPGPU. Lecture 2: CUDA

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 11: GPU programming

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Lecture 3. Programming with GPUs

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CS 677: Parallel Programming for Many-core Processors Lecture 1

Intro to GPU s for Parallel Computing

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

EEM528 GPU COMPUTING

Many-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage:

Data Parallel Execution Model

CSE 160 Lecture 24. Graphical Processing Units

Lecture 10!! Introduction to CUDA!

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Using The CUDA Programming Model

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

High Performance Computing

GPU Computing with CUDA

ECE 574 Cluster Computing Lecture 15

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

High Performance Computing and GPU Programming

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA C/C++ BASICS. NVIDIA Corporation

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

ECE 574 Cluster Computing Lecture 17

GPU Programming Using CUDA

Programming in CUDA. Malik M Khan

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

High Performance Linear Algebra on Data Parallel Co-Processors I

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA Programming. Aiichiro Nakano

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CUDA (Compute Unified Device Architecture)

Using CUDA. Oswald Haan

GPU Computing with CUDA. Part 2: CUDA Introduction

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Massively Parallel Algorithms

Lecture 1: Introduction

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduc)on to GPU Programming

Introduction to CUDA Programming

CUDA Kenjiro Taura 1 / 36

CUDA C/C++ BASICS. NVIDIA Corporation

An Introduction to GPU Computing and CUDA Architecture

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

CS 179: GPU Computing. Lecture 2: The Basics

Introduction to GPGPUs and to CUDA programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

COSC 6374 Parallel Computations Introduction to CUDA

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Tesla Architecture, CUDA and Optimization Strategies

High-Performance Computing Using GPUs

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Master Informatics Eng.

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Introduction to CUDA (1 of n*)

GPU Programming Introduction

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Introduction to Parallel Computing with CUDA. Oswald Haan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Basics. July 6, 2016

Paralization on GPU using CUDA An Introduction

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CELL B.E. and GPU Programming. Agenda

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Transcription:

CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1

CUDA /OpenCL Execution Model Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code ( host ) Serial Code ( device ) Parallel Kernel KernelA<<< nblk, ntid >>>(args);... ( host ) Serial Code ( device ) Parallel Kernel KernelB<<< nblk, ntid >>>(args);... 2

From Natural Language to Electrons Compiler Natural Language (e.g, English) Algorithm High-Level Language (C/C++ ) Instruction Set Architecture Microarchitecture Circuits Electrons Yale Patt and Sanjay Patel, From bits and bytes to gates and beyond 3

The ISA An Instruction Set Architecture (ISA) is a contract between the hardware and the software. As the name suggests, it is a set of instructions that the architecture (hardware) can execute. 4

A program at the ISA level A program is a set of instructions stored in memory that can be read, interpreted, and executed by the hardware. Program instructions operate on data stored in memory or provided by Input/Output (I/O) device. 5

The Von-Neumann Model Memory I/O Processing Unit ALU Reg File PC Control Unit IR 6

Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads ( SPMD ) All threads in a grid run the same kernel code Each thread has an index that it uses to compute memory addresses and make control decisions 0 1 2 254 255 i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; 7 7

Thread Blocks: Scalable Cooperation Divide thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Threads in different blocks cannot cooperate Thread Block 0 0 1 2 254 255 0 Thread Block 1 1 2 254 255 0 Thread Block N-1 1 2 254 255 i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; 8

blockidx and threadidx Each thread uses indices to decide what data to work on blockidx: 1D, 2D, or 3D (CUDA 4.0) threadidx: 1D, 2D, or 3D Host Kernel 1 Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Kernel 2 Grid 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread (0,0,0) Thread (0,1,0) Thread (1,0,0) Thread (1,1,0) Thread (2,0,0) Thread (2,1,0) Thread (3,0,0) Thread (3,1,0) Courtesy: NDVIA 9

Vector Addition Conceptual View vector A A[0] A[1] A[2] A[3] A[4] A[N-1] vector B B[0] B[1] B[2] B[3] B[4] B[N-1] + + + + + + vector C C[0] C[1] C[2] C[3] C[4] C[N-1] 10

Vector Addition Traditional C Code // Compute vector sum C = A+B void vecadd(float* A, float* B, float* C, int n) { } for (i = 0, i < n, i++) C[i] = A[i] + B[i]; int main() { } // Memory allocation for A_h, B_h, and C_h // I/O to read A_h and B_h, N elements vecadd(a_h, B_h, C_h, N); 11

#include <cuda.h> ( n void vecadd(float* A, float* B, float* C, int { Heterogeneous Computing vecadd int size = n* sizeof(float); float* A_d, B_d, C_d; 1. // Allocate device memory for A, B, and C // copy A and B to device memory 2. // Kernel launch code to have the device // to perform the actual vector addition 3. // copy C from the device memory // Free device vectors } Host Code Host Memory CPU Part 1 Device Memory Part 3 GPU Part 2 12

Partial Overview of CUDA Memories Device code can: R/W per-thread registers R/W per-grid global memory (Device) Grid Block (0, 0) Block (1, 0) Host code can Transfer data to/from per grid global memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory We will cover more later. 13

CUDA Device Memory Management API functions cudamalloc() Allocates object in the device global memory Grid Block (0, 0) Block (1, 0) Two parameters Address of a pointer to the allocated object Size of of allocated object in terms of bytes Registers Thread (0, 0) Registers Thread (1, 0) Registers Thread (0, 0) Registers Thread (1, 0) cudafree() Host Global Memory Frees object from device global memory Pointer to freed object 14

Host-Device Data Transfer API functions cudamemcpy() memory data transfer (Device) Grid Block (0, 0) Block (1, 0) Requires four parameters Pointer to destination Registers Registers Registers Registers Pointer to source Number of bytes copied Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Type/Direction of transfer Host Global Memory Transfer to device is asynchronous 15

void vecadd(float* A, float* B, float* C, int n) { int size = n * sizeof(float); float* A_d, B_d, C_d; 1. // Transfer A and B to device memory cudamalloc((void **) &A_d, size); cudamemcpy(a_d, A, size, cudamemcpyhosttodevice); cudamalloc((void **) &B_d, size); cudamemcpy(b_d, B, size, cudamemcpyhosttodevice); // Allocate device memory for cudamalloc((void **) &C_d, size); 2. // Kernel invocation code to be shown later 3. // Transfer C from device to host cudamemcpy(c, C_d, size, cudamemcpydevicetohost); // Free device memory for A, B, C cudafree(a_d); cudafree(b_d); cudafree (C_d); } 16

Check for API Errors in Host Code cudaerror_t err = cudamalloc((void**)&d_a,size); if (err!=cudasuccess) { printf( %s in %s at line %d\n, cudageterrorstring(err), FILE, LINE ); } exit(exit_failure); 17

// Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecaddkernel(float* A_d, float* B_d, float* C_d, int n) { } int i = threadidx.x + blockdim.x * blockidx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; int vectadd(float* A, float* B, float* C, int n) { } Example: Vector Addition Kernel // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecaddkernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n); David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign Device Code 18

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecaddkernel(float* A_d, float* B_d, float* C_d, int n) { } int i = threadidx.x + blockdim.x * blockidx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; int vecadd(float* A, float* B, float* C, int n) { } // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecaddkernnel<<<ceil(n/256),256>>>(a_d, B_d, C_d, n); David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign Host Code 19

More on Kernel Launch Host Code int vecadd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); dim3 DimGrid((n-1)/256 + 1, 1, 1); if (n%256) DimGrid.x++; dim3 DimBlock(256, 1, 1); } vecaddkernnel<<<dimgrid,dimblock>>>(a_d, B_d, C_d, n); Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign 20

Kernel execution in a nutshell host Void vecadd() { dim3 DimGrid = (ceil(n/256,1,1); dim3 DimBlock = (256,1,1); vecaddkernel<<<dimgrid,dimblock>>> (A_d,B_d,C_d,n); } global void vecaddkernel(float *A_d, float *B_d, float *C_d, int n) { int i = blockidx.x * blockdim.x + threadidx.x; } if( i<n ) C_d[i] = A_d[i]+B_d[i]; Blk 0 Kernel Blk N-1 Schedule onto multiprocessors M0 GPU RAM Mk 21

More on CUDA Function Declarations () DeviceFunc device float () KernelFunc global void Executed on the: device device Only callable from the: device host host float () HostFunc host host global defines a kernel function Each consists of two underscore characters A kernel function must return void device and host can be used together 22

Compiling A CUDA Program Integrated C programs with CUDA extensions NVCC Compiler Host Code Host C Compiler/ Linker Device Code (PTX) Device Just-in-Time Compiler Heterogeneous Computing Platform with CPUs, GPUs 23

QUESTIONS? 24