Module 2: Introduction to CUDA C

Similar documents
Module 2: Introduction to CUDA C. Objective

Lecture 2: Introduction to CUDA C

Parallel Computing. Lecture 19: CUDA - I

Lecture 3: Introduction to CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Data parallel computing

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

CUDA Parallelism Model

Lecture 1: Introduction and Computational Thinking

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model Michael Garland

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

Introduction to CUDA (1 of n*)

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Module 3: CUDA Execution Model -I. Objective

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

GPGPU. Lecture 2: CUDA

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

EEM528 GPU COMPUTING

ECE 574 Cluster Computing Lecture 15

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 3. Programming with GPUs

CS 677: Parallel Programming for Many-core Processors Lecture 1

Lecture 11: GPU programming

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA Programming. Aiichiro Nakano

Intro to GPU s for Parallel Computing

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Many-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage:

ECE 574 Cluster Computing Lecture 17

CSE 160 Lecture 24. Graphical Processing Units

Lecture 10!! Introduction to CUDA!

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

GPU Computing with CUDA. Part 2: CUDA Introduction

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

CUDA (Compute Unified Device Architecture)

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Using The CUDA Programming Model

High Performance Computing

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CUDA C/C++ BASICS. NVIDIA Corporation

GPU Programming Using CUDA

GPU Computing with CUDA

Introduction to CUDA Programming

Tesla Architecture, CUDA and Optimization Strategies

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Programming in CUDA. Malik M Khan

Introduction to CUDA (1 of n*)

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Lecture 1: Introduction

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Introduction to Parallel Computing with CUDA. Oswald Haan

Massively Parallel Algorithms

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

High Performance Computing and GPU Programming

Introduc)on to GPU Programming

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CUDA Basics. July 6, 2016

Master Informatics Eng.

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

High-Performance Computing Using GPUs

CUDA Architecture & Programming Model

CUDA Programming Model

CS377P Programming for Performance GPU Programming - I

HPCSE II. GPU programming and CUDA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to GPGPUs and to CUDA programming model

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

COSC 6374 Parallel Computations Introduction to CUDA

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Basics of CADA Programming - CUDA 4.0 and newer

Mathematical computations with GPUs

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Vector Addition on the Device: main()

CUDA Kenjiro Taura 1 / 36

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

GPU programming. Dr. Bernhard Kainz

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Transcription:

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding with a simple but complete CUDA program 2 1

Reading Assignment Kirk and Hwu, Programming Massively Parallel Processors: A Hands on Approach,, Chapter 3 CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-cprogramming-guide/#abstract 3 CUDA/OpenCL- Execution Model Reflects a multicore processor + GPGPU execution model Addition of device functions and declarations to stock C programs Compilation separated into host and GPU paths 4 2

Compiling A CUDA Program Integrated C programs with CUDA extensions NVCC Compiler Host Code Host C Compiler/ Linker Device Code (PTX) Device Just-in-Time Compiler Heterogeneous Computing Platform with CPUs, GPUs 5 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code ( host ) Serial Code ( device ) Parallel Kernel KernelA<<< nblk, ntid >>>(args);... ( host ) Serial Code ( device ) Parallel Kernel KernelB<<< nblk, ntid >>>(args);... 6 3

CUDA /OpenCL - Programming Model Host Tasks GPU Tasks User Application Each kernel N-Dimensional Range Software Stack Operating System CPU GPU Grid of threads, each operating over a data partition Note: Each thread executes the same kernel code! 7 Vector Addition Conceptual View vector A A[0] A[1] A[2] A[3] A[4] A[N-1] vector B B[0] B[1] B[2] B[3] B[4] B[N-1] + + + + + + vector C C[0] C[1] C[2] C[3] C[4] C[N-1] 8 4

A Simple Thread Model blockdim threadidx = 0 threadidx = blockdim-1 blockidx = 0 blockidx = 1 blockidx = 2 blockidx = griddim-1 id = blockidx * blockdim + threadidx; Structure an array of threads into thread blocks A unique id can be computed for each thread This id is used for workload partitioning 9 Using Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads ( SPMD ) All threads in a grid run the same kernel code Each thread has an index that it uses to compute memory addresses and make control decisions 0 1 2 254 255 i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; Thread blocks can be multidimensional arrays of threads 10 5

Thread s: Scalable Cooperation Divide thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Threads in different blocks cannot cooperate (only through global memory) Thread 0 Thread 1 Thread N-1 0 1 2 254 255 0 1 2 254 255 0 1 2 254 255 i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign 11 blockidx and threadidx Each thread uses indices to decide what data to work on blockidx: 1D, 2D, or 3D (CUDA 4.0) threadidx: 1D, 2D, or 3D Host Kernel 1 Device Grid 1 (0, 0) (0, 1) (1, 0) (1, 1) Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Kernel 2 Grid 2 (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA 12 6

Data Partitioning Host Device Kernel 1 Grid 1 (0, 0) (0, 1) (1, 0) (1, 1) Grid 2 Kernel 2 (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Partition multidimensional data across a multidimensional grid of threads Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA 13 Vector Addition Traditional C Code // Compute vector sum C = A+B void vecadd(float* A, float* B, float* C, int n) for (i = 0, i < n, i++) C[i] = A[i] + B[i]; int main() // Memory allocation for A_h, B_h, and C_h // I/O to read A_h and B_h, N elements vecadd(a_h, B_h, C_h, N); 14 7

#include <cuda.h> ( n void vecadd(float* A, float* B, float* C, int Heterogeneous Computing vecadd Host Code int size = n* sizeof(float); float* A_d, B_d, C_d; 1. // Allocate device memory for A, B, and C // copy A and B to device memory 2. // Kernel launch code to have the device // to perform the actual vector addition 3. // copy C from the device memory // Free device vectors Host Memory CPU Part 1 Device Memory Part 3 GPU Part 2 15 Partial Overview of CUDA Memories Device code can: R/W per-thread registers R/W per-grid global memory (Device) Grid (0, 0) (1, 0) Host code can Transfer data to/from per grid global memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) We will cover more later. Host Global Memory 16 8

CUDA Device Memory Management API functions cudamalloc() Allocates object in the device global memory Two parameters Address of a pointer to the allocated object Size of of allocated object in terms of bytes cudafree() Frees object from device global memory Pointer to freed object Host Grid (0, 0) Registers Registers Thread (0, 0) Thread (1, 0) Global Memory (1, 0) Registers Registers Thread (0, 0) Thread (1, 0) 17 Host-Device Data Transfer API functions cudamemcpy() memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type/Direction of transfer Transfer to device is asynchronous Host (Device) Grid (0, 0) Registers Thread (0, 0) Global Memory Registers Thread (1, 0) (1, 0) Registers Registers Thread (0, 0) Thread (1, 0) 18 9

void vecadd(float* A, float* B, float* C, int n) int size = n * sizeof(float); float* A_d, B_d, C_d; 1. // Transfer A and B to device memory cudamalloc((void **) &A_d, size); cudamemcpy(a_d, A, size, cudamemcpyhosttodevice); cudamalloc((void **) &B_d, size); cudamemcpy(b_d, B, size, cudamemcpyhosttodevice); // Allocate device memory for cudamalloc((void **) &C_d, size); 2. // Kernel invocation code to be shown later 3. // Transfer C from device to host cudamemcpy(c, C_d, size, cudamemcpydevicetohost); // Free device memory for A, B, C cudafree(a_d); cudafree(b_d); cudafree (C_d); 19 Device Code // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecaddkernel(float* A_d, float* B_d, float* C_d, int n) int i = threadidx.x + blockdim.x * blockidx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; int vectadd(float* A, float* B, float* C, int n) Example: Vector Addition Kernel // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecaddkernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n); 20 10

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecaddkernel(float* A_d, float* B_d, float* C_d, int n) int i = threadidx.x + blockdim.x * blockidx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; int vecadd(float* A, float* B, float* C, int n) // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecaddkernnel<<<ceil(n/256),256>>>(a_d, B_d, C_d, n); Execution Configuration Host Code 21 More on Kernel Launch int vecadd(float* A, float* B, float* C, int n) // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; dim3 Dim(256, 1, 1); Host Code vecaddkernnel<<<dimgrid,dim>>>(a_d, B_d, C_d, n); Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking 22 11

host Void vecadd() Kernel Execution in a Nutshell dim3 DimGrid = (ceil(n/256,1,1); dim3 Dim = (256,1,1); vecaddkernel<<<dimgrid,dim>>> (A_d,B_d,C_d,n); global void vecaddkernel(float *A_d, float *B_d, float *C_d, int n) int i = blockidx.x * blockdim.x + threadidx.x; if( i<n ) C_d[i] = A_d[i]+B_d[i]; Blk 0 Kernel Blk N-1 Schedule onto multiprocessors M0 GPU RAM Mk 23 More on CUDA Function Declarations () DeviceFunc device float () KernelFunc global void () HostFunc host float Executed on the: device device host Only callable from the: device host host global defines a kernel function Each consists of two underscore characters A kernel function must return void device and host can be used together 24 12

Compiling A CUDA Program Integrated C programs with CUDA extensions NVCC Compiler Host Code Host C Compiler/ Linker Device Code (PTX) Device Just-in-Time Compiler Heterogeneous Computing Platform with CPUs, GPUs 25 The ISA An Instruction Set Architecture (ISA) is a contract between the hardware and the software. As the name suggests, it is a set of instructions that the architecture (hardware) can execute. 26 13

PTX ISA Major Features Set of predefined, read only variables E.g., %tid, %ntid, %ctaid, etc. Some of these are multidimensional E.g., %tid.x, %tid.y, %tid.z Includes architecture variables E.g., %laneid, %warpid Can be used for auto-tuning of code Includes undefined performance counters 27 PTX ISA Major Features (2) Multiple address spaces Register, parameter Constant global, etc. More later Predicated instruction execution More in PTX ISA v4.2 28 14

QUESTIONS? 29 15