GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Similar documents
Parallel Computing. Lecture 19: CUDA - I

Lecture 3: Introduction to CUDA

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Computation to Core Mapping Lessons learned from a simple application

Lessons learned from a simple application

Introduction to GPGPUs and to CUDA programming model

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Module 2: Introduction to CUDA C

GPGPU. Lecture 2: CUDA

Using The CUDA Programming Model

Matrix Multiplication in CUDA. A case study

Lecture 2: Introduction to CUDA C

Module 2: Introduction to CUDA C. Objective

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Introduction to GPU programming. Introduction to GPU programming p. 1/17

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA Architecture & Programming Model

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CS 179: GPU Computing. Lecture 2: The Basics

CSE 160 Lecture 24. Graphical Processing Units

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Lecture 1: Introduction

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Introduction to CUDA (1 of n*)

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CS 677: Parallel Programming for Many-core Processors Lecture 1

Introduction to CUDA Programming

CUDA (Compute Unified Device Architecture)

Tesla Architecture, CUDA and Optimization Strategies

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Scientific discovery, analysis and prediction made possible through high performance computing.

Nvidia G80 Architecture and CUDA Programming

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Introduction to CUDA (1 of n*)

Intro to GPU s for Parallel Computing

Introduction to Parallel Computing with CUDA. Oswald Haan

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

COSC 6374 Parallel Computations Introduction to CUDA

Introduction to CUDA

University of Bielefeld

Many-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage:

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Lecture 3. Programming with GPUs

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Introduction to CUDA

COMP 322: Fundamentals of Parallel Programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Module 3: CUDA Execution Model -I. Objective

GPU CUDA Programming

CUDA C/C++ BASICS. NVIDIA Corporation

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Parallel Numerical Algorithms

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Introduction to CUDA 5.0

ECE 574 Cluster Computing Lecture 17

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Data Parallel Execution Model

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Mathematical computations with GPUs

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

HPCSE II. GPU programming and CUDA

Real-time Graphics 9. GPGPU

Reduction of a Symmetrical Matrix. to Tridiagonal Form on GPUs

ECE 574 Cluster Computing Lecture 15

Lecture 1: an introduction to CUDA

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA C/C++ BASICS. NVIDIA Corporation

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

GPU Programming Using CUDA

Real-time Graphics 9. GPGPU

Overview: Graphics Processing Units

Introduction to CUDA (2 of 2)

EEM528 GPU COMPUTING

Threading Hardware in G80

Sparse Linear Algebra in CUDA

GPU programming. Dr. Bernhard Kainz

Programming with CUDA, WS09

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

GPU Fundamentals Jeff Larkin November 14, 2016

CS 179 Lecture 4. GPU Compute Architecture

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Lecture 10!! Introduction to CUDA!

Transcription:

1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas

2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions and Threading

3 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions and Threading

4 / 34 Architecture of G80 GPU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU 128 streaming processors in 16 streaming multiprocessors

5 / 34 Architecture of GT200 GPU (e.g., GeForce GTX 295) SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU 240 streaming processors in 30 streaming multiprocessors

6 / 34 Architecture of Fermi GPU (e.g., GeForce GTX 480, Tesla C2075) DRAM Gigathread DRAM Host Interface DRAM Fermi SM Fermi SM Fermi SM Fermi SM Fermi SM Fermi SM Fermi SM Fermi SM L2 Cache Fermi SM Fermi SM 16-SM Fermi GPU Fermi SM Fermi SM CUDA Core Dispatch Port Operant Collector FP Unit Fermi SM Fermi SM INT Unit Result Queue Fermi SM Fermi SM DRAM DRAM DRAM DRAM Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Register File (32,768 32-bit) Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST Special Function Unit Special Function Unit Special Function Unit Special Function Unit Interconnect Network 64 KB Shared Memory / L1 Cache Fermi Streaming Multiprocessor (SM) 512 Fermi streaming processors in 16 streaming multiprocessors

7 / 34 Architecture of Kepler GPU (e.g., Tesla K20) 2,880 streaming processors in 15 streaming multiprocessors

8 / 34 Architecture of Kepler Streaming Multiprocessor Each streaming multiprocessor contains 192 single-precision cores and 64 double-precision cores

9 / 34 Comparison among Three Architectures GPU G80 GT200 Fermi Transistors 681 million 1.4 billion 3.0 billion CUDA Cores 128 240 512 Double Precision Capability None 30 FMAs/clk 256 FMAs/clk Single Precision Capability 128 MADs/clk 240 MADs/clk 512 FMAs/clk Special Function Units / SM 2 2 4 Shared Memory / SM 16KB 16KB 48KB/16KB L1 Cache / SM None None 16KB/48KB L2 Cache None None 768KB ECC Support No No Yes Multiply-Add (MAD) A B = Product (truncate extra digits) + C = Result Fused Multiply-Add (FMA) A B = Product + C (retain all digits) = Result

Fermi vs. Kepler FERMI FERMI KEPLER KEPLER GF100 GF104 GK104 GK110 Compute Capability 2.0 2.1 3.0 3.5 Threads / Warp 32 32 32 32 Max Warps / Multiprocessor 48 48 64 64 Max Threads / Multiprocessor 1536 1536 2048 2048 Max Thread Blocks / Multiprocessor 8 8 16 16 32 bit Registers / Multiprocessor 32768 32768 65536 65536 Max Registers / Thread 63 63 63 255 Max Threads / Thread Block 1024 1024 1024 1024 Shared Memory Size Configurations (bytes) 16K 16K 16K 16K 48K 48K 32K 32K 48K 48K Max X Grid Dimension 2^16 1 2^16 1 2^32 1 2^32 1 Hyper Q No No No Yes Dynamic Parallelism No No No Yes Compute Capability of Fermi and Kepler GPUs 10 / 34

11 / 34 Compute Capability The compute capability of a device is defined by a major revision number and a minor revision number Devices with the same major revision number are of the same core architecture Kepler architecture: 3.x Fermi architecture: 2.x Prior devices: 1.x NVIDIA CUDA C Programming Guide (v7.5) Appendix A lists of all CUDA-enabled devices along with their compute capability Appendix G gives the technical specifications of each compute capability

12 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions and Threading

13 / 34 Data Parallelism Square Matrix Multiplication Example N P = M N of size M P 28

14 / 34 Data Parallelism Square Matrix Multiplication Example M N P P = M N of size The approach in C for (i=0;i<;i++) { for (j=0;j<;j++) {... P[i][j]=...;... } } 28

15 / 34 Data Parallelism Square Matrix Multiplication Example M N P 28 P = M N of size The approach in C for (i=0;i<;i++) { for (j=0;j<;j++) {... P[i][j]=...;... } } The approach in CUDA (the straightforward way) One thread calculates one element of P Issue threads simultaneously

16 / 34 The Heterogeneous Platform A typical GPGPU-capable platform consists of One or more microprocessors (CPUs) host One or more GPUs device GPU devices are connected to CPUs through PCI Express (PCIe) bus A CUDA program consists of The code on CPU software part The code on GPU hardware part 25.6GB/s QPI CPU-GPU 5.8GB/s Unidirectional 16x PCIe 2.0 102GB/s QPI QPI 25.6GB/s QPI 16x PCIe 2.0 144GB/s

17 / 34 CUDA Program Structure Serial Code (host) Grid 0 Parallel Kernel (device) KernelA<<< nblk, ntid >>>(args);... Serial Code (host) Grid 1 Parallel Kernel (device) KernelB<<< nblk, ntid >>>(args);... Integrated host+device application C program Sequential or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code

18 / 34 CUDA Thread Organization Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) A kernel is implemented as a grid of threads Threads in grid is further decomposed into blocks A grid can be up to three-dimensional A thread block is a batch of threads that can cooperate with each other by: Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA

19 / 34 CUDA Thread Organization Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) A kernel is implemented as a grid of threads Threads in grid is further decomposed into blocks A grid can be up to three-dimensional A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution Efficiently sharing data through shared memory A block can be one or two or three-dimensional Each block and each thread has its own ID Courtesy: NDVIA

20 / 34 Matrix Multiplication: the main() function N int main (void) { // 1. // Allocate and initialize // the matrices M, N, P // I/O to read the input // matrices M and N... M P // 2. // M*N on the processor MatrixMultiplication(M,N,P,width); // 3. // I/O to write the output matrix P // Free matrices M, N, P 28 }... return 0;

21 / 34 Matrix Multiplication: A Simple Host Version in C void MatrixMultiplication(float* M, float* N, float* P, int width) { for (int i=0; i<width; i++) { for (int j=0; j<width; j++) { float sum = 0; for (int k=0; k<width; k++) { float a = M[i*width+k]; float b = N[k*width+j]; sum += a*b; } P[i*width+j] = sum; } } } M 0,0 M 0,1 M 0,2 M 0,3 M 1,0 M 1,1 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 0,0 M 0,1 M 0,2 M 0,3 M 1,0 M 1,1 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3 M i,j M 3,0 M 3,1 M 3,2 M 3,3 row column

22 / 34 Matrix Multiplication: Move computation to the device void MatrixMultiplication(float* M, float* N, float* P, int width) { int size = width*width*sizeof(float); float* Md, Nd, Pd;... // 1. // Allocate device memory for M, N, // and P // Copy M and N to allocated device // memory locations // 2. // Kernel invocation code - to have // the device to perform the actual // matrix multiplication } // 3. // Copy P from the device memory // Free device matrices

23 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions and Threading

24 / 34 Memory Hierarchy on Device Memory hierarchy on device Grid Global Memory Block (0, 0) Block (1, 0) Main means of communicating Shared Memory between host and device Registers Registers Registers Long latency access Shared Memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Short latency Host Global Memory Register Per-thread local variables Data access Device code can read/write Per-thread registers, per-block shared memory, per-grid global memory Host code can transfer data to/from Per-grid global memory Shared Memory Registers Thread (1, 0)

25 / 34 CUDA Device Memory Allocation cudamalloc() Allocates object in the device Global Memory Requires two parameters 1. Address of a pointer to the allocated object 2. Size of the allocated object Grid Block (0, 0) Shared Memory Registers Registers Block (1, 0) Shared Memory Registers Registers float* Md; //d indicates a device data int size = width*width*sizeof(float); Host Thread (0, 0) Thread (1, 0) Global Memory Thread (0, 0) Thread (1, 0) cudamalloc((void**)&md, size); cudafree(md); cudafree() Frees object from device Global Memory Pointer to freed object

26 / 34 CUDA Host-Device Data Transfer cudamemcpy() Memory data transfer Requires four parameters 1. Pointer to destination 2. Pointer to source 3. Number of bytes copied 4. Type of transfer Types of transfer cudamemcpyhosttohost cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice cudamemcpy(md, M, size, cudamemcpyhosttodevice); cudamemcpy(m, Md, size, cudamemcpydevicetohost); Host Grid Block (0, 0) Registers Shared Memory Thread (0, 0) Global Memory Registers Thread (1, 0) Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0)

27 / 34 Matrix Multiplication: After Integrating the Data Transfer void MatrixMultiplication(float* M, float* N, float* P, int width) { int size = width*width*sizeof(float); float* Md, Nd, Pd; } // 1. Allocate and Load M, N to device memory cudamalloc((void**)&md, size); cudamemcpy(md, M, size, cudamemcpyhosttodevice); cudamalloc((void**)&nd, size); cudamemcpy(nd, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc((void**)&pd, size); // 2. Kernel invocation code -- to be shown later // 3. Read P from the device cudamemcpy(p, Pd, size, cudamemcpydevicetohost); // Free device matrices cudafree(md); cudafree(nd); cudafree (Pd);

28 / 34 CUDA Thread Organization Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) A kernel is implemented as a grid of threads Threads in grid is further decomposed into blocks A grid can be up to three-dimensional A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution Efficiently sharing data through shared memory A block can be one or two or three-dimensional Each block and each thread has its own ID Courtesy: NDVIA

29 / 34 Index (i.e., coordinates) of Block and Thread Z O X (0,0,1) (1,0,1) (2,0,1) (3,0,1) M 0,0 M 0,1 M 0,2 M 0,3 Y Thread (0,0,0) Thread (0,1,0) Thread (1,0,0) Thread (1,1,0) Thread (2,0,0) Thread (2,1,0) Thread (3,0,0) Thread (3,1,0) M 1,0 M 2,0 M 3,0 M 1,1 M 2,1 M 3,1 M 1,2 M 2,2 M 3,2 M 1,3 M 2,3 M 3,3 t x t y t z Each block and each thread are assigned an index, i.e., blockidx and threadidx blockidx.x, blockidx.y, blockidx.z threadidx.x, threadidx.y, threadidx.z

30 / 34 Index (i.e., coordinates) of Block and Thread Z O X (0,0,1) (1,0,1) (2,0,1) (3,0,1) M 0,0 M 0,1 M 0,2 M 0,3 Y Thread (0,0,0) Thread (0,1,0) Thread (1,0,0) Thread (1,1,0) Thread (2,0,0) Thread (2,1,0) Thread (3,0,0) Thread (3,1,0) M 1,0 M 2,0 M 3,0 M 1,1 M 2,1 M 3,1 M 1,2 M 2,2 M 3,2 M 1,3 M 2,3 M 3,3 t x t y t z Each block and each thread are assigned an index, i.e., blockidx and threadidx blockidx.x, blockidx.y, blockidx.z threadidx.x, threadidx.y, threadidx.z threadidx.y row, threadidx.x column

31 / 34 Define the Dimension of Grid and Block Use pre-defined dim3 to define the dimension of grid and block Use griddim and blockdim to get the dimensions of the grid and the block dim3 dimgrid(width_g, height_g, depth_g); dim3 dimblock(width_b, height_b, depth_b); Total number of threads issued width_g height_g depth_g width_b height_b depth_b Launch the device computation threads MatrixMulKernel<<<dimGrid, dimblock>>>(md, Nd, Pd, Width);

32 / 34 Kernel Function Specification // Matrix multiplication kernel // -- per thread code global void MatrixMulKernel( float* Md, float* Nd, float* Pd, int width) { int tx = threadidx.x; int ty = threadidx.y; // Pvalue is used to store the // element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < width; ++k) { float Melement = Md[ty*width+k]; float Nelement = Nd[k*width+tx]; Pvalue += Melement * Nelement; } } Pd[ty*width+tx] = Pvalue;

33 / 34 CUDA Function Declarations Executed on Only callable from device float DeviceFunc() device device globe void DeviceFunc() device host host float HostFunc() host host global defines a kernel function Must be void function device and host can be used together Generate two versions of the same function during the compilation For functions executed on the device global functions does not support recursion device functions only support recursion in device code compiled for devices of compute capability 2.x and higher No static variable declarations inside the function No variable number of arguments No indirect function calls through pointers All functions are host functions by default

34 / 34 Check the return code of CUDA APIs cudaerror_t cuda_ret; cuda_ret = cudamalloc((void**) &A_d, sizeof(float)*n); if(cuda_ret!= cudasuccess) { printf("some error message"); exit(-1); }