COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

Similar documents
COSC 462 Parallel Programming

Introduction to CUDA C

Introduction to CUDA C

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA. More on threads, shared memory, synchronization. cuprintf

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Speed Up Your Codes Using GPU

Lecture 3: Introduction to CUDA

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

HPCSE II. GPU programming and CUDA

CUDA C/C++ BASICS. NVIDIA Corporation

Introduction to Parallel Computing with CUDA. Oswald Haan

Parallel Computing. Lecture 19: CUDA - I

GPU Programming Introduction

An Introduction to GPU Computing and CUDA Architecture

GPU Programming Using CUDA

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Module 2: Introduction to CUDA C. Objective

Cartoon parallel architectures; CPUs and GPUs

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Rupesh Nasre.

Vector Addition on the Device: main()

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Lecture 6b Introduction of CUDA programming

ECE 574 Cluster Computing Lecture 17

Module 3: CUDA Execution Model -I. Objective

Josef Pelikán, Jan Horáček CGG MFF UK Praha

ECE 574 Cluster Computing Lecture 15

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

CUDA Basics. July 6, 2016

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

GPU programming. Dr. Bernhard Kainz

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Using CUDA. Oswald Haan

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Tesla Architecture, CUDA and Optimization Strategies

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CS377P Programming for Performance GPU Programming - I

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to CUDA CIRC Summer School 2014

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Scientific discovery, analysis and prediction made possible through high performance computing.

CUDA programming interface - CUDA C

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Paralization on GPU using CUDA An Introduction

04. CUDA Data Transfer

ECE 408 / CS 483 Final Exam, Fall 2014

Module 2: Introduction to CUDA C

Programming with CUDA, WS09

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

Assignment 7. CUDA Programming Assignment

CS 179: GPU Computing. Lecture 2: The Basics

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Lecture 2: Introduction to CUDA C

COMP 605: Introduction to Parallel Computing Quiz 4: Module 4 Quiz: Comparing CUDA and MPI Matrix-Matrix Multiplication

Lecture 1: an introduction to CUDA

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Kenjiro Taura 1 / 36

CUDA Programming. Aiichiro Nakano

High Performance Linear Algebra on Data Parallel Co-Processors I

Card Sizes. Tesla K40: 2880 processors; 12 GB memory

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Lecture 10!! Introduction to CUDA!

Massively Parallel Algorithms

Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science 2018 Workshop

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

Zero Copy Memory and Multiple GPUs

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Introduction to GPGPUs and to CUDA programming model

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Practical Introduction to CUDA and GPU

Inter-Block GPU Communication via Fast Barrier Synchronization

Lab 1 Part 1: Introduction to CUDA

Introduction to CUDA

Parallel Numerical Algorithms

Introduction to CUDA Programming

CUDA Architecture & Programming Model

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Transcription:

COSC 462 CUDA Basics: Blocks, Grids, and Threads Piotr Luszczek November 1, 2017 1/10

Minimal CUDA Code Example global void sum(double x, double y, double *z) { *z = x + y; } int main(void) { double *device_z, host_z; cudamalloc( &device_z, sizeof(double) ); sum<<<1,1>>>(2.0, 3.0, device_z); cudamemcpy( &host_z, device_z, sizeof(double), cudamemcpydevicetohost ); printf( %g\n, host_z); cudafree(device_z); return 0; } $ nvcc sum.cu -o sum $./sum 5 2/10

3/10 Structure of CUDA Code // parallel function (GPU) global void sum(double x, double y, double *z) { *z = x + y; } // sequential function (CPU) void sum_cpu(double x, double y, double *z) { *z = x + y; } // sequential function (CPU) int main(void) { double *dev_z, hst_z; cudamalloc( &dev_z, sizeof(double) ); // launch parallel code (CPU GPU) sum<<<1,1>>>(2.0, 3.0, dev_z); cudamemcpy( &hst_z, dev_z, sizeof(double), cudamemcpydevicetohost ); printf( %g\n, hst_z[i]); cudafree(dev_z); } return 0;

Introducing Parallelism to CUDA Code Two points where parallelism enters the code Kernel invocation sum<<< 1,1>>>( a, b, c ) sum<<<10,1>>>( a, b, c ) Kernel execution global void sum(double *a, double *b, double*c) c[blockidx.x] = a[blockidx.x] + b[blockidx.x] CUDA makes the connection between: invocation sum<<<10,1>>> with execution and its index blockidx.x Recall GPU massive parallelism Many CUDA cores Many CUDA threads Many GPU SM (or SMX) units 4/10

CUDA Parallelism with Blocks int N = 100, SN = N * sizeof(double); global void sum(double *a, double *b, double *c) { c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; // no loop! } int main(void) { double *dev_a, *dev_b, *dev_c, *hst_a, *hst_b, *hst_c; cudamalloc( &dev_a, SN ); hst_a = calloc(n, sizeof(double)); cudamalloc( &dev_b, SN ); hst_b = calloc(n, sizeof(double)); cudamalloc( &dev_c, SN ); hst_c = malloc(n, sizeof(double)); cudamemcpy( dev_a, hst_a, SN, cudamemcpyhosttodevice ); cudamemcpy( dev_b, hst_b, SN, cudamemcpyhosttodevice ); sum<<<10,1>>>(dev_a, dev_b, dev_c); // only 10 elements will be used cudamemcpy( &hst_c, dev_c, SN, cudamemcpydevicetohost ); for (int i=0; i<10; ++i) printf( %g\n, hst_c[i]); cudafree(dev_a); free(hst_a); cudafree(dev_b); free(hst_b); cudafree(dev_c); free(hst_c); return 0; } 5/10

Details on Execution of Blocks on GPU // BLOCK 0 c[0]=a[0]+b[0]; // BLOCK 2 // BLOCK 1 c[1]=a[1]+b[1]; // BLOCK 3 Blocks is a level of parallelism There are other levels Blocks execute in parallel Synchronization is Explicit (special function calls, etc.) Implicit (memory access, etc.) Mixed (atomics, etc.) Total number of available blocks is hardware specifc CUDA ofers inquiry functions to get the maximum block count c[2]=a[2]+b[2]; c[3]=a[3]+b[3]; // BLOCK 4 // BLOCK 5 c[4]=a[4]+b[4]; c[5]=a[5]+b[5]; // BLOCK 6 // BLOCK 7 c[6]=a[6]+b[6]; c[7]=a[7]+b[7]; // BLOCK 8 // BLOCK 9 c[8]=a[8]+b[8]; c[9]=a[9]+b[9]; 6/10

7/10 Adding Thread Parallelism to CUDA Code Kernel invocation sum<<<10, 1>>>( x, y, z ) // block-parallel sum<<< 1,10>>>( x, y, z ) // thread-parallel Kernel execution z[threadidx.x] = x[threadidx.x] + y[threadidx.x] Consistency of syntax Minimum changes to switch from blocks to threads Similar naming for blocks and threads

CUDA Parallelism with Threads int N = 100, SN = N * sizeof(double); global void sum(double *a, double *b, double *c) { c[threadidx.x] = a[threadidx.x] + b[threadidx.x]; // no loop! } int main(void) { // sequential function (CPU) double *dev_a, *dev_b, *dev_c, *hst_a, *hst_b, *hst_c; cudamalloc( &dev_a, SN ); hst_a = calloc(sn); cudamalloc( &dev_b, SN ); hst_b = calloc(sn); cudamalloc( &dev_c, SN ); hst_c = malloc(sn); cudamemcpy( dev_a, hst_a, SN, cudamemcpyhosttodevice ); cudamemcpy( dev_b, hst_b, SN, cudamemcpyhosttodevice ); sum<<<1,10>>>(dev_a, dev_b, dev_c); cudamemcpy( &hst_c, dev_c, SN, cudamemcpydevicetohost ); for (int i=0; i<10; ++i) printf( %g\n, hst_c[i]); cudafree(dev_a); free(hst_a); cudafree(dev_b); free(hst_b); cudafree(dev_c); free(hst_c); return 0; } 8/10

9/10 More on Block and Thread Parallelism When to use blocks and when to use threads? Synchronization between threads is cheaper Blocks have higher scheduling overhead Block and thread parallelism can be combined Often it is hard to get good balance between both Exact combination depends on GPU generation Tesla, Fermi, Kepler, Maxwell, Pascal, Volta,... SM/SMX confguration Memory size

Thread Identifcation Across APIs POSIX threads pthread_t tid = pthread_self(); MPI MPI_Comm_rank(comm, &rank); MPI_Comm_size(comm, &size); OpenMP int tid = omp_get_thread_num(); int all = omp_get_num_threads(); CUDA int blkid = blockidx.x + (blockidx.y + blockidx.z * griddim.y) * griddim.x int inside_blk_tid = threadidx.x + (threadidx.y + threadidx.z * blockdim.y) * blockdim.x 10/10