CUDA. More on threads, shared memory, synchronization. cuprintf

Similar documents
$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Zero Copy Memory and Multiple GPUs

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

CS377P Programming for Performance GPU Programming - I

Introduction to CUDA C

Vector Addition on the Device: main()

Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science 2018 Workshop

Introduction to CUDA C

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

ECE 574 Cluster Computing Lecture 15

Assignment 7. CUDA Programming Assignment

ECE 574 Cluster Computing Lecture 17

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CS671 Parallel Programming in the Many-Core Era

Module 2: Introduction to CUDA C. Objective

GPU Programming Using CUDA

COSC 462 Parallel Programming

Parallel Computing. Lecture 19: CUDA - I

Lecture 5. Performance Programming with CUDA

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Hardware/Software Co-Design

Lecture 3: Introduction to CUDA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA C Programming Mark Harris NVIDIA Corporation

ECE 408 / CS 483 Final Exam, Fall 2014

Warps and Reduction Algorithms

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Lecture 2: CUDA Programming

CS 314 Principles of Programming Languages

CUDA. Sathish Vadhiyar High Performance Computing

Optimizing Parallel Reduction in CUDA

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Parallel Numerical Algorithms

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

University of Bielefeld

Speed Up Your Codes Using GPU

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation

COSC 6339 Accelerators in Big Data

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

COSC 6374 Parallel Computations Introduction to CUDA

Module 2: Introduction to CUDA C

An Introduction to GPU Computing and CUDA Architecture

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 2: Introduction to CUDA C

Massively Parallel Algorithms

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

GPU Computing with CUDA

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

HPCSE II. GPU programming and CUDA

Lecture 10!! Introduction to CUDA!

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

High Performance Computing and GPU Programming

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Cartoon parallel architectures; CPUs and GPUs

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Lecture 11: GPU programming

A few notes on parallel programming with CUDA

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Introduction to Parallel Computing with CUDA. Oswald Haan

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Lab 1 Part 1: Introduction to CUDA

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Parallel Programming Model Michael Garland

Programming in CUDA. Malik M Khan

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CS516 Programming Languages and Compilers II

Introduction to GPU programming. Introduction to GPU programming p. 1/17

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Merge

Tiled Matrix Multiplication

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Global Memory Access Pattern and Control Flow

Warp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles

CUDA Performance Considerations (2 of 2)

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Matrix Multiplication in CUDA. A case study

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Dense Linear Algebra. HPC - Algorithms and Applications

Unrolling parallel loops

High Performance Linear Algebra on Data Parallel Co-Processors I

04. CUDA Data Transfer

Tesla Architecture, CUDA and Optimization Strategies

CUDA Programming. Aiichiro Nakano

Transcription:

CUDA More on threads, shared memory, synchronization cuprintf Library function for CUDA Developers Copy the files from /opt/cuprintf into your source code folder #include cuprintf.cu global void testkernel(int val) cuprintf( Value is: %d\n, val); int main() cudaprintfinit(); testkernel<<< 2, 3 >>>(10); cudaprintfdisplay(stdout, true); cudaprintfend(); return 0; 1

Handling Arbitrarily Long Vectors The limit is 512 threads per block, so there is a failure if the vector is of size N and N/512 > 65535 N > 65535*512 = 33,553,920 elements Pretty big but we could have the capacity for up to 4GB Solution Have to assign range of data values to each thread instead of each thread only operating on one value Next slide: An easy-to-code solution Approach Have a fixed number of blocks and threads per block Ideally some number to maximize the number of threads the GPU can handle per warp, e.g. 128 or 256 threads per block Each thread processes an element with a stride equal to the total number of threads. Example with 2 blocks, 3 threads per block, 10 element vector Vector Thread 0 1 2 3 4 5 6 7 8 9 B0T0 B0T1 B0T2 B1T0 B1T1 B1T2 B0T0 B0T1 B0T2 B1T0 Thread starts work at: (blockidx.x * (NumBlocks)) + threadidx.x (blockidx.x * blockdim.x) + threadidx.x e.g. B1T0 starts working at (1*2)+0 = index 3 next item to work on is at index 3 + TotalThreads = 3 + 6 = 9 blockdim.x * griddim.x 2

Vector Add Kernel For Arbitrarily Long Vectors #define N (100 * 1024) // Length of vector global void add(int *a, int *b, int *c) int tid = threadidx.x + (blockidx.x * blockdim.x); while (tid < N) c[tid] = a[tid] + b[tid]; tid += blockdim.x * griddim.x; main: Pick some number of blocks less than N, threads to fill up a warp: add<<<128, 128>>>(dev_a, dev_b, dev_c); // 16384 total threads G80 Implementation of CUDA Memories Each thread can: Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read/only per-grid constant memory Shared memory Only shared among threads in the block Is on chip, not DRAM, so fast to access Useful for software-managed or scratchpad Must be synchronized if the same value shared among threads Host Grid Block (0, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Global Memory Constant Memory Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) 3

Shared Memory Example Dot Product Book does a more complex version in matrix multiply (x 1 x 2 x 3 x 4 ) (y 1 y 2 y 3 y 4 ) =x 1 y 1 +x 2 y 2 +x 3 y 3 +x 4 y 4 When we did this with matrix multiply we had one thread perform this entire computation for a row and column Obvious parallelism idea: Have thread 0 compute x 1 y 1 and thread 1 compute x 2 y 2, etc. We have to store the individual products somewhere then add up all of the intermediate sums Use shared memory Shared Memory Dot Product A B Thread 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 B0T0 B0T1 B0T2 B0T3 B1T0 B1T1 B1T2 B1T3 B0T0 B0T1 B0T0 computes A[0]*B[0] + A[8]*B[8] B0T1 computes A[1]*B[1] + A[9]*B[9] Etc. this will be easier later with threadsperblock a power of 2 Store result in a per-block shared memory array: shared float [threadsperblock]; B0 0 1 2 3 B1 0 1 2 3 B0T0 sum B0T1 sum B0T2 sum B0T3 sum B1T0 sum B1T1 sum B1T2 sum B0T3 sum 4

#include "stdio.h" Kernel Test Code #define N 10 const int THREADS_PER_BLOCK = 4; // Have to be int, not #define; power of 2 const int NUM_BLOCKS = 2; // Have to be int, not #define global void dot(float *a, float *b, float *c) shared float [THREADS_PER_BLOCK]; int tid = threadidx.x + (blockidx.x * blockdim.x); int Index = threadidx.x; float temp = 0; while (tid < N) temp += a[tid] * b[tid]; tid += blockdim.x * griddim.x; // THREADS_PER_BLOCK * NUM_BLOCKS [Index] = temp; if ((blockidx.x == 0) && (threadidx.x == 0)) *c = [Index]; // For a test, only send back result of one thread Main Test Code int main() float a[n], b[n], c[num_blocks]; float *dev_a, *dev_b, *dev_c; // We ll see why c[num_blocks] shortly cudamalloc((void **) &dev_a, N*sizeof(float)); cudamalloc((void **) &dev_b, N*sizeof(float)); cudamalloc((void **) &dev_c, NUM_BLOCKS*sizeof(float)); // Fill arrays for (int i = 0; i < N; i++) a[i] = (float) i; b[i] = (float) i; // Copy data from host to device cudamemcpy(dev_a, a, N*sizeof(float), cudamemcpyhosttodevice); cudamemcpy(dev_b, b, N*sizeof(float), cudamemcpyhosttodevice); dot<<<num_blocks,threads_per_block>>>(dev_a,dev_b,dev_c); // Copy data from device to host cudamemcpy(c, dev_c, NUM_BLOCKS*sizeof(float), cudamemcpydevicetohost); // Output results printf("%f\n", c[0]); < cudafree, return 0 would go here> 5

Accumulating Sums At this point we have products in [] in each block that we have to sum together Easy solution is to copy all these back to the host and let the host add them up O(n) operation If n is small this is the fastest way to go But we can do pairwise adds in logarithmic time This is a common parallel algorithm called reduction Summation Reduction 0 1 2 3 4 5 6 7 i=8/2=4 0+4 1+5 2+6 3+7 4 5 6 7 i=4/2=2 0+4+ 2+6 1+5+ 3+7 2+6 3+7 4 5 6 7 i=2/2=1 0+4+ 2+6+ 1+5+ 3+7 1+5+ 3+7 2 3 4 5 6 7 i=1/2=0 Have to wait for all working threads to finish adding before starting the next iteration 6

Summation Reduction Code At the end of the kernel after storing temp into [Index]: int i = blockdim.x / 2; while (i > 0) if (Index < i) [Index] += [Index + i]; syncthreads(); i /= 2; Summation Reduction Code We still need to sum the values computed by each block. Since there are not too many of these (most likely) we just return the value to the host and let the host sequentially add them up: int i = blockdim.x / 2; while (i > 0) if (Index < i) [Index] += [Index + i]; syncthreads(); i /= 2; if (Index == 0) c[blockidx.x] = [Index]; // We re thread 0 in this block // Save the sum in array of blocks 7

Main dot<<<num_blocks,threads_per_block>>>(dev_a,dev_b,dev_c); // Copy data from device to host cudamemcpy(c, dev_c, NUM_BLOCKS*sizeof(float), cudamemcpydevicetohost); // Sum and output result float sum = 0; for (int i =0; i < NUM_BLOCKS; i++) sum += c[i]; printf("the dot product is %f\n", sum); cudafree(dev_a); cudafree(dev_b); cudafree(dev_c); return 0; Thread Divergence When control flow differs among threads this is called thread divergence Under normal circumstances, divergent branches simply result in some threads remaining idle while others execute the instructions in the branch int i = blockdim.x / 2; while (i > 0) if (Index < i) [Index] += [Index + i]; syncthreads(); i /= 2; 8

Optimization Attempt In the reduction, only some of the threads (always less than half) are updating entries in the shared memory What if we only wait for the threads actually writing to shared memory int i = blockdim.x / 2; while (i > 0) if (Index < i) [Index] += [Index + i]; syncthreads(); i /= 2; Won t work; waits until ALL threads In the block reach this point Summary There are some arithmetic details to map a block s thread to elements it should compute Shared memory is fast but only accessible by threads in the same block syncthreads() is necessary when multiple threads access the same shared memory and must be used with care 9