CUDA. Sathish Vadhiyar High Performance Computing

Similar documents
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Parallel Prefix Sum (1A) Young W. Lim 4/5/14

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications


Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU

Optimizing Parallel Reduction in CUDA

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

CUDA C Programming Mark Harris NVIDIA Corporation

CS671 Parallel Programming in the Many-Core Era

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

CS516 Programming Languages and Compilers II

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

CUDA Programming. Aiichiro Nakano

High Performance Linear Algebra on Data Parallel Co-Processors I

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Parallel Computing. Lecture 19: CUDA - I

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Breadth-first Search in CUDA

Lecture 3: Introduction to CUDA

Plan. CS : High-Performance Computing with CUDA. Matrix Transpose Characteristics (1/2) Plan

CS : High-Performance Computing with CUDA

Lecture 2: CUDA Programming

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Module 2: Introduction to CUDA C. Objective

Lecture 2: Introduction to CUDA C

CS 314 Principles of Programming Languages

CS 179: GPU Computing. Lecture 2: The Basics

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Lecture 10!! Introduction to CUDA!

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

CUDA. More on threads, shared memory, synchronization. cuprintf

Module 2: Introduction to CUDA C

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model Michael Garland

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Blocks, Grids, and Shared Memory

Introduction to CUDA C

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Advanced CUDA Optimizations

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Card Sizes. Tesla K40: 2880 processors; 12 GB memory

CUDA Technical Training

GPU Programming Using CUDA

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory

ECE 574 Cluster Computing Lecture 17

CS377P Programming for Performance GPU Programming - I

Scientific discovery, analysis and prediction made possible through high performance computing.

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Tesla Architecture, CUDA and Optimization Strategies

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

ECE 574 Cluster Computing Lecture 15

Introduction to CUDA C

Lecture 11: GPU programming

NVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Massively Parallel Algorithms

Evolving a CUDA Kernel from an nvidia Template

Shared Memory and Synchronizations

University of Bielefeld

Evolving a CUDA Kernel from an nvidia Template

GPU CUDA Programming

Introduction to GPU Programming

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Computer Architecture

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Cartoon parallel architectures; CPUs and GPUs

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Zero Copy Memory and Multiple GPUs

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

EEM528 GPU COMPUTING

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Speed Up Your Codes Using GPU

Lecture 5. Performance Programming with CUDA

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Introduction to CUDA (1 of n*)

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Computation to Core Mapping Lessons learned from a simple application

CSE 160 Lecture 24. Graphical Processing Units

Lessons learned from a simple application

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA Architecture & Programming Model

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Basics. July 6, 2016

Transcription:

CUDA Sathish Vadhiyar High Performance Computing

Hierarchical Parallelism Parallel computations arranged as grids One grid executes after another Grid consists of blocks Blocks assigned to SM. A single block assigned to a single SM. Multiple blocks can be assigned to a SM. Max thread blocks executed concurrently per SM = 16

Hierarchical Parallelism Block consists of elements Elements computed by threads Max threads per thread block = 1024 A thread executes on a GPU core

Hierarchical Parallelism

Thread Blocks Thread block an array of concurrent threads that execute the same program and can cooperate to compute the result Has shape and dimensions (1d, 2d or 3d) for threads A thread ID has corresponding 1,2 or 3d indices Threads of a thread block share memory

CUDA Programming Language Programming language for threaded parallelism for GPUs Minimal extension of C A serial program that calls parallel kernels Serial code executes on CPU Parallel kernels executed across a set of parallel threads on the GPU Programmer organizes threads into a hierarchy of thread blocks and grids

CUDA C Built-in variables: threadidx.{x,y,z} thread ID within a block blockidx.{x,y,z} block ID within a grid blockdim.{x,y,z} number of threads within a block griddim.{x,y,z} number of blocks within a grid kernel<<<nblocks,nthreads>>>(args) Invokes a parallel kernel function on a grid of nblocks where each block instantiates nthreads concurrent threads

Example: Summing Up kernel function grid of kernels

General CUDA Steps 1. Copy data from CPU to GPU 2. Compute on GPU 3. Copy data back from GPU to CPU By default, execution on host doesn t wait for kernel to finish General rules: Minimize data transfer between CPU & GPU Maximize number of threads on GPU

CUDA Elements cudamalloc for allocating memory in device cudamemcopy for copying data to allocated memory from host to device, and from device to host cudafree freeing allocated memory void syncthreads () synchronizing all threads in a block like barrier

EXAMPLE 1: MATRIX VECTOR MULTIPLICATION

Kernel

Host Program

Host Program

EXAMPLE 1, VERSION 2: ACCESS FROM SHARED MEMORY

EXAMPLE 2: MATRIX MULTIPLICATION

Example 1: Matrix Multiplication

Example 1

Example 1

Example 1

Example 1

EXAMPLE 2: REDUCTION

Example: Reduction Tree based approach used within each thread block In this case, partial results need to be communicated across thread blocks Hence, global synchronization needed across thread blocks

Reduction But CUDA does not have global synchronization expensive to build in hardware for large number of GPU cores Solution Decompose into multiple kernels Kernel launch serves as a global synchronization point

Illustration

Host Code int main(){ int* h_idata, h_odata; /* host data*/ Int *d_idata, d_odata; /* device data*/ /* copying inputs to device memory */ cudamemcpy(d_idata, h_idata, bytes, cudamemcpyhosttodevice) ; cudamemcpy(d_odata, h_idata, numblocks*sizeof(int), cudamemcpyhosttodevice) ; int numthreadsperblock = (n < maxthreadsperblock)? n : maxthreadsperblock; int numblocks = n / numthreadsperblock; dim3 dimblock(numthreads, 1, 1); dim3 dimgrid(numblocks, 1, 1); reduce<<< dimgrid, dimblock >>>(d_idata, d_odata);

Host Code int s=numblocks; while(s > 1) { } numthreadsperblock = (s< maxthreadsperblock)? s : maxthreadsperblock; numblocks = s / numthreadsperblock; dimblock(numthreads, 1, 1); dimgrid(numblocks, 1, 1); reduce<<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); s = s / numthreadsperblock; }

Device Code global void reduce(int *g_idata, int *g_odata) { extern shared int sdata[]; // load shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { } if ((tid % (2*s)) == 0) sdata[tid] += sdata[tid + s]; syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockidx.x] = sdata[0];

For more information CUDA SDK code samples NVIDIA - http://www.nvidia.com/object/cuda_get_samp les.html

BACKUP

EXAMPLE 3: SCAN

Example: Scan or Parallel-prefix sum Using binary tree An upward reduction phase (reduce phase or up-sweep phase) Traversing tree from leaves to root forming partial sums at internal nodes Down-sweep phase Traversing from root to leaves using partial sums computed in reduction phase

Up Sweep

Down Sweep

Host Code int main(){ const unsigned int num_threads = num_elements / 2; /* cudamalloc d_idata and d_odata */ cudamemcpy( d_idata, h_data, mem_size, cudamemcpyhosttodevice) ); dim3 grid(256, 1, 1); dim3 threads(num_threads, 1, 1); scan<<< grid, threads>>> (d_odata, d_idata); cudamemcpy( h_data, d_odata[i], sizeof(float) * num_elements, cudamemcpydevicetohost /* cudafree d_idata and d_odata */ }

Device Code global void scan_workefficient(float *g_odata, float *g_idata, int n) { // Dynamically allocated shared memory for scan kernels extern shared float temp[]; int thid = threadidx.x; int offset = 1; // Cache the computational window in shared memory temp[2*thid] = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; // build the sum in place up the tree for (int d = n>>1; d > 0; d >>= 1) { syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; } temp[bi] += temp[ai]; } offset *= 2;

Device Code // scan back down the tree // clear the last element if (thid == 0) temp[n - 1] = 0; // traverse down the tree building the scan in place for (int d = 1; d < n; d *= 2) { offset >>= 1; syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; } } float t = temp[ai]; temp[ai] = temp[bi]; temp[bi] += t; syncthreads(); } // write results to global memory g_odata[2*thid] = temp[2*thid]; g_odata[2*thid+1] = temp[2*thid+1];