Vector Addition on the Device: main()

Similar documents
CUDA C/C++ BASICS. NVIDIA Corporation

Advanced Topics in CUDA C

CUDA C/C++ BASICS. NVIDIA Corporation

An Introduction to GPU Computing and CUDA Architecture

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

GPU Computing: Introduction to CUDA. Dr Paul Richmond

HPCSE II. GPU programming and CUDA

Lecture 6b Introduction of CUDA programming

Speed Up Your Codes Using GPU

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA C

Introduction to CUDA C

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA. More on threads, shared memory, synchronization. cuprintf

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Programming with CUDA, WS09

Introduction to CUDA Programming

EEM528 GPU COMPUTING

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Module 2: Introduction to CUDA C. Objective

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CS377P Programming for Performance GPU Programming - I

Module 2: Introduction to CUDA C

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

Lecture 10!! Introduction to CUDA!

Parallel Computing. Lecture 19: CUDA - I

High Performance Linear Algebra on Data Parallel Co-Processors I

Massively Parallel Algorithms

Lecture 8: GPU Programming. CSE599G1: Spring 2017

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Programmable Graphics Hardware (GPU) A Primer

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CS179 GPU Programming Recitation 4: CUDA Particles

Lecture 3: Introduction to CUDA

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

GPU Programming Using CUDA

Scientific discovery, analysis and prediction made possible through high performance computing.

CUDA Kenjiro Taura 1 / 36

Optimizing CUDA for GPU Architecture. CSInParallel Project

ECE 574 Cluster Computing Lecture 15

Module 3: CUDA Execution Model -I. Objective

ECE 408 / CS 483 Final Exam, Fall 2014

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Data parallel computing

CUDA Programming. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

COSC 462 Parallel Programming

ECE 574 Cluster Computing Lecture 17

CUDA Parallel Programming Model Michael Garland

High-Performance Computing Using GPUs

COSC 6374 Parallel Computations Introduction to CUDA

GPU Programming Introduction

Practical Introduction to CUDA and GPU

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming. GPU computing. GPU computing.

Basics of CADA Programming - CUDA 4.0 and newer

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Lecture 11: GPU programming

CUDA programming interface - CUDA C

CS 179: GPU Computing. Lecture 2: The Basics

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Mathematical computations with GPUs

CUDA Programming. Aiichiro Nakano

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Lecture 5. Performance Programming with CUDA

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

High Performance Computing and GPU Programming

Lecture 2: Introduction to CUDA C

Introduction to GPGPUs and to CUDA programming model

CS179: GPU Programming Recitation 5: Rendering Fractals

Tesla Architecture, CUDA and Optimization Strategies

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Transcription:

Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device cudamalloc((void **)&d_a, cudamalloc((void **)&d_b, cudamalloc((void **)&d_c, // Alloc a = (int b = (int c = (int copies of a, b, c size); size); size); space for host copies of a, b, c and setup input values *)malloc(size); random_ints(a, N); *)malloc(size); random_ints(b, N); *)malloc(size); 50

Vector Addition on the Device: main() // Copy inputs to device cudamemcpy(d_a, a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU with N blocks add<<<n,1>>>(d_a, d_b, d_c); // Copy result back to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); // Cleanup free(a); free(b); free(c); cudafree(d_a); cudafree(d_b); cudafree(d_c); return 0; } 51

Common Pattern 52

Block index 53

Remark Why, you may then ask, is it not just blockidx? Why blockidx.x? As it turn out, CUDA C allows you to define a group of blocks in two dimensions. For problems with two-dimensional domains, such as matrix math or image processing, it is often convenient to use two-dimensional indexing to avoid annoying translations from linear to rectangular indices. Don t worry if you aren t familiar with these problem types; just know that using twodimensional indexing can sometimes be more convenient than one-dimensional indexing. But you never have to use it. We won t be offended. 54

Remark We call the collection of parallel blocks a grid These threads will have varying values for blockidx.x, the first taking value 0 and the last taking value N-1. Later, we use blockidx.y 55

56

Remark Why do we check whether tid is less than N? It should always be less than N, but just make sure about this as we have paranoid. If you would like to see how easy it is to generate a massively parallel application, try changing the 10 in the line #define N 10 to 10000 or 50000 to launch tens of thousands of parallel blocks. Be warned, though: No dimension of your launch of blocks may exceed 65,535. Go check enum_gpu.cu again. 57

Review (1 of 2) Difference between host and device Using global to declare a function as device code Host CPU Device GPU Executes on the device Called from the host Passing parameters from host code to a device function 58

Review (2 of 2) Basic device memory management cudamalloc() cudamemcpy() cudafree() Launching parallel kernels Launch N copies of add() with add<<<n,1>>>( ); Use blockidx.x to access block index 59

Julia example 60

A FUN EXAMPLE The following example will demonstrate code to draw slices of the Julia Set. For the uninitiated, the Julia Set is the boundary of a certain class of functions over complex numbers. This boundary forms a fractal, one of the most interesting and beautiful curiosities of mathematics. 61

Julia Set The calculations involved in generating such a set are quite simple. At its heart, the Julia Set evaluates a simple iterative equation for points in the complex plane. A point is not in the set if the process of iterating the equation diverges for that point. That is, if the sequence of values produced by iterating the equation grows toward infinity, a point is considered outside the set. Conversely, if the values taken by the equation remain bounded, the point is in the set. 62

The iterative equation Computing an iteration of equation would therefore involve squaring the current value and adding a constant to get the next value of the equation: 63

CPU Julia Set Our main routine is remarkably simple. It creates the appropriate size bitmap image using a utility library provided. Next, it passes a pointer to the bitmap data to the kernel function. 64

CPU Julia Set The computation kernel does nothing more than iterate through all points we care to render, calling julia() on each to determine membership in the Julia Set. The function julia()will return 1 if the point is in the set and 0 if it is not in the set. We set the point s color to be red if julia()returns 1 and black if it returns 0. These colors are arbitrary, and you should feel free to choose a color scheme Use RGBA color model. 65

CPU Julia Set 66

CPU Julia Set 67

CPU Julia Set We begin by translating our pixel coordinate to a coordinate in complex space. To center the complex plane at the image center, we shift by DIM/2. Then, to ensure that the image spans the range of 1.0 to 1.0, we scale the image coordinate by DIM/2. Thus, given an image point at (x,y), we get a point in complex space at ( (DIM/2 x)/(dim/2), ((DIM/2 y)/(dim/2) ). 68

CPU Julia Set Then, to potentially zoom in or out, we introduce a scale factor. Currently, the scale is hard-coded to be 1.5. Need to determine whether the point is in or out of the Julia Set. We do this by computing the values of the iterative equation. Compute 200 iterations of this function. After each iteration, check whether the magnitude of the result exceeds some threshold (1,000 for our purposes). 69

CPU Julia Set Since all the computations are being performed on complex numbers, define a generic structure to store complex numbers. 70

GPU Julia Set 71

GPU Julia Set The GPU version of main() looks much more complicated than the CPU version, but the flow is actually identical. Because we will be doing computation on a GPU, we also declare a pointer called dev_bitmap to hold a copy of the data on the device. And to hold data, we need to allocate memory using cudamalloc(). 72

GPU Julia Set As with the CPU example, we pass kernel() the pointer we allocated in the previous line to store the results. The only difference is that the memory resides on the GPU now, not on the host system. Because each point can be computed independently of every other point, we simply specify one copy of the function for each point we want to compute. Use two-dimensional indexing 73

Dim3 gird(dim, DIM) So, we specify a two-dimensional grid of blocks in this line: dim3 grid(dim,dim). The type dim3 represents a three-dimensional tuple. (Q) Why do we use a three-dimensional value for a twodimensional grid? (A) We do this because a three-dimensional, dim3 value is what the CUDA runtime expects. 74

Remark Although a three-dimensional launch grid is not currently supported, the CUDA runtime still expects a dim3 variable where the last component equals 1. When we initialize it with only two values, as we do in the statement dim3 grid(dim,dim), the CUDA runtime automatically fills the third dimension with the value 1, so everything here will work as expected. 75

Kernel function We then pass our dim3 variable grid to the CUDA runtime in this line: kernel<<<grid,1>>>( dev _ bitmap ); Have to copy the results back to the host. 76

Kernel function 77

Block numbers As with the vector addition example, the CUDA runtime generates these indices for us in the variable blockidx. This works because we declared our grid of blocks to have the same dimensions as our image, so we get one block for each pair of integers (x,y) between (0,0) and (DIM-1, DIM-1). 78

griddim Instead of two for loops, use another built-in variable, griddim. This variable is a constant across all blocks and simply holds the dimensions of the grid that was launched. In this example, it will always be the value (DIM, DIM). So, multiplying the row index by the grid width and adding the column index will give us a unique index into ptr that ranges from 0 to (DIM*DIM-1). int offset = x + y * griddim.x; 79

Device funtion 80

Device qualifier 81

Splitting Parallel Blocks The CUDA runtime allows blocks to be split into threads. Inside the angle brackets, the second parameter actually represents the number of threads per block N blocks x 1 thread/block = N parallel threads 82

Vector sum This time we will use threads instead of blocks N blocks of one thread apiece: add<<<n,1>>>( dev _ a, dev _ b, dev _ c ); int tid = blockidx.x; % block index N threads all within one block: add<<<1,n>>>( dev _ a, dev _ b, dev _ c ); int tid = threadidx.x; % Thread index 83

GPU SUMS OF A LONGER VECTOR Hardware limits the number of blocks in a single launch to 65,535 Hardware limits the number of threads per block: Use maxthreadsperblock(). In general, this limit is 512 threads per block. For longer vector, we use multiple blocks and threads 84

Index int tid = threadidx.x + blockidx.x * blockdim.x; The blockdim is a constant for all blocks and stores the number of threads along each dimension of the block. blockdim gives the number of threads in a block, in the particular direction griddim gives the number of blocks in a grid, in the particular direction 85

Thread numbers Still need N parallel threads to launch, but we want them to launch across multiple blocks so we do not hit the 512-thread limitation. One solution is to arbitrarily set the block size to some fixed number of threads; for this example, let s use 128 threads per block. Then we can just launch N/128 blocks to get our total of N threads running. Note that N/128 is an integer division, so if N were 127, N/128 would be zero. 86

Thread numbers We actually want this division to round up. Compute (N+127)/128 instead of N/128 without ceil() function. With 128 threads per block, use the following kernel: add<<< (N+127)/128, 128 >>>( dev _ a, dev _ dev _ c ); b, 87

GPU SUMS OF ARBITRARILY LONG VECTORS If we launch N/128 blocks to add our vectors, we will hit launch failures when our vectors exceed 65,535 * 128 = 8,388,480 elements. This seems like a large number, but with current memory capacities between 1GB and 4GB, the high-end graphics processors can hold orders of magnitude more data than vectors with 8 million elements. 88

Solution is extremely simple GPU version CPU version 89

Kernel function We are almost there! The only remaining piece is to fix the launch itself. add<<<(n+127)/128,128>>>( dev_a, dev_b, dev_c ) will fail when (N+127)/128 is greater than 65,535. Just fix the number of blocks to some reasonably small value. Since we like copying and pasting so much, we will use 128 blocks, each with 128 threads. Show add_loop_long.cu 90

CUDA Time 91

CUDA device info See add_loop_long_time.cu 92