Introduction to CUDA C

Similar documents
Introduction to CUDA C

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

CUDA C/C++ BASICS. NVIDIA Corporation

An Introduction to GPU Computing and CUDA Architecture

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Lecture 6b Introduction of CUDA programming

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

GPU Programming Introduction

HPCSE II. GPU programming and CUDA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Speed Up Your Codes Using GPU

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

A crash course on C- CUDA programming for GPU. Massimo Bernaschi h7p://

Zero Copy Memory and Multiple GPUs

ECE 574 Cluster Computing Lecture 15

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Paralization on GPU using CUDA An Introduction

CUDA C Programming Mark Harris NVIDIA Corporation

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

CUDA Parallel Programming Model Michael Garland

ECE 574 Cluster Computing Lecture 17

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA. More on threads, shared memory, synchronization. cuprintf

GPU Programming Using CUDA

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Shared Memory and Synchronizations

Massively Parallel Algorithms

Vector Addition on the Device: main()

Mathematical computations with GPUs

Tesla Architecture, CUDA and Optimization Strategies

A crash course on (C- ) CUDA programming for (mul8) GPU. Massimo Bernaschi h;p://

Information Coding / Computer Graphics, ISY, LiTH. More memory! ! Managed memory!! Atomics!! Pinned memory 25(85)

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

High Performance Linear Algebra on Data Parallel Co-Processors I

CUDA Kenjiro Taura 1 / 36

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

COSC 462 Parallel Programming

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Scientific GPU computing with Go A novel approach to highly reliable CUDA HPC 1 February 2014

Programming with CUDA, WS09

Debugging and Optimization strategies

Introduction to CUDA CIRC Summer School 2014

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Parallel Computing. Lecture 19: CUDA - I

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Lecture 2: Introduction to CUDA C

Lecture 3: Introduction to CUDA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CS 179: GPU Computing. Lecture 2: The Basics

CUDA Programming. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

Module 2: Introduction to CUDA C

CUDA Architecture & Programming Model

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Practical Introduction to CUDA and GPU

CUDA. Sathish Vadhiyar High Performance Computing

Lecture 11: GPU programming

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Scientific discovery, analysis and prediction made possible through high performance computing.

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

04. CUDA Data Transfer

Module 2: Introduction to CUDA C. Objective

Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science 2018 Workshop

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Odds and Ends. Joseph Kider University of Pennsylvania CIS Fall 2011

ECE 408 / CS 483 Final Exam, Fall 2014

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CS377P Programming for Performance GPU Programming - I

Introduction to CUDA (1 of n*)

OpenMP and GPU Programming

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to GPU Programming

Lecture 2: CUDA Programming

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

University of Bielefeld

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA Odds and Ends. Administrivia. Administrivia. Agenda. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

GPU Computing with CUDA. Part 2: CUDA Introduction

Lecture 10!! Introduction to CUDA!

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

Parallel Numerical Algorithms

Cartoon parallel architectures; CPUs and GPUs

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Transcription:

Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race conditions and atomic operations 1

The Basics Terminology Host The CPU and its memory (host memory) Device The GPU and its memory (device memory) Host Device Hello, World! int main( void ) { printf( "Hello, return 0; World!\n" ); This program is just standard C that runs on the host NVIDIA s compiler (nvcc) will not complain about CUDA programs with no device code At its simplest, CUDA C Is just C! 2

Hello, World! with Device Code global void kernel( void ) { int main( void ) { kernel<<<1,1>>>(); printf( "Hello, World!\n" ); return 0; Two notable additions to the original Hello, World! Hello, World! with Device Code global void kernel( void ) { CUDA C keyword global indicates that a function Runs on the device Called from host code nvcc splits source file into host and device components NVIDIA s compiler handles device functions like kernel() Standard host compiler handles host functions like main() gcc 3

Hello, World! with Device Code int main( void ) { kernel<<< 1, 1 >>>(); printf( "Hello, World!\n" ); return 0; Triple angle brackets mark a call from host code to device code Sometimes called a kernel launch We ll discuss the parameters inside the angle brackets later This is all that s required to execute a function on the GPU! The function kernel() does nothing for now A More Complex Example A simple kernel to add two integers: global void add( int *a, int *b, int *c ) { *c = *a + *b; As before, global is a CUDA C keyword meaning add() will execute on the device add()will be called from the host 4

A More Complex Example Notice that we use pointers for our variables: global void add( int *a, int *b, int *c ) { *c = *a + *b; add() runs on the device, So a, b, and c must point to device memory Question: How do we allocate memory on the GPU? Memory Management Host and device memory are distinct entities Device pointers: point to GPU memory May be passed to and from host code But may not be dereferenced from host code Host pointers: point to CPU memory May be passed to and from device code But may not be dereferenced from device code CUDA API for dealing with device memory cudamalloc(), cudafree(), cudamemcpy() Similar to their C equivalents, malloc(), free(), memcpy() 5

A More Complex Example (Cont.) Using our add()kernel: global void add( int *a, int *b, int *c ) { *c = *a + *b; Let s take a look at main() A More Complex Example: main() int main( void ) { int a, b, c; int *dev_a, *dev_b, *dev_c; int size = sizeof( int ); // host copies of a, b, c // device copies of a, b, c // we need space for an integer // allocate device copies of a, b, c cudamalloc( cudamalloc( (void**)&dev_a, size ); (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, size ); a = 2; b = 7; 6

A More Complex Example: main() (cont) // copy inputs to device cudamemcpy( dev_a, &a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, &b, size, cudamemcpyhosttodevice ); // launch add() kernel on GPU, passing parameters add<<< 1, 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudamemcpy( &c, dev_c, size, cudamemcpydevicetohost ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; Parallel Programming in CUDA C But GPU computing is about massive parallelism Then how do we run code in parallel on the device? Solution lies in the parameters between the triple angle brackets: add<<< 1, 1 >>>( dev_a, dev_b, dev_c ); add<<< N, 1 >>>( dev_a, dev_b, dev_c ); Instead of executing add() once, add()is executed N times in parallel 7

Parallel Programming in CUDA C With add() running in parallel We can do vector addition Each parallel invocation of add()is referred to as a block For now, assuming each block has one thread Kernel can refer to its block s index with the variable blockidx.x Each block adds a value from a[] and b[], storingthe result in c[]: global void add( int *a, int *b, int *c ) { c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; By using blockidx.x to index arrays, each block handles different indices Parallel Programming in CUDA C Write this code: global void add( int *a, int *b, int *c ) { c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; This is what runs in parallel on the device: Block 0 Block 1 c[0] = a[0] + b[0]; Block 2 c[2] = a[2] + b[2]; c[1] = a[1] + b[1]; Block 3 c[3] = a[3] + b[3]; 8

Parallel Addition: add() Using our newly parallelized add()kernel: global void add( int *a, int *b, int *c ) { c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; Let s take a look at main() Parallel Addition: main() #define N 512 int main( void ) { int *a, *b, *c; int *dev_a, *dev_b, *dev_c; int size = N * sizeof( int ); // host copies of a, b, c // device copies of a, b, c // we need space for 512 integers // allocate device copies of a, b, c cudamalloc( cudamalloc( cudamalloc( (void**)&dev_a, (void**)&dev_b, (void**)&dev_c, size ); size ); size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N ); 9

Parallel Addition: main() (cont) // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); cudamemcpyhosttodevice ); // launch add() kernel with N parallel blocks, also vector size=n add<<< N, 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudamemcpy( c, dev_c, size, cudamemcpydevicetohost ); free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; Review Difference between host and device Host = CPU Device = GPU Using global Runs on device Called from host to declare a function as device code Passing parameters from host code to a device function 10

Review (cont) Basic device memory management cudamalloc() cudafree() cudamemcpy() Launching parallel kernels Launch N copies of add() with: add<<< N, 1 >>>(); Use blockidx.x to access a block s index Threads Terminology: A block can be split into parallel threads Let s change vector addition to use parallel threads instead of parallel blocks: Now assuming 1 block only global void add( int *a, int *b, int *c ) { c[threadidx.x] = a[threadidx.] + b[threadidx.x]; We use threadidx.x instead of blockidx.x main() will require one more change 11

Parallel Addition (Threads): main() #define N 512 int main( void ) { int *a, *b, *c; int *dev_a, *dev_b, *dev_c; int size = N * sizeof( int ); //host copies of a, b, c //device copies of a, b, c //we need space for 512 integers // allocate cudamalloc( cudamalloc( cudamalloc( device copies of a, b, c (void**)&dev_a, size ); (void**)&dev_b, size ); (void**)&dev_c, size ); a = (int*)malloc( b = (int*)malloc( size size ); ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N ); Parallel Addition (Threads): main() (cont) // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); cudamemcpyhosttodevice ); // launch add() kernel with N threads add<<<1, N >>>( dev_a, dev_b, dev_c ); // copy device result cudamemcpy( c, dev_c, back to host copy of c size, cudamemcpydevicetohost ); free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; 12

Using Both Threads and Blocks We ve seen parallel vector addition using Many blocks with 1 thread per block 1 block with many threads Now, let s use lots of both blocks and threads First let s discuss data indexing Indexing Arrays: e.g., N=32, 4 blocks M = 8 threads/block blockidx.x = 2 int index = threadidx.x + blockidx.x * M; = = 21; 5 + 2 * 8; The red entry would have an index of 21: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 13

Addition with Threads and Blocks The blockdim.x is a built-in variable for threads per block: int index= threadidx.x + blockidx.x * blockdim.x; A combined version of our vector addition kernel to use blocks and threads: global void add( int *a, int *b, int *c ) { int index = threadidx.x + blockidx.x c[index] = a[index] + b[index]; * blockdim.x; So what changes in main() when we use both blocks and threads? Parallel Addition (Blocks/Threads): main() #define N (2048*2048) #define THREADS_PER_BLOCK 512 // thread block size int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; int size = N * sizeof( int ); // device copies of a, // we need space for N b, c integers // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N ); 14

Parallel Addition (Blocks/Threads): main() // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); // launch add() kernel with add<<< N/THREADS_PER_BLOCK, blocks and threads THREADS_PER_BLOCK >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudamemcpy( c, dev_c, size, cudamemcpydevicetohost ); free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; 15

Why Bother With Threads? Threads seem unnecessary Added a level of abstraction and complexity What did we gain? Because parallel threads have mechanisms to Communicate Synchronize Let s see how to communicate and synchronize Dot Product Unlike vector addition, dot product is a reduction from 2 vectors to a scalar a b a0 * b0 a1 * b1 a2 * b2 a3 * b3 + c c = a b = (a 0, a 1, a 2, a 3 ) (b 0, b 1, b 2, b 3 ) = a 0 b 0 + a 1 b 1 + a 2 b 2 + a 3 b 3 16

Dot Product Parallel threads have no problem computing the "pairwise products. But how to compute their sum? a b a0 * b0 a1 * b1 a2 * b2 + a3 * b3 So we can start a dot product by doing just that: global void dot( int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadidx.x] * b[threadidx.x]; Dot Product But we need to share data between threads to compute the final sum: a b a0 * b0 a1 * b1 a2 * b2 + a3 * b3 global void dot( int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadidx.x] * b[threadidx.x]; // Can t compute the final sum // Each thread s local copy of temp is private 17

Sharing Data Between Threads A thread block shares a memory called shared memory Extremely fast, on-chip memory (user-managed cache) Declared with the shared CUDA keyword Not visible to threads in other blocks running in parallel Block 0 Block 1 Block 2 Threads Threads Threads Shared Memory Shared Memory Shared Memory Parallel Dot Product: dot() We perform parallel multiply, then serial addition Assuming 1 thread block only #define N 512 global void dot( int *a, int *b, int *c ) { // Shared memory for results of multiplication shared int temp[n]; temp[threadidx.x] = a[threadidx.x] * b[threadidx.x]; // Thread 0 sums the pairwise products if( 0 == threadidx.x ) { int sum = 0; for( int i = 0; i < N; i++ ) sum += temp[i]; *c = sum; 18

Parallel Dot Product Recap We perform parallel, pairwise multiplications Shared memory stores each thread s result We sum these pairwise products from each single thread Sounds good but we ve made a mistake Synchronization We need threads to wait between the sections of dot(): global void dot( int *a, int *b, int *c ) { shared int temp[n]; temp[threadidx.x] = a[threadidx.x] * b[threadidx.x]; // * NEED THREADS TO SYNCHRONIZE HERE * // No thread can advance until all threads // have reached this point in the code // Thread 0 sums the pairwise products if( 0 == threadidx.x ) { int sum = 0; for( int i = 0; i < N; i++ ) sum += temp[i]; *c = sum; 19

syncthreads() We can synchronize threads with the function syncthreads() Threads in the block wait until all threads have hit the syncthreads() Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 syncthreads() syncthreads() syncthreads() syncthreads() syncthreads() Threads are only synchronized within a block Parallel Dot Product: dot() global void dot( int *a, int *b, int *c ) { shared int temp[n]; temp[threadidx.x] = a[threadidx.x] * b[threadidx.x]; syncthreads(); if( 0 == threadidx.x ) { int sum = 0; for( int i = 0; i < N; i++ ) sum += temp[i]; *c = sum; With a properly synchronized dot() routine, let s look at main() 20

Parallel Dot Product: main() #define N 512 int main( void ) { int *a, *b, *c; int *dev_a, *dev_b, *dev_c; int size = N * sizeof( int ); // copies of a, b, c // device copies of a, b, c // we need space for 512 integers // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, sizeof( int ) ); a = (int *)malloc( size ); b = (int *)malloc( size ); c = (int *)malloc( sizeof( int ) ); random_ints( a, N ); random_ints( b, N ); Parallel Dot Product: main() // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); // launch dot() kernel with 1 block and N threads dot<<< 1, N >>>( dev_a, dev_b, dev_c ); 1 block only // copy device result cudamemcpy( c, dev_c, back to host copy of c sizeof( int ), cudamemcpydevicetohost ); free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; 21

Review Launching kernels with parallel threads Launch add() with N threads: add<<< 1, N >>>(); Used threadidx.x to access thread sindex Using both blocks and threads Use (threadidx.x + blockidx.x * blockdim.x) to index input/output N/THREADS_PER_BLOCK blocks and THREADS_PER_BLOCK threads gave us N threads in total Review (cont) Using shared to declare memory as shared memory Data shared among threads in a block Not visible to threads in other blocks Using syncthreads() as a barrier No thread executes instructions after syncthreads() until all threads have reached the syncthreads() Needs to be used to prevent data hazards 22

Multiblock Dot Product Recall our dot product launch: // launch the kernel with 1 block and N threads dot<<< 1, N >>>( dev_a, dev_b, dev_c ); But, launching with one block will not utilize much of the GPU Let s write a multi-block version of dot product Multi-block Dot Product: Algorithm Each block computes a sum of its pairwise products like before: Block 0 a b a0 * b0 a1 * b1 a2 * b2 a3 * b3 + sum Block 1 a b a 512 * a 513 * a 514 * * a 515 b 512 b 513 b 514 b 515 + sum 23

Multiblock Dot Product: Algorithm And then contributes its sum to the final result: Block 0 a b a0 * b0 a1 * b1 a2 * b2 a3 * b3 + sum c Block 1 a b a 512 * a 513 * a 514 * a 515 * b 512 b 513 b 514 b 515 + sum Multiblock Dot Product: dot() #define N (2048*2048) #define THREADS_PER_BLOCK 512 global void dot( int *a, int *b, int *c ) { shared int temp[threads_per_block]; int index = threadidx.x + blockidx.x * blockdim.x; temp[threadidx.x] = a[index] * b[index]; syncthreads(); if( 0 == threadidx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; sum += temp[i]; *c += sum; i++ ) But we have a race condition We can fix it with one of CUDA s atomic operations 24

Atomic Operations Many atomic operations on memory available with CUDA C atomicadd() atomicsub() atomicmin() atomicmax() atomicinc() atomicdec() atomicexch() atomiccas() Predictable result when simultaneous access to memory required We need to atomically add sum to c in our multiblock dot product Multiblock Dot Product: dot() global void dot( int *a, int *b, int *c ) { shared int temp[threads_per_block]; int index = threadidx.x + blockidx.x * blockdim.x; temp[threadidx.x] = a[index] * b[index]; syncthreads(); if( 0 == threadidx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; sum += temp[i]; atomicadd( c, sum ); i++ ) Now let s fix up main() to handle a multiblock dot product 25

Parallel Dot Product: main() #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main( void ) { int *a, *b, *c; int *dev_a, *dev_b, *dev_c; int size = N * sizeof( int ); // host copies of a, b, c // device copies of a, b, c // we need space for N ints // allocate cudamalloc( cudamalloc( cudamalloc( device copies of a, b, c (void**)&dev_a, (void**)&dev_b, (void**)&dev_c, a = (int *)malloc( size ); b = (int *)malloc( size ); c = (int *)malloc( sizeof( int ) ); random_ints( a, N ); random_ints( b, N ); size ); size ); sizeof( int ) ); Parallel Dot Product: main() // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); cudamemcpyhosttodevice ); // launch dot() kernel dot<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( dev_a, dev_b, dev_c ); // copy device result cudamemcpy( c, dev_c, back to host copy of c sizeof( int ), cudamemcpydevicetohost ); free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; 26

Review Race conditions Behavior depends upon relative timing of multiple event sequences Can occur when an implied read-modify-write is interruptible Atomic operations CUDA provides read-modify-write operations guaranteed to be atomic Atomics ensure correct results when multiple threads modify memory To Learn More CUDA C Check out CUDA by Example Parallel Programming in CUDAC Thread Cooperation Constant Memory and Events Texture Memory Graphics Interoperability Atomics Streams CUDA C on Multiple GPUs Other CUDAResources https://developer.nvidia.com/cuda-example 27