Shared Memory and Synchronizations

Size: px

Start display at page:

Download "Shared Memory and Synchronizations"

Lucinda Sanders
6 years ago
Views:

1 and Synchronizations Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Technology SM can be accessed by all threads within a block (but not across blocks) Threads within a block can communicate Once we have communication, we also need synchronization. Motivation: Let s have two parallel processes and writes the result into SM and reads it float ; is the shared memory e.g., : 1; : ; (The calculation of depends on ) 1

2 CUDA Synchronization writes before reads: the data in is valid Various mechanisms Time Here shown in decreasing strength A writes after reads: the data in is invalid Time syncthreads() Assures all threads in the block have completed instructions prior to this one after going with another instruction It is a barrier command syncthreads_count(cond) syncthreads_or(cond) syncthreads_and(cond) threadfence_system() waits until all GM and SM accesses made by the calling thread prior to threadfence_system() are visible to: threads in the thread block for shared memory accesses, all threads in the device for global memory accesses, host threads for page-locked host memory accesses. 2

3 threadfence() Waits until all GM and SM accesses made by the calling thread prior to the call are visible to: threads in the thread block for shared memory accesses and threads in the device for global memory accesses. threadfence() Prevents the compiler from optimizing by caching shared memory writes in registers. It does not synchronize the threads and it is not necessary for all threads to actually reach this instruction. threadfence_block() Waits until all GM and SM accesses made by the calling thread prior to threadfence_block() are visible to all threads in the thread block. Atomic Functions atomic = guaranteed to be performed without interference from other threads. No other thread can access this address until the operation is complete. Read-modify-write atomic operation on one 32-bit or 64-bit word in GM or SM. 3

4 Atomic Functions atomicadd() reads a word adds a number to it, and writes the result back to the same address. atomicexch() and atomicadd() only can operate on 32-bit floating-point values Atomic Functions atomicadd(), atomicsub() atomicexch() atomicmin(), atomicmax() atomicinc(), atomicded() atomiccas() (old == compare? val : old ) atomicand(),atomicor(),atomicxor() SM Access Speed The fastest memory on the GPU Slower if there are bank conflicts or reading from the same space accessible by any thread within block dies with the block SM Access Speed SM vs global/local memory: GPU access mem command: 4 clock cycles local/global memory access: cycles GPU SM access: 4 clock cycles SM access is approx x faster!!! 4

5 Copy between the global and SM //SM size specified explicitly in the kernel shared float shared[32]; device float device[32]; shared[threadidx.x]=device[threadidx.x]; //SM size specified implicitly //by the kernel call extern shared float shared[]; int smsize=numthreadsperblock*sizeof(int); KernelCall<<<dimGrid,dimBlock,smSize>>>(); Array Reverse using GM Example: create a reverse of an array using GM global void reversearrayblockgm(int *doout, int *din){ int inoff=blockdim.x*blockidx.x; int outoff=blockdim.x*(griddim.x-1-blockidx.x); int in =inoff+threadidx.x; int out=outoff+(blockdim.x-1-threadidx.x); dout[out]=din[in]; } Example: create a reverse of an array using SM It is a three phase process 1) Create local reverted chunks in the SM 2) Sync all threads 3) Write them out into the GM 5

$dimblock(numthreadsperblock); reversearrayblocksm<<<dimgrid, dimblock,smsize>>>(d_b,d_a); Image courtesy of NVIDIA global void reversearrayblocksm(int *doout, int *din){ //sm size is specified$

6 The size of the SM (per block) is specified by kernel launch numthreadsperblock=256; int numblocks=dima/numthreadsperblock; int smsize=numthreadsperblock*sizeof(int); dim3 dimgrid(numblocks); dim3 dimblock(numthreadsperblock); reversearrayblocksm<<<dimgrid, dimblock,smsize>>>(d_b,d_a); Image courtesy of NVIDIA global void reversearrayblocksm(int *doout, int *din){ //sm size is specified elsewhere extern shared int sm[]; int inoff=blockdim.x*blockidx.x;//global index int in =inoffset+threadidx.x; //one element per thread is stored //in reversed order in SM sm[blockdim.x-1-threadidx.x]=din[in]; //wait till all threads have finished syncthreads(); //write back //store the data from SM in forward order, //but to the reversed block offset as before int outoff=blockdim.x*(griddim.x-1-blockidx.x); int out=outoff+threadidx.x; d_out[out]=s_data[threadidx.x]; } 6

7 Speedup? 100 runs of the kernel on Quadro FX 770M Reading NVIDIA CUDA Programming Guide Kirk, D.B., Hwu, W.W., Programming Massively Parallel Processors, NVIDIA, Morgan Kaufmann 2010 Reading CUDA by Example Sanders, J., and Kandort, E., Am introduction to general-purpose GPU programming NVIDIA, Morgan Kaufmann

Programming with CUDA

Programming with CUDA Jens K. Mueller jkm@informatik.uni-jena.de Department of Mathematics and Computer Science Friedrich-Schiller-University Jena Tuesday 19 th April, 2011 Today s lecture: Synchronization