CUDA Threads. Origins. ! The CPU processing core 5/4/11

Size: px

Start display at page:

Download "CUDA Threads. Origins. ! The CPU processing core 5/4/11"

Charlotte Chase
6 years ago
Views:

1 5/4/11 CUDA Threads James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011! The CPU processing core 1

2 ! But gfx operations don t need this! Highly parallel/synchronous execution! Many operations have spatial locality! Heavy use of SIMD! So we throw away cache and other unnecessary overhead! Use multiple cores with many lightweight threads running concurrently on different cores 2

3 ! Still inefficient! one processor ---- one thread! Yet, threads differ only in input and output! The execution path is the same! Solution?! Still very inefficient! one processor ---- one thread! Yet, threads differ only in input and output! The execution path is the same! (SIMD) 3

4 ! Branching can still hurt performance.! Watch out for syncthreads()! Aren t long wait times a concern without a data cache? No. Just interleave the processing. 4

5 5/4/11 Some (G80) hardware info Streaming Processor Array TPC TPC TPC Texture Processor Cluster TPC TPC Streaming Multiprocessor Instruction L1 SM TPC Data L1 Instruction Fetch/Dispatch Shared Memory TEX SM SFU SFU 5

6 CUDA threads! Ids help determine the data to work on! Block ID: 1D or 2D! Thread ID: 1D, 2D, or 3D! Simplifies memory addressing! Yesterday s prac CUDA thread blocks! So multiple threads produce good parallelism But why do we need thread blocks? 6

7 CUDA thread blocks! All threads in same block will exec same instruction! threads in block share data! synchronised in doing their share of the work! But threads in different blocks cannot communicate! transparent scalability --- Lack of communication is a boon!! Since! Blocks can execute in any order! Threads of the block execute together on a single streaming processor! Thus, an increase in processor count produces proportional increase in parallelism CUDA thread blocks! G80 limitations (for example)! Grid dimension limit is 64K in x or y! Block dimension limit is! At most 512 threads, but also! dim x <= 512! dim y <= 512! dim z <= 64! And you can only have 768 threads per SM! OR whichever of these 5 limitations you hit first!! Question: Which of these is the most efficient block dimensions to use given this architectural description: 4x4, 8x8, 16x8, 16x16, 32x32?! HINT: SM occupancy 7

8 5/4/11 CUDA thread blocks! So, out of the two possibilities below, which would you choose?! 16 x 8 = 128 threads; 768 / 128 = 6 blocks < 8 block maximum, so we of course have 768 occupancy on a single SM! 16 x 16 = 256 threads; 768 / 256 = 3 blocks < 8 block maximum, so again have full occupancy 8

9 CUDA thread blocks! So, out of the two possibilities below, which would you choose?! 16 x 8 = 128 threads; 768 / 128 = 6 blocks < 8 block maximum, so we of course have 768 occupancy on a single SM Reason: With a larger block count, this option offers more opportunity to swap out stalled blocks. Plus, it uses more streaming processors at once. Actually there is another good reasons for choosing a block size with a x dimension that is a multiple of 16 CUDA Thread blocks: Warps! SM divides each thread block into warps! 32 threads: [0,1,...,31], [32,33,...,63], etc.! SM exchanges stalled warps for waiting warps! A warp stalls if any thread in the warp stalls! If it stalls, we swap in some other warp! With zero-overhead thread scheduling! i.e. a reason for throwing away the data cache and branch logic!! Can sync all threads in a block (warp)! SM waits for all threads to reach sync point! Avoids read-after-write, write-after-write,... errors! Conditionals allowed but must be uniform across entire thread block.! OK, but you mentioned 16 was a good dimension 9

10 CUDA Thread blocks: Warps! Warps are not part of the CUDA specification! This means NVIDIA can do pretty much anything they want here! If you want scalable long-lasting code, then perhaps warps aren t for you! The only programmatic significant to a warp is when you divide it in two half-warps (16 threads)! When a half-warp accesses global memory the possibility for coalesced access arises! Much like pre-fetching, where get spatial local accesses for free! However, there are some steep requirements CUDA Thread blocks: Warps! To get coalesced memory access you need to have! Arrays aligned on 4/8/16 byte boundaries! Half-warp threads serially accessing consecutive memory addresses! A thread usage pattern where only the x -direction is significant! Generally, means accessing things in the correct order! This is actually one good use for shared memory! an illustration is in order! The base address of these memory accesses must be aligned to a multiple of element size (of the array data being accessed) 10

11 CUDA Thread blocks: Warps! Question:! Suppose you decide to use 16x16 thread blocks. How many warps are there per SM? CUDA Thread blocks: Warps! Question:! Suppose you decide to use 16x16 thread blocks. How many warps are there per SM?! Answer:! 768 / (16x16) = 3 blocks of 256 threads! 256 / 32 = 8 warps! 8 warps per block means 8 * 3 = 24 warps on a SM 11

12 CUDA Thread blocks: Warps! Quick summary of warps! Essentially scheduling units of the SM! One and only one warp executes at a time on a SM! I ll repeat that: At any point in time, only one warp is executed by an SM! Warps are scheduled by some unknown hardware algorithm according to some unspecified priority metric. NVIDIA has the answers but aren t sharing.! They re an implementation decision by NVIDIA, so everything I just told you could be a lie. Some final points! Bordering blocks do not in general run on the same streaming multiprocessor! Blocks cannot synchronise during kernel execution. The best you can do is wait for the kernel to finish.! You will live a long and productive life if you forget about warps. IMHO, algorithmic considerations and texture memory (that is, cached) accesses will likely bring you much joy. This is the substance of the next talk. 12

CUDA Memories. Introduction 5/4/11

CUDA Memories. Introduction 5/4/11 5/4/11 CUDA Memories James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011 Introduction