CUDA Architecture & Programming Model

CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012

Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 2

Motivation: GPU vs. CPU May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 4

The Rise Of GPGPU Early 2000 s: Programmable shaders enable general purpose computing on GPUs But: Intimate knowledge of graphics pipeline/apis required, GPUs were powerful yet unflexible A unified processor architecture was needed for both graphics and computing ( G80) Since 2007: CUDA Compute Unified Device Architecture Program GPUs intuitively with (extended) C May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 5

Fermi Architecture Overview 16 streaming multiprocessors, 512 cores in total May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 8

Fermi s Streaming Multiprocessor SIMT (single instruction, multiple threads) Hardware threading no overhead! Groups of 32 threads (warps) scheduled together May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 9

Fermi s Streaming Multiprocessor SIMT (single instruction, multiple threads) Hardware threading no overhead! Groups of 32 threads (warps) scheduled together Special Function Units (SFUs) for e.g. sin/cos, 1 x, x Scalable just add more SMs! May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 9

Fermi s Memory Hierarchy 64KB at block level, 768KB L2 Cache May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 10

What Tesla Couldn t Do: Fused Multiply-Add May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 12

What Else Was Improved Over Tesla Introduction of L1 and L2 caches Better double precision performance Atomic operations up to 20 times faster Concurrent kernel execution possible May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 13

Kepler Architecture Overview 1536 cores in total (though running at a lower shader clock rate than Fermi) May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 15

Main Focus: Power Efficiency May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 16

Three CUDA Generations At A Glance May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 17

Grids, Blocks, Threads Threads map to cores Blocks map to SMs SMs schedule warps Grids & blocks up to 3 dimensions May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 20

Grids, Blocks, Threads Threads map to cores Blocks map to SMs SMs schedule warps Grids & blocks up to 3 dimensions Threads in a block communicate thru shared memory synchronize at a barrier (_syncthreads()) Blocks in a grid communicate thru global memory synchronize only at end of kernel May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 20

Automatic Scalability May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 21

Software Stack (Libraries include CUBLAS, CUFFT, Thrust (STL),... ) May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 23

Runtime Library And Built-ins Types / Functions Vector types: int2, dim3 (uint3), float4,... Math functions: sinf, powf, min,... Atomic functions: atomicadd(), atomicmax(),... Memory management: cudamalloc(), cudamemcpy(),... syncthreads() Variables dim3 threadidx dim3 blockidx dim3 blockdim dim3 griddim int warpsize Position within block Position within grid Size in threads Size in blocks Number of threads May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 24

Qualifiers Function types: global device host kernel, called from host, executed on device function called from device, executed on device function called from host, executed on host (optional) Variable types: device constant shared global, accessible by device and host (optional) constant, accessible by device (read only) and host shared, life span and access tied to block May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 25

Compilation NVCC separates serial and parallel parts Device code compiled to pseudo-assembly PTX (Parallel Thread Execution) Finally linked to one executable May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 26

minibrot.cu May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 28

Host Code 1 s i z e _ t n = 30; / / side l e n g t h of canvas 2 s i z e _ t block = 5; / / side l e n g t h of a block 3 4 dim3 blockdim ( block, block ) ; 5 dim3 griddim ( ( n / block ) + 1, ( n / block ) + 1) ; 6 7 char arr_gpu ; 8 cudamalloc (& arr_gpu, n n s i z e o f ( char ) ) ; 9 10 mandelbrotkernel <<<griddim, blockdim >>>( arr_gpu, n ) ; 11 12 cudadevicesynchronize ( ) ; / / w ait f o r the k e r n e l to f i n i s h 13 14 char a r r = ( char ) malloc ( n n s i z e o f ( char ) ) ; 15 16 cudamemcpy ( arr, arr_gpu, n n s i z e o f ( char ), cudamemcpydevicetohost ) ; 17 18 p r i n t M a t r i x ( arr, n ) ; 19 20 f r e e ( a r r ) ; 21 cudafree ( arr_gpu ) ; May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 29

Device Code 1 global void mandelbrotkernel ( char arr, s i z e _ t n ) 2 { 3 u i n t 2 i d x ; / / p o s i t i o n on canvas 4 i d x. x = b l o ckidx. x blockdim. x + threadidx. x ; 5 i d x. y = b l o ckidx. y blockdim. y + threadidx. y ; 6 7 i f (! ( i d x. x < n && i d x. y < n ) ) r e t u r n ; 8 9 f l o a t 2 z = make_float2 ( 0. 0 f, 0.0 f ) ; 10 f l o a t 2 c = make_float2 ( 1.0 f + 2.0 f ( f l o a t ( i d x. x ) / n ), 11 1.0 f + 2.0 f ( f l o a t ( i d x. y ) / n ) ) ; 12 13 i n t i t e r = 0; 14 i n t maxiter = 100; 15 16 f o r ( ; i t e r < maxiter && ( z. x z. x + z. y z. y ) < 2.0 f ; ++ i t e r ) 17 z = make_float2 ( z. x z. x z. y z. y + c. x, 2.0 f z. x z. y + c. y ) ; 18 19 a r r [ i d x. x n + i d x. y ] = ( i t e r == maxiter )? # : ; 20 } May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 30

Sources Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39 55, March 2008. Nvidia Corporation. NVIDIA s Next Generation CUDA Compute Architecture: Fermi. 2009. Nvidia Corporation. NVIDIA GeForce GTX 680: The fastest, most efficient GPU ever built. 2012. Nvidia Corporation. NVIDIA CUDA C Programming Guide v. 4.2. 2012. (Plus slides from talks given in this course in previous years.) May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 31

Thanks...... for your attention! from: gizmodo.com.au/2009/05/giz_explains_gpgpu_computing_and_why_itll_melt_your_face_off-2 Questions? May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 32