CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012
Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 2
Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 3
Motivation: GPU vs. CPU May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 4
The Rise Of GPGPU Early 2000 s: Programmable shaders enable general purpose computing on GPUs But: Intimate knowledge of graphics pipeline/apis required, GPUs were powerful yet unflexible May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 5
The Rise Of GPGPU Early 2000 s: Programmable shaders enable general purpose computing on GPUs But: Intimate knowledge of graphics pipeline/apis required, GPUs were powerful yet unflexible A unified processor architecture was needed for both graphics and computing ( G80) Since 2007: CUDA Compute Unified Device Architecture Program GPUs intuitively with (extended) C May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 5
Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 6
Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 7
Fermi Architecture Overview 16 streaming multiprocessors, 512 cores in total May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 8
Fermi s Streaming Multiprocessor SIMT (single instruction, multiple threads) Hardware threading no overhead! Groups of 32 threads (warps) scheduled together May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 9
Fermi s Streaming Multiprocessor SIMT (single instruction, multiple threads) Hardware threading no overhead! Groups of 32 threads (warps) scheduled together Special Function Units (SFUs) for e.g. sin/cos, 1 x, x Scalable just add more SMs! May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 9
Fermi s Memory Hierarchy 64KB at block level, 768KB L2 Cache May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 10
Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 11
What Tesla Couldn t Do: Fused Multiply-Add May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 12
What Else Was Improved Over Tesla Introduction of L1 and L2 caches Better double precision performance Atomic operations up to 20 times faster Concurrent kernel execution possible May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 13
Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 14
Kepler Architecture Overview 1536 cores in total (though running at a lower shader clock rate than Fermi) May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 15
Main Focus: Power Efficiency May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 16
Three CUDA Generations At A Glance May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 17
Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 18
Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 19
Grids, Blocks, Threads Threads map to cores Blocks map to SMs SMs schedule warps Grids & blocks up to 3 dimensions May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 20
Grids, Blocks, Threads Threads map to cores Blocks map to SMs SMs schedule warps Grids & blocks up to 3 dimensions Threads in a block communicate thru shared memory synchronize at a barrier (_syncthreads()) Blocks in a grid communicate thru global memory synchronize only at end of kernel May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 20
Automatic Scalability May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 21
Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 22
Software Stack (Libraries include CUBLAS, CUFFT, Thrust (STL),... ) May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 23
Runtime Library And Built-ins Types / Functions Vector types: int2, dim3 (uint3), float4,... Math functions: sinf, powf, min,... Atomic functions: atomicadd(), atomicmax(),... Memory management: cudamalloc(), cudamemcpy(),... syncthreads() May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 24
Runtime Library And Built-ins Types / Functions Vector types: int2, dim3 (uint3), float4,... Math functions: sinf, powf, min,... Atomic functions: atomicadd(), atomicmax(),... Memory management: cudamalloc(), cudamemcpy(),... syncthreads() Variables dim3 threadidx dim3 blockidx dim3 blockdim dim3 griddim int warpsize Position within block Position within grid Size in threads Size in blocks Number of threads May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 24
Qualifiers Function types: global device host kernel, called from host, executed on device function called from device, executed on device function called from host, executed on host (optional) Variable types: device constant shared global, accessible by device and host (optional) constant, accessible by device (read only) and host shared, life span and access tied to block May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 25
Compilation NVCC separates serial and parallel parts Device code compiled to pseudo-assembly PTX (Parallel Thread Execution) Finally linked to one executable May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 26
Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 27
minibrot.cu May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 28
Host Code 1 s i z e _ t n = 30; / / side l e n g t h of canvas 2 s i z e _ t block = 5; / / side l e n g t h of a block 3 4 dim3 blockdim ( block, block ) ; 5 dim3 griddim ( ( n / block ) + 1, ( n / block ) + 1) ; 6 7 char arr_gpu ; 8 cudamalloc (& arr_gpu, n n s i z e o f ( char ) ) ; 9 10 mandelbrotkernel <<<griddim, blockdim >>>( arr_gpu, n ) ; 11 12 cudadevicesynchronize ( ) ; / / w ait f o r the k e r n e l to f i n i s h 13 14 char a r r = ( char ) malloc ( n n s i z e o f ( char ) ) ; 15 16 cudamemcpy ( arr, arr_gpu, n n s i z e o f ( char ), cudamemcpydevicetohost ) ; 17 18 p r i n t M a t r i x ( arr, n ) ; 19 20 f r e e ( a r r ) ; 21 cudafree ( arr_gpu ) ; May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 29
Device Code 1 global void mandelbrotkernel ( char arr, s i z e _ t n ) 2 { 3 u i n t 2 i d x ; / / p o s i t i o n on canvas 4 i d x. x = b l o ckidx. x blockdim. x + threadidx. x ; 5 i d x. y = b l o ckidx. y blockdim. y + threadidx. y ; 6 7 i f (! ( i d x. x < n && i d x. y < n ) ) r e t u r n ; 8 9 f l o a t 2 z = make_float2 ( 0. 0 f, 0.0 f ) ; 10 f l o a t 2 c = make_float2 ( 1.0 f + 2.0 f ( f l o a t ( i d x. x ) / n ), 11 1.0 f + 2.0 f ( f l o a t ( i d x. y ) / n ) ) ; 12 13 i n t i t e r = 0; 14 i n t maxiter = 100; 15 16 f o r ( ; i t e r < maxiter && ( z. x z. x + z. y z. y ) < 2.0 f ; ++ i t e r ) 17 z = make_float2 ( z. x z. x z. y z. y + c. x, 2.0 f z. x z. y + c. y ) ; 18 19 a r r [ i d x. x n + i d x. y ] = ( i t e r == maxiter )? # : ; 20 } May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 30
Sources Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39 55, March 2008. Nvidia Corporation. NVIDIA s Next Generation CUDA Compute Architecture: Fermi. 2009. Nvidia Corporation. NVIDIA GeForce GTX 680: The fastest, most efficient GPU ever built. 2012. Nvidia Corporation. NVIDIA CUDA C Programming Guide v. 4.2. 2012. (Plus slides from talks given in this course in previous years.) May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 31
Thanks...... for your attention! from: gizmodo.com.au/2009/05/giz_explains_gpgpu_computing_and_why_itll_melt_your_face_off-2 Questions? May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 32