CUDA Architecture & Programming Model

Size: px

Start display at page:

Download "CUDA Architecture & Programming Model"

Ambrose Shelton
5 years ago
Views:

1 CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012

2 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 2

3 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 3

4 Motivation: GPU vs. CPU May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 4

5 The Rise Of GPGPU Early 2000 s: Programmable shaders enable general purpose computing on GPUs But: Intimate knowledge of graphics pipeline/apis required, GPUs were powerful yet unflexible May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 5

6 The Rise Of GPGPU Early 2000 s: Programmable shaders enable general purpose computing on GPUs But: Intimate knowledge of graphics pipeline/apis required, GPUs were powerful yet unflexible A unified processor architecture was needed for both graphics and computing ( G80) Since 2007: CUDA Compute Unified Device Architecture Program GPUs intuitively with (extended) C May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 5

7 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 6

8 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 7

9 Fermi Architecture Overview 16 streaming multiprocessors, 512 cores in total May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 8

10 Fermi s Streaming Multiprocessor SIMT (single instruction, multiple threads) Hardware threading no overhead! Groups of 32 threads (warps) scheduled together May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 9

11 Fermi s Streaming Multiprocessor SIMT (single instruction, multiple threads) Hardware threading no overhead! Groups of 32 threads (warps) scheduled together Special Function Units (SFUs) for e.g. sin/cos, 1 x, x Scalable just add more SMs! May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 9

12 Fermi s Memory Hierarchy 64KB at block level, 768KB L2 Cache May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 10

13 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 11

14 What Tesla Couldn t Do: Fused Multiply-Add May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 12

up to 20 times faster Concurrent kernel execution possible

15 What Else Was Improved Over Tesla Introduction of L1 and L2 caches Better double precision performance Atomic operations up to 20 times faster Concurrent kernel execution possible May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 13

16 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 14

17 Kepler Architecture Overview 1536 cores in total (though running at a lower shader clock rate than Fermi) May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 15

18 Main Focus: Power Efficiency May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 16

19 Three CUDA Generations At A Glance May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 17

20 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 18

21 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 19

22 Grids, Blocks, Threads Threads map to cores Blocks map to SMs SMs schedule warps Grids & blocks up to 3 dimensions May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 20

Grids, Blocks, Threads Threads map to cores Blocks map to SMs SMs schedule warps Grids & blocks up to 3 dimensions Threads in a block communicate thru shared memory synchronize

23 Grids, Blocks, Threads Threads map to cores Blocks map to SMs SMs schedule warps Grids & blocks up to 3 dimensions Threads in a block communicate thru shared memory synchronize at a barrier (_syncthreads()) Blocks in a grid communicate thru global memory synchronize only at end of kernel May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 20

24 Automatic Scalability May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 21

25 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 22

26 Software Stack (Libraries include CUBLAS, CUFFT, Thrust (STL),... ) May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 23

27 Runtime Library And Built-ins Types / Functions Vector types: int2, dim3 (uint3), float4,... Math functions: sinf, powf, min,... Atomic functions: atomicadd(), atomicmax(),... Memory management: cudamalloc(), cudamemcpy(),... syncthreads() May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 24

28 Runtime Library And Built-ins Types / Functions Vector types: int2, dim3 (uint3), float4,... Math functions: sinf, powf, min,... Atomic functions: atomicadd(), atomicmax(),... Memory management: cudamalloc(), cudamemcpy(),... syncthreads() Variables dim3 threadidx dim3 blockidx dim3 blockdim dim3 griddim int warpsize Position within block Position within grid Size in threads Size in blocks Number of threads May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 24

29 Qualifiers Function types: global device host kernel, called from host, executed on device function called from device, executed on device function called from host, executed on host (optional) Variable types: device constant shared global, accessible by device and host (optional) constant, accessible by device (read only) and host shared, life span and access tied to block May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 25

30 Compilation NVCC separates serial and parallel parts Device code compiled to pseudo-assembly PTX (Parallel Thread Execution) Finally linked to one executable May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 26

31 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 27

32 minibrot.cu May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 28

33 Host Code 1 s i z e _ t n = 30; / / side l e n g t h of canvas 2 s i z e _ t block = 5; / / side l e n g t h of a block 3 4 dim3 blockdim ( block, block ) ; 5 dim3 griddim ( ( n / block ) + 1, ( n / block ) + 1) ; 6 7 char arr_gpu ; 8 cudamalloc (& arr_gpu, n n s i z e o f ( char ) ) ; 9 10 mandelbrotkernel <<<griddim, blockdim >>>( arr_gpu, n ) ; cudadevicesynchronize ( ) ; / / w ait f o r the k e r n e l to f i n i s h char a r r = ( char ) malloc ( n n s i z e o f ( char ) ) ; cudamemcpy ( arr, arr_gpu, n n s i z e o f ( char ), cudamemcpydevicetohost ) ; p r i n t M a t r i x ( arr, n ) ; f r e e ( a r r ) ; 21 cudafree ( arr_gpu ) ; May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 29

34 Device Code 1 global void mandelbrotkernel ( char arr, s i z e _ t n ) 2 { 3 u i n t 2 i d x ; / / p o s i t i o n on canvas 4 i d x. x = b l o ckidx. x blockdim. x + threadidx. x ; 5 i d x. y = b l o ckidx. y blockdim. y + threadidx. y ; 6 7 i f (! ( i d x. x < n && i d x. y < n ) ) r e t u r n ; 8 9 f l o a t 2 z = make_float2 ( 0. 0 f, 0.0 f ) ; 10 f l o a t 2 c = make_float2 ( 1.0 f f ( f l o a t ( i d x. x ) / n ), f f ( f l o a t ( i d x. y ) / n ) ) ; i n t i t e r = 0; 14 i n t maxiter = 100; f o r ( ; i t e r < maxiter && ( z. x z. x + z. y z. y ) < 2.0 f ; ++ i t e r ) 17 z = make_float2 ( z. x z. x z. y z. y + c. x, 2.0 f z. x z. y + c. y ) ; a r r [ i d x. x n + i d x. y ] = ( i t e r == maxiter )? # : ; 20 } May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 30

35 Sources Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39 55, March Nvidia Corporation. NVIDIA s Next Generation CUDA Compute Architecture: Fermi Nvidia Corporation. NVIDIA GeForce GTX 680: The fastest, most efficient GPU ever built Nvidia Corporation. NVIDIA CUDA C Programming Guide v (Plus slides from talks given in this course in previous years.) May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 31

36 Thanks for your attention! from: gizmodo.com.au/2009/05/giz_explains_gpgpu_computing_and_why_itll_melt_your_face_off-2 Questions? May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 32

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization