Tesla Architecture, CUDA and Optimization Strategies

Size: px

Start display at page:

Download "Tesla Architecture, CUDA and Optimization Strategies"

Avice French
6 years ago
Views:

1 Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1

2 Outline Tesla Architecture & CUDA CUDA Programming Optimization Strategies Summary Page 2

3 Outline Tesla Architecture & CUDA CUDA Programming Optimization Strategies Summary Page 3

Revolutionary NVIDIA Tesla Multi-threaded architecture with a 128-processor computing core C-language development environment for the GPU C870 GPU Computing Processor - One GPU (128 thread

4 Revolutionary NVIDIA Tesla Multi-threaded architecture with a 128-processor computing core C-language development environment for the GPU C870 GPU Computing Processor - One GPU (128 thread processors) GB dedicated memory - Fits in one full-length, dual slot with one open PCI Express x16 slot D870 Deskside GPU Computing System: desktop (2 x C870) S870 GPU Computing System: 1U rack-mount chassis (4 x C870) Page 4

GPU architecture Massively multithreaded parallel computing platform 8 Texture Processor Clusters (TPC) 1 TPC = 2 Streaming Multiprocessors (SMs) + texture

5 GPU architecture Massively multithreaded parallel computing platform 8 Texture Processor Clusters (TPC) 1 TPC = 2 Streaming Multiprocessors (SMs) + texture 1 SM = 8 streaming processors (SPs) 128 Thread Processors total 1.35 GHz processor clock, 518 GFLOPS peak Parallel Data Cache accelerates processing Page 5

6 SM Multithreaded Multiprocessor - 8 SP Thread Processors - 32 GFLOPS, peak at 1.35GHz - IEEE bit floating point - 32-bit integer - 2 SFU Special Function Units - Multithreaded Instruction Unit Threads, hardware multithreaded - 24 SIMD warps of 32 threads - Independent MIMD thread execution - Hardware thread scheduling - 16KB Shared Memory - Concurrent threads share data - Low latency load/store Page 6

SIMT Multithreaded Execution Warp: the set of 32 parallel threads that execute a SIMD instruction SM hardware implements zero-overhead warp and thread scheduling 768 concurrent threads = 24 warps x

7 SIMT Multithreaded Execution Warp: the set of 32 parallel threads that execute a SIMD instruction SM hardware implements zero-overhead warp and thread scheduling 768 concurrent threads = 24 warps x 32 threads Threads can execute independently Best efficiency and performance: threads of a warp execute together Single-Instruction Multiple-Thread across threads (not just SIMDacross data) gives easy single-threadscalar programming with SIMD efficiencywarp Page 7

8 NVIDIA CUDA CUDA (Compute Unified Device Architecture) enables efficient use of the massive parallelism of NVIDIA GPUs Direct execution of data-parallel programs Without the overhead of a graphics API Using CUDA on Tesla GPUs can provide large speedups on data-parallel computations straight out of the box! Heterogeneous mixed serial-parallel programming Scalable hierarchical thread execution model Accessible minimal but expressive changes to C Page 8

CUDA Programming Model: Grids, Blocks, and Threads Execute a sequence of kernels on GPU computing device A kernel executes as a Grid of thread blocks A thread block/ctas (Cooperative

9 CUDA Programming Model: Grids, Blocks, and Threads Execute a sequence of kernels on GPU computing device A kernel executes as a Grid of thread blocks A thread block/ctas (Cooperative Thread Array) is an array of threads that can cooperate size: 1 to 512 concurrent threads shape: 1D, 2D, or 3D Threads within the same block synchronize and share data in Shared Memory Page 9

10 CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks Page 10

Memory Spaces Each thread can: Read/write per-thread 32-bit registers Read/write per-thread local memory Read/write per-block shared memory (on chip) Read/write per-grid global

11 Memory Spaces Each thread can: Read/write per-thread 32-bit registers Read/write per-thread local memory Read/write per-block shared memory (on chip) Read/write per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can: read/write Host Global memory, Constant memory, and texture memory (stored in DRAM) Page 11

12 Outline Tesla Architecture & CUDA CUDA Programming Optimization Strategies Summary Page 12

13 CUDA Programming A minimal set of extensions to the C language A runtime library Page 13

14 Learning by example Matrices addition void addmatrix( float *a, float *b, global void addmatrix( float *a, floatt *b, float *c, int N ) float *c, int N ) { { int i, j, index; int i = blockidx.x * blockdim.x + threadidx.x; for( i = 0; i < N; i++ ) { int j = blockidx.y * blockdim.y + threadidx.y; for( j = 0; j < N; j++ ) { int index = i + j * N; index = i + j *N; if( i < N && j < N ) c[index] = a[index] + c[index]; c[index] = a [index] + b[index]; } } } } void main() void main() { { addmatrix( a, b, c, N ); dim3 dimblk ( blocksize, blocksize ); } dim3 dimgrd ( N/dimBlk.x, N/dimBlk.y ); addmatrix<<<dimgrd, dimblk>>>( a, b, c, N); } Page 14

15 Language Extensions Function Type Qualifiers global : kernel callable from host device : function callable on device host : function callable on host (default) - device void trigger( ); Variable Type Qualifiers device : variable in device memory constant : variable in constant memory shared : variable in shared memory Page 15

16 Language Extensions Execution Configuration Definition of the grid and blocks executed on the device In global function : <<< Dg, Db, Ns, S >>> Example : global void Function( float* parameter); dim3 dimgrid( 100, 50 ); dim3 dimblock( 4, 8, 8 ); Function<<< DimGrid, DimBlock >>>(parameter); Build-in in Variables - dim3 griddim - dim3 blockidx - dim3 blockdim - dim3 threadidx Page 16

17 Compilation with NVCC NVCC C++ syntax rules PTX Page 17

18 Software Stack Device driver Application programming interface Mathematical libraries : CUFFT and CUBLAS Page 18

19 Runtime library Common component Device component Host component Page 19

20 Common Component Built-in Vector Types : - char1, uchar1, int3, long2, etc. - Structures accessed with x, y, z, w fields : uint4 para; int y = para.y; - dim3 based on uint3 Texture Type - texture references - texture<type, Dim, ReadMode> texref ; Mathematical Functions - sinf, powf, log, min, etc. Time Function - clock_t clock(); Page 20

21 Device Component Mathematical Functions - pow - sin, cos, tan - etc. Synchronization Functions - void syncthreads(); Texture Functions - Type text1dfetch ( texture<type, 1, cudareadmodeelement> texref, int x ); Atomic Functions - atomicadd(); Page 21

22 Host Component Device management Context management Memory management Code module management Execution control Texture reference management Interoperability with OpenGL and Direct3D Page 22

23 Host Component Memory management float data[ 256 ]; int size = sizeof( data ); float* devptr; cudamalloc( ( void** ) &devptr, size ); cudamemcpy( devptr, data, size, cudamemcpyhosttodevice ); Page 23

24 Outline Tesla Architecture & CUDA CUDA Programming Optimization Strategies Summary Page 24

25 Optimization Strategies Maximize parallel execution Optimize memory usage Optimize instruction usage Page 25

26 Maximize Parallel Execution More blocks per SM, more threads per block Limiting factors: # Registers per kernel (8192/SM ) Amount of shared memory per SM (16KB/SM) So 8 blocks/sm mostly #Threads per block = a multiple of the warp size,better 64 Best:192 or 256 threads per block (maximal 768 / SM) Page 26

27 Optimize Memory Usage Data Transfers Device to host with lowest bandwidth : - Minimize the data transfers between the host and the device - memory - Group transfers Shared memory is hundreds of times faster than global memory(400 ~ 600 cycles) - Minimize the data transfers between the global memory by using on-chip shared memory - Typical programming pattern is to stage data coming from global memory - Into shared memory - Best: might be avoid any data transfer by recomputing the data Page 27

28 Optimize Memory Usage Memory accessing Global memory Read 4-byte, 8-byte, 16-byte words in a single instruction Most efficient: memory accesses by threads in a half-warp can be coalesced into a single memory transaction(32 bytes, 64 bytes, 128 bytes) Coalescing Non-coalescing Page 28

29 Optimize Memory Usage Matrix transpose Page 29

30 Optimize Memory Usage Shared memory:16 banks No bank conflict between two halves of a warp Page 30

31 Optimize instruction usage Memory Instructions -->4 clock cycles. - Global memory latency (400~600 cycles) can be hidden by the thread scheduler Arithmetic Instructions : - Minimize the use of arithmetic instructions with low throughput - Trading precision for speed Control flow instruction Lead to Divergent Branching Key: - Condition: threadidx / WSIZE - then - no divergent in a warp Page 31

32 Summary Page 32

33 References 1. NVIDIA CUDA Programming Guide NVIDIA CUDA Optimization Strategies 3. Wikipedia NVIDIA Tesla: A Unified Graphics and Computing Architecture March-April 2008, IEEE MICRO Page 33

34 Thank you for your attention! Page 34

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New