GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Size: px

Start display at page:

Download "GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum"

Dinah Rogers
5 years ago
Views:

1 GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture Florian Schornbaum

2 GPU Computing Performance In the last few years the GPU has evolved into an absolute computing workhorse : programmable processor very high memory bandwidth multiple cores (high parallelism)

3 CPU vs. GPU GPU originally specialized for math-intensive, highly parallel computation (exactly what graphics rendering is about) On the GPU (in contrast to the CPU) more transistors are devoted to data processing ( ALU) rather than data caching and flow control

4 Problem: GPGPU (general purpose computation on GPUs) GPGPU so far : program the GPU through a graphics API and trick the GPU into general-purpose computing by casting problems as graphics : turn data into images ( texture maps ) turn algorithms into image synthesis ( rendering passes )

5 Problem: GPGPU (general purpose computation on GPUs) GPGPU so far : program the GPU through a graphics API and trick the GPU into general-purpose computing by casting problems as graphics : turn data into images ( texture maps ) turn algorithms into image synthesis ( rendering passes ) Promising results, but : tough learning curve, particularly for non-graphics experts potentially high overhead of an inadequate graphics API highly constrained memory layout & access model

6 Solution: GPU Computing with CUDA Co-designed hardware & software for direct GPU computing : new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API (available for the GeForce 8 series, the Tesla platform and some Quadro solutions)

Solution: GPU Computing with CUDA Co-designed hardware & software for direct GPU computing : new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel

7 Solution: GPU Computing with CUDA Co-designed hardware & software for direct GPU computing : new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API (available for the GeForce 8 series, the Tesla platform and some Quadro solutions) 2 higher-level mathematical libraries of common usage : CUFFT and CUBLAS Hardware has been designed to support lightweight driver and runtime layers high performance

8 Hardware Implementation The device is implemented as a set of multiprocessors. Each multiprocessor has a single instruction multiple data architecture (SIMD).

9 Hardware Implementation The device is implemented as a set of multiprocessors. Each multiprocessor has a single instruction multiple data architecture (SIMD). At any given clock cycle, each processor of the multiprocessors executes the same instruction, but operates on different data.

10 Programming Model: A Highly Multi-threaded Coprocessor the GPU is viewed as a compute device capable of executing a very high number of threads in parallel

11 Programming Model: A Highly Multi-threaded Coprocessor the GPU is viewed as a compute device capable of executing a very high number of threads in parallel the GPU operates as a coprocessor to the main CPU (host) : data-parallel, compute-intensive portions of applications running on the host are off-loaded onto the device

12 Programming Model: A Highly Multi-threaded Coprocessor the GPU is viewed as a compute device capable of executing a very high number of threads in parallel the GPU operates as a coprocessor to the main CPU (host) : data-parallel, compute-intensive portions of applications running on the host are off-loaded onto the device a portion of an application that is executed many times, but on different data, can be isolated into a function that is executed on the device as many different threads

13 Programming Model: A Highly Multi-threaded Coprocessor the GPU is viewed as a compute device capable of executing a very high number of threads in parallel the GPU operates as a coprocessor to the main CPU (host) : data-parallel, compute-intensive portions of applications running on the host are off-loaded onto the device a portion of an application that is executed many times, but on different data, can be isolated into a function that is executed on the device as many different threads GPU needs thousands of threads for full efficiency (GPU threads are extremely lightweight and have very little creation overhead)

14 Application Programming Interface an extension to the C programming language function type qualifiers to specify whether a function executes on the host or the device

15 Application Programming Interface an extension to the C programming language function type qualifiers to specify whether a function executes on the host or the device explicit GPU memory allocation, returns pointers to GPU memory ( cudamalloc(), cudafree() ) memory can be copied from host to device, device to host and device to device ( cudamemcpy() )

16 Application Programming Interface an extension to the C programming language function type qualifiers to specify whether a function executes on the host or the device explicit GPU memory allocation, returns pointers to GPU memory ( cudamalloc(), cudafree() ) memory can be copied from host to device, device to host and device to device ( cudamemcpy() ) 4 build-in variables: grid & block dimension, block & thread index programming model (grid of thread blocks)

17 CPU C Program void add_matrix_cpu(float *a, float *b, float *c, int N) { int i, j, index; for(i = 0; i < N; i++) { for(j = 0; j < N; j++) { index = i + j * N; c[index] = a[index] + b[index]; } } } void main() {... add_matrix_cpu(a, b, c, N);... }

18 CUDA C Program global void add_matrix_gpu(float *a, float *b, float *c, int N) { int i = blockidx.x * blockdim.x + threadidx.x; int j = blockidx.y * blockdim.y + threadidx.y; int index = i + j * N; if(i < N && j < N) c[index] = a[index] + b[index]; } void main() {... dim3 dimblock(blocksize, blocksize); dim3 dimgrid(n / dimblock.x, N / dimblock.y); add_matrix_gpu<<<dimgrid,dimblock>>>(a, b, c, N);... }

19 CUDA Software Development Kit

20 CUDA vs. Standard CPU Performance (2007)

21 GPU Computing with CUDA THE END References : SUPERCOMPUTING 2007 Tutorial: High Performance Computing with CUDA CUDA Programming Guide 1.1

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization