Paralization on GPU using CUDA An Introduction

Size: px

Start display at page:

Download "Paralization on GPU using CUDA An Introduction"

Julia White
5 years ago
Views:

1 Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011

2 Outline 1 Introduction to GPU 2 Introduction to CUDA

3 Graphics Processing Unit (GPU) CPU vs GPU CPU

4 Graphics Processing Unit (GPU) CPU vs GPU CPU GPU

5 Graphics Processing Unit (GPU) 32 Core per Multiprocessor, registers (32 bit), 64 KB L1 cache, 32 bit internal operations.

6 Graphics Processing Unit (GPU) Host and Device Bandwidth Bandwidth of PCIe gen2 = 6 GB/s, Bandwidth of GPU = 150 GB/s.

Graphics Processing Unit (GPU) Nvidia c1060 Tesla a a http://en.wikipedia.

7 Graphics Processing Unit (GPU) Nvidia c1060 Tesla a a Core clock in MHz 602 Shader Thread Processors (total) 240 Shader Clock in MHz 1300 Memory Bandwidth max (GB/s) Memory Bus type GDDR3 Memory Total Size(MiB) 4096 Memory Clock (MHz) 1600 Processing Power Single Precision (peak) GFLOPs Processing Power Double Precision (peak) GFLOPs 77.76

8 CUDA CUDA = Compute Unified Device Architecture CUDA C Based on Standard C, A handful of language extensions to allow heterogeneous programs Straightforward APIs to manage devices, memory, etcs.

9 CUDA CUDA Programming Model in CUDA, GPU considered as a compute device which has its own memory (DRAM)

10 CUDA CUDA Programming Model in CUDA, GPU considered as a compute device which has its own memory (DRAM) Such device is able to runs many threads in parallel.

11 CUDA CUDA Programming Model in CUDA, GPU considered as a compute device which has its own memory (DRAM) Such device is able to runs many threads in parallel. Parallel portion of the program run on device as a Kernels,

12 CUDA CUDA Programming Model in CUDA, GPU considered as a compute device which has its own memory (DRAM) Such device is able to runs many threads in parallel. Parallel portion of the program run on device as a Kernels, Unlike CPU, GPU threads have very low creation overhead.

13 CUDA CUDA Programming Model in CUDA, GPU considered as a compute device which has its own memory (DRAM) Such device is able to runs many threads in parallel. Parallel portion of the program run on device as a Kernels, Unlike CPU, GPU threads have very low creation overhead. To achive full effeciency, GPU needs to run thousands of threads, whereas, CPU needs a few ones.

14 CUDA What changes need to be done in a sequential code Computation partitioning (define the portion of code which should be run on device) Declaration of functions host, device, global, Mapping the thread programs to device: function_name«<gs,bs»>(<args>)

15 CUDA What changes need to be done in a sequential code Computation partitioning (define the portion of code which should be run on device) Declaration of functions host, device, global, Mapping the thread programs to device: function_name«<gs,bs»>(<args>) Transfering Data to/from Device from/to Host using cudamemcpy

16 CUDA What changes need to be done in a sequential code Computation partitioning (define the portion of code which should be run on device) Declaration of functions host, device, global, Mapping the thread programs to device: function_name«<gs,bs»>(<args>) Transfering Data to/from Device from/to Host using cudamemcpy Concurrency Management (e.g. synchthreads())

17 CUDA A Simple Program #include "../common/book.h" int main( void ) { printf( "Hello, World!\n" ); return 0; }

18 CUDA A Simple Program #include "../common/book.h" int main( void ) { printf( "Hello, World!\n" ); return 0; } Running this Simple Example! Compiling: nvcc -o hello.exe hello_world.cu Running the program./hello.exe (as usual!) This program will run on host and device do nothing in this case! NVIDIA compiler (nvcc) will not complain about CUDA programs with not device code.

19 CUDA Simple Program on Device global void kernel( void ) { } int main( void ) { kernel«<1,1»>(); printf( "Hello, World! \n" ); return 0; } global keyword indicate that this function will run on the device called form host

20 CUDA Simple Program on Device global void kernel( void ) { } int main( void ) { kernel«<1,1»>(); printf( "Hello, World! \n" ); return 0; } global keyword indicate that this function will run on the device called form host nvcccompiler devide your code to host and device components, NVIDIA s compiler hamdles device functions like Kernel() Standard host compiler (gcc, icc)handles the rest.

21 CUDA A more complex Example global void add( int a, int b, int *c ) { *c = a + b; } int main( void ) { int c; int *dev_c; cudamalloc( (void**)&dev_c, sizeof(int) ); add«<1,1»>( 2, 7, dev_c ); cudamemcpy( &c, dev_c, sizeof(int), cudamemcpydevicetohost ); printf( "2 + 7 = %d\n", c ); cudafree( dev_c ) ; return 0; }

22 CUDA A more complex Example global void add( int a, int b, int *c ) { *c = a + b; } int main( void ) { int c; int *dev_c; cudamalloc( (void**)&dev_c, sizeof(int) ); add«<1,1»>( 2, 7, dev_c ); cudamemcpy( &c, dev_c, sizeof(int), cudamemcpydevicetohost ); printf( "2 + 7 = %d\n", c ); cudafree( dev_c ) ; return 0; } Memory in CUDA basic CUDA API for dealing with device Memory cudamalloc() cudafree() cudamemcpy()

23 CUDA Doing in Parallel

24 CUDA Doing in Parallel (continue)

25 CUDA Doing in Parallel with threads

26 CUDA Doing in Parallel with threads

27 CUDA both blocks and threads

28 CUDA both blocks and threads Doing in Parallel with both block and threads

29 CUDA both blocks and threads Doing in Parallel with both block and threads

30 CUDA Why thread? Unlike Parallel blocks, parallel threads have machanisms to Communicate, Synchronize.

31 For Further Reading I [1] David B. Kirk and Wen-mei W. Hwu Programming Massively Parallel Processors. Morgan Kaufmann Publishers, 2010 [2] Massimo Bernaschi A Crash Course on C-CUDA Programming. Advanced School in High Performance and GRID Computing, 2011, ICTP, Italy [3] Piero Altoe GPU HW and Performance. Advanced School in High Performance and GRID Computing, 2011, ICTP, Italy

32 The End That s All Folks

Speed Up Your Codes Using GPU

Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel