PART IV. CUDA programming! code, libraries and tools. Dr. Christian Napoli, M.Sc.! Dpt. Mathematics and Informatics, University of Catania!

Size: px
Start display at page:

Download "PART IV. CUDA programming! code, libraries and tools. Dr. Christian Napoli, M.Sc.! Dpt. Mathematics and Informatics, University of Catania!"

Transcription

1 Postgraduate course on Electronics and Informatics Engineering (M.Sc.) Training Course on Circuits Theory (prof. G. Capizzi) Workshop on High performance computing and GPGPU computing Postgraduate course on Computer Sciences (M.Sc.) Training Course on Distributed Sistems (prof. G. Pappalardo) Workshop on High performance computing and GPGPU computing PART IV CUDA programming code, libraries and tools Dr. Christian Napoli, M.Sc. Dpt. Mathematics and Informatics, University of Catania Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

2 In the last episode Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

3 In the last episode #include <iostream> #include <algorithm> using namespace std; #define N 104 #define RADIUS 3 #define BLOCK_SIZE 16 global void stencil_1d(int *in, int *out) { shared int temp[block_size + * RADIUS]; int gindex = threadidx.x + blockidx.x * blockdim.x; int lindex = threadidx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadidx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) syncthreads(); // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; parallel fn } // Store the result out[gindex] = result; void fill_ints(int *x, int n) { fill_n(x, n, 1); } int main(void) { int *in, *out; // host copies of a, b, c int *d_in, *d_out; // device copies of a, b, c int size = (N + *RADIUS) * sizeof(int); // Alloc space for host copies and setup values in = (int *)malloc(size); fill_ints(in, N + *RADIUS); out = (int *)malloc(size); fill_ints(out, N + *RADIUS); // Alloc space for device copies cudamalloc((void **)&d_in, size); cudamalloc((void **)&d_out, size); // Copy to device cudamemcpy(d_in, in, size, cudamemcpyhosttodevice); cudamemcpy(d_out, out, size, cudamemcpyhosttodevice); serial code // Launch stencil_1d() kernel on GPU stencil_1d<<<n/block_size,block_size>>>(d_in + RADIUS, d_out + RADIUS); } // Copy result back to host cudamemcpy(out, d_out, size, cudamemcpydevicetohost); // Cleanup free(in); free(out); cudafree(d_in); cudafree(d_out); return 0; parallel code serial code 3 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

4 Processing flow Copy input data from CPU memory to GPU memory PCI BUS 4 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

5 Processing flow Copy input data from CPU memory to GPU memory Load GPU program and execute caching data on chip PCI BUS 5 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

6 Processing flow Copy input data from CPU memory to GPU memory Load GPU program and execute caching data on chip Copy results from GPU memory to CPU memory PCI BUS 6 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

7 The first instruments for the job CODE :: add.cu [ ] global void add( int *a, int *b, int *c ); int main ( void ) { int a[n], b[n], c[n]; int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) ); [ ] cudamemcpy( dev_a, a, N * sizeof(int), cudamemcpyhosttodevice); [ ] add<<<1,n>>>( dev_a, dev_b, dev_c ); [ ] cudafree( dev_a ); [ ] } KEYWORDS LIBRARIES TOKENS CONSTRUCTS 7 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

8 S Y N T A X Global functions or kernels global void add (int*, int*, int*); mykernel<<<nbk,nth>>>(); global - Global kernel function :: it is a keyword Nbk - Number of blocks where to execute :: it is an address Nbk - Number of threads where to execute :: it is an integer In CUDA C/C++ the keyword global indicates a function that runs on the device and is called from host code. Such a function is called kernel and it runs in multiple instances on several blocks and threads. The number of those blocks and threads is determined by using the <<<,>>> symbols. 8 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

9 S Y N T A X Memory allocation typedef enum cudaerror cudaerror_t; cuda_error_t cudamalloc (void** devptr, size_t size); cuda_error_t cudafree (void* devptr); devptr - Pointer to allocated device memory :: it is an address count - Requested allocation size in bytes :: it is an integer int void Host Memory Device Memory HOST char void cudamemcpyhosttodevice float void DEVICE 9 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

10 S Y N T A X Memory copy cuda_error_t cudamemcpy( void dst, const void src, size_t count, enum cudamemcpykind kind, ); dst - Destination memory address :: it is an address src - Source memory address :: it is an address count - Size in bytes to copy :: it is an integer kind - Type of transfer :: it is a specific token: cudamemcpyhosttohost HOST cudamemcpyhosttodevice cudamemcpydevicetohost DEVICE cudamemcpydevicetodevice 10 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

11 S Y N T A X CUDA errors cuda_error_t cudagetlasterror(void); char* cudageterrorstring(cuda_error_t); cuda_error_t myerror = cudagetlasterror(); printf ( CUDA ERROR STRING: %s\n, cudageterrorstring(myerror)); DEVICE global device HOST host CUDA library printf TERMINAL $./ErrExample.x $ CudaSuccess $ _ HW HW SIGNAL HW DRIVER 11 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

12 CUDA errors 1 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

13 Add example (reloaded) CODE :: add.cu #include <stdio.h> #include <stdlib.h> #define N 10 global void add( int *a, int *b, int *c ) { int tid = threadidx.x; c[tid] = a[tid] + b[tid]; } int main ( void ) { int a[n], b[n], c[n]; int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) ) ; cudamalloc( (void**)&dev_b, N * sizeof(int) ) ; cudamalloc( (void**)&dev_c, N * sizeof(int) ) ; for (int i=0; i<n; i++) { a[i] = -i; b[i] = i * i; } [ ] 13 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

14 Add example (reloaded) CODE :: add.cu [splitted view] [ ] global void add( int *a, int *b, int *c ) { int tid = threadidx.x; c[tid] = a[tid] + b[tid]; } [ ] [ ] int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) cudamalloc( (void**)&dev_b, N * sizeof(int) cudamalloc( (void**)&dev_c, N * sizeof(int) [ ] [ ] cudamemcpy( dev_a, a, N * sizeof(int), cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, N * sizeof(int), cudamemcpyhosttodevice ); add<<<1,n>>>( dev_a, dev_b, dev_c ); cudamemcpy( c, dev_c, N * sizeof(int), cudamemcpydevicetohost ); for (int i=0; i<n; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; } 14 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

15 Add example (blocks+threads) CODE :: add.cu [splitted view] [ ] [ ] global void add( int *a, int *b, int *c ) { int tid = (blockidx.x*n)+threadidx.x; c[tid] = a[tid] + b[tid]; } [ ] [ ] cudamemcpy( dev_a, a, N * sizeof(int), cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, N * sizeof(int), cudamemcpyhosttodevice ); add<<<m,n>>>( dev_a, dev_b, dev_c ); } cudamemcpy( c, dev_c, N * sizeof(int), cudamemcpydevicetohost ); for (int i=0; i<n; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) cudamalloc( (void**)&dev_b, N * sizeof(int) cudamalloc( (void**)&dev_c, N * sizeof(int) [ ] 15 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

16 Christian Napoli Workshop on High Performance Computing and GPGPU Computing - PART IV Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. 16 Blocks, threads and built-in structures BlockIdx.x=0 BlockIdx.x=1 BlockIdx.x= BlockIdx.x=3 ThreadIdx.x int tid = (blockidx.x*n)+threadidx.x; (*8) + 6 = int tid = (blockidx.x*blockdim.x)+threadidx.x;

17 Add example (built-in) CODE :: add.cu [splitted view] [ ] global void add( int *a, int *b, int *c ) { int tid=(blockidx.x*blockdim.x)+threadidx.x; c[tid] = a[tid] + b[tid]; } [ ] [ ] int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) cudamalloc( (void**)&dev_b, N * sizeof(int) cudamalloc( (void**)&dev_c, N * sizeof(int) [ ] [ ] cudamemcpy( dev_a, a, N * sizeof(int), cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, N * sizeof(int), cudamemcpyhosttodevice ); add<<<m,n>>>( dev_a, dev_b, dev_c ); cudamemcpy( c, dev_c, N * sizeof(int), cudamemcpydevicetohost ); for (int i=0; i<n; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; } 17 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

18 NVIDIA CUDA Libraries CUDA Toolkit includes several libraries: CUFFT: Fourier transforms CUBLAS: Dense Linear Algebra CUSPARSE : Sparse Linear Algebra LIBM: Standard C Math library CURAND: Pseudo-random and Quasi-random numbers NPP: Image and Signal Processing Thrust : Template Library Applications 3 rd Party Libraries NVIDIA Libraries CUDA C/Fortran CUFFT CUBLAS CUSPARSE Libm (math.h) CURAND NPP Thrust CUSP Several open source and commercial * libraries: MAGMA: Linear Algebra - OpenVidia: Computer Vision CULA Tools * : Linear Algebra - OpenCurrent: CFD CUSP: Sparse Linear Solvers.. NAG * : Computational Finance 18 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

19 CUDA meets Fast Fourier Transform: cufft 19 Christian Napoli Algorithms based on Cooley-Tukey (n = a 3b 5c 7d) and Bluestein Simple interface similar to FFTW 1D, D and 3D transforms of complex and real data Row-major order (C-order) for D and 3D data Single precision (SP) and Double precision (DP) transforms In-place and out-of-place transforms 1D transform sizes up to 18 million elements Batch execution for doing multiple transforms Streamed asynchronous execution Non normalized output: IFFT(FFT(A))=len(A)*A Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

20 CODE :: simplefft.cu cufft example #define NX 56 #define NY 18 [ ] cuffthandle plan; cufftcomplex *idata, *odata; cudamalloc((void**)&idata, sizeof(cufftcomplex)*nx*ny); cudamalloc((void**)&odata, sizeof(cufftcomplex)*nx*ny); [ ] /*Create a D FFT plan.*/ cufftpland(&plan, NX,NY, CUFFT_CC); /*Use the CUFFT plan to transform the signal*/ cufftexeccc(plan, idata, odata, CUFFT_FORWARD); /* Inverse transform*/ cufftexeccc(plan, odata, odata, CUFFT_INVERSE); [ ] /*Destroy the CUFFT plan.*/ cufftdestroy(plan); cudafree(idata); cudafree(odata); [ ] Step 1 Allocate space on GPU memory Step Create plan specifying transform configuration like the size and type (real, complex, 1D, D and so on). Step 3 Execute the plan as many times as required, providing the pointer to the GPU data created in Step 1. Step 4 Destroy plan, free GPU memory 0 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

21 1 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

22 POISSON EQUATION SOLVER φ = r F F T ( k x + k y ) φˆ = rˆ φˆ = rˆ ( k + k x y ) Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

23 POISSON EQUATION SOLVER - matlab % No. of Fourier modes N = 64; % Domain size (assumed square) L = 1; % Characteristic width of f (make << 1) sig = 0.1; % Vector of wavenumbers k = (*pi/l)*[0:(n/-1) (-N/):(-1)]; % Matrix of (x,y) wavenumbers corresponding % to Fourier mode (m,n) [KX KY] = meshgrid(k,k); % Laplacian matrix acting on the wvnumbers delsq = -(KX.^ + KY.^); % Kludge to avoid division by zero for % wavenumber (0,0). delsq(1,1) = 1; % Grid spacing h = L/N; x = (0:(N-1))*h ; y = (0:(N-1))*h; [X Y] = meshgrid(x,y); CODE :: pois_fft.m [reduced] % Construct RHS f(x,y) at the Fourier gridpts rsq = (X-0.5*L).^ + (Y-0.5*L).^; sigsq = sig^; f = exp(-rsq/(*sigsq)).*... (rsq - *sigsq)/(sigsq^); % Spectral inversion of Laplacian fhat = fft(f); u = real(ifft(fhat./delsq)); % Specify arbitrary constant by forcing corner % u = 0. u = u - u(1,1); % Compute L and Linf norm of error uex = exp(-rsq/(*sigsq)); errmax = norm(u(:)-uex(:),inf); errmax = norm(u(:)-uex(:),)/(n*n); % Print L and Linf norm of error fprintf('n=%d\n',n); fprintf('solution at (%d,%d): ',N/,N/); fprintf('computed=%10.6f... reference = %10.6f\n',u(N/,N/),uex(N/,N/)); fprintf('linf err=%10.6e L norm err = %10.6e\n',errmax, errmax); 3 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

24 POISSON EQUATION SOLVER in The following steps need to be performed: 1. Allocate memory on host: r (NxN), u (NxN), kx (N) and ky (N). Allocate memory on device: r_d, u_d, kx_d, ky_d 3. Transfer r, kx and ky content from host memory to the device memory 4. Initialize plan for FFT 5. Compute execution configuration 6. Transform real input to complex input 7. D forward FFT 8. Solve Poisson equation in Fourier space 9. D inverse FFT 10. Transform complex output to real input and apply scaling 11. Transfer results from the GPU back to the host. We are not taking advantage of the symmetries (CC transform for real data) to keep the code simple. 4 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

25 POISSON EQUATION SOLVER - steps 1:5 CODE :: pois_fft.cu /*Allocate arrays on the host */ float *kx, *ky, *r; kx = (float *) malloc(sizeof(float*n); ky = (float *) malloc(sizeof(float*n); r = (float *) malloc(sizeof(float*n*n); /* Allocate array on the GPU */ float *kx_d, *ky_d, *r_d; cudamalloc( (void **) &kx_d, sizeof(cufftcomplex)*n); cudamalloc( (void **) &ky_d, sizeof(cufftcomplex)*n); cudamalloc( (void **) &r_d, sizeof(cufftcomplex)*n*n); cufftcomplex *r_complex_d; cudamalloc( (void **) &r_complex_d, sizeof(cufftcomplex)*n*n); /* Initialize r, kx and ky on the host */... /*Transfer data from host to device */ cudamemcpy (kx_d, kx, sizeof(float)*n, cudamemcpyhosttodevice); cudamemcpy (ky_d, ky, sizeof(float)*n, cudamemcpyhosttodevice); cudamemcpy (r_d, r, sizeof(float)*n*n, cudamemcpyhosttodevice); /* Create plan for CUDA FFT*/ cuffthandle plan; cufftpland( &plan, N, N, CUFFT_CC); /* Compute the execution configuration*/ dim3 dimblock(block_size_x, block_size_y); dim3 dimgrid (N/dimBlock.x, N/dimBlock.y); if (N % block_size_x =0 ) dimgrid.x+=1; if (N % block_size_y =0 ) dimgrid.y+=1; 5 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

26 POISSON EQUATION SOLVER - steps 6:11 CODE :: pois_fft.cu /* Transform real input to complex input */ realcomplex<<<dimgrid, dimblock>>> (r_d, r_complex_d, N); /* Compute forward FFT */ cufftexeccc (plan, r_complex_d, r_complex_d, CUFFT_FORWARD); /* Solve Poisson equation in Fourier space */ solve_poisson<<<dimgrid, dimblock>>> (r_complex_d, kx_d, ky_d,n); /* Compute inverse FFT */ cufftexeccc (plan, r_complex_d, r_complex_d, CUFFT_INVERSE); /* Copy the solution back to a real array and apply scaling*/ scale = 1.f / ( (float) N * (float) N ); complexreal_scaled<<<dimgrid, dimblock>>> (r_d, r_complex_d, N, scale); /*Transfer data from device to host with*/ cudamemcpy (r, r_d, sizeof(float)*n*n, cudamemcpydevicetohost); /* Destroy plan and clean up memory on device*/ cufftdestroy( plan); cudafree(r_complex_d); cudafree(kx_d); 6 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

27 POISSON EQUATION SOLVER - functions CODE :: pois_fft.cu global void realcomplex (float *a, cufftcomplex *c, int N) { int idx = blockidx.x*blockdim.x+threadidx.x; int idy = blockidx.y*blockdim.y+threadidx.y; if ( idx < N && idy <N) { int index = idx + idy*n; c[index].x = a[index]; c[index].y = 0.f; } } global void solve_poisson (cufftcomplex *c, float *kx, float *ky, int N) { int idx = blockid.x*blockdim.x+threadidx.x; int idy = blockid.y*blockdim.y+threadidx.y; if ( idx < N && idy <N) { int index = idx + idy*n; float scale = - ( kx[idx]*kx[idx] + ky[idy]*ky[idy] ); if ( idx ==0 && idy == 0 ) scale =1.f; scale = 1.f / scale; c[index].x *= scale; c[index].y *= scale; } } 7 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

28 POISSON EQUATION SOLVER - execution CODE :: pois_fft.cu global void complexreal_scaled (cufftcomplex *c, float *a, int N, float scale) int idx = blockid.x*blockdim.x+threadidx.x; int idy = blockid.y*blockdim.y+threadidx.y; if ( idx < N && idy <N) { int index = idx + idy*n; a[index] = scale*c[index].x ; } } TERMINAL $ nvcc O3 o poisson poisson.cu -I/usr/local/cuda/include L/usr/local/cuda/lib -lcufft -L/ usr/local/nvdia_cuda_sdk/common/inc -L/usr/local/NVDIA_CUDA_SDK/lib -lcutil $./poisson -N64 Poisson solver on a domain 64 x 64 dimblock 3 16 (51 threads) dimgrid 4 L error e-08: Time : Time I/O ( ): Solution at (3,3) computed= reference= $_ From MATLAB N=64 Solution at (3,3): computed= reference= Linf err= e-05 L norm err = e-08 8 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

29 Unified Distributed Architectures: the last frontier Mul$%GPUcode Domain'decomposi-on Need'to'add'communica-on'of'halos GPUs.within.a.single.network.node Exchange'halos'via'PCIe'communica-on With'CUDA'4.0'can'use'PeerBtoBPeer'communica-on'' GPUs.in.different.network.nodes.. - Require'network'communica-on' Currently'require'transferring'GPU'data'to/from'host'CPU Efforts'underway'to'make'MPI'aware'of'GPUs'(OpenMPI,MVAPICH) 30 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

30 Unified Distributed Architectures: the last frontier The.goal.is.to.hide.communica$on.cost So,.every.$me%step.each.GPU.will:.. Compute'halos'to'be'sent'to'neighbors' Compute'the'internal'region Exchange'halos'with'neighbors Linear.scaling.as.long.as.internal%computa$on.takes.longer.than.halo.exchange Well,'separate'halo'computa$on'adds'some'overhead Example:)Two)Subdomains) 31 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

31 Unified Distributed Architectures: the last frontier Phase 1 compute compute Phase compute send send compute GPU%0:)green)subdomain) GPU%1:)grey)subdomain) NVIDIA, Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

32 Unified Distributed Architectures: the last frontier MVPACIH)design) 33 Christian Napoli From:)MVAPICH%GPU:))Op$mized)GPU)to)GPU)Communica$on)for)InfiniBand)Clusters) H)Wang,)S.)Potluri,)M))Luo,)A.K.)Singh,))S.)Sur,)D.))K.))Panda) Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

33 Unified Distributed Architectures: the last frontier MVPACIH)performances) Ping-pong One side NVIDIA, Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

34 Unified Distributed Architectures: the last frontier LET SEE A SIMULATION 35 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

35 QUESTION TIME 36 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

36 Thank You You will find the PDF edition in the didactic section of the author s website, visit To contact the author send an to If you want to share this presentation be sure to read and follow the CC-BY-NC-ND-4.0 license. Visit Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Massively Parallel Algorithms

Massively Parallel Algorithms Massively Parallel Algorithms Introduction to CUDA & Many Fundamental Concepts of Parallel Programming G. Zachmann University of Bremen, Germany cgvr.cs.uni-bremen.de Hybrid/Heterogeneous Computation/Architecture

More information

CUDA Technical Training

CUDA Technical Training CUDA Technical Training Volume II: CUDA Case Studies Prepared and Provided by NVIDIA Q2 2008 Table of Contents Section Slide Computational Finance in CUDA...1 Black-Scholes pricing for European options...3

More information

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon

More information

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018 GPU 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

An Introduction to GPU Computing and CUDA Architecture

An Introduction to GPU Computing and CUDA Architecture An Introduction to GPU Computing and CUDA Architecture Sarah Tariq, NVIDIA Corporation GPU Computing GPU: Graphics Processing Unit Traditionally used for real-time rendering High computational density

More information

S05: High Performance Computing with CUDA. CUDA Libraries. Massimiliano Fatica, NVIDIA. CUBLAS: BLAS implementation CUFFT: FFT implementation

S05: High Performance Computing with CUDA. CUDA Libraries. Massimiliano Fatica, NVIDIA. CUBLAS: BLAS implementation CUFFT: FFT implementation CUDA Libraries Massimiliano Fatica, NVIDIA Outline CUDA libraries: CUBLAS: BLAS implementation CUFFT: FFT implementation Using CUFFT to solve a Poisson equation with spectral methods: How to use the profile

More information

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation What is CUDA? CUDA Platform Expose GPU computing for general purpose Retain performance CUDA C/C++ Based on industry-standard C/C++ Small

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

compiler, CUBLAS and CUFFT (required for development) collection of examples and documentation

compiler, CUBLAS and CUFFT (required for development) collection of examples and documentation CUDA Toolkit CUDA Driver: Toolkit: required component to run CUDA applications compiler, CUBLAS and CUFFT (required for development) SDK: collection of examples and documentation Support for Linux (32

More information

Lecture 6b Introduction of CUDA programming

Lecture 6b Introduction of CUDA programming CS075 1896 1920 1987 2006 Lecture 6b Introduction of CUDA programming 0 1 0, What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on

More information

HPCSE II. GPU programming and CUDA

HPCSE II. GPU programming and CUDA HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

GPU applications within HEP

GPU applications within HEP GPU applications within HEP Liang Sun Wuhan University 2018-07-20 The 5th International Summer School on TeV Experimental Physics (istep), Wuhan Outline Basic concepts GPU, CUDA, Thrust GooFit introduction

More information

GPU Programming Introduction

GPU Programming Introduction GPU Programming Introduction DR. CHRISTOPH ANGERER, NVIDIA AGENDA Introduction to Heterogeneous Computing Using Accelerated Libraries GPU Programming Languages Introduction to CUDA Lunch What is Heterogeneous

More information

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory CUDA Exercises CUDA Programming Model 05.05.2015 Lukas Cavigelli ETZ E 9 / ETZ D 61.1 Integrated Systems Laboratory Exercises 1. Enumerate GPUs 2. Hello World CUDA kernel 3. Vectors Addition Threads and

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

Basic unified GPU architecture

Basic unified GPU architecture Basic unified GPU architecture SM=streaming multiprocessor TPC = Texture Processing Cluster SFU = special function unit ROP = raster operations pipeline 78 Note: The following slides are extracted from

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application $ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x Enter in the application folder Compile the source code Launch the application --- General Information for device 0 --- Name: xxxx Compute capability:

More information

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I.

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I. Programming GPUs with CUDA Tutorial at 1th IEEE CSE 15 and 13th IEEE EUC 15 conferences Prerequisites for this tutorial Porto (Portugal). October, 20th, 2015 You (probably) need experience with C. You

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

04. CUDA Data Transfer

04. CUDA Data Transfer 04. CUDA Data Transfer Fall Semester, 2015 COMP427 Parallel Programming School of Computer Sci. and Eng. Kyungpook National University 2013-5 N Baek 1 CUDA Compute Unified Device Architecture General purpose

More information

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal

More information

A Sampling of CUDA Libraries Michael Garland

A Sampling of CUDA Libraries Michael Garland A Sampling of CUDA Libraries Michael Garland NVIDIA Research CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver Self-contained at the API level, no direct interaction

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of

More information

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Laboratory of Information Technologies Joint Institute for Nuclear Research The Helmholtz International Summer School Lattice

More information

Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science 2018 Workshop

Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science 2018 Workshop Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science Institute @COMAC 2018 Workshop www.top500.org Summit #1, Linpack: 122.3 Pflos/s 4356

More information

CUFFT Library PG _V1.0 June, 2007

CUFFT Library PG _V1.0 June, 2007 PG-00000-003_V1.0 June, 2007 PG-00000-003_V1.0 Confidential Information Published by Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 Notice This source code is subject to ownership rights under

More information

CS 179: GPU Computing. Lecture 2: The Basics

CS 179: GPU Computing. Lecture 2: The Basics CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced

More information

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690 Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium

More information

Introduction to CUDA C

Introduction to CUDA C NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is

More information

Vector Addition on the Device: main()

Vector Addition on the Device: main() Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space

More information

CUDA. More on threads, shared memory, synchronization. cuprintf

CUDA. More on threads, shared memory, synchronization. cuprintf CUDA More on threads, shared memory, synchronization cuprintf Library function for CUDA Developers Copy the files from /opt/cuprintf into your source code folder #include cuprintf.cu global void testkernel(int

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture

More information

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC

More information

Lecture 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or

More information

CUDA Toolkit 4.2 CUFFT Library

CUDA Toolkit 4.2 CUFFT Library CUDA Toolkit 4.2 CUFFT Library PG-05327-040_v01 March 2012 Programming Guide Contents 1 Introduction 2 2 Using the CUFFT API 3 2.1 Data Layout..................................... 4 2.1.1 FFTW Compatibility

More information

CUDA Toolkit 5.0 CUFFT Library DRAFT. PG _v01 April Programming Guide

CUDA Toolkit 5.0 CUFFT Library DRAFT. PG _v01 April Programming Guide CUDA Toolkit 5.0 CUFFT Library PG-05327-050_v01 April 2012 Programming Guide Contents 1 Introduction 2 2 Using the CUFFT API 3 2.1 Data Layout..................................... 4 2.1.1 FFTW Compatibility

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

Introduction to GPGPUs and to CUDA programming model: CUDA Libraries

Introduction to GPGPUs and to CUDA programming model: CUDA Libraries Introduction to GPGPUs and to CUDA programming model: CUDA Libraries www.cineca.it Marzia Rivi m.rivi@cineca.it NVIDIA CUDA Libraries http://developer.nvidia.com/technologies/libraries CUDA Toolkit includes

More information

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics 1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached

More information

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. CSC 391/691: GPU Programming Fall 2011 Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. Copyright 2011 Samuel S. Cho Streams Until now, we have largely focused on massively data-parallel execution

More information

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Heterogeneous Computing CPU GPU Once upon a time Past Massively Parallel Supercomputers Goodyear MPP Thinking Machine MasPar Cray 2 1.31

More information

Accelerating MATLAB with CUDA

Accelerating MATLAB with CUDA Accelerating MATLAB with CUDA Massimiliano Fatica NVIDIA mfatica@nvidia.com Won-Ki Jeong University of Utah wkjeong@cs.utah.edu Overview MATLAB can be easily extended via MEX files to take advantage of

More information

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory. Table of Contents Streams Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

Zero Copy Memory and Multiple GPUs

Zero Copy Memory and Multiple GPUs Zero Copy Memory and Multiple GPUs Goals Zero Copy Memory Pinned and mapped memory on the host can be read and written to from the GPU program (if the device permits this) This may result in performance

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

Lecture 10!! Introduction to CUDA!

Lecture 10!! Introduction to CUDA! 1(50) Lecture 10 Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY 1(50) Laborations Some revisions may happen while making final adjustments for Linux Mint. Last minute changes may occur.

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu. Table of Contents Multi-GPU Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes 2 Zero-copy Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain

More information

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA GPU Architecture and Programming Andrei Doncescu inspired by NVIDIA Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions are executed

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I) Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

CUDA Basics. July 6, 2016

CUDA Basics. July 6, 2016 Mitglied der Helmholtz-Gemeinschaft CUDA Basics July 6, 2016 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching

More information

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool Basic CUDA workshop Penporn Koanantakool twitter: @kaewgb e-mail: kaewgb@gmail.com Outlines Setting Up Your Machine Architecture Getting Started Programming CUDA Debugging Fine-Tuning Setting Up Your Machine

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

RAPID MULTI-GPU PROGRAMMING WITH CUDA LIBRARIES. Nikolay Markovskiy

RAPID MULTI-GPU PROGRAMMING WITH CUDA LIBRARIES. Nikolay Markovskiy RAPID MULTI-GPU PROGRAMMING WITH CUDA LIBRARIES Nikolay Markovskiy CUDA 6 cublas cufft 2. cublas-xt 3. cufft-xt 1. NVBLAS WHAT IS NVBLAS? Drop-in replacement of BLAS Built on top of cublas-xt BLAS Level

More information

CUFFT LIBRARY USER'S GUIDE. DU _v7.5 September 2015

CUFFT LIBRARY USER'S GUIDE. DU _v7.5 September 2015 CUFFT LIBRARY USER'S GUIDE DU-06707-001_v7.5 September 2015 TABLE OF CONTENTS Chapter 1. Introduction...1 Chapter 2. Using the cufft API... 3 2.1. Accessing cufft...4 2.2. Fourier Transform Setup...5 2.3.

More information

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C CSE 599 I Accelerated Computing Programming GPUS Intro to CUDA C GPU Teaching Kit Accelerated Computing Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn the

More information

CUDA 6.5 Performance Report

CUDA 6.5 Performance Report CUDA 6.5 Performance Report 1 CUDA 6.5 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

Introduction to CUDA C QCon 2011 Cyril Zeller, NVIDIA Corporation

Introduction to CUDA C QCon 2011 Cyril Zeller, NVIDIA Corporation Introduction to CUDA C QCon 2011 Cyril Zeller, NVIDIA Corporation Welcome Goal: an introduction to GPU programming CUDA = NVIDIA s architecture for GPU computing What will you learn in this session? Start

More information

Scientific GPU computing with Go A novel approach to highly reliable CUDA HPC 1 February 2014

Scientific GPU computing with Go A novel approach to highly reliable CUDA HPC 1 February 2014 Scientific GPU computing with Go A novel approach to highly reliable CUDA HPC 1 February 2014 Arne Vansteenkiste Ghent University Real-world example (micromagnetism) DyNaMat LAB @ UGent: Microscale Magnetic

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

Introduc)on to CUDA. T.V. Singh Ins)tute for Digital Research and Educa)on UCLA

Introduc)on to CUDA. T.V. Singh Ins)tute for Digital Research and Educa)on UCLA Introduc)on to CUDA T.V. Singh Ins)tute for Digital Research and Educa)on UCLA tvsingh@ucla.edu GPU and GPGPU Graphics Processing Unit A specialized device on computer to accelerate building of images

More information

CUDA Programming. Aiichiro Nakano

CUDA Programming. Aiichiro Nakano CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Introduc)on to GPU Programming

Introduc)on to GPU Programming Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf

More information

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Using a GPU in InSAR processing to improve performance

Using a GPU in InSAR processing to improve performance Using a GPU in InSAR processing to improve performance Rob Mellors, ALOS PI 152 San Diego State University David Sandwell University of California, San Diego What is a GPU? (Graphic Processor Unit) A graphics

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPU Computing: Introduction to CUDA. Dr Paul Richmond GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

CS377P Programming for Performance GPU Programming - I

CS377P Programming for Performance GPU Programming - I CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic

More information

CUDA Programming. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

CUDA Programming. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna moreno.marzolla@unibo.it Copyright 2014, 2017, 2018 Moreno Marzolla, Università di Bologna, Italy http://www.moreno.marzolla.name/teaching/hpc/

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

Writing and compiling a CUDA code

Writing and compiling a CUDA code Writing and compiling a CUDA code Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) Writing CUDA code 1 / 65 The CUDA language If we want fast code, we (unfortunately)

More information