PART IV. CUDA programming! code, libraries and tools. Dr. Christian Napoli, M.Sc.! Dpt. Mathematics and Informatics, University of Catania!

Size: px

Start display at page:

Download "PART IV. CUDA programming! code, libraries and tools. Dr. Christian Napoli, M.Sc.! Dpt. Mathematics and Informatics, University of Catania!"

Berenice Richards
5 years ago
Views:

G. Pappalardo) Workshop on High performance computing and GPGPU computing PART IV CUDA programming code, libraries and tools Dr. Christian Napoli, M.Sc. Dpt.

1 Postgraduate course on Electronics and Informatics Engineering (M.Sc.) Training Course on Circuits Theory (prof. G. Capizzi) Workshop on High performance computing and GPGPU computing Postgraduate course on Computer Sciences (M.Sc.) Training Course on Distributed Sistems (prof. G. Pappalardo) Workshop on High performance computing and GPGPU computing PART IV CUDA programming code, libraries and tools Dr. Christian Napoli, M.Sc. Dpt. Mathematics and Informatics, University of Catania Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

2 In the last episode Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

$In the last episode #include <iostream> #include <algorithm> using namespace std; #define N 104 #define RADIUS 3 #define BLOCK_SIZE 16 global void stencil_1d(int *in, int *out) { shared int$ $x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) syncthreads(); // Apply the stencil$ $fill_n(x, n, 1); } int main(void) { int *in, *out; // host copies of a, b, c int *d_in, *d_out; // device copies of a, b, c int size = (N + *RADIUS) * sizeof(int); // Alloc space for host copies and$ cudamalloc((void **)&d_out, size); // Copy to device cudamemcpy(d_in, in, size, cudamemcpyhosttodevice); cudamemcpy(d_out, out, size, cudamemcpyhosttodevice); serial code // Launch stencil_1d()

cudamalloc((void **)&d_out, size); // Copy to device cudamemcpy(d_in, in, size, cudamemcpyhosttodevice); cudamemcpy(d_out, out, size, cudamemcpyhosttodevice); serial code // Launch stencil_1d()

3 In the last episode #include <iostream> #include <algorithm> using namespace std; #define N 104 #define RADIUS 3 #define BLOCK_SIZE 16 global void stencil_1d(int *in, int *out) { shared int temp[block_size + * RADIUS]; int gindex = threadidx.x + blockidx.x * blockdim.x; int lindex = threadidx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadidx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) syncthreads(); // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; parallel fn } // Store the result out[gindex] = result; void fill_ints(int *x, int n) { fill_n(x, n, 1); } int main(void) { int *in, *out; // host copies of a, b, c int *d_in, *d_out; // device copies of a, b, c int size = (N + *RADIUS) * sizeof(int); // Alloc space for host copies and setup values in = (int *)malloc(size); fill_ints(in, N + *RADIUS); out = (int *)malloc(size); fill_ints(out, N + *RADIUS); // Alloc space for device copies cudamalloc((void **)&d_in, size); cudamalloc((void **)&d_out, size); // Copy to device cudamemcpy(d_in, in, size, cudamemcpyhosttodevice); cudamemcpy(d_out, out, size, cudamemcpyhosttodevice); serial code // Launch stencil_1d() kernel on GPU stencil_1d<<<n/block_size,block_size>>>(d_in + RADIUS, d_out + RADIUS); } // Copy result back to host cudamemcpy(out, d_out, size, cudamemcpydevicetohost); // Cleanup free(in); free(out); cudafree(d_in); cudafree(d_out); return 0; parallel code serial code 3 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

4 Processing flow Copy input data from CPU memory to GPU memory PCI BUS 4 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

5 Processing flow Copy input data from CPU memory to GPU memory Load GPU program and execute caching data on chip PCI BUS 5 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

6 Processing flow Copy input data from CPU memory to GPU memory Load GPU program and execute caching data on chip Copy results from GPU memory to CPU memory PCI BUS 6 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

7 The first instruments for the job CODE :: add.cu [ ] global void add( int *a, int *b, int *c ); int main ( void ) { int a[n], b[n], c[n]; int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) ); [ ] cudamemcpy( dev_a, a, N * sizeof(int), cudamemcpyhosttodevice); [ ] add<<<1,n>>>( dev_a, dev_b, dev_c ); [ ] cudafree( dev_a ); [ ] } KEYWORDS LIBRARIES TOKENS CONSTRUCTS 7 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

8 S Y N T A X Global functions or kernels global void add (int*, int*, int*); mykernel<<<nbk,nth>>>(); global - Global kernel function :: it is a keyword Nbk - Number of blocks where to execute :: it is an address Nbk - Number of threads where to execute :: it is an integer In CUDA C/C++ the keyword global indicates a function that runs on the device and is called from host code. Such a function is called kernel and it runs in multiple instances on several blocks and threads. The number of those blocks and threads is determined by using the <<<,>>> symbols. 8 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

9 S Y N T A X Memory allocation typedef enum cudaerror cudaerror_t; cuda_error_t cudamalloc (void** devptr, size_t size); cuda_error_t cudafree (void* devptr); devptr - Pointer to allocated device memory :: it is an address count - Requested allocation size in bytes :: it is an integer int void Host Memory Device Memory HOST char void cudamemcpyhosttodevice float void DEVICE 9 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

10 S Y N T A X Memory copy cuda_error_t cudamemcpy( void dst, const void src, size_t count, enum cudamemcpykind kind, ); dst - Destination memory address :: it is an address src - Source memory address :: it is an address count - Size in bytes to copy :: it is an integer kind - Type of transfer :: it is a specific token: cudamemcpyhosttohost HOST cudamemcpyhosttodevice cudamemcpydevicetohost DEVICE cudamemcpydevicetodevice 10 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

11 S Y N T A X CUDA errors cuda_error_t cudagetlasterror(void); char* cudageterrorstring(cuda_error_t); cuda_error_t myerror = cudagetlasterror(); printf ( CUDA ERROR STRING: %s\n, cudageterrorstring(myerror)); DEVICE global device HOST host CUDA library printf TERMINAL $./ErrExample.x $ CudaSuccess $ _ HW HW SIGNAL HW DRIVER 11 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

12 CUDA errors 1 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

13 Add example (reloaded) CODE :: add.cu #include <stdio.h> #include <stdlib.h> #define N 10 global void add( int *a, int *b, int *c ) { int tid = threadidx.x; c[tid] = a[tid] + b[tid]; } int main ( void ) { int a[n], b[n], c[n]; int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) ) ; cudamalloc( (void**)&dev_b, N * sizeof(int) ) ; cudamalloc( (void**)&dev_c, N * sizeof(int) ) ; for (int i=0; i<n; i++) { a[i] = -i; b[i] = i * i; } [ ] 13 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

14 Add example (reloaded) CODE :: add.cu [splitted view] [ ] global void add( int *a, int *b, int *c ) { int tid = threadidx.x; c[tid] = a[tid] + b[tid]; } [ ] [ ] int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) cudamalloc( (void**)&dev_b, N * sizeof(int) cudamalloc( (void**)&dev_c, N * sizeof(int) [ ] [ ] cudamemcpy( dev_a, a, N * sizeof(int), cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, N * sizeof(int), cudamemcpyhosttodevice ); add<<<1,n>>>( dev_a, dev_b, dev_c ); cudamemcpy( c, dev_c, N * sizeof(int), cudamemcpydevicetohost ); for (int i=0; i<n; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; } 14 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

15 Add example (blocks+threads) CODE :: add.cu [splitted view] [ ] [ ] global void add( int *a, int *b, int *c ) { int tid = (blockidx.x*n)+threadidx.x; c[tid] = a[tid] + b[tid]; } [ ] [ ] cudamemcpy( dev_a, a, N * sizeof(int), cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, N * sizeof(int), cudamemcpyhosttodevice ); add<<<m,n>>>( dev_a, dev_b, dev_c ); } cudamemcpy( c, dev_c, N * sizeof(int), cudamemcpydevicetohost ); for (int i=0; i<n; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) cudamalloc( (void**)&dev_b, N * sizeof(int) cudamalloc( (void**)&dev_c, N * sizeof(int) [ ] 15 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

16 Christian Napoli Workshop on High Performance Computing and GPGPU Computing - PART IV Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. 16 Blocks, threads and built-in structures BlockIdx.x=0 BlockIdx.x=1 BlockIdx.x= BlockIdx.x=3 ThreadIdx.x int tid = (blockidx.x*n)+threadidx.x; (*8) + 6 = int tid = (blockidx.x*blockdim.x)+threadidx.x;

17 Add example (built-in) CODE :: add.cu [splitted view] [ ] global void add( int *a, int *b, int *c ) { int tid=(blockidx.x*blockdim.x)+threadidx.x; c[tid] = a[tid] + b[tid]; } [ ] [ ] int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) cudamalloc( (void**)&dev_b, N * sizeof(int) cudamalloc( (void**)&dev_c, N * sizeof(int) [ ] [ ] cudamemcpy( dev_a, a, N * sizeof(int), cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, N * sizeof(int), cudamemcpyhosttodevice ); add<<<m,n>>>( dev_a, dev_b, dev_c ); cudamemcpy( c, dev_c, N * sizeof(int), cudamemcpydevicetohost ); for (int i=0; i<n; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; } 17 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

18 NVIDIA CUDA Libraries CUDA Toolkit includes several libraries: CUFFT: Fourier transforms CUBLAS: Dense Linear Algebra CUSPARSE : Sparse Linear Algebra LIBM: Standard C Math library CURAND: Pseudo-random and Quasi-random numbers NPP: Image and Signal Processing Thrust : Template Library Applications 3 rd Party Libraries NVIDIA Libraries CUDA C/Fortran CUFFT CUBLAS CUSPARSE Libm (math.h) CURAND NPP Thrust CUSP Several open source and commercial * libraries: MAGMA: Linear Algebra - OpenVidia: Computer Vision CULA Tools * : Linear Algebra - OpenCurrent: CFD CUSP: Sparse Linear Solvers.. NAG * : Computational Finance 18 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

19 CUDA meets Fast Fourier Transform: cufft 19 Christian Napoli Algorithms based on Cooley-Tukey (n = a 3b 5c 7d) and Bluestein Simple interface similar to FFTW 1D, D and 3D transforms of complex and real data Row-major order (C-order) for D and 3D data Single precision (SP) and Double precision (DP) transforms In-place and out-of-place transforms 1D transform sizes up to 18 million elements Batch execution for doing multiple transforms Streamed asynchronous execution Non normalized output: IFFT(FFT(A))=len(A)*A Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

20 CODE :: simplefft.cu cufft example #define NX 56 #define NY 18 [ ] cuffthandle plan; cufftcomplex *idata, *odata; cudamalloc((void**)&idata, sizeof(cufftcomplex)*nx*ny); cudamalloc((void**)&odata, sizeof(cufftcomplex)*nx*ny); [ ] /*Create a D FFT plan.*/ cufftpland(&plan, NX,NY, CUFFT_CC); /*Use the CUFFT plan to transform the signal*/ cufftexeccc(plan, idata, odata, CUFFT_FORWARD); /* Inverse transform*/ cufftexeccc(plan, odata, odata, CUFFT_INVERSE); [ ] /*Destroy the CUFFT plan.*/ cufftdestroy(plan); cudafree(idata); cudafree(odata); [ ] Step 1 Allocate space on GPU memory Step Create plan specifying transform configuration like the size and type (real, complex, 1D, D and so on). Step 3 Execute the plan as many times as required, providing the pointer to the GPU data created in Step 1. Step 4 Destroy plan, free GPU memory 0 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

21 1 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

22 POISSON EQUATION SOLVER φ = r F F T ( k x + k y ) φˆ = rˆ φˆ = rˆ ( k + k x y ) Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

23 POISSON EQUATION SOLVER - matlab % No. of Fourier modes N = 64; % Domain size (assumed square) L = 1; % Characteristic width of f (make << 1) sig = 0.1; % Vector of wavenumbers k = (*pi/l)*[0:(n/-1) (-N/):(-1)]; % Matrix of (x,y) wavenumbers corresponding % to Fourier mode (m,n) [KX KY] = meshgrid(k,k); % Laplacian matrix acting on the wvnumbers delsq = -(KX.^ + KY.^); % Kludge to avoid division by zero for % wavenumber (0,0). delsq(1,1) = 1; % Grid spacing h = L/N; x = (0:(N-1))*h ; y = (0:(N-1))*h; [X Y] = meshgrid(x,y); CODE :: pois_fft.m [reduced] % Construct RHS f(x,y) at the Fourier gridpts rsq = (X-0.5*L).^ + (Y-0.5*L).^; sigsq = sig^; f = exp(-rsq/(*sigsq)).*... (rsq - *sigsq)/(sigsq^); % Spectral inversion of Laplacian fhat = fft(f); u = real(ifft(fhat./delsq)); % Specify arbitrary constant by forcing corner % u = 0. u = u - u(1,1); % Compute L and Linf norm of error uex = exp(-rsq/(*sigsq)); errmax = norm(u(:)-uex(:),inf); errmax = norm(u(:)-uex(:),)/(n*n); % Print L and Linf norm of error fprintf('n=%d\n',n); fprintf('solution at (%d,%d): ',N/,N/); fprintf('computed=%10.6f... reference = %10.6f\n',u(N/,N/),uex(N/,N/)); fprintf('linf err=%10.6e L norm err = %10.6e\n',errmax, errmax); 3 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

24 POISSON EQUATION SOLVER in The following steps need to be performed: 1. Allocate memory on host: r (NxN), u (NxN), kx (N) and ky (N). Allocate memory on device: r_d, u_d, kx_d, ky_d 3. Transfer r, kx and ky content from host memory to the device memory 4. Initialize plan for FFT 5. Compute execution configuration 6. Transform real input to complex input 7. D forward FFT 8. Solve Poisson equation in Fourier space 9. D inverse FFT 10. Transform complex output to real input and apply scaling 11. Transfer results from the GPU back to the host. We are not taking advantage of the symmetries (CC transform for real data) to keep the code simple. 4 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

25 POISSON EQUATION SOLVER - steps 1:5 CODE :: pois_fft.cu /*Allocate arrays on the host */ float *kx, *ky, *r; kx = (float *) malloc(sizeof(float*n); ky = (float *) malloc(sizeof(float*n); r = (float *) malloc(sizeof(float*n*n); /* Allocate array on the GPU */ float *kx_d, *ky_d, *r_d; cudamalloc( (void **) &kx_d, sizeof(cufftcomplex)*n); cudamalloc( (void **) &ky_d, sizeof(cufftcomplex)*n); cudamalloc( (void **) &r_d, sizeof(cufftcomplex)*n*n); cufftcomplex *r_complex_d; cudamalloc( (void **) &r_complex_d, sizeof(cufftcomplex)*n*n); /* Initialize r, kx and ky on the host */... /*Transfer data from host to device */ cudamemcpy (kx_d, kx, sizeof(float)*n, cudamemcpyhosttodevice); cudamemcpy (ky_d, ky, sizeof(float)*n, cudamemcpyhosttodevice); cudamemcpy (r_d, r, sizeof(float)*n*n, cudamemcpyhosttodevice); /* Create plan for CUDA FFT*/ cuffthandle plan; cufftpland( &plan, N, N, CUFFT_CC); /* Compute the execution configuration*/ dim3 dimblock(block_size_x, block_size_y); dim3 dimgrid (N/dimBlock.x, N/dimBlock.y); if (N % block_size_x =0 ) dimgrid.x+=1; if (N % block_size_y =0 ) dimgrid.y+=1; 5 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

26 POISSON EQUATION SOLVER - steps 6:11 CODE :: pois_fft.cu /* Transform real input to complex input */ realcomplex<<<dimgrid, dimblock>>> (r_d, r_complex_d, N); /* Compute forward FFT */ cufftexeccc (plan, r_complex_d, r_complex_d, CUFFT_FORWARD); /* Solve Poisson equation in Fourier space */ solve_poisson<<<dimgrid, dimblock>>> (r_complex_d, kx_d, ky_d,n); /* Compute inverse FFT */ cufftexeccc (plan, r_complex_d, r_complex_d, CUFFT_INVERSE); /* Copy the solution back to a real array and apply scaling*/ scale = 1.f / ( (float) N * (float) N ); complexreal_scaled<<<dimgrid, dimblock>>> (r_d, r_complex_d, N, scale); /*Transfer data from device to host with*/ cudamemcpy (r, r_d, sizeof(float)*n*n, cudamemcpydevicetohost); /* Destroy plan and clean up memory on device*/ cufftdestroy( plan); cudafree(r_complex_d); cudafree(kx_d); 6 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

27 POISSON EQUATION SOLVER - functions CODE :: pois_fft.cu global void realcomplex (float *a, cufftcomplex *c, int N) { int idx = blockidx.x*blockdim.x+threadidx.x; int idy = blockidx.y*blockdim.y+threadidx.y; if ( idx < N && idy <N) { int index = idx + idy*n; c[index].x = a[index]; c[index].y = 0.f; } } global void solve_poisson (cufftcomplex *c, float *kx, float *ky, int N) { int idx = blockid.x*blockdim.x+threadidx.x; int idy = blockid.y*blockdim.y+threadidx.y; if ( idx < N && idy <N) { int index = idx + idy*n; float scale = - ( kx[idx]*kx[idx] + ky[idy]*ky[idy] ); if ( idx ==0 && idy == 0 ) scale =1.f; scale = 1.f / scale; c[index].x *= scale; c[index].y *= scale; } } 7 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

28 POISSON EQUATION SOLVER - execution CODE :: pois_fft.cu global void complexreal_scaled (cufftcomplex *c, float *a, int N, float scale) int idx = blockid.x*blockdim.x+threadidx.x; int idy = blockid.y*blockdim.y+threadidx.y; if ( idx < N && idy <N) { int index = idx + idy*n; a[index] = scale*c[index].x ; } } TERMINAL $ nvcc O3 o poisson poisson.cu -I/usr/local/cuda/include L/usr/local/cuda/lib -lcufft -L/ usr/local/nvdia_cuda_sdk/common/inc -L/usr/local/NVDIA_CUDA_SDK/lib -lcutil $./poisson -N64 Poisson solver on a domain 64 x 64 dimblock 3 16 (51 threads) dimgrid 4 L error e-08: Time : Time I/O ( ): Solution at (3,3) computed= reference= $_ From MATLAB N=64 Solution at (3,3): computed= reference= Linf err= e-05 L norm err = e-08 8 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

29 Unified Distributed Architectures: the last frontier Mul$%GPUcode Domain'decomposi-on Need'to'add'communica-on'of'halos GPUs.within.a.single.network.node Exchange'halos'via'PCIe'communica-on With'CUDA'4.0'can'use'PeerBtoBPeer'communica-on'' GPUs.in.different.network.nodes.. - Require'network'communica-on' Currently'require'transferring'GPU'data'to/from'host'CPU Efforts'underway'to'make'MPI'aware'of'GPUs'(OpenMPI,MVAPICH) 30 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

30 Unified Distributed Architectures: the last frontier The.goal.is.to.hide.communica$on.cost So,.every.$me%step.each.GPU.will:.. Compute'halos'to'be'sent'to'neighbors' Compute'the'internal'region Exchange'halos'with'neighbors Linear.scaling.as.long.as.internal%computa$on.takes.longer.than.halo.exchange Well,'separate'halo'computa$on'adds'some'overhead Example:)Two)Subdomains) 31 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

31 Unified Distributed Architectures: the last frontier Phase 1 compute compute Phase compute send send compute GPU%0:)green)subdomain) GPU%1:)grey)subdomain) NVIDIA, Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

32 Unified Distributed Architectures: the last frontier MVPACIH)design) 33 Christian Napoli From:)MVAPICH%GPU:))Op$mized)GPU)to)GPU)Communica$on)for)InfiniBand)Clusters) H)Wang,)S.)Potluri,)M))Luo,)A.K.)Singh,))S.)Sur,)D.))K.))Panda) Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

33 Unified Distributed Architectures: the last frontier MVPACIH)performances) Ping-pong One side NVIDIA, Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

34 Unified Distributed Architectures: the last frontier LET SEE A SIMULATION 35 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

35 QUESTION TIME 36 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

36 Thank You You will find the PDF edition in the didactic section of the author s website, visit To contact the author send an to If you want to share this presentation be sure to read and follow the CC-BY-NC-ND-4.0 license. Visit Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Massively Parallel Algorithms

Massively Parallel Algorithms Introduction to CUDA & Many Fundamental Concepts of Parallel Programming G. Zachmann University of Bremen, Germany cgvr.cs.uni-bremen.de Hybrid/Heterogeneous Computation/Architecture