PART IV. CUDA programming! code, libraries and tools. Dr. Christian Napoli, M.Sc.! Dpt. Mathematics and Informatics, University of Catania!
|
|
- Berenice Richards
- 5 years ago
- Views:
Transcription
1 Postgraduate course on Electronics and Informatics Engineering (M.Sc.) Training Course on Circuits Theory (prof. G. Capizzi) Workshop on High performance computing and GPGPU computing Postgraduate course on Computer Sciences (M.Sc.) Training Course on Distributed Sistems (prof. G. Pappalardo) Workshop on High performance computing and GPGPU computing PART IV CUDA programming code, libraries and tools Dr. Christian Napoli, M.Sc. Dpt. Mathematics and Informatics, University of Catania Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
2 In the last episode Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
3 In the last episode #include <iostream> #include <algorithm> using namespace std; #define N 104 #define RADIUS 3 #define BLOCK_SIZE 16 global void stencil_1d(int *in, int *out) { shared int temp[block_size + * RADIUS]; int gindex = threadidx.x + blockidx.x * blockdim.x; int lindex = threadidx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadidx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) syncthreads(); // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; parallel fn } // Store the result out[gindex] = result; void fill_ints(int *x, int n) { fill_n(x, n, 1); } int main(void) { int *in, *out; // host copies of a, b, c int *d_in, *d_out; // device copies of a, b, c int size = (N + *RADIUS) * sizeof(int); // Alloc space for host copies and setup values in = (int *)malloc(size); fill_ints(in, N + *RADIUS); out = (int *)malloc(size); fill_ints(out, N + *RADIUS); // Alloc space for device copies cudamalloc((void **)&d_in, size); cudamalloc((void **)&d_out, size); // Copy to device cudamemcpy(d_in, in, size, cudamemcpyhosttodevice); cudamemcpy(d_out, out, size, cudamemcpyhosttodevice); serial code // Launch stencil_1d() kernel on GPU stencil_1d<<<n/block_size,block_size>>>(d_in + RADIUS, d_out + RADIUS); } // Copy result back to host cudamemcpy(out, d_out, size, cudamemcpydevicetohost); // Cleanup free(in); free(out); cudafree(d_in); cudafree(d_out); return 0; parallel code serial code 3 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
4 Processing flow Copy input data from CPU memory to GPU memory PCI BUS 4 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
5 Processing flow Copy input data from CPU memory to GPU memory Load GPU program and execute caching data on chip PCI BUS 5 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
6 Processing flow Copy input data from CPU memory to GPU memory Load GPU program and execute caching data on chip Copy results from GPU memory to CPU memory PCI BUS 6 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
7 The first instruments for the job CODE :: add.cu [ ] global void add( int *a, int *b, int *c ); int main ( void ) { int a[n], b[n], c[n]; int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) ); [ ] cudamemcpy( dev_a, a, N * sizeof(int), cudamemcpyhosttodevice); [ ] add<<<1,n>>>( dev_a, dev_b, dev_c ); [ ] cudafree( dev_a ); [ ] } KEYWORDS LIBRARIES TOKENS CONSTRUCTS 7 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
8 S Y N T A X Global functions or kernels global void add (int*, int*, int*); mykernel<<<nbk,nth>>>(); global - Global kernel function :: it is a keyword Nbk - Number of blocks where to execute :: it is an address Nbk - Number of threads where to execute :: it is an integer In CUDA C/C++ the keyword global indicates a function that runs on the device and is called from host code. Such a function is called kernel and it runs in multiple instances on several blocks and threads. The number of those blocks and threads is determined by using the <<<,>>> symbols. 8 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
9 S Y N T A X Memory allocation typedef enum cudaerror cudaerror_t; cuda_error_t cudamalloc (void** devptr, size_t size); cuda_error_t cudafree (void* devptr); devptr - Pointer to allocated device memory :: it is an address count - Requested allocation size in bytes :: it is an integer int void Host Memory Device Memory HOST char void cudamemcpyhosttodevice float void DEVICE 9 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
10 S Y N T A X Memory copy cuda_error_t cudamemcpy( void dst, const void src, size_t count, enum cudamemcpykind kind, ); dst - Destination memory address :: it is an address src - Source memory address :: it is an address count - Size in bytes to copy :: it is an integer kind - Type of transfer :: it is a specific token: cudamemcpyhosttohost HOST cudamemcpyhosttodevice cudamemcpydevicetohost DEVICE cudamemcpydevicetodevice 10 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
11 S Y N T A X CUDA errors cuda_error_t cudagetlasterror(void); char* cudageterrorstring(cuda_error_t); cuda_error_t myerror = cudagetlasterror(); printf ( CUDA ERROR STRING: %s\n, cudageterrorstring(myerror)); DEVICE global device HOST host CUDA library printf TERMINAL $./ErrExample.x $ CudaSuccess $ _ HW HW SIGNAL HW DRIVER 11 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
12 CUDA errors 1 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
13 Add example (reloaded) CODE :: add.cu #include <stdio.h> #include <stdlib.h> #define N 10 global void add( int *a, int *b, int *c ) { int tid = threadidx.x; c[tid] = a[tid] + b[tid]; } int main ( void ) { int a[n], b[n], c[n]; int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) ) ; cudamalloc( (void**)&dev_b, N * sizeof(int) ) ; cudamalloc( (void**)&dev_c, N * sizeof(int) ) ; for (int i=0; i<n; i++) { a[i] = -i; b[i] = i * i; } [ ] 13 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
14 Add example (reloaded) CODE :: add.cu [splitted view] [ ] global void add( int *a, int *b, int *c ) { int tid = threadidx.x; c[tid] = a[tid] + b[tid]; } [ ] [ ] int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) cudamalloc( (void**)&dev_b, N * sizeof(int) cudamalloc( (void**)&dev_c, N * sizeof(int) [ ] [ ] cudamemcpy( dev_a, a, N * sizeof(int), cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, N * sizeof(int), cudamemcpyhosttodevice ); add<<<1,n>>>( dev_a, dev_b, dev_c ); cudamemcpy( c, dev_c, N * sizeof(int), cudamemcpydevicetohost ); for (int i=0; i<n; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; } 14 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
15 Add example (blocks+threads) CODE :: add.cu [splitted view] [ ] [ ] global void add( int *a, int *b, int *c ) { int tid = (blockidx.x*n)+threadidx.x; c[tid] = a[tid] + b[tid]; } [ ] [ ] cudamemcpy( dev_a, a, N * sizeof(int), cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, N * sizeof(int), cudamemcpyhosttodevice ); add<<<m,n>>>( dev_a, dev_b, dev_c ); } cudamemcpy( c, dev_c, N * sizeof(int), cudamemcpydevicetohost ); for (int i=0; i<n; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) cudamalloc( (void**)&dev_b, N * sizeof(int) cudamalloc( (void**)&dev_c, N * sizeof(int) [ ] 15 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
16 Christian Napoli Workshop on High Performance Computing and GPGPU Computing - PART IV Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. 16 Blocks, threads and built-in structures BlockIdx.x=0 BlockIdx.x=1 BlockIdx.x= BlockIdx.x=3 ThreadIdx.x int tid = (blockidx.x*n)+threadidx.x; (*8) + 6 = int tid = (blockidx.x*blockdim.x)+threadidx.x;
17 Add example (built-in) CODE :: add.cu [splitted view] [ ] global void add( int *a, int *b, int *c ) { int tid=(blockidx.x*blockdim.x)+threadidx.x; c[tid] = a[tid] + b[tid]; } [ ] [ ] int *dev_a, *dev_b, *dev_c; cudamalloc( (void**)&dev_a, N * sizeof(int) cudamalloc( (void**)&dev_b, N * sizeof(int) cudamalloc( (void**)&dev_c, N * sizeof(int) [ ] [ ] cudamemcpy( dev_a, a, N * sizeof(int), cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, N * sizeof(int), cudamemcpyhosttodevice ); add<<<m,n>>>( dev_a, dev_b, dev_c ); cudamemcpy( c, dev_c, N * sizeof(int), cudamemcpydevicetohost ); for (int i=0; i<n; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; } 17 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
18 NVIDIA CUDA Libraries CUDA Toolkit includes several libraries: CUFFT: Fourier transforms CUBLAS: Dense Linear Algebra CUSPARSE : Sparse Linear Algebra LIBM: Standard C Math library CURAND: Pseudo-random and Quasi-random numbers NPP: Image and Signal Processing Thrust : Template Library Applications 3 rd Party Libraries NVIDIA Libraries CUDA C/Fortran CUFFT CUBLAS CUSPARSE Libm (math.h) CURAND NPP Thrust CUSP Several open source and commercial * libraries: MAGMA: Linear Algebra - OpenVidia: Computer Vision CULA Tools * : Linear Algebra - OpenCurrent: CFD CUSP: Sparse Linear Solvers.. NAG * : Computational Finance 18 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
19 CUDA meets Fast Fourier Transform: cufft 19 Christian Napoli Algorithms based on Cooley-Tukey (n = a 3b 5c 7d) and Bluestein Simple interface similar to FFTW 1D, D and 3D transforms of complex and real data Row-major order (C-order) for D and 3D data Single precision (SP) and Double precision (DP) transforms In-place and out-of-place transforms 1D transform sizes up to 18 million elements Batch execution for doing multiple transforms Streamed asynchronous execution Non normalized output: IFFT(FFT(A))=len(A)*A Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
20 CODE :: simplefft.cu cufft example #define NX 56 #define NY 18 [ ] cuffthandle plan; cufftcomplex *idata, *odata; cudamalloc((void**)&idata, sizeof(cufftcomplex)*nx*ny); cudamalloc((void**)&odata, sizeof(cufftcomplex)*nx*ny); [ ] /*Create a D FFT plan.*/ cufftpland(&plan, NX,NY, CUFFT_CC); /*Use the CUFFT plan to transform the signal*/ cufftexeccc(plan, idata, odata, CUFFT_FORWARD); /* Inverse transform*/ cufftexeccc(plan, odata, odata, CUFFT_INVERSE); [ ] /*Destroy the CUFFT plan.*/ cufftdestroy(plan); cudafree(idata); cudafree(odata); [ ] Step 1 Allocate space on GPU memory Step Create plan specifying transform configuration like the size and type (real, complex, 1D, D and so on). Step 3 Execute the plan as many times as required, providing the pointer to the GPU data created in Step 1. Step 4 Destroy plan, free GPU memory 0 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
21 1 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
22 POISSON EQUATION SOLVER φ = r F F T ( k x + k y ) φˆ = rˆ φˆ = rˆ ( k + k x y ) Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
23 POISSON EQUATION SOLVER - matlab % No. of Fourier modes N = 64; % Domain size (assumed square) L = 1; % Characteristic width of f (make << 1) sig = 0.1; % Vector of wavenumbers k = (*pi/l)*[0:(n/-1) (-N/):(-1)]; % Matrix of (x,y) wavenumbers corresponding % to Fourier mode (m,n) [KX KY] = meshgrid(k,k); % Laplacian matrix acting on the wvnumbers delsq = -(KX.^ + KY.^); % Kludge to avoid division by zero for % wavenumber (0,0). delsq(1,1) = 1; % Grid spacing h = L/N; x = (0:(N-1))*h ; y = (0:(N-1))*h; [X Y] = meshgrid(x,y); CODE :: pois_fft.m [reduced] % Construct RHS f(x,y) at the Fourier gridpts rsq = (X-0.5*L).^ + (Y-0.5*L).^; sigsq = sig^; f = exp(-rsq/(*sigsq)).*... (rsq - *sigsq)/(sigsq^); % Spectral inversion of Laplacian fhat = fft(f); u = real(ifft(fhat./delsq)); % Specify arbitrary constant by forcing corner % u = 0. u = u - u(1,1); % Compute L and Linf norm of error uex = exp(-rsq/(*sigsq)); errmax = norm(u(:)-uex(:),inf); errmax = norm(u(:)-uex(:),)/(n*n); % Print L and Linf norm of error fprintf('n=%d\n',n); fprintf('solution at (%d,%d): ',N/,N/); fprintf('computed=%10.6f... reference = %10.6f\n',u(N/,N/),uex(N/,N/)); fprintf('linf err=%10.6e L norm err = %10.6e\n',errmax, errmax); 3 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
24 POISSON EQUATION SOLVER in The following steps need to be performed: 1. Allocate memory on host: r (NxN), u (NxN), kx (N) and ky (N). Allocate memory on device: r_d, u_d, kx_d, ky_d 3. Transfer r, kx and ky content from host memory to the device memory 4. Initialize plan for FFT 5. Compute execution configuration 6. Transform real input to complex input 7. D forward FFT 8. Solve Poisson equation in Fourier space 9. D inverse FFT 10. Transform complex output to real input and apply scaling 11. Transfer results from the GPU back to the host. We are not taking advantage of the symmetries (CC transform for real data) to keep the code simple. 4 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
25 POISSON EQUATION SOLVER - steps 1:5 CODE :: pois_fft.cu /*Allocate arrays on the host */ float *kx, *ky, *r; kx = (float *) malloc(sizeof(float*n); ky = (float *) malloc(sizeof(float*n); r = (float *) malloc(sizeof(float*n*n); /* Allocate array on the GPU */ float *kx_d, *ky_d, *r_d; cudamalloc( (void **) &kx_d, sizeof(cufftcomplex)*n); cudamalloc( (void **) &ky_d, sizeof(cufftcomplex)*n); cudamalloc( (void **) &r_d, sizeof(cufftcomplex)*n*n); cufftcomplex *r_complex_d; cudamalloc( (void **) &r_complex_d, sizeof(cufftcomplex)*n*n); /* Initialize r, kx and ky on the host */... /*Transfer data from host to device */ cudamemcpy (kx_d, kx, sizeof(float)*n, cudamemcpyhosttodevice); cudamemcpy (ky_d, ky, sizeof(float)*n, cudamemcpyhosttodevice); cudamemcpy (r_d, r, sizeof(float)*n*n, cudamemcpyhosttodevice); /* Create plan for CUDA FFT*/ cuffthandle plan; cufftpland( &plan, N, N, CUFFT_CC); /* Compute the execution configuration*/ dim3 dimblock(block_size_x, block_size_y); dim3 dimgrid (N/dimBlock.x, N/dimBlock.y); if (N % block_size_x =0 ) dimgrid.x+=1; if (N % block_size_y =0 ) dimgrid.y+=1; 5 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
26 POISSON EQUATION SOLVER - steps 6:11 CODE :: pois_fft.cu /* Transform real input to complex input */ realcomplex<<<dimgrid, dimblock>>> (r_d, r_complex_d, N); /* Compute forward FFT */ cufftexeccc (plan, r_complex_d, r_complex_d, CUFFT_FORWARD); /* Solve Poisson equation in Fourier space */ solve_poisson<<<dimgrid, dimblock>>> (r_complex_d, kx_d, ky_d,n); /* Compute inverse FFT */ cufftexeccc (plan, r_complex_d, r_complex_d, CUFFT_INVERSE); /* Copy the solution back to a real array and apply scaling*/ scale = 1.f / ( (float) N * (float) N ); complexreal_scaled<<<dimgrid, dimblock>>> (r_d, r_complex_d, N, scale); /*Transfer data from device to host with*/ cudamemcpy (r, r_d, sizeof(float)*n*n, cudamemcpydevicetohost); /* Destroy plan and clean up memory on device*/ cufftdestroy( plan); cudafree(r_complex_d); cudafree(kx_d); 6 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
27 POISSON EQUATION SOLVER - functions CODE :: pois_fft.cu global void realcomplex (float *a, cufftcomplex *c, int N) { int idx = blockidx.x*blockdim.x+threadidx.x; int idy = blockidx.y*blockdim.y+threadidx.y; if ( idx < N && idy <N) { int index = idx + idy*n; c[index].x = a[index]; c[index].y = 0.f; } } global void solve_poisson (cufftcomplex *c, float *kx, float *ky, int N) { int idx = blockid.x*blockdim.x+threadidx.x; int idy = blockid.y*blockdim.y+threadidx.y; if ( idx < N && idy <N) { int index = idx + idy*n; float scale = - ( kx[idx]*kx[idx] + ky[idy]*ky[idy] ); if ( idx ==0 && idy == 0 ) scale =1.f; scale = 1.f / scale; c[index].x *= scale; c[index].y *= scale; } } 7 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
28 POISSON EQUATION SOLVER - execution CODE :: pois_fft.cu global void complexreal_scaled (cufftcomplex *c, float *a, int N, float scale) int idx = blockid.x*blockdim.x+threadidx.x; int idy = blockid.y*blockdim.y+threadidx.y; if ( idx < N && idy <N) { int index = idx + idy*n; a[index] = scale*c[index].x ; } } TERMINAL $ nvcc O3 o poisson poisson.cu -I/usr/local/cuda/include L/usr/local/cuda/lib -lcufft -L/ usr/local/nvdia_cuda_sdk/common/inc -L/usr/local/NVDIA_CUDA_SDK/lib -lcutil $./poisson -N64 Poisson solver on a domain 64 x 64 dimblock 3 16 (51 threads) dimgrid 4 L error e-08: Time : Time I/O ( ): Solution at (3,3) computed= reference= $_ From MATLAB N=64 Solution at (3,3): computed= reference= Linf err= e-05 L norm err = e-08 8 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
29 Unified Distributed Architectures: the last frontier Mul$%GPUcode Domain'decomposi-on Need'to'add'communica-on'of'halos GPUs.within.a.single.network.node Exchange'halos'via'PCIe'communica-on With'CUDA'4.0'can'use'PeerBtoBPeer'communica-on'' GPUs.in.different.network.nodes.. - Require'network'communica-on' Currently'require'transferring'GPU'data'to/from'host'CPU Efforts'underway'to'make'MPI'aware'of'GPUs'(OpenMPI,MVAPICH) 30 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
30 Unified Distributed Architectures: the last frontier The.goal.is.to.hide.communica$on.cost So,.every.$me%step.each.GPU.will:.. Compute'halos'to'be'sent'to'neighbors' Compute'the'internal'region Exchange'halos'with'neighbors Linear.scaling.as.long.as.internal%computa$on.takes.longer.than.halo.exchange Well,'separate'halo'computa$on'adds'some'overhead Example:)Two)Subdomains) 31 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
31 Unified Distributed Architectures: the last frontier Phase 1 compute compute Phase compute send send compute GPU%0:)green)subdomain) GPU%1:)grey)subdomain) NVIDIA, Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
32 Unified Distributed Architectures: the last frontier MVPACIH)design) 33 Christian Napoli From:)MVAPICH%GPU:))Op$mized)GPU)to)GPU)Communica$on)for)InfiniBand)Clusters) H)Wang,)S.)Potluri,)M))Luo,)A.K.)Singh,))S.)Sur,)D.))K.))Panda) Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
33 Unified Distributed Architectures: the last frontier MVPACIH)performances) Ping-pong One side NVIDIA, Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
34 Unified Distributed Architectures: the last frontier LET SEE A SIMULATION 35 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
35 QUESTION TIME 36 Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
36 Thank You You will find the PDF edition in the didactic section of the author s website, visit To contact the author send an to If you want to share this presentation be sure to read and follow the CC-BY-NC-ND-4.0 license. Visit Christian Napoli Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Massively Parallel Algorithms
Massively Parallel Algorithms Introduction to CUDA & Many Fundamental Concepts of Parallel Programming G. Zachmann University of Bremen, Germany cgvr.cs.uni-bremen.de Hybrid/Heterogeneous Computation/Architecture
More informationCUDA Technical Training
CUDA Technical Training Volume II: CUDA Case Studies Prepared and Provided by NVIDIA Q2 2008 Table of Contents Section Slide Computational Finance in CUDA...1 Black-Scholes pricing for European options...3
More informationIntroduction to GPU Computing Junjie Lai, NVIDIA Corporation
Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon
More informationGPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018
GPU 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives
More informationAn Introduction to GPU Computing and CUDA Architecture
An Introduction to GPU Computing and CUDA Architecture Sarah Tariq, NVIDIA Corporation GPU Computing GPU: Graphics Processing Unit Traditionally used for real-time rendering High computational density
More informationS05: High Performance Computing with CUDA. CUDA Libraries. Massimiliano Fatica, NVIDIA. CUBLAS: BLAS implementation CUFFT: FFT implementation
CUDA Libraries Massimiliano Fatica, NVIDIA Outline CUDA libraries: CUBLAS: BLAS implementation CUFFT: FFT implementation Using CUFFT to solve a Poisson equation with spectral methods: How to use the profile
More informationCUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation
CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation What is CUDA? CUDA Platform Expose GPU computing for general purpose Retain performance CUDA C/C++ Based on industry-standard C/C++ Small
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationcompiler, CUBLAS and CUFFT (required for development) collection of examples and documentation
CUDA Toolkit CUDA Driver: Toolkit: required component to run CUDA applications compiler, CUBLAS and CUFFT (required for development) SDK: collection of examples and documentation Support for Linux (32
More informationLecture 6b Introduction of CUDA programming
CS075 1896 1920 1987 2006 Lecture 6b Introduction of CUDA programming 0 1 0, What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on
More informationHPCSE II. GPU programming and CUDA
HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationGPU applications within HEP
GPU applications within HEP Liang Sun Wuhan University 2018-07-20 The 5th International Summer School on TeV Experimental Physics (istep), Wuhan Outline Basic concepts GPU, CUDA, Thrust GooFit introduction
More informationGPU Programming Introduction
GPU Programming Introduction DR. CHRISTOPH ANGERER, NVIDIA AGENDA Introduction to Heterogeneous Computing Using Accelerated Libraries GPU Programming Languages Introduction to CUDA Lunch What is Heterogeneous
More informationCUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory
CUDA Exercises CUDA Programming Model 05.05.2015 Lukas Cavigelli ETZ E 9 / ETZ D 61.1 Integrated Systems Laboratory Exercises 1. Enumerate GPUs 2. Hello World CUDA kernel 3. Vectors Addition Threads and
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationBasic unified GPU architecture
Basic unified GPU architecture SM=streaming multiprocessor TPC = Texture Processing Cluster SFU = special function unit ROP = raster operations pipeline 78 Note: The following slides are extracted from
More informationCUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan
CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device
More information$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application
$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x Enter in the application folder Compile the source code Launch the application --- General Information for device 0 --- Name: xxxx Compute capability:
More informationProgramming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I.
Programming GPUs with CUDA Tutorial at 1th IEEE CSE 15 and 13th IEEE EUC 15 conferences Prerequisites for this tutorial Porto (Portugal). October, 20th, 2015 You (probably) need experience with C. You
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More information04. CUDA Data Transfer
04. CUDA Data Transfer Fall Semester, 2015 COMP427 Parallel Programming School of Computer Sci. and Eng. Kyungpook National University 2013-5 N Baek 1 CUDA Compute Unified Device Architecture General purpose
More informationGraph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.
Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal
More informationA Sampling of CUDA Libraries Michael Garland
A Sampling of CUDA Libraries Michael Garland NVIDIA Research CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver Self-contained at the API level, no direct interaction
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of
More informationTutorial: Parallel programming technologies on hybrid architectures HybriLIT Team
Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Laboratory of Information Technologies Joint Institute for Nuclear Research The Helmholtz International Summer School Lattice
More informationIntroduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science 2018 Workshop
Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science Institute @COMAC 2018 Workshop www.top500.org Summit #1, Linpack: 122.3 Pflos/s 4356
More informationCUFFT Library PG _V1.0 June, 2007
PG-00000-003_V1.0 June, 2007 PG-00000-003_V1.0 Confidential Information Published by Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 Notice This source code is subject to ownership rights under
More informationCS 179: GPU Computing. Lecture 2: The Basics
CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced
More informationOutline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun
Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory
More informationIntroduction to CUDA C
Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race
More informationMemory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.
Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip
More informationStanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690
Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium
More informationIntroduction to CUDA C
NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is
More informationVector Addition on the Device: main()
Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space
More informationCUDA. More on threads, shared memory, synchronization. cuprintf
CUDA More on threads, shared memory, synchronization cuprintf Library function for CUDA Developers Copy the files from /opt/cuprintf into your source code folder #include cuprintf.cu global void testkernel(int
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture
More informationCUDA GPGPU Workshop CUDA/GPGPU Arch&Prog
CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC
More informationLecture 2: Introduction to CUDA C
CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or
More informationCUDA Toolkit 4.2 CUFFT Library
CUDA Toolkit 4.2 CUFFT Library PG-05327-040_v01 March 2012 Programming Guide Contents 1 Introduction 2 2 Using the CUFFT API 3 2.1 Data Layout..................................... 4 2.1.1 FFTW Compatibility
More informationCUDA Toolkit 5.0 CUFFT Library DRAFT. PG _v01 April Programming Guide
CUDA Toolkit 5.0 CUFFT Library PG-05327-050_v01 April 2012 Programming Guide Contents 1 Introduction 2 2 Using the CUFFT API 3 2.1 Data Layout..................................... 4 2.1.1 FFTW Compatibility
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationIntroduction to GPGPUs and to CUDA programming model: CUDA Libraries
Introduction to GPGPUs and to CUDA programming model: CUDA Libraries www.cineca.it Marzia Rivi m.rivi@cineca.it NVIDIA CUDA Libraries http://developer.nvidia.com/technologies/libraries CUDA Toolkit includes
More informationGPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics
1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached
More informationAdvanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.
CSC 391/691: GPU Programming Fall 2011 Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. Copyright 2011 Samuel S. Cho Streams Until now, we have largely focused on massively data-parallel execution
More informationGetting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator
Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Heterogeneous Computing CPU GPU Once upon a time Past Massively Parallel Supercomputers Goodyear MPP Thinking Machine MasPar Cray 2 1.31
More informationAccelerating MATLAB with CUDA
Accelerating MATLAB with CUDA Massimiliano Fatica NVIDIA mfatica@nvidia.com Won-Ki Jeong University of Utah wkjeong@cs.utah.edu Overview MATLAB can be easily extended via MEX files to take advantage of
More informationPinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.
Table of Contents Streams Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationParallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer
Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationZero Copy Memory and Multiple GPUs
Zero Copy Memory and Multiple GPUs Goals Zero Copy Memory Pinned and mapped memory on the host can be read and written to from the GPU program (if the device permits this) This may result in performance
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationLecture 10!! Introduction to CUDA!
1(50) Lecture 10 Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY 1(50) Laborations Some revisions may happen while making final adjustments for Linux Mint. Last minute changes may occur.
More informationRegister file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.
Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationProgramming with CUDA, WS09
Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming
More informationGPU Programming Using CUDA. Samuli Laine NVIDIA Research
GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick
More informationZero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.
Table of Contents Multi-GPU Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes 2 Zero-copy Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain
More informationGPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA
GPU Architecture and Programming Andrei Doncescu inspired by NVIDIA Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions are executed
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationLecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)
Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationIntroduction to GPGPUs and to CUDA programming model
Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries
More informationGPU Programming Using CUDA. Samuli Laine NVIDIA Research
GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationCUDA Basics. July 6, 2016
Mitglied der Helmholtz-Gemeinschaft CUDA Basics July 6, 2016 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching
More informationBasic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool
Basic CUDA workshop Penporn Koanantakool twitter: @kaewgb e-mail: kaewgb@gmail.com Outlines Setting Up Your Machine Architecture Getting Started Programming CUDA Debugging Fine-Tuning Setting Up Your Machine
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationRAPID MULTI-GPU PROGRAMMING WITH CUDA LIBRARIES. Nikolay Markovskiy
RAPID MULTI-GPU PROGRAMMING WITH CUDA LIBRARIES Nikolay Markovskiy CUDA 6 cublas cufft 2. cublas-xt 3. cufft-xt 1. NVBLAS WHAT IS NVBLAS? Drop-in replacement of BLAS Built on top of cublas-xt BLAS Level
More informationCUFFT LIBRARY USER'S GUIDE. DU _v7.5 September 2015
CUFFT LIBRARY USER'S GUIDE DU-06707-001_v7.5 September 2015 TABLE OF CONTENTS Chapter 1. Introduction...1 Chapter 2. Using the cufft API... 3 2.1. Accessing cufft...4 2.2. Fourier Transform Setup...5 2.3.
More informationCSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C
CSE 599 I Accelerated Computing Programming GPUS Intro to CUDA C GPU Teaching Kit Accelerated Computing Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn the
More informationCUDA 6.5 Performance Report
CUDA 6.5 Performance Report 1 CUDA 6.5 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationIntroduction to CUDA C QCon 2011 Cyril Zeller, NVIDIA Corporation
Introduction to CUDA C QCon 2011 Cyril Zeller, NVIDIA Corporation Welcome Goal: an introduction to GPU programming CUDA = NVIDIA s architecture for GPU computing What will you learn in this session? Start
More informationScientific GPU computing with Go A novel approach to highly reliable CUDA HPC 1 February 2014
Scientific GPU computing with Go A novel approach to highly reliable CUDA HPC 1 February 2014 Arne Vansteenkiste Ghent University Real-world example (micromagnetism) DyNaMat LAB @ UGent: Microscale Magnetic
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationIntroduc)on to CUDA. T.V. Singh Ins)tute for Digital Research and Educa)on UCLA
Introduc)on to CUDA T.V. Singh Ins)tute for Digital Research and Educa)on UCLA tvsingh@ucla.edu GPU and GPGPU Graphics Processing Unit A specialized device on computer to accelerate building of images
More informationCUDA Programming. Aiichiro Nakano
CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationIntroduc)on to GPU Programming
Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf
More informationModule 2: Introduction to CUDA C
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationUsing a GPU in InSAR processing to improve performance
Using a GPU in InSAR processing to improve performance Rob Mellors, ALOS PI 152 San Diego State University David Sandwell University of California, San Diego What is a GPU? (Graphic Processor Unit) A graphics
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationGPU Computing: Introduction to CUDA. Dr Paul Richmond
GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationCS377P Programming for Performance GPU Programming - I
CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic
More informationCUDA Programming. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.
Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna moreno.marzolla@unibo.it Copyright 2014, 2017, 2018 Moreno Marzolla, Università di Bologna, Italy http://www.moreno.marzolla.name/teaching/hpc/
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationWriting and compiling a CUDA code
Writing and compiling a CUDA code Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) Writing CUDA code 1 / 65 The CUDA language If we want fast code, we (unfortunately)
More information