APPENDIX. Source code. Part 1. Part 2. Part 3.

Size: px
Start display at page:

Download "APPENDIX. Source code. Part 1. Part 2. Part 3."

Transcription

1 APPENDIX Source code Part 1. Part 2. Part 3. 1

2 Source Code Part 1. arrayfun pagefun bsxfun 2

3 Source Code Part 1. arrayfun() function y = foo(x) y = 1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x./9)./8)./7)./6)./5)./4)./3)./2); %% arrayfun clear ; clc; % display(' GPU Performance [ CPU vs GPU vs GPU with arrayfun ] '); display(' GPU Performance [ CPU vs GPU with arrayfun ] '); type foo.m; cpux gpux = rand(1e4, 1e3); = gpuarray(cpux); % CPU COMPUTING cpuy = foo(cpux); % cpuy = arrayfun(@foo, cpux); tcpu = toc; % GPU ONLY COMPUTING gpuy1 = foo(gpux); tgpu1 = toc; % GPU WITH ARRAYFUN COMPUTING gpuy2 = arrayfun(@foo, gpux); tgpu2 = toc; % MAXIMUM ABSOLUTE ERROR err1 = max(abs(cpuy(:) - gpuy1(:))); err2 = max(abs(cpuy(:) - gpuy2(:))); % DISPLAY display(['execution time on CPU : ' num2str(tcpu, '%2.6f') ' sec']); % display(['execution time on GPU only : ' num2str(tgpu1, '%2.6f') ' sec']); display(['execution time on GPU with arrayfun : ' num2str(tgpu2, '%2.6f') ' sec']); % display(['maximum absolute error for CPU / GPU only : ' num2str(err1, '%2.4e')]); display(['maximum absolute error for CPU / GPU with arrayfun : ' num2str(err2, '%2.4e')]); % display(['acceleration ratio for CPU / GPU only : x ' num2str(tcpu/tgpu1, '%2.4f')]); display(['acceleration ratio for CPU / GPU with arrayfun : x ' num2str(tcpu/tgpu2, '%2.4f')]); 3

4 Source Code Part 1. pagefun() %% pagefun clear ; % clc; % display(' GPU Performance [ CPU vs GPU vs GPU with pagefun ] '); display(' GPU Performance [ CPU vs GPU with pagefun ] '); cpux gpux = rand(1e2, 1e2, 1e1, 1e1); = gpuarray(cpux); % CPU COMPUTING cpuy = zeros(size(cpux)); for i = 1:size(cpuX, 3) for j = 1:size(cpuX, 4) cpuy(:,:,i,j) = transpose(cpux(:,:,i,j)); tcpu = toc; % GPU ONLY COMPUTING gpuy1 = zeros(size(gpux), 'gpuarray'); for i = 1:size(cpuX, 3) for j = 1:size(cpuX, 4) gpuy1(:,:,i,j) = transpose(gpux(:,:,i,j)); tgpu1 = toc; % GPU WITH PAGEFUN COMPUTING gpuy2 = pagefun(@transpose, gpux); tgpu2 = toc; % MAXIMUM ABSOLUTE ERROR err1 = max(abs(cpuy(:) - gpuy1(:))); err2 = max(abs(cpuy(:) - gpuy2(:))); % DISPLAY display(['execution time on CPU : ' num2str(tcpu, '%2.6f') ' sec']); % display(['execution time on GPU only : ' num2str(tgpu1, '%2.6f') ' sec']); display(['execution time on GPU with arrayfun : ' num2str(tgpu2, '%2.6f') ' sec']); % display(['maximum absolute error for CPU / GPU only : ' num2str(err1, '%2.4e')]); display(['maximum absolute error for CPU / GPU with arrayfun : ' num2str(err2, '%2.4e')]); % display(['acceleration ratio for CPU / GPU only : x ' num2str(tcpu/tgpu1, '%2.4f')]); display(['acceleration ratio for CPU / GPU with arrayfun : x ' num2str(tcpu/tgpu2, '%2.4f')]); 4

5 Source Code Part 1. bsxfun() %% bsxfun clear ; % clc; % display(' GPU Performance [ CPU vs GPU vs GPU with bsxfun ] '); display(' GPU Performance [ CPU vs GPU with bsxfun ] '); cpux cpuy gpux gpuy = rand(1e4, 1e3); = mean(cpux); = gpuarray(cpux); = gpuarray(cpuy); % CPU cpuz = zeros(size(cpux)); for j = 1:size(cpuX, 2) cpuz(:, j) = minus(cpux(:, j), cpuy(j)); % cpuz = cpux - repmat(cpuy, [size(cpux, 1), 1]); tcpu = toc; % GPU ONLY % gpuz1 = zeros(size(gpux), 'gpuarray'); % for j = 1:size(cpuX, 2) % gpuz1(:, j) = minues(gpux(:, j), gpuy(j)); % gpuz1 = gpux - repmat(gpuy, [size(gpux, 1), 1]); tgpu1 = toc; % GPU WITH BSXFUN COMPUTING gpuz2 = bsxfun(@minus, gpux, gpuy); tgpu2 = toc; % MAXIMUM ABSOLUTE ERROR err1 = max(abs(cpuz(:) - gpuz1(:))); err2 = max(abs(cpuz(:) - gpuz2(:))); % DISPLAY display(['execution time on CPU : ' num2str(tcpu, '%2.6f') ' sec']); % display(['execution time on GPU only : ' num2str(tgpu1, '%2.6f') ' sec']); display(['execution time on GPU with arrayfun : ' num2str(tgpu2, '%2.6f') ' sec']); % display(['maximum absolute error for CPU / GPU only : ' num2str(err1, '%2.4e')]); display(['maximum absolute error for CPU / GPU with arrayfun : ' num2str(err2, '%2.4e')]); % display(['acceleration ratio for CPU / GPU only : x ' num2str(tcpu/tgpu1, '%2.4f')]); display(['acceleration ratio for CPU / GPU with arrayfun : x ' num2str(tcpu/tgpu2, '%2.4f')]); 5

6 Source Code Part 2. mrics_gpu.m test_mrics.m 6

7 Source Code Part 2. mrics_gpu() function u = mrics_gpu(r,f, mu, lambda, gamma, ninner, nbreg) [rows,cols] = size(f); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % GPUARRAY %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% f R = gpuarray(f); = gpuarray(r); % Reserve memory for the auxillary variables f0 = f; u = zeros(rows,cols, 'gpuarray'); x = zeros(rows,cols, 'gpuarray'); y = zeros(rows,cols, 'gpuarray'); bx = zeros(rows,cols, 'gpuarray'); by = zeros(rows,cols, 'gpuarray'); % Build Kernels scale = sqrt(rows*cols); murf = ifft2(mu*(conj(r).*f))*scale; uker = zeros(rows,cols, 'gpuarray'); uker(1,1) = 4;uker(1,2)=-1;uker(2,1)=-1;uker(rows,1)=-1;uker(1,cols)=-1; uker = mu*(conj(r).*r)+lambda*fft2(uker)+gamma; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Do the reconstruction for outer = 1:nBreg; for inner = 1:nInner; % update u rhs = murf+lambda*dxt(x-bx)+lambda*dyt(y-by)+gamma*u; u = ifft2(fft2(rhs)./uker); % update x and y dx = Dx(u); dy =Dy(u); [x,y] = shrink2( dx+bx, dy+by,1/lambda); % update bregman parameters bx = bx+dx-x; by = by+dy-y; f = f+f0-r.*fft2(u)/scale; murf = ifft2(mu*r.*f)*scale; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % GATHER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% u = gather(u); return; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 7

8 function d = Dx(u) [rows,cols] = size(u); d = zeros(rows,cols, 'gpuarray'); d(:,2:cols) = u(:,2:cols)-u(:,1:cols-1); d(:,1) = u(:,1)-u(:,cols); Return function d = Dxt(u) [rows,cols] = size(u); d = zeros(rows,cols, 'gpuarray'); d(:,1:cols-1) = u(:,1:cols-1)-u(:,2:cols); d(:,cols) = u(:,cols)-u(:,1); return function d = Dy(u) [rows,cols] = size(u); d = zeros(rows,cols, 'gpuarray'); d(2:rows,:) = u(2:rows,:)-u(1:rows-1,:); d(1,:) = u(1,:)-u(rows,:); return function d = Dyt(u) [rows,cols] = size(u); d = zeros(rows,cols, 'gpuarray'); d(1:rows-1,:) = u(1:rows-1,:)-u(2:rows,:); d(rows,:) = u(rows,:)-u(1,:); return function [xs,ys] = shrink2(x,y,lambda) s = sqrt(x.*conj(x)+y.*conj(y)); ss = s-lambda; ss = ss.*(ss>0); s = s+(s<lambda); ss = ss./s; xs = ss.*x; ys = ss.*y; return; 8

9 Source Code Part 2. test_mrics () N = 512; % The image will be NxN sparsity =.25; % use only 25% on the K-Space data for CS mu =.1; lambda =.1; gamma = mu/1000; % build an image of a square % image = zeros(n,n); % image(n/4:3*n/4,n/4:3*n/4)=255; image = phantom(n)*255; % build the sampling matrix, R R = rand(n,n); R = double(r<sparsity); R(1, 1) = 1; % DC POINT % Form the CS data F = R.*fft2(image)/N; % Recover the image recovered = mrics(r,f, mu, lambda, gamma,10, 5); toc; recovered2 = mrics_gpu(r,f, mu, lambda, gamma,10, 5); toc; wnd = [0, 255]; % build a figure to display results figure; subplot(2,2,1); imagesc(abs(image), wnd); colormap('gray'); title('original'); subplot(2,2,2); imagesc(abs(r)); colormap('gray'); title('r'); subplot(2,2,3); % imagesc(abs(ifft2(f))); colormap('gray'); imagesc(abs(recovered), wnd); colormap('gray'); title('set unknown to 0'); subplot(2,2,4); imagesc(abs(recovered2), wnd); colormap('gray'); title('split Bregman Recovery'); figure; imagesc(abs(recovered - recovered2)); colormap('gray'); colorbar; title('cpu_{recovery} - GPU_{reconvery}'); 9

10 Source Code Part 3. iradon_gpu.m iradonmexcu.cu demo_iradon.m 10

11 Source Code Part 3. iradon_gpu() function [img,h] = iradon_gpu(varargin) narginchk(2,6); [p,theta,filter,d,interp,n] = parse_inputs(varargin{:}); [p,h] = filterprojections(p, filter, d); % Define the x & y axes for the reconstructed image so that the origin % (center) is in the spot which RADON would choose. center = floor((n + 1)/2); xleft = -center + 1; x = (1:N) xleft; x = repmat(x, N, 1); ytop = center - 1; y = (N:-1:1).' - N + ytop; y = repmat(y, 1, N); len = size(p,1); ctridx = ceil(len/2); % index of the center of the projections % Zero pad the projections to size 1+2*ceil(N/sqrt(2)) if this % quantity is greater than the length of the projections imgdiag = 2*ceil(N/sqrt(2))+1; % largest distance through image. if size(p,1) < imgdiag rz = imgdiag - size(p,1); % how many rows of zeros p = [zeros(ceil(rz/2),size(p,2)); p; zeros(floor(rz/2),size(p,2))]; ctridx = ctridx+ceil(rz/2); img = iradonmexcu(n, single(theta), single(x), single(y), single(p)); img = img*pi/(2*length(theta)); return ; 11

12 Source Code Part 3. iradonmexcu.cu #include <string.h> #include "mex.h" * Declare a prototype of a kernel function. global void iradon(float *img, int N, int len, int view, float *theta, float *x, float *y, float *p); * Declare a main function. void mexfunction (int nlhs, mxarray *plhs[], int nrhs, const mxarray *prhs[]) { * Connect from the MATLAB ARRAY POINTER * to the MEX ARRAY POINTER. int N = (int) mxgetscalar(prhs[0]); float *theta = (float *) mxgetdata(prhs[1]); float *x = (float *) mxgetdata(prhs[2]); float *y = (float *) mxgetdata(prhs[3]); float *p = (float *) mxgetdata(prhs[4]); int len = (int) mxgetm(prhs[4]); int view = (int) mxgetn(prhs[4]); * Create a OUT MATRIX. mwsize DIM = 2; mwsize DIMS[2] = {N, N}; plhs[0] mxreal); = mxcreatenumericarray(dim, (const mwsize *)DIMS, mxsingle_class, float *img = (float *) mxgetdata(plhs[0]); 12

13 * Create a GPU ARRAY. * Copy a MEMORY from MEX ARRAY * to GPU ARRAY. float *gtheta = 0; float *gx = 0; float *gy = 0; float *gp = 0; float *gimg = 0; cudamalloc(&gtheta, sizeof(float)*view); cudamemset(gtheta, 0, sizeof(float)*view); cudamemcpy(gtheta, theta, sizeof(float)*view, cudamemcpyhosttodevice); cudamalloc(&gx, sizeof(float)*n*n); cudamemset(gtheta, 0, sizeof(float)*n*n); cudamemcpy(gx, x, sizeof(float)*n*n, cudamemcpyhosttodevice); cudamalloc(&gy, sizeof(float)*n*n); cudamemset(gtheta, 0, sizeof(float)*n*n); cudamemcpy(gy, y, sizeof(float)*n*n, cudamemcpyhosttodevice); cudamalloc(&gp, sizeof(float)*len*view); cudamemset(gp, 0, sizeof(float)*len*view); cudamemcpy(gp, p, sizeof(float)*len*view, cudamemcpyhosttodevice); cudamalloc(&gimg, sizeof(float)*n*n); cudamemset(gimg, 0, sizeof(float)*n*n); 13

14 * Create a 3-d GRID. * 1st GRID : X axis of OBJECT * 2nd GRID : Y axis of OBJECT * 3th GRID : view axis of PROJECTION int threadnum = 8; dim3 dim3 blk.x blk.y blk.z grd.x grd.y grd.z blk; grd; = threadnum; = threadnum; = threadnum; = ceil(float(n)/threadnum); = ceil(float(n)/threadnum); = ceil(float(view)/threadnum); * Call the kernel using a CUDA runtime API. iradon<<<grd, blk>>>(gimg, N, len, view, gtheta, gx, gy, gp); * Copy a MEMORY from GPU ARRAY * to MEX ARRAY. cudamemcpy(img, gimg, sizeof(float)*n*n, cudamemcpydevicetohost); * MUST BE Destroy the GPU ARRAY. cudafree(gtheta); gtheta = 0; cudafree(gx); gx = 0; cudafree(gy); gy = 0; cudafree(gp); gp = 0; cudafree(gimg); gimg = 0; } return ; 14

15 * Declare a main function. global void iradon(float *img, int N, int len, int view, float *theta, float *x, float *y, float *p) { * Calcurate a global linear index, assuming a 3-d GRID. * 1st GRID : X axis of OBJECT * 2nd GRID : Y axis of OBJECT * 3th GRID : view axis of PROJECTION * * Except the index if exceeded the boundary. int xidx = blockdim.x*blockidx.x + threadidx.x; int yidx = blockdim.y*blockidx.y + threadidx.y; int viewidx = blockdim.z*blockidx.z + threadidx.z; if (xidx >= N) return ; if (yidx >= N) return ; if (viewidx >= view) return ; int xyidx = N*xIdx + yidx; * Calculate a detector position (t) matched a xy position of object. int ctridx = int(ceil((len - 1.0f)/2.0f) + 1) - 1; float t = x[xyidx]*cosf(theta[viewidx]) + y[xyidx]*sinf(theta[viewidx]) + ctridx; * Fetch a projection data using 1d interpolation int t_b = floor(t); int t_u = ceil(t); int pidx_b = len*viewidx + t_b; int pidx_u = len*viewidx + t_u; float wgt_b = t_u - t; float wgt_u = 1 - wgt_b; float projcontrib = wgt_b*p[pidx_b] + wgt_u*p[pidx_u]; * Accumulate a projection data on the object matrix atomicadd(&img[xyidx], projcontrib); } return ; 15

16 Source Code Part 3. demo_iradon() clear; clc; % mex iradonmexcu.cu; %% N = 512; VIEW = 720; THETA = linspace(0, 360, VIEW + 1); THETA() = []; % OBJECT OBJ % RADON PROJ = phantom(n); = radon(obj, THETA); % IRADON ON MATLAB RECON_MATLAB = iradon(proj, THETA, N); tmat = toc; % IRADON ON GPU RECON_GPU tgpu = iradon_gpu(proj, THETA, N); = toc; % MAX ABSOLUTE ERROR ERR = max(abs(recon_matlab(:) - RECON_GPU(:))); % MEAN SQUARED ERROR % ERR1 = mse(obj(:), RECON_MATLAB(:)); % ERR2 = mse(obj(:), RECON_GPU(:)); %% FIGURE figure(1); colormap gray; subplot(231); imagesc(obj, [0, 1]); title('object'); axis off image; subplot(2,3,[2,3]); imagesc(proj); title('projection'); axis off; subplot(234); imagesc(recon_matlab, [0, 1]); title('recon_{matlab}'); axis off image; subplot(235); imagesc(recon_gpu, [0, 1]); title('recon_{gpu}'); axis off image; subplot(236); imagesc(recon_matlab - RECON_GPU); title('difference_{matlab - GPU}'); axis off image; % DISPLAY display(['execution time on MATLAB : ' num2str(tmat, '%2.6f') ' sec']); display(['execution time on GPU : ' num2str(tgpu, '%2.6f') ' sec']); display(['acceleration ratio for MATLAB / GPU : x ' num2str(tmat/tgpu, '%2.4f')]); 16

17 Thank you Bio Imaging & Signal Processing Lab. (BISPL) Dept. of Bio & Brain Engineering Korea Advanced Institute of Science & Technology (KAIST) 17

MATRIX INVERSION SPEED UP WITH CUDA JORGE SORIANO PINEDO ELECTRICAL AND COMPUTER ENGINEERING

MATRIX INVERSION SPEED UP WITH CUDA JORGE SORIANO PINEDO ELECTRICAL AND COMPUTER ENGINEERING MATRIX INVERSION SPEED UP WITH CUDA BY JORGE SORIANO PINEDO ELECTRICAL AND COMPUTER ENGINEERING Submitted in partial fulfillment of the requirements for the degree of Electrical engineering in ECE in the

More information

[ NOTICE ] YOU HAVE TO INSTALL ALL FILES PREVIOUSLY, BECAUSE A INSTALLATION TIME IS TOO LONG.

[ NOTICE ] YOU HAVE TO INSTALL ALL FILES PREVIOUSLY, BECAUSE A INSTALLATION TIME IS TOO LONG. [ NOTICE ] YOU HAVE TO INSTALL ALL FILES PREVIOUSLY, BECAUSE A INSTALLATION TIME IS TOO LONG. Setting up Development Environment 1. O/S Platform is Windows. 2. Install the latest NVIDIA Driver. 11) 3.

More information

Face Recognition. Programming Project. Haofu Liao, BSEE. Department of Electrical and Computer Engineering. Northeastern University.

Face Recognition. Programming Project. Haofu Liao, BSEE. Department of Electrical and Computer Engineering. Northeastern University. Face Recognition Programming Project Haofu Liao, BSEE June 23, 2013 Department of Electrical and Computer Engineering Northeastern University 1. How to build the PCA Mex Funtion 1.1 Basic Information The

More information

USING LAPACK SOLVERS FOR STRUCTURED MATRICES WITHIN MATLAB

USING LAPACK SOLVERS FOR STRUCTURED MATRICES WITHIN MATLAB USING LAPACK SOLVERS FOR STRUCTURED MATRICES WITHIN MATLAB Radek Frízel*, Martin Hromčík**, Zdeněk Hurák***, Michael Šebek*** *Department of Control Engineering, Faculty of Electrical Engineering, Czech

More information

Purpose: How to train an MLP neural network in MATLAB environment!

Purpose: How to train an MLP neural network in MATLAB environment! Purpose: How to train an MLP neural network in MATLAB environment! that is For good computations, we need good formulae for good algorithms; and good visualization for good illustration and proper testing

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

Implementation of Parma Polyhedron Library -functions in MATLAB

Implementation of Parma Polyhedron Library -functions in MATLAB Implementation of Parma Polyhedron Library -functions in MATLAB Leonhard Asselborn Electrical and Computer Engineering Carnegie Mellon University Group meeting Oct. 21 st 2010 Overview Introduction Motivation

More information

Introduction to Matlab/Octave

Introduction to Matlab/Octave Introduction to Matlab/Octave February 28, 2014 This document is designed as a quick introduction for those of you who have never used the Matlab/Octave language, as well as those of you who have used

More information

MATLAB: The challenges involved in providing a high-level language on a GPU

MATLAB: The challenges involved in providing a high-level language on a GPU MATLAB: The challenges involved in providing a high-level language on a GPU Jos Martin jos.martin@mathworks.co.uk 2013 The MathWorks, Inc. 1 Agenda Why did we introduce GPU support? What did we do? What

More information

PyCUDA. An Introduction

PyCUDA. An Introduction PyCUDA An Introduction Scripting GPUs with PyCUDA Why do Scripting for GPUs? GPUs are everything that scripting languages are not: Highly parallel Very architecture-sensitive Built for maximum FP/memory

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

A System for Interfacing MATLAB with External Software Geared Toward Automatic Differentiation

A System for Interfacing MATLAB with External Software Geared Toward Automatic Differentiation A System for Interfacing MATLAB with External Software Geared Toward Automatic Differentiation 02. Sept. 2006 - ICMS 2006 - Castro-Urdiales H. Martin Bücker, RWTH Aachen University, Institute for Scientific

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Computation to Core Mapping Lessons learned from a simple application

Computation to Core Mapping Lessons learned from a simple application Lessons learned from a simple application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core Mapping Block schedule, Occupancy, and thread

More information

Lessons learned from a simple application

Lessons learned from a simple application Computation to Core Mapping Lessons learned from a simple application A Simple Application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core

More information

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways: COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

Master Thesis Accelerating Image Registration on GPUs

Master Thesis Accelerating Image Registration on GPUs Master Thesis Accelerating Image Registration on GPUs A proof of concept migration of FAIR to CUDA Sunil Ramgopal Tatavarty Prof. Dr. Ulrich Rüde Dr.-Ing.Harald Köstler Lehrstuhl für Systemsimulation Universität

More information

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I) Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA

More information

ECE 662 Spring 2008 Homework #2

ECE 662 Spring 2008 Homework #2 PROBLEM 1 The number of experiments was 1000. The MATLAB codes are attached. To simplify things the same set of data were used for training and testing purposes. From the graph it is obvious that substituting

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB Jos Martin Principal Architect, Parallel Computing Tools jos.martin@mathworks.co.uk 1 2013 The MathWorks, Inc. www.matlabexpo.com Code used in this presentation can be found

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB Jos Martin Principal Architect, Parallel Computing Tools jos.martin@mathworks.co.uk 2015 The MathWorks, Inc. 1 Overview Scene setting Task Parallel (par*) Why doesn t it

More information

GPU Computing with CUDA

GPU Computing with CUDA GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical

More information

From Image to Video: Real-time Medical Imaging with MRI

From Image to Video: Real-time Medical Imaging with MRI From Image to Video: Real-time Medical Imaging with MRI Sebastian Schaetz, Martin Uecker BiomedNMR Forschungs GmbH at the MPI for biophysical Chemistry, Goettingen, Germany Electrical Engineering and Computer

More information

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen Frank Graeber Application Engineering MathWorks Germany 2013 The MathWorks, Inc. 1 Speed up the serial code within core

More information

Module 3: CUDA Execution Model -I. Objective

Module 3: CUDA Execution Model -I. Objective ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource

More information

Term Project report for EE5302

Term Project report for EE5302 Term Project report for EE5302 Submitted by: Vidhya N.S Murthy Student ID: 100060564 Project statement To study the statistical properties of a video signal and remove spatial redundancy using different

More information

Fast Bilateral Filter GPU implementation

Fast Bilateral Filter GPU implementation Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design, University of Erlangen-Nuremberg July 21, 2016 Overview

More information

Matrix Multiplication in CUDA. A case study

Matrix Multiplication in CUDA. A case study Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block

More information

How to get Real Time Data into Matlab

How to get Real Time Data into Matlab How to get Real Time Data into Matlab First make sure you have Visual Studio 6.0 installed. You re going to have to build a mex file in visual studio. A mex file is just C code that has been compiled to

More information

Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU

Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU 1 1 Samara National Research University, Moskovskoe Shosse 34, Samara, Russia, 443086 Abstract.

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

CS 179: GPU Computing. Lecture 2: The Basics

CS 179: GPU Computing. Lecture 2: The Basics CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced

More information

Vector Addition on the Device: main()

Vector Addition on the Device: main() Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Solving the heat equation with CUDA

Solving the heat equation with CUDA Solving the heat equation with CUDA Oliver Meister January 09 th 2013 Last Tutorial CSR kernel - scalar One row per thread No coalesced memory access Non-uniform matrices CSR kernel - vectorized One row

More information

Tiled Matrix Multiplication

Tiled Matrix Multiplication Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;

More information

Parallel Computing with MATLAB on Discovery Cluster

Parallel Computing with MATLAB on Discovery Cluster Parallel Computing with MATLAB on Discovery Cluster Northeastern University Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics Lets look at the Discovery Cluster Matlab environment

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

CUDA Parallelism Model

CUDA Parallelism Model GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

Basics of CADA Programming - CUDA 4.0 and newer

Basics of CADA Programming - CUDA 4.0 and newer Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics 1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Using a GPU in InSAR processing to improve performance

Using a GPU in InSAR processing to improve performance Using a GPU in InSAR processing to improve performance Rob Mellors, ALOS PI 152 San Diego State University David Sandwell University of California, San Diego What is a GPU? (Graphic Processor Unit) A graphics

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

GPU Computing Master Clss. Development Tools

GPU Computing Master Clss. Development Tools GPU Computing Master Clss Development Tools Generic CUDA debugger goals Support all standard debuggers across all OS Linux GDB, TotalView and DDD Windows Visual studio Mac - XCode Support CUDA runtime

More information

Matlab for Engineers

Matlab for Engineers Matlab for Engineers Alistair Johnson 31st May 2012 Centre for Doctoral Training in Healthcare Innovation Institute of Biomedical Engineering Department of Engineering Science University of Oxford Supported

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

LAB 2: Resampling. Maria Magnusson, 2012 (last update August 2016) with contributions from Katarina Flood, Qingfen Lin and Henrik Turbell

LAB 2: Resampling. Maria Magnusson, 2012 (last update August 2016) with contributions from Katarina Flood, Qingfen Lin and Henrik Turbell LAB 2: Resampling Maria Magnusson, 2 (last update August 6) with contributions from Katarina Flood, Qingfen Lin and Henrik Turbell Computer Vision Laboratory, Dept. of Electrical Engineering, Linköping

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture

More information

matlab_intro.html Page 1 of 5 Date: Tuesday, September 6, 2005

matlab_intro.html Page 1 of 5 Date: Tuesday, September 6, 2005 matlab_intro.html Page 1 of 5 % Introducing Matlab % adapted from Eero Simoncelli (http://www.cns.nyu.edu/~eero) % and Hany Farid (http://www.cs.dartmouth.edu/~farid) % and Serge Belongie (http://www-cse.ucsd.edu/~sjb)

More information

GPU acceleration of 3D forward and backward projection using separable footprints for X-ray CT image reconstruction

GPU acceleration of 3D forward and backward projection using separable footprints for X-ray CT image reconstruction GPU acceleration of 3D forward and backward projection using separable footprints for X-ray CT image reconstruction Meng Wu and Jeffrey A. Fessler EECS Department University of Michigan Fully 3D Image

More information

Uppsala University. CUDA Exercises. Karl Ljungkvist. 25 February 2016

Uppsala University. CUDA Exercises. Karl Ljungkvist. 25 February 2016 CUDA Exercises Karl Ljungkvist 25 February 2016 Karl Ljungkvist karl.ljungkvist@it.uu.se 2016-02-25 2/21 Example: PDE solver Heat equation: Discretization: u n+1 i,j k u n i,j Time stepping: u n+1 i,j

More information

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating

More information

Page 1 of 7 E7 Spring 2009 Midterm I SID: UNIVERSITY OF CALIFORNIA, BERKELEY Department of Civil and Environmental Engineering. Practice Midterm 01

Page 1 of 7 E7 Spring 2009 Midterm I SID: UNIVERSITY OF CALIFORNIA, BERKELEY Department of Civil and Environmental Engineering. Practice Midterm 01 Page 1 of E Spring Midterm I SID: UNIVERSITY OF CALIFORNIA, BERKELEY Practice Midterm 1 minutes pts Question Points Grade 1 4 3 6 4 16 6 1 Total Notes (a) Write your name and your SID on the top right

More information

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes Memory issue for CUDA programming CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

GPU Memory Memory issue for CUDA programming

GPU Memory Memory issue for CUDA programming Memory issue for CUDA programming Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Compiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010

Compiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010 Compiling and Executing CUDA Programs in Emulation Mode High Performance Scientific Computing II ICSI-541 Spring 2010 Topic Overview Overview of compiling and executing CUDA programs in emulation mode

More information

Lecture 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or

More information

Lecture 10!! Introduction to CUDA!

Lecture 10!! Introduction to CUDA! 1(50) Lecture 10 Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY 1(50) Laborations Some revisions may happen while making final adjustments for Linux Mint. Last minute changes may occur.

More information

Image convolution with CUDA

Image convolution with CUDA Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany

More information

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot

More information

High Performance and GPU Computing in MATLAB

High Performance and GPU Computing in MATLAB High Performance and GPU Computing in MATLAB Jan Houška houska@humusoft.cz http://www.humusoft.cz 1 About HUMUSOFT Company: Humusoft s.r.o. Founded: 1990 Number of employees: 18 Location: Praha 8, Pobřežní

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

CS 376b Computer Vision

CS 376b Computer Vision CS 376b Computer Vision 09 / 25 / 2014 Instructor: Michael Eckmann Today s Topics Questions? / Comments? Enhancing images / masks Cross correlation Convolution C++ Cross-correlation Cross-correlation involves

More information

Wir schaffen Wissen heute für morgen

Wir schaffen Wissen heute für morgen Wir schaffen Wissen heute für morgen The MEXperience, Getting to Grips with MATLAB Executable Files Jan Chrin Paul Scherrer Institut Contents Motivation Context of SwissFEL Injector Test Facility (2010-2014)

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC

More information

Example 1: Color-to-Grayscale Image Processing

Example 1: Color-to-Grayscale Image Processing GPU Teaching Kit Accelerated Computing Lecture 16: CUDA Parallelism Model Examples Example 1: Color-to-Grayscale Image Processing RGB Color Image Representation Each pixel in an image is an RGB value The

More information

AIRWC : Accelerated Image Registration With CUDA. Richard Ansorge 1 st August BSS Group, Cavendish Laboratory, University of Cambridge UK.

AIRWC : Accelerated Image Registration With CUDA. Richard Ansorge 1 st August BSS Group, Cavendish Laboratory, University of Cambridge UK. AIRWC : Accelerated Image Registration With CUDA Richard Ansorge 1 st August 2008 BSS Group, Cavendish Laboratory, University of Cambridge UK. We report some initial results using an NVIDA 9600 GX card

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

CUDA Basics. July 6, 2016

CUDA Basics. July 6, 2016 Mitglied der Helmholtz-Gemeinschaft CUDA Basics July 6, 2016 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching

More information

Image Processing CS 6640 : An Introduction to MATLAB Basics Bo Wang and Avantika Vardhan

Image Processing CS 6640 : An Introduction to MATLAB Basics Bo Wang and Avantika Vardhan Image Processing CS 6640 : An Introduction to MATLAB Basics Bo Wang and Avantika Vardhan August 29, 2014 1 Getting Started with MATLAB 1.1 Resources 1) CADE Lab: Matlab is installed on all the CADE lab

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Pablo Brubeck Department of Physics Tecnologico de Monterrey October 14, 2016 Student Chapter Tecnológico de Monterrey Tecnológico de Monterrey Student Chapter Outline

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 3: GPU Application GPU Intro Review Simple Example Memory Effects GPU Intro Review GPU Intro Review Shared Multiprocessors Global parallelism Assign

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information