APPENDIX. Source code. Part 1. Part 2. Part 3.
|
|
- Allen Adams
- 5 years ago
- Views:
Transcription
1 APPENDIX Source code Part 1. Part 2. Part 3. 1
2 Source Code Part 1. arrayfun pagefun bsxfun 2
3 Source Code Part 1. arrayfun() function y = foo(x) y = 1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x./9)./8)./7)./6)./5)./4)./3)./2); %% arrayfun clear ; clc; % display(' GPU Performance [ CPU vs GPU vs GPU with arrayfun ] '); display(' GPU Performance [ CPU vs GPU with arrayfun ] '); type foo.m; cpux gpux = rand(1e4, 1e3); = gpuarray(cpux); % CPU COMPUTING cpuy = foo(cpux); % cpuy = arrayfun(@foo, cpux); tcpu = toc; % GPU ONLY COMPUTING gpuy1 = foo(gpux); tgpu1 = toc; % GPU WITH ARRAYFUN COMPUTING gpuy2 = arrayfun(@foo, gpux); tgpu2 = toc; % MAXIMUM ABSOLUTE ERROR err1 = max(abs(cpuy(:) - gpuy1(:))); err2 = max(abs(cpuy(:) - gpuy2(:))); % DISPLAY display(['execution time on CPU : ' num2str(tcpu, '%2.6f') ' sec']); % display(['execution time on GPU only : ' num2str(tgpu1, '%2.6f') ' sec']); display(['execution time on GPU with arrayfun : ' num2str(tgpu2, '%2.6f') ' sec']); % display(['maximum absolute error for CPU / GPU only : ' num2str(err1, '%2.4e')]); display(['maximum absolute error for CPU / GPU with arrayfun : ' num2str(err2, '%2.4e')]); % display(['acceleration ratio for CPU / GPU only : x ' num2str(tcpu/tgpu1, '%2.4f')]); display(['acceleration ratio for CPU / GPU with arrayfun : x ' num2str(tcpu/tgpu2, '%2.4f')]); 3
4 Source Code Part 1. pagefun() %% pagefun clear ; % clc; % display(' GPU Performance [ CPU vs GPU vs GPU with pagefun ] '); display(' GPU Performance [ CPU vs GPU with pagefun ] '); cpux gpux = rand(1e2, 1e2, 1e1, 1e1); = gpuarray(cpux); % CPU COMPUTING cpuy = zeros(size(cpux)); for i = 1:size(cpuX, 3) for j = 1:size(cpuX, 4) cpuy(:,:,i,j) = transpose(cpux(:,:,i,j)); tcpu = toc; % GPU ONLY COMPUTING gpuy1 = zeros(size(gpux), 'gpuarray'); for i = 1:size(cpuX, 3) for j = 1:size(cpuX, 4) gpuy1(:,:,i,j) = transpose(gpux(:,:,i,j)); tgpu1 = toc; % GPU WITH PAGEFUN COMPUTING gpuy2 = pagefun(@transpose, gpux); tgpu2 = toc; % MAXIMUM ABSOLUTE ERROR err1 = max(abs(cpuy(:) - gpuy1(:))); err2 = max(abs(cpuy(:) - gpuy2(:))); % DISPLAY display(['execution time on CPU : ' num2str(tcpu, '%2.6f') ' sec']); % display(['execution time on GPU only : ' num2str(tgpu1, '%2.6f') ' sec']); display(['execution time on GPU with arrayfun : ' num2str(tgpu2, '%2.6f') ' sec']); % display(['maximum absolute error for CPU / GPU only : ' num2str(err1, '%2.4e')]); display(['maximum absolute error for CPU / GPU with arrayfun : ' num2str(err2, '%2.4e')]); % display(['acceleration ratio for CPU / GPU only : x ' num2str(tcpu/tgpu1, '%2.4f')]); display(['acceleration ratio for CPU / GPU with arrayfun : x ' num2str(tcpu/tgpu2, '%2.4f')]); 4
5 Source Code Part 1. bsxfun() %% bsxfun clear ; % clc; % display(' GPU Performance [ CPU vs GPU vs GPU with bsxfun ] '); display(' GPU Performance [ CPU vs GPU with bsxfun ] '); cpux cpuy gpux gpuy = rand(1e4, 1e3); = mean(cpux); = gpuarray(cpux); = gpuarray(cpuy); % CPU cpuz = zeros(size(cpux)); for j = 1:size(cpuX, 2) cpuz(:, j) = minus(cpux(:, j), cpuy(j)); % cpuz = cpux - repmat(cpuy, [size(cpux, 1), 1]); tcpu = toc; % GPU ONLY % gpuz1 = zeros(size(gpux), 'gpuarray'); % for j = 1:size(cpuX, 2) % gpuz1(:, j) = minues(gpux(:, j), gpuy(j)); % gpuz1 = gpux - repmat(gpuy, [size(gpux, 1), 1]); tgpu1 = toc; % GPU WITH BSXFUN COMPUTING gpuz2 = bsxfun(@minus, gpux, gpuy); tgpu2 = toc; % MAXIMUM ABSOLUTE ERROR err1 = max(abs(cpuz(:) - gpuz1(:))); err2 = max(abs(cpuz(:) - gpuz2(:))); % DISPLAY display(['execution time on CPU : ' num2str(tcpu, '%2.6f') ' sec']); % display(['execution time on GPU only : ' num2str(tgpu1, '%2.6f') ' sec']); display(['execution time on GPU with arrayfun : ' num2str(tgpu2, '%2.6f') ' sec']); % display(['maximum absolute error for CPU / GPU only : ' num2str(err1, '%2.4e')]); display(['maximum absolute error for CPU / GPU with arrayfun : ' num2str(err2, '%2.4e')]); % display(['acceleration ratio for CPU / GPU only : x ' num2str(tcpu/tgpu1, '%2.4f')]); display(['acceleration ratio for CPU / GPU with arrayfun : x ' num2str(tcpu/tgpu2, '%2.4f')]); 5
6 Source Code Part 2. mrics_gpu.m test_mrics.m 6
7 Source Code Part 2. mrics_gpu() function u = mrics_gpu(r,f, mu, lambda, gamma, ninner, nbreg) [rows,cols] = size(f); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % GPUARRAY %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% f R = gpuarray(f); = gpuarray(r); % Reserve memory for the auxillary variables f0 = f; u = zeros(rows,cols, 'gpuarray'); x = zeros(rows,cols, 'gpuarray'); y = zeros(rows,cols, 'gpuarray'); bx = zeros(rows,cols, 'gpuarray'); by = zeros(rows,cols, 'gpuarray'); % Build Kernels scale = sqrt(rows*cols); murf = ifft2(mu*(conj(r).*f))*scale; uker = zeros(rows,cols, 'gpuarray'); uker(1,1) = 4;uker(1,2)=-1;uker(2,1)=-1;uker(rows,1)=-1;uker(1,cols)=-1; uker = mu*(conj(r).*r)+lambda*fft2(uker)+gamma; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Do the reconstruction for outer = 1:nBreg; for inner = 1:nInner; % update u rhs = murf+lambda*dxt(x-bx)+lambda*dyt(y-by)+gamma*u; u = ifft2(fft2(rhs)./uker); % update x and y dx = Dx(u); dy =Dy(u); [x,y] = shrink2( dx+bx, dy+by,1/lambda); % update bregman parameters bx = bx+dx-x; by = by+dy-y; f = f+f0-r.*fft2(u)/scale; murf = ifft2(mu*r.*f)*scale; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % GATHER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% u = gather(u); return; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 7
8 function d = Dx(u) [rows,cols] = size(u); d = zeros(rows,cols, 'gpuarray'); d(:,2:cols) = u(:,2:cols)-u(:,1:cols-1); d(:,1) = u(:,1)-u(:,cols); Return function d = Dxt(u) [rows,cols] = size(u); d = zeros(rows,cols, 'gpuarray'); d(:,1:cols-1) = u(:,1:cols-1)-u(:,2:cols); d(:,cols) = u(:,cols)-u(:,1); return function d = Dy(u) [rows,cols] = size(u); d = zeros(rows,cols, 'gpuarray'); d(2:rows,:) = u(2:rows,:)-u(1:rows-1,:); d(1,:) = u(1,:)-u(rows,:); return function d = Dyt(u) [rows,cols] = size(u); d = zeros(rows,cols, 'gpuarray'); d(1:rows-1,:) = u(1:rows-1,:)-u(2:rows,:); d(rows,:) = u(rows,:)-u(1,:); return function [xs,ys] = shrink2(x,y,lambda) s = sqrt(x.*conj(x)+y.*conj(y)); ss = s-lambda; ss = ss.*(ss>0); s = s+(s<lambda); ss = ss./s; xs = ss.*x; ys = ss.*y; return; 8
9 Source Code Part 2. test_mrics () N = 512; % The image will be NxN sparsity =.25; % use only 25% on the K-Space data for CS mu =.1; lambda =.1; gamma = mu/1000; % build an image of a square % image = zeros(n,n); % image(n/4:3*n/4,n/4:3*n/4)=255; image = phantom(n)*255; % build the sampling matrix, R R = rand(n,n); R = double(r<sparsity); R(1, 1) = 1; % DC POINT % Form the CS data F = R.*fft2(image)/N; % Recover the image recovered = mrics(r,f, mu, lambda, gamma,10, 5); toc; recovered2 = mrics_gpu(r,f, mu, lambda, gamma,10, 5); toc; wnd = [0, 255]; % build a figure to display results figure; subplot(2,2,1); imagesc(abs(image), wnd); colormap('gray'); title('original'); subplot(2,2,2); imagesc(abs(r)); colormap('gray'); title('r'); subplot(2,2,3); % imagesc(abs(ifft2(f))); colormap('gray'); imagesc(abs(recovered), wnd); colormap('gray'); title('set unknown to 0'); subplot(2,2,4); imagesc(abs(recovered2), wnd); colormap('gray'); title('split Bregman Recovery'); figure; imagesc(abs(recovered - recovered2)); colormap('gray'); colorbar; title('cpu_{recovery} - GPU_{reconvery}'); 9
10 Source Code Part 3. iradon_gpu.m iradonmexcu.cu demo_iradon.m 10
11 Source Code Part 3. iradon_gpu() function [img,h] = iradon_gpu(varargin) narginchk(2,6); [p,theta,filter,d,interp,n] = parse_inputs(varargin{:}); [p,h] = filterprojections(p, filter, d); % Define the x & y axes for the reconstructed image so that the origin % (center) is in the spot which RADON would choose. center = floor((n + 1)/2); xleft = -center + 1; x = (1:N) xleft; x = repmat(x, N, 1); ytop = center - 1; y = (N:-1:1).' - N + ytop; y = repmat(y, 1, N); len = size(p,1); ctridx = ceil(len/2); % index of the center of the projections % Zero pad the projections to size 1+2*ceil(N/sqrt(2)) if this % quantity is greater than the length of the projections imgdiag = 2*ceil(N/sqrt(2))+1; % largest distance through image. if size(p,1) < imgdiag rz = imgdiag - size(p,1); % how many rows of zeros p = [zeros(ceil(rz/2),size(p,2)); p; zeros(floor(rz/2),size(p,2))]; ctridx = ctridx+ceil(rz/2); img = iradonmexcu(n, single(theta), single(x), single(y), single(p)); img = img*pi/(2*length(theta)); return ; 11
12 Source Code Part 3. iradonmexcu.cu #include <string.h> #include "mex.h" * Declare a prototype of a kernel function. global void iradon(float *img, int N, int len, int view, float *theta, float *x, float *y, float *p); * Declare a main function. void mexfunction (int nlhs, mxarray *plhs[], int nrhs, const mxarray *prhs[]) { * Connect from the MATLAB ARRAY POINTER * to the MEX ARRAY POINTER. int N = (int) mxgetscalar(prhs[0]); float *theta = (float *) mxgetdata(prhs[1]); float *x = (float *) mxgetdata(prhs[2]); float *y = (float *) mxgetdata(prhs[3]); float *p = (float *) mxgetdata(prhs[4]); int len = (int) mxgetm(prhs[4]); int view = (int) mxgetn(prhs[4]); * Create a OUT MATRIX. mwsize DIM = 2; mwsize DIMS[2] = {N, N}; plhs[0] mxreal); = mxcreatenumericarray(dim, (const mwsize *)DIMS, mxsingle_class, float *img = (float *) mxgetdata(plhs[0]); 12
13 * Create a GPU ARRAY. * Copy a MEMORY from MEX ARRAY * to GPU ARRAY. float *gtheta = 0; float *gx = 0; float *gy = 0; float *gp = 0; float *gimg = 0; cudamalloc(>heta, sizeof(float)*view); cudamemset(gtheta, 0, sizeof(float)*view); cudamemcpy(gtheta, theta, sizeof(float)*view, cudamemcpyhosttodevice); cudamalloc(&gx, sizeof(float)*n*n); cudamemset(gtheta, 0, sizeof(float)*n*n); cudamemcpy(gx, x, sizeof(float)*n*n, cudamemcpyhosttodevice); cudamalloc(&gy, sizeof(float)*n*n); cudamemset(gtheta, 0, sizeof(float)*n*n); cudamemcpy(gy, y, sizeof(float)*n*n, cudamemcpyhosttodevice); cudamalloc(&gp, sizeof(float)*len*view); cudamemset(gp, 0, sizeof(float)*len*view); cudamemcpy(gp, p, sizeof(float)*len*view, cudamemcpyhosttodevice); cudamalloc(&gimg, sizeof(float)*n*n); cudamemset(gimg, 0, sizeof(float)*n*n); 13
14 * Create a 3-d GRID. * 1st GRID : X axis of OBJECT * 2nd GRID : Y axis of OBJECT * 3th GRID : view axis of PROJECTION int threadnum = 8; dim3 dim3 blk.x blk.y blk.z grd.x grd.y grd.z blk; grd; = threadnum; = threadnum; = threadnum; = ceil(float(n)/threadnum); = ceil(float(n)/threadnum); = ceil(float(view)/threadnum); * Call the kernel using a CUDA runtime API. iradon<<<grd, blk>>>(gimg, N, len, view, gtheta, gx, gy, gp); * Copy a MEMORY from GPU ARRAY * to MEX ARRAY. cudamemcpy(img, gimg, sizeof(float)*n*n, cudamemcpydevicetohost); * MUST BE Destroy the GPU ARRAY. cudafree(gtheta); gtheta = 0; cudafree(gx); gx = 0; cudafree(gy); gy = 0; cudafree(gp); gp = 0; cudafree(gimg); gimg = 0; } return ; 14
15 * Declare a main function. global void iradon(float *img, int N, int len, int view, float *theta, float *x, float *y, float *p) { * Calcurate a global linear index, assuming a 3-d GRID. * 1st GRID : X axis of OBJECT * 2nd GRID : Y axis of OBJECT * 3th GRID : view axis of PROJECTION * * Except the index if exceeded the boundary. int xidx = blockdim.x*blockidx.x + threadidx.x; int yidx = blockdim.y*blockidx.y + threadidx.y; int viewidx = blockdim.z*blockidx.z + threadidx.z; if (xidx >= N) return ; if (yidx >= N) return ; if (viewidx >= view) return ; int xyidx = N*xIdx + yidx; * Calculate a detector position (t) matched a xy position of object. int ctridx = int(ceil((len - 1.0f)/2.0f) + 1) - 1; float t = x[xyidx]*cosf(theta[viewidx]) + y[xyidx]*sinf(theta[viewidx]) + ctridx; * Fetch a projection data using 1d interpolation int t_b = floor(t); int t_u = ceil(t); int pidx_b = len*viewidx + t_b; int pidx_u = len*viewidx + t_u; float wgt_b = t_u - t; float wgt_u = 1 - wgt_b; float projcontrib = wgt_b*p[pidx_b] + wgt_u*p[pidx_u]; * Accumulate a projection data on the object matrix atomicadd(&img[xyidx], projcontrib); } return ; 15
16 Source Code Part 3. demo_iradon() clear; clc; % mex iradonmexcu.cu; %% N = 512; VIEW = 720; THETA = linspace(0, 360, VIEW + 1); THETA() = []; % OBJECT OBJ % RADON PROJ = phantom(n); = radon(obj, THETA); % IRADON ON MATLAB RECON_MATLAB = iradon(proj, THETA, N); tmat = toc; % IRADON ON GPU RECON_GPU tgpu = iradon_gpu(proj, THETA, N); = toc; % MAX ABSOLUTE ERROR ERR = max(abs(recon_matlab(:) - RECON_GPU(:))); % MEAN SQUARED ERROR % ERR1 = mse(obj(:), RECON_MATLAB(:)); % ERR2 = mse(obj(:), RECON_GPU(:)); %% FIGURE figure(1); colormap gray; subplot(231); imagesc(obj, [0, 1]); title('object'); axis off image; subplot(2,3,[2,3]); imagesc(proj); title('projection'); axis off; subplot(234); imagesc(recon_matlab, [0, 1]); title('recon_{matlab}'); axis off image; subplot(235); imagesc(recon_gpu, [0, 1]); title('recon_{gpu}'); axis off image; subplot(236); imagesc(recon_matlab - RECON_GPU); title('difference_{matlab - GPU}'); axis off image; % DISPLAY display(['execution time on MATLAB : ' num2str(tmat, '%2.6f') ' sec']); display(['execution time on GPU : ' num2str(tgpu, '%2.6f') ' sec']); display(['acceleration ratio for MATLAB / GPU : x ' num2str(tmat/tgpu, '%2.4f')]); 16
17 Thank you Bio Imaging & Signal Processing Lab. (BISPL) Dept. of Bio & Brain Engineering Korea Advanced Institute of Science & Technology (KAIST) 17
MATRIX INVERSION SPEED UP WITH CUDA JORGE SORIANO PINEDO ELECTRICAL AND COMPUTER ENGINEERING
MATRIX INVERSION SPEED UP WITH CUDA BY JORGE SORIANO PINEDO ELECTRICAL AND COMPUTER ENGINEERING Submitted in partial fulfillment of the requirements for the degree of Electrical engineering in ECE in the
More information[ NOTICE ] YOU HAVE TO INSTALL ALL FILES PREVIOUSLY, BECAUSE A INSTALLATION TIME IS TOO LONG.
[ NOTICE ] YOU HAVE TO INSTALL ALL FILES PREVIOUSLY, BECAUSE A INSTALLATION TIME IS TOO LONG. Setting up Development Environment 1. O/S Platform is Windows. 2. Install the latest NVIDIA Driver. 11) 3.
More informationFace Recognition. Programming Project. Haofu Liao, BSEE. Department of Electrical and Computer Engineering. Northeastern University.
Face Recognition Programming Project Haofu Liao, BSEE June 23, 2013 Department of Electrical and Computer Engineering Northeastern University 1. How to build the PCA Mex Funtion 1.1 Basic Information The
More informationUSING LAPACK SOLVERS FOR STRUCTURED MATRICES WITHIN MATLAB
USING LAPACK SOLVERS FOR STRUCTURED MATRICES WITHIN MATLAB Radek Frízel*, Martin Hromčík**, Zdeněk Hurák***, Michael Šebek*** *Department of Control Engineering, Faculty of Electrical Engineering, Czech
More informationPurpose: How to train an MLP neural network in MATLAB environment!
Purpose: How to train an MLP neural network in MATLAB environment! that is For good computations, we need good formulae for good algorithms; and good visualization for good illustration and proper testing
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationImplementation of Parma Polyhedron Library -functions in MATLAB
Implementation of Parma Polyhedron Library -functions in MATLAB Leonhard Asselborn Electrical and Computer Engineering Carnegie Mellon University Group meeting Oct. 21 st 2010 Overview Introduction Motivation
More informationIntroduction to Matlab/Octave
Introduction to Matlab/Octave February 28, 2014 This document is designed as a quick introduction for those of you who have never used the Matlab/Octave language, as well as those of you who have used
More informationMATLAB: The challenges involved in providing a high-level language on a GPU
MATLAB: The challenges involved in providing a high-level language on a GPU Jos Martin jos.martin@mathworks.co.uk 2013 The MathWorks, Inc. 1 Agenda Why did we introduce GPU support? What did we do? What
More informationPyCUDA. An Introduction
PyCUDA An Introduction Scripting GPUs with PyCUDA Why do Scripting for GPUs? GPUs are everything that scripting languages are not: Highly parallel Very architecture-sensitive Built for maximum FP/memory
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationA System for Interfacing MATLAB with External Software Geared Toward Automatic Differentiation
A System for Interfacing MATLAB with External Software Geared Toward Automatic Differentiation 02. Sept. 2006 - ICMS 2006 - Castro-Urdiales H. Martin Bücker, RWTH Aachen University, Institute for Scientific
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationComputation to Core Mapping Lessons learned from a simple application
Lessons learned from a simple application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core Mapping Block schedule, Occupancy, and thread
More informationLessons learned from a simple application
Computation to Core Mapping Lessons learned from a simple application A Simple Application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core
More informationGPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:
COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationMaster Thesis Accelerating Image Registration on GPUs
Master Thesis Accelerating Image Registration on GPUs A proof of concept migration of FAIR to CUDA Sunil Ramgopal Tatavarty Prof. Dr. Ulrich Rüde Dr.-Ing.Harald Köstler Lehrstuhl für Systemsimulation Universität
More informationLecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)
Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA
More informationECE 662 Spring 2008 Homework #2
PROBLEM 1 The number of experiments was 1000. The MATLAB codes are attached. To simplify things the same set of data were used for training and testing purposes. From the graph it is obvious that substituting
More informationParallel Computing with MATLAB
Parallel Computing with MATLAB Jos Martin Principal Architect, Parallel Computing Tools jos.martin@mathworks.co.uk 1 2013 The MathWorks, Inc. www.matlabexpo.com Code used in this presentation can be found
More informationParallel Computing with MATLAB
Parallel Computing with MATLAB Jos Martin Principal Architect, Parallel Computing Tools jos.martin@mathworks.co.uk 2015 The MathWorks, Inc. 1 Overview Scene setting Task Parallel (par*) Why doesn t it
More informationGPU Computing with CUDA
GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical
More informationFrom Image to Video: Real-time Medical Imaging with MRI
From Image to Video: Real-time Medical Imaging with MRI Sebastian Schaetz, Martin Uecker BiomedNMR Forschungs GmbH at the MPI for biophysical Chemistry, Goettingen, Germany Electrical Engineering and Computer
More informationOutline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun
Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory
More informationProgramming with CUDA, WS09
Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationMit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen
Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen Frank Graeber Application Engineering MathWorks Germany 2013 The MathWorks, Inc. 1 Speed up the serial code within core
More informationModule 3: CUDA Execution Model -I. Objective
ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource
More informationTerm Project report for EE5302
Term Project report for EE5302 Submitted by: Vidhya N.S Murthy Student ID: 100060564 Project statement To study the statistical properties of a video signal and remove spatial redundancy using different
More informationFast Bilateral Filter GPU implementation
Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design, University of Erlangen-Nuremberg July 21, 2016 Overview
More informationMatrix Multiplication in CUDA. A case study
Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block
More informationHow to get Real Time Data into Matlab
How to get Real Time Data into Matlab First make sure you have Visual Studio 6.0 installed. You re going to have to build a mex file in visual studio. A mex file is just C code that has been compiled to
More informationImplementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU
Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU 1 1 Samara National Research University, Moskovskoe Shosse 34, Samara, Russia, 443086 Abstract.
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationCS 179: GPU Computing. Lecture 2: The Basics
CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced
More informationVector Addition on the Device: main()
Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationMemory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.
Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices
CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationSolving the heat equation with CUDA
Solving the heat equation with CUDA Oliver Meister January 09 th 2013 Last Tutorial CSR kernel - scalar One row per thread No coalesced memory access Non-uniform matrices CSR kernel - vectorized One row
More informationTiled Matrix Multiplication
Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;
More informationParallel Computing with MATLAB on Discovery Cluster
Parallel Computing with MATLAB on Discovery Cluster Northeastern University Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics Lets look at the Discovery Cluster Matlab environment
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationCUDA Parallelism Model
GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationBasics of CADA Programming - CUDA 4.0 and newer
Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationGPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics
1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationUsing a GPU in InSAR processing to improve performance
Using a GPU in InSAR processing to improve performance Rob Mellors, ALOS PI 152 San Diego State University David Sandwell University of California, San Diego What is a GPU? (Graphic Processor Unit) A graphics
More informationHigh Performance Linear Algebra on Data Parallel Co-Processors I
926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018
More informationGPU Computing Master Clss. Development Tools
GPU Computing Master Clss Development Tools Generic CUDA debugger goals Support all standard debuggers across all OS Linux GDB, TotalView and DDD Windows Visual studio Mac - XCode Support CUDA runtime
More informationMatlab for Engineers
Matlab for Engineers Alistair Johnson 31st May 2012 Centre for Doctoral Training in Healthcare Innovation Institute of Biomedical Engineering Department of Engineering Science University of Oxford Supported
More informationRegister file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.
Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationLAB 2: Resampling. Maria Magnusson, 2012 (last update August 2016) with contributions from Katarina Flood, Qingfen Lin and Henrik Turbell
LAB 2: Resampling Maria Magnusson, 2 (last update August 6) with contributions from Katarina Flood, Qingfen Lin and Henrik Turbell Computer Vision Laboratory, Dept. of Electrical Engineering, Linköping
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture
More informationmatlab_intro.html Page 1 of 5 Date: Tuesday, September 6, 2005
matlab_intro.html Page 1 of 5 % Introducing Matlab % adapted from Eero Simoncelli (http://www.cns.nyu.edu/~eero) % and Hany Farid (http://www.cs.dartmouth.edu/~farid) % and Serge Belongie (http://www-cse.ucsd.edu/~sjb)
More informationGPU acceleration of 3D forward and backward projection using separable footprints for X-ray CT image reconstruction
GPU acceleration of 3D forward and backward projection using separable footprints for X-ray CT image reconstruction Meng Wu and Jeffrey A. Fessler EECS Department University of Michigan Fully 3D Image
More informationUppsala University. CUDA Exercises. Karl Ljungkvist. 25 February 2016
CUDA Exercises Karl Ljungkvist 25 February 2016 Karl Ljungkvist karl.ljungkvist@it.uu.se 2016-02-25 2/21 Example: PDE solver Heat equation: Discretization: u n+1 i,j k u n i,j Time stepping: u n+1 i,j
More informationCSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating
More informationPage 1 of 7 E7 Spring 2009 Midterm I SID: UNIVERSITY OF CALIFORNIA, BERKELEY Department of Civil and Environmental Engineering. Practice Midterm 01
Page 1 of E Spring Midterm I SID: UNIVERSITY OF CALIFORNIA, BERKELEY Practice Midterm 1 minutes pts Question Points Grade 1 4 3 6 4 16 6 1 Total Notes (a) Write your name and your SID on the top right
More informationGPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes
Memory issue for CUDA programming CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block
More informationHands-on CUDA Optimization. CUDA Workshop
Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory
More informationGPU Memory Memory issue for CUDA programming
Memory issue for CUDA programming Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationCompiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010
Compiling and Executing CUDA Programs in Emulation Mode High Performance Scientific Computing II ICSI-541 Spring 2010 Topic Overview Overview of compiling and executing CUDA programs in emulation mode
More informationLecture 2: Introduction to CUDA C
CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or
More informationLecture 10!! Introduction to CUDA!
1(50) Lecture 10 Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY 1(50) Laborations Some revisions may happen while making final adjustments for Linux Mint. Last minute changes may occur.
More informationImage convolution with CUDA
Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany
More informationReview. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API
Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot
More informationHigh Performance and GPU Computing in MATLAB
High Performance and GPU Computing in MATLAB Jan Houška houska@humusoft.cz http://www.humusoft.cz 1 About HUMUSOFT Company: Humusoft s.r.o. Founded: 1990 Number of employees: 18 Location: Praha 8, Pobřežní
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationCS 376b Computer Vision
CS 376b Computer Vision 09 / 25 / 2014 Instructor: Michael Eckmann Today s Topics Questions? / Comments? Enhancing images / masks Cross correlation Convolution C++ Cross-correlation Cross-correlation involves
More informationWir schaffen Wissen heute für morgen
Wir schaffen Wissen heute für morgen The MEXperience, Getting to Grips with MATLAB Executable Files Jan Chrin Paul Scherrer Institut Contents Motivation Context of SwissFEL Injector Test Facility (2010-2014)
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationCUDA GPGPU Workshop CUDA/GPGPU Arch&Prog
CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC
More informationExample 1: Color-to-Grayscale Image Processing
GPU Teaching Kit Accelerated Computing Lecture 16: CUDA Parallelism Model Examples Example 1: Color-to-Grayscale Image Processing RGB Color Image Representation Each pixel in an image is an RGB value The
More informationAIRWC : Accelerated Image Registration With CUDA. Richard Ansorge 1 st August BSS Group, Cavendish Laboratory, University of Cambridge UK.
AIRWC : Accelerated Image Registration With CUDA Richard Ansorge 1 st August 2008 BSS Group, Cavendish Laboratory, University of Cambridge UK. We report some initial results using an NVIDA 9600 GX card
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationCUDA Basics. July 6, 2016
Mitglied der Helmholtz-Gemeinschaft CUDA Basics July 6, 2016 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching
More informationImage Processing CS 6640 : An Introduction to MATLAB Basics Bo Wang and Avantika Vardhan
Image Processing CS 6640 : An Introduction to MATLAB Basics Bo Wang and Avantika Vardhan August 29, 2014 1 Getting Started with MATLAB 1.1 Resources 1) CADE Lab: Matlab is installed on all the CADE lab
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Pablo Brubeck Department of Physics Tecnologico de Monterrey October 14, 2016 Student Chapter Tecnológico de Monterrey Tecnológico de Monterrey Student Chapter Outline
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationIntroduction to GPGPUs and to CUDA programming model
Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 3: GPU Application GPU Intro Review Simple Example Memory Effects GPU Intro Review GPU Intro Review Shared Multiprocessors Global parallelism Assign
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming
More information