MATLAB AND PARALLEL COMPUTING

Size: px
Start display at page:

Download "MATLAB AND PARALLEL COMPUTING"

Transcription

1 Image Processing & Communication, vol. 17, no. 4, pp DOI: /v MATLAB AND PARALLEL COMPUTING MAGDALENA SZYMCZYK, PIOTR SZYMCZYK AGH University of Science and Technology, Kraków, Polska Abstract. The MATLAB is a technical computing language used in a variety of fields, such as control systems, image and signal processing, visualization, financial process simulations in an easy-to-use environment. MATLAB offers "toolboxes" which are specialized libraries for variety scientific domains, and a simplified interface to high-performance libraries (LA- PACK, BLAS, FFTW too). Now MATLAB is enriched by the possibility of parallel computing with the Parallel Computing Toolbox TM and MATLAB Distributed Computing Server TM. In this article we present some of the key features of MATLAB parallel applications focused on using GPU processors for image processing. 1 Introduction The article provides an overview of capabilities for increasing the speed of calculations in Matlab using parallel programming on graphic processing unit (GPU). An example of the algorithm implemented in CUDA and used with parallel Matlab was elaborated and presented. Tests of the speedup of the algorithm were performed and described. Microprocessors based on a single central processing unit (CPU) drove rapid performance and cost reductions in computer applications for more than two decades. This performance improvement allowed application software to perform more functionality, had better user interfaces, and generated more useful results faster. The users, in turn, demanded even more and more improvements so a positive cycle for the computer industry was created. This fast changes in software relied on the advances in hardware and introduction of new generation of processors. This drive slowed around 2003 due to power consumption issues and the level of productive activities that can be performed in each clock period within a single CPU. Nowadays all microprocessor producers have switched to multi-core solutions used to increase the processing power. This change has a great impact on the development of software. Up till now software applications have been written mostly as sequential programs, but with each new generation of microprocessors it will be necessary to write parallel programs, in which multiple threads of execution cooperate to achieve the functionality faster. This is nothing new, the high-performance computing community has been developing parallel programs for large scale, expensive computers for decades. This solutions was rather unpopular in large number of ordinary implementations. Nowadays all new microprocessors are parallel com-

2 208 M. Szymczyk, P. Szymczyk puters, so the number of applications need to be developed as parallel programs has increased dramatically. There is now a great need for software developers to learn about parallel programming. Using several machines simultaneously in applications is harder than in sequential version of it, so computing languages which make this process easier are desirable. 2 Parallel Matlab Parallel MATLAB is an extension of MATLAB that takes advantage of multicore desktop machines and clusters. Matlab supports parallel computing in several ways. Some features can only be set. In other situations programs may need some adaptations or buying a special toolbox. MATLAB supports three types of parallel computing: multithreaded (implicit parallelism), distributed computing, explicit parallelism. In multithreaded parallelism, one instance of MATLAB automatically generates multiple simultaneous instruction streams. Multiple processors or cores, sharing the memory of a single computer, execute these streams. Elementwise computations on big matrices might benefit most from such solutions. In distributed computing, multiple instances of MAT- LAB run multiple independent computations on separate computers, each with its own memory. In most cases, a single program is run many times with different parameters. In explicit parallelism, several instances of MATLAB run on several processors or computers, often with separate memories, and simultaneously execute a single MAT- LAB command or M-function. New programming constructions, including parallel loops and distributed arrays, describe this parallelism. Such different kinds of parallelism can be used in different ways, for example for multithreaded parallelism, the number of threads can be set in the MATLAB Preferences panel which are used by multithreaded versions of the BLAS (Basic Linear Algebra Subroutines).A distributed computing job might invoke multithreaded functions on each machine and then use a distributed array to collect the final results. For vector arguments, the MAT- LAB elementary function library is multithreaded too [5]. It is not an easy task to choose the most proper form of parallelism for a given type of application. 3 Matlab Parallel Computing Toolbox (PCT) and Distributed Computing Server (MDCS) The Parallel Computing Toolbox runs on a desktop, and can take advantage of up to 8 cores there. Parallel programs can be run interactively or in batch. The Matlab Distributed Computing Server (MDCS) controls parallel execution of MATLAB on a cluster with tens or hundreds of cores. On Fig. 1 typical architecture of parallel system is presented. The three ways to write a parallel MATLAB program which was mentioned earlier can be expressed by: the parfor statement - the simplest path to parallelism which indicates that a given for loop can be executed in parallel; the spmd statement which can create cooperating synchronized processing; the task feature creates multiple independent programs. The parfor approach is a limited but simple way to get started, spmd statement is powerful, but requires rethinking the program and data, the task approach is simple, but suitable only for computations that need almost no communication. A simplified version of MPI programming

3 Image Processing & Communication, vol. 17, no. 4, pp Fig. 1: Typical architecture of parallel system model can be used also in Matlab. There is a single program, but it is divided into client and worker (lab) sections which can cooperate. Each worker process has its own memory and separate ID and runs ideally on separate core. When the parfor statement is included in a client code, it indicates that a given for loop can be executed in parallel so iterations of the loop are automatically divided up among the workers, and the results gathered back onto the client. Using parfor requires that the iterations are completely independent and there are also some restrictions on data access. Spmd programming includes distributed arrays. A distributed array is logically one array, and a large set of MATLAB commands can treat it that way. However, portions of the array are scattered across multiple processors. This means such an array can be really large. The local part of a distributed array can be operated by that processor very quickly. A distributed array can be operated by explicit commands to the spmd workers that own pieces of the array, or implicitly by commands at the global or client level. The client and each worker have separate workspaces, but it is possible for them to communicate and trade information. Instead of having an array created on the client and distributed to the workers, it is possible to have a distributed array constructed by having each worker built its piece. The result is still a distributed array. Codistributing the creation of an array has several advantages: the array is built faster in parallel, you skip the communication cost of distributing it, the array might be too large to build entirely on one core (or processor). Parallel MATLAB jobs can be run directly, that is, interactively. The matlabpool command is used to reserve a given number of workers on the local (or perhaps remote) machine. Once these workers are available, the user can type commands, run scripts, or evaluate functions, which contain parfor statements. The workers will cooperate in producing results. Parallel PARFOR MATLAB jobs can be run indirectly. The batch command is used to specify MATLAB code to be executed and it starts the computation in the background.

4 210 M. Szymczyk, P. Szymczyk 4 Examples of using Parallel Toolbox 4.1 Using PARFOR By simple changing a for loop as a parfor loop (the simplest form of parallelization of application), the user explicitly expresses that the contents of the for loop may be executed in any order on available resources. Using this kind of loop, if additional computational resources are available (through matlabpool), the effect of faster result is achieved. In the absence of these resources, on a single processor system, the parfor behaves like a traditional for loop. The parfor loop requires iterations to be completely independent of each other. The syntax of parfor is as follows [2]: p a r f o r ( i t r = m : n, [ NumWorkers ] ) % loop body end NumWorkers is an optional argument that indicates an upper-bound on the number of MATLAB workers the user wants to use for executing the loop body. Once a pool has been set up, programs can then use parfor which is like for, except that the iterations may be farmed out to different CPUs. In the following code for example, one CPU could handle i=1:1000 while another could deal with i=1001:2000, etc. p a r f o r i =1:10000 x ( i ) =x ( i ) * 2 ; end The single program multiple data (spmd) make it possible to define a block of code that runs in parallel on all the labs (workers) in the MATLAB pool. The second example is much more complex, but the idea is just the same [2]: p a r f o r k = 1 : 60 a ( k ) = max( abs ( e i g ( rand ( ) ) ) ) ; end Depending on the availability of workers, the iteration range may be divided differently. 4.2 Using distributed arrays In MATLAB there is no syntactic difference in the way users can access elements in distributed arrays among different workers and regular MATLAB arrays. MATLAB takes the responsibility for appropriately shipping data as necessary. Distributed arrays may be used with almost all of the nearly 150 core built-in MATLAB functions including reduction operations, indexing, and linear algebra operations such as LU factorization. For dense linear algebra operations ScaLAPACK library is used whenever it is possible. Other algorithms, such as those for sparse matrices, are implemented in the MATLAB language. FOR-DRANGE The for-drange construct lets users iterate over a distributed range. range that it owns. Each worker executes on the piece of The for-drange construct requires loops iterations to be independent of each other and that no communication should occur between labs when executing the loop. 5 Speeding up MATLAB applications by GPU (Graphic Processor Unit) [1] A GPU can be used to speed MATLAB up. Options include MATLAB plug-in written in CUDA language (CUDA is a parallel library that uses an nvidia board). The latest generation of GPUs offer considerable computing power using their 100 to 512 on-card processors. The CUDA applications can be called through Matlab and Matlab mex functions. With a properly developed mex function, the user-friendly Matlab interface can be used to perform behind-the-scenes parallel computations on the GPU. The GPU becomes a co-processor for the personal computer (). "Tesla" variety devices are available as compute-only devices that eliminate all graphics capability and include additional memory. While such devices may be slightly more reliable than the desktop graphics

5 Image Processing & Communication, vol. 17, no. 4, pp cards and allow larger computational problems, they do not usually offer faster computation, since they both use the same processor design. Another Nvidia next generation CUDA architecture, called "Fermi" consists of 512 CUDA cores, ECC memory, and offers double precision capabilities 8Πfaster than existing devices. S2070 will also offer display output. Fermi based Tesla S2050 and Message Passing Interface (MPI) implements parallel computing on a cluster of PCs used for complicated computations, but it is often limited by inter-computer communications. CUDA implements parallel computing on the massive number of processors of a GPU for rather simple floating-point calculations with very fast communication between processors. Both approaches have their strengths and weaknesses; one does not replace the other. The matrix intensive computations on Matlab are to be ideally suited to GPU computation (). Now CUDA applications are several times faster than their equivalent CPU calculations for large matrices. This sort of computation is more often used now. 6 Applying MATLAB and GPU technology for image recognition Methods used in this area of science are based on complicated matrix operations so they can be easily tuned for parallel computations on Matlab and CUDA. Matlab with CUDA technology can be used on each level of image processing. The first stage of any vision system is the image acquisition stage. Discretization makes it possible to talk about images. After the image has been obtained, various methods of processing can be applied to the image (in the form of pixel array) to perform many different vision tasks required today. Pre-processing adapts the image to our specific application. Low-level image processing concerns image enhancement, restoration and transformation. However, if the image has not been acquired satisfactorily, then the intended tasks may not be achievable, even with the aid of some form of image enhancement. This part of low level image preprocessing is supported by Image Acquisition Toolbox TM of MATLAB. It enables the user to acquire images and videos from cameras and frame grabbers directly into MATLAB and Simulink. This toolbox can detect hardware automatically and configure hardware properties. Image processing (image restoration and enhancement) is in many cases concerned with taking one array of pixels as input and producing another array of pixels as output, which in some way represents an improvement to the original array. For example, this processing may remove noise, improve the contrast of the image, remove blurring caused by movement of the camera during image acquisition. It may correct geometrical distortions caused by the lens. Image processing methods may be broadly divided into: Real space methods - which work by directly processing the input pixel array Fourier space methods - which work by firstly deriving a new; representation of the input data by performing a Fourier transformation, which is then processed, and finally, an inverse Fourier transform is performed on the resulting data to give the final output image. In Medium-level image processing (image understanding) object representation and description is created. This phase includes the process of image segmentation, detection of contours and edges. Edges are very important to any vision system (biological or machine). They are fairly cheap to compute. They do provide strong visual clues that can help the recognition process. An edge may be regarded as a boundary between two dissimilar regions in an image. In principle, an edge is easy to find due to differences in pixel values between regions are relatively easy to calculate by considering gradients. The process of extracting and representing information from an image is to group pixels together into regions of

6 212 M. Szymczyk, P. Szymczyk similarity. It is commonly called segmentation. In 2D segmentation we group pixels together according to the rate of change of their intensity over a region. In 3D segmentation we group together pixels according to the rate of change of depth in the image, corresponding to pixels lying on the same surface such as a plane, cylinder, sphere etc. The Medium Level and part of Low Level computer vision application can be supported by Image Processing Toolbox TM, which provides a comprehensive set of reference-standard algorithms and graphical tools for image processing. It contains such functionality as: analysis, visualization, and algorithm development. can perform: image registration, image enhancement, image deblurring, noise reduction, image segmentation feature detection, geometric transformations. Many toolbox functions are multithreaded to take advantage of multicore and multiprocessor computers. Representation and description make it possible to talk about the properties of objects and it is a task of highlevel image processing. On this level of application object recognition is made and objects identification based on object models is performed. The interpretation of properties of objects (identity, size, material, position 2D/3D, orientation) and relationships among objects (relative position, occlusions) are made using artificial intelligence methods. 7 CUDA function called from Matlab Computation on GPU is basically a three step process: Copy data to the GPU memory, Execute code (the "kernel") to process the data, Copy the results back from the GPU memory. In general the code should be designed to minimize all these steps, which when frequently used limit the overall speed of the calculations. It 8 An example of implementation of parallel convolution algorithm in CUDA executed in PCT (Matlab Parallel Computing Toolbox) Parallel implementation of convolution algorithm is presented below together with tests of speedup. Convolutions are used by many applications for engineering and mathematics. Especially it is used in low level image processing with blur filters or edge detectors. Mathematically, a convolution measures the amount of overlap between two functions [1]. In the context of image processing a convolution filter is just the scalar product of the filter weights with the input pixels within a window surrounding each of the output pixels. In our paper we use algorithms implemented by Dirk-Jan Kroon [5]. The scalar output product of convolution which is a parallel operation is well suited to computation on highly parallel hardware such as the GPU. In convolution rotationally symmetric Gaussian low pass filter of size with deviation sigma equal to 3 is used on the picture of size pixels. At first main matlab file is prepared which is calling convolution on the CPU and GPU. % Load an image I = im2double ( imread ( s c e n a g. t i f ) ) ; % C r e a t e a Gaussian f i l t e r i n g k e r n e l H = f s p e c i a l ( g a u s s i a n, [ ], 3 ) ; % Perform t h e c o n v o l u t i o n on t h e CPU t S t a r t = t i c ; J = conv2 ( I,H) ; t E l a p s e d = t o c ( t S t a r t ) ; % Perform t h e c o n v o l u t i o n on t h e GPU t 2 S t a r t = t i c ; [ Jcuda, t 1 E l a p s e d ] = gpuconv2 ( I,H) ; t 2 E l a p s e d = t o c ( t 2 S t a r t ) ; % Show t h e r e s u l t s f i g u r e, imshow ( J, [ ] ) ; t i t l e ( [ CPU f i l t e r i n g time :, num2str ( t E l a p s e d ), s ] ) ; f i g u r e, imshow ( Jcuda, [ ] ) ; t i t l e ( [ GPU f i l t e r i n g time :, num2str ( t 1 E l a p s e d ), s, num2str ( t 2 E l a p s e d ), s ] ) ; % END The function gpuconv2 is responsible for making CUDA kernel for MATLAB, transferring data to GPU, setting proper parameters for CUDA program, preparing memory on GPU for output matrix, performing convolu-

7 Image Processing & Communication, vol. 17, no. 4, pp Tab. 1: GPU-enabled MATLAB functions

8 214 M. Szymczyk, P. Szymczyk tion on the GPU and finally gathering output data to the main memory of CPU. f u n c t i o n [C, t 2 E l a p s e d ] = gpuconv2 (A, B, SHAPE) %GPUCONV2 Two d i m e n s i o n a l c o n v o l u t i o n on t h e GPU u s i n g Cuda. % % C = GPUCONV2( A, B ) p e r f o r m s t h e 2 D c o n v o l u t i o n o f m a t r i c e s A and B. % I f [ma, na ] = s i z e ( A ), [mb, nb ] = s i z e ( B ), and [mc, nc ] = s i z e (C), t h e n % mc = max ( [ ma+mb 1,ma, mb ] ) and nc = max ( [ na+nb 1,na, nb ] ). % % C = GPUCONV2( A, B, SHAPE) r e t u r n s a s u b s e c t i o n o f t h e 2 D c o n v o l u t i o n w i t h s i z e s p e c i f i e d by SHAPE: % f u l l ( d e f a u l t ) r e t u r n s t h e f u l l 2 D c o n v o l u t i o n, % same r e t u r n s t h e c e n t r a l p a r t o f t h e c o n v o l u t i o n t h a t i s t h e same s i z e as A. % v a l i d r e t u r n s o n l y t h o s e p a r t s o f t h e c o n v o l u t i o n t h a t are computed w i t h o u t t h e zero padded edges. % s i z e (C) = max ( [ ma max ( 0, mb 1), na max ( 0, nb 1) ], 0 ). % F u n c t i o n i s w r i t t e n by D. Kroon U n i v e r s i t y o f Twente ( December 2010) % Create t h e Cuda k e r n e l Conv2Kernel = p a r a l l e l. gpu. CUDAKernel ( conv2_double. p t x, conv2_double. cu ) ; % Matrices A, B to GPU A_Cuda=gpuArray ( d ouble (A) ) ; B_Cuda= gpuarray ( double (B) ) ; i f ( any ( ( s i z e (A) s i z e (B) ) <0) ) error ( gpuconv2 : i n p u t s, S i z e M atrix A must be l a r g e r t h e n S i z e M atrix B ) ; end % S i z e of Matrices to CUDA Asize_Cuda=gpuArray ( i n t 3 2 ( s i z e (A) ) ) ; Bsize_Cuda=gpuArray ( i n t 3 2 ( s i z e (B) ) ) ; % Output s i z e C i f ( nargin <3), SHAPE= f u l l ; end s w i t c h ( lower (SHAPE) ) c a s e f u l l C s i z e = s i z e (A) + s i z e (B) 1; c a s e same C s i z e = s i z e (A) ; c a s e v a l i d C s i z e = s i z e (A) s i z e (B) +1; o t h e r w i s e error ( gpuconv2 : i n p u t s, Unknow Shape ) ; end % S i z e t o GPU memory Csize_Cuda=gpuArray ( i n t 3 2 ( C s i z e ) ) ; % C a l c u l a t e max Block s i z e s = f l o o r ( s q r t ( Conv2Kernel. MaxThreadsPerBlock ) ) ; % Create B l o c k s o f s x s p i x e l s Conv2Kernel. ThreadBlockSize =[ s s 1 ] ; % Make a Grid t o p r o c e s t h e whole o u t p u t image i n b l o c k s Conv2Kernel. G r i d S i z e = c e i l ( C s i z e / s ) ; % I n i t i a l i z e memory f o r t h e o u t p u t image C_Cuda = p a r a l l e l. gpu. GPUArray. z e r o s ( C s i z e ( 1 ), C s i z e ( 2 ), d ouble ) ; % Perform t h e Convolution on t h e GPU t 2 S t a r t = t i c ; C_Cuda = f e v a l ( Conv2Kernel, C_Cuda, Csize_Cuda, A_Cuda, Asize_Cuda, B_Cuda, Bsize_Cuda ) ; t 2 E l a p s e d = t o c ( t 2 S t a r t ) ; % Get t h e data back to t h e main memory C = g a t h e r ( C_Cuda ) ; % END Function conv2 is a CUDA kernel responsible for calculating output convolution matrix. g l o b a l void conv2 ( double * J, i n t * J s i z e, double * I, i n t * I s i z e, double * K, i n t * Ksize ) / / L o c a t i o n i n a 2D m a t r i x i n t i d x = t h r e a d I d x. x+ b l o c k I d x. x* blockdim. x ; / / I f o u t s i d e t h e o u t p u t image, s t o p t h e t h r e a d i f ( idx > J s i z e [0] 1) return ; i n t i d y = t h r e a d I d x. y+ b l o c k I d x. y* blockdim. y ; / / I f o u t s i d e t h e o u t p u t image, s t o p t h e t h r e a d i f ( idy > J s i z e [1] 1) return ; / / The L o c a t i o n i n t h e J 2D matrix, d e f i n e d by a 1D v a l u e i n t i d = i d x + i d y * J s i z e [ 0 ] ; / / Kernel O f f s e t ( To deal with conv2 Full, Same and Valid mode ) i n t ox = i d x (Ksize [0]+1+ J s i z e [0] I s i z e [ 0 ] ) / ; i n t oy = i d y (Ksize [1]+1+ J s i z e [1] I s i z e [ 1 ] ) / ; / / F i l t e r i n g R e s u l t Value double r e s u l t =0; / / Loop v a l u e i n t i =0; / / Image boundary i n t i b x = I s i z e [0] 1; i n t i b y = I s i z e [1] 1; / / Check i f f i l t e r i n g near a image boundary i f ( ox < 0 oy < 0 ( ox+ Ksize [ 0 ] ) > i b x ( oy+ Ksize [ 1 ] ) > i b y ) i d y =oy ; / / Loop t h r o u g h t h e k e r n e l f o r ( i n t i t t =0; i t t < Ksize [ 0 ] * Ksize [ 1 ] ; ++ i t t ) / / Boundary check i f (! ( idy < 0 idy > i b y ) ) / / Get t h e l o c a t i o n i n t h e image i d x = i +ox ; / / Boundary check i f (! ( idx < 0 idx > i b x ) ) / / Add t h e p i x e l f i l t e r d w i t h t h e k e r n e l r e s u l t += I [ i d x + i d y * I s i z e [ 0 ] ] *K[ i t t ] ; i ++; i f ( i == Ksize [ 0 ] ) i =0; i d y ++; J [ i d ]= r e s u l t ; e l s e i n t o=oy* I s i z e [ 0 ] + ox ; / / Loop t h r o u g h t h e k e r n e l f o r ( i n t i t t =0; i t t < Ksize [ 0 ] * Ksize [ 1 ] ; ++ i t t ) / / Add t h e p i x e l f i l t e r d w i t h t h e k e r n e l r e s u l t += I [ i +o ]*K[ i t t ] ; i ++; i f ( i == Ksize [ 0 ] ) i =0; o+= I s i z e [ 0 ] ; J [ i d ]= r e s u l t ; / / END The results of our experiments are very interesting, execution time of 2D convolution kernel on GPU (GeForce GTX 280) without loading data is over four times faster than application executed on CPU (Intel E8400 3GHz) (T CPU = 0,18357s, T GPU 1 = 0,044952s, S P1 = 4,0837) and it is three times faster with loading data (T CPU = 0,18357s, T GPU 2 = 0,060001s, S P2 = 3,0595). It is possible to achieve significant acceleration using parallel com-

9 Image Processing & Communication, vol. 17, no. 4, pp Fig. 2: Input image [3, 6] Fig. 3: Output images after performing convolution operation on CPU (on the left) and GPU (on the right) - both solutions give these same result puting algorithms written in CUDA running on the GPU calculated in Matlab. This is a new opportunity for parallel algorithms that can be run on commonly available computers. 9 Conclusions MATLAB (MATrix LABoratory) is a high-performance language for technical computing. Typical uses include: math and computation, algorithm development, modeling, simulation, data analysis and visualization, scientific and engineering graphics. It is a good choice for vision program development because it is easy to do very rapid prototyping. It contains a good library of image processing functions and excellent display capabilities. MATLAB code is optimized to be relatively quick when performing matrix operations. To run applications more real-time, the possibility of parallel execution should be used. Today s microprocessors have two or four computational cores (we can expect them even more in the future), and modern computers have sophisticated, hierarchical memory structures. MATLAB users now have access to clusters and networks of machines, and will soon have personal parallel computers ( now they have GPU) so using Parallel MATLAB [4] is a result of all these changes. The MathWorks toolset provides the users with a set of constructs that can be applied to exploit various types of parallelism with minimal effort. The way to exploit task parallelism is a parfor construction while distributed arrays and parallel functions target data parallel problems. The users can choose adequate level of parallelism in its application from low-level message passing functions to high-level distributed arrays and parallel loops. This toolset was also designed to be portable across multiple platforms and architectures (Linux, Macintosh, Windows). Parallel convolution algorithm implemented in CUDA and used in Matlab shows four times acceleration so advantages of using it are significant. Acknowledgements Project MNiSW no. 0128/R/t00/2010/12 References [1] B. Dushaw, Matlab and CUDA, APL UoW, Seatlle, 2010 [2] S. Gaurav, M. Jos, MATLAB: A Language for Parallel Programming, International Journal of Parallel Programming, Vol. 37, 2009 [3] Imagery Library for Intelligent Detection Sys-

10 216 M. Szymczyk, P. Szymczyk tems, [4] C. Moler, Parallel MATLAB: Multiple Processors and Multiple Cores, com/company/newsletters/articles/parallelmatlab-multiple-processors-and-multiplecores.html [5] fileexchange/29648-gpuconv2 [6]

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Speeding up MATLAB Applications Sean de Wolski Application Engineer Speeding up MATLAB Applications Sean de Wolski Application Engineer 2014 The MathWorks, Inc. 1 Non-rigid Displacement Vector Fields 2 Agenda Leveraging the power of vector and matrix operations Addressing

More information

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1 Parallel and Distributed Computing with MATLAB 2018 The MathWorks, Inc. 1 Practical Application of Parallel Computing Why parallel computing? Need faster insight on more complex problems with larger datasets

More information

Parallel MATLAB at VT

Parallel MATLAB at VT Parallel MATLAB at VT Gene Cliff (AOE/ICAM - ecliff@vt.edu ) James McClure (ARC/ICAM - mcclurej@vt.edu) Justin Krometis (ARC/ICAM - jkrometis@vt.edu) 11:00am - 11:50am, Thursday, 25 September 2014... NLI...

More information

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer 2018 The MathWorks, Inc. 1 Practical Application of Parallel Computing Why parallel computing? Need faster

More information

Optimizing and Accelerating Your MATLAB Code

Optimizing and Accelerating Your MATLAB Code Optimizing and Accelerating Your MATLAB Code Sofia Mosesson Senior Application Engineer 2016 The MathWorks, Inc. 1 Agenda Optimizing for loops and using vector and matrix operations Indexing in different

More information

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen Frank Graeber Application Engineering MathWorks Germany 2013 The MathWorks, Inc. 1 Speed up the serial code within core

More information

Getting Started with MATLAB Francesca Perino

Getting Started with MATLAB Francesca Perino Getting Started with MATLAB Francesca Perino francesca.perino@mathworks.it 2014 The MathWorks, Inc. 1 Agenda MATLAB Intro Importazione ed esportazione Programmazione in MATLAB Tecniche per la velocizzazione

More information

High Performance and GPU Computing in MATLAB

High Performance and GPU Computing in MATLAB High Performance and GPU Computing in MATLAB Jan Houška houska@humusoft.cz http://www.humusoft.cz 1 About HUMUSOFT Company: Humusoft s.r.o. Founded: 1990 Number of employees: 18 Location: Praha 8, Pobřežní

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Speeding up MATLAB Applications The MathWorks, Inc.

Speeding up MATLAB Applications The MathWorks, Inc. Speeding up MATLAB Applications 2009 The MathWorks, Inc. Agenda Leveraging the power of vector & matrix operations Addressing bottlenecks Utilizing additional processing power Summary 2 Example: Block

More information

Accelerating System Simulations

Accelerating System Simulations Accelerating System Simulations 김용정부장 Senior Applications Engineer 2013 The MathWorks, Inc. 1 Why simulation acceleration? From algorithm exploration to system design Size and complexity of models increases

More information

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Moysey Brio & Paul Dostert July 4, 2009 1 / 18 Sparse Matrices In many areas of applied mathematics and modeling, one

More information

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX Keith Ma ---------------------------------------- keithma@bu.edu Research Computing Services ----------- help@rcs.bu.edu Boston University ----------------------------------------------------

More information

Daniel D. Warner. May 31, Introduction to Parallel Matlab. Daniel D. Warner. Introduction. Matlab s 5-fold way. Basic Matlab Example

Daniel D. Warner. May 31, Introduction to Parallel Matlab. Daniel D. Warner. Introduction. Matlab s 5-fold way. Basic Matlab Example to May 31, 2010 What is Matlab? Matlab is... an Integrated Development Environment for solving numerical problems in computational science. a collection of state-of-the-art algorithms for scientific computing

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

Modeling a 4G LTE System in MATLAB

Modeling a 4G LTE System in MATLAB Modeling a 4G LTE System in MATLAB Part 2: Simulation acceleration Houman Zarrinkoub PhD. Signal Processing Product Manager MathWorks houmanz@mathworks.com 2011 The MathWorks, Inc. 1 Why simulation acceleration?

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB Jos Martin Principal Architect, Parallel Computing Tools jos.martin@mathworks.co.uk 2015 The MathWorks, Inc. 1 Overview Scene setting Task Parallel (par*) Why doesn t it

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Introduction CUDA is a tool to turn your graphics card into a small computing cluster. It s not always

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU

Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU 1 1 Samara National Research University, Moskovskoe Shosse 34, Samara, Russia, 443086 Abstract.

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen Michael Glaßer Application Engineering MathWorks Germany 2014 The MathWorks, Inc. 1 Key Takeaways 1. Speed up your serial

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB Jos Martin Principal Architect, Parallel Computing Tools jos.martin@mathworks.co.uk 1 2013 The MathWorks, Inc. www.matlabexpo.com Code used in this presentation can be found

More information

Using Parallel Computing Toolbox to accelerate the Video and Image Processing Speed. Develop parallel code interactively

Using Parallel Computing Toolbox to accelerate the Video and Image Processing Speed. Develop parallel code interactively Using Parallel Computing Toolbox to accelerate the Video and Image Processing Speed Presenter: Claire Chuang TeraSoft Inc. Agenda Develop parallel code interactively parallel applications for faster processing

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Multicore Computer, GPU 및 Cluster 환경에서의 MATLAB Parallel Computing 기능

Multicore Computer, GPU 및 Cluster 환경에서의 MATLAB Parallel Computing 기능 Multicore Computer, GPU 및 Cluster 환경에서의 MATLAB Parallel Computing 기능 성호현 MathWorks Korea 2012 The MathWorks, Inc. 1 A Question to Consider Do you want to speed up your algorithms? If so Do you have a multi-core

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

Introduction to Matlab GPU Acceleration for. Computational Finance. Chuan- Hsiang Han 1. Section 1: Introduction

Introduction to Matlab GPU Acceleration for. Computational Finance. Chuan- Hsiang Han 1. Section 1: Introduction Introduction to Matlab GPU Acceleration for Computational Finance Chuan- Hsiang Han 1 Abstract: This note aims to introduce the concept of GPU computing in Matlab and demonstrates several numerical examples

More information

Bindel, Spring 2015 Numerical Analysis (CS 4220) Figure 1: A blurred mystery photo taken at the Ithaca SPCA. Proj 2: Where Are My Glasses?

Bindel, Spring 2015 Numerical Analysis (CS 4220) Figure 1: A blurred mystery photo taken at the Ithaca SPCA. Proj 2: Where Are My Glasses? Figure 1: A blurred mystery photo taken at the Ithaca SPCA. Proj 2: Where Are My Glasses? 1 Introduction The image in Figure 1 is a blurred version of a picture that I took at the local SPCA. Your mission

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Evaluating the MATLAB Parallel Computing Toolbox Kashif Hussain Computer Science 2012/2013

Evaluating the MATLAB Parallel Computing Toolbox Kashif Hussain Computer Science 2012/2013 Evaluating the MATLAB Parallel Computing Toolbox Kashif Hussain Computer Science 2012/2013 The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Module 4. Computer-Aided Design (CAD) systems

Module 4. Computer-Aided Design (CAD) systems Module 4. Computer-Aided Design (CAD) systems Nowadays the design of complex systems is unconceivable without computers. The fast computers, the sophisticated developing environments and the well elaborated

More information

Technical Computing with MATLAB

Technical Computing with MATLAB Technical Computing with MATLAB University Of Bath Seminar th 19 th November 2010 Adrienne James (Application Engineering) 1 Agenda Introduction to MATLAB Importing, visualising and analysing data from

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

MATLAB on BioHPC. portal.biohpc.swmed.edu Updated for

MATLAB on BioHPC. portal.biohpc.swmed.edu Updated for MATLAB on BioHPC [web] [email] portal.biohpc.swmed.edu biohpc-help@utsouthwestern.edu 1 Updated for 2015-06-17 What is MATLAB High level language and development environment for: - Algorithm and application

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

Measurement of real time information using GPU

Measurement of real time information using GPU Measurement of real time information using GPU Pooja Sharma M. Tech Scholar, Department of Electronics and Communication E-mail: poojachaturvedi1985@gmail.com Rajni Billa M. Tech Scholar, Department of

More information

Parallelism paradigms

Parallelism paradigms Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization

More information

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Ke Ma 1, and Yao Song 2 1 Department of Computer Sciences 2 Department of Electrical and Computer Engineering University of Wisconsin-Madison

More information

MatCL - OpenCL MATLAB Interface

MatCL - OpenCL MATLAB Interface MatCL - OpenCL MATLAB Interface MatCL - OpenCL MATLAB Interface Slide 1 MatCL - OpenCL MATLAB Interface OpenCL toolkit for Mathworks MATLAB/SIMULINK Compile & Run OpenCL Kernels Handles OpenCL memory management

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Offloading Java to Graphics Processors

Offloading Java to Graphics Processors Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance

More information

Computer Vision on Tegra K1. Chen Sagiv SagivTech Ltd.

Computer Vision on Tegra K1. Chen Sagiv SagivTech Ltd. Computer Vision on Tegra K1 Chen Sagiv SagivTech Ltd. Established in 2009 and headquartered in Israel Core domain expertise: GPU Computing and Computer Vision What we do: - Technology - Solutions - Projects

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Parallel Computing with Matlab and R

Parallel Computing with Matlab and R Parallel Computing with Matlab and R scsc@duke.edu https://wiki.duke.edu/display/scsc Tom Milledge tm103@duke.edu Overview Running Matlab and R interactively and in batch mode Introduction to Parallel

More information

Technical Report TR

Technical Report TR Technical Report TR-2012-03 Parallel Ray Tracing Simulations with MATLAB for Dynamic Lens Systems Nicolai Wengert and Dan Negrut August 10, 2012 Abstract Ray tracing simulations are required for investigating

More information

An Introduction to OpenAcc

An Introduction to OpenAcc An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by

More information

Parallelism. Parallel Hardware. Introduction to Computer Systems

Parallelism. Parallel Hardware. Introduction to Computer Systems Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Using the MATLAB Parallel Computing Toolbox on the UB CCR cluster

Using the MATLAB Parallel Computing Toolbox on the UB CCR cluster Using the MATLAB Parallel Computing Toolbox on the UB CCR cluster N. Barlow, C. Cornelius, S. Matott Center for Computational Research University at Buffalo State University of New York October, 1, 2013

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

GPU Implementation of Implicit Runge-Kutta Methods

GPU Implementation of Implicit Runge-Kutta Methods GPU Implementation of Implicit Runge-Kutta Methods Navchetan Awasthi, Abhijith J Supercomputer Education and Research Centre Indian Institute of Science, Bangalore, India navchetanawasthi@gmail.com, abhijith31792@gmail.com

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by

MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by 1 MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by MathWorks In 2004, MATLAB had around one million users

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Pierre Nowodzienski Engineer pierre.nowodzienski@mathworks.fr 2018 The MathWorks, Inc. 1 From Data to Business value Make decisions Get

More information

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis. Hannes Fassold, Jakub Rosner

SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis. Hannes Fassold, Jakub Rosner SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis Hannes Fassold, Jakub Rosner 2014-03-26 2 Overview GPU-activities @ AVM research group SIFT descriptor extraction Algorithm GPU implementation

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

Grand Central Dispatch

Grand Central Dispatch A better way to do multicore. (GCD) is a revolutionary approach to multicore computing. Woven throughout the fabric of Mac OS X version 10.6 Snow Leopard, GCD combines an easy-to-use programming model

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology Dan Amerson, Technical Director, Emergent Game Technologies Purpose Why am I giving this talk? To answer this question:

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Large Data in MATLAB: A Seismic Data Processing Case Study U. M. Sundar Senior Application Engineer

Large Data in MATLAB: A Seismic Data Processing Case Study U. M. Sundar Senior Application Engineer Large Data in MATLAB: A Seismic Data Processing Case Study U. M. Sundar Senior Application Engineer 2013 MathWorks, Inc. 1 Problem Statement: Scaling Up Seismic Analysis Challenge: Developing a seismic

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

GeoImaging Accelerator Pansharpen Test Results. Executive Summary

GeoImaging Accelerator Pansharpen Test Results. Executive Summary Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance Whitepaper), the same approach has

More information

arxiv: v1 [physics.ins-det] 11 Jul 2015

arxiv: v1 [physics.ins-det] 11 Jul 2015 GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University

More information

MATLAB Based Optimization Techniques and Parallel Computing

MATLAB Based Optimization Techniques and Parallel Computing MATLAB Based Optimization Techniques and Parallel Computing Bratislava June 4, 2009 2009 The MathWorks, Inc. Jörg-M. Sautter Application Engineer The MathWorks Agenda Introduction Local and Smooth Optimization

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

(Refer Slide Time 00:17) Welcome to the course on Digital Image Processing. (Refer Slide Time 00:22)

(Refer Slide Time 00:17) Welcome to the course on Digital Image Processing. (Refer Slide Time 00:22) Digital Image Processing Prof. P. K. Biswas Department of Electronics and Electrical Communications Engineering Indian Institute of Technology, Kharagpur Module Number 01 Lecture Number 02 Application

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

High-Performance Data Loading and Augmentation for Deep Neural Network Training

High-Performance Data Loading and Augmentation for Deep Neural Network Training High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information