MATLAB AND PARALLEL COMPUTING

Size: px

Start display at page:

Download "MATLAB AND PARALLEL COMPUTING"

Joanna Barber
5 years ago
Views:

1 Image Processing & Communication, vol. 17, no. 4, pp DOI: /v MATLAB AND PARALLEL COMPUTING MAGDALENA SZYMCZYK, PIOTR SZYMCZYK AGH University of Science and Technology, Kraków, Polska Abstract. The MATLAB is a technical computing language used in a variety of fields, such as control systems, image and signal processing, visualization, financial process simulations in an easy-to-use environment. MATLAB offers "toolboxes" which are specialized libraries for variety scientific domains, and a simplified interface to high-performance libraries (LA- PACK, BLAS, FFTW too). Now MATLAB is enriched by the possibility of parallel computing with the Parallel Computing Toolbox TM and MATLAB Distributed Computing Server TM. In this article we present some of the key features of MATLAB parallel applications focused on using GPU processors for image processing. 1 Introduction The article provides an overview of capabilities for increasing the speed of calculations in Matlab using parallel programming on graphic processing unit (GPU). An example of the algorithm implemented in CUDA and used with parallel Matlab was elaborated and presented. Tests of the speedup of the algorithm were performed and described. Microprocessors based on a single central processing unit (CPU) drove rapid performance and cost reductions in computer applications for more than two decades. This performance improvement allowed application software to perform more functionality, had better user interfaces, and generated more useful results faster. The users, in turn, demanded even more and more improvements so a positive cycle for the computer industry was created. This fast changes in software relied on the advances in hardware and introduction of new generation of processors. This drive slowed around 2003 due to power consumption issues and the level of productive activities that can be performed in each clock period within a single CPU. Nowadays all microprocessor producers have switched to multi-core solutions used to increase the processing power. This change has a great impact on the development of software. Up till now software applications have been written mostly as sequential programs, but with each new generation of microprocessors it will be necessary to write parallel programs, in which multiple threads of execution cooperate to achieve the functionality faster. This is nothing new, the high-performance computing community has been developing parallel programs for large scale, expensive computers for decades. This solutions was rather unpopular in large number of ordinary implementations. Nowadays all new microprocessors are parallel com-

2 208 M. Szymczyk, P. Szymczyk puters, so the number of applications need to be developed as parallel programs has increased dramatically. There is now a great need for software developers to learn about parallel programming. Using several machines simultaneously in applications is harder than in sequential version of it, so computing languages which make this process easier are desirable. 2 Parallel Matlab Parallel MATLAB is an extension of MATLAB that takes advantage of multicore desktop machines and clusters. Matlab supports parallel computing in several ways. Some features can only be set. In other situations programs may need some adaptations or buying a special toolbox. MATLAB supports three types of parallel computing: multithreaded (implicit parallelism), distributed computing, explicit parallelism. In multithreaded parallelism, one instance of MATLAB automatically generates multiple simultaneous instruction streams. Multiple processors or cores, sharing the memory of a single computer, execute these streams. Elementwise computations on big matrices might benefit most from such solutions. In distributed computing, multiple instances of MAT- LAB run multiple independent computations on separate computers, each with its own memory. In most cases, a single program is run many times with different parameters. In explicit parallelism, several instances of MATLAB run on several processors or computers, often with separate memories, and simultaneously execute a single MAT- LAB command or M-function. New programming constructions, including parallel loops and distributed arrays, describe this parallelism. Such different kinds of parallelism can be used in different ways, for example for multithreaded parallelism, the number of threads can be set in the MATLAB Preferences panel which are used by multithreaded versions of the BLAS (Basic Linear Algebra Subroutines).A distributed computing job might invoke multithreaded functions on each machine and then use a distributed array to collect the final results. For vector arguments, the MAT- LAB elementary function library is multithreaded too [5]. It is not an easy task to choose the most proper form of parallelism for a given type of application. 3 Matlab Parallel Computing Toolbox (PCT) and Distributed Computing Server (MDCS) The Parallel Computing Toolbox runs on a desktop, and can take advantage of up to 8 cores there. Parallel programs can be run interactively or in batch. The Matlab Distributed Computing Server (MDCS) controls parallel execution of MATLAB on a cluster with tens or hundreds of cores. On Fig. 1 typical architecture of parallel system is presented. The three ways to write a parallel MATLAB program which was mentioned earlier can be expressed by: the parfor statement - the simplest path to parallelism which indicates that a given for loop can be executed in parallel; the spmd statement which can create cooperating synchronized processing; the task feature creates multiple independent programs. The parfor approach is a limited but simple way to get started, spmd statement is powerful, but requires rethinking the program and data, the task approach is simple, but suitable only for computations that need almost no communication. A simplified version of MPI programming

Image Processing & Communication, vol. 17, no. 4, pp. 207-216 209 Fig. 1: Typical architecture of parallel system model can be used also in Matlab.

3 Image Processing & Communication, vol. 17, no. 4, pp Fig. 1: Typical architecture of parallel system model can be used also in Matlab. There is a single program, but it is divided into client and worker (lab) sections which can cooperate. Each worker process has its own memory and separate ID and runs ideally on separate core. When the parfor statement is included in a client code, it indicates that a given for loop can be executed in parallel so iterations of the loop are automatically divided up among the workers, and the results gathered back onto the client. Using parfor requires that the iterations are completely independent and there are also some restrictions on data access. Spmd programming includes distributed arrays. A distributed array is logically one array, and a large set of MATLAB commands can treat it that way. However, portions of the array are scattered across multiple processors. This means such an array can be really large. The local part of a distributed array can be operated by that processor very quickly. A distributed array can be operated by explicit commands to the spmd workers that own pieces of the array, or implicitly by commands at the global or client level. The client and each worker have separate workspaces, but it is possible for them to communicate and trade information. Instead of having an array created on the client and distributed to the workers, it is possible to have a distributed array constructed by having each worker built its piece. The result is still a distributed array. Codistributing the creation of an array has several advantages: the array is built faster in parallel, you skip the communication cost of distributing it, the array might be too large to build entirely on one core (or processor). Parallel MATLAB jobs can be run directly, that is, interactively. The matlabpool command is used to reserve a given number of workers on the local (or perhaps remote) machine. Once these workers are available, the user can type commands, run scripts, or evaluate functions, which contain parfor statements. The workers will cooperate in producing results. Parallel PARFOR MATLAB jobs can be run indirectly. The batch command is used to specify MATLAB code to be executed and it starts the computation in the background.

4 210 M. Szymczyk, P. Szymczyk 4 Examples of using Parallel Toolbox 4.1 Using PARFOR By simple changing a for loop as a parfor loop (the simplest form of parallelization of application), the user explicitly expresses that the contents of the for loop may be executed in any order on available resources. Using this kind of loop, if additional computational resources are available (through matlabpool), the effect of faster result is achieved. In the absence of these resources, on a single processor system, the parfor behaves like a traditional for loop. The parfor loop requires iterations to be completely independent of each other. The syntax of parfor is as follows [2]: p a r f o r ( i t r = m : n, [ NumWorkers ] ) % loop body end NumWorkers is an optional argument that indicates an upper-bound on the number of MATLAB workers the user wants to use for executing the loop body. Once a pool has been set up, programs can then use parfor which is like for, except that the iterations may be farmed out to different CPUs. In the following code for example, one CPU could handle i=1:1000 while another could deal with i=1001:2000, etc. p a r f o r i =1:10000 x ( i ) =x ( i ) * 2 ; end The single program multiple data (spmd) make it possible to define a block of code that runs in parallel on all the labs (workers) in the MATLAB pool. The second example is much more complex, but the idea is just the same [2]: p a r f o r k = 1 : 60 a ( k ) = max( abs ( e i g ( rand ( ) ) ) ) ; end Depending on the availability of workers, the iteration range may be divided differently. 4.2 Using distributed arrays In MATLAB there is no syntactic difference in the way users can access elements in distributed arrays among different workers and regular MATLAB arrays. MATLAB takes the responsibility for appropriately shipping data as necessary. Distributed arrays may be used with almost all of the nearly 150 core built-in MATLAB functions including reduction operations, indexing, and linear algebra operations such as LU factorization. For dense linear algebra operations ScaLAPACK library is used whenever it is possible. Other algorithms, such as those for sparse matrices, are implemented in the MATLAB language. FOR-DRANGE The for-drange construct lets users iterate over a distributed range. range that it owns. Each worker executes on the piece of The for-drange construct requires loops iterations to be independent of each other and that no communication should occur between labs when executing the loop. 5 Speeding up MATLAB applications by GPU (Graphic Processor Unit) [1] A GPU can be used to speed MATLAB up. Options include MATLAB plug-in written in CUDA language (CUDA is a parallel library that uses an nvidia board). The latest generation of GPUs offer considerable computing power using their 100 to 512 on-card processors. The CUDA applications can be called through Matlab and Matlab mex functions. With a properly developed mex function, the user-friendly Matlab interface can be used to perform behind-the-scenes parallel computations on the GPU. The GPU becomes a co-processor for the personal computer (). "Tesla" variety devices are available as compute-only devices that eliminate all graphics capability and include additional memory. While such devices may be slightly more reliable than the desktop graphics

5 Image Processing & Communication, vol. 17, no. 4, pp cards and allow larger computational problems, they do not usually offer faster computation, since they both use the same processor design. Another Nvidia next generation CUDA architecture, called "Fermi" consists of 512 CUDA cores, ECC memory, and offers double precision capabilities 8Œ faster than existing devices. S2070 will also offer display output. Fermi based Tesla S2050 and Message Passing Interface (MPI) implements parallel computing on a cluster of PCs used for complicated computations, but it is often limited by inter-computer communications. CUDA implements parallel computing on the massive number of processors of a GPU for rather simple floating-point calculations with very fast communication between processors. Both approaches have their strengths and weaknesses; one does not replace the other. The matrix intensive computations on Matlab are to be ideally suited to GPU computation (). Now CUDA applications are several times faster than their equivalent CPU calculations for large matrices. This sort of computation is more often used now. 6 Applying MATLAB and GPU technology for image recognition Methods used in this area of science are based on complicated matrix operations so they can be easily tuned for parallel computations on Matlab and CUDA. Matlab with CUDA technology can be used on each level of image processing. The first stage of any vision system is the image acquisition stage. Discretization makes it possible to talk about images. After the image has been obtained, various methods of processing can be applied to the image (in the form of pixel array) to perform many different vision tasks required today. Pre-processing adapts the image to our specific application. Low-level image processing concerns image enhancement, restoration and transformation. However, if the image has not been acquired satisfactorily, then the intended tasks may not be achievable, even with the aid of some form of image enhancement. This part of low level image preprocessing is supported by Image Acquisition Toolbox TM of MATLAB. It enables the user to acquire images and videos from cameras and frame grabbers directly into MATLAB and Simulink. This toolbox can detect hardware automatically and configure hardware properties. Image processing (image restoration and enhancement) is in many cases concerned with taking one array of pixels as input and producing another array of pixels as output, which in some way represents an improvement to the original array. For example, this processing may remove noise, improve the contrast of the image, remove blurring caused by movement of the camera during image acquisition. It may correct geometrical distortions caused by the lens. Image processing methods may be broadly divided into: Real space methods - which work by directly processing the input pixel array Fourier space methods - which work by firstly deriving a new; representation of the input data by performing a Fourier transformation, which is then processed, and finally, an inverse Fourier transform is performed on the resulting data to give the final output image. In Medium-level image processing (image understanding) object representation and description is created. This phase includes the process of image segmentation, detection of contours and edges. Edges are very important to any vision system (biological or machine). They are fairly cheap to compute. They do provide strong visual clues that can help the recognition process. An edge may be regarded as a boundary between two dissimilar regions in an image. In principle, an edge is easy to find due to differences in pixel values between regions are relatively easy to calculate by considering gradients. The process of extracting and representing information from an image is to group pixels together into regions of

6 212 M. Szymczyk, P. Szymczyk similarity. It is commonly called segmentation. In 2D segmentation we group pixels together according to the rate of change of their intensity over a region. In 3D segmentation we group together pixels according to the rate of change of depth in the image, corresponding to pixels lying on the same surface such as a plane, cylinder, sphere etc. The Medium Level and part of Low Level computer vision application can be supported by Image Processing Toolbox TM, which provides a comprehensive set of reference-standard algorithms and graphical tools for image processing. It contains such functionality as: analysis, visualization, and algorithm development. can perform: image registration, image enhancement, image deblurring, noise reduction, image segmentation feature detection, geometric transformations. Many toolbox functions are multithreaded to take advantage of multicore and multiprocessor computers. Representation and description make it possible to talk about the properties of objects and it is a task of highlevel image processing. On this level of application object recognition is made and objects identification based on object models is performed. The interpretation of properties of objects (identity, size, material, position 2D/3D, orientation) and relationships among objects (relative position, occlusions) are made using artificial intelligence methods. 7 CUDA function called from Matlab Computation on GPU is basically a three step process: Copy data to the GPU memory, Execute code (the "kernel") to process the data, Copy the results back from the GPU memory. In general the code should be designed to minimize all these steps, which when frequently used limit the overall speed of the calculations. It 8 An example of implementation of parallel convolution algorithm in CUDA executed in PCT (Matlab Parallel Computing Toolbox) Parallel implementation of convolution algorithm is presented below together with tests of speedup. Convolutions are used by many applications for engineering and mathematics. Especially it is used in low level image processing with blur filters or edge detectors. Mathematically, a convolution measures the amount of overlap between two functions [1]. In the context of image processing a convolution filter is just the scalar product of the filter weights with the input pixels within a window surrounding each of the output pixels. In our paper we use algorithms implemented by Dirk-Jan Kroon [5]. The scalar output product of convolution which is a parallel operation is well suited to computation on highly parallel hardware such as the GPU. In convolution rotationally symmetric Gaussian low pass filter of size with deviation sigma equal to 3 is used on the picture of size pixels. At first main matlab file is prepared which is calling convolution on the CPU and GPU. % Load an image I = im2double ( imread ( s c e n a g. t i f ) ) ; % C r e a t e a Gaussian f i l t e r i n g k e r n e l H = f s p e c i a l ( g a u s s i a n, [ ], 3 ) ; % Perform t h e c o n v o l u t i o n on t h e CPU t S t a r t = t i c ; J = conv2 ( I,H) ; t E l a p s e d = t o c ( t S t a r t ) ; % Perform t h e c o n v o l u t i o n on t h e GPU t 2 S t a r t = t i c ; [ Jcuda, t 1 E l a p s e d ] = gpuconv2 ( I,H) ; t 2 E l a p s e d = t o c ( t 2 S t a r t ) ; % Show t h e r e s u l t s f i g u r e, imshow ( J, [ ] ) ; t i t l e ( [ CPU f i l t e r i n g time :, num2str ( t E l a p s e d ), s ] ) ; f i g u r e, imshow ( Jcuda, [ ] ) ; t i t l e ( [ GPU f i l t e r i n g time :, num2str ( t 1 E l a p s e d ), s, num2str ( t 2 E l a p s e d ), s ] ) ; % END The function gpuconv2 is responsible for making CUDA kernel for MATLAB, transferring data to GPU, setting proper parameters for CUDA program, preparing memory on GPU for output matrix, performing convolu-

7 Image Processing & Communication, vol. 17, no. 4, pp Tab. 1: GPU-enabled MATLAB functions

8 214 M. Szymczyk, P. Szymczyk tion on the GPU and finally gathering output data to the main memory of CPU. f u n c t i o n [C, t 2 E l a p s e d ] = gpuconv2 (A, B, SHAPE) %GPUCONV2 Two d i m e n s i o n a l c o n v o l u t i o n on t h e GPU u s i n g Cuda. % % C = GPUCONV2( A, B ) p e r f o r m s t h e 2 D c o n v o l u t i o n o f m a t r i c e s A and B. % I f [ma, na ] = s i z e ( A ), [mb, nb ] = s i z e ( B ), and [mc, nc ] = s i z e (C), t h e n % mc = max ( [ ma+mb 1,ma, mb ] ) and nc = max ( [ na+nb 1,na, nb ] ). % % C = GPUCONV2( A, B, SHAPE) r e t u r n s a s u b s e c t i o n o f t h e 2 D c o n v o l u t i o n w i t h s i z e s p e c i f i e d by SHAPE: % f u l l ( d e f a u l t ) r e t u r n s t h e f u l l 2 D c o n v o l u t i o n, % same r e t u r n s t h e c e n t r a l p a r t o f t h e c o n v o l u t i o n t h a t i s t h e same s i z e as A. % v a l i d r e t u r n s o n l y t h o s e p a r t s o f t h e c o n v o l u t i o n t h a t are computed w i t h o u t t h e zero padded edges. % s i z e (C) = max ( [ ma max ( 0, mb 1), na max ( 0, nb 1) ], 0 ). % F u n c t i o n i s w r i t t e n by D. Kroon U n i v e r s i t y o f Twente ( December 2010) % Create t h e Cuda k e r n e l Conv2Kernel = p a r a l l e l. gpu. CUDAKernel ( conv2_double. p t x, conv2_double. cu ) ; % Matrices A, B to GPU A_Cuda=gpuArray ( d ouble (A) ) ; B_Cuda= gpuarray ( double (B) ) ; i f ( any ( ( s i z e (A) s i z e (B) ) <0) ) error ( gpuconv2 : i n p u t s, S i z e M atrix A must be l a r g e r t h e n S i z e M atrix B ) ; end % S i z e of Matrices to CUDA Asize_Cuda=gpuArray ( i n t 3 2 ( s i z e (A) ) ) ; Bsize_Cuda=gpuArray ( i n t 3 2 ( s i z e (B) ) ) ; % Output s i z e C i f ( nargin <3), SHAPE= f u l l ; end s w i t c h ( lower (SHAPE) ) c a s e f u l l C s i z e = s i z e (A) + s i z e (B) 1; c a s e same C s i z e = s i z e (A) ; c a s e v a l i d C s i z e = s i z e (A) s i z e (B) +1; o t h e r w i s e error ( gpuconv2 : i n p u t s, Unknow Shape ) ; end % S i z e t o GPU memory Csize_Cuda=gpuArray ( i n t 3 2 ( C s i z e ) ) ; % C a l c u l a t e max Block s i z e s = f l o o r ( s q r t ( Conv2Kernel. MaxThreadsPerBlock ) ) ; % Create B l o c k s o f s x s p i x e l s Conv2Kernel. ThreadBlockSize =[ s s 1 ] ; % Make a Grid t o p r o c e s t h e whole o u t p u t image i n b l o c k s Conv2Kernel. G r i d S i z e = c e i l ( C s i z e / s ) ; % I n i t i a l i z e memory f o r t h e o u t p u t image C_Cuda = p a r a l l e l. gpu. GPUArray. z e r o s ( C s i z e ( 1 ), C s i z e ( 2 ), d ouble ) ; % Perform t h e Convolution on t h e GPU t 2 S t a r t = t i c ; C_Cuda = f e v a l ( Conv2Kernel, C_Cuda, Csize_Cuda, A_Cuda, Asize_Cuda, B_Cuda, Bsize_Cuda ) ; t 2 E l a p s e d = t o c ( t 2 S t a r t ) ; % Get t h e data back to t h e main memory C = g a t h e r ( C_Cuda ) ; % END Function conv2 is a CUDA kernel responsible for calculating output convolution matrix. g l o b a l void conv2 ( double * J, i n t * J s i z e, double * I, i n t * I s i z e, double * K, i n t * Ksize ) / / L o c a t i o n i n a 2D m a t r i x i n t i d x = t h r e a d I d x. x+ b l o c k I d x. x* blockdim. x ; / / I f o u t s i d e t h e o u t p u t image, s t o p t h e t h r e a d i f ( idx > J s i z e [0] 1) return ; i n t i d y = t h r e a d I d x. y+ b l o c k I d x. y* blockdim. y ; / / I f o u t s i d e t h e o u t p u t image, s t o p t h e t h r e a d i f ( idy > J s i z e [1] 1) return ; / / The L o c a t i o n i n t h e J 2D matrix, d e f i n e d by a 1D v a l u e i n t i d = i d x + i d y * J s i z e [ 0 ] ; / / Kernel O f f s e t ( To deal with conv2 Full, Same and Valid mode ) i n t ox = i d x (Ksize [0]+1+ J s i z e [0] I s i z e [ 0 ] ) / ; i n t oy = i d y (Ksize [1]+1+ J s i z e [1] I s i z e [ 1 ] ) / ; / / F i l t e r i n g R e s u l t Value double r e s u l t =0; / / Loop v a l u e i n t i =0; / / Image boundary i n t i b x = I s i z e [0] 1; i n t i b y = I s i z e [1] 1; / / Check i f f i l t e r i n g near a image boundary i f ( ox < 0 oy < 0 ( ox+ Ksize [ 0 ] ) > i b x ( oy+ Ksize [ 1 ] ) > i b y ) i d y =oy ; / / Loop t h r o u g h t h e k e r n e l f o r ( i n t i t t =0; i t t < Ksize [ 0 ] * Ksize [ 1 ] ; ++ i t t ) / / Boundary check i f (! ( idy < 0 idy > i b y ) ) / / Get t h e l o c a t i o n i n t h e image i d x = i +ox ; / / Boundary check i f (! ( idx < 0 idx > i b x ) ) / / Add t h e p i x e l f i l t e r d w i t h t h e k e r n e l r e s u l t += I [ i d x + i d y * I s i z e [ 0 ] ] *K[ i t t ] ; i ++; i f ( i == Ksize [ 0 ] ) i =0; i d y ++; J [ i d ]= r e s u l t ; e l s e i n t o=oy* I s i z e [ 0 ] + ox ; / / Loop t h r o u g h t h e k e r n e l f o r ( i n t i t t =0; i t t < Ksize [ 0 ] * Ksize [ 1 ] ; ++ i t t ) / / Add t h e p i x e l f i l t e r d w i t h t h e k e r n e l r e s u l t += I [ i +o ]*K[ i t t ] ; i ++; i f ( i == Ksize [ 0 ] ) i =0; o+= I s i z e [ 0 ] ; J [ i d ]= r e s u l t ; / / END The results of our experiments are very interesting, execution time of 2D convolution kernel on GPU (GeForce GTX 280) without loading data is over four times faster than application executed on CPU (Intel E8400 3GHz) (T CPU = 0,18357s, T GPU 1 = 0,044952s, S P1 = 4,0837) and it is three times faster with loading data (T CPU = 0,18357s, T GPU 2 = 0,060001s, S P2 = 3,0595). It is possible to achieve significant acceleration using parallel com-

Image Processing & Communication, vol. 17, no. 4, pp. 207-216 215 Fig. 2: Input image [3, 6] Fig.

calculated in Matlab. This is a new opportunity for parallel algorithms that can be run on commonly available computers.

9 Image Processing & Communication, vol. 17, no. 4, pp Fig. 2: Input image [3, 6] Fig. 3: Output images after performing convolution operation on CPU (on the left) and GPU (on the right) - both solutions give these same result puting algorithms written in CUDA running on the GPU calculated in Matlab. This is a new opportunity for parallel algorithms that can be run on commonly available computers. 9 Conclusions MATLAB (MATrix LABoratory) is a high-performance language for technical computing. Typical uses include: math and computation, algorithm development, modeling, simulation, data analysis and visualization, scientific and engineering graphics. It is a good choice for vision program development because it is easy to do very rapid prototyping. It contains a good library of image processing functions and excellent display capabilities. MATLAB code is optimized to be relatively quick when performing matrix operations. To run applications more real-time, the possibility of parallel execution should be used. Today s microprocessors have two or four computational cores (we can expect them even more in the future), and modern computers have sophisticated, hierarchical memory structures. MATLAB users now have access to clusters and networks of machines, and will soon have personal parallel computers ( now they have GPU) so using Parallel MATLAB [4] is a result of all these changes. The MathWorks toolset provides the users with a set of constructs that can be applied to exploit various types of parallelism with minimal effort. The way to exploit task parallelism is a parfor construction while distributed arrays and parallel functions target data parallel problems. The users can choose adequate level of parallelism in its application from low-level message passing functions to high-level distributed arrays and parallel loops. This toolset was also designed to be portable across multiple platforms and architectures (Linux, Macintosh, Windows). Parallel convolution algorithm implemented in CUDA and used in Matlab shows four times acceleration so advantages of using it are significant. Acknowledgements Project MNiSW no. 0128/R/t00/2010/12 References [1] B. Dushaw, Matlab and CUDA, APL UoW, Seatlle, 2010 [2] S. Gaurav, M. Jos, MATLAB: A Language for Parallel Programming, International Journal of Parallel Programming, Vol. 37, 2009 [3] Imagery Library for Intelligent Detection Sys-

10 216 M. Szymczyk, P. Szymczyk tems, [4] C. Moler, Parallel MATLAB: Multiple Processors and Multiple Cores, com/company/newsletters/articles/parallelmatlab-multiple-processors-and-multiplecores.html [5] fileexchange/29648-gpuconv2 [6]

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Speeding up MATLAB Applications Sean de Wolski Application Engineer 2014 The MathWorks, Inc. 1 Non-rigid Displacement Vector Fields 2 Agenda Leveraging the power of vector and matrix operations Addressing