A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu

Size: px
Start display at page:

Download "A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu"

Transcription

1 A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu Qi Li, Vojislav Kecman, Raied Salman Department of Computer Science School of Engineering, Virginia Commonwealth University Richmond, Virginia 23284, USA Abstract Calculating Euclidean distance matrix is a data intensive operation and becomes computationally prohibitive for large datasets. Recent development of Graphics Processing Units (GPUs) has produced superb performance on scientific computing problems using massive parallel processing cores. However, due to the limited size of device memory, many GPU based algorithms have low capability in solving problems with large datasets. In this paper, a chunking method is proposed to calculate Euclidean distance matrix on large datasets. This is not only designed for scalability in multi-gpu environment but also to maximize the computational capability of each individual GPU device. We first implement a fast GPU algorithm that is suitable for calculating submatrices of Euclidean distance matrix. Then we utilize a Map-Reduce like framework to split the final distance matrix calculation into many small independent jobs of calculating partial distance matrices, which can be efficiently solved by our GPU algorithm. The framework also dynamically allocates GPU resources to those independent jobs for maximum performance. The experimental results have shown a speed up of 15x on datasets which contain more than half million data points. Keywords-Multi-GPU; Euclidean Distance Matrix; Chunking; I. INTRODUCTION Since the semiconductor industry revealed that high performance processors cannot be built by simply increasing the clock frequency any more, the scientific computing market has been offered alternative products with multicore, multi-processor, or many-core which all shift to parallel architecture. One of these successful products is the Graphic Processing Units (GPUs) based computational device. GPUs used to be only integrated on the video card specialized for 2D and 3D graphic rendering. These applications normally require a higher capability in floating point operations than in logic control and memory fetch operations. Thus GPU is designed with many built-in modified floating point units, which can do parallel computations, and the GPU itself acts as an assisting processor to CPU. Because of its nature, GPU has become more and more popular in many applications which require intensive computations. General purpose programming on parallel architecture of GPU is difficult not only because most libraries offered are solely for graphic related programming but also because finding the parallelism is critical in many well known problems. This situation has been changed when NVIDIA released Compute Unified Device Architecture [1] (CUDA) in CUDA offers simplified programming interface which is an extension of C language for general purpose programming on GPU. Meanwhile NVIDIA also released their GPU based computational device called Tesla, which is now in its second generation. Open Computing Language (OpenCL), which uses taskbased and data-based parallelism, is another framework designed for GPU programming supported by many hardware vendors, e.g. Apple, NVIDIA and ATI. Many CUDA based algorithms can now be ported to OpenCL but OpenCL is still less popular than CUDA. Many machine learning algorithms require calculation of certain type of distances, e.g. Euclidean distance, Manhattan distance and Cosine distance, because distance is a good measurement to tell the differences between data points. Thus distance calculation is the fundamental work of classification and clustering tasks. Distance matrices are widely used in algorithms such as Support Vector Machine [2] [3], K-Nearest Neighbors [4] and K-Means [5]. The popular Radial Basis Function (RBF) kernel matrix used in nonlinear Support Vector Machine can be derived from the Euclidean distance matrix. The K-Nearest Neighbors problem can be solved by simply sorting the distance matrix. K-Means is also easily accelerated by improving distance calculation. These problems with small or medium datasets can now be efficiently solved by using GPU devices. However, for problems with large datasets, memory requirements for distance calculation cannot always be met. Considering a dataset which has n instances with each of them having m attributes, the time complexity for calculating the complete Euclidean distances matrix is O(n 2 m). Standard CPU based method used for calculating Euclidean distance matrix goes through three nested loops. Obviously, it is only necessary to calculate either the upper triangular or the lower triangular of the distance matrix because of its symmetric property. If the feature space (i.e., the number of attributes for each data point) is small, the time complexity is reduced to O(n 2 ). In order to clearly categorize different datasets based on the number of instances and dimensionality of attribute space,

2 we define ultra-large datasets (e.g. URL Reputation [6], which has millions of instances and attributes) as datasets, which have a large number of instances and a large number of attributes at the same time. Datasets (e.g. Mnist [7], Covertype [6] and Poker Hand [6], whose feature spaces are smaller than a thousand), which have a large number of instances but a limited number of attributes, are defined as large datasets. Most of the real world datasets fall into this category. Certain ultra-large datasets can be converted to large datasets by using feature reduction techniques such as Principle Component Analysis [8]. Our focus is on how to process large datasets. The memory storage for distance matrix is restricted only by n; however, the memory for distance computations is restricted by both n and m. Thus complete distance matrices of both ultra-large and large datasets cannot be calculated at one time. Chang et al. proposed a fast implementation for calculating Euclidean distance matrix using GPU in [9]. They also have a similar implementation in [10] for calculating Manhattan distance matrix. Their results show a speedup ranging from 20 times to approximately two orders of magnitude times compared to a slow CPU implementation. However, their implementation only suits certain special cases whose input datasets have both the number of instances and the number of attributes being a multiple of 16. Also, it is specifically designed to calculate normal symmetric distance matrix only. This significantly limits the contribution of their work for solving real world problems. Besides, the largest dataset tested in their work has only around 12,000 data points with a feature space of 64, which merely catches the lower bound of large datasets. Common large datasets containing more than 100,000 instances cannot be solved by their method because of the limited memory space on a single GPU device. Currently, there is very little work done on large distance matrix calculation using multi-gpu environments, e.g. workstation with multiple GPU devices or hybrid CPU-GPU clusters. In this paper, we continue the research direction initiated by Chang et al. and propose a chunking method targeted for solving Euclidean distance matrices calculation on large datasets by using multi-gpu environment. The organization of this paper is as follows. Section II briefly introduces how to calculate Euclidean distance matrix using efficient CPU algorithm instead of the naive one. In this section, a CUDA programming model based implementation for calculating generalized Euclidean distance matrix is also presented. Section III describes a Map-Reduce like framework for large distance matrix calculation including how to split the calculation and how to maximally utilize the available hardware resources in a multi-gpu environment. The performance results of this chunking method on real world datasets are shown at the end of this section. Section IV summarizes the merits of this work and talks about possible improvements and extensions in future work. II. CUDA IMPLEMENTATION FOR CALCULATING GENERALIZED EUCLIDEAN DISTANCE MATRIX Euclidean distance is one of the most common distances which have been used in many applications. Let us consider an n by m dataset matrix S where n is the number of data points and m is the size of the feature space. Euclidean distance between any data points s i and s j can be calculated by d ij = m (s ik s jk ) 2. (1) k=1 The computation complexity for each pairwise distance is O(m). Since the feature space has a limited size in the case of large datasets, the time complexity for calculating one pairwise Euclidean distance is considered as a constant. A. Naive CPU Algorithm and Efficient CPU Algorithm A normal distance matrix is defined as a 2-D array containing distances of a set of data points taken pairwise. It is a symmetric square matrix with zero entries on the diagonal. In a broader sense, the generalized distance matrix stores distances between any two data points from two datasets. Thus, the generalized distance matrix is actually a submatrix of a normal distance matrix. This submatrix can be an asymmetric square matrix or even a rectangular matrix. Generalized distance matrix is more frequently used compared to normal distance matrix in solving real world problems. It is also the core part of calculating large distance matrix. Algorithm 1 shows the standard procedure of how to calculate a generalized Euclidean distance matrix. It is considered as a naive method because of its low performance. Two input datasets are loaded into system memory and stored into two 2-D arrays A and B, who have n A and n B data points. Data points in both A and B are m- dimensional. The memory space for storing distance matrix D is preallocated and matrix D has a dimension of n A by n B. Each pairwise distance is calculated inside of two nested loops. In order to achieve the maximum performance of multicore CPU, a matrix operation based method is used to calculate the distance matrix instead of using nested loops. Euclidean distance matrix, unlike other distance matrices (e.g., Manhattan and Cosine distance matrices), can be decomposed into matrix level operations. Algorithm 2 shows the procedure of how to use matrix operations to calculate the generalized Euclidean distance matrix. The algorithm contains element-wise matrix multiplication, matrix-vector multiplication and matrix-matrix multiplication. In the testing program, MKL BLAS routines from Intel are used to accomplish these matrix operations and the code is complied by enabling multi-thread support in MKL. It is well known that this is one of the fastest implementation for calculating Euclidean distance matrix on CPU.

3 Algorithm 1 Standard procedure for generalized Euclidean distance matrix calculation (1) begin (2) Load in A, B and allocate memory for D (3) for i := 1 to n A do (4) for j := 1 to n B do (5) d ij := computedistance(a i, b j ); (6) od (7) od (8) return D (9) funct computedistance(a i, b j ) (10) d ij := 0; (11) for k := 1 to m do (12) d ij = d ij + (a ik b jk ) 2 ; (13) od (14) d ij := d ij ; (15) return d ij. (16) end Algorithm 2 Matrix operation based method for generalized Euclidean distance matrix calculation (1) begin (2) Load in A, B and allocate memory for D (3) comment: ( ) is element-wise multiplication (4) v 1 = (A A)[1, ] T (5) v 2 = (B B)[1, ] T (6) P 1 = [v 1 v 1... v 1 ] Dimension: n A by n B (7) P 2 = [v 2 v 2... v 2 ] T Dimension: n A by n B (8) P 3 = AB T Dimension: n A by n B (9) comment: () 1 2 is element-wise square root (10) D = (P 1 + P 2 2P 3 ) 1 2 (11) return D (12) end B. CUDA Based Algorithm CUDA is an extension of C programming language and the latest version CUDA 3.0 offers strong support for many C++ features. The programming model is based on the logical representation of three different layers which are grids, blocks and threads from high to low. It is the users responsibility to adapt algorithms to a 2-D grid structure. These grid executions, also known as kernel functions, are launched by the CPU. Grids are composed by blocks, which are groups of threads that share local memory and can be synchronized using barriers. Kernel functions are executed simultaneously by multi-thread from the logic view. However, the amount of physical threads which are running on the GPU concurrently is determined by the hardware specification. CUDA provides a hierarchy of memories that differ on their accessibility, size and speed. Registers are the fastest but smallest in size. Shared memory can be as fast as registers but it is also very limited in size. Texture memory is larger but read-only and device memory is the slowest one, which has the largest size compared to the others. The CUDA2 implementation shown in [9] is one good example of using shared memory. It takes one dataset as input and calculates the normal distance matrix of it. However, as mentioned in Section I, the way the code is written makes it only suitable for special input datasets. In the case of large dataset, it is impossible to load all data points into the GPU device memory, thus the distance matrix must be calculated by combining its submatrices together. This part will be discussed in detail in the next section. Here, we rework this method to make it suitable for calculating generalized distances matrix between two datasets with any number of data points and any size of feature space. Continuing with the previous example, the distance matrix from A to B is an n A by n B rectangular matrix. Similarly, the distance matrix from B to A is the transpose of the above matrix. In order to correctly map the distance matrix calculation to the CUDA grid representation, we use the following code to create a 16 by 16 2-D block containing 256 threads. #define BLOCK_DIM 16 dim3 dimblock(block_dim,block_dim,1); dim3 dimgrid((na+block_dim-1)/na,(nb+block_dim-1)/nb,1); The dimension of the grid is dynamically calculated to ensure that the number of rows multiplied by 16 is no less than n A and the number of columns multiplied by 16 is no less than n B. In this way, there are enough threads generated to cover the whole distance matrix calculation. The following code is used to launch the kernel function execution on the GPU. distkernel<<<dimgrid,dimblock>>>(/*i/o arguments list*/); Fig. 1 illustrates the idea of how to implement the kernel function. Notice that both input matrices A and B are organized in the transpose format in the GPU device memory, whose columns correspond to the related data points. As it is shown that the blocks may exceed the border of the input matrices, several checking conditions must be made to ensure zeros are used when the blocks run out of the boundary to acquire correct results. This solves the issue brought by datasets with number of data points and size of feature space which are not multiple of 16. Each block handles no more than 256 pairwise distances calculations. Each thread within the block goes along the vertical direction and picks up 16 features at a time. All threads are synchronized to the same stage when data have been loaded into the shared memory from the device memory. Then feature differences are computed between any two threads and accumulated to the related result. All threads are synchronized again before moving to the next iteration of the loop. When all features have been scanned through by every thread, the final distance is calculated and stored to the corresponding position in the

4 CUDA blocks mapping for generalized distance matrix calcu- Figure 1. lation matrix D based on the current block index and thread index. The distance matrix is then returned at the end of the kernel function. At this point, the distance matrix is still stored on the GPU device memory and it must be copied back to the system memory for any future processing by CPU. By using shared memory, the threads can avoid frequently accessing device memory to fetch data so that better performance is achieved. Another way of using GPU to calculate Euclidean distance matrix is using the same idea in Algorithm 2, but instead of Intel s MKL, CUBLAS [11] and MAGMA [12] are used for GPU accelerated matrix operations. CUBLAS is the official version of BLAS routine for GPU made by NVIDIA and it has proved to be extremely efficient compared to the same implementations on CPU, e.g. using MKL. Dongarra et al. also published a CUDA based GPU library containing a subset of BLAS routines plus some LAPACK factorization and linear system solver routines called MAGMA for GPU. It even outperforms the official CUBLAS version 2.3 in many matrix operations such as matrix-matrix multiplication, which is the time consuming part in distance matrix calculation. C. Performance Results The testing computer is equipped with an Intel Xeon E GHz quad-core CPU and 16GB RAM. There are three Tesla C1060 GPU devices connected to the system through PCI Express interfaces. Each of these cards has 4GB device memory. The latest CUDA 3.0 toolkit is used and the driver version is for 64bit Linux system. The operating system is Fedora Core 10 Linux. The benchmarks of GPU algorithms include two ways of data transferring time between system memory and device memory as well as the computational time on the GPU. Table I shows the normal Euclidean distance matrix calculation comparison among naive C implementation, MKL based C implementation, Chang et al. s CUDA implementation, and the proposed generalized CUDA implementation. It is easy to observe that using MKL and multi-thread support for CPU can boost the performance 5 to 6 times, thus comparison with the naive C implementation does not truly reflect the performance gain by using GPU. Our implementation is slightly slower in these special cases (both n and m are multiple of 16) compared to Chang et al. s implementation because the kernel function has been modified to suit general datasets, which cannot be used by Chang et al. s method. It takes two datasets as input, thus the same dataset is copied twice from the system memory to the device memory in theses special cases. In general, the GPU implementation still has a speed-up of approximately 5 times compared to MKL which is in the reasonable range based on the performance comparison of matrix-matrix multiplication between MKL and CUBLAS shown in [13]. Table II shows the performance comparison of calculating generalized distance matrix between any two input datasets. Chang et al. s method is not listed because of the unsuitability. CUBLAS 3.0 based implementation comes at the top and the proposed generalized CUDA implementation is very close to the GPU matrix operation based method. Other distances matrices, e.g. Manhattan distance and cosine distance, cannot be efficiently transformed to matrix level operations. However, they still can be easily implemented by modifying the proposed method. III. MAP-REDUCE LIKE MODEL FOR HANDLING LARGE DATASETS For problems with large datasets, neither can all data points be loaded into the system memory at one time, nor is there enough space for storing the complete distance matrix. Thus it is necessary to break down the complete distance matrix into many submatrices and calculate them individually in a parallel manner. Fig. 2 shows the approach of how to split the input datasets to chunks and calculate the generalized distance matrices between any two chunks. Each chunk is assigned an index from 1 to k. The final distance matrix contains k by k small distance matrices. Due to the symmetric property of the complete distance matrix, there are only k(k + 1)/2 small distance matrices required to be Table I PERFORMANCE COMPARISON OF SYMMETRIC EUCLIDEAN DISTANCE MATRIX CALCULATION. TIME UNIT IS SECOND AND SIZE OF FEATURE SPACE IS SPEED UP IS RELATED TO NAIVE C IMPLEMENTATION. n IS NUMBER OF DATA POINTS. Input matrix Naive Efficient Chang et al. s Generalized n C C (MKL) CUDA CUDA (4.96x) 0.36 (33.06x) 0.47 (25.32x) (5.70x) 1.42 (34.08x) 1.79 (27.04x) (5.96x) 3.16 (34.40x) 3.82 (28.48x)

5 Table II PERFORMANCE COMPARISON OF GENERALIZED EUCLIDEAN DISTANCE MATRIX CALCULATION. TIME UNIT IS SECOND AND SIZE OF FEATURE SPACE IS SPEED UP IS RELATED TO EFFICIENT C IMPLEMENTATION. n IS NUMBER OF DATA POINTS. Input matrices Efficient Generalized CUBLAS 3.0 MAGMA 0.2 n n C (MKL) CUDA CUDA CUDA (4.60x) 0.21(4.38x) 0.22(4.12x) (4.79x) 0.37(4.92x) 0.42(4.33x) (4.56x) 1.56(5.12x) 1.72(4.64x) (4.51x) 2.96(5.36x) 3.33(4.76x) calculated in a total of k 2 ones. The rest of them can be acquired by simply doing transpose operations of calculated ones, e.g. D(1, 2) is the transpose of D(2, 1). These small distance matrices can be calculated using the method, which is accelerated by GPU, introduced in the previous section. The amount of physical GPU devices determines how many grids can be launched simultaneously. The Map-Reduce [14] pattern has been proposed to handle large data processing problems on a cluster environment. We adopt the merits of this programming pattern and model it to do the large distance matrix calculation job. As shown in Fig. 3, the input reader first reads multiple chunks into the system memory. Then the mapper generates a list of key/value pairs, which correspond to these active chunks currently loaded in the system memory. For each key/value pair, both key and value store the indices of the chunks. The reducers iteratively load pairs of chunks with the same key and search for any available GPU device to launch the distance kernel function. There is a list which stores the IDs of the available GPUs. Any GPU device which is taken by a reducer will be removed from the list and appended back after it is released by that reducer. Each reducer only calculates the small distance matrices whose keys are smaller than or equal to their values. The final distance matrix is in the form of its upper triangular. After all small distance matrices with the same key are calculated by the reducer, the results are grouped together and passed to the output writer. The output writer concatenates the results from every reducer and writes them to a distributed file system if available. A drawback of this approach is that different reducers may have different workloads. This can be solved by fixing the number of small distance matrices calculation job to each reducer. For example, the first reducer calculates D(1, 1) to D(1, 5), the second calculates D(1, 6) to D(1, 10) and so on. However, the key value must still be kept the same in each reducer. In this way, only certain reducers may have less jobs but in a general view, the distance calculations are distributed equally among reducers. When each reducer finishes its job, it will notify the mapper to update the key/value pairs list and refresh the system memory by loading in new chunks and deleting used ones. The complete model requires a GPU cluster environment and extra communication support, e.g. MPI, from different Figure 2. Mapping between data chunks to the related distance submatrices nodes as well as a proper distributed file system. Our test is done on a workstation with three Tesla C1060 GPU devices. This is much simpler compared to the GPU cluster environment. We use multi-threading to implement different functions for input reader, maper, reducer and output writer. Since all reducers will be competing for the GPU device resources on the same computer, whether they have an equal amount of jobs does not matter anymore. Because all three cards will be used for distance matrices calculations all the time, an approximately performance increase of 3x is achieved compared to using one card to do the same job sequentially. Table III shows the performance of finalized chunking method tested on the real world datasets. File I/O time is excluded because both CPU and GPU implementations share the same procedure. Time cost is for calculating submatrices only. The data transferring time for GPU is shortened because in certain cases dataset can be reused. For example, if the same GPU is assigned the job to calculate D(1, 1) and D(1, 2), only chunk 2 needs to be loaded into the device memory in the second distance matrix calculation. The speedup is close to 15 times when utilizing three GPU devices together on a dataset containing more than half million data points Table III PERFORMANCE RESULT OF CHUNKING METHOD ON REAL WORLD LARGE DATASETS. n IS THE NUMBER OF DATA POINTS. m IS THE SIZE OF FEATURE SPACE. c IS THE NUMBER OF CHUNKS. TIME UNIT IS SECOND AND MINUTE. SPEED UP IS RELATED TO CPU IMPLEMENTATION. Datasets n m Quad-core c 1 GPU 3 GPUs CPU Mnist 60, s 4 Covertype 581, m s 19.39s (4.21x) (10.51x) 10.84m 3.62m (5.00x) (14.98x)

6 Figure 3. Map-Reduce pattern for large distance matrix calculation IV. CONCLUSIONS AND FUTURE WORK In this paper, we have presented a novel idea on how to break down data intensive large Euclidean distance matrix calculation and utilize a multi-gpu environment. The proposed chunking method is easy to implement by using multithreading technique on a standalone workstation. However, there will be many other technical issues to implement it in a GPU cluster environment. The next stage of our work will be implementing the model for a small GPU cluster and seeking for any possible optimization method to maximize the capability of the hardware. Other distances, e.g. Manhattan and Cosine distances, will be included and a simplified interface will be offered to those machine learning algorithms who require distance calculation. REFERENCES [1] NVIDIA, CUDA Compute Unified Device Architecture Programming Guide, June [2] T.-M. Huang, V. Kecman, and I. Kopriva, Kernel Based Algorithms for Mining Huge Data Sets, Supervised, Semisupervised, and Unsupervised Learning. Springer, [3] B. Catanzaro, N. Sundaram, and K. Keutzer, Fast support vector machine training and classification on graphics processors, in ICML 08: Proceedings of the 25th international conference on Machine learning. New York, NY, USA: ACM, 2008, pp [4] V. Garcia, E. Debreuve, and M. Barlaud, Fast k nearest neighbor search using gpu, in Computer Vision and Pattern Recognition Workshops, CVPRW 08. IEEE Computer Society Conference on, , pp [7] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradientbased learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp , nov [8] M. Andrecut, Parallel gpu implementation of iterative pca algorithms. Journal of Computational Biology, vol. 16, no. 11, pp , [9] D.-J. Chang, N. A. Jones, D. Li, M. Ouyang, and R. K. Ragade, Compute pairwise euclidean distances of data points with gpus, in Computational Biology and Bioinformatics, CBB 08. IASTED International Symposium, November [10] D.-J. Chang, A. Desoky, M. Ouyang, and E. Rouchka, Compute pairwise manhattan distance and pearson correlation coefficient of data points with gpu, in Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, SNPD th ACIS International Conference on, , pp [11] NVIDIA, CUDA CUBLAS Library, June [12] J. Dongarra, S. Tomov et al., Matrix algebra on gpu and multicore architectures, [Online]. Available: [13] J. Cohen, Cuda libraries and tools, [Online]. Available: SC09 CUDA Tools Cohen.pdf [14] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, Interpreting the data: Parallel analysis with sawzall, Sci. Program., vol. 13, no. 4, pp , [5] S. A. Shalom, M. Dash, and M. Tue, Efficient k-means clustering using accelerated graphics processors, in DaWaK 08: Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery. Berlin, Heidelberg: Springer-Verlag, 2008, pp [6] A. Frank and A. Asuncion, UCI machine learning repository, [Online]. Available: edu/ml

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Face Detection CUDA Accelerating

Face Detection CUDA Accelerating Face Detection CUDA Accelerating Jaromír Krpec Department of Computer Science VŠB Technical University Ostrava Ostrava, Czech Republic krpec.jaromir@seznam.cz Martin Němec Department of Computer Science

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline

More information

Parallel Approach for Implementing Data Mining Algorithms

Parallel Approach for Implementing Data Mining Algorithms TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Simultaneous Solving of Linear Programming Problems in GPU

Simultaneous Solving of Linear Programming Problems in GPU Simultaneous Solving of Linear Programming Problems in GPU Amit Gurung* amitgurung@nitm.ac.in Binayak Das* binayak89cse@gmail.com Rajarshi Ray* raj.ray84@gmail.com * National Institute of Technology Meghalaya

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Subset Sum Problem Parallel Solution

Subset Sum Problem Parallel Solution Subset Sum Problem Parallel Solution Project Report Harshit Shah hrs8207@rit.edu Rochester Institute of Technology, NY, USA 1. Overview Subset sum problem is NP-complete problem which can be solved in

More information

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil Parallel Genome-Wide Analysis With Central And Graphic Processing Units Muhamad Fitra Kacamarga mkacamarga@binus.edu James W. Baurley baurley@binus.edu Bens Pardamean bpardamean@binus.edu Abstract The

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

AperTO - Archivio Istituzionale Open Access dell'università di Torino

AperTO - Archivio Istituzionale Open Access dell'università di Torino AperTO - Archivio Istituzionale Open Access dell'università di Torino An hybrid linear algebra framework for engineering This is the author's manuscript Original Citation: An hybrid linear algebra framework

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices

GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices Hiroki Tokura, Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the

More information

Parallel Architecture & Programing Models for Face Recognition

Parallel Architecture & Programing Models for Face Recognition Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature

More information

Accelerated Machine Learning Algorithms in Python

Accelerated Machine Learning Algorithms in Python Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Ahmad Abdelfattah 1, Jack Dongarra 2, David Keyes 1 and Hatem Ltaief 3 1 KAUST Division of Mathematical and Computer Sciences and

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

A Mixed Hierarchical Algorithm for Nearest Neighbor Search

A Mixed Hierarchical Algorithm for Nearest Neighbor Search A Mixed Hierarchical Algorithm for Nearest Neighbor Search Carlo del Mundo Virginia Tech 222 Kraft Dr. Knowledge Works II Building Blacksburg, VA cdel@vt.edu ABSTRACT The k nearest neighbor (knn) search

More information

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT Daniel Schlifske ab and Henry Medeiros a a Marquette University, 1250 W Wisconsin Ave, Milwaukee,

More information

Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs

Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs Mark van Heeswijk 1 and Yoan Miche 2 and Erkki Oja 1 and Amaury Lendasse 1 1 Helsinki University of Technology - Dept. of Information

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

GPU Programming for Mathematical and Scientific Computing

GPU Programming for Mathematical and Scientific Computing GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu

More information

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

Compute Distance Matrices with GPU

Compute Distance Matrices with GPU Compute Distance Matrices with GPU Seongho Kim Bioinformatics and Biostatistics Department University of Louisville Louisville, Kentucky 40292 USA s0kim023@louisville.edu Ming Ouyang Computer Engineering

More information

ASYNCHRONOUS MATRIX FRAMEWORK WITH PRIORITY-BASED PROCESSING. A Thesis. Presented to. the Faculty of

ASYNCHRONOUS MATRIX FRAMEWORK WITH PRIORITY-BASED PROCESSING. A Thesis. Presented to. the Faculty of ASYNCHRONOUS MATRIX FRAMEWORK WITH PRIORITY-BASED PROCESSING A Thesis Presented to the Faculty of California Polytechnic State University, San Luis Obispo In Partial Fulfillment of the Requirements for

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs)

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs) Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs) Hasindu Gamaarachchi, Roshan Ragel Department of Computer Engineering University of Peradeniya Peradeniya, Sri Lanka hasindu8@gmailcom,

More information

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Speeding up MATLAB Applications Sean de Wolski Application Engineer Speeding up MATLAB Applications Sean de Wolski Application Engineer 2014 The MathWorks, Inc. 1 Non-rigid Displacement Vector Fields 2 Agenda Leveraging the power of vector and matrix operations Addressing

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Sam Skalicky, Sonia López, Marcin Łukowiak, James Letendre, and Matthew Ryan Rochester Institute of Technology, Rochester NY 14623,

More information

THE MNIST DATABASE of handwritten digits Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York

THE MNIST DATABASE of handwritten digits Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York THE MNIST DATABASE of handwritten digits Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York The MNIST database of handwritten digits, available from this page, has a training set

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 1 / 20 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

GPUML: Graphical processors for speeding up kernel machines

GPUML: Graphical processors for speeding up kernel machines GPUML: Graphical processors for speeding up kernel machines http://www.umiacs.umd.edu/~balajiv/gpuml.htm Balaji Vasan Srinivasan, Qi Hu, Ramani Duraiswami Department of Computer Science, University of

More information

Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware

Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware NSF REU - 2018: Project Report Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware Anumeena Sorna Electronics and Communciation Engineering National Institute of Technology,

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering World Journal of Computer Application and Technology 5(2): 24-29, 2017 DOI: 10.13189/wjcat.2017.050202 http://www.hrpub.org Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

International Conference on Computational Science (ICCS 2017)

International Conference on Computational Science (ICCS 2017) International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A.

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In

More information

International Journal of Computer Science and Network (IJCSN) Volume 1, Issue 4, August ISSN

International Journal of Computer Science and Network (IJCSN) Volume 1, Issue 4, August ISSN Accelerating MATLAB Applications on Parallel Hardware 1 Kavita Chauhan, 2 Javed Ashraf 1 NGFCET, M.D.University Palwal,Haryana,India Page 80 2 AFSET, M.D.University Dhauj,Haryana,India Abstract MATLAB

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

An Introduction to OpenAcc

An Introduction to OpenAcc An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by

More information

arxiv: v1 [cs.lg] 5 Mar 2013

arxiv: v1 [cs.lg] 5 Mar 2013 GURLS: a Least Squares Library for Supervised Learning Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro, Lorenzo Rosasco Center for Biological and Computational Learning, Massachusetts Institute

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Parsing in Parallel on Multiple Cores and GPUs

Parsing in Parallel on Multiple Cores and GPUs 1/28 Parsing in Parallel on Multiple Cores and GPUs Mark Johnson Centre for Language Sciences and Department of Computing Macquarie University ALTA workshop December 2011 Why parse in parallel? 2/28 The

More information

Face Detection using GPU-based Convolutional Neural Networks

Face Detection using GPU-based Convolutional Neural Networks Face Detection using GPU-based Convolutional Neural Networks Fabian Nasse 1, Christian Thurau 2 and Gernot A. Fink 1 1 TU Dortmund University, Department of Computer Science, Dortmund, Germany 2 Fraunhofer

More information

Towards Breast Anatomy Simulation Using GPUs

Towards Breast Anatomy Simulation Using GPUs Towards Breast Anatomy Simulation Using GPUs Joseph H. Chui 1, David D. Pokrajac 2, Andrew D.A. Maidment 3, and Predrag R. Bakic 4 1 Department of Radiology, University of Pennsylvania, Philadelphia PA

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Using Graphics Chips for General Purpose Computation

Using Graphics Chips for General Purpose Computation White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1

More information

Performance Diagnosis for Hybrid CPU/GPU Environments

Performance Diagnosis for Hybrid CPU/GPU Environments Performance Diagnosis for Hybrid CPU/GPU Environments Michael M. Smith and Karen L. Karavanic Computer Science Department Portland State University Performance Diagnosis for Hybrid CPU/GPU Environments

More information

Optimizing CUDA for GPU Architecture. CSInParallel Project

Optimizing CUDA for GPU Architecture. CSInParallel Project Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information