A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu
|
|
- Lydia Robertson
- 6 years ago
- Views:
Transcription
1 A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu Qi Li, Vojislav Kecman, Raied Salman Department of Computer Science School of Engineering, Virginia Commonwealth University Richmond, Virginia 23284, USA Abstract Calculating Euclidean distance matrix is a data intensive operation and becomes computationally prohibitive for large datasets. Recent development of Graphics Processing Units (GPUs) has produced superb performance on scientific computing problems using massive parallel processing cores. However, due to the limited size of device memory, many GPU based algorithms have low capability in solving problems with large datasets. In this paper, a chunking method is proposed to calculate Euclidean distance matrix on large datasets. This is not only designed for scalability in multi-gpu environment but also to maximize the computational capability of each individual GPU device. We first implement a fast GPU algorithm that is suitable for calculating submatrices of Euclidean distance matrix. Then we utilize a Map-Reduce like framework to split the final distance matrix calculation into many small independent jobs of calculating partial distance matrices, which can be efficiently solved by our GPU algorithm. The framework also dynamically allocates GPU resources to those independent jobs for maximum performance. The experimental results have shown a speed up of 15x on datasets which contain more than half million data points. Keywords-Multi-GPU; Euclidean Distance Matrix; Chunking; I. INTRODUCTION Since the semiconductor industry revealed that high performance processors cannot be built by simply increasing the clock frequency any more, the scientific computing market has been offered alternative products with multicore, multi-processor, or many-core which all shift to parallel architecture. One of these successful products is the Graphic Processing Units (GPUs) based computational device. GPUs used to be only integrated on the video card specialized for 2D and 3D graphic rendering. These applications normally require a higher capability in floating point operations than in logic control and memory fetch operations. Thus GPU is designed with many built-in modified floating point units, which can do parallel computations, and the GPU itself acts as an assisting processor to CPU. Because of its nature, GPU has become more and more popular in many applications which require intensive computations. General purpose programming on parallel architecture of GPU is difficult not only because most libraries offered are solely for graphic related programming but also because finding the parallelism is critical in many well known problems. This situation has been changed when NVIDIA released Compute Unified Device Architecture [1] (CUDA) in CUDA offers simplified programming interface which is an extension of C language for general purpose programming on GPU. Meanwhile NVIDIA also released their GPU based computational device called Tesla, which is now in its second generation. Open Computing Language (OpenCL), which uses taskbased and data-based parallelism, is another framework designed for GPU programming supported by many hardware vendors, e.g. Apple, NVIDIA and ATI. Many CUDA based algorithms can now be ported to OpenCL but OpenCL is still less popular than CUDA. Many machine learning algorithms require calculation of certain type of distances, e.g. Euclidean distance, Manhattan distance and Cosine distance, because distance is a good measurement to tell the differences between data points. Thus distance calculation is the fundamental work of classification and clustering tasks. Distance matrices are widely used in algorithms such as Support Vector Machine [2] [3], K-Nearest Neighbors [4] and K-Means [5]. The popular Radial Basis Function (RBF) kernel matrix used in nonlinear Support Vector Machine can be derived from the Euclidean distance matrix. The K-Nearest Neighbors problem can be solved by simply sorting the distance matrix. K-Means is also easily accelerated by improving distance calculation. These problems with small or medium datasets can now be efficiently solved by using GPU devices. However, for problems with large datasets, memory requirements for distance calculation cannot always be met. Considering a dataset which has n instances with each of them having m attributes, the time complexity for calculating the complete Euclidean distances matrix is O(n 2 m). Standard CPU based method used for calculating Euclidean distance matrix goes through three nested loops. Obviously, it is only necessary to calculate either the upper triangular or the lower triangular of the distance matrix because of its symmetric property. If the feature space (i.e., the number of attributes for each data point) is small, the time complexity is reduced to O(n 2 ). In order to clearly categorize different datasets based on the number of instances and dimensionality of attribute space,
2 we define ultra-large datasets (e.g. URL Reputation [6], which has millions of instances and attributes) as datasets, which have a large number of instances and a large number of attributes at the same time. Datasets (e.g. Mnist [7], Covertype [6] and Poker Hand [6], whose feature spaces are smaller than a thousand), which have a large number of instances but a limited number of attributes, are defined as large datasets. Most of the real world datasets fall into this category. Certain ultra-large datasets can be converted to large datasets by using feature reduction techniques such as Principle Component Analysis [8]. Our focus is on how to process large datasets. The memory storage for distance matrix is restricted only by n; however, the memory for distance computations is restricted by both n and m. Thus complete distance matrices of both ultra-large and large datasets cannot be calculated at one time. Chang et al. proposed a fast implementation for calculating Euclidean distance matrix using GPU in [9]. They also have a similar implementation in [10] for calculating Manhattan distance matrix. Their results show a speedup ranging from 20 times to approximately two orders of magnitude times compared to a slow CPU implementation. However, their implementation only suits certain special cases whose input datasets have both the number of instances and the number of attributes being a multiple of 16. Also, it is specifically designed to calculate normal symmetric distance matrix only. This significantly limits the contribution of their work for solving real world problems. Besides, the largest dataset tested in their work has only around 12,000 data points with a feature space of 64, which merely catches the lower bound of large datasets. Common large datasets containing more than 100,000 instances cannot be solved by their method because of the limited memory space on a single GPU device. Currently, there is very little work done on large distance matrix calculation using multi-gpu environments, e.g. workstation with multiple GPU devices or hybrid CPU-GPU clusters. In this paper, we continue the research direction initiated by Chang et al. and propose a chunking method targeted for solving Euclidean distance matrices calculation on large datasets by using multi-gpu environment. The organization of this paper is as follows. Section II briefly introduces how to calculate Euclidean distance matrix using efficient CPU algorithm instead of the naive one. In this section, a CUDA programming model based implementation for calculating generalized Euclidean distance matrix is also presented. Section III describes a Map-Reduce like framework for large distance matrix calculation including how to split the calculation and how to maximally utilize the available hardware resources in a multi-gpu environment. The performance results of this chunking method on real world datasets are shown at the end of this section. Section IV summarizes the merits of this work and talks about possible improvements and extensions in future work. II. CUDA IMPLEMENTATION FOR CALCULATING GENERALIZED EUCLIDEAN DISTANCE MATRIX Euclidean distance is one of the most common distances which have been used in many applications. Let us consider an n by m dataset matrix S where n is the number of data points and m is the size of the feature space. Euclidean distance between any data points s i and s j can be calculated by d ij = m (s ik s jk ) 2. (1) k=1 The computation complexity for each pairwise distance is O(m). Since the feature space has a limited size in the case of large datasets, the time complexity for calculating one pairwise Euclidean distance is considered as a constant. A. Naive CPU Algorithm and Efficient CPU Algorithm A normal distance matrix is defined as a 2-D array containing distances of a set of data points taken pairwise. It is a symmetric square matrix with zero entries on the diagonal. In a broader sense, the generalized distance matrix stores distances between any two data points from two datasets. Thus, the generalized distance matrix is actually a submatrix of a normal distance matrix. This submatrix can be an asymmetric square matrix or even a rectangular matrix. Generalized distance matrix is more frequently used compared to normal distance matrix in solving real world problems. It is also the core part of calculating large distance matrix. Algorithm 1 shows the standard procedure of how to calculate a generalized Euclidean distance matrix. It is considered as a naive method because of its low performance. Two input datasets are loaded into system memory and stored into two 2-D arrays A and B, who have n A and n B data points. Data points in both A and B are m- dimensional. The memory space for storing distance matrix D is preallocated and matrix D has a dimension of n A by n B. Each pairwise distance is calculated inside of two nested loops. In order to achieve the maximum performance of multicore CPU, a matrix operation based method is used to calculate the distance matrix instead of using nested loops. Euclidean distance matrix, unlike other distance matrices (e.g., Manhattan and Cosine distance matrices), can be decomposed into matrix level operations. Algorithm 2 shows the procedure of how to use matrix operations to calculate the generalized Euclidean distance matrix. The algorithm contains element-wise matrix multiplication, matrix-vector multiplication and matrix-matrix multiplication. In the testing program, MKL BLAS routines from Intel are used to accomplish these matrix operations and the code is complied by enabling multi-thread support in MKL. It is well known that this is one of the fastest implementation for calculating Euclidean distance matrix on CPU.
3 Algorithm 1 Standard procedure for generalized Euclidean distance matrix calculation (1) begin (2) Load in A, B and allocate memory for D (3) for i := 1 to n A do (4) for j := 1 to n B do (5) d ij := computedistance(a i, b j ); (6) od (7) od (8) return D (9) funct computedistance(a i, b j ) (10) d ij := 0; (11) for k := 1 to m do (12) d ij = d ij + (a ik b jk ) 2 ; (13) od (14) d ij := d ij ; (15) return d ij. (16) end Algorithm 2 Matrix operation based method for generalized Euclidean distance matrix calculation (1) begin (2) Load in A, B and allocate memory for D (3) comment: ( ) is element-wise multiplication (4) v 1 = (A A)[1, ] T (5) v 2 = (B B)[1, ] T (6) P 1 = [v 1 v 1... v 1 ] Dimension: n A by n B (7) P 2 = [v 2 v 2... v 2 ] T Dimension: n A by n B (8) P 3 = AB T Dimension: n A by n B (9) comment: () 1 2 is element-wise square root (10) D = (P 1 + P 2 2P 3 ) 1 2 (11) return D (12) end B. CUDA Based Algorithm CUDA is an extension of C programming language and the latest version CUDA 3.0 offers strong support for many C++ features. The programming model is based on the logical representation of three different layers which are grids, blocks and threads from high to low. It is the users responsibility to adapt algorithms to a 2-D grid structure. These grid executions, also known as kernel functions, are launched by the CPU. Grids are composed by blocks, which are groups of threads that share local memory and can be synchronized using barriers. Kernel functions are executed simultaneously by multi-thread from the logic view. However, the amount of physical threads which are running on the GPU concurrently is determined by the hardware specification. CUDA provides a hierarchy of memories that differ on their accessibility, size and speed. Registers are the fastest but smallest in size. Shared memory can be as fast as registers but it is also very limited in size. Texture memory is larger but read-only and device memory is the slowest one, which has the largest size compared to the others. The CUDA2 implementation shown in [9] is one good example of using shared memory. It takes one dataset as input and calculates the normal distance matrix of it. However, as mentioned in Section I, the way the code is written makes it only suitable for special input datasets. In the case of large dataset, it is impossible to load all data points into the GPU device memory, thus the distance matrix must be calculated by combining its submatrices together. This part will be discussed in detail in the next section. Here, we rework this method to make it suitable for calculating generalized distances matrix between two datasets with any number of data points and any size of feature space. Continuing with the previous example, the distance matrix from A to B is an n A by n B rectangular matrix. Similarly, the distance matrix from B to A is the transpose of the above matrix. In order to correctly map the distance matrix calculation to the CUDA grid representation, we use the following code to create a 16 by 16 2-D block containing 256 threads. #define BLOCK_DIM 16 dim3 dimblock(block_dim,block_dim,1); dim3 dimgrid((na+block_dim-1)/na,(nb+block_dim-1)/nb,1); The dimension of the grid is dynamically calculated to ensure that the number of rows multiplied by 16 is no less than n A and the number of columns multiplied by 16 is no less than n B. In this way, there are enough threads generated to cover the whole distance matrix calculation. The following code is used to launch the kernel function execution on the GPU. distkernel<<<dimgrid,dimblock>>>(/*i/o arguments list*/); Fig. 1 illustrates the idea of how to implement the kernel function. Notice that both input matrices A and B are organized in the transpose format in the GPU device memory, whose columns correspond to the related data points. As it is shown that the blocks may exceed the border of the input matrices, several checking conditions must be made to ensure zeros are used when the blocks run out of the boundary to acquire correct results. This solves the issue brought by datasets with number of data points and size of feature space which are not multiple of 16. Each block handles no more than 256 pairwise distances calculations. Each thread within the block goes along the vertical direction and picks up 16 features at a time. All threads are synchronized to the same stage when data have been loaded into the shared memory from the device memory. Then feature differences are computed between any two threads and accumulated to the related result. All threads are synchronized again before moving to the next iteration of the loop. When all features have been scanned through by every thread, the final distance is calculated and stored to the corresponding position in the
4 CUDA blocks mapping for generalized distance matrix calcu- Figure 1. lation matrix D based on the current block index and thread index. The distance matrix is then returned at the end of the kernel function. At this point, the distance matrix is still stored on the GPU device memory and it must be copied back to the system memory for any future processing by CPU. By using shared memory, the threads can avoid frequently accessing device memory to fetch data so that better performance is achieved. Another way of using GPU to calculate Euclidean distance matrix is using the same idea in Algorithm 2, but instead of Intel s MKL, CUBLAS [11] and MAGMA [12] are used for GPU accelerated matrix operations. CUBLAS is the official version of BLAS routine for GPU made by NVIDIA and it has proved to be extremely efficient compared to the same implementations on CPU, e.g. using MKL. Dongarra et al. also published a CUDA based GPU library containing a subset of BLAS routines plus some LAPACK factorization and linear system solver routines called MAGMA for GPU. It even outperforms the official CUBLAS version 2.3 in many matrix operations such as matrix-matrix multiplication, which is the time consuming part in distance matrix calculation. C. Performance Results The testing computer is equipped with an Intel Xeon E GHz quad-core CPU and 16GB RAM. There are three Tesla C1060 GPU devices connected to the system through PCI Express interfaces. Each of these cards has 4GB device memory. The latest CUDA 3.0 toolkit is used and the driver version is for 64bit Linux system. The operating system is Fedora Core 10 Linux. The benchmarks of GPU algorithms include two ways of data transferring time between system memory and device memory as well as the computational time on the GPU. Table I shows the normal Euclidean distance matrix calculation comparison among naive C implementation, MKL based C implementation, Chang et al. s CUDA implementation, and the proposed generalized CUDA implementation. It is easy to observe that using MKL and multi-thread support for CPU can boost the performance 5 to 6 times, thus comparison with the naive C implementation does not truly reflect the performance gain by using GPU. Our implementation is slightly slower in these special cases (both n and m are multiple of 16) compared to Chang et al. s implementation because the kernel function has been modified to suit general datasets, which cannot be used by Chang et al. s method. It takes two datasets as input, thus the same dataset is copied twice from the system memory to the device memory in theses special cases. In general, the GPU implementation still has a speed-up of approximately 5 times compared to MKL which is in the reasonable range based on the performance comparison of matrix-matrix multiplication between MKL and CUBLAS shown in [13]. Table II shows the performance comparison of calculating generalized distance matrix between any two input datasets. Chang et al. s method is not listed because of the unsuitability. CUBLAS 3.0 based implementation comes at the top and the proposed generalized CUDA implementation is very close to the GPU matrix operation based method. Other distances matrices, e.g. Manhattan distance and cosine distance, cannot be efficiently transformed to matrix level operations. However, they still can be easily implemented by modifying the proposed method. III. MAP-REDUCE LIKE MODEL FOR HANDLING LARGE DATASETS For problems with large datasets, neither can all data points be loaded into the system memory at one time, nor is there enough space for storing the complete distance matrix. Thus it is necessary to break down the complete distance matrix into many submatrices and calculate them individually in a parallel manner. Fig. 2 shows the approach of how to split the input datasets to chunks and calculate the generalized distance matrices between any two chunks. Each chunk is assigned an index from 1 to k. The final distance matrix contains k by k small distance matrices. Due to the symmetric property of the complete distance matrix, there are only k(k + 1)/2 small distance matrices required to be Table I PERFORMANCE COMPARISON OF SYMMETRIC EUCLIDEAN DISTANCE MATRIX CALCULATION. TIME UNIT IS SECOND AND SIZE OF FEATURE SPACE IS SPEED UP IS RELATED TO NAIVE C IMPLEMENTATION. n IS NUMBER OF DATA POINTS. Input matrix Naive Efficient Chang et al. s Generalized n C C (MKL) CUDA CUDA (4.96x) 0.36 (33.06x) 0.47 (25.32x) (5.70x) 1.42 (34.08x) 1.79 (27.04x) (5.96x) 3.16 (34.40x) 3.82 (28.48x)
5 Table II PERFORMANCE COMPARISON OF GENERALIZED EUCLIDEAN DISTANCE MATRIX CALCULATION. TIME UNIT IS SECOND AND SIZE OF FEATURE SPACE IS SPEED UP IS RELATED TO EFFICIENT C IMPLEMENTATION. n IS NUMBER OF DATA POINTS. Input matrices Efficient Generalized CUBLAS 3.0 MAGMA 0.2 n n C (MKL) CUDA CUDA CUDA (4.60x) 0.21(4.38x) 0.22(4.12x) (4.79x) 0.37(4.92x) 0.42(4.33x) (4.56x) 1.56(5.12x) 1.72(4.64x) (4.51x) 2.96(5.36x) 3.33(4.76x) calculated in a total of k 2 ones. The rest of them can be acquired by simply doing transpose operations of calculated ones, e.g. D(1, 2) is the transpose of D(2, 1). These small distance matrices can be calculated using the method, which is accelerated by GPU, introduced in the previous section. The amount of physical GPU devices determines how many grids can be launched simultaneously. The Map-Reduce [14] pattern has been proposed to handle large data processing problems on a cluster environment. We adopt the merits of this programming pattern and model it to do the large distance matrix calculation job. As shown in Fig. 3, the input reader first reads multiple chunks into the system memory. Then the mapper generates a list of key/value pairs, which correspond to these active chunks currently loaded in the system memory. For each key/value pair, both key and value store the indices of the chunks. The reducers iteratively load pairs of chunks with the same key and search for any available GPU device to launch the distance kernel function. There is a list which stores the IDs of the available GPUs. Any GPU device which is taken by a reducer will be removed from the list and appended back after it is released by that reducer. Each reducer only calculates the small distance matrices whose keys are smaller than or equal to their values. The final distance matrix is in the form of its upper triangular. After all small distance matrices with the same key are calculated by the reducer, the results are grouped together and passed to the output writer. The output writer concatenates the results from every reducer and writes them to a distributed file system if available. A drawback of this approach is that different reducers may have different workloads. This can be solved by fixing the number of small distance matrices calculation job to each reducer. For example, the first reducer calculates D(1, 1) to D(1, 5), the second calculates D(1, 6) to D(1, 10) and so on. However, the key value must still be kept the same in each reducer. In this way, only certain reducers may have less jobs but in a general view, the distance calculations are distributed equally among reducers. When each reducer finishes its job, it will notify the mapper to update the key/value pairs list and refresh the system memory by loading in new chunks and deleting used ones. The complete model requires a GPU cluster environment and extra communication support, e.g. MPI, from different Figure 2. Mapping between data chunks to the related distance submatrices nodes as well as a proper distributed file system. Our test is done on a workstation with three Tesla C1060 GPU devices. This is much simpler compared to the GPU cluster environment. We use multi-threading to implement different functions for input reader, maper, reducer and output writer. Since all reducers will be competing for the GPU device resources on the same computer, whether they have an equal amount of jobs does not matter anymore. Because all three cards will be used for distance matrices calculations all the time, an approximately performance increase of 3x is achieved compared to using one card to do the same job sequentially. Table III shows the performance of finalized chunking method tested on the real world datasets. File I/O time is excluded because both CPU and GPU implementations share the same procedure. Time cost is for calculating submatrices only. The data transferring time for GPU is shortened because in certain cases dataset can be reused. For example, if the same GPU is assigned the job to calculate D(1, 1) and D(1, 2), only chunk 2 needs to be loaded into the device memory in the second distance matrix calculation. The speedup is close to 15 times when utilizing three GPU devices together on a dataset containing more than half million data points Table III PERFORMANCE RESULT OF CHUNKING METHOD ON REAL WORLD LARGE DATASETS. n IS THE NUMBER OF DATA POINTS. m IS THE SIZE OF FEATURE SPACE. c IS THE NUMBER OF CHUNKS. TIME UNIT IS SECOND AND MINUTE. SPEED UP IS RELATED TO CPU IMPLEMENTATION. Datasets n m Quad-core c 1 GPU 3 GPUs CPU Mnist 60, s 4 Covertype 581, m s 19.39s (4.21x) (10.51x) 10.84m 3.62m (5.00x) (14.98x)
6 Figure 3. Map-Reduce pattern for large distance matrix calculation IV. CONCLUSIONS AND FUTURE WORK In this paper, we have presented a novel idea on how to break down data intensive large Euclidean distance matrix calculation and utilize a multi-gpu environment. The proposed chunking method is easy to implement by using multithreading technique on a standalone workstation. However, there will be many other technical issues to implement it in a GPU cluster environment. The next stage of our work will be implementing the model for a small GPU cluster and seeking for any possible optimization method to maximize the capability of the hardware. Other distances, e.g. Manhattan and Cosine distances, will be included and a simplified interface will be offered to those machine learning algorithms who require distance calculation. REFERENCES [1] NVIDIA, CUDA Compute Unified Device Architecture Programming Guide, June [2] T.-M. Huang, V. Kecman, and I. Kopriva, Kernel Based Algorithms for Mining Huge Data Sets, Supervised, Semisupervised, and Unsupervised Learning. Springer, [3] B. Catanzaro, N. Sundaram, and K. Keutzer, Fast support vector machine training and classification on graphics processors, in ICML 08: Proceedings of the 25th international conference on Machine learning. New York, NY, USA: ACM, 2008, pp [4] V. Garcia, E. Debreuve, and M. Barlaud, Fast k nearest neighbor search using gpu, in Computer Vision and Pattern Recognition Workshops, CVPRW 08. IEEE Computer Society Conference on, , pp [7] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradientbased learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp , nov [8] M. Andrecut, Parallel gpu implementation of iterative pca algorithms. Journal of Computational Biology, vol. 16, no. 11, pp , [9] D.-J. Chang, N. A. Jones, D. Li, M. Ouyang, and R. K. Ragade, Compute pairwise euclidean distances of data points with gpus, in Computational Biology and Bioinformatics, CBB 08. IASTED International Symposium, November [10] D.-J. Chang, A. Desoky, M. Ouyang, and E. Rouchka, Compute pairwise manhattan distance and pearson correlation coefficient of data points with gpu, in Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, SNPD th ACIS International Conference on, , pp [11] NVIDIA, CUDA CUBLAS Library, June [12] J. Dongarra, S. Tomov et al., Matrix algebra on gpu and multicore architectures, [Online]. Available: [13] J. Cohen, Cuda libraries and tools, [Online]. Available: SC09 CUDA Tools Cohen.pdf [14] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, Interpreting the data: Parallel analysis with sawzall, Sci. Program., vol. 13, no. 4, pp , [5] S. A. Shalom, M. Dash, and M. Tue, Efficient k-means clustering using accelerated graphics processors, in DaWaK 08: Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery. Berlin, Heidelberg: Springer-Verlag, 2008, pp [6] A. Frank and A. Asuncion, UCI machine learning repository, [Online]. Available: edu/ml
Performance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationAccelerating GPU kernels for dense linear algebra
Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationFace Detection CUDA Accelerating
Face Detection CUDA Accelerating Jaromír Krpec Department of Computer Science VŠB Technical University Ostrava Ostrava, Czech Republic krpec.jaromir@seznam.cz Martin Němec Department of Computer Science
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationOn Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy
On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,
More informationA MATLAB Interface to the GPU
Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationOptimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline
More informationParallel Approach for Implementing Data Mining Algorithms
TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationSimultaneous Solving of Linear Programming Problems in GPU
Simultaneous Solving of Linear Programming Problems in GPU Amit Gurung* amitgurung@nitm.ac.in Binayak Das* binayak89cse@gmail.com Rajarshi Ray* raj.ray84@gmail.com * National Institute of Technology Meghalaya
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationSubset Sum Problem Parallel Solution
Subset Sum Problem Parallel Solution Project Report Harshit Shah hrs8207@rit.edu Rochester Institute of Technology, NY, USA 1. Overview Subset sum problem is NP-complete problem which can be solved in
More informationcalled Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil
Parallel Genome-Wide Analysis With Central And Graphic Processing Units Muhamad Fitra Kacamarga mkacamarga@binus.edu James W. Baurley baurley@binus.edu Bens Pardamean bpardamean@binus.edu Abstract The
More informationAccelerated Load Balancing of Unstructured Meshes
Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationAccelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin
Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most
More informationAperTO - Archivio Istituzionale Open Access dell'università di Torino
AperTO - Archivio Istituzionale Open Access dell'università di Torino An hybrid linear algebra framework for engineering This is the author's manuscript Original Citation: An hybrid linear algebra framework
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationGPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices
GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices Hiroki Tokura, Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationREDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS
BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the
More informationParallel Architecture & Programing Models for Face Recognition
Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationOptimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators
Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Ahmad Abdelfattah 1, Jack Dongarra 2, David Keyes 1 and Hatem Ltaief 3 1 KAUST Division of Mathematical and Computer Sciences and
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationA Mixed Hierarchical Algorithm for Nearest Neighbor Search
A Mixed Hierarchical Algorithm for Nearest Neighbor Search Carlo del Mundo Virginia Tech 222 Kraft Dr. Knowledge Works II Building Blacksburg, VA cdel@vt.edu ABSTRACT The k nearest neighbor (knn) search
More informationA Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT
A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT Daniel Schlifske ab and Henry Medeiros a a Marquette University, 1250 W Wisconsin Ave, Milwaukee,
More informationSolving Large Regression Problems using an Ensemble of GPU-accelerated ELMs
Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs Mark van Heeswijk 1 and Yoan Miche 2 and Erkki Oja 1 and Amaury Lendasse 1 1 Helsinki University of Technology - Dept. of Information
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationGPU Programming for Mathematical and Scientific Computing
GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu
More informationBatch Linear Algebra for GPU-Accelerated High Performance Computing Environments
Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering
More informationCompute Distance Matrices with GPU
Compute Distance Matrices with GPU Seongho Kim Bioinformatics and Biostatistics Department University of Louisville Louisville, Kentucky 40292 USA s0kim023@louisville.edu Ming Ouyang Computer Engineering
More informationASYNCHRONOUS MATRIX FRAMEWORK WITH PRIORITY-BASED PROCESSING. A Thesis. Presented to. the Faculty of
ASYNCHRONOUS MATRIX FRAMEWORK WITH PRIORITY-BASED PROCESSING A Thesis Presented to the Faculty of California Polytechnic State University, San Luis Obispo In Partial Fulfillment of the Requirements for
More informationTo Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs
To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com
More informationAccelerating Correlation Power Analysis Using Graphics Processing Units (GPUs)
Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs) Hasindu Gamaarachchi, Roshan Ragel Department of Computer Engineering University of Peradeniya Peradeniya, Sri Lanka hasindu8@gmailcom,
More informationSpeeding up MATLAB Applications Sean de Wolski Application Engineer
Speeding up MATLAB Applications Sean de Wolski Application Engineer 2014 The MathWorks, Inc. 1 Non-rigid Displacement Vector Fields 2 Agenda Leveraging the power of vector and matrix operations Addressing
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationPerformance Modeling of Pipelined Linear Algebra Architectures on FPGAs
Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Sam Skalicky, Sonia López, Marcin Łukowiak, James Letendre, and Matthew Ryan Rochester Institute of Technology, Rochester NY 14623,
More informationTHE MNIST DATABASE of handwritten digits Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York
THE MNIST DATABASE of handwritten digits Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York The MNIST database of handwritten digits, available from this page, has a training set
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationDAG-Scheduled Linear Algebra Using Template-Based Building Blocks
DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 1 / 20 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to
More informationCATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING
CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationGPUML: Graphical processors for speeding up kernel machines
GPUML: Graphical processors for speeding up kernel machines http://www.umiacs.umd.edu/~balajiv/gpuml.htm Balaji Vasan Srinivasan, Qi Hu, Ramani Duraiswami Department of Computer Science, University of
More informationAccelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware
NSF REU - 2018: Project Report Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware Anumeena Sorna Electronics and Communciation Engineering National Institute of Technology,
More informationXIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture
XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics
More informationMAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel
MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More information1 Motivation for Improving Matrix Multiplication
CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n
More informationOutlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering
World Journal of Computer Application and Technology 5(2): 24-29, 2017 DOI: 10.13189/wjcat.2017.050202 http://www.hrpub.org Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationInternational Conference on Computational Science (ICCS 2017)
International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A.
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationPARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION
PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In
More informationInternational Journal of Computer Science and Network (IJCSN) Volume 1, Issue 4, August ISSN
Accelerating MATLAB Applications on Parallel Hardware 1 Kavita Chauhan, 2 Javed Ashraf 1 NGFCET, M.D.University Palwal,Haryana,India Page 80 2 AFSET, M.D.University Dhauj,Haryana,India Abstract MATLAB
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationHigh-Performance and Parallel Computing
9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement
More informationAn Introduction to OpenAcc
An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by
More informationarxiv: v1 [cs.lg] 5 Mar 2013
GURLS: a Least Squares Library for Supervised Learning Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro, Lorenzo Rosasco Center for Biological and Computational Learning, Massachusetts Institute
More informationOptimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*
Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating
More informationFeature Selection Using Modified-MCA Based Scoring Metric for Classification
2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification
More informationParsing in Parallel on Multiple Cores and GPUs
1/28 Parsing in Parallel on Multiple Cores and GPUs Mark Johnson Centre for Language Sciences and Department of Computing Macquarie University ALTA workshop December 2011 Why parse in parallel? 2/28 The
More informationFace Detection using GPU-based Convolutional Neural Networks
Face Detection using GPU-based Convolutional Neural Networks Fabian Nasse 1, Christian Thurau 2 and Gernot A. Fink 1 1 TU Dortmund University, Department of Computer Science, Dortmund, Germany 2 Fraunhofer
More informationTowards Breast Anatomy Simulation Using GPUs
Towards Breast Anatomy Simulation Using GPUs Joseph H. Chui 1, David D. Pokrajac 2, Andrew D.A. Maidment 3, and Predrag R. Bakic 4 1 Department of Radiology, University of Pennsylvania, Philadelphia PA
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationAnalysis of Matrix Multiplication Computational Methods
European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods
More informationLocality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationUsing Graphics Chips for General Purpose Computation
White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1
More informationPerformance Diagnosis for Hybrid CPU/GPU Environments
Performance Diagnosis for Hybrid CPU/GPU Environments Michael M. Smith and Karen L. Karavanic Computer Science Department Portland State University Performance Diagnosis for Hybrid CPU/GPU Environments
More informationOptimizing CUDA for GPU Architecture. CSInParallel Project
Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationModern GPUs (Graphics Processing Units)
Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB
More informationParallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming
Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),
More informationBest First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis
Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationSchool of Computer and Information Science
School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast
More information