A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu

Size: px

Start display at page:

Download "A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu"

Lydia Robertson
6 years ago
Views:

1 A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu Qi Li, Vojislav Kecman, Raied Salman Department of Computer Science School of Engineering, Virginia Commonwealth University Richmond, Virginia 23284, USA Abstract Calculating Euclidean distance matrix is a data intensive operation and becomes computationally prohibitive for large datasets. Recent development of Graphics Processing Units (GPUs) has produced superb performance on scientific computing problems using massive parallel processing cores. However, due to the limited size of device memory, many GPU based algorithms have low capability in solving problems with large datasets. In this paper, a chunking method is proposed to calculate Euclidean distance matrix on large datasets. This is not only designed for scalability in multi-gpu environment but also to maximize the computational capability of each individual GPU device. We first implement a fast GPU algorithm that is suitable for calculating submatrices of Euclidean distance matrix. Then we utilize a Map-Reduce like framework to split the final distance matrix calculation into many small independent jobs of calculating partial distance matrices, which can be efficiently solved by our GPU algorithm. The framework also dynamically allocates GPU resources to those independent jobs for maximum performance. The experimental results have shown a speed up of 15x on datasets which contain more than half million data points. Keywords-Multi-GPU; Euclidean Distance Matrix; Chunking; I. INTRODUCTION Since the semiconductor industry revealed that high performance processors cannot be built by simply increasing the clock frequency any more, the scientific computing market has been offered alternative products with multicore, multi-processor, or many-core which all shift to parallel architecture. One of these successful products is the Graphic Processing Units (GPUs) based computational device. GPUs used to be only integrated on the video card specialized for 2D and 3D graphic rendering. These applications normally require a higher capability in floating point operations than in logic control and memory fetch operations. Thus GPU is designed with many built-in modified floating point units, which can do parallel computations, and the GPU itself acts as an assisting processor to CPU. Because of its nature, GPU has become more and more popular in many applications which require intensive computations. General purpose programming on parallel architecture of GPU is difficult not only because most libraries offered are solely for graphic related programming but also because finding the parallelism is critical in many well known problems. This situation has been changed when NVIDIA released Compute Unified Device Architecture [1] (CUDA) in CUDA offers simplified programming interface which is an extension of C language for general purpose programming on GPU. Meanwhile NVIDIA also released their GPU based computational device called Tesla, which is now in its second generation. Open Computing Language (OpenCL), which uses taskbased and data-based parallelism, is another framework designed for GPU programming supported by many hardware vendors, e.g. Apple, NVIDIA and ATI. Many CUDA based algorithms can now be ported to OpenCL but OpenCL is still less popular than CUDA. Many machine learning algorithms require calculation of certain type of distances, e.g. Euclidean distance, Manhattan distance and Cosine distance, because distance is a good measurement to tell the differences between data points. Thus distance calculation is the fundamental work of classification and clustering tasks. Distance matrices are widely used in algorithms such as Support Vector Machine [2] [3], K-Nearest Neighbors [4] and K-Means [5]. The popular Radial Basis Function (RBF) kernel matrix used in nonlinear Support Vector Machine can be derived from the Euclidean distance matrix. The K-Nearest Neighbors problem can be solved by simply sorting the distance matrix. K-Means is also easily accelerated by improving distance calculation. These problems with small or medium datasets can now be efficiently solved by using GPU devices. However, for problems with large datasets, memory requirements for distance calculation cannot always be met. Considering a dataset which has n instances with each of them having m attributes, the time complexity for calculating the complete Euclidean distances matrix is O(n 2 m). Standard CPU based method used for calculating Euclidean distance matrix goes through three nested loops. Obviously, it is only necessary to calculate either the upper triangular or the lower triangular of the distance matrix because of its symmetric property. If the feature space (i.e., the number of attributes for each data point) is small, the time complexity is reduced to O(n 2 ). In order to clearly categorize different datasets based on the number of instances and dimensionality of attribute space,

2 we define ultra-large datasets (e.g. URL Reputation [6], which has millions of instances and attributes) as datasets, which have a large number of instances and a large number of attributes at the same time. Datasets (e.g. Mnist [7], Covertype [6] and Poker Hand [6], whose feature spaces are smaller than a thousand), which have a large number of instances but a limited number of attributes, are defined as large datasets. Most of the real world datasets fall into this category. Certain ultra-large datasets can be converted to large datasets by using feature reduction techniques such as Principle Component Analysis [8]. Our focus is on how to process large datasets. The memory storage for distance matrix is restricted only by n; however, the memory for distance computations is restricted by both n and m. Thus complete distance matrices of both ultra-large and large datasets cannot be calculated at one time. Chang et al. proposed a fast implementation for calculating Euclidean distance matrix using GPU in [9]. They also have a similar implementation in [10] for calculating Manhattan distance matrix. Their results show a speedup ranging from 20 times to approximately two orders of magnitude times compared to a slow CPU implementation. However, their implementation only suits certain special cases whose input datasets have both the number of instances and the number of attributes being a multiple of 16. Also, it is specifically designed to calculate normal symmetric distance matrix only. This significantly limits the contribution of their work for solving real world problems. Besides, the largest dataset tested in their work has only around 12,000 data points with a feature space of 64, which merely catches the lower bound of large datasets. Common large datasets containing more than 100,000 instances cannot be solved by their method because of the limited memory space on a single GPU device. Currently, there is very little work done on large distance matrix calculation using multi-gpu environments, e.g. workstation with multiple GPU devices or hybrid CPU-GPU clusters. In this paper, we continue the research direction initiated by Chang et al. and propose a chunking method targeted for solving Euclidean distance matrices calculation on large datasets by using multi-gpu environment. The organization of this paper is as follows. Section II briefly introduces how to calculate Euclidean distance matrix using efficient CPU algorithm instead of the naive one. In this section, a CUDA programming model based implementation for calculating generalized Euclidean distance matrix is also presented. Section III describes a Map-Reduce like framework for large distance matrix calculation including how to split the calculation and how to maximally utilize the available hardware resources in a multi-gpu environment. The performance results of this chunking method on real world datasets are shown at the end of this section. Section IV summarizes the merits of this work and talks about possible improvements and extensions in future work. II. CUDA IMPLEMENTATION FOR CALCULATING GENERALIZED EUCLIDEAN DISTANCE MATRIX Euclidean distance is one of the most common distances which have been used in many applications. Let us consider an n by m dataset matrix S where n is the number of data points and m is the size of the feature space. Euclidean distance between any data points s i and s j can be calculated by d ij = m (s ik s jk ) 2. (1) k=1 The computation complexity for each pairwise distance is O(m). Since the feature space has a limited size in the case of large datasets, the time complexity for calculating one pairwise Euclidean distance is considered as a constant. A. Naive CPU Algorithm and Efficient CPU Algorithm A normal distance matrix is defined as a 2-D array containing distances of a set of data points taken pairwise. It is a symmetric square matrix with zero entries on the diagonal. In a broader sense, the generalized distance matrix stores distances between any two data points from two datasets. Thus, the generalized distance matrix is actually a submatrix of a normal distance matrix. This submatrix can be an asymmetric square matrix or even a rectangular matrix. Generalized distance matrix is more frequently used compared to normal distance matrix in solving real world problems. It is also the core part of calculating large distance matrix. Algorithm 1 shows the standard procedure of how to calculate a generalized Euclidean distance matrix. It is considered as a naive method because of its low performance. Two input datasets are loaded into system memory and stored into two 2-D arrays A and B, who have n A and n B data points. Data points in both A and B are m- dimensional. The memory space for storing distance matrix D is preallocated and matrix D has a dimension of n A by n B. Each pairwise distance is calculated inside of two nested loops. In order to achieve the maximum performance of multicore CPU, a matrix operation based method is used to calculate the distance matrix instead of using nested loops. Euclidean distance matrix, unlike other distance matrices (e.g., Manhattan and Cosine distance matrices), can be decomposed into matrix level operations. Algorithm 2 shows the procedure of how to use matrix operations to calculate the generalized Euclidean distance matrix. The algorithm contains element-wise matrix multiplication, matrix-vector multiplication and matrix-matrix multiplication. In the testing program, MKL BLAS routines from Intel are used to accomplish these matrix operations and the code is complied by enabling multi-thread support in MKL. It is well known that this is one of the fastest implementation for calculating Euclidean distance matrix on CPU.

3 Algorithm 1 Standard procedure for generalized Euclidean distance matrix calculation (1) begin (2) Load in A, B and allocate memory for D (3) for i := 1 to n A do (4) for j := 1 to n B do (5) d ij := computedistance(a i, b j ); (6) od (7) od (8) return D (9) funct computedistance(a i, b j ) (10) d ij := 0; (11) for k := 1 to m do (12) d ij = d ij + (a ik b jk ) 2 ; (13) od (14) d ij := d ij ; (15) return d ij. (16) end Algorithm 2 Matrix operation based method for generalized Euclidean distance matrix calculation (1) begin (2) Load in A, B and allocate memory for D (3) comment: ( ) is element-wise multiplication (4) v 1 = (A A)[1, ] T (5) v 2 = (B B)[1, ] T (6) P 1 = [v 1 v 1... v 1 ] Dimension: n A by n B (7) P 2 = [v 2 v 2... v 2 ] T Dimension: n A by n B (8) P 3 = AB T Dimension: n A by n B (9) comment: () 1 2 is element-wise square root (10) D = (P 1 + P 2 2P 3 ) 1 2 (11) return D (12) end B. CUDA Based Algorithm CUDA is an extension of C programming language and the latest version CUDA 3.0 offers strong support for many C++ features. The programming model is based on the logical representation of three different layers which are grids, blocks and threads from high to low. It is the users responsibility to adapt algorithms to a 2-D grid structure. These grid executions, also known as kernel functions, are launched by the CPU. Grids are composed by blocks, which are groups of threads that share local memory and can be synchronized using barriers. Kernel functions are executed simultaneously by multi-thread from the logic view. However, the amount of physical threads which are running on the GPU concurrently is determined by the hardware specification. CUDA provides a hierarchy of memories that differ on their accessibility, size and speed. Registers are the fastest but smallest in size. Shared memory can be as fast as registers but it is also very limited in size. Texture memory is larger but read-only and device memory is the slowest one, which has the largest size compared to the others. The CUDA2 implementation shown in [9] is one good example of using shared memory. It takes one dataset as input and calculates the normal distance matrix of it. However, as mentioned in Section I, the way the code is written makes it only suitable for special input datasets. In the case of large dataset, it is impossible to load all data points into the GPU device memory, thus the distance matrix must be calculated by combining its submatrices together. This part will be discussed in detail in the next section. Here, we rework this method to make it suitable for calculating generalized distances matrix between two datasets with any number of data points and any size of feature space. Continuing with the previous example, the distance matrix from A to B is an n A by n B rectangular matrix. Similarly, the distance matrix from B to A is the transpose of the above matrix. In order to correctly map the distance matrix calculation to the CUDA grid representation, we use the following code to create a 16 by 16 2-D block containing 256 threads. #define BLOCK_DIM 16 dim3 dimblock(block_dim,block_dim,1); dim3 dimgrid((na+block_dim-1)/na,(nb+block_dim-1)/nb,1); The dimension of the grid is dynamically calculated to ensure that the number of rows multiplied by 16 is no less than n A and the number of columns multiplied by 16 is no less than n B. In this way, there are enough threads generated to cover the whole distance matrix calculation. The following code is used to launch the kernel function execution on the GPU. distkernel<<<dimgrid,dimblock>>>(/*i/o arguments list*/); Fig. 1 illustrates the idea of how to implement the kernel function. Notice that both input matrices A and B are organized in the transpose format in the GPU device memory, whose columns correspond to the related data points. As it is shown that the blocks may exceed the border of the input matrices, several checking conditions must be made to ensure zeros are used when the blocks run out of the boundary to acquire correct results. This solves the issue brought by datasets with number of data points and size of feature space which are not multiple of 16. Each block handles no more than 256 pairwise distances calculations. Each thread within the block goes along the vertical direction and picks up 16 features at a time. All threads are synchronized to the same stage when data have been loaded into the shared memory from the device memory. Then feature differences are computed between any two threads and accumulated to the related result. All threads are synchronized again before moving to the next iteration of the loop. When all features have been scanned through by every thread, the final distance is calculated and stored to the corresponding position in the

4 CUDA blocks mapping for generalized distance matrix calcu- Figure 1. lation matrix D based on the current block index and thread index. The distance matrix is then returned at the end of the kernel function. At this point, the distance matrix is still stored on the GPU device memory and it must be copied back to the system memory for any future processing by CPU. By using shared memory, the threads can avoid frequently accessing device memory to fetch data so that better performance is achieved. Another way of using GPU to calculate Euclidean distance matrix is using the same idea in Algorithm 2, but instead of Intel s MKL, CUBLAS [11] and MAGMA [12] are used for GPU accelerated matrix operations. CUBLAS is the official version of BLAS routine for GPU made by NVIDIA and it has proved to be extremely efficient compared to the same implementations on CPU, e.g. using MKL. Dongarra et al. also published a CUDA based GPU library containing a subset of BLAS routines plus some LAPACK factorization and linear system solver routines called MAGMA for GPU. It even outperforms the official CUBLAS version 2.3 in many matrix operations such as matrix-matrix multiplication, which is the time consuming part in distance matrix calculation. C. Performance Results The testing computer is equipped with an Intel Xeon E GHz quad-core CPU and 16GB RAM. There are three Tesla C1060 GPU devices connected to the system through PCI Express interfaces. Each of these cards has 4GB device memory. The latest CUDA 3.0 toolkit is used and the driver version is for 64bit Linux system. The operating system is Fedora Core 10 Linux. The benchmarks of GPU algorithms include two ways of data transferring time between system memory and device memory as well as the computational time on the GPU. Table I shows the normal Euclidean distance matrix calculation comparison among naive C implementation, MKL based C implementation, Chang et al. s CUDA implementation, and the proposed generalized CUDA implementation. It is easy to observe that using MKL and multi-thread support for CPU can boost the performance 5 to 6 times, thus comparison with the naive C implementation does not truly reflect the performance gain by using GPU. Our implementation is slightly slower in these special cases (both n and m are multiple of 16) compared to Chang et al. s implementation because the kernel function has been modified to suit general datasets, which cannot be used by Chang et al. s method. It takes two datasets as input, thus the same dataset is copied twice from the system memory to the device memory in theses special cases. In general, the GPU implementation still has a speed-up of approximately 5 times compared to MKL which is in the reasonable range based on the performance comparison of matrix-matrix multiplication between MKL and CUBLAS shown in [13]. Table II shows the performance comparison of calculating generalized distance matrix between any two input datasets. Chang et al. s method is not listed because of the unsuitability. CUBLAS 3.0 based implementation comes at the top and the proposed generalized CUDA implementation is very close to the GPU matrix operation based method. Other distances matrices, e.g. Manhattan distance and cosine distance, cannot be efficiently transformed to matrix level operations. However, they still can be easily implemented by modifying the proposed method. III. MAP-REDUCE LIKE MODEL FOR HANDLING LARGE DATASETS For problems with large datasets, neither can all data points be loaded into the system memory at one time, nor is there enough space for storing the complete distance matrix. Thus it is necessary to break down the complete distance matrix into many submatrices and calculate them individually in a parallel manner. Fig. 2 shows the approach of how to split the input datasets to chunks and calculate the generalized distance matrices between any two chunks. Each chunk is assigned an index from 1 to k. The final distance matrix contains k by k small distance matrices. Due to the symmetric property of the complete distance matrix, there are only k(k + 1)/2 small distance matrices required to be Table I PERFORMANCE COMPARISON OF SYMMETRIC EUCLIDEAN DISTANCE MATRIX CALCULATION. TIME UNIT IS SECOND AND SIZE OF FEATURE SPACE IS SPEED UP IS RELATED TO NAIVE C IMPLEMENTATION. n IS NUMBER OF DATA POINTS. Input matrix Naive Efficient Chang et al. s Generalized n C C (MKL) CUDA CUDA (4.96x) 0.36 (33.06x) 0.47 (25.32x) (5.70x) 1.42 (34.08x) 1.79 (27.04x) (5.96x) 3.16 (34.40x) 3.82 (28.48x)

5 Table II PERFORMANCE COMPARISON OF GENERALIZED EUCLIDEAN DISTANCE MATRIX CALCULATION. TIME UNIT IS SECOND AND SIZE OF FEATURE SPACE IS SPEED UP IS RELATED TO EFFICIENT C IMPLEMENTATION. n IS NUMBER OF DATA POINTS. Input matrices Efficient Generalized CUBLAS 3.0 MAGMA 0.2 n n C (MKL) CUDA CUDA CUDA (4.60x) 0.21(4.38x) 0.22(4.12x) (4.79x) 0.37(4.92x) 0.42(4.33x) (4.56x) 1.56(5.12x) 1.72(4.64x) (4.51x) 2.96(5.36x) 3.33(4.76x) calculated in a total of k 2 ones. The rest of them can be acquired by simply doing transpose operations of calculated ones, e.g. D(1, 2) is the transpose of D(2, 1). These small distance matrices can be calculated using the method, which is accelerated by GPU, introduced in the previous section. The amount of physical GPU devices determines how many grids can be launched simultaneously. The Map-Reduce [14] pattern has been proposed to handle large data processing problems on a cluster environment. We adopt the merits of this programming pattern and model it to do the large distance matrix calculation job. As shown in Fig. 3, the input reader first reads multiple chunks into the system memory. Then the mapper generates a list of key/value pairs, which correspond to these active chunks currently loaded in the system memory. For each key/value pair, both key and value store the indices of the chunks. The reducers iteratively load pairs of chunks with the same key and search for any available GPU device to launch the distance kernel function. There is a list which stores the IDs of the available GPUs. Any GPU device which is taken by a reducer will be removed from the list and appended back after it is released by that reducer. Each reducer only calculates the small distance matrices whose keys are smaller than or equal to their values. The final distance matrix is in the form of its upper triangular. After all small distance matrices with the same key are calculated by the reducer, the results are grouped together and passed to the output writer. The output writer concatenates the results from every reducer and writes them to a distributed file system if available. A drawback of this approach is that different reducers may have different workloads. This can be solved by fixing the number of small distance matrices calculation job to each reducer. For example, the first reducer calculates D(1, 1) to D(1, 5), the second calculates D(1, 6) to D(1, 10) and so on. However, the key value must still be kept the same in each reducer. In this way, only certain reducers may have less jobs but in a general view, the distance calculations are distributed equally among reducers. When each reducer finishes its job, it will notify the mapper to update the key/value pairs list and refresh the system memory by loading in new chunks and deleting used ones. The complete model requires a GPU cluster environment and extra communication support, e.g. MPI, from different Figure 2. Mapping between data chunks to the related distance submatrices nodes as well as a proper distributed file system. Our test is done on a workstation with three Tesla C1060 GPU devices. This is much simpler compared to the GPU cluster environment. We use multi-threading to implement different functions for input reader, maper, reducer and output writer. Since all reducers will be competing for the GPU device resources on the same computer, whether they have an equal amount of jobs does not matter anymore. Because all three cards will be used for distance matrices calculations all the time, an approximately performance increase of 3x is achieved compared to using one card to do the same job sequentially. Table III shows the performance of finalized chunking method tested on the real world datasets. File I/O time is excluded because both CPU and GPU implementations share the same procedure. Time cost is for calculating submatrices only. The data transferring time for GPU is shortened because in certain cases dataset can be reused. For example, if the same GPU is assigned the job to calculate D(1, 1) and D(1, 2), only chunk 2 needs to be loaded into the device memory in the second distance matrix calculation. The speedup is close to 15 times when utilizing three GPU devices together on a dataset containing more than half million data points Table III PERFORMANCE RESULT OF CHUNKING METHOD ON REAL WORLD LARGE DATASETS. n IS THE NUMBER OF DATA POINTS. m IS THE SIZE OF FEATURE SPACE. c IS THE NUMBER OF CHUNKS. TIME UNIT IS SECOND AND MINUTE. SPEED UP IS RELATED TO CPU IMPLEMENTATION. Datasets n m Quad-core c 1 GPU 3 GPUs CPU Mnist 60, s 4 Covertype 581, m s 19.39s (4.21x) (10.51x) 10.84m 3.62m (5.00x) (14.98x)

6 Figure 3. Map-Reduce pattern for large distance matrix calculation IV. CONCLUSIONS AND FUTURE WORK In this paper, we have presented a novel idea on how to break down data intensive large Euclidean distance matrix calculation and utilize a multi-gpu environment. The proposed chunking method is easy to implement by using multithreading technique on a standalone workstation. However, there will be many other technical issues to implement it in a GPU cluster environment. The next stage of our work will be implementing the model for a small GPU cluster and seeking for any possible optimization method to maximize the capability of the hardware. Other distances, e.g. Manhattan and Cosine distances, will be included and a simplified interface will be offered to those machine learning algorithms who require distance calculation. REFERENCES [1] NVIDIA, CUDA Compute Unified Device Architecture Programming Guide, June [2] T.-M. Huang, V. Kecman, and I. Kopriva, Kernel Based Algorithms for Mining Huge Data Sets, Supervised, Semisupervised, and Unsupervised Learning. Springer, [3] B. Catanzaro, N. Sundaram, and K. Keutzer, Fast support vector machine training and classification on graphics processors, in ICML 08: Proceedings of the 25th international conference on Machine learning. New York, NY, USA: ACM, 2008, pp [4] V. Garcia, E. Debreuve, and M. Barlaud, Fast k nearest neighbor search using gpu, in Computer Vision and Pattern Recognition Workshops, CVPRW 08. IEEE Computer Society Conference on, , pp [7] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradientbased learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp , nov [8] M. Andrecut, Parallel gpu implementation of iterative pca algorithms. Journal of Computational Biology, vol. 16, no. 11, pp , [9] D.-J. Chang, N. A. Jones, D. Li, M. Ouyang, and R. K. Ragade, Compute pairwise euclidean distances of data points with gpus, in Computational Biology and Bioinformatics, CBB 08. IASTED International Symposium, November [10] D.-J. Chang, A. Desoky, M. Ouyang, and E. Rouchka, Compute pairwise manhattan distance and pearson correlation coefficient of data points with gpu, in Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, SNPD th ACIS International Conference on, , pp [11] NVIDIA, CUDA CUBLAS Library, June [12] J. Dongarra, S. Tomov et al., Matrix algebra on gpu and multicore architectures, [Online]. Available: [13] J. Cohen, Cuda libraries and tools, [Online]. Available: SC09 CUDA Tools Cohen.pdf [14] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, Interpreting the data: Parallel analysis with sawzall, Sci. Program., vol. 13, no. 4, pp , [5] S. A. Shalom, M. Dash, and M. Tue, Efficient k-means clustering using accelerated graphics processors, in DaWaK 08: Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery. Berlin, Heidelberg: Springer-Verlag, 2008, pp [6] A. Frank and A. Asuncion, UCI machine learning repository, [Online]. Available: edu/ml

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu