Exploiting GPUs to Accelerate Clustering Algorithms

Size: px

Start display at page:

Download "Exploiting GPUs to Accelerate Clustering Algorithms"

Isabella Lucas
5 years ago
Views:

1 Exploiting GPUs to Accelerate Clustering Algorithms Mahmoud Al-Ayyoub, Qussai Yaseen, Moahmmed A. Shehab, Yaser Jararweh, Firas Albalas and Elhadj Benkhelifa Jordan University of Science and Technology, Irbid, Jordan s: {maalshbool, mohammed {yijararweh, Mobile Fusion Applied Research Centre, Staffordshire University, Stafford, UK Abstract Big data is a main problem for data mining methods. Fortunately, the rapid advances in affordable high performance computing platforms such as the Graphics Processing Unit (GPU) have helped researchers in reducing the execution time of many algorithms including data mining algorithms. This paper discusses the utilization of the parallelism capabilities of the GPU to improve the the performance of two common clustering algorithms, which are K-Means (KM) and Fuzzy C-Means (FCM) algorithms. Two main parallelism approaches are presented: pure and hybrid. These different versions are tested under different settings including two different GPU-equipped machines (a laptop and a server). The results show excellent improvement gains of the hybrid implementations compared with the pure parallel and sequential ones. On the laptop, the best gains of the hybrid implementations compared with the sequential ones are 11.3X for KM and 1.9X for FCM. As for the server, the best gains are 13.5X for KM and 16.3X for FCM. Moreover, the paper explores the usage of a recent memory management technique for GPU called Unified Memory (UM). The results show a decrease in the performance gain of the hybrid implementations that is equal to 44% for hybrid version of KM and 61% for FCM. On the other hand, the use of UM does introduce a small advantage for the pure parallel implementation. I. INTRODUCTION Big data has become a main challenge for many information technology fields due to the large processing time it needs. The clustering of big data, where data is separated into groups of similar features, is an example of those challenges in the data mining and machine learning fields. The applications of clustering are numerous and diverse from text analysis to robotics. Image segmentation is another popular clustering applications, where objects from natural images are segmented, or regions of interest from medical images are extracted to diagnose many diseases such as brain tumor and breast cancer[1] [3]. K-Means (KM) and Fuzzy C-Means (FCM), two very common methods for clustering data, face serious issues when they deal with big data [4] [6]. The execution times for these techniques increase as the data size increases, which makes big data clustering a major issue. Furthermore, the number of dimensions may reduce the speed of finishing the clustering operation [7]. To increase the efficiency of clustering algorithms on big data, parallel programming is used. To this end, Graphics Processing Unit (GPUs) are gaining more popularity for compute-intensive computation compared with the Central Processing Units (CPUs). The reason for this is very simple. While modern CPUs can run up to 32 threads at same time, modern GPUs can run around 4999 threads [8], [9]. Obviously, GPUs have higher capabilities to run more threads than CPUs. Therefore, many researchers utilize this advantage to improve the performance of many algorithms [1] [13]. This paper leverages the capabilities of GPUs and parallel techniques in big data clustering. This reduces the effect of increasing data size and number of dimensions, and increases the scalability of applying KM and FCM clustering algorithms. Both CPUs and GPUs are in the focus of many researchers in academia as well as the industry. Thus, they are both being rapidly improving and optimized in terms of speed, parallelism capability, memory management, etc. Unfortunately, many previous researchers fell into the pitfall of performing an unfair performance comparison between CPUs and GPUs from different settings (e.g., comparing the performance of a GPU built for heavily-loaded servers with a CPU built for lightly loaded laptops) or from different time periods (e.g., comparing the performance of a GPU with a CPU that is five or ten years older). This paper aims to perform fair comparisons using modern CPUs with modern GPUs in both laptop and server settings. The contributions of this paper are as follows. 1) It uses different parallel programming approaches (pure parallel and hybrid parallel) to test the KM and FCM clustering algorithms on big data. 2) It leverages the capabilities of GPUs to implement the aforementioned methods under variable dimensions and scalable data size. 3) It tests the Unified Memory (UM) technology against the pure parallel and hybrid parallel implementation approaches, and shows that it gives negative effect on the improvement gain for performance. The paper is organized as the follows. The next section discusses some related work. Section III discusses the proposed methodology. Section IV demonstrates and analyzes the experiments and results. Section V concludes the work and presents the future work. II. RELATED WORK This section discusses some of the existing work related to the problem at hand. Specifically, we try to cover prior efforts /16/$ IEEE

2 to improve the performance of KM or FCM algorithms using parallelism (especially, if it is based on GPUs). A limitation of clustering algorithms lies in the processing time needed for clustering and labeling data, especially in the case of big data. However, many researchers proposed new algorithms that can handle clustering of big data. Zechner et al. [14] accelerated the KM algorithm by utilizing GPU capability. They used Intel Core 2 CPU with 4GB main memory and NVIDIA GeForce 96 GT with 512MB as RAM for GPU. The operating system used is Windows XP and the algorithm was implemented using C and CUDA. The parallel implementation achieved 14X faster than the sequential version. Another work on using GPUs to accelerate the KM algorithm is that of Farivar et al. [15]. The authors used Intel Pentium D CPU and compared it with two GPUs. The first GPU was NVIDIA GeForce 86 and the second GPU was NVIDIA 88 Ultra GTX. They used a dataset of 1 million elements with one dimension. In addition, the number of clusters used was 4. The first GPU (NVIDIA 86 GT) was around 13X faster than the sequential implementation, while the second GPU (NVIDIA 88 Ultra GTX) was around 68X faster than sequential version. In [16], Soroushnia et al. discussed a parallel implementation of the KM algorithm using CUDA. In their experiments, the authors ranged the input data from 1K elements to 1M elements. At each test case, the number of clusters ranged from 8 to 124. They compared an Intel Pentium D with 1GB main memory to the GPU GTX 88 NVIDIA with 782MB RAM. The improvement achieved was around 6X with 124 clusters. Similarly, Shalom et al. [17] implemented a parallel version of the KM clustering algorithm. However, they compared the performance of a Pentium 4 CPU with NVIDIA GeForce FX 59 XT and NVIDIA GeForce 85 GT. The improvement they achieved in performance was about 5X against the sequential version. Fuzzy C-Means (FCM) is a clustering algorithm used to segment data to a number of clusters [5]. Compared with the KM algorithm, this algorithm is more complex and consumes more time in separating data to perfect clusters [7]. Many researchers have dealt with the problem of enhancing FCM s performance using parallelism [18] [25]. In [18], Shalom et al. implemented a parallel version of the FCM algorithm. They compared the performance of the parallel implementation on two GPUs with the performance of a sequential implementation on a Pentium 4 CPU. As for the GPUs they used, they were GeForce 85 GT and 88 GTX. The two GPU models achieved around 73X and 14X improvements respectively. The data set used in their models contained 1 million data points with 4 dimensions. Furthermore, they ranged the number of clusters from 3 clusters to 64 clusters. Similarly, Li and et al. [19] implemented a parallel version of the FCM algorithm for image segmentation. The data set consisted of images of natural scenes from which the authors tried to extract objects. In their work, all FCM functions were located on the GPU side. They used Intel Core 2 Duo CPU with GTX 26 NVIDIA GPU. The improvement achieved in their model was 1X faster than the sequential version. More work on FCM was performed by Zhuge et al. [2] who used the algorithm to improve the segmentation of medical images. The authors divided the data set into three image types (small, medium, large), and used Intel Xeno CPU with quad cores and Tesla C16 GPU in their experiments. They achieved improvements of about 24X, 18X and 1X for the small, medium and large images, respectively. Onchis et al. [21] implemented FCM algorithm with CUDA for image segmentation. The algorithm was used to segment/extract a Region of Interest (ROI) from images. For this purpose, the authors compared the performance of an Intel Core i7 CPU with the performance of a Tesla M27Q GPU. They achieved an improvement of about 3X. Most of the papers discussed so far showed higher improvement gain than what we obtain. One justification is the unfairness point mentioned previously where the sequential version is run on a simple or old CPU whereas the parallel version is run on a more involved or newer GPU. For our work, we try to be as fair as possible in our choices of the CPU and GPU hardware for the two settings under consideration. Specifically, for the laptop setting, we use Intel Core i7 CPU and NVIDIA GeForce GT 74M. As for the server setting, we use two socket Xeon Haswell 2.6GHz server with 16 cores in total and 2X NVIDIA K8 GPU cards (4 GPUs in total). Another advantage of this work compared with existing work is the inclusion of two different clustering techniques and the consideration of UM. To the best of our knowledge, no prior work provides such a rich set of experiments. III. METHODOLOGY This section discusses the proposed methodology. Two common clustering algorithms, namely K-Means (KM) and Fuzzy C-Means (FCM), are implemented using parallel programming on GPU. Moreover, a new NVIDIA technology called Unified Memory (UM) is considered. In UM, a virtual memory is created between the CPU and the GPU to reduce the effect of data transfer. The performance of the proposed parallel implementations is compared with the performance of the sequential version. The next subsections discuss the sequential versions followed by the parallel versions. A. Sequential Implementation 1) K-Means (KM) Clustering Algorithm: KM is one of simplest and most common clustering techniques. It is used to cluster data into K groups. The algorithm has three main functions: update centroids, calculate memberships and calculate objective functions [26]. The algorithm starts by initializing clusters centroids, where the number of clusters is set by the user. Next, the algorithm creates random values of centroids as initial centers. Then, it calculates the membership using Equation (1) for each point from input data [4]. t i,j = (X i C j ) 2 (1)

3 The centroids are updated at each iteration using Equation (2). C N j i j µ i,j = X i, (2) K where K is the number of points related to the class. The process continues until the difference between previous and current objective function becomes less than or equal to a certain threshold value. The objective function is calculated using Equation (3). θ = C j N (X i C j ) 2 (3) i 2) Fuzzy C-Means (FCM) Clustering Algorithm: FCM is one of the popular soft clustering techniques. It combines two ideas: fuzzy sets and the CMeans algorithms [5], [27]. Therefore, the FCM algorithm is more complex than the KM algorithm. The algorithm is based on three main functions which are: calculate memberships, update cluster centroids and calculate objective function. First, the algorithm needs to get the number of clusters and the degree of fuzziness as inputs and initialize random centroids at the first iteration. Then, for each iteration, it calculates the membership values using Equation (4) and update the centroids using Equation (5). u ij = 1 c k=1( xi c j x i c k V j = n i=1 um ij x i n i=1 um ij ) 2 m 1 where C is the number of clusters, x i is the object point i, m is the Fuzziness factor, n is the number of points and v j is the center of cluster j. Then, the objective function is calculated using Equation (6). Jm = n i=1 j=1 (4) (5) C u m ij x i c j 2, 1 m < (6) Finally, it calculates the difference between the objective functions of the previous and the current steps. If the difference is less than or equal to a certain preset threshold value, the algorithm stops. B. Parallel Implementations Parallel programming separates the implementation code into sub-blocks of code which can be run at the same time virtually. This technique is used to speed up the execution time of applications in which some parts of the code can be run individually and without any dependencies between them. The CPU is capable of running few parts of the code in parallel. In the other hand, customized hardware, such as GPUs, offer more parallelization capabilities. However, this process is controlled by the CPU which sends the query to the GPU. Intuitively, one can think of two types of parallel implementations on GPUs. The first type is called the Hybrid Implementation, which distributes the execution of code blocks among the CPU and the GPU. As a matter of fact, the GPU executes some functions more efficiently than a parallel implementation on the CPU. The reason for this is very simple. While modern CPUs can run up to 32 threads at the same time, modern GPUs can run around 4999 threads [8], [9]. The second type of parallel implementation is called the Pure Parallel Implementation. In this type, the code is run at the GPU, while the CPU just sends the job to the GPU and receives back the results. It is not immediately clear why would anyone consider the hybrid implementation if the pure parallel implementation allows for more parallelism and avoids any interaction between the CPU and the GPU during the execution of the code. This would be true if the code to be parallelized has limited dependencies between its sub-blocks. Obviously, this is not always the case as it is common to come across code that has high dependencies, which means that it can run faster on the CPU that it does on the GPU. As we show later in this paper, the type of dependencies might allow a hybrid implementation to outperform both pure implementation (pure CPU and pure GPU). The following subsections discusses the implementation of KM and FCM algorithms using both techniques. 1) KM Hybrid and Pure Parallel Implementations: We present two main implementations for the KM algorithm, which are the hybrid implementation and the pure parallel implementation. The paper tests each version with and without using the UM technology. The purpose of these tests is to investigate how UM can help in reducing the effect of data transfer between the GPU and the CPU. To implement the algorithm in a parallel setting, we should choose which functions would be run at the CPU side and which ones would be run on the GPU side. As discussed earlier, the KM algorithm has three main functions, which are calculate memberships, calculate centroids and calculate objective. To determine the best distribution of functions among GPU and CPU, Microsoft Visual Studio 213 profiling tool is used with the sequential version. The profiling tool show that the calculate memberships function is the heaviest function to run on CPU side. Therefore, we decide to run this function at the GPU side. This technique improves the performance by about 1X (on average) compared with the sequential version. Similarly, we transfer the other functions in order to test whether the pure parallel implementation is better than the hybrid implementation or not. However, after transferring them to the GPU side, we notice that the execution becomes longer than that of the hybrid implementation (gets only around 3X improvement compared with the sequential implementation). To detect the source of this delay, we use the CUDA profiling tool. Using this tool, we discover that the calculate centroids and calculate objective function are heavy on the GPU side. Therefore, the best implementation is to use the CPU to calculate centroids and objective function, and the GPU to calculate the membership values. This is due to the fact that the summation operation is not suitable to run in parallel

4 mode. As shown in Equation (3), the calculate objective and centroids functions have a summation operation. This type of operation creates a dependency that prohibits using parallel programming efficiently since each thread needs to write in one memory location. That is, this requires synchronization between thread workers. UM is used for both the hybrid and pure parallel implementations to measure how useful is it in improving the performance of the clustering algorithms under consideration. To do so, the memory allocation functions in the CPU and the GPU sides are replaced with UM functions. On average, the improvement in performance reached around 5X faster than the sequential implementation for the hybrid implementation and 3.5X for pure parallel implementation. 2) FCM Hybrid and Pure Parallel Versions: The same experiments discussed in the previous subsection for the KM algorithm are performed on the FCM algorithm. The sequential implementation of the algorithm is analyzed using the Microsoft Visual Studio 213 profiling tool to detect the heaviest functions. The profiler report show that the membership calculation function is the heaviest on the CPU side. Therefore, this function is selected to be run on GPU side in the hybrid implementation, while the update centroids and objective functions are selected to be run on the CPU side. This implementation achieves an improvement of about 1X on average. In the pure parallel version, all FCM functions are run on the GPU side. However, the objective value is transferred to the CPU side after it is calculated on the GPU side. This version achieves an improvement of about 5X on average. However, when using the UM technology, the performance reached 7X for the pure parallel version, and decreased to 5X for the hybrid version. IV. EXPERIMENTAL AND ANALYSIS This section presents and analyzes the results of this work. The following subsection describes the specifications of the hardware and software used in this paper, while the Subsections IV-B and IV-C show the experimental setup and the experiments results and analysis, respectively. A. Hardware Specifications Two types of equipments are used in this work, which are the simple laptop equipment and the more powerful server equipment. The specifications of each type are listed below. Simple Equipment: 2.2 GHz CPU Intel I7 fourth generation with 6GB RAM. The GPU is NVIDIA GT 74M with 2GB memory. 64-bit Windows 1 operating system, CUDA 7.5 toolkit, CUDA drivers and Microsoft visual studio 213. Powerful Equipment: Two socket Xeon Haswell 2.6GHz server with 16 cores, 128 GB Ram, equipped with 2X NVIDIA K8 GPU cards (4 GPUs in total). 64bit Linux kernel OS, distribution Red Hat-compatible 6.6 (Scientific Linux), NVIDIA driver , CUDA SDK 7. and OpenMPI TABLE I VERSIONS OF THE DATASET Dataset name Dataset size(records) Transactions1k 12,428 Transactions3k 284,284 Transactions5k 475,649 Transactions7k 665,471 Transactions8k 855,367 Total 2,41,199 B. Experimental setup This part discusses the experimental setup. The C programming language is used to implement the sequential version, while the GPU side is implemented using CUDA. The dataset from [28] is used to test the two clustering algorithms KM and FCM. The dataset is divided into several subsets in order to test the scalability of the proposed implementations. Table I shows the subsets of the dataset used in the experiments. As can be seen from the table, the datasets are rather large containing hundreds of thousands of data points. The number of dimensions for each data point is three (denoted as X, Y and Z). Each algorithm is run using the five groups of data. Each group is tested using all dimensions (X, Y and Z), where dimensions are added one by one in each iteration. I.e., testing of the KM algorithm starts with loading all elements from the first dataset (which is Transactions1k) consisting of 12,428 data points. One dimension is used in this experiment, which is X. After the algorithm finishes segmenting the data points, the second dataset is loaded, (which is Transactions3k) consisting of 284,284 data points, using one dimension only, and so on. When all groups are tested using one dimension, the same process is repeated using two dimensions (X and Y ), etc. C. Results As mentioned earlier, the experiments are conducted using two types of equipments, which are the simple equipment and the powerful equipment. Five versions of each of the KM and FCM algorithms are tested: a sequential implementation as well as four different parallel implementations (depending on whether the parallel implementation is pure or hybrid, and whether it uses UM or not). The following two subsections show the results using the simple equipment and the powerful equipment, respectively. 1) Effect of Scaling-Up the Dataset: Figure 1 show the effects of increasing the dataset size and the number of dimensions on the simple laptop equipment. Clearly, such increases leads to an increase in the execution time of the CPU. However, the GPU implementations do not exhibit the same trend. Furthermore, the hybrid version is better than the pure parallel version. The improvements in hybrid version reached 11X without using UM and 6X when using it. On the other hand, the pure parallel version achieved an improvement of about 3X and 3.6X without and with using UM, respectively. Obviously, the UM decreased the improvement gain for the hybrid version because of hardware synchronization. UM reduces the transferring time between CPU and GPU, however, the effect of the synchronization operation is greater than the

5 Time (in seconds) CPU 1D CPU 2D CPU 3D GPU Hyb-UM 1D GPU Hyb-UM 2D GPU Hyb-UM 3D GPU Pur-UM 1D GPU Pur-UM 2D GPU Pur-UM 3D GPU Hyb+UM 1D GPU Hyb+UM 2D GPU Hyb+UM 3D GPU Pur+UM 1D GPU Pur+UM 2D GPU Pur+UM 3D Time (in seconds) 1 CPU 1D CPU 2D 9 CPU 3D GPU Hyb-UM 1D GPU Hyb-UM 2D 8 GPU Hyb-UM 3D GPU Pur-UM 1D 7 GPU Pur-UM 2D GPU Pur-UM 3D GPU Hyb+UM 1D 6 GPU Hyb+UM 2D GPU Hyb+UM 3D GPU Pur+UM 1D 5 GPU Pur+UM 2D GPU Pur+UM 3D Dataset size (in hundreds of thousands) (a) KM (b) FCM Fig. 1. The effect of increasing the dataset size and the number of dimensions on the simple equipment. transfer time. Using UM, the data is locked for the CPU until the GPU completes the process. During this time, the CPU will run another process and it needs time to switch the process when it gets a signal from GPU to release the data. The same operation is performed at GPU side too. In the hybrid version, the CPU or the GPU are ready to run the code because the transfer time is slower than the execution time. Furthermore, each side has its own data that can be accessed any time. Therefore, using UM, the data that is locked is the data that is not needed at execution time either by the CPU or the GPU. The experiments on the FCM algorithm are similar to those on the KM algorithm. The execution time of the FCM algorithm is affected also by increasing the size of input data and number of dimensions. Furthermore, the hybrid implementation has better results than the pure parallel version. However, using UM creates the same delay as in KM algorithm. Running FCM algorithm in hybrid version without using UM achieved 1.8X faster than the sequential version. However, using UM, the improvement decreased to 4.7X. Similarly, implementing the algorithm in the pure parallel model achieved about 5.6X is faster than sequential version, meanwhile, using UM with pure parallel model, the performance increased to 7X faster than the sequential implementation. The best improvement we obtained is for the hybrid version without using UM. After using UM, the improvement was reduced by 44% for hybrid version of KM and 61% for FCM, whereas UM increased the performance for pure parallel version by 14% for KM and 18% for FCM. This effect is because the hardware synchronization between CPU and GPU. In the hybrid version, the synchronization rate is more than pure parallel version, because in hybrid version the CPU and GPU have to collaborate to achieve the main goal which is to divide the data into clusters. Meanwhile, with pure parallel version, the CPU lunches the GPU kernel code, then the GPU will run all algorithm steps. In this case, the probability of synchronization is less than the pure parallel version. We deduce from those results that UM does not reduce Dataset size (in hundreds of thousands) the effect of transferring data between CPU and GPU, but it used to help developer to manage data transfer. If the implementation has different data that need to transfer between GPU and CPU, the developer can use UM. This will help the developer to focus on function coding without missing any transferring data. 2) Effect of Hardware: The experiments of the previous section are all conducted on the simple basic hardware equipment. The purpose of this experiment is to study and compare the performance gains obtained by the two hardware equipments under consideration. Figure 2 shows the results of testing the hybrid version of the KM and FCM algorithms without using UM. Figure 2(a) shows that the improvements on the KN algorithm for 1D, 2D and 3D are about 9X, 13X and 17X respectively. As for the FCM algorithms, Figure 2(b) shows that the improvements are about 12X, 13X and 15X. for 1D, 2D and 3D, respectively. The performance gain that clustering algorithms get after utilize the GPU capabilities is shown on Figure IV-C2. The Tesla 8K GPU has 4999 cores, while the GT 74M GPU has 366 cores. It can be clearly seen from the figure that Tesla 8K gets 3X faster than GT 74M, but with FCM algorithm, the Tesla 8K is 6X faster than GT 74M. The complexity of FCM is higher than KM [7]. However, the FCM algorithm gets improvement gain with Tesla 8K that is better than GT74M. This means that the powerful GPU has less effects with heavy algorithm. ACKNOWLEDGMENT This research was supported by the Deanship of Research at the Jordan university of Science and Technology (Grant no 21674). We are thankful to NVIDIA for giving us the chance to try the new Tesla-K8 GPU with server environment. We also thank Mr. Carlo Nardone, the technical developer from NVIDIA, for his corporation. V. CONCLUSION This work has presented our effort to extensively study the performance improvements gained by utilizing GPUs to speed

6 2 Tesla8K GT74M 2 Tesla8K GT74M Performance Gain Performance Gain D 2D 3D 1D 2D 3D Performance Gain Tesla8K GT74M (a) KM (b) FCM Fig. 2. Comparison of the performance gains obtained by the two hardware equipments under consideration. KM FCM Fig. 3. Comparison of the performance gains obtained on the clustering algorithms by the two hardware equipments under consideration. up the performance of the two common clustering algorithms, KM and FCM. The paper has provided four different parallel implementations for each algorithm (depending on whether the parallel implementation is pure or hybrid, and whether it uses UM or not). Moreover, in this paper, extensive experiments has been conducted using two types of equipment: a simple laptop setting and a more powerful server setting. The experiments aimed to study the effect of increasing data size and number of dimensions on the performance gain of the different parallel implementations compared with the sequential ones. The paper has shown that the best improvement gain is obtained by the hybrid version, which has achieved about 11.3X for KM and about 1.9X for FCM. Without UM, the performance gain was about 6X for KM and about 5X for FCM. Moreover, the pure parallel implementation achieved about 3X improvement for KM and about 6X for FCM. However, the paper has shown that after using UM, the performance of pure parallel version was about 4X with KM and about 7X with FCM. REFERENCES [1] M. Shehab et al., Improving fcm and t2fcm algorithms performance using gpus for medical images segmentation, in ICICS. IEEE, 215. [2], Accelerating compute-intensive image segmentation algorithms using gpus, The Journal of Supercomputing, 216, to appear. [3] H. Cheng et al., Automated breast cancer detection and classification using ultrasound images: A survey, Pattern Recognition, vol. 43, no. 1, pp , 21. [4] A. Likas, N. Vlassis, and J. J. Verbeek, The global k-means clustering algorithm, Pattern recognition, vol. 36, no. 2, pp , 23. [5] L. A. Zadeh, Fuzzy sets as a basis for a theory of possibility, Fuzzy sets and systems, vol. 1, no. 1, pp. 3 28, [6] A. K. Jain, Data clustering: 5 years beyond k-means, Pattern Recognition, vol. 31, p , 21. [7] S. Ghosh and S. K. Dubey, Comparative analysis of k-means and fuzzy c-means algorithms, IJACSA, vol. 4, pp , 213. [8] S. Cook, CUDA programming: a developer s guide to parallel computing with GPUs. Newnes, 212. [9] NVIDIA Corporation, Tesla k8 accelerator features and benefits, Online, Dec 216, [Accessed Jan-216]. [1] M. Fakirah et al., Accelerating needleman-wunsch global alignment algorithm with gpus, in AICCSA. IEEE, 215, pp [11] L. Wang, B. Yang, Y. Chen, Z. Chen, and H. Sun, Accelerating fcm neural network classifier using graphics processing units with cuda, Applied intelligence, vol. 4, no. 1, pp , 214. [12] M. Alandoli et al., Using dynamic parallelism to speed-up clusteringbased community detection in social networks, in FiCloud. IEEE, 216. [13], Using gpus to speed-up fcm-based community detection in social networks, in CSIT. IEEE, 216, pp [14] M. Zechner and M. Granitzer, Accelerating k-means on the graphics processor via cuda, in INTENSIVE 9, 29, pp [15] R. Farivar, D. Rebolledo, E. Chan, and R. H. Campbell, A parallel implementation of k-means clustering on gpus, in PDPTA, vol. 13, no. 2, 28, pp [16] S. Soroushnia et al., Parallel implementation of fuzzified pattern matching algorithm on gpu, in PDP. IEEE, 215, pp [17] S. Shalom et al., Efficient k-means clustering using accelerated graphics processors, in DaWaK. Springer, 28, pp [18], Graphics hardware based efficient and scalable fuzzy c-means clustering, in AusDM, 28, pp [19] H. Li et al., An improved image segmentation algorithm based on gpu parallel computing, Journal of Software, vol. 9, no. 8, 214. [2] Y. Zhuge et al., Parallel fuzzy connected image segmentation on gpu, Medical physics, vol. 38, no. 7, pp , 211. [21] D. M. Onchis et al., Multi-phase identification in microstructures images using a gpu accelerated fuzzy c-means segmentation, in SYNASC. IEEE, 214, pp [22] M. Al-Ayyoub et al., A gpu-based implementations of the fuzzy c- means algorithms for medical image segmentation, The Journal of Supercomputing, vol. 71, no. 8, pp , 215. [23], A gpu-based breast cancer detection system using fuzzy c-means clustering algorithm, in ICMCS. IEEE, 216. [24] S. AlZu bi et al., Parallel implementation of fcm-based volume segmentation of 3d images, in AICCSA. IEEE, 216. [25] M. Alsmirat et al., Accelerating compute intensive medical imaging segmentation algorithms using gpus, Multimedia Tools and Applications (MTAP), 216, to appear. [26] T. Kanungo et al., An efficient k-means clustering algorithm: Analysis and implementation, TPAMI, vol. 24, no. 7, pp , 22. [27] A. K. Jain, M. N. Murty, and P. J. Flynn, Data clustering: a review, ACM computing surveys (CSUR), vol. 31, no. 3, pp , [28] C.-H. Chen et al., Genetic-fuzzy mining with taxonomy, IJUFKS, vol. 2, no. supp2, pp , 212.

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu