Parallel K-Means Clustering with Triangle Inequality
|
|
- Rudolph Carson
- 5 years ago
- Views:
Transcription
1 Parallel K-Means Clustering with Triangle Inequality Rachel Krohn and Christer Karlsson Mathematics and Computer Science Department, South Dakota School of Mines and Technology Rapid City, SD, 5771, USA Abstract Clustering divides data objects into groups to minimize the variation within each group. This technique is widely used in data mining and other areas of computer science. K-means is a partitional clustering algorithm that produces a fixed number of clusters through an iterative process. The relative simplicity and obvious data ism of the K-means algorithm make it an excellent candidate for distributed-memory optimization, particularly as datasets grow beyond the size of a single machine. The triangle inequality, when applied to the K-means algorithm, allows unnecessary distance calculations between data objects and cluster centroids to be avoided. Various implementations of the K-means algorithm exist, but no example comparing a standard implementation to one utilizing the could be located. This paper seeks to fill this gap by presenting experimental results demonstrating the performance of both standard and improved K-means implementations compared to a implementation. keywords: K-means, clustering, data mining,, 1 Introduction The field of data mining seeks to extract useful information from large datasets drawn from real-world situations. As big data becomes more popular and data storage capabilities increase, demand for data mining analysis increases while datasets grow in size. One common data mining task is clustering, which partitions the data into discrete categories. The K- means algorithm clusters data objects into k separate clusters using some distance measure, with each data object belonging to the cluster with the nearest cluster centroid[6]. Although the K-means algorithm is simple to implement, growing datasets present unique challenges. Larger datasets require more time to analyze, and if the entire dataset does not fit into the memory of a single machine, costly disk I/O operations further slow the analysis process. One solution is to use processing to speed up the analysis process and, in the case of distributed-memory solutions, reduce disk I/O. This paper will introduce the basic K-means algorithm and summarize current implementations before introducing a new flexible distributed-memory solution that utilizes the for improved speedup. 2 Basic K-Means Algorithm K-means is a simple prototype-based, partitional clustering algorithm. Each cluster is represented by a prototype, or centroid, which is typically defined as the mean of all data points in that cluster. Because K- means is partitional, each data point will only belong to one cluster, and clusters will not be nested. The number of clusters, k, is specified by the user before the algorithm begins processing. Initial centroids can be selected in many different ways; the exact method does not impact the main algorithm. K-means then relies on an iterative procedure to refine the centroids, as described in Algorithm 1 [6]. Algorithm 1 Basic K-Means Algorithm 1: Select k points as initial centroids 2: repeat 3: Form k clusters by assigning each point to its closest centroid 4: Recompute the centroid of each cluster 5: until Centroids do not change Initial centroid selection varies depending on the specific K-means implementation being used. Some common methods include randomly selecting k data points, or selecting k points with maximum separation distance. Initial centroids can impact the final clustering results, as poor initial centroids may produce a sub-optimal clustering. At each K-means iteration, all points are reassigned / copyright ISCA, CAINE 16 September 26-28, 16, Denver, Colorado, USA
2 to the nearest centroid based on some distance metric. Note that finding the nearest centroid requires calculating the distance between the current data object and all cluster centroids. Euclidean distance is often used, since it is simple to implement and produces good results if the data is normalized. Based on the new cluster assignments, the centroids for each cluster are recalculated to reflect the new cluster membership. Often a cluster centroid is defined as the average of all data objects belonging to that cluster. The data object assignment and centroid adjustment process is repeated until the cluster centroids do not move, indicating a stable final clustering. In practice, particularly for large datasets, at each iteration the number of data points that switch clusters is tracked, and the algorithm terminates once this count falls below a certain threshold [6]. Since most centroid movement occurs in the first few iterations, stopping early does not greatly undermine the final results. There are also cases where a small number of points will repeatedly oscillate between two clusters; using the threshold termination condition prevents this. One of the greatest advantages of the K-means algorithm is its simplicity and efficiency. Despite these advantages, K-means is not suitable for all datasets. K-means does not handle non-globular clusters, or clusters of differing sizes and densities; other clustering algorithms exist to handle these situations. The algorithm is also not robust to outliers; detecting and removing outliers before running K-means is a viable solution. 3 Parallel K-Means Algorithm Examining the K-means algorithm reveals an obvious data ism in the algorithm [7]. To reassign each point to the nearest cluster centroid, the distance from each point to all the cluster centroids must be calculated; these distances are then compared to determine the nearest centroid. Because this process must be repeated for every data object in the set, and results for one data object do not impact other data objects, it is appealing to ize the K-means algorithm to reduce overall execution time. 3.1 Existing Work Following theoretical development of a K- means algorithm [7], various implementations have been created. Some implementations rely on shared memory [4] for simplicity, but these programs cannot handle extremely large datasets. Other researchers, desiring a distributed-memory solution, use MPI [1] [8] or Apache Hadoop [9] for ization. More recent work focusing on GPU-based solutions [2] also exists. Various algorithm modifications have been researched and implemented to improve the running time of the basic K-means algorithm. Improvements seek to reduce either the number of algorithm iterations or the number of distance calculations at each iteration. Both of these strategies reduce the overall complexity of the K-means algorithm, thereby achieving speedup. Implementing the to reduce the number of Euclidean distance calculations greatly improves the scalability of the K-means algorithm [5]. Applying the to an object s previous cluster assignment and a table of inter-centroid distances eliminates unnecessary distance calculations. Further improvement is achieved by sorting the centroids in order of distance from the previous cluster centroid. The main advantage of this improvement strategy is its simplicity, as the triangle inequality does not alter the foundation of the basic K-means algorithm. Another improvement strategy is to avoid looping the entire dataset at each centroid adjustment iteration by removing data points close to the centroid from consideration [3]. Using statistical analysis, the dataset can be reduced to a subset of boundary points, which are points near the edges of a cluster. If centroids do not move significantly, only the boundary points must be reassigned, decreasing the cost of each algorithm iteration. This strategy produces speedup by reducing the number of passes over the entire dataset, which reduces the total number of distance calculations. Although the boundary point approach offers significant speedups, it is much more difficult to implement than the triangle inequality. 3.2 Implemenation Details OpenMPI is a library implementation of the Message Passing Interface for distributed-memory computing. Typically, OpenMPI provides greater control to the programmer than MapReduce, allowing for more complex and frequent communication between processes. For this reason, OpenMPI was selected as the platform for this implementation. Unfortunately, increased control generally causes larger communication overhead, so careful algorithm design is required to minimize communication between processes. Once the desired number of MPI processes are initialized, the root process retrieves the data object attributes from the user-specified input file. The data is partitioned between the different processes; each process is then responsible for clustering only part of the entire dataset. A series of k data objects are randomly
3 selected to serve as the initial cluster centroids. The number of desired clusters and the initial centroids are disseminated to all processes before the main clustering procedure begins. The K-means algorithm was converted to an MPI procedure to exploit data ism. Each process is responsible for a subset of the data objects, but knows the location of all cluster centroids. At each iteration, a process assigns each of its own data objects to the nearest cluster centroid. The attributes of all data objects assigned to each cluster are summed as they are assigned, and the size of each cluster is tracked. After all data objects are assigned, the processes exchange the local cluster sizes and attribute sums. Each process then computes updated centroid locations independently before repeating the data object assignment procedure. Using this strategy, our implementation produces the same results as the K-means algorithm while limiting communication and reducing overall computation time. Because a few data objects may oscillate between cluster centroids, preventing the algorithm from terminating, our implementation utilizes a stopping threshold. Once fewer than 1% of data objects change clusters, the program stops. Two results files are created, one giving the exact location of each cluster centroid, and the other indicating the membership of each cluster. 4 The Triangle Inequality The can be applied to the K-means algorithm to reduce the number of distance calculations required. By comparing the distance between a data object X i and its current cluster centroid C curr to the distance between C curr and any potential cluster centroid C pot, unnecessary distance calculations are eliminated. Taking d(a, b) as the Euclidean distance between points a and b, the states that d(c curr, C pot ) d(x i, C curr ) + d(x i, C pot ), which means that d(x i, C curr ) d(c curr, C pot ) d(x i, C pot ). Considering d(c curr, C pot ) 2d(X i, C curr ), the allows us to conclude that d(x i, C pot ) d(x i, C curr ). This indicates that object X i will not be assigned to cluster C pot, and d(x i, C pot ) does not need to be calculated. A prerequisite for these comparisons is a table of distances between all pairs of centroids, which must be recalculated following each centroid adjustment. Each process computes this table independently after receiving the new centroid location data. Because the number of cluster centroids is much smaller than the number of objects, the cost of computing this table is negligible in relation to the number of distance calculations saved by this improvement. For our implementation the centroids are not sorted because the extra speedup did not outweigh the additional complexity, particularly when k is small. 5 Experimental Results To evaluate the value of ization, we conducted extensive experiments on the, basic, and optimized K-means implementations. All experiments were performed on a cluster of computers; each machine has 16 GB of memory and eight cores running at 3.6 GHz. All timing measurements rely on the MPI Wtime call. Since the goal of this paper is to compare the performance of the clustering algorithm, I/O time is ignored for these measurements. Sequential testing consisted of a single process on a single machine, while programs used eight processes on two nodes. This configuration was selected to include inter-process communication both across and within network nodes. Each graph datapoint represents an average of several test runs. The datasets used for all test runs were synthetically generated. Each dataset contains k equal-sized, wellseparated, globular clusters, with theoretical cluster centers following a normal distribution. Within each cluster, individual data items also follow a normal distribution. Attributes are floating-point numbers ranging from zero to one with five digits of precision. If not specifically mentioned, the default dataset for each run consists of 5, items, each with 3 attributes, with k set to clusters. All individual test runs use the same initial centroids for consistency. By comparing the outputs of the programs, we proved that all three versions of the K-means algorithm produced the same final clustering. Centroid locations and object cluster assignments were identical across all test runs, regardless of which program performed the clustering. Figure 1 shows the runtimes of all three program versions for datasets with varying numbers of objects, ranging from, to 2 million objects. As expected, the program takes much longer to produce the same results as the ized version. The K-means version with the optimization does perform better than the standard algorithm. The runtime of both programs grows slower than the runtime of the version, indicating that the algorithm is scalable. The effect of k on runtime was also considered;
4 number of objects (X) 5 15 number of attributes Figure 1: Effect of dataset size on runtime. Figure 3: Effect of number of attributes on runtime plementation, particularly as the problem size grows. The optimization does improve upon the standard implementation, with the greatest benefit occurring for large numbers of clusters Figure 2: Effect of number of clusters on runtime. Figure 2 shows these results. As with the number of data objects, the runtime of the K- means algorithm grows quickly as k increases, while the runtime of the programs increases more slowly. The speedup from the optimization is more apparent here, with the consistently reducing clustering runtime by half. Finally, the number of attributes of the data objects was adjusted and the impact on runtime examined. Figure 3 shows that once again, the implementations significantly outperform the K- means algorithm. The optimization appears to offer some benefit in these cases, although the improvement is not as significant as for larger numbers of clusters. Overall, the experiments prove that a implemenation of the K-means algorithm greatly reduces the execution time as compared to a im- k 6 Future Work Because the size of the test datasets was limited in this study by the need to time execution, the benefits of the optimization does not seem significant when compared with the improvements achieved by a standard implementation. Further testing with larger datasets and more attributes is required to better assess the effectiveness of the triangle inequality. Experiments using real-world data instead of synthetically generated objects is also necessary. The standard K-means algorithm does not specify a method for selecting the initial centroids, so many implementations rely on randomly-selected initial centroids. Because poor initial centroids can result in sub-optimal clustering results [6], clustering is often repeated so that the best result can be used. Additional research into initial centroid selection methods is required to solve this problem. Other K-means algorithm improvements, such as boundary point tracking [3], may allow for even greater speedups. 7 Conclusion In this paper, a survey of the basic K-means algorithm and current related work is presented as an introduction to clustering. Then, a distributed-memory implementation that exploits the data ism present in the K-means algorithm is discussed. The is also introduced as an op-
5 timization of the basic algorithm, including implementation details. Finally, experimental results are presented for all three program versions. Our OpenMPI implementation of K-means clustering partitions the dataset between multiple processes to speedup program execution. The algorithm is designed to limit communication between nodes and maximize data ism. A second variation of this program utilizes the to prevent unnecessary Euclidean distance calculations. We evaluate our programs performance through detailed experimentation. The results show that ization of the K-means algorithm does improve running time, and that the optimization offers further benefits. More testing with larger datasets is required to more clearly demonstrate the additional speedup gained through the optimization. To the best of our knowledge, a performance comparison between basic K-means and a version implementing the does not exist in literature. The K-means algorithm is a relatively simple and flexible clustering mechanism for many datasets. Exploiting the data ism present in this algorithm greatly reduces overall runtime, and the provides an opportunity for further optimization, particularly for large datasets. [5] Jitendra Kumar, Richard T. Mills, Forrest M. Hoffman, and William W. Hargrove. Parallel k- means clustering for quantitative ecoregion delineation using large data sets. In Procedia Computer Science, pages Elsevier, 11. [6] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Pearson Addison Wesley, 1st edition, 5. [7] Jinlan Tian, Lin Zhu, Suqin Zhang, and Lu Liu. Improvement and ism of k-means clustering algorithm. Tsinghua Science and Technology, (3): , 5. [8] Jing Zhang, Gongqing Wu, Xuegang Hu, Shiying Li, and Shuilong Hao. A clustering algorithm with mpi - mkmeans. Journal of Computers, 8(1): 17, 13. [9] Weizhong Zhao, Huifang Ma, and Qing He. Parallel k-means clustering based on mapreduce. In Cloud Computing, pages Springer, 9. References [1] Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Revised Papers from Large- Scale Parallel Data Mining, Workshop on Large- Scale Parallel KDD Systems, SIGKDD, pages Springer-Verlag, [2] Reza Farivar, Daniel Rebolledo, Ellick Chan, and Roy Campbell. A implementation of k-means clustering on gpus. In Parallel and Distributed Processing Techniques and Applications, pages CSREA Press, 8. [3] Ruoming Jin, Anjan Goswami, and Gagan Agrawal. Fast and exact out-of-core and distributed k-means clustering. Knowledge and Information Systems, (1):17 4, 6. [4] Tayfan Kucukyilmaz. Parallel k-means algorithm for shared memory multiprocessors. Journal of Computer and Communications, 2(11):15 23, 14.
Performance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationParallel K-means Clustering. Ajay Padoor Chandramohan Fall 2012 CSE 633
Parallel K-means Clustering Ajay Padoor Chandramohan Fall 2012 CSE 633 Outline Problem description Implementation MPI Implementation OpenMP Test Results Conclusions Future work Problem Description Clustering
More informationParallel K-Means Algorithm for Shared Memory Multiprocessors
Journal of Computer and Communications, 2014, 2, 15-23 Published Online September 2014 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2014.211002 Parallel K-Means Algorithm for
More informationCombining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms
Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms Ruoming Jin Department of Computer and Information Sciences Ohio State University, Columbus OH 4321 jinr@cis.ohio-state.edu
More informationParallel Implementation of K-Means on Multi-Core Processors
Parallel Implementation of K-Means on Multi-Core Processors Fahim Ahmed M. Faculty of Science, Suez University, Suez, Egypt, ahmmedfahim@yahoo.com Abstract Nowadays, all most personal computers have multi-core
More informationAN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang
International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA
More informationParallelization of K-Means Clustering on Multi-Core Processors
Parallelization of K-Means Clustering on Multi-Core Processors Kittisak Kerdprasop and Nittaya Kerdprasop Data Engineering and Knowledge Discovery (DEKD) Research Unit School of Computer Engineering, Suranaree
More informationFINAL REPORT: K MEANS CLUSTERING SAPNA GANESH (sg1368) VAIBHAV GANDHI(vrg5913)
FINAL REPORT: K MEANS CLUSTERING SAPNA GANESH (sg1368) VAIBHAV GANDHI(vrg5913) Overview The partitioning of data points according to certain features of the points into small groups is called clustering.
More informationAnalysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark
Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,
More informationA Mixed Hierarchical Algorithm for Nearest Neighbor Search
A Mixed Hierarchical Algorithm for Nearest Neighbor Search Carlo del Mundo Virginia Tech 222 Kraft Dr. Knowledge Works II Building Blacksburg, VA cdel@vt.edu ABSTRACT The k nearest neighbor (knn) search
More informationClassifying Documents by Distributed P2P Clustering
Classifying Documents by Distributed P2P Clustering Martin Eisenhardt Wolfgang Müller Andreas Henrich Chair of Applied Computer Science I University of Bayreuth, Germany {eisenhardt mueller2 henrich}@uni-bayreuth.de
More informationClustering. (Part 2)
Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works
More informationNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm Abhishek Patel Department of Information & Technology, Parul Institute of Engineering & Technology, Vadodara, Gujarat, India Purnima Singh Department of
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.
More informationUnsupervised Learning
Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationData Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data
More informationCLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,
More informationCHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION
CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant
More informationChuck Cartledge, PhD. 23 September 2017
Introduction Definitions Numerical data Hands-on Q&A Conclusion References Files Big Data: Data Analysis Boot Camp Agglomerative Clustering Chuck Cartledge, PhD 23 September 2017 1/30 Table of contents
More informationOn Initial Effects of the k-means Clustering
200 Int'l Conf. Scientific Computing CSC'15 On Initial Effects of the k-means Clustering Sherri Burks, Greg Harrell, and Jin Wang Department of Mathematics and Computer Science Valdosta State University,
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationA Naïve Soft Computing based Approach for Gene Expression Data Analysis
Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for
More informationK-Means Clustering With Initial Centroids Based On Difference Operator
K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,
More informationAn Optimization Algorithm of Selecting Initial Clustering Center in K means
2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017) An Optimization Algorithm of Selecting Initial Clustering Center in K means Tianhan Gao1, a, Xue Kong2, b,* 1 School
More informationAccelerating K-Means Clustering with Parallel Implementations and GPU computing
Accelerating K-Means Clustering with Parallel Implementations and GPU computing Janki Bhimani Electrical and Computer Engineering Dept. Northeastern University Boston, MA Email: bhimani@ece.neu.edu Miriam
More informationCATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING
CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline
More informationDBSCAN. Presented by: Garrett Poppe
DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large spatial databases with noise by Martin Ester, Hans-peter Kriegel, Jörg S, Xiaowei Xu Slides adapted from resources
More informationMining Quantitative Maximal Hyperclique Patterns: A Summary of Results
Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Yaochun Huang, Hui Xiong, Weili Wu, and Sam Y. Sung 3 Computer Science Department, University of Texas - Dallas, USA, {yxh03800,wxw0000}@utdallas.edu
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationDynamic Clustering of Data with Modified K-Means Algorithm
2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq
More informationMounica B, Aditya Srivastava, Md. Faisal Alam
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 3 ISSN : 2456-3307 Clustering of large datasets using Hadoop Ecosystem
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationInital Starting Point Analysis for K-Means Clustering: A Case Study
lemson University TigerPrints Publications School of omputing 3-26 Inital Starting Point Analysis for K-Means lustering: A ase Study Amy Apon lemson University, aapon@clemson.edu Frank Robinson Vanderbilt
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationClustering Billions of Images with Large Scale Nearest Neighbor Search
Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton
More informationNormalization based K means Clustering Algorithm
Normalization based K means Clustering Algorithm Deepali Virmani 1,Shweta Taneja 2,Geetika Malhotra 3 1 Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi Email:deepalivirmani@gmail.com
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More information2 Proposed Implementation. 1 Introduction. Abstract. 2.1 Pseudocode of the Proposed Merge Procedure
Enhanced Merge Sort Using Simplified Transferrable Auxiliary Space Zirou Qiu, Ziping Liu, Xuesong Zhang Department of Computer Science Southeast Missouri State University Cape Girardeau, MO 63701 zqiu1s@semo.edu,
More informationResearch Article Apriori Association Rule Algorithms using VMware Environment
Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,
More informationMIS2502: Data Analytics Clustering and Segmentation. Jing Gong
MIS2502: Data Analytics Clustering and Segmentation Jing Gong gong@temple.edu http://community.mis.temple.edu/gong What is Cluster Analysis? Grouping data so that elements in a group will be Similar (or
More informationImproving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts
Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts Michael Beckerle, ChiefTechnology Officer, Torrent Systems, Inc., Cambridge, MA ABSTRACT Many organizations
More informationOnline Document Clustering Using the GPU
Online Document Clustering Using the GPU Benjamin E. Teitler, Jagan Sankaranarayanan, Hanan Samet Center for Automation Research Institute for Advanced Computer Studies Department of Computer Science University
More informationA Comparative study of Clustering Algorithms using MapReduce in Hadoop
A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering
More informationCommunication and Memory Efficient Parallel Decision Tree Construction
Communication and Memory Efficient Parallel Decision Tree Construction Ruoming Jin Department of Computer and Information Sciences Ohio State University, Columbus OH 4321 jinr@cis.ohiostate.edu Gagan Agrawal
More informationImproved MapReduce k-means Clustering Algorithm with Combiner
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering
More informationK-Means Clustering Using Localized Histogram Analysis
K-Means Clustering Using Localized Histogram Analysis Michael Bryson University of South Carolina, Department of Computer Science Columbia, SC brysonm@cse.sc.edu Abstract. The first step required for many
More informationMotivation. Technical Background
Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering
More informationFuzzy C-means Clustering with Temporal-based Membership Function
Indian Journal of Science and Technology, Vol (S()), DOI:./ijst//viS/, December ISSN (Print) : - ISSN (Online) : - Fuzzy C-means Clustering with Temporal-based Membership Function Aseel Mousa * and Yuhanis
More informationSearching frequent itemsets by clustering data: towards a parallel approach using MapReduce
Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce Maria Malek and Hubert Kadima EISTI-LARIS laboratory, Ave du Parc, 95011 Cergy-Pontoise, FRANCE {maria.malek,hubert.kadima}@eisti.fr
More informationFast Efficient Clustering Algorithm for Balanced Data
Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut
More informationA MPI-based parallel pyramid building algorithm for large-scale RS image
A MPI-based parallel pyramid building algorithm for large-scale RS image Gaojin He, Wei Xiong, Luo Chen, Qiuyun Wu, Ning Jing College of Electronic and Engineering, National University of Defense Technology,
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationCOMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare
COMP6237 Data Mining Data Mining & Machine Learning with Big Data Jonathon Hare jsh2@ecs.soton.ac.uk Contents Going to look at two case-studies looking at how we can make machine-learning algorithms work
More informationAppropriate Item Partition for Improving the Mining Performance
Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK DISTRIBUTED FRAMEWORK FOR DATA MINING AS A SERVICE ON PRIVATE CLOUD RUCHA V. JAMNEKAR
More informationGenetic Algorithms with Mapreduce Runtimes
Genetic Algorithms with Mapreduce Runtimes Fei Teng 1, Doga Tuncay 2 Indiana University Bloomington School of Informatics and Computing Department CS PhD Candidate 1, Masters of CS Student 2 {feiteng,dtuncay}@indiana.edu
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationPERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS
PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS Pradeep Nagendra Hegde 1, Ayush Arunachalam 2, Madhusudan H C 2, Ngabo Muzusangabo Joseph 1, Shilpa Ankalaki 3, Dr. Jharna Majumdar
More informationSchool of Computer and Information Science
School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast
More informationMATE-EC2: A Middleware for Processing Data with Amazon Web Services
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering
More informationAn improved MapReduce Design of Kmeans for clustering very large datasets
An improved MapReduce Design of Kmeans for clustering very large datasets Amira Boukhdhir Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Boukhdhir _ amira@yahoo.fr Oussama Lachiheb
More informationAn Improved Parallel Scalable K-means++ Massive Data Clustering Algorithm Based on Cloud Computing
An Improved Parallel Scalable K-means++ Massive Data Clustering Algorithm Based on Cloud Computing Shuzhi Nie Abstract Clustering is one of the most effective algorithms in data analysis and management.
More informationHARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS
HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS An Undergraduate Research Scholars Thesis by DENISE IRVIN Submitted to the Undergraduate Research Scholars program at Texas
More informationCLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16
CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf
More informationThe Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria Astuti
Information Systems International Conference (ISICO), 2 4 December 2013 The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria
More informationImproved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *
2016 Joint International Conference on Artificial Intelligence and Computer Engineering (AICE 2016) and International Conference on Network and Communication Security (NCS 2016) ISBN: 978-1-60595-362-5
More informationk-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University
k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University Outline Unsupervised versus Supervised Learning Clustering Problem k-means Clustering Algorithm Visual
More informationOutlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data
Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University
More informationMulti-Stage Rocchio Classification for Large-scale Multilabeled
Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale
More informationUnsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationMine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2
Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2 1 Department of Computer Science and Systems Engineering, Andhra University, Visakhapatnam-
More informationA Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis
A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract
More informationParallelizing Frequent Itemset Mining with FP-Trees
Parallelizing Frequent Itemset Mining with FP-Trees Peiyi Tang Markus P. Turkia Department of Computer Science Department of Computer Science University of Arkansas at Little Rock University of Arkansas
More informationCommunication and Memory Efficient Parallel Decision Tree Construction
Communication and Memory Efficient Parallel Decision Tree Construction Ruoming Jin Department of Computer and Information Sciences Ohio State University, Columbus OH 4321 jinr@cis.ohiostate.edu Gagan Agrawal
More informationDS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li
Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand
More informationAn Efficient Clustering Method for k-anonymization
An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management
More informationResearch and Improvement of Apriori Algorithm Based on Hadoop
Research and Improvement of Apriori Algorithm Based on Hadoop Gao Pengfei a, Wang Jianguo b and Liu Pengcheng c School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021,
More informationDistance-based Outlier Detection: Consolidation and Renewed Bearing
Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction
More informationVolume 3, Issue 11, November 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 11, November 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com
More informationUsing Center Representation and Variance Effect on K-NN Classification
Using Center Representation and Variance Effect on K-NN Classification Tamer TULGAR Department of Computer Engineering Girne American University Girne, T.R.N.C., Mersin 10 TURKEY tamertulgar@gau.edu.tr
More informationEnhancing K-means Clustering Algorithm with Improved Initial Center
Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of
More informationClustering Documentation
Clustering Documentation Release 0.3.0 Dahua Lin and contributors Dec 09, 2017 Contents 1 Overview 3 1.1 Inputs................................................... 3 1.2 Common Options.............................................
More informationText clustering based on a divide and merge strategy
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and
More informationMultiple Pivot Sort Algorithm is Faster than Quick Sort Algorithms: An Empirical Study
International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 03 14 Multiple Algorithm is Faster than Quick Sort Algorithms: An Empirical Study Salman Faiz Solehria 1, Sultanullah Jadoon
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #14: Clustering Seoul National University 1 In This Lecture Learn the motivation, applications, and goal of clustering Understand the basic methods of clustering (bottom-up
More informationA Review of K-mean Algorithm
A Review of K-mean Algorithm Jyoti Yadav #1, Monika Sharma *2 1 PG Student, CSE Department, M.D.U Rohtak, Haryana, India 2 Assistant Professor, IT Department, M.D.U Rohtak, Haryana, India Abstract Cluster
More informationImproved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning
Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning Timothy Glennan, Christopher Leckie, Sarah M. Erfani Department of Computing and Information Systems,
More informationA Comparative Analysis between K Means & EM Clustering Algorithms
A Comparative Analysis between K Means & EM Clustering Algorithms Y.Naser Eldin 1, Hythem Hashim 2, Ali Satty 3, Samani A. Talab 4 P.G. Student, Department of Computer, Faculty of Sciences and Arts, Ranyah,
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationClustering Documents. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve
More informationAn Accelerated MapReduce-based K-prototypes for Big Data
An Accelerated MapReduce-based K-prototypes for Big Data Mohamed Aymen Ben HajKacem, Chiheb-Eddine Ben N'cir, and Nadia Essoussi LARODEC, Université de Tunis, Institut Supérieur de Gestion de Tunis, 41
More informationPthread Parallel K-means
Pthread Parallel K-means Barbara Hohlt CS267 Applications of Parallel Computing UC Berkeley December 14, 2001 1 Introduction K-means is a popular non-hierarchical method for clustering large datasets.
More informationSEQUENTIAL PATTERN MINING FROM WEB LOG DATA
SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract
More informationData Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005
Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate
More informationImprovements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1
3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015) Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao
More information