A Study of Hierarchical and Partitioning Algorithms in Clustering Methods

Similar documents
Cluster Analysis. Ying Shen, SSE, Tongji University

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised Learning

Clustering Part 3. Hierarchical Clustering

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Hierarchical Document Clustering

Analyzing Outlier Detection Techniques with Hybrid Method

Machine Learning (BSMC-GA 4439) Wenke Liu

CSE 5243 INTRO. TO DATA MINING

Lesson 3. Prof. Enza Messina

CHAPTER 4: CLUSTER ANALYSIS

Chapter 6: Cluster Analysis

Data Clustering With Leaders and Subleaders Algorithm

A Review on Cluster Based Approach in Data Mining

Clustering Algorithms In Data Mining

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

CSE 5243 INTRO. TO DATA MINING

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Text Documents clustering using K Means Algorithm

Iteration Reduction K Means Clustering Algorithm

Clustering CS 550: Machine Learning

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

A k-means Clustering Algorithm on Numeric Data

Hierarchical Clustering

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data Mining: An experimental approach with WEKA on UCI Dataset

Redefining and Enhancing K-means Algorithm

Cluster Analysis: Agglomerate Hierarchical Clustering

University of Florida CISE department Gator Engineering. Clustering Part 2

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

AN EXPERIMENTAL APPROACH OF K-MEANS ALGORITHM

Review on Various Clustering Methods for the Image Data

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Machine Learning (BSMC-GA 4439) Wenke Liu

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Clustering Lecture 3: Hierarchical Methods

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

ECLT 5810 Clustering

An Efficient Approach towards K-Means Clustering Algorithm

Hierarchical Clustering

Gene Clustering & Classification

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Clustering part II 1

Cluster Analysis: Basic Concepts and Methods

CS7267 MACHINE LEARNING

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

An Efficient Clustering for Crime Analysis

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

CSE 5243 INTRO. TO DATA MINING

A REVIEW ON CLUSTERING TECHNIQUES AND THEIR COMPARISON

Comparative Study of Clustering Algorithms using R

Hierarchical clustering

Keywords: clustering algorithms, unsupervised learning, cluster validity

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Information Retrieval and Web Search Engines

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Unsupervised Learning and Clustering

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)

Enhancing K-means Clustering Algorithm with Improved Initial Center

Dynamic Clustering of Data with Modified K-Means Algorithm

A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING

Cluster Analysis. Angela Montanari and Laura Anderlucci

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

ECLT 5810 Clustering

Data Informatics. Seon Ho Kim, Ph.D.

Clustering Part 4 DBSCAN

4. Ad-hoc I: Hierarchical clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

International Journal Of Engineering And Computer Science ISSN: Volume 5 Issue 11 Nov. 2016, Page No.

Divisive Hierarchical Clustering with K-means and Agglomerative Hierarchical Clustering

Performance impact of dynamic parallelism on different clustering algorithms

A Comparative Study of Various Clustering Algorithms in Data Mining

数据挖掘 Introduction to Data Mining

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Clustering and Visualisation of Data

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Clustering: Overview and K-means algorithm

6. Dicretization methods 6.1 The purpose of discretization

MSA220 - Statistical Learning for Big Data

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

Data Mining Concepts & Techniques

Midterm Examination CS540-2: Introduction to Artificial Intelligence

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Clustering in Data Mining

Cluster Analysis for Microarray Data

The Application of K-medoids and PAM to the Clustering of Rules

3. Cluster analysis Overview

Kapitel 4: Clustering

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

University of Florida CISE department Gator Engineering. Clustering Part 4

Data Clustering. Danushka Bollegala

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Transcription:

A Study of Hierarchical Partitioning Algorithms in Clustering Methods T. NITHYA Dr.E.RAMARAJ Ph.D., Research Scholar Dept. of Computer Science Engg. Alagappa University Karaikudi-3. th.nithya@gmail.com Professor Dept. of Computer Science Engg. Alagappa University Karaikudi-3. eramaraj@rediffmail.com Abstract In recent research environment, clustering plays as a vital role in data mining techniques. In this environment, the research paper mainly focuses on two different kinds of clustering algorithms there is, hierarchical partitioning. In this algorithm, the research paper compares two types of algorithms such as hierarchical algorithms of K-means partitioning algorithms of agglomerative algorithm. The aim of this research paper is focuses clustering functionalities, characteristics classifications also comparing with them. Keywords: Clustering, Partitioning method, hierarchical method, k-means agglomerative algorithm. 1. Introduction Data mining is the process of analyzing data from different perspectives summarizing it into useful information. Also, it is the process of finding correlations among various fields in large relational databases. The key properties of data mining are: Automatic pattern discovery Prediction of outcomes Creation of actionable information Focus on large data sets databases Clustering is plays division of data into groups of similar objects. Each group called cluster. It contains objects, which is similar between objects of own groups dissimilar to objects of other groups. Clustering is the subject of active research in several fields such as statistics, pattern recognition, machine learning. This survey focuses on clustering algorithm in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. In historical perspectives of the clustering, the data modeling techniques are puts in mathematics, statistics numerical 749

analysis methods. The search of clusters is unsupervised learning its resulting system represents a data concepts. From a practical perspective, the clustering plays an outsting role in data mining applications such as scientific data exploration, information retrieval, text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology many others. 1.1. Characteristics of Clustering methods [2] It is characterized by large datasets with many attributes of different types. In data mining, clustering is used to intense developments in information retrieval text mining. It keeps the particular level of quality of service. It is fully time-sensitive based process. 1.2. Classification of Clustering methods[2] Classification methods are meant to statistically distinguish between two or more groups. 1.Partitioning clustering method. 2.Hierarchical clustering method. 2. Partitioning Clustering Method [2] In Data partitioning algorithms, the data has divided into several subsets. All possible subset systems are computationally infeasible. Certainly, the greedy heuristics are used in the form of iterative optimization. Specifically, this means different relocation schemes that iteratively reassign points between the k clusters. Unlike the traditional hierarchical methods, in which clusters are not revisited after being constructed. A relocation algorithm gradually improves the clusters with appropriate data are produced these results in high quality clusters. In partitioning algorithms, the following two algorithms are most important. They are, K- MEANS K-MEDOIDS 2.1.K-MEANS[4] K-means is unsupervised learning algorithm that solve the well-known clustering method [1]. It is clearly shown in fig. 1. The following procedure simply classify a given data set through a certain number of clusters (assume k clusters) with fixed apriori. This algorithm aims at minimizing objective functions by using following squared error function. There is, where, x i - v j is the Euclidean distance between x i v j. distance between x i v j. c i is the number of data points in i th cluster. c is the number of cluster centers. 750

5) Recalculate the distance between each data point new obtained cluster centers. 6) If no data point was reassigned then stop, otherwise repeat from step 3). 2.3 Example [5] The following data set consisting of the scores of two variables on each of seven individuals: Fig.1 K MEANS CLUSTERING 2.2 Algorithmic steps for k-means clustering Let X = {x 1, x2, x3 xn} be the set of data points V = {v 1, v2 vc} be the set of centers. 1) Romly select c cluster centers. 2) Calculate the distance between each data point cluster centers. 3) Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers. 4) Recalculate the new cluster center using: Table 1 Subject A B 1 1.0 1.0 2 1.5 2.0 3 3.0 4.0 4 5.0 7.0 5 3.5 5.0 6 4.5 5.0 7 3.5 4.5 The data set is to be grouped into two clusters (A & B) in table 2.. Table 2 Individual Mean Vector Group 1 1 (1.0, 1.0) where, c i represents the number of data points in i th cluster. Group 2 4 (5.0, 7.0) 751

The remaining individuals are now examined in sequence allocated to the cluster. In it, they are closest shown in table 3. The mean vector is recalculated, then every time a new member is added. Step Individual Cluster 1 Cluster 2 Mean Vector Individual Mean Vector 1 1 (1.0, 1.0) 4 (5.0, 7.0) Table 5 Individual Distance to Distance to mean mean of of Cluster 1 Cluster 2 1 1.5 Table 3 5.4 2 0.4 4.3 3 2.1 1.8 4 5.7 1.8 5 3.2 0.7 6 3.8 0.6 7 2.8 1.1 2 1, 2 (1.2, 1.5) 4 (5.0, 7.0) 3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0) 4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0) 5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7) 6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4) The initial partition has changed the two clusters at this stage having the following characteristics in table 4. Individual Men Vector Cluster 1 1, 2, 3 (1.8, 2.3) Cluster 2 4, 5, 6, 7 (4.1, 5.4) Compare each individual s distance to its own cluster mean to that of the opposite cluster clearly shown in table 5. Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1), is noted in table 6. Each individual's distance to its own cluster mean should be smaller than the distance to the other cluster's mean.. Table 6 Individual Cluster 1 1, 2 (1.3, 1.5) Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1) Table 4 Mean Vector The iterative relocation would now continue from this new partition until no more relocation occurs. However, in this example, each individual is now nearer its own cluster mean than that of the other cluster the iteration stops, choosing the latest partitioning as the final cluster solution. Also, it is possible that the k-means algorithm won't find a final solution. 752

3 Hierarchical Clustering [17] Hierarchical clustering builds a cluster hierarchy or, a tree of clusters, also known as a dendrogram. Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. An approach allows exploring data on different levels of granularity. Hierarchical clustering methods are categorized into agglomerative (bottomup) divisive (top-down). An agglomerative clustering starts with onepoint (singleton) clusters recursively merges two or more most appropriate clusters. A divisive clustering starts with one cluster of all data points recursively splits the most clustering starts with one cluster of all data points recursively splits the most appropriate cluster. The process continues until a stopping criterion (frequently, the requested number k of clusters) is achieved. 3.1 Advantages of hierarchical clustering Embedded flexibility regarding the level of granularity. Ease of hling of any forms of similarity or distance. 3.2 Disadvantages of hierarchical clustering Vagueness of termination criteria In fact, that most hierarchical algorithms do not revisit once constructed. 3.3 Agglomerative [6] Existing groups are combined or divided. In order to it creates a hierarchical structure that reflects the order in which groups are merged or divided. In agglomerative method, which builds the hierarchy by merging the objects initially belong to the list of singleton sets S1,S2 Sn. Function is used to find the pair of sets{si, Sj} from the list. Once merged Si Sj are removed from the list of sets replaced with Si U Sj. Different variants of agglomerative hierarchical clustering algorithm may use different cost function. Complete linkage, average linkages single linkages methods are Maximum, average minimum difference between the members of the two clusters. Algorithm 1.Compute the proximity matrix which contains the distance between each pair of patterns. 2.Every pattern as a cluster. 3.Find the most similar pair of clusters using the proximity matrix which is merges these clusters into one. 4.If all the patterns are in one cluster, then stop otherwise go to step 2. Combining Clusters in the Agglomerative Approach [15] In the agglomerative hierarchical approach, each data point to be a cluster combine existing clusters at each every step. Here are four different methods are described. There is, 1.Single Linkage: In Single Linkage method, the calculated measures distance between two points or clusters to be the minimum distance between any single data point in the first cluster any single data point in the second cluster. 753

On the basis of this definition of distance between clusters, at each stage of the process the researcher has combine the two clusters that have the smallest single linkage distance. 2.Complete Linkage: In complete Linkage method, the calculated measures distance between two points or clusters the maximum distance between any single data point in the first cluster any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process the researcher has combine the two clusters that have the smallest complete linkage distance. 3. Average Linkage: In Average Linkage method, the distance between two clusters of the average distance such as data points in the first cluster data points in the second cluster. 4. Centroid Method: In centroid method, the calculated measures distance between two mean vectors of the clusters. At each stage of the process the researcher has combine the two clusters which one of the smallest centroid distance. 5. Ward s Method In ward s method, the calculated measures distance between two points or clusters. It is an ANOVA based approach. 5.COMPARISION BETWEEN K MEANS AND AGGLOMERATIVE ALGORITHM [21] The two algorithms are compared by the size of dataset, no of clusters, type of dataset type of software. These algorithms are compared by the dataset in two times based on its size type. This comparison as shown in below table(8). Algorith m K Means Agglome rative Size of data set Huge data set small Huge data set small TABLE 8 No of clusters Large small no of clusters Large small no of clusters Types of dataset Ideal rom dataset Ideal rom dataset Type of software LNKnet cluster front view LNKnet cluster front view K means algorithm have less quality (accuracy) than other algorithm. The quality of K-means algorithm is very good, when the dataset is larger. K means algorithm has less quality (accuracy) than other algorithm. A hierarchical clustering technique has produced better result, when the dataset is small. When using the rom dataset, the hierarchical method is better than others. K means clustering is disturbed by the noise in dataset. It will affect the result. In different software techniques, running an algorithm will leads to almost the same result. Because, all software will have the same steps to run. Clustering the data is the main concept of this research. It is done by different types of algorithm. 754

Agglomerative algorithm produces better result in the larger dataset CONCLUSION The hierarchical partitioning algorithms are explained about the data set accuracy. Normally, the clustering algorithm is used to reduce the space as well as time complexity. The partitioning method clearly explained by k means algorithm, it dealt with the small number of data sets. The performance of k means algorithm is better than the hierarchical clustering methods. References: 1. P.Berkin Survey of clustering data mining techniques, Grouping multidimentional data,2006- Springer. 2. Micheal J.Berry, Gordon Linoff Data Mining Techniques: for marketing, sales customer support, Johnwilley sons, inc, New York, NY, USA,1997. ISBN: 0471179809. 3. B.Vinodhini, Survey on clustering algorithm, International journal of engineering science Innovative technology (IJESIT), volume 2, issue 6, November 2013. 4. K.Krishna, MN Murty, Genetic K-Means algorithm,system,man cybermetics,part B: Cybernetics, IEEE Transactions,ieeexplore.org. 5. Z Huang Extensions to the K-Means algorithm for clustering large datasets with categorical values,data mining knowledge discovery,1998- Springer. 6. WHE DAY,h Edelsbrunner, Efficient algorithm for agglomerative hierarchical clustering methods,journal of classification,1984-springer. 7. R MAC NALLY, Hierarchical partioning as an interpretative tool in multivariate inferenve,australian journal of ecology.1996. 8. H FRIGUI,R KRUSHNAPURAM Pattern recoginition,1997-elsevier. 9. J VESANTO,E ALHONIEMI, Clustering on the self organizing map,neural network,ieee transactions,2000. 10. TW LIAO, Clustering of time series data-a survey,pattern recoginition,2005-elsevier. 11. RM NALLY,CJ WALSH, Hierarchical portioning public domain software,biodiversity conservation,2004-springer. 12. FALOUTSOS,KL LIN, A fast algorithm for indexing,data mining visualization of traditional multimedia datasets,1995,dl.acm.org. 13. AK JAIN, Data clustering 50 years beyond k- means,pattern recognition letters,2010-elsevier. 14. L JING,MK NG,JZ HUANG, An entropy weighting k-means algorithm for subspace clustering of high-dimensional space data, Knowledge data engineering journal, 2007-ieeexplore.ieee. org. 15. D beeferman,a berger, Agglomerative clustering of a search engine query log,proceedings of the sixth ACM SIGKDD,2000,dl.acm.org. 16. G KARYPIS, EH VAN, V KUMAR, Chameleon: Hierarchical clustering using dynamic modeling, computer, 1999, ieeexplore.ieee.org. 17. AP REYNOLDS, G.RICHARDS,B DE LA LGLESIA, Clustering rules:a comparision of portioning hierarchical clustering algorithm, journal of mathematical modelling algorithms,2006,springer. 18. T HASTIER TIBSHIRANI,J FRIEDMAN, The elements of statistical learning:data mining, inference prediction,the department of mathematical statistical science,2005,springer. 19. K SATHIYAKUMARI,G MANIMEKALAI, A survey of on various approaches in document clustering,ijcta,2011. 20. RS BHADORIA,R BANJAL,H ALEXANDER, Analysis of frequent item set mining on varaiant datasets,international journal of computer applications,2011. 21. OSAMA ABU ABBAS, Comparisons between clustering algorithm, International arab journal of information technology.2008. 755