A Study of Hierarchical and Partitioning Algorithms in Clustering Methods

Size: px

Start display at page:

Download "A Study of Hierarchical and Partitioning Algorithms in Clustering Methods"

Horace Mason
5 years ago
Views:

1 A Study of Hierarchical Partitioning Algorithms in Clustering Methods T. NITHYA Dr.E.RAMARAJ Ph.D., Research Scholar Dept. of Computer Science Engg. Alagappa University Karaikudi-3. Professor Dept. of Computer Science Engg. Alagappa University Karaikudi-3. Abstract In recent research environment, clustering plays as a vital role in data mining techniques. In this environment, the research paper mainly focuses on two different kinds of clustering algorithms there is, hierarchical partitioning. In this algorithm, the research paper compares two types of algorithms such as hierarchical algorithms of K-means partitioning algorithms of agglomerative algorithm. The aim of this research paper is focuses clustering functionalities, characteristics classifications also comparing with them. Keywords: Clustering, Partitioning method, hierarchical method, k-means agglomerative algorithm. 1. Introduction Data mining is the process of analyzing data from different perspectives summarizing it into useful information. Also, it is the process of finding correlations among various fields in large relational databases. The key properties of data mining are: Automatic pattern discovery Prediction of outcomes Creation of actionable information Focus on large data sets databases Clustering is plays division of data into groups of similar objects. Each group called cluster. It contains objects, which is similar between objects of own groups dissimilar to objects of other groups. Clustering is the subject of active research in several fields such as statistics, pattern recognition, machine learning. This survey focuses on clustering algorithm in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. In historical perspectives of the clustering, the data modeling techniques are puts in mathematics, statistics numerical 749

2 analysis methods. The search of clusters is unsupervised learning its resulting system represents a data concepts. From a practical perspective, the clustering plays an outsting role in data mining applications such as scientific data exploration, information retrieval, text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology many others Characteristics of Clustering methods [2] It is characterized by large datasets with many attributes of different types. In data mining, clustering is used to intense developments in information retrieval text mining. It keeps the particular level of quality of service. It is fully time-sensitive based process Classification of Clustering methods[2] Classification methods are meant to statistically distinguish between two or more groups. 1.Partitioning clustering method. 2.Hierarchical clustering method. 2. Partitioning Clustering Method [2] In Data partitioning algorithms, the data has divided into several subsets. All possible subset systems are computationally infeasible. Certainly, the greedy heuristics are used in the form of iterative optimization. Specifically, this means different relocation schemes that iteratively reassign points between the k clusters. Unlike the traditional hierarchical methods, in which clusters are not revisited after being constructed. A relocation algorithm gradually improves the clusters with appropriate data are produced these results in high quality clusters. In partitioning algorithms, the following two algorithms are most important. They are, K- MEANS K-MEDOIDS 2.1.K-MEANS[4] K-means is unsupervised learning algorithm that solve the well-known clustering method [1]. It is clearly shown in fig. 1. The following procedure simply classify a given data set through a certain number of clusters (assume k clusters) with fixed apriori. This algorithm aims at minimizing objective functions by using following squared error function. There is, where, x i - v j is the Euclidean distance between x i v j. distance between x i v j. c i is the number of data points in i th cluster. c is the number of cluster centers. 750

3 5) Recalculate the distance between each data point new obtained cluster centers. 6) If no data point was reassigned then stop, otherwise repeat from step 3). 2.3 Example [5] The following data set consisting of the scores of two variables on each of seven individuals: Fig.1 K MEANS CLUSTERING 2.2 Algorithmic steps for k-means clustering Let X = {x 1, x2, x3 xn} be the set of data points V = {v 1, v2 vc} be the set of centers. 1) Romly select c cluster centers. 2) Calculate the distance between each data point cluster centers. 3) Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers. 4) Recalculate the new cluster center using: Table 1 Subject A B The data set is to be grouped into two clusters (A & B) in table 2.. Table 2 Individual Mean Vector Group 1 1 (1.0, 1.0) where, c i represents the number of data points in i th cluster. Group 2 4 (5.0, 7.0) 751

4 The remaining individuals are now examined in sequence allocated to the cluster. In it, they are closest shown in table 3. The mean vector is recalculated, then every time a new member is added. Step Individual Cluster 1 Cluster 2 Mean Vector Individual Mean Vector 1 1 (1.0, 1.0) 4 (5.0, 7.0) Table 5 Individual Distance to Distance to mean mean of of Cluster 1 Cluster Table , 2 (1.2, 1.5) 4 (5.0, 7.0) 3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0) 4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0) 5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7) 6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4) The initial partition has changed the two clusters at this stage having the following characteristics in table 4. Individual Men Vector Cluster 1 1, 2, 3 (1.8, 2.3) Cluster 2 4, 5, 6, 7 (4.1, 5.4) Compare each individual s distance to its own cluster mean to that of the opposite cluster clearly shown in table 5. Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1), is noted in table 6. Each individual's distance to its own cluster mean should be smaller than the distance to the other cluster's mean.. Table 6 Individual Cluster 1 1, 2 (1.3, 1.5) Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1) Table 4 Mean Vector The iterative relocation would now continue from this new partition until no more relocation occurs. However, in this example, each individual is now nearer its own cluster mean than that of the other cluster the iteration stops, choosing the latest partitioning as the final cluster solution. Also, it is possible that the k-means algorithm won't find a final solution. 752

5 3 Hierarchical Clustering [17] Hierarchical clustering builds a cluster hierarchy or, a tree of clusters, also known as a dendrogram. Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. An approach allows exploring data on different levels of granularity. Hierarchical clustering methods are categorized into agglomerative (bottomup) divisive (top-down). An agglomerative clustering starts with onepoint (singleton) clusters recursively merges two or more most appropriate clusters. A divisive clustering starts with one cluster of all data points recursively splits the most clustering starts with one cluster of all data points recursively splits the most appropriate cluster. The process continues until a stopping criterion (frequently, the requested number k of clusters) is achieved. 3.1 Advantages of hierarchical clustering Embedded flexibility regarding the level of granularity. Ease of hling of any forms of similarity or distance. 3.2 Disadvantages of hierarchical clustering Vagueness of termination criteria In fact, that most hierarchical algorithms do not revisit once constructed. 3.3 Agglomerative [6] Existing groups are combined or divided. In order to it creates a hierarchical structure that reflects the order in which groups are merged or divided. In agglomerative method, which builds the hierarchy by merging the objects initially belong to the list of singleton sets S1,S2 Sn. Function is used to find the pair of sets{si, Sj} from the list. Once merged Si Sj are removed from the list of sets replaced with Si U Sj. Different variants of agglomerative hierarchical clustering algorithm may use different cost function. Complete linkage, average linkages single linkages methods are Maximum, average minimum difference between the members of the two clusters. Algorithm 1.Compute the proximity matrix which contains the distance between each pair of patterns. 2.Every pattern as a cluster. 3.Find the most similar pair of clusters using the proximity matrix which is merges these clusters into one. 4.If all the patterns are in one cluster, then stop otherwise go to step 2. Combining Clusters in the Agglomerative Approach [15] In the agglomerative hierarchical approach, each data point to be a cluster combine existing clusters at each every step. Here are four different methods are described. There is, 1.Single Linkage: In Single Linkage method, the calculated measures distance between two points or clusters to be the minimum distance between any single data point in the first cluster any single data point in the second cluster. 753

6 On the basis of this definition of distance between clusters, at each stage of the process the researcher has combine the two clusters that have the smallest single linkage distance. 2.Complete Linkage: In complete Linkage method, the calculated measures distance between two points or clusters the maximum distance between any single data point in the first cluster any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process the researcher has combine the two clusters that have the smallest complete linkage distance. 3. Average Linkage: In Average Linkage method, the distance between two clusters of the average distance such as data points in the first cluster data points in the second cluster. 4. Centroid Method: In centroid method, the calculated measures distance between two mean vectors of the clusters. At each stage of the process the researcher has combine the two clusters which one of the smallest centroid distance. 5. Ward s Method In ward s method, the calculated measures distance between two points or clusters. It is an ANOVA based approach. 5.COMPARISION BETWEEN K MEANS AND AGGLOMERATIVE ALGORITHM [21] The two algorithms are compared by the size of dataset, no of clusters, type of dataset type of software. These algorithms are compared by the dataset in two times based on its size type. This comparison as shown in below table(8). Algorith m K Means Agglome rative Size of data set Huge data set small Huge data set small TABLE 8 No of clusters Large small no of clusters Large small no of clusters Types of dataset Ideal rom dataset Ideal rom dataset Type of software LNKnet cluster front view LNKnet cluster front view K means algorithm have less quality (accuracy) than other algorithm. The quality of K-means algorithm is very good, when the dataset is larger. K means algorithm has less quality (accuracy) than other algorithm. A hierarchical clustering technique has produced better result, when the dataset is small. When using the rom dataset, the hierarchical method is better than others. K means clustering is disturbed by the noise in dataset. It will affect the result. In different software techniques, running an algorithm will leads to almost the same result. Because, all software will have the same steps to run. Clustering the data is the main concept of this research. It is done by different types of algorithm. 754

7 Agglomerative algorithm produces better result in the larger dataset CONCLUSION The hierarchical partitioning algorithms are explained about the data set accuracy. Normally, the clustering algorithm is used to reduce the space as well as time complexity. The partitioning method clearly explained by k means algorithm, it dealt with the small number of data sets. The performance of k means algorithm is better than the hierarchical clustering methods. References: 1. P.Berkin Survey of clustering data mining techniques, Grouping multidimentional data,2006- Springer. 2. Micheal J.Berry, Gordon Linoff Data Mining Techniques: for marketing, sales customer support, Johnwilley sons, inc, New York, NY, USA,1997. ISBN: B.Vinodhini, Survey on clustering algorithm, International journal of engineering science Innovative technology (IJESIT), volume 2, issue 6, November K.Krishna, MN Murty, Genetic K-Means algorithm,system,man cybermetics,part B: Cybernetics, IEEE Transactions,ieeexplore.org. 5. Z Huang Extensions to the K-Means algorithm for clustering large datasets with categorical values,data mining knowledge discovery,1998- Springer. 6. WHE DAY,h Edelsbrunner, Efficient algorithm for agglomerative hierarchical clustering methods,journal of classification,1984-springer. 7. R MAC NALLY, Hierarchical partioning as an interpretative tool in multivariate inferenve,australian journal of ecology H FRIGUI,R KRUSHNAPURAM Pattern recoginition,1997-elsevier. 9. J VESANTO,E ALHONIEMI, Clustering on the self organizing map,neural network,ieee transactions, TW LIAO, Clustering of time series data-a survey,pattern recoginition,2005-elsevier. 11. RM NALLY,CJ WALSH, Hierarchical portioning public domain software,biodiversity conservation,2004-springer. 12. FALOUTSOS,KL LIN, A fast algorithm for indexing,data mining visualization of traditional multimedia datasets,1995,dl.acm.org. 13. AK JAIN, Data clustering 50 years beyond k- means,pattern recognition letters,2010-elsevier. 14. L JING,MK NG,JZ HUANG, An entropy weighting k-means algorithm for subspace clustering of high-dimensional space data, Knowledge data engineering journal, 2007-ieeexplore.ieee. org. 15. D beeferman,a berger, Agglomerative clustering of a search engine query log,proceedings of the sixth ACM SIGKDD,2000,dl.acm.org. 16. G KARYPIS, EH VAN, V KUMAR, Chameleon: Hierarchical clustering using dynamic modeling, computer, 1999, ieeexplore.ieee.org. 17. AP REYNOLDS, G.RICHARDS,B DE LA LGLESIA, Clustering rules:a comparision of portioning hierarchical clustering algorithm, journal of mathematical modelling algorithms,2006,springer. 18. T HASTIER TIBSHIRANI,J FRIEDMAN, The elements of statistical learning:data mining, inference prediction,the department of mathematical statistical science,2005,springer. 19. K SATHIYAKUMARI,G MANIMEKALAI, A survey of on various approaches in document clustering,ijcta, RS BHADORIA,R BANJAL,H ALEXANDER, Analysis of frequent item set mining on varaiant datasets,international journal of computer applications, OSAMA ABU ABBAS, Comparisons between clustering algorithm, International arab journal of information technology

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group