An Enhanced K-Medoid Clustering Algorithm

Similar documents
An Efficient Approach towards K-Means Clustering Algorithm

New Approach for K-mean and K-medoids Algorithm

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Enhancing K-means Clustering Algorithm with Improved Initial Center

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Iteration Reduction K Means Clustering Algorithm

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

A Review of K-mean Algorithm

Comparative Study Of Different Data Mining Techniques : A Review

Dynamic Clustering of Data with Modified K-Means Algorithm

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

K-Means Clustering With Initial Centroids Based On Difference Operator

Normalization based K means Clustering Algorithm

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

Count based K-Means Clustering Algorithm

Incremental K-means Clustering Algorithms: A Review

Analyzing Outlier Detection Techniques with Hybrid Method

K-means clustering based filter feature selection on high dimensional data

Comparative Study of Clustering Algorithms using R

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

An Improved Document Clustering Approach Using Weighted K-Means Algorithm

Unsupervised Learning Partitioning Methods

Clustering: An art of grouping related objects

An efficient method to improve the clustering performance for high dimensional data by Principal Component Analysis and modified K-means

arxiv: v1 [cs.lg] 3 Oct 2018

A REVIEW ON K-mean ALGORITHM AND IT S DIFFERENT DISTANCE MATRICS

Kapitel 4: Clustering

The comparative study of text documents clustering algorithms

An Efficient Enhanced K-Means Approach with Improved Initial Cluster Centers

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

A Review on Cluster Based Approach in Data Mining

Unsupervised Learning

I. INTRODUCTION II. RELATED WORK.

Improved Performance of Unsupervised Method by Renovated K-Means

INITIALIZING CENTROIDS FOR K-MEANS ALGORITHM AN ALTERNATIVE APPROACH

Text Documents clustering using K Means Algorithm

AN IMPROVED DENSITY BASED k-means ALGORITHM

Filtered Clustering Based on Local Outlier Factor in Data Mining

Data Clustering With Leaders and Subleaders Algorithm

A Survey On Different Text Clustering Techniques For Patent Analysis

Enhanced Cluster Centroid Determination in K- means Algorithm

International Journal of Advanced Research in Computer Science and Software Engineering

CHAPTER 4: CLUSTER ANALYSIS

University of Florida CISE department Gator Engineering. Clustering Part 2

A Survey Of Issues And Challenges Associated With Clustering Algorithms

Semi-Supervised Clustering with Partial Background Information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Research and Improvement on K-means Algorithm Based on Large Data Set

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

Cluster Analysis. CSE634 Data Mining

Clustering part II 1

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Alankrita Aggarwal et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011,

On Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances

K-modes Clustering Algorithm for Categorical Data

A Comparative Study of Various Clustering Algorithms in Data Mining

Particle Swarm Optimization applied to Pattern Recognition

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction

A NEW ALGORITHM FOR SELECTION OF BETTER K VALUE USING MODIFIED HILL CLIMBING IN K-MEANS ALGORITHM

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

K-Mean Clustering Algorithm Implemented To E-Banking

Improving the Performance of K-Means Clustering For High Dimensional Data Set

Unsupervised Learning

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

K-Means. Oct Youn-Hee Han

The Application of K-medoids and PAM to the Clustering of Rules

CSE 5243 INTRO. TO DATA MINING

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Saudi Journal of Engineering and Technology. DOI: /sjeat ISSN (Print)

The Clustering Validity with Silhouette and Sum of Squared Errors

Review on Various Clustering Methods for the Image Data

Unsupervised Learning : Clustering

Datasets Size: Effect on Clustering Results

Comparative Analysis of K means Clustering Sequentially And Parallely

A CRITIQUE ON IMAGE SEGMENTATION USING K-MEANS CLUSTERING ALGORITHM

Supervised vs. Unsupervised Learning

A New Method For Forecasting Enrolments Combining Time-Variant Fuzzy Logical Relationship Groups And K-Means Clustering

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: April, 2016

Selection of n in K-Means Algorithm

ECLT 5810 Clustering

Clustering. Supervised vs. Unsupervised Learning

Color based segmentation using clustering techniques

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Clustering Algorithms In Data Mining

AN EXPERIMENTAL APPROACH OF K-MEANS ALGORITHM

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

A NOVEL APPROACH TO TEST SUITE REDUCTION USING DATA MINING

CSE 5243 INTRO. TO DATA MINING

A Genetic Algorithm Approach for Clustering

Information Retrieval and Web Search Engines

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Constrained Clustering with Interactive Similarity Learning

Unsupervised Learning and Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Transcription:

An Enhanced Clustering Algorithm Archna Kumari Science &Engineering kumara.archana14@gmail.com Pramod S. Nair Science &Engineering, pramodsnair@yahoo.com Sheetal Kumrawat Science &Engineering, sheetal2692@gmail.com Abstract Data mining is a technique of mining information from the raw data. It is a non trivial process of identifying valid and useful patterns in data. Some of the major Data Mining techniques used for analysis are Association, Classification and Clustering etc. Clustering is used to group homogenous kind of data, but it is different approach from classification process. In the classification process data is grouped on the predefined domains or subjects. A basic clustering technique represents a list of topics for each data and calculates the distance for how accurately a data fit into a group. The Cluster is helpful to get fascinating patterns and structures from an outsized set of knowledge. There are a lots of clustering algorithms that have been proposed and they can be divided as: partitional, grid, density, model and hierarchical based. This paper propose the new enhanced algorithm for k-medoid clustering algorithm which eliminates the deficiency of existing k-medoid algorithm. It first calculates the initial medoids k as per needs of users and then gives relatively better cluster. It follows an organized way to generate initial medoid and applies an effective approach for allocation of data points into the clusters. It reduces the mean square error without sacrificing the execution time and memory use as compared to the existing k-medoid algorithm. Keywords- Data Mining, Clustering, Partitional Clustering,, Enhanced Algorithm. ***** I. Introduction Data Mining is the extraction of information from large amounts of data to view the hidden knowledge and to facilitate its use in the real time applications. It is a significant process in which useful and valid patterns are identified in a data. It tends to work on the data and best techniques are developed to arrive at reliable conclusion and decisions massive amounts of data. There are many techniques used in Data mining process for data analysis such as clustering, association, classification etc [3, 11]. Clustering is among one of the most effective techniques for the analysis of data. Most of the application is utilizing the cluster analysis methods for categorizing data. Clustering is used to group same kind of data, but it is a different approach from classification process. In the classification process data is grouped on the predefined domains or subjects. Clustering plays an important role in the analysis space within the field of knowledge mining. The Cluster could be a method of partition a collection of knowledge in an exceedingly significant sub category known as clusters. It helps users to grasp the natural grouping of cluster from the info set. Its unattended classification which means it's no predefined categories. Information is sorted into clusters in such the simplest way that information of an equivalent cluster are similar and people in alternative teams are dissimilar. It aims to reduce intra-class similarity whereas to maximize interclass difference. The Cluster is helpful to get fascinating patterns and structures from an outsized set of knowledge. There are a lots of clustering algorithms that have been proposed and they can be divided as: partitional, grid, density, model and hierarchical based [9]. The process of Partition based clustering algorithm firstly generates the value of k. Here k is denoting the number of partitions that are wanted to be created. After that it applies an iterative replacement procedure that tries to get better results [7]. Some the important partitional clustering algorithms are k- Means [5,6,], and k-medoids [8]. In the K-mean algorithm, the centroid is defined as the mean of the cluster points. But in the K-medoid clustering algorithm, uses the object points as the representative point to make a cluster center. The disadvantage of does not generate the same result with each run, because the resulting clusters depend on the initial random assignments. As an attempt to solve the problem, the algorithm is used which is an unsupervised learning algorithm. Thus the propose work is dedicated to enhanced algorithm for k-medoid clustering algorithm which eliminates the deficiency of existing k- medoid algorithm. It first calculates the initial medoids k as per needs of users and then gives relatively better cluster. II. LITERATURE SURVEY The detailed study has been done to identify the drawbacks and possible solutions to resolve the limitations of existing system. Performance of repetitive cluster algorithms depends very much on the choice of cluster centre which is set at each step. In this section, the brief outlook of various algorithms is given. The issues related to these algorithms and also the many kinds of approaches used by authors to resolve those problems are discussed. [13] In this paper the projected algorithm calculated the space matrix one time and used it for locating novel medoids at every unvarying step. The experimental results are compared with the similar kind of existing algorithms. The output illustrates that the new algorithm takes an appreciably less time in calculation with equivalent performance compared to the PAM clustering algorithm. The advantage is in the calculation of the mean to finalize the centre points in each iteration of the cluster point. The proposed algorithm finds the centre points in a less time as compared with the k-mean algorithm. [12] The paper describes that the k- medoid algorithm is chosen as the object points as the initial medoids that ensures the clustering process better than the K- 27

mean.the k-means and k-mediods algorithms are computationally expensive as consider the time parameter. The new approach in the proposed algorithm reduces the shortcomings of exiting k mean algorithm. First of all, it determines the first centroids for k as per needs of users and so offers higher, effective and stable cluster. It additionally takes a minimum time in execution as a result of it already eliminates the super numeracy distance computation by the victimization of the previous iteration. [11] The Paper explains clustering algorithm is that the methodology of organizing items the process of clustering involves creating groups of the data that is clustered or classes so that data within a cluster may have more resemblance when compared to each other, but are very different to the values in the different clusters. It has taken two clustering algorithms like k-means clustering algorithm and k-medoid clustering algorithm. But, the standard k-medoid algorithm rules do suffer from several shortcomings. The number of clusters must be defined before starting the process of K-medoid. The selection of k representative objects are not chosen properly in the initial stage the whole clustering process will move towards wrong direction and finally leads to clustering accuracy degradation. Third one is also responsive to the arrange of the input data points. The main drawback was eliminated from the exploitation of cluster validity index. [6] The Paper explains that the mean sq. error of clusters may be reduced by creating changes in creation of initial centroids. If the initial centroid is chosen systematically and properly than far better clusters are created. During this algorithm, first, the space between each and every data point is determined. Currently using that calculation, initial centroids are created by taking the points in the same set that has minimum distance to one another. III. PROPOSED ALGORITHM Distance Calculation: Calculation of the path length between two clusters involves some or all elements of the clusters. To get the similarity or the relativity between the components of a population a common metrics of distance between two points is created. Euclidean metric is the most common distance measure which explains the space or distance between two points p = (p 1,p 2, ) and q = (q 1,q 2,.) d = p i q i 2 1 2 Eq (1) s algorithm: The basic strategy of s clustering algorithm is to find k clusters in n objects by first arbitrarily finding a representative object (the Medoids) which is the most centrally located object in a cluster, for each cluster. Each remaining object is clustered with the Medoid to which it is the most similar. s method uses representative objects as reference points instead of taking the mean value of the objects in each cluster. The algorithm takes the input parameter k, the number of clusters to be partitioned among a set of n objects. A typical k-medoids algorithm for partitioning based on Medoid or central objects is as follows [12]: Input: K: The number of clusters D: A data set containing n objects Output: A set of k clusters that minimizes the sum of the dissimilarities of all the objects to their nearest medoid. Method: Arbitrarily choose k objects in D as the initial representative objects; Repeat: 1. Assign each remaining object to the cluster with the nearest medoid; 2. Randomly select a non medoid object Orandom; 3. Compute the total points S of swap point Oj with Oramdom 4. If S < 0 then swap Oj with Orandom to form the new set of k medoid 5. Select the configuration with the lowest cost 6. Repeat steps 2 to 5 until there is no change in the medoid Proposed algorithm: Proposed Approach of classical partition is primarily based on the partitional clustering rule. The process of declaring the k for number of clusters will remain intact in the proposed method also. The main difference in the proposed approach is in the way it chooses the starting k object points. The enhanced s algorithm, initializes the cluster medoids by selecting the initial medoid points. The implementation result shows it is better performing in perspective of the time taken for the clustering process of the entire objects. The memory usage in this approach is also comparatively less as per the implementation result. Initially the distance between the origin and the object points are calculated. Then the object points are sorted in the ascending order. The ordered objects now divided into k clusters. The medoids of k clusters will be calculated as using the distance measures. Algorithm: Input: A = {a 1, a 2,...,a n } // set of n items Let k be the number of desired clusters Output: Let k be set of clusters Steps: Let O be the origin point with attribute values 0. Calculate distance between each data point and origin. Sort the data points in ascending order of the value obtained in step 2. Partition the sorted data points into k equal sets. Assign first elements of those partitions to be initial medoids. Assign them as medoids. 28

Time Taken In Millisecond Mean Square Error International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169 6. Compute the distance of each data-point a i (1<=i<=n) to all the medoids m j (1<=j<=k) as d(a i, m j ) 7. Repeat 8. Find the closest medoid m j for data points in ai and assign a i to cluster j. 9. Set ClusterId[i]=j. // j:id of the closest cluster. 10. Set NearestDist[i] = d(a i, a j ). 11. For each cluster j (1 <= j <= k), again calculate medoids. 12. For each a i, 12.1 calculate the distance with current nearest medoid. 12.2 The data point stays in the same cluster if the current nearest distance is less or equal. Else 12.2.1 For every medoid mj (1<=j<=k) compute the distance d (a i, m j ). End for; Until the convergence criteria is met. IV. RESULTS ANALYSIS The given section includes the performance analysis of the implemented algorithms for the k-medoid. The performance of algorithms are evaluated and compared in this chapter. A) Mean Square Error Figure 1 and table 1 show Mean square error comparison results, which has been carried out on the same size of Iris datasets. The mean square error is calculated the difference between the instances of each cluster and their cluster center. Smaller values indicate a cluster of higher quality. The values of the graph are represented using table 1 where the amount of Mean Square Error of the proposed algorithm is given in the last column and the first column contain existing k-medoid algorithm. In the similar ways the given graph as given in figure 1 contains the comparative mean square error of all the algorithms. In this figure red color shows the proposed algorithms performance and the green color shows the result of existing k-medoid algorithm. For demonstrating the performance of the system Y axis contains the Mean Square Error and X axis contains the number of cluster. TABLE 1: MEAN SQUARE ERROR PERFORMANCE Clusters (MSE) Proposed Algorithm(MSE) 5 77.8 54.98 10 36.62 34.52 15 27.75 25.14 20 23.99 20.96 100 80 60 40 20 0 Figure 1: Mean Square Error Comparison Chart B) Execution Time The comparative time utilization of the proposed and existing k-medoid algorithms is given using figure 2 and table 2. In this graph the horizontal axis contains the number of clusters and the vertical axis contains execution time in terms of milliseconds. In this diagram the red colour demonstrates the output of the existing k-medoid clustering algorithm and the green colour demonstrates the output of proposed k-medoid algorithm. According to the comparative results analysis the performance of the proposed technique shows the less time consuming as compared to the original k-medoid algorithm. The new algorithm gives better performance without much increment in execution time. The comparison is done on iris dataset. TABLE 2: COMPARISONS BETWEEN ALGORITHM WITH NUMBER OF CLUSTER AND EXECUTION TIME OF IRIS DATASET Cluster (Time in ms) Proposed K- Medoid Algorithm(Time in ms) 5 33.9879 16.8096 10 41.3256 33.2097 15 106.4943 52.2604 20 164.0374 69.0788 25 209.2598 72.3591 250 200 150 100 50 0 Mean Sqauare Error Chart 5 10 15 20 25 Number Of Cluster k-medoid algorithm(ms E) proposed algorithm(ms E) Time Efficency Comparative Chart 5 10 15 20 25 Cluster k-medoid algorithm Time Taken in millisecond proposed algorithm Time Taken in millisecond 25 23.96 18.15 Figure 2: Graphs Represent Clustering and Execution Time Comparison for Iris Dataset 29

C) Memory Used TABLE 4: PERFORMANCE PARAMETERS Memory use of the system also termed as the space complexity in terms of algorithm performance. That can be calculated using the following formula: Memory consumption= Total memory - Free memory The amount of memory consumption depends on the amount of data residing in the main memory, therefore, that affects the computational cost of an algorithm execution. The comparison between all the algorithms that is existing k-medoid and proposed k-medoid algorithm is done on the basis of memory required for the execution of the algorithm. The experimental results of and proposed clustering algorithm are shown in the figure 3. In the graph the memory use of the proposed k-medoid algorithm and the existing k-medoid algorithm is shown. The experiments suggests For reporting the performance of figure 3 Y axis contains the use of memory consumption during experimentations and the X axis shows the number of clusters. According to the results the proposed algorithm demonstrates similar behaviour even if the size of clusters increases. TABLE 3: COMPARISON BETWEEN PROPOSED AND EXISTING SYSTEM BASED ON NUMBER OF CLUSTERS AND MEMORY REQUIREMENT Cluster (Memory in bytes) Proposed K- Medoid (Memory in bytes) S.No. Parameters K- Medoid Algorithm 1. Mean Square Error 2. Execution Time 3. Memory Use Comparatively Proposed Algorithm Comparatively V. CONCLUSION AND FUTURE WORKS An enhanced k-medoid algorithm which is new approach of classical partition based clustering algorithm improves the execution time of k-medoid algorithm. The results conclude that, the proposed implementation of the k-medoid algorithm is better performed as compare with the K-medoid. From experiment it is observed that the proposed algorithm gives better performance in all parameters. Proposed Approach of classical partition is primarily based on the partitional clustering rule. The process of declaring the k for number of clusters will remain intact in the proposed method also. The main difference in the proposed approach is in the way it chooses the starting k object points. work enhances the system s performance with the use of other types of attributes in data set. Proposed system s performance was evaluated and it was concluded that it was a better option to calculate the distance and arrange these in ascending order rather than calculating the distance form origin. 5 204596 204652 10 204544 201960 15 204576 212792 20 212768 210132 25 212788 220984 Figure 3: Graphs Represent Clusters and memory required Comparison for Iris Dataset The comparison between the algorithms that is k-medoid and proposed k- medoid algorithm is done on the Iris data set which contains 150 data points with five attributes. Table 4 describe the performance summary of both algorithms. According to obtained result the proposed algorithm is able to clustering the data points. Therefore the proposed algorithm based on partitional clustering algorithm is adoptable and efficient REFERENCES [1] L. Kaufman and P. J. Rousseau, Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley & Sons, 1990. [2] A. K. Jain, M. N. Murty, and P. J. Flynn, Data Clustering: A review. ACM Computing Surveys, Vol. 31 No. 3, pp.264 323, 1999. [3] J. Han and M. Kamber. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, August 2000. [4] Rui Xu and Donlad Wunsch, Survey of Clustering Algorithm, IEEE Transactions on Neural Networks,Vol. 16, No. 3, May 2005. [5] Sanjay Garg, Ramesh and Chandra Jain, Variation of K- Mean Algorithm: A study for Dimensional Large Data Sets, Information Technology Journal,Vol. 5, No. 6, pp.1132 1135, 2006. [6] K. A. Abdul Nazeer and M. P. Sebastian Improving the Accuracy and Efficiency of the K-Means Clustering Algorithm Proceedings of the World Congress on Engineering, Vol.1, pp.1-3, July 2009. [7] T. Velmurugan and T. Santhanam, A Survey of Partition Based Clustering Algorithms in Data Mining: An Experimental Approach, Journal, Vol. 10, No. 3, pp. 478-484, 2011 [8] Shalini S Singh and NC Chauhan, K- means v/s K- medoids: A Comparative Study, National Conference on Recent Trends in Engineering & Technology, 2011. 30

[9] Rui Xu and Donlad Wunsch, Survey of Clustering Algorithm, IEEE Transactions on Neural Networks, Vol. 16, No. 3, May 2005. [10] M. S. Chen, J. Han and P. S. Yu., Data Mining: An Overview from a Database Perspective, IEEE Transactions on Knowledge and Data Engineering, Vol. 8, pp. 866-883, 1998 [11] Bharat Pardeshi and Durga Toshniwal, "Improved K- Medoid Clustering Based On Cluster Validity Index and Object Density", IEEE, 2010. [12] Abhishek Patel and Purnima Singh, "New Approach For K-mean and s Algorithm", International Journal of Computer Applications Technology and Research, 2013 [13] H.S. Park, and C.H. Jun, "A Simple and Fast Algorithm for s Clustering", Department of Industerial and Management Engineering POSTECH, 2009. [14] R. Fisher, UCI Machine Learning Repository, 1936. https://archive.ics.uci.edu/ml/datasets/iris 31