A Survey Of Issues And Challenges Associated With Clustering Algorithms

Similar documents
Dynamic Clustering of Data with Modified K-Means Algorithm

K-Means Clustering With Initial Centroids Based On Difference Operator

Enhancing K-means Clustering Algorithm with Improved Initial Center

Analyzing Outlier Detection Techniques with Hybrid Method

Iteration Reduction K Means Clustering Algorithm

International Journal Of Engineering And Computer Science ISSN: Volume 5 Issue 11 Nov. 2016, Page No.

Unsupervised Learning : Clustering

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

An Enhanced K-Medoid Clustering Algorithm

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Clustering and Visualisation of Data

Incremental K-means Clustering Algorithms: A Review

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Network Traffic Measurements and Analysis

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Overview of Web Mining Techniques and its Application towards Web

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

A Review of K-mean Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Normalization based K means Clustering Algorithm

ECLT 5810 Clustering

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning

ECLT 5810 Clustering

Improved Performance of Unsupervised Method by Renovated K-Means

Density Based Clustering using Modified PSO based Neighbor Selection

Comparative Study of Clustering Algorithms using R

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

CSE 5243 INTRO. TO DATA MINING

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering. Supervised vs. Unsupervised Learning

A Survey On Different Text Clustering Techniques For Patent Analysis

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

Redefining and Enhancing K-means Algorithm

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)

Data mining overview. Data Mining. Data mining overview. Data mining overview. Data mining overview. Data mining overview 3/24/2014

New Approach for K-mean and K-medoids Algorithm

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

Unsupervised Learning

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Comparative Study of Subspace Clustering Algorithms

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Unsupervised Learning and Clustering

Performance Analysis of Video Data Image using Clustering Technique

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

K-Mean Clustering Algorithm Implemented To E-Banking

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Intrusion Detection Using Data Mining Technique (Classification)

数据挖掘 Introduction to Data Mining

Dynamic Clustering Of High Speed Data Streams

Gene Clustering & Classification

Unsupervised Learning and Clustering

Fast Efficient Clustering Algorithm for Balanced Data

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Statistical Pattern Recognition

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction

An Efficient Approach towards K-Means Clustering Algorithm

Semi-Supervised Clustering with Partial Background Information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications

Keywords: clustering algorithms, unsupervised learning, cluster validity

An Efficient Approach for Color Pattern Matching Using Image Mining

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

AN IMPROVED DENSITY BASED k-means ALGORITHM

K-modes Clustering Algorithm for Categorical Data

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

K-means clustering based filter feature selection on high dimensional data

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Clustering: An art of grouping related objects

Statistical Pattern Recognition

Computational Time Analysis of K-mean Clustering Algorithm

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

A Content Based Image Retrieval System Based on Color Features

A Survey on Image Segmentation Using Clustering Techniques

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW

Clustering Part 3. Hierarchical Clustering

DATA MINING AND WAREHOUSING

Clustering CS 550: Machine Learning

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA

Clustering. Chapter 10 in Introduction to statistical learning

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

COMPARATIVE ANALYSIS OF PARALLEL K MEANS AND PARALLEL FUZZY C MEANS CLUSTER ALGORITHMS

A Comparative Study of Selected Classification Algorithms of Data Mining

MATRIX BASED SEQUENTIAL INDEXING TECHNIQUE FOR VIDEO DATA MINING

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

Comparison of FP tree and Apriori Algorithm

Statistical Pattern Recognition

Transcription:

International Journal for Science and Emerging ISSN No. (Online):2250-3641 Technologies with Latest Trends 10(1): 7-11 (2013) ISSN No. (Print): 2277-8136 A Survey Of Issues And Challenges Associated With Clustering Algorithms Ms. Asmita Yadav Assistant Professor, Institute of Professional excellence & Management, Ghaziabad (Received 20 June 2013 Accepted 15 July 2013) Abstract-Data mining is the process of taking out of concealed prognostic information from a huge amount of databases. It is an influential technology which helps companies to focus on important information in their data warehouses. There are different steps in data mining process like Anomaly detection, Association rule learning, Clustering, Classification, Regression, Summarization. This paper is mainly concerned about clustering which is the procedure of organizing the objects in groups whose members contains some kind of similarity. In the present review work, will make an attempt for identifying the major issues and challenges associated with different clustering algorithms. Keywords: k-means Clustering Algorithms, Data Mining 1. DATA MINING Data mining is the process to extract the hidden predictive information from large amount of databases which help companies to focus on important information in their data warehouse. Data mining tools predicts future trends and behaviors which allow business to make practical, knowledgedriven decisions. Data mining tools can answer business questions that traditionally were time consuming to resolve. Data mining contributes by searching in databases to evaluate hidden patterns, predictive information which experts may miss because this information may lies outside their expectations [9]. Data mining algorithms represent techniques that have been implemented as established, consistent, understandable tools that consistently outperform older statistical methods. Before Data mining process occur we apply pre-processing of data in which we first select the data, after this pre-processing of data is done in which we assemble large amount of target data set. Pre-processing is necessary to analyze the multivariate data set. Subsequent to this data cleaning is done in which we remove noise and missing data from target dataset. Data mining may be applied as per the steps given below: 1. Anomaly detection: This is the identification of the unusual records or data errors. 2. Association rule learning: It Searches the relationships between variables. This is sometimes referred to as market basket analysis. 3.Clustering: This is the process of finding groups and structures in the data that are in some way or another "similar", without using known structures in the data. 4. Classification: This is the task of generalizing known structure to apply to new data. 5. Regression: this process is used to search a function which modals the data with the least error. 6. Summarization: It provides a more compact representation of the data set, including visualization and report generation [10].

8 Yadav As in the above discussed six steps of data mining clustering is most important so author discuss it in detailed manner in following section of clustering. 2. CLUSTERING Cluster: A cluster is an ordered list of objects, which have some common characteristics.. So a cluster is the collection of objects which are alike and are different from the objects that belongs to other clusters. Core objective of clustering is to find out the inherent grouping in a set of unlabeled data[2]. There is no standard to find the best clustering algorithm which is independent of the dataset. It depends on user who must supply the criterion in such a way that the result of clustering will suits their needs. Clustering algorithms can be applied in many fields like in marketing to find groups of customers with similar behaviours and their buying habits, in biology for classification of plants and animals, or in library for ordering books etc. The major requirements for a clustering algorithm are: it should be scalable, it can deal with different types of attributes, it can discover clusters with arbitrary shape, there should be minimal requirements for domain knowledge to determine input parameters, it should have ability to deal with noise and outliers; it should be insensitive to order of input records etc. Distance between Two Clusters: The distance between two clusters involves some or all elements of the two clusters. The clustering method determines how the distance should be computed. Similarity: A similarity measure SIMILAR (Di, Dj) can be used to represent the similarity between the documents. Typical similarity generates values of 0 for documents exhibiting no agreement among the assigned indexed terms, and 1 when perfect agreement is detected. Intermediate values are obtained for cases of partial agreement. Average Similarity: If the similarity measure is computed for all pairs of documents (Di, Dj) except when i=j, an average value AVERAGE SIMILARITY is obtainable. Specifically, AVERAGE SIMILARITY = CONSTANT SIMILAR (Di, Dj), where i=1, 2.n and j=1, 2.n and i < > j Problem in clustering: There are some problems with clustering techniques like these do not address all the requirements effectively (and simultaneously); there is time complexity problem with large number of dimensions and large number of data items, the effectiveness of the clustering method depends on the distance function used (for distance-based clustering); defining a new distance function if required is not always easy especially in multidimensional spaces, the result of the clustering algorithm can be interpreted in different ways[11]. 3.REVIEW OF LITERATURE Research in the various techniques in clustering is started in early 1990s. Now a days we have lots of clustering algorithms which are useful in different areas.we have different kind of clustering algorithms from which we can select the best suited algorithm according to our requirement. K. A. Abdul Nazeer, M.P.Sebastian presented an enhanced k-means algorithm which combines a systematic method for finding initial centroids and an efficient way for assigning data points to clusters. This method ensures the entire process of clustering in O(n2 ) time without sacrificing the accuracy of clusters. The previous improvements of the k-means algorithm

Yadav 9 compromise on either accuracy or efficiency. A limitation of the proposed algorithm is that the value of k, the number of desired clusters, is still required to be given as an input, regardless of the distribution of the data points. Evolving some statistical methods to compute the value of k, depending on the data distribution, is suggested for future research. Methods for refining the computation of initial centroids are worth investigating. Malay K Pakhir proposed modified algorithm that maintained all important characteristic features of the basic k-means and at the same time eliminates the possibility of generation of empty clusters. It has been shown that the present algorithm is semantically equivalent to the serial k-means algorithm. Proposed clustering scheme was able to solve the empty cluster problem, to a great extent, without any significant performance degradation. Neha Aggarwal & Kirti Aggarwalpresented the way to find the initial centres for the k-means so that every time the K-Means algorithm produces same result for the same dataset by using mid point based K-means algorithms and it also remove the limitation of k-means that the final cluster results heavily depends on the selection of initial centroids which causes it to converge at local optimum. Ahamed Shafeeq & Hareesha [4] proposed Dynamic clustering of data with modifief K-means algorithm. In which,we can overcome the problem by finding the optimal number of clusters on the run. But the main drawback of the proposed approach was that it takes more computational time than the K-means for larger data sets. Pankaj Jadwal & Ruchi Dave proposed An Improved and Customised I- K Means for Avoiding Similar Distance Problem, by which same distance problem can be solved and better result can be obtained. It depends on equal distribution of data in each cluster and quality factor. Rajeev Kumar & Rajeshwar Puran proposed Enhanced K-Means Clustering Algorithm Using Red Black Tree and Min- Heap, it saved the distances between data objects and clusters. It then dynamically changes them when required. However, the saving of the distances requires much space. Thus, although our algorithm is superior to the traditional k-means algorithm in terms of time complexity, it appears to be lagging behind in terms of space complexity. But the drawback of this algorithm is to sort its orientation. K-means Algorithm The simple definition of k-means clustering is to classify data to groups of objects based on attributes/features into K number of groups. K is positive integer number. K-means Prototype-based (center-based) clustering technique which is one of the algorithms that solve the well-known clustering problem. It creates a one-level partitioning of the data objects. K-means (KM) define a prototype in terms of a centroid, which is the mean of a group of points and is applied to dimensional continuous space. Another technique as prominent as K-means is K-medoid, which defines a prototype that is the most representative point for a group and can be applied to a wide range of data since it needs a proximity measure for a pair of objects. The difference with centroid is the medoid correspond to an actual data point. [8] As advantages [10] using K-means, there is some idea which find in one paper that referenced The process, which is called k-means, appears to give partitions which are reasonably efficient in the sense of withinclass variance, corroborated to some extend

10 Yadav by mathematical analysis and practical experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer. And the other is Likewise idea which summarized in introduction part of his work benefits of using K-means: K-means algorithm is one of first which a data analyst will use to investigate a new data set because it is algorithmically simple, relatively robust and gives good enough answers over a wide variety of data sets. Totally the analysis in aspect of optimality of K-means defines into two different components: membership optimal) : Each point is a member of the cluster to whose representative point, it is closest. or given Content : Each cluster s representative point is the centroid of its member points, for more, the similarity define according to point that select. [29] In the cluster memberships optimal, the concept of optimization is simple where any feature even out of original dataset is considered as a member of the cluster. 4.ISSUE AND CHALLENGES Since there are various clustering algorithms available to automate or semi-automate the clustering procedure it is very difficult to choose the suitable algorithm for a particular dataset. As different algorithms applied on a dataset may produce different kind of results with different clusters. Each algorithm has its own run time, complexity, error frequency, resources used etc to complete the procedure of clustering. Another issue may be that the outcome of a clustering algorithm mainly depends on the type of dataset used. As the size and dimensions of dataset increases day by day this makes it difficult to handle for a particular clustering algorithm. Also the complexity of data set increases, which include data like audios, videos, pictures and other multimedia data which form very heavy database, this in turn create the time complexity of a clustering algorithm. Furthermore clustering algorithms do not concentrate on all of the requirements simultaneously and effectively which makes the result uncertain. Most of the clustering algorithms depends on the distance function used in the algorithm and if the given distance function do not perform efficiently then a new distance function may required which is difficult to formulate especially for multi-dimensional data this increases the tediousness of work. Also the output of a clustering algorithm can be interpreted in different ways which may create confusion for understanding the result by users. So we need an immense concern to choose a clustering algorithm for the dataset. The selection of a clustering algorithm may based on the type of dataset, time requirement, efficiency needed, accuracy required, error tolerance etc. so the main challenge is to choose the correct type of clustering algorithm for the data set which are based on user requirements among many known clustering algorithms so that user can get the desired results which helps in further research for data mining process. 5. CONCLUSION This paper deals with study of different kind of clustering algorithms. It first defines the data mining process which is the method of finding predictive information from a huge amount of databases. Then it defines the clustering process which is the procedure of assemblage of the objects in groups whose members contain some kind of resemblance. After that a detailed study of clustering algorithms and their comparison in different perceptions are examined. This paper highlights the concerned issues and

Yadav 11 challenges which may be helpful for the upcoming researchers to carry on their work. REFERENCES Conference on Advanced Computing and Communications [10] H. G. Wilson, B. Boots, and A. A. Millward A Comparison of Hierarchical and partitional Clustering Techniques for Multispectral Image Classification, 0-7803-7536-X (C) 2002 IEEE [1] K.A. Abdul Nazeer, M. P. Sebastian Improving the Accuracy and Efficiency of the k. Means Clustering Algorithm world congress on Engineering 2009 vol 1, 2009 [2] Malay K. Pakhira A Modified K-means Algorith to Avoid Empty clusters international Journal of recent trends in Engineering, Vol 1, No. 1, May 2009 [3] Neha Aggarwal and kirti Aggarwal A Mid- Point based K means Clustering Algorithm for Data mining International journal on Computer science and Engineering(IJCSE), vol. 4 no. 06 June 2012 [4] Ahamed Shafeeq B M and Hareesha K S dynamic clustering of data with modified K- means algorithm, 2012 International conference on Information and Computer Networks(ICICN 2012) [5] Pankaj Jadwal & Ruchi Dave proposed An Improved and Customised I-K Means for Avoiding Similar Distance Problem, 2012 International Journal of Engineering Research & Application, vol. 2. [6] Rajeev Kumar & Rajeshwar Puran proposed Enhanced K-Means Clustering Algorithm Using Red Black Tree and Min-Heap, 2011 International journal of Innovation, Management and Technology, vol. 2. [7] Ren Jingbiao and Yin Shaohong Research and Improvement of Clustering Algorithm in Data Mining, 2010 2nd International Conference on Signal Processing Systems (ICSPS) [8] M. Srinivas and C. Krishna Mohan, Efficient Clustering Approach using Incremental and Hierarchical Clustering Methods, 978-1-4244-8126-2/10/$26.00 2010 IEEE [9] Hemanta Kumar, Kalita Dhruba Kumar and Bhattacharyya Avijit Kar A New Algorithm for Ordering of Points to Identify Clustering Structure Based On Perimeter of Triangle: OPTICS (BOPT), 15th International