A Survey Of Issues And Challenges Associated With Clustering Algorithms

International Journal for Science and Emerging ISSN No. (Online):2250-3641 Technologies with Latest Trends 10(1): 7-11 (2013) ISSN No. (Print): 2277-8136 A Survey Of Issues And Challenges Associated With Clustering Algorithms Ms. Asmita Yadav Assistant Professor, Institute of Professional excellence & Management, Ghaziabad (Received 20 June 2013 Accepted 15 July 2013) Abstract-Data mining is the process of taking out of concealed prognostic information from a huge amount of databases. It is an influential technology which helps companies to focus on important information in their data warehouses. There are different steps in data mining process like Anomaly detection, Association rule learning, Clustering, Classification, Regression, Summarization. This paper is mainly concerned about clustering which is the procedure of organizing the objects in groups whose members contains some kind of similarity. In the present review work, will make an attempt for identifying the major issues and challenges associated with different clustering algorithms. Keywords: k-means Clustering Algorithms, Data Mining 1. DATA MINING Data mining is the process to extract the hidden predictive information from large amount of databases which help companies to focus on important information in their data warehouse. Data mining tools predicts future trends and behaviors which allow business to make practical, knowledgedriven decisions. Data mining tools can answer business questions that traditionally were time consuming to resolve. Data mining contributes by searching in databases to evaluate hidden patterns, predictive information which experts may miss because this information may lies outside their expectations [9]. Data mining algorithms represent techniques that have been implemented as established, consistent, understandable tools that consistently outperform older statistical methods. Before Data mining process occur we apply pre-processing of data in which we first select the data, after this pre-processing of data is done in which we assemble large amount of target data set. Pre-processing is necessary to analyze the multivariate data set. Subsequent to this data cleaning is done in which we remove noise and missing data from target dataset. Data mining may be applied as per the steps given below: 1. Anomaly detection: This is the identification of the unusual records or data errors. 2. Association rule learning: It Searches the relationships between variables. This is sometimes referred to as market basket analysis. 3.Clustering: This is the process of finding groups and structures in the data that are in some way or another "similar", without using known structures in the data. 4. Classification: This is the task of generalizing known structure to apply to new data. 5. Regression: this process is used to search a function which modals the data with the least error. 6. Summarization: It provides a more compact representation of the data set, including visualization and report generation [10].

8 Yadav As in the above discussed six steps of data mining clustering is most important so author discuss it in detailed manner in following section of clustering. 2. CLUSTERING Cluster: A cluster is an ordered list of objects, which have some common characteristics.. So a cluster is the collection of objects which are alike and are different from the objects that belongs to other clusters. Core objective of clustering is to find out the inherent grouping in a set of unlabeled data[2]. There is no standard to find the best clustering algorithm which is independent of the dataset. It depends on user who must supply the criterion in such a way that the result of clustering will suits their needs. Clustering algorithms can be applied in many fields like in marketing to find groups of customers with similar behaviours and their buying habits, in biology for classification of plants and animals, or in library for ordering books etc. The major requirements for a clustering algorithm are: it should be scalable, it can deal with different types of attributes, it can discover clusters with arbitrary shape, there should be minimal requirements for domain knowledge to determine input parameters, it should have ability to deal with noise and outliers; it should be insensitive to order of input records etc. Distance between Two Clusters: The distance between two clusters involves some or all elements of the two clusters. The clustering method determines how the distance should be computed. Similarity: A similarity measure SIMILAR (Di, Dj) can be used to represent the similarity between the documents. Typical similarity generates values of 0 for documents exhibiting no agreement among the assigned indexed terms, and 1 when perfect agreement is detected. Intermediate values are obtained for cases of partial agreement. Average Similarity: If the similarity measure is computed for all pairs of documents (Di, Dj) except when i=j, an average value AVERAGE SIMILARITY is obtainable. Specifically, AVERAGE SIMILARITY = CONSTANT SIMILAR (Di, Dj), where i=1, 2.n and j=1, 2.n and i < > j Problem in clustering: There are some problems with clustering techniques like these do not address all the requirements effectively (and simultaneously); there is time complexity problem with large number of dimensions and large number of data items, the effectiveness of the clustering method depends on the distance function used (for distance-based clustering); defining a new distance function if required is not always easy especially in multidimensional spaces, the result of the clustering algorithm can be interpreted in different ways[11]. 3.REVIEW OF LITERATURE Research in the various techniques in clustering is started in early 1990s. Now a days we have lots of clustering algorithms which are useful in different areas.we have different kind of clustering algorithms from which we can select the best suited algorithm according to our requirement. K. A. Abdul Nazeer, M.P.Sebastian presented an enhanced k-means algorithm which combines a systematic method for finding initial centroids and an efficient way for assigning data points to clusters. This method ensures the entire process of clustering in O(n2 ) time without sacrificing the accuracy of clusters. The previous improvements of the k-means algorithm

Yadav 9 compromise on either accuracy or efficiency. A limitation of the proposed algorithm is that the value of k, the number of desired clusters, is still required to be given as an input, regardless of the distribution of the data points. Evolving some statistical methods to compute the value of k, depending on the data distribution, is suggested for future research. Methods for refining the computation of initial centroids are worth investigating. Malay K Pakhir proposed modified algorithm that maintained all important characteristic features of the basic k-means and at the same time eliminates the possibility of generation of empty clusters. It has been shown that the present algorithm is semantically equivalent to the serial k-means algorithm. Proposed clustering scheme was able to solve the empty cluster problem, to a great extent, without any significant performance degradation. Neha Aggarwal & Kirti Aggarwalpresented the way to find the initial centres for the k-means so that every time the K-Means algorithm produces same result for the same dataset by using mid point based K-means algorithms and it also remove the limitation of k-means that the final cluster results heavily depends on the selection of initial centroids which causes it to converge at local optimum. Ahamed Shafeeq & Hareesha [4] proposed Dynamic clustering of data with modifief K-means algorithm. In which,we can overcome the problem by finding the optimal number of clusters on the run. But the main drawback of the proposed approach was that it takes more computational time than the K-means for larger data sets. Pankaj Jadwal & Ruchi Dave proposed An Improved and Customised I- K Means for Avoiding Similar Distance Problem, by which same distance problem can be solved and better result can be obtained. It depends on equal distribution of data in each cluster and quality factor. Rajeev Kumar & Rajeshwar Puran proposed Enhanced K-Means Clustering Algorithm Using Red Black Tree and Min- Heap, it saved the distances between data objects and clusters. It then dynamically changes them when required. However, the saving of the distances requires much space. Thus, although our algorithm is superior to the traditional k-means algorithm in terms of time complexity, it appears to be lagging behind in terms of space complexity. But the drawback of this algorithm is to sort its orientation. K-means Algorithm The simple definition of k-means clustering is to classify data to groups of objects based on attributes/features into K number of groups. K is positive integer number. K-means Prototype-based (center-based) clustering technique which is one of the algorithms that solve the well-known clustering problem. It creates a one-level partitioning of the data objects. K-means (KM) define a prototype in terms of a centroid, which is the mean of a group of points and is applied to dimensional continuous space. Another technique as prominent as K-means is K-medoid, which defines a prototype that is the most representative point for a group and can be applied to a wide range of data since it needs a proximity measure for a pair of objects. The difference with centroid is the medoid correspond to an actual data point. [8] As advantages [10] using K-means, there is some idea which find in one paper that referenced The process, which is called k-means, appears to give partitions which are reasonably efficient in the sense of withinclass variance, corroborated to some extend

10 Yadav by mathematical analysis and practical experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer. And the other is Likewise idea which summarized in introduction part of his work benefits of using K-means: K-means algorithm is one of first which a data analyst will use to investigate a new data set because it is algorithmically simple, relatively robust and gives good enough answers over a wide variety of data sets. Totally the analysis in aspect of optimality of K-means defines into two different components: membership optimal) : Each point is a member of the cluster to whose representative point, it is closest. or given Content : Each cluster s representative point is the centroid of its member points, for more, the similarity define according to point that select. [29] In the cluster memberships optimal, the concept of optimization is simple where any feature even out of original dataset is considered as a member of the cluster. 4.ISSUE AND CHALLENGES Since there are various clustering algorithms available to automate or semi-automate the clustering procedure it is very difficult to choose the suitable algorithm for a particular dataset. As different algorithms applied on a dataset may produce different kind of results with different clusters. Each algorithm has its own run time, complexity, error frequency, resources used etc to complete the procedure of clustering. Another issue may be that the outcome of a clustering algorithm mainly depends on the type of dataset used. As the size and dimensions of dataset increases day by day this makes it difficult to handle for a particular clustering algorithm. Also the complexity of data set increases, which include data like audios, videos, pictures and other multimedia data which form very heavy database, this in turn create the time complexity of a clustering algorithm. Furthermore clustering algorithms do not concentrate on all of the requirements simultaneously and effectively which makes the result uncertain. Most of the clustering algorithms depends on the distance function used in the algorithm and if the given distance function do not perform efficiently then a new distance function may required which is difficult to formulate especially for multi-dimensional data this increases the tediousness of work. Also the output of a clustering algorithm can be interpreted in different ways which may create confusion for understanding the result by users. So we need an immense concern to choose a clustering algorithm for the dataset. The selection of a clustering algorithm may based on the type of dataset, time requirement, efficiency needed, accuracy required, error tolerance etc. so the main challenge is to choose the correct type of clustering algorithm for the data set which are based on user requirements among many known clustering algorithms so that user can get the desired results which helps in further research for data mining process. 5. CONCLUSION This paper deals with study of different kind of clustering algorithms. It first defines the data mining process which is the method of finding predictive information from a huge amount of databases. Then it defines the clustering process which is the procedure of assemblage of the objects in groups whose members contain some kind of resemblance. After that a detailed study of clustering algorithms and their comparison in different perceptions are examined. This paper highlights the concerned issues and

Yadav 11 challenges which may be helpful for the upcoming researchers to carry on their work. REFERENCES Conference on Advanced Computing and Communications [10] H. G. Wilson, B. Boots, and A. A. Millward A Comparison of Hierarchical and partitional Clustering Techniques for Multispectral Image Classification, 0-7803-7536-X (C) 2002 IEEE [1] K.A. Abdul Nazeer, M. P. Sebastian Improving the Accuracy and Efficiency of the k. Means Clustering Algorithm world congress on Engineering 2009 vol 1, 2009 [2] Malay K. Pakhira A Modified K-means Algorith to Avoid Empty clusters international Journal of recent trends in Engineering, Vol 1, No. 1, May 2009 [3] Neha Aggarwal and kirti Aggarwal A Mid- Point based K means Clustering Algorithm for Data mining International journal on Computer science and Engineering(IJCSE), vol. 4 no. 06 June 2012 [4] Ahamed Shafeeq B M and Hareesha K S dynamic clustering of data with modified K- means algorithm, 2012 International conference on Information and Computer Networks(ICICN 2012) [5] Pankaj Jadwal & Ruchi Dave proposed An Improved and Customised I-K Means for Avoiding Similar Distance Problem, 2012 International Journal of Engineering Research & Application, vol. 2. [6] Rajeev Kumar & Rajeshwar Puran proposed Enhanced K-Means Clustering Algorithm Using Red Black Tree and Min-Heap, 2011 International journal of Innovation, Management and Technology, vol. 2. [7] Ren Jingbiao and Yin Shaohong Research and Improvement of Clustering Algorithm in Data Mining, 2010 2nd International Conference on Signal Processing Systems (ICSPS) [8] M. Srinivas and C. Krishna Mohan, Efficient Clustering Approach using Incremental and Hierarchical Clustering Methods, 978-1-4244-8126-2/10/$26.00 2010 IEEE [9] Hemanta Kumar, Kalita Dhruba Kumar and Bhattacharyya Avijit Kar A New Algorithm for Ordering of Points to Identify Clustering Structure Based On Perimeter of Triangle: OPTICS (BOPT), 15th International