Research and Improvement on K-means Algorithm Based on Large Data Set

Size: px

Start display at page:

Download "Research and Improvement on K-means Algorithm Based on Large Data Set"

Jerome Chase
5 years ago
Views:

1 International Journal Of Engineering And Computer Science ISSN: Volume 6 Issue 7 July 2017, Page No Index Copernicus value (2015): DOI: /ijecs/v6i7.40 Research and Improvement on K-means Algorithm Based on Large Data Set Dr. Gurpreet Singh, Er. Vanshita Sharma Professor & Head, CSE M.Tech, Scholar St. Soldier Inst. of Engg. & Tech. Jalandhar (Punjab) St. Soldier Inst. of Engg. & Tech. Jalandhar (Punjab) Abstract The highway safety is being cooperated and there are not sufficient safety aspects by which we can examine the traffic crashes before it occurs. A technique is planned by which we can pre-process the unintentional aspects. In order to control these pre-process issues, a clustering technique is used. In clustering technique present k-mean algorithm is improved and this improved K-mean algorithm will apply on traffic dataset. This dataset is composed from National Highway Authority. To collect data in the dataset several assessments and surveys are conducted from people and the staff of National Highway Authority. The elementary impression of this proposed work is to develop highway safety. Keywords: K-means,, Weka, Data mining, centroid. I. Introduction Data mining is a multidisciplinary field, drawing work from areas including database technology, machine learning, statistics, pattern recognition, information retrieval, neural networks, knowledge-based systems, artificial intelligence, high-performance computing, and data visualization. Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. The relationships and summaries derived through a data mining exercise are often referred to as models or patterns. Examples include linear equations, rules, clusters, graphs, tree structures, and recurrent patterns in time series. data mining refers to extracting or mining knowledge from large amounts of data. The term is actually a misnomer. Clustering Clustering is an unsupervised learning technique that separates data items into a number of groups, such that items in the same cluster are more similar to each other and items in different clusters tend to be dissimilar, according to some measure of similarity or proximity. Different from supervised learning, where training examples are associated with a class label that expresses the membership of every example to a class, clustering assumes no information about the distribution of the objects and it has the task to both discover the classes present in the data set and to assign objects among such classes in the best way. Dr. Gurpreet Singh, IJECS Volume 6 Issue 7 July 2017 Page No Page 22145

2 similarity or homogeneity within a group and the greater the difference between groups or heterogeneity. II.types of clustering algorithms Figure 1: clustering Cluster analysis groups objects (observations, events) based on the information found in the data describing the objects or their relationships. Clustering is defined as similar type of objects belongs to one group and dissimilar types to the other groups. The greater the similarity (or homogeneity) within a group and the greater the difference between groups, makes the different clusters. Clustering is a tool for data analysis, which solves classification problems. Its object is to distribute cases (people, objects, events etc.) into groups, so that the degree of association to be strong between members of the same cluster and weak between members of different clusters. In this way each cluster describes, in terms of data collection, the class to which its members belong. Clustering is discovery tool. It may reveal associations and structure in data which though not previously evident, nevertheless are sensible and useful once found. The results of clustering analysis may be contribute to the definition of a formal classification scheme, such as a taxonomy for related animals, insects or plants; or suggest statistical models with which to describe populations or indicate rules for assigning new cases to classes for identification and diagnostic purposes or provide measures of definition, size and change in what previously were only broad concepts or find exemplars to represent classes. The goal of clustering is that the objects in a group will be similar related to one cluster and dissimilar related to other groups. The greater the A. Hierarchical approach A hierarchical algorithm yields a dendrogram representing the nested group of patterns and similar levels at which group change. It seeks to build the hierarchy of clusters. It falls into two categories: Agglomerative: This is a "bottom up" approach, each observation starts in its own cluster, and pairs of clusters are merged to move upwards the hierarchy. Divisive: This is a "top down" approach all observations start in one cluster, and splits are performed recursively as one move down the hierarchy. B. Density based approach Cluster is grown as long as density in neighborhood exceeds some threshold i.e. for each data point in cluster, the neighborhood of a given radius has to contain minimum number of points. C. Model based approach In this approach, the data is viewed as coming from a mixture of probability distributions, each of which represents a different cluster. In modelbased clustering, the data are generated by a mixture of probability distributions in which each component represents a different cluster. D. K-means clustering The k-means algorithm (Lloyd, 1982) belongs to a family of algorithms known as optimization clustering algorithms. In this family of algorithms, clusters are formed such that some criterion of cluster goodness is optimized. That is, the examples are partitioned into clusters such that the clusters are optimal according to some measure. The name comes from the fact that k clusters are formed, where the centre of the cluster is the arithmetic mean of all vectors within that cluster. III.the k-means algorithm is as follows: 1. Select k seed examples as initial centers (randomly generated vectors can also be used). Dr. Gurpreet Singh, IJECS Volume 6 Issue 7 July 2017 Page No Page 22146

3 2. Calculate the distance from each cluster centre to each example. 3. Assign each example to the nearest cluster. 4. Calculate new cluster centers, where each new centre is the mean of all vectors in that cluster. 5. Repeat steps 2-4 until a stopping condition is reached. In the experiments reported here, the initial centers were vectors that were randomly selected from the dataset, and the stopping criterion was based on the movement of the cluster centers: when vectors no longer changed clusters between iterations (the clusters had stabilized), the algorithm terminated. The number of clusters was set equal to the number of SOM output map neurons that were evaluated. The disadvantage of k-means compared to SOM is that it does not perform vector quantization, that is, it does not naturally result in a form that can be easily visualized. The advantage of k-means over SOM is that it is more computationally efficient and can thus run much faster. IV. Related Work Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. Extraction of meaningful information from large experimental data sets is a key element in bioinformatics research. One of the challenges is to identify genomic markers in Hepatitis B Virus (HBV) that are associated with HCC (liver cancer) development by comparing the complete genomic sequences of HBV among patients with HCC and those without HCC. In this study, a data mining framework, which includes molecular evolution analysis, clustering, feature selection, classifier learning, and classification, is introduced. Our research group has collected HBV DNA sequences, either genotype B or C, from over 200 patients specifically for this project. In the molecular evolution analysis and clustering, three subgroups have been identified in genotype C and a clustering method has been developed to separate the subgroups.[2] V. (Proposed Algorithm) It is a partitioning clustering algorithm. It partitions the given data into k clusters. The no of clusters are fixed.let the set of data points (or instances) D be {x 1, x 2,, x n }, Where X = (x 1, x 2,, x n ) is a vector in a realvalued space X R r, and r is the number of attributes (dimensions) in the dataset minimize the sum of squared Euclidean distance between objects and cluster centroid. A. Proposed Algorithm Setup 1. Draw multiple divisions {DI,D2,...,Dj } from the original dataset. 2. Repeat step 3 for n=1 to i 3. Apply combined approach for multiple divisions of dataset. 4. Compute Centroid. 5. Choose minimum of minimum distance from cluster centre criteria. 6. Now apply new calculation again on dataset D for k1 clusters. 7. Combine two nearest clusters into one cluster and recalculate the new cluster centre for the combined cluster until the number of clusters reduces into k. B. Implementation steps: Select k points as the initial centroids. Assign all objects to the closest centroid. Recalculate the Centroid of each cluster Repeat steps 2 and 3 until a termination criterion is met. Pass the solution to the next stage. Attributes of in Weka are Number of clusters, Max iterations, Number of trails, Distance Normalization as Variance, Average Computation such as Forgy, Mc queen, Seed random generator such as Random, Standard C. Proposed flowchart: Dr. Gurpreet Singh, IJECS Volume 6 Issue 7 July 2017 Page No Page 22147

Figure 4: Graphical representation of all attributes Figure 2: Proposed Flowchart VI.

us. To evaluate the performance of our algorithm we have tested it on Traffic safety dataset. It consists 13 attributes and 4999 instances.

Figure 5: algorithm in NETBEANS Results of K Means and algorithm for Number of Iterations KMeans No.of Iterations 16 4 20 15 10 5 0 No.

4 Figure 4: Graphical representation of all attributes Figure 2: Proposed Flowchart VI. Results and Discussions This chapter deals with the results obtained by us after the experiments were carried out on the algorithm developed by us. To evaluate the performance of our algorithm we have tested it on Traffic safety dataset. It consists 13 attributes and 4999 instances. Figure 3: Show all instances according to the class attribute Accident Location. Figure 5: algorithm in NETBEANS Results of K Means and algorithm for Number of Iterations KMeans No.of Iterations No. of Iterations No. of Iterations Figure 6: Graphical representations of comparison of K Means and results Results of K Means and algorithm for Number of Clusters K Means No. of clusters 4 4 Dr. Gurpreet Singh, IJECS Volume 6 Issue 7 July 2017 Page No Page 22148

5 6 No. of clusters DOI: /ijecs/v6i7.40 Figure 9: Graphical representations of comparison of all attributes together of K Means and algorithm Figure 7: Graphical representations of number of clusters in K Means and Results of K Means and algorithm for K Means No. of clusters Figure 8: Graphical representations of in K Means and algorithm Comparison of K Means and algorithm. No. of Iterations 16 4 No. of clusters VII. Conclusion In this research paper, study is being done on partitioning clustering algorithms and hierarchical clustering algorithms. The features of K-Means clustering algorithms are enhanced and a new algorithm (Enhanced K Means Clustering Algorithm) is proposed. The comparison of proposed algorithm is done with the existing algorithm K-Means on traffic safety dataset using WEKA data mining tool. The results by changing the number of clusters value specifies that the proposed method gives better performance than K-Mean clustering by reducing the sum of square error rate and reducing the number of iterations which signifies that (Enhanced K Means Clustering Algorithm) have high intra cluster similarity and is more accurate. Also the proposed algorithm can handle large datasets more effectively. VIII. Future scope Some of the further enhancements would be to implement the proposed algorithm in some other data mining tool with increased number of clusters and order value and with other distance measures. To combine the features of Birch clustering algorithm and other partitioning clustering algorithm. To use other tree algorithms like AVL Tree, B+ Tree, AD Tree, etc. No. of clusters No. of Iterations REFERENCES [1] Shi Na, Liu Xumin, Guan yong, Research on k-means Clustering Algorithm An Improved k- means Clustering Algorithm, 2010 IEEE Third International Symposium on Intelligent Information Technology and Security Informatics [2] ShuhuaRen, Alin Fan, K-means Clustering Algorithm Based On Coefficient Of Variation, Dr. Gurpreet Singh, IJECS Volume 6 Issue 7 July 2017 Page No Page 22149

6 2011 IEEE 4th International Congress on Image and Signal Processing. [3] SaurabhShah, Manmohan Singh, Comparison of A Time Efficient ModifiedK-mean Algorithm with K-Mean and K-Medoid algorithm, 2012 IEEE International Conference on Communication Systems and Network Technologies. [4] ShaloveAgarwal, ShashankYadav, Kanchan Singh, K-means versus K-means ++ Clustering Technique, 2012 IEEE Second International Workshop on Education Technology and Computer Science. [5] Y. Ramamohan, K. Vasantharao, C. KalyanaChakravarti, A.S.K.Ratnam, A Study of Data Mining Tools in Knowledge Discovery Process, International Journal of Soft Computing and Engineering (IJSCE) Face book 2014 IEEE th International Joint Conference on Computer Science and Software Engineering [11] NidalIsmael, Mahmoud Alzaalan, WesamAshour, Improved Multi Threshold Birch Clustering Algorithm 2014 International Journal of Artificial Intelligence and Applications for Smart Devices. [12] K.Kameshwaran, K.Malarvizhi, Survey on Clustering Techniques in Data Mining 2014 International Journal of Computer Science and Information Technologies. [13] T. Zhang, R. Ramakrishnan, M. Linvy, BIRCH: an efficient data clustering method for very large databases (1996) ACM SIGMOD International Conference on Management of Data. [6] J. Han and M. Kamber, Data Mining: concepts and techniques, Beijing: China Machine Press, Third Edition (2012). [7] Ji Dan, QiuJianlin, Gu Xiang, Chen Li, He Peng, A Synthesized Data Mining Algorithm based on Clustering and Decision tree, 2010 IEEE International Conference on Computer and Information Technology. [8] R Joshi, A Patidar, S Mishra, Scaling k- medoid algorithm for clustering large categorical dataset and its performance analysis 2011 IEEE Electronics Computer Technology [9] V.S.Jagadeeswaran, P.uma, Hierarchical Birch Algorithm for Large Datasets 2013 International Journal of Advanced Research in Computer and Communication Engineering [10]Suwimon Vongs ingthong, Nawaporn Wisitpongphan, Classification of University Students Behaviors in Sharing Information on Dr. Gurpreet Singh, IJECS Volume 6 Issue 7 July 2017 Page No Page 22150

A Review of K-mean Algorithm

A Review of K-mean Algorithm Jyoti Yadav #1, Monika Sharma *2 1 PG Student, CSE Department, M.D.U Rohtak, Haryana, India 2 Assistant Professor, IT Department, M.D.U Rohtak, Haryana, India Abstract Cluster