Research and Improvement on K-means Algorithm Based on Large Data Set

Similar documents
A Review of K-mean Algorithm

Unsupervised Learning

Dynamic Clustering of Data with Modified K-Means Algorithm

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Analyzing Outlier Detection Techniques with Hybrid Method

Comparative Study of Clustering Algorithms using R

CHAPTER 4: CLUSTER ANALYSIS

Iteration Reduction K Means Clustering Algorithm

CSE 5243 INTRO. TO DATA MINING

Clustering CS 550: Machine Learning

CSE 5243 INTRO. TO DATA MINING

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Clustering part II 1

K-Means Clustering With Initial Centroids Based On Difference Operator

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Gene Clustering & Classification

University of Florida CISE department Gator Engineering. Clustering Part 2

Unsupervised Learning : Clustering

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Keywords: clustering algorithms, unsupervised learning, cluster validity

Cluster Analysis. Ying Shen, SSE, Tongji University

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Data Informatics. Seon Ho Kim, Ph.D.

Hierarchical Clustering

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Redefining and Enhancing K-means Algorithm

CS570: Introduction to Data Mining

Enhancing K-means Clustering Algorithm with Improved Initial Center

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

4. Cluster Analysis. Francesc J. Ferri. Dept. d Informàtica. Universitat de València. Febrer F.J. Ferri (Univ. València) AIRF 2/ / 1

A k-means Clustering Algorithm on Numeric Data

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

A Review on Cluster Based Approach in Data Mining

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

CSE 5243 INTRO. TO DATA MINING

An Enhanced K-Medoid Clustering Algorithm

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

What to come. There will be a few more topics we will cover on supervised learning

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Unsupervised Learning and Clustering

A HYBRID APPROACH FOR DATA CLUSTERING USING DATA MINING TECHNIQUES

Lesson 3. Prof. Enza Messina

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Kapitel 4: Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

A Survey on Image Segmentation Using Clustering Techniques

Unsupervised Learning and Clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

A Comparative Study of Various Clustering Algorithms in Data Mining

Clustering in Data Mining

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

K-Mean Clustering Algorithm Implemented To E-Banking

K-modes Clustering Algorithm for Categorical Data

Datasets Size: Effect on Clustering Results

Clustering Part 4 DBSCAN

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Review on Various Clustering Methods for the Image Data

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

ECLT 5810 Clustering

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Survey On Different Text Clustering Techniques For Patent Analysis

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Information Retrieval and Web Search Engines

Clustering: An art of grouping related objects

Knowledge Discovery in Databases

Based on Raymond J. Mooney s slides

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Unsupervised Learning Hierarchical Methods

Machine Learning (BSMC-GA 4439) Wenke Liu

Pattern Clustering with Similarity Measures

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

Introduction to Clustering

CS573 Data Privacy and Security. Li Xiong

Comparing and Selecting Appropriate Measuring Parameters for K-means Clustering Technique

A Genetic Algorithm Approach for Clustering

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

数据挖掘 Introduction to Data Mining

ECG782: Multidimensional Digital Signal Processing

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

Cluster Analysis. Angela Montanari and Laura Anderlucci

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

COMP 465: Data Mining Still More on Clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

[Raghuvanshi* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Clustering. Supervised vs. Unsupervised Learning

Road map. Basic concepts

Association Rule Mining and Clustering

I. INTRODUCTION II. RELATED WORK.

Transcription:

www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 6 Issue 7 July 2017, Page No. 22145-22150 Index Copernicus value (2015): 58.10 DOI: 10.18535/ijecs/v6i7.40 Research and Improvement on K-means Algorithm Based on Large Data Set Dr. Gurpreet Singh, Er. Vanshita Sharma Professor & Head, CSE M.Tech, Scholar St. Soldier Inst. of Engg. & Tech. Jalandhar (Punjab) St. Soldier Inst. of Engg. & Tech. Jalandhar (Punjab) Abstract The highway safety is being cooperated and there are not sufficient safety aspects by which we can examine the traffic crashes before it occurs. A technique is planned by which we can pre-process the unintentional aspects. In order to control these pre-process issues, a clustering technique is used. In clustering technique present k-mean algorithm is improved and this improved K-mean algorithm will apply on traffic dataset. This dataset is composed from National Highway Authority. To collect data in the dataset several assessments and surveys are conducted from people and the staff of National Highway Authority. The elementary impression of this proposed work is to develop highway safety. Keywords: K-means,, Weka, Data mining, centroid. I. Introduction Data mining is a multidisciplinary field, drawing work from areas including database technology, machine learning, statistics, pattern recognition, information retrieval, neural networks, knowledge-based systems, artificial intelligence, high-performance computing, and data visualization. Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. The relationships and summaries derived through a data mining exercise are often referred to as models or patterns. Examples include linear equations, rules, clusters, graphs, tree structures, and recurrent patterns in time series. data mining refers to extracting or mining knowledge from large amounts of data. The term is actually a misnomer. Clustering Clustering is an unsupervised learning technique that separates data items into a number of groups, such that items in the same cluster are more similar to each other and items in different clusters tend to be dissimilar, according to some measure of similarity or proximity. Different from supervised learning, where training examples are associated with a class label that expresses the membership of every example to a class, clustering assumes no information about the distribution of the objects and it has the task to both discover the classes present in the data set and to assign objects among such classes in the best way. Dr. Gurpreet Singh, IJECS Volume 6 Issue 7 July 2017 Page No. 22145-22150 Page 22145

similarity or homogeneity within a group and the greater the difference between groups or heterogeneity. II.types of clustering algorithms Figure 1: clustering Cluster analysis groups objects (observations, events) based on the information found in the data describing the objects or their relationships. Clustering is defined as similar type of objects belongs to one group and dissimilar types to the other groups. The greater the similarity (or homogeneity) within a group and the greater the difference between groups, makes the different clusters. Clustering is a tool for data analysis, which solves classification problems. Its object is to distribute cases (people, objects, events etc.) into groups, so that the degree of association to be strong between members of the same cluster and weak between members of different clusters. In this way each cluster describes, in terms of data collection, the class to which its members belong. Clustering is discovery tool. It may reveal associations and structure in data which though not previously evident, nevertheless are sensible and useful once found. The results of clustering analysis may be contribute to the definition of a formal classification scheme, such as a taxonomy for related animals, insects or plants; or suggest statistical models with which to describe populations or indicate rules for assigning new cases to classes for identification and diagnostic purposes or provide measures of definition, size and change in what previously were only broad concepts or find exemplars to represent classes. The goal of clustering is that the objects in a group will be similar related to one cluster and dissimilar related to other groups. The greater the A. Hierarchical approach A hierarchical algorithm yields a dendrogram representing the nested group of patterns and similar levels at which group change. It seeks to build the hierarchy of clusters. It falls into two categories: Agglomerative: This is a "bottom up" approach, each observation starts in its own cluster, and pairs of clusters are merged to move upwards the hierarchy. Divisive: This is a "top down" approach all observations start in one cluster, and splits are performed recursively as one move down the hierarchy. B. Density based approach Cluster is grown as long as density in neighborhood exceeds some threshold i.e. for each data point in cluster, the neighborhood of a given radius has to contain minimum number of points. C. Model based approach In this approach, the data is viewed as coming from a mixture of probability distributions, each of which represents a different cluster. In modelbased clustering, the data are generated by a mixture of probability distributions in which each component represents a different cluster. D. K-means clustering The k-means algorithm (Lloyd, 1982) belongs to a family of algorithms known as optimization clustering algorithms. In this family of algorithms, clusters are formed such that some criterion of cluster goodness is optimized. That is, the examples are partitioned into clusters such that the clusters are optimal according to some measure. The name comes from the fact that k clusters are formed, where the centre of the cluster is the arithmetic mean of all vectors within that cluster. III.the k-means algorithm is as follows: 1. Select k seed examples as initial centers (randomly generated vectors can also be used). Dr. Gurpreet Singh, IJECS Volume 6 Issue 7 July 2017 Page No. 22145-22150 Page 22146

2. Calculate the distance from each cluster centre to each example. 3. Assign each example to the nearest cluster. 4. Calculate new cluster centers, where each new centre is the mean of all vectors in that cluster. 5. Repeat steps 2-4 until a stopping condition is reached. In the experiments reported here, the initial centers were vectors that were randomly selected from the dataset, and the stopping criterion was based on the movement of the cluster centers: when vectors no longer changed clusters between iterations (the clusters had stabilized), the algorithm terminated. The number of clusters was set equal to the number of SOM output map neurons that were evaluated. The disadvantage of k-means compared to SOM is that it does not perform vector quantization, that is, it does not naturally result in a form that can be easily visualized. The advantage of k-means over SOM is that it is more computationally efficient and can thus run much faster. IV. Related Work Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. Extraction of meaningful information from large experimental data sets is a key element in bioinformatics research. One of the challenges is to identify genomic markers in Hepatitis B Virus (HBV) that are associated with HCC (liver cancer) development by comparing the complete genomic sequences of HBV among patients with HCC and those without HCC. In this study, a data mining framework, which includes molecular evolution analysis, clustering, feature selection, classifier learning, and classification, is introduced. Our research group has collected HBV DNA sequences, either genotype B or C, from over 200 patients specifically for this project. In the molecular evolution analysis and clustering, three subgroups have been identified in genotype C and a clustering method has been developed to separate the subgroups.[2] V. (Proposed Algorithm) It is a partitioning clustering algorithm. It partitions the given data into k clusters. The no of clusters are fixed.let the set of data points (or instances) D be {x 1, x 2,, x n }, Where X = (x 1, x 2,, x n ) is a vector in a realvalued space X R r, and r is the number of attributes (dimensions) in the dataset minimize the sum of squared Euclidean distance between objects and cluster centroid. A. Proposed Algorithm Setup 1. Draw multiple divisions {DI,D2,...,Dj } from the original dataset. 2. Repeat step 3 for n=1 to i 3. Apply combined approach for multiple divisions of dataset. 4. Compute Centroid. 5. Choose minimum of minimum distance from cluster centre criteria. 6. Now apply new calculation again on dataset D for k1 clusters. 7. Combine two nearest clusters into one cluster and recalculate the new cluster centre for the combined cluster until the number of clusters reduces into k. B. Implementation steps: Select k points as the initial centroids. Assign all objects to the closest centroid. Recalculate the Centroid of each cluster Repeat steps 2 and 3 until a termination criterion is met. Pass the solution to the next stage. Attributes of in Weka are Number of clusters, Max iterations, Number of trails, Distance Normalization as Variance, Average Computation such as Forgy, Mc queen, Seed random generator such as Random, Standard C. Proposed flowchart: Dr. Gurpreet Singh, IJECS Volume 6 Issue 7 July 2017 Page No. 22145-22150 Page 22147

Figure 4: Graphical representation of all attributes Figure 2: Proposed Flowchart VI. Results and Discussions This chapter deals with the results obtained by us after the experiments were carried out on the algorithm developed by us. To evaluate the performance of our algorithm we have tested it on Traffic safety dataset. It consists 13 attributes and 4999 instances. Figure 3: Show all instances according to the class attribute Accident Location. Figure 5: algorithm in NETBEANS Results of K Means and algorithm for Number of Iterations KMeans No.of Iterations 16 4 20 15 10 5 0 No. of Iterations No. of Iterations Figure 6: Graphical representations of comparison of K Means and results Results of K Means and algorithm for Number of Clusters K Means No. of clusters 4 4 Dr. Gurpreet Singh, IJECS Volume 6 Issue 7 July 2017 Page No. 22145-22150 Page 22148

6 No. of clusters DOI: 10.18535/ijecs/v6i7.40 Figure 9: Graphical representations of comparison of all attributes together of K Means and algorithm 4 2 0 Figure 7: Graphical representations of number of clusters in K Means and Results of K Means and algorithm for K Means 4444 2600 5000 0 No. of clusters Figure 8: Graphical representations of in K Means and algorithm Comparison of K Means and algorithm. No. of Iterations 16 4 No. of clusters 4 4 4444 2600 VII. Conclusion In this research paper, study is being done on partitioning clustering algorithms and hierarchical clustering algorithms. The features of K-Means clustering algorithms are enhanced and a new algorithm (Enhanced K Means Clustering Algorithm) is proposed. The comparison of proposed algorithm is done with the existing algorithm K-Means on traffic safety dataset using WEKA data mining tool. The results by changing the number of clusters value specifies that the proposed method gives better performance than K-Mean clustering by reducing the sum of square error rate and reducing the number of iterations which signifies that (Enhanced K Means Clustering Algorithm) have high intra cluster similarity and is more accurate. Also the proposed algorithm can handle large datasets more effectively. VIII. Future scope Some of the further enhancements would be to implement the proposed algorithm in some other data mining tool with increased number of clusters and order value and with other distance measures. To combine the features of Birch clustering algorithm and other partitioning clustering algorithm. To use other tree algorithms like AVL Tree, B+ Tree, AD Tree, etc. No. of clusters No. of Iterations REFERENCES [1] Shi Na, Liu Xumin, Guan yong, Research on k-means Clustering Algorithm An Improved k- means Clustering Algorithm, 2010 IEEE Third International Symposium on Intelligent Information Technology and Security Informatics. 0 20 40 60 [2] ShuhuaRen, Alin Fan, K-means Clustering Algorithm Based On Coefficient Of Variation, Dr. Gurpreet Singh, IJECS Volume 6 Issue 7 July 2017 Page No. 22145-22150 Page 22149

2011 IEEE 4th International Congress on Image and Signal Processing. [3] SaurabhShah, Manmohan Singh, Comparison of A Time Efficient ModifiedK-mean Algorithm with K-Mean and K-Medoid algorithm, 2012 IEEE International Conference on Communication Systems and Network Technologies. [4] ShaloveAgarwal, ShashankYadav, Kanchan Singh, K-means versus K-means ++ Clustering Technique, 2012 IEEE Second International Workshop on Education Technology and Computer Science. [5] Y. Ramamohan, K. Vasantharao, C. KalyanaChakravarti, A.S.K.Ratnam, A Study of Data Mining Tools in Knowledge Discovery Process, International Journal of Soft Computing and Engineering (IJSCE) Face book 2014 IEEE 2014 11th International Joint Conference on Computer Science and Software Engineering [11] NidalIsmael, Mahmoud Alzaalan, WesamAshour, Improved Multi Threshold Birch Clustering Algorithm 2014 International Journal of Artificial Intelligence and Applications for Smart Devices. [12] K.Kameshwaran, K.Malarvizhi, Survey on Clustering Techniques in Data Mining 2014 International Journal of Computer Science and Information Technologies. [13] T. Zhang, R. Ramakrishnan, M. Linvy, BIRCH: an efficient data clustering method for very large databases (1996) ACM SIGMOD International Conference on Management of Data. [6] J. Han and M. Kamber, Data Mining: concepts and techniques, Beijing: China Machine Press, Third Edition (2012). [7] Ji Dan, QiuJianlin, Gu Xiang, Chen Li, He Peng, A Synthesized Data Mining Algorithm based on Clustering and Decision tree, 2010 IEEE International Conference on Computer and Information Technology. [8] R Joshi, A Patidar, S Mishra, Scaling k- medoid algorithm for clustering large categorical dataset and its performance analysis 2011 IEEE Electronics Computer Technology [9] V.S.Jagadeeswaran, P.uma, Hierarchical Birch Algorithm for Large Datasets 2013 International Journal of Advanced Research in Computer and Communication Engineering [10]Suwimon Vongs ingthong, Nawaporn Wisitpongphan, Classification of University Students Behaviors in Sharing Information on Dr. Gurpreet Singh, IJECS Volume 6 Issue 7 July 2017 Page No. 22145-22150 Page 22150