Normalization based K means Clustering Algorithm

Similar documents
Enhancing K-means Clustering Algorithm with Improved Initial Center

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Dynamic Clustering of Data with Modified K-Means Algorithm

Iteration Reduction K Means Clustering Algorithm

K-Means Clustering With Initial Centroids Based On Difference Operator

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

An Enhanced K-Medoid Clustering Algorithm

Analyzing Outlier Detection Techniques with Hybrid Method

An efficient method to improve the clustering performance for high dimensional data by Principal Component Analysis and modified K-means

AN IMPROVED DENSITY BASED k-means ALGORITHM

Count based K-Means Clustering Algorithm

I. INTRODUCTION II. RELATED WORK.

A Review of K-mean Algorithm

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Clustering: An art of grouping related objects

Filtered Clustering Based on Local Outlier Factor in Data Mining

ECLT 5810 Clustering

Comparative Study Of Different Data Mining Techniques : A Review

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

ECLT 5810 Clustering

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

An Efficient Clustering for Crime Analysis

Comparative Study of Clustering Algorithms using R

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Mining of Web Server Logs using Extended Apriori Algorithm

An Efficient Enhanced K-Means Approach with Improved Initial Cluster Centers

Text clustering based on a divide and merge strategy

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

A Novel Algorithm for Associative Classification

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET

IMPROVING APRIORI ALGORITHM USING PAFI AND TDFI

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Unsupervised learning on Color Images

Performance Analysis of Data Mining Classification Techniques

Datasets Size: Effect on Clustering Results

Optimization using Ant Colony Algorithm

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

Monika Maharishi Dayanand University Rohtak

A NEW ALGORITHM FOR SELECTION OF BETTER K VALUE USING MODIFIED HILL CLIMBING IN K-MEANS ALGORITHM

A Genetic Algorithm Approach for Clustering

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

An Improved Document Clustering Approach Using Weighted K-Means Algorithm

OUTLIER DETECTION USING ENHANCED K-MEANS CLUSTERING ALGORITHM AND WEIGHT BASED CENTER APPROACH

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

Keywords: clustering algorithms, unsupervised learning, cluster validity

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Open Access Research on the Data Pre-Processing in the Network Abnormal Intrusion Detection

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

A Survey Of Issues And Challenges Associated With Clustering Algorithms

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

International Journal of Advanced Research in Computer Science and Software Engineering

Accelerating K-Means Clustering with Parallel Implementations and GPU computing

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

Case Study on Enhanced K-Means Algorithm for Bioinformatics Data Clustering

K-means clustering based filter feature selection on high dimensional data

K-Means. Oct Youn-Hee Han

Appropriate Item Partition for Improving the Mining Performance

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

An Efficient Approach towards K-Means Clustering Algorithm

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

AN EFFICIENT HYBRID APPROACH FOR DATA CLUSTERING USING DYNAMIC K-MEANS ALGORITHM AND FIREFLY ALGORITHM

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Index Terms Data Mining, Classification, Rapid Miner. Fig.1. RapidMiner User Interface

Performance Analysis of Enhanced Clustering Algorithm for Gene Expression Data

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

OCR For Handwritten Marathi Script

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

A Novel Approach for Weighted Clustering

The Application of K-medoids and PAM to the Clustering of Rules

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

An Optimization Algorithm of Selecting Initial Clustering Center in K means

A Review on Cluster Based Approach in Data Mining

An Improved Apriori Algorithm for Association Rules

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

Review on Data Mining Techniques for Intrusion Detection System

A Recommender System Based on Improvised K- Means Clustering Algorithm

[7.3, EA], [9.1, CMB]

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

Heterogeneous Density Based Spatial Clustering of Application with Noise

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

A new predictive image compression scheme using histogram analysis and pattern matching

Improved Frequent Pattern Mining Algorithm with Indexing

K-Mean Clustering Algorithm Implemented To E-Banking

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

EFFICIENT ALGORITHM FOR MINING FREQUENT ITEMSETS USING CLUSTERING TECHNIQUES

C-NBC: Neighborhood-Based Clustering with Constraints

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

A Mixed Hierarchical Algorithm for Nearest Neighbor Search

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Handling Missing Values via Decomposition of the Conditioned Set

Transcription:

Normalization based K means Clustering Algorithm Deepali Virmani 1,Shweta Taneja 2,Geetika Malhotra 3 1 Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi Email:deepalivirmani@gmail.com 2 Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi Email:shweta_taneja08@yahoo.co.in 3 Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi Email:geets002@gmail.com Abstract- K-means is an effective clustering technique used to separate similar data into groups based on initial centroids of clusters. In this paper, Normalization based K-means clustering algorithm(n-k means) is proposed. Proposed N-K means clustering algorithm applies normalization prior to clustering on the available data as well as the proposed approach calculates initial centroids based on weights. Experimental results prove the betterment of proposed N-K means clustering algorithm over existing K-means clustering algorithm in terms of complexity and overall performance. Keywords- Clustering, Data mining, K means, Normalization, Weighted Average I. INTRODUCTION Data mining[7][11]or knowledge discovery is a process of analysing large amounts of data and extracting useful information. It is an important technology which is used by industries as a novel approach to mine data. Data mining tools and techniques are used to generate effective results which was earlier difficult and time consuming.data mining is widely used in various areas like financial data analysis, retail and telecommunication industry, biological data analysis, fraud detection, spatial data analysis and other scientific applications. Clustering is a technique of data mining in which similar objects are grouped into clusters. Clustering techniques are widely used in various domains like information retrieval, image processing,etc[1][2].there are two types of approaches in clustering: hierarchical and partitioning. In hierarchical clustering, the clusters are combined based on their proximity or how close they are. This combination is prevented when further process leads to undesirable clusters. In partition clustering approach, one dataset is separated into definite number of small sets in a single iteration[10]. The accuracy and quality of clustering results depends how the algorithms are implemented and their ability to find hidden knowledge. There are various clustering algorithms based on the nature of generated clusters and techniques. Few of them are BIRCH(Balanced iterative reducing and clustering using hierarchies),cure(clustering using representatives),k-means, genetic K-means, Clara, Dbscan,Clarans etc[6]. The most widely used clustering algorithm is the K- means algorithm. This algorithm is used in many practical applications.it works by selecting the initial number of clusters and initial centroids[7][13]. We have chosen K-means algorithm over other clustering algorithms as it very efficient in processing large data sets. It often terminates at a local optimum and generates tighter clusters than hierarchical clustering, especially if clusters are globular. It is a popular algorithm because of its observable speed and simplicity.but K-means has a major disadvantage that it does not work well with clusters of different size and different density. Moreover initial centroids are chosen randomly due to which clusters produced vary from one run to another. Also various datapoints exist on which K- means takes superpolynomial time[8][5]. Different researchers have put forward various methods to improve the efficiency and time of K-

means algorithm. K-means uses the concept of Euclidean distance to calculate the centroids of the clusters. This method is less effective when new data sets are added and have no effect on the measured distance between various data objects. The computational complexity of k means algorithm is also very high[1][9].also, K-means is unable to handle noisy data and missing values.data preprocessing techniques are often applied to the datasets to make them more clean, consistent and noise free. Normalization is used to eliminate redundant data and ensures that good quality clusters are generated which can improve the efficiency of clustering algorithms.so it becomes an essential step before clustering as Euclidean distance is very sensitive to the changes in the differences[3]. This paper is organized as follows:in Section II, a description of the literature survey is done in which we have covered the work done by various authors to improve K-means clustering algorithm. Then in Section III, our proposed N-K means algorithm is described stepwise followed by experimental results in Section IV, where a comparison is shown between traditional K-means clustering algorithm and N-K means clustering algorithm. Lastly, the conclusions are addressed in Section V. II. LITERATURE SURVEY A lot of methods and techniques have been proposed over the past few years to improve the accuracy of the algorithm and there is a need to optimize it to have good results.this section discusses the various approaches proposed by researchers to find better initial centroids in k means algorithm. Authors[3], have proposed data preprocessing techniques like cleaning and normalization to produce optimum quality clusters. In normalization the data to be analyzed is scaled to a specific range. A modified k means algorithm is proposed which provides a solution for automatic initialization of centroids and performance is enhanced using normalization. This techqnique overcomes many drawbacks of naive k means algorithm. Some authors have proposed their methods to identify initial centroids. In the following work[1], authors proposed a novel method to find better initial centroids as well as more accurate clusters with less computational time. This method was adopted to find weighted average score of dataset by averaging the value of attribute of each data point to generate initial centroids. In another work, Authors[8] proposed a new k means clustering method with improved initial centre. In this method, initial cluster centres are selected and the centres are used as input to the k means. The user is not required to give the number of clusters as input. In another research as done by authors[4], a data clustering approach is proposed which works by partitioning the space into different segments and calculating the frequency of data point in each segment and the segment which shows maximum frequency of data point has maximum chances to contain the centroid of the cluster.the authors have introduced concept of threshold distance for each cluster s centroid for comparing the distance between data point and cluster s centroid and using this method, efforts to calculate the distance between data point and cluster s centroid is minimized. This algorithm effectively decreases the complexity and makes calculations easier. III. PROPOSED N- K MEANS ALGORITHM K-means algorithm can generate better results after the modification of the databases. We apply the modified algorithm with calculation of initial centroids based on weighted average score of dataset. Next, we preprocess and normalize dataset before we apply the N-K means algorithm. This proposed method works in three stages.during the first stage,data preprocessing technique is adopted that transforms raw data into understandable format. During the second stage, normalization is performed to standardize the data objects into specific range. During the third stage we apply the N-K means algorithm to generate clusters. 3.1 DATA PRE - PROCESSING It is a very important step and should be adopted in clustering as this method uses concepts like constant, average, minimum, maximum, standard deviation to calculates missing values in the tuples[7]. These missing values need to be avoided for accurate results. Preprocessing involves steps like data cleaning, data integration, data transformation, data reduction and data discretization. 3.2 NORMALIZATION Data Mining can generate effective results if normalization is applied to the dataset. It is a

process used to standardize all the attributes of the dataset and give them equal weight so that redundant or noisy objects can be eliminated and there is valid and reliable data which enhances the accuracy of the result. K-Means algorithm uses Euclidean distance that is highly prone to irregularities in the size of various features[3]. There are various data normalization methods like Min-Max, Z-Score and Decimal Scaling.The best normalization method depends on the data to be normalized. Here, we have used Min-Max normalization technique in our algorithm because our dataset is limited and has not much variability between minimum and maximum. Min-Max normalization technique performs a linear transformation on the data. In this method, we fit the data in a predefined boundary or in a predefined interval. 3.3 INITIAL CENTROIDS CALCULATION We use a uniform method to find score by taking the average of the attribute of each data point which will generate initial centroids that follow the data distribution of the given set. A sorting algorithm is applied to the score of each data point and then divided into k subsets where k is the number of clusters. Finally the nearest value of mean from each subset is taken as initial centroid. In this method we have introduced a weight with each attribute, which makes the method advantageous as it can cause enhancement of any feature of the dataset by increasing the weight related to that attribute. The algorithm is given in Fig 1: ALGORITHM 1: Steps of N-K means Algorithm INPUT: A dataset with d dimensions OUTPUT: Clusters 1. Load initial data set. 2. Find the maximum and minimum values of each feature from the dataset. 3. Normalize real scalar values of datasets with maximum and minimum values using equation : v = v-min(e) (1) where, max(e)- min(e) min(e) and max(e) are the minimum and the maximum values for attribute E. 4. Pass the number of clusters and generate initial centroids using algorithm 2. 5. Generate clusters. Figure 1: shows the steps of N-K means algorithm ALGORITHM 2: Initialization of centroids 1. Calculate the average score of each data point. 1)di= x1, x2,x3,x4 xn 2)di(avg)=(w1*x1+w2*x2+w3*x3+..wm*xm)/m where x= attribute s value, m= no of attributes,w= weight to multiply to ensure fair distribution of cluster. 2. Sort the data based on average score. 3. Divide the data based on k subsets. 4. Calculate the mean value of each subset. 5. Take the nearest possible data point of the mean as the initial centroid for each data subsets. Figure 2: Algorithm to calculate initial number of centroids

TIME SPEED IV. EXPERIMENTAL RESULTS Our experiment was conducted on Iris data set [12] from UCI Machine Learning Repository for evaluating the performance of N-K means clustering algorithm. In this section, we represent a comparative analysis of traditional K-means clustering algorithm with N-K means algorithm. Both the algorithms are run for different values of k. From the comparisons we can make out that N-K means algorithm outperforms the traditional K- means algorithm in terms of parameters namely execution time and speed. Hence the algorithm computationally runs faster as it executes in less number of iterations and the complexity is reduced. The results are depicted in Table 1. Table 1: Performance Comparison of N-K means Algorithm With Existing K-means Value of k Algorithm Time taken Speed (ms) 1 K means 0.078 5.1 N- K means 0.065 3.5 3 K-means 0.094 6.2 N-K means 0.081 4.7 5 K-means 0.125 6.6 N-K means 0.103 5.0 7 K-means 0.134 7.2 N-K means 0.117 5.7 0.2 0.15 0.1 0.05 0 1 3 5 7 No. of clusters K-means N-K means 8 6 4 2 0 1 3 5 7 No. of clusters K-means N-K means Figure 3: shows comparison between K-means and N-K means on the basis of time V. CONCLUSION The K-means clustering algorithm is widely used for clustering huge data sets. But traditional k means algorithm does not always generate good quality results as automatic initialization of centroids affects final clusters.this paper presents an efficient algorithm where we have first preprocessed our dataset based on normalization technique and then generated effective clusters. This is done by assigning weights to each attribute value to achieve standardization. Our algorithm has proved to be better than traditional K-means algorithm in terms of execution time and speed. Figure 4: shows comparison between K-means and N-K means on the basis of speed REFERENCES [1] Md Sohrab Mahmud, Md. Mostafizer Rahman, Md. Nasim Akhtar, Improvement of K-means Clustering algorithm with better initial centroids based on weighted average, 7th International Conference on Electrical and Computer Engineering, 2012, pp. 647-650. [2] Madhu Yedla,Srinivasa Rao Pathakota,TM Srinivasa, Enhancing K-means Clustering Algorithm with Improved Initial Center, International Journal of Computer Science and Information Technologies(IJCSIT),Vol.1(2),2010,pp. 121-125.

[3]Vaishali Rajeev Patel, Rupa G. Mehta, Performance Analysis of MK-means Clustering Algorithm with Normalization Approach, World Congress on Information and Communication Technologies, 2011, pp. 974-979. [4] Ran Vijay Singh, M.P.S Bhatia, Data Clustering With Modified K means Algorithm, IEEE-International Conference on Recent Trends in Information Technology, ICRTIT 2011, pp. 717-721. [5] David Arthur & Sergei Vassilvitskii, "How Slow is the k means Method?", Proceedings of the 22nd Symposium on Computational Geometry (SoCG),2006, pp. 144-153. [6] J. Han and M.Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers,SanDiego, 2001. [7]Vaishali R. Patel, Rupa G. Mehta, Impact of Outlier Removal and Normalization Approach in Modified k-means Clustering Algorithm, IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 5, No 2, 2011, pp. 331-336. [8]Zhang Chen, Xia Shixiong, K-means Clustering Algorithm with improved Initial Center, Second International Workshop on Knowledge Discovery and Data Mining,2009, pp. 790-792. [9] K. A.Abdul Nazeer, M.P.Sebastian, Improving the Accuracy and Efficiency of the k-means Clustering Algorithm, Proceedings of the World Congress On Engineering 2009 Vol I, WCE 2009, pp. 308-312. [10] Margaret H.Dunham, Data Mining- Introductory and Advanced Concepts, Pearson Education,2006. [11] R. Agrawal, T. Imielinksi and A. Swami, Mining association rules between sets of items in large database, The ACM SIGMOD Conference, Washington DC, USA, 1993, pp. 207-216. [12] UCI Repository of Machine Learning Databases, Available: archive.ics.uci.edu/ml/ [13] McQueen J, Some methods for classification and analysis of ultivariate observations, Proc. 5th Berkeley Symp. Math. Statist. Prob., Vol. 1, 1967, pp. 281 297.