International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

Similar documents
Analyzing Outlier Detection Techniques with Hybrid Method

Iteration Reduction K Means Clustering Algorithm

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Efficient K-Mean Clustering Algorithm for Large Datasets using Data Mining Standard Score Normalization

Comparative Study of Clustering Algorithms using R

A Review of K-mean Algorithm

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

An Efficient Approach towards K-Means Clustering Algorithm

Comparative Study Of Different Data Mining Techniques : A Review

The Transpose Technique to Reduce Number of Transactions of Apriori Algorithm

Data Mining of Web Access Logs Using Classification Techniques

Density Based Clustering using Modified PSO based Neighbor Selection

An Enhanced K-Medoid Clustering Algorithm

Unsupervised learning on Color Images

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Normalization based K means Clustering Algorithm

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

University of Florida CISE department Gator Engineering. Clustering Part 2

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW

AN IMPROVED DENSITY BASED k-means ALGORITHM

Performance Analysis of Video Data Image using Clustering Technique

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Dynamic Clustering of Data with Modified K-Means Algorithm

Topic 1 Classification Alternatives

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Pattern Classification based on Web Usage Mining using Neural Network Technique

The comparative study of text documents clustering algorithms

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

A Review on Cluster Based Approach in Data Mining

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

New Approach for K-mean and K-medoids Algorithm

Comparative Study of Web Structure Mining Techniques for Links and Image Search

A Web Based Recommendation Using Association Rule and Clustering

THE STUDY OF WEB MINING - A SURVEY

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

An Improved Document Clustering Approach Using Weighted K-Means Algorithm

Performance Analysis of K-Mean Clustering on Normalized and Un-Normalized Information in Data Mining

Web Usage Mining: A Research Area in Web Mining

CSE 5243 INTRO. TO DATA MINING

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

Fuzzy Ant Clustering by Centroid Positioning

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

Analysis of Data Mining Techniques for Software Effort Estimation

IJMIE Volume 2, Issue 9 ISSN:

WEB USAGE MINING: ANALYSIS DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH NOISE ALGORITHM

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction

Clustering Algorithms for Data Stream

Research/Review Paper: Web Personalization Using Usage Based Clustering Author: Madhavi M.Mali,Sonal S.Jogdand, Deepali P. Shinde Paper ID: V1-I3-002

A Genetic Algorithm Approach for Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Survey Paper on Web Usage Mining for Web Personalization

CHAPTER 4: CLUSTER ANALYSIS

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Centroid Based Text Clustering

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Implementation of Data Mining for Vehicle Theft Detection using Android Application

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Clustering (COSC 488) Nazli Goharian. Document Clustering.

ECLT 5810 Clustering

Unsupervised Learning

Redefining and Enhancing K-means Algorithm

CHAPTER - 3 PREPROCESSING OF WEB USAGE DATA FOR LOG ANALYSIS

K+ Means : An Enhancement Over K-Means Clustering Algorithm

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Comparison of FP tree and Apriori Algorithm

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.

NETWORK FAULT DETECTION - A CASE FOR DATA MINING

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

Text clustering based on a divide and merge strategy

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

The k-means Algorithm and Genetic Algorithm

Count based K-Means Clustering Algorithm

Data Mining: An experimental approach with WEKA on UCI Dataset

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

International Journal of Advanced Research in Computer Science and Software Engineering

Knowledge Discovery in Databases

Introduction to Mobile Robotics

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Farthest First Clustering in Links Reorganization

Research on Data Mining Technology Based on Business Intelligence. Yang WANG

数据挖掘 Introduction to Data Mining

K-Means Clustering With Initial Centroids Based On Difference Operator

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

Dynamic Data in terms of Data Mining Streams

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Mining of Web Server Logs using Extended Apriori Algorithm

Comparative studyon Partition Based Clustering Algorithms

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

Hierarchical Document Clustering

Clustering: An art of grouping related objects

K-Mean Clustering Algorithm Implemented To E-Banking

International Journal of Modern Engineering and Research Technology

Transcription:

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 DESIGN OF AN EFFICIENT DATA ANALYSIS CLUSTERING ALGORITHM Dr. Dilbag Singh 1, Ms. Priyanka 2 1 Associate Professor, Dept. of Computer Science & Applications, Chaudhary Devi Lal University, Sirsa 2 Research Scholar, Dept. of Computer Science & Applications, Chaudhary Devi Lal University, Sirsa ABSTRACT: Clustering is the process of partitioning a set of data into a set of meaningful sub-classes. It helps in understanding the natural grouping or structure in a dataset. The main problem is time taken by the clustering algorithm to form clusters. Major problems with K-Mean algorithm are time complexity and outliers. The present study is carried out to design a new algorithm in which outlier problem and the time complexity of K- Mean algorithm will be removed. In this study, the data set of an insurance company has been taken. This data set is applied on both the K-Mean algorithm and the proposed algorithm. Total computation times of both algorithms are obtained by running these algorithms on the particular data. Time Complexity of both algorithms is compared. Comparison of the K-mean clustering algorithm has been made with the proposed algorithm in the present study. Keywords: Clustering, Data Mining, K-Mean clustering, Outliers, Time Complexity. [1] INTRODUCTION Data mining technology facilitates in extraction meaningful patterns from large database. Extraction of meaningful patterns from huge portions involving textual information is quite challenging job. Facts exploration or data mining has produced a whole new opportunity for exploiting the knowledge from the databases [7]. Information exploration is generally used by companies using a strong buyer concentration such as retail, financial, communication, and marketing organizations. It enables companies to determine relationships among internal factors. It provides data access to business analysts and information technology professionals. It is used to analyze the data processed by application software [3]. Data mining is the part of knowledge discovery in database. This deals with the mining power structure where the idea entails textual content mining in addition to web mining [9]. Data is gathered, reviewed, and analyzed to form finding or conclusion. Data analysis has multiple facts and approaches, data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive as an alternative to just descriptive requirements [10]. Cluster analysis is being used in data recovery, text and web mining, pattern recognition, image segmentation and software reverse engineering. It 86

Design Of An Efficient Data Analysis Clustering Algorithm helps users to understand the natural grouping or structure in a dataset. A good clustering method will produce high quality clusters in which the intra-class similarity is high and the inter-class similarity is low. Quality of clustering results depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or the entire hidden pattern [5]. K-Mean clustering is usually an extremely popular protocol to find the clusters inside a dataset through iterative calculations. It offers the luxury of simple execution as well as locating at the least nearby optimal clustering. [2] RELATED WORK Many approaches have been proposed in the field of clustering algorithm, some of them have been discussed as follows: Agathe (2004) proposed the clustering approach for students to help in the evaluation of learning process. Here it shows how clustering techniques can be applied to student answers generated from a web-based tutoring tool. In particular it is interested in extracting clusters of students based on the mistakes they made using the tool, with the aim of obtaining pedagogically relevant information and providing this feedback to the teacher [8]. Martin and Peter (1998) stated how clustering is done in large spatial database. Recently, clustering has been recognized as a primary data mining method for knowledge discovery in spatial databases. The well-known clustering algorithms, however, have some drawbacks when applied to large spatial databases. First, it is assumed that all objects to be clustered reside in main memory. Second, these methods are too inefficient when applied to large databases. To overcome these limitations, new algorithms have been developed which are surveyed. These algorithms make use of efficient query processing techniques provided by spatial database systems [10]. Nagwani et al. (2010) explained the concept of clustering based URL normalization technique for web mining. URL normalization is an important activity in web mining. URL normalization also reduces lot of calculations in web mining activities. A web mining technique for URL normalization is proposed in this paper. The proposed technique is based on content, structure and semantic similarity and web page redirection and forwarding similarity of the given set of URLs. Web page redirection and forward graphs can be used to measure the similarities between the URL s and can also be used for URL clusters. The URL clusters can be used for URL normalization. A data structure is also suggested to store the forward and redirect URL information [9]. Singh and Kaur (2013) proposed modified k-means algorithm that will reduced value of objective function for categorical data clustering. If the user observes the stability of algorithm in terms of objective function value for minimum value and converged value, these values are equal or almost equal. Results show that there is significant reduction in objective function value from maximum to local minimum or converged value of objective function for each algorithm whereas values are decreasing in sequences from Hard C- Mean (HCM), Fuzzy C- Mean (FCM), Rough C- Mean (RCM) and Rough Fuzzy Possibilistic C-Mean (RFPCM). Here in the proposed work 87

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 RFPCM for categorical data performs better over other c-mean variants. Among these algorithms RFPCM gives improved results over other variations of k-means algorithm [5]. [3] PROPOSED ALGORITHM FOR DATA CLUSTERING The proposed algorithm is given below: 1. Initialize the data points (n) and Number of Clusters (K) 2. Checkpoint Cluster Value (K) 3. If number K=1, then Exit, Else 4. Calculate Min(Data_Points) and Max( Data_Points) 5. Calculate Group Area Range (A G ) with Equation (Max(Data_Points) - Min(Data_Points))/Number of Clusters(K) 6. Data_Points Division in Number of Cluster (K) Group with Width A G. 7. Frequency Calculation of Data_Points in Division Partitions. 8. Select highest Frequency Data_Points K Group. 9. Calculate Mean of Data-Point in group. 10. Initialize V=1 11. Analyze closest pair of Data_Points from collection of points and generate Data_Points set S V and 1<V<=K having Data_Points and merge. 12. Analyze the closest Data_Points with Data_Points collection S V and add to S V, then merge. 13. Repeat step 12 until the Data_Points in S V is in Range 0.6<L<0.9 * (n/k) 14. If V<K, then V++, Search another pair of Data_Points. 15. Form Data_Points set S V and Merge, move to Step 12. 16. Distance Calculation of each Data_Points dist i, Set 1<=i<=n with centroids Cj, 1<=j<=K and d(dist i, Cj) 17. Analyze the closest centroids Cj and assign it to cluster based on dist. 18. Set ClusterNum[i]=j and assign d(disti, Cj) as nearest distance ;//nearest cluster number 19. For each Recalculate centroids for each cluster j. Repeat steps 20 to 23 20. For each Data_Points dist i 21. Distance computation from centroids with present closest cluster. 22. If dist<=nearestdistance, Data_Points stable in cluster and no move. 23. Else, for each centroids Cj, compute distance (disti,cj), End Loop. 24. Assign Data_Points Dist i to Cluster with nearest centroids Cj. 25. Set ClusterNum[i]=j and assign d(disti, Cj) as nearest distance, End Loop. 26. Repeat until convergence with recalculation of centroids. Initially, the input has been taken having the values from the.csv file and number of clusters has been input. If the cluster is only one, means all data points are in one cluster and no need of any calculation and execution will exit immediately otherwise the minimum and maximum of the data points is calculated for boundary definition. The input data is partitioned according the 88

Design Of An Efficient Data Analysis Clustering Algorithm input of cluster numbers and the area range is calculated by the given function. The frequency in each group is calculated and maximum frequency group will be considered so as to cover the maximum of data point s frequency. The mean will be calculated and the internal variables are defined for the looping process and condition check process. Closest pair is identified in the collection of data points and the internal small groups are formed without any complex calculation so as to avoid its interference with other data points and to reduce the processing time. After that the centroid is considered for assigning the data points to the cluster and further, the distance is calculated and then, the pairs are assigned to the particular group. Number to each cluster is assigned as to identify the cluster and number of data points for further prediction and analysis. The remaining points are covered by these steps only and by iteration process, the all points will be converged and not outlier problem will be occurred. [4] RESULTS Proposed algorithm takes number of clusters as input for initiating the process. After performing the computation clustering results are shown in window that also shows the total computation time. No. of cluster is 2 Figure 2 Output of our proposed algorithm (Number of clusters are two) Total numbers of clusters to be generated are two as shown in the figure 5 above. Total numbers of records that are used in clustering process are 20403. Total numbers of records in cluster 0 are 10192. Total numbers of records in cluster 1 are 10211. When the numbers of clusters are two then total computation time is 0.12 milliseconds. The Output of the experiment also shows the records in separate column for each cluster under which they come. No. of Cluster is 3. Total numbers of clusters to be generated are three as shown in the figure 6 above. Total numbers of records that are used in clustering process are 20403. Total numbers of records in 89

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 cluster 0 are 6824. Total numbers of records in cluster 1 are 6737. Total numbers of records in cluster 2 are 6824. For three numbers of clusters computation time is 0.27 milliseconds. Output window also shows the records in separate column for each cluster under which they come. No. of Cluster is 4. Figure 3 Output of Proposed algorithm (number of clusters are 3) Total numbers of clusters to be generated are four shown in the figure 7 below. Total numbers of records that are used in clustering process are 20403. Total numbers of records in cluster 0 are 5107. Total numbers of records in cluster 1 are 5122. Total numbers of records in cluster 2 are 5087. Total numbers of records in cluster three is 5087. When the numbers of clusters are four, total computation time comes out to be 0.49 milliseconds. The records are shown in separate column for each cluster under which they come No. of Cluster is 5. Figure 4 Output of Proposed algorithm (number of clusters are 4). Total numbers of clusters to be generated are 5 as shown in the figure 8 above. Total numbers of records that are used in clustering process are 20403. Total numbers of records in cluster 0 are 4065. Total numbers of records in cluster 1 are 4138. Total numbers of records in 90

Design Of An Efficient Data Analysis Clustering Algorithm cluster 2 are 4022. Total numbers of records in cluster 3 are 4175. Total numbers of records in cluster 4 are 4003. Foe five numbers of clusters, computation time is 0.59 milliseconds. The records are shown in separate column for each cluster under which they come in the output of algorithm. Figure 5 Output of Proposed Algorithm ( number of clusters are 5) [5] COMPARISON TABLE OF K- MEAN CLUSTRING ALGORITHM AND PROPOSED ALGORITHM No. of clusters 2 3 4 5 Computation time (ms) of K- Mean 0.67 0.97 0.98 2.74 Computation time (ms) of Proposed Algorithm 0.12 0.27 0.49 0.59 Table-1: Time complexity comparison of K- mean and our proposed algorithm The above given table 1 shows the time complexity of K- Mean algorithm and the proposed algorithm, by varying the number of clusters. When the numbers of clusters are 2 then the total computation time taken by K- Mean algorithm is 0.67 milliseconds and total computation time taken by proposed algorithm is 0.12 milliseconds. When the numbers of clusters are 3 then the total computation time taken by K- Mean algorithm is 0.97 milliseconds and total computation time taken by proposed algorithm is 0.27 milliseconds. When the numbers of clusters are 4 then the total computation time taken by K- Mean algorithm is 0.98 milliseconds and total computation time taken by proposed algorithm is 0.49 milliseconds. When the numbers of clusters are 5 then the total computation time taken by K- Mean algorithm is 2.74 milliseconds and total computation time taken by proposed algorithm is 0.59 milliseconds. As shown in the table-1, the time taken by K- Mean algorithm is more as taken by the proposed algorithm. 91

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 [6] CONCLUSION It is clear that K- mean clustering algorithm is suffered from the problem of outliers & time complexity, time complexity is reduced by the Proposed Algorithm. The main parameter which is used for comparison of K- Mean clustering algorithm and proposed algorithm is time complexity by varying the number of clusters. When the numbers of clusters are two, the time taken by the K- Mean algorithm is 0.67 milliseconds and by the proposed algorithm is 0.12 milliseconds. In case of three clusters, the time taken by the K- Mean algorithm is 0.97 milliseconds and by the proposed algorithm is 0.26 milliseconds. For four clusters, the time taken by the K- Mean algorithm is 0.98 milliseconds and by the proposed algorithm is 0.49 milliseconds. The time taken by the K- Mean algorithm is 2.74 milliseconds and in proposed algorithm is 0.59 milliseconds in case of five clusters. Total time taken in the clustering process by K- mean algorithm is more than the proposed algorithm. Result shows that time complexity is reduced by the proposed algorithm and hence, the proposed algorithm reduces the time complexity in comparison with the K- mean algorithm, therefore, more efficient. It will provide a better result of clustering process in a very fast manner. [7] ACKNOWLEDGMENTS Work of this magnitude is only possible with the hand of help from many people. First and foremost, I wish to express my deepest gratitude to my supervisor Dr Dilbag Singh (Associate professor, Dept. of Computer Science and Applications, Chaudhary Devi Lal University, Sirsa) for his untiring guidance. I want to express my deepest gratitude to my parents and friends who have always supported me in whatever decision I have made. Finally I am thankful to GOD for blessing me much more than I deserve. 92

Design Of An Efficient Data Analysis Clustering Algorithm REFERENCES [1] An Introduction to Cluster Analysis for Data Mining, 2000. [Online]. Available: http://www.cs.umn.edu/~han/dmclass/cluster_survey_10_02_00. [2] C.R. Kothari, Research Methodology Research Methods & Techniques 2 nd Edition. [3] Clifton, C. and R. Steinheiser. 1998. "Data Mining on Text", Proceedings of the 22nd Annual IEEE International Computer Software and Applications Conference, COMPSAC98, pp. 630 635. [4] Frigui H. and Krishnapuram R. Competitive Fuzzy Clustering, IEEE 1996, Page No. 225-228. [5] G. Singh and N. Kaur, Hybrid Clustering Algorithm with Modified Enhanced K-Mean and Hierarchical Clustering, International Journal of Advanced Research in Computer Science and Software Engineering 2013. [6] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, SIGKDD [7] Jiawei Han and Micheline Kamber, Data Mining Concepts and Techniques, 2nd ed., Morgan Kaufmann publishers, SanFrancisco, 2006. [8] Mercer Agathe introduced an idea for Clustering Students to Help Evaluate Learning. [9] Naresh Kumar Nagwani,"Clustering Based URL Normalization Technique for Web Mining," ace, pp.349-351, 2010 International Conference on Advances in Computer Engineering, 2010. [10] Ester Martin, Kriegel Hans-Peter introduced the Idea of Clustering for Mining in Large Spatial Database. [11] Swasti Singhal and Monika Jena, A Study on WEKA Tool for data Preprocessing, Classification and Clustering, International Journal of Innovative Technology and Exploring Engineering (IJITEE), ISSN: 2278-3075, volume- 2, Issue 6, may 2013 93