International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 DESIGN OF AN EFFICIENT DATA ANALYSIS CLUSTERING ALGORITHM Dr. Dilbag Singh 1, Ms. Priyanka 2 1 Associate Professor, Dept. of Computer Science & Applications, Chaudhary Devi Lal University, Sirsa 2 Research Scholar, Dept. of Computer Science & Applications, Chaudhary Devi Lal University, Sirsa ABSTRACT: Clustering is the process of partitioning a set of data into a set of meaningful sub-classes. It helps in understanding the natural grouping or structure in a dataset. The main problem is time taken by the clustering algorithm to form clusters. Major problems with K-Mean algorithm are time complexity and outliers. The present study is carried out to design a new algorithm in which outlier problem and the time complexity of K- Mean algorithm will be removed. In this study, the data set of an insurance company has been taken. This data set is applied on both the K-Mean algorithm and the proposed algorithm. Total computation times of both algorithms are obtained by running these algorithms on the particular data. Time Complexity of both algorithms is compared. Comparison of the K-mean clustering algorithm has been made with the proposed algorithm in the present study. Keywords: Clustering, Data Mining, K-Mean clustering, Outliers, Time Complexity. [1] INTRODUCTION Data mining technology facilitates in extraction meaningful patterns from large database. Extraction of meaningful patterns from huge portions involving textual information is quite challenging job. Facts exploration or data mining has produced a whole new opportunity for exploiting the knowledge from the databases [7]. Information exploration is generally used by companies using a strong buyer concentration such as retail, financial, communication, and marketing organizations. It enables companies to determine relationships among internal factors. It provides data access to business analysts and information technology professionals. It is used to analyze the data processed by application software [3]. Data mining is the part of knowledge discovery in database. This deals with the mining power structure where the idea entails textual content mining in addition to web mining [9]. Data is gathered, reviewed, and analyzed to form finding or conclusion. Data analysis has multiple facts and approaches, data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive as an alternative to just descriptive requirements [10]. Cluster analysis is being used in data recovery, text and web mining, pattern recognition, image segmentation and software reverse engineering. It 86

Design Of An Efficient Data Analysis Clustering Algorithm helps users to understand the natural grouping or structure in a dataset. A good clustering method will produce high quality clusters in which the intra-class similarity is high and the inter-class similarity is low. Quality of clustering results depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or the entire hidden pattern [5]. K-Mean clustering is usually an extremely popular protocol to find the clusters inside a dataset through iterative calculations. It offers the luxury of simple execution as well as locating at the least nearby optimal clustering. [2] RELATED WORK Many approaches have been proposed in the field of clustering algorithm, some of them have been discussed as follows: Agathe (2004) proposed the clustering approach for students to help in the evaluation of learning process. Here it shows how clustering techniques can be applied to student answers generated from a web-based tutoring tool. In particular it is interested in extracting clusters of students based on the mistakes they made using the tool, with the aim of obtaining pedagogically relevant information and providing this feedback to the teacher [8]. Martin and Peter (1998) stated how clustering is done in large spatial database. Recently, clustering has been recognized as a primary data mining method for knowledge discovery in spatial databases. The well-known clustering algorithms, however, have some drawbacks when applied to large spatial databases. First, it is assumed that all objects to be clustered reside in main memory. Second, these methods are too inefficient when applied to large databases. To overcome these limitations, new algorithms have been developed which are surveyed. These algorithms make use of efficient query processing techniques provided by spatial database systems [10]. Nagwani et al. (2010) explained the concept of clustering based URL normalization technique for web mining. URL normalization is an important activity in web mining. URL normalization also reduces lot of calculations in web mining activities. A web mining technique for URL normalization is proposed in this paper. The proposed technique is based on content, structure and semantic similarity and web page redirection and forwarding similarity of the given set of URLs. Web page redirection and forward graphs can be used to measure the similarities between the URL s and can also be used for URL clusters. The URL clusters can be used for URL normalization. A data structure is also suggested to store the forward and redirect URL information [9]. Singh and Kaur (2013) proposed modified k-means algorithm that will reduced value of objective function for categorical data clustering. If the user observes the stability of algorithm in terms of objective function value for minimum value and converged value, these values are equal or almost equal. Results show that there is significant reduction in objective function value from maximum to local minimum or converged value of objective function for each algorithm whereas values are decreasing in sequences from Hard C- Mean (HCM), Fuzzy C- Mean (FCM), Rough C- Mean (RCM) and Rough Fuzzy Possibilistic C-Mean (RFPCM). Here in the proposed work 87

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 RFPCM for categorical data performs better over other c-mean variants. Among these algorithms RFPCM gives improved results over other variations of k-means algorithm [5]. [3] PROPOSED ALGORITHM FOR DATA CLUSTERING The proposed algorithm is given below: 1. Initialize the data points (n) and Number of Clusters (K) 2. Checkpoint Cluster Value (K) 3. If number K=1, then Exit, Else 4. Calculate Min(Data_Points) and Max( Data_Points) 5. Calculate Group Area Range (A G ) with Equation (Max(Data_Points) - Min(Data_Points))/Number of Clusters(K) 6. Data_Points Division in Number of Cluster (K) Group with Width A G. 7. Frequency Calculation of Data_Points in Division Partitions. 8. Select highest Frequency Data_Points K Group. 9. Calculate Mean of Data-Point in group. 10. Initialize V=1 11. Analyze closest pair of Data_Points from collection of points and generate Data_Points set S V and 1<V<=K having Data_Points and merge. 12. Analyze the closest Data_Points with Data_Points collection S V and add to S V, then merge. 13. Repeat step 12 until the Data_Points in S V is in Range 0.6<L<0.9 * (n/k) 14. If V<K, then V++, Search another pair of Data_Points. 15. Form Data_Points set S V and Merge, move to Step 12. 16. Distance Calculation of each Data_Points dist i, Set 1<=i<=n with centroids Cj, 1<=j<=K and d(dist i, Cj) 17. Analyze the closest centroids Cj and assign it to cluster based on dist. 18. Set ClusterNum[i]=j and assign d(disti, Cj) as nearest distance ;//nearest cluster number 19. For each Recalculate centroids for each cluster j. Repeat steps 20 to 23 20. For each Data_Points dist i 21. Distance computation from centroids with present closest cluster. 22. If dist<=nearestdistance, Data_Points stable in cluster and no move. 23. Else, for each centroids Cj, compute distance (disti,cj), End Loop. 24. Assign Data_Points Dist i to Cluster with nearest centroids Cj. 25. Set ClusterNum[i]=j and assign d(disti, Cj) as nearest distance, End Loop. 26. Repeat until convergence with recalculation of centroids. Initially, the input has been taken having the values from the.csv file and number of clusters has been input. If the cluster is only one, means all data points are in one cluster and no need of any calculation and execution will exit immediately otherwise the minimum and maximum of the data points is calculated for boundary definition. The input data is partitioned according the 88

Design Of An Efficient Data Analysis Clustering Algorithm input of cluster numbers and the area range is calculated by the given function. The frequency in each group is calculated and maximum frequency group will be considered so as to cover the maximum of data point s frequency. The mean will be calculated and the internal variables are defined for the looping process and condition check process. Closest pair is identified in the collection of data points and the internal small groups are formed without any complex calculation so as to avoid its interference with other data points and to reduce the processing time. After that the centroid is considered for assigning the data points to the cluster and further, the distance is calculated and then, the pairs are assigned to the particular group. Number to each cluster is assigned as to identify the cluster and number of data points for further prediction and analysis. The remaining points are covered by these steps only and by iteration process, the all points will be converged and not outlier problem will be occurred. [4] RESULTS Proposed algorithm takes number of clusters as input for initiating the process. After performing the computation clustering results are shown in window that also shows the total computation time. No. of cluster is 2 Figure 2 Output of our proposed algorithm (Number of clusters are two) Total numbers of clusters to be generated are two as shown in the figure 5 above. Total numbers of records that are used in clustering process are 20403. Total numbers of records in cluster 0 are 10192. Total numbers of records in cluster 1 are 10211. When the numbers of clusters are two then total computation time is 0.12 milliseconds. The Output of the experiment also shows the records in separate column for each cluster under which they come. No. of Cluster is 3. Total numbers of clusters to be generated are three as shown in the figure 6 above. Total numbers of records that are used in clustering process are 20403. Total numbers of records in 89

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 cluster 0 are 6824. Total numbers of records in cluster 1 are 6737. Total numbers of records in cluster 2 are 6824. For three numbers of clusters computation time is 0.27 milliseconds. Output window also shows the records in separate column for each cluster under which they come. No. of Cluster is 4. Figure 3 Output of Proposed algorithm (number of clusters are 3) Total numbers of clusters to be generated are four shown in the figure 7 below. Total numbers of records that are used in clustering process are 20403. Total numbers of records in cluster 0 are 5107. Total numbers of records in cluster 1 are 5122. Total numbers of records in cluster 2 are 5087. Total numbers of records in cluster three is 5087. When the numbers of clusters are four, total computation time comes out to be 0.49 milliseconds. The records are shown in separate column for each cluster under which they come No. of Cluster is 5. Figure 4 Output of Proposed algorithm (number of clusters are 4). Total numbers of clusters to be generated are 5 as shown in the figure 8 above. Total numbers of records that are used in clustering process are 20403. Total numbers of records in cluster 0 are 4065. Total numbers of records in cluster 1 are 4138. Total numbers of records in 90

Design Of An Efficient Data Analysis Clustering Algorithm cluster 2 are 4022. Total numbers of records in cluster 3 are 4175. Total numbers of records in cluster 4 are 4003. Foe five numbers of clusters, computation time is 0.59 milliseconds. The records are shown in separate column for each cluster under which they come in the output of algorithm. Figure 5 Output of Proposed Algorithm ( number of clusters are 5) [5] COMPARISON TABLE OF K- MEAN CLUSTRING ALGORITHM AND PROPOSED ALGORITHM No. of clusters 2 3 4 5 Computation time (ms) of K- Mean 0.67 0.97 0.98 2.74 Computation time (ms) of Proposed Algorithm 0.12 0.27 0.49 0.59 Table-1: Time complexity comparison of K- mean and our proposed algorithm The above given table 1 shows the time complexity of K- Mean algorithm and the proposed algorithm, by varying the number of clusters. When the numbers of clusters are 2 then the total computation time taken by K- Mean algorithm is 0.67 milliseconds and total computation time taken by proposed algorithm is 0.12 milliseconds. When the numbers of clusters are 3 then the total computation time taken by K- Mean algorithm is 0.97 milliseconds and total computation time taken by proposed algorithm is 0.27 milliseconds. When the numbers of clusters are 4 then the total computation time taken by K- Mean algorithm is 0.98 milliseconds and total computation time taken by proposed algorithm is 0.49 milliseconds. When the numbers of clusters are 5 then the total computation time taken by K- Mean algorithm is 2.74 milliseconds and total computation time taken by proposed algorithm is 0.59 milliseconds. As shown in the table-1, the time taken by K- Mean algorithm is more as taken by the proposed algorithm. 91

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 [6] CONCLUSION It is clear that K- mean clustering algorithm is suffered from the problem of outliers & time complexity, time complexity is reduced by the Proposed Algorithm. The main parameter which is used for comparison of K- Mean clustering algorithm and proposed algorithm is time complexity by varying the number of clusters. When the numbers of clusters are two, the time taken by the K- Mean algorithm is 0.67 milliseconds and by the proposed algorithm is 0.12 milliseconds. In case of three clusters, the time taken by the K- Mean algorithm is 0.97 milliseconds and by the proposed algorithm is 0.26 milliseconds. For four clusters, the time taken by the K- Mean algorithm is 0.98 milliseconds and by the proposed algorithm is 0.49 milliseconds. The time taken by the K- Mean algorithm is 2.74 milliseconds and in proposed algorithm is 0.59 milliseconds in case of five clusters. Total time taken in the clustering process by K- mean algorithm is more than the proposed algorithm. Result shows that time complexity is reduced by the proposed algorithm and hence, the proposed algorithm reduces the time complexity in comparison with the K- mean algorithm, therefore, more efficient. It will provide a better result of clustering process in a very fast manner. [7] ACKNOWLEDGMENTS Work of this magnitude is only possible with the hand of help from many people. First and foremost, I wish to express my deepest gratitude to my supervisor Dr Dilbag Singh (Associate professor, Dept. of Computer Science and Applications, Chaudhary Devi Lal University, Sirsa) for his untiring guidance. I want to express my deepest gratitude to my parents and friends who have always supported me in whatever decision I have made. Finally I am thankful to GOD for blessing me much more than I deserve. 92

Design Of An Efficient Data Analysis Clustering Algorithm REFERENCES [1] An Introduction to Cluster Analysis for Data Mining, 2000. [Online]. Available: http://www.cs.umn.edu/~han/dmclass/cluster_survey_10_02_00. [2] C.R. Kothari, Research Methodology Research Methods & Techniques 2 nd Edition. [3] Clifton, C. and R. Steinheiser. 1998. "Data Mining on Text", Proceedings of the 22nd Annual IEEE International Computer Software and Applications Conference, COMPSAC98, pp. 630 635. [4] Frigui H. and Krishnapuram R. Competitive Fuzzy Clustering, IEEE 1996, Page No. 225-228. [5] G. Singh and N. Kaur, Hybrid Clustering Algorithm with Modified Enhanced K-Mean and Hierarchical Clustering, International Journal of Advanced Research in Computer Science and Software Engineering 2013. [6] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, SIGKDD [7] Jiawei Han and Micheline Kamber, Data Mining Concepts and Techniques, 2nd ed., Morgan Kaufmann publishers, SanFrancisco, 2006. [8] Mercer Agathe introduced an idea for Clustering Students to Help Evaluate Learning. [9] Naresh Kumar Nagwani,"Clustering Based URL Normalization Technique for Web Mining," ace, pp.349-351, 2010 International Conference on Advances in Computer Engineering, 2010. [10] Ester Martin, Kriegel Hans-Peter introduced the Idea of Clustering for Mining in Large Spatial Database. [11] Swasti Singhal and Monika Jena, A Study on WEKA Tool for data Preprocessing, Classification and Clustering, International Journal of Innovative Technology and Exploring Engineering (IJITEE), ISSN: 2278-3075, volume- 2, Issue 6, may 2013 93