A Framework for Outlier Detection Using Improved

Size: px

Start display at page:

Download "A Framework for Outlier Detection Using Improved"

April Johnson
5 years ago
Views:

1 International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 17 No: 02 8 A Framework for Outlier Detection Using Improved Bisecting k-means Clustering Algorithm K.Swapna 1, Prof. M.S. Prasad Babu 2 1 Research Scholar, Department of Computer Science and Systems Engineering, AU College of Engineering (A), Visakhapatnam, INDIA. pentaswapna@yahoo.co.in 2 Professor, Department of Computer Science and Systems Engineering, AU College of Engineering (A), Andhra University, Visakhapatnam, INDIA. profmspbabu@gmail.com Abstract The aim of this paper is to design an automatic liver diagnosis system to detect liver diseases early and accurately to help reduce the increasing deaths caused by liver diseases. With this automatic diagnosis system early diagnosis can be done and treatment can be made easy and immediately. Physical data considered for this study is collected from various pathological laboratories from southern India and annotated from expert Gastroenterologists. One of the dataset considered is from Indian liver patient dataset (ILPD) available in the UCI machine repository has 583 records and the physical data collected has 500 records which form a total of 1083 records. Automatic diagnosis tools may reduce burden on doctors. Since common attributes were found in both the datasets considered for the analysis, this paper evaluates the selected outlier algorithms and clustering algorithms using the proposed frame work for clustering liver patient datasets without outliers. These algorithms are evaluated based on four criteria: Accuracy, F-measure, Entropy and Purity. Our interest is to analyze these datasets which would contribute to better understand the system and help us develop an Automatic liver diagnosis system. Index Term-- Liver datasets, Outlier detection, Cluster-based Bisecting k-means, cluster validation. I. INTRODUCTION Data mining techniques are very popular and they can be applied in diverse areas including information retrieval and medicine. Detecting outliers has many important applications in data preprocessing as well as in mining abnormal points among the data points. There are various outlier detection methods in data mining and they are classified into different classes such as model based, density based, connectedness, distance based, cluster based, k-nearest neighbor etc. In the above methods, cluster based and distance based are familiar to user, simple, easy to implement and efficient. To produce better result with less computational time cluster based algorithm and the distance based algorithm are merged [12]. Clustering is one of the most important techniques used in data mining to find interesting patterns and structures from the hidden information in datasets. In clustering various methods such as hierarchical clustering, partitional clustering, densitybased clustering etc. are used. In these methods hierarchical clustering is one of the best method for generating the clusters properly, which follows the dendrogram technique. In this technique all objects are arranged in a tree structure, split or merge operation produces required clusters. This method either my use top down or bottom-up approach and measures proximity between clusters using either Singe link(sla) or complete link (CLA) or Average link (AvgLink) methods. One of the common measuring link used for clusters excellence is the group average method Unweighted Pair Group Method with Arithmetic Mean (UPGMA) or weighted Pair Group Method with Arithmetic Mean (WPGMA)[1]. Partitional clustering method is one of the simplest iterative method which uses many algorithms like k- Means, k-mediods etc. The k-means algorithm is used in the clustering is advantageous to hierarchical clustering, is often the better quality clustering approach, but is limited because of its quadratic time complexity. The standard k-means algorithm takes initial centroids as data points, and finds out the mean value for every center every iteration until the desired number of clusters are obtained. The k-means algorithm always produces results with less time complexity than hierarchical clustering algorithms [3]. Bisecting k-means algorithm is one of the algorithm, which merges both hierarchical and partitional techniques. It is the bottom up approach in hierarchical and k-means iterative method is used to get better clusters with less computational time. The k- Means its variants like bisecting k-means have a time complexity that is linear in the number of items, but are thought to produce inferior clusters [10]. In this paper we designed a hybrid clustering approach which combines two or more widely used clustering algorithms like k-means, bisecting k-means and hierarchical clustering methods, so that it generates better quality clusters. The experimental results demonstrate that the proposed improved bisecting k- Means method out performs the standard k-means and bisecting k-means clustering methods. In this paper a framework that uses the proposed improved bisecting k- Means clustering algorithm and cluster based distance outlier detection algorithm. This new proposed frame work generates number of clusters initially and later eliminates the outliers by using threshold value for every clusters. II. RELATED WORK An outlier detection method is important to find noise in a collection of dataset. When the distance based outlier detection algorithm is applied to datasets like ILPD and BUPA liver datasets, it is less efficient and took more computational time,when compared to the proposed cluster based and distance based outlier detection algorithm, [12]. This outlier algorithm used k-means clustering algorithm. Hierarchical clustering algorithms when compared with partitional clustering space and time complexity of hierarchical clustering is more. [4][7]. It is proved that

International Journal of Electrical & Computer Sciences IJECS-IJENS Vol:17 No:02 9 clustering algorithms namely k-means, hierarchical, densitybased algorithms applied with ILPD liver dataset, gave

2 International Journal of Electrical & Computer Sciences IJECS-IJENS Vol:17 No:02 9 clustering algorithms namely k-means, hierarchical, densitybased algorithms applied with ILPD liver dataset, gave better performance. But hierarchical SLA /CLA and k-means algorithms show equal performance, k-means algorithms gave less computational time when compared to other algorithm [3].Various researchers suggested combination of hierarchical and partitional clustering algorithms to achieve better clustering.hybrid hierarchical agglomerative algorithm which uses the (SLA) or (CLA) merge partition clustering technique and proposed new algorithm for cluster quality [11]. Bisect k-means algorithm is a combination of divisive hierarchical and k-means partitional algorithm. In this bisect k-means gives better result of standard k-means algorithm, Hybrid Bisect k-means clustering algorithm uses bisect k-means for divisive clustering algorithm and UPGMA for agglomerative clustering algorithm with document clustering to generate clusters better with less computational time when compared with standard k-means and bisecting k-means algorithm [10]. In this paper we would be using hybrid bisecting k-means and other distance calculated method, WPGMA and proposed new improved bisectingk--means clustering algorithm for better quality of clusters with less computation time. We also proposed a frame work for generation of clusters and thereby eliminating of outliers at the same instant in the liver data set. In frame work cluster- based and distance outlier detection algorithm is used to improve bisecting k-means clustering algorithm by replacing k-means for better results. attributes in these data sets are Age, Gender, TB, DB, ALB, SGPT, SGOT,, A/G ratio and Alkphos. Out of these attributes TB (Total Bilirubin), DB (Direct Bilirubin), (Total Proteins), ALB (albumin), A/G ratio, SGPT, SGOT and Alkphos are related to liver function tests, used to measure the levels of enzymes, proteins and bilirubin levels which helps for the diagnosis of liver disease. The description of ILPD Dataset Attributes and Normal values of attributes are shown in (Table. I) TABLE I ATTRIBUTES IN LIVER DATASET Attributes Information(Normal Value) Age Age of the patient Gender Gender of the patient TB (LFT) Total Bilirubin ( mg/dl ) DB (LFT) Direct Bilirubin ( mg/dl ) Alkphos (LFT) SGPT (LFT) SGOT (LFT) (LFT) Alkaline Phosphotase ( U/L) Alamine Aminotransferase (5-45U/L) Aspartate Aminotransferase (5-40U/L) Total Protiens (5.5-8gm/dl) ALB (LFT) Albumin(3.5-5 gm/dl ) A/G Ratio (LFT) Albumin and Globulin Ratio (>=1) III. PROBLEM DEFINITION The objective of the proposed frame work is to find clean data by preprocessing and to increase the accuracy of cluster analysis. In this study attention is placed on preprocessing, so that it removes the outliers and missing values so data set becomes clean and improves grouping of data, and consequently the clustering results. The proposed framework is shown in Fig.1.This frame work has clusterbased distance outlier algorithm, in that using Improved bisecting k-means clustering algorithm. This framework initially groups the data into number of clusters, taking the threshold value to remove outliers. This phase generates number of clusters and eliminates outliers at the same instant Finally it compares the cluster result with the class labels in the dataset to get accuracy and justify by Purity,Entropy, F- measure the cluster validation techniques. This frame work efficiently generates better clusters and finds out the outliers with less computational cost compared to other outlier and clustering algorithms. IV. EXPERIMENTAL DATASET This study uses two data sets totaling 1083 records of which 651 are liver patients and 432 are non-liver patients. The first being ILPD which is from UCI Machine Learning Repository data set [8] comprising 583 liver patient s records with 10 attributes (obtained from eight blood tests). The second data set (Physically collected) of 500 records is collected from various pathological labs in south India, with 13attributes (obtained from ten blood tests). The common Fig. 1. Proposed Framework CADBOD Classification Algorithms were considered for evaluating their classification performance in terms of Accuracy, Precision, Sensitivity and Specificity in classifying liver patient s dataset as ILPD [18]. In classification of dataset the class label is needed, in clustering class label is not needed to classify the data. Every time labeling the data is very challenging, therefore an attempt is made to develop a framework to automate this process based on cluster analysis. The proposed method is tested on Indian Liver Patient Dataset (ILPD), a real world dataset available in UCI Machine Learning Repository. But, it has only 583 records and with 10 dimensions those records are not sufficient do the experiment for clustering. Forman (1984) recommends a minimum sample size 2 m, where m equals the number of clustering dimensions[17]. Hence our sample size should be 2 10, i.e sample, so we physically collected some more data on liver patients which is 500 records similar to ILPD dataset. Then the data set become 1083 records for 10 attributes. The experimental data set has more than 1084 records with which one can obtain minimum of 10 clusters.

3 International Journal of Electrical & Computer Sciences IJECS-IJENS Vol:17 No:02 10 V. METHODOLOGY A. Proposed Improved bisecting k-means clustering Algorithm Bisecting k-means algorithm is a combination of divisive and agglomerative clustering algorithms. Our method uses bisecting k-means algorithm for divisive clustering algorithm and WPGMA for agglomerative clustering algorithm. WPGMA is a good choice when there is a reason to eliminate size differences between the resulting groups [15]. The proposed bisecting k-means clustering algorithm is a combination of two or more algorithms, so its accuracy would be better than individual algorithms. Initially cluster the data elements by using bisecting k-means clustering algorithm and later obtain cluster centroids. After forming the cluster centroids apply WPGMA method on those obtained cluster centroids. If two centroids ended up in same cluster, then they are said to belong to same cluster. In the proposed algorithm we used WPGMA method for distance calculation of centroids and the method is not as complex as UPGMA.WPGMA also works with inconsistency values. The procedure for both methods is different but final output results are same. So WPGMA is used in proposed clustering algorithm for easy implementation and less computational time. Algorithm: Input: cluster with n data items and k (number of clusters) Output: n individual data items in k clusters. Steps: 1. Starts with all cluster points in one single cluster. 2. Find 2 sub-clusters using the basic k Means algorithm 3. Find distance between those 2 sub clusters 4. If ( sub_cluste r_1 > sub_cluster_2 ) split divide sub_cluster_1 into 2 clusters else Divide sub_cluster_2 into 2 clusters. 5. Repeat step 2, 3, 4, the bisecting step, for no.of iteration time and take the split that produces the clustering with the highest overall similarity. 6. we use WPGMA and getting k centroid clusters. 7. In finally we use the refinement in the step for centroids of clusters until getting k-clusters. Fig. 2. Proposed IBKM clustering algorithm B. Outlier Detection: The data typically consists of patient records which may have several different types of features such as patient age, blood group, weight have temporal as well as spatial aspect to it. The data can have outliers due to several reasons such as patient's abnormal condition or instrumentation errors or recording error. Outlier is a pattern which is dissimilar with respect to the rest of the patterns in the dataset.this study uses cluster based distance outlier detection algorithm [12] which merges the distance based and cluster based outlier detection method Fig. 3. CBDODA Framework Cluster Based Approach: Clustering is a popular technique, used to group similar data points or objects into groups or clusters. Clustering is an important tool for outlier analysis. Cluster- based approach is primarily group data having similar characteristics and calculate the centroids for each group. Distance-Based Approach Distance based approach is used to calculate maximum distance value for whole data. This approach gives only one value as most expected outlier. To find the distance between points with its neighbor, different dissimilarity measures are used such as Euclidean distance, cosine distance, city block distance, etc. This does not require any a priori of data distributions as the statistical methods. But in this approach it is needed to define the threshold parameter. Framework based on cluster analysis for distance based outlier detection (CADBOD) This outlier detection algorithm uses Hybrid Approach combining two techniques. This method performs by applying proposed improved bisecting k-means clustering algorithm, replacing existing k-means algorithm for better efficiency, which partition the dataset into number of clusters and then for each cluster finds out outliers from the given dataset using threshold value.[12] Algorithm: Input: The set of points n, number of clusters k Output: O, clustering with outlier result set Steps: 1. Generate clusters using IBKM clustering algorithm 2. Calculate Threshold % for each cluster. i. Find the minimum and, maximum for each cluster. ii. Find the maximum distance (D) from the centroid. iii. Take threshold value T from the user. iv. Calculate threshold (T) value from each user. 3. If D> T than point is declared as Outlier.

International Journal of Electrical & Computer Sciences IJECS-IJENS Vol:17 No:02 11 From the results, it is found that the proposed clustering algorithm IBKM produces high quality clusters in terms

Cluster validation Fig. 4. Flow Chart of CADBOD Entropy, F -measure, and Purity are the most frequently used external quality measures in addition to the interpretability of the result.

$The Entropy of a clustering is H(Ω) = H(w) ( N_w/N) Where Ω= {w1, w2..wk.$

4 International Journal of Electrical & Computer Sciences IJECS-IJENS Vol:17 No:02 11 From the results, it is found that the proposed clustering algorithm IBKM produces high quality clusters in terms of accuracy, entropy, F-Measure and purity than that of k- Means and Bisect k-means. The IBKM algorithm takes less computational time to process the given dataset to generate good clusters. Cluster validation Fig. 4. Flow Chart of CADBOD Entropy, F -measure, and Purity are the most frequently used external quality measures in addition to the interpretability of the result. Entropy: Entropy provides a measure of ring randomness. It specifies whether the particular data is constantly falling into same cluster or not. The Entropy of a clustering is H(Ω) = H(w) ( N_w/N) Where Ω= {w1, w2..wk.} is the set of clusters, H (w) is a single clusters Entropy Nw is the number of points in cluster N is the total number of points. F-measure: F-measure provides a measure of Accuracy. It is based on recall and precision measures used in evaluation of an information retrieval system. 2*( precision * recall ) F Measure ( precision recall ) precision ( FP) recall ( FN ) Purity: Purity measures the quality of the clusters. Purity TN FP FN =Tue positive, TN=True negative FP=False positive, FN=False negative VI. EXPERIMENTAL RESULT The proposed clustering algorithm and the outlier detection framework are implemented using Java, and the these algorithms are applied to the selected experimental dataset. The results are interpreted and validated based on the indices Accuracy, Entropy, F-measure and Purity. TABLE II PERFORMANCE OF CLUSTERING ALGORITHMS Algorithms Accuracy Entropy F-Measure Purity k-means Bisecting KM Improved BKM Fig. 5. Performance Evaluation of Clustering Algorithms The proposed outlier detection framework has been implemented and tested on experimental dataset. The results are compared with the results of Distance based outlier algorithm (DBOA), proposed IBKM. TABLE III ELAPSED TIME COMPARISON OF OUTLIER AND CLUSTERING ALGORITHMS Algorithm Elapsed Time Distance based Outlier algorithm (DBOA) Improved Bisecting k-means (IBKM) CADBOD Framework s s s Fig. 6. Elapsed Time Comparison of Outlier and Clustering Algorithms Comparing above three algorithms, IBKM algorithm took less computation time and compared with the DBOA and CADBOD algorithm. But CADBOD can cluster and eliminate outlier at the same instant as one algorithm, DBOA can operate on whole data but cannot cluster the data, so that computational time increases. When comparing the total computational time of DBOA and IBKM is greater than that of CADBOD. So then CADBOD is proved to individually run the outlier algorithm, clustering algorithm take more time complexity. We can use the hybrid approach CADBOD in

5 International Journal of Electrical & Computer Sciences IJECS-IJENS Vol:17 No:02 12 single instant gives the complete result with comparatively less time than individual algorithms. VII. CONCLUSIONS In this paper an improved version of bisecting k-means algorithm known as Improved Bisecting k-means Algorithm (IBKM) is proposed. The proposed algorithm generates better clusters which cannot be achieved if we run them individually. Clusters generated by the IBKM algorithm are compared with the clusters generated by the k-means, bisecting KM algorithm with respect to the parameters Accuracy and three evaluation metrics Entropy, F-measure and Purity of clusters. It is found that the proposed IBKM algorithm outperforms the both k-means algorithm and bisecting K-means algorithm and produces better clusters. In addition to IBKM, a framework based on cluster analysis for distance based outlier detection (CADBOD) is proposed. In the proposed framework, IBKM is used for clustering the dataset. The result of this phase gives efficient clusters without outliers, in a single instant with less computational time. The proposed frame work is very help full for developing software based automatic liver diagnose system. ACKNOWLEDGMENTS We sincerely thank the expert Gastroenterologists Dr. Srinivas Rao and Dr. Srinivas Baba for their highly valuable contribution and cooperation. [11] P. Vijaya, M. NarasimhaMurty, and D. Subramanian, An efficient hybrid hierarchical agglomerative clustering (HHAC) Technique for partitioning large data sets, in PReMI, ser.lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp [12] Ms. S. D. Pachgade, Ms. S. S. Dhande Outlier Detection over Data Set Using Cluster-Based and Distance-Based Approach Volume 2, Issue 6, June 2012, IJARCSSE,ISSN: X. [13] The Indian liver patient dataset (ILPD)is from UCI machine repository in the area of life science. The ILPD data set is available in following hyper linkhttp://archive.ics.uci.edu/datasets/ilpd+(indian+liver+patient+ Dataset ) [14] Prof. M.S.PrasadBabu, prof.m.ramjee, someshkatta, k.swapna Implementation of Partitional Clustering on ILPD Dataset to Predict Liver Disorders paper was presented in IEEE 7 th international conference on software engineering and service science. Beijing, china. [15] Fionn Muztagh School of Computer Science The Queen's University of Belfast Belfast BT7 1NN, Northern Ireland f.murtagh@qub.ac.uk Clustering in Massive Data Sets July 10, 2000 [16] M. J. Dallwitz A flexible clustering method based on UPGMA and ISS [17] M. Sarstedt and E. Mooi, A Concise Guide to Market Research, Springer Texts in Business and Economics, DOI / _9, # Springer-Overflag Berlin Heidelberg [18] Bendi Venkata Ramana, Prof. M.Surendra Prasad Babu 2 Prof. N. B. Venkateswarlu. A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis International Journal of Database Management Systems (IJDMS), Vol.3, No.2, May REFERENCES [1] A. K. Jain and R. C. Dubes, Algorithms for Clustering in Data. Prentice Hall, [2] M. Steinbach, G. Karypis, and V. Kumar, A comparison of document clustering techniques, in KDD workshop on text mining, vol. 400, Department of Computer Science and Engineering University of Minnesota. Cite seer, 2000, pp [3] K.Swapna, Prof. M.S.PrasadBabu and B. Jogeswara Rao Clustering of ILPD Dataset with k-means, hierarchical and DBSCAN Algorithms paper was presented in 102 nd Indian Science Congress Association., Mumbai [4] B. Larsen and C. A one, Fast and effective text mining using linear-time document clustering, in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining KDD 99, vol. 5. ACM Press, 1999, pp [5] Han, J., Kamber, M. and Tung, A Spatial clustering methods in data mining: A survey. In Miller, H., and Han, J., eds., Geographic Data Mining and Knowledge Discovery. Taylor & Francis [6] Hartigan, J., A. and Wong, M., A. 1979, A k-means Clustering Algorithm, Applied Statistics, Vol. 28, No. 1, pp [7] B. S. Everitt, S. Landau, and M. Leese, Cluster Analysis, ser. Social Science Research Council Reviews of Current Research. Arnold, 2001, vol. 33, no. 1. [8] Y. Zhao, G. Karypis, and U. Fayyad, Hierarchical clustering algorithms for document datasets, Data Mining and Knowledge Discovery, vol. 10, no. 2, pp , Mar [9] R. Chitta and M. NarasimhaMurty, Two-level k-means clustering algorithm for k _ relationship establishment and lineartime classification, Pattern Recognition, vol. 43, no. 3,pp , Mar [10] KeerthiramMurugesan and Jun Zhang HYBRID BISECT k- MEANS CLUSTERING ALGORITHM 2011 International conference on business computing and Global information.

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational