AN EVALUATION OF CLUSTER BASED OUTLIER DETECTION STRATEGY BY FEATURE SELECTION TECHNIQUE IN DIABETES DATA SET

Size: px

Start display at page:

Download "AN EVALUATION OF CLUSTER BASED OUTLIER DETECTION STRATEGY BY FEATURE SELECTION TECHNIQUE IN DIABETES DATA SET"

Julian Allison
5 years ago
Views:

Volume 119 No. 16 2018, 411-420 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.

1 Volume 119 No , ISSN: (on-line version) url: AN EVALUATION OF CLUSTER BASED OUTLIER DETECTION STRATEGY BY FEATURE SELECTION TECHNIQUE IN DIABETES DATA SET S. ANITHA 1 Dr. MARY METILDA 2 1Research Scholar,Bharathiar University,Coimbatore, Tamil Nadu, India. 1anitasenthil@gmail.com, 2 Asst. Prof., Queen Mary s College, Chennai, Tamil Nadu, India. 2metilda_dgvc@yahoo.co.in Abstract: Detection of Outliers based on clusters is an important task on the field of data mining research. In this proposed work, feature selection method used to reduce an irrelevant data points and eliminating redundancy of data instances before clustering.after clustering data elements, outliers are identified and discarded based on threshold value. Genetic algorithm (GA) is used to extract the large amount of data sets into relevant attribute and finding the optimal set of parameters for clustering process. Features selected from biomedical data can be more essential in disease diagnostics and there are number of features that can be tested. The research work proposed both Euclidean and mahalonobis distance for identifying outliers. And outlier rejection is required during clustering process is absolutely necessary for avoiding life losses and improving efficiency of diagnostic works. Pima Indian diabetes data sets are taken from UCI machine learning repository. Various Experiments are conducted and compared in proposed method for selecting the relevant subsets, clustering and Outliers removal with less computational time. Keywords-, Classification, Clustering, Feature selection, Genetic Algorithm, Mahalanobis Distance, Outlier Detection. 1. INTRODUCTION Outlier detection involves in statistical and scientific domains for making intellectual decisions and predictions that is essential for calculating accurate results.[6,7,8,9,11]. This proposed research work carried out the cluster and distance based outlier detection method which includes feature selection. Feature selection is one of the vital notions in pattern recognition in data mining. [1]. various feature selection methods are used to select relevant data such as filter method, wrapper method and embedded methods. Out of these three methods, filter method is preferred in this proposed system.it comprising 411

2 correlation coefficient (CFS) that represents the linear relationship [2] between two variables [10], [12], [13], [19]. The genetic [20] search (GA) method is used to selecting the relevant features of PIMA diabetes data. The selected significant attributes are grouped together by k-means clustering technique secondly using Euclidean distance. Distance based Outlier detection algorithm has been implemented for discovering outliers by [3], [4]. This paper has organized as follows: section II exhibits the review of literature. section III discusses research methodology which includes the need of feature selection technique before clustering the dataset. Section IV illustrates empirical results using various classification approaches for evaluating accuracy. Section V concludes the proposed method and scope for future work. 2. REVIEW OF LITERATURE In the field of data mining, lot of research work has been done previously for discovering outliers using clustering techniques, but few works only carried out for identifying outlying points during cluster analysis with feature selection. Here, some research works are specified as review for clustering using feature selection. T.Sridevi et al proposed two levels of feature selection technique, that is features are selected based on rough set with data reduction. Then, selected features from the reduced set based on the Correlation Feature Selection (CFS). Experiments of proposed method shown by comparing in terms of number of selected features have achieved highest classification accuracy [2]. Anirudha. R. C et al carried out the research of a Genetic Algorithm based Wrapper feature selection Hybrid Prediction Model (GWHPM). That model initially used k-means clustering technique to remove the outliers from the dataset. Next, an optimal set of features were obtained by Genetic Algorithm based feature selection. [10]. Dash.M et al proposed an approach as first, features are ranked according to their importance on clustering and then a subset of important features are selected. For large data [12] used a scalable method using sampling. Daniela M. Witten et al used that framework to develop a model for sparse K-means and sparse hierarchical clustering. It governed both the selection of the features and the resulting clusters. That method demonstrated on simulated data and on genomic data sets. [18]. Yi Hong et al described a novel feature selection algorithm for unsupervised clustering, that combined the clustering ensembles method and the population based incremental learning algorithm. That method used to search a subset of all features 412

3 such that the clustering algorithm trained on that feature subset achieved the most similar clustering solution to the one obtained. In particular, a clustering solution achieved by a clustering ensembles method, then the population based incremental learning algorithm is adopted to find the feature subset that best fits the obtained clustering solution. Advantage of the proposed unsupervised feature selection algorithm that it was dimensionalityunbiased. [19]. 3. REASERCH METHODOLOGY This research work consists of the following five phases, 1. Pre-processing the PIMA Indian diabetes data set contains 768 instances with 9 attributes (including class variable.) 2. Feature selection using genetic algorithm(gafs) 3. Clustering using k-means clustering method(clopd) 4. Identifying outliers (IMO) 5. Evaluating results using classifiers Data pre-processing is the process of cleaning the dataset as replacing missing values and cleaning the data. Next part of the proposed method is Data Transformation that the cleaned data is normalized by Z-Score normalization method. As a second phase of method is feature selection using correlation coefficient (CFS) and genetic search algorithm (GA). By these methods the features are elected based on irrelevant subset removal and redundancy elimination from the whole dataset. The clustering technique in CLOPD method by k-means clustering to partitioned the dataset and removes the outliers. [3]. in the proposed method uses two (k=2) as the number of clusters for partitioning the datasets with Ignoring the class label. Initializing k =2, cluster centers by randomly selecting them from the given data points. Outliers are identified on the last stage, and the results are visualized and compared with data mining classification algorithms. Table -1.Data Description of Pima Diabetes Data Sets Data set (UCI) Number of instances Number of attributes Reduced attributes by CFS & GA PIMA Indian Diabetes Data ,6,7,8 (4 attributes) 413

4 The main framework of this proposed method and outlier detection algorithms are presented in fig-1. Original training datasets (PIMA) are taken from UCI machine learning repository. After pre-process, relevant subsets of datasets are selected and clustered. Simultaneously outliers are identified and rejected using outlier detection methods. Fig-1.The Framework of Proposed Method. Algorithm-1-Identification of multiple outliers (IMO) Input: X: training dataset with class variables, N: size of X, α: limit of significant. P: no of variables. S: covariance matrix., In: inliers, Nout: outliers, d: dimentionality. Output: Data sets without outliers. 1. Initialize N =size(x) 2. For each attribute in dataset X 3. Compute median and covariant for all observations. 4. Calculate mahalanobis distance for n observation using the value of median and covariant based on p variable. 414

5 MDi= T 2 i = 5. if the observation value of data is less than the cut-off α value (limit of significant) data points reassign the value As o as inliers.(in) Else let it is 1 as outliers. (Nout) 6. Repeat the steps go to step 3 and 4 for rest of the instances 7. After the discordancy test, reject the points are considered as outliers (nout) 4. EMPRICAL RESULTS The proposed methods have been implemented using MATLAB R2013a.. Multivariate analysis (MVA) is based on the Analysis of two or more variables simultaneously. after preprocessing the data, only four relevant features{2,6,7,8 } are chosen by genetic algorithm for Clustering the dataset and n numbers of outliers are identified using Mahalanobis distance. Calculation of mahalanobis distance of data points in clusters is fully based on the value of median and covariant of data points. Data descriptions are depicted in the table-1. Evaluating clusters and feature selection is an essential task because irrelevant subsets will produce incorrect results and predicting wrong decisions in the data analysis. Intuitively, Fig-2 exploits the detection accuracy rate that is number of correctly identified outliers with and without feature selection technique. There are five classifier algorithms used for evaluating in the proposed system, the classifiers are Naive Bayes, Multilayer Preceptron, Support Vector Machine, Radial Basis Function Network and Random Forest. The results of the computation of IMO algorithm has been implemented in these five data mining classifiers and compared. As the result, the performance of classifiers are analysed, their accuracy results were presented and error detection rate are also represented graphically in fig-3. In this research, feature selection with Genetic Algorithm is considered to make out relevant features. The main advantage of the Genetic algorithm is more convenient with possible solutions which evaluated by a fitness function. The main objective of the proposed system is selecting the finest subset of features can construct the maximum classification accuracy for diagnosis process of the PIMA diabetes dataset. That shown in the Table

6 Accuracy(%) Classifier Accuracy Before Outlier Detection (%) Accuracy After Outlier Detection Without Feature Selection (%) Accuracy After Outlier Detection With Feature Selection (%) Naive Bayes MLP SMO RBF Network Random Forest Table-2 Accuracy of PIMA data set NAIVE BAYES MLP SMO RBF RANDOM FOREST Classifiers Accuracy without outlier detection Accuracy of outlier detection without feature selection Accuracy of outlier detection with feature selection Fig-2 Accuracy Comparison after Detecting Outliers The results are depicted by in terms of comparison between classification accuracies with feature selection Methods. And the accuracies are obtained by The classifiers of Naive Bayes, Multilayer Perceptron, Support Vector Mechine,Radial Basis Function Network, and Random Forest. Among all the five classifiers MLP, SMO and random forest rendering more accuracy after eliminating outliers with significant features. Also, considering The Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) rates are visualized in fig

7 ERROR RATE ERROR RATE OF PIMA DIABETES DATA SET NAIVE BAYES MLP SMO RBF RANDOM FOREST RMSE MAE Fig-3-Error Rate of PIMA Data Set 5. CONCLUSION The proposed method is focus on the evaluation of outliers detection algorithm with feature selection concept. PIMA diabetes datasets are taken for consideration on outlier analysis during clustering process. On the basis of feature selection, accuracies of the data set are varied from one classifiers to another. Multilayer Perceptron and Random Forest are giving more accuracy than other classifiers which are used in this work. Hence, reducing features providing higher rate of accuracy without irrelevant attributes. For detecting outliers, feature selection is an essential preprocessing technique for increasing clustering speed and improving classification accuracy. The execution time is reduced due to reduction in size of dataset. The evaluation approach takes less computation time to find outliers. From The results of experiments, concluding the distance and cluster based algorithm in PIMA datasets comparatively more accuracy than other outlier detection methods. Experimental results shows that the enhanced algorithm generates better results than the distance based approach Considering the merits and demerits of the proposed system. REFERENCES [1] Aggarwal, Charu C., and Philip S. Yu. "Outlier detection for high dimensional data." In ACM Sigmod Record, vol. 30, no. 2, pp ACM, [2] Sridevi T, Murugan A. A novel feature selection method for effective breast cancer diagnosis and prognosis. Int J Comput Appl. 2014;88: [3] Anitha, S., and M. Mary Metilda. "A heuristic approach for observing outlying points in diabetes data set." In Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM), 2017 IEEE International Conference on, pp IEEE, [4] Anitha.S, Mary Metilda, "A Survey on Cluster Based Outlier Detection Techniques in Data Stream", International Journal of Data Mining Techniques and Applications (IJDMTA), vol. 5(1)pp ,

8 [5] [6] Mukhopadhyay, A., Maulik, U., & Bandyopadhyay, S. (2013). An interactive approach to multiobjective clustering of gene expression patterns. IEEE Transactions on Biomedical Engineering, 60(1), [7] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3), [8] Raja, P. Vishnu, and V. Murali Bhaskaran. "An effective genetic algorithm for outlier detection." International Journal of Computer Applications 38, no. 6 (2012): [9] Hawkins, Douglas M. Identification of outliers. Vol. 11. London: Chapman and Hall, [10] Anirudha R C, Kannan R and Patil N, Genetic algorithm based wrapper feature selection on hybrid prediction model for analysis of high dimensionaldata, In IEEE 9th International Conference on Industrial and Information Systems (ICIIS), pp. 1-6, 2014 [11] Hadi, A.S., (1992), 'Identifying multiple outliers in multivariate data', Journal of the Royal Statistical Society. Series B (Methodological), Vol. 54, No. 3(1992), pp [12] Dash, Manoranjan, and Huan Liu. "Feature selection for clustering." In Pacific-Asia Conference on knowledge discovery and data mining, pp Springer, Berlin, Heidelberg, [13] Law, Martin HC, Mario AT Figueiredo, and Anil K. Jain. "Simultaneous feature selection and clustering using mixture models." IEEE transactions on pattern analysis and machine intelligence 26, no. 9 (2004): [14] Chandrashekar, Girish, and Ferat Sahin. "A survey on feature selection methods." Computers & Electrical Engineering 40, no. 1 (2014): [15] Verónica Bolón, Alonso-Betanzos, Maroño Amparo, and Canedo Noelia Sánchez. Artificial Intelligence: Foundations, Theory, and Algorithms Feature Selection for High-Dimensional Data. Springer, [16] Abualigah, Laith Mohammad, Ahamad Tajudin Khader, Mohammed Azmi Al-Betar, and Osama Ahmad Alomari. "Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering." Expert Systems with Applications 84 (2017): [17] Zhang, Shanwen, Harry Wang, and Wenzhun Huang. "Two-stage plant species recognition by local mean clustering and Weighted sparse representation classification." Cluster Computing 20, no. 2 (2017): [18] Witten, Daniela M., and Robert Tibshirani. "A framework for feature selection in clustering." Journal of the American Statistical Association 105, no. 490 (2010): [19] Hong, Yi, Sam Kwong, Yuchou Chang, and Qingsheng Ren. "Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm." Pattern Recognition 41, no. 9 (2008): [20] R. Karpagam, 2Dr. S. Suganya, APPLICATIONS OF DATA MINING AND ALGORITHMS IN EDUCATION A SURVEY, International Journal of Innovations in Scientific and Engineering Research (IJISER), vol 3, no 4, pp.38-46, [21] Prashant Chauhan and Madhu Shukla, A Review on Outlier Detection Techniques on Data Stream by Using Different Approaches of KMeans Algorithm 418

9 , 2015 International Conference on Advances in Computer Engineering and Applications (ICACEA), IMS Engineering College, Ghaziabad, 2015 IEEE. 419

10 420

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION Sandeep Kaur 1, Dr. Sheetal Kalra 2 1,2 Computer Science Department, Guru Nanak Dev University RC, Jalandhar(India) ABSTRACT