Mutual Information with PSO for Feature Selection

Mutual Information with PSO for Feature Selection S. Sivakumar #1, Dr.C.Chandrasekar *2 #* Department of Computer Science, Periyar University Salem-11, Tamilnadu, India 1 ssivakkumarr@yahoo.com 2 ccsekar@gmail.com ABSTRACT Feature selection allows the reduction of feature space, which is crucial in reducing the training time and improving the prediction accuracy. This is achieved by removing irrelevant, redundant, and noisy features. In this paper, PSO is used to implement a feature selection in filter based method, and the mutual information is served as a fitness function of PSO and k-nn is used to evaluate the accuracy of the selected feature. The proposed feature selection method is applied to the features extracted from the Lung CT scan images. Experimental results shows that proposed feature selection method simplifies features effectively and obtains a higher classification accuracy compared to the unreduced dataset classification accuracy. INTRODUCTION Feature selection plays a central role in the data analysis process since irrelevant features often degrade the performance of algorithms devoted to data characterization, rule extraction and construction of predictive models, both in speed and in predictive accuracy. Irrelevant and redundant features interfere with useful ones, so that most supervised learning algorithms fail to properly identify those features that are necessary to describe the target concept. Effective feature selection, by enabling generalization algorithms to focus on the best subset of useful features, substantially increases the likelihood of obtaining simpler, more understandable and predictive models of the data. Optimal feature selection is achieved by maximizing or minimizing a criterion function. Such an approach is referred to as the filter feature selection model. Conversely, the effectiveness of the performance-dependent or wrapper, feedback, feature selection model is directly related to the performance of the learning algorithm, usually in terms of its predictive accuracy. Many feature selection algorithms involve heuristic or random search strategies in order to reduce the computing time. For a large number of features, heuristic search is often used to find the best subset of features. However, accuracy of the classification results by the final feature subset is often decreased. More recently, nature inspired algorithms are used for feature selection. Populationbased optimization algorithms for feature selection such as genetic algorithm (GA), ant colony optimization (ACO), particle swarm optimization (PSO) [10] have been proposed.thesemethods are stochastic optimization techniques attempting to achieve better solutions by referencing the feedback and heuristic information. PARTICLE SWARM OPTIMIZATION PSO is a population based metaheuristic optimizationalgorithm, optimizes a problem by having apopulation of solutions called particles. Each particlehas a position represented by a position vector x i. Thefeature vector is multiplied with position vector.these particles move around the search space withthe velocity vector v i, searching of the objectivefunction which determines the fitness of the solution [8] [9]. In each of the iteration, particles areupdated by two best values, called pbest and gbest.each particle keeps track of its own best position,which is associated with the best fitness it hasachieved so far, called pbest. When a particle takesthe whole population as its topological neighbor, theglobal best value is called as gbest. In PSO for feature selection, the representation of a particle is a n-bit string, where n is the total number of features in the dataset. The position value in the d th dimension (i.e. x id ) is in [0,1], which shows the probability of the d th feature being selected. A threshold is used to determine whether a feature is selected or not. If x id >, the d th feature is 759

selected. Otherwise, the d th feature is not selected [9]. During movement, the current position of particle I is represented by a vector xi = (xi1, xi2,..., xid), where D is the dimensionality of the search space. The velocity of particle i is represented as vi = (vi1, vi2,..., vid), which is limited by a predefined maximum velocity, v max and v t id [ v max, v max ]. The best previous position of a particle is recorded as the personal best pbest and the best position obtained by the population thus far is called gbest. Based on pbest and gbest, PSO searches for the optimal solution by updating the velocity and the position of each particle according to the following equations: x t+1 t t+1 id = x id + v id (1) v t+1 t id = w v id + c1 t r1i p id x id + c2 r2i t (p gd x id ) (2) where t denotes the t th iteration, d denotes the d th dimension in the search space D, w is inertia weight. c1 and c2 are acceleration constants. r1i and r2i are random values uniformly distributed in [0, 1]. pid and pgd represent the elements of pbest and gbest in the d th dimension[3]. Initialize the position and velocity of each particle Collect the features selected by a particle Evaluate the Mutual Information for the selected features Fitness evaluation No Calculate the goodness of selected particle according to fitness function Update pbest and gbest Update the position and velocity Termination? Yes Return the best selected features Figure 1: processing model of PSO with Mutual Information for Feature Selection The figure1, shows the steps involved in PSO for feature selection with mutual information acts as a fitness function. 760

MUTUAL INFORMATION Mutual information is defined as the information shared between two randomvariables, which can be used to measure the relevance between a featurex and the class labels C [1].The information shared between two random variables is defined asmutual information. Given variable X, how much information one cangain about variable Y, which is mutual information I(X;Y). I X, Y = H X H(X Y) (3) According to Equation (1), the mutual information I(X; Y) will be large if two variables X and Y are closely related. Otherwise, I(X; Y) = 0 if X and Y are totally unrelated. Information theory, mainly mutual information, has been applied to filter feature selection to measure the relationship between the selected features and the class labels. A classical use of information theory is found in several feature ranking measures. These consist in statistics from the data which score each feature Fi depending on its relation with the classes. One of the most relevant contributions of information theory to the feature selection research is the use of mutual information for feature evaluation. In the following formulation F refers to a set of features and C to the class labels [2]. p (f,c ) I F, C = p f, c log dfdc p f p (c) (4) Some approaches evaluate the mutual information between a single feature and the class label. This measure is not a problem. The difficulties arise when evaluating entire feature sets. The necessity for evaluating entire feature sets in a multivariate way is due to the possible interactions among features [1] [6]. While two single features might not provide enough information about the class, the combination of both of them could, in some cases, provide significant information. For the mutual information between N variables X 1, X 2 X N, and the variable Y, the chain rule is [4]: I X1, X2,, XN; Y = 1, Xi 2,.., X1) (5) N i=1 I Xi; Y Xi The usual approach for calculating mutual information is to measure entropy and substitute it in the mutual information formula. Mutual information is considered to be a suitable criterion for feature selection. Mutual information is a measure of the reduction of uncertainty about the class labels, due to the knowledge of the features of a data set [5]. fitness x i = I(xi, C) (6) where the fitness function is maximize the mutual information value. PSO with Mutual Information Based Feature Selection Algorithm Input :Data set Output : Selected feature subset 1 Begin 2 randomly initialize the particles position and velocity; 3 whilemaximum Iterations is not reacheddo 4 evaluate the fitness (Mutual Information) of each particle on the Data set; 5 for i=1 to Population Sizedo // Fitness(xi) measures the Mutual Information between the feature vector and class of xi 6 iffitness(xi) < Fitness(pbest)then 7 pbest = xi ; // Update the pbest of particle i 8else iffitness(xi) = Fitness(pbest) and xi < pbest then 9 pbest = xi ; // Update the pbest of particle i 10 if any Fitness(pbest) < Fitness(gbest)then 11 gbest = pbest ; // Update the gbest of particle i 12 else if any Fitness(pbest) = Fitness(gbest) and pbest < gbest then 13 gbest = pbest ; // Update the gbest of particle i 14for i=1 to Population Sizedo 15update the velocity and the position of particle i; 16 calculate the classification accuracy of the selected feature subset on the Data set; 17 return the position of gbest (the selected feature subset); EXPERIMENTAL RESULTS AND DISCUSSION Image Database: The Lung Image Database Consortium image collection (LIDC-IDRI) consists of diagnostic and lung cancer screening thoracic CT scans with marked-up annotated lesions. It is a web-accessible international resource for development, training, and evaluation of computer-assisted diagnostic (CAD) methods for lung cancer detection and diagnosis. Each study in the dataset consist of collection of slices and each slice of the size of 512 X 512 in DICOM format. The lung image data, nodule size list 761

and annotated XML file documentations can be downloaded from the National Cancer Institute website [7]. For the experiment we taken 180 Non- Cancer Lung CT scan images and 320 Cancer Lung CT images from the LIDC dataset. From the dataset, images are filtered for noise removal through wiener filter. After the noise removal morphological based operations are applied to extract the lung portion. From the extracted lung portion, features are extracted, namely first order statistical features and second order statistical features. These features are taken as the input for the mutual information based PSO for feature selection with the following parameters. Parameter Value Number 200 of Iterations Population 50,200,100 Size Number 5,13,7 of particles C1 2 C2 2 0.6 6 [1,5] [1,3,6,10,12,13] [1,3,4,5] 7 [1,3,4] [1,4,5,6,8,10,11,12] [2,3,5,7] 8 [1,3,5] [1,2,7,8,11,12,13] [1,2,5,7] 9 [1,3] [1,2,5,7,8,9,10,12] [1,2,4,5,6] 10 [1,4] [1,3,4,5,7,8,11,13] [1,2,6,7] From the table1, the three different type of features which are extracted from the Lung CT scan images namely first order statistical features, GLCM based Haralick features and GLRLM based features used in the experiment. The three features sets have different number of features (5, 13, 7), with two classes and instances as the representative samples of the problems that the proposed algorithms can address. In the experiments, the instances in each dataset are randomly divided into two sets: 75% as the training set and 25% as the test set. From table1, the PSO with MI feature selection yields better accuracy with minimal set of features where compare with the unreduced data set. The table2 shows the different number of features selected by the PSO with MI in different iterations. Table1: Performance analysis of PSO with MI for feature selection Image Feature Set First order Se GL co CM nd GL ord RL er M No.of. Featur es (Unre duced) Classi ficati on Accur acy (%) No.of. Featur es (select ed by MI with PSO) Classi ficati on Accur acy (%) 5 68.56 3 81.43 13 63.07 8 80.56 7 72.34 4 83.47 Table2: Features selected by the MI based PSO algorithm Run # First GLCM GLRLM Order 1 [1,2,4] [1,2,4,5,6,8,10] [1,2,4,5,7] 2 [1,2,5] [1,2,4,6,9,10] [1,2,5,7] 3 [1,4] [1,2,3,5,7,11] [1,2,4,7] 4 [1,2] [1,2,4,5,10,11,13] [1,2,4,5] 5 [1,2,5] [1,3,6,7,9,10,11,12] [1,2,4,6,7] CONCLUSION In this work, mutual information based PSO algorithm, is used in feature subset selection process for classification purpose.from the result, the classification accuracy of the k-nn classifier for the Mutual Information based PSO feature selection performs significantly superior to the k-nn classifier without feature selection. It could be seen that reducing the number of features by selecting only the significant one improved the classification accuracy. ACKNOWLEDGMENT The First Author extends his gratitude to UGC as this research work was supported by Basic Scientific Research (BSR) Non-SAP Scheme, under grant reference number, F-41/2006(BSR)/11-142/2010(BSR) UGC XI Plan. REFERENCES [1] H. Peng, F. Long, and C. Ding, Feature selection based on mutualinformation criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine 762

Intelligence,vol. 27, no. 8, pp. 1226 1238, 2005. [2] Guyon I and Elissee A, An introduction to variable and feature selection, Journal of Machine Learning Research, volume 3, pp. 1157-1182. [3] Unler and A. Murat, A discrete particle swarm optimization method for feature selection in binary classification problems, European Journal of Operational Research, vol. 206, no. 3, pp. 528 539, 2010. [4] Peng H, Long F and Ding C, Feature selection based on mutual information: Criteria of max-dependency, maxrelevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 27, pp. 1226-1238, 2005. [5] Vasconcelos, N. and Vasconcelos, M, Scalable discriminant feature selection for image retrieval and Recognition, Computer Vision and Pattern Recognition Conference (CVPR04) proceedings, pp. 770-775. [6] Kozachenko L. F. and Leonenko N. N, Sample estimate of the entropy of a random vector, Problems Information Transmission, 23(1):95-101, 1987. [7] S.Sivakumar and C.Chandrasekar, Lung Nodule Detection Using Fuzzy Clustering and Support Vector Machines, International Journal of Engineering and Technology, vol. 5, no. 1, pp. 179-185, 2013. [8] L. Ke, Z. Feng, Z. Xu, K. Shang, and Y.Wang, A multiobjective ACO algorithm for rough feature selection, in Second Pacific-Asia Conference on Circuits, Communications and System (PACCS), vol. 1, pp. 207 210, 2010. [9] S. Yang, L. Y. Chuang, C. H. Ke, and C. H. Yang, Boolean binary particle swarm optimization for feature selection, in IEEE Congress on Evolutionary Computation (CEC 08), pp. 2093 2098, 2008. [10] K. Waqas, R. Baig, and S. Ali, Feature subset selection using multiobjective genetic algorithms, in IEEE 13th International Conference on Multitopic Conference (INMIC 09), pp. 1 6, 2009. 763