Wrapper Feature Selection using Discrete Cuckoo Optimization Algorithm Abstract S.J. Mousavirad and H. Ebrahimpour-Komleh* 1 Department of Computer and Electrical Engineering, University of Kashan, Kashan, Iran *Corresponding Author s E-mail: ebrahimpour@gmail.com Feature subset selection plays an important role in data mining. The aim of feature selection is to remove redundant and irrelevant features without reducing the accuracy. Cuckoo optimization algorithm (COA) is a new population based algorithm which is inspired by the lifestyle of a species of bird called cuckoo. In this paper, we introduce a new approach based on COA for feature subset selection. To verify the efficiency of our algorithm, experiments carried out on some datasets. The results demonstrate that proposed algorithm can provide an optimal solution for feature subset selection problem. Keywords: Feature Selection, Cuckoo Optimization Algorithm, Population based Algorithms, Data Mining 1. Introduction Feature subset selection or feature selection is one of the main steps in data mining process. Feature subset selection is process of selection a subset of relevant features without reducing the accuracy. Finding the relevant features with N number of features for a given problem needs evaluating 2 N possible subsets. This method is exhaustive. It also may be very demanding and time consuming. There are other ways based on heuristic or random search that effort to reduce computational complexity. Algorithms for feature selection are divided into three broad categories: wrapper methods that use learning algorithms for evaluating features[1], filter methods that evaluate features according to the statistical information of the features, and embedded methods perform feature selection in the process of training. Wrapper based approaches utilize the learning algorithms as a fitness function and search best subset of features in the space of all feature subsets. Moreover, the selected features could be compared with the previous selected candidates and replace them if 709
found to be better[1]. Among too many methods, which are proposed for wrapper feature selection, population based optimization algorithms such as genetic algorithm[2-4], particle swarm optimization[5, 6], ant colony[7, 8] and imperialist competition algorithm[9] have attracted a lot of attention. These methods attempt to find a better solution using an iterative process. Cuckoo optimization algorithm (COA) is a novel population based algorithm which is inspired by the lifestyle of a species of bird called cuckoo[10]. This algorithm is based on anomalous egg laying and breeding of cuckoo. The current paper is the first attempt to apply cuckoo optimization algorithm for feature selection. The rest of this paper is organized as follows. First, a brief description of COA has been described. Then the proposed algorithm for feature selection using COA has been demonstrated. In the next section, the experiment and results are presented. Finally, several conclusions have been included. 2. Brief Description of Cuckoo Optimization Algorithm COA is a novel population based algorithm which is inspired by the life of a bird species called cuckoo. This algorithm is based on anomalous egg laying and breeding of cuckoos. In the presented algorithm, cuckoos are divided two forms: mature cuckoos and eggs[10]. Similar other population based algorithms, COA starts with an initial population of cuckoos. This initial population which is mature cuckoos lay some eggs in some host birds nest. Some of these eggs which are more similar to the eggs of host birds have the opportunities to grow up and become a mature cuckoo. Cuckoos with less similarity are detected by host birds and are destroyed. The more eggs survive in an area, the more profit is gathered in that area. Cuckoo search is searching for the best area to lay eggs. After intact eggs grow and become mature cuckoos, they make some societies. Cuckoos in other societies immigrate toward the most appropriate society. They will inhabit somewhere near the best habitat in the most appropriate society. According to the number of eggs each cuckoo and their distance to best habitat, some egg laying radii will be devoted to it. Cuckoos start to lay eggs in some random nests inside her egg laying radius. This process continue iteratively until he best position with maximum profit value is obtained and most of cuckoo population is collected around the same position[10]. Figure 1 Shows flowchart of COA. Start Initialize Cuckoos with eggs Lay eggs in different nests 710 Determine egg laying radius for each cuckoo Move all cuckoos
Figure 1: Flowchart of Cuckoo Optimization Algorithm 711
3. The proposed approach In this section, the proposed algorithm for feature selection is presented. The steps of the proposed approach are considered in details in the following subsections. 3.1. Generating initial cuckoo habitat In genetic algorithm and particle swarm optimization, each solution is called chromosome and particle position, respectively. But in COA, it is called habitat. In a N dimensional problem, a habitat is a 1 N array, representing current living position of cuckoo[10]. This array is: habitat ( x1, x2,..., x N ) In the proposed approach for Feature selection, each habitat is a string of binary numbers. When value of variable is 1, then the feature is selected and when it is 0, the corresponding feature is not selected. Figure 2 shows of the feature representation as a habitat in the proposed approach. The profit of a habitat is defined as the classifier accuracy. Many classifiers can be used to calculate the profit. For example, K-nearest neighbor (KNN), Neural networks (NN) and Support vector machines (SVM) are three popular classifier. SVM and NN are powerful classifiers but it takes too long to build a classifier. Moreover, NN is sensitive to weight initialization. Since KNN was chosen to compute profit value which is simpler and quicker in compared to other classifiers. F 1 F 2 F 3 F n-1 F n-2 Habitat 0 1 0 1 0 Feature Subset: {F 2,,F n-1 } Figure 2: Example of feature representation in the proposed approach 712
Algorithm starts with N pop initial habitat randomly in the population size. A habitat of real cuckoos is that they lay eggs within a maximum distance from their habitat[10]. It maximum range has been called Egg Laying Radius (ELR). It is defined as: Number of Current Cuckoo's eggs ELR= (varhi var low) Total Number of Eggs Where is an integer, var hi and var low is an upper bound and lower bound, respectively. According to the above equation, ELR is proportional to the total number of eggs, number of current cuckoo s eggs and also variable limits. 3.2. Cuckoo s egg laying Each cuckoo starts laying eggs randomly in some other host bird s nest in the range of her ELR. Figure 3 gives a clear view of this concept. Figure 3: Random egg laying in ELR, central red star is the initial habitat of the cuckoo with 5 eggs; pink stars are the eggs new nest[10]. After egg laying process, eggs with less profit value, will be detected and destroyed. Other eggs grow in host nests, hatch and are fed by host birds. Interestingly, only on egg in the each nest has the chance to grow because cuckoos chick eats most of the food host bird brings to the nest[10]. 713
3.3. Immigration of cuckoos When cuckoos grow and become mature they live in the own society. But in the time of the egg laying, they immigrate to new and better society with more similarity of eggs to host birds. After the cuckoos are formed in different area, the society with best profit value is selected as the goal point for other cuckoos to immigrate[10]. It is difficult to distinguish which cuckoo belongs to which groups. To solve this problem, clustering is done. After cuckoo grouping, maximum of mean profit of each group determine the goal group. As previously mentioned, cuckoos improved their habitat for egg laying by moving all the cuckoos toward the goal point. The original version of COA operates on continuous problems. Since the feature selection is a discrete problem, in the current research a new immigration method, which is suitable for the discrete problems, is presented. This operator is as below in Figure 4. For each habitat do Calculate city block distance(d) between habitat and goal point Create a binary string(s) of length N with initial value of zero Assign 1 to some array cells proportional to D. Copy the cells from the goal point correspond to location of the 1 s in the S to the same position in the habitat. End Figure 4: Proposed method for immigration cuckoos toward goal point in a discrete problem 3.4. Eliminating cuckoos in worst habitats Due to the population equilibrium in birds, a new parameter is defined that limits the maximum number of live cuckoos in the society. For modeling of this limitation, survives that have better profit values, and other cuckoos death. N max number of cuckoos 714
3.5. Convergence After some iteration, all the cuckoo population move toward best habitat with maximum similarity of eggs to the host birds. 4. Results and Discussions In order to test of proposed algorithm, K nearest neighbor classifier [11] is used. This classifier classifies instance based on their similarity to instances in the training data. In order to evaluate the proposed method discussed in the previous section, datasets from the UCI repository[12], was chosen as follows: Iris: in this dataset each class refers to a type of iris plant. Wine: it includes data from a chemical analysis of wine grown in the same region in Italy but derived from three different cultivars[12]. Pima: it includes Pima Indians diabetes analysis that belongs to classes of healthy and diabetics. Glass identification Breast cancer: Features in this dataset are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image[12]. Table.1 shows characteristics of the used datasets. Some of datasets used contain missing value. For replacing missing data with substituted values, an approach based on nearest neighbor algorithm was used[13]. In this approach, missing value replaces with the corresponding feature value from the nearest neighbor instance. Nearest neighbor instance is the closet instance in Euclidean distance. To test the efficiency of the proposed method, a K-fold cross validation procedure was used. In this procedure, the dataset is randomly divided in K disjoint parts of approximately equal size. Classifier is trained with K-1 parts and then is tested with a single part. This process is repeated K times (K folds) with each of the K parts used exactly once as the test data. The average of K results from the folds can be calculated to produce a single estimator. An example of the process of the cuckoo feature selection searching for optimal solution is given in Figure 5-7, where it can be seen that the average classification error decreases, indicating the convergence of the proposed algorithm. In 715
Figure 5, the average of the classification error for all habitats at each iteration t is shown. Figure 6 presents minimum classification error of all habitats at each iteration t. Figure 7 exhibits the evolution of the search for the best number of features. The classification accuracy was calculated for each dataset before and after feature selection. Table 2 shows the result feature selection using the above mentioned datasets. The results for the algorithm represented the average of 10 fold in K fold cross validation procedure. Results of the proposed approach are compared with three feature selection approach: Forward feature selection (FFS)[14], Backward feature selection (BFS)[14], genetic based feature selection(ga-fs), and particle swarm based feature selection(pso-fs). According to the results, Classification with feature selection, improve classification performance. In addition, the proposed approach showed improvement in the majority of datasets compared to other methods. Table 1: Description of the used datasets Dataset used Number of features Number of class Number of instance Missing value Wine 13 3 178 No Iris 3 3 150 No Glass identification 10 7 214 No Pima 8 2 768 Yes Breast cancer 10 2 699 Yes 716
A B C D E Figure 5: Average classification error for each iteration in A. Iris, B.Wine, C.Pima, D.Glass identification, and E. Breast cancer datasets 717
A B C D E Figure 6. Minimum classification error for each iteration in A. Iris, B.Wine, C.Pima, D.Glass identification, and E. Breast cancer datasets 718
Figure 7. Best number of features for each iteration in A. Iris, B. Wine, C. Pima, D. Glass identification, and E. Breast cancer datasets 719
Dataset Name No. of the original features Table 2: Classification results using the proposed approach KNN Without FS* Accuracy KNN with FFS KNN with BFS KNN with GA-FS KNN with PSO-FS KNN with COA-FS Wine 13 0.8611 0.8993 0.8800 0.9670 0.9664 0.9778 Iris 4 0.9231 0.9332 0.9433 0.9507 0.9732 0.9873 Glass identification 10 0.7091 0.74.32 0.7565 0.8095 0.8245 0.8318 Pima 8 0.6414 0.6818 0.6532 0.7368 0.7392 0.7260 Breast cancer 8 0.9243 0.95.16 0.9432 0.9653 0.9669 0.9729 * Feature selection Conclusion In this paper, a new approach based on Cuckoo Optimization Algorithm (COA) for feature subset selection is presented. In the proposed approach, features are encoded to binary strings. The COA based method is evaluated on five known classification problem. The experimental results showed that the proposed approach have high performance in searching for a reduced set of features. In the future, COA can be combined with other intelligent classifiers such as support vector machines. References 1. Emmert-Streib, F. and M. Dehmer, Information theory and statistical learning. 2008: Springer-Verlag New York Incorporated. 2. Yang, J. and V. Honavar, Feature subset selection using a genetic algorithm, in Feature extraction, construction and selection. 1998, Springer. p. 117-136. 3. Leardi, R., Application of a genetic algorithm to feature selection under full validation conditions and to outlier detection. Journal of Chemometrics, 1994. 8(1): p. 65-79. 4. Uğuz, H., A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems, 2011. 24(7): p. 1024-1032. 720
5. Wang, X., et al., Feature selection based on rough sets and particle swarm optimization. Pattern Recognition Letters, 2007. 28(4): p. 459-471. 6. Unler, A. and A. Murat, A discrete particle swarm optimization method for feature selection in binary classification problems. European Journal of Operational Research, 2010. 206(3): p. 528-539. 7. Aghdam, M.H., N. Ghasem-Aghaee, and M.E. Basiri, Text feature selection using ant colony optimization. Expert Systems with Applications, 2009. 36(3): p. 6843-6853. 8. Ahmed, A.-A., Feature subset selection using ant colony optimization. 2005. 9. MousaviRad, S., F.A. Tab, and K. Mollazade, Application of Imperialist Competitive Algorithm for Feature Selection: A Case Study on Bulk Rice Classification. International Journal of Computer Applications, 2012. 40(16). 10. Rajabioun, R., Cuckoo optimization algorithm. Applied Soft Computing, 2011. 11(8): p. 5508-5518. 11. Bishop, C.M., Pattern recognition and machine learning. Vol. 1. 2006: springer New York. 12. Asuncion, A. and D.J. Newman, UCI machine learning repository. 2007. 13. Hastie, T., et al., Imputing missing data for gene expression arrays. 1999, Stanford University Statistics Department Technical report. 14. Kittler, J., Feature selection and extraction. Handbook of pattern recognition and image processing, 1986: p. 59-83. 721