Similar documents
Feature Selection using Modified Imperialist Competitive Algorithm

A Binary Model on the Basis of Cuckoo Search Algorithm in Order to Solve the Problem of Knapsack 1-0

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm

[Kaur, 5(8): August 2018] ISSN DOI /zenodo Impact Factor

Using a genetic algorithm for editing k-nearest neighbor classifiers

Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification

THE NEW HYBRID COAW METHOD FOR SOLVING MULTI-OBJECTIVE PROBLEMS

MIXED VARIABLE ANT COLONY OPTIMIZATION TECHNIQUE FOR FEATURE SUBSET SELECTION AND MODEL SELECTION

Chapter 8 The C 4.5*stat algorithm

LEARNING WEIGHTS OF FUZZY RULES BY USING GRAVITATIONAL SEARCH ALGORITHM

Feature weighting using particle swarm optimization for learning vector quantization classifier

ARTIFICIAL INTELLIGENCE (CSCU9YE ) LECTURE 5: EVOLUTIONARY ALGORITHMS

FEATURE SELECTION USING PARTICLE SWARM OPTIMIZATION IN TEXT CATEGORIZATION

Classifier Inspired Scaling for Training Set Selection

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Double Sort Algorithm Resulting in Reference Set of the Desired Size

Kyrre Glette INF3490 Evolvable Hardware Cartesian Genetic Programming

Hybrid AFS Algorithm and k-nn Classification for Detection of Diseases

Mutual Information with PSO for Feature Selection

An Enhanced Binary Particle Swarm Optimization (EBPSO) Algorithm Based A V- shaped Transfer Function for Feature Selection in High Dimensional data

CLUSTERING CATEGORICAL DATA USING k-modes BASED ON CUCKOO SEARCH OPTIMIZATION ALGORITHM

Use of the Improved Frog-Leaping Algorithm in Data Clustering

International Journal of Digital Application & Contemporary research Website: (Volume 1, Issue 7, February 2013)

Particle Swarm Optimization Artificial Bee Colony Chain (PSOABCC): A Hybrid Meteahuristic Algorithm

Nearest Cluster Classifier

CloNI: clustering of JN -interval discretization

International Journal of Current Research and Modern Education (IJCRME) ISSN (Online): & Impact Factor: Special Issue, NCFTCCPS -

AN EFFICIENT COST FUNCTION FOR IMPERIALIST COMPETITIVE ALGORITHM TO FIND BEST CLUSTERS

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

The movement of the dimmer firefly i towards the brighter firefly j in terms of the dimmer one s updated location is determined by the following equat

The k-means Algorithm and Genetic Algorithm

Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding

Associative Cellular Learning Automata and its Applications

Introduction to Artificial Intelligence

Global Metric Learning by Gradient Descent

Hamming Distance based Binary PSO for Feature Selection and Classification from high dimensional Gene Expression Data

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

CHAPTER 2 CONVENTIONAL AND NON-CONVENTIONAL TECHNIQUES TO SOLVE ORPD PROBLEM

Improving Tree-Based Classification Rules Using a Particle Swarm Optimization

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

A Hybrid Feature Selection Algorithm Based on Information Gain and Sequential Forward Floating Search

Univariate Margin Tree

Research Article Application of Global Optimization Methods for Feature Selection and Machine Learning

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

[Sabeena*, 5(4): April, 2016] ISSN: (I2OR), Publication Impact Factor: 3.785

Weighting and selection of features.

PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-nearest Neighbor Classifier

Particle Swarm Optimization applied to Pattern Recognition

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

A Lazy Approach for Machine Learning Algorithms

An Analysis of Applicability of Genetic Algorithms for Selecting Attributes and Examples for the Nearest Neighbour Classifier

A *69>H>N6 #DJGC6A DG C<>C::G>C<,8>:C8:H /DA 'D 2:6G, ()-"&"3 -"(' ( +-" " " % '.+ % ' -0(+$,

IN recent years, neural networks have attracted considerable attention

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

The Design of Pole Placement With Integral Controllers for Gryphon Robot Using Three Evolutionary Algorithms

Time Complexity Analysis of the Genetic Algorithm Clustering Method

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality

Feature Selection in Knowledge Discovery

Supervised classification exercice

k-nn Disgnosing Breast Cancer

CHAPTER 4 FEATURE SELECTION USING GENETIC ALGORITHM

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

GENETIC ALGORITHM VERSUS PARTICLE SWARM OPTIMIZATION IN N-QUEEN PROBLEM

Monika Maharishi Dayanand University Rohtak

Meta- Heuristic based Optimization Algorithms: A Comparative Study of Genetic Algorithm and Particle Swarm Optimization

Evolving SQL Queries for Data Mining

Maximum Relevancy Minimum Redundancy Based Feature Subset Selection using Ant Colony Optimization

Fuzzy Ant Clustering by Centroid Positioning

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

A Maximal Margin Classification Algorithm Based on Data Field

NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION

Nearest Cluster Classifier

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION

Information Fusion Dr. B. K. Panigrahi

RECORD-TO-RECORD TRAVEL ALGORITHM FOR ATTRIBUTE REDUCTION IN ROUGH SET THEORY

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

Solving the Traveling Salesman Problem using Reinforced Ant Colony Optimization techniques

A PSO-based Generic Classifier Design and Weka Implementation Study

A Fast Wrapper Feature Subset Selection Method Based On Binary Particle Swarm Optimization

A Survey of Parallel Social Spider Optimization Algorithm based on Swarm Intelligence for High Dimensional Datasets

Fuzzy Ants as a Clustering Concept

Cell-to-switch assignment in. cellular networks. barebones particle swarm optimization

A genetic algorithm based focused Web crawler for automatic webpage classification

MUSSELS WANDERING OPTIMIZATION ALGORITHM BASED TRAINING OF ARTIFICIAL NEURAL NETWORKS FOR PATTERN CLASSIFICATION

Keywords Clustering, K-Mean, Firefly algorithm, Genetic Algorithm (GA), Particle Swarm Optimization (PSO).

Using Genetic Algorithm with Triple Crossover to Solve Travelling Salesman Problem

Classification of Hand-Written Numeric Digits

Differentiation of Malignant and Benign Breast Lesions Using Machine Learning Algorithms

Incremental Continuous Ant Colony Optimization Technique for Support Vector Machine Model Selection Problem

Salman Ahmed.G* et al. /International Journal of Pharmacy & Technology

A New Meta-heuristic Bat Inspired Classification Approach for Microarray Data

An Efficient Analysis for High Dimensional Dataset Using K-Means Hybridization with Ant Colony Optimization Algorithm

Class dependent feature weighting and K-nearest neighbor classification

A Web Page Recommendation system using GA based biclustering of web usage data

Transcription:

Wrapper Feature Selection using Discrete Cuckoo Optimization Algorithm Abstract S.J. Mousavirad and H. Ebrahimpour-Komleh* 1 Department of Computer and Electrical Engineering, University of Kashan, Kashan, Iran *Corresponding Author s E-mail: ebrahimpour@gmail.com Feature subset selection plays an important role in data mining. The aim of feature selection is to remove redundant and irrelevant features without reducing the accuracy. Cuckoo optimization algorithm (COA) is a new population based algorithm which is inspired by the lifestyle of a species of bird called cuckoo. In this paper, we introduce a new approach based on COA for feature subset selection. To verify the efficiency of our algorithm, experiments carried out on some datasets. The results demonstrate that proposed algorithm can provide an optimal solution for feature subset selection problem. Keywords: Feature Selection, Cuckoo Optimization Algorithm, Population based Algorithms, Data Mining 1. Introduction Feature subset selection or feature selection is one of the main steps in data mining process. Feature subset selection is process of selection a subset of relevant features without reducing the accuracy. Finding the relevant features with N number of features for a given problem needs evaluating 2 N possible subsets. This method is exhaustive. It also may be very demanding and time consuming. There are other ways based on heuristic or random search that effort to reduce computational complexity. Algorithms for feature selection are divided into three broad categories: wrapper methods that use learning algorithms for evaluating features[1], filter methods that evaluate features according to the statistical information of the features, and embedded methods perform feature selection in the process of training. Wrapper based approaches utilize the learning algorithms as a fitness function and search best subset of features in the space of all feature subsets. Moreover, the selected features could be compared with the previous selected candidates and replace them if 709

found to be better[1]. Among too many methods, which are proposed for wrapper feature selection, population based optimization algorithms such as genetic algorithm[2-4], particle swarm optimization[5, 6], ant colony[7, 8] and imperialist competition algorithm[9] have attracted a lot of attention. These methods attempt to find a better solution using an iterative process. Cuckoo optimization algorithm (COA) is a novel population based algorithm which is inspired by the lifestyle of a species of bird called cuckoo[10]. This algorithm is based on anomalous egg laying and breeding of cuckoo. The current paper is the first attempt to apply cuckoo optimization algorithm for feature selection. The rest of this paper is organized as follows. First, a brief description of COA has been described. Then the proposed algorithm for feature selection using COA has been demonstrated. In the next section, the experiment and results are presented. Finally, several conclusions have been included. 2. Brief Description of Cuckoo Optimization Algorithm COA is a novel population based algorithm which is inspired by the life of a bird species called cuckoo. This algorithm is based on anomalous egg laying and breeding of cuckoos. In the presented algorithm, cuckoos are divided two forms: mature cuckoos and eggs[10]. Similar other population based algorithms, COA starts with an initial population of cuckoos. This initial population which is mature cuckoos lay some eggs in some host birds nest. Some of these eggs which are more similar to the eggs of host birds have the opportunities to grow up and become a mature cuckoo. Cuckoos with less similarity are detected by host birds and are destroyed. The more eggs survive in an area, the more profit is gathered in that area. Cuckoo search is searching for the best area to lay eggs. After intact eggs grow and become mature cuckoos, they make some societies. Cuckoos in other societies immigrate toward the most appropriate society. They will inhabit somewhere near the best habitat in the most appropriate society. According to the number of eggs each cuckoo and their distance to best habitat, some egg laying radii will be devoted to it. Cuckoos start to lay eggs in some random nests inside her egg laying radius. This process continue iteratively until he best position with maximum profit value is obtained and most of cuckoo population is collected around the same position[10]. Figure 1 Shows flowchart of COA. Start Initialize Cuckoos with eggs Lay eggs in different nests 710 Determine egg laying radius for each cuckoo Move all cuckoos

Figure 1: Flowchart of Cuckoo Optimization Algorithm 711

3. The proposed approach In this section, the proposed algorithm for feature selection is presented. The steps of the proposed approach are considered in details in the following subsections. 3.1. Generating initial cuckoo habitat In genetic algorithm and particle swarm optimization, each solution is called chromosome and particle position, respectively. But in COA, it is called habitat. In a N dimensional problem, a habitat is a 1 N array, representing current living position of cuckoo[10]. This array is: habitat ( x1, x2,..., x N ) In the proposed approach for Feature selection, each habitat is a string of binary numbers. When value of variable is 1, then the feature is selected and when it is 0, the corresponding feature is not selected. Figure 2 shows of the feature representation as a habitat in the proposed approach. The profit of a habitat is defined as the classifier accuracy. Many classifiers can be used to calculate the profit. For example, K-nearest neighbor (KNN), Neural networks (NN) and Support vector machines (SVM) are three popular classifier. SVM and NN are powerful classifiers but it takes too long to build a classifier. Moreover, NN is sensitive to weight initialization. Since KNN was chosen to compute profit value which is simpler and quicker in compared to other classifiers. F 1 F 2 F 3 F n-1 F n-2 Habitat 0 1 0 1 0 Feature Subset: {F 2,,F n-1 } Figure 2: Example of feature representation in the proposed approach 712

Algorithm starts with N pop initial habitat randomly in the population size. A habitat of real cuckoos is that they lay eggs within a maximum distance from their habitat[10]. It maximum range has been called Egg Laying Radius (ELR). It is defined as: Number of Current Cuckoo's eggs ELR= (varhi var low) Total Number of Eggs Where is an integer, var hi and var low is an upper bound and lower bound, respectively. According to the above equation, ELR is proportional to the total number of eggs, number of current cuckoo s eggs and also variable limits. 3.2. Cuckoo s egg laying Each cuckoo starts laying eggs randomly in some other host bird s nest in the range of her ELR. Figure 3 gives a clear view of this concept. Figure 3: Random egg laying in ELR, central red star is the initial habitat of the cuckoo with 5 eggs; pink stars are the eggs new nest[10]. After egg laying process, eggs with less profit value, will be detected and destroyed. Other eggs grow in host nests, hatch and are fed by host birds. Interestingly, only on egg in the each nest has the chance to grow because cuckoos chick eats most of the food host bird brings to the nest[10]. 713

3.3. Immigration of cuckoos When cuckoos grow and become mature they live in the own society. But in the time of the egg laying, they immigrate to new and better society with more similarity of eggs to host birds. After the cuckoos are formed in different area, the society with best profit value is selected as the goal point for other cuckoos to immigrate[10]. It is difficult to distinguish which cuckoo belongs to which groups. To solve this problem, clustering is done. After cuckoo grouping, maximum of mean profit of each group determine the goal group. As previously mentioned, cuckoos improved their habitat for egg laying by moving all the cuckoos toward the goal point. The original version of COA operates on continuous problems. Since the feature selection is a discrete problem, in the current research a new immigration method, which is suitable for the discrete problems, is presented. This operator is as below in Figure 4. For each habitat do Calculate city block distance(d) between habitat and goal point Create a binary string(s) of length N with initial value of zero Assign 1 to some array cells proportional to D. Copy the cells from the goal point correspond to location of the 1 s in the S to the same position in the habitat. End Figure 4: Proposed method for immigration cuckoos toward goal point in a discrete problem 3.4. Eliminating cuckoos in worst habitats Due to the population equilibrium in birds, a new parameter is defined that limits the maximum number of live cuckoos in the society. For modeling of this limitation, survives that have better profit values, and other cuckoos death. N max number of cuckoos 714

3.5. Convergence After some iteration, all the cuckoo population move toward best habitat with maximum similarity of eggs to the host birds. 4. Results and Discussions In order to test of proposed algorithm, K nearest neighbor classifier [11] is used. This classifier classifies instance based on their similarity to instances in the training data. In order to evaluate the proposed method discussed in the previous section, datasets from the UCI repository[12], was chosen as follows: Iris: in this dataset each class refers to a type of iris plant. Wine: it includes data from a chemical analysis of wine grown in the same region in Italy but derived from three different cultivars[12]. Pima: it includes Pima Indians diabetes analysis that belongs to classes of healthy and diabetics. Glass identification Breast cancer: Features in this dataset are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image[12]. Table.1 shows characteristics of the used datasets. Some of datasets used contain missing value. For replacing missing data with substituted values, an approach based on nearest neighbor algorithm was used[13]. In this approach, missing value replaces with the corresponding feature value from the nearest neighbor instance. Nearest neighbor instance is the closet instance in Euclidean distance. To test the efficiency of the proposed method, a K-fold cross validation procedure was used. In this procedure, the dataset is randomly divided in K disjoint parts of approximately equal size. Classifier is trained with K-1 parts and then is tested with a single part. This process is repeated K times (K folds) with each of the K parts used exactly once as the test data. The average of K results from the folds can be calculated to produce a single estimator. An example of the process of the cuckoo feature selection searching for optimal solution is given in Figure 5-7, where it can be seen that the average classification error decreases, indicating the convergence of the proposed algorithm. In 715

Figure 5, the average of the classification error for all habitats at each iteration t is shown. Figure 6 presents minimum classification error of all habitats at each iteration t. Figure 7 exhibits the evolution of the search for the best number of features. The classification accuracy was calculated for each dataset before and after feature selection. Table 2 shows the result feature selection using the above mentioned datasets. The results for the algorithm represented the average of 10 fold in K fold cross validation procedure. Results of the proposed approach are compared with three feature selection approach: Forward feature selection (FFS)[14], Backward feature selection (BFS)[14], genetic based feature selection(ga-fs), and particle swarm based feature selection(pso-fs). According to the results, Classification with feature selection, improve classification performance. In addition, the proposed approach showed improvement in the majority of datasets compared to other methods. Table 1: Description of the used datasets Dataset used Number of features Number of class Number of instance Missing value Wine 13 3 178 No Iris 3 3 150 No Glass identification 10 7 214 No Pima 8 2 768 Yes Breast cancer 10 2 699 Yes 716

A B C D E Figure 5: Average classification error for each iteration in A. Iris, B.Wine, C.Pima, D.Glass identification, and E. Breast cancer datasets 717

A B C D E Figure 6. Minimum classification error for each iteration in A. Iris, B.Wine, C.Pima, D.Glass identification, and E. Breast cancer datasets 718

Figure 7. Best number of features for each iteration in A. Iris, B. Wine, C. Pima, D. Glass identification, and E. Breast cancer datasets 719

Dataset Name No. of the original features Table 2: Classification results using the proposed approach KNN Without FS* Accuracy KNN with FFS KNN with BFS KNN with GA-FS KNN with PSO-FS KNN with COA-FS Wine 13 0.8611 0.8993 0.8800 0.9670 0.9664 0.9778 Iris 4 0.9231 0.9332 0.9433 0.9507 0.9732 0.9873 Glass identification 10 0.7091 0.74.32 0.7565 0.8095 0.8245 0.8318 Pima 8 0.6414 0.6818 0.6532 0.7368 0.7392 0.7260 Breast cancer 8 0.9243 0.95.16 0.9432 0.9653 0.9669 0.9729 * Feature selection Conclusion In this paper, a new approach based on Cuckoo Optimization Algorithm (COA) for feature subset selection is presented. In the proposed approach, features are encoded to binary strings. The COA based method is evaluated on five known classification problem. The experimental results showed that the proposed approach have high performance in searching for a reduced set of features. In the future, COA can be combined with other intelligent classifiers such as support vector machines. References 1. Emmert-Streib, F. and M. Dehmer, Information theory and statistical learning. 2008: Springer-Verlag New York Incorporated. 2. Yang, J. and V. Honavar, Feature subset selection using a genetic algorithm, in Feature extraction, construction and selection. 1998, Springer. p. 117-136. 3. Leardi, R., Application of a genetic algorithm to feature selection under full validation conditions and to outlier detection. Journal of Chemometrics, 1994. 8(1): p. 65-79. 4. Uğuz, H., A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems, 2011. 24(7): p. 1024-1032. 720

5. Wang, X., et al., Feature selection based on rough sets and particle swarm optimization. Pattern Recognition Letters, 2007. 28(4): p. 459-471. 6. Unler, A. and A. Murat, A discrete particle swarm optimization method for feature selection in binary classification problems. European Journal of Operational Research, 2010. 206(3): p. 528-539. 7. Aghdam, M.H., N. Ghasem-Aghaee, and M.E. Basiri, Text feature selection using ant colony optimization. Expert Systems with Applications, 2009. 36(3): p. 6843-6853. 8. Ahmed, A.-A., Feature subset selection using ant colony optimization. 2005. 9. MousaviRad, S., F.A. Tab, and K. Mollazade, Application of Imperialist Competitive Algorithm for Feature Selection: A Case Study on Bulk Rice Classification. International Journal of Computer Applications, 2012. 40(16). 10. Rajabioun, R., Cuckoo optimization algorithm. Applied Soft Computing, 2011. 11(8): p. 5508-5518. 11. Bishop, C.M., Pattern recognition and machine learning. Vol. 1. 2006: springer New York. 12. Asuncion, A. and D.J. Newman, UCI machine learning repository. 2007. 13. Hastie, T., et al., Imputing missing data for gene expression arrays. 1999, Stanford University Statistics Department Technical report. 14. Kittler, J., Feature selection and extraction. Handbook of pattern recognition and image processing, 1986: p. 59-83. 721