CHAPTER 4 FEATURE SELECTION USING GENETIC ALGORITHM

CHAPTER 4 FEATURE SELECTION USING GENETIC ALGORITHM In this research work, Genetic Algorithm method is used for feature selection. The following section explains how Genetic Algorithm is used for feature selection and how it works. 4.1 Genetic Algorithm A genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover [56-57]. 4.1.1 Methodology In a genetic algorithm, a population of strings (called chromosomes or the genotype of the genome), which encode candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem, evolves toward better solutions. Traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible. The evolution usually starts from a population of randomly generated individuals and happens in generations. In each generation, the fitness of every individual in the population is evaluated, multiple individuals are stochastically selected from the current population (based on their fitness), and modified (recombined and possibly randomly mutated) to form a new population. The new population is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either 45

a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population. If the algorithm has terminated due to a maximum number of generations, a satisfactory solution may or may not have been reached. Genetic algorithms find application in bioinformatics, computational science, engineering, economics, chemistry, manufacturing, mathematics, physics and other fields. A typical genetic algorithm requires: a genetic representation of the solution domain, a fitness function to evaluate the solution domain. A standard representation of the solution is as an array of bits. Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size, which facilitates simple crossover operations. Variable length representations may also be used, but crossover implementation is more complex in this case. Tree-like representations are explored in genetic programming and graph-form representations are explored in evolutionary programming. The fitness function is defined over the genetic representation and measures the quality of the represented solution. The fitness function is always problem dependent. For instance, in the knapsack problem one wants to maximize the total value of objects that can be put in a knapsack of some fixed capacity. A representation of a solution might be an array of bits, where each bit represents a different object, and the value of the bit (0 or 1) represents whether or not the object is in the knapsack. Not every such representation is valid, as the size of objects may exceed the capacity of the knapsack. The fitness of the solution is the sum of values of all objects in the 46

knapsack if the representation is valid or 0 otherwise. In some problems, it is hard or even impossible to define the fitness expression; in these cases, interactive genetic algorithms are used. Once the genetic representation and the fitness function is defined, GA proceeds to initialize a population of solutions randomly, and then improve it through repetitive application of mutation, crossover, inversion and selection operators. 4.1.2 Initialization Initially many individual solutions are randomly generated to form an initial population. The population size depends on the nature of the problem, but typically contains several hundreds or thousands of possible solutions. Traditionally, the population is generated randomly, covering the entire range of possible solutions (the search space). Occasionally, the solutions may be "seeded" in areas where optimal solutions are likely to be found. 4.1.3 Selection During each successive generation, a proportion of the existing population is selected to breed a new generation. Individual solutions are selected through a fitness-based process, where fitter solutions (as measured by a fitness function) are typically more likely to be selected. Certain selection methods rate the fitness of each solution and preferentially select the best solutions. Other methods rate only a random sample of the population, as this process may be very time-consuming. 47

4.1.4 Reproduction The next step is to generate a second generation population of solutions from those selected through genetic operators: crossover (also called recombination), and/or mutation. For each new solution to be produced, a pair of "parent" solutions is selected for breeding from the pool selected previously. By producing a "child" solution using the above methods of crossover and mutation, a new solution is created which typically shares many of the characteristics of its "parents". New parents are selected for each new child, and the process continues until a new population of solutions of appropriate size is generated. Although reproduction methods that are based on the use of two parents are more "biology inspired", some research suggests more than two "parents" are better to be used to reproduce a good quality chromosome. These processes ultimately result in the next generation population of chromosomes that is different from the initial generation. Generally the average fitness will have increased by this procedure for the population, since only the best organisms from the first generation are selected for breeding, along with a small proportion of less fit solutions, for reasons already mentioned above. Although Crossover and Mutation are known as the main genetic operators, it is possible to use other operators such as regrouping, colonization-extinction, or migration in genetic algorithms. 48

4.1.5 Termination This generational process is repeated until a termination condition has been reached. Common terminating conditions are: A solution is found that satisfies minimum criteria Fixed number of generations reached Allocated budget (computation time/money) reached The highest ranking solution's fitness is reaching or has reached a plateau such that successive iterations no longer produce better results Manual inspection Combinations of the above A Simple generational genetic algorithm procedure is given below. 1. Choose the initial population of individuals 2. Evaluate the fitness of each individual in that population 3. Repeat on this generation until termination (time limit, sufficient fitness achieved, etc.): a. Select the best-fit individuals for reproduction b. Breed new individuals through crossover and mutation operations to give birth to offspring c. Evaluate the individual fitness of new individuals d. Replace least-fit population with new individuals 49

4.1.6 Variants of Genetic Algorithm The simplest algorithm represents each chromosome as a bit string. Typically, numeric parameters can be represented by integers, though it is possible to use floating point representations. The floating point representation is natural to evolution strategies and evolutionary programming. The basic algorithm performs crossover and mutation at the bit level. Other variants treat the chromosome as a list of numbers which are indexes into an instruction table, nodes in a linked list, hashes, objects, or any other imaginable data structure. Crossover and mutation are performed so as to respect data element boundaries. For most data types, specific variation operators can be designed. Different chromosomal data types seem to work better or worse for different specific problem domains. A very successful variant of the general process of constructing a new population is to allow some of the better organisms from the current generation to carry over to the next, unaltered. This strategy is known as elitist selection. Parallel implementations of genetic algorithms come in two flavours. Coarse-grained parallel genetic algorithms assume a population on each of the computer nodes and migration of individuals among the nodes. Fine-grained parallel genetic algorithms assume an individual on each processor node which acts with neighboring individuals for selection and reproduction. Other variants, like genetic algorithms for online optimization problems, introduce timedependence or noise in the fitness function. Genetic algorithms with adaptive parameters (adaptive genetic algorithms, AGAs) is another significant and promising variant of genetic algorithms. The probabilities of crossover 50

(pc) and mutation (pm) greatly determine the degree of solution accuracy and the convergence speed that genetic algorithms can obtain. Instead of using fixed values of pc and pm, AGAs utilize the population information in each generation and adaptively adjust the pc and pm in order to maintain the population diversity as well as to sustain the convergence capacity. In AGA (adaptive genetic algorithm), the adjustment of pc and pm depends on the fitness values of the solutions. In CAGA (clustering-based adaptive genetic algorithm), through the use of clustering analysis to judge the optimization states of the population, the adjustment of pc and pm depends on these optimization states. It can be quite effective to combine GA with other optimization methods. GA tends to be quite good at finding generally good global solutions, but quite inefficient at finding the last few mutations to find the absolute optimum. 4.2 Using Genetic Algorithm for feature selection This heuristic approach has been chosen as the number of features to consider is large. The objective is first to isolate the most relevant associations of features, and then to class individuals that have the considered similarities according to these associations. 4.2.1 Introduction The first phase of this algorithm deals with isolating the very few relevant features from the large set. This is not exactly the classical feature selection problem known in Data mining. Here, we have the idea that less than 5% of the features have to be selected. But this problem is close from the classical feature selection problem, and we will use a genetic algorithm as we saw they are well adapted for problems with a large number of features. Genetic algorithm considered here has different phases. It proceeds for a fixed number of generations. A 51

chromosome, here, is a string of bits whose size corresponds to the number of features. A 0 or 1, at position i, indicates whether the feature i is selected (1) or not (0). The Genetic Operators These operators allow GAs to explore the search space. However, operators typically have destructive as well as constructive effects. They must be adapted to the problem. We use a Subset Size-Oriented Common Feature Crossover Operator (SSOCF), which keeps useful informative blocks and produces offspring s which have the same distribution than the parents. Off- springs are kept, only if they fit better than the least good individual of the population. Features shared by the 2 parents are kept by offsprings and the non-shared features are inherited by offsprings corresponding to the i th parent with the probability (ni - nc/nu) where ni is the number of selected features of the i th parent, nc is the number of commonly selected features across both mating partners and nu is the number of non-shared selected features. Figure 4.1The SSOCF Crossover Operator The mutation is an operator which allows diversity. During the mutation stage, a chromosome has a probability pmut to mutate. If a chromosome is selected to mutate, we choose randomly a number n of bits to be flipped then n bits are chosen randomly and flipped. 52

A probabilistic binary tournament selection is taken. Tournament selection holds n tournaments to choose n individuals. Each tournament consists of sampling 2 elements of the population and choosing the best one with a probability p [0.5, 1]. The Chromosomal Distance Create a specific distance which is a kind of bit to bit distance where not a single bit i is considered but the whole window (i, i+) of the two individuals are compared. If one and only one individual has a selected feature in this window, the distance is increased by one. Sharing To avoid premature convergence and to discover different good solutions (different relevant associations of features), we use a niching mechanism. Both crowding and sharing give good results and we choose to implement the fitness sharing. The objective is to boost the selection chance of individuals that lie in less crowded area of the search space. We use a niche count that measures of how crowded the neighborhood of a solution is. The fitness of individuals situating in high concentrated search space regions is degraded and a new fitness value is calculated and used, in place of the initial value of the fitness, for the selection. Random Immigrant Random Immigrant is a method that helps to maintain diversity in the population. It should also help to avoid premature convergence. Random immigrant is used as follows: if the best individual is the same during N generations, each individual of the population, whose fitness is under the mean, is replaced by a new randomly generated individual. 53

4.2.2 Filter Approach Filter approach uses metrics like Information Gain, Similarity, Relief methods to assign fitness value to the individual whose fitness is being evaluated. This approach gives weight for each of the selected features individually and overall fitness value is obtaining by combining the individual weights suitably[58-60]. The following two filter based approaches have been implemented for feature selection using MATLAB: 4.2.3 Relief Algorithm based feature selection The key point of Relief algorithm is to evaluate features according to its ability to distinguish close samples. Relief s core concept is that a good feature should make the simples in the same category closed, and keep the simple in different categories off. In Relief algorithm, a simple R is select randomly first, then find out R s nearest neighbor H in the same category, say NearestHit and the nearest neighbor M in different categories, say NearestMiss. For certain feature x, if the distance between R and H is shorter than the distance between R and M, which means Diff(x, R, M) > Diff(x, R, H), it concludes that this feature x is good for differentiation, so the weight value of feature x would be added; On the contrary, if Diff(x, R, M) < Diff(x, R,H), the weight value of the feature would be reduced. Repeat the above procedure m times, finally get average weight of each feature. The bigger the weight value, the better the feature is. 54

The pseudo-code of Relief is given below: Input: training set D, iterations m Output: the weight value vector W[A] Set all the weight value of W[A]=0 for i=1 to m do begin Select sample R randomly; Find out NearestHit H and NearestMiss M; for A=1 to N do W[A]=W[A]-diff(A,R,H)/m+diff(A,R,M)/m; End; The advantages of Relief series algorithms are: high efficiency, there is no restriction on the data type and the relationship between features is not sensitive. The drawbacks of Relief series algorithms are: they cannot remove redundant features, it would be given higher weight value to the features with higher categories correlation, and regardless of whether the feature is redundancy or not for the rest features. 4.2.4 Information Gain and Similarity In this method fitness is evaluated based on the Information Gain and Similarity of an attribute. A good subset selection should have attributes with high information gain, similarity of the individual attribute with the class should be high and the similarity of the attributes with one another should be less. The Information Gain of an attribute x with respect to class c is given by IG(c, x) = H(c) H(c x) (4.1) Where H(x) is the entropy of x and H(c x) is the conditional entropy of c when value of feature x is known. 55

The similarity between feature x and y is computed and the value range of Sim(x, y) is [0,1]. Sim(x,y) is 0 means that x and y are completely irrelevant. Sim(x, y) is 1 means that x and y are completely relevant. When Sim(x,y) is greater than a threshold, the feature x and y are redundant. (4.2) The overall benefit of a feature x is given by the equation: E X ) k IG( c, x' ) i i1 i1 ( k k Sim( c, x' i ) / Sim( x', x' ) pairsnum i j (4.3) 4.3 Implementation of Genetic Algorithm for feature selection The feature selection algorithm has been implemented using MATLAB. Fitness function is the objective function we want to minimize. We can specify the function as a function handle of the form @distance_fitness_function, where distance_fitness_function.m is an M-file that returns a scalar. The implementation of Relief algorithm is present in the distance_fitness_function.m file The distance_fitness_function performs a fitness function on a set of attributes based on the ReliefF algorithm. At the beginning of the function, a training set of clinical dataset is read. The total numbers of attributes as well as the total number of instances are stored in variables. The position of class, i.e. an increment of the total number of attributes is also stored and the attribute details are loaded. Then we specify the number of random samples that are to be 56

chosen. This signifies the number of iterations that the fitness function will perform for a particular set of attributes. The weight variable is initially set to zero. The MATLAB function rand() generates a random number between 0 and 9.99. Hence we multiply this function by ten to the power of the number of digits of the total instances to give a random number in the appropriate range. We then round-off this number to give an integer value. We then define variables for nearest hit, nearest miss, hit value and miss value and initialize them to 0, 0, infinity and infinity respectively. We initialize a loop in which an index variable varies from one to number of instances in the dataset. As long as the index variable is not equal to the generated random number, the distance between the attribute corresponding to the index number in the training set and the attribute corresponding to the random number in the training set is found out. Here, the distance function performs the Exclusive OR operation between the selected attributes and the sum total of the number of 1 s in the result is returned as the distance. Then we check if the element present in position given by the position of class of the attribute corresponding to the random number is equal to the corresponding element of the attribute given by the index number. If equal, then the distance is stored as hit value and the index number is stored as the nearest hit. If not equal, then the distance is stored as the miss value and the index number is stored as the nearest miss. Then, the input attribute set is loaded and for each one in the attribute set, corresponding weight is computed as weight= weight [absolute value of element present in position given by index number in training set corresponding to the attribute given by random number] [absolute value of element present in position given by index number in training set corresponding to the 57

attribute given by nearest hit divided by number of samples to be chosen] + [absolute value of element present in position given by index number in training set corresponding to the attribute given by random number] - [absolute value of element present in position given by index number in training set corresponding to the attribute given by nearest miss divided by number of samples to be chosen]. Finally, the return value of the fitness function is calculated as the negative of the weight value divided by the number of one s in the input set. Number of variables is the number of independent variables for the fitness function. Here the number of variables is based on the number of attributes in the experimental dataset Plot Functions Plot functions enable us to plot various aspects of the genetic algorithm as it is executing. Each one will draw in a separate axis on the display window. We can use the Stop button on the window to interrupt a running process. Best individual is chosen as a plot function in this experiment Best individual plots the vector entries of the individual with the best fitness function value in each generation. 4.3.1 Population Options Population options specify options for the population of the genetic algorithm. Population type specifies the type of the input to the fitness function. Bit string has been chosen as Population type in this experiment. 58

Population size specifies how many individuals there are in each generation. Population size is set to be a vector of length of 20, the algorithm creates multiple subpopulations. Each entry of the vector specifies the size of a subpopulation. Creation function specifies the function that creates the initial population. The default creation function Uniform is used in our experiment that creates a random initial population with a uniform distribution. Initial population enables us to specify an initial population for the genetic algorithm. Since an initial population is not specified, the algorithm creates one using the Creation function. Initial scores enable us to specify scores for initial population. Since initial scores is not specified, the algorithm computes the scores using the fitness function. Initial range specifies lower and upper bounds for the entries of the vectors in the initial population. We have specified Initial range as a matrix with 2 rows and Initial length columns. The first row contains lower bounds for the entries of the vectors in the initial population, while the second row contains upper bounds. 4.3.2 Fitness Scaling Options The scaling function converts raw fitness scores returned by the fitness function to values in a range that is suitable for the selection function. Scaling function specifies the function that performs the scaling. Rank scaling is chosen as a scaling function Rank scales the raw scores based on the rank of each individual, rather than its score. The rank of an individual is its position in the sorted scores. The rank of the fittest individual 59

is 1, the next fittest is 2 and so on. Rank fitness scaling removes the effect of the spread of the raw scores. 4.3.3 Selection Options The selection function chooses parents for the next generation based on their scaled values from the fitness scaling function. The Stochastic uniform function performs the selection. Stochastic uniform lays out a line in which each parent corresponds to a section of the line of length proportional to its expectation. The algorithm moves along the line in steps of equal size, one step for each parent. At each step, the algorithm allocates a parent from the section it lands on. The first step is a uniform random number less than the step size. 4.3.4 Reproduction Options generation. Reproduction options determine how the genetic algorithm creates children at each new Elite count specifies the number of individuals that are guaranteed to survive to the next generation. Elite count is set to 2, which is less than or equal to Population Size. Crossover fraction specifies the fraction of the next generation, other than elite individuals, that are produced by crossover. The remaining individuals, other than elite individuals, in the next generation are produced by mutation. Crossover fraction is set to 0.8. 60

4.3.5 Mutation Options Mutation functions make small random changes in the individuals in the population, which provide genetic diversity and enable the GA to search a broader space. Gaussian function performs the mutation. Gaussian adds a random number to each vector entry of an individual. This random number is taken from a Gaussian distribution centered on zero. The variance of this distribution can be controlled with two parameters. The Scale parameter determines the variance at the first generation. The Shrink parameter controls how variance shrinks as generations go by. The Shrink parameter is set to 1 and the variance shrinks to 0 linearly as the last generation is reached. 4.3.6 Crossover Options Crossover combines two individuals, or parents, to form a new individual, or child, for the next generation. Scattered function performs the Crossover function. Scattered creates a random binary vector. It then selects the genes where the vector is a 1 from the first parent, and the genes where the vector is a 0 from the second parent, and combines the genes to form the child. For example, p1 = [a b c d e f g h] p2 = [1 2 3 4 5 6 7 8] random crossover vector = [1 1 0 0 1 0 0 0] child = [a b 3 4 e 6 7 8] 61

4.3.7 Migration Options Migration is the movement of individuals between subpopulations, which the algorithm creates if we set Population size to be a vector of length greater than 1. Every so often, the best individuals from one subpopulation replace the worst individuals in another subpopulation. We can control how migration occurs by the following three parameters. Direction - Migration can take place in one direction or two. Direction is set to Forward; migration takes place toward the last subpopulation. That is the nth subpopulation migrates into the (n+1)'th subpopulation. Fraction controls how many individuals move between subpopulations. Fraction is the fraction of the smaller of the two subpopulations that moves. Fraction is set to 0.2 in our experiment. Individuals that migrate from one subpopulation to another are copied. They are not removed from the source subpopulation. Interval controls how many generations pass between migrations. We have set Interval to 20, migration between subpopulations takes place every 20 generations. 4.3.8 Hybrid Function Options Hybrid Function enables us to specify another minimization function that runs after the genetic algorithm terminates. In our experiment Hybrid unction option is set as none. 4.3.9 Stopping Criteria Options Stopping criteria determine what causes the algorithm to terminate. 62

Generations specifies the maximum number of iterations the genetic algorithm performs. In this experiment generation is set to 100. Time limit specifies the maximum time in seconds the genetic algorithm runs before stopping. In this experiment time limit is set to Infinity. Fitness limit - If the best fitness value is less than or equal to the value of Fitness limit, the algorithm stops. In this experiment fitness limit is set to Infinity. Stall generations - If there is no improvement in the best fitness value for the number of generations specified by Stall generations, the algorithm stops. In this experiment stall generations is set to 50. Stall time limit - If there is no improvement in the best fitness value for an interval of time in seconds specified by Stall time limit, the algorithm stop. In this experiment stall time limit is set to 50. 4.3.10 Display to Command Window Options Level of display specifies the amount of information displayed in the MATLAB command window when we run the genetic algorithm. We have chosen the option as off and only the final answer is displayed. Vectorize Option The vectorize option specifies whether the computation of the fitness function is vectorized. The objective function is vectorized to off to indicate that the fitness function is scalar. 63

4.4 Experimental datasets Five standard clinical datasets of varying sizes and characteristics were obtained from UCI Machine Learning Repository and one from BHEL Hospital is used in this experiment. The details of the datasets are as follows: We have two datasets for appendicitis. The first standard appendicitis dataset[61] from UCI Machine Learning Repository is used to discriminate healthy people from those with appendicitis disease, according to class attribute which is set to either 0 for healthy and 1 for appendicitis disease. This dataset contains 9 numeric valued attributes and 1 binary valued class variable and 106 records. The second data set is used to diagnose the severity of appendicitis in patients presenting with right iliac fossa (RIF) pain. It is based on the statistics collected about the presence of appendicitis from patients data set of around 2230 records collected from BHEL Hospital, Tiruchirappalli, India. The second dataset is used to discriminate patients to different classes of appendicitis namely mild, moderate and severe appendicitis. Parkinson s Dataset [62] is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease. The main aim of the data is to discriminate healthy people from those with Parkinson s Disease, according to class attribute which is set to either 0 for healthy and 1 for Parkinson s Disease. ARCENE's [63] task is to distinguish cancer versus normal patterns from massspectrometric data. This is a two-class classification problem with continuous input variables. ARCENE was obtained by merging three mass-spectrometry datasets to obtain enough training and test data for a benchmark. 64

SPECT Heart Dataset[64] describes diagnosing of cardiac Single Proton Emission Computed Tomography (SPECT) images. Each patient is classified into two categories: normal and abnormal. Cardiotocography Dataset [63] contains the processed information of 2126 fetal cardiotocograms (CTGs) and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label was assigned to each of them. They classified the fetal state as Normal and Abnormal. 4.5 Experimental Results The classification accuracy of Genetic algorithms with Decision Tree Classifier, Naïve Bayesian classifier and k-nearest Neighbor Classifier for appendicitis dataset is 88.68%, 88.68% and 85.85% respectively. The classification accuracy of Information Gain with Decision Tree Classifier, Naïve Bayesian classifier and k-nearest Neighbor Classifier is 83.02%, 83.96% and 81.13% respectively. The classification accuracy of Chi-Square algorithm with Decision Tree Classifier, Naïve Bayesian classifier and k-nearest Neighbor Classifier is 83.02%, 83.96% and 81.13% respectively. The classification accuracy of BLogReg algorithm with Decision Tree Classifier, Naïve Bayesian classifier and k-nearest Neighbor Classifier is 85.85%, 82.08% and 80.19% respectively. The classification accuracy of FCBF algorithm with Decision Tree Classifier, Naïve Bayesian classifier and k-nearest Neighbor Classifier is 85.85%, 83.02% and 83.02% respectively. The classification accuracy of Genetic Algorithms and different feature selection techniques on other clinical data sets are given in detail in the Chapter Experimental Results. 65

Table 4.1 Classification accuracy of different feature selection techniques on Appendicitis dataset Feature Selection algorithm Genetic Algorithm Number of attributes in the dataset Number of attributes selected Accuracy of Decision Tree Classifier Accuracy of Naïve Bayesian Classifier Accuracy of k-nearest Neighbor Classifier 8 4 88.68% 88.68% 85.85% Information Gain 8 4 83.02% 83.96% 81.13% Chi square 8 4 83.02% 83.96% 81.13% BLogReg 8 1 85.85% 82.08% 80.19% FCBF 8 2 85.85% 83.02% 83.02% 4.6 Chapter Conclusions It is observed that the proposed Relief Algorithm based feature selection implemented in Genetic algorithm has high performance compared to the other feature selection algorithms with different classification techniques. Genetic Algorithm is the best feature selection algorithm for Appendicitis, Parkinson s and ARCENE datasets, which have all attributes as real valued attributes. It is clear that for high-dimensional datasets Genetic Algorithm in combination with decision tree is the best feature selection strategy. 66