Distributed Optimization of Feature Mining Using Evolutionary Techniques

Distributed Optimization of Feature Mining Using Evolutionary Techniques Karthik Ganesan Pillai University of Dayton Computer Science 300 College Park Dayton, OH 45469-2160 Dale Emery Courte University of Dayton Computer Science 300 College Park Dayton, OH 45469-2160 Abstract Pattern recognition is a resource intensive task which includes feature extraction, feature selection and classification. Optimizing any of these steps can significantly improve performance. Evolutionary computation methods are utilized to address optimization problems that explore a huge, nonlinear and multidimensional search space. In this paper a new distributed framework is introduced which greatly reduces the computation time of such systems, concentrating on feature extraction and feature selection which will operate parallel. This software architecture incorporates Mother Nature s most powerful tools, evolution and parallelism, in its design, while maintaining robustness. Keywords: evolutionary computation, genetic algorithms, feature extraction, classification, distributed architecture. 1. Introduction In this paper the problem of extracting features from data and selecting the best subset of those features for classification of the objects represented by the data is approached using a distributed architecture. It is an extensive and computationally intensive task, where optimization plays an important role. Training machines to learn by themselves about the patterns present in data consumes enormous amount of time when a single machine is used. Optimization of pattern recognition problems is considered very difficult, and is highly multidimensional in nature. In considering this resource intensive task, this paper addresses it by introducing a distributed framework that will greatly reduce the computation at each unit while maintaining the robustness of design. Evolutionary computation methods are techniques that are developed to provide optimal solutions to such problems. This architecture uses evolutionary computation techniques to optimize the solution obtained by such systems. Extraction and selection of meaningful features plays an important role in the accuracy and efficiency of the classifiers. Extracting features is the process of synthesizing meaningful features from the original data. Such features will help the classifier to be more accurate and efficient. Feature selection is the process of selecting the best set of features from a given set. This is a search problem, where the performance of the classifier on the given feature set may be exhibited by a fitness function that is used to assess the validity of that subset. Here, a supervised evolutionary learning method is used to extract useful and meaningful features from the data. 2. Approach We introduce a distributed evolutionary computation method to evolve both feature extractors and a feature selector. Based on a feature set created by distributed feature agents, we use a k-nearest neighbor method to classify the objects in the training data. This drives the fitness function for parallel genetic algorithms (GA) that select the best set of features from the given set of features. These parallel GA s increase the efficiency and enhance the search capabilities of the system proposed in [9]. Distributed feature agents operate independently in a distributed environment. These feature agents deposit extracted features in a central feature mine. Parallel distributed GA s then determine feature selections from sets of features stored in the mine. Feedback based on competition between the distributed feature

selectors will guide the distributed agents to refine their search. Upon receiving the feedback, feature agents will adjust to search for features either locally or in some new location. This process continues until a user defined threshold is reached. The best evolved feature set will be used in testing the final system configuration against hold-out test data. 3. Pattern Recognition Pattern recognition can be defined as process acting on raw data and actions taken upon the data to identify what it represents. It is a multi-step process in which steps include collecting data which is to be classified; building models on the data which is used to train the system, synthesizing features from the data and finally classifying the data based on its category using these new, extracted features. To classify a pattern in data we need to identify certain features in the data that will help us to recognize those patterns. Feature extraction is the process of synthesizing useful features from the data that is used by the classifiers to recognize the patterns in the data. Reducing the dimensionality of features in data can greatly aid the classifier in terms of both efficiency and accuracy. 4. Evolutionary Computation Effective optimization of computational systems has been accomplished through forms of evolutionary computation [9]. Evolutionary computation is a family of computational approaches that tries to mimic the biological process of evolution in computer programs. In this approach the system is initialized with an initial population of candidate solutions, then reproduction is performed through two important operators mutation and cross-over. A fitness function is used to evaluate the performance of each of the candidate solutions, then based on the fitness of the candidate solution selection is performed. Several different types of selection methods exist, to name a few Roulette wheel selection, Tournament selection and Truncation selection. Fitness landscape is a term used frequently in these methods that envisions the fitness of all candidate solutions as a mountain range, in which the maximum value in this landscape corresponds to the optimized value of the best found candidate solution. 5. EC techniques in Pattern Recognition Evolutionary computation approaches have been used for extraction and selection of features from the data; primarily genetic algorithms and genetic programming have been used. Feature extraction for rule-based machine learning architectures for image recognition and texture classification was explored. In classification, if the training data set is limited, then even the useful features can cause loss of accuracy. This is known as curse of dimensionality. This occurs because the amount of computations required for pattern recognition and the amount of data required for training systems grows exponentially with the increase of the dimensionality of the feature space [1]. Genetic algorithms have been used in feature selection method by Siedlecki and Sklanski. In this method genetic algorithm was used to find out a vector with one s and zeros which is used by knn classifier that gives the error rate below some threshold value and at the same time it has minimum number of one s, since each one in the vector corresponds to a feature used [2]. Later Punch and Kelly & Davis extended this work by using genetic algorithm in feature extraction. In this work they used a weight vector instead of just one s and zeros. This weight vector consists of real values to be multiplied with each feature resulting in a new feature vector that is linearly related to the original one. Genetic algorithms were used to find a weight vector that minimizes the error rate of knn classifier [3, 4]. Michael Raymer et.al extended this work to use a masking genetic algorithm for simultaneous selection and extraction of features. In this approach a masking vector with ones and zeros which is used along with weight vector. Both weight vector and masking vector to multiply with each feature, by this way they can maintain the weight vector value while disengaging a feature by having a zero in the masking vector at its bit position [5]. 6. Distributed Approaches in Evolutionary Computation Since evolution in biological systems is parallel in nature, some level of parallelism is incorporated into genetic algorithms. Moreover this will greatly enhance the computation power such systems as genetic algorithms [7] are computationally intensive. Distributed computation of genetic algorithms contributes to

keeping variety of chromosomes to avoid immature convergence. Therefore parallel and distributed execution of genetic algorithms has the possibility to obtain high speed computation and high quality solution. Generally in these approaches the population is divided in to sub populations each representing a GA system. Chromosomes of such sub groups are transferred to other groups this is known as Island model genetic algorithms. This causes coupling of genetic materials across the different sub systems which will prevent local optimum in any sub population.. Co-evolution of competing solutions is sometimes used in parallel evolutionary systems. In this method different species compete with each other to evolve an optimized solution. 7. System Design Our system is divided into three principal components: feature agents, a feature mine and parallel distributed GA s. Figure 1 represents system architecture with all the components. Number of distributed agents and parallel GA s are totally independent of each other. Each parallel GA searches a population of features that is extracted from different search space of individual distributed agents. This allows parallel GA s to search in wide range of search space at same time and selects the best features from that space. Feedback from feature mine allows distributed feature agents to explore the search space more effectively, which drives the entire system to more optimal solutions. Feature agents are independent distributed components which extract features from the given training data and deposit them into the feature mine. Figure 2 illustrates an overview of algorithm used in feature agents. The feature mine maintains all these extracted features colleted from each feature agents separately in a table data structure. Figure 4 illustrates an overview of algorithm used in feature mine. Parallel distributed GA s retrieve features from table, deposited by feature agents in the feature mine and selects a subset of features from it. After selecting a subset of features from the given set using a genetic algorithm, these parallel distributed GA s deposit a bit string representing the best found feature subset in the feature mine. The feature mine selects the best feature subset among the deposited feature subset and stores it. The feature agents, waiting for the report from the feature mine, queries the feature miner about the best-found subset of features. Upon receiving the best subset feature from feature mine, these feature agents check whether the features deposited by them are selected. If they find that the features deposited by them in the feature mine are selected then they perform micro mutations in their search but leave the selected feature in place. Through these micro mutations, the agents continue to extract features from the dataset and deposit the features extracted into feature mine. Agents that find none of their extracted features have been selected perform a macro mutation that directs their search to new areas of the search space. Figure 3 illustrates the algorithm used in parallel GA s. This cycle continues until a user defined threshold is reached. The best evolved feature subset is used in testing the final system configuration against hold-out test data. 8. System Implementation Our first distributed feature mining system has been developed and tested using JAVA 2 SDK, MATLAB 7.0, and CORBA. Feature agents and parallel GA s are implemented using MATLAB and PR Tools for pattern recognition [11]. Feature mine uses CORBA server with JAVA implementation. All these components are connected using JAVA objects with CORBA s distributed applications technology. CORBA is used to implement the data mine server, as it provides facilities to incorporate several language components to interoperate with another. This system is developed and tested using both WINDOWS PC s and SUN workstations.

Figure 1 System Architecture 1. Initialize the system, load training data, and register to the feature mine. 2. Compute the dimension of area randomly and extract features from those areas. 3. Deposit all the extracted features from the feature mine. 4. Wait for the report from the feature mine. 5. Check to see if any feature column is selected in feature mine in previous epoch; if yes then do mine a minor mutation on the on all the columns except for the one which is selected. Else do a random change in the dimensions of the area and then extract features from these patches. 6. Deposit extracted features from the feature mine. 7. Repeat steps 4 to 6 until maximum number of epoch is reached. 8. On the final epoch use the dimension of patches which is used in the last epoch, extract features from the test data and deposit in the feature mine. 9. The agents are now configured with final system configuration, and these patches will be used against actual data. Figure 2 Feature agent 1. Initialize the GA s with maximum population size, mutation rate, maximum number of generations, and maximum number of epochs and register to the feature mine. 2. Wait for the trigger from the feature mine. 3. Apply mutation operator to produce off springs. 4. Divide the training data into two, one for the training and other for testing. 5. For each chromosome in the population, use k-nearest neighbor method to classify objects in the data. Compute the fitness of chromosomes using the testing data. 6. Inform the feature mine about the best selected features, sort the chromosome in the population and truncate the number of chromosomes in initial population. 7. Do step 2 to 6 until maximum number of epoch is reached. 8. At the maximum epoch, load the test data classify the objects in the data using the best chromosome. Figure 3 Parallel GA s

1. Initialize the feature mine. 2. Wait for the distributed agents and parallel GA s to register them with it. 3. Wait for the distributed feature agents to deposit all the features in the table. 4. Trigger the parallel GA s to select the best feature subset in the table. 5. Wait for the parallel GA s to report about the best found feature subset. 6. Report to the distributed agents about the best found feature subset which is waiting for it. 7. Do step 3 to 6 until maximum epoch is reached. 8. Once the final epoch is reached store all the dimensions of patches which will be used for classifying the actual data. Figure 4 Feature mine 9. Experiments In early experiments, the public MSTAR targets dataset was used [8]. This contains collections of X-band SAR imagery of target types is used. It has types of targets, which are T- 72 tank, BMP2 infantry flying vehicle, BTR-70 armored personnel carrier. All these images are taken at an angle of 15 and 17 degree with full aspect coverage and each image is of 128 by 128 pixels in size. Data from this dataset is divided into training data and test data. Training data is chosen in such a way that it has all kind of tanks in different poses. Once the system has reached it final configuration, it is tested against the test data to check the accuracy of the system. Two sets of experiments were conducted with this dataset, using two sampling methods. In method one, training and test data were collected in a sequential order from the dataset, with alternative image file corresponds to training and test data, so that different targets is covered in all poses for both training and test data. In sampling method two, random image files of all the targets were picked up from the dataset. In both the sampling methods one hundred image files of each target is collected, hence three hundred images were used in total for training and testing individually. Parameters used in these experiments are supplied in Tables 1 and 2. Feature agents use a genotype that represents a coordinate position in image, width and height of the patch, which is selected at random in the initial population. For these initial experiments, the feature associated with a patch is simply the average pixel value. Mutation is the only variation operator applied to patches. A random chance of 50 % is used to determine which coordinate or width or height will be changed and the coordinates might increase or decrease in value. In the parallel GA s, mutation is the only operator applied to the population and the chromosome is represented by a bit string one s and zeros. Each bit corresponds to a feature and a one in that position indicates that feature is selected and a zero indicates feature not selected. This differs from a classical GA in which crossover operator is applied to bit strings. For mutation, two positions which are in the range of length of chromosome are selected and all the bit positions in that range are inverted. Population size is doubled for each generation and after calculating the fitness, the best of the population sorted according to their fitness is selected. Population size is reduced back to initial number of chromosomes, using simple truncation selection. During fitness function calculates the error rate that is computed by leave one out method. Test accuracy is determined by classifying each test image based on distances to training images only, using knn with k=3. Table 3 and 4 summarizes the result obtained from both sampling methods. The consistency of results using the two sampling methods demonstrates that the system is not dependent on a specific composition of training images. Table 1 Feature Agent Configuration Maximum Epoch 20 Maximum Features 20 Maximum Movement 10 Minimum width 10 Maximum width 25 Minimum height 10 Maximum height 25 Table 2 Parallel GA Configuration Maximum Epoch 20 Maximum Generation 10 Population Size 20 Mutation Rate 1 Maximum Features 20

Table 3 Sampling Method One Results Test Number Training Accuracy Test Accuracy Number of Features 1 92.59 87 15 2 92.9 88.33 19 3 92.26 88 13 4 93.91 90 18 5 92.59 91 15 6 94.59 89.67 16 7 93.59 88.33 15 8 93.57 90 18 9 93.6 84.67 12 10 92.91 87.67 18 Average 93.25 88.46 15.9 S.D 0.68 1.73 2.04 Figure 5 Sample Feature Patches Table 4 Sampling Method Two Results Test Number Training Accuracy Test Accuracy Number of Features 1 94.9 86.33 19 2 93.92 89.34 15 3 95.59 88 15 4 96.25 89.67 15 5 93.92 85.67 16 6 92.92 87 15 7 96.93 89.67 13 8 94.6 89.67 14 9 92.54 86.67 18 10 94.6 89.67 14 Average 94.61 88.16 15.4 S.D 1.31 1.53 1.74 BTR BMP2 T72 Figure 6 Sample target Images 10. Conclusion and Future Work: Results obtained from these experiments are comparable to similar experiments using the CHAMP system [6], which evolves cooperating voting classifiers. Exploring the search space in distributed population of candidate solutions proves to be a viable alternative for sequential evolutionary computation techniques that require more computation time. Future work includes possible replacement of the feature mine with more direct communication methods. This prevents the mine from being a bottleneck for network traffic when large volumes of data are handled. 11. Acknowledgements Partial funding was provided for this work by the University of Dayton Research Council, and the Ohio Board of Regents.

Thanks also to the University of Dayton Computer Science Department, the Dayton Area Graduate Studies Institute (DAGSI), and Wright State University College of Engineering and Computer Science, for equipment and technical support. 12. References: [1] Xuechuan Wang, Feature Extraction and Dimensionality Reduction in Pattern Recognition and Their Application in Speech Recognition, A Dissertation. Griffith University, 2002. [2] W. Siedlecki and J. Sklansky, A note on Genetic Algorithms for Large-Scale Feature Selection, Pattern Recognition Letters, vol. 10, pp. 335-347, 1989. [3] W. F. Punch, E. D. Goodman, M. Pei, L. ChiaShun, P. Hovland, and R. Enbody, Further Research on Feature Selection and Classification Using Genetic Algorithms, International Conference On Genetic Algorithms 93, pp. 557-564, 1993. [4] J. D. Kelly and L. Davis, Hybridizing the Genetic Algorithm and the K Nearest Neighbors Classification Algorithm, Proc. Fourth Inter. Conf. Genetic Algorithms and their Applications (ICGA), pp. 377-383, 1991. [5] Michael L. Raymer, William F. Punch, Erik D. Goodman, Paul C. Sanschagrin, Leslie A. Kuhn, Simultaneous Feature Extraction and Selection Using a Masking Genetic Algorithm. [6] Dale E. Courte, Evolutionary Optimization of Voting Classifiers, Ph.D. Dissertation, Wirght State University, Dayton, OH, 2002. [7] J.H Holland, Adaptation in Natural and Artificial systems, University of Michigan Press, 1975. [8] MSTAR. https://www.sdms.afrl.af.mil/. [9] T. Back, D.B. Fogel, T. Michaeleevicz. Evolutionary Computation Basic Algorithms And Operators. [10] Dale E. Courte. Evolutionary Feature Mining. Proceedings of the Sixteenth Midwest Artificial Intelligence Cognitive Science Conference 2005. [11] PRTOOLS. http://www.prtools.org/.