Constructive Induction Using Non-algebraic Feature Representation

Size: px

Start display at page:

Download "Constructive Induction Using Non-algebraic Feature Representation"

Ethelbert Norman
5 years ago
Views:

1 Constructive Induction Using Non-algebraic Feature Representation Leila S. Shafti, Eduardo Pérez EPS, Universidad Autónoma de Madrid, Ctra. de Colmenar Viejo, Km 15, E 28049, Madrid, Spain {Leila.Shafti, Eduardo.perez}@ii.uam.es Abstract Learning hard concepts in spite of complex interaction among makes constructive induction necessary. Most constructive induction methods apply a greedy search for constructing new features. The search space of hard concepts with complex interaction among has high variation. Therefore, a greedy constructive induction method falls in local optima when searching this space. A global search such as genetic algorithms is more convenient for hard concepts than a greedy local search. Existing constructive induction methods based on genetic algorithms still suffer from some deficiencies because of their overly restricted representation language, which in turn, defines search space. In this paper we explain how the search space can be decomposed into two spaces and we present a new genetic algorithm constructive induction method based on a non-algebraic representation of features. Experiments show that our method outperforms existing constructive induction methods. Key Words Machine Learning, Constructive Induction, Feature Selection and Construction, Genetic Algorithms 1. Introduction Hard concepts with complex interaction among are difficult to be learned by machine learning algorithms based on similarity. These algorithms assume that cases belonging to the same class are located close to each other. But for hard concepts with complex interaction, each class is scattered through the space due to low-level representation [1], [2]. Interaction means the relation between one attribute and the target concept depends on another attribute. Such problem has been seen in realworld domains such as protein secondary structure [3]. Constructive induction (CI) methods have been introduced to ease the attribute interaction problem. Their goal is to automatically transform the original representation space of hard concepts into a new one where the regularity is more apparent [4], [5]. This goal is achieved by constructing features from the given set of to abstract the relation (interaction) among several to a single new attribute. Most existing CI methods are based on greedy search. These methods have addressed some problems. However, they still have an important limitation; they suffer from the local optima problem. When the concept is complex because of the interaction among, the search space for constructing new features has more variation. Because of high variation, the CI method needs a more global search strategy such as Genetic Algorithms [6] to be able to skip local optima and find the global optima solutions. Genetic Algorithms (GA) are a kind of multidirectional parallel search, and more likely to be successful in searching through the intractable and complicated search space [7]. There are only a few CI methods that use genetic search strategy for constructing new features. Among these methods are GCI [8], GPCI [9], Gabret [10], and the hybrid method of Ritthoff et al [11]. Their partial success in constructing useful features indicates the effectiveness of genetic-based search for CI. However, these methods still have some limitations and deficiencies as will be seen later. The most important ones are their consideration of the search space and the language they apply for representing constructed features. This paper presents DCI, a new CI method based on GA. We begin by reviewing some genetic-based CI methods. Then, we analyze the search space and search method, and the representation language, which are crucial for converging to an optimal solution. Considering these we design a CI method based on GA and present and discuss the results of our initial experiments. 2. Exploring the Search Space A CI method aims to construct features that highlight the interactions. For achieving this aim, it has to look for subsets of interacting and functions over these subsets that capture interactions. Existing genetic-based CI methods, GCI, and GPCI, search the space (of size 2 2N for N input in Boolean domain) of all functions that can be defined over all subsets of. This enormous space has lots of local optima and is more difficult to be explored. To ease the searching task, we propose to decompose the

2 The function that represents interactions Figure 1. The decomposed search space Subset of interacting {x 1,..., x n } Ø search space into two spaces: S the space of all subsets of, and F Si the space of all functions defined over a subset of S i (Figure 1). The decomposition of the search space into two spaces allows a specific method to be adjusted for each search space. This strategy divides the main goal of CI into two easier sub-goals: finding the subset of interacting (S-Search) and looking for a function that represents the interaction among in a given subset (F Si -Search). There are few methods that consider the decomposition of search space. Among these methods are MRP [12], hybrid method of Ritthoff et al [11] and Gabret [10]. These methods apply different search strategies for each space. MRP uses a greedy hill climbing strategy for selecting, starting from set of all and eliminating one attribute at a time. For constructing features, MRP projects data into the selected and extracts relations. The space of subsets of grows exponentially with the number of original and has many local optima. Therefore, the greedy search of MRP causes this method to fall in local optima when high variation exists in space of subsets. The hybrid method of Ritthoff et al applies GA for selecting to reduce the local optima problem. Feature construction is performed by arbitrarily constructing a feature from current sets of and a set of predefined constructive operators. Though features are constructed during the GA search of good the method does not apply any genetic operator or search strategy in feature construction level. Therefore, it explores the space of features without having a strategy to guide the method toward best feature. Gabret applies GA for both spaces. It consists of two GA modules: feature selection and feature construction. However, each module is performed separately and independently from the other. When high interaction exists among, attribute subset selection module cannot see the interaction among primitive. Therefore, this module often excludes some relevant F i S: Space of subsets of S i F Si : Space of functions defined over subset S i from the search space that should be used later for constructing new features. The two spaces of attribute subsets and functions over subsets are related and each one has a direct effect on the other. In order to find a good subset of it is necessary to see the relation among. An existing interaction among in subset S i is not apparent unless a function f Si, that outlines this interaction, is defined over the proper subset of. Similarly, a function can highlight the interaction among only when it is defined over the relevant. If the chosen subset of is not good enough, the best function found may not be helpful. Therefore, the two tasks, S-Search and F Si -Search, should be linked to transfer the effects of each search space into the other. 3. The Representation Language The language for representing features to show the relation among has an impact on the way the search space is conducted. The representation language should provide the capability of representing all complex functions or features of interests. This representation has an important role in convergence to optimal solution. There are two alternative groups of languages for representing features: algebraic form and non-algebraic form. By algebraic form, we mean that features are shown by means of some algebraic operators such as arithmetic or Boolean operators. Most CI methods like GCI [8], GPCI [9], Gabret [10], and the hybrid method of Ritthoff et al [11] apply this form of representation. When using this form of representation, the algebraic operators should be defined properly. GPCI uses a fix set of simple and general operators, AND and NOT, which are sufficient to represent all Boolean concepts. The use of simple operators makes the method applicable to a wide range of problems. However, a complex feature is required to capture and encapsulate the interaction using simple operators. Conversely, GCI, Gabret, and the hybrid method of Ritthoff et al apply domain-specific operators to reduce complexity; though, specifying these operators properly cannot be performed without any prior information about the target concept. In addition to the problem of defining operators, an algebraic form of representation produces an unlimited search space since any feature can be appeared in infinite forms. As an example see the following conceptually equivalent functions that are represented differently: X 1 X2 (( X1 X1) X1 1 X1) X1) X 2) 2. (((( X X

3 Therefore, a CI method needs a restriction to limit the Search space provided by this form of representation. Features can also be represented in a non-algebraic form, which means no operator is used for representing the feature. For example, for an attribute set such as {X 1, X 2 } an algebraic feature like ( X1 ( X1 can be represented by a non-algebraic feature such as <0110>, where the j th element in <0110> represents the outcome of the function for jth combination of X 1 and X 2 according to the following table: X 1 X 2 F A non-algebraic form is simpler to be applied in a CI method since there is no need to specify any operator. Besides, all features have equal degree of complexity when they are represented in a non-algebraic form. For example two algebraic features like f ( X1, = ( X1 ( X1 and f ( X 1, X 2 ) = X 1 are represented in non-algebraic form as R ( X 1, =<1001> and R ( X 1, =< 0011>, which are equivalent in terms of complexity. Since for problems with high interaction, CI needs to construct features more like the first feature in example, which are complex in algebraic form, a non-algebraic representation is preferable for these methods. Such kind of representation is applied in MRP [12], a learning method that represents features by sets of tuples using relational projection. However, MRP applies a greedy method for constructing features. As the search space for complex problems has local optima, MRP fails in some cases. 4. DCI DCI, Decomposed Constructive Induction, consists of two tasks S-Search and F Si -Search as described in Section 2. Since the space of attribute subsets S grows exponentially with the number of and has many local optima, our method applies GA for S-Search to search this space. The GA generates a population of attribute subsets and by means of genetic operators aims to converge the population to an optimal subset of. Table 1. Modified parameters of PGAPack GA Parameter S-Search Population Size 50 Max Iteration 100 Max no Change Iter. 25 Max Similarity Value 80% No. of Strings to be Replaced 46 Selection Type Proportional Crossover Type Uniform Mutation and/or Crossover And Set of Training data Figure 2. The framework of DCI GA: S-Search subset of S i F Si -Search best function and its fitness Updated data using corresponding function of the best subset For implementing GA, we use PGAPack library [13] with default parameters except those indicated in Table 1. We represent individuals (i.e., attribute subsets) by bit-strings; each bit representing presence or absence of an attribute in subset. We apply all three stopping rules of PGAPack (maximum iteration, no change in the best solution in a given number of iterations, and population too similar) as termination criteria. Recalling the need to keep the integration between two tasks, we propose to have F Si -Search inside S-Search. That is, for each subset S i in S-Search, in order to measure its fitness, the method finds the best function defined over this subset using F Si -Search (Figure 2). Then the goodness of this function determines the fitness of the subset of S i in S-Search. By this strategy we can maintain the relation between two tasks and their effects to each other, while improving each of them. F Si -Search has the main role in guiding the method towards optimal solution. Bearing in mind that the aim of F Si -Search is to construct a function over S i, the current subset of to represent interactions, we apply a non-algebraic form of representation as described in Section 3. In addition to all advantages of non-algebraic representation mentioned in Section 3, this form of representation provides the ability to extract the relations directly from data. Without losing generality, suppose that the domain is Boolean. If S i = k, DCI generates a new non-algebraic function as a string of length 2 k. The j th element of this string represents the outcome of the function for j th combination of values. The outcome is determined by the majority label in set of examples in training data that matches with the j th combination of values. If there is a gap, that is, there is not such a majority or no example in data matches with this combination of, DCI generalizes the label from all training data. For each gap a label is selected stochastically according to its probability ratio in training data. The fitness of each individual S i in S-Search depends on the goodness of the constructed feature. If the fitness of the feature F constructed by F Si -Search is Fitness Si (F), then the fitness of S i is calculated as: Si Fitness ( Si ) = FitnessSi ( F) + ε (1) S

4 where S i is the size of subset of S i and S is the size of the original set of. The first term of the formula has the major influence on fitness value of the subset S i. The last term roughly estimates the complexity of the new feature, by measuring the amount of participating in the feature. It is multiplied by ε = to have effect only when two subsets S j and S k perform equally well in F Si -Search; that is, Fitness Sj (F )= Fitness Sk (F ). Therefore, between two equally good subsets, this formula assigns a better fitness value to the smaller. The fitness of the constructed function in F Si -Search can be determined by a hypothesis-driven or a data-driven evaluation. A hypothesis-driven evaluation can be done by adding the new function to the list of and updating data, to then apply a learner like [14] and measure its accuracy. GCI, Gabret, and the hybrid method of Ritthoff et al apply this kind of evaluation. A hypothesis-driven evaluation relies on the performance of the learner. Therefore, if the learner assigns inaccurately a low or high fitness value to the feature, it will guide the algorithm to a local solution. Another possibility is to apply a data-driven evaluation formula like conditional entropy [15] to measure the amount of information needed to identify the class of a sample in data knowing the value of the new feature. A data-driven approach only depends on data, and therefore, is more reliable than a hypothesis-driven approach for guiding GA. Moreover, in GA the computation time of fitness function is very important since this function is called for every individual during each GA generation. As the computation time of a data-driven fitness function is usually less than the one for a hypothesis-driven fitness function, the former is preferable. The current version of DCI applies data-driven evaluation function. The fitness of the new feature F, for the training data T, is measured by the following formula: Fitness ( F) = Entropy( T F) (2) where Entropy(T F) is the conditional Entropy [15]. When S-Search is finished, the constructed function associated with the best individual (subset of ) in GA is added to the original set of and data are updated using the new set of. The updated data are then given to a standard learner for learning. 5. Experiments We evaluate our method by conducting experiments on artificial problems with high interactions. Similar results are expected for real-world problems where high interactions occur. We also compare DCI with other CI methods based on greedy search. Concept 5.1 Empirical Evaluation To evaluate our method, we measured the accuracy of a standard learner after constructing new features by DCI. We defined concepts over 12 Boolean a 1,, a 12 as defined in Table 2. A high interaction exists among relevant of these concepts. An important characteristic of these concepts is that there are that participate in more than one interaction, which makes the concept more difficult to be learned. Each experiment was run 10 times independently over 10 sets of data. For each experiment, we used 5% of shuffled data for training and kept the rest (95% of data) unseen for evaluating the final result. When the feature construction finished, to evaluate its performance, the accuracy of [14] on modified data after adding the constructed feature was measured using the 95% unseen data as test data. Table 3 compares DCI with two other methods. One is, a standard learner based on similarities, ran on original set of 12 (column 3). To see the need for feature construction in addition to attribute selection, we also applied on these concepts forcing it to ignore irrelevant (column 4). The second column shows number of relevant. Table 3 shows that our method considerably facilitates learning these concepts by constructing features. The Table 3. The accuracy s result average comparing to a standard learner Concept Table 2. Concepts definition No of Relevant Atts Original Atts Definition P4&P4-of-6 Parity(a 1,, a 4) and Parity(a 3,, a 6) P6&P6-of-8 Parity(a 1,, a 6) and Parity(a 3,, a 8) P3&P3&P3-of-6 Parity(a 1,, a 3) and Parity(a 3,, a 5) and Parity(a 4,, a 6) P4&P4&P4-of-6 Parity(a 1,, a 4) and Parity(a 2,, a 5) and Parity(a 3,, a 6) P4&P4&P4-of-8 Parity(a 1,, a 4) and Parity(a 3,, a 6) and Parity(a 5,, a 8) P6&P6&P6-of-8 Parity(a 1,, a 6) and Parity(a 2,, a 7) and Parity(a 3,, a 8) Relevant Atts DCI + P4&P4-of (3.6) 87.9 (5.5) 97.7 (2.2) P6&P6-of (3.3) 73.2 (2.4) 81.7 (1.3) P3&P3&P3-of (1.5) 91.7 (3.6) 99.2 (1.4) P4&P4&P4-of (0.1) 90.6 (2.5) 99.4 (0.8) P4&P4&P4-of (0.1) 87.5 (0.1) 89.2 (1.2) P6&P6&P6-of (2.2) 87.0 (1.6) 89.0 (1.1)

5 results of on relevant shows that guiding the learner to only consider interacting does not help it much to learn the concept. This implies that a preprocessing feature selection method by itself cannot help a learner when a high interaction exists among. The feature selection and feature construction should be combined together to highlight interactions. 5.2 Empirical Comparison In order to compare DCI with other CI methods, we used concepts applied in [12]. The concepts are defined over 12 Boolean a 1,, a 12 as follows: Parity-n: parity of a 1,, a n cp5-8: AND (Parity(a 5, a 6 ), Parity(a 7, a 8 )) mx6: multiplexor of a 1, a 2, a 3, a 6, a 9, a 12 mx6c-n-m: AND (a n,, a m ) if a 1 =0 and a 2 =0 OR (a n,, a m ) if a 1 =0 and a 2 =1 Parity (a n,, a m ) if a 1 =1 and a 2 =0 NOT Parity (a n,, a m ) if a 1 =1 and a 2 =1 gw5-8: the number of ones in {a 5, a 6 } is greater then sw5-8: the number of ones in {a 7, a 8 } the number of ones in {a 5, a 6 } is equal to the number of ones in {a 7, a 8 } Majn-m: among in {a n,, a m }, at least half of them are one nm-4-5: among in {a 4,, a 10 }, 4 or 5 of them are one n of m: exactly n of m are one Each experiment was ran 20 times using 5% of data for training and the rest for final evaluation and the result was evaluated by accuracy on modified data. Table 4 presents a summary of DCI s accuracy and its comparison to the other systems accuracy. The second column shows number of relevant. The last Table 4. The accuracy s result average on 5% data comparing with other CI methods Concept Relevant The best result MRP DCI + gw (0) (0) 100 (0) sw (9) (0) 100 (0) cp (7.5) (0) 100 (0) mx6c (5.8) (0) 100 (0) parity (9.3) (0) 100 (0) mj (2.8) (0.7) 99.7(1.5) mx (3.1) (7.3) 97.8 (2.1) mx6c (2.3) (3.7) 97.8 (1.8) parity (1.2) (1.6) 98.0 (1.8) mj (1.4) (5.1) 89.1 (2.7) nm (2.8) (7.6) 89.9 (1.6) 5-of (1.6) (4.7) 95.1 (1.7) 6-of (0.7) (2.3) 98.3 (1.3) parity (0.2) (1.4) 76.7 (2.7) column indicates the average accuracy after testing the result of our method by using unseen data. The results are compared with and -Rules [14], which are similarity-based learners, Fringe, Grove and Greedy3 [16], and LFC [17], which are greedy-based CI methods that apply algebraic form for representing features and MRP [12] which is a greedy CI method with a non-algebraic form of representation that uses relational projection for constructing features. Among these CI methods, Grove, Greedy3, LFC and MRP consider the decomposition of search space as explained in Section 2. In the third column of the table, we show the best results among -Rules, Fringe, Grove, Greedy3 and LFC, as reported in [12]. Numbers between parentheses indicate standard deviations. Bolds mean that with a significant level of 0.05 this accuracy is better than the others. It can be seen from Table 4 that when few are involved in interaction DCI, as most CI methods, perfectly helps the learner to learn the concept. When the number of interacting is high, the accuracy of most CI methods is scaled down. MRP slightly gives better accuracy on concepts with high number of interacting comparing to CI methods of column 4 because of its non-algebraic representation of features. The global search of DCI reduces the local optima problem. Therefore, when the number of interacting is high and the search space has high variation, the accuracy of DCI is considerably better than MRP and other methods. The concepts with small number of interacting were easy for almost all CI methods of Table 4. We reduced the number of training data to make these concepts more difficult to learn and repeated the experiments to compare the accuracies of methods on these concepts. We used 1% of data for training and kept the 99% unseen for final evaluation. This is closer to the situation of real-world problems. Table 5 shows the result of our experiments. The advantage of our method over other CI methods is now clearer. The experiments show that when a high complex interaction exists among, with a small number of training data, most CI methods with algebraic form of representation fail to construct a useful feature. MRP with non-algebraic representation can overcome this problem if there is enough training data. The small number of training data produces more variation in search space and Table 5. The accuracy s result average on 1% data comparing with other CI methods Concept Relevant The best result MRP DCI + gw (6.1) (11.1) 93.2 (7.7) sw (7.2) (18.3) 95.6 (4.1) cp (7.3) (12.1) 95.6 (4.1) mx6c (5.5) (17.4) 96.5 (3.8) parity (4) (16) 94.6 (11.1)

6 therefore, the local search of MRP fails to find optimal solution. The size of training data is less problematic for our method and therefore, DCI outperforms MRP as well as other CI methods in these experiments. Also it is important to note that in all experiments DCI successfully found the subset of interacting except one of 20 experiments over concept 6-of-7 of Table 4 and one of 20 experiments over concepts gw5-8 and parity-4 of Table Conclusion This paper has presented DCI, a new CI method based on Non-Algebraic representation, whose goal is to ease learning problems with complex attribute interaction. The design of the method decomposes the search space into two spaces of subsets of and features. The integration of two searching tasks maintains the effects of two tasks on each other while improving each of them. Other CI methods that consider these two spaces as a whole often fall in local optimum by searching over a huge space with many local optima, that is more difficult to be explored. Moreover, the genetic approach of DCI makes our method more promising than other methods in constructing adequate features when the search space is intractable and the required features are too complex to be found by greedy approaches. The non-algebraic representation of features provides a finite search space of features. The form of representation assigns an equal degree of complexity to features regardless of the size or complexity of their symbolic representation. This reduces the difficulty of constructing complex features. Besides, this form of representation provides the ability to extract features directly from training data. Empirical results proved that DCI performs well on concepts with complex interaction and outperforms other CI methods in terms of accuracy. Acknowledgement This work has been partially supported by the Spanish Interdepartmental Commission for Science Technology (CICYT), under Grants numbers TIC C02-02 and TIC References [1] L.A Rendell & R. Seshu, Learning hard concepts through constructive induction: framework and rationale, Computational Intelligence, 6, 1990, [2] A. Freitas, Understanding the crucial role of attribute interaction in data mining. Artificial Intelligence Review, 16(3), 2001, [3] N. Qian & T.J. Sejnowski, Predicting the secondary structure of globular proteins using neural network models, Journal of Molecular Biology, 202, 1988, [4] T.G. Dietterich & R.S. Michalski, Inductive learning of structural description: evaluation criteria and comparative review of selected methods, Artificial Intelligence, 16(3), 1981, [5] D.W. Aha, Incremental constructive induction: an instance-based approach, Proc. 8 th International Workshop on Machine Learning, IL, 1991, [6] J. H. Holland, Adaptation in natural and artificial systems (Ann Arbor, MI, The University of Michigan Press, 1975). [7] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs (Berlin, Heidelberg, NY, Springer-Verlag, 1999). [8] H. Bensusan & I. Kuscu, Constructive Induction Using Genetic Programming, Proc. International Conference on Machine Learning, Workshop on Evolutionary Computing and Machine Learning, Bari, Italy, [9] Y. Hu, A genetic programming approach to constructive induction, Proc. 3 rd Annual Genetic Programming Conference, Madison, WI, 1998, Morgan Kauffman, [10] H. Vafaie & K. De Jong, Feature space transformation using genetic algorithms, IEEE Intelligent Systems & Their Applications, 13(2), 1998, [11] O. Ritthoff, R. Klinkenberg, S. Fischer & I. Mierswa, A hybrid approach to feature selection and generation using an evolutionary algorithm, Proc UK Workshop on Computational Intelligence. Birmingham, UK, [12] E. Pérez & L.A. Rendell, Using multidimensional projection to find relations, Proc. 12 th International Conference on Machine Learning, Tahoe City, CA, 1995, [13] D. Levine, Users guide to the PGAPack parallel genetic algorithm library, Technical Report ANL- 95/18, Argonne National Laboratory, [14] R.J. Quinlan, : Programs for machine learning, (San Mateo, CA, Morgan Kaufmann, 1993). [15] R.J. Quinlan, Induction of decision trees, Machine Learning, 1(1), 1986, [16] G. Pagallo & D. Haussler, Boolean feature discovery in empirical learning, Machine Learning, 5, 1990, [17] H. Ragavan & L.A. Rendell, Lookahead feature construction for learning hard concepts, Proc. 10 th International Conference on Machine Learning, Amherst, MA, 1993,

Binary Representations of Integers and the Performance of Selectorecombinative Genetic Algorithms

Binary Representations of Integers and the Performance of Selectorecombinative Genetic Algorithms Franz Rothlauf Department of Information Systems University of Bayreuth, Germany franz.rothlauf@uni-bayreuth.de