An adaptive heuristic for feature selection based on complementarity

Size: px

Start display at page:

Download "An adaptive heuristic for feature selection based on complementarity"

Nora Merritt
5 years ago
Views:

1 Appeared in: Machine Learning, 0(), 0, An adaptive heuristic for feature selection based on complementarity Sumanta Singha Prakash P. Shenoy Received: date / Accepted: date Abstract Feature selection is a dimensionality reduction technique that helps to improve data visualization, simplify learning, and enhance the efficiency of learning algorithms. The existing redundancy-based approach, which relies on relevance and redundancy criteria, does not account for feature complementarity. Complementarity implies information synergy, in which additional class information becomes available due to feature interaction. We propose a novel filter-based approach to feature selection that explicitly characterizes and uses feature complementarity in the search process. Using theories from multi-objective optimization, the proposed heuristic penalizes redundancy and rewards complementarity, thus improving over the redundancy-based approach that penalizes all feature dependencies. Our proposed heuristic uses an adaptive cost function that uses redundancy-complementarity ratio to automatically update the trade-off rule between relevance, redundancy, and complementarity. We show that this adaptive approach outperforms many existing feature selection methods using benchmark datasets. Keywords Dimensionality reduction Feature selection Classification Feature complementarity Adaptive heuristic Introduction Learning from data is one the central goals of machine learning research. Statistical and datamining communities have long focused on building simpler and more interpretable models for prediction and understanding of data. However, high dimensional data present unique computational challenges such as model over-fitting, computational intractability, and poor prediction. Feature selection is a dimensionality reduction technique that helps to simplify learning, reduce cost, and improve data interpretability. Sumanta Singha Tel.: sumanta.singha@ku.edu Prakash P. Shenoy pshenoy@ku.edu School of Business University of Kansas

2 Singha & Shenoy Existing approaches Over the years, feature selection methods have evolved from simplest univariate ranking algorithms to more sophisticated relevance-redundancy trade-off to interaction-based approach in recent times. Univariate feature ranking [] is a feature selection approach that ranks features based on relevance and ignores redundancy. As a result, when features are interdependent, the ranking approach leads to sub-optimal results [0]. Redundancy-based approach improves over the ranking approach by considering both relevance and redundancy in the feature selection process. A wide variety of feature selection methods are based on relevance-redundancy trade-off [,,,,, ]. Their goal is to find an optimal subset of features that produces maximum relevance and minimum redundancy. Complementarity-based feature selection methods emerged as an alternative approach to account for feature complementarity in the selection process. Complementarity can be described as a phenomenon in which two features together provide more information about the target variable than the sum of their individual information (information synergy). Several complementarity-based methods are proposed in the literature [,,,,,, ]. [] and later [] propose an interactive sequential feature selection method, known as joint mutual information (JMI), which selects a candidate feature that maximizes relevance and complementarity simultaneously. They conclude that JMI approach provides the best trade-off in terms of accuracy, stability, and flexibility with small data samples. Limitations of the existing methods Clearly, redundancy-based approach is less efficient than complementarity-based approach as it does not account for feature complementarity. However, the main criticism of redundancy-based approach is with regard to how redundancy is formalized and measured. Correlation is the most common way to measure redundancy between features, which implicitly assumes that all correlated features are redundant. However, [0,, ] show that this is an incorrect assumption; correlation does not imply redundancy, nor absence of complementarity. This is evident in Figures and, which present a -class classification problem (denoted by star and circle) with two continuous features X and X. The projections on the axis denote the relevance of each respective feature. In Figure, X and X are perfectly correlated, and indeed redundant. Having both X and X leads to no significant improvement in class separation compared to having either X or X. However, in Figure, a perfect separation is achieved by X and X together, although they are (negatively) correlated (within each class), and have identical relevance as in Figure. This shows that while generic dependency is undesirable, dependency that conveys class information is useful. Whether two interacting features are redundant or complimentary depends on the relative magnitude of their class conditional dependency in comparison to their unconditional dependency. However, in the redundancy-based approach, it only focuses on unconditional dependency, and try to minimize it without exploring whether such dependencies lead to information gain or loss. Although complementarity-based approach overcomes some of the drawbacks of redundancy based approach, it has limitations, too. Most of the existing complementarity-based feature selection methods either adopt a sequential feature selection approach or evaluate subsets of a given size. Sequential feature selection method raises a few important issues. First, [] show that sequential selection methods suffer from initial selection bias, i.e., the feature selected in the earlier steps govern acceptance or rejection of the subsequent features at each iteration and not vice versa. Second, it requires a priori knowledge of the desired number of features to be selected or some stopping criterion, which is usually determined by expert information or some technical considerations such as scalability or computation time. In many practical situations, it is difficult to get such prior knowledge. Moreover, in

3 Adaptive heuristic for feature Selection Fig. : Perfectly correlated and redundant Fig. : Features are negatively-correlated within class, yet provide perfect separation many cases, we are interested in finding an optimal feature subset that gives maximum predictive accuracy for a given task, and we are not really concerned about their ranking or the size of the optimal subset. Third, most of the existing methods combine the redundancy and complementarity and consider the net or aggregate effect in the search process (note that complementarity has the opposite sign of redundancy). [] refer to this aggregate approach as implicit consideration of complementarity. Our approach In this paper, we propose a filter-based feature subset selection method based on relevance, redundancy, and complementarity. Unlike most of the existing methods, which focus on feature ranking or compare subsets of a given size, our goal is to select an optimal subset of features that predicts well. This is useful in many situations, where no prior expert knowledge is available regarding size of an optimal subset or the goal is simply to find an optimal subset. Using a multi-objective optimization (MOO) technique and an adaptive cost function, the proposed method aims to (i) maximize relevance, (ii) minimize redundancy, and (iii) maximize complementarity, while keeping the subset size as small as possible. The term adaptive implies that our proposed method adaptively determines the tradeoff between relevance, redundancy, and complementarity based on subset properties. An adaptive approach helps to overcome the limitations of a fixed policy that fails to model the trade-off between competing objectives appropriately in a MOO problem. Such an adaptive approach is new to feature selection and essentially mimics a feature feedback mechanism in which the trade-off rule is a function of the objective values. The proposed cost function is also flexible in that it does not assume any particular functional form or rely on concavity assumption, and uses implicit utility maximization principles [, ]. Unlike some of the complementarity-based methods, which consider the net (aggregate) effect of redundancy and complementarity, we consider redundancy minimization and complementarity maximization as two separate objectives in the optimization process. This allows us the flexibility to apply different weights (preference) to redundancy and complementarity and control their relative importance adaptively during the search process. Using best-first as search strategy, the proposed heuristic offers a best compromise solution (more likely to avoid local optimum due to interactively determined gradient), if not the best solution (in the sense of optimum) [0], which is sufficiently good in most prac-

4 Singha & Shenoy tical scenarios. Using benchmark datasets, we show empirically that our adaptive heuristic not only outperforms many redundancy-based methods, but is also competitive amongst the existing complementarity-based methods. Structure of the paper The rest of the paper is organized as follows. Section presents the information-theoretic definitions and the concepts of relevance, redundancy, and complementarity. In Section, we present the existing feature selection methods, and discuss their strengths and limitations. In Section, we describe the proposed heuristic, and its theoretical motivation. In this section, we also discuss the limitations of our heuristic, and carry out sensitivity analysis. Section presents the algorithm for our proposed heuristic, and evaluates its time complexity. In Section, we assess the performance of the heuristic on two synthetic datasets. In Section, we validate our heuristic using real data sets, and present the experimental results. In Section, we summarize and conclude. Information theory - definitions and concepts First, we provide the necessary definitions in information theory [] and then discuss the existing notions of relevance, redundancy, and complementarity.. Definitions Suppose, X and Y are discrete random variables with finite state spaces X and Y, respectively. Let p X,Y denote the joint probability mass function (PMF) of X and Y, with marginal PMFs p X and p Y. Definition (Entropy) Entropy of X, denoted by H(X), is defined as follows: H(X) = p X (x)log(p X (x)). Entropy is a measure of uncertainty in PMF p X of X. x X Definition (Joint entropy) Joint entropy of X and Y, denoted by H(X,Y ), is defined as follows: H(X,Y ) = p X,Y (x,y)log(p X,Y (x,y)). Joint entropy is a measure of x X y Y uncertainty in the joint PMF p X,Y of X and Y. Definition (Conditional entropy) Conditional entropy of X given Y, denoted by H(X Y ), is defined as follows: H(X Y ) = p X,Y (x,y)log(p X y (x)), where p X y (x) is the conditional PMF of X given Y = y. Conditional entropy H(X Y ) measures the remaining uncer- x X y Y tainty in X given the knowledge of Y. Definition (Mutual information (MI)) Mutual information between X and Y, denoted by I(X;Y ), is defined as follows: I(X;Y ) = H(X) H(X Y ) = H(Y ) H(Y X). Mutual information measures the amount of dependence between X and Y. It is non-negative, symmetric, and is equal to zero iff X and Y are independent. Definition (Conditional mutual information) Conditional mutual information between X and Y given another discrete random variable Z, denoted by I(X;Y Z), is defined as follows: I(X;Y Z) = H(X Z) H(X Y, Z) = H(Y Z) H(Y X, Z). It measures the conditional dependence between X and Y given Z.

5 Adaptive heuristic for feature Selection Definition (Interaction information) Interaction information [,, ] among X, Y, and Z, denoted by I(X;Y ;Z), is defined as follows: I(X;Y ;Z) = I(X;Y ) I(X;Y Z). Interaction information measures the change in the degree of association between two random variables by holding one of the interacting variables constant. It can be positive, negative, or zero depending on the relative order of magnitude of I(X;Y ) and I(X;Y Z). Interaction information is symmetric (order independent). More generally, the interaction information among a set of n random variables X = {X,X,...,X n }, is given by I(X ;X ;...;X n ) = ( ) S H(S) where, X is the superset of X and denotes the sum over all subsets S S X of the superset X []. Should it be zero, we say that features do not interact altogether []. Definition (Multivariate mutual information) Multivariate mutual information [, ] between a set of n features X = {X,X,...,X n } and Y, denoted as follows: I(X;Y ), is defined by I(X;Y ) = I(X i ;Y ) I(X i ;X j ;Y ) + + ( ) n+ I(X ;...;X n ;Y ). This is the i i< j möbius representation of multivariate mutual information based on set theory. Multivariate mutual information measures the information that X contains about Y and can be seen as a series of alternative inclusion and exclusion of higher-order terms that represent the simultaneous interaction of several variables. Definition (Symmetric uncertainty) Symmetric uncertainty [] between X and Y, denoted by SU(X,Y ), is defined as follows: SU(X,Y ) = I(X;Y ) H(X)+H(Y ). Symmetric uncertainty is a normalized version of MI in the range [0, ]. Symmetric uncertainty can compensate for MI s bias towards features with more values. Definition (Conditional symmetric uncertainty) Conditional symmetric uncertainty between X and Y given Z, denoted by SU(X,Y Z), is defined as follows: SU(X,Y Z) = SU(X,Y Z) is a normalized version of conditional mutual information I(X;Y Z). I(X;Y Z) H(X Z)+H(Y Z). S From Definitions and, the symmetric uncertainty equivalent of interaction information can be expressed as follows: SU(X,Y,Z) = SU(X,Y ) SU(X,Y Z). Using the above notations, we can formulate the feature selection problem as follows: Given a set of n features F = {F i } i {,...,n}, the goal of feature selection to select a subset of features F S = {F i : i S}, S {,,...,n} such that F S = argmax I(F S ;Y ), where I(F S ;Y ) denotes mutual information between F S and the class variable Y. For tractability reasons and unless there is strong evidence for the existence of higher-order interaction, the correction terms beyond - way interaction are generally ignored in the estimation of multivariate mutual information. In this paper, we will use I(F S ;Y ) I(F i ;Y ) i S i, j S,i< j I(F i ;F j ;Y ) () where I(F i ;F j ;Y ) is the -way interaction term between F i, F j, and Y. The proof of Equation for a -variable case can be shown easily using Venn diagram shown in Figure. The n variable case can be shown by recursive computation of the -variable case. In our paper, we use I(X;Y ;Z) = I(X;Y ) I(X;Y Z) as interaction information that uses the sign convention consistent with the measure theory and is used by several authors [,, ]. [] defines I(X;Y ;Z) = I(X;Y Z) I(X;Y ) as interaction information, which has opposite signs for odd number of random variables. Either formulation measures the same aspect of feature interaction []. The sign convention used in this paper corresponds to the common area of overlap in the information diagram and does not impact the heuristic as we deal with absolute value of the interaction information.

6 Singha & Shenoy Relevance Fig. : Venn diagram showing the interaction between features F, F, and class Y I(F,F ;Y ) = I(F ;Y ) + I(F ;Y F ) = I(F ;Y ) + I(F ;Y ) I(F ;F ;Y ) Relevance of a feature signifies its explanatory power, and is a measure of feature worthiness. A feature can be relevant individually or together with other variables if it carries information about the class Y. It is also possible that an individually relevant feature becomes relevant or a relevant feature becomes irrelevant when other features are present. This can be shown using Figures and, which present a -class classification problem (denoted by star and circle) with two continuous features X and X. The projections of the class on each axis denotes each feature s individual relevance. In Figure, X, which is individually irrelevant (uninformative), becomes relevant in the presence of X and together improve the class separation that is otherwise achievable by X alone. In Figure, both X and X are individually irrelevant, however provide a perfect separation when present together ( chessboard problem, analogous to XOR problem). Thus relevance of a feature is context dependent []. Using information-theoretic framework, a feature F i is said to be unconditionally relevant to the class variable Y if I(F i ;Y ) > 0, and irrelevant if I(F i ;Y ) = 0. When evaluated in the context of other features, we call F i to be conditionally relevant if I(F i ;Y F S i ) > 0, where F S i = F S \ F i. There are several other probabilistic definitions of relevance available in the literature. Most notably, [] formalize relevance in terms of an optimal Bayes classifier and propose two degrees of relevance strong and weak. Strongly relevant features are those that bring unique information about the class variable and can not be replaced by other features. Weakly relevant features are relevant but not unique in the sense that they can ()

7 Adaptive heuristic for feature Selection Fig. : X is individually irrelevant but improves the separability of X Fig. : Both individually irrelevant features become relevant together be replaced by other features. An irrelevant feature is one that is neither strong nor weakly relevant.. Redundancy The concept of redundancy is associated with the degree of dependency between two or more features. Two variables are said to be redundant, if they share common information about each other. This is the general dependency measured by I(F i ;F j ). [, ] formalize this notion of redundancy as multi-information or total correlation. The multi-information between a set of n features {F,...,F n } is given by R(F,...,F n ) = n H(F i ) H(F,...,F n ). For n =, R(F,F ) = H(F ) + H(F ) H(F,F ) = I(F ;F ). This measure of redundancy is non-linear, non-negative and non-decreasing with the number of features. In the context of feature election, it is often of interest to know whether two features are redundant with respect to the class variable, more than whether they are mutually redundant. Two features F i and F j are said to be redundant with respect to the class variable Y, if I(F i,f j ;Y ) < I(F i ;Y ) + I(F j ;Y ). From Equation, it follows I(F i ;F j ;Y ) > 0 or I(F i ;F j ) > I(F i ;F j Y ). Thus two features are redundant with respect to the class variable if their unconditional dependency exceeds their class-conditional dependency.. Complementarity Complementarity, known as information synergy, is the beneficial effect of feature interaction where two features together provide more information than the sum of their individual information. Two features F i and F j are said to be complementary with respect to the class variable Y if I(F i,f j ;Y ) > I(F i ;Y )+I(F j ;Y ), or equivalently, I(F i ;F j ) < I(F i ;F j Y ). Complementarity is negative interaction information. While generic dependency is undesirable, the dependency that conveys class information is good. Different researchers have explained complementarity from different perspectives. [] define complementarity in terms of the i=

8 Singha & Shenoy degree of interaction between an individual feature F i and the selected feature subset F S given the class Y, i.e., I(F i,f S Y ). [0] provide similar definition to complementarity but call it conditional redundancy. They come to the similar conclusion as []: the inclusion of the correlated features can be useful, provided the correlation within the class is stronger than the overall correlation. Related literature In this section, we review filter-based feature selection methods, which use information gain as a measure of dependence. In terms of evaluation strategy, filter-based methods can be broadly classified into (i) redundancy-based approach, and (ii) complementarity-based approach depending on whether or not they account for feature complementarity in the selection process. [0] however show that both these approaches can be subsumed in a more general, unifying theoretical framework known as conditional likelihood maximization.. Redundancy-based methods Most feature selection algorithms in the 0s and early 000 focus on relevance and redundancy to obtain the optimal subset. Notable amongst them are (i) mutual information based feature selection (MIFS) [], (ii) correlation based feature selection (CFS) [], (iii) minimum redundancy maximum relevance (mrmr) [], (iv) fast correlation based filter (FCBF) [], (v) ReliefF [], and (vi) conditional mutual information maximization (CMIM) [, ]. With some variation, their main goal is to maximize relevance and minimize redundancy. Of these methods, MIFS, FCBF, ReliefF, and CMIM are potentially feature ranking algorithms. They rank the features based on certain information maximization criterion [] and select the top k features, where k is decided a priori based on expert knowledge or technical considerations. MIFS is a sequential feature selection algorithm, in which a candidate feature F i is selected that maximizes the conditional mutual information I(F i ;Y F S ). [] approximates this MI by I(F i ;Y F S ) = I(F i ;Y ) β Fj F S I(F i ;F j ), where, F S is an already selected feature subset, and β [0,] is a user-defined parameter that controls the redundancy. For β = 0, it reduces to a ranking algorithm. [] finds β [0., ] is appropriate for many classification tasks. [] show that when β =, MIFS method penalizes redundancy too strongly, and for this reason does not work well for non-linear dependence. CMIM implements an idea similar to MIFS, but differs in the way in which I(F i ;Y F S ) is estimated. CMIM selects the candidate feature F i that maximizes min I(F i ;Y F j ). Both F j F S MIFS and CMIM are incremental forward search methods, and they suffer from initial selection bias []. For example, if {F,F } is the selected subset and {F,F } is the candidate subset, the CMIM selects F if I(F ;Y {F,F }) > I(F ;Y {F,F }) and the new optimal subset becomes {F,F,F }. The incremental search only evaluates the redundancy between the candidate feature F and {F,F } i.e. I(F ;Y {F,F }) and never considers the redundancy between F and {F,F } i.e. I(F ;Y {F,F }). CFS and mrmr are both subset selection algorithms, which evaluate a subset of features using an implicit cost function that simultaneously maximizes relevance and minimizes redundancy. CFS evaluates a subset of features based on pairwise correlation measures,

9 Adaptive heuristic for feature Selection in which correlation is used as a generic measure of dependence. CFS uses the following heuristic to evaluate a subset of features: merit(s) = k+k c f k r, where k denotes (k ) r f f the subset size, r c f denotes the average feature-class correlation, and r f f denotes the average feature-feature correlation of features in the subset. The feature-feature correlation is used as a measure of redundancy, and feature-class correlation is used as a measure of relevance. The goal of CFS is to find a subset of independent features that are uncorrelated and predictive of the class. CFS ignores feature complementarity, and cannot identify strongly interacting features such as in parity problem []. mrmr is very similar to CFS in principle, however, instead of correlation measures, mrmr uses mutual information I(F i ;Y ) as a measure of relevance, and I(F i ;F j ) as a measure of redundancy. mrmr uses the following heuristic to evaluate a subset of features: score(s) = i S I(F i ;Y ) k i, j S I(F i ;F j ). mrmr method suffers from limitations similar to CFS. k [] show that the approximations made by the information-theoretic methods, such as mrmr and CMIM, are based on unrealistic assumptions and they introduce a novel set of assumptions based on variational distributions and derive novel algorithms with competitive performance. FCBF follows a -step process. In step, it ranks all features based on symmetric uncertainty between each feature and the class variable, i.e., SU(F i,y ), and selects the relevant features that exceed a given threshold value δ. In step, it finds the optimal subset by eliminating redundant features from the relevant features selected in step (i), using an approximate Markov blanket criterion. In essence, it decouples the relevance and redundancy analysis, and circumvents the concurrent subset search and subset evaluation process. Unlike CFS and mrmr, FCBF is computationally fast, simple, and fairly easy to implement due to the sequential -step process. However, this method fails to capture situations where feature dependencies appear only conditionally on the class variable []. [] state that FCBF suffers from instability as the naive heuristics FCBF may be unsuitable in many situations. One of the drawbacks of FCBF is that it rules out the possibility of an irrelevant feature becoming relevant due to interaction with other features []. CMIM, which simultaneously evaluates relevance and redundancy at every iteration, overcomes this limitation. Relief [], and its multi-class version ReliefF [], are instance-based feature ranking algorithms that rank each feature based on its similarity with k nearest neighbors from the same and opposite classes, selected randomly from the dataset. The underlying principle is that a useful feature should have the same value for instances from the same class and different values for instances from a different class. In this method, m instances are randomly selected from the training data and for each of these m instances, n nearest neighbors are chosen from the same and the opposite class. Values of features of the nearest neighbors are compared with the sample instance and the scores for each feature is updated. A feature has higher weights if it has the same value with instances from the same class and different values to others. In Relief, the score or weight of each feature is measured by the Euclidean distance between the sampled instance and the nearest neighbor, which reflects its ability to discriminate between different classes. The consistency-based method [,, ] is another approach, which uses consistency measure as the performance metric. A feature subset is inconsistent if there exist at least two instances with the same feature values but with different class labels. The inconsistency rate of a dataset is the number of inconsistent instances divided by the total number of instances in it. This approach aims to find a subset, whose size is minimal and inconsistency rate is equal to that of the original dataset. [] propose the following heuristic: Consistency(S) =

10 0 Singha & Shenoy m ( D i M i ) i=0 N, where, m is the number of distinct combinations of feature values for subset S, D i is the number of instances of i-th feature value combination, M i is the cardinality of the majority class of i-th feature value combination, and N is the number of total instances in the dataset. Markov Blanket (MB) filter [] provides another useful technique for variable selection. MB filter works on the principle of conditional independence and excludes a feature only if the MB of the feature is within the set of remaining features. Though the MB framework based on information theory is theoretically optimal in eliminating irrelevant and redundant feature, it is computationally intractable. Incremental association Markov blanket (IAMB) [] and Fast-IAMB [0] are two MB based algorithms that use conditional mutual information as the metric for conditional independence test. They address the drawback of CMIM by performing redundancy checks during growing (forward pass) and shrinkage phase (backward pass).. Complementarity-based methods The literature on complementarity-based feature selection that simultaneously optimize redundancy and complementarity are relatively few, despite earliest research on feature interaction dating back to [] and subsequently advanced by [,,, ]. The feature selection methods that consider feature complementarity include double input symmetrical relevance (DISR) [], redundancy complementariness dispersion based feature selection (RCDFS) [], INTERACT [], interaction weight based feature selection (IWFS) [], and maximum relevance maximum complementary (MRMC) [], joint mutual information (JMI) [, ], and min-interaction Max-Relevancy (mimr) []. The goal of DISR is to find the best subset of a given size d, where d is assumed to be known a priori. It considers complementarity implicitly [], which means they consider the net effect of redundancy and complementarity in the search process. As a result, DISR does not distinguish between two subsets S and S, where S has information gain= 0. and information loss= 0., and S has information gain = 0. and information loss = 0. In other words, information loss and information gain are treated equally. DISR works on the principle of k-average sub-subset information criterion, which is shown to be a good approximation of the information of a set of features. They show that the mutual information between a subset F S of d features, and the class variable Y is lower bounded by the average information of its subsets. Using notations, I(F V ;Y ) I(F S ;Y ). k is considered to be size of k!( d k) V S: V =k the sub-subset such that there is no complementarities of order greater than k. Using k =, DISR recursively decomposes each bigger subset (d > ) into subsets containing features F i and F j (i j), and chooses a subset F S such that F S = argmax I(F i,f j ;Y )/ ( d S ). An i, j,i< j implementation of this heuristic, known as MASSIVE is also proposed. mimr method presents another variation of DISR in that (i) mimr first removes all features that have zero mutual information with the class, and (ii) it decomposes the multivariate term in DISR into a linear combination of relevance and interaction terms. mimr considers causal discovery in the selection process; restricts the selection to variables that have both positive relevance and negative interaction. Both DISR and mimr belong to a framework, known as Joint Mutual Information (JMI) initially proposed by []. JMI provides a sequential feature selection method in which the JMI score of the incoming feature

11 Adaptive heuristic for feature Selection F k is given by J jmi (F k ) = I(F k,f i ;Y ) where F S is the already selected subset. This is the F i F S information between the targets and the joint random variable (F k,f i ) defined by pairing the candidate F k with each feature previously selected. In RCDFS, [] suggest that ignoring higher order feature dependence may lead to false positives (FP) (actually redundant features misidentified as relevant due to pairwise approximation) being selected in the optimal subset, which may impair the selection of subsequent features. The degree of interference depends on the number of FPs present in the already selected subset and their degree of correlation with the incoming candidate feature. Only when the true positives (TPs) and FPs have opposing influence on the candidate feature, the selection is misguided. For instance, if the candidate feature is redundant to the FPs but complementary to the TPs, then new feature will be discouraged from selection, while it should be ideally selected and vice-versa. They estimate the interaction information (complementarity or redundancy) of the candidate feature with each of already selected features. They propose to measure this noise by standard deviation (dispersion) of these interaction effects and minimize it. The smaller the dispersion, the less influential is the interference effect of false positives. One limitation of RCDFS is that it assumes that all TPs in the already selected subset will exhibit a similar type of association, i.e., either all are complementary to, or all are redundant with, the candidate feature (see Figure in []). This is a strong asumption and need not be necessarily true. In fact, it is more likely that different dispersion patterns could be observed. In such cases, the proposed method will fail to differentiate between the good influence (due to TPs) and bad influence (due to FPs) and therefore will be ineffective in mitigating the interference effect of FPs in the feature selection process. [] propose a complementarity-based ranking algorithm, IWFS. Their method is based on interaction weight factors, which reflect the information on whether a feature is redundant or complementary. The interaction weight for a feature is updated at each iteration, and a feature is selected if its interaction weight exceeds a given threshold. Another complementaritybased method, INTERACT, uses a feature sorting metric using data consistency. The c- contribution of a feature is estimated based on its inconsistency rate. A feature is removed if its c-contribution is less than a given threshold of c-contribution, otherwise retained. This method is computationally intensive and has worst-case time complexity O(N M), where N is the number of instances and M is the number of features. MRMC method presents a neural network based feature selection that uses relevance and complementary score. Relevance and complementary scores are estimated based on how a feature influences or complements the networks. [0] propose a space of potential criterion that encompasses several redundancy and complementarity-based methods. They propose that the worth of a candidate feature F k given already selected subset F S can be represented as J(F k ) = I(F k ;Y ) β I(F k ;F i ) + F i F S γ I(F k ;F i Y ). Different values of β and γ lead to different feature selection methods. For F i F S example, γ = 0 leads to MIFS, β = γ = S lead to JMI, γ = 0, and β = S lead to mrmr. Motivation and the proposed heuristic In this section, we first outline the motivation behind using redundancy and complementarity explicitly in the search process and the use of an implicit utility function approach.

12 Singha & Shenoy Table : Golf Dataset Outlook (F) Temperature (F) Humidity (F) Windy (F) Play golf (Y) Rainy Hot High False No Rainy Hot High True No Overcast Hot High False Yes Sunny Mild High False Yes Sunny Cool Normal False Yes Sunny Cool Normal True No Overcast Cool Normal True Yes Rainy Mild High False No Rainy Cool Normal False Yes Sunny Mild Normal False Yes Rainy Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Sunny Mild High True No Then, we propose a heuristic, called self-adaptive feature evaluation (SAFE). SAFE is motivated by the implicit utility function approach in multi-objective optimization. Implicit utility function approach belongs to interactive methods of optimization [, ], which combines the search process with the decision maker s relative preference over multiple objectives. In interactive method, decision making and optimization occur simultaneously.. Implicit vs. explicit measurement of complementarity Combining negative (complementarity) and positive (redundancy) interaction information may produce inconsistent results when the goal is to find an optimal feature subset. In this section, we demonstrate this using the Golf dataset presented in Table. The dataset has four features {F,...,F }. The information content I(F S ;Y ) of each possible subset F S is estimated (i) first, using Eq. (), and (ii) then using the aggregate approach. In the aggregate approach, to compute I(F S ;Y ), we take the average of mutual information of each sub-subsets of two features in F S with the class variable, i.e., I(F i,f j ;Y ). For example, I(F,F,F ;Y ) is approximated by the average of I(F,F ;Y ), I(F,F ;Y ), and I(F,F ;Y ). Table presents the results. Mutual information is computed using infotheo package in R and empirical entropy estimator. Our goal is to find the optimal subset regardless of the subset size. Clearly, in this example, the {F,F,F,F } is the optimal subset that has maximum information about the class variable. However, using aggregate approach, {F,F,F } is the best subset. Moreover, in the aggregate approach, one would assign a higher rank to the subset {F,F,F } as compared to {F,F,F }, though the latter subset has higher information content than the former.. A new adaptive heuristic We first introduce the following notations for our heuristic and then define the adaptive cost function.

13 Adaptive heuristic for feature Selection Table : Mutual information between a subset F S and the class Y No. Subset (S) I(F S ;Y ) Aggregate approach {F } 0.0 {F } 0.00 {F } 0.0 {F } 0.0 {F,F } 0. {F,F } 0. {F,F } 0. {F,F } 0. {F,F } 0. 0 {F,F } 0.0 {F,F,F } {F,F,F } {F,F,F } {F,F,F } {F,F,F,F } Subset Relevance Given a subset S, subset relevance, denoted by A S, is defined by summation of all pairwise mutual information between each feature and the class variable, i.e., A S = I(F i ;Y ). A S measures the predictive ability of each individual feature acting alone. i S Subset Redundancy Given a subset S, subset redundancy, denoted by R S, is defined by the summation of all positive -way interactions in the subset, i.e., R S = (I(F i ;F j ) i, j S,i< j I(F i ;F j Y )) (i, j) such that I(F i ;F j ) > I(F i ;F j Y ). R S measures information loss due to feature redundancy. i, j S,i< j Subset Complementarity Given a subset S, subset complementarity, denoted by C S, is defined by the absolute value of the sum of all negative -way interactions in the subset, i.e., C S = (I(F i ;F j Y ) I(F i ;F j )) (i, j) such that I(F i ;F j ) < I(F i ;F j Y ). C S measures information gain due to feature complementarity. Subset Dependence Given a subset S, subset dependence, denoted by D S, is defined by the summation of mutual information between each pair of features, i.e., D S = I(F i ;F j ). i, j S,i< j We use D S as a measure of unconditional feature redundancy in our heuristic. This is the same as the unconditional mutual information between features described as redundancy in the literature [,, ]. We call this subset dependence to distinguish this from the information loss due to redundancy (R S ), which is measured by the difference between conditional and unconditional mutual information. Below, we present the proposed heuristic. β S Score(S) = A S + γ CS () S + β DS where, A S, C S, and D S are subset relevance, subset complementarity, and subset dependence, respectively, and S denotes the subset size. β and γ are hyper-parameters defined as follows: α = R S R S +C S, β = ( + α), ξ = C S C S +A S and γ = ( ξ ). R S measures subset redundancy as defined in Definition.. As mentioned above, the heuristic characterizes an adaptive

14 Singha & Shenoy objective function, which evolves depending on the level of feature interactions. We model this adaptation using two hyper-parameters β and γ. The values of these parameters are computed by the heuristic during the search process based on the relative values of relevance, redundancy and complementarity. The ratio α [0,] measures the percentage of redundancy in the subset, which determines whether the subset is predominantly redundant or complementary. If α =, all features are pairwise redundant, we call the subset predominantly redundant. At the other extreme, if α = 0 all features are pairwise complementary, we call the subset predominantly complementary. We consider α = 0/0 = 0 for a fully independent subset of features which is, however, rarely the case. The hyper-parameter β controls the trade-off between redundancy and complementarity based on the value of α, which is a function of subset characteristics. We consider β as a linear function of α such that the penalty for unconditional dependency increases linearly to twice its value when the subset is fully redundant. This resembles the heuristic of CFS [] when α =. The S in the denominator allows the heuristic to favor smaller subsets, while the square root in the denominator allows the penalty term to vary exponentially with increasing subset size and feature dependency. The proposed heuristic adaptively modifies the trade-off rule as the search process for the optimal subset continues. We explain how such an adaptive criterion works. As α increases, the subset becomes more redundant, the value of subset complementarity (C S ) decreases (C S = 0 when α = ) leaving us with little opportunity to use complementarity very effectively for feature selection process. In other words, the value of subset complementarity C S is not sufficiently large to be able to differentiate between the two subsets. At best, we expect to extract a set of features that are less redundant or nearly independent. As a result, β [,] increasingly penalizes the subset redundancy term D S in the denominator and rewards subset complementarity C S in the numerator as α increases from 0 to. In contrast, as α decreases, the subset becomes predominantly complementary leading to an increase in the value C S. As the magnitude of C S gets sufficiently large, complementarity plays the key role in differentiating between two subsets compared to subset dependence D S in the denominator. This, however, causes a bias towards larger subset as the complementarity gain increases monotonically with the size of the subset. We observe that C S increases exponentially with the logarithm of the subset size as α decreases. In Figure, we demonstrate this for three different real data sets having different degrees of redundancy (α). In order to control this bias, we raise C S to the exponent S. Moreover, given the way we formalize α, it is possible to have two different subsets, both having α = 0 but with different degrees of information gain C S. This is because the information gain C S of a subset is affected by both the number of complementary pair of features in the subset as well as how complementary each pair of features is, which is an intrinsic property of each feature. Hence it is possible that a larger subset with weak complementarity produces the same amount of information gain C S as another smaller subset with highly complementary features. This is evident from Figure, which shows that subset complementarity gains faster for dataset Lung Cancer compared to data set Promoter, despite the fact that the latter has lower α, and both have an identical number of features. The exponent S also takes care of such issues. Next, we examine scenarios when C S > A S. This is the case when a subset learns more information from interaction with other features compared to their individual predictive power. Notice that the source of class information, whether C S or A S, is indistinguishable to the heuristic as they are linearly additive in the numerator. As a result, this produces an undue bias towards larger subset size when C S > A S. To control this bias, we introduce the

15 Score Score Adaptive heuristic for feature Selection Subset Size 0 Fig. : Variation of subset complementarity with subset size Fig. : Heuristic score variation with degree of redundancy α and subset size, given A S = 00, D S = 0, and R S +C S = alpha Subset Size 0 Fig. : Heuristic score variation with degree of redundancy α and subset size, given A S = 0, D S = 0, and R S +C S = 00 hyper-parameter γ [0,] that maintains the balance between relevance and complementarity information gain. It controls the subset size by reducing the contribution from C S when C S > A S alpha 0..0

16 Score Score Singha & Shenoy Subset relevance Subset dependence Fig. : Heuristic score variation with subset dependence and subset relevance, given α = 0., S =, and R S +C S = 0. Sensitivity analysis Subset relevance Subset dependence Fig. 0: Heuristic score variation with subset dependence and subset relevance, given α = 0., S =, and R S +C S = 0 In Figures and, we show how the proposed heuristic score varies with degree of redundancy α, and the subset size under different relative magnitude of interaction information (R S +C S ), and subset relevance A S. Figure depicts a situation in which features are individually highly relevant, but less interactive (C S < A S ), whereas in Figure, features are individually less relevant, but as a subset become extremely relevant due to high degree of feature interaction (C S > A S ). In either scenario, the score decreases with increasing subset size, and with increasing degree of redundancy. For a given subset size, the score is generally lower when C S > A S compared to when C S < A S, showing the heuristic is effective in controlling the subset size when the subset is predominantly complementary. We also observe that the heuristic is very sensitive to redundancy when the features are highly relevant. In other words, redundancy hurts much more when features are highly relevant. This is evident from the fact that the score reduces at a much faster rate with increasing redundancy when A S is very high compared to C S as in Figure. In Figures and 0, we show how the proposed heuristic score varies with subset relevance and subset dependence for two different subset sizes. The score increases linearly with the subset relevance, and decreases non-linearly with the subset dependence, as we would expect from Equation. The score is higher for a smaller subset compared to a bigger subset under nearly identical conditions. However, for a given subset relevance, the score decreases at a much faster rate with increasing subset dependence when there are fewer number of features in the subset (as in Figure ). This phenomenon can be explained with the help of Figures and. For a given subset dependence D S, fewer features would mean higher degree of association (overlap) between features. In other words, features share more 00

17 Adaptive heuristic for feature Selection I(X;Y ) H(X) H(Y ) D S = I(X;Y ) Fig. : Fewer features in the subset for given subset dependence D S I(X;Y ) H(X) I(X; Z) H(Z) H(Y ) D S = I(X;Y ) + I(Y ;Z) + I(X;Z) I(Y ;Z) Fig. : More features in the subset for given subset dependence D S common information, and are therefore more redundant compared to when there are higher number of features in the subset. Hence, our heuristic not only encourages parsimony, but is also sensitive to the change in feature dependence as the subset size changes. The above discussion demonstrates the adaptive nature of the heuristic under different conditions of relevance and redundancy, which is the motivation behind our heuristic.. Limitations One limitation of this heuristic, as evident from Figures and 0, is that it assigns a zero score to a subset when the subset relevance is zero, i.e., A S = 0. Since feature relevance is a non-negative measure, this implies a situation in which every feature in the subset is individually irrelevant to the target concept. Thus, our heuristic does not select a subset when none of the features in the subset carry any useful class information by themselves. This, however, ignores the possibility that they can become relevant due to interaction. However, we have not encountered datasets where this is the case. Another limitation of our heuristic is that it considers up to -way feature interaction (interaction between a pair of features and the class variable) and ignores the higher-order corrective terms. This pairwise approximation is necessary for computational tractability and is considered in many MI-based feature selection methods []. Despite this limitation, [] shows that Equation produces reasonably good estimates of mutual information for all practical purposes. The higher order corrective terms become significant only when there exists a very high degree of dependence amongst a large number of features. It may be noted that our definition of subset complementarity and subset redundancy, as given in Section., can be extended to include higher-order interaction terms without any difficulty. With more precise estimates of mutual information becoming available, further work will address the merit of using higher order correction terms in our proposed approach.

18 Singha & Shenoy Algorithm In this section, we present the algorithm for our proposed heuristic SAFE. Step Assume we start with a training sample D(F,Y ) with full feature set F = {F,...,F n } and class variable Y. Using a search strategy, we choose a candidate subset of features S F. Step Using the training data, we compute the mutual information between each pair of features I(F i ;F j ), and between each feature and the class variable I(F i ;Y ). We eliminate all constant valued features from S, for which I(F i ;Y ) = 0. Step For each pair of features in S, we compute the conditional mutual information given the class variable, I(F i ;F j Y ). Step We transform all I(F i ;F j ) and I(F i ;F j Y ) to their symmetric uncertainty form to maintain a scale between [0,]. Step For each pair of features, we compute the interaction gain or loss, i.e., I(F i ;F j ;Y ) = I(F i ;F j ) I(F i ;F j Y ) using the symmetric uncertainty measures. Step We compute A S, D S, C S, R S, β, and γ using information from Steps &. Step Using information from Step, the heuristic determines a Score(S) for subset S. The search continues, and a subset S opt is chosen that maximizes this score. The pseudo-algorithm of the proposed heuristic is presented in Algorithm. Algorithm Self Adapting Feature Evaluation (SAFE) Input: A training sample D(F,Y ) with full feature set F = {F,...,F n } and class variable Y Output: The selected feature subset S opt : Initialize Score opt, S opt /0; : repeat : Generate a subset S F using a search strategy; : for i, j = to n; j> i do : Calculate SU(F i,y ), SU(F i,f j ), and SU(F i,f j,y ); : Remove all features F i from S if SU(F i,y ) = 0; : end for : Calculate A S, D S, C S, β, γ; : Calculate Score(S) ; 0: if Score(S) > Score opt then Score opt Score(S), S opt S; : else no change in Score opt, S opt ; : end if : until Score opt does not improve based on stopping rule : Return S opt ; : End

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes