An adaptive heuristic for feature selection based on complementarity

Size: px
Start display at page:

Download "An adaptive heuristic for feature selection based on complementarity"

Transcription

1 Appeared in: Machine Learning, 0(), 0, An adaptive heuristic for feature selection based on complementarity Sumanta Singha Prakash P. Shenoy Received: date / Accepted: date Abstract Feature selection is a dimensionality reduction technique that helps to improve data visualization, simplify learning, and enhance the efficiency of learning algorithms. The existing redundancy-based approach, which relies on relevance and redundancy criteria, does not account for feature complementarity. Complementarity implies information synergy, in which additional class information becomes available due to feature interaction. We propose a novel filter-based approach to feature selection that explicitly characterizes and uses feature complementarity in the search process. Using theories from multi-objective optimization, the proposed heuristic penalizes redundancy and rewards complementarity, thus improving over the redundancy-based approach that penalizes all feature dependencies. Our proposed heuristic uses an adaptive cost function that uses redundancy-complementarity ratio to automatically update the trade-off rule between relevance, redundancy, and complementarity. We show that this adaptive approach outperforms many existing feature selection methods using benchmark datasets. Keywords Dimensionality reduction Feature selection Classification Feature complementarity Adaptive heuristic Introduction Learning from data is one the central goals of machine learning research. Statistical and datamining communities have long focused on building simpler and more interpretable models for prediction and understanding of data. However, high dimensional data present unique computational challenges such as model over-fitting, computational intractability, and poor prediction. Feature selection is a dimensionality reduction technique that helps to simplify learning, reduce cost, and improve data interpretability. Sumanta Singha Tel.: sumanta.singha@ku.edu Prakash P. Shenoy pshenoy@ku.edu School of Business University of Kansas

2 Singha & Shenoy Existing approaches Over the years, feature selection methods have evolved from simplest univariate ranking algorithms to more sophisticated relevance-redundancy trade-off to interaction-based approach in recent times. Univariate feature ranking [] is a feature selection approach that ranks features based on relevance and ignores redundancy. As a result, when features are interdependent, the ranking approach leads to sub-optimal results [0]. Redundancy-based approach improves over the ranking approach by considering both relevance and redundancy in the feature selection process. A wide variety of feature selection methods are based on relevance-redundancy trade-off [,,,,, ]. Their goal is to find an optimal subset of features that produces maximum relevance and minimum redundancy. Complementarity-based feature selection methods emerged as an alternative approach to account for feature complementarity in the selection process. Complementarity can be described as a phenomenon in which two features together provide more information about the target variable than the sum of their individual information (information synergy). Several complementarity-based methods are proposed in the literature [,,,,,, ]. [] and later [] propose an interactive sequential feature selection method, known as joint mutual information (JMI), which selects a candidate feature that maximizes relevance and complementarity simultaneously. They conclude that JMI approach provides the best trade-off in terms of accuracy, stability, and flexibility with small data samples. Limitations of the existing methods Clearly, redundancy-based approach is less efficient than complementarity-based approach as it does not account for feature complementarity. However, the main criticism of redundancy-based approach is with regard to how redundancy is formalized and measured. Correlation is the most common way to measure redundancy between features, which implicitly assumes that all correlated features are redundant. However, [0,, ] show that this is an incorrect assumption; correlation does not imply redundancy, nor absence of complementarity. This is evident in Figures and, which present a -class classification problem (denoted by star and circle) with two continuous features X and X. The projections on the axis denote the relevance of each respective feature. In Figure, X and X are perfectly correlated, and indeed redundant. Having both X and X leads to no significant improvement in class separation compared to having either X or X. However, in Figure, a perfect separation is achieved by X and X together, although they are (negatively) correlated (within each class), and have identical relevance as in Figure. This shows that while generic dependency is undesirable, dependency that conveys class information is useful. Whether two interacting features are redundant or complimentary depends on the relative magnitude of their class conditional dependency in comparison to their unconditional dependency. However, in the redundancy-based approach, it only focuses on unconditional dependency, and try to minimize it without exploring whether such dependencies lead to information gain or loss. Although complementarity-based approach overcomes some of the drawbacks of redundancy based approach, it has limitations, too. Most of the existing complementarity-based feature selection methods either adopt a sequential feature selection approach or evaluate subsets of a given size. Sequential feature selection method raises a few important issues. First, [] show that sequential selection methods suffer from initial selection bias, i.e., the feature selected in the earlier steps govern acceptance or rejection of the subsequent features at each iteration and not vice versa. Second, it requires a priori knowledge of the desired number of features to be selected or some stopping criterion, which is usually determined by expert information or some technical considerations such as scalability or computation time. In many practical situations, it is difficult to get such prior knowledge. Moreover, in

3 Adaptive heuristic for feature Selection Fig. : Perfectly correlated and redundant Fig. : Features are negatively-correlated within class, yet provide perfect separation many cases, we are interested in finding an optimal feature subset that gives maximum predictive accuracy for a given task, and we are not really concerned about their ranking or the size of the optimal subset. Third, most of the existing methods combine the redundancy and complementarity and consider the net or aggregate effect in the search process (note that complementarity has the opposite sign of redundancy). [] refer to this aggregate approach as implicit consideration of complementarity. Our approach In this paper, we propose a filter-based feature subset selection method based on relevance, redundancy, and complementarity. Unlike most of the existing methods, which focus on feature ranking or compare subsets of a given size, our goal is to select an optimal subset of features that predicts well. This is useful in many situations, where no prior expert knowledge is available regarding size of an optimal subset or the goal is simply to find an optimal subset. Using a multi-objective optimization (MOO) technique and an adaptive cost function, the proposed method aims to (i) maximize relevance, (ii) minimize redundancy, and (iii) maximize complementarity, while keeping the subset size as small as possible. The term adaptive implies that our proposed method adaptively determines the tradeoff between relevance, redundancy, and complementarity based on subset properties. An adaptive approach helps to overcome the limitations of a fixed policy that fails to model the trade-off between competing objectives appropriately in a MOO problem. Such an adaptive approach is new to feature selection and essentially mimics a feature feedback mechanism in which the trade-off rule is a function of the objective values. The proposed cost function is also flexible in that it does not assume any particular functional form or rely on concavity assumption, and uses implicit utility maximization principles [, ]. Unlike some of the complementarity-based methods, which consider the net (aggregate) effect of redundancy and complementarity, we consider redundancy minimization and complementarity maximization as two separate objectives in the optimization process. This allows us the flexibility to apply different weights (preference) to redundancy and complementarity and control their relative importance adaptively during the search process. Using best-first as search strategy, the proposed heuristic offers a best compromise solution (more likely to avoid local optimum due to interactively determined gradient), if not the best solution (in the sense of optimum) [0], which is sufficiently good in most prac-

4 Singha & Shenoy tical scenarios. Using benchmark datasets, we show empirically that our adaptive heuristic not only outperforms many redundancy-based methods, but is also competitive amongst the existing complementarity-based methods. Structure of the paper The rest of the paper is organized as follows. Section presents the information-theoretic definitions and the concepts of relevance, redundancy, and complementarity. In Section, we present the existing feature selection methods, and discuss their strengths and limitations. In Section, we describe the proposed heuristic, and its theoretical motivation. In this section, we also discuss the limitations of our heuristic, and carry out sensitivity analysis. Section presents the algorithm for our proposed heuristic, and evaluates its time complexity. In Section, we assess the performance of the heuristic on two synthetic datasets. In Section, we validate our heuristic using real data sets, and present the experimental results. In Section, we summarize and conclude. Information theory - definitions and concepts First, we provide the necessary definitions in information theory [] and then discuss the existing notions of relevance, redundancy, and complementarity.. Definitions Suppose, X and Y are discrete random variables with finite state spaces X and Y, respectively. Let p X,Y denote the joint probability mass function (PMF) of X and Y, with marginal PMFs p X and p Y. Definition (Entropy) Entropy of X, denoted by H(X), is defined as follows: H(X) = p X (x)log(p X (x)). Entropy is a measure of uncertainty in PMF p X of X. x X Definition (Joint entropy) Joint entropy of X and Y, denoted by H(X,Y ), is defined as follows: H(X,Y ) = p X,Y (x,y)log(p X,Y (x,y)). Joint entropy is a measure of x X y Y uncertainty in the joint PMF p X,Y of X and Y. Definition (Conditional entropy) Conditional entropy of X given Y, denoted by H(X Y ), is defined as follows: H(X Y ) = p X,Y (x,y)log(p X y (x)), where p X y (x) is the conditional PMF of X given Y = y. Conditional entropy H(X Y ) measures the remaining uncer- x X y Y tainty in X given the knowledge of Y. Definition (Mutual information (MI)) Mutual information between X and Y, denoted by I(X;Y ), is defined as follows: I(X;Y ) = H(X) H(X Y ) = H(Y ) H(Y X). Mutual information measures the amount of dependence between X and Y. It is non-negative, symmetric, and is equal to zero iff X and Y are independent. Definition (Conditional mutual information) Conditional mutual information between X and Y given another discrete random variable Z, denoted by I(X;Y Z), is defined as follows: I(X;Y Z) = H(X Z) H(X Y, Z) = H(Y Z) H(Y X, Z). It measures the conditional dependence between X and Y given Z.

5 Adaptive heuristic for feature Selection Definition (Interaction information) Interaction information [,, ] among X, Y, and Z, denoted by I(X;Y ;Z), is defined as follows: I(X;Y ;Z) = I(X;Y ) I(X;Y Z). Interaction information measures the change in the degree of association between two random variables by holding one of the interacting variables constant. It can be positive, negative, or zero depending on the relative order of magnitude of I(X;Y ) and I(X;Y Z). Interaction information is symmetric (order independent). More generally, the interaction information among a set of n random variables X = {X,X,...,X n }, is given by I(X ;X ;...;X n ) = ( ) S H(S) where, X is the superset of X and denotes the sum over all subsets S S X of the superset X []. Should it be zero, we say that features do not interact altogether []. Definition (Multivariate mutual information) Multivariate mutual information [, ] between a set of n features X = {X,X,...,X n } and Y, denoted as follows: I(X;Y ), is defined by I(X;Y ) = I(X i ;Y ) I(X i ;X j ;Y ) + + ( ) n+ I(X ;...;X n ;Y ). This is the i i< j möbius representation of multivariate mutual information based on set theory. Multivariate mutual information measures the information that X contains about Y and can be seen as a series of alternative inclusion and exclusion of higher-order terms that represent the simultaneous interaction of several variables. Definition (Symmetric uncertainty) Symmetric uncertainty [] between X and Y, denoted by SU(X,Y ), is defined as follows: SU(X,Y ) = I(X;Y ) H(X)+H(Y ). Symmetric uncertainty is a normalized version of MI in the range [0, ]. Symmetric uncertainty can compensate for MI s bias towards features with more values. Definition (Conditional symmetric uncertainty) Conditional symmetric uncertainty between X and Y given Z, denoted by SU(X,Y Z), is defined as follows: SU(X,Y Z) = SU(X,Y Z) is a normalized version of conditional mutual information I(X;Y Z). I(X;Y Z) H(X Z)+H(Y Z). S From Definitions and, the symmetric uncertainty equivalent of interaction information can be expressed as follows: SU(X,Y,Z) = SU(X,Y ) SU(X,Y Z). Using the above notations, we can formulate the feature selection problem as follows: Given a set of n features F = {F i } i {,...,n}, the goal of feature selection to select a subset of features F S = {F i : i S}, S {,,...,n} such that F S = argmax I(F S ;Y ), where I(F S ;Y ) denotes mutual information between F S and the class variable Y. For tractability reasons and unless there is strong evidence for the existence of higher-order interaction, the correction terms beyond - way interaction are generally ignored in the estimation of multivariate mutual information. In this paper, we will use I(F S ;Y ) I(F i ;Y ) i S i, j S,i< j I(F i ;F j ;Y ) () where I(F i ;F j ;Y ) is the -way interaction term between F i, F j, and Y. The proof of Equation for a -variable case can be shown easily using Venn diagram shown in Figure. The n variable case can be shown by recursive computation of the -variable case. In our paper, we use I(X;Y ;Z) = I(X;Y ) I(X;Y Z) as interaction information that uses the sign convention consistent with the measure theory and is used by several authors [,, ]. [] defines I(X;Y ;Z) = I(X;Y Z) I(X;Y ) as interaction information, which has opposite signs for odd number of random variables. Either formulation measures the same aspect of feature interaction []. The sign convention used in this paper corresponds to the common area of overlap in the information diagram and does not impact the heuristic as we deal with absolute value of the interaction information.

6 Singha & Shenoy Relevance Fig. : Venn diagram showing the interaction between features F, F, and class Y I(F,F ;Y ) = I(F ;Y ) + I(F ;Y F ) = I(F ;Y ) + I(F ;Y ) I(F ;F ;Y ) Relevance of a feature signifies its explanatory power, and is a measure of feature worthiness. A feature can be relevant individually or together with other variables if it carries information about the class Y. It is also possible that an individually relevant feature becomes relevant or a relevant feature becomes irrelevant when other features are present. This can be shown using Figures and, which present a -class classification problem (denoted by star and circle) with two continuous features X and X. The projections of the class on each axis denotes each feature s individual relevance. In Figure, X, which is individually irrelevant (uninformative), becomes relevant in the presence of X and together improve the class separation that is otherwise achievable by X alone. In Figure, both X and X are individually irrelevant, however provide a perfect separation when present together ( chessboard problem, analogous to XOR problem). Thus relevance of a feature is context dependent []. Using information-theoretic framework, a feature F i is said to be unconditionally relevant to the class variable Y if I(F i ;Y ) > 0, and irrelevant if I(F i ;Y ) = 0. When evaluated in the context of other features, we call F i to be conditionally relevant if I(F i ;Y F S i ) > 0, where F S i = F S \ F i. There are several other probabilistic definitions of relevance available in the literature. Most notably, [] formalize relevance in terms of an optimal Bayes classifier and propose two degrees of relevance strong and weak. Strongly relevant features are those that bring unique information about the class variable and can not be replaced by other features. Weakly relevant features are relevant but not unique in the sense that they can ()

7 Adaptive heuristic for feature Selection Fig. : X is individually irrelevant but improves the separability of X Fig. : Both individually irrelevant features become relevant together be replaced by other features. An irrelevant feature is one that is neither strong nor weakly relevant.. Redundancy The concept of redundancy is associated with the degree of dependency between two or more features. Two variables are said to be redundant, if they share common information about each other. This is the general dependency measured by I(F i ;F j ). [, ] formalize this notion of redundancy as multi-information or total correlation. The multi-information between a set of n features {F,...,F n } is given by R(F,...,F n ) = n H(F i ) H(F,...,F n ). For n =, R(F,F ) = H(F ) + H(F ) H(F,F ) = I(F ;F ). This measure of redundancy is non-linear, non-negative and non-decreasing with the number of features. In the context of feature election, it is often of interest to know whether two features are redundant with respect to the class variable, more than whether they are mutually redundant. Two features F i and F j are said to be redundant with respect to the class variable Y, if I(F i,f j ;Y ) < I(F i ;Y ) + I(F j ;Y ). From Equation, it follows I(F i ;F j ;Y ) > 0 or I(F i ;F j ) > I(F i ;F j Y ). Thus two features are redundant with respect to the class variable if their unconditional dependency exceeds their class-conditional dependency.. Complementarity Complementarity, known as information synergy, is the beneficial effect of feature interaction where two features together provide more information than the sum of their individual information. Two features F i and F j are said to be complementary with respect to the class variable Y if I(F i,f j ;Y ) > I(F i ;Y )+I(F j ;Y ), or equivalently, I(F i ;F j ) < I(F i ;F j Y ). Complementarity is negative interaction information. While generic dependency is undesirable, the dependency that conveys class information is good. Different researchers have explained complementarity from different perspectives. [] define complementarity in terms of the i=

8 Singha & Shenoy degree of interaction between an individual feature F i and the selected feature subset F S given the class Y, i.e., I(F i,f S Y ). [0] provide similar definition to complementarity but call it conditional redundancy. They come to the similar conclusion as []: the inclusion of the correlated features can be useful, provided the correlation within the class is stronger than the overall correlation. Related literature In this section, we review filter-based feature selection methods, which use information gain as a measure of dependence. In terms of evaluation strategy, filter-based methods can be broadly classified into (i) redundancy-based approach, and (ii) complementarity-based approach depending on whether or not they account for feature complementarity in the selection process. [0] however show that both these approaches can be subsumed in a more general, unifying theoretical framework known as conditional likelihood maximization.. Redundancy-based methods Most feature selection algorithms in the 0s and early 000 focus on relevance and redundancy to obtain the optimal subset. Notable amongst them are (i) mutual information based feature selection (MIFS) [], (ii) correlation based feature selection (CFS) [], (iii) minimum redundancy maximum relevance (mrmr) [], (iv) fast correlation based filter (FCBF) [], (v) ReliefF [], and (vi) conditional mutual information maximization (CMIM) [, ]. With some variation, their main goal is to maximize relevance and minimize redundancy. Of these methods, MIFS, FCBF, ReliefF, and CMIM are potentially feature ranking algorithms. They rank the features based on certain information maximization criterion [] and select the top k features, where k is decided a priori based on expert knowledge or technical considerations. MIFS is a sequential feature selection algorithm, in which a candidate feature F i is selected that maximizes the conditional mutual information I(F i ;Y F S ). [] approximates this MI by I(F i ;Y F S ) = I(F i ;Y ) β Fj F S I(F i ;F j ), where, F S is an already selected feature subset, and β [0,] is a user-defined parameter that controls the redundancy. For β = 0, it reduces to a ranking algorithm. [] finds β [0., ] is appropriate for many classification tasks. [] show that when β =, MIFS method penalizes redundancy too strongly, and for this reason does not work well for non-linear dependence. CMIM implements an idea similar to MIFS, but differs in the way in which I(F i ;Y F S ) is estimated. CMIM selects the candidate feature F i that maximizes min I(F i ;Y F j ). Both F j F S MIFS and CMIM are incremental forward search methods, and they suffer from initial selection bias []. For example, if {F,F } is the selected subset and {F,F } is the candidate subset, the CMIM selects F if I(F ;Y {F,F }) > I(F ;Y {F,F }) and the new optimal subset becomes {F,F,F }. The incremental search only evaluates the redundancy between the candidate feature F and {F,F } i.e. I(F ;Y {F,F }) and never considers the redundancy between F and {F,F } i.e. I(F ;Y {F,F }). CFS and mrmr are both subset selection algorithms, which evaluate a subset of features using an implicit cost function that simultaneously maximizes relevance and minimizes redundancy. CFS evaluates a subset of features based on pairwise correlation measures,

9 Adaptive heuristic for feature Selection in which correlation is used as a generic measure of dependence. CFS uses the following heuristic to evaluate a subset of features: merit(s) = k+k c f k r, where k denotes (k ) r f f the subset size, r c f denotes the average feature-class correlation, and r f f denotes the average feature-feature correlation of features in the subset. The feature-feature correlation is used as a measure of redundancy, and feature-class correlation is used as a measure of relevance. The goal of CFS is to find a subset of independent features that are uncorrelated and predictive of the class. CFS ignores feature complementarity, and cannot identify strongly interacting features such as in parity problem []. mrmr is very similar to CFS in principle, however, instead of correlation measures, mrmr uses mutual information I(F i ;Y ) as a measure of relevance, and I(F i ;F j ) as a measure of redundancy. mrmr uses the following heuristic to evaluate a subset of features: score(s) = i S I(F i ;Y ) k i, j S I(F i ;F j ). mrmr method suffers from limitations similar to CFS. k [] show that the approximations made by the information-theoretic methods, such as mrmr and CMIM, are based on unrealistic assumptions and they introduce a novel set of assumptions based on variational distributions and derive novel algorithms with competitive performance. FCBF follows a -step process. In step, it ranks all features based on symmetric uncertainty between each feature and the class variable, i.e., SU(F i,y ), and selects the relevant features that exceed a given threshold value δ. In step, it finds the optimal subset by eliminating redundant features from the relevant features selected in step (i), using an approximate Markov blanket criterion. In essence, it decouples the relevance and redundancy analysis, and circumvents the concurrent subset search and subset evaluation process. Unlike CFS and mrmr, FCBF is computationally fast, simple, and fairly easy to implement due to the sequential -step process. However, this method fails to capture situations where feature dependencies appear only conditionally on the class variable []. [] state that FCBF suffers from instability as the naive heuristics FCBF may be unsuitable in many situations. One of the drawbacks of FCBF is that it rules out the possibility of an irrelevant feature becoming relevant due to interaction with other features []. CMIM, which simultaneously evaluates relevance and redundancy at every iteration, overcomes this limitation. Relief [], and its multi-class version ReliefF [], are instance-based feature ranking algorithms that rank each feature based on its similarity with k nearest neighbors from the same and opposite classes, selected randomly from the dataset. The underlying principle is that a useful feature should have the same value for instances from the same class and different values for instances from a different class. In this method, m instances are randomly selected from the training data and for each of these m instances, n nearest neighbors are chosen from the same and the opposite class. Values of features of the nearest neighbors are compared with the sample instance and the scores for each feature is updated. A feature has higher weights if it has the same value with instances from the same class and different values to others. In Relief, the score or weight of each feature is measured by the Euclidean distance between the sampled instance and the nearest neighbor, which reflects its ability to discriminate between different classes. The consistency-based method [,, ] is another approach, which uses consistency measure as the performance metric. A feature subset is inconsistent if there exist at least two instances with the same feature values but with different class labels. The inconsistency rate of a dataset is the number of inconsistent instances divided by the total number of instances in it. This approach aims to find a subset, whose size is minimal and inconsistency rate is equal to that of the original dataset. [] propose the following heuristic: Consistency(S) =

10 0 Singha & Shenoy m ( D i M i ) i=0 N, where, m is the number of distinct combinations of feature values for subset S, D i is the number of instances of i-th feature value combination, M i is the cardinality of the majority class of i-th feature value combination, and N is the number of total instances in the dataset. Markov Blanket (MB) filter [] provides another useful technique for variable selection. MB filter works on the principle of conditional independence and excludes a feature only if the MB of the feature is within the set of remaining features. Though the MB framework based on information theory is theoretically optimal in eliminating irrelevant and redundant feature, it is computationally intractable. Incremental association Markov blanket (IAMB) [] and Fast-IAMB [0] are two MB based algorithms that use conditional mutual information as the metric for conditional independence test. They address the drawback of CMIM by performing redundancy checks during growing (forward pass) and shrinkage phase (backward pass).. Complementarity-based methods The literature on complementarity-based feature selection that simultaneously optimize redundancy and complementarity are relatively few, despite earliest research on feature interaction dating back to [] and subsequently advanced by [,,, ]. The feature selection methods that consider feature complementarity include double input symmetrical relevance (DISR) [], redundancy complementariness dispersion based feature selection (RCDFS) [], INTERACT [], interaction weight based feature selection (IWFS) [], and maximum relevance maximum complementary (MRMC) [], joint mutual information (JMI) [, ], and min-interaction Max-Relevancy (mimr) []. The goal of DISR is to find the best subset of a given size d, where d is assumed to be known a priori. It considers complementarity implicitly [], which means they consider the net effect of redundancy and complementarity in the search process. As a result, DISR does not distinguish between two subsets S and S, where S has information gain= 0. and information loss= 0., and S has information gain = 0. and information loss = 0. In other words, information loss and information gain are treated equally. DISR works on the principle of k-average sub-subset information criterion, which is shown to be a good approximation of the information of a set of features. They show that the mutual information between a subset F S of d features, and the class variable Y is lower bounded by the average information of its subsets. Using notations, I(F V ;Y ) I(F S ;Y ). k is considered to be size of k!( d k) V S: V =k the sub-subset such that there is no complementarities of order greater than k. Using k =, DISR recursively decomposes each bigger subset (d > ) into subsets containing features F i and F j (i j), and chooses a subset F S such that F S = argmax I(F i,f j ;Y )/ ( d S ). An i, j,i< j implementation of this heuristic, known as MASSIVE is also proposed. mimr method presents another variation of DISR in that (i) mimr first removes all features that have zero mutual information with the class, and (ii) it decomposes the multivariate term in DISR into a linear combination of relevance and interaction terms. mimr considers causal discovery in the selection process; restricts the selection to variables that have both positive relevance and negative interaction. Both DISR and mimr belong to a framework, known as Joint Mutual Information (JMI) initially proposed by []. JMI provides a sequential feature selection method in which the JMI score of the incoming feature

11 Adaptive heuristic for feature Selection F k is given by J jmi (F k ) = I(F k,f i ;Y ) where F S is the already selected subset. This is the F i F S information between the targets and the joint random variable (F k,f i ) defined by pairing the candidate F k with each feature previously selected. In RCDFS, [] suggest that ignoring higher order feature dependence may lead to false positives (FP) (actually redundant features misidentified as relevant due to pairwise approximation) being selected in the optimal subset, which may impair the selection of subsequent features. The degree of interference depends on the number of FPs present in the already selected subset and their degree of correlation with the incoming candidate feature. Only when the true positives (TPs) and FPs have opposing influence on the candidate feature, the selection is misguided. For instance, if the candidate feature is redundant to the FPs but complementary to the TPs, then new feature will be discouraged from selection, while it should be ideally selected and vice-versa. They estimate the interaction information (complementarity or redundancy) of the candidate feature with each of already selected features. They propose to measure this noise by standard deviation (dispersion) of these interaction effects and minimize it. The smaller the dispersion, the less influential is the interference effect of false positives. One limitation of RCDFS is that it assumes that all TPs in the already selected subset will exhibit a similar type of association, i.e., either all are complementary to, or all are redundant with, the candidate feature (see Figure in []). This is a strong asumption and need not be necessarily true. In fact, it is more likely that different dispersion patterns could be observed. In such cases, the proposed method will fail to differentiate between the good influence (due to TPs) and bad influence (due to FPs) and therefore will be ineffective in mitigating the interference effect of FPs in the feature selection process. [] propose a complementarity-based ranking algorithm, IWFS. Their method is based on interaction weight factors, which reflect the information on whether a feature is redundant or complementary. The interaction weight for a feature is updated at each iteration, and a feature is selected if its interaction weight exceeds a given threshold. Another complementaritybased method, INTERACT, uses a feature sorting metric using data consistency. The c- contribution of a feature is estimated based on its inconsistency rate. A feature is removed if its c-contribution is less than a given threshold of c-contribution, otherwise retained. This method is computationally intensive and has worst-case time complexity O(N M), where N is the number of instances and M is the number of features. MRMC method presents a neural network based feature selection that uses relevance and complementary score. Relevance and complementary scores are estimated based on how a feature influences or complements the networks. [0] propose a space of potential criterion that encompasses several redundancy and complementarity-based methods. They propose that the worth of a candidate feature F k given already selected subset F S can be represented as J(F k ) = I(F k ;Y ) β I(F k ;F i ) + F i F S γ I(F k ;F i Y ). Different values of β and γ lead to different feature selection methods. For F i F S example, γ = 0 leads to MIFS, β = γ = S lead to JMI, γ = 0, and β = S lead to mrmr. Motivation and the proposed heuristic In this section, we first outline the motivation behind using redundancy and complementarity explicitly in the search process and the use of an implicit utility function approach.

12 Singha & Shenoy Table : Golf Dataset Outlook (F) Temperature (F) Humidity (F) Windy (F) Play golf (Y) Rainy Hot High False No Rainy Hot High True No Overcast Hot High False Yes Sunny Mild High False Yes Sunny Cool Normal False Yes Sunny Cool Normal True No Overcast Cool Normal True Yes Rainy Mild High False No Rainy Cool Normal False Yes Sunny Mild Normal False Yes Rainy Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Sunny Mild High True No Then, we propose a heuristic, called self-adaptive feature evaluation (SAFE). SAFE is motivated by the implicit utility function approach in multi-objective optimization. Implicit utility function approach belongs to interactive methods of optimization [, ], which combines the search process with the decision maker s relative preference over multiple objectives. In interactive method, decision making and optimization occur simultaneously.. Implicit vs. explicit measurement of complementarity Combining negative (complementarity) and positive (redundancy) interaction information may produce inconsistent results when the goal is to find an optimal feature subset. In this section, we demonstrate this using the Golf dataset presented in Table. The dataset has four features {F,...,F }. The information content I(F S ;Y ) of each possible subset F S is estimated (i) first, using Eq. (), and (ii) then using the aggregate approach. In the aggregate approach, to compute I(F S ;Y ), we take the average of mutual information of each sub-subsets of two features in F S with the class variable, i.e., I(F i,f j ;Y ). For example, I(F,F,F ;Y ) is approximated by the average of I(F,F ;Y ), I(F,F ;Y ), and I(F,F ;Y ). Table presents the results. Mutual information is computed using infotheo package in R and empirical entropy estimator. Our goal is to find the optimal subset regardless of the subset size. Clearly, in this example, the {F,F,F,F } is the optimal subset that has maximum information about the class variable. However, using aggregate approach, {F,F,F } is the best subset. Moreover, in the aggregate approach, one would assign a higher rank to the subset {F,F,F } as compared to {F,F,F }, though the latter subset has higher information content than the former.. A new adaptive heuristic We first introduce the following notations for our heuristic and then define the adaptive cost function.

13 Adaptive heuristic for feature Selection Table : Mutual information between a subset F S and the class Y No. Subset (S) I(F S ;Y ) Aggregate approach {F } 0.0 {F } 0.00 {F } 0.0 {F } 0.0 {F,F } 0. {F,F } 0. {F,F } 0. {F,F } 0. {F,F } 0. 0 {F,F } 0.0 {F,F,F } {F,F,F } {F,F,F } {F,F,F } {F,F,F,F } Subset Relevance Given a subset S, subset relevance, denoted by A S, is defined by summation of all pairwise mutual information between each feature and the class variable, i.e., A S = I(F i ;Y ). A S measures the predictive ability of each individual feature acting alone. i S Subset Redundancy Given a subset S, subset redundancy, denoted by R S, is defined by the summation of all positive -way interactions in the subset, i.e., R S = (I(F i ;F j ) i, j S,i< j I(F i ;F j Y )) (i, j) such that I(F i ;F j ) > I(F i ;F j Y ). R S measures information loss due to feature redundancy. i, j S,i< j Subset Complementarity Given a subset S, subset complementarity, denoted by C S, is defined by the absolute value of the sum of all negative -way interactions in the subset, i.e., C S = (I(F i ;F j Y ) I(F i ;F j )) (i, j) such that I(F i ;F j ) < I(F i ;F j Y ). C S measures information gain due to feature complementarity. Subset Dependence Given a subset S, subset dependence, denoted by D S, is defined by the summation of mutual information between each pair of features, i.e., D S = I(F i ;F j ). i, j S,i< j We use D S as a measure of unconditional feature redundancy in our heuristic. This is the same as the unconditional mutual information between features described as redundancy in the literature [,, ]. We call this subset dependence to distinguish this from the information loss due to redundancy (R S ), which is measured by the difference between conditional and unconditional mutual information. Below, we present the proposed heuristic. β S Score(S) = A S + γ CS () S + β DS where, A S, C S, and D S are subset relevance, subset complementarity, and subset dependence, respectively, and S denotes the subset size. β and γ are hyper-parameters defined as follows: α = R S R S +C S, β = ( + α), ξ = C S C S +A S and γ = ( ξ ). R S measures subset redundancy as defined in Definition.. As mentioned above, the heuristic characterizes an adaptive

14 Singha & Shenoy objective function, which evolves depending on the level of feature interactions. We model this adaptation using two hyper-parameters β and γ. The values of these parameters are computed by the heuristic during the search process based on the relative values of relevance, redundancy and complementarity. The ratio α [0,] measures the percentage of redundancy in the subset, which determines whether the subset is predominantly redundant or complementary. If α =, all features are pairwise redundant, we call the subset predominantly redundant. At the other extreme, if α = 0 all features are pairwise complementary, we call the subset predominantly complementary. We consider α = 0/0 = 0 for a fully independent subset of features which is, however, rarely the case. The hyper-parameter β controls the trade-off between redundancy and complementarity based on the value of α, which is a function of subset characteristics. We consider β as a linear function of α such that the penalty for unconditional dependency increases linearly to twice its value when the subset is fully redundant. This resembles the heuristic of CFS [] when α =. The S in the denominator allows the heuristic to favor smaller subsets, while the square root in the denominator allows the penalty term to vary exponentially with increasing subset size and feature dependency. The proposed heuristic adaptively modifies the trade-off rule as the search process for the optimal subset continues. We explain how such an adaptive criterion works. As α increases, the subset becomes more redundant, the value of subset complementarity (C S ) decreases (C S = 0 when α = ) leaving us with little opportunity to use complementarity very effectively for feature selection process. In other words, the value of subset complementarity C S is not sufficiently large to be able to differentiate between the two subsets. At best, we expect to extract a set of features that are less redundant or nearly independent. As a result, β [,] increasingly penalizes the subset redundancy term D S in the denominator and rewards subset complementarity C S in the numerator as α increases from 0 to. In contrast, as α decreases, the subset becomes predominantly complementary leading to an increase in the value C S. As the magnitude of C S gets sufficiently large, complementarity plays the key role in differentiating between two subsets compared to subset dependence D S in the denominator. This, however, causes a bias towards larger subset as the complementarity gain increases monotonically with the size of the subset. We observe that C S increases exponentially with the logarithm of the subset size as α decreases. In Figure, we demonstrate this for three different real data sets having different degrees of redundancy (α). In order to control this bias, we raise C S to the exponent S. Moreover, given the way we formalize α, it is possible to have two different subsets, both having α = 0 but with different degrees of information gain C S. This is because the information gain C S of a subset is affected by both the number of complementary pair of features in the subset as well as how complementary each pair of features is, which is an intrinsic property of each feature. Hence it is possible that a larger subset with weak complementarity produces the same amount of information gain C S as another smaller subset with highly complementary features. This is evident from Figure, which shows that subset complementarity gains faster for dataset Lung Cancer compared to data set Promoter, despite the fact that the latter has lower α, and both have an identical number of features. The exponent S also takes care of such issues. Next, we examine scenarios when C S > A S. This is the case when a subset learns more information from interaction with other features compared to their individual predictive power. Notice that the source of class information, whether C S or A S, is indistinguishable to the heuristic as they are linearly additive in the numerator. As a result, this produces an undue bias towards larger subset size when C S > A S. To control this bias, we introduce the

15 Score Score Adaptive heuristic for feature Selection Subset Size 0 Fig. : Variation of subset complementarity with subset size Fig. : Heuristic score variation with degree of redundancy α and subset size, given A S = 00, D S = 0, and R S +C S = alpha Subset Size 0 Fig. : Heuristic score variation with degree of redundancy α and subset size, given A S = 0, D S = 0, and R S +C S = 00 hyper-parameter γ [0,] that maintains the balance between relevance and complementarity information gain. It controls the subset size by reducing the contribution from C S when C S > A S alpha 0..0

16 Score Score Singha & Shenoy Subset relevance Subset dependence Fig. : Heuristic score variation with subset dependence and subset relevance, given α = 0., S =, and R S +C S = 0. Sensitivity analysis Subset relevance Subset dependence Fig. 0: Heuristic score variation with subset dependence and subset relevance, given α = 0., S =, and R S +C S = 0 In Figures and, we show how the proposed heuristic score varies with degree of redundancy α, and the subset size under different relative magnitude of interaction information (R S +C S ), and subset relevance A S. Figure depicts a situation in which features are individually highly relevant, but less interactive (C S < A S ), whereas in Figure, features are individually less relevant, but as a subset become extremely relevant due to high degree of feature interaction (C S > A S ). In either scenario, the score decreases with increasing subset size, and with increasing degree of redundancy. For a given subset size, the score is generally lower when C S > A S compared to when C S < A S, showing the heuristic is effective in controlling the subset size when the subset is predominantly complementary. We also observe that the heuristic is very sensitive to redundancy when the features are highly relevant. In other words, redundancy hurts much more when features are highly relevant. This is evident from the fact that the score reduces at a much faster rate with increasing redundancy when A S is very high compared to C S as in Figure. In Figures and 0, we show how the proposed heuristic score varies with subset relevance and subset dependence for two different subset sizes. The score increases linearly with the subset relevance, and decreases non-linearly with the subset dependence, as we would expect from Equation. The score is higher for a smaller subset compared to a bigger subset under nearly identical conditions. However, for a given subset relevance, the score decreases at a much faster rate with increasing subset dependence when there are fewer number of features in the subset (as in Figure ). This phenomenon can be explained with the help of Figures and. For a given subset dependence D S, fewer features would mean higher degree of association (overlap) between features. In other words, features share more 00

17 Adaptive heuristic for feature Selection I(X;Y ) H(X) H(Y ) D S = I(X;Y ) Fig. : Fewer features in the subset for given subset dependence D S I(X;Y ) H(X) I(X; Z) H(Z) H(Y ) D S = I(X;Y ) + I(Y ;Z) + I(X;Z) I(Y ;Z) Fig. : More features in the subset for given subset dependence D S common information, and are therefore more redundant compared to when there are higher number of features in the subset. Hence, our heuristic not only encourages parsimony, but is also sensitive to the change in feature dependence as the subset size changes. The above discussion demonstrates the adaptive nature of the heuristic under different conditions of relevance and redundancy, which is the motivation behind our heuristic.. Limitations One limitation of this heuristic, as evident from Figures and 0, is that it assigns a zero score to a subset when the subset relevance is zero, i.e., A S = 0. Since feature relevance is a non-negative measure, this implies a situation in which every feature in the subset is individually irrelevant to the target concept. Thus, our heuristic does not select a subset when none of the features in the subset carry any useful class information by themselves. This, however, ignores the possibility that they can become relevant due to interaction. However, we have not encountered datasets where this is the case. Another limitation of our heuristic is that it considers up to -way feature interaction (interaction between a pair of features and the class variable) and ignores the higher-order corrective terms. This pairwise approximation is necessary for computational tractability and is considered in many MI-based feature selection methods []. Despite this limitation, [] shows that Equation produces reasonably good estimates of mutual information for all practical purposes. The higher order corrective terms become significant only when there exists a very high degree of dependence amongst a large number of features. It may be noted that our definition of subset complementarity and subset redundancy, as given in Section., can be extended to include higher-order interaction terms without any difficulty. With more precise estimates of mutual information becoming available, further work will address the merit of using higher order correction terms in our proposed approach.

18 Singha & Shenoy Algorithm In this section, we present the algorithm for our proposed heuristic SAFE. Step Assume we start with a training sample D(F,Y ) with full feature set F = {F,...,F n } and class variable Y. Using a search strategy, we choose a candidate subset of features S F. Step Using the training data, we compute the mutual information between each pair of features I(F i ;F j ), and between each feature and the class variable I(F i ;Y ). We eliminate all constant valued features from S, for which I(F i ;Y ) = 0. Step For each pair of features in S, we compute the conditional mutual information given the class variable, I(F i ;F j Y ). Step We transform all I(F i ;F j ) and I(F i ;F j Y ) to their symmetric uncertainty form to maintain a scale between [0,]. Step For each pair of features, we compute the interaction gain or loss, i.e., I(F i ;F j ;Y ) = I(F i ;F j ) I(F i ;F j Y ) using the symmetric uncertainty measures. Step We compute A S, D S, C S, R S, β, and γ using information from Steps &. Step Using information from Step, the heuristic determines a Score(S) for subset S. The search continues, and a subset S opt is chosen that maximizes this score. The pseudo-algorithm of the proposed heuristic is presented in Algorithm. Algorithm Self Adapting Feature Evaluation (SAFE) Input: A training sample D(F,Y ) with full feature set F = {F,...,F n } and class variable Y Output: The selected feature subset S opt : Initialize Score opt, S opt /0; : repeat : Generate a subset S F using a search strategy; : for i, j = to n; j> i do : Calculate SU(F i,y ), SU(F i,f j ), and SU(F i,f j,y ); : Remove all features F i from S if SU(F i,y ) = 0; : end for : Calculate A S, D S, C S, β, γ; : Calculate Score(S) ; 0: if Score(S) > Score opt then Score opt Score(S), S opt S; : else no change in Score opt, S opt ; : end if : until Score opt does not improve based on stopping rule : Return S opt ; : End

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu FMA901F: Machine Learning Lecture 6: Graphical Models Cristian Sminchisescu Graphical Models Provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate

More information

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall

More information

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

BITS F464: MACHINE LEARNING

BITS F464: MACHINE LEARNING BITS F464: MACHINE LEARNING Lecture-16: Decision Tree (contd.) + Random Forest Dr. Kamlesh Tiwari Assistant Professor Department of Computer Science and Information Systems Engineering, BITS Pilani, Rajasthan-333031

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 12 Combining

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Bayesian Classification Using Probabilistic Graphical Models

Bayesian Classification Using Probabilistic Graphical Models San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2014 Bayesian Classification Using Probabilistic Graphical Models Mehal Patel San Jose State University

More information

Filter methods for feature selection. A comparative study

Filter methods for feature selection. A comparative study Filter methods for feature selection. A comparative study Noelia Sánchez-Maroño, Amparo Alonso-Betanzos, and María Tombilla-Sanromán University of A Coruña, Department of Computer Science, 15071 A Coruña,

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA S. DeepaLakshmi 1 and T. Velmurugan 2 1 Bharathiar University, Coimbatore, India 2 Department of Computer Science, D. G. Vaishnav College,

More information

1) Give decision trees to represent the following Boolean functions:

1) Give decision trees to represent the following Boolean functions: 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Decision Tree CE-717 : Machine Learning Sharif University of Technology Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Decision Tree Example Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short} Class: Country = {Gromland, Polvia} CS4375 --- Fall 2018 a

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks A Parallel Algorithm for Exact Structure Learning of Bayesian Networks Olga Nikolova, Jaroslaw Zola, and Srinivas Aluru Department of Computer Engineering Iowa State University Ames, IA 0010 {olia,zola,aluru}@iastate.edu

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

Supplementary Figure 1. Decoding results broken down for different ROIs

Supplementary Figure 1. Decoding results broken down for different ROIs Supplementary Figure 1 Decoding results broken down for different ROIs Decoding results for areas V1, V2, V3, and V1 V3 combined. (a) Decoded and presented orientations are strongly correlated in areas

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Feature Selection for Image Retrieval and Object Recognition

Feature Selection for Image Retrieval and Object Recognition Feature Selection for Image Retrieval and Object Recognition Nuno Vasconcelos et al. Statistical Visual Computing Lab ECE, UCSD Presented by Dashan Gao Scalable Discriminant Feature Selection for Image

More information

Information Driven Healthcare:

Information Driven Healthcare: Information Driven Healthcare: Machine Learning course Lecture: Feature selection I --- Concepts Centre for Doctoral Training in Healthcare Innovation Dr. Athanasios Tsanas ( Thanasis ), Wellcome Trust

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Numerical Robustness. The implementation of adaptive filtering algorithms on a digital computer, which inevitably operates using finite word-lengths,

Numerical Robustness. The implementation of adaptive filtering algorithms on a digital computer, which inevitably operates using finite word-lengths, 1. Introduction Adaptive filtering techniques are used in a wide range of applications, including echo cancellation, adaptive equalization, adaptive noise cancellation, and adaptive beamforming. These

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Data Mining and Machine Learning: Techniques and Algorithms

Data Mining and Machine Learning: Techniques and Algorithms Instance based classification Data Mining and Machine Learning: Techniques and Algorithms Eneldo Loza Mencía eneldo@ke.tu-darmstadt.de Knowledge Engineering Group, TU Darmstadt International Week 2019,

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 239 / 262 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel

More information

FEATURE SELECTION TECHNIQUES

FEATURE SELECTION TECHNIQUES CHAPTER-2 FEATURE SELECTION TECHNIQUES 2.1. INTRODUCTION Dimensionality reduction through the choice of an appropriate feature subset selection, results in multiple uses including performance upgrading,

More information

An Empirical Comparison of Spectral Learning Methods for Classification

An Empirical Comparison of Spectral Learning Methods for Classification An Empirical Comparison of Spectral Learning Methods for Classification Adam Drake and Dan Ventura Computer Science Department Brigham Young University, Provo, UT 84602 USA Email: adam drake1@yahoo.com,

More information

3 Feature Selection & Feature Extraction

3 Feature Selection & Feature Extraction 3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges. Instance-Based Representations exemplars + distance measure Challenges. algorithm: IB1 classify based on majority class of k nearest neighbors learned structure is not explicitly represented choosing k

More information

Segmentation of Images

Segmentation of Images Segmentation of Images SEGMENTATION If an image has been preprocessed appropriately to remove noise and artifacts, segmentation is often the key step in interpreting the image. Image segmentation is a

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Lecture 19: Generative Adversarial Networks

Lecture 19: Generative Adversarial Networks Lecture 19: Generative Adversarial Networks Roger Grosse 1 Introduction Generative modeling is a type of machine learning where the aim is to model the distribution that a given set of data (e.g. images,

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Decision Trees. Query Selection

Decision Trees. Query Selection CART Query Selection Key Question: Given a partial tree down to node N, what feature s should we choose for the property test T? The obvious heuristic is to choose the feature that yields as big a decrease

More information

2. A Bernoulli distribution has the following likelihood function for a data set D: N 1 N 1 + N 0

2. A Bernoulli distribution has the following likelihood function for a data set D: N 1 N 1 + N 0 Machine Learning Fall 2015 Homework 1 Homework must be submitted electronically following the instructions on the course homepage. Make sure to explain you reasoning or show your derivations. Except for

More information

Data Mining. Lecture 03: Nearest Neighbor Learning

Data Mining. Lecture 03: Nearest Neighbor Learning Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. R. Mooney (UT Austin) Prof E. Keogh (UCR), Prof. F. Provost

More information

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract

More information

International Journal of Scientific & Engineering Research, Volume 5, Issue 7, July ISSN

International Journal of Scientific & Engineering Research, Volume 5, Issue 7, July ISSN International Journal of Scientific & Engineering Research, Volume 5, Issue 7, July-2014 445 Clusters using High Dimensional Data for Feature Subset Algorithm V. Abhilash M. Tech Student Department of

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. D-Separation Say: A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked by C if it contains a node such that either a) the arrows on the path meet either

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Optimal Naïve Nets (Adapted from

More information

Employment of Multiple Algorithms for Optimal Path-based Test Selection Strategy. Miroslav Bures and Bestoun S. Ahmed

Employment of Multiple Algorithms for Optimal Path-based Test Selection Strategy. Miroslav Bures and Bestoun S. Ahmed 1 Employment of Multiple Algorithms for Optimal Path-based Test Selection Strategy Miroslav Bures and Bestoun S. Ahmed arxiv:1802.08005v1 [cs.se] 22 Feb 2018 Abstract Executing various sequences of system

More information

Empirical Evaluation of Feature Subset Selection based on a Real-World Data Set

Empirical Evaluation of Feature Subset Selection based on a Real-World Data Set P. Perner and C. Apte, Empirical Evaluation of Feature Subset Selection Based on a Real World Data Set, In: D.A. Zighed, J. Komorowski, and J. Zytkow, Principles of Data Mining and Knowledge Discovery,

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Core Membership Computation for Succinct Representations of Coalitional Games

Core Membership Computation for Succinct Representations of Coalitional Games Core Membership Computation for Succinct Representations of Coalitional Games Xi Alice Gao May 11, 2009 Abstract In this paper, I compare and contrast two formal results on the computational complexity

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

THE preceding chapters were all devoted to the analysis of images and signals which

THE preceding chapters were all devoted to the analysis of images and signals which Chapter 5 Segmentation of Color, Texture, and Orientation Images THE preceding chapters were all devoted to the analysis of images and signals which take values in IR. It is often necessary, however, to

More information

Collective classification in network data

Collective classification in network data 1 / 50 Collective classification in network data Seminar on graphs, UCSB 2009 Outline 2 / 50 1 Problem 2 Methods Local methods Global methods 3 Experiments Outline 3 / 50 1 Problem 2 Methods Local methods

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science

More information

CSE 573: Artificial Intelligence Autumn 2010

CSE 573: Artificial Intelligence Autumn 2010 CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Lecture 7: Decision Trees

Lecture 7: Decision Trees Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...

More information

A CSP Search Algorithm with Reduced Branching Factor

A CSP Search Algorithm with Reduced Branching Factor A CSP Search Algorithm with Reduced Branching Factor Igor Razgon and Amnon Meisels Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84-105, Israel {irazgon,am}@cs.bgu.ac.il

More information

Supervised Learning: K-Nearest Neighbors and Decision Trees

Supervised Learning: K-Nearest Neighbors and Decision Trees Supervised Learning: K-Nearest Neighbors and Decision Trees Piyush Rai CS5350/6350: Machine Learning August 25, 2011 (CS5350/6350) K-NN and DT August 25, 2011 1 / 20 Supervised Learning Given training

More information

Forward Feature Selection Using Residual Mutual Information

Forward Feature Selection Using Residual Mutual Information Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics

More information

Markscheme May 2017 Mathematical studies Standard level Paper 1

Markscheme May 2017 Mathematical studies Standard level Paper 1 M17/5/MATSD/SP1/ENG/TZ/XX/M Markscheme May 017 Mathematical studies Standard level Paper 1 3 pages M17/5/MATSD/SP1/ENG/TZ/XX/M This markscheme is the property of the International Baccalaureate and must

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Efficient Voting Prediction for Pairwise Multilabel Classification

Efficient Voting Prediction for Pairwise Multilabel Classification Efficient Voting Prediction for Pairwise Multilabel Classification Eneldo Loza Mencía, Sang-Hyeun Park and Johannes Fürnkranz TU-Darmstadt - Knowledge Engineering Group Hochschulstr. 10 - Darmstadt - Germany

More information

Measures of Clustering Quality: A Working Set of Axioms for Clustering

Measures of Clustering Quality: A Working Set of Axioms for Clustering Measures of Clustering Quality: A Working Set of Axioms for Clustering Margareta Ackerman and Shai Ben-David School of Computer Science University of Waterloo, Canada Abstract Aiming towards the development

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM Akshay S. Agrawal 1, Prof. Sachin Bojewar 2 1 P.G. Scholar, Department of Computer Engg., ARMIET, Sapgaon, (India) 2 Associate Professor, VIT,

More information

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours. CS 188 Spring 2010 Introduction to Artificial Intelligence Final Exam INSTRUCTIONS You have 3 hours. The exam is closed book, closed notes except a two-page crib sheet. Please use non-programmable calculators

More information

Classification: Linear Discriminant Functions

Classification: Linear Discriminant Functions Classification: Linear Discriminant Functions CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Discriminant functions Linear Discriminant functions

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

CISC 4631 Data Mining

CISC 4631 Data Mining CISC 4631 Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. R. Mooney (UT Austin) Prof E. Keogh (UCR), Prof. F.

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Estimating Noise and Dimensionality in BCI Data Sets: Towards Illiteracy Comprehension

Estimating Noise and Dimensionality in BCI Data Sets: Towards Illiteracy Comprehension Estimating Noise and Dimensionality in BCI Data Sets: Towards Illiteracy Comprehension Claudia Sannelli, Mikio Braun, Michael Tangermann, Klaus-Robert Müller, Machine Learning Laboratory, Dept. Computer

More information

Stability of Feature Selection Algorithms

Stability of Feature Selection Algorithms Stability of Feature Selection Algorithms Alexandros Kalousis, Jullien Prados, Phong Nguyen Melanie Hilario Artificial Intelligence Group Department of Computer Science University of Geneva Stability of

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information