Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Size: px

Start display at page:

Download "Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262"

Rachel Morgan
5 years ago
Views:

1 Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

2 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

3 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 What is Feature Selection? Abundance of features Usually, our output variable Y does not depend on all of our input features X. Why is this? X usually includes all features that could determine Y according to our prior knowledge, but we do not know for sure. In fact, we perform supervised learning to determine this dependence between input variables and output variables. (Supervised) feature selection means selecting the relevant subset of features for a particular learning task. Note the difference to feature ranking, which orders features according to their relevance, and feature transformation, which translates the original features into a new feature representation.

4 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Why Feature Selection? Reasons for feature selection to detect causal features to remove noisy features to reduce the set of features that has to be observed to reduce cost and to improve speed and data understanding Two modes of feature selection Filter approaches: select interesting features a priori, based on a quality function (information criterion) Wrapper approaches: select special features that are interesting for one particular classifier

5 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Optimisation Problem Combinatorial problem Given a set of features D, and a quality function q, we try to find the subset S of D of cardinality n that maximises q argmax S D S =n q(s) (1) Exponential runtime effort The computational effort for enumerating all possibilities is exponential in n, and hence intractable for large D and n. In practice, we have to find a workaround!

6 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Greedy Selection Take the currently best one Greedy selection is an alternative to exhaustive enumeration Idea is to iteratively add the currently most informative feature to the selected set or remove the currently most uninformative feature from the solution set These two variants of greedy feature selection are referred to as: forward feature selection backward elimination

7 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Greedy Selection Forward Feature Selection 1: S 2: repeat 3: j arg max j q(s j) 4: S S j 5: S S \ j 6: until S = n

8 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Greedy Selection Backward Elimination 1: S D, S 2: repeat 3: j arg max j q(s \ j) 4: S S \ j 5: S S j 6: until S = n

9 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Greedy Selection Optimality of greedy selection Only optimal if q decomposes over the elements of S q(s) = x S q(x) (2) Near-optimal if q is submodular (more details later) Otherwise there is no guarantee for optimality.

10 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Correlation Coefficient Definition The correlation coefficient ρ X,Y between two random variables X and Y with expected values µ X and µ Y and standard deviations σ X and σ Y is defined as: ρ X,Y = cov(x, Y ) σ X σ Y (3) = E((X µ X )(Y µ Y )) σ X σ Y, (4) where E is the expected value operator and cov means covariance.

11 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Mutual Information Definition Given two random variables X and Y, we define the mutual information I (, ) as I (X, Y ) = ( ) p(x = x, Y = y) p(x = x, Y = y) log, (5) p(x = x) p(y = y) y Y x X where X is the input variable, Y is the output variable, p(x = x, Y = y) is the (joint) probability of observing x and y, p(x = x) and p(y = y) are the marginal probabilities of observing x and y, respectively. log is usually the logarithm with base 2.

12 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Hilbert-Schmidt Independence Criterion Definition (Gretton et al., 2005) The Hilbert-Schmidt Independence Criterion (HSIC) measures the dependence of two random variables X and Y. An empirical estimate of the HSIC is proportional to where K is a kernel matrix on X, L is a kernel on Y, H is a centering matrix with H ij = δ(i, j) 1 n. HSIC(X, Y ) = 0 iff X and Y are independent. trace(khlh), (6) The larger HSIC(X, Y ), the larger the dependence between X and Y.

13 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Submodular Functions Definition A function q on a set D is said to be submodular if q(s X ) q(s) q(t X ) q(t ) (7) where X D S D T D S T This is referred to as the property of diminishing returns : If S is a subset of T, then S benefits more from adding X than T

14 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Submodular Functions Near-optimality (Nemhauser, Wolsey, and Fisher, 1978) If q is a submodular, nondecreasing set function and q( ) = 0, then the greedy algorithm is guaranteed to find a set S such that q(s) (1 1 e ) max q(t ) (8) T = S This means that the solution of greedy selection reaches at least 63% of the quality of the optimal solution.

15 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Submodular Functions Example: Sensor Placement Imagine our features form a graph G = (D, E) Imagine the features are possible locations for a sensor. Each sensor may cover a node v and its neighbourhood N (v), that is q(s) = N (v) v. Now we want to pick locations in the graph such that our sensors cover as large an area of the graph as possible. q fulfills the following properties q( ) = 0 q is non-decreasing q is submodular Hence greedy selection will lead to near-optimal sensor placement!

16 Wrapper Methods Two flavours not-embedded (Wrapper): The learning algorithm is employed as a quality measure embedded: The selection process is really integrated into the learning algorithm Wrappers Simple wrapper: do prediction using 1 feature only. Use classification accuracy as measure of quality Extend this to groups of features by heuristic search strategies (e.g. greedy, Monte-Carlo) Embedded Typical example: Decision trees and l 0 norm SVM Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

17 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 l 0 norm SVM (Weston et al., 2003) Key Idea 1. Train a regular linear SVM (using l 1 -norm or l 2 -norm regularization). 2. Re-scale the input variables x by multiplying them by the absolute values of the components of the weight vector w obtained, that is x new = w x. 3. Iterate the first 2 steps until convergence.

18 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Number of Features What if we do not know a reasonable choice of n? Pick all features that are significantly correlated with the output variable. Problem 1: Multiple testing problem: Even some randomly generated features may show a significant correlation. Problem 2: Ignores interactions and correlations between features. Use the probe method (Bi et al., 2003, Stoppiglia et al., 2003, Tusher et al., 2003): Insert fake features (= probes) into the set of features. Fake features can be drawn randomly from a Gaussian distribution, or they can be created in a nonparametric manner by randomly shuffling existing features. Stop feature selection when you select the first fake feature or when the proportion of fake features exceeds a certain threshold.

19 Unsupervised Feature Selection Problem setting Even without a target variable Y, we can select features that are informative according to a criterion of interest. Criteria (Guyon and Elisseeff, 2003) Saliency: A feature is salient if it has a high variance or range. Entropy: A feature has high entropy if the distribution of examples is uniform. Smoothness: A feature in a time series is smooth if its average local curvature is moderate. Density: A feature is in a high-density region if it is connected with many other variables. Reliability: A feature is reliable if the measurement error bars are smaller than the variability. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

20 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 1 Do you have domain knowledge? If yes, construct a better set of ad-hoc features. 2 Are your features commensurate? If no, consider normalizing them. 3 Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features, as much as your computer resources allow you. 4 Do you need to prune the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of features.

21 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 5 Do you need to assess features individually (e.g. to understand their influence on the system or because their number is so large that you need to do a first filtering)? If yes, use a variable ranking method; else, do it anyway to get baseline results. 6 Do you need a predictor? If no, stop. 7 Do you suspect your data is dirty (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them.

22 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 8 Do you know what to try first? If no, use a linear predictor. Use a forward selection method with the probe method as a stopping criterion or use the l 0 -norm. embedded method. For comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset. 9 Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded methods. Use linear and non-linear predictors. Select the best approach with model selection.

23 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 10 Do you want a stable solution (to improve performance and/or understanding)? If yes, subsample your data and redo your analysis for several bootstraps.

24 Two variables that are useless by themselves can be useful together. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Revealing Examples Can presumably redundant variables help each other? Noise reduction and consequently better class separation may be obtained by adding variables that are presumably redundant. How does correlation impact variable redundancy? Perfectly correlated variables are truly redundant in the sense that no additional information is gained by adding them. Very high variable (anti-)correlation does not mean absence of variable complementarity. Can a variable that is useless by itself be useful with others?

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen Lecture 2: Feature selection Feature Selection feature selection (also called variable selection): choosing k < d important