SGN (4 cr) Chapter 10

Size: px

Start display at page:

Download "SGN (4 cr) Chapter 10"

Katrina Park
5 years ago
Views:

1 SGN (4 cr) Chapter 10 Feature Selection and Extraction Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 18, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

2 Contents of This Lecture 1 Feature Selection 2 Linear Feature Extraction 3 Multidimensional Scaling J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

3 Material Chapter 10 in WebCop: Extra Material About LASSO: Section in HasTibFri:2009 J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

4 What Should You Already Know? Curse of dimensionality A few carefully chosen (selected or extracted) features may lead to a sufficiently small overlap between the class-conditional pdfs a good classifier which is easy to train and use J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

of data. Especially important when the number of variables is large w.r.t the number of training data (when n p or n < p).

5 Feature Selection and Extraction 1 / 2 Reducing the number of variables either through feature selection or feature extraction can lead to improved classifier performance or greater understanding of data. Especially important when the number of variables is large w.r.t the number of training data (when n p or n < p). J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

6 Feature Selection and Extraction 2 / 2 Feature selection : select a subset of d features of p available measurements (d has to be determined as well). Also termed subset selection or variable selection although some make distinction between variable and feature selection. Feature extraction: Find a transformation from the p measurements to d-dimensional space (usually d < p). Feature selection and (linear) feature extraction may be regarded as transformation method where a set of weights is applied to original measurements. Both approaches require the optimization of some criterion J. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

7 Feature Selection Reasons for Feature Selection Improving the classifier Removing irrelevant data Reducing storage and computational requirements Reducing the cost of future data collection Reducing the classifier complexity for better understanding of the data Feature selection essential in applications where n p arising often in bioinformatics and neuroimaging. Feature selection not necessary for classifiers that employ regularization (e.g. SVMs) J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

8 Feature Selection Example Applications Brain imaging based Alzheimer s conversion prediction. (Moradi et al. Submitted 2014) Task: Try to predict whether a mild cognitive impairment patient develops Alzheimer s disease within 3 years. Data: 184 MRIs of the MCI patients with about voxels (measurements) Feature selection clearly improves the performance of also SVMs by precenting overfitting. Many bioinformatics examples in Chapter 18 of HasTibFri:2009 J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

9 Feature Selection Feature Selection: Theory 4 kinds of features (denoted by X ) 1 Irrelevant: X not necessary at all. 2 Redundant: Defined via Markov Blankets, see pp Weakly relevant:. Feature not always necessary, but may be for certain subsets (of features); X may be necessary. 4 Strongly relevant: The distribution of the class predictor depends on X ; X cannot be removed. A good feature set will include none of the irrelevant features, some of the weakly relevant features and all strongly relevant features. The subset of weakly relevant features is selected based on redundancy. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

10 Feature Selection Computational Complexity From Huttunen, Manninen, Tohka, FCSE 2012 Evaluation of all possible feature subsets not possible even for small p since the number of subsets is 2 p. For example, for p = 408, the number of subsets is (about) Assume further that annual increase in earth s computational power is 57%. If a subset could be evaluated in a single clock cycle, evaluating all the subsets using all the computational power on the earth, the evaluation of all all subsets would take 517 years. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

11 Feature Selection Categorization of Feature Selection Methods Filters - computationally efficient Wrappers - often better performance than with filters Embedded - feature selection is built in the classifier design. Decision trees and LASSO. may be considered as wrappers. Filters Wrappers Images from Kohavi and John AI-1997 J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

12 Feature Selection Evaluation Measures - Filters Independent of the final classier Feature ranking (next slide) Interclass distance (tables 10.2 and 10.3) Probabilistic distance (Kullback-Leibler, table 10.4) Probabilistic dependence J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

13 Feature Selection Filters: Simple Feature Ranking 2-class problem, p-features Data x 1,..., x n1 from ω 1 ; y 1,..., y n2 from ω 2 For each feature i, compute J(i) = (1/n 1) n 1 j=1 x j i (1/n 2 ) n 2 j=1 y j i, σ(i) where σ(i) is the standard deviation of the ith feature Higher the J(i) more important the feature i Select d best features or select the features with J(i) greater then some threshold. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

14 Filters: Relief Feature Selection Some features may become relevant in the context of others For each pattern x i compute K nearest hits (patterns within the same class, {x Hk (i)}) and K nearest misses (patterns within a different class, {x Mk (i)})). Relief criterion: R(j) = n K i=1 k=1 x i j x Mk (i) j n K i=1 k=1 x i j x Hk (i) j J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

15 Feature Selection Filters: Markov Blanket Warning: This slide is mathematically involved M X a subset of the features X that does contain the feature X. Then M X is a Markov Blanket for X if X is conditionally independent of X M X {X } given M X. G a set of features. A feature in G weakly relevant and can be removed if and only if it is weakly relevant and has a Markov blanket in X. Incremental Association Markov Blanket algorithm: 2 phases 1 Growing: identify features that have a strong dependency on a class variable 2 Shrinking: removes redundant variables J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

16 Feature Selection Evaluation Criteria: Wrappers Error rate is the most widely used evaluation criterion with wrappers Evaluation criterion should resemble (or rather, be the same) as the performance criterion for the classifier itself. If error rate is a poor performance criterion then a more suitable criterion should be selected. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

17 Feature Selection Search Algorithms for Wrappers Complete search - branch and bound, often computationally complex and requires a specific monotonicity property of the evaluation criterion Sequential search - features are added (forward search) or removed sequentially (backward search); Modification: floating versions Random search - inject randomness into the search J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

18 Feature Selection Sequential Forward Selection J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

19 Feature Selection Random Search: Simulated Annealing Example J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

20 Feature Selection Embbeded Methods - LASSO J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

21 Nested CV Feature Selection Note that if a performance assessment is required for an embbedded or a wrapper method, it is imperative to use nested cross-validation for the performance estimate, i.e., to use nested CV loops, inner for feature selection and the outer for the performance evaluation. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

22 Feature Selection Stability of Feature Selection Different feature selection algorithms do not necessarily produce the same ranking of the features give the same dataset Even with the same algorithm, considerable variation between the different CV folds possible/probable. High reproducibility of features can be as important as high classification rate There are stability selection algorithms utilizing these ideas. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

23 Linear Feature Extraction Motivation to feature extraction Feature extraction: Find a transformation from the p measurements to d-dimensional space (usually d < p). to reduce the bandwidth of the original data; to improve the classifier; to reduce the redundancy; to recover new meaningful underlying variables or feature to describe the data to produce graphical representations of the data J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

24 Linear Feature Extraction Principal Component Analysis (PCA) 1 / 2 Derive new variables (in decreasing order of importance) that are linear combinations of the original variables and are uncorrelated. Geometrically, PCA is a rotation of axes of the original coordinate system to a new set of orthogonal axes. Axes are ordered in terms of the amount of variation of the original data they account for (importance in the PCA sense). index.html J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

25 Linear Feature Extraction Principal Component Analysis (PCA) 2 / 2 Let [x 1,..., x p ] T = x be the set of original (random) variables Let [y 1,..., y p ] T = y be linear combination of these variables y = A T x The task is to find the matrix A (or some columns of it) 3 distinct widely-used derivations of PCA 1 (Hotelling) Find orthogonal A yielding y i that have stationary values of their variance 2 Find orthogonal A yielding uncorrelated y i 3 (Pearson) Find (iteratively) columns that give minimum squared error between data and their (linear) projections. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

26 PCA Example Linear Feature Extraction Wine data for UCL repository: 13 chemical measurements for three types of wine Class labels are not used in PCA but shown in the figure J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

27 Linear Feature Extraction PCA Derivation (Hotelling, Sketch) Consider the first variable y 1 : y 1 = j a 1j x j. Choose a 1 = [a 11,..., a 1p ] T to maximize the variance of y 1, subject to a = 1. This is a T 1 Σa 1, where Σ is the covariance of x. It can be shown that a 1 must be the eigenvector of Σ with the maximal eigenvalue. a 2 is the eigenvector of Σ with the 2nd largest eigenvalue, a 3 is the eigenvector of Σ with the 3rd largest eigenvalue, and so on. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

28 Linear Feature Extraction PCA - Notes Sum of the variance of k first PCs is k i=1 λ i, where λ i are (ordered) eigenvalues of the covariance matrix Σ. Thus, the percentage of variance accounted for by k first PCs is k i=1 λ i p i=1 λ i. To reduce the dimensionality, the data x 1,..., x n can be transformed y i = A T k x i where A k = [a 1... a k ] T is p k matrix. Selection of a good k not trivial. x i s often de-meaned PCA ignores the class-information J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

29 Linear Feature Extraction Karhunen-Loeve Transform In its most basic form, Karhunen-Loeve transform is identical to PCA Interestingly, Kari Karhunen was a Finnish mathematician and Emanuel Parzen and Leo Breiman were doctoral students of Michel Loeve In pattern recognition, the umbrella term Karhunen-Loeve expansion has a meaning similar to generalized PCA Summary of various transforms under the umbrella term is given in Table 10.7 J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

30 Kernel PCA Linear Feature Extraction Various other generalizations sucu as kernel PCA J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

31 Linear Feature Extraction Independent Component Analysis (ICA) Separates multivariate signal into additive subcomponents by assuming that the subcomponents are non-gaussian signals and that they are all statistically independent from each other. Let x be the (random vector describing the) observations, let s be the vector of (unobservable) sources, and A the mixing matrix so that x = As. ICA tries to find A (and thereby s by maximizing the independence of the components of s. Note the differences between (and similarities) between ICA and PCA; In ICA s is non-gaussian, ICA does not provide (directly) a ranking of sources. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

32 Linear Feature Extraction ICA - Cocktail Party Problem J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

33 Multidimensional Scaling Classical Scaling Analyses of (Euclidean) proximity matrix to produce a representation of data points in a reduced dimension space (representation space) Classical scaling: For n p-dimensional points x 1,..., x p, it is easy to compute n n matrix T of Euclidean distances. Classical scaling solves converse problem: Given distance matrix T, how to determine the coordinates of the points (in dimension e)? Based on the n n dissimilarity matrix D construct T such that T = XX T (X n p matrix of the wanted coordinates). (Easy) Solve X : T symmetric T = UΛU T, where U are the eigenvectors of T and Λ the diagonal eigenvalue matrix. Take X = UΛ (1/2). In reduced dimension only part of the eigenvalues is used in a similar way than in PCA. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

34 Multidimensional Scaling Metric MDS and Ordinal Scaling Metric MDS: Optimize an objective function measuring the discrepancy between the given dissimilarities δ ij and the derived distances d ij in a lower e-dimensional space. d ij depend on coordinates of the samples we wish to find. Ordinal scaling: Find a configuration of points for which the rank ordering of the interpoint distance is close to the ranking of the values of the given dissimilarities. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

35 Example Multidimensional Scaling from Snow, Kyllonen, Marshalek, 1984 adpated by Gray, Thompson, Nat. Rev. Neuroscience 2004 The correlation of the specific tasks with g-factor of intelligence, strongly in the centre, waekly at the periphery. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

36 Example 2 Multidimensional Scaling J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

37 Example 2 Continued Multidimensional Scaling from statpics fi/2010_07_01_archive.html J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

38 Summary Multidimensional Scaling 1 Feature selection and extraction can be used to reduce the data dimensionality J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

39 Multidimensional Scaling Summary 1 Feature selection and extraction can be used to reduce the data dimensionality 2 Feature selection selects a subset of features. Embedded or wrapper methods are recommended for improved performance. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

40 Summary Multidimensional Scaling 1 Feature selection and extraction can be used to reduce the data dimensionality 2 Feature selection selects a subset of features. Embedded or wrapper methods are recommended for improved performance. 3 Implement the feature selection within the CV-loop (or nested CV loops) J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

41 Summary Multidimensional Scaling 1 Feature selection and extraction can be used to reduce the data dimensionality 2 Feature selection selects a subset of features. Embedded or wrapper methods are recommended for improved performance. 3 Implement the feature selection within the CV-loop (or nested CV loops) 4 Linear feature extraction may be useful. PCA can be used to gain insights of the data. J. Tohka & J. Niemi (TUT-SGN) SGN (4 cr) Chapter 10 February 18, / 38

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate