Information Driven Healthcare:

Size: px

Start display at page:

Download "Information Driven Healthcare:"

Rosalind Gilmore
5 years ago
Views:

1 Information Driven Healthcare: Machine Learning course Lecture: Feature selection I --- Concepts Centre for Doctoral Training in Healthcare Innovation Dr. Athanasios Tsanas ( Thanasis ), Wellcome Trust post-doctoral fellow, Institute of Biomedical Engineering (IBME), Affiliate researcher, Centre for Mathematical Biology, Mathematical Institute, Junior Research Fellow, Kellogg College University of Oxford

2 The big picture signal processing and statistical machine learning Signal processing course: tools for extracting information (patterns) from the data (feature generation) This course: using the extracted information from the data Statistical mapping: associating features with another measured quantity (response) supervised learning Goal: maximize the information available in the data to predict the response Feature generation Feature selection/transformation Statistical mapping

3 N (samples) Supervised learning setting Subjects feature 1 feature 2... feature M S S S S N X y Outcome type 1 type 2 type 1 type 5 M (features or characteristics) y = f (X), f : mechanism X: feature set y: outcome Feature selection: which features 1 M in the design matrix X should we keep? Feature transformation: project the features to a new lower dimensional feature space

4 Introduction to the problem Many features Μ Curse of dimensionality Obstruct interpretability and detrimental to learning process

5 Solution to the problem Reduce the initial feature space M into m (or m<<m) Feature selection Feature transformation

6 Main concepts Principle of parsimony Information content Statistical associations Computational constraints We want to determine the most parsimonious feature subset with maximum joint information content

7 Feature transformation Construct lower dimensional space where the new data points retain the distance of the data points in the original feature space Different algorithms depending on how we define the distance

8 Feature transformation problems Results are not easily interpretable Does not save up on resources on data collection or data processing Reliable transformation in high dimensional spaces is problematic

9 Feature selection Discard non-contributing features towards predicting the response

10 Feature selection advantages Interpretable Retain domain expertise Often is the only useful approach in practice (e.g. in micro-array data) Saves on resources on data collection or data processing

11 Feature selection approaches Two approaches: Wrappers (involve a learner) Filters (rely on information content of feature subset, e.g. using statistical tests)

12 Wrappers Computationally intensive Rely on incorporating a learner Feature exportability problems Wrappers may produce models with better predictive performance compared to filters

13 Filters Rely on basic concepts statistical Information theory Computationally fast Learner performance comes later and hence filters may generalize better than wrappers

14 Filter concept: relevance Maximum relevance: features (F) and response (y) F1 F2 F3 y Which features would you choose? In which order?

15 Filter concept: redundancy Minimum redundancy amongst features in the subset F1 F2 F3 F4 Which features would you choose? In which order?

16 Filter concept: complementarity Conditional relevance (feature interaction or complementarity)

17 Formalizing these concepts How to express relevance and redundancy (i.e. which are the appropriate metrics?) Metrics include: correlation coefficients, mutual information, statistical tests, p-values, information gain How to compromise between relevance and redundancy? Process? (forward selection VS backward elimination)

18 To be continued In the following lecture we will look at specific algorithms!

19 Usual steps in forward selection

20 LASSO Start with classical ordinary least squares regression L1 penalty: sparsity promoting, some coefficients become 0

21 RELIEF Feature weighting algorithm Concept: work with nearest neighbours Nearest hit (NH) and nearest miss (NM) Great for datasets with interactions but does not account for information redundancy

22 mrmr minimum Redundancy Maximum Relevance (mrmr) Trade-off relevance + redundancy Does not account for interactions and non-pairwise redundancy Generally works very well

23 Comparing feature selection algorithms Selecting the true feature subset (i.e. discarding features which are known to be noise) o Possible only for artificial datasets Maximize the out of sample prediction performance o o o o proxy for assessing feature selection algorithms adds an additional layer : the learner beware of feature exportability (different learners may give different results) BUT in practice this is really what is of most interest!

24 Matlab code LASSO: RELIEF: mrmr: Be careful: the latter implementation relies on discrete features and computes densities using histograms. For continuous features you would need another density estimator (e.g. kernel density estimation) UCI ML repository

25 Conclusions Multi-faceted problem, fertile field for research No free lunch theorem (no universally best algorithm) Trade-offs o o o algorithmic: relevance, redundancy, complementarity computational: wrappers are costly but often give better results comprehensive search of the feature space, e.g. genetic algorithms (very costly) Reducing the number of features may improve prediction performance and always improves interpretability

Dimensionality Reduction, including by Feature Selection.

Dimensionality Reduction, including by Feature Selection www.cs.wisc.edu/~dpage/cs760 Goals for the lecture you should understand the following concepts filtering-based feature selection information gain