Information theory methods for feature selection

Similar documents
Machine Learning Techniques for Data Mining

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

Slides for Data Mining by I. H. Witten and E. Frank

Contents. Preface to the Second Edition

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

Supervised Learning for Image Segmentation

3 Feature Selection & Feature Extraction

Forward Feature Selection Using Residual Mutual Information

Variable Selection 6.783, Biomedical Decision Support

Classification. Slide sources:

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Network Traffic Measurements and Analysis

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Random Forest A. Fornaser

8. Tree-based approaches

7. Boosting and Bagging Bagging

COMP61011 Foundations of Machine Learning. Feature Selection

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

SSV Criterion Based Discretization for Naive Bayes Classifiers

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

We had several appearances of the curse of dimensionality:

1) Give decision trees to represent the following Boolean functions:

Regularization and model selection

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Data Mining Practical Machine Learning Tools and Techniques

FEATURE SELECTION BASED ON INFORMATION THEORY, CONSISTENCY AND SEPARABILITY INDICES.

Machine Learning. Supervised Learning. Manfred Huber

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

Filter methods for feature selection. A comparative study

Information-Theoretic Feature Selection Algorithms for Text Classification

Machine Learning. Chao Lan

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Hybrid Correlation and Causal Feature Selection for Ensemble Classifiers

Classification/Regression Trees and Random Forests

What is machine learning?

Supervised Learning Classification Algorithms Comparison

An introduction to random forests

Classification of Imbalanced Marketing Data with Balanced Random Sets

COMP 465: Data Mining Classification Basics

Topics In Feature Selection

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Using PageRank in Feature Selection

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Naïve Bayes for text classification

Data Preprocessing. Data Preprocessing

Feature Selection in Knowledge Discovery

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Conditional Mutual Information Based Feature Selection for Classification Task

Using Machine Learning to Optimize Storage Systems

Features: representation, normalization, selection. Chapter e-9

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Statistical Pattern Recognition

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Using PageRank in Feature Selection

Learning Bayesian Networks (part 2) Goals for the lecture

Building Classifiers using Bayesian Networks

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Cyber attack detection using decision tree approach

SGN (4 cr) Chapter 10

Supervised vs unsupervised clustering

Resampling methods for parameter-free and robust feature selection with mutual information

Feature selection in environmental data mining combining Simulated Annealing and Extreme Learning Machine

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Extra readings beyond the lecture slides are important:

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Predicting Rare Failure Events using Classification Trees on Large Scale Manufacturing Data with Complex Interactions

AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED CATEGORICAL AND CONTINUOUS DATA

International Journal of Scientific & Engineering Research, Volume 5, Issue 7, July ISSN

Nearest neighbor classification DSE 220

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Medical Image Segmentation Based on Mutual Information Maximization

Feature Selection with Decision Tree Criterion

Lecture 7: Decision Trees

Three Embedded Methods

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Feature-weighted k-nearest Neighbor Classifier

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Statistical Pattern Recognition

ECG782: Multidimensional Digital Signal Processing

CSE4334/5334 DATA MINING

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Statistical Pattern Recognition

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Probabilistic Approaches

SOCIAL MEDIA MINING. Data Mining Essentials

Artificial Intelligence. Programming Styles

Table Of Contents: xix Foreword to Second Edition

10601 Machine Learning. Model and feature selection

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Data Mining & Feature Selection

Machine Learning Lecture 3

Transcription:

Information theory methods for feature selection Zuzana Reitermanová Department of Computer Science Faculty of Mathematics and Physics Charles University in Prague, Czech Republic Diplomový a doktorandský seminář I. 11.11.2010

Outline 1 Introduction Feature extraction 2 Basic approaches Filter methods Wrapper methods Embedded methods Ensemble learning NIPS 2003 Challenge results 3 Conclusion References

Introduction Feature extraction Introduction Feature extraction An integral part of the data mining process. Two steps Feature construction

Introduction Feature extraction Introduction Feature extraction An integral part of the data mining process. Two steps Feature construction Preprocessing techniques standardization, normalization, discretization,... Part of the model (ANN),... Extraction of local features, signal enhancement,... Space-embedding methods PCA, MDS (Multidimensional scaling),... Non-linear expansions...

Why to employ feature selection techniques?... to select relevant and informative features.... to select features that are useful to build a good predictor Moreover General data reduction decrease storage requirements and increase algorithm speed Feature set reduction save resources in the next round of data collection or during utilization Performance improvement increase predictive accuracy Better data understanding... Advantage Selected features retain the original meanings.

Current challenges in Unlabeled data Knowledge-oriented sparse learning Detection of feature dependencies / interaction Data-sets with a huge number of features (100 1000000) but relatively few instances ( 1000) microarrays, transaction logs, Web data,...

Current challenges in Unlabeled data Knowledge-oriented sparse learning Detection of feature dependencies / interaction Data-sets with a huge number of features (100 1000000) but relatively few instances ( 1000) microarrays, transaction logs, Web data,... NIPS 2003 challenge:

Basic approaches Basic approaches to Filter models Select features without optimizing the performance of a predictor Feature ranking methods provide a complete order of features using a relevance index Wrapper models Use a predictor as a black box to score the feature subsets Embedded models is a part of the model training Hybrid approaches

Filter methods Filter methods Feature ranking methods Provide a complete order of features using a relevance index. Each feature is treated separately. Many many various relevance indices Correlation coefficients linear dependencies: Pearson: R(i) = cov(x i,y ) Estimate: R(i) =... var(xi )var(y ) k (xi k x i )(y k ȳ) k (xi k x i ) 2 k (y k ȳ) 2 Classical test statistics T-test, F-test, χ 2 -test,... Single variable predictors (for example decision trees) risk of overfitting Information theoretic ranking criteria non-linear dependencies...

Filter methods Relevance Measures Based on Information Theory Mutual information (Shannon) Entropy: H(X ) = x p(x)log 2p(x)dx Conditional entropy: H(Y X ) = x p(x)( y p(y x)log 2p(y x))dx Mutual information: MI (Y, X ) = H(Y ) H(Y X ) = x y p(x, y)log 2 p(x,y) p(x)p(y) dxdy Is MI for classification Bayes optimal? H(Y X ) 1 log 2 K e bayes (X ) 0.5 H(Y X ) Kullback-Leibler divergence: MI (X, Y ) D KL (p(x, y) p(y)p(x)), where D KL (p 1 p 2 ) = x p 1(x)log 2 p 1 (x) p 2 (x) dx

Filter methods Relevance Measures Based on Information Theory Mutual information MI (Y, X ) = H(Y ) H(Y X ) = x y p(x, y)log 2 p(x,y) p(x)p(y) dxdy Problem: p(x), p(y), p(x, y) are unknown and hard to estimate from the data Classification with nominal or discrete features The simplest case we can estimate the probabilities from the frequency counts This introduces a negative bias Harder estimate with larger numbers of classes and feature values

Filter methods Relevance Measures Based on Information Theory Mutual information MI (Y, X ) = H(Y ) H(Y X ) = x y p(x, y)log 2 p(x,y) p(x)p(y) dxdy Problem: p(x), p(y), p(x, y) are unknown and hard to estimate from the data Classification with nominal or discrete features MI corresponds to the Information Gain (IG) for Decision trees Many modifications of IG (avoiding bias towards the multivalued features) MI (Y,X ) H(X ), Information Gain Ratio IGR(Y, X ) = Gini-index, J-measure,... Relaxed entropy measures are more straightforward to estimate: Renyi Entropy H α (X ) = 1 1 α log 2( x p(x)α )dx Parzen window approach

Filter methods Relevance Measures Based on Information Theory Mutual information MI (Y, X ) = H(Y ) H(Y X ) = x y p(x, y)log 2 p(x,y) p(x)p(y) dxdy Problem: p(x), p(y), p(x, y) are unknown and hard to estimate from the data Regression with continous features The hardest case Possible solutions: Histogram-based discretization: MI is overestimated depending on the quantization level MI should be overestimated the same for all features Approximation of the densities (Parzen window,...) Normal distribution correlation coefficient Computational complexity...

Filter methods Filter methods Feature ranking methods Advantages Simple and cheap methods, good empirical results. Fast and effective even in the case when the number of samples is smaller than the number of features. Can be used as preprocessing for more sophisticated methods.

Filter methods Filter methods Feature ranking methods Limitations Which relevance index is the best? Select a redundand subset of features. A variable individually relevant may not be useful because of redundancies. A variable useless by itself can be useful together with others:

Filter methods Mutual information for multivariate feature selection How to exclude both irrelevant and redundant features? Greedy selection of variables may not work well when there are dependencies among relevant variables. multivariate filter MI (Y, {X 1,..., X n }) is hard to approximate and compute approximative MIFS algorithm and its variants: MIFS algorithm 1 X = argmax X A MI (X, Y ), F {X }, A A \ X 2 Repeat until F is desired: X = argmax X A [MI (X, Y ) β X F MI (X, X )], F F {X }, A A \ X

Filter methods Multivariate relevance criteria Relief algorithms Based on the k-nearest neighbor algorithm. Relevance of features in the context of oders. Example of the ranking index (for multi-classification): R(X ) = K i k=1 x i x Mk (i) i K k=1 x i x Hk (i), where x Mk (i), k = 1,.., K K closest examples of the same class (nearest misses) in the original feature space x Hk (i), k = 1,.., K K closest examples of a different class (nearest hits) Popular algorithm, low bias (NIPS 2003)

Wrapper methods Wrapper methods Multivariate feature selection Maximize the relevance of a subset of features X : R(Y, X ) Use a predictor to measure the relevance (i.e. accuracy). A validation set must be used to achieve a useful estimate K-fold cross-validation,... A useful accuracy estimate on a separate testing set Employ a search strategy Exhaustive search Sequential search (growing/prunning),... Stochastiic search (Simulated Annealing, GA,...) Limitations Slower than the filter methods Tendency to overfitting discrepancy between the evaluation score and the ultimate performance No valid good empirical results (NIPS 2003) High variance of the results

Embedded methods Embedded methods depends on the predictive model (SVM, ANN, DT,...) is a part of the model training Forward selection methods Backward elimination methods Nested methods Optimization of scaling factors over the compact interval [0, 1] n regularization techniques Advantages and limitations Slower than the filter methods Tendency to overfitting if not enough data is available Outperform filter methods if enough data is available High variance of the results

Ensemble learning Ensemble learning Help the model-based (wrapper and embedded) methods fast, greedy and unstable base learners (Decision trees, Neural networks,...) Robust variable selection Improve feature set stability. Improve stability generalization stability. Parallel ensembles Variance reduction Bagging Random forest,... Serial ensembles Reduction of both bias and variance Boosting Gradient tree boosting,...

Ensemble learning Random forests for variable selection Random forest (RF) Select a number n N, N is the number of variables. Each decision tree is trained on a bootstrap sample (about two-third of the training set). Each decision tree has maximal depth and it is not pruned. At each node, n variables are randomly chosen and the best split is considered on these variables. CART algorithm Grow trees until no more generalization improvement.

Ensemble learning Random forests for variable selection Variable importance measure for RF Compute an importance index for each variable and for each tree M(X i, T j ) = t T j I G (x i, t), I G (x i, t) is the decrease of impurity due to an actual (or potential) split on variable x i : I G (x i, t) = I (t) p L I (t L ) p r I (t R ), Impurity for regression: I (t) = 1 N(t) s t (y s ȳ) 2 Impurity for classification: I (t) = Gini(t) = y i y j p t t i p j Compute the average importance of each variable over all trees: M(x i ) = 1 NT N T j=1 M(x i, T j ) Optimal number of features is selected by trying cut-off points

Ensemble learning Random forests for variable selection Advantages Avoid over-fitting in the case when there are more features than examples. More stable results.

NIPS 2003 Challenge results NIPS 2003 Challenge results Top ranking challengers used a combination of filters and embedded methods. Very good results of methods using only filters, even simple correlation coefficients. Search strategies were generally unsophisticated. The winner was a combination of Bayesian neural networks and Dirichlet diffusion trees Ensemble methods (Random trees) were on the second and third position.

NIPS 2003 Challenge results NIPS 2003 Challenge results

NIPS 2003 Challenge results NIPS 2003 Challenge results Other (surprising) results Some of the top ranking challengers used almost all the probe features. Very good results for methods using only filters, even simple correlation coefficients. Non-linear classifiers outperformed the linear classifiers. They didn t overfit. The hyper-parameters are important. Several groups were using the same classifier (e.g. SVM) and reported significantly different results.

Conclusion Conclusion Many different approaches to feature selection Best results obtained by hybrid methods Advancing research Knowledge-based feature extraction Unsupervised feature extraction...

Conclusion References References Guyon, I. M., Gunn, S. R., Nikravesh, M. and Zadeh, L., eds., Feature Extraction, Foundations and Applications. Springer, 2006. Huan Liu, Hiroshi Motoda, Rudy Setiono, Zheng Zhao, Feature Selection: An Ever Evolving Frontier in Data Mining, in JMLR: Workshop and Conference Proceedings, volume 10, pages 4 13, 2010 Isabelle Guyon, André Elisseeff, An Introduction to Variable and Feature Selection, in JMLR: Workshop and Conference Proceedings, volume 3, pages 1157 1182, 2003 Journal of Machine Learning Research, http://jmlr.csail.mit.edu/

Conclusion References References R. Battiti, Using mutual information for selecting features in supervised neural net learning, in: Neural Networks, volume 5(4), pages 537 550, July 1994. Kari Torkkola, Feature extraction by non parametric mutual information maximization, in: The Journal of Machine Learning Research, volume 3, pages 1415 1438, 2003. Francois Fleuret, Fast Binary Feature Selection with Conditional Mutual Informationin, in: The Journal of Machine Learning Research, volume 4, pages 1531 1555, 2004. Kraskov, Alexander; Stögbauer, Harald; Grassberger, Peter, Estimating mutual information, Physical Review E, volume 69, Issue 6, 16 pages, 2004