Feature Selection in Knowledge Discovery

Similar documents
ACO, NATURAL AGENTS APPLIED TO FEATURE SELECTION

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Dimensionality Reduction, including by Feature Selection.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Features: representation, normalization, selection. Chapter e-9

Chapter 12 Feature Selection

3 Feature Selection & Feature Extraction

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

MUSI-6201 Computational Music Analysis

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Classification. Slide sources:

Machine Learning Techniques for Data Mining

Slides for Data Mining by I. H. Witten and E. Frank

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Introduction to GE Microarray data analysis Practical Course MolBio 2012

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Topics In Feature Selection

FEATURE SELECTION TECHNIQUES

Forward Feature Selection Using Residual Mutual Information

NEURAL NETWORKS ... FEATURE SELECTION USING ANT COLONY OPTIMIZATION: APPLICATIONS IN HEALTH CARE. Motivation. Outline.

INF 4300 Classification III Anne Solberg The agenda today:

Classification by Nearest Shrunken Centroids and Support Vector Machines

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

Discriminate Analysis

Variable Selection 6.783, Biomedical Decision Support

Data preprocessing Functional Programming and Intelligent Algorithms

Information Fusion Dr. B. K. Panigrahi

Information Driven Healthcare:

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

10601 Machine Learning. Model and feature selection

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Machine Learning in Biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Data Preprocessing. Data Preprocessing

Random projection for non-gaussian mixture models

Supervised vs unsupervised clustering

Information theory methods for feature selection

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Feature-weighted k-nearest Neighbor Classifier

Feature Selection for Image Retrieval and Object Recognition

Pre-requisite Material for Course Heuristics and Approximation Algorithms

Classification with PAM and Random Forest

Gene expression & Clustering (Chapter 10)

Regularization and model selection

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

CPSC 340: Machine Learning and Data Mining

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

A Wrapper-Based Feature Selection for Analysis of Large Data Sets

CT79 SOFT COMPUTING ALCCS-FEB 2014

Noise-based Feature Perturbation as a Selection Method for Microarray Data

COMP61011 Foundations of Machine Learning. Feature Selection

Data Mining - Motivation

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Machine Learning Feature Creation and Selection

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Supervised Learning for Image Segmentation

Special Topic: Missing Values. Missing Can Mean Many Things. Missing Values Common in Real Data

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Unsupervised Learning

Statistical Pattern Recognition

Improving Feature Selection Techniques for Machine Learning

Gene Clustering & Classification

CS249: ADVANCED DATA MINING

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

SGN (4 cr) Chapter 10

Dimension Reduction CS534


CS229 Lecture notes. Raphael John Lamarre Townshend

Exploratory data analysis for microarrays

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Chapter 3: Supervised Learning

Building Classifiers using Bayesian Networks

Statistical Pattern Recognition

Min-Uncertainty & Max-Certainty Criteria of Neighborhood Rough- Mutual Feature Selection

Basics of Multivariate Modelling and Data Analysis

CS145: INTRODUCTION TO DATA MINING

Feature Selection Using Principal Feature Analysis

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

The k-means Algorithm and Genetic Algorithm

Artificial Neural Networks (Feedforward Nets)

An Information-Theoretic Approach to the Prepruning of Classification Rules

Chapter 8 The C 4.5*stat algorithm

Statistical Pattern Recognition

Data Mining and Knowledge Discovery Practice notes 2

Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction

International Journal of Current Trends in Engineering & Technology Volume: 02, Issue: 01 (JAN-FAB 2016)

Statistical dependence measure for feature selection in microarray datasets

Mostafa Salama Abdel-hady

Applying Supervised Learning

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

An Effective Feature Selection Approach Using the Hybrid Filter Wrapper

Content-based image and video analysis. Machine learning

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Survey on Rough Set Feature Selection Using Evolutionary Algorithm

Transcription:

Feature Selection in Knowledge Discovery Susana Vieira Technical University of Lisbon, Instituto Superior Técnico Department of Mechanical Engineering, Center of Intelligent Systems, IDMEC-LAETA Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal E-mail: susana.vieira@ist.utl.pt November, 010 Knowledge discovery process Interpretation Modeling Knowledge Feature selection Patterns Preprocessing Data acquisition Target data Data Preprocessed data Reduced data Based on G. Piatetsky-Shapiro U. Fayyad and P. Smyth. From data mining to knowledge discovery in databases. Artificial Intelligence Magazine, 17(3):37-54, 1996. 1

Outline Motivation Why feature selection Basic definitions Ranking methods Feature subset selection Optimization methods: Tree search Ant feature selection 3 Why feature selection? Why even think about feature selection? The information about the target class is inherent in the variables! Naive theoretical view: More features More information More discrimination power. In practice: many reasons why this is not the case! Also: Optimization is (usually) good, so why not try to optimize the input-coding? 4

Practical problems Many explored domains have hundreds to tens of thousands of variables/features with many irrelevant and redundant ones! In domains with many features the underlying probability distribution can be very complex and very hard to estimate (e.g. dependencies between variables) Irrelevant and redundant features can confuse learners! Limited training data! Limited computational resources! Curse of dimensionality! 5 Practical problems The required number of samples (to achieve the same accuracy) grows exponentionally with the number of variables! In practice: number of training examples is fixed! the classifier s performance usually will degrade for a large number of features! In many cases the information that is lost by discarding variables is made up for by a more accurate mapping/sampling in the lower-dimensional space! 6 3

Real world example Gene selection from microarray data Variables: gene expression coefficients corresponding to the amount of mrna in a patient s sample (e.g. tissue biopsy) Task: Separate healthy patients from cancer patients Usually there are only about 100 examples (patients) available for training and testing (!!!) Number of variables in the raw data: 6.000 60.000 Does this work? ([8]) [8] C. Ambroise, G.J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS Vol. 99 656-6566(00) 7 Feature selection What is feature selection? Remove features X(i) to improve (or least degrade) prediction of Y. Advantages: Feature selection specify the most relevant features Collect/process less features and data Less complex models run faster Models are easier to understand, verify and explain 8 4

Feature selection: definition Given a set of features F{ f1,..., fi,..., fn} the Feature Selection problem is to find a subset F' F that maximizes the learners ability to classify patterns. Formally F should maximize some scoring function : (where is the space of all possible feature subsets of F), i.e. F' arg m ax G G 9 Feature extraction - definition Given a set of features F{ f1,..., fi,..., fn} the Feature Extraction( Construction ) problem is is to map F to some feature set F'' that maximizes the learner s ability to classify patterns. (again F'' argmax G ) * G This general definition subsumes feature selection (i.e. a feature selection algorithm also performs a mapping but can only map to subsets of the input variables) * here is the set of all possible feature sets 10 5

Feature Selection Feature Selection: F F { f,..., f,..., f } { f,..., f,..., f } 1 i n f. selection i1 ij im Feature Extraction/Creation i 1,..., n ; j 1,..., m j i i ab; a, b 1,..., m a b F F { f,..., f,..., f } { g ( f,..., f ),..., g ( f,..., f ),..., g ( f,..., f )} 1 i n f. extraction 1 1 n j 1 n m 1 n 11 Feature selection optimality In theory the goal is to find an optimal feature-subset (one that maximizes the scoring function) In real world applications this is usually not possible For most problems it is computationally intractable to search the whole space of possible feature subsets One usually has to settle for approximations of the optimal subset Most of the research in this area is devoted to finding efficient search-heuristics 1 6

Relevance of features Relevance vs Optimality of Feature-Set Classifiers induced from training data are likely to be suboptimal (no access to the real distribution of the data) Relevance does not imply that the feature is in the optimal feature subset Even irrelevant features can improve a classifier s performance Defining relevance in terms of a given classifier (and therefore a hypothesis space) would be better. 13 Feature selection Filters Based on general characteristics of data to be evaluated. No model is involved. Wrappers Uses model Hybrid performance methods to evaluate feature subsets. Train one model for each feature subset. Embedded methods Do not retrain the model at every step. Search feature selection space and model parameter space simultaneously. 14 7

Filter methods R p Feature selection R s s << p Classifier design Features are scored independently and the top s are used by the classifier Score: correlation, mutual information, t-statistic, F-statistic, p- value, etc. Easy to interpret. Usually fast. Adapted from J. Fridlyand 15 Feature ranking Given a set of features F Variable Ranking is the process of ordering the features by the value of some scoring function S:F (which usually measures feature-relevance) Resulting set: a permutation of F: F ' { f,..., f,... f } ij1 i1 i j in S ( f ) S ( f ); j 1,..., n 1; i j with The score S(f i ) is computed from the training data, measuring some criteria of feature f i. By convention a high score is indicative for a valuable (relevant) feature. 16 8

Feature ranking feature selection A simple method for feature selection using variable ranking is to select the k highest ranked features according to S. This is usually not optimal But often preferable to other, more complicated methods Computationally efficient(!): only calculation and sorting of n scores 17 Ranking criteria Questions: Can variables with small score be automatically discarded? NO Can a useless variable (i.e. one with a small score) be useful together with others? Can two variables that are useless by themselves can be useful together?) YES YES 18 9

Ranking criteria Correlation between variables and target is not enough to assess relevance! Correlation / covariance between pairs of variables has to be considered too! (potentially difficult) Diversity of features which one to choose? 19 Problems with filter method Redundancy in selected features: features are considered independently and not measured on the basis of whether they contribute with new information; Interactions among features generally can not be explicitly incorporated (some filter methods are smarter than others); Classifier has no say in what features should be used: some scores may be more appropriate in conjunction with some classifiers than others; Sometimes used as a pre-processing step for other methods. Adapted from J. Fridlyand 0 10

Dimension reduction A variant on filter methods: Rather than retain a subset of s features, perform dimension reduction by projecting features onto s principal components of variation (e.g. PCA etc) Problem is that we are no longer dealing with one feature at a time but rather a linear or possibly more complicated combination of all features. It may be good enough for a black box but how does one build a diagnostic chip on a supergene? (even though we don t want to confuse the tasks) Those methods tend not to work better than simple filter methods. Adapted from J. Fridlyand 1 Wrapper methods R p Feature selection R s s << p Classifier design Iterative approach: many feature subsets are scored based on classification performance and best is used. Selection of subsets: forward selection, backward selection, forward-backward selection, ant colony optimization, genetic algorithms, particle swarm optimization, etc. By using the learner as a black box wrappers are universal and simple! Adapted from J. Fridlyand 11

Problems with wrapper methods Computationally expensive: for each feature subset to be considered, a classifier must be built and evaluated p No exhaustive search is possible ( subsets to consider) : generally greedy algorithms only. Easy to overfit. Adapted from J. Fridlyand 3 Validation Cross Validation 1 N samples Cross Validation N samples Leave One Out Train and test the featureselector and the classifier Count errors feature selection Leave One Out Train and test the classifier Count errors CV - can yield optimistic estimation of classification true error 4 1

Taxonomy of feature selection Saeys Y, InzaI,Larrañaga P. A review of feature selection techniques in bioinformatics Bioinformatics. 007 Oct 1;3(19):507-17 5 Tree search methods: Bottom-up 6 13

Tree search methods: Top-down 7 Tree search methods Advantages: Easy to use Reduce number of iterations Bottom-up achieves smaller number of features Disadvantages: Converge to local minima Computationally very heavy for more than about 50 features Metaheuristic methods global search 8 14

Artificial ants Artificial ants move in graphs nodes / arcs environment is discrete As real ants: choose paths based on pheromone concentration deposit pheromones on paths environment updates pheromones Extra abilities of artificial ants: prior knowledge (heuristic ) memory (feasible neighbourhood N 9 Proposed algorithm Multicriteria algorithm: Feature 1 Feature N Rank Features Update pheromone Ant colony for cardinality of features Ant system Update pheromone Ant colony for selection of features X test Test Modeling Y test Minimize number of features N cycles Cost Minimize classification error 30 15

Ant Feature Selection (AFS) Choose node p k ij ij ij, if j N ij ij j 0, otherwise x 3 x 1 x x 4 Pheromone update x 5 x 6 k ( l1) ()(1 l ) ij x 7 x n Subset: {x 3,x 6,x 7,x 1,x 4 } 31 Heuristics in AFS Heuristic for feature cardinality: Fisher s score for the features Fi () () i () i c c 1 c c 1 () i () i mean and variance values of feature i for the samples in class c 1 and c Heuristic for selection of features: classification error e(i) for the individual features () i f 1 ei () 3 16

Results: fuzzy models Classification rates with 10-fold cross validation: Data set Fuzzy Models Classification Accuracy Standard deviation Number of features No FS AFS No FS AFS No FS AFS 1 WBCO 84.5 97.7 1.75 1.1 9-5 Wine 8.6 99.5 3.40 1.66 13-4 3 Vote 80.0 99.7 4.18 1.0 16-5 4 WDBC 77. 99.5 3.05 0.84 3-3 5 WPBC 78.9 85.6 1.50.47 33 6 Sonar 60. 86.6 5.73.83 60-3 7 Musk 77.7 78.3 4.14 4.39 166-0 Average 77.3 9.4 - - - - WTL 0/0/7 0/1/6 - - - - 33 Comparison with state-of-the-art GAAR - genetic algorithm-based PSORSFS - particle swarm optimization algorithm-based GBML multi-objective fuzzy genetics-based machine learning MIFS - a classical filter method based on mutual information HGA - a hybrid genetic algorithm wrapper approach based on mutual information 34 17

Real world example MEDAN database Web: http://141..16.103/datenbank/download_database.htm Variables: The MEDAN data base contains the data of 38 patients. The data were copied from intensive care unit records in the years 1998-00 by medical documentation staff. All patients have septic shock of abdominal cause. Task: Predict patients survival. Problems in the database 35 Sepsis patients database - MEDAN Patient Variable The matrix contains 387 patients and 59 variables. 36 18

MEDAN - Problems Different time samples: 37 MEDAN - Problems Missing data : 38 19

MEDAN - Problems Stoped being measured: 39 MEDAN - Problems 40 0

Test example Problem definition: x rcos() t 1 x rsin() t r 0.99,1.01 yr 1 Features: F x1 x x1 x Output: y 0 1 41 Test example Features: Output: y Correlation: F x1 x x1 x 0 1 x x 1 1 x x y x x x x y 1 1 1.0000-0.1163-0.1784 0.1790-0.1090-0.1163 1.0000 0. 00-0. 085-0.116-0.1784 0.00 1.0000-0. 9995 0.1050 0.1790-0.085-0.9995 1.0000-0.077-0.1090-0.116 0.1050-0.077 1. 0000 4 1

Test example x 4 x 4 1 0-1 0 1 x 1 1 0 - -1 0 1 x x 3 x 3 1 0-1 0 1 x 1 1 0 - -1 0 1 x x 0 - -1 0 1 x 1 x 4 1 0 0 0.5 1 1.5 x 3 43 Test example 100 Fuzzy model performance (correct classification [%]) 90 80 70 60 50 40 30 0 10 0 1 3 4 Subset cardinality All combinations test using fuzzy models Pareto solutions 44

Test example Pheromone concentration evolution 0.9 0.8 0.7 0.6 Iteration 0.5 0.4 0.3 0. Features Ant feature selection using fuzzy models (5 Ants, 0 Iterations). 45 Fuzzy Objective function Classic objective function minimize fwn 1 ewn f Fuzzy objective function Fuzzy decision D(x) = C 1 (x)... C n (x) Optimal decision maximize D( x) D( x) CN ( x) oc ( ) e N x f 46 3

Fuzzy criteria Classification error C Ne 47 Fuzzy criteria Feature cardinality C Nf 48 4

Results: fuzzy models Classification rates with 10-fold cross validation: Data set Fuzzy Models Classification Accuracy Standard deviation Number of features No FS AFS No FS AFS AFS FOF 1 WBCO 84.5 97.7 1.75 1.1-5 3 Wine 8.6 99.5 3.40 1.66-4 4 3 Vote 80.0 99.7 4.18 1.0-5 -3 4 WDBC 77. 99.5 3.05 0.84-3 3 5 WPBC 78.9 85.6 1.50.47 3-4 6 Sonar 60. 86.6 5.73.83-3 7 Musk 77.7 78.3 4.14 4.39-0 6- Average 77.3 9.4 - - - - WTL 0/0/7 0/1/6 - - - - 49 5