Feature Selection in Knowledge Discovery

Feature Selection in Knowledge Discovery Susana Vieira Technical University of Lisbon, Instituto Superior Técnico Department of Mechanical Engineering, Center of Intelligent Systems, IDMEC-LAETA Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal E-mail: susana.vieira@ist.utl.pt November, 010 Knowledge discovery process Interpretation Modeling Knowledge Feature selection Patterns Preprocessing Data acquisition Target data Data Preprocessed data Reduced data Based on G. Piatetsky-Shapiro U. Fayyad and P. Smyth. From data mining to knowledge discovery in databases. Artificial Intelligence Magazine, 17(3):37-54, 1996. 1

Outline Motivation Why feature selection Basic definitions Ranking methods Feature subset selection Optimization methods: Tree search Ant feature selection 3 Why feature selection? Why even think about feature selection? The information about the target class is inherent in the variables! Naive theoretical view: More features More information More discrimination power. In practice: many reasons why this is not the case! Also: Optimization is (usually) good, so why not try to optimize the input-coding? 4

Practical problems Many explored domains have hundreds to tens of thousands of variables/features with many irrelevant and redundant ones! In domains with many features the underlying probability distribution can be very complex and very hard to estimate (e.g. dependencies between variables) Irrelevant and redundant features can confuse learners! Limited training data! Limited computational resources! Curse of dimensionality! 5 Practical problems The required number of samples (to achieve the same accuracy) grows exponentionally with the number of variables! In practice: number of training examples is fixed! the classifier s performance usually will degrade for a large number of features! In many cases the information that is lost by discarding variables is made up for by a more accurate mapping/sampling in the lower-dimensional space! 6 3

Real world example Gene selection from microarray data Variables: gene expression coefficients corresponding to the amount of mrna in a patient s sample (e.g. tissue biopsy) Task: Separate healthy patients from cancer patients Usually there are only about 100 examples (patients) available for training and testing (!!!) Number of variables in the raw data: 6.000 60.000 Does this work? ([8]) [8] C. Ambroise, G.J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS Vol. 99 656-6566(00) 7 Feature selection What is feature selection? Remove features X(i) to improve (or least degrade) prediction of Y. Advantages: Feature selection specify the most relevant features Collect/process less features and data Less complex models run faster Models are easier to understand, verify and explain 8 4

Feature selection: definition Given a set of features F{ f1,..., fi,..., fn} the Feature Selection problem is to find a subset F' F that maximizes the learners ability to classify patterns. Formally F should maximize some scoring function : (where is the space of all possible feature subsets of F), i.e. F' arg m ax G G 9 Feature extraction - definition Given a set of features F{ f1,..., fi,..., fn} the Feature Extraction( Construction ) problem is is to map F to some feature set F'' that maximizes the learner s ability to classify patterns. (again F'' argmax G ) * G This general definition subsumes feature selection (i.e. a feature selection algorithm also performs a mapping but can only map to subsets of the input variables) * here is the set of all possible feature sets 10 5

Feature Selection Feature Selection: F F { f,..., f,..., f } { f,..., f,..., f } 1 i n f. selection i1 ij im Feature Extraction/Creation i 1,..., n ; j 1,..., m j i i ab; a, b 1,..., m a b F F { f,..., f,..., f } { g ( f,..., f ),..., g ( f,..., f ),..., g ( f,..., f )} 1 i n f. extraction 1 1 n j 1 n m 1 n 11 Feature selection optimality In theory the goal is to find an optimal feature-subset (one that maximizes the scoring function) In real world applications this is usually not possible For most problems it is computationally intractable to search the whole space of possible feature subsets One usually has to settle for approximations of the optimal subset Most of the research in this area is devoted to finding efficient search-heuristics 1 6

Relevance of features Relevance vs Optimality of Feature-Set Classifiers induced from training data are likely to be suboptimal (no access to the real distribution of the data) Relevance does not imply that the feature is in the optimal feature subset Even irrelevant features can improve a classifier s performance Defining relevance in terms of a given classifier (and therefore a hypothesis space) would be better. 13 Feature selection Filters Based on general characteristics of data to be evaluated. No model is involved. Wrappers Uses model Hybrid performance methods to evaluate feature subsets. Train one model for each feature subset. Embedded methods Do not retrain the model at every step. Search feature selection space and model parameter space simultaneously. 14 7

Filter methods R p Feature selection R s s << p Classifier design Features are scored independently and the top s are used by the classifier Score: correlation, mutual information, t-statistic, F-statistic, p- value, etc. Easy to interpret. Usually fast. Adapted from J. Fridlyand 15 Feature ranking Given a set of features F Variable Ranking is the process of ordering the features by the value of some scoring function S:F (which usually measures feature-relevance) Resulting set: a permutation of F: F ' { f,..., f,... f } ij1 i1 i j in S ( f ) S ( f ); j 1,..., n 1; i j with The score S(f i ) is computed from the training data, measuring some criteria of feature f i. By convention a high score is indicative for a valuable (relevant) feature. 16 8

Feature ranking feature selection A simple method for feature selection using variable ranking is to select the k highest ranked features according to S. This is usually not optimal But often preferable to other, more complicated methods Computationally efficient(!): only calculation and sorting of n scores 17 Ranking criteria Questions: Can variables with small score be automatically discarded? NO Can a useless variable (i.e. one with a small score) be useful together with others? Can two variables that are useless by themselves can be useful together?) YES YES 18 9

Ranking criteria Correlation between variables and target is not enough to assess relevance! Correlation / covariance between pairs of variables has to be considered too! (potentially difficult) Diversity of features which one to choose? 19 Problems with filter method Redundancy in selected features: features are considered independently and not measured on the basis of whether they contribute with new information; Interactions among features generally can not be explicitly incorporated (some filter methods are smarter than others); Classifier has no say in what features should be used: some scores may be more appropriate in conjunction with some classifiers than others; Sometimes used as a pre-processing step for other methods. Adapted from J. Fridlyand 0 10

Dimension reduction A variant on filter methods: Rather than retain a subset of s features, perform dimension reduction by projecting features onto s principal components of variation (e.g. PCA etc) Problem is that we are no longer dealing with one feature at a time but rather a linear or possibly more complicated combination of all features. It may be good enough for a black box but how does one build a diagnostic chip on a supergene? (even though we don t want to confuse the tasks) Those methods tend not to work better than simple filter methods. Adapted from J. Fridlyand 1 Wrapper methods R p Feature selection R s s << p Classifier design Iterative approach: many feature subsets are scored based on classification performance and best is used. Selection of subsets: forward selection, backward selection, forward-backward selection, ant colony optimization, genetic algorithms, particle swarm optimization, etc. By using the learner as a black box wrappers are universal and simple! Adapted from J. Fridlyand 11

Problems with wrapper methods Computationally expensive: for each feature subset to be considered, a classifier must be built and evaluated p No exhaustive search is possible ( subsets to consider) : generally greedy algorithms only. Easy to overfit. Adapted from J. Fridlyand 3 Validation Cross Validation 1 N samples Cross Validation N samples Leave One Out Train and test the featureselector and the classifier Count errors feature selection Leave One Out Train and test the classifier Count errors CV - can yield optimistic estimation of classification true error 4 1

Taxonomy of feature selection Saeys Y, InzaI,Larrañaga P. A review of feature selection techniques in bioinformatics Bioinformatics. 007 Oct 1;3(19):507-17 5 Tree search methods: Bottom-up 6 13

Tree search methods: Top-down 7 Tree search methods Advantages: Easy to use Reduce number of iterations Bottom-up achieves smaller number of features Disadvantages: Converge to local minima Computationally very heavy for more than about 50 features Metaheuristic methods global search 8 14

Artificial ants Artificial ants move in graphs nodes / arcs environment is discrete As real ants: choose paths based on pheromone concentration deposit pheromones on paths environment updates pheromones Extra abilities of artificial ants: prior knowledge (heuristic ) memory (feasible neighbourhood N 9 Proposed algorithm Multicriteria algorithm: Feature 1 Feature N Rank Features Update pheromone Ant colony for cardinality of features Ant system Update pheromone Ant colony for selection of features X test Test Modeling Y test Minimize number of features N cycles Cost Minimize classification error 30 15

Ant Feature Selection (AFS) Choose node p k ij ij ij, if j N ij ij j 0, otherwise x 3 x 1 x x 4 Pheromone update x 5 x 6 k ( l1) ()(1 l ) ij x 7 x n Subset: {x 3,x 6,x 7,x 1,x 4 } 31 Heuristics in AFS Heuristic for feature cardinality: Fisher s score for the features Fi () () i () i c c 1 c c 1 () i () i mean and variance values of feature i for the samples in class c 1 and c Heuristic for selection of features: classification error e(i) for the individual features () i f 1 ei () 3 16

Results: fuzzy models Classification rates with 10-fold cross validation: Data set Fuzzy Models Classification Accuracy Standard deviation Number of features No FS AFS No FS AFS No FS AFS 1 WBCO 84.5 97.7 1.75 1.1 9-5 Wine 8.6 99.5 3.40 1.66 13-4 3 Vote 80.0 99.7 4.18 1.0 16-5 4 WDBC 77. 99.5 3.05 0.84 3-3 5 WPBC 78.9 85.6 1.50.47 33 6 Sonar 60. 86.6 5.73.83 60-3 7 Musk 77.7 78.3 4.14 4.39 166-0 Average 77.3 9.4 - - - - WTL 0/0/7 0/1/6 - - - - 33 Comparison with state-of-the-art GAAR - genetic algorithm-based PSORSFS - particle swarm optimization algorithm-based GBML multi-objective fuzzy genetics-based machine learning MIFS - a classical filter method based on mutual information HGA - a hybrid genetic algorithm wrapper approach based on mutual information 34 17

Real world example MEDAN database Web: http://141..16.103/datenbank/download_database.htm Variables: The MEDAN data base contains the data of 38 patients. The data were copied from intensive care unit records in the years 1998-00 by medical documentation staff. All patients have septic shock of abdominal cause. Task: Predict patients survival. Problems in the database 35 Sepsis patients database - MEDAN Patient Variable The matrix contains 387 patients and 59 variables. 36 18

MEDAN - Problems Different time samples: 37 MEDAN - Problems Missing data : 38 19

MEDAN - Problems Stoped being measured: 39 MEDAN - Problems 40 0

Test example Problem definition: x rcos() t 1 x rsin() t r 0.99,1.01 yr 1 Features: F x1 x x1 x Output: y 0 1 41 Test example Features: Output: y Correlation: F x1 x x1 x 0 1 x x 1 1 x x y x x x x y 1 1 1.0000-0.1163-0.1784 0.1790-0.1090-0.1163 1.0000 0. 00-0. 085-0.116-0.1784 0.00 1.0000-0. 9995 0.1050 0.1790-0.085-0.9995 1.0000-0.077-0.1090-0.116 0.1050-0.077 1. 0000 4 1

Test example x 4 x 4 1 0-1 0 1 x 1 1 0 - -1 0 1 x x 3 x 3 1 0-1 0 1 x 1 1 0 - -1 0 1 x x 0 - -1 0 1 x 1 x 4 1 0 0 0.5 1 1.5 x 3 43 Test example 100 Fuzzy model performance (correct classification [%]) 90 80 70 60 50 40 30 0 10 0 1 3 4 Subset cardinality All combinations test using fuzzy models Pareto solutions 44

Test example Pheromone concentration evolution 0.9 0.8 0.7 0.6 Iteration 0.5 0.4 0.3 0. Features Ant feature selection using fuzzy models (5 Ants, 0 Iterations). 45 Fuzzy Objective function Classic objective function minimize fwn 1 ewn f Fuzzy objective function Fuzzy decision D(x) = C 1 (x)... C n (x) Optimal decision maximize D( x) D( x) CN ( x) oc ( ) e N x f 46 3

Fuzzy criteria Classification error C Ne 47 Fuzzy criteria Feature cardinality C Nf 48 4