Minimum Redundancy and Maximum Relevance Feature Selec4on. Hang Xiao

Size: px

Start display at page:

Download "Minimum Redundancy and Maximum Relevance Feature Selec4on. Hang Xiao"

Benedict Mills
5 years ago
Views:

1 Minimum Redundancy and Maximum Relevance Feature Selec4on Hang Xiao

Background Feature a feature is an individual measurable heuris4c property of a phenomenon being observed In character recogni4on: horizontal and ver4cal

2 Background Feature a feature is an individual measurable heuris4c property of a phenomenon being observed In character recogni4on: horizontal and ver4cal profiles, number of internal holes, stroke detec4on In speech recogni4on: noise ra4os, length of sounds, rela4ve power, filter matches In microarray : genes expression

joint distribu4on func4on of X and Y p(x), p(y) : marginal

3 Background Relevance between features Correla4on F- sta4s4c Mutual informa4on I(x, y) = Z Z p(x, y)log p(x, y) p(x)p(y) dxdy p(x,y) : joint distribu4on func4on of X and Y p(x), p(y) : marginal probability distribu4on func4ons Independent : p(x,y) = p(x)p(y) è I(x,y) = 0

4 Feature Selec4on Problem Maximal relevance selec4ng the features with the highest relevance to the target class c, based on mutual info., F- test, etc. without considering rela4onships among features Minimal Redundancy Selected features are correlated Selected features cover narrow regions in space

5 mrmr: Discrete Variables Maximize Relevance: max V I,V I = 1 S S is the set of features I(i,j) is mutual informa4on between feature i and j Minimal Redundancy: min W I,W I = 1 S 2 X I(h, i) i2s X i,j2s I(i, j)

6 mrmr: Con4nuous Variables Maximum relevance: F- sta4s4c F(i,h) max V F,V F = 1 S Minimum redundancy : Correla4on cor(i,j) min W c,w c = 1 S 2 X X F (i, h) i2s i,j cor(i, j)

7 Combine Relevance and Redundancy Addi4ve combina4on max(v W ) Mul4plica4ve combina4on max(v/w )

8 Most Related Methods Most used feature selec4on methods: top- ranking features without considering rela4onships among features. Yu & Liu, 2003/2004. informa4on gain, essen4ally similar approach Wrapper: not filter approach, classifier- involved and thus features do not generalize well PCA and ICA: Feature are orthogonal or independent, but not in the original feature space

9 Class Predic4on Methods Naive Bayes (NB) classifier p(h k s) / i2s p(g i h k ) {g 1, g 2,, g m } gene expression level p(g i h k ) is condi4onal table (density) Support Vector Machine SVM Draw an op4mal hyperplane in the feature vector space

10 Class Predic4on Methods Linear Discriminant Analysis (LDA) Find a linear combina4on of feature ANOVA, regression analysis Logis4c Regression (LR) a linear combina4on of the feature variables transformed into probabili4es by a logis4c func4on

11 Microarray Gene Expression Data Sets for Cancer Classifica4on

12 LOOCV : Leave- One- Out Cross Valida4on Baseline feature : based solely on maximum relevance

13 The role of redundancy reduc4on (a) Relevance V I, and (b) Redundancy for MRMR features on discre4zed NCI dataset. (c) The respec4ve LOOCV errors obtained using the Naive Bayes classifier

14 Do mrmr Features Generalize Well on Unseen Data? Child Leukemia data (7 classes, 215 training samples, 112 tes4ng samples) tes4ng errors. M is the number of features used in classifica4on

15 What is the Rela4onship of mrmr Features and Various Data Discre4za4on Schemes? LOOCV tes4ng results classifier(#error) for binarized NCI and Lymphoma data using SVM classifier.

16 Comparison with other work

17 Theore4cal basis of mrmr Maximum Dependency Criterion Sta4s4c associa4on Defini4on : mutual informa4on I(S m,h) Mutual Informa4on For two variables x and y I(x, y) = Z Z p(x, y)log p(x, y) p(x)p(y) dxdy For mul4variate variable S m and the target h I(S m ; h) = Z Z p(s m,h)log p(s m,h) p(s m )p(h) ds mdh

18 High- Dimensional Mutual Informa4on For mul4variate variable Sm and the target h I(S m ; h) = = = Z Z p(s m,h)log p(s m,h) p(s m )p(h) ds mdh Z Z p(s m 1,x m,h)log p(s m 1,x m,h) p(s m 1,x m )p(h) ds mdh Z Z... p(x 1,...,x m,h)log p(x 1,...,x m,h) p(x 1,...,x m )p(h) dx 1...dx m dh Es4mate high- dimensional I(S m,h) is so difficult An ill- posed problem to find inverse of large co- variance matrix Insufficient number of samples Combinatorial 4me complex O(C( Ω, S ))

19 Factorize the Mutual Informa4on Mutual informa4on for mul4variate variable S m and the target h I(S m ; h) = Z Z p(s m,h)log p(s m,h) p(s m )p(h) ds mdh Define: Z Z J(x 1,x 2,...,x m )=... It can be proved: p(x 1,...,x m )log p(x 1,x 2,...,x m ) p(x 1 )...p(x m ) dx 1...dx m I(S m,h)=j(h, S m ) J(S m )

20 Factorize I(S m,h) Relevance of S={x 1,x 2, } and h, or RL(S,h) Redundancy among variables {x 1,x 2,...}, or RD(S) RL = 1 S X x i 2S I(x i,h) RD = 1 S 2 I(S m,h)=j(h, S m ) J(S m ) X x i x j 2S I(x i,x j ) For incremental search, max I(S,h) is equivalent to max [RL(S,h) RD(S)], so called min- Redundancy- Max- Relevance(mRMR)

21 Advantages of mrmr Both relevance and redundancy es4ma4on are low- dimensional problems (i.e. involving only 2 variables). This is much easier than directly es4ma4ng mul4variate density or mutual informa4on in the high- dimensional space! Fast speed More reliable es4ma4on mrmr is an op4mal first- order approxima4on of I(.) maximiza4on Relevance- only ranking only maximizes J(.)!

22 Search Algorithm of mrmr Greedy search algorithm In the pool Ω, find the variable x 1 that has the largest I(x 1,h). Exclude x 1 from Ω Search x 2 so that it maximizes I(x 2,h) - I(.,x 2 )/ Ω Iterate this process un4l an expected number of variables have been obtained, or other constraints are sa4sfied Complexity O( S * Ω )

23 Comparing Max- Dep and mrmr: Complexity of Feature Selec4on

24 Comparing Max- Dep and mrmr: Accuracy of Feature Selected in Classifica4on Leave- One- Out cross valida4on of feature classifica4on accuracies of mrmr and MaxDep

25 Use Wrappers to Refine Features mrmr is a filter approach Fast Features might be redundant Independent of the classifier Wrappers seek to minimize the number of errors directly Slow Features are less robust Dependent on classifier Beuer predic4on accuracy Use mrmr first to generate a short feature pool and use wrappers to get a least redundant feature set with beuer accuracy

26 Use Wrappers to Refine Features Forward wrappers (incremental selec4on) Backward wrappers (decremental selec4on) NCI Data

27 Conclusions The Max- Dependency feature selec4on can be efficiently implemented as the mrmr algorithm Significantly outperforms the widely used max- relevance selec4on method: mrmr features cover a broader feature space with less features mrmr is very efficient and useful for gene selec4on and many other applica4ons. The programs are ready!

Machine Learning Crash Course: Part I

Machine Learning Crash Course: Part I Ariel Kleiner August 21, 2012 Machine learning exists at the intersec