Towards robust feature selection for high-dimensional, small sample settings

Size: px

Start display at page:

Download "Towards robust feature selection for high-dimensional, small sample settings"

Sharon Harper
6 years ago
Views:

1 Towards robust feature selection for high-dimensional, small sample settings Yvan Saeys Bioinformatics and Evolutionary Genomics, Ghent University, Belgium Marseille, January 14th, 2010

Background: biomarker discovery Common task in computational biology Find the entities that best explain phenotypic differences Challenges: Many possible biomarkers (high dimensionality) Only very

2 Background: biomarker discovery Common task in computational biology Find the entities that best explain phenotypic differences Challenges: Many possible biomarkers (high dimensionality) Only very few biomarkers are important for the specific phenotypic difference Very few samples Examples: Microarray data Mass spectrometry data SNP data Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

3 Dimensionality reduction techniques Dimensionality reduction techniques Feature selection techniques Feature transformation techniques Subset selection Feature ranking Feature weighting Projection PCA LDA Compression Fourier transform Wavelet transform Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

4 Dimensionality reduction techniques Dimensionality reduction techniques Feature selection techniques Feature transformation techniques Subset selection Feature ranking Feature weighting Projection PCA LDA Compression Fourier transform Wavelet transform Preserve the original semantics! Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

5 Casting the problem as a feature selection task Feature selection is a way to avoid the curse of dimensionality Improve model performance Classification: improve classification performance (maximize accuracy, AUC) Clustering: improve cluster detection (AIC, BIC, sum of squares, various indices) Regression: improve fit (sum of squares error) Faster and more cost-effective models Improve generalization performance (avoiding overfitting) Gain deeper insight into the processes that generated the data (esp. important in Bioinformatics) Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

6 The need for robust marker selection algorithms Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

7 The need for robust marker selection algorithms Ranked gene list: gene A gene B gene C gene D gene E Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

8 The need for robust marker selection algorithms Ranked gene list: gene A gene B gene C gene D gene E Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

9 The need for robust marker selection algorithms Ranked gene list: gene A gene B gene C gene D gene E Ranked gene list: gene X gene A gene W gene Y gene C Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

10 The need for robust marker selection algorithms Motivation Highly variable marker ranking algorithms decrease the confidence of a domain expert Need to quantify the stability of a ranking algorithm Use this as an additional criterion next to the predictive power More robust rankings yield a higher chance of representing biologically relevant markers Focus on quantifying/increasing marker stability within one data source Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

11 Formalizing feature selection robustness Definition Consider a dataset D = {x 1,..., x M }, x i = (xi 1,... xi N ) with M instances and N features. A feature selection algorithm can then be defined as a mapping F : D f from D to an N-dimensional vector f = (f 1,..., f N ), 1 weighting: f i = w i denotes the weight of feature i 2 ranking: f i {1, 2,..., N} denotes the rank of feature i 3 subset selection: f i = 0/1 denotes the exclusion/inclusion of feature i in the selected subset Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

12 Formalizing feature selection robustness Research questions: 1 How stable are current feature selection techniques for high dimensional, small sample settings? Analyze sensitivity of robustness to signature size, model parameters. 2 Can we increase the robustness of feature selection in this setting? Definition A feature selection algorithm is stable if small variations in the input [training data] result in small variations in the output [selected features]: F is stable iff for D D, it follows that S(f, f ) < ɛ Methodological requirements: 1 Framework to generate small changes in training data 2 Similarity measures for feature weightings/rankings/subsets Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

13 Generating training set variations A subsampling approach: Draw k subsamples of size xm (0 < x < 1) randomly without replacement from D, where the parameters k and x can be varied. In our experiments: k=500 x=0.9 Algorithm 1 Generate k subsamples of size xm, {D 1,..., D k } 2 Perform the basic feature selector F on each of these k subsamples k : F(D k ) = f k 3 Perform all k(k 1) 2 pairwise comparisons, and average over them Stab(F) = 2 k i=1 k j=i+1 S(f i, f j ) k(k 1) where S(.,.) denotes an appropriate similarity function between weightings/rankings/subsets Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

14 Similarity measures for feature selection outputs 1 Weighting (Pearson CC): S(f i, f j ) = l (fl i µ fi )(f l j µ fj ) l (fl i µ fi ) 2 l (fl j µ fj ) 2 2 Ranking (Spearman rank CC): S(f i, f j ) = 1 6 l (f l i f l j )2 N(N 2 1) 3 Subset selection (Jaccard index): S(f i, f j ) = f i f j f i f j = l I(fl i = f l j = 1)) l I(fl i + f l j > 0) Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

15 Kuncheva s index for comparing feature subsets Definition Let A and B be subsets of features, both of the same cardinality s. Let r = A B Requirements for a desirable stability index for feature subsets: 1 Monotonicity: for a fixed subset size s, and number of features N, the larger the intersection between the subsets, the higher the value of the consistency index. 2 Limits: index should be bound by constants that do not depend on N or s. Maximum should be attained when the subsets are identical: r = s 3 Correction for chance: index should have a constant value for independently drawn subsets of the same cardinality s. Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

16 Kuncheva s index for comparing feature subsets General form of the index: Observed r Expected r Maximum r Expected r For randomly drawn A and B,the number of objects from A selected also in B is a random variable Y with hypergeometric distribution with probability mass function P(Y = r) = ( s r ) ( ) N s s r ( N s ) The expected value of Y for given s and N is s2 N KI(A, B) = r s 2 N s s2 N KI is bound by 1 KI 1 [Kuncheva (2007)] = Thus define rn s2 s(n s) Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

17 Improving feature selection robustness Methodology based on ensemble methods for classification. Can we transfer this to feature selection? Previous work Use feature selection to construct an ensemble Works of Cherkauer, Opitz, Tsymbal and Cunningham Feature selection ensemble This work Use ensemble methods to perform feature selection Feature selection ensemble Research questions: Can we improve feature selection robustness/stability using ensembles of feature selectors? Statistical, computational and representational aspects of ensemble learning transferable to feature selection? How does it affect classification performance? Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

18 Components of ensemble feature selection Training set Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

19 Components of ensemble feature selection Training set Feature selection algorithm 1 Ranked list 1 Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

20 Components of ensemble feature selection Training set Feature selection algorithm 1 Feature selection algorithm 2 Ranked list 1 Ranked list 2 Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

21 Components of ensemble feature selection Training set Feature selection algorithm 1 Feature selection algorithm 2 Feature selection algorithm t Ranked list 1 Ranked list 2 Ranked list T Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

22 Components of ensemble feature selection Training set Feature selection algorithm 1 Feature selection algorithm 2 Feature selection algorithm t Ranked list 1 Ranked list 2 Ranked list T Aggregation operator Consensus Ranked list C Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

23 Components of ensemble feature selection Variation in the feature selectors Choosing different feature selection techniques Dataset perturbation Instance level perturbation Feature level perturbation Stochasticity in the feature selector Bayesian model averaging Combinations of these techniques Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

24 Components of ensemble feature selection Variation in the feature selectors Choosing different feature selection techniques Dataset perturbation Instance level perturbation Feature level perturbation Stochasticity in the feature selector Bayesian model averaging Combinations of these techniques Aggregation of the results into a single output Rank aggregation Weighted rank aggregations Score aggregation Counting most frequently selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

25 Overview: 2 case studies 1 Bagging based ensemble feature selection Microarray data sets Feature ranking approach Rank aggregation method 2 Ensemble feature selection using model stochasticity Mass spectrometry data sets Feature selection approach Subset aggregation approach Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

26 Case study 1: Bagging based ensemble feature selection Generate feature selection diversity by instance perturbation Bootstrapping Generate t datasets by sampling the training set with replacement For each dataset, apply a feature selection algorithm (e.g. a ranker) EFS = {F 1, F 2,..., F t } Each feature selector F k results in a ranking f i = (fi 1,..., fi N ), where f j i denotes the rank of feature j in bootstrap i. Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

27 Aggregation methods Rank aggregation t t f = ( w 1 f 1 i,..., w N f N i ) i=1 i=1 Complete linear aggregation (CLA) w i = 1 Complete weighted aggregation (CWA) w i = OO-AUC i Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

28 Overview methodology Full data set (100% of the samples) SUBSAMPLING 90 % 90 % 90 % Marker selection algorithm Marker selection algorithm Marker selection algorithm Ranked list 1 Ranked list 2 Ranked list K Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

29 Overview methodology 90 % Marker selection algorithm Ranked list 1 Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

30 Overview methodology 90 % BOOTSTRAPPING Bootstrap 1 Bootstrap 2 Bootstrap T Marker selection algorithm Marker selection algorithm Marker selection algorithm Ranked list A Ranked list B Ranked list T Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

31 Overview methodology 90 % Consensus marker selection algorithm Consensus ranked list Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

32 Experiments Microarray datasets Name # Class 1 # Class 2 Size # Features SDR Reference Colon Alon et al. (1999) Leukemia Golub et al. (1999) Lymphoma Alizadeh et al. (2000) Prostate Singh et al. (2002) Baseline classifier/feature selection algorithm Linear SVM SVM Recursive Feature Elimination (RFE, Guyon et al. (2002)) 1 Train linear SVM on full feature set 2 Rank features based on w 3 Eliminate 50% worst features 4 Retrain SVM on remaining features 5 Go to step 2 Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

33 Results: stability distributions 45,000 colon 45,000 leukemia 40,000 40,000 35,000 35,000 30,000 30,000 25,000 25,000 20,000 20,000 15,000 15,000 10,000 10,000 5,000 5, lymphoma prostate 45,000 45,000 40,000 35,000 30,000 25,000 20,000 15,000 10,000 5, ,000 35,000 30,000 25,000 20,000 15,000 10,000 5, ensemble baseline Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

34 Results: stability 0.8 Colon 0.8 Leukemia Kuncheva index Kuncheva index CLA CWA Baseline Percentage of selected features CLA CWA Baseline Percentage of selected features 0.8 Lymphoma 0.8 Prostate Kuncheva index Kuncheva index CLA CWA Baseline Percentage of selected features CLA CWA Baseline Percentage of selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

35 Results: classification performance 1 Colon 1 Leukemia AUC AUC CLA CWA Baseline Percentage of selected features 1 Lymphoma CLA CWA Baseline Percentage of selected features 1 Prostate AUC AUC CLA CWA Baseline Percentage of selected features CLA CWA Baseline Percentage of selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

36 Bagging based EFS: first conclusions Ensemble feature selection (EFS) increases model performance: More stable biomarker selection Increased predictive performance EFS is easy to parallelize As signature sizes get smaller, EFS progressively improves upon the baseline Robust, small signatures are interesting candidates for prognostic tests Linear aggregation method is preferred Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

37 Sensitivity analysis: number of bootstraps 0.8 Colon Effect on stability 0.8 Leukemia Kuncheva Index Boot Boot 60 Boot Baseline Percentage of selected features Lymphoma Kuncheva Index Boot Boot 60 Boot Baseline Percentage of selected features Prostate Kuncheva Index Boot Boot 60 Boot Baseline Percentage of selected features Kuncheva Index Boot Boot 60 Boot Baseline Percentage of selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

38 Sensitivity analysis: number of bootstraps 1 Effect on classification performance Colon Leukemia AUC AUC Boot Boot 60 Boot Baseline Percentage of selected features Lymphoma 20 Boot Boot 60 Boot Baseline Percentage of selected features Prostate AUC AUC Boot Boot 60 Boot Baseline Percentage of selected features 20 Boot Boot 60 Boot Baseline Percentage of selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

39 Sensitivity analysis: RFE elimination percentage 0.8 Colon Effect on stability 0.8 Leukemia Kuncheva Index CLA,E=20% CLA,E=50% 0.4 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Lymphoma Kuncheva Index CLA,E=20% CLA,E=50% 0.4 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Prostate Kuncheva Index CLA,E=20% CLA,E=50% 0.4 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Kuncheva Index CLA,E=20% CLA,E=50% 0.4 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

40 Sensitivity analysis: RFE elimination percentage 1 Effect on classification performance Colon Leukemia AUC AUC CLA,E=20% CLA,E=50% 0.8 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Lymphoma CLA,E=20% CLA,E=50% 0.8 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Prostate AUC CLA,E=20% CLA,E=50% 0.8 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features AUC CLA,E=20% CLA,E=50% 0.8 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

41 Bagging based EFS: final conclusions Ensemble feature selection (EFS) increases model performance: More stable biomarker selection Increased predictive performance Number of bootstraps only effects stability RFE elimination percentage does not affect EFS RFE elimination percentage has a strong impact on baseline: Single run SVM performs best in terms of stability Smaller impact on classification performance Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

42 Case study 2: Ensemble FS using model stochasticity Traditional approach: Run a stochastic FS method many times (e.g. MCMC, Genetic Algorithm, stochastic iterative sampling) Compare all feature subsets found Make a final selection Intersection of the results Most frequently selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

43 Case study 2: Ensemble FS using model stochasticity Traditional approach: Run a stochastic FS method many times (e.g. MCMC, Genetic Algorithm, stochastic iterative sampling) Compare all feature subsets found Make a final selection Intersection of the results Most frequently selected features Computationally more efficient approach: Don t use only the single best results of the sampling procedure Average over the whole distribution Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

44 Estimation of distribution algorithms (EDA) Instead of working on one solution, work on a set of solutions (distribution) Use stochastic iterative sampling, combined with probabilistic graphical models to model good solutions 1. Generate initial solution set S0 2. Select a number of samples 3. Estimate probability distribution 4. Generate new samples i by sampling the estimated distribution 5. Create new solution set No Termination criteria met? Yes End Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

45 Estimating the probability distribution EDA UMDA BMDA BOA, EBNA Graphical model X 4 X 3 X 7 X 1 X 5 X 8 X 2 X 6 X 1 X 2 X 1 X 2 X 4 X 4 X 3 X 3 X 5 X X 5 6 X 6 X 7 X X 7 8 X 8 Probability distribution X i p(x ij ) j=0,1 X 1 p(x 1j ) X 2 p(x j 2 ) X 3 p(x 3j ) X 4 p(x j 4 ) X 5 p(x j 5 ) X 6 p(x 6j ) X 7 p(x 7j ) X 8 p(x 8j ) X i p(x ij ) j,k=0,1 X 1 p(x 1j ) X 2 p(x 2j ) X 3 p(x 3j ) X 4 p(x j 4 x k 1 ) X 5 p(x j 5 x k 3 ) X 6 p(x j 6 x k 4 ) X 7 p(x j 7 x k 3 ) X 8 p(x 8j x 5k ) X i p(x ij ) j,k,l=0,1 X 1 p(x 1j ) X 2 p(x j 2 ) X 3 j p(x 3 x 1k ) X 4 p(x 4j x 1k ) X 5 p(x j 5 x k 3, x l 4 ) X 6 p(x 6j x 4k ) X 7 p(x 7j x 3k ) X 8 p(x j 8 x k 5 ) Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

46 Experiments Mass spectrometry datasets Name # C1 # C2 Size # Features SDR Reference Ovarian cancer profiling ,200 0,0044 Petricoin et al. (2002) Detection of drug-induced toxicity ,200 0,00137 Petricoin et al. (2004) Hepatocellular carcinoma ,802 0,0041 Ressom et al. (2006) Estimation algorithms: UMDA, BMDA Classifiers: Naive Bayes, k-nn, SVM Average all EDA results over 500 multistarts Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

47 Results [preliminary] Usage for knowledge discovery: peak frequency plots Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

48 Future challenges Better dealing with correlated features First cluster correlated features, then choose representatives from each cluster, and build a model with the representatives Adapt similarity measures to deal with correlated features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

49 Future challenges Better dealing with correlated features First cluster correlated features, then choose representatives from each cluster, and build a model with the representatives Adapt similarity measures to deal with correlated features Increasing stability by transfer learning Assume 2 related datasets D 1 and D 2 Use feature selection on D 1 as prior for feature selection on D 2 Preliminary research shows that this transferral of feature selection information increases the stability of feature selection on D 2 [Helleputte and Dupont (2009)] Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

50 Future challenges Better dealing with correlated features First cluster correlated features, then choose representatives from each cluster, and build a model with the representatives Adapt similarity measures to deal with correlated features Increasing stability by transfer learning Assume 2 related datasets D 1 and D 2 Use feature selection on D 1 as prior for feature selection on D 2 Preliminary research shows that this transferral of feature selection information increases the stability of feature selection on D 2 [Helleputte and Dupont (2009)] A comparative evaluation of different ensemble FS techniques Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

51 Acknowledgements Thomas Abeel (Ghent University) Yves Van de Peer (Ghent University) Thibault Helleputte (UC Louvain) Pierre Dupont (UC Louvain) Ruben Armañanzas (Universidad Politecnica de Madrid) Iñaki Inza (University of the Basque Country) Pedro Larrañaga (Universidad Politecnica de Madrid) Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

52 Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., et al. (2000). Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403, Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., & Levine, A. (1999). Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, 96, Golub, T., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, Helleputte, T., & Dupont, P. (2009). Feature selection by transfer learning with linear regularized models. Lecture Notes in Artificial Intelligence, 5781, Kuncheva, L. (2007). A stability index for feature selection. Proceedings of the 25th International Multi-Conference on Artificial Intelligence and Applications (pp ). Petricoin, E. F., Ardekani, A. M., Hitt, B. A., Levine, P. J., Fusaro, V. A., Steinberg, S. M., Mills, G. B., Simone, C., Fishman, D. A., Kohn, E. C., & Liotta, L. A. (2002). Use of proteomic patterns in serum to identify ovarian cancer. Lancet, 359, Petricoin, E. F., Rajapaske, V., Herman, E. H., Arekani, A. M., Ross, S., Johann, D., Knapton, A., Zhang, J., Hitt, B. A., Conrads, T. P., Veenstra, T. D., Liotta, L. A., & Sistare, F. D. (2004). Toxicoproteomics: Serum proteomic pattern diagnostics for early detection of drug induced cardiac toxicities and cardioprotection. Toxicologic Pathology, 32, Ressom, H. W., Varghese, R. S., Orvisky, E., Drake, S. K., Hortin, G. L., Abdel-Hamid, M., Loffredo, C. A., & Goldman, R. (2006). Ant colony optimization for biomarker identification from MALDI-TOF mass spectra. Proceedings of the 28th International Conference of the IEEE Engineering in Medicine and Biology Society (pp ). Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., Renshaw, A., D Amico, A., Richie, J., Lander, E., Loda, M., Kantoff, P., TR, T. G., & Sellers, W. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1, Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

Stability of Feature Selection Algorithms

Stability of Feature Selection Algorithms Alexandros Kalousis, Jullien Prados, Phong Nguyen Melanie Hilario Artificial Intelligence Group Department of Computer Science University of Geneva Stability of