Towards robust feature selection for high-dimensional, small sample settings

Size: px
Start display at page:

Download "Towards robust feature selection for high-dimensional, small sample settings"

Transcription

1 Towards robust feature selection for high-dimensional, small sample settings Yvan Saeys Bioinformatics and Evolutionary Genomics, Ghent University, Belgium Marseille, January 14th, 2010

2 Background: biomarker discovery Common task in computational biology Find the entities that best explain phenotypic differences Challenges: Many possible biomarkers (high dimensionality) Only very few biomarkers are important for the specific phenotypic difference Very few samples Examples: Microarray data Mass spectrometry data SNP data Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

3 Dimensionality reduction techniques Dimensionality reduction techniques Feature selection techniques Feature transformation techniques Subset selection Feature ranking Feature weighting Projection PCA LDA Compression Fourier transform Wavelet transform Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

4 Dimensionality reduction techniques Dimensionality reduction techniques Feature selection techniques Feature transformation techniques Subset selection Feature ranking Feature weighting Projection PCA LDA Compression Fourier transform Wavelet transform Preserve the original semantics! Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

5 Casting the problem as a feature selection task Feature selection is a way to avoid the curse of dimensionality Improve model performance Classification: improve classification performance (maximize accuracy, AUC) Clustering: improve cluster detection (AIC, BIC, sum of squares, various indices) Regression: improve fit (sum of squares error) Faster and more cost-effective models Improve generalization performance (avoiding overfitting) Gain deeper insight into the processes that generated the data (esp. important in Bioinformatics) Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

6 The need for robust marker selection algorithms Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

7 The need for robust marker selection algorithms Ranked gene list: gene A gene B gene C gene D gene E Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

8 The need for robust marker selection algorithms Ranked gene list: gene A gene B gene C gene D gene E Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

9 The need for robust marker selection algorithms Ranked gene list: gene A gene B gene C gene D gene E Ranked gene list: gene X gene A gene W gene Y gene C Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

10 The need for robust marker selection algorithms Motivation Highly variable marker ranking algorithms decrease the confidence of a domain expert Need to quantify the stability of a ranking algorithm Use this as an additional criterion next to the predictive power More robust rankings yield a higher chance of representing biologically relevant markers Focus on quantifying/increasing marker stability within one data source Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

11 Formalizing feature selection robustness Definition Consider a dataset D = {x 1,..., x M }, x i = (xi 1,... xi N ) with M instances and N features. A feature selection algorithm can then be defined as a mapping F : D f from D to an N-dimensional vector f = (f 1,..., f N ), 1 weighting: f i = w i denotes the weight of feature i 2 ranking: f i {1, 2,..., N} denotes the rank of feature i 3 subset selection: f i = 0/1 denotes the exclusion/inclusion of feature i in the selected subset Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

12 Formalizing feature selection robustness Research questions: 1 How stable are current feature selection techniques for high dimensional, small sample settings? Analyze sensitivity of robustness to signature size, model parameters. 2 Can we increase the robustness of feature selection in this setting? Definition A feature selection algorithm is stable if small variations in the input [training data] result in small variations in the output [selected features]: F is stable iff for D D, it follows that S(f, f ) < ɛ Methodological requirements: 1 Framework to generate small changes in training data 2 Similarity measures for feature weightings/rankings/subsets Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

13 Generating training set variations A subsampling approach: Draw k subsamples of size xm (0 < x < 1) randomly without replacement from D, where the parameters k and x can be varied. In our experiments: k=500 x=0.9 Algorithm 1 Generate k subsamples of size xm, {D 1,..., D k } 2 Perform the basic feature selector F on each of these k subsamples k : F(D k ) = f k 3 Perform all k(k 1) 2 pairwise comparisons, and average over them Stab(F) = 2 k i=1 k j=i+1 S(f i, f j ) k(k 1) where S(.,.) denotes an appropriate similarity function between weightings/rankings/subsets Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

14 Similarity measures for feature selection outputs 1 Weighting (Pearson CC): S(f i, f j ) = l (fl i µ fi )(f l j µ fj ) l (fl i µ fi ) 2 l (fl j µ fj ) 2 2 Ranking (Spearman rank CC): S(f i, f j ) = 1 6 l (f l i f l j )2 N(N 2 1) 3 Subset selection (Jaccard index): S(f i, f j ) = f i f j f i f j = l I(fl i = f l j = 1)) l I(fl i + f l j > 0) Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

15 Kuncheva s index for comparing feature subsets Definition Let A and B be subsets of features, both of the same cardinality s. Let r = A B Requirements for a desirable stability index for feature subsets: 1 Monotonicity: for a fixed subset size s, and number of features N, the larger the intersection between the subsets, the higher the value of the consistency index. 2 Limits: index should be bound by constants that do not depend on N or s. Maximum should be attained when the subsets are identical: r = s 3 Correction for chance: index should have a constant value for independently drawn subsets of the same cardinality s. Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

16 Kuncheva s index for comparing feature subsets General form of the index: Observed r Expected r Maximum r Expected r For randomly drawn A and B,the number of objects from A selected also in B is a random variable Y with hypergeometric distribution with probability mass function P(Y = r) = ( s r ) ( ) N s s r ( N s ) The expected value of Y for given s and N is s2 N KI(A, B) = r s 2 N s s2 N KI is bound by 1 KI 1 [Kuncheva (2007)] = Thus define rn s2 s(n s) Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

17 Improving feature selection robustness Methodology based on ensemble methods for classification. Can we transfer this to feature selection? Previous work Use feature selection to construct an ensemble Works of Cherkauer, Opitz, Tsymbal and Cunningham Feature selection ensemble This work Use ensemble methods to perform feature selection Feature selection ensemble Research questions: Can we improve feature selection robustness/stability using ensembles of feature selectors? Statistical, computational and representational aspects of ensemble learning transferable to feature selection? How does it affect classification performance? Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

18 Components of ensemble feature selection Training set Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

19 Components of ensemble feature selection Training set Feature selection algorithm 1 Ranked list 1 Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

20 Components of ensemble feature selection Training set Feature selection algorithm 1 Feature selection algorithm 2 Ranked list 1 Ranked list 2 Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

21 Components of ensemble feature selection Training set Feature selection algorithm 1 Feature selection algorithm 2 Feature selection algorithm t Ranked list 1 Ranked list 2 Ranked list T Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

22 Components of ensemble feature selection Training set Feature selection algorithm 1 Feature selection algorithm 2 Feature selection algorithm t Ranked list 1 Ranked list 2 Ranked list T Aggregation operator Consensus Ranked list C Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

23 Components of ensemble feature selection Variation in the feature selectors Choosing different feature selection techniques Dataset perturbation Instance level perturbation Feature level perturbation Stochasticity in the feature selector Bayesian model averaging Combinations of these techniques Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

24 Components of ensemble feature selection Variation in the feature selectors Choosing different feature selection techniques Dataset perturbation Instance level perturbation Feature level perturbation Stochasticity in the feature selector Bayesian model averaging Combinations of these techniques Aggregation of the results into a single output Rank aggregation Weighted rank aggregations Score aggregation Counting most frequently selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

25 Overview: 2 case studies 1 Bagging based ensemble feature selection Microarray data sets Feature ranking approach Rank aggregation method 2 Ensemble feature selection using model stochasticity Mass spectrometry data sets Feature selection approach Subset aggregation approach Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

26 Case study 1: Bagging based ensemble feature selection Generate feature selection diversity by instance perturbation Bootstrapping Generate t datasets by sampling the training set with replacement For each dataset, apply a feature selection algorithm (e.g. a ranker) EFS = {F 1, F 2,..., F t } Each feature selector F k results in a ranking f i = (fi 1,..., fi N ), where f j i denotes the rank of feature j in bootstrap i. Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

27 Aggregation methods Rank aggregation t t f = ( w 1 f 1 i,..., w N f N i ) i=1 i=1 Complete linear aggregation (CLA) w i = 1 Complete weighted aggregation (CWA) w i = OO-AUC i Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

28 Overview methodology Full data set (100% of the samples) SUBSAMPLING 90 % 90 % 90 % Marker selection algorithm Marker selection algorithm Marker selection algorithm Ranked list 1 Ranked list 2 Ranked list K Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

29 Overview methodology 90 % Marker selection algorithm Ranked list 1 Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

30 Overview methodology 90 % BOOTSTRAPPING Bootstrap 1 Bootstrap 2 Bootstrap T Marker selection algorithm Marker selection algorithm Marker selection algorithm Ranked list A Ranked list B Ranked list T Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

31 Overview methodology 90 % Consensus marker selection algorithm Consensus ranked list Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

32 Experiments Microarray datasets Name # Class 1 # Class 2 Size # Features SDR Reference Colon Alon et al. (1999) Leukemia Golub et al. (1999) Lymphoma Alizadeh et al. (2000) Prostate Singh et al. (2002) Baseline classifier/feature selection algorithm Linear SVM SVM Recursive Feature Elimination (RFE, Guyon et al. (2002)) 1 Train linear SVM on full feature set 2 Rank features based on w 3 Eliminate 50% worst features 4 Retrain SVM on remaining features 5 Go to step 2 Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

33 Results: stability distributions 45,000 colon 45,000 leukemia 40,000 40,000 35,000 35,000 30,000 30,000 25,000 25,000 20,000 20,000 15,000 15,000 10,000 10,000 5,000 5, lymphoma prostate 45,000 45,000 40,000 35,000 30,000 25,000 20,000 15,000 10,000 5, ,000 35,000 30,000 25,000 20,000 15,000 10,000 5, ensemble baseline Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

34 Results: stability 0.8 Colon 0.8 Leukemia Kuncheva index Kuncheva index CLA CWA Baseline Percentage of selected features CLA CWA Baseline Percentage of selected features 0.8 Lymphoma 0.8 Prostate Kuncheva index Kuncheva index CLA CWA Baseline Percentage of selected features CLA CWA Baseline Percentage of selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

35 Results: classification performance 1 Colon 1 Leukemia AUC AUC CLA CWA Baseline Percentage of selected features 1 Lymphoma CLA CWA Baseline Percentage of selected features 1 Prostate AUC AUC CLA CWA Baseline Percentage of selected features CLA CWA Baseline Percentage of selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

36 Bagging based EFS: first conclusions Ensemble feature selection (EFS) increases model performance: More stable biomarker selection Increased predictive performance EFS is easy to parallelize As signature sizes get smaller, EFS progressively improves upon the baseline Robust, small signatures are interesting candidates for prognostic tests Linear aggregation method is preferred Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

37 Sensitivity analysis: number of bootstraps 0.8 Colon Effect on stability 0.8 Leukemia Kuncheva Index Boot Boot 60 Boot Baseline Percentage of selected features Lymphoma Kuncheva Index Boot Boot 60 Boot Baseline Percentage of selected features Prostate Kuncheva Index Boot Boot 60 Boot Baseline Percentage of selected features Kuncheva Index Boot Boot 60 Boot Baseline Percentage of selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

38 Sensitivity analysis: number of bootstraps 1 Effect on classification performance Colon Leukemia AUC AUC Boot Boot 60 Boot Baseline Percentage of selected features Lymphoma 20 Boot Boot 60 Boot Baseline Percentage of selected features Prostate AUC AUC Boot Boot 60 Boot Baseline Percentage of selected features 20 Boot Boot 60 Boot Baseline Percentage of selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

39 Sensitivity analysis: RFE elimination percentage 0.8 Colon Effect on stability 0.8 Leukemia Kuncheva Index CLA,E=20% CLA,E=50% 0.4 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Lymphoma Kuncheva Index CLA,E=20% CLA,E=50% 0.4 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Prostate Kuncheva Index CLA,E=20% CLA,E=50% 0.4 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Kuncheva Index CLA,E=20% CLA,E=50% 0.4 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

40 Sensitivity analysis: RFE elimination percentage 1 Effect on classification performance Colon Leukemia AUC AUC CLA,E=20% CLA,E=50% 0.8 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Lymphoma CLA,E=20% CLA,E=50% 0.8 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Prostate AUC CLA,E=20% CLA,E=50% 0.8 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features AUC CLA,E=20% CLA,E=50% 0.8 CLA,E=100% Baseline,E=20% Baseline,E=50% Baseline,E=100% Percentage of selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

41 Bagging based EFS: final conclusions Ensemble feature selection (EFS) increases model performance: More stable biomarker selection Increased predictive performance Number of bootstraps only effects stability RFE elimination percentage does not affect EFS RFE elimination percentage has a strong impact on baseline: Single run SVM performs best in terms of stability Smaller impact on classification performance Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

42 Case study 2: Ensemble FS using model stochasticity Traditional approach: Run a stochastic FS method many times (e.g. MCMC, Genetic Algorithm, stochastic iterative sampling) Compare all feature subsets found Make a final selection Intersection of the results Most frequently selected features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

43 Case study 2: Ensemble FS using model stochasticity Traditional approach: Run a stochastic FS method many times (e.g. MCMC, Genetic Algorithm, stochastic iterative sampling) Compare all feature subsets found Make a final selection Intersection of the results Most frequently selected features Computationally more efficient approach: Don t use only the single best results of the sampling procedure Average over the whole distribution Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

44 Estimation of distribution algorithms (EDA) Instead of working on one solution, work on a set of solutions (distribution) Use stochastic iterative sampling, combined with probabilistic graphical models to model good solutions 1. Generate initial solution set S0 2. Select a number of samples 3. Estimate probability distribution 4. Generate new samples i by sampling the estimated distribution 5. Create new solution set No Termination criteria met? Yes End Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

45 Estimating the probability distribution EDA UMDA BMDA BOA, EBNA Graphical model X 4 X 3 X 7 X 1 X 5 X 8 X 2 X 6 X 1 X 2 X 1 X 2 X 4 X 4 X 3 X 3 X 5 X X 5 6 X 6 X 7 X X 7 8 X 8 Probability distribution X i p(x ij ) j=0,1 X 1 p(x 1j ) X 2 p(x j 2 ) X 3 p(x 3j ) X 4 p(x j 4 ) X 5 p(x j 5 ) X 6 p(x 6j ) X 7 p(x 7j ) X 8 p(x 8j ) X i p(x ij ) j,k=0,1 X 1 p(x 1j ) X 2 p(x 2j ) X 3 p(x 3j ) X 4 p(x j 4 x k 1 ) X 5 p(x j 5 x k 3 ) X 6 p(x j 6 x k 4 ) X 7 p(x j 7 x k 3 ) X 8 p(x 8j x 5k ) X i p(x ij ) j,k,l=0,1 X 1 p(x 1j ) X 2 p(x j 2 ) X 3 j p(x 3 x 1k ) X 4 p(x 4j x 1k ) X 5 p(x j 5 x k 3, x l 4 ) X 6 p(x 6j x 4k ) X 7 p(x 7j x 3k ) X 8 p(x j 8 x k 5 ) Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

46 Experiments Mass spectrometry datasets Name # C1 # C2 Size # Features SDR Reference Ovarian cancer profiling ,200 0,0044 Petricoin et al. (2002) Detection of drug-induced toxicity ,200 0,00137 Petricoin et al. (2004) Hepatocellular carcinoma ,802 0,0041 Ressom et al. (2006) Estimation algorithms: UMDA, BMDA Classifiers: Naive Bayes, k-nn, SVM Average all EDA results over 500 multistarts Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

47 Results [preliminary] Usage for knowledge discovery: peak frequency plots Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

48 Future challenges Better dealing with correlated features First cluster correlated features, then choose representatives from each cluster, and build a model with the representatives Adapt similarity measures to deal with correlated features Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

49 Future challenges Better dealing with correlated features First cluster correlated features, then choose representatives from each cluster, and build a model with the representatives Adapt similarity measures to deal with correlated features Increasing stability by transfer learning Assume 2 related datasets D 1 and D 2 Use feature selection on D 1 as prior for feature selection on D 2 Preliminary research shows that this transferral of feature selection information increases the stability of feature selection on D 2 [Helleputte and Dupont (2009)] Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

50 Future challenges Better dealing with correlated features First cluster correlated features, then choose representatives from each cluster, and build a model with the representatives Adapt similarity measures to deal with correlated features Increasing stability by transfer learning Assume 2 related datasets D 1 and D 2 Use feature selection on D 1 as prior for feature selection on D 2 Preliminary research shows that this transferral of feature selection information increases the stability of feature selection on D 2 [Helleputte and Dupont (2009)] A comparative evaluation of different ensemble FS techniques Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

51 Acknowledgements Thomas Abeel (Ghent University) Yves Van de Peer (Ghent University) Thibault Helleputte (UC Louvain) Pierre Dupont (UC Louvain) Ruben Armañanzas (Universidad Politecnica de Madrid) Iñaki Inza (University of the Basque Country) Pedro Larrañaga (Universidad Politecnica de Madrid) Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

52 Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., et al. (2000). Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403, Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., & Levine, A. (1999). Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, 96, Golub, T., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, Helleputte, T., & Dupont, P. (2009). Feature selection by transfer learning with linear regularized models. Lecture Notes in Artificial Intelligence, 5781, Kuncheva, L. (2007). A stability index for feature selection. Proceedings of the 25th International Multi-Conference on Artificial Intelligence and Applications (pp ). Petricoin, E. F., Ardekani, A. M., Hitt, B. A., Levine, P. J., Fusaro, V. A., Steinberg, S. M., Mills, G. B., Simone, C., Fishman, D. A., Kohn, E. C., & Liotta, L. A. (2002). Use of proteomic patterns in serum to identify ovarian cancer. Lancet, 359, Petricoin, E. F., Rajapaske, V., Herman, E. H., Arekani, A. M., Ross, S., Johann, D., Knapton, A., Zhang, J., Hitt, B. A., Conrads, T. P., Veenstra, T. D., Liotta, L. A., & Sistare, F. D. (2004). Toxicoproteomics: Serum proteomic pattern diagnostics for early detection of drug induced cardiac toxicities and cardioprotection. Toxicologic Pathology, 32, Ressom, H. W., Varghese, R. S., Orvisky, E., Drake, S. K., Hortin, G. L., Abdel-Hamid, M., Loffredo, C. A., & Goldman, R. (2006). Ant colony optimization for biomarker identification from MALDI-TOF mass spectra. Proceedings of the 28th International Conference of the IEEE Engineering in Medicine and Biology Society (pp ). Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., Renshaw, A., D Amico, A., Richie, J., Lander, E., Loda, M., Kantoff, P., TR, T. G., & Sellers, W. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1, Yvan Saeys (UGent) Towards robust feature selection Marseille / 36

Stability of Feature Selection Algorithms

Stability of Feature Selection Algorithms Stability of Feature Selection Algorithms Alexandros Kalousis, Jullien Prados, Phong Nguyen Melanie Hilario Artificial Intelligence Group Department of Computer Science University of Geneva Stability of

More information

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification 1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore

More information

THE EFFECT OF NOISY BOOTSTRAPPING ON THE ROBUSTNESS OF SUPERVISED CLASSIFICATION OF GENE EXPRESSION DATA

THE EFFECT OF NOISY BOOTSTRAPPING ON THE ROBUSTNESS OF SUPERVISED CLASSIFICATION OF GENE EXPRESSION DATA THE EFFECT OF NOISY BOOTSTRAPPING ON THE ROBUSTNESS OF SUPERVISED CLASSIFICATION OF GENE EXPRESSION DATA Niv Efron and Nathan Intrator School of Computer Science, Tel-Aviv University, Ramat-Aviv 69978,

More information

Feature Selection for SVMs

Feature Selection for SVMs Feature Selection for SVMs J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik Barnhill BioInformatics.com, Savannah, Georgia, USA. CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research

More information

SVM-Based Local Search for Gene Selection and Classification of Microarray Data

SVM-Based Local Search for Gene Selection and Classification of Microarray Data SVM-Based Local Search for Gene Selection and Classification of Microarray Data Jose Crispin Hernandez Hernandez, Béatrice Duval, and Jin-Kao Hao LERIA, Université d Angers, 2 Boulevard Lavoisier, 49045

More information

Assessing Similarity of Feature Selection Techniques in High-Dimensional Domains

Assessing Similarity of Feature Selection Techniques in High-Dimensional Domains The following article has been published in Cannas L.M, Dessì N., Pes B., Assessing similarity of feature selection techniques in high-dimensional domains, Pattern Recognition Letters, Volume 34, Issue

More information

STABILITY OF FEATURE SELECTION ALGORITHMS AND ENSEMBLE FEATURE SELECTION METHODS IN BIOINFORMATICS

STABILITY OF FEATURE SELECTION ALGORITHMS AND ENSEMBLE FEATURE SELECTION METHODS IN BIOINFORMATICS STABILITY OF FEATURE SELECTION ALGORITHMS AND ENSEMBLE FEATURE SELECTION METHODS IN BIOINFORMATICS PENGYI YANG, BING B. ZHOU, JEAN YEE HWA YANG, ALBERT Y. ZOMAYA SCHOOL OF INFORMATION TECHNOLOGIES AND

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Package OrderedList. December 31, 2017

Package OrderedList. December 31, 2017 Title Similarities of Ordered Gene Lists Version 1.50.0 Date 2008-07-09 Package OrderedList December 31, 2017 Author Xinan Yang, Stefanie Scheid, Claudio Lottaz Detection of similarities between ordered

More information

Gene Expression Based Classification using Iterative Transductive Support Vector Machine

Gene Expression Based Classification using Iterative Transductive Support Vector Machine Gene Expression Based Classification using Iterative Transductive Support Vector Machine Hossein Tajari and Hamid Beigy Abstract Support Vector Machine (SVM) is a powerful and flexible learning machine.

More information

Stability of Feature Selection Algorithms

Stability of Feature Selection Algorithms Stability of Feature Selection Algorithms Alexandros Kalousis, Julien Prados, Melanie Hilario University of Geneva, Computer Science Department Rue General Dufour 24, 1211 Geneva 4, Switzerland {kalousis,

More information

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract

More information

A New Implementation of Recursive Feature Elimination Algorithm for Gene Selection from Microarray Data

A New Implementation of Recursive Feature Elimination Algorithm for Gene Selection from Microarray Data 2009 World Congress on Computer Science and Information Engineering A New Implementation of Recursive Feature Elimination Algorithm for Gene Selection from Microarray Data Sihua Peng 1, Xiaoping Liu 2,

More information

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Noise-based Feature Perturbation as a Selection Method for Microarray Data Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering

More information

Stable Feature Selection via Dense Feature Groups

Stable Feature Selection via Dense Feature Groups Stable Feature Selection via Dense Feature Groups Lei Yu Department of Computer Science Binghamton University Binghamton, NY 392, USA lyu@cs.binghamton.edu Chris Ding Department of Computer Science and

More information

Statistical dependence measure for feature selection in microarray datasets

Statistical dependence measure for feature selection in microarray datasets Statistical dependence measure for feature selection in microarray datasets Verónica Bolón-Canedo 1, Sohan Seth 2, Noelia Sánchez-Maroño 1, Amparo Alonso-Betanzos 1 and José C. Príncipe 2 1- Department

More information

A hybrid of discrete particle swarm optimization and support vector machine for gene selection and molecular classification of cancer

A hybrid of discrete particle swarm optimization and support vector machine for gene selection and molecular classification of cancer A hybrid of discrete particle swarm optimization and support vector machine for gene selection and molecular classification of cancer Adithya Sagar Cornell University, New York 1.0 Introduction: Cancer

More information

Model Mining for Robust Feature Selection

Model Mining for Robust Feature Selection Model Mining for Robust Feature Selection Adam Woznica University of Geneva Department of Computer Science 7, route de Drize Battelle - building A 1227 Carouge, Switzerland Adam.Woznica@unige.ch Phong

More information

Kernel methods for mixed feature selection

Kernel methods for mixed feature selection ESANN 04 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 3- April 04, i6doc.com publ., ISBN 978-874909-7. Available from

More information

Feature Selection and Classification for Small Gene Sets

Feature Selection and Classification for Small Gene Sets Feature Selection and Classification for Small Gene Sets Gregor Stiglic 1,2, Juan J. Rodriguez 3, and Peter Kokol 1,2 1 University of Maribor, Faculty of Health Sciences, Zitna ulica 15, 2000 Maribor,

More information

The Iterative Bayesian Model Averaging Algorithm: an improved method for gene selection and classification using microarray data

The Iterative Bayesian Model Averaging Algorithm: an improved method for gene selection and classification using microarray data The Iterative Bayesian Model Averaging Algorithm: an improved method for gene selection and classification using microarray data Ka Yee Yeung, Roger E. Bumgarner, and Adrian E. Raftery April 30, 2018 1

More information

Automated Microarray Classification Based on P-SVM Gene Selection

Automated Microarray Classification Based on P-SVM Gene Selection Automated Microarray Classification Based on P-SVM Gene Selection Johannes Mohr 1,2,, Sambu Seo 1, and Klaus Obermayer 1 1 Berlin Institute of Technology Department of Electrical Engineering and Computer

More information

Feature Selection in Knowledge Discovery

Feature Selection in Knowledge Discovery Feature Selection in Knowledge Discovery Susana Vieira Technical University of Lisbon, Instituto Superior Técnico Department of Mechanical Engineering, Center of Intelligent Systems, IDMEC-LAETA Av. Rovisco

More information

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski 10. Clustering Introduction to Bioinformatics 30.9.2008 Jarkko Salojärvi Based on lecture slides by Samuel Kaski Definition of a cluster Typically either 1. A group of mutually similar samples, or 2. A

More information

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No.677 C.B.C.L Paper No.8

More information

SVM Classification in -Arrays

SVM Classification in -Arrays SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What

More information

The Imbalanced Problem in Mass-spectrometry Data Analysis

The Imbalanced Problem in Mass-spectrometry Data Analysis The Second International Symposium on Optimization and Systems Biology (OSB 08) Lijiang, China, October 31 November 3, 2008 Copyright 2008 ORSC & APORC, pp. 136 143 The Imbalanced Problem in Mass-spectrometry

More information

Profiling of mass spectrometry data for ovarian cancer detection using negative correlation learning

Profiling of mass spectrometry data for ovarian cancer detection using negative correlation learning Profiling of mass spectrometry data for ovarian cancer detection using negative correlation learning Shan He, Huanhuan Chen, Xiaoli Li, and Xin Yao The Centre of Excellence for Research in Computational

More information

Feature Selection for SVMs

Feature Selection for SVMs Feature Selection for SVMs The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Weston, J., S. Mukherjee,

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Gene selection through Switched Neural Networks

Gene selection through Switched Neural Networks Gene selection through Switched Neural Networks Marco Muselli Istituto di Elettronica e di Ingegneria dell Informazione e delle Telecomunicazioni Consiglio Nazionale delle Ricerche Email: Marco.Muselli@ieiit.cnr.it

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Neighborhood Component Feature Selection for High-Dimensional Data

Neighborhood Component Feature Selection for High-Dimensional Data JOURNAL OF COMPUTERS, VOL. 7, NO., JANUARY Neighborhood Component Feature Selection for High-Dimensional Data Wei Yang,Kuanquan Wang and Wangmeng Zuo Biocomputing Research Centre, School of Computer Science

More information

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 12: Ensemble Learning I Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1 Outline Bias

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

A Memetic Algorithm for Gene Selection and Molecular Classification of Cancer

A Memetic Algorithm for Gene Selection and Molecular Classification of Cancer A Memetic Algorithm for Gene Selection and Molecular Classification of Cancer Béatrice Duval LERIA, Université d Angers 2 Boulevard Lavoisier 49045 Angers cedex 01, France bd@info.univ-angers.fr Jin-Kao

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

CURSE-OF-DIMENSIONALITY is a major problem associated

CURSE-OF-DIMENSIONALITY is a major problem associated 24 Feature Article: An Embedded Two-Layer Feature Selection Approach for Microarray Data Analysis An Embedded Two-Layer Feature Selection Approach for Microarray Data Analysis Pengyi Yang and Zili Zhang

More information

An integrated tool for microarray data clustering and cluster validity assessment

An integrated tool for microarray data clustering and cluster validity assessment An integrated tool for microarray data clustering and cluster validity assessment Nadia Bolshakova Department of Computer Science Trinity College Dublin Ireland +353 1 608 3688 Nadia.Bolshakova@cs.tcd.ie

More information

Redundancy Based Feature Selection for Microarray Data

Redundancy Based Feature Selection for Microarray Data Redundancy Based Feature Selection for Microarray Data Lei Yu Department of Computer Science & Engineering Arizona State University Tempe, AZ 85287-8809 leiyu@asu.edu Huan Liu Department of Computer Science

More information

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Stable Feature Selection with Minimal Independent Dominating Sets

Stable Feature Selection with Minimal Independent Dominating Sets Stable Feature Selection with Minimal Independent Dominating Sets Le Shu Computer and Information Science, Temple University 85 N. Broad St. Philadelphia, PA slevenshu@gmail.com Tianyang Ma Computer and

More information

Package twilight. August 3, 2013

Package twilight. August 3, 2013 Package twilight August 3, 2013 Version 1.37.0 Title Estimation of local false discovery rate Author Stefanie Scheid In a typical microarray setting with gene expression data observed

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

Comparison of Optimization Methods for L1-regularized Logistic Regression

Comparison of Optimization Methods for L1-regularized Logistic Regression Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Dimensionality Reduction, including by Feature Selection.

Dimensionality Reduction, including by Feature Selection. Dimensionality Reduction, including by Feature Selection www.cs.wisc.edu/~dpage/cs760 Goals for the lecture you should understand the following concepts filtering-based feature selection information gain

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Adrian Alexa Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken slides by Jörg Rahnenführer NGFN - Courses

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Center for Genome Research, MIT Whitehead Institute, One Kendall Square, Cambridge, MA 02139, USA

Center for Genome Research, MIT Whitehead Institute, One Kendall Square, Cambridge, MA 02139, USA BIOINFORMATICS Vol. 17 Suppl. 1 2001 Pages S316 S322 Molecular classification of multiple tumor types Chen-Hsiang Yeang, Sridhar Ramaswamy, Pablo Tamayo, Sayan Mukherjee, Ryan M. Rifkin, Michael Angelo,

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary Discussion 1: Rationale for Clustering Algorithm Selection Introduction: The process of machine learning is to design and employ computer programs that are capable to deduce patterns, regularities

More information

Weighted Heuristic Ensemble of Filters

Weighted Heuristic Ensemble of Filters Weighted Heuristic Ensemble of Filters Ghadah Aldehim School of Computing Sciences University of East Anglia Norwich, NR4 7TJ, UK G.Aldehim@uea.ac.uk Wenjia Wang School of Computing Sciences University

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Using a Clustering Similarity Measure for Feature Selection in High Dimensional Data Sets

Using a Clustering Similarity Measure for Feature Selection in High Dimensional Data Sets Using a Clustering Similarity Measure for Feature Selection in High Dimensional Data Sets Jorge M. Santos ISEP - Instituto Superior de Engenharia do Porto INEB - Instituto de Engenharia Biomédica, Porto,

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Computers in Biology and Medicine

Computers in Biology and Medicine Computers in Biology and Medicine 39 (29) 818 -- 823 Contents lists available at ScienceDirect Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm Feature extraction and dimensionality

More information

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

More information

Variable Selection using Random Forests

Variable Selection using Random Forests Author manuscript, published in "Pattern Recognition Letters 31, 14 (21) 222-2236" Variable Selection using Random Forests Robin Genuer a, Jean-Michel Poggi,a,b, Christine Tuleau-Malot c a Laboratoire

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Yaochun Huang, Hui Xiong, Weili Wu, and Sam Y. Sung 3 Computer Science Department, University of Texas - Dallas, USA, {yxh03800,wxw0000}@utdallas.edu

More information

A Regularized Method for Selecting Nested Groups of. Relevant Genes from Microarray Data

A Regularized Method for Selecting Nested Groups of. Relevant Genes from Microarray Data A Regularized Method for Selecting Nested Groups of Relevant Genes from Microarray Data Christine De Mol Université Libre de Bruxelles, Dept Math. and ECARES Campus Plaine CP217, 1050 Brussels, Belgium

More information

Prognosis of Lung Cancer Using Data Mining Techniques

Prognosis of Lung Cancer Using Data Mining Techniques Prognosis of Lung Cancer Using Data Mining Techniques 1 C. Saranya, M.Phil, Research Scholar, Dr.M.G.R.Chockalingam Arts College, Arni 2 K. R. Dillirani, Associate Professor, Department of Computer Science,

More information

Feature Selection for Support Vector Machines by Means of Genetic Algorithms

Feature Selection for Support Vector Machines by Means of Genetic Algorithms Feature Selection for Support Vector Machines by Means of Genetic Algorithms Holger Fröhlich holger.froehlich@informatik.uni-tuebingen.de ernhard Schölkopf bernhard.schoelkopf@tuebingen.mpg.de Olivier

More information

MIT 9.520/6.860 Project: Feature selection for SVM

MIT 9.520/6.860 Project: Feature selection for SVM MIT 9.520/6.860 Project: Feature selection for SVM Antoine Dedieu Operations Research Center Massachusetts Insitute of Technology adedieu@mit.edu Abstract We consider sparse learning binary classification

More information

arxiv: v1 [stat.ap] 9 Nov 2017

arxiv: v1 [stat.ap] 9 Nov 2017 Dimension Reduction of High-Dimensional Datasets Based on Stepwise SVM arxiv:1711.03346v1 [stat.ap] 9 Nov 2017 Elizabeth P. Chou, Tzu-Wei Ko Department of Statistics, National Chengchi University, No.64,

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Variable selection using Random Forests

Variable selection using Random Forests Variable selection using Random Forests Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot To cite this version: Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot. Variable selection using Random

More information

Forward Feature Selection Using Residual Mutual Information

Forward Feature Selection Using Residual Mutual Information Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics

More information

Filter versus Wrapper Feature Subset Selection in Large Dimensionality Micro array: A Review

Filter versus Wrapper Feature Subset Selection in Large Dimensionality Micro array: A Review Filter versus Wrapper Feature Subset Selection in Large Dimensionality Micro array: A Review Binita Kumari #1, Tripti Swarnkar *2 #1 Department of Computer Science - *2 Department of Computer Applications,

More information

Title: Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data

Title: Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data Supplementary material for Manuscript BIOINF-2005-1602 Title: Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data Appendix A. Testing K-Nearest Neighbor and Support

More information

Clustering and Classification. Basic principles of clustering. Clustering. Classification

Clustering and Classification. Basic principles of clustering. Clustering. Classification Classification Clustering and Classification Task: assign objects to classes (groups) on the basis of measurements made on the objects Jean Yee Hwa Yang University of California, San Francisco http://www.biostat.ucsf.edu/jean/

More information

Variable Selection 6.783, Biomedical Decision Support

Variable Selection 6.783, Biomedical Decision Support 6.783, Biomedical Decision Support (lrosasco@mit.edu) Department of Brain and Cognitive Science- MIT November 2, 2009 About this class Why selecting variables Approaches to variable selection Sparsity-based

More information

arxiv: v1 [cs.cv] 18 Apr 2017

arxiv: v1 [cs.cv] 18 Apr 2017 arxiv:1704.05409v1 [cs.cv] 18 Apr 2017 Ranking to Learn: Feature Ranking and Selection via Eigenvector Centrality Giorgio Roffo 1,2 Simone Melzi 2 Giorgio.Roffo@glasgow.ac.uk Simone.Melzi@univr.it 1 School

More information

Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach

Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach Soha Ahmed 1, Mengjie Zhang 1, and Lifeng Peng 2 1 School of Engineering and Computer Science

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

Pattern Recognition Letters

Pattern Recognition Letters Pattern Recognition Letters 31 (21) 2225 2236 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec Variable selection using random forests

More information

High throughput Data Analysis 2. Cluster Analysis

High throughput Data Analysis 2. Cluster Analysis High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO

More information

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen Lecture 2: Feature selection Feature Selection feature selection (also called variable selection): choosing k < d important

More information

Keywords: Linear dimensionality reduction, Locally Linear Embedding.

Keywords: Linear dimensionality reduction, Locally Linear Embedding. Orthogonal Neighborhood Preserving Projections E. Kokiopoulou and Y. Saad Computer Science and Engineering Department University of Minnesota Minneapolis, MN 4. Email: {kokiopou,saad}@cs.umn.edu Abstract

More information

Technical Report. Feature Selection for High-Dimensional Data with RapidMiner. Sangkyun Lee, *Benjamin Schowe, *Viswanath Sivakumar, Katharina Morik

Technical Report. Feature Selection for High-Dimensional Data with RapidMiner. Sangkyun Lee, *Benjamin Schowe, *Viswanath Sivakumar, Katharina Morik Feature Selection for High-Dimensional Data with RapidMiner Technical Report Sangkyun Lee, *Benjamin Schowe, *Viswanath Sivakumar, Katharina Morik 01/2011 technische universität dortmund Part of the work

More information

mirsig: a consensus-based network inference methodology to identify pan-cancer mirna-mirna interaction signatures

mirsig: a consensus-based network inference methodology to identify pan-cancer mirna-mirna interaction signatures SUPPLEMENTARY FILE - S1 mirsig: a consensus-based network inference methodology to identify pan-cancer mirna-mirna interaction signatures Joseph J. Nalluri 1,*, Debmalya Barh 2, Vasco Azevedo 3 and Preetam

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Combining nearest neighbor classifiers versus cross-validation selection

Combining nearest neighbor classifiers versus cross-validation selection Statistics Preprints Statistics 4-2004 Combining nearest neighbor classifiers versus cross-validation selection Minhui Paik Iowa State University, 100min@gmail.com Yuhong Yang Iowa State University Follow

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Feature Selection Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. Admin Assignment 3: Due Friday Midterm: Feb 14 in class

More information

Improving Feature Selection Techniques for Machine Learning

Improving Feature Selection Techniques for Machine Learning Georgia State University ScholarWorks @ Georgia State University Computer Science Dissertations Department of Computer Science 11-27-2007 Improving Feature Selection Techniques for Machine Learning Feng

More information

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) Department of Information, Operations and Management Sciences Stern School of Business, NYU padamopo@stern.nyu.edu

More information

Open Access Computer Image Feature Extraction Algorithm for Robot Visual Equipment

Open Access Computer Image Feature Extraction Algorithm for Robot Visual Equipment Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 5, 7, 993-993 Open Access Computer Image Feature Extraction Algorithm for Robot Visual Equipment

More information

Double Self-Organizing Maps to Cluster Gene Expression Data

Double Self-Organizing Maps to Cluster Gene Expression Data Double Self-Organizing Maps to Cluster Gene Expression Data Dali Wang, Habtom Ressom, Mohamad Musavi, Cristian Domnisoru University of Maine, Department of Electrical & Computer Engineering, Intelligent

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

10601 Machine Learning. Model and feature selection

10601 Machine Learning. Model and feature selection 10601 Machine Learning Model and feature selection Model selection issues We have seen some of this before Selecting features (or basis functions) Logistic regression SVMs Selecting parameter value Prior

More information

A new approach to enhance the performance of decision tree for classifying gene expression data

A new approach to enhance the performance of decision tree for classifying gene expression data PROCEEDINGS Open Access A new approach to enhance the performance of decision tree for classifying gene expression data Md Rafiul Hassan 1*, Ramamohanarao Kotagiri 2 From Great Lakes Bioinformatics Conference

More information

Gene extraction for cancer diagnosis by support vector machines An improvement

Gene extraction for cancer diagnosis by support vector machines An improvement Artificial Intelligence in Medicine (2005) 35, 185 194 http://www.intl.elsevierhealth.com/journals/aiim Gene extraction for cancer diagnosis by support vector machines An improvement Te Ming Huang, Vojislav

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information