Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Similar documents
Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection in Knowledge Discovery

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

FEATURE EVALUATION FOR EMG-BASED LOAD CLASSIFICATION

Didacticiel - Études de cas

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Statistical Pattern Recognition

Slides for Data Mining by I. H. Witten and E. Frank

Statistical Pattern Recognition

Machine Learning Techniques for Data Mining

Statistical Pattern Recognition

CPSC 340: Machine Learning and Data Mining

Applying Supervised Learning

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Filter versus Wrapper Feature Subset Selection in Large Dimensionality Micro array: A Review

Machine Learning. Decision Trees. Manfred Huber

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

FEATURE SELECTION TECHNIQUES

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

SVM Classification in -Arrays

Classification Algorithms in Data Mining

Network Traffic Measurements and Analysis

/ Computational Genomics. Normalization

CS 229 Final Project Report Learning to Decode Cognitive States of Rat using Functional Magnetic Resonance Imaging Time Series

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Chapter 10. Conclusion Discussion

Forward Feature Selection Using Residual Mutual Information

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

3 Feature Selection & Feature Extraction

Data Mining. Lecture 03: Nearest Neighbor Learning

Machine Learning Feature Creation and Selection

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CT79 SOFT COMPUTING ALCCS-FEB 2014

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

10601 Machine Learning. Model and feature selection

[Kaur, 5(8): August 2018] ISSN DOI /zenodo Impact Factor

Machine Learning with MATLAB --classification

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Variable Selection 6.783, Biomedical Decision Support

Multi-label classification using rule-based classifier systems

Random Forest A. Fornaser

Predicting Popular Xbox games based on Search Queries of Users

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

7. Decision or classification trees

Final Report: Kaggle Soil Property Prediction Challenge

Lecture 7: Decision Trees

Machine Learning in Biology

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Features: representation, normalization, selection. Chapter e-9

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Introduction to GE Microarray data analysis Practical Course MolBio 2012

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Cluster Analysis Gets Complicated

Feature Selection. A PhD Seminar Uni Cagliari. Gavin Brown School of Computer Science University of Manchester

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Chapter 3: Supervised Learning

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Machine Learning. Computational biology: Sequence alignment and profile HMMs

STA 4273H: Statistical Machine Learning

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Motion Estimation for Video Coding Standards

Lecture #11: The Perceptron

Data preprocessing Functional Programming and Intelligent Algorithms

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Unsupervised Learning and Clustering

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

5 Learning hypothesis classes (16 points)

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

Network Traffic Measurements and Analysis

CISC 4631 Data Mining

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

Mass Spec Data Post-Processing Software. ClinProTools. Wayne Xu, Ph.D. Supercomputing Institute Phone: Help:

Data mining fundamentals

Particle Swarm Optimization applied to Pattern Recognition

Unsupervised learning in Vision

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Basic Data Mining Technique

Contents. Preface to the Second Edition

Introduction to Data Mining and Data Analytics

Department of Computer Science & Engineering The Graduate School, Chung-Ang University. CAU Artificial Intelligence LAB

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka


A Keypoint Descriptor Inspired by Retinal Computation

Information Fusion Dr. B. K. Panigrahi

User Guide Written By Yasser EL-Manzalawy

D B M G Data Base and Data Mining Group of Politecnico di Torino

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Data Preprocessing. Data Preprocessing

Lab 9. Julia Janicki. Introduction

Transcription:

Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and what it is used for. We discuss methods for getting those features and present them in the context of classification that is using only supervised learning. Our goal is to present applications of these techniques in bioinformatics and match them with the appropriate type of method to use. We will also talk about how to deal with small sample sizes and problems related with it. Finally, we will present some areas in bioinformatics that can be improved using feature selection methods. 1 - INTRODUCTION The need of using feature selection (FS) techniques is growing in last years due to the size of data we need to analyze in areas related to bioinformatics. FS can be classified as one of many dimensionality reduction techniques. What distinguishes it from the others is the fact that it allows us to pick the subset of variables from the original data without changing it. 2 - FEATURE SELECTION TECHNIQUES The principles of FS usage are: avoiding overfitting improving model performance gaining a deeper insight into the underlying processes that generated the data Unfortunately, the optimal parameters of the model generated from the full feature set are not always the same as the optimal parameters of the model generated using the optimal features set selected by FS. So it is important to use it wisely, there is a danger that we lose some information. Consequently, we need to find the optimal model settings for the new set of features. Feature selection techniques can be divided into three categories, depending on how they combine the feature selection search with the construction of the classification model: filter methods, wrapper methods and embedded methods. 2.1 - FILTER TECHNIQUES These techniques give importance to features by looking only at the basic properties of the data. In most cases a feature relevance score is calculated, and low-scoring features are removed. Afterwards, this subset of features is presented as input to the classification algorithm. Advantages of filter techniques are that they easily scale to very high-dimensional datasets, they are computationally simple and fast, and they are independent of the classification algorithm. Because of the problem with not considering the features dependencies multivariate techniques were introduced. But they are also not perfect, as they only use a certain degree of dependencies.

2.2 - WRAPPER TECHNIQUES Unlike filter techniques, wrapper methods embed the model hypothesis search within the feature subset search. In this setup, the result of search procedure belongs to the space of possible feature subsets, and various subsets of features are generated. By using wrapper methods we can also evaluate selected feature subset by training and testing specific classification model. In order to pick the best feature subset, we need to perform the search that will take exponential time, thus we use heuristics. To make this search we can use deterministic or randomized search algorithms. A common disadvantage of these techniques is that they have a higher risk of overfitting than filter techniques and are very computationally intensive, especially if building the classifier has a high computational cost.

2.3 - EMBEDDED TECHNIQUES The search space of the feature selection algorithm is a combination of feature space and hypothesis space. The classifier itself provides the optimal feature selection. A great advantage is that it is not so computationally intensive. 3 - APPLICATIONS IN BIOINFORMATICS 3.1 -SEQUENCE ANALYSIS Sequence analysis is one of the most traditional areas of bioinformatics. The problems that the programmer meets in this area can be divided in two types differing in the scope we are interested in. If we want to focus on general characteristics, to reason basing on statistical features of the whole sequence, then we are interested in performing content analysis. On the other hand, if we want to detect in a sequence the presence of a particular motif or some specified aberration - we are in fact interested in analyzing only some small part(s) of the sequence. In that case we want to perform signal analysis. As we can imagine, in both cases (content/signal) in our sequence there's a lot of garbage data which provides us with information that is either irrelevant or redundant. And that is exactly where FS can be exploited! 3.1.1 FS dedicated to content analysis (Filter multivariate) Because features are derived from a sequence which is ordered, keeping them ordered is often beneficial, as it preserves dependencies between adjacent features. That's why Markov models were used in the first approach mentioned in the Saeys review [1] and it's improvements are still maintained, but the first idea stays the same. For scoring feature subsets there are used also genetic algorithms and SVMs. 3.1.2 FS dedicated to signal analysis (Wrapper) Usually signal analysis is performed to recognize binding sites or other places of sequence with special function. For feature selection it's best to interpret the code and relate motifs to the gene expression level. Then motifs can be cropped in such a way that the preserved motifs would fit the best to their regression models. Choosing which motifs are unselected in FS is dependent on the threshold number of misclassification (TNoM) to the regression models. Importance of motifs can be sorted by the P-value (derived directly from the TNoM score). 3.2 - MICROARRAY ANALYSIS Main feature of new datasets created by microarrays is both: large dimensionality and small sample sizes. What makes it more exciting is that analysis has to cope with noises and variability. 3.2.1 Univariate (only filter) Reasons why this approach is used most widely: understandable output faster than multivariate it is somehow easier to validate the selection by biological lab methods the experts usually don't feel the need to consider genes interactions

Simplest heuristics techniques include setting a threshold on the differences in gene expression between the states and then detection of the threshold point in each gene that minimizes TNoM. But they were also developed in two directions: 3.2.1.1 Parametric methods: Parametric methods assume a given distribution from which the samples have been generated. That's why before using them; the programmer should justify his choice of the distribution. Unfortunately, samples are so small that it is very hard to even validate such choice. The most standard choice is a Gaussian distribution. 3.2.1.2 Model-free methods: Just as parametric, but tries to figure the distribution out by estimating random permutations of the data, which enhances the robustness against outliers. There is also another group of non-parametric methods which, instead of trying to identify differentially expressed genes at the whole population level, are able to capture genes which are significantly disregulated in only a subset of samples. These types of methods can select genes containing specific patterns that are missed by previously mentioned metrics. 3.2.2 Multivariate 3.2.2.1 Filter methods: The application of multivariate filter methods ranges from simple bivariate interactions towards more advanced solutions exploring higher order interactions, such as correlation-based feature selection (CFS) and several variants of the Markov blanket filter method. 3.2.2.3 Wrapper methods: In the context of microarray analysis, most wrapper methods use population-based, randomized search heuristics, although also a few examples use sequential search techniques. An interesting hybrid filterwrapper approach is crossing a univariately preordered gene ranking with an incrementally augmenting wrapper method. Another characteristic of any wrapper procedure concerns the scoring function used to evaluate each gene subset found. As the 0 1 accuracy measure allows for comparison with previous works, the vast majority of papers use this measure. 3.2.2.3 Embedded methods: The embedded capacity of several classifiers to discard input features and thus propose a subset of discriminative genes, has been exploited by several authors. Examples include the use of random forests in an embedded way to calculate the importance of each gene. Another line of embedded FS techniques uses the weights of each feature in linear classifiers, such as SVMs and logistic regression. These weights are used to reflect the relevance of each gene in a multivariate way and allow the removal of genes with very small weights. Partially due to the higher computational complexity of wrapper and to a lesser degree embedded approaches, these techniques have not received as much interest as filter proposals. However, an advisable practice is to pre-reduce the search space using a univariate filter method, and only then apply wrapper or embedded methods, hence fitting the computation time to the available resources. 3.3 - MASS SPECTRA ANALYSIS Mass spectrometry technology is emerging as a new and attractive framework for disease diagnosis and protein-based biomarker profiling. A mass spectrum sample is characterized by thousands of different mass/charge (m/ z) ratios on the x-axis, each with their corresponding signal intensity value on the y-axis. A low-resolution profile can contain up to 15.500 data points in the spectrum between 500 and 20.000 m/z. That's why data analysis step is severely constrained by both high-dimensional input spaces and their inherent sparseness, just as it is the case with gene expression datasets.

3.3.1 Filter methods: Starting from the raw data, and after an initial step to reduce noise and normalize the spectra from different samples, we need to extract the variables that will constitute the initial pool of candidate discriminative features. Some studies employ the simplest approach of considering every measured value as a feature (15.000 100.000 variables!). On the other hand, a great deal of the current studies performs aggressive feature extraction procedures that tend to limit the number of variables even to 500. FS has to be of low-cost in this case. Similar to the domain of microarray analysis, univariate filter techniques seem to be the most common techniques used, although the use of embedded techniques is certainly emerging as an alternative. Multivariate filter techniques on the other hand, are still somewhat underrepresented. 3.3.2 Wrapper methods: In the wrapper approaches different types of population-based randomized heuristics are used as search engines in the major part of these papers: genetic algorithms, particle swarm optimization and ant colony procedures. It is worth noting that the tendency of improvements in these methods is to reduce the initial number of variables. Variations of the popular method originally proposed for gene expression domains, using the weights of the variables in the SVM-formulation to discard features with small weights, have been broadly and successfully applied in the mass spectrometry domain. Also a neural network classifier (using the weights of the input masses to rank the features importance) and different types of decision tree-based algorithms (including random forests) are an alternative for this strategy. 4 - DEALING WITH SMALL SAMPLE DOMAINS When using small sample sizes to create models, some bad results start to appear, that is, the risks of overfitting and imprecision of the models grow with the smaller amount of data used to train the model. This poses a great challenge to many modeling problems in the area of bioinformatics. To try to overcome these problems with feature selection two techniques were created, that is, the use of adequate evaluation criteria, and the use of stable and robust feature selection models. 4.1 - ADEQUATE EVALUATION CRITERIA In some cases it is selected a discriminative subset of features from the data and this subset is used to test the final model. Since the model is also made using this subset we are using the same samples for both testing and training. Because of this it is needed to have an external feature selection process for training the model during each stage of testing the model, to get a better estimation of the accuracy. The bolstered error estimation[2] is an example of a method to get a good estimation of the predictive accuracy that can deal with small sample domains. 4.2 - ENSEMBLE FEATURE SELECTION APPROACHES The idea of using ensembles is that instead of using just one feature selection method and accepting its outcome, we can use different feature selection methods together to have better results. This is useful because it's not certain that a specific optimal feature subset is the only optimal one. Although ensembles are computationally more complex and require more resources, they give decent results even for small sample domains, thus affording more computational resources will compensate with better results. The random Forest method is a particular example of and ensemble that is based on a collection of decision trees. This method can be used to get the relevance of each feature which can help on selecting interesting features.

5 FEATURE SELECTION IN UPCOMING DOMAINS 5.1 - SINGLE NUCLEOTIDE POLYMORPHISM ANALYSIS Single nucleotide polymorphisms (SMPs) are mutations at a single nucleotide position that occurred during evolution and were passed on through heredity, accounting for most of the genetic variation among different individuals. They are used in many disease-gene associations and their number is estimated to be 7milion in the human genome. Because of this, there is a need to select a portion of SMPs that is sufficiently informative and also small enough to reduce the genotyping overhead which is an important step towards disease-gene association. 5.2 - TEXT AND LITERATURE MINING Text and literature mining is a method to get information from texts, which can further be used to generate classification models. A particular representation of this texts and documents is using a Bag-of-words. In this representation each word on the text represents a specific variable or feature and its frequency is counted. Because of this on some texts the dimension of the data will be too big and the data will be sparse, which is why feature selection must be used to choose the fundamental features. Although using feature selection on text mining is common when making text classification, in the context of bioinformatics it is still not much developed. In the case of biomedical documents clustering and classification it is expected that most of the methods that were developed by the text mining community can be used. 6 - CONCLUSION Feature selection is becoming very important, because the amount of data that has to be processed in bioinformatics and other areas is constantly growing in numbers and dimensions. Nevertheless, not all the people know about the existence of various types of feature selection methods, as they usually pick the univariate filter without considering other options. Prudent usage of feature selection methods lead us to deal better with the following issues of bioinformatics: large input dimensionality and small sample sizes. We hope the reader feels convinced to make an effort to find the proper method of FS to use in each problem he is solving that needs dimensionality reduction, before actually doing it. REFERENCES [1] A review of feature selection techniques in bioinformatics, Yvan Saeys, 2007 [2] High-dimensional bolstered error estimation, Chao Sima, 2011.