Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and what it is used for. We discuss methods for getting those features and present them in the context of classification that is using only supervised learning. Our goal is to present applications of these techniques in bioinformatics and match them with the appropriate type of method to use. We will also talk about how to deal with small sample sizes and problems related with it. Finally, we will present some areas in bioinformatics that can be improved using feature selection methods. 1 - INTRODUCTION The need of using feature selection (FS) techniques is growing in last years due to the size of data we need to analyze in areas related to bioinformatics. FS can be classified as one of many dimensionality reduction techniques. What distinguishes it from the others is the fact that it allows us to pick the subset of variables from the original data without changing it. 2 - FEATURE SELECTION TECHNIQUES The principles of FS usage are: avoiding overfitting improving model performance gaining a deeper insight into the underlying processes that generated the data Unfortunately, the optimal parameters of the model generated from the full feature set are not always the same as the optimal parameters of the model generated using the optimal features set selected by FS. So it is important to use it wisely, there is a danger that we lose some information. Consequently, we need to find the optimal model settings for the new set of features. Feature selection techniques can be divided into three categories, depending on how they combine the feature selection search with the construction of the classification model: filter methods, wrapper methods and embedded methods. 2.1 - FILTER TECHNIQUES These techniques give importance to features by looking only at the basic properties of the data. In most cases a feature relevance score is calculated, and low-scoring features are removed. Afterwards, this subset of features is presented as input to the classification algorithm. Advantages of filter techniques are that they easily scale to very high-dimensional datasets, they are computationally simple and fast, and they are independent of the classification algorithm. Because of the problem with not considering the features dependencies multivariate techniques were introduced. But they are also not perfect, as they only use a certain degree of dependencies.

2.2 - WRAPPER TECHNIQUES Unlike filter techniques, wrapper methods embed the model hypothesis search within the feature subset search. In this setup, the result of search procedure belongs to the space of possible feature subsets, and various subsets of features are generated. By using wrapper methods we can also evaluate selected feature subset by training and testing specific classification model. In order to pick the best feature subset, we need to perform the search that will take exponential time, thus we use heuristics. To make this search we can use deterministic or randomized search algorithms. A common disadvantage of these techniques is that they have a higher risk of overfitting than filter techniques and are very computationally intensive, especially if building the classifier has a high computational cost.

2.3 - EMBEDDED TECHNIQUES The search space of the feature selection algorithm is a combination of feature space and hypothesis space. The classifier itself provides the optimal feature selection. A great advantage is that it is not so computationally intensive. 3 - APPLICATIONS IN BIOINFORMATICS 3.1 -SEQUENCE ANALYSIS Sequence analysis is one of the most traditional areas of bioinformatics. The problems that the programmer meets in this area can be divided in two types differing in the scope we are interested in. If we want to focus on general characteristics, to reason basing on statistical features of the whole sequence, then we are interested in performing content analysis. On the other hand, if we want to detect in a sequence the presence of a particular motif or some specified aberration - we are in fact interested in analyzing only some small part(s) of the sequence. In that case we want to perform signal analysis. As we can imagine, in both cases (content/signal) in our sequence there's a lot of garbage data which provides us with information that is either irrelevant or redundant. And that is exactly where FS can be exploited! 3.1.1 FS dedicated to content analysis (Filter multivariate) Because features are derived from a sequence which is ordered, keeping them ordered is often beneficial, as it preserves dependencies between adjacent features. That's why Markov models were used in the first approach mentioned in the Saeys review [1] and it's improvements are still maintained, but the first idea stays the same. For scoring feature subsets there are used also genetic algorithms and SVMs. 3.1.2 FS dedicated to signal analysis (Wrapper) Usually signal analysis is performed to recognize binding sites or other places of sequence with special function. For feature selection it's best to interpret the code and relate motifs to the gene expression level. Then motifs can be cropped in such a way that the preserved motifs would fit the best to their regression models. Choosing which motifs are unselected in FS is dependent on the threshold number of misclassification (TNoM) to the regression models. Importance of motifs can be sorted by the P-value (derived directly from the TNoM score). 3.2 - MICROARRAY ANALYSIS Main feature of new datasets created by microarrays is both: large dimensionality and small sample sizes. What makes it more exciting is that analysis has to cope with noises and variability. 3.2.1 Univariate (only filter) Reasons why this approach is used most widely: understandable output faster than multivariate it is somehow easier to validate the selection by biological lab methods the experts usually don't feel the need to consider genes interactions

Simplest heuristics techniques include setting a threshold on the differences in gene expression between the states and then detection of the threshold point in each gene that minimizes TNoM. But they were also developed in two directions: 3.2.1.1 Parametric methods: Parametric methods assume a given distribution from which the samples have been generated. That's why before using them; the programmer should justify his choice of the distribution. Unfortunately, samples are so small that it is very hard to even validate such choice. The most standard choice is a Gaussian distribution. 3.2.1.2 Model-free methods: Just as parametric, but tries to figure the distribution out by estimating random permutations of the data, which enhances the robustness against outliers. There is also another group of non-parametric methods which, instead of trying to identify differentially expressed genes at the whole population level, are able to capture genes which are significantly disregulated in only a subset of samples. These types of methods can select genes containing specific patterns that are missed by previously mentioned metrics. 3.2.2 Multivariate 3.2.2.1 Filter methods: The application of multivariate filter methods ranges from simple bivariate interactions towards more advanced solutions exploring higher order interactions, such as correlation-based feature selection (CFS) and several variants of the Markov blanket filter method. 3.2.2.3 Wrapper methods: In the context of microarray analysis, most wrapper methods use population-based, randomized search heuristics, although also a few examples use sequential search techniques. An interesting hybrid filterwrapper approach is crossing a univariately preordered gene ranking with an incrementally augmenting wrapper method. Another characteristic of any wrapper procedure concerns the scoring function used to evaluate each gene subset found. As the 0 1 accuracy measure allows for comparison with previous works, the vast majority of papers use this measure. 3.2.2.3 Embedded methods: The embedded capacity of several classifiers to discard input features and thus propose a subset of discriminative genes, has been exploited by several authors. Examples include the use of random forests in an embedded way to calculate the importance of each gene. Another line of embedded FS techniques uses the weights of each feature in linear classifiers, such as SVMs and logistic regression. These weights are used to reflect the relevance of each gene in a multivariate way and allow the removal of genes with very small weights. Partially due to the higher computational complexity of wrapper and to a lesser degree embedded approaches, these techniques have not received as much interest as filter proposals. However, an advisable practice is to pre-reduce the search space using a univariate filter method, and only then apply wrapper or embedded methods, hence fitting the computation time to the available resources. 3.3 - MASS SPECTRA ANALYSIS Mass spectrometry technology is emerging as a new and attractive framework for disease diagnosis and protein-based biomarker profiling. A mass spectrum sample is characterized by thousands of different mass/charge (m/ z) ratios on the x-axis, each with their corresponding signal intensity value on the y-axis. A low-resolution profile can contain up to 15.500 data points in the spectrum between 500 and 20.000 m/z. That's why data analysis step is severely constrained by both high-dimensional input spaces and their inherent sparseness, just as it is the case with gene expression datasets.

3.3.1 Filter methods: Starting from the raw data, and after an initial step to reduce noise and normalize the spectra from different samples, we need to extract the variables that will constitute the initial pool of candidate discriminative features. Some studies employ the simplest approach of considering every measured value as a feature (15.000 100.000 variables!). On the other hand, a great deal of the current studies performs aggressive feature extraction procedures that tend to limit the number of variables even to 500. FS has to be of low-cost in this case. Similar to the domain of microarray analysis, univariate filter techniques seem to be the most common techniques used, although the use of embedded techniques is certainly emerging as an alternative. Multivariate filter techniques on the other hand, are still somewhat underrepresented. 3.3.2 Wrapper methods: In the wrapper approaches different types of population-based randomized heuristics are used as search engines in the major part of these papers: genetic algorithms, particle swarm optimization and ant colony procedures. It is worth noting that the tendency of improvements in these methods is to reduce the initial number of variables. Variations of the popular method originally proposed for gene expression domains, using the weights of the variables in the SVM-formulation to discard features with small weights, have been broadly and successfully applied in the mass spectrometry domain. Also a neural network classifier (using the weights of the input masses to rank the features importance) and different types of decision tree-based algorithms (including random forests) are an alternative for this strategy. 4 - DEALING WITH SMALL SAMPLE DOMAINS When using small sample sizes to create models, some bad results start to appear, that is, the risks of overfitting and imprecision of the models grow with the smaller amount of data used to train the model. This poses a great challenge to many modeling problems in the area of bioinformatics. To try to overcome these problems with feature selection two techniques were created, that is, the use of adequate evaluation criteria, and the use of stable and robust feature selection models. 4.1 - ADEQUATE EVALUATION CRITERIA In some cases it is selected a discriminative subset of features from the data and this subset is used to test the final model. Since the model is also made using this subset we are using the same samples for both testing and training. Because of this it is needed to have an external feature selection process for training the model during each stage of testing the model, to get a better estimation of the accuracy. The bolstered error estimation[2] is an example of a method to get a good estimation of the predictive accuracy that can deal with small sample domains. 4.2 - ENSEMBLE FEATURE SELECTION APPROACHES The idea of using ensembles is that instead of using just one feature selection method and accepting its outcome, we can use different feature selection methods together to have better results. This is useful because it's not certain that a specific optimal feature subset is the only optimal one. Although ensembles are computationally more complex and require more resources, they give decent results even for small sample domains, thus affording more computational resources will compensate with better results. The random Forest method is a particular example of and ensemble that is based on a collection of decision trees. This method can be used to get the relevance of each feature which can help on selecting interesting features.

5 FEATURE SELECTION IN UPCOMING DOMAINS 5.1 - SINGLE NUCLEOTIDE POLYMORPHISM ANALYSIS Single nucleotide polymorphisms (SMPs) are mutations at a single nucleotide position that occurred during evolution and were passed on through heredity, accounting for most of the genetic variation among different individuals. They are used in many disease-gene associations and their number is estimated to be 7milion in the human genome. Because of this, there is a need to select a portion of SMPs that is sufficiently informative and also small enough to reduce the genotyping overhead which is an important step towards disease-gene association. 5.2 - TEXT AND LITERATURE MINING Text and literature mining is a method to get information from texts, which can further be used to generate classification models. A particular representation of this texts and documents is using a Bag-of-words. In this representation each word on the text represents a specific variable or feature and its frequency is counted. Because of this on some texts the dimension of the data will be too big and the data will be sparse, which is why feature selection must be used to choose the fundamental features. Although using feature selection on text mining is common when making text classification, in the context of bioinformatics it is still not much developed. In the case of biomedical documents clustering and classification it is expected that most of the methods that were developed by the text mining community can be used. 6 - CONCLUSION Feature selection is becoming very important, because the amount of data that has to be processed in bioinformatics and other areas is constantly growing in numbers and dimensions. Nevertheless, not all the people know about the existence of various types of feature selection methods, as they usually pick the univariate filter without considering other options. Prudent usage of feature selection methods lead us to deal better with the following issues of bioinformatics: large input dimensionality and small sample sizes. We hope the reader feels convinced to make an effort to find the proper method of FS to use in each problem he is solving that needs dimensionality reduction, before actually doing it. REFERENCES [1] A review of feature selection techniques in bioinformatics, Yvan Saeys, 2007 [2] High-dimensional bolstered error estimation, Chao Sima, 2011.