Automated Microarray Classification Based on P-SVM Gene Selection

Size: px

Start display at page:

Download "Automated Microarray Classification Based on P-SVM Gene Selection"

Cecilia Cobb
5 years ago
Views:

1 Automated Microarray Classification Based on P-SVM Gene Selection Johannes Mohr 1,2,, Sambu Seo 1, and Klaus Obermayer 1 1 Berlin Institute of Technology Department of Electrical Engineering and Computer Science Franklinstr. 28/29, Berlin, Germany 2 Charité Universitätsmedizin Berlin, CCM, Department for Psychiatry and Psychotherapy Charitéplatz 1, Berlin, Germany johann,sontag,oby@cs.tu-berlin.de The two first authors have contributed equally. Abstract The analysis of microarray data is a challenging task for statistical and machine learning methods, since the datasets usually contain a very large number of features (genes) and only a small number of examples (subjects). In this work, we describe a technique for gene selection and classification of microarray data based on the recently proposed potential support vector machine (P-SVM) for feature selection and a ν-svm for classification. The P-SVM expands the decision function in terms of a sparse set of support features. Based on this novel technique for feature selection, we suggest a fully automated method for gene selection, hyper-parameter optimization and microarray classification. Benchmark results are given for the two datasets provided by the ICMLA 08 Automated Micro-Array Classification Challenge. 1. Introduction The recent advance of high-throughput technologies in molecular biology and medicine has also lead to the demand for new statistical techniques for data analysis. An important example are gene expression microarrays, which can be used either for diagnostic or research purposes and allow to assess the expression levels of thousands of genes in parallel. Therefore, microarray datasets are characterized by an extremely large number of features and small sample size. As such settings in general are prone to over-fitting, the classification of microarray data requires a suitable form of regularization and poses a challenging problem for machine learning research. A prediction algorithm for microarray data should meet several requirements: First, it should provide good generalization performance on yet unseen data. The dimensionality of the feature space for microarray data is very high, but only a low-dimensional subspace will be relevant for prediction. Since the presence of irrelevant features can degrade the generalization performance of most learning machines, techniques for dimensionality reduction can be helpful. Second, one is interested in a sparse solution, which depends only on a small number of genes. On the hand, this helps in the identification of the genes which are actually involved in the observed effect. This allows to better understand the molecular-biological mechanisms underlying a certain disease and to discover possible cures. On the other hand, reducing the number of genes necessary for diagnosis helps saving costs. Therefore, feature selection techniques, which yield a sparse subset of the original features, have advantages over feature construction techniques (like principal component analysis), which also reduce the dimensionality of the problem but still work with projections requiring the original number of features. For these reasons, the performance of a predictor for microarray analysis should be assessed by both, its predictive classification performance (e.g. by evaluating the mean balanced error, i.e. the average misclassification error per class), and its sparsity (the number of genes involved in the prediction function). It is desirable to have an algorithm with low classification error which also requires only a small number of genes. In this paper we suggest a technique for microarray classification which is based on the recently proposed [6] potential support vector machine (P-SVM) for feature selection and a ν-support vector machine (ν-svm [11]) for classification. In contrast to a conventional SVM, which expands the prediction function into sparse set of data points (the

2 support vectors ), the P-SVM expands the prediction function into a sparse set of features (the support features ). The dual objective function of the P-SVM can be efficiently solved by an SMO technique [9]. The gene selection protocol used in this work is partly based on the protocol described in [5], however it uses a different ranking scheme and includes mechanisms to automatically adjust all hyperparameters for a given data set. This paper is structured as follows: First the P-SVM for feature selection is shortly reviewed, then the gene selection and classification protocol is described in detail. Finally some benchmark results are given. 2. Review of the P-SVM Most techniques for solving classification and regression problems have been focusing on vectorial data. However, for many datasets a vector-based description is suboptimal, and other representations like dyadic data ([8],[10],[7]) which are based on relationships between objects are more appropriate. The P-SVM is a recently introduced ([5],[6],[9]) machine learning method for classification, regression and feature selection. It can be used in two modes: Either it can be used on vectorial data, where a a kernel function is applied to the each pair of feature vectors to yield a Gram matrix, or it can be used on dyadic data. Dyadic data ([8],[10],[7]) describes the relation between a set of row and a set of column objects, e.g. similarity or dissimilarity matrices. In the context of the P-SVM, this relation is represented by a kernel between row objects and column objects. Note that a measured data matrix can be interpreted as such a kernel matrix, where the row objects correspond to features, and the column objects to examples. Then the data matrix can be interpreted as the result of a measurement kernel. In the context of microarray analysis, this kernel would measure the expression of a certain gene for a certain person. The P-SVM can directly work on dyadic data, since in contrast to standard support vector machine approaches ([11],[12]), its kernel matrix has to be neither positive definite nor square. If employed in vectorial mode, the P- SVM expands the prediction function in a sparse set of data points, the support vectors, like a conventional support vector machine. However, if the dyadic mode of the P-SVM is applied to a data matrix where the rows are the features and the columns the examples, the P-SVM expands the prediction function into a sparse set of features, extracting a small number of informative features from the set of all features. Once a set of support features is determined, it can be used as input to an arbitrary predictor. In this mode the P-SVM works as feature selection method. In the following, we will briefly outline the mathematical formulation of P-SVM feature selection (for further details, see [5]). We consider a two class classification task, where the m (d-dimensional) input vectors and class labels are summarized in the matrix X = (x 1,...,x m ) and the vector y. The learning task is to select a classifier f with minimal risk, R(f) = min, from the set of classifiers f(x) = sgn(w x + b) which are parameterized by the weight vector w and the offset b. Standardization (mean subtraction and dividing by the standard deviation) of the data leads to X T 1 = 0. The primal P-SVM optimization problem for feature selection can then be formulated as min w,b 1 2 XT w (1) s.t. X(X T w + b1 y) ǫ1 0 X(X T w + b1 y) + ǫ1 0 The corresponding dual problem can be derived as min α+,α 1 2 (α+ α ) T XX T (α + α ) y T X T (α + α ) + ǫ1 T (α + + α ) s.t. 0 α +, 0 α, (2) where α = (α + α ) denote the Lagrange multipliers for the constraints and ǫ is a regularization parameter. The first term in eq. 2 depends on the empirical covariance matrix of the features, while the second term captures the correlation between features and target. Therefore, a set of features will be selected with a low mutual correlation but high correlation to the target. The third term enforces the sparseness of the solution which is controlled via ǫ. The non-zero components α mark the support features. If XX T is singular and w is not uniquely determined, ǫ enforces a unique solution. The value of ǫ implicitly controls the size of the set of support features. Increasing ǫ increases the sparsity of the solution, which means that fewer features will be selected. Eqs. 2 can be solved using a new sequential minimal optimization (SMO) technique [9]. Using (α + α ) the weight vector w and the offset b are given by w = α, b = 1 m y i (3) m i=1 The resulting classifier is then given by n f(x) = sgn(w x+b) = sgn α j (x e j ) + b. (4) j=1 3. Classification based on P-SVM Gene Selection In this work, a fully automated method for microarray data classification is suggested, whose outline is shown in algorithm 1.

3 Algorithm 1 Microarray Classification based on P-SVM Gene Selection BEGIN PROCEDURE Standardize each gene to zero mean and unit variance Determine the ǫ-values ǫ(j),j = Set F max = min(max(m/2,15),40) Initialize R( ) = 0 and C( ) = 0 for all leave-one-out CV folds k do Training set Train(k) Test point Test(k) Initialize empty list L for i=1:4 do ǫ = ǫ(i) P-SVM feature selection on Train(k) using ǫ Find genes with non-zero α Find set of genes G not in L Rank genes in G: R(G) = R(G) + 4 i + 1 Sort G according to descending absolute value of α and append sorted G to L C(L) = C(L) + 1 for F = 1 : F max + 3 do for ν = {0.2, 0.3, 0.4, 0.5} do Train ν-svm on Train(k) using ν and the first F features from L Compute prediction error error k (ν,f) on Test(k) error(ν,f) = error(ν,f) + error k (ν,f) Replace error(ν,f) by 3 i= 3 error(ν,f + i) Select optimal values F opt and ν opt with minimum error Calculate the final ranking R f of genes using C and R Select the first F opt genes from R f Train ν-svm on all samples using the F opt selected genes and the hyper-parameter ν opt END PROCEDURE As preprocessing step, the values for each gene are standardized to zero mean and unit variance across all training samples. The predictor consists of P-SVM feature selection as filter method followed by a ν-svm with linear kernel, which uses only the genes selected by the P-SVM. The predictor itself is embedded into a robust cross-validation (CV) framework used to determine optimal values for two hyperparameters, the number of genes F and the parameter ν of the ν-svm. These are selected by a grid search procedure, where estimates of the generalization error are computed via leave-one-out cross-validation for several candidate values lying on a 2D grid. The number of genes and the value of ν which yield minimal generalization error are selected and used for learning the final predictor on the whole training set. The parameter ǫ of the P-SVM is not optimized, but the results for different values of ǫ within a pre-determined range are used in the ranking of the genes. The reason for this is that for different values of ǫ different sets of genes are obtained. The genes obtained at a large value of ǫ can be interpreted to be more informative than the genes which are (additionally) obtained at a small value of ǫ. Four different values for ǫ are determined automatically for each individual dataset using the following criteria: The smallest value of ǫ value is set such that the number of obtained genes corresponds approximately to the number of samples in the dataset. The largest value of ǫ is chosen such that less then five genes are obtained. The two remaining ǫ-values are set to lie equally spaced between these two extrema. On each training set Train(k) of the k th leave-one-out cross-validation loop, a ranking of the genes is obtained in the following way: The highest rank (4) is assigned to genes obtained at the largest value of ǫ. This reflects the idea that these genes are most important for the prediction. The next highest rank (3) is given to genes additionally obtained by the next highest ǫ, and so on. Within each of these ranks the genes are sorted according to the absolute value of α. It was shown in [6] that the absolute value of the the Lagrange multiplier α directly reflects the importance of a specific feature for the prediction, as it is proportional to the increase in empirical error if the feature is left out. The above procedure yields an ordered list of genes for each training set of the leave-one out loop, where the highest ranking genes (the ones with high ǫ and high α ) appear at the top of the list. Using the top F genes from this sorted list the linear ν- SVM is trained as classifier on the training set Train(k) of the leave-one-out loop. This is done for different numbers of genes F = 1,...,F max + 3 and for different values of ν = {0.2, 0.3, 0.4, 0.5}. The predictions on the left out samples T est(k) yield the leave-one-out error for each hyper-parameter combination. Since the leave-one-out error as a function of the number F of selected genes is noisy, the leave-one-out error for F is replaced by the average of the leave-one-out errors for F = [F 3,...,F + 3]. This average is evaluated only between 4 and F max, therefore, the minimum possible number of genes F is 4. The value of ν opt and the number of genes F opt yielding the lowest error are selected for the final classifier. After the leave one-out loop, a final ranking of the genes is conducted, based on how often a gene was selected in the leave-one-out loop (this is denoted by the function C( ) in the algorithm). If two or more genes are selected equally often, these genes are ranked according to the values of ǫ. Concretely, for each leave-one-out cross-validation fold k all genes which are obtained at the highest value of ǫ get

4 assigned 4 points, those which are additionally obtained at the next highest value get assigned 3 points, and so on. The final score summed over all cross-validation folds (denoted by the function R( ) in the algorithm) determines the subranking of those set of genes which were selected the same number of times. These two rules yield the final ranking R f. For the final predictor, the top F opt genes are selected from the final ranking R f, and the ν-svm is trained using only the selected genes on all samples at the parameter ν opt. Implementation details The method was implemented in Matlab. For the P- SVM, the implementation by Knebel et al. [9] was used, which utilizes an efficient SMO to solve the dual optimization problem of the P-SVM. For the ν-svm, the LIBSVM implementation [3] was employed. Since the performance evaluation of the ICMLA 08 Automated Micro-Array Classification Challenge required probabilistic (or at least continuous) output, the LIBSVM is used in a mode which provides probability estimates for the classes. 4. Experimental Results In this section, the proposed algorithm (P-SVM Gene Selection, short P-SVM-GS) is compared to the example algorithm BLogReg (sparse logistic regression using Bayesian regularization [2]) provided by the organizers of the ICMLA 08 Automated Micro-Array Classification Challenge ( gcc/projects/amcc/) on the two pre-processed benchmark datasets which were provided for developing and testing purposes. The dataset Alon contains gene expression values in 40 tumor and 22 normal colon tissue samples [1], while the dataset Golub consists of data from patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) [4]. The two data sets were divided into a training and a test set by the challenge organizers. However, the prediction performance might be influenced by the given training-test set split, so we combined training and test sets and used leave-one-out cross-validation on each of these unified data sets to get a more robust measure of performance. The unified dataset Alon consists of 62 samples with 2000 genes and the unified dataset Golub contains 72 samples with 7129 genes. Note that for P-SVM-GS the whole procedure shown in algorithm 1 is carried out on each individual training fold of the leave-one-out cross-validation used to assess the generalization performance. The experimental results are shown in the table 1. An unbiased estimate of the generalization error is obtained using leave-one-out cross-validation. Table 1. Comparison between BLogReg and the proposed method (P-SVM-GS) Alon Golub BLogReg P-SVM BLogReg P-SVM TER BER µ F σ F The first row shows the total error rate (TER) computed via leave-one-out cross-validation. The second row contains the balanced error rate (BER) under leave-one-out cross-validation, which is calculated as BER = 1 ( FN 2 PC + FP ), NC where FN(FP) denotes the number of false negative (positive) classified samples and P C(N C) denotes the number of samples in the positive (negative) class. Usually, in classification problems the balanced error rate is preferred as error measure, since it corresponds to the average error rate per class. Thus it requires both a high sensitivity and a high specificity. If, in contrast, the total error rate is used and the classes are unbalanced, a classifier which assigns all examples to the larger class will achieve a good classification performance. However, then the specificity will be high and the sensitivity low, or vice versa. This is generally not desirable. µ F and σ F denote the mean and standard deviation of the selected number of genes. On the data set Alon P-SVM-GS outperformed BLogReg with respect to both total and balanced prediction error, however BLogReg used on the average less features then P-SVM-GS. On the dataset Golub the prediction error of P-SVM-GS was only slightly better than of BLogReg, and again BLogReg had the sparser prediction function in terms of the number of selected genes. 5. Summary In this work we proposed P-SVM-GS, an algorithm for fully automated microarray classification and gene selection. The method makes use of the P-SVM for feature selection to select a sparse subset of genes and a ν-svm with probabilistic outputs as classifier. A leave-one-out crossvalidation scheme is used to optimize the hyper-parameters of the classifier and to obtain a ranking of the genes based on several different criteria. The experiments conducted on two microarray datasets provide evidence that the method is able to achieve good prediction performance using a sparse set of selected genes.

5 Acknowledgments This work was funded by the Bernstein Center for Computational Neuroscience Berlin (BMBF grant 01GQ0411). References [1] U. Alon, N. Barkai, D. A. Notterman, G. K., S. Ybarra, M. D., and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96(12): , [2] G. C. Cawley and N. L. C. Talbot. Gene selection in cancer classification using sparse logistic regression with bayesian regularization. Bioinformatics, 22(19): , July [3] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, Software available at cjlin/libsvm. [4] T. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, M. L. Coller, H. Loh, J. Downing, M. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439): , [5] S. Hochreiter and K. Obermayer. Kernel Methods in Computational Biology, chapter Gene Selection for Microarray Data. MIT Press, Cambridge, Massachusetts, [6] S. Hochreiter and K. Obermayer. Support vector machines for dyadic data. Neural Computation, 18: , [7] P. D. Hoff. Bilinear mixed-effects models for dyadic data. Journal of the American Statistical Assosciation, 100(469): , [8] T. Hofmann, J. Puzicha, and M. Jordan. Learning from dyadic data. In M. Kearns, S. Solla, and D. Cohn, editors, Advances in Neural Information Processing Systems 11, pages The MIT Press, [9] T. Knebel, S. Hochreit, and K. Obermayer. An smo algorithm for the potential support vector machine. Neural Computation, 20(1): , [10] H. Li and E. Loken. A unified theory of statistical analysis and inference for variance component models for dyadic data. Statistica Sinica, 12: , [11] B. Schölkopf and A. J. Smola. Learning with kernels Support Vector Machines, Reglarization, Optimization, and Beyond. MIT Press, Cambridge, [12] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995.

Software Documentation of the Potential Support Vector Machine

Software Documentation of the Potential Support Vector Machine Tilman Knebel and Sepp Hochreiter Department of Electrical Engineering and Computer Science Technische Universität Berlin 10587 Berlin, Germany