The Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines

The Minimum Redundany Maximum Relevane Approah to Building Sparse Support Vetor Mahines Xiaoxing Yang, Ke Tang, and Xin Yao, Nature Inspired Computation and Appliations Laboratory (NICAL), Shool of Computer Siene and Tehnology, University of Siene and Tehnology of China, Hefei, China, 3007 apriot@mail.ust.edu.n, ketang@ust.edu.n The Center of Exellene for Researh in Computational Intelligene and Appliations (CERCIA), Shool of Computer Siene, The University of Birmingham, B5 TT Birmingham, U.K. x.yao@s.bham.a.uk Abstrat. Reently, building sparse SVMs beomes an ative researh topi due to its potential appliations in large sale data mining tasks. One of the most popular approahes to building sparse SVMs is to selet a small subset of training samples and employ them as the support vetors. In this paper, we explain that seleting the support vetors is equivalent to seleting a number of olumns from the kernel matrix, and is equivalent to seleting a subset of features in the feature seletion domain. Hene, we propose to use an effetive feature seletion algorithm, namely the Minimum Redundany Maximum Relevane (MRMR) algorithm to solve the support vetor seletion problem. MRMR algorithm was then ompared to two existing methods, namely bakfitting (BF) and pre-fitting (PF) algorithms. Preliminary results showed that MRMR generally outperformed BF algorithm while it was inferior to PF algorithm, in terms of generalization performane. However, the MRMR approah was extremely effiient and signifiantly faster than the two ompared algorithms. Keywords: Relevane, Redundany, Sparse design, SVMs, Mahine learning. Introdution As a relatively new lass of learning algorithms, kernel methods have been extensively studied reently []. The underlying onept of kernel methods an be interpreted as solving learning tasks in a reproduing kernel Hilbert spae (RKHS) indued by a kernel funtion. Then, a kernel method an usually be obtained by extending some traditional learning algorithm to the RKHS. Typial kernel methods inlude the well-known support vetor mahines (SVM) [], least squares support vetor mahines (LS-SVM) [3], proximal support vetor mahines (PSVM) [4], kernel Fisher disriminant analysis (KFDA) [5], et. Among the existing kernel methods, SVM is the first invented one, and probably is the most well-known one as well. Suppose we have n training samples, {(x, y ), (x, y ),, (x n, y n )}, where x i is a d-dimensional sample vetor and y ± is the i E. Corhado and H. Yin (Eds.): IDEAL 009, LNCS 5788, pp. 84 90, 009. Springer-Verlag Berlin Heidelberg 009

The MRMR Approah to Building Sparse Support Vetor Mahines 85 orresponding lass label. The training algorithm of SVM seeks a deision hyperplane in the RKHS, whih is expeted to separate apart the samples from different lasses. Given a sample x, the SVM adopts the deision funtion in Eq. () to lassify it to either + or - lass: N SV f( x) = α yk( x, x ) + b () i= i i i where k(x, x i ) is a predefined funtion, usually known as kernel funtion, that defines the inner produt of x and x i in the RKHS. α i and b are parameters of the lassifier that are omputed during the training episode, and N sv is the number of support vetors. In the literature of SVM, a support vetor is a training sample that orresponds to a non-zero α i (α i s always take non-negative values). Although SVM has been proven to be an effetive learning approah for real-world tasks, its omputational ost is relatively high, and hene its appliation in many data mining tasks is usually prohibited. The reasons are two-fold: First, the training episode of SVM involves solving a quadrati programming (QP) problem. Briefly speaking, this proedure is arried out on a kernel matrix K: k K = k ( x, x ) L k( x, x ) M O ( ) ( ) xn, x L k xn, xn where K ij an be alulated by k(x i, x j ). That means, K is a n-by-n matrix, and thereby the omputational omplexity of solving the QP problem inreases quadratially with the number of samples. Seond, Eq. () demonstrates that the omputational omplexity of lassifying a sample inreases linearly with the number of support vetors. Sine the number of support vetors typially inreases linearly with the number of samples, the time required for lassifying a new sample (i.e., the testing phase) inreases linearly with the number of samples. Due to the above drawbaks of SVM, a lot of work has been onduted to redue its omputational ost. In partiular, most approahes aim at reduing the number of support vetors of the final SVM. Suh an SVM lassifier is referred to the sparse SVM in the literature. The existing approahes for building sparse SVM an be ategorized into three groups. The first group of methods solves the SVM first to get a deision funtion in the form of Eq. (). Then, they try to use less support vetors to approximate this deision funtion. For example, the redued set (RS) method proposed by Burges [6] [7] utilizes the gradient desent algorithm to seek a set of virtual samples for the approximation. This type of methods an only make the testing phase of SVM more effiient, while requires higher omputational ost for training. The seond type of methods aims at diretly finding a small set of virtual support vetors that provides good generalization performane. A representative of this group of methods is the sparse kernel learning algorithm (SKLA) proposed by Wu et al. [8]. Given a predefined number of support vetors, the SKLA diretly uses a gradient desent algorithm to searh for the optimal vetors that minimize the objetive funtion of SVM. Like the first type methods, the final support vetors are not neessarily training samples, but an be any virtual data points. The differene is that it does not require solving the QP problem in prior, and thus redues the training ost. Although M n ()

86 X. Yang, K. Tang, and X. Yao finding virtual support vetors has been shown to be effetive, it osts too muh time to get a good solution. Hene, quite a few researhers have tried to aelerate SVM without seeking virtual support vetors [9] [0] [] [] [3] [4]. All the third type methods follow suh a priniple. Typially, these methods aim at seeking a small set of training samples, the size of whih is smaller than the atually number of support vetors obtained by training the whole QP problem, but are still suffiient for ahieving good generalization performane. With this purpose in mind, the general methodology is to selet a subset of training samples without diretly solving the QP problem. After the seletion proess, a new QP problem with muh smaller size (typially the size is determined by the size of the sample subset) an be formulated, and the optimal α i s are then omputed by solving the new QP problem. Sine this type of methods is usually easier to implement and may not suffer from many numerial problems, they seem to be more popular than the gradient desent-based methods. For example, Keerthi et al. [3] proposed two algorithms named bak-fitting (BF) and pre-fitting (PF) approahes, respetively. In this paper, we propose an approah to selet a training sample subset as the support vetors of SVM. Speifially, we suggest that the seletion of support vetors for SVM is equivalent to the seletion of features in the feature seletion domain. Sine the feature seletion problems have been intensively investigated in the pattern reognition literature, we believe that a good feature seletion algorithm an be readily applied to selet support vetors. Following this idea, we employed the well-known Minimum Redundany Maximum Relevane (MRMR) feature seletion algorithm [5] to address the support vetor seletion problem, and ompared it with two stateof-the-art support vetor seletion algorithms, the BF and PF algorithms. Preliminary results showed that MRMR generally outperformed the BF algorithm while it was inferior to PF algorithm in terms of generalization performane. However, the MRMR approah was signifiantly faster than the two ompared algorithms. The rest of this paper is organized as follows. In Setion, we elaborate the equivalene between support vetor seletion and feature seletion, and introdue the MRMR algorithm in detail. Setion 3 presents our preliminary experimental study. Setion 4 onludes the paper and disusses potential diretions for future work. Selet Support Vetors Using MRMR Algorithm From Eq. (), we may find that the deision funtion of SVM an be written in form of T f( x) = β K % + b (3) where β=[α y, α y, α Nsv y Nsv ] T, K [ k( x, x ),..., k( x, x )] T ~ =. In feature seletion domain, we hoose a subset of features so as to better desribe the relationship between the features and label. For linear ase, we want to get the following expression: T x = w x + (4) g( ) Nsv

The MRMR Approah to Building Sparse Support Vetor Mahines 87 where x=[x,,x sub ] with x i as the seleted feature value, and w=[w w sub ] with w i as the orresponding weight. is the bias term. Absolutely, if the expression is good for desribing the data, then we an say that the features are good representatives. We an see the similarity between the Eq. (3) and Eq. (4). That means, the kernel matrix K an be viewed as a new data matrix, eah training sample x i orresponds to a row of K. If we view the ith row of K as the mapping of sample x i in a new spae, then its jth features are obtained by alulating the value of k(x i, x j ). From this view point, when we try to redue the number of support vetors, we only need to redue the number of olumns of the kernel matrix K, while keeping the number of rows unhanged. After seleting N sv appropriate olumns of K, alulating the orresponding α i s is equivalent to seeking the optimal linear deision funtion with respet to the data lying in the l-dimensional spae. In the next, we introdue the MRMR approah for seleting the olumns of K. Let K denote the ith olumn of the n-by-n kernel matrix K, the MRMR starts from alulating the relevane sore of eah K, whih is [ ( i =. i. i) ]/ (5) = F n K K σ n is the number of training samples of the th lass. K is the mean value of where the olumn K and K. i is the mean value of K within the th lass. σ is the pooled variane that an be alulated by σ = [ ( n ) σ]/( n ) (where σ is the variane of K within the th lass.) F i is the F-statisti between the ith olumn and the labels, and it is equivalent to t-statisti in ase of binary lassifiation. Based upon Eq. (5), the relevane sore of a subset of olumns (say G) an be alulated by: R F = F (6) G i G Aordingly, MRMR measures the redundany of G by R off i = (, ) off i j (7) G i, j G where off(i, j) is the Pearson orrelation oeffiient between the ith and the jth olumns of K. With these definitions, MRMR aims at finding the subset of olumns with the maximum value of R F and the minimum value of R off. This is typially done by seeking the subset with the largest RF / R off or the largest value of RF Roff in the original literature [5]. In this paper, we introdue another parameter into the MRMR. That is, we seek the subset with the largest RF λroff, where λ is a predefined parameter whih ontrols the trade-off between the relevane and redundany. MRMR employs sequential forward seletion sheme to searh for the optimal feature subset.

88 X. Yang, K. Tang, and X. Yao First, the olumn orresponding to the largest F i is seleted aording to Eq. (5). After that, the rest of the features are seleted one by one. At eah iteration, the previously unseleted features are evaluated aording to Eqs. (6) and (7), the one with the largest RF λroff is seleted. 3 Experimental Study To evaluate the effiay of MRMR algorithm for building sparse SVMs, we arried out experimental studies on seven data sets of the UCI Repository [6], namely Australian, Monks-, Heart, Mammographi, Wdb, Hill-valley, and Promoters. Sine MRMR is in nature a seletion-based algorithm, we ompared it to two state-of-theart seletion based algorithms, namely the BF and PF algorithms. The radial basis funtion was used as the kernel funtion: k ( xi, x j ) exp( xi x j ) = γ (8) As to ompare the three methods, we implemented all the three methods ourselves. The Newton method used in [3] was employed in our implementation to obtain the solution (i.e., the oeffiients β) of SVM. All parameters were tuned using 5-fold ross-validation. We tuned the parameter γ and the regularization parameter C with seletion rate equivalent to (i.e., using all the data). And then we tuned the parameter λ with seletion rate equivalent 0% using our seletion method. For eah data set, we onduted 5-fold ross-validation for 0 times. In eah run, the three algorithms were applied separately to selet 4% of the training samples to build sparse SVMs. Then, lassifiation auray of the sparse SVMs built with the three subsets was evaluated. The average auray and the standard deviations are presented in Table. Furthermore, Wiloxon signed-rank test with signifiane level equivalent to 0.05 has been onduted and the results are also presented in Table. From Table, we may find that MRMR outperformed BF algorithm on data sets, while no signifiant differene was observed on the other 5 data sets. In omparison with PF algorithm, MRMR was inferior on 3 data sets, while ahieved omparable performane on the other 4 data sets. In summary, MRMR generally outperformed BF algorithm in terms of lassifiation auray, but was generally inferior to PF algorithm. Sine the major signifiane of onstruting sparse SVMs is to extend its appliation to large size data sets, the omputational ost required to obtain the sparse SVM is also of great importane. Table summarizes the average CPU time required by the three ompared algorithms to selet support vetors on the 7 data sets. It an be observed that the MRMR is the most effiient one among the three. The runtimes of BF and PF algorithms are at least 0 times and 00 times of that of MRMR, respetively. To summarize, experimental studies demonstrated that MRMR is definitely better than BF algorithm. In omparison with PF algorithm, MRMR may lead to sparse SVMs with inferior generalization performane. However, MRMR provides signifiant omputational advantage. Hene, MRMR an be viewed as a potential alternative to PF algorithm in ase of mining large sale data sets.

The MRMR Approah to Building Sparse Support Vetor Mahines 89 Table. Average lassifiation auray (%) of 0 runs of 5-fold ross-validation proedures. The standard deviations are given in the parentheses. The last two olumns present the results of the Wiloxon signed-rank test. means MRMR outperformed the ompared algorithm, 0 indiates that there was no signifiant differene between the two ompared algorithms, and - means MRMR was outperformed by the ompared algorithm. Method Datasets MRMR PF BF MRMR vs. PF Australian 86.5(.57) 86.6(.5) 86.63(.50) 0 0 Monks- 76.0(7.76) 9.09(5.4) 77.08(4.77) - 0 Heart 83.37(4.4) 8.83(4.53) 8.6(3.80) 0 Mammographi 80.64(3.75) 80.93(3.78) 80.83(3.94) 0 0 Wdb 97.06(.59) 96.94(.44) 97.07(.44) 0 0 Hill-valley 57.89(5.68) 65.47(4.7) 5.3(4.) - Promoters 6.88(4.9) 67.88(0.8) 66.3(0.8) - 0 MRMR vs. BF Table. Runtime (seonds) of the three ompared algorithms on the seven data sets Methods MRMR PF BF Datasets Australian 0.45 00.83.865 Monks- 0.0548.644 3.8000 Heart 0.0387.8770 0.54 Mammographi 0.0978 57.046 6.6690 Wdb 0.480 35.90 5.6759 Hill-valley 0.0093 58.63 5.76 Promoters 0.00 0.395 0.0464 4 Conlusion and Disussion Sparse SVMs are usually obtained by seleting support vetors from training samples. In this paper, we explained that the support vetor seletion is equivalent to seleting a number of olumns of the kernel matrix, and proposed to employ the MRMR algorithm to solve this problem. Experimental results indiated that the omputational ost of MRMR is extremely low in omparison to two existing approahes, i.e., bakfitting (BF) and pre-fitting (PF) algorithms. Furthermore, MRMR also outperformed the BF algorithm in terms of lassifiation auray. Our urrent work an be extended in the future along two main diretions. First, sine MRMR is very effiient and PF algorithm an lead to better generalization performane, it would be interesting to investigate the possibility of ombining MRMR with the PF algorithm. Suh a ombination might lead to novel algorithms with both good generalization performane and satisfatory omputational effiieny. Seond, the relevane and redundany defined in the original MRMR algorithm might not suit the speifi senario of building sparse SVMs. Hene, it is neessary to seek alternative definitions to enhane the performane of MRMR in the ase of building sparse SVMs.

90 X. Yang, K. Tang, and X. Yao Aknowledgement This work was partially supported by the Fund for International Joint Researh Program of Anhui Siene and Tehnology Department (No. 0808070306) and a National Natural Siene Foundation of China grant (No. 6080036). Referenes. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (004). Burges, C.J.C.: A tutorial on support vetor mahines for pattern reognition. Data Mining and Knowledge Disovery, 67 (998) 3. Suykens, J.A.K., Vandewalle, J.: Least squares support vetor mahine lassifiers. Neural Proessing Letters 9, 93 300 (999) 4. Fung, G., Mangasarian, O.L.: Proximal support vetor mahine lassifiers. In: Proeedings of Knowledge Disovery and Data Mining, San Franiso, CA, New York, pp. 77 86 (00) 5. Mika, S., Rätsh, G., Weston, J., Shölkopf, B., Smola, A.J., Mueller, K.-R.: Construting desriptive and disriminative non-linear features: Rayleigh oeffiients in kernel feature spaes. IEEE Transations on Pattern Analysis and Mahine Intelligene 5(5), 63 68 (003) 6. Burges, C.J.C.: Simplified support vetor deision rules. In: Proeedings of the 3 th International Conferene on Mahine Learning, Bari, Italy, pp. 7 77 (996) 7. Burges, C.J.C., Shoelkopf, B.: Improving speed and auray of support vetor learning mahines. In: Advanes in Neural Information Proessing Systems, vol. 9, pp. 375 38. MIT Press, Cambridge (997) 8. Wu, M., Shölkoph, B., Bakir, G.: A diret method for building sparse kernel learning algorithms. Journal of Mahine Learning Researh 7, 603 64 (006) 9. Lee, Y., Mangasarian, O.L.: RSVM: redued support vetor mahines. In: CD Proeedings of the First SIAM International Conferene on Data Mining, Chiago (00) 0. Lee, Y., Mangasarian, O.L.: SSVM: A smooth support vetor mahine. In: Computational Optimization and appliations, pp. 5 (00). Lin, K., Lin, C.: A study on redued support vetor mahines. IEEE Transations on Neural Networks 4, 449 459 (003). Downs, T., Gates, K.E., Masters, A.: Exat simplifiation of support vetor solutions. Journal of Mahine Learning Researh, 93 97 (00) 3. Keerthi, S.S., Chapelle, O., DeCoste, D.: Building support vetor mahines with redued lassifier omplexity. Journal of Mahine Learning Researh 8, (006) 4. Sun, P., Yao, X.: Greedy forward seletion algorithms to sparse Gaussian proess regression. In: Proeedings of the 006 International Joint Conferene on Neural Networks (IJCNN 006), Vanouver, Canada, pp. 59 65 (006) 5. Ding, C., Peng, H.: Minimum redundany feature seletion from miroarray gene expression data. In: Proeedings of the Computational Systems Bioinformatis, pp. 53 58 (003) 6. UCI Mahine Learning Repository, http://wwws.ui.edu/~mlearn/mlrepository.html