Trading Convexity for Scalability

Size: px

Start display at page:

Download "Trading Convexity for Scalability"

Clement Spencer
5 years ago
Views:

1 Trading Convexity for Scalability Ronan Collobert NEC Labs America Princeton NJ, USA Jason Weston NEC Labs America Princeton NJ, USA Léon Bottou NEC Labs America Princeton NJ, USA This is an expanded version of the original document. The new appendix discusses previous works whose existence was not known to us. It also points out differences that we believe are significant enough to justify this report. Abstract Convex learning algorithms, such as Support Vector Machines (SVMs), are often seen as highly desirable because they offer strong practical properties, such as ease-of-use, and being amenable to theoretical analysis. However, in this work we show how non-convexity can still provide advantages over convexity, especially in terms of scalability. We show how the Concave-Convex Procedure can be applied to produce (i) SVMs where training errors are no longer support vectors, and (ii) much faster Transductive SVMs. 1 Introduction Machine learning algorithms based on convex optimiation procedures have become increasingly popular in the last decade, partly because of the success of Support Vector Machines (SVMs) [21]. SVMs are popular because they span two worlds: the world of applications (they have good empirical performance, ease-of-use and computational efficiency) and the world of theory (they can be analysed and bounds can be produced). Convexity is largely responsible for these factors. Convex algorithms are also better understood and easier to tune than famous non-convex algorithms such as Multi Layer Perceptrons. While it is clear that convex algorithms have many advantages, in this contribution we argue that perhaps non-convex algorithms should not be buried too quickly. Although their theoretical foundations are more challenging, they can achieve very interesting results in practice. Convex algorithms have a single solution, but this solution might not be the best one. To expound on this point, we leverage a smart non-convex optimiation technique to show that non-convexity has some benefits. The Concave-Concave Procedure (CCCP) [23, 2] solves a sequence of convex problems and has no difficult parameters to tune. We apply it in two different ways: (i) SVMs where training errors are no longer support vectors; and (ii) large scale Transductive SVMs. The former brings increased sparsity which can lead to better scaling properties for SVMs. The latter provides what we believe is the best

2 known method for implementing Transductive SVM with a scaling of (L + U) 2, where L+U is the number of unlabeled and labeled training examples. This is in stark constrast to convex versions of transduction which scale at least like (L + U) 4, and Joachim s heuristic approach in SVMLight which can only deal with a few thousand examples. 2 The Concave-Convex Procedure Minimiing a non-convex cost function is usually difficult. Gradient descent techniques, such as conjugate gradient descent or stochastic gradient descent, often involve delicate hyper-parameters [9]. In contrast, convex optimiation seems much more straight-forward. For instance, the SMO [14] algorithm locates the SVM solution efficiently and reliably. We propose instead to optimie non convex problems using the Concave-Convex Procedure (CCCP) [23]. The CCCP procedure is closely related to the Difference of Convex (DC) methods that have been developed by the optimiation community during the last two decades [2]. Such techniques have already been applied for dealing with missing values in SVMs [17] and for improving boosting algorithms [8]. Assume that a cost function J(θ) can be rewritten as the sum of a convex part J vex (θ) and a concave part J cav (θ). Each iteration of the CCCP procedure (Algorithm 1) approximates the concave part by its tangent and minimies the resulting convex function. Algorithm 1 : The Concave-Convex Procedure (CCCP) Initialie θ with a best guess. repeat θ t+1 ( = arg min Jvex (θ) + J cav(θ t ) θ ) (1) θ until convergence of θ t One can easily see that the cost J(θ t ) decreases after each iteration by summing two inequalities resulting from (1) and from the concavity of J cav (θ). J vex (θ t+1 ) + J cav(θ t ) θ t+1 J vex (θ t ) + J cav(θ t ) θ t (2) J cav (θ t+1 ) J cav (θ t ) + J cav(θ t ) (θ t+1 θ t) (3) The convergence of CCCP has been shown [23] by refining this argument. No additional hyper-parameters are needed by CCCP. Furthermore, each update (1) is a convex minimiation problem and can be solved using classical and efficient convex algorithms. 3 Non-Convexity for Classification with SVMs This section describes the curse of dual variables, that is the number of support vectors increases in classical SVMs linearly with the number of training examples. The curse can be exorcised by replacing the classical Hinge Loss by a non convex loss function, the Ramp Loss. The optimiation of the new dual problem can be solved using CCCP. Notation In the rest of this paper, we consider two-class classification problems. We are given a training set (x l, y l )...L with (x l, y l ) R n { 1, 1}. SVMs have a decision function f θ (.) of the form f θ (x) = w Φ(x) + b, where θ = (w, b) are the parameters of the model, and Φ( ) is the chosen feature map, often implemented implicitly using the kernel trick [21]. The standard SVM algorithm involves the minimiation of θ 1 L 2 w 2 + C H 1 (y l f θ (x l )), (4)

3 where function H s () = max(, s ) is the Hinge Loss (Figure 1, center.) R H H s= s= Figure 1: The Ramp Loss function (left) can be decomposed into the sum of the convex Hinge Loss (middle) and a concave loss (right). Number of SVs The SVM solution w is a sparse linear combination of the training examples Φ(x l ), called support vectors (SVs). Recent results [18] prove that the number of SVs k scales linearly with the number of examples. More specifically k/l 2 B Φ (5) where B Φ is the best possible error achievable with an SVM using the Φ mapping. SVM training and recognition times grow quickly with the number of SVs, therefore it appears obvious that SVMs cannot deal with very large datasets. In the following we show how changing the cost function in SVMs cures the problem and leads to a non-convex problem solvable by CCCP. Ramp Loss The SV scaling property (5) is not surprising because all training errors become SVs. This is in fact a property of the Hinge Loss function. Assume for simplicity that the Hinge Loss is made differentiable with a smooth approximation on a small interval [1 ε, 1 + ε] near the hinge point. Differentiating (4) shows that the minimum w must satisfy L w = C y l H 1(y l f θ (x l )) Φ(x l ). (6) As H 1() = 1 for < 1 ε, all examples located in the SVM margin ( < 1 ε) or simply misclassified ( < ) become SVs. Examples lying on the flat area ( > 1 + ε) do not become SVs because the derivative H 1 is then ero. We propose to avoid converting some misclassified examples (or margin examples) into SVs by making the loss function flat for scores smaller than a predefined value s < 1. We thus introduce the Ramp Loss (Figure 1): R s () = H 1 () H s () (7) Replacing H 1 by R s in (6) guarantees that examples with score < s do not become SVs. This intuitive argument assumes the loss function is differentiable. Rigorous proofs can be written by writing (4) as a constrained minimiation problem with slack variables and considering the Karush-Kuhn-Tucker (KKT) conditions. This is well-known for SVMs using the Hinge Loss [21]. This still works for the Ramp Loss: although it leads to a nonconvex minimiation problem, the KKT constraints remain necessary (but not sufficient) optimality conditions [5]. Due to lack of space we omit this easy but tedious derivation.

4 Optimiation Decomposition (7) makes the Ramp Loss amenable to CCCP optimiation. The new cost J s (θ) can be decomposed as follows: J s (θ) = 1 2 w 2 + C = 1 2 w 2 + C L R s (y l f θ (x l )) L H 1 (y l f θ (x l )) C } {{ } Jvex(θ) s L H s (y l f θ (x l )) } {{ } Jcav(θ) s (8) For simplification purposes, we introduce the notation β l = y l J s cav(θ) f θ (x l ) = { C if yl f θ (x l ) < s otherwise The convex optimiation problem (1) that constitutes the core of the CCCP algorithm is easily reformulated in to dual variables using the standard SVM technique. This yields the following algorithm 1. Algorithm 2 : CCCP for SVMs using the Ramp Loss function Initialie w =, b =, β =. repeat Solve the following convex problem ( with K lm = y l y m Φ(x l ) Φ(x m ) ) max α Compute w t+1 = ( α (α βt ) T K (α β t ) L y l (α l βl t ) Φ(x l ). ) subject to { y (α β t ) = α C Compute b t+1 using < α l < C = y l (w t+1 Φ(x l ) + b t+1 ) = 1 { Compute β t+1 C if yl f l = θ t+1(x l ) < s otherwise until β t+1 = β t Convergence in finite time t is guaranteed because variable β can only take a finite number of distinct values, because J(θ t ) is decreasing, and because inequality (3) is strict unless β remains unchanged. It is also easy to see that examples x l whose final score = y l f θ t +1(x l ) is smaller than s cannot be SVs. The algorithm termination condition indeed implies βl t = β t +1 l = C which leads to α t +1 l = C using the KKT conditions for the convex optimiation step. Experiments Table 1 presents an experimental comparison of the Hinge Loss and the Ramp Loss using the RBF kernel Φ(x l ) Φ(x m ) = exp( γ x l x m 2 ). We chose two remarkable values for the Ramp Loss parameter s. s = 1 s = Symmetric loss: outliers ( < 1) cannot be SVs. Asymmetric loss: misclassified examples ( < ) cannot be SVs. 1 Note that R s() is non-differentiable at = s. It can be shown that the CCCP remains valid when using any super-derivative of the concave function. Alternatively, function R s() could be made smooth in a small interval [s ε, s + ε] as in our argument about the reduced number of SVs. (9)

5 Train Test Notes Waveform Artificial data, 21 dims. Banana Artificial data, 2 dims. USPS+N Class vs. rest. with 1% training label noise. Adult As in [14]. 1 raetsch/data/index.html 2 ftp://ftp.ics.uci.edu/pub/machine-learning-databases SVM R Dataset Error SV Error SV Error SV Waveform 8.9% % % 614 Banana 9.56% % % 61 USPS+N.44% % % 591 Adult 15.19% % % 73 Table 1: Comparison of SVMs using the Hinge Loss (H 1) and two instances of the Ramp Loss (R 1 and R ): Test error rates (Error) and number of SVs. Results are averaged on 1 random choices of training-testing sets for Waveform and Banana, and 1 random choices for USPS+N (USPS corrupted with 1% label noise) and Adult. All hyper-parameters were chosen for each model using a cross-validation technique. Number Of Support Vectors Number Of Support Vectors SVM R SVM R SVM R SVM R SVM R s Number Of Support Vectors SVM R s x x Number Of Support Vectors Figure 2: Results on USPS+N (first line) and Adult (second line). We show the number of SV (left) and the test error (middle) as a function of the number of training examples, using the Hinge Loss H 1 and the Ramp Losses R 1 and R. These results are averaged over 1 random training-test set splits. The right plots show the testing error as a function of the number of SVs using various values of the parameter s of the Ramp Loss, printed along the curve. These results are averaged over 1 random training-test set splits. The Ramp Loss achieves similar or better generaliation performance with much less SVs. Figure 2 reports the evolution of the number of SVs as a function of the number of training examples. Whereas the number of SVs in classical SVM increases linearly, the number of SVs in Ramp Loss SVMs increases like the square root of the number of examples. The right plots in Figure 2 report the generaliation performance as a function of the Ramp Loss parameter s. Clearly s should be considered as a hyperparameter and selected for speed and accuracy considerations. The best value of s leads to slighly improved accuracies. Discussion The excessive number of SVs in SVMs has long been recognied as one of the main flaws of this otherwise elegant algorithm. Many methods to reduce the number of

6 1525 SVM R x 15 x Objective Function SVM R Objective Function Objective Function SVM R SVM R Objective Function Iterations Iterations Figure 3: The objective function with respect to the number of iterations in the outer-loop of the CCCP. Results are reported on USPS+N (left) and Adult (right). Results obtained by averaging using 1 random choices of training-test sets. SVs (e.g. [15], 18) are after-training methods that only improve efficiency during the test phase. A non-convex sigmoid loss was also proposed in [13], but using a comparatively inefficient reweighted least-squares training method. Our method can be used to find sparse solutions that give both training and testing time speed-ups. The experiments we have detailed are not faster to train than a normal SVM because the first iteration (β = ) corresponds to initialiing CCCP with the classical SVM solution. Few subsequent iterations are necessary (Figure 3), and they run much faster because of the smaller number of SVs. The resulting training time was less than twice the SVM training time for both datasets. The full SVM initialiation is actually not necessary: we can initialie CCCP with a SVM trained on a subset of the training examples (e.g. 1/1 th ), update β on all examples, and then proceed as before. Using this scheme we obtain similar accuracies (we could find a different local minimum) with a three-fold to four-fold speedup over SVMs. More research should be done in this direction. 4 Fast Non-Convex Transductive SVMs The same non-convex optimiation techniques can be used to perform large scale semisupervised learning. Transductive SVMs [21] seek large margins for both the labeled and unlabeled examples, in the hope that the true decision boundary lies in a region of low density (implementing the so-called cluster assumption, see e.g. [4]). This was first implemented in [1], using an integer programming method, intractable for large problems. Joachims [7] then proposed a combinatorial approach, known as SVMLight TSVM, that is practical for a few thousand examples. Chapelle [4] proposed a primal method, which improves generaliation performance, but scales as (L + U) 3. Other methods [2, 22] transform the non-convex transductive problem into a convex semi-definite programming that scales as (L + U) 4 or worse. CCCP for transduction We propose to solve the transductive SVM problem using CCCP. Given an unlabeled example x and its prediction = f θ (x), the SVMLight TSVM assigns a Hinge Loss on the labeled examples (Figure 1, center) and a Symmetric Hinge Loss on the unlabeled examples (Figure 4, left). Chapelle s method handles unlabeled examples with a smooth version of this loss (Figure 4, center). While we also use the Hinge Loss for labeled examples, we use the following loss for unlabeled examples, R s () + R s ( ) (1) where R s is defined as in (7). When s =, this is equivalent to the symmetric Hinge. When s, we obtain a non-peaked loss function (Figure 4, right) which can be viewed as a simplification of Chapelle s loss function.

7 Figure 4: Three loss functions for unlabeled examples SVMLight TSVM CCCP TSVM 25 2 SVMLight TSVM CCCP TSVM Time (sec) s in R s Figure 5: Testing error for CCCP TSVMs with respect to the parameter s of the Ramp Loss averaged over 1 trials (left), and comparing training times (middle) and error rates (right) with SVMLight TSVMs, when using s =.3. The last two plots use a single trial. Using the loss function (1) is equivalent to training a Ramp Loss SVM where each unlabeled example appears as two examples labeled with both possible classes. We could then use Algorithm 2 without changes. Unfortunately, transductive SVMs perform badly without adding a balancing constraint on the unlabeled examples. For comparison purposes, we chose to use the constraint proposed by [4]: 1 U l U (w Φ(x l ) + b) = 1 L y l, (11) where L and U are respectively the labeled and unlabeled examples. After some algebra, we end up using Algorithm 2 with a training set composed of the labeled examples, of two copies of the unlabeled examples using both possible classes, and adding an extra line and column to matrix K such that K l = K l = 1 Φ(x m ) Φ(x l ) l. (12) U m U The Lagrangian variable α corresponding to this column does not have any constraint (can be positive and negative) and β =. Adding this special column can be achieved very efficiently by computing it only once, or by approximating the sum (12) using an appropriate sampling method. Experiments We performed experiments using the Chapelle s setup [4], using 2 examples from the real world mac versus mswindow in the Newsgroup2 dataset. With a RBF kernel, Chapelle reports a best testing error of 5.71% using their transduction method. We obtained exactly the same result with the CCCP algorithm. This result should be compared to 7.44% obtained with SVMLight TSVMs [7]. The CCCP algorithm was implemented by adapting the SVMTorch program. Without any particular optimiation, CCCP TSVMs run orders of magnitude faster than SVMLight TSVMs, (Figure 5, center). The complexity of the CCCP TSVM is similar to the complexity of SVM trained on L + 2U examples, which typically scales like (L + 2U) 2 [14]. Figure 5 also shows the importance of the parameter s of the loss function (1). The peaked cost function (s = ), used by SVMLight TSVMs, leads to poor generaliation l L

8 performance. We obtained better performance using a non-peaked loss function (related to [4]). The peaked loss forces early decisions for the β variables and might lead to a poor local optimum. This effect disappears as soon as we clip the loss. 5 Conclusion We described two non-convex algorithms using CCCP that bring marked scalability improvements over the existing classical approaches, namely SVMs and TSVMs. In general, we argue that the current popularity of convex approaches should not dissuade researchers from exploring alternative techniques, as they sometimes give clear advantages. The following appendix was inserted in september 25. A Appendix This appendix was gives due credit and discusses previous works whose existence was unknown to us. It also points out differences that we believe are significant enough to justify this report. Ψ-learning and related algorithms An anonymous reviewer pointed out that the algorithm described in Section 3 had already been proposed from a very different perspective. Ψ-Learning [16] advocates the minimiation of the Ramp loss function in order to more closely approximate the classification error, and suggests the concave-convex procedure to carry out the optimiation. Algorithms and experiments are proposed in [1, 11] without mention of the scalability benefits. On the contrary, both papers conclude with a warning about the potentially high computational cost of the algorithm. The authors apparently did not expect that optimiing a non-convex function could be faster and easier. They did not test their algorithm on a sieable noisy dataset, and did not optimie the first iteration by using only a subset of the examples, as described in the last paragraph of Section 3. Following the logic of ψ-learning, they simply researched large accuracy improvements over standard SVMs. They obtained modest accuracy improvements comparable to those reported in this contribution. Ample literature researches accuracy improvements by using loss functions that approximate the classification error more closely (see [3, 12, 13] for instance). The Ψ-learning paper [16] proposes such loss functions for SVMs and derives sophisticated bounds with fast convergence rates. These bounds depend on a specific assumption on the possible probability distribution of the examples (see section 3.1, assumption B). However, under similar assumptions, it has recently been shown that standard SVMs achieve comparable convergence rates with the Hinge loss [19]. This work restricts our hopes for obtaining significant accuracy improvements by replacing the loss function with a better approximation of the classification error. It does not however address the computation time. Instead it reasonably assumes that minimiing a convex loss function is always easier. The purpose of our contribution is to show that this is not the case. Concave semi-supervised SVM In Section 4 we described previous work achieved in the field semi-supervised SVM. However, we should also have acknowledged [6]. Following the formulation of transductive SVMs found in [1], the authors consider transductive linear SVMs with a 1-norm regularier, which allow them to decompose the corresponding loss function as a sum of a linear function and a concave function. Then, they iteratively linearly approximate the concave part as a linear function, leading to a series of linear programming problems. This can be viewed as a simplified subcase of CCCP (a linear function

9 being convex) applied to a special kind of SVM. Note also that the algorithm presented in their paper did not implement a balancing constraint for the labeling of the unlabeled examples as in (11). Transductive SVMs without this constraint are known to perform poorly (see [24]). Our transduction algorithm is nonlinear and the use of kernels, solving the optimiation in the dual, allows for large scale training with high dimensionality and number of examples. Acknowledgements We thank Hans Peter Graf, Eric Cosatto, and Vladimir Vapnik for their advice and support. We also thank the anonymous reviewers who pointed out the references discussed in the appendix. Part of this work was funded by NSF grant CCR References [1] K. Bennett and A. Demiri. Semi-supervised support vector machines. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 12, pages MIT Press, Cambridge, MA, [2] T. De Bie and N. Cristianini. Convex methods for transduction. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 24. [3] Léon Bottou. Une Approche théorique de l Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole. PhD thesis, Université de Paris XI, Orsay, France, Section : Approximation continues du risque moyen. [4] O. Chapelle and A. Zien. Semi-supervised classification by low density separation. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 25. [5] P. G. Ciarlet. Introduction à l analyse numérique matricielle et à l optimisation. Masson, 199. [6] G. Fung and O. Mangasarian. Semi-supervised support vector machines for unlabeled data classification. In Optimisation Methods and Software, pages Kluwer Academic Publishers, Boston, 21. [7] T. Joachims. Transductive inference for text classification using support vector machines. In International Conference on Machine Learning, ICML, [8] N. Krause and Y. Singer. Leveraging the margin more carefully. In International Conference on Machine Learning, ICML, 24. [9] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient backprop. In G.B. Orr and K.-R. Müller, editors, Neural Networks: Tricks of the Trade, pages 9 5. Springer, [1] Y. Liu. Multicategory psi-learning and support vector machine. PhD thesis, Ohio State University, 24. [11] Y. Liu, X. Shen, and H. Doss. Multicategory ψ-learning and support vector machine: Computational tools. Journal of Computational & Graphical Statistics, 14(1): , 25. [12] L. Mason, P. L. Bartlett, and J. Baxter. Improved generaliation through explicit optimiation of margins. Machine Learning, 38(3): , 2. [13] F. Pére-Cru, A. Navia-Váque, A. R. Figueiras-Vidal, and A. Artés-Rodrígue. Empirical risk minimiation for support vector classifiers. IEEE Transactions on Neural Networks, 14(3):296 33, 22. [14] J. C. Platt. Fast training of support vector machines using sequential minimal optimiation. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods. The MIT Press, [15] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 22. [16] X. Shen, G. C. Tseng, X. Zhang, and W. H. Wong. On (psi)-learning. Journal of the American Statistical Association, 98(463): , 23. [17] A. J. Smola, S. V. N. Vishwanathan, and T. Hofmann. Kernel methods for missing variables. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 25.

10 [18] I. Steinwart. Sparseness of support vector machines. Journal of Machine Learning Research, 4: , 23. [19] Ingo Steinwart and Clint Scovel. Fast rates to bayes for kernel machines. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages MIT Press, Cambridge, MA, 25. [2] H. A. Le Thi. Analyse numérique des algorithmes de l optimisation D.C. Approches locales et globale. Codes et simulations numériques en grande dimension. Applications. PhD thesis, INSA, Rouen, [21] V. Vapnik. The Nature of Statistical Learning Theory. Springer, second edition, [22] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages MIT Press, Cambridge, MA, 25. [23] A. L. Yuille and A. Rangarajan. The concave-convex procedure (CCCP). In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 22. MIT Press. [24] T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classification problems. In International Conference on Machine Learning, ICML, pages , 2.

Trading Convexity for Scalability

Trading Convexity for Scalability Ronan Collobert NEC Labs America, Princeton NJ, USA. Fabian Sin NEC Labs America, Princeton NJ, USA; and Max Planck Insitute for Biological Cybernetics, Tuebingen, Germany. Jason Weston NEC Labs America,