Solving MultiClass Support Vector Machines with LaRank

Size: px

Start display at page:

Download "Solving MultiClass Support Vector Machines with LaRank"

Aubrey Porter
5 years ago
Views:

1 Solvng MultClass Support Vector Machnes wth LaRank Antone Bordes Léon Bottou Patrck Gallnar Jason Weston ( ) NEC Laboratores Amerca, Inc., 4 Independence Way, Prnceton, NJ08540, USA. ( ) LIP6, Unversté de Pars 6, 104 Avenue du Pdt Kennedy, Pars, France Abstract Optmzaton algorthms for large margn multclass recognzers are often too costly to handle ambtous problems wth structured outputs and exponental numbers of classes. Optmzaton algorthms that rely on the full gradent are not effectve because, unlke the soluton, the gradent s not sparse and s very large. The LaRank algorthm sdesteps ths dffculty by relyng on a randomzed exploraton nspred by the perceptron algorthm. We show that ths approach s compettve wth gradent based optmzers on smple multclass problems. Furthermore, a sngle LaRank pass over the tranng examples delvers test error rates that are nearly as good as those of the fnal soluton. 1. Introducton Much has been wrtten about the recognton of multple classes usng large margn kernel machnes such as support vector machnes (SVMs). The most wdely used approaches combne multple bnary classfers separately traned usng ether the one-versus-all or one-versus-one scheme (e.g. Hsu & Ln, 2002). Alternatve proposals (Weston & Watkns, 1998; Crammer & Snger, 2001) reformulate the large margn problem to drectly address the multclass problem. These algorthms are more expensve because they must smultaneously handle all the support vectors assocated wth dfferent nter-class boundares. Rgorous experments (Hsu & Ln, 2002; Rfkn & Klautau, 2004) suggest that ths hgher cost does not translate nto hgher generalzaton performance. Appearng n Proceedngs of the 24 th Internatonal Conference on Machne Learnng, Corvalls, OR, Copyrght 2007 by the author(s)/owner(s). The pcture changes when one consders learnng systems that predct structured outputs (e.g. Bakır et al., 2007). Instead of predctng a class label y for each pattern x, structured output systems produce complex dscrete outputs such as a sequences, trees, or graphs. Snce these potental outputs can be enumerated (n theory), these systems can be vewed as multclass problems wth a number of classes growng exponentally wth the characterstc sze of the output. Dealng wth so many classes n a large margn classfer would be nfeasble wthout smart factorzatons that leverage the specfc structure of the outputs (Taskar et al., 2005; Tsochantards et al., 2005). Ths s best acheved usng a drect multclass formulaton because the factorzaton of the output space mples that all the classes are handled smultaneously. It s therefore mportant to reduce the computatonal cost of multclass SVMs wth a potentally large number of classes. MCSVM Crammer and Snger (2001) propose a multclass formulaton that we call partal rankng. The dual cost s a functon of a n k matrx of Lagrange coeffcents where n s the number of examples and k the number of classes. Each teraton of the MCSVM algorthm maxmzes the restrcton of the dual cost to a sngle row of the coeffcent matrx. Successve rows are selected usng the gradent of the cost functon. Unlke the coeffcents matrx, the gradent s not sparse. Ths approach s not feasble when the number of classes k grows exponentally, because the gradent becomes too large. SVMstruct Tsochantards et al. (2005) essentally use the same partal rankng formulaton for the SVMstruct system. The clever cuttng plane algorthm ensures convergence but only requres to store and compute a small part of the gradent. Ths crucal dfference makes SVMstruct sutable for structured output problems wth a large number of classes.

2 Solvng MultClass Support Vector Machnes wth LaRank Kernel Perceptrons Onlne algorthms nspred by the perceptron (Collns, 2002; Crammer & Snger, 2003) can be nterpreted as the successve soluton of optmzaton subproblems restrcted to coeffcents assocated wth the current tranng example. There s no need to represent the gradent. The random orderng of the tranng examples drves the successve optmzatons. Perceptrons provde surprsngly strong theoretcal guarantees (Graepel et al., 2000). They run very quckly but provde nferor generalzaton performances n practce. LaRank Ths paper proposes LaRank, a stochastc learnng algorthm that combnes partal gradent nformaton wth the randomzaton arsng from the sequence of tranng examples. LaRank uses gradents as sparngly as SVMstruct and yet runs consderably faster. In fact, LaRank reaches an equvalent accuracy faster than algorthms that use the full gradent nformaton. LaRank generalzes better than perceptron-based algorthms. In fact, LaRank provdes the performance of SVMstruct or MCSVM because t solves the same optmzaton problem. LaRank acheves nearly optmal test error rates after a sngle pass over the randomly reordered tranng set. Therefore, LaRank offers the practcalty of an onlne algorthm. Ths paper frst revews and dscusses the multclass formulaton of Crammer and Snger. Then t presents the LaRank algorthm, dscusses ts convergence, and reports expermental results on well known multclass problems. 2. Multclass Support Vector Machnes Ths secton descrbes the partal rankng formulaton of multclass SVMs (Crammer & Snger, 2001). The presentaton frst follows (Tsochantards et al., 2005) then ntroduces a new parametrzaton of the dual program Partal Rankng We want to learn a functon f that maps patterns x to dscrete class labels y Y. We ntroduce a dscrmnant functon S(x, y) R that measures the correctness of the assocaton between pattern x and class label y. The optmal class label s then f(x) = arg maxs(x, y). (1) y Y We assume that the dscrmnant functon has the form S(x, y) = w, Φ(x, y) where Φ(x, y) maps the par (x, y) nto a sutable feature space endowed wth the dot product,. As usual wth kernel machnes, the feature mappng functon Φ s mplctly defned by the specfcaton of a jont kernel functon K(x, y, x, ȳ) = Φ(x, y), Φ( x, ȳ). (2) Consder tranng patterns x 1...x n and ther class labels y 1... y n Y. For each pattern x, we want to make sure that the score S(x, y ) of the correct assocaton s greater than the scores S(x, y), y y, of the ncorrect assocatons. Ths amounts to enforcng a partal order relatonshp on the elements of Y. Ths partal rankng can be expressed by constrants = 1...n y y w, δφ (y, y) 1 where δφ (y, ȳ) stands for Φ(x, y) Φ(x, ȳ). Followng the standard SVM dervaton, we ntroduce slack varables ξ to account for the potental volaton of the constrants and optmze a combnaton of the norm of w and of the sze of the slack varables. subject to mn w { ξ 0 1 n 2 w, w + C ξ (3) y y 2.2. Dual Programs =1 w, δφ (y, y) 1 ξ The usual dervaton leads to solvng the followng equvalent dual problem (Crammer & Snger, 2001; Tsochantards et al., 2005): max α,y y α y 1 2 subject to 8 < : y α αȳ j,y y j,ȳ y j y y α y 0 α y C y y δφ(y, y), δφj(yj, ȳ) Ths problem has n(k 1) varables α y, y y correspondng to the constrants of (3). Once we have the soluton, the dscrmnant functon s S(x, y) = δφ (y, ȳ), Φ(x, y),ȳ y αȳ Ths dual problem can be consderably smplfed by reparametrzng t wth nk varables β y defned as α y β y = f y y αȳ otherwse (5) ȳ y Note that only the β y can be postve. Substtutng n (4), and takng nto account the relaton y βy = 0, (4)

3 Solvng MultClass Support Vector Machnes wth LaRank leads to a much smpler expresson for the dual problem (the δφ (... ) have dsappeared.) max β subject to β y 1 β y 2 βȳ j Φ(x, y),φ(xj, ȳ),j,y,ȳ 8 y y β >< y 0 β y C >: β y = 0 The dscrmnant functon then becomes S(x, y) =,ȳ 2.3. Kernel Factorzaton y βȳ Φ(x, ȳ), Φ(x, y) In practce, smart factorzatons of the jont kernel (2) are crucal to reduce the memory requred to store or cache the kernel values. Ths paper focuses on smple multclass problems whose kernel functon (2) takes the form Φ(x, y), Φ( x, ȳ) = k(x, x) δ(y, ȳ) where k(x, x) s a kernel defned on the patterns, and where δ(y, ȳ) s 1 f y = ȳ and 0 otherwse. The dual problem (6) then becomes (6) max β y 1 β y β 2 βy j k(x, xj) 8,j y < y β y Cδ(y, y) (7) subject to β y : = 0 and the dscrmnant functon becomes S(x, y) = y β y k(x, x) When there are only two classes, ths reduces to the standard SVM soluton (wthout equalty constrant.) Structured output learnng systems (Tsochantards et al., 2005) call for much more sophstcated factorzatons of the jont kernel. For the sake of smplcty, we descrbe the LaRank algorthm n the context of the multclass problem (7) whch s the focus of ths paper. Dealng wth the general problem (6), or handlng the varable margns suggested by Tsochantards et al. (2005), only requres mnor changes. 3. Optmzaton Algorthm Durng the executon of the optmzaton algorthm, we call support vectors all pars (x, y) whose assocated coeffcent β y s non zero; we call support patterns all patterns x that appear n a support vector. The LaRank algorthm stores the followng data: The set S of the current support vectors. The coeffcents β y assocated wth the support vectors (x, y) S. Ths descrbes the soluton snce all the other β coeffcents are zero. The dervatves g (y) of the dual objectve functon wth respect to the coeffcents β y assocated wth the support vectors (x, y) S. g (y) = δ(y, y ) j βy j k(x, x j ) = δ(y, y ) S(x, y) (8) Note that we do not store or even compute the remanng coeffcents of the gradent. In general, these mssng dervatves are not zero because the gradent s not sparse. A nave mplementaton could smply precompute all the kernel values k(x, x j ). Ths would be a waste of processng tme because the locaton of the optmum depends only on the fracton of the kernel matrx that nvolves support patterns. Our code computes kernel values on demand and caches them n sets of the form E(y, j) = { k(x, x j ) such that (x, y) S }. Although ths cache stores several copes of the same kernel values, cachng ndvdual kernel values has a hgher overhead Elementary Step Problem (7) lends tself to a smple teratve algorthm whose elementary steps are nspred by the well known sequental mnmal optmzaton (SMO) algorthm (Platt, 1999). Algorthm 1 SmoStep(, y +, y ): 1: Retreve or compute g (y +). 2: Retreve or compute g (y ). 3: Let λ u = g (y + ) g (y ) 2 k(x,x ) 4: Let λ = max 0, mn( λ u, C δ(y +, y ) β y + ) 5: Update β y + β y + + λ and β y β y λ 6: Update S accordng to whether β y + and β y are zero. 7: Update gradents: j s.t. (x j, y +) S, g j(y +) g j(y +) + λk(x, x j) j s.t. (x j, y ) S, g j(y ) g j(y ) λk(x, x j) Each teraton starts wth the selecton of one pattern x and two classes y + and y. The elementary step modfes the coeffcents β y+ and β y by opposte amounts, β y+ β y+ + λ (9) β y β y λ where λ 0 maxmzes the dual objectve functon (7) subject to the constrants. Ths optmal value s

4 Solvng MultClass Support Vector Machnes wth LaRank easly computed by frst calculatng the unconstraned optmum λ u = g (y + ) g (y ) (10) 2 k(x, x ) and then enforcng the constrants λ = max { 0, mn( λ u, C δ(y +, y ) β y+ ) } (11) Fnally the stored dervatves g j (y) are updated to reflect the coeffcent update. Ths s summarzed n algorthm Step Selecton Strateges Popular SVM solvers based on SMO select successve steps by choosng the par of coeffcents that defnes the feasble search drecton wth the hghest gradent. We cannot use ths strategy because we have chosen to store only a small fracton of the gradent. Stochastc algorthms nspred by the perceptron perform qute well by successvely updatng coeffcents determned by randomly pckng tranng patterns. For nstance, n a multclass context, Taskar (2004, secton 6.1) terates over the randomly ordered patterns: for each pattern x, he computes the scores S(x, y) for all classes and runs SmoStep on the two most volatng classes, that s, the classes that defne the feasble search drecton wth the hghest gradent. In the context of bnary classfcaton, Bordes and Bottou (2005) observe that such perceptron-nspred updates lead to a slow optmzaton of the dual because the coeffcents correspondng to the few support vectors are not updated often enough. They suggest to alternatvely update the coeffcent correspondng to a fresh random example and the coeffcent correspondng to an example randomly chosen among the current support vectors. The related LaSVM algorthm (Bordes et al., 2005) alternates steps explotng a fresh random tranng example and steps explotng current support vectors selected usng the gradent. We now extend ths dea to the multclass formulaton. Snce the multclass problem has both support vectors and support patterns, we defne three ways to select a trple (, y +, y ) for the elementary SmoStep. Algorthm 2 ProcessNew(x ): 1: f x s a support pattern then ext. 2: y + y. 3: y arg mn y Y g (y) 4: Perform SmoStep(, y +, y ) ProcessNew (algorthm 2) operates on a pattern x that s not a support pattern. It chooses the classes y + and y that defne the feasble drecton wth the hghest gradent. Snce all the Algorthm 3 ProcessOld: 1: Randomly pck a support pattern x. 2: y + arg max y Y g (y) subject to β y < C δ(y,y) 3: y arg mn y Y g (y) 4: Perform SmoStep(, y +, y ) Algorthm 4 Optmze: 1: Randomly pck a support pattern x. 2: Let Y = { y Y such that (x, y) S } 3: y + arg max y Y g (y) subject to β y < C δ(y,y) 4: y arg mn y Y g (y) 5: Perform SmoStep(, y +, y ) β y are zero, y + s always y. Choosng of y conssts of fndng arg max y S(x, y) snce equaton (8) holds. ProcessOld (algorthm 3) randomly pcks a support pattern x. It chooses the classes y + and y that defne the feasble drecton wth the hghest gradent. The determnaton of y + mostly nvolves labels y such that β y < 0, for whch the correspondng dervatves g (y) are known. The determnaton of y agan conssts of computng arg max y S(x, y). Optmze (algorthm 4) resembles ProcessOld but pcks the classes y + and y among those that correspond to exstng support vectors (x, y + ) and (x, y ). Usng the gradent s fast because the relevant dervatves are already known and ther number s moderate. The ProcessNew operaton s closely related to the perceptron algorthm. It can be nterpreted as a stochastc gradent update for the mnmzaton of the generalzed margn loss (LeCun et al., 2007, 2.2.3), wth a step sze adjusted accordng to the curvature of the dual (Hldreth, 1957). Crammer and Snger (2003) use a very smlar approach for the MIRA algorthm Adaptve Schedule Prevous works (Bordes & Bottou, 2005; Bordes et al., 2005) smply alternate two step selecton strateges accordng to a fxed schedule. They also report results suggestng that the optmal schedule s n fact datadependent. We would lke to select at each teraton an operaton that causes a large ncrease of the dual n a small amount of tme. For each operaton type, LaRank mantans a runnng estmate of the average rato of the dual ncrease over the duraton. Runnng tmes are measured; dual ncreases are derved from the value of λ computed durng the elementary step. Each teraton of the LaRank algorthm (algorthm 5)

5 Solvng MultClass Support Vector Machnes wth LaRank Algorthm 5 LaRank: 1: S. 2: r Optmze, r ProcessOld, r ProcessNew 1. 3: loop 4: Randomly reorder the tranng examples. 5: k 1. 6: whle k n do 7: Pck operaton s wth odds proportonal to r s. 8: f s = Optmze then 9: Perform Optmze. 10: else f s = ProcessOld then 11: Perform ProcessOld. 12: else 13: Perform ProcessNew(x k ). 14: k k : end f 16: r s µ 17: end whle 18: end loop dual ncrease duraton + (1 µ) r s. randomly selects whch operaton to perform wth a probablty proportonal to these estmates. Our mplementaton uses µ = In order to facltate tmng, we treat sequences of ten Optmze as a sngle atomc operaton Correctness and Complexty Let ν 2 = max {k(x, x )} and let κ, τ, η be small postve tolerances. We assume that the algorthm mplementaton enforces the followng propertes: SmoStep exts when g (y + ) g (y ) τ. Optmze and ProcessOld chooses y + among the y that satsfy β y C δ(y, y ) κ. LaRank makes sure that every operaton has probablty greater than η to be selected at each teraton (see algorthm 5). We refer to ths as the (κ, τ, η)-algorthm. Theorem Wth probablty 1, the (κ, τ, η)-algorthm reaches a κτ-approxmate soluton of problem (7), wth no more than max{ 2ν2 nc τ, 2nC 2 κτ } successful SmoSteps. Proof Sketch The convergence s a consequence from theorem 18 from (Bordes et al., 2005). To apply ths theorem, we must prove that the drectons defned by (9) form a wtness famly for the polytope defned by the constrants of problem (7). Ths s the case because ths polytope s a product of n polytopes for whch we can apply proposton 7 from (Bordes et al., 2005). The number of teratons s then bounded usng a technque smlar to that of (Tsochantards et al., 2005). The complete proof wll be gven n an extended verson of ths paper. The bound on the number of teratons s also a bound on the number of support vectors. It s lnear n the number of examples and does not depend on the possbly large number of classes Stoppng Algorthm 5 does not specfy a crteron for stoppng ts outer loop. Excellent results are obtaned by performng just one or two outer loop teratons (epochs). We use the name LaRank 1 to ndcate that we perform a sngle epoch, that s to say, a sngle pass over the randomly ordered tranng examples. Other stoppng crtera nclude explotng the dualty gap (Schölkopf & Smola, 2002, ) and montorng the performance measured on a valdaton set. We use the name LaRankGap to ndcate that we terate algorthm 5 untl the dfference between the prmal cost (3) and the dual cost (7) becomes smaller than C. However, computng the dualty gap can become qute expensve. 4. Experments Ths secton report experments carred out on varous multclass pattern recognton problems. Although our approach s partly motvated by structured output problems, ths work focuses on well understood multclass tasks n order best characterze the algorthm behavor Expermental Setup Experments were carred out on four datasets brefly descrbed n table 1. The LETTER and USPS datasets are avalable from the UCI repostory. 1 The MNIST dataset 2 s a well known handwrtten dgt recognton benchmark. The INE dataset contans scentfc artcles from 18 journals and proceedngs of the IEEE. We use a flat TF/IDF feature space (see Denoyer & Gallnar, 2006 for further detals). Table 1 also lsts our choces for the parameter C and for the kernels k(x, x). These choces were made on the bass of past experence. We use the same parameters for all algorthms because we mostly compare algorthms that optmze the same crteron. The kernel cache sze was 500MB for all experments Comparng Optmzers Table 2 (top half) compares three optmzaton algorthms for the same dual cost (7). MCSVM (Crammer & Snger, 2001) uses the full gradent and therefore cannot be easly extended to handle structured output problems. We have used the MCSVM mplementaton dstrbuted by the authors

6 Solvng MultClass Support Vector Machnes wth LaRank Table 1. Datasets used for the experments. Tran Ex. Test Ex. Classes Features C k(x, x) LETTER e x x 2 USPS e x x 2 MNIST e x x 2 INE x x Table 2. Compared test error rates and tranng tmes. LETTER USPS MNIST INE MCSVM Test error (%) (stores the full gradent) Dual Tranng tme (sec.) Kernels ( 10 6 ) SVMstruct Test error (%) (stores partal gradent) Dual Tranng tme (sec.) Kernels ( 10 6 ) n/a LaRankGap Test error (%) (stores partal gradent) Dual Tranng tme (sec.) Kernels ( 10 6 ) LaRank 1 Test error (%) (onlne) Dual Tranng tme (sec.) Kernels ( 10 6 ) Not applcable because SVMstruct bypasses the cache when usng lnear kernels. LETTER USPS MNIST INE Fgure 1. Evoluton of the test error as a functon of the number of kernel calculatons

7 Solvng MultClass Support Vector Machnes wth LaRank SVMstruct (Tsochantards et al., 2005) targets structured output problems and therefore uses only a small fracton of the gradent. We have used the mplementaton dstrbuted by the authors. The authors warn that ths mplementaton has not been thoroughly optmzed. LaRankGap terates algorthm 5 untl the dualty gap becomes smaller than parameter C. Ths algorthm only stores a small fracton of the gradent, comparable to that used by SVMstruct. We have mplemented LaRank usng an nterpreted scrptng language wth a specalzed C functon for algorthm 1 (SmoStep). Both SVMstruct and LaRankGap use small subsets of the gradent coeffcents. Although these subsets have smlar sze, LaRankGap avods the tranng tme penalty experenced by SVMstruct. Both SVMstruct and LaRank make heavy use of kernel values nvolvng two support patterns. In contrast, MCSVM updates the complete gradent vector after each step and therefore uses the kernel matrx rows correspondng to support patterns. On our relatvely small problems, ths stronger memory requrement s more than compensated by the lower overhead of MCSVM s smpler cache structure Comparng Onlne Learnng Algorthms Table 2 (bottom half) also reports the results obtaned wth a sngle LaRank epoch (LaRank 1). Ths sngle pass over the tranng examples s suffcent to nearly reach the optmal performance. Ths result s understandable because () onlne perceptrons offer strong theoretcal guarantees after a sngle pass over the tranng examples, and () LaRank drves the optmzaton process by replcatng the randomzaton that happens n the perceptron. For each dataset, fgure 1 shows the evoluton of the test error wth respect to the number of kernel calculatons. The pont marked LaRank 1 corresponds to runnng a sngle LaRank epoch. The pont marked LaRankGap corresponds to usng the dualty gap stoppng crteron as explaned n secton 4.2. Fgure 1 also reports results obtaned wth two popular onlne algorthms: The ponts marked AvgPerceptron 1 and AvgPerceptron 10 respectvely correspond to performng one and ten epochs of the average perceptron algorthm (Freund & Schapre, 1998; Collns, 2002). Multple epochs of the averaged perceptron are very effectve when the necessary kernel values ft n the cache (frst row). Tranng tme ncreases consderably when ths s not the case (second row.) Fgure 2. Impact of the LaRank operatons (USPS dataset). The pont marked MIRA corresponds to the onlne multclass algorthm proposed by Crammer and Snger (2003). We have used the mplementaton provded by the authors as part of the MCSVM package. Ths algorthm computes more kernel values than AvgPerceptron 1 because ts soluton contans more support patterns. Its performance seems senstve to the choce of kernel: Crammer and Snger (2003) report substantally better results usng the same code but dfferent kernels. These results ndcate that performng sngle LaRank epoch s an attractve onlne learnng algorthm. Although LaRank 1 usually runs slower than AvgPerceptron 1 or MIRA, t provdes better and more predctable generalzaton performance Comparng Optmzaton Strateges Fgure 2 shows the error rates and kernel calculatons acheved when one restrcts the set of operatons chosen by algorthm 5. These results were obtaned after a sngle pass on the USPS dataset. As expected, usng only the ProcessNew operaton performs lke MIRA. The average perceptron requres sgnfcantly less kernel calculatons because ts soluton s much more sparse. However, t looses ths ntal sparsty when one performs several epochs (see fgure 1.) Enablng ProcessOld and Optmze sgnfcantly reduces the test error. The best test error s acheved when all operatons are enabled. The number of kernel calculatons s also reduced because ProcessOld and Optmze often elmnate support patterns Comparng ArgMax Calculatons The prevous experments measure the computatonal cost usng tranng tme and number of kernel calculatons. Certan structured output problems use costly

8 Solvng MultClass Support Vector Machnes wth LaRank Table 3. Numbers of arg max (n thousands). LETTER USPS MNIST INE AvgPerceptron AvgPerceptron LaRank LaRankGap SVMstruct algorthms to fnd the class wth the best score (1). The cost of ths arg max calculaton s partly related to the requred number of new kernel values. The average perceptron (and MIRA) performs one such arg max calculaton for each example t processes. In contrast, LaRank performs one arg max calculaton when processng a new example wth ProcessNew, and also when runnng ProcessOld. Table 3 compares the number of arg max calculatons for varous algorthms and datasets. 3 The SVMstruct optmzer performs very well wth ths metrc. The AvgPerceptron and LaRank are very compettve on a sngle epoch and become more costly when performng many epochs. One epoch s suffcent to reach good performance wth LaRank. Ths s not the case for the AvgPerceptron. 5. Concluson We have presented a large margn multclass algorthm that uses gradents as sparngly as SVMstruct wthout experencng the same tranng tme penalty. LaRank can be consdered an onlne algorthm because t nearly reaches ts optmal performance n a sngle pass over the tranng examples. Under these condtons, LaRank acheves test error rates that are compettve wth those of the full optmzaton, and sgnfcantly better than those acheved by perceptrons. Acknowledgments Ncolas Usuner helped provng the theorem bound. Part of ths work was funded by NSF grant CCR Antone Bordes was also supported by the DGA and by the Network of Excellence IST PASCAL. References Bakır, G., Hofmann, T., Schölkopf, B., Smola, A. J., Taskar, B., & Vshwanathan, S. V. N. (Eds.). (2007). Predctng structured outputs. MIT Press. n press. Bordes, A., & Bottou, L. (2005). The Huller: a smple and effcent onlne SVM. Machne Learnng: ECML 2005 (pp ). Sprnger Verlag. LNAI The letter results n table 3 are outlers because the letter kernel runs as fast as the kernel cache. Snce LaRank depends on tmngs, t often runs ProcessOld when a smple Optmze would have be suffcent. Bordes, A., Ertekn, S., Weston, J., & Bottou, L. (2005). Fast kernel classfers wth onlne and actve learnng. Journal of Machne Learnng Research, 6, Collns, M. (2002). Dscrmnatve tranng methods for hdden markov models: theory and experments wth perceptron algorthms. EMNLP 02: Proceedngs of the ACL-02 conference on Emprcal methods n natural language processng (pp. 1 8). Morrstown, NJ: Assocaton for Computatonal Lngustcs. Crammer, K., & Snger, Y. (2001). On the algorthmc mplementaton of multclass kernel-based vector machnes. Journal of Machne Learnng Research, 2, Crammer, K., & Snger, Y. (2003). Ultraconservatve onlne algorthms for multclass problems. Journal of Machne Learnng Research, 3, Denoyer, L., & Gallnar, P. (2006). The ML document mnng challenge. Advances n ML Informaton Retreval and Evaluaton, 5th Internatonal Workshop of the Intatve for the Evaluaton of ML Retreval, INE Schloß Dagsthul, Germany. Freund, Y., & Schapre, R. E. (1998). Large margn classfcaton usng the perceptron algorthm. Machne Learnng: Proceedngs of the Ffteenth Internatonal Conference. San Francsco, CA: Morgan Kaufmann. Graepel, T., Herbrch, R., & Wllamson, R. C. (2000). From margn to sparsty. In Advances n neural nformaton processng systems, vol. 13, MIT Press. Hldreth, C. (1957). A quadratc programmng procedure. Naval Research Logstcs Quarterly, 4, Erratum, bd. p361. Hsu, C.-W., & Ln, C.-J. (2002). A comparson of methods for mult-class support vector machnes. IEEE Transactons on Neural Networks, 13, LeCun, Y., Chopra, S., Hadsell, R., HuangFu, J., & Ranzato, M. (2007). A tutoral on energy-based learnng. In (Bakır et al., 2007), n press. Platt, J. (1999). Fast tranng of support vector machnes usng sequental mnmal optmzaton. Advances n Kernel Methods Support Vector Learnng (pp ). MIT Press. Rfkn, R. M., & Klautau, A. (2004). In defense of one-vsall classfcaton. Journal of Machne Learnng Research, 5, Schölkopf, B., & Smola, A. J. (2002). Learnng wth kernels. MIT Press. Taskar, B. (2004). Learnng structured predcton models: A large margn approach. Doctoral dssertaton, Stanford Unversty. Taskar, B., Chatalbashev, V., Koller, D., & Guestrn, C. (2005). Learnng structured predcton models: a large margn approach. Internatonal Conference on Machne Learnng (ICML) (pp ). Tsochantards, I., Joachms, T., Hofmann, T., & Altun, Y. (2005). Large margn methods for structured and nterdependent output varables. Journal of Machne Learnng Research, 6, Weston, J., & Watkns, C. (1998). Mult-class support vector machnes (Techncal Report CSD-TR-98-04). Department of Computer Scence, Royal Holloway, Unversty of London, Egham, UK.

Classification / Regression Support Vector Machines

Classification / Regression Support Vector Machines Classfcaton / Regresson Support Vector Machnes Jeff Howbert Introducton to Machne Learnng Wnter 04 Topcs SVM classfers for lnearly separable classes SVM classfers for non-lnearly separable classes SVM