Taxonomy of Large Margin Principle Algorithms for Ordinal Regression Problems

Taxonomy of Large Margn Prncple Algorthms for Ordnal Regresson Problems Amnon Shashua Computer Scence Department Stanford Unversty Stanford, CA 94305 emal: shashua@cs.stanford.edu Anat Levn School of Computer Scence and Engneerng Hebrew Unversty of Jerusalem Jerusalem 91904, Israel emal: alevn@cs.huj.ac.l Abstract We dscuss the problem of rankng nstances where an nstance s assocated wth an nteger from 1 to k. In other words, the specalzaton of the general mult-class learnng problem when there exsts an orderng among the nstances a problem known as ordnal regresson or rankng learnng. Ths problem arses n varous settngs both n vsual recognton and other nformaton retreval tasks. In the context of applyng a large margn prncple to ths learnng problem, we ntroduce two man approaches for mplementng the large margn optmzaton crtera for k ; 1 margns. The frst s the fxed margn polcy n whch the margn of the closest neghborng classes s beng maxmzed whch turns out to be a drect generalzaton of SVM to rankng learnng. The second approach allows for k ; 1 dfferent margns where the sum of margns s maxmzed, thus effectvely havng the soluton based towards the pars of neghborng classes whch are the farthest apart from each other. Ths approach s shown to reduce to SV M when the number of classes k = 2. Both approaches are optmal n sze (of the dual functonal) of 2l where l s the total number of tranng examples. Experments performed on vsual classfcaton and collaboratve flterng show that both approaches outperform exstng ordnal regresson algorthms appled for rankng and mult-class SVM appled to general mult-class classfcaton. 1 Introducton In ths paper we nvestgate the problem of nductve learnng from the pont of vew of predctng varables of ordnal scale [3, 7, 5], a settng referred to as rankng learnng or ordnal regresson. We consder the problem of applyng the large margn prncple used n Support Vector methods [11, 2] to the ordnal regresson problem whle mantanng an (optmal) problem sze lnear n the number of tranng examples. Ordnal regresson may be vewed as a problem brdgng between the two standard machne Ths manuscrpt should be referenced as Techncal Report 2002-39, Lebnz Center for Research, School of Computer Scence and Eng., the Hebrew Unversty of Jerusalem.

learnng tasks of classfcaton and (metrc) regresson. Let x 2 R n, =1 :::l, be the nput vectors (the nformaton upon whch predcton takes place) drawn from some unknown probablty dstrbuton D(x); let y 2 Y be the output of the predcton process accordng to a unknown condtonal dstrbuton functon D(yjx). The tranng set, on whch the selecton of the best predctor would be made, conssts of (x y ) ndependent and dentcally dstrbuted observatons drawn from the jont dstrbuton D(x y)=d(x)d(yjx). The learnng task s to select a predcton functon f(x) from a famly of possble functons F that mnmzes the expected loss over the tranng set weghted by the jont dstrbuton D(x y) (also known as rsk functonal). The loss functon c : Y Y! R represents the dscrepancy between f(x) and y. Snce the jont dstrbuton s unknown, the rsk functonal s replaced by the so-called emprcal rsk functonal[11] P whch s smply the average of the loss functon over the tranng set: (1=l) c(f(x ) y ). In a standard classfcaton problem the nput vectors are assocated wth one of k classes, thus y 2 Y = f1 ::: kg belongs to an unordered set of labels denotng the class membershp. Snce Y s unordered and snce the metrc dstance between the predcton f(x) and the correct output y s of no partcular value, the loss functon relevant for classfcaton s the non-metrc 0-1 ndcator functon c(f(x) y) = 0 f f(x) = y and c(f(x) y) = 1 f f(x) 6= y. In a standard regresson problem y ranges over the reals therefore the loss functon can take nto account the full metrc structure for example, c(f(x) y)=(f(x) ; y) 2. In ordnal regresson, Y s a fnte set (lke n classfcaton) but there s an orderng among the elements of Y (lke n regresson, but unlke classfcaton). On the other hand, the orderng of the labels does not justfy a metrc loss functon, thus castng the rankng learnng problem as an ordnary regresson (by treatng the contnuous varable wth a coarse scale) may not be realstc [1]. Settngs n whch t s natural to rank or rate nstances arse n many felds such as nformaton retreval, vsual recognton, collaboratve flterng, econometrc models and classcal statstcs. We wll later use some applcatons from collaboratve flterng and vsual recognton as our runnng examples n ths paper. In collaboratve flterng for example, the goal s to predct a person s ratng on new tems such as moves gven the person s past ratngs on smlar tems and the ratngs of other people of all the tems (ncludng the new tem). The ratngs are ordered, such as hghly recommended, good,..., very bad thus collaboratve flterng falls naturally under the doman of ordnal regresson. In ths paper we approach the ordnal regresson problem wthn a classfcaton problem framework, and n order to take advantage of the non-metrc nature of the loss functon we wsh to embed the problem wthn a large margn prncple used n Support Vector methods [11]. The Support Vector method (SVM) was ntroduced orgnally n the context of 2-class classfcaton. The SVM paradgm has a nce geometrc nterpretaton of dscrmnatng one class from the other by a separatng plane wth maxmum margn. The large-margn prncple gves rse to the representaton of the decson boundary by a small subset of the tranng examples called Support Vectors. The SVM approach s advantageous for representng the ordnal regresson problem for two reasons. Frst, the computatonal machnery for fndng the optmal classfer f(x) s based on the non-metrc 0-1 loss functon. Therefore, by adoptng the large-margn prncple for ordnal regresson we would be mplementng an approprate non-metrc loss functon as well. Second, the SVM approach s not lmted to lnear classfers where through the mechansm of Kernel nner-products one can draw upon a rch famly of learnng functons applcable to non-lnear decson boundares. To tackle the problem of usng an SVM framework for regresson learnng, one may take the approach proposed n [7], whch s to reduce the total order nto a set of preferences over pars whch n effect ncreases the tranng set by from l to l 2. Another approach, nherted from the one-versus-many classfers used for extendng bnary SVM to mult-class SVM,

s to solve k ; 1 bnary classfcaton problems. The dsadvantage of ths approach s that t gnores the total orderng of the class labels (and also the effectve sze of the tranng set s kl whereas we wll show that regresson learnng can be performed wth an effectve tranng set of sze 2l). Lkewse, the mult-class SVMs proposed n [4, 11, 12, 8] would also gnore the orderng of the class labels and use a tranng set of sze kl. In ths paper we adopt the noton of mantanng a totally ordered set va projectons n the sense of projectng the nstances x onto the reals f(x) = w x [7, 5] and show how ths could be mplemented wthn a large margn prncple wth an effectve tranng sze of 2l. In fact, we show there s more than one way to mplement the large margn prncple as there k ; 1 possble margns. Essentally, we show, there are two strateges n general: a fxed margn strategy where the large margn prncple s appled to the closest neghborng pars of classes, or a mult-margn strategy where the sum of the k ; 1 margns s maxmzed. 2 The Ordnal Regresson Problem Let x j be the set of tranng examples where j = 1 P ::: k denotes the class number, and =1 ::: j s the ndex wthn each class. Let l = j j be the total number of tranng examples. A straght-forward generalzaton of the 2-class separatng hyperplane problem, where a sngle hyperplane determnes the classfcaton rule, s to defne k ; 1 separatng hyperplanes whch would separate the tranng data nto k ordered classes by modelng the ranks as ntervals on the real lne an dea whose orgns are wth the classcal cumulatve model [9], see also [7, 5]. The geometrc nterpretaton of ths approach s to look for k ; 1 parallel hyperplanes represented by vector w 2 R n (the dmenson of the nput vectors) and scalars b 1 ::: b k;1 defnng the hyperplanes (w b 1 ) ::: (w b k;1 ), such that the data are separated by dvdng the space nto equally ranked regons by the decson rule f(x) = mn fr : w x ; b r < 0g: (1) r2f1 ::: kg In other words, all nput vectors x satsfyng b r;1 < w x < b r are assgned the rank r (usng the conventon that b k = 1). For nstance, recently [5] proposed an on-lne algorthm (wth smlar prncples to the classc perceptron used for 2-class separaton) for fndng the set of parallel hyperplanes whch would comply wth the separaton rule above. To contnue the analogy to 2-class learnng, n addton to the separablty constrants on the varables = fw b 1 ::: b k;1 g one would lke to control the tradeoff between lowerng the emprcal rsk R emp () (error measure on the tranng set) and lowerng the confdence nterval ( h) controlled by the VC-dmenson h of the set of loss functons. The structural rsk mnmzaton (SRM) prncple [11] controls the actual rsk R() (error measured on the test data) by keepng R emp () fxed (n the deal separable case t would be zero) whle mnmzng the confdence nterval. The geometrc nterpretaton for 2-class learnng s to maxmze the margn between the boundares of the two sets [11, 2]. In our settng of rankng learnng, there are k ; 1 margns to consder, thus there are two possble approaches to take on the large margn prncple for rankng learnng: fxed margn strategy: the margn to be maxmzed s the one defned by the closest (neghborng) par of classes. Formally, let w b q be the hyperplane separatng the two pars of classes whch are the closest among all the neghborng pars of classes. Let w b q be scaled such the dstance of the boundary ponts from the hyperplane s 1,.e., the margn between the classes q q +1s 2=jwj (see Fg. 1). Thus, the fxed margn polcy for rankng learnng s to fnd the drecton

2 w maxmze the margn ( 1 w, b ) w, b ) ( 2 Fgure 1: Fxed-margn polcy for rankng learnng. The margn to be maxmzed s assocated wth the two closest neghborng classes. As n conventonal SVM, the margn s pre-scaled to be equal to 2=jwj thus maxmzng the margn s acheved by mnmzng w w. The support vectors le on the boundares between the two closest classes. w and the scalars b 1 ::: b k;1 such that w w s mnmzed (.e., the margn between classes q q+1 s maxmzed) the separablty constrants (modulo margn errors n the non-separable case). sum of margns strategy: the sum of all k ; 1 margns are to be maxmzed. In ths case, the margns are not necessarly equal (see Fg. 2). Formally, the rankng rule employs a vector w, jwj = 1, and a set of 2(k ; 1) thresholds a 1 b 1 a 2 b 2 ::: a k;1 b k;1 such that wx j a j and wx j+1 b j for j = 1 ::: k ; 1. In other words, all the examples of class 1 j k are sandwched between two parallel hyperplanes (w a j ) and (w b j;1 ), where b 0 = ;1 and a k = 1. The k ; P 1 margns are therefore (b j ; a j ) and the large margn prncple s to maxmze j (b j ;a j ) the separablty constrants above. It s also farly straghtforward to apply the SRM prncple and derve the bounds on the actual rsk functonal by followng [11] and makng substtutons where necessary. Let the emprcal rsk be defned as: R emp () = 1 l j k =1 j=1 jf(x j ) ; yj j = m l where f(x j ) s the decson rule (1), j s the number of tranng examples of class j and l s the total number of tranng examples. The emprcal rsk s the average of the number of mstakes where the magntude of a mstake s related to the total orderng,.e., the loss functon Q(z )=jf(x) ; yj, where z =(x y), s an nteger between 0 and k ; 1 (unlke the 0/1 loss functon assocated wth classfcaton learnng). Snce the loss functon s totally bounded, the VC-dmenson of the class of loss functons 0 Q(z ) k ; 1 s equal to the VC-dmenson h of the class of ndcator (0/1) functons 0 Q(z ) ; <0 I(z )= 1 Q(z ) ; 0 where 2 (0 k; 1). Let 4-margn k-separatng hyperplanes be defned when jwj =1

and 8 >< 1 w x a 1 r b j;1 w x a j y = : : >: : : > k b k;1 w x and where b j ; a j = 4 (fxed margn polcy), and 4 s the margn between the closest par of classes. From the arguments above, the VC-dmenson of the set of 4-margn k- separatng hyperplanes s bounded by the nequalty (followng [11]): R 2 h mn n +1 4 2 where R s the radus of the sphere contanng all the examples. Thus we arrve to the bound on the probablty that a test example wll not be separated correctly (followng [[11], pp. 77,133]): Wth probablty 1 ; one can assert that the probablty that a test example wll not be separated correctly by the 4-margn k-separatng hyperplanes has the bound P error m l (k ; 1) + 2 1+ s 9 >= 1+ 4m l(k ; 1) where 2l h(ln h +1); ln =4 =4 : l Therefore, the larger the fxed margn s the better bounds we obtan on the generalzaton performance of the rankng learnng problem wth the fxed-margn polcy. Lkewse, we obtan the same bound under the sum-of-margns prncple, where 4 s defned by the sum of the k ; 1 margns. In the remander of ths paper we wll ntroduce the algorthmc mplcatons of these two strateges for mplementng the large margn prncple for rankng learnng. The fxedmargn prncple wll turn out to be a drect generalzaton of the Support Vector Machne (SVM) algorthm n the sense that substtutng k =2n our proposed algorthm would produce the dual functonal underlyng conventonal SVM. It s nterestng to note that the sum-of-margns prncple reduces to SV M (ntroduced by [10]) when k =2. 3 Fxed Margn Strategy Recall that n the fxed margn polcy (w b q ) s a canoncal hyperplane normalzed such that the margn between the closest classes q q +1s 2=jwj. The ndex q s of course unknown. The unknown varables w b 1 ::: b k;1 (and the ndex q) could be solved n a two-stage optmzaton problem: a Quadratc Lnear Programmng (QLP) formulaton followed by a Lnear Programmng (LP) formulaton. The (prmal) QLP formulaton of the ( soft margn ) fxed-margn polcy for rankng learnng takes the form: mn w b j j j+1 1 2 w w + C j! j + j+1 (2) w x j ; b j ;1+ j (3) w x j+1 ; b j 1 ; j+1 (4) j 0 j 0 (5)

b1 a w 1 b2 a w 2 ( 1 w, a ) w, b ) w, a ) w, b ) ( 1 ( 2 ( 2 Fgure 2: Sum-of-margns polcy for rankng learnng. The objectve s to maxmze the sum of k ; 1 margns. Each class s sandwched between two hyperplanes, the norm of w s P set to unty as a constrant n the optmzaton problem and as a result the objectve s to maxmze (bj ; aj). In j ths case, the support vectors le on the boundares among all neghborng classes (unlke the fxedmargn polcy). When the number of classes k =2, the dual functonal s equvalent to SV M. where j =1 ::: k ; 1 and =1 ::: j, and C s some predefned constant. The scalars j and j+1 are postve for data ponts whch are nsde the margns or placed on the wrong sde of the respectve hyperplane f the tranng data s lnearly separable on all the k (ordered) classes then we wouldn t need those ( slack ) varables. The prmal functonal mplements the fxed-margn prncple even though we do not know n advance the ndex q. In the case of hard margn (the prmal functonal above when j j are set to zero) the margn s maxmzed whle mantanng separablty, thus the margn wll be governed by the closest par of classes because otherwse the separablty condtons would cease to hold. The stuaton may be slghtly dfferent and would depend on the choce of C n the soft margn mplementaton but qualtatvely the same type of behavor holds. The soluton to ths optmzaton problem s gven by the saddle pont of the Lagrange functonal (Lagrangan): L() = 1 2 w w + C j + + j j ; j j + j+1 j (w xj ; b j +1; j ) j (1 ; j+1 + b j ; w x j+1 ) j j ; j j+1 j+1 where j = 1 ::: k ; 1, = 1 ::: j, and j j+1 j j are all non-negatve Lagrange multplers. Snce the prmal problem s convex, there exsts a strong dualty between the prmal and dual optmzaton functons. By frst mnmzng the Lagrangan wth respect to w b j j j+1 we obtan the dual optmzaton functon whch then must be maxmzed wth respect to the Lagrange multplers. From the mnmzaton of the Lagrangan wth

respect to w we obtan: w = ; j j xj + j j xj+1 (6) That s, the drecton w of the parallel hyperplanes s descrbed by a lnear combnaton of the support vectors x assocated wth the non-vanshng Lagrange multplers. From the Kuhn-Tucker theorem the support vectors are those vectors for whch equalty s acheved n the nequaltes (3,4). These vectors le on the two boundares between the adjacent classes q q +1(and other adjacent classes whch have the same margn). From the mnmzaton of the Lagrangan wth respect to b j we obtan the constrant: j = j j =1 ::: k; 1 (7) and the mnmzaton wth respect to j and j+1 yelds the constrants: C ; j ; j =0 (8) C ; j ; j+1 =0 (9) whch n turn gves rse to the constrants 0 j C where j = C f the correspondng data pont s a margn error ( j = 0, thus from the Kuhn-Tucker theorem j > 0), and lkewse 0 j C where equalty j = C holds when the data pont s a margn error. Note that a data pont can count twce as a margn error once wth respect to the class on ts left and once wth respect to the class on ts rght. For the sake of presentng the dual functonal n a compact form, we wll ntroduce some new notatons. Let j be the n j matrx whose columns are the data ponts x j, = 1 ::: j : j = hx j 1 ::: xj j n j : Let j =( j 1 ::: j j ) > be the vector whose components are the Lagrange multplers j correspondng to class j. Lkewse, let j =( j 1 ::: j j ) > be the Lagrange multplers j correspondng to class j +1. Let =( 1 ::: k;1 1 ::: k;1 ) > be the vector holdng all the j and j Lagrange multplers, and let 1 = ( 1 1 ::: 1 k;1 )> = ( 1 ::: k;1 ) > and 2 = ( 2 1 ::: 2 k;1 )> = ( 1 ::: k;1 ) > the frst and second halves of. Note that 1 j = j s a vector, and lkewse so s 2 j = j. Let 1 be the vector of 1 s, and fnally, let Q be the matrx holdng two copes of the tranng data: Q = ; 1 ::: ; k;1 2 ::: k nn (10) where N =2l ; 1 ; k. For example, (6) becomes n the new notatons w = Q. By substtutng the expresson for w = Q back nto the Lagrangan and takng nto account the constrants (7,8,9) one obtans the dual functonal whch should be maxmzed wth respect to the Lagrange multplers : max N =1 ; > (Q > Q) (11) 0 C =1 ::: N (12) 1 1 j = 1 2 j j =1 ::: k; 1 (13)

There are several ponts worth notng at ths stage. Frst, when k =2,.e., we have only two classes thus the rankng learnng problem s equvalent to the 2-class classfcaton problem, the dual functonal reduces and becomes equvalent to the dual form of the conventonal SVM. In that case (Q > Q) j = y y j x xj where y y j = 1 denotng the class membershp. Second, the dual problem s a functon of the Lagrange multplers j and j alone, that s, all the remanng Lagrange multplers have dropped out. Therefore the sze of the dual QLP problem (the number of unknown varables) s proportonal to twce the number of tranng examples precsely N = 2l ; 1 ; k where l s the number of tranng examples. Ths favorably compares to the O(l 2 ) requred by the recent SVM approach to ordnal regresson ntroduced n [7] or the kl requred by the general mult-class approach to SVM [4]. In fact, the problem sze of N =2l ; 1 ; k s the smallest possble for the ordnal regresson problem snce each tranng example s flanked by a class on each sde (except examples of the frst and last class), therefore the mnmal number of constrants for descrbng an ordnal regresson problem usng separatng hyperplanes s N. Thrd, the crtera functon nvolves only nner-products of the tranng examples, thereby makng t possble to work wth kernel-based nner-products. In other words, the entres Q > Q are the nner-products of the tranng examples whch can be represented by the kernel nner-product n the nput space dmenson rather than by nner-products n the feature space dmenson. The decson rule, n ths case, gven a new nstance vector x would be the rank r correspondng to the frst smallest threshold b r for whch support vectors j K(xj+1 x) ; support vectors j K(xj x) b r where K(x y) =(x) (y) replaces the nner-products n the hgher-dmensonal feature space (x). Fnally, from the dual form one can solve for the Lagrange multplers and n turn obtan w = Q the drecton of the parallel hyperplanes. The scalar b q (separatng the adjacent classes q q +1whch are the closest apart) can be obtaned from the support vectors, but the remanng scalars b j cannot. Therefore an addtonal stage s requred whch amounts to a Lnear Programmng problem on the orgnal prmal functonal (2) but ths tme w s already known (thus makng ths a lnear problem nstead of a quadratc one). 4 Sum-of-Margns Strategy In the fxed margn polcy for rankng learnng the drecton w of the k ; 1 parallel hyperplanes was determned such as to maxmze the margn of the closest adjacent par of classes. In other words, vewed as an extenson to conventonal SVM, the crtera functon remaned essentally a 2-class representaton (maxmzng the margn between two classes) whle the lnear constrants represented the admssblty constrants necessary for makng sure that all classes are properly separable (modulo margn errors). In ths secton we propose an alternatve large-margn polcy whch allows for k ; 1 margns where the crtera functon maxmzes the sum of the k ; 1 margns. The challenge n formulatng the approprate optmzaton functonal s that one cannot adopt the prescalng of w approach whch s at the center of conventonal SVM formulaton and of the fxed-margn polcy for rankng learnng descrbed n the prevous secton. The approach we take s to represent the prmal functonal usng 2(k ; 1) parallel hyperplanes nstead of k ; 1. Each class would be sandwched between two hyperplanes (except the frst and last classes). Ths may appear superfluous, but n fact all the extra varables (havng 2(k ; 1) thresholds nstead of k ; 1) drop out n the dual functonal

therefore ths approach has no detrmental effect n terms of computatonal effcency. Formally, we seek a rankng rule whch employs a vector w and a set of 2(k ; 1) thresholds a 1 b 1 a 2 b 2 ::: a k;1 b k;1 such that w x j a j and w x j+1 b j for j =1 ::: k; 1. In other words, all the examples of class 1 j k are sandwched between two parallel hyperplanes (w a j ) and (w b j;1 ), where b 0 = ;1 and a k = 1. The margn between two hyperplanes separatng class j and j +1s: b j ; a j p (w w) : Thus, by settng the magntude of w to be of unt length (as a constrant n the optmzaton problem), the margn whch we would lke to maxmze s P j (b j ; a j ) for j =1 ::: k; 1 whch we can formulate n the followng prmal Quadratc Lnear Programmng (QLP) problem (see also Fg. 2): mn w a j b j k;1 j + j+1 (14) j ; b j )+C j=1(a j a j b j (15) b j a j+1 j =1 ::: k; 2 (16) w x j a j + j (17) b j ; j+1 w x j+1 (18) w w 1 (19) j 0 j+1 0 (20) where j =1 ::: k ; 1 (unless otherwse specfed) and =1 ::: j, and C s some predefned constant (whose physcal role would be explaned later). There are several ponts to note about the prmal problem. Frst, the constrants a j b j and b j a j+1 are necessary and suffcent to enforce the orderng constrant a 1 b 1 a 2 b 2 ::: a k;1 b k;1. Second, the (non-convex) constrant w w = 1 s replaced by the convex constrant w w 1 snce the optmal soluton w would have unt magntude n order to optmze the objectve functon. To see why ths s so, consder frst the case of k =2where we have a sngle (hard) margn: mn w a b (a ; b) a b w x a =1 ::: 1 b w x = 1 +1 ::: N w w 1 We would lke to show that for the optmal soluton (gven that the data s lnearly separable) w must be of unt norm. Let w a bbe the optmal soluton and jwj = 1. Let x + and x ; be ponts (support vectors) on the left and rght boundary planes,.e., w x ; = a and w x + = b. Let w =(1=)w (thus jw j =1). We have therefore, w x ; = 1 a w x + = 1 b

Therefore, the new soluton w (1=)a (1=)b has a lower energy value (larger margn) of (1=)(a ; b) when <1. As a result, =1snce the orgnal soluton was assumed to be optmal. Ths lne of reasonng readly extends to P multple margns as the factor 1= would apply to all the margns unformly thus the sum j (a j ; b j ) would decrease (larger sum of margns) by a factor of 1= thus =1. The ntroducton of the soft margn component (the second term n 14) does not affect ths lne of reasonng as long as the constant C s consstent wth the exstence of a soluton wth negatve energy otherwse there would be a dualty gap between the prmal and dual functonals. Ths consstency s related to the number of margn errors whch we wll dscuss n more detals later n ths secton and the followng secton. We wll proceed to derve the dual functonal below. The Lagrangan takes the followng form: L() = + j j (a j ; b j )+C j + j+1 j j (w xj ; a j ; j )+ + (w w ; 1) ; j j j j ; j + j k;2 j(a j ; b j )+ j (b j ; a j+1 ) j (b j ; j+1 ; w x j+1 ) j+1 j where j = 1 ::: k ; 1 (unless otherwse specfed), = 1 ::: j, and j j j j j j are all non-negatve Lagrange multplers. From the mnmzaton of the Lagrangan wth respect to w we obtan: w = 1 2 Q where the matrx Q was defned n ( 10) and the vector holds the Lagrange multplers j and j as defned n the prevous secton. From the mnmzaton wth respect to b j for j =1 ::: k; 2 we obtan: For j = k ; 1 we obtan, from whch t follows that, @L @b j = ;1 ; j + j + @L @b k;1 = ;1 ; k;1 + j =0: k;1 =0 j=1 k;1 1: (21) Lkewse, the mnmzaton wth respect to a 1 provdes the constrant, from whch t follows (snce 1 0) that 1 =1+ 1 and wth respect to a j, j =2 ::: k ; 1, we get the expresson, @L @a j =1+ j ; j;1 ; 1 1 (22) j =0:

Summng up the Lagrange multpler gves rse to another constrant (beyond (21) and (22)), as follows: and k;2 j=1 1 + k;1 j=2 j + k;1 j =(k ; 1) + k;1 j=1 k;1 =(k ; 1) + j=1 k;2 j + j=1 k;2 j + Therefore, as a result we obtan the constrant: j = j : (23) j Fnally, the mnmzaton wth respect to j and j+1 yelds the expressons (8) and (9) from whch we obtan the constrants 0 j C (24) 0 j C (25) where j = C and/or j = C f the correspondng data pont x j s a margn error (as mentoned before, a data pont can count twce as a margn error once wth respect to the class n ts left and once wth respect to the class on ts rght ). After substtutng the expresson for w back nto the Lagrangan and consderng the constrants borne out of the partal dervatves wth respect to a j b j we obtan the dual functonal as a functon of j j only (all the remanng varables drop out): max L 0 ( ) =; ; 1 4 > (Q > Q) the constrants (21,22,24,25) and 0. Note that =0cannot occur f there s an optmal soluton wth negatve energy n the prmal functonal (otherwse we have a dualty-gap, see later) snce we have shown above that the jwj =1n the optmal soluton thus form the Kuhn-Tucker theorem 6= 0. We can elmnate as follows: j @L 0 @ = ;1+ 1 4 2 C =0: Substtutng the expresson for =(1=2) p C back to L 0 () provdes a new dual functonal L 00 () =; p > Q > Q and maxmzaton of L 00 () s equvalent to maxmzaton of the expresson ; > (Q > Q snce Q > Q s postve defnte. To conclude, the dual functonal takes the followng form: max j=1 j j ; > (Q > Q) (26) 0 C =1 ::: N (27) 1 1 1 1 (28) 1 2 k;1 1 (29) 1 1 = 1 2 (30) where Q and are defned n the prevous secton. The drecton w s represented by the lnear combnaton of the support vectors: w = Q jqj

where, followng the Kuhn-Tucker theorem, > 0 for all vectors on the boundares between the adjacent pars of classes and margn errors. In other words, the vectors x assocated wth non-vanshng are those whch le on the hyperplanes,.e., satsfy a j = w x j or b j = w x j+1 or vectors tagged as margn errors ( j > 0 or j+1 > 0). Therefore, all the thresholds a j b j can be recovered from the support vectors unlke the fxed-margn scheme whch requred another LP pass. The dual functonal (26) s smlar to the dual functonal (11) but wth some crucal dfferences: () the quadratc crtera functonal s homogeneous, and () constrants (28,29) leads to the constrant P 2. From the Kuhn-Tucker theorem, j =0when a j <b j, and j =0when b j <a j+1 thus when the data s lnearly separable the optmal soluton we would have P =2(k ; 1). Snce a margn error mples that the correspondng Lagrange multpler = C, the number of margn errors s bounded snce P s bounded. These two dfferences are also what dstngushes between conventonal SVM and SV M for 2-class learnng proposed recently by [10]. Indeed, f we set k =2n the dual functonal (26) we would be able to conclude that the two dual functonals are dentcal. The prmal and dual functonals of SV M and the sum-of-margns polcy for rankng learnng for k =2classes are summarzed below: SVM : prmal SVM : Dual P 1 mnw b 2 w w ; + 1 N N =1 max ; 1 2 > M y (w x + b) ; 0 0 P 0 1 N P y =0 k =2sum; of ; margns : prmal P P mnw a b (a ; b)+c 1 N =1 + = 1+1 max ; > M w x a ; =1 ::: 1 0 C b ; w x = 1 +1 ::: N P y =0 w w 1 2 a b 0 0 k =2sum; of ; margns : dual where M = Q > Q and M j = y y j x x j where y = 1 dependng on the class membershp. Although the prmal functonals appear dfferent, the dual functonals are smlar and n fact can be made equvalent by the followng change of varables. Scale the Lagrange multplers assocated wth SV M such that! 2. Then, C = 2 N and equvalence between the two dual forms s establshed. Appendx A provdes a more detaled analyss of the role of C n the case of k =2. In the general case of k > 2 classes (n the context of rankng learnng) the role of the constant C carres the same meanng: C 2(k;1) #m:e: where #m:e: stand for total number of margn errors, thus 2(k ; 1) C 2(k ; 1): N Recall that n the worst case a data pont can count twce for a margn error beng both a margn error n the context of ts class and the class on ts left and n the context of ts

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 fxed-margn algorthm 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 sum-of-margns algorthm Fgure 3: Synthetc data experments for k = 3 classes wth 2D data ponts usng second order kernel nner-products. The sold lnes correspond to a 1 a 2 and the dashed lnes to b 1 b 2 (from left to rght). Support vectors are marked as squares n the dsplay. The left column llustrates fxed-margn (dual functonal (35)) and the rght column llustrates sum-of-margns (dual functonal (26)). When the value of C s small (top row) the number of margn errors (and support vectors) s large n order to enable large margns,.e. b j ; a j are large. In the case of sum-of-margns (top rght dsplay) a small value of C makes b 1 = a 2 n order to maxmze the margns. When the value of C s large (bottom row) the number of margn errors (and support vectors) s small and as a result the margns are tght. class and the class on ts rght. Therefore the total number of margn errors n the worst case s N =2l ; 1 ; k where l s the total number of data ponts. The last pont of nterest to make s that, unlke the fxed margn polcy, all the thresholds a j b j are determned from the support vectors the second Lnear Programmng optmzaton stage s not necessary n ths case. In other words, there must be support vectors on each hyperplane (w a j ) and (w b j ), otherwse a better soluton exsts wth larger margns. To conclude, the multple margn polcy maxmzes the sum of the k ; 1 margns allowng the margns to dffer n sze thus effectvely rewardng larger margns between neghborng classes whch are spaced far apart from each other. Ths s opposte to the fxed margn polcy n whch the drecton of the hyperplanes s domnated by the closest neghborng classes. We saw that the fxed margn polcy reduces to conventonal SVM when the number of classes k = 2 and the multple margn polcy reduces to SV M. Other dfferences between the two polces of usng the large margn prncple s that the multple margn polcy requres a sngle optmzaton sweep for recoverng both the drecton w and the thresholds a j b j whereas the fxed margn polcy requres two sweeps: a QLP for recoverng w and a Lnear Programmng problem for recoverng the k ; 1 thresholds b j.

5 Fxed Margn Polcy Revsted: Generalzaton of SV M We have seen that the sum-of-margns polcy reduces to SV M when the number of classes k = 2. However, one cannot make the asserton n the other drecton that the dual functonal (26) s a generalzaton of SV M. In fact, the fxed margn polcy appled to SV M for rankng learnng would have the followng form: mn w b j j j+1 1 2 w w ; + 1 l w x j ; b j ; + j w x j+1 ; b j ; j+1 0 j 0 j 0 and the resultng dual functonal would have the form: max j j + j+1 (31) ; 1 2 > (Q > Q) (32) 0 1 l =1 ::: N (33) 1 1 j = 1 2 j (34) whch s not equvalent to the dual functonal (26) of the multple-margn polcy (nor to the dual functonal (11) of the fxed-margn polcy). EachMove data set: 72,916 users Ratng of user j of move matrx s sparse (5% full) 1628 moves x1 http://www.research.compaq.com/src/eachmove/ x x1628... Total of 2,811,983 ratngs x Tranng set {, )} 1628 ( x y = 1 y ( 0,...,6) Target user y1 y f (x) Predct the ratng of a target user to a new move Fgure 4: EachMove dataset used for predctng a person s ratng on a new move gven the past ratngs on smlar moves and the ratngs of other people on all the moves. See text for detals. We saw that SV M could be rederved usng the prncple of two parallel hyperplanes (prmal functonal (14) n the case k = 2). We wll show next that the generalzaton of

SV M to rankng learnng (dual functonal (32) above) can be derved usng the 2(k ; 1) parallel hyperplanes approach. The prmal functonal takes the followng form: mn w a j b j t + C j a j ; b j = t w x j a j + j j + j+1 b j ; j+1 w x j+1 w w 1 j 0 j 0: Note that the objectve functon mn t the constrant a j ; b j = t captures the fxed margn polcy. The resultng dual functonal takes the followng form: max ; > (Q > Q) (35) 0 C =1 ::: N (36) =2 (37) 1 1 j = 1 2 j (38) whch s equvalent (va change of varables) to the dual functonal (32). Thus to conclude, there are two fxed-margn mplementatons for rankng learnng, one s a drect generalzaton of conventonal SVM (dual functonal (11)), and the second s a drect generalzaton of SV M (dual functonal (35)). 6 Experments We have conducted experments on synthetc data n order to vsualze the behavor of the new rankng algorthms, experments on collaboratve flterng problems, and experments on rankng vsual data of vehcles. Fg. 3 shows the performance of the two types of algorthms on synthetc 2D data of a three class (k = 3) ordnal regresson problem usng second order kernel nner-products (thus the separatng surfaces are concs). The value of the constant C changes the senstvty to the number of margn errors and the number of support vectors and as a result the margns themselves (more margn errors allow larger margns). The left column llustrates fxed-margn (dual functonal (35)) and the rght column llustrates sum-of-margns (dual functonal (26)). When the value of C s small (top row) the number of margn errors (and support vectors) s large n order to enable large margns,.e. b j ; a j are large. In the case of sum-of-margns (top rght dsplay) a small value of C makes b 1 = a 2 n order to maxmze the margns as a result the center class completely vanshes (the decson rule wll never make a classfcaton n favor of the center class). When the value of C s large (bottom row) the number of margn errors (and support vectors) s small and as a result the margns are tght. Fg. 4 shows the data structure of EachMove dataset [6] whch s used for collaboratve flterng tasks. In general, the goal n collaboratve flterng s to predct a person s ratng on new tems such as moves gven the person s past ratngs on smlar tems and the ratngs of other people of all the tems (ncludng the new tem). The ratngs are ordered, such

Crammer & Snger 2001 fxed-margn Fgure 5: The results of the fxed-margn prncple plotted aganst the results obtaned by usng the on-lne algorthm of [5] whch does not use a large-margn prncple. The average error between the predcted ratng and the correct ratng s much lower. as hghly recommended, good,..., very bad thus collaboratve flterng fall naturally under the doman of ordnal regresson (rather than general mult-class learnng). The EachMove dataset contans 1628 moves rated by 72,916 people arranged as a 2D array whose columns represent the moves and the rows represent the users about 5% of the entres of ths array are flled-n wth ratngs between 0 ::: 6 totalng 2,811,983 ratngs. Gven a new user, the ratngs of the user on the 1628 moves (not all moves would be rated) form the y and the th column of the array forms the x whch together form the tranng data (for that partcular user). Gven a new move represented by the vector x of ratngs of all the other 72,916 users (not all the users rated the new move), the learnng task s to predct the ratng f(x) of the new user. Snce the array contans empty entres, the ratngs were shfted by ;3:5 to have the possble ratngs f;2:5 ;1:5 ;0:5 0:5 1:5 2:5g whch allows to assgn the value of zero to the empty entres of the array (moves whch were not rated). For the tranng phase we chose users whch ranked about 450 moves and selected a subset f50 100 ::: 300g of those moves for tranng and tested the predcton on the remanng moves. We compared our results (collected over 100 runs) the average dstance between the correct ratng and the predcted ratng to the best on-lne algorthm of [5] called PRank (there s no use of large margn prncple). In ther work, PRank was compared to other known on-lne approaches and was found to be superor, thus we lmted our comparson to PRank alone. Attempts to compare our algorthms to other known rankng algorthms whch use a large-margn prncple ([7], for example) were not successful snce those square the tranng set sze whch made the experment wth the Eachmove dataset untractable computatonally. The graph n Fg. 5 shows that the large margn prncple (dual functonal 35) makes a sgnfcant dfference on the results compared to PRank. The results we obtaned wth PRank are consstent wth the reported results of [5] (best average error of about 1.25), whereas our fxed-margn algorthm provded an average error of about 0.7).

Correctly Classfed Badly Classfed 1 2 3 Fgure 6: Classfcaton of vehcle type: Small, Medum and Large. On the left are typcal examples of correct classfcatons and on the rght are typcal examples of ncorrect classfcatons. We also appled the rankng learnng algorthms to a vsual classfcaton problem where we consder mages of vehcles taken from the rear where the task s to classfy each pcture to one of three classes: small (passenger cars), medum (SUVs, mnvans) and large (buses, trucks). There s a natural order Small, Medum, Large snce makng a mstake between Small and Large s worse than confusng Small and Medum, for example. The orderng Small, Medum, Large makes t natural for applyng rankng learnng (rather than general mult-class). The problem of classfyng vehcle types s relevant for applcatons n the area of Intellgent Traffc Transportaton (ITS) where on-board sensors such as Vsual and Radar would be responsble for a wde varety of drvng assstance applcatons ncludng actve safety related to arbag deployment n whch vehcle classfcaton data s one mportant pece of nformaton. The tranng data ncluded 1500 examples from each class where the nput vector was smply the raw pxel values down-sampled to 20x20 pxels per mage. The testng phase ncluded 8081 pctures of Small vehcles, 3453 pctures of Medum vehcles and 2395 pctures of Large vehcles. The classfcaton error (countng the number of mssclassfcatons) wth the fxed-margn polcy usng second-order kernel nner-products was 20% of all test data compared to 25% when performng the classfcaton usng three rounds of 2-class conventonal SVM (whch s the conventonal approach for usng large margn prncple for general mult-class). We also examned the rankng error by averagng the dfference between the true rank f1 2 3g and the predcted rank f(x) = support vectors j K(xj+1 x) ; support vectors j K(xj x) over all test vectors x. The average was 0.216 compared to 1.408 usng PRank. Fg. 6 shows a typcal collecton of correctly classfed and ncorrectly classfed pctures from the test set.

7 Summary We have ntroduced a number of algorthms of lnear sze wth the number of tranng examples for mplementng a large margn prncple for the task of ordnal regresson. The frst type of algorthms (dual functonals 11, 32, 35) ntroduces the constrant of a sngle margn determned by the closest adjacent par of classes. That partcular margn s maxmzed whle preservng (modulo margn errors) the separablty constrants. The support vectors le on the boundares of the closest adjacent par of classes only, thus a complete soluton requres frst a QLP for fndng the hyperplanes drecton w and an LP for fndng the thresholds. Ths type of algorthm comes n two flavors: the frst s a drect extenson of conventonal SVM (dual functonal 11) and the second s a drect extenson of SV M (dual functonals 32, 35). The second type of algorthm (dual functonal 26) allows for multple dfferent margns where the optmzaton crtera s the sum of the k ; 1 margns. The key observaton wth ths approach s that n order to accommodate dfferent margns the pre-scalng concept (canoncal hyperplane) used n conventonal SVM (and n fxed-margn algorthms above) s not approprate and nstead one must have 2(k ; 1) parallel hyperplanes where the margns are represented explctly by the ntervals b j ; a j (rather than by w w as wth conventonal SVM and fxed margn algorthms). A byproduct of the sum-of-margn approach s that the LP phase s not necessary any more, and that the role of the constant C has a natural nterpretaton. In fact when k =2the sum-of-margns algorthm s dentcal to SV M. The drawback of ths approach (a drawback shared wth SV M) s that unfortunate choces of the constant C mght lead to a dualty gap wth the QLP thus renderng the dual functonal rrelevant or degenerate. Experments performed on vsual classfcaton and collaboratve flterng show that both approaches outperform exstng ordnal regresson algorthms (on-lne approach) appled for rankng and mult-class SVM (appled to the vsual classfcaton problem). Acknowledgements Thanks for MoblEye Ltd. for the use of the vehcle data set. Ths work was done whle the authors were at the Computer Scence department at Stanford Unversty. A.S. especally thanks hs host Leo Gubas for makng hs vst to Stanford possble. References [1] J. Anderson. Regresson and ordered categorcal varables. Journal of the Royal Statstcal Socety Seres B, 46:1 30, 1984. [2] B.E. Boser, I.M. Guyon, and V.N. Vapnk. A tranng algorthm for optmal margn classfers. In Proc. of the 5th ACM Workshop on Computatonal Learnng Theory, pages 144 152. ACM Press, 1992. [3] W.W. Cohen, R.E. Schapre, and Y. Snger. Learnng to order thngs. Journal of Artfcal Intellgence Research (JAIR), 10:243 270, 1999. [4] K. Crammer and Y. Snger. On the algorthmc mplementaton of multclass kernelbased vector machnes. Journal of Machne Learnng Research, 2:265 292, 2001. [5] K. Crammer and Y. Snger. Prankng wth rankng. In Proceedngs of the conference on Neural Informaton Processng Systems (NIPS), 2001. [6] http://www.research.compaq.com/src/eachmove/. [7] R. Herbrch, T. Graepel, and K. Obermayer. Large margn rank boundares for ordnal regresson. Advances n Large Margn Classfers, 2000. pp. 115 132.

[8] Y. Lee, Y. Ln, and G. Wahba. Multcategory support vector machnes. Techncal Report 1043, Unv. of Wsconsn, Dept. of Statstcs, Sep. 2001. [9] P. McCullagh and J. A. Nelder. Generalzed Lnear Models. Chapman and Hall, London, 2nd edton edton, 1989. [10] B. Scholkopf, A. Smola, R.C. Wllamson, and P.L. Bartless. New support vector algorthms. Neural Computaton, 12:1207 1245, 2000. [11] V.N. Vapnk. The nature of statstcal learnng. Sprnger, 2nd edton edton, 1998. [12] J. Weston and C. Watkns. Support vector machnes for mult-class pattern recognton. In Proc. of the 7th European Symposum on Artfcal Neural Networks, Aprl 1999. A A Closer Look at k =2: the Role of the Constant C In SV M the constant 0 < < 1 sets the tradeoff between the fracton of allowable margn errors (at most N data ponts could be margn errors) and the mnmal number of support vectors (at least N support vectors). Therefore, the constant C n the sum-ofmargns rankng learnng specalzed to k =2has a smlar nterpretaton: 2=N < C 2 s nversely proportonal to the allowable number of margn errors N =2=C. Thus, when C =2only a sngle margn error s tolerated (otherwse the optmzaton problem wll be n a weak dualty state to be dscussed later), and when C =2=N all the ponts could be margn errors (and n turn all the ponts are support vectors). The role of the constant C as a tradeoff between the mnmal number of support vectors and the allowable number of margn errors can be drectly observed through the prmal problem, as follows. Let w a b be a feasble soluton for the prmal problem. Let 0 be the smallest of the non-vanshng,.e., the dstance of the nearest margn error assocated wth the negatve tranng examples; and let 0 be the smallest of the non-vanshng,.e., the dstance of the nearest margn error assocated wth the postve tranng examples. Consder translatng the two hyperplanes such that ^a = a + 0 and ^b = b ; 0. The new feasble soluton conssts of: ^a ^b w ^ ^ where, ^ = ; 0 > 0 0 otherwse and ^ s defned smlarly. The value of the crteron functon becomes: ^a ; ^b + C = a ; b + C ^ + + ^!! + 0 (1 ; C) + 0 (1 ; C) where s the number of margn errors (where > 0) assocated wth the negatve tranng examples, and the number of margn errors assocated wth the postve examples. In order that the orgnal soluton would be optmal we must have that 1 ; C +1; C 0 (otherwse we could lower the crtera functon and obtan a better soluton). Therefore, 2 C + : We see that C = 2 when only a sngle margn error s allowed and C = 2=N when all tranng data, postve and negatve, are allowed to be margn errors. In other words the smaller C 2 s, the more margn errors are allowed n the fnal soluton.

To see the connecton between C and the necessary number of support vectors consder: 0 =mnfa ; w x j a ; w x > 0 =1 ::: 1 g whch s the smallest dstance between a negatve example (whch s not a support vector) and the left hyperplane. Lkewse, 0 =mn fw x ; b j w x ; b>0 = 1 +1 ::: Ng whch s the smallest dstance between a postve example (whch s not a support vector) and the rght hyperplane. Startng wth a feasble soluton w a b we create a new feasble soluton w ^a ^b ^ ^ as follows. Let ^a = a ; 0, ^b = b + 0, + ^ = 0 > 0 =1 ::: 1 0 otherwse and ^ = + 0 > 0 = 1 +1 ::: N 0 otherwse Note that the support vectors are assocated wth ponts on the hyperplanes and ponts labeled as margn errors ( > 0 covers both). Snce n the new soluton the hyperplanes are shfted, all the old support vectors become margn errors (thus ^ > 0). The value of the crtera functon becomes: ^a ; ^b + C = a ; b + C ^ + + ^!! : + 0 (C ; 1) + 0 ( C ; 1) where s the number of negatve support vectors and s the number of postve support vectors. In order that the orgnal soluton would be optmal we must have that C ; 1+ C ; 1 0 (otherwse we could lower the crtera functon and obtan a better soluton). Therefore, + 2 C : We see that when C =2(a sngle margn error s allowed), the number of support vectors s at least 1, and when C =2=N (all nstances are allowed to become margn errors) then the number of support vectors s N (.e., all nstances are support vectors). Taken together, C forms a tradeoff: the more margn errors are allowed, the more support vectors one wll have n the optmal soluton. Fnally, t s worth notng that wth a wrong selecton of the constant C (when there are more margn errors than the value of C allows for) would make the problem non-feasble as the prmal crtera functon would be postve (otherwse the constrants would not be satsfed). Snce the dual crtera functon s non-postve, a dualty gap would emerge. In other words, even n the presence of slack varables (soft margn), there could be an unfortunate stuaton where the optmzaton problem s not feasble and ths stuaton s related to the choce of the constant C. To conclude, the 2-parallel hyperplanes formulaton, or equvalently the SV M formulaton, carres wth t a tradeoff. On one hand, the role of the constant C s clear and ntutvely smple: there s a drect relatonshp between the value of C and the fracton of data ponts whch are allowed to be marked as margn errors. On the other hand, unlke conventonal SVM whch exhbts strong dualty under all choces of the regularzaton constant C, the 2-plane formulaton exhbts strong dualty only for values of C whch are consstent wth the worst case scenaro of margn errors.