Nested Support Vector Machines

Size: px

Start display at page:

Download "Nested Support Vector Machines"

Malcolm Hudson
5 years ago
Views:

1 Nested Support Vector Machnes 1 *Gyemn Lee, Student Member, IEEE, and Clayton Scott, Member, IEEE Abstract One-class and cost-senstve support vector machnes (SVMs) are state-of-the-art machne learnng methods for estmatng densty level sets and solvng weghted classfcaton problems, respectvely. However, the solutons of these SVMs do not necessarly produce set estmates that are nested as the parameters controllng the densty level or cost-asymmetry are contnuously vared. Such nestng not only reflects the true sets beng estmated, but s also desrable for applcatons requrng the smultaneous estmaton of multple sets, ncludng clusterng, anomaly detecton, and rankng. We propose new quadratc programs whose solutons gve rse to nested versons of one-class and cost-senstve SVMs. Furthermore, lke conventonal SVMs, the soluton paths n our constructon are pecewse lnear n the control parameters, although here the number of breakponts s drectly controlled by the user. We also descrbe decomposton algorthms to solve the quadratc programs. These methods are compared to conventonal (non-nested) SVMs on synthetc and benchmark data sets, and are shown to exhbt more stable rankngs and decreased senstvty to parameter settngs. Index Terms machne learnng, pattern classfcaton, one class support vector machne, cost senstve support vector machne, nested set estmaton, soluton paths. I. INTRODUCTION Many statstcal learnng problems may be characterzed as problems of set estmaton. In these problems, the nput takes the form of a random sample of ponts n a feature space, whle the desred output s a subset G of the feature space. For example, n densty level set estmaton, a random sample from a densty s gven and G s an estmate of a densty level set. In bnary classfcaton, labeled tranng data are avalable, and G s the set of all feature vectors predcted to belong to one of the classes. G. Lee and C. Scott are wth the Department of Electrcal Engneerng and Computer Scence, Unversty of Mchgan, Ann Arbor, MI, USA. E-mal: {gyemn, cscott}@eecs.umch.edu. Ths work was supported n part by NSF Award No

2 2 (a) one-class SVM (b) cost-senstve SVM Fg. 1. Two decson boundares from a one-class SVM (a) and a cost-senstve SVM (b) at two densty levels and cost asymmetres. The shaded regons ndcate the densty level set estmate at the hgher densty level and the postve decson set estmate at the lower cost asymmetry, respectvely. These regons are not completely contaned nsde the sold contours correspondng to the smaller densty level or the larger cost asymmetry, hence the two decson sets are not properly nested. In other statstcal learnng problems, the desred output s a famly of sets G θ wth the ndex θ takng values n a contnuum. For example, estmatng densty level sets at multple levels s an mportant task for many problems ncludng clusterng [1], outler rankng [2], mnmum volume set estmaton [3], and anomaly detecton [4]. Estmatng cost-senstve classfers at a range of dfferent cost asymmetres s mportant for rankng [5], Neyman-Pearson classfcaton [6], sem-supervsed novelty detecton [7], and ROC studes [8]. Support vector machnes (SVMs) are powerful nonparametrc approaches to set estmaton [9]. However, both the one-class SVM (OC-SVM) for level set estmaton and the standard two-class SVM for classfcaton do not produce set estmates that are nested as the parameter controllng the densty level or, respectvely, msclassfcaton cost s vared. As dsplayed n Fg. 1, set estmates from the orgnal SVMs are not properly nested. On the other hand, Fg. 2 shows nested counterparts obtaned from our proposed methods (see Secton III, IV). Snce the true sets beng estmated are n fact nested, estmators that enforce the nestng constrant wll not only avod nonsenscal solutons, but should also be more accurate and less senstve to parameter settngs and perturbatons of the tranng data. One way to generate nested SVM classfers s to tran a cost-nsenstve SVM and smply vary the offset. However, ths often leads to nferor performance as demonstrated n [8]. In ths paper, we develop nested varants of one-class and two-class SVMs by ncorporatng nestng constrants nto the dual quadratc programs assocated wth these methods. Decomposton algorthms for solvng these modfed duals are also presented. Lke the soluton paths for conventonal SVMs [10], [8],

3 3 (a) nested OC-SVM (b) nested CS-SVM Fg. 2. Fve decson boundares from our nested OC-SVM (a) and nested CS-SVM (b) at fve dfferent densty levels and cost asymmetres, respectvely. These decson boundares from nested SVMs do not cross each other, unlke the decson boundares from the orgnal SVMs (OC-SVM and CS-SVM). Therefore, the correspondng set estmates are properly nested. [11], nested SVM soluton paths are also pecewse lnear n the control parameters, but requre far fewer breakponts. We compare our nested paths to the unnested paths on synthetc and benchmark data sets. We also quantfy the degree to whch standard SVMs are unnested, whch s often qute hgh. The Matlab mplementaton of our algorthms s avalable at cscott/code/nestedsvm.zp. A prelmnary verson of ths work appeared n [12]. A. Motvatng Applcatons Wth the multple set estmates from nested SVMs over densty levels or cost asymmetres, the followng applcatons are envsoned. Rankng : In the bpartte rankng problem [13], we are gven labeled examples from two classes, and the goal s constructng a score functon that rates new examples accordng to ther lkelhood of belongng to the postve class. If the decson sets are not nested as cost asymmetres or densty levels vares, then the resultng score functon leads to ambguous rankng. Nested SVMs wll make the rankng unambguous and less senstve to perturbatons of the data. See Secton VI-C for further dscusson. Clusterng : Clusters may be defned as the connected components of a densty level set. The level at whch the densty s thresholded determnes a tradeoff between cluster number and cluster coverage. Varyng the level from 0 to yelds a cluster tree [14] that depcts the bfurcaton of clusters nto dsjont components and gves a herarchcal representaton of cluster structure. Anomaly Detecton : Anomaly detecton ams to dentfy devatons from nomnal data when combned observatons of nomnal and anomalous data are gven. Scott and Kolaczyk [4] and Scott and Blanchard

4 4 [7] present approaches to classfyng the contamnated, unlabeled data by solvng multple level set estmaton and multple cost-senstve classfcaton problems, respectvely. II. BACKGROUND ON CS-SVM AND OC-SVM In ths secton, we wll overvew two SVM varants and show how they can be used to learn set estmates. To establsh notaton and basc concepts, we brefly revew SVMs. Suppose that we have a random sample {(x, y )} N =1 where x R d s a feature vector and y { 1, +1} s ts class. An SVM fnds a separatng hyperplane wth a normal vector w n a hgh dmensonal space H by solvng mn w,ξ λ 2 w 2 + ξ s.t. y w, Φ(x ) 1 ξ, ξ 0, where λ s a regularzaton parameter and Φ s a nonlnear functon that maps each data pont nto H generated by a postve sem-defnte kernel k : R d R d R. Ths kernel corresponds to an nner product n H through k(x, x ) = Φ(x), Φ(x ). Then the two half-spaces of the hyperplane {Φ(x) : f(x) w, Φ(x) = 0} form postve and negatve decson sets. Snce the offset of the hyperplane s often omtted when Gaussan or nhomogeneous polynomal kernels are chosen [15], t s not consdered n ths formulaton. More detaled dscusson on SVMs can be found n [9]. A. Cost-Senstve SVM The SVM above, whch we call a cost-nsenstve SVM (CI-SVM), penalzes errors n both classes equally. However, there are many applcatons where the numbers of data samples from each class are not balanced, or false postves and false negatves ncur dfferent costs. The cost-senstve SVM (CS-SVM) handles ths ssue by controllng the cost asymmetry between false postves and false negatves [16]. Let I + = { : y = +1} and I = { : y = 1} denote the two ndex sets, and γ denote the cost asymmetry. Then a CS-SVM solves mn w,ξ λ 2 w 2 + γ I + ξ + (1 γ) I ξ (1) s.t. y w, Φ(x ) 1 ξ, ξ 0, where w s the normal vector of the hyperplane. When γ = 1 2, CS-SVMs reduce to CI-SVMs.

5 5 In practce ths optmzaton problem s solved va ts dual, whch depends only on a set of Lagrange multplers (one for each x ): mn α 1 2λ α α j y y j K,j j s.t. 0 α 1 {y<0} + y γ,. α (2) where K,j = k(x, x j ) and α = (α 1, α 2,..., α N ). The ndcator functon 1 {A} returns 1 f the condton A s true and 0 otherwse. Once an optmal soluton α (γ) = (α1 (γ),..., α N (γ)) s found, the sgn of the decson functon f γ (x) = 1 α (γ)y k(x, x ) (3) λ determnes the class of x. If k(, ) 0, then ths decson functon takes only non-postve values when γ = 0, and corresponds to (0, 0) n the ROC. On the other hand, γ = 1 penalzes only the volatons of postve examples, and corresponds to (1, 1) n the ROC. Bach et al. [8] extended the method of Haste et al. [10] to the CS-SVM. They showed that α (γ) are pecewse lnear n γ, and derved an effcent algorthm for computng the entre path of solutons to (2). Thus, a famly of classfers at a range of cost asymmetres can be found wth a computatonal cost comparable to solvng (2) for a sngle γ. B. One-Class SVM The OC-SVM was proposed n [17], [18] to estmate a level set of an underlyng probablty densty gven a data sample from the densty. In one-class problems, all the nstances are assumed from the same class, typcally the negatve class, y = 1,. The prmal quadratc program of the OC-SVM s λ 2 w N ξ (4) N mn w,ξ =1 s.t. w, Φ(x ) 1 ξ, ξ 0,. Ths problem s agan solved va ts dual n practce: mn α 1 2λ α α j K,j j α (5) s.t. 0 α 1 N,. Then a soluton α (λ) = (α1 (λ),..., α N (λ)) defnes a decson functon that determnes whether a pont s an outler or not. Here α (λ) are also pecewse lnear n λ [11]. From ths property, we can develop a path followng algorthm and generate a famly of level set estmates wth a small computatonal cost.

6 6 The set estmate conventonally assocated wth the OC-SVM s gven by Ĝ λ = {x : α (λ)k(x, x) > λ}. (6) Vert and Vert [19] showed that by modfyng ths estmate slghtly, substtutng α (ηλ) for α (λ) where η > 1, (6) leads to a consstent estmate of the true level set when a Gaussan kernel wth a wellcalbrated bandwdth s used. Regardless of whether η = 1 or η > 1, however, the obtaned estmates are not guaranteed to be nested as we wll see n Secton VI. Note also that when α (λ) = 1 N, (6) s equvalent to set estmaton based on kernel densty estmaton. III. NESTED CS-SVM In ths secton, we develop the nested cost-senstve SVM (NCS-SVM), whch ams to produce nested postve decson sets G γ = {x : f γ (x) > 0} as the cost asymmetry γ vares. Our constructon s a two stage process. We frst select a fnte number of cost asymmetres 0 = γ 1 < γ 2 <... < γ M = 1 a pror and generate a famly of nested decson sets at the preselected cost asymmetres. We acheve ths goal by ncorporatng nestng constrants nto the dual quadratc program of CS-SVM. Second, we lnearly nterpolate the soluton coeffcents of the fnte nested collecton to a contnuous nested famly defned for all γ. As an effcent method to solve the formulated problem, we present a decomposton algorthm. A. Fnte Famly of Nested Sets Our NCS-SVM fnds decson functons at cost asymmetres γ 1, γ 2,..., γ M smultaneously by mnmzng the sum of duals (2) at each γ and by mposng addtonal constrants that nduce nested sets. For a fxed λ and preselected cost asymmetres 0 = γ 1 < γ 2 < < γ M = 1, an NCS-SVM solves M mn 1 α α,m α j,m y y j K,j α,m (7) 1,...,α M 2λ m=1,j s.t. 0 α,m 1 {y<0} + y γ m,, m (8) y α,1 y α,2 y α,m, (9) where α m = (α 1,m,..., α N,m ) and α,m s a coeffcent for data pont x and cost asymmetry γ m. Then ts optmal soluton α m = (α1,m,..., α N,m ) defnes the decson functon f γ m (x) = 1 λ α,m y k(x, x) and ts correspondng decson set Ĝγ m = {x : f γm (x) > 0)} for each m. In Secton VII, the proposed quadratc program for NCS-SVMs s nterpreted as a dual of a correspondng prmal quadratc program.

7 7 B. Interpolaton For an ntermedate cost asymmetry γ between two cost asymmetres, say γ 1 and γ 2 wthout loss of generalty, we can wrte γ = ɛγ 1 + (1 ɛ)γ 2 for some ɛ [0, 1]. Then we defne new coeffcents α (γ) through lnear nterpolaton: Then the postve decson set at cost asymmetry γ s α (γ) = ɛα,1 + (1 ɛ)α,2. (10) Ĝ γ = {x : f γ (x) = 1 α (γ)y k(x, x) > 0}. (11) λ Ths s motvated by the pecewse lnearty of the Lagrange multplers of the CS-SVM, and s further justfed by the followng result. Proposton 1. The nested CS-SVM equpped wth a kernel such that k(, ) 0 (e.g., Gaussan kernels or polynomal kernels of even orders) generates nested decson sets. In other words, f 0 γ ɛ < γ δ 1, then Ĝγ ɛ Ĝγ δ. Proof: We prove the proposton n three steps. Frst, we show that sets from (7) satsfy Ĝγ 1 Ĝ γ2 Ĝγ M. Second, we show that f γ m < γ < γ m+1, then Ĝγ m Ĝγ Ĝγ m+1. Fnally, we prove that any two sets from the NCS-SVM are nested. Wthout loss of generalty, we show Ĝγ 1 Ĝγ 2. Let α 1 and α 2 denote the optmal solutons for γ 1 and γ 2. Then from k(, ) 0 and (9), we have α,1 y k(x, x) α,2 y k(x, x). Therefore, Ĝ γ1 = {x : f γ1 (x) > 0} Ĝγ 2 = {x : f γ2 (x) > 0}. Next, wthout loss of generalty, we show Ĝγ 1 Ĝγ Ĝγ 2 when γ 1 γ γ 2. The lnear nterpolaton (10) and the nestng constrants (9) mply y α,1 leads to α,1 y k(x, x) α (γ)y k(x, x) α,2 y k(x, x). y α (γ) y α,2, whch, n turn, Now consder arbtrary 0 γ ɛ < γ δ 1. If γ ɛ γ m γ δ for some m, then Ĝγ ɛ Ĝγ δ by the above results. Thus, suppose ths s not the case and assume γ 1 < γ ɛ < γ δ < γ 2 wthout loss of generalty. Then there exst ɛ > δ such that γ ɛ = ɛγ 1 + (1 ɛ)γ 2 and γ δ = δγ 1 + (1 δ)γ 2. Suppose x Ĝγ ɛ. Then x Ĝγ 2, hence f γɛ (x) = 1 λ (ɛα,1 + (1 ɛ)α,2 )y k(x, x) > 0 and f γ2 (x) = 1 λ α,2 y k(x, x) > 0. By addng δ ɛ f γ ɛ (x) + (1 δ ɛ )f γ 2 (x), we have f γδ (x) = (δα,1 + (1 δ)α,2 )y k(x, x) > 0. Thus, Ĝ γɛ Ĝγ δ. The assumpton that the kernel s postve can n some cases be attaned through pre-processng of the data. For example, a cubc polynomal kernel can be appled f the data support s shfted to le n the postve orthant, so that the kernel functon s n fact always postve.

8 8 C. Decomposton Algorthm The objectve functon (7) requres optmzaton over N M varables. Due to ts large sze, standard quadratc programmng algorthms are nadequate. Thus, we develop a decomposton algorthm that teratvely dvdes the large optmzaton problem nto subproblems and optmzes the smaller problems. A smlar approach also appears n a mult-class classfcaton algorthm [20], although the algorthm developed there s substantvely dfferent from ours. The decomposton algorthm follows: 1) Choose an example x from the data set. 2) Optmze coeffcents {α,m } M m=1 correspondng to x whle leavng other varables fxed. 3) Repeat 1 and 2 untl the optmalty condton error falls below a predetermned tolerance. The pseudo code gven n Fg. 3 ntalzes wth a feasble soluton α,m = 1 {y<0}+y γ m,, m. A smple way of selecton and termnaton s cyclng through all the x or pckng x randomly and stoppng after a fxed number of teratons. However, by checkng the Karush-Kuhn-Tucker (KKT) optmalty condtons and choosng x most volatng the condtons [21], the algorthm wll converge n far fewer teratons. In the Appendx, we provde a detaled dscusson of the data pont selecton scheme and termnaton crteron based on the KKT optmalty condtons. In step 2, the algorthm optmzes a set of varables assocated to the chosen data pont. Wthout loss of generalty, let us assume that the data pont x 1 s chosen and {α 1,m } M m=1 wll be optmzed whle fxng the other α,m. We rewrte the objectve functon (7) n terms of α 1,m : 1 α,m α j,m y y j K,j α,m 2λ m,j = 1 1 λ 2 α2 1,mK 1,1 + α 1,m α j,m y 1 y j K 1,j λ + C j 1 where f 1,m = 1 λ = 1 λ m [ 1 m = K 1,1 λ m 2 α2 1,mK 1,1 + α 1,m ( λy1 f 1,m α old 1,mK 1,1 λ )] + C [ 1 2 α2 1,m α 1,m ( α1,m old + λ(1 y )] 1f 1,m ) + C K 1,1 ( ) j 1 α j,my j K 1,j + α1,m old y 1K 1,1 and α1,m old denote the output and the varable precedng the update. These values can be easly computed from the prevous teraton result. C s a collecton of terms that do not depend on α 1,m.

9 9 Input: {(x, y )} N =1, {γ m} M m=1 Intalze: α,m 1 {y<0} + y γ m,, m repeat Choose a data pont x. Compute: Update {α,m } M m=1 f,m 1 α j,m y j K,j, λ j m α new,m α,m + λ(1 y f,m ) K,, m wth the soluton of the subproblem: [ ] 1 mn α,1,...,α,m m 2 α2,m α,m α,m new s.t. 0 α,m 1 {y<0} + y γ m, m y α,1 y α,2 y α,m untl Accuracy condtons are satsfed Output: Ĝγ m = {x : α,my k(x, x) > 0}, m Fg. 3. Decomposton algorthm for a nested cost-senstve SVM. Specfc strateges for data pont selecton and termnaton, based on the KKT condtons, are gven n the Appendx. Then the algorthm solves the new subproblem wth M varables, [ ] 1 mn α 1,1,...,α 1,M m 2 α2 1,m α 1,m α1,m new where α new 1,m = αold 1,m + λ(1 y1f1,m) K 1,1 s.t. 0 α 1,m 1 {y1<0} + y 1 γ m, m y 1 α 1,1 y 1 α 1,2 y 1 α 1,M be solved effcently va standard quadratc program solvers. s the soluton f feasble. Ths subproblem s much smaller and can

10 10 IV. NESTED OC-SVM In ths secton, we present a nested extenson of OC-SVM. The nested OC-SVM (NOC-SVM) estmates a famly of nested level sets over a contnuum of levels λ. Our approach here parallels the approach developed for the NCS-SVM. Frst, we wll ntroduce an objectve functon for nested set estmaton, and wll develop analogous nterpolaton and decomposton algorthms for the NOC-SVM. A. Fnte Famly of Nested Sets For M dfferent densty levels of nterest λ 1 > λ 2 > > λ M > 0, an NOC-SVM solves the followng optmzaton problem mn α 1,...,α M M 1 α,m α j,m K,j 2λ m=1 m,j α,m (12) s.t. 0 α,m 1,, m (13) N α,1 λ 1 α,2 λ 2 α,m λ M, (14) where α m = (α 1,m,..., α N,m ) and α,m corresponds to data pont x at level λ m. Its optmal soluton α m = (α 1,m,..., α N,m ) determnes a level set estmate Ĝλ m = {x : f λm (x) > 1} where f λm (x) = 1 λ m α,m k(x, x). In practce, we can choose λ 1 and λ M to cover the entre range of nterestng values of densty level (see Secton VI-B, Appendx C). In Secton VII, ths quadratc program for the NOC-SVM s nterpreted as a dual of a correspondng prmal quadratc program. B. Interpolaton and Extrapolaton We construct a densty level set estmate at an ntermedate level λ between two preselected levels, say λ 1 and λ 2. At λ = ɛλ 1 + (1 ɛ)λ 2 for some ɛ [0, 1], we set α (λ) = ɛα,1 + (1 ɛ)α,2. For λ > λ 1, we extrapolate the soluton by settng α (λ) = α,1 for. These are motvated by the facts that the OC-SVM soluton s pecewse lnear n λ and remans constant for λ > λ 1 as presented n Appendx C. Then the level set estmate becomes Ĝ λ = {x : α (λ)k(x, x) > λ}. (15) The level set estmates generated from the above process are shown to be nested n the next Proposton.

11 11 Proposton 2. The nested OC-SVM equpped wth a kernel such that k(, ) 0 (n partcular, a Gaussan kernel) generates nested densty level set estmates. That s, f 0 < λ ɛ < λ δ <, then Ĝλ ɛ Ĝλ δ. Proof: We prove the proposton n three steps. Frst, we show that sets from (12) satsfy Ĝλ 1 Ĝ λ2 Ĝλ M. Second, the nterpolated set (15) s shown to satsfy Ĝλ m Ĝλ Ĝλ m+1 when λ m > λ > λ m+1. Fnally, we prove the clam for any two sets from the NOC-SVM. Wthout loss of generalty, we frst show Ĝλ 1 Ĝλ 2. Let λ 1 > λ 2 denote two densty levels chosen a pror, and α 1 and α 2 denote ther correspondng optmal solutons. From (14), we have α,2 λ 2 k(x, x), so the two estmated level sets are nested Ĝλ 1 Ĝλ 2. α,1 λ 1 k(x, x) Next, wthout loss of generalty, we prove Ĝλ 1 Ĝλ Ĝλ 2 for λ 1 > λ > λ 2. From (14), we have α,1 λ 1 α,2 λ 2 and α,1 λ 1 Hence, f λ1 (x) f λ (x) f λ2 (x). = λ α,1 λ 1 λ = ɛα,1 ɛα,1 + (1 ɛ)α,2 λ ɛ λ1 λ 2 α,2 + (1 ɛ)α,2 λ + (1 ɛ) λ2 λ 1 α,1 λ = α (λ) λ = λ α,2 λ 2 λ = α,2. λ 2 Now consder arbtrary λ δ > λ ɛ > 0. By constructon, we can easly see that Ĝλ δ Ĝλ ɛ Ĝλ 1 for λ δ > λ ɛ > λ 1, and Ĝλ M Ĝλ δ Ĝλ ɛ for λ M > λ δ > λ ɛ. Thus we only need to consder the case λ 1 > λ δ > λ ɛ > λ M. Snce above results mply Ĝλ δ Ĝλ ɛ f λ δ > λ m > λ ɛ for some m, we can safely assume λ 1 > λ δ > λ ɛ > λ 2 wthout loss of generalty. Then there exst δ > ɛ such that λ δ = δλ 1 + (1 δ)λ 2 and λ ɛ = ɛλ 1 + (1 ɛ)λ 2. Suppose x Ĝλ δ. Then x Ĝλ 2 and (δα,1 + (1 δ)α,2)k(x, x) > λ δ (16) α,2k(x, x) > λ 2. (17) By ɛ δ (16) + (1 ɛ δ ) (17), we have (ɛα,1 + (1 ɛ)α,2 )k(x, x) > λ ɛ. Thus, Ĝλ δ Ĝλ ɛ. The statement of ths result focuses on the Gaussan kernel because ths s the prmary kernel for whch the OC-SVM has been successfully appled. C. Decomposton Algorthm We also use a decomposton algorthm to solve (12). The general steps are the same as explaned n Secton III-C for the NCS-SVM. Fg. 4 shows the outlne of the algorthm. In the algorthm, a feasble

12 12 soluton α,m = 1 N for, m s used as an ntal soluton. Here we present how we can dvde the large optmzaton problem nto a collecton of smaller problems. Suppose that the data pont x 1 s selected and ts correspondng coeffcents {α 1,m } M m=1 wll be updated. Wrtng the objectve functon only n terms of α 1,m, we have 1 α,m α j,m K,j α,m 2λ m m,j = 1 α 2 2λ 1,mK 1,1 + α 1,m 1 α j,m K 1,j 1 + C m m λ m j 1 = [ ( )] 1 α 2 2λ 1,mK 1,1 + α 1,m f 1,m αold 1,m K 1,1 1 + C m m λ m [ 1 =K 1,1 α1,m 2 α ( 1,m α1,m old + λ )] m(1 f 1,m ) + C 2λ m m λ m K 1,1 ( ) where α1,m old and f 1,m = 1 λ m j 1 α j,mk 1,j + α1,m old K 1,1 denote the varable from the prevous teraton step and the correspondng output, respectvely. C s a constant that does not affect the soluton. Then we obtan the reduced optmzaton problem of M varables, [ 1 mn α α 1,m 2 α ] 1,m α1,m new 1,1,...,α 1,M m 2λ m λ m where α new 1,m = αold 1,m + λm(1 f1,m) K 1,1 (18) s.t. 0 α 1,m 1, m (19) N α 1,1 λ 1 α 1,2 λ 2 α 1,M λ M (20). Notce that α new 1,m becomes the soluton f t s feasble. Ths reduced optmzaton problem can be solved through standard quadratc program solvers. V. COMPUTATIONAL CONSIDERATIONS Here we provde gudelnes for breakpont selecton and dscuss the effects of nterpolaton. A. Breakpont Selecton The constructon of an NCS-SVM begns wth the selecton of a fnte number of cost asymmetres. Snce the cost asymmetres take values wthn the range [0, 1], the two breakponts γ 1 and γ M should be at the two extremes so that γ 1 = 0 and γ M = 1. Then the rest of the breakponts γ 2,, γ M 1 can be set evenly spaced between γ 1 and γ M. On the other hand, the densty levels for NOC-SVMs should be strctly postve. Wthout coverng all postve reals, however, λ 1 and λ M can be chosen to cover practcally all the densty levels of nterest.

13 13 Input: {x } N =1, {λ m} M m=1 Intalze: α,m 1 N,, m repeat Choose a data pont x. Compute: Update {α,m } M m=1 f,m 1 α j,m K,j, λ m j m α new,m α,m + λ m(1 f,m ) K,, m wth the soluton of the subproblem: [ 1 mn α α,m 2 α ],m α,m new,1,...,α,m m 2λ m λ m untl Accuracy condtons are satsfed s.t. 0 α,m 1 N, m α,1 λ 1 α,2 λ 2 Output: Ĝλ m = {x : α,mk(x, x) > λ m }, m α,m λ M Fg. 4. Decomposton algorthm for a nested one-class SVM. Specfc strateges for data pont selecton and termnaton, based on the KKT condtons, are gven n the Appendx. The largest level λ 1 for the NOC-SVM s set as descrbed n Appendx C where we show that for λ > λ 1, the CS-SVM and OC-SVM reman unchanged. A very small number greater than 0 s set for λ M. Then the NOC-SVM s traned on evenly spaced breakponts between λ 1 and λ M. In our experments, we set the number of breakponts to be M = 5 for NCS-SVMs and M = 11 for NOC-SVMs. These values were chosen because ncreasng the number of breakponts M had dmnshng AUC gans whle causng tranng tme ncreases n our experments. Thus, the cost asymmetres for the NCS-SVM are (0, 0.25, 0.5, 0.75, 1) and the densty levels for NOC-SVM are 11 lnearly spaced ponts from λ 1 = 1 N max j K,j to λ 11 = 10 6.

14 14 B. Effects of Interpolaton Nested SVMs are traned on a fnte number of cost asymmetres or densty levels and then the soluton coeffcents are lnearly nterpolated over a contnuous range of parameters. Here we llustrate the effectveness of the lnear nterpolaton scheme of nested SVMs usng the two dmensonal banana data set. Consder two sets of cost asymmetres, γ = (0 : 0.25 : 1) and γ = (0 : 0.1 : 1), wth dfferent numbers of breakponts for the NCS-SVM. Let α (γ m) denote the lnearly nterpolated soluton at γ m from the soluton of the NCS-SVM wth γ, and let α (γ m) denote the soluton from the NCS-SVM wth γ. Fg. 5 compares these two soluton coeffcents α (γ m) and α (γ m). The box plots Fg. 5 (a) shows that values of α (γ m) α (γ m) tend to be very small. Indeed, for most γ m, the nterquartle range on these box plots s not even vsble. Regardless of these mnor dscrepances, what s most mportant s that the resultng decson sets are almost ndstngushable as llustrated n Fg. 5 (c) and (e). Smlar results can be observed n the NOC-SVM as well from Fg. 5 (b), (d) and (f). Here we consder two sets of densty levels λ wth 11 breakponts and λ wth 16 breakponts between λ 1 = 1 N max j K,j and λ M = C. Computatonal complexty Accordng to Haste et al. [10], the (non-nested) path followng algorthm has O(N) breakponts and complexty O(m 2 N + N 2 m), where m s the maxmum number of ponts on the margn along the path. On the other hand, our nested SVMs have a controllable number of breakponts M. To assess the complexty of the nested SVMs, we make a couple of assumptons based on expermental evdence. Frst, our experence has shown that the number of teratons of the decomposton algorthm s proportonal to the number of data ponts N. Second, we assume that the subproblem, whch has M varables, can be solved n O(M 2 ) operatons. Furthermore, each teraton of the decomposton algorthm also nvolves a varable selecton step. Ths nvolves checkng all varables for KKT condton volatons (as detaled n the Appendces), and thus entals O(M N) operatons. Thus, the computaton tme of nested SVMs are O(M 2 N + MN 2 ). In Secton VI-E, we expermentally compare the run tmes of the path followng algorthms to our methods. VI. EXPERIMENTS AND RESULTS In order to compare the algorthms descrbed above, we expermented on 13 benchmark data sets avalable onlne 1 [22]. Ther bref summary s provded n Fg. 6. Each feature s standardzed wth 1

15 15 1 banana 1/N banana α ~ * α * α ~ * α * Cost asymmetres γ 1/N Densty levels λ (a) α (γ m) α (γ m) (b) α (λ m) α (λ m) (c) Ĝγ m ( α (γ m)) (d) Ĝλ m ( α (λ m)) (e) Ĝγ m (α (γ m)) (f) Ĝλ m (α (λ m)) Fg. 5. detals. Smulaton results depctng the mpact of nterpolaton on the coeffcents and fnal set estmates. See Secton V-B for zero mean and unt varance. The frst eleven data sets are randomly permuted 100 tmes (the last two are permuted 20 tmes) and dvded nto tranng and test sets. In all of our experments, we used the Gaussan kernel k(x, x ) = exp ( x x 2 and searched for the bandwdth σ over 20 logarthmcally 2σ 2 ) spaced ponts from d avg /15 to 10 d avg where d avg s the average dstance between tranng data ponts. Ths control parameter s selected va 5-fold cross valdaton on the frst 10 permutatons, then the average of these values s used to tran the remanng permutatons. Each algorthm generates a famly of decson functons and set estmates. From these sets, we construct

16 16 Data set dm N tran N test banana breast-cancer dabetes flare-solar german heart rngnorm thyrod ttanc twonorm waveform mage splce Fg. 6. Descrpton of data sets. dm s the number of features, and N tran and N test are the numbers of tranng and testexamples. an ROC and compute ts area under the curve (AUC). We use the AUC averaged across permutatons to compare the performance of algorthms. As shown n Fg. 1, however, the set estmates from CS-SVMs or OC-SVMs are not properly nested, and cause ambguty partcularly n rankng. In Secton VI-C, we measure ths volaton of the nestng by defnng the rankng dsagreement of two rank scorng functons. Then n Secton VI-D, we combne ths rankng dsagreement and the AUC, and compare the algorthms over multple data sets usng the Wlcoxon sgned ranks test as suggested n [23]. A. Two-class Problems CS-SVMs and NCS-SVMs are compared n two-class problems. For NCS-SVMs, we set M = 5 and solved (7) at unformly spaced cost asymmetres γ = (0, 0.25, 0.50, 0.75, 1). In two-class problems, we also searched for the regularzaton parameter λ over 10 logarthmcally spaced ponts from 0.1 to λ max where λ max s λ max = max max y y j K,j, max y y j K,j. j I + j I Values of λ > λ max do not produce dfferent solutons n the CS-SVM (see Appendx C). We compared the descrbed algorthms by constructng ROCs and computng ther AUCs. The results are collected n Fg. 7

17 17 Two-class One-class: Postve One-class: Unform Data Set CS NCS OC NOC OC NOC banana ± ± ± ± ± ± breast-cancer ± ± ± ± ± ± dabetes ± ± ± ± ± ± flare-solar ± ± ± ± ± ± german ± ± ± ± ± ± heart ± ± ± ± ± ± rngnorm ± ± ± ± ± ± thyrod ± ± ± ± ± ± ttanc ± ± ± ± ± ± twonorm ± ± ± ± ± ± waveform ± ± ± ± ± ± mage ± ± ± ± ± ± splce ± ± ± ± ± ± Fg. 7. AUC values for the CS-SVM (CS) and NCS-SVM (NCS) n two-class problems, and OC-SVM (OC) and NOC-SVM (NOC) n one-class problems. In one-class problems, Postve ndcates that the alternatve hypotheses are from the postve class examples n the data sets, and Unform ndcated that the alternatve hypotheses are from a unform dstrbuton. B. One-class Problems For the NOC-SVM, we selected 11 densty levels spaced evenly from λ 1 = 1 N max j K,j (see Appendx C) to λ 11 = Among the two classes avalable n each data set, we chose the negatve class for tranng. Because the bandwdth selecton step requres computng AUCs, we smulated an artfcal second class from a unform dstrbuton. For evaluaton of the traned decson functons, both the postve examples n the test sets and a new unform sample were used as the alternatve class. Fg. 7 reports the results for both cases (denoted by Postve and Unform, respectvely). Fg. 8 shows the AUC of the two algorthms over a range of σ. Throughout the experments on oneclass problems, we observed that the NOC-SVM s more robust to the kernel bandwdth selecton than the OC-SVM. However, we dd not observe smlar results on two-class problems. C. Rankng dsagreement The decson sets from the OC-SVM and the CS-SVM are not properly nested, as llustrated n Fg. 1. Snce larger λ means hgher densty level, the densty level set estmate of the OC-SVM s expected to be contaned wthn the densty level set estmate at smaller λ. Lkewse, larger γ n the CS-SVM

18 breast cancer (Postve) 0.98 breast cancer (Unform) AUC OC NOC Kernel bandwdth σ AUC OC NOC Kernel bandwdth σ (a) Postve (b) Unform Fg. 8. The effect of kernel bandwdth σ on the performance (AUC). The AUC s evaluated when the alternatve class s from the postve class n the data sets (a) and from a unform dstrbuton (b). The NOC-SVM s less senstve to σ than the OC-SVM. penalzes msclassfcaton of postve examples more; thus, ts correspondng postve decson set should contan the decson set at smaller γ, and the two decson boundares should not cross. Ths undesred nature of the algorthms leads to non-unque rankng score functons. In the case of the CS-SVM, we can consder the followng two rankng functons: s + (x) = 1 mn γ, s (x) = 1 max {γ:f γ(x) 0} For the OC-SVM, we consder the next par of rankng functons, s + (x) = {γ:f γ(x) 0} γ. (21) max λ, s (x) = mn λ. (22) {λ:x Ĝ λ} {λ:x Ĝ λ} In words, s + ranks accordng to the frst set contanng a pont x and s ranks accordng to the last set contanng the pont. In ether case, t s easy to see s + (x) s (x). In order to quantfy the dsagreement of the two rankng functons, we defne the followng measure of rankng dsagreement: d(s +, s ) = 1 N max j 1 {(s +(x ) s +(x j))(s (x ) s (x j))<0}, whch s the proporton of data ponts ambguously ranked,.e., ranked dfferently wth respect to at least one other pont. Then d(s +, s ) = 0 f and only f s + and s nduce the same rankng. Wth these rankng functons, Fg. 9 reports the rankng dsagreements from the CS-SVM and OC- SVM. In the table, d 2 refers to the rankng dsagreement of the CS-SVM, and d p and d u respectvely refer to the rankng dsagreement of the OC-SVM when the second class s from the postve samples and from an artfcal unform dstrbuton. As can be seen n the table, for some data sets the volaton of the nestng causes severe dfferences between the above rankng functons.

19 19 Data set d 2(s +, s ) d p(s +, s ) d u(s +, s ) banana breast-cancer dabetes flare-solar german heart rngnorm thyrod ttanc twonorm waveform mage splce Fg. 9. The measure of dsagreement of the two rankng functons from the CS-SVM and OC-SVM. The meanng of each subscrpt s explaned n the text. s + and s are defned n (21) and (22). D. Statstcal comparson We employ the statstcal methodology of Demšar [23] to compare the algorthms across all data sets. Usng the Wlcoxon sgned ranks test, we compare the CS-SVM and the NCS-SVM for two-class problems, and the OC-SVM and the NOC-SVM for one-class problems. The Wlcoxon sgned ranks test s a non-parametrc method testng the sgnfcance of dfferences between pared observatons, and can be used to compare the performances between two algorthms over multple data sets. The dfference between the AUCs from the two algorthms are ranked gnorng the sgns, and then the ranks of postve and negatve dfferences are added. Fg. 10 and Fg. 11 respectvely report the comparson results of the algorthms for two-class problems and one-class problems. Here the numbers under NCS or NOC denote the sums of ranks of the data sets on whch the nested SVMs performed better than the orgnal SVMs; the values under CS or OC are for the opposte. T s the smaller of the two sums. For a confdence level of α = 0.01 and 13 data sets, the dfference between algorthms s sgnfcant f T s less than or equal to 9 [24]. Therefore, any sgnfcant performance dfference between the CS-SVM and the NCS-SVM was not detected n the test. Lkewse, no dfference between the OC-SVM and the NOC-SVM was detected. However, the AUC alone does not hghlght the rankng dsagreement of the algorthms. Therefore, we merge the AUC and the dsorder measurement, and consder AUC d(s +, s ) for algorthm comparson.

20 20 CS NCS T Fg. 10. Comparson of the AUCs of the two-class problem algorthms: CS-SVM (CS) and NCS-SVM (NCS) usng the Wlcoxon sgned ranks test (see text for detal.) The test statstc T s greater than the crtcal dfference 9, hence no sgnfcant dfference s detected n the test. OC NOC T Postve Unform Fg. 11. Comparson of the OC-SVM (OC) and NOC-SVM (NOC). In the one-class problems, both cases of alternatve hypothess are consdered. Here no sgnfcant dfference s detected. Fg. 12 shows the results of the Wlcoxon sgned-ranks test usng ths combned performance measure. From the results, we can observe clearly the performance dfferences between algorthms. Snce the test statstc T s smaller than the crtcal dfference 9, the NCS-SVM outperforms the CS-SVM. Lkewse, the performance dfference between the OC-SVM and the NOC-SVM s also detected by the Wlcoxon test for both cases of the second class. Therefore, we can conclude that the nested algorthms perform better than ther unnested counterparts. E. Run tme comparson Fg. 13 shows the tranng tmes for each algorthm. The results for the CS-SVM and OC-SVM are based on our Matlab mplementaton of soluton path algorthms [8], [11] avalable at eecs.umch.edu/ cscott/code/svmpath.zp. We emphasze here that our decomposton algorthm reles on Matlab s quadprog functon as the basc subproblem solver, and that ths functon s n no way optmzed for our partcular subproblem. A dscusson of computatonal complexty was gven n V-C. CS NCS T OC NOC T Postve Unform Fg. 12. Comparson of the algorthms based on the AUC along wth the rankng dsagreement. Left: CS-SVM and NCS-SVM. Rght: OC-SVM and NOC-SVM. T s less than the crtcal values 9, hence the nested SVMs outperforms the orgnal SVMs.

21 21 Data set CS NCS OC NOC banana breast-cancer dabetes flare-solar german heart rngnorm thyrod ttanc twonorm waveform mage splce Fg. 13. Average tranng tmes (sec) for the CS-SVM, NCS-SVM, OC-SVM, and NOC-SVM on benchmark data sets. Ths result s based on our mplementaton of soluton path algorthms for the CS-SVM and OC-SVM. VII. PRIMAL OF NESTED SVMS Although not essental for our approach, we can fnd a prmal optmzaton problem of the NCS-SVM f we thnk of (7) as a dual problem: M mn λ w,ξ 2 w m 2 + γ m ξ,m + (1 γ m ) m=1 I + s.t. M w k, Φ(x ) k=m m k=1 w k, Φ(x ) ξ,m 0,, m. I M (1 ξ,k ), I +, m k=m m k=1 ξ,m (1 ξ,k ), I, m The dervaton of (7) from ths prmal can be found n [25]. Note that the above prmal of the NCS-SVM reduces to the prmal of the CS-SVM (1) when M = 1.

22 22 Lkewse, the prmal correspondng to the NOC-SVM s [ M λm 2 w m N mn w,ξ s.t. m=1 M k=m λ k w k, Φ(x ) ξ,m 0,, m, M k=m ξ,m ] λ k (1 ξ,m ), whch also bols down to the prmal of the OC-SVM (4) when M = 1., m (23) Wth these formulatons, we can see the geometrc meanng of w and ξ. For smplcty, consder (23) when M = 2: λ 2 mn {w m},{ξ,m} 2 w N ξ,2 + λ 1 2 w N ξ,1 s.t. λ 2 w 2, Φ(x ) λ 2 (1 ξ,2 ), λ 2 w 2 + λ 1 w 1, Φ(x ) λ 2 (1 ξ,2 ) + λ 1 (1 ξ,1 ), ξ,m 0,, m. Here ξ,1 > 0 when x les between the hyperplane P λ 2 w 2 +λ 1 w 1 and the orgn, and ξ,2 > 0 when the pont λ 2 +λ 1 les between P w2 and the orgn where we used P w to denote {Φ(x) : w, Φ(x) = 1}, a hyperplane n H. Note that from the nestng structure, the hyperplane P λ 2 w 2 +λ 1 w 1 s located between P w1 and P w2. λ 2 +λ 1 λ Then we can show that 1ξ,1+λ 2ξ,2 λ s the dstance between the pont x 1w 1+λ 2w 2 and the hyperplane P λ 2 w 2 +λ 1 w 1. λ 2 +λ 1 VIII. CONCLUSION In ths paper, we ntroduced a novel framework for buldng a famly of nested support vector machnes for the tasks of cost-senstve classfcaton and densty level set estmaton. Our approach nvolves formng new quadratc programs nspred by the cost-senstve and one-class SVMs, wth addtonal constrants that enforce nestng structure. Our constructon generates a fnte number of nested set estmates at a pre-selected set of parameter values, and lnearly nterpolates these sets to a contnuous nested famly. We also developed effcent algorthms to solve the proposed quadratc problems. Thus, the NCS-SVM yelds a famly of nested classfers ndexed by cost asymmetry γ, and the NOC-SVM yelds a famly of nested densty level set estmates ndexed by densty level λ. Unlke the orgnal SVMs, whch are not nested, our methods can be readly appled to problems requrng multple set estmaton ncludng clusterng, rankng, and anomaly detecton.

23 23 In expermental evaluatons, we found that non-nested SVMs can yeld hghly ambguous rankngs for many datasets, and that nested SVMs offer consderable mprovements n ths regard. Nested SVMs also exhbt greater stablty wth respect to model selecton crtera such as cross-valdaton. In terms of area under the ROC (AUC), we found that enforcement of nestng appears to have a bgger mpact on one-class problems. However, nether cost-senstve nor one-class classfcaton problems dsplayed sgnfcantly dfferent AUC values between nested and non-nested methods. Recently Clémençon and Vayats [26] developed a method for bpartte rankng that also nvolves computng nested estmates of cost-senstve classfers at a fnte grd of costs. Ther set estmates are computed ndvdually, and nestng s mposed subsequently through an explct process of successve unons. These sets are then extended to a complete scorng functon through pecewse constant nterpolaton. Ther nterest s prmarly theoretcal, as ther estmates ental emprcal rsk mnmzaton, and ther results assume the underlyng Bayes classfers les n a Vapnk-Chervonenks class. The statstcal consstency of our nested SVMs s an nterestng open queston. Such a result would lkely depend on the consstency of the orgnal CS-SVM or OC-SVM at fxed values of γ or λ, respectvely. We are unaware of consstency results for the CS-SVM at fxed γ [27]. However, consstency of the OC-SVM for fxed λ has been establshed [19]. Thus, suppose Ĝλ 1,..., Ĝλ M are (non-nested) OC-SVMs at a grd of ponts. Snce these estmators are each consstent, and the true levels sets they approxmate are nested, t seems plausble that for a suffcently large sample sze, these OC-SVMs are also nested. In ths case, they would be feasble for the NOC-SVM, whch would suggest that the NOC- SVM estmates the true level sets at least as well, asymptotcally, at these estmates. Takng the grd of levels {λ } to be ncreasngly dense, the error of the nterpolaton scheme should also vansh. We leave t as future work to determne whether ths ntuton can be formalzed. APPENDIX A DATA POINT SELECTION AND TERMINATION CONDITION OF NCS-SVM On each round, the algorthm n Fg. 3 selects an example x, updates ts correspondng varables {α,m } M m=1, and checks the termnaton condton. In ths appendx, we employ the KKT condtons to derve an effcent varable selecton strategy and a termnaton condton of NCS-SVM. We use the KKT condtons to fnd the necessary condtons of the optmal soluton of (7). Before we proceed, we defne α,0 = 0 for I + and α,m+1 = 0 for I for notatonal convenence. Then the

24 24 Lagrangan of the quadratc program s L(α, u, v) = 1 α,m α j,m y y j K,j α,m m 2λ,j + u,m (α,m 1 {y<0} y γ m ) m + v,m (α,m 1 α,m ) m I + v,m (α,m α,m+1 ) m I where u,m 0 and v,m 0 for, m. At the global mnmum, the dervatve of the Lagrangan wth respect to α,m vanshes L v,m + v,m+1, I + = y f,m 1 + u,m α,m +v,m 1 v,m, I where, recall, f,m = 1 λ = 0 (24) j α j,my j K,j and we ntroduced auxlary varables v,m+1 = 0 for I + and v,0 = 0 for I. Then we obtan the followng set of constrants from the KKT condtons v,m v,m+1, I + y f,m 1 + u,m = (25) v,m 1 + v,m, I 0 α,m 1 {y<0} + y γ m,, m (26) y α,1 y α,2 y α,m, (27) ( ) u,m α,m 1 {y<0} y γ m = 0,, m (28) v,m (α,m 1 α,m ) = 0, I +, m (29) v,m (α,m α,m+1 ) = 0, I, m (30) u,m 0, v,m 0,, m. (31) Snce (7) s a convex program, the KKT condtons are also suffcent [21]. That s, α,m, u,m, and v,m satsfyng (25)-(31) s ndeed optmal. Therefore, at the end of each teraton, we assess a current soluton wth these condtons and decde whether to stop or to contnue. We evaluate the amount of error for x by defnng e = m L α,m,.

25 25 α,m 1 < α,m α,m 1 = α,m α,m < mn(γ m, α,m+1) u,m = 0 u,m = 0 v,m = 0 v,m = max(f,m 1, 0) α,m = γ m < α,m+1 u,m = max(1 f,m, 0) - v,m = 0 - α,m = α,m+1 < γ m u,m = 0 u,m = 0 v,m = 0 v,m = max(f,m 1 + v,m+1, 0) α,m = α,m+1 = γ m u,m = max(1 f,m v,m+1, 0) - v,m = 0 - α,m 1 < α,m α,m 1 = α,m α,m < γ M u,m = 0 u,m = 0 v,m = 0 v,m = max(f,m 1, 0) α,m = γ M u,m = max(1 f,m, 0) - v,m = 0 - Fg. 14. The optmalty condtons of NCS-SVM when I +. (Upper: m = 1, 2,..., M 1, Lower: m = M.) Assumng α,m are optmal, u,m and v,m are solved as above from the KKT condtons. Empty entres ndcate cases that cannot occur. α,m+1 < α,m α,m+1 = α,m α,m < mn(1 γ m, α,m 1) u,m = 0 u,m = 0 v,m = 0 v,m = max( f,m 1, 0) α,m = 1 γ m < α,m 1 u,m = max(1 + f,m, 0) - v,m = 0 - α,m = α,m 1 < 1 γ m u,m = 0 u,m = 0 v,m = 0 v,m = max( f,m 1 + v,m 1, 0) α,m = α,m 1 = 1 γ m u,m = max(1 + f,m v,m 1, 0) - v,m = 0 - α,2 < α,1 α,2 = α,1 α,1 < 1 γ 1 u,1 = 0 u,1 = 0 v,1 = 0 v,1 = max( f,1 1, 0) α,1 = 1 γ 1 u,1 = max(1 + f,1, 0) - v,1 = 0 - Fg. 15. The optmalty condtons of NCS-SVM when I. (Upper: m = 2,..., M, Lower: m = 1.)

26 26 An optmal soluton makes these quanttes zero. In practce, when ther sum e decreases below a predetermned tolerance, the algorthm stops and returns the current soluton. If not, the algorthm chooses the example wth the largest e and contnues the loop. Computng e nvolves unknown varables u,m and v,m (see (24)), whereas f,m can be easly computed from the known varables α,m. Fg. 14 and Fg. 15 are for determnng these u,m and v,m. These tables are obtaned by frstly assumng the current soluton α,m s optmal and secondly solvng u,m and v,m such that they satsfy the KKT condtons. Thus, dependng on the value α,m between ts upper and lower bounds, u,m and v,m can be smply set as drected n the tables. For example, f I +, then we fnd u,m and v,m by referrng Fg. 14 teratvely from m = M down to m = 1. If I, we use Fg. 15 and terate from m = 1 up to m = M. Then the obtaned e takes a non-zero value only when the assumpton s false and the current soluton s sub-optmal. APPENDIX B DATA POINT SELECTION AND TERMINATION CONDITION OF NOC-SVM As n NCS-SVM, we nvestgate the optmalty condton of NOC-SVM (12) and fnd a data pont selecton method and a termnaton condton. Wth a slght modfcaton, we rewrte (12), M mn 1 α α,m α j,m K,j 1,...,α M 2λ m=1 m,j α,m (32) s.t. α,m 1 N,, m 0 α,1 λ 1 α,2 λ 2 α,m λ M, We then use the KKT condtons to fnd the necessary condtons of the optmal soluton of (32). The Lagrangan s M L(α, u, v) = 1 α,m α j,m K,j 2λ m=1 m,j. α,m M + u,m (α,m 1 N ) α,1 v,1 λ m=1 1 + M ( α,m 1 v,m α ),m λ m=2 m 1 λ m where u,m 0 and v,m 0 for, m. At the global mnmum, the dervatve of the Lagrangan wth

27 27 respect to α,m vanshes L = f,m 1 + u,m α,m v,m λ m v,m + v,m+1 λ m, m M λ M, m = M = 0. (33) where, recall, f,m = 1 λ m j α j,mk,j. Then, from the KKT condtons, we obtan the followng set of constrants for x : f,m 1 + u,m = v,m λ m v,m+1 λ m, m M v,m λ M, m = M (34) α,m 1, m (35) N 0 α,1 λ 1 α,2 λ 2 α,m λ M (36) u,m (α,m 1 ) = 0, m (37) N v,m ( α,m 1 λ m 1 α,m λ m ) = 0, m (38) u,m 0, v,m 0, m. (39) Snce (32) s a convex program, the KKT condtons are suffcent [21]. That s, α,m, u,m, and v,m satsfyng (34)-(39) s ndeed optmal. Therefore, at the end of each teraton, we assess a current soluton wth these condtons and decde whether to stop or to contnue. We evaluate the amount of error for x by defnng e = L m α,m,. An optmal soluton makes these quanttes zero. In practce, when ther sum e decreases below a predetermned tolerance, the algorthm stops and returns the current soluton. If not, the algorthm chooses the example wth the largest e and contnues the loop. Computng e nvolves unknown varables u,m and v,m (see (33)), whereas f,m can be easly computed from the known varables α,m. Fg. 16 are for determnng these u,m and v,m. These tables are obtaned by frstly assumng the current soluton α,m s optmal and secondly solvng u,m and v,m such that they satsfy the KKT condtons. Thus, dependng on the value α,m between ts upper and lower bounds, u,m and v,m can be smply set by referrng Fg. 16 teratvely from m = M down to m = 1. Then the obtaned e takes a non-zero value only when the assumpton s false and the current soluton s not optmal.

28 28 Fg. 16. α,m < mn( 1 N, α,m = 1 N < α,m = α,m = λ m λ m 1 α,m 1 < α,m λ m λ m 1 α,m 1 = α,m λ m λ m+1 α,m+1) u,m = 0 u,m = 0 v,m = 0 v,m = max(λ m(f,m 1), 0) λm λ m+1 α,m+1 u,m = max(1 f,m, 0) - v,m = 0 - λm λ m+1 α,m+1 < 1 u N,m = 0 u,m = 0 λm λ m+1 α,m+1 = 1 N v,m = 0 v,m = max(λ m(f,m 1 + v,m+1 λ m ), 0) u,m = max(1 f,m v,m+1 λ m, 0) - λ M λ M 1 α,m 1 < α,m v,m = 0 - λ M λ M 1 α,m 1 = α,m α,m < 1 N u,m = 0 u,m = 0 v,m = 0 v,m = max(λ M (f,m 1), 0) α,m = 1 N u,m = max(1 f,m, 0) - v,m = 0 - The optmalty condtons of NOC-SVM. (Upper: m = 1, 2,..., M 1, and Lower: m = M.) Empty entres ndcate cases that cannot occur. APPENDIX C MAXIMUM VALUE OF λ OF CS-SVM AND OC-SVM In ths appendx, we fnd the values of the regularzaton parameter λ over whch OC-SVM or CS-SVM generate the same solutons. Frst, we consder OC-SVM. The decson functon of OC-SVM s f λ (x) = 1 λ j α jk(x j, x) and f λ (x) = 1 forms the margn. For suffcently large λ, every data pont x falls nsde the margn (f λ (x ) 1). Snce the KKT optmalty condtons of (4) mply α = 1 N for the data ponts such that f λ(x ) < 1, we obtan λ 1 N j K,j for. Therefore, f the maxmum row sum of the kernel matrx s denoted 1 as λ OC = max N j K,j, then for any λ λ OC, the optmal soluton of OC-SVM becomes α = 1 N for. Next, we consder the regularzaton parameter λ of n the formulaton (1) of CS-SVM. The decson functon of CS-SVM s f γ (x) = 1 λ j α jy j k(x j, x), and the margn s yf γ (x) = 1. Thus, f λ s suffcently large, all the data ponts are nsde the margn and satsfy y f γ (x ) 1. Then λ j I + γy y j K,j + j I (1 γ)y y j K,j for because α = 1 {y<0} + y γ for all the data ponts

29 29 such that y f γ (x ) < 1 from the KKT condtons. For a gven γ, let λ CS (γ) = max γ y y j K,j + (1 γ) y y j K,j. j I + j I Then for λ > λ CS (γ), the soluton of CS-SVM becomes α = 1 {y<0} + y γ for. Therefore, snce λ CS (γ) (1 γ)λ CS (0) + γλ CS (1) for all γ [0, 1], values of λ > max (λ CS (0), λ CS (1)) generate the same solutons n CS-SVM. REFERENCES [1] J. A. Hartgan, Consstency of sngle lnkage for hgh-densty clusters, J. of the Amercan Stat. Assocaton, vol. 76, pp , [2] R. Lu, J. Parelus, and K. Sngh, Multvarate analyss by data depth: descrptve statstcs, graphcs and nference, Annals of Statstcs, vol. 27, pp , [3] C. Scott and R. Nowak, Learnng mnmum volume sets, Journal of Machne Learnng Research, vol. 7, pp , [4] C. Scott and E. D. Kolaczyk, Annotated mnmum volume sets for nonparametrc anomaly dscovery, n IEEE Workshop on Statstcal Sgnal Processng, 2007, pp [5] R. Herbrch, T. Graepel, and K. Obermayer, Large margn rank boundares for ordnal regresson, Advances n Large Margn Classfers, pp , [6] C. Scott and R. Nowak, A Neyman-Pearson approach to statstcal learnng, IEEE Trans. Inf. Theory, vol. 51, pp , [7] C. Scott and G. Blanchard, Novelty detecton: Unlabeled data defntely help, Proceedngs of the Twelfth Internatonal Conference on Artfcal Intellgence and Statstcs, vol. 5, pp , [8] F. R. Bach, D. Heckerman, and E. Horvtz, Consderng cost asymmetry n learnng classfers, Journal of Machne Learnng Research, vol. 7, pp , [9] B. Schölkopf and A. Smola, Learnng wth Kernels. Cambrdge, MA: MIT Press, [10] T. Haste, S. Rosset, R. Tbshran, and J. Zhu, The entre regularzaton path for the support vector machne, Journal of Machne Learnng Research, vol. 5, pp , [11] G. Lee and C. Scott, The one class support vector machne soluton path, n IEEE Intl. Conf. on Acoustcs, Speech and Sgnal Proc. (ICASSP), vol. 2, 2007, pp. II 521 II 524. [12], Nested support vector machnes, n IEEE Intl. Conf. on Acoustcs, Speech and Sgnal Proc. (ICASSP), 2008, pp [13] S. Agarwal, T. Graepel, R. Herbrch, S. Har-Peled, and D. Roth, Generalzaton bounds for the area under the roc curve, Journal of Machne Learnng Research, vol. 6, pp , [14] W. Stuetzle, Estmatng the cluster tree of a densty by analyzng the mnmal spannng tree of a sample, Journal of Classfcaton, vol. 20, no. 5, pp , [15] V. Kecman, Learnng and Soft Computng, Support Vector Machnes, Neural Networks, and Fuzzy Logc Models. Cambrdge, MA: MIT Press, 2001.

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned