INF 43 Support Vector Machne Classfers (SVM) Anne Solberg (anne@f.uo.no) 9..7 Lnear classfers th mamum margn for toclass problems The kernel trck from lnear to a hghdmensonal generalzaton Generaton from to M classes Practcal ssues SVM 9..7 INF 43
Currculum Lecture fols are most mportant! The lectures are based on selected sectons from Pattern Recognton, Thrd Edton, by Theodords and Koutroumbas: 3.-3., 3.7 (but 3.7.3 s a SVM-varant that e ll skp) 4.7 These sectons use optmzaton theory descrbed n Appendc C. We only nclude enough mathematcs to state the optmzaton problem, and you are not requred to understand ho ths optmzaton s solved. Another useful note: Andre Ng s note http://cs9.stanford.edu/notes/cs9-notes3.pdf INF 43
Classfcaton approaches Ft a parametrc probablty densty functon to the data and classfy to the class that mamze the posteror probablty. Knn-classfcaton: use a non-parametrc model. Other approaches to lnear classfcaton: Logstc regresson and Softma (generalzaton to multple classes) (INF 586) Feedforard neural net (INF 586) Support Vector Machne classfcaton (today) INF 43 3
Lnear algebra bascs: Inner product beteen to vectors. The nner product (or dot product) beteen to vectors (of length N) a and b or s gven by a, b N ab a T b The angle beteen to vectors A and B s defned as: cos A, B A B If the nner product of to vectors s zero, they are normal to each other. 4
INF 43 5 From last eek: Case : Σ =σ I No e get an equvalent formulaton of the dscrmnant functons: An equaton for the decson boundary g ()=g () can be rtten as = - s the vector beteen the mean values. Ths equaton defnes a hyperplane through the pont, and orthogonal to. ) ( ln and here ) ( t t P g μ μ μ - ) ( ) ( ln - - and - here ) ( μ μ μ μ μ μ μ μ t P P
Gaussan model th =I Lnear decson boundary We found that the dscrmnant functon (hen Σ =σ I) that defnes the border beteen class and n the feature space s a straght lne. The dscrmnant functon ntersects the lne connectng the to class means at the pont c=( - )/ (f e do not consder pror probabltes). The dscrmnant functon ll also be normal to the lne connectng the means. c Decson boundary INF 43 6
Introducton to Support Vector Machne classfers To understand Support Vector Machne (SVM) classfers, e need to study the lnear classfcaton problem n detal. We ll need to see ho classfcaton usng ths (the lnear case on the prevous slde) can be computed usng nner products. We start th to lnearly separable classes. Etenson to to non-separable classes. Etenson to M classes. The last step ll be to use kernels to separate classes that cannot be separated n the nput space. INF 43 7
A ne ve at lnear classfcaton usng nner products We have to classes (+ and -) represented by the class means c + and c -. Let the feature vector be, and let y be the class of feature vector. If e have m + samples from class + and m - samples from class -, the class means are gven by c c m m { y ) { y ) A ne pattern should be classfed to the class th the closest mean. INF 43 8
Half ay beteen the means les the pont c=(c + +c - )/. We can compute the class of a ne sample by checkng hether the vector -c (connectng to c) encloses an angle smaller than / ( n terms of absolute value) th the vector =c + -c -. Ths angle s gven by the nner product beteen and -c: T (-c) If a pont s on the decson boundary hyperplane, then T =. The angle beteen to vectors s computed by the nner product, hch changes sgn as the angle passes through /. INF 43 9
Support Vector Machnes To lnear separable classes Let, =,...N be all the l-dmensonal feature vectors n a tranng set th N samples. These belong to one of to classes, and. We assume that the classes are lnearly separable. Ths means that a hyperplane g()= T + = correctly classfes all these tranng samples. =[,... l ] s called a eght vector, and s the threshold. INF 43
Introducton to lnear SVM Dscrmnant functon: g() = T + Weghts/orentaton To-class problem, y ϵ {-,} Class ndcator for pattern Threshold/bas - g() = y = -, g( ) <, g( ) > g() Class predcton Input pattern INF 43
Separable case: Many canddates Obvously e ant the decson boundary to separate the classes.... hoever, there can be many such hyperplanes. Whch of these to canddates ould you prefer? Why? INF 43
Snce / s a unt vector n the drecton, B=-z*/ Because B les on the decson boundary, T B+ = Ths s called the margn of the classfer Dstance to the decson boundary INF 43 3 g() = Dstance from to the decson boundary B z T T T z Solve ths for z :
Hyperplanes and margns If both classes are equally probable, the dstance from the hyperplane to the closest ponts n both classes should be equal. Ths s called the margn. The margn for «drecton» s z, and for «drecton» t s z. From prevous slde; the dstance from a pont to the separatng hyperplane s z g( ) Goal: Fnd and mamzng the margn! Ho ould you rte a program fndng ths? Not easy unless e state the obectve functon cleverly! INF 43 4
Toards a clever obectve functon We can scale g() such that g() ll be equal to or - at the closest ponts n the to classes. Ths s equvalent to: Does not change the margn. Have a margn of. Requre that g( ) g( ) T T,, Remember our goal: Fnd and yeldng the mamum margn INF 43 5
Mamum-margn obectve functon The hyperplane th mamum margn can be found by solvng the optmzaton problem (.r.t. and ): mnmze subect to J ( ) T y ( ),,,... N The ½ factor s for later convenence Note: We assume here fully classseparable data! Checkpont: Do you understand the formulaton? Ho s ths crteron related to mamzng the margn? Note! We are somehat done -- Matlab (or smlar softare) can solve ths no. But e seek more nsght! INF 43 6
More on the optmzaton problem Generalzed Lagrangan functon: We recommend, agan, to read the note on Lagrangan multplers (see undervsnngsplan) Ether λ = or g()= Karush-Kuhn-Tucker (KKT) condtons INF 43 7
Support vectors The feature vectors th a correspondng > are called the support vectors for the problem. The classfer defned by ths hyperplane s called a Support Vector Machne. Dependng on y (+ or -), the support vectors ll thus le on ether of the to hyperplanes T + = The support vectors are the ponts n the tranng set that are closest to the decson hyperplane. The optmzaton has a unque soluton, only one hyperplane satsfes the condtons. The support vectors for hyperplane are the blue crcles. The support vectors for hyperplane are the red crcles. INF 43 8
INF 43 9 Dual representaton s.t. ma N, y y y N T Pluggng back nto L(,,λ) gves us Important (for later): The samples come nto play as nner products only!
The nonseparable case If the to classes are nonseparable, a hyperplane satsfyng the condtons T - = cannot be found. The feature vectors n the tranng set are no ether:. Vectors that fall outsde the band and are correctly classfed.. Vectors that are nsde the band and are correctly classfed. They satsfy y ( T + )< 3. Vectors that are msclassfed epressed as y ( T + )< Correctly classfed Erroneously classfed INF 43
The three cases can be treated under a sngle type of contrants f e ntroduce slack varables : y T [ ] The frst category (outsde, correct classfed) have = The second category (nsde, correct classfed) have The thrd category (nsde, msclassfed) have > The optmzaton goal s no to keep the margn as large as possble and the number of ponts th > as small as possble. INF 43
INF 43 Cost functon nonseparable case The cost functon to mnmze s no C s a parameter that controls ho much msclassfed tranng samples s eghted. We skp the mathematcs and present the alternatve dual formulaton: All ponts beteen the to hyperplanes ( >) can be shon to have =C.. parameters the vector of s and ) I( here ) (,, N I C J and subect to ma N, C y y y N T
Nonseparable vs. separable case Note that the slack varables does not enter the problem eplctly. The only dfference beteen the lnear separable and the non-separable case s that the Lagrangemultplers are bounded by C. Tranng a SVM classfer conssts of solvng the optmzaton problem. The problem s qute comple snce t gros th the number of tranng pels. It s computatonally heavy. We ll get back th hnts of softare lbrares to use at the end of the lecture... INF 43 3
An eample the effect of C C s the msclassfcaton cost. C=. C= Selectng too hgh C ll gve a classfer that fts the tranng data perfect, but fals on dfferent data set. The value of C should be selected usng a separate valdaton set. Separate the tranng data nto a part used for tranng, tran th dfferent values of C and select the value that gves best results on the valdaton data set. Then apply ths to ne data or the test data set. (eplaned later) INF 43 4
SVM: A geometrc ve SVMs can be related to the conve hull of the dfferent classes. Consder a class that contans tranng samples X={,... N }. The conve hull of the set of ponts n X s gven by all conve combnatons of the N elements n X. A regon R s conve f and only f for any to ponts, n R, the hole lne segment beteen and s nsde the R. The conve hull of a regon s the smalles conve regon H hch satsfes the condtons RH. INF 43 5
The conve hull for a class s the smallest conve set that contans all the ponts n the class (X). Searchng for the hyperplane th the hghest margn s equvalent to searchng for the to nearest ponts n the to conve sets. Ths can be proved, but e ust take the result as an ad to get a better vsual nterpretaton of the SVM hyperplane. INF 43 6
INF 43 7 Reduced conve hull To get a useable nterpretaton for nonseparable classes, e need the reduced conve hull. The conve set can be epressed as: The reduced conve hull s : s a scalar beteen and. = gves the regular conve hull. N N X y y X conv,, : : } { N N X y y X R,, : : }, { Here e add a restrcton that must also be smaller than
Reduced conve hull - eample = Regular conve hulls...: =.4 ----: =. Reduced conve hulls Data set th overlappng classes. For small enough values of, e can make the to reduced conve hulls non-overlappng. A very rough eplanaton of the non-separable SVM problem s that a value of that gves non-ntersectng reduced conve hulls must be found. Gven a value of that gves non-ntersectng reduced conve hulls, the best hyperplane ll bsect the lne beteen the closest ponts n these to reduced conve hulls. INF 43 8
Relatng and C Gven a value of that gves non-ntersectng reduced conve hulls, fnd the hyperplane by fndng the closest to ponts n the to sets. Several values of can gve nonntersectng reduced hulls. s related to C, the cost of msclassfyng tranng regons (see page ). A hgh C ll gve regons that ust barely gve nonntersectng regons. The most robust consderng a valdaton data set s probably a smaller value of C (and ). INF 43 9
INF 43 3 Checkpont What does ths crteron mean: Whch ponts are the support vectors n the lnear case?. parameters the vector of s and ) I( here ) (,, N I C J
SVMs: The nonlnear case ntro. The tranng samples are l-dmensonal vectors; e have untl no tred to fnd a lnear separaton n ths l-dmensonal feature space Ths seems qute lmtng What f e ncrease the dmensonalty (map our samples to a hgher dmensonal space) before applyng our SVM? Perhaps e can fnd a better lnear decson boundary n that space? Even f the feature vectors are not lnearly separable n the nput space, they mght be (close to) separable n a hgher dmensonal space INF 43 3
An eamle: from D to 3D Let be a D vector =[, ]. In the toy eample on the rght, the to classes can not be lnearly separated n the orgnal D space. Consder no the transformaton y No, the transformed ponts n ths 3D space can be separated by a plane. The separatng plane n 3D maps out an ellpse n the orgnal D space Cf. net slde, note that y T y = ( T ). «Nonlnear»! INF 43 3
Note that n both the optmzaton problem and the evaluaton functon, g(), the samples come nto play as nner products only If e have a functon evaluatng nner products, K(, ), e can gnore the samples themselves Let s say e have K(, ) evaluatng nner products n a hgher dmensonal space: -> no need to do the mappng of our samples eplctly! INF 43 33 SVMs and kernels N s T T y g ) ( s.t. ma N, N T y C y y Called «kernel»
Useful kernels for classfcaton Polynomal kernels T q z, q K(, z) Radal bass functon kernels (very commonly used!) K(, z) ep z Hyperbolc tangent kernels (often th = and =) Note the e need to set the parameter The «support» of each pont s controlled by. The nner product s related to the smlarty of the to samples. K(, z) T tanh z The kernel nputs need not be numerc, e.g. kernels for tet strngs are possble. The kernels gve nnerproduct evaluatons n the, possbly nfntedmensonal, transformed space. INF 43 34
INF 43 35 The kernel formulaton of the obectve functon Gven the approprate kernel (e.g. «radal» th dth ) and the cost of msclassfcaton C, the optmzaton task s: The resultng classfer s: y N C K y y,..., subect to ), ( ma, otherse and to class ), ( f o class assgn t K y g() N
Eample of nonlnear decson boundary Ths llustrates ho the nonlnear SVM mght look n the orgnal feature space RBF kernel used Fgure 4.3 n PR by Teodords et.al. INF 43 36
From to M classes All e have dscussed up untl no nvolves only separatng classes. Ho do e etend the methods to M classes? To common approaches: One-aganst-all For each class m, fnd the hyperplane that best dscmnates ths class from all other classes. Then classfy a sample to the class havng the hghest output. (To use ths, e need the VALUE of the nner product and not ust the sgn.) Compare all sets of parse classfers Fnd a hyperplane for each par of classes. Ths gves M(M-)/ parse classfers. For a gven sample, use a votng scheme for selectng the most-nnng class. INF 43 37
Ho to use a SVM classfer Fnd a lbrary th all the necessary SVM-functons For eample LbSVM http://.cse.ntu.edu.t/~cln/lbsvm/ Or use the PRTools toolbo http://.37steps.com/prtools/ Read the ntroductory gudes. Often a radal bass functon kernel s a good startng pont. Scale the data to the range [-,] (features th large values ll not domnate). Fnd the optmal values of C and by performng a grd search on selected values and usng a valdaton data set. Tran the classfer usng the best value from the grd search. Test usng a separate test set. INF 43 38
Ho to do a grd search Use n-fold cross valaton (e.g. -fold crossvaldaton). -fold: dvde the tranng data nto subsets of equal sze. Tran on 9 subsets and test on the last subset. Repeat ths procedure tmes. Grd search: try pars of (C,). Select the par that gets the best classfcaton performance on average over all the n valdaton test subsets. Use the follong values of C and : C = -5, -3,..., 5 = -5, -3,..., 3 INF 43 39
Summary / Learnng goals Understand enough of SVM classfers to be able to use t for a classfcaton applcaton. Understand the basc lnear separable problem and hat the meanng of the soluton th the largest margn s. Understand ho SVMs ork n the non-separable case usng a cost for msclassfcaton. Accept the kernel trck: that the orgnal feature vectors can be transformed nto a hgher dmensonal space, and that lnear SVM s appled n ths space thout eplctly dong the feature transform Kno brefly ho to etend from to M classes. Kno hch parameters (C, etc.) the user must specfy and ho to perform a grd search for these. Be able to fnd a SVM lbrary and use t correctly INF 33 4