Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned very hghdmensonal space va a kernel functon Fnd the hyperplane that maxmzes the margn between the two classes If data are not separable fnd the hyperplane that maxmzes the margn and mnmzes the (a weghted average of the) msclassfcatons 206
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space (kernel) 207
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 208
Whch Separatng Hyperplane to Use? Var 1 Var 2 209
Maxmzng the Margn Var 1 IDEA 1: Select the separatng hyperplane that maxmzes the margn! Margn Wdth Margn Wdth Var 2 210
Support Vectors Var 1 Support Vectors Margn Wdth Var 2 211
Settng Up the Optmzaton Problem Var 1 The wdth of the margn s: 2 k w w x + b = k k k w x w x + b = k Var 2 + b = 0 w So, the problem s: 2 k max w st.. ( w x+ b) k, x of class 1 ( w x+ b) k, x of class 2 212
Settng Up the Optmzaton Problem Var 1 There s a scale and unt for data so that k=1. Then problem becomes: w w x+ b= 1 1 1 w x w x+ b= 1 Var 2 + b = 0 2 max w st.. ( w x+ b) 1, x of class 1 ( w x+ b) 1, x of class 2 213
Settng Up the Optmzaton Problem If class 1 corresponds to 1 and class 2 corresponds to -1, we can rewrte as ( w x + b) 1, x wth y = 1 ( w x + b) 1, x wth y = 1 y ( w x + b) 1, x So the problem becomes: 2 max w st.. y( w x + b) 1, x or 1 2 mn w 2 st.. y( w x + b) 1, x 214
Lnear, Hard-Margn SVM Formulaton Fnd w,b that solves 1 mn 2 w 2 st.. y( w x + b) 1, x Problem s convex so, there s a unque global mnmum value (when feasble) There s also a unque mnmzer,.e. weght and b value that provdes the mnmum Non-solvable f the data s not lnearly separable Quadratc Programmng Very effcent computatonally wth modern constrant optmzaton engnes (handles thousands of constrants and tranng nstances). 215
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 216
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 217
Non-Lnearly Separable Data Var 1 ξ Introduce slack varables ξ w ξ w x+ b= 1 Allow some nstances to fall wthn the margn, but penalze them w x+ b= 1 1 1 w x + b = 0 Var 2 218
Formulatng the Optmzaton Problem Constrant becomes : Var 1 ξ y ( w x + b) 1 ξ, x ξ 0 w ξ w x+ b= 1 1 1 w x w x+ b= 1 Var 2 + b = 0 Objectve functon penalzes for msclassfed nstances and those wthn the margn 1 mn 2 C trades-off margn wdth and msclassfcatons 219 2 w + C ξ
Lnear, Soft-Margn SVMs 1 mn 2 2 w + C ξ Algorthm tres to mantan ξ to zero whle maxmzng margn Notce: algorthm does not mnmze the number of msclassfcatons (NP-complete problem) but the sum of dstances from the margn hyperplanes Other formulatons use ξ 2 nstead y ( w x + b) 1 ξ, x ξ 0 As C, we get closer to the hard-margn soluton 220
Robustness of Soft vs Hard Margn SVMs Var 1 Var 1 ξ ξ w x + b = 0 Soft Margn SVN Var 2 w x + b = 0 Hard Margn SVN Var 2 221
Soft vs Hard Margn SVM Soft-Margn always have a soluton Soft-Margn s more robust to outlers Smoother surfaces (n the non-lnear case) Hard-Margn does not requre to guess the cost parameter (requres no parameters at all) 222
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 223
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 224
Dsadvantages of Lnear Decson Surfaces Var 1 Var 2 225
Advantages of Non-Lnear Surfaces Var 1 Var 2 226
Lnear Classfers n Hgh- Dmensonal Spaces Var 1 Constructed Feature 2 Var 2 Constructed Feature 1 Fnd functon Φ(x) to map to a dfferent space 227
Mappng Data to a Hgh-Dmensonal Space Fnd functon Φ(x) to map to a dfferent space, then SVM formulaton becomes: 1 mn 2 2 w C ξ + ξ 0 Data appear as Φ(x), weghts w are now weghts n the new space Explct mappng expensve f Φ(x) s very hgh dmensonal Solvng the problem wthout explctly mappng the data s desrable s. t. y ( w Φ( x) + b) 1 ξ, x 228
The Dual of the SVM Formulaton Orgnal SVM formulaton n nequalty constrants n postvty constrants n number of ξ varables The (Wolfe) dual of ths problem one equalty constrant n postvty constrants n number of α varables (Lagrange multplers) Objectve functon more complcated NOTICE: Data only appear as Φ(x ) Φ(x j ) 1 mn a 2 s. t. ξ 0 mn w, b 2, j y ( w Φ( x) + b) 1 ξ, x 1 w α α y 2 + C ξ s. t. C α 0, x α y = 0 j y j ( Φ( x ) Φ( x j )) 229 α
The Kernel Trck Φ(x ) Φ(x j ): means, map data nto new space, then take the nner product of the new vectors We can fnd a functon such that: K(x x j ) = Φ(x ) Φ(x j ),.e., the mage of the nner product of the data s the nner product of the mages of the data Then, we do not need to explctly map the data nto the hghdmensonal space to solve the optmzaton problem (for tranng) How do we classfy wthout explctly mappng the new nstances? Turns out sgn( wx + b) = sgn( where b solves α ( y for any j wthα j j 0 α y K( x j α y K( x, x) + b), x j ) + b 1) = 0, 230
Examples of Kernels Assume we measure two quanttes, e.g. expresson level of genes TrkC and SoncHedghog (SH) and we use the mappng: Φ : < TrkC, x SH Consder the functon: We can verfy that: x Φ( x) Φ( z) = x 2 TrkC z = ( x 2 TrkC TrkC z + x TrkC 2 SH + z 2 SH x SH z > { x 2 TrkC, x 2 SH K( x z) = ( x z + + 2x SH TrkC + 1) x 2 SH z TrkC z = ( x z, SH 2x 2 1) + + 1) 2 x TrkC TrkC x z SH TrkC, x + = K( x z) TrkC x, x SH z SH SH,1} + 1 = 231
Polynomal and Gaussan Kernels K ( x z) = ( x z + 1) s called the polynomal kernel of degree p. For p=2, f we measure 7,000 genes usng the kernel once means calculatng a summaton product wth 7,000 terms then takng the square of ths number Mappng explctly to the hgh-dmensonal space means calculatng approxmately 50,000,000 new features for both tranng nstances, then takng the nner product of that (another 50,000,000 terms to sum) In general, usng the Kernel trck provdes huge computatonal savngs over explct mappng! Another commonly used Kernel s the Gaussan (maps to a dmensonal space wth number of dmensons equal to the number of tranng cases): K( x z) p = exp( x z 2 / 2σ ) 232
The Mercer Condton Is there a mappng Φ(x) for any symmetrc functon K(x,z)? No The SVM dual formulaton requres calculaton K(x, x j ) for each par of tranng nstances. The array G j = K(x, x j ) s called the Gram matrx There s a feature space Φ(x) when the Kernel s such that G s always sem-postve defnte (Mercer condton) 233
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 234
Other Types of Kernel Methods SVMs that perform regresson SVMs that perform clusterng ν-support Vector Machnes: maxmze margn whle boundng the number of margn errors Leave One Out Machnes: mnmze the bound of the leave-one-out error SVM formulatons that take nto consderaton dfference n cost of msclassfcaton for the dfferent classes Kernels sutable for sequences of strngs, or other specalzed kernels 235
Varable Selecton wth SVMs Recursve Feature Elmnaton Tran a lnear SVM Remove the varables wth the lowest weghts (those varables affect classfcaton the least), e.g., remove the lowest 50% of varables Retran the SVM wth remanng varables and repeat untl classfcaton s reduced Very successful Other formulatons exst where mnmzng the number of varables s folded nto the optmzaton problem Smlar algorthm exst for non-lnear SVMs Some of the best and most effcent varable selecton methods 236
Comparson wth Neural Networks Neural Networks Hdden Layers map to lower dmensonal spaces Search space has multple local mnma Tranng s expensve Classfcaton extremely effcent Requres number of hdden unts and layers Very good accuracy n typcal domans SVMs Kernel maps to a very-hgh dmensonal space Search space has a unque mnmum Tranng s extremely effcent Classfcaton extremely effcent Kernel and cost the two parameters to select Very good accuracy n typcal domans Extremely robust 237
Why do SVMs Generalze? Even though they map to a very hghdmensonal space They have a very strong bas n that space The soluton has to be a lnear combnaton of the tranng nstances Large theory on Structural Rsk Mnmzaton provdng bounds on the error of an SVM Typcally the error bounds too loose to be of practcal use 238
MultClass SVMs One-versus-all Tran n bnary classfers, one for each class aganst all other classes. Predcted class s the class of the most confdent classfer One-versus-one Tran n(n-1)/2 classfers, each dscrmnatng between a par of classes Several strateges for selectng the fnal classfcaton based on the output of the bnary SVMs Truly MultClass SVMs Generalze the SVM formulaton to multple categores More on that n the nomnated for the student paper award: Methods for Mult-Category Cancer Dagnoss from Gene Expresson Data: A Comprehensve Evaluaton to Inform Decson Support System Development, Alexander Statnkov, Constantn F. Alfers, Ioanns Tsamardnos 239
Conclusons SVMs express learnng as a mathematcal program takng advantage of the rch theory n optmzaton SVM uses the kernel trck to map ndrectly to extremely hgh dmensonal spaces SVMs extremely successful, robust, effcent, and versatle whle there are good theoretcal ndcatons as to why they generalze well 240
Suggested Further Readng http://www.kernel-machnes.org/tutoral.html C. J. C. Burges. A Tutoral on Support Vector Machnes for Pattern Recognton. Knowledge Dscovery and Data Mnng, 2(2), 1998. P.H. Chen, C.-J. Ln, and B. Schölkopf. A tutoral on nu -support vector machnes. 2003. N. Crstann. ICML'01 tutoral, 2001. K.-R. Müller, S. Mka, G. Rätsch, K. Tsuda, and B. Schölkopf. An ntroducton to kernel-based learnng algorthms. IEEE Neural Networks, 12(2):181-201, May 2001. (PDF) B. Schölkopf. SVM and kernel methods, 2001. Tutoral gven at the NIPS Conference. Haste, Tbshran, Fredman, The Elements of Statstcal Learnng, Sprngel 2001 241