Support Vector Machnes Some sldes adapted from Alfers & Tsamardnos, Vanderblt Unversty http://dscover1.mc.vanderblt.edu/dscover/publc/ml_tutoral_ol d/ndex.html Rong Jn, Language Technology Insttute www.contrb.andrew.cmu.edu/~jn/r_proj/svm.ppt ABDBM Ron Shamr 1
Support Vector Machnes Decson surface: a hyperplane n feature space One of the most mportant tools n the machne learnng toolbox In a nutshell: map the data to a predetermned very hghdmensonal space va a kernel functon Fnd the hyperplane that maxmzes the margn between the two classes If data are not separable - fnd the hyperplane that maxmzes the margn and mnmzes the (weghted average of the) msclassfcatons ABDBM Ron Shamr 2
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (takng nto account that t needs to be computed effcently): maxmze margn 2. Generalze to non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data are mapped mplctly to ths space ABDBM Ron Shamr 3
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (takng nto account that t needs to be computed effcently): maxmze margn 2. Generalze to non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data are mapped mplctly to ths space ABDBM Ron Shamr 4
Whch Separatng Hyperplane Var 1 to Use? ABDBM Ron Shamr Var 2 5
Maxmzng the Margn Var 1 IDEA 1: Select the separatng hyperplane that maxmzes the margn! Margn Wdth Margn Wdth ABDBM Ron Shamr Var 2 6
Support Vectors Var 1 Support Vectors Margn Wdth ABDBM Ron Shamr Var 2 7
Settng Up the Optmzaton Problem Var 1 The wdth of the margn s: 2 k w w x b k k k w x w x b k b 0 Var 2 w So, the problem s: 2 k max w s. t. ( w x b) k, x of class 1 ( w x b) k, x of class 2 ABDBM Ron Shamr 8
Settng Up the Optmzaton Problem Var 1 Scalng w, b so that k=1, the problem becomes: w w x b 1 1 1 w x w x b 1 b 0 Var 2 2 max w s. t. ( w x b) 1, x of class 1 ( w x b) 1, x of class 2 ABDBM Ron Shamr 9
Settng Up the Optmzaton Problem If class 1 corresponds to 1 and class 2 corresponds to -1, we can rewrte ( w x b) 1, x wth y 1 ( w x b) 1, x wth y 1 as y ( w x b) 1, x So the problem becomes: max 2 w s. t. y ( w x b) 1, x or 1 2 mn w 2 s. t. y ( w x b) 1, x ABDBM Ron Shamr 10
Lnear, Hard-Margn SVM Formulaton Fnd w,b that solve 1 2 mn w 2 s. t. y ( w x b) 1, x Quadratc program: quadratc objectve, lnear (n)equalty constrants Problem s convex there s a unque global mnmum value (when feasble) There s also a unque mnmzer,.e. w and b values that provde the mnmum No soluton f the data are not lnearly separable Objectve s PD polynomal-tme soln Very effcent soln wth modern optmzaton software (handles 1000s of constrants and tranng nstances). ABDBM Ron Shamr 11
Lagrange multplers Mnmze s. t. 1 w, b, ( w) 0,..., 0 l 1 2 w 2 y ( w x b) Convex quadratc programmng problem Dualty theory apples! ABDBM Ron Shamr 12
13 Dual Space Dual Problem Representaton for w Decson functon j j j n l y y D y y y D F x x y Λ Λ y Λ Λ Λ Λ Λ ),,...,, ( ),,...,, ( where 0 0 subject to 2 1 1 ) ( Maxmze 2 1 2 1 ) ) ( ( ) ( 1 l y b sgn f x x x l y 1 x w ABDBM Ron Shamr
Comments Representaton of vector w Lnear combnaton of examples x # parameters = # examples : the mportance of each examples Only the ponts closest to the bound have 0 Core of the algorthm: xx Both matrx D and decson functon requre the knowledge of xx (More on ths soon) w l D j 1 y y y j x x x j ABDBM Ron Shamr 14
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (takng nto account that t needs to be computed effcently): maxmze margn 2. Generalze to non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data are mapped mplctly to ths space ABDBM Ron Shamr 15
Non-Lnearly Separable Data Var 1 Introduce slack varables w j w x b 1 Allow some nstances to fall wthn the margn, but penalze them w x b 1 1 1 w x b 0 Var 2 ABDBM Ron Shamr 16
Formulatng the Optmzaton Problem Constrant becomes : Var 1 y ( w x b) 1, x 0 w w x b 1 ABDBM Ron Shamr j 1 1 w x w x b 1 b 0 Var 2 Objectve functon penalzes for msclassfed nstances and those wthn the margn 1 mn 2 C trades-off margn wdth & msclassfcatons 2 w C 17
Lnear, Soft-Margn SVMs 1 mn 2 2 w C y ( w x b) 1, x 0 Algorthm tres to keep at zero whle maxmzng margn Alg does not mnmze the no. of msclassfcatons (NP-complete problem) but the sum of dstances from the margn hyperplanes Other formulatons use 2 nstead C: penalty for msclassfcaton As C, we get closer to the hard-margn soluton ABDBM Ron Shamr 18
19 Dual Space j j j n l y y D y y y C D F x x y Λ 1 Λ Λ y Λ Λ Λ Λ Λ ),,...,, ( ),,...,, ( where 0 0 subject to 2 1 1 ) ( Maxmze 2 1 2 1 ) ) ( ( ) ( 1 l y b sgn f x x x l y 1 x w Dual Problem Only dfference: upper bound C on Representaton for w Decson functon ABDBM Ron Shamr
Param C Comments Controls the range of avods over emphaszng some examples (C - ) = 0 ( complementary slackness ) C can be extended to be case-dependent Weght < C = 0 -th example s correctly classfed not qute mportant = C can be nonzero -th tranng example may be msclassfed very mportant ABDBM Ron Shamr 20
Robustness of Soft vs Hard Margn SVMs Var 1 Var 1 Soft Margn SVM w x b 0 Var 2 w x b 0 Hard Margn SVM Var 2 ABDBM Ron Shamr 21
Soft vs Hard Margn SVM Soft-Margn always has a soluton Soft-Margn s more robust to outlers Smoother surfaces (n the non-lnear case) Hard-Margn does not requre to guess the cost parameter (requres no parameters at all) ABDBM Ron Shamr 22
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (takng nto account that t needs to be computed effcently): maxmze margn 2. Generalze to non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data are mapped mplctly to ths space ABDBM Ron Shamr 23
Dsadvantages of Lnear Decson Surfaces Var 1 ABDBM Ron Shamr Var 2 24
Advantages of Non-Lnear Surfaces Var 1 ABDBM Ron Shamr Var 2 25
Lnear Classfers n Hgh- Dmensonal Spaces Var 1 Constructed Feature 2 Var 2 Constructed Feature 1 Fnd functon (x) to map to a dfferent space ABDBM Ron Shamr 26
Mappng Data to a Hgh- Dmensonal Space Fnd functon (x) to map to a dfferent space, then SVM formulaton becomes: 1 mn 2 2 w C s. t. y ( w( x ) b) 1, x 0 Data appear as (x), weghts w are now weghts n the new space Explct mappng expensve f (x) s very hgh dmensonal Can we solve the problem wthout explctly mappng the data? ABDBM Ron Shamr 27
The Dual of the SVM Formulaton Orgnal SVM formulaton n nequalty constrants n postvty constrants n number of varables The (Wolfe) dual of ths problem one equalty constrant n postvty constrants n number of varables (Lagrange multplers) Objectve functon more complcated But: Data only appear as (x ) (x j ) 1 mn 2 s. t. 0 1 mn w, b 2 a, j y ( w ( x) b) 1, x s t. w y 2. C C 0, x y 0 j y j ( ( x ) ( x j )) ABDBM Ron Shamr 28
The Kernel Trck (x ) t (x j ) means: map data nto new space, then take the nner product of the new vectors Suppose we can fnd a functon such that: K(x, x j ) = (x ) t (x j ),.e., K s the nner product of the mages of the data For tranng, no need to explctly map the data nto the hgh-dmensonal space to solve the optmzaton problem How do we classfy wthout explctly mappng the new nstances? Turns out sgn( wx b) sgn( where b solves ( y for any j wth j j 0 j y K( x y K( x, x) b), x j ) b 1) 0, ABDBM Ron Shamr 29
Examples of Kernels Assume we measure x 1,x 2 mappng: Consder the functon: Then: and we use the : x, x { x, x, 2 x x, 2 x, 2 x,1} 2 2 1 2 2 1 2 1 2 φ x t φ z = x 1 2 z 1 2 + x 2 2 z 2 2 + 2x 1 x 2 z 1 z 2 + 2x 1 z 1 + 2x 2 z 2 + 1 = x 1 z 1 + x 2 z 2 + 1 2 = x z + 1 2 = K(x, z) 1 K( x, z) ( x z 1) 2 ABDBM Ron Shamr 30
Polynomal and Gaussan Kernels K ( x, z) ( x z 1) s called the polynomal kernel of degree p. For p=2, wth 7,000 genes usng the kernel once: nner product wth 7,000 terms, squarng Mappng explctly to the hgh-dmensonal space: calculatng ~50,000,000 new features for both tranng nstances, then takng the nner product of that (another 50,000,000 terms to sum) In general, usng the Kernel trck provdes huge computatonal savngs over explct mappng! Another common opton: Gaussan kernel (maps to l dmensonal space wth l=no of tranng ponts): K( x, z) exp( x z p 2 / 2 ) ABDBM Ron Shamr 31
The Mercer Condton Is there a mappng (x) for any symmetrc functon K(x,z)? No The SVM dual formulaton requres calculaton K(x, x j ) for each par of tranng nstances. The matrx G j = K(x,x j ) s called the Gram matrx Theorem (Mercer 1908): There s a feature space (x) ff the Kernel s such that G s postve-sem defnte Recall: M PSD ff z 0 z T Mz>0 ff M has non-negatve egenvalues ABDBM Ron Shamr 32
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (takng nto account that t needs to be computed effcently): maxmze margn 2. Generalze to non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data are mapped mplctly to ths space ABDBM Ron Shamr 33
Complexty (for one mplementaton, Burges 98) Notaton: l tranng pts of dmenson d, N support vectors (Nl) When most SVs are not at the upper bound: O(N 3 +N 2 l+ndl) f N<<l O(N 3 +nl+ndl) f N~l When most SVs are at the upper bound: O(N 2 + Ndl) f N<<l O(dl 2 ) f N~l ABDBM Ron Shamr 34
Other Types of Kernel Methods SVMs that perform regresson SVMs that perform clusterng -Support Vector Machnes: maxmze margn whle boundng the number of margn errors Leave One Out Machnes: mnmze the bound of the leave-one-out error SVM formulatons that allow dfferent cost of msclassfcaton for dfferent classes Kernels sutable for sequences of strngs, or other specalzed kernels ABDBM Ron Shamr 35
Feature Selecton wth SVMs Recursve Feature Elmnaton Tran a lnear SVM Remove the x% of varables wth the lowest weghts (those varables affect classfcaton the least) Retran the SVM wth remanng varables and repeat untl classfcaton qualty s reduced Very successful Other formulatons exst where mnmzng the number of varables s folded nto the optmzaton problem Smlar algs for non-lnear SVMs Qute successful ABDBM Ron Shamr 36
Why do SVMs Generalze? Even though they map to a very hgh-dmensonal space They have a very strong bas n that space The soluton has to be a lnear combnaton of the tranng nstances Large theory on Structural Rsk Mnmzaton provdng bounds on the error of an SVM Typcally the error bounds too loose to be of practcal use ABDBM Ron Shamr 37
Conclusons SVMs formulate learnng as a mathematcal program takng advantage of the rch theory n optmzaton SVM uses kernels to map ndrectly to extremely hgh dmensonal spaces SVMs are extremely successful, robust, effcent, and versatle, and have a good theoretcal bass ABDBM Ron Shamr 39
Vladmr Vapnk Vladmr Naumovch Vapnk s one of the man developers of Vapnk Chervonenks theory. He was born n the Sovet Unon. He receved hs master's degree n mathematcs at the Uzbek State Unversty, Samarkand, Uzbek SSR n 1958 and Ph.D n statstcs at the Insttute of Control Scences, Moscow n 1964. He worked at ths nsttute from 1961 to 1990 and became Head of the Computer Scence Research Department. At the end of 1990, he moved to the USA and joned the Adaptve Systems Research Department at AT&T Bell Labs n Holmdel, New Jersey. The group later became the Image Processng Research Department of AT&T Laboratores when AT&T spun off Lucent Technologes n 1996. Vapnk Left AT&T n 2002 and joned NEC Laboratores n Prnceton, New Jersey, where he currently works n the Machne Learnng group. He also holds a Professor of Computer Scence and Statstcs poston at Royal Holloway, Unversty of London snce 1995, as well as an Adjunct Professor poston at Columba Unversty, New York Cty snce 2003. He was nducted nto the U.S. Natonal Academy of Engneerng n 2006. He receved the 2008 Pars Kanellaks Award. Whle at AT&T, Vapnk and hs colleagues developed the theory of the support vector machne. They demonstrated ts performance on a number of problems of nterest to the machne learnng communty, ncludng handwrtng recognton. http://en.wkpeda.org/wk/vladmr_vapnk 40 ABDBM Ron Shamr
Suggested Further Readng http://www.kernel-machnes.org/tutoral.html http://www.svms.org/tutorals/ - many tutorals C. J. C. Burges. "A Tutoral on Support Vector Machnes for Pattern Recognton." Knowledge Dscovery and Data Mnng, 2(2), 1998. E. Osuna, R. Freund, and F. Gros. "Support vector machnes: Tranng and applcatons." Techncal Report AIM-1602, MIT A.I. Lab., 1996. P.H. Chen, C.-J. Ln, and B. Schölkopf. A tutoral on nu -support vector machnes. 2003. N. Crstann. ICML'01 tutoral, 2001. K.-R. Müller, S. Mka, G. Rätsch, K. Tsuda, and B. Schölkopf. An ntroducton to kernel-based learnng algorthms. IEEE Neural Networks, 12(2):181-201, May 2001. (PDF) B. Schölkopf. SVM and kernel methods, 2001. Tutoral gven at the NIPS Conference. Haste, Tbshran, Fredman, The Elements of Statstcal Learnng, Sprngel 2001 ABDBM Ron Shamr 41
Analyss of mcroarray GE data usng SVM Brown, Grundy, Ln, Crstann, Sugnet, Furey, Ares Jr., Haussler PNAS 97(1) 262-7 (2000) ABDBM Ron Shamr 42
Data Expresson patterns of n=2467 annotated yeast genes over m=79 dfferent condtons Sx gene functonal classes: 5 related to transcrpt levels, trcarboxylc acd (TCA) cycle, respraton, cytoplasmc rbosomes, proteasome, hstones, and 1 unrelated (control) helx-turn-helx protens. For gene x, condton : E level of x n tested condton R level of x n reference condton Normalzed pattern (X 1,,X m ) of gene x: X = log(e /R )/( k log 2 (E k /R k )) 0.5 ABDBM Ron Shamr 43
Goal Classfy genes based on gene expresson Tred SVM and other classfers 1/ w ABDBM Ron Shamr www.contrb.andrew.cmu.edu/~jn/r_proj/ 44
Kernel functons used Smplest : K(X,Y)=X Y+1 (dot product; lnear kernel) Kernel of degree d: K(X,Y)=(X Y+1) d Radal bass (Gaussan) kernel: exp(- X-Y 2 /2 2 ) n + / n - : no. of postve / negatve examples Problem: n + << n - Overcomng mbalance: modfy K s dagonal: K j =K(X,X j )+c/n + for postve ex, K j =K(X,X j )+c/n - for negatve ex ABDBM Ron Shamr 48
Measurng performance True Classfer + - + TP FP - FN TN The mbalance problem: very few postves Performance of method M: C(M) =FP+2FN C(N) = cost of classfyng all as negatves S(M) =C(N)-C(M) (how much we save by the classfer). 3-way cross valdaton: 2/3 learn, 1/3 test ABDBM Ron Shamr 49
Results TCA class Method FP FN TP TN S(M) D-p-1-SVM 18 5 12 2,432 6 D-p-2-SVM 7 9 8 2,443 9 D-p-3-SVM 4 9 8 2,446 12 Radal-SVM 5 9 8 2,445 11 Parzen 4 12 5 2,446 6 FLD 9 10 7 2,441 5 C4.5 7 17 0 2,443-7 MOC1 3 16 1 2,446-1 D-p--SVM: dot product kernel, degree Other methods used: Parzen wndows, Fsher lnear dscrmnant, C4.5+MOC1: decson trees ABDBM Ron Shamr 50
Results: Rbo Class Method FP FN TP TN S(M) D-p-1-SVM 14 2 119 2,332 224 D-p-2-SVM 9 2 119 2,337 229 D-p-3-SVM 7 3 118 2,339 229 Radal-SVM 6 5 116 2,340 226 Parzen 6 8 113 2,340 220 FLD 15 5 116 2,331 217 C4.5 31 21 100 2,315 169 MOC1 26 26 95 2,320 164 ABDBM Ron Shamr 51
Results: Summary SVM outperformed the other methods Ether hgh-dm dot-product or Gaussan kernels worked best Insenstve to specfc cost weghtng Consstently msclassfed genes requre specal attenton Does not always reflect proten levels and post-translatonal modfcatons Can use classfers for functonal annotaton ABDBM Ron Shamr 52
Davd Haussler ABDBM Ron Shamr 53
Gene Selecton va the BAHSIC Famly of Algorthms Le Song, Justn Bedo, Karsten M. Borgwardt, Arthur Gretton, Alex Smola ISMB 07 ABDBM Ron Shamr 54
Testng 15 two-class datasets (mostly cancer), 2K-25K genes, 50-300 samples 10-fold cross valdaton Selected the 10 top features accordng to each method pc=pearson s correlaton, snr=sgnal-to-nose rato, pam=shrunken centrod, t=t-statstcs, m-t = moderated t- statstcs, lods=b-statstcs, ln=centrod, RBF= SVM w Gaussan kernel, rfe=svm recursve feature elmnaton, l1=l 1 norm SVM, m=mutual nformaton) Selecton method: RFE: Tran, remove 10% of features that are least relevant, repeat. ABDBM Ron Shamr 55
Classfcaton error % Overlap btw the 10 genes selected n each fold Lnear kernel has best overall performance L2 dst from best 56 # tmes alg was best ABDBM Ron Shamr
Multclass datasets In a smlar comparson on 13 multclass datasets, lnear kernel was agan best. ABDBM Ron Shamr 58
Rules of thumb Always apply the lnear kernel for general purpose gene selecton Apply a Gaussan Kernel f nonlnear effects are present, such as multmodalty or complementary effects of dfferent genes Not a bg surprse, gven the hgh dmenson of mcroarray datasets, but pont drven home by broad expermentaton. ABDBM Ron Shamr 59