Support Vector Machines

Similar documents
Support Vector Machines

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Classification / Regression Support Vector Machines

Support Vector Machines

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Support Vector Machines. CS534 - Machine Learning

Announcements. Supervised Learning

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

Machine Learning 9. week

GSLM Operations Research II Fall 13/14

Discriminative classifiers for object classification. Last time

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Smoothing Spline ANOVA for variable screening

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

The Research of Support Vector Machine in Agricultural Data Classification

Discriminative Dictionary Learning with Pairwise Constraints

Using Neural Networks and Support Vector Machines in Data Mining

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Taxonomy of Large Margin Principle Algorithms for Ordinal Regression Problems

Lecture 5: Multilayer Perceptrons

Edge Detection in Noisy Images Using the Support Vector Machines

INF 4300 Support Vector Machine Classifiers (SVM) Anne Solberg

CS 534: Computer Vision Model Fitting

Feature Reduction and Selection

Training of Kernel Fuzzy Classifiers by Dynamic Cluster Generation

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

Network Intrusion Detection Based on PSO-SVM

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Classifier Selection Based on Data Complexity Measures *

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CMPSCI 670: Computer Vision! Object detection continued. University of Massachusetts, Amherst November 10, 2014 Instructor: Subhransu Maji

Computer Vision. Pa0ern Recogni4on Concepts Part II. Luis F. Teixeira MAP- i 2012/13

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

SUMMARY... I TABLE OF CONTENTS...II INTRODUCTION...

LECTURE : MANIFOLD LEARNING

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

SVM-based Learning for Multiple Model Estimation

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

RECOGNIZING GENDER THROUGH FACIAL IMAGE USING SUPPORT VECTOR MACHINE

Categorizing objects: of appearance

5 The Primal-Dual Method

Solving the SVM Problem. Christopher Sentelle, Ph.D. Candidate L-3 CyTerra Corporation

Hierarchical clustering for gene expression data analysis

Unsupervised Learning

Efficient Text Classification by Weighted Proximal SVM *

Biostatistics 615/815

Multi-objective Optimization Using Adaptive Explicit Non-Dominated Region Sampling

Evolutionary Support Vector Regression based on Multi-Scale Radial Basis Function Kernel

Complex System Reliability Evaluation using Support Vector Machine for Incomplete Data-set

A Robust LS-SVM Regression

Recognizing Faces. Outline

Machine Learning: Algorithms and Applications

Classification and clustering using SVM

Solving Mixed Integer Formulation of the KS Maximization Problem Dual Based Methods and Results from Large Practical Problems

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

A New Approach For the Ranking of Fuzzy Sets With Different Heights

CLASSIFICATION OF ULTRASONIC SIGNALS

Face Recognition Method Based on Within-class Clustering SVM

LECTURE NOTES Duality Theory, Sensitivity Analysis, and Parametric Programming

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Polyhedral Compilation Foundations

Support Vector Machine Algorithm applied to Industrial Robot Error Recovery

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Multi-stable Perception. Necker Cube

Solving two-person zero-sum game by Matlab

On Multiple Kernel Learning with Multiple Labels

Support Vector classifiers for Land Cover Classification

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Data Mining: Model Evaluation

Support Vector Machines for Business Applications

An Anti-Noise Text Categorization Method based on Support Vector Machines *

Human Face Recognition Using Generalized. Kernel Fisher Discriminant

A Facet Generation Procedure. for solving 0/1 integer programs

EYE CENTER LOCALIZATION ON A FACIAL IMAGE BASED ON MULTI-BLOCK LOCAL BINARY PATTERNS

Unsupervised Learning and Clustering

Angle-Independent 3D Reconstruction. Ji Zhang Mireille Boutin Daniel Aliaga

S1 Note. Basis functions.

Relevance Feedback Document Retrieval using Non-Relevant Documents

Active Contours/Snakes

Semi-supervised Mixture of Kernels via LPBoost Methods

Discrimination of Faulted Transmission Lines Using Multi Class Support Vector Machines

An Entropy-Based Approach to Integrated Information Needs Assessment

Adaptive Virtual Support Vector Machine for the Reliability Analysis of High-Dimensional Problems

Quadratic Program Optimization using Support Vector Machine for CT Brain Image Classification

Hermite Splines in Lie Groups as Products of Geodesics

Classifying Acoustic Transient Signals Using Artificial Intelligence

Modeling and Solving Nontraditional Optimization Problems Session 2a: Conic Constraints

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

A Selective Sampling Method for Imbalanced Data Learning on Support Vector Machines

Programming in Fortran 90 : 2017/2018

11. APPROXIMATION ALGORITHMS

APPLICATION OF A SUPPORT VECTOR MACHINE FOR LIQUEFACTION ASSESSMENT

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Collaboratively Regularized Nearest Points for Set Based Recognition

Feature Selection By KDDA For SVM-Based MultiView Face Recognition

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

Transcription:

Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned very hghdmensonal space va a kernel functon Fnd the hyperplane that maxmzes the margn between the two classes If data are not separable fnd the hyperplane that maxmzes the margn and mnmzes the (a weghted average of the) msclassfcatons 206

Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space (kernel) 207

Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 208

Whch Separatng Hyperplane to Use? Var 1 Var 2 209

Maxmzng the Margn Var 1 IDEA 1: Select the separatng hyperplane that maxmzes the margn! Margn Wdth Margn Wdth Var 2 210

Support Vectors Var 1 Support Vectors Margn Wdth Var 2 211

Settng Up the Optmzaton Problem Var 1 The wdth of the margn s: 2 k w w x + b = k k k w x w x + b = k Var 2 + b = 0 w So, the problem s: 2 k max w st.. ( w x+ b) k, x of class 1 ( w x+ b) k, x of class 2 212

Settng Up the Optmzaton Problem Var 1 There s a scale and unt for data so that k=1. Then problem becomes: w w x+ b= 1 1 1 w x w x+ b= 1 Var 2 + b = 0 2 max w st.. ( w x+ b) 1, x of class 1 ( w x+ b) 1, x of class 2 213

Settng Up the Optmzaton Problem If class 1 corresponds to 1 and class 2 corresponds to -1, we can rewrte as ( w x + b) 1, x wth y = 1 ( w x + b) 1, x wth y = 1 y ( w x + b) 1, x So the problem becomes: 2 max w st.. y( w x + b) 1, x or 1 2 mn w 2 st.. y( w x + b) 1, x 214

Lnear, Hard-Margn SVM Formulaton Fnd w,b that solves 1 mn 2 w 2 st.. y( w x + b) 1, x Problem s convex so, there s a unque global mnmum value (when feasble) There s also a unque mnmzer,.e. weght and b value that provdes the mnmum Non-solvable f the data s not lnearly separable Quadratc Programmng Very effcent computatonally wth modern constrant optmzaton engnes (handles thousands of constrants and tranng nstances). 215

Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 216

Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 217

Non-Lnearly Separable Data Var 1 ξ Introduce slack varables ξ w ξ w x+ b= 1 Allow some nstances to fall wthn the margn, but penalze them w x+ b= 1 1 1 w x + b = 0 Var 2 218

Formulatng the Optmzaton Problem Constrant becomes : Var 1 ξ y ( w x + b) 1 ξ, x ξ 0 w ξ w x+ b= 1 1 1 w x w x+ b= 1 Var 2 + b = 0 Objectve functon penalzes for msclassfed nstances and those wthn the margn 1 mn 2 C trades-off margn wdth and msclassfcatons 219 2 w + C ξ

Lnear, Soft-Margn SVMs 1 mn 2 2 w + C ξ Algorthm tres to mantan ξ to zero whle maxmzng margn Notce: algorthm does not mnmze the number of msclassfcatons (NP-complete problem) but the sum of dstances from the margn hyperplanes Other formulatons use ξ 2 nstead y ( w x + b) 1 ξ, x ξ 0 As C, we get closer to the hard-margn soluton 220

Robustness of Soft vs Hard Margn SVMs Var 1 Var 1 ξ ξ w x + b = 0 Soft Margn SVN Var 2 w x + b = 0 Hard Margn SVN Var 2 221

Soft vs Hard Margn SVM Soft-Margn always have a soluton Soft-Margn s more robust to outlers Smoother surfaces (n the non-lnear case) Hard-Margn does not requre to guess the cost parameter (requres no parameters at all) 222

Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 223

Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 224

Dsadvantages of Lnear Decson Surfaces Var 1 Var 2 225

Advantages of Non-Lnear Surfaces Var 1 Var 2 226

Lnear Classfers n Hgh- Dmensonal Spaces Var 1 Constructed Feature 2 Var 2 Constructed Feature 1 Fnd functon Φ(x) to map to a dfferent space 227

Mappng Data to a Hgh-Dmensonal Space Fnd functon Φ(x) to map to a dfferent space, then SVM formulaton becomes: 1 mn 2 2 w C ξ + ξ 0 Data appear as Φ(x), weghts w are now weghts n the new space Explct mappng expensve f Φ(x) s very hgh dmensonal Solvng the problem wthout explctly mappng the data s desrable s. t. y ( w Φ( x) + b) 1 ξ, x 228

The Dual of the SVM Formulaton Orgnal SVM formulaton n nequalty constrants n postvty constrants n number of ξ varables The (Wolfe) dual of ths problem one equalty constrant n postvty constrants n number of α varables (Lagrange multplers) Objectve functon more complcated NOTICE: Data only appear as Φ(x ) Φ(x j ) 1 mn a 2 s. t. ξ 0 mn w, b 2, j y ( w Φ( x) + b) 1 ξ, x 1 w α α y 2 + C ξ s. t. C α 0, x α y = 0 j y j ( Φ( x ) Φ( x j )) 229 α

The Kernel Trck Φ(x ) Φ(x j ): means, map data nto new space, then take the nner product of the new vectors We can fnd a functon such that: K(x x j ) = Φ(x ) Φ(x j ),.e., the mage of the nner product of the data s the nner product of the mages of the data Then, we do not need to explctly map the data nto the hghdmensonal space to solve the optmzaton problem (for tranng) How do we classfy wthout explctly mappng the new nstances? Turns out sgn( wx + b) = sgn( where b solves α ( y for any j wthα j j 0 α y K( x j α y K( x, x) + b), x j ) + b 1) = 0, 230

Examples of Kernels Assume we measure two quanttes, e.g. expresson level of genes TrkC and SoncHedghog (SH) and we use the mappng: Φ : < TrkC, x SH Consder the functon: We can verfy that: x Φ( x) Φ( z) = x 2 TrkC z = ( x 2 TrkC TrkC z + x TrkC 2 SH + z 2 SH x SH z > { x 2 TrkC, x 2 SH K( x z) = ( x z + + 2x SH TrkC + 1) x 2 SH z TrkC z = ( x z, SH 2x 2 1) + + 1) 2 x TrkC TrkC x z SH TrkC, x + = K( x z) TrkC x, x SH z SH SH,1} + 1 = 231

Polynomal and Gaussan Kernels K ( x z) = ( x z + 1) s called the polynomal kernel of degree p. For p=2, f we measure 7,000 genes usng the kernel once means calculatng a summaton product wth 7,000 terms then takng the square of ths number Mappng explctly to the hgh-dmensonal space means calculatng approxmately 50,000,000 new features for both tranng nstances, then takng the nner product of that (another 50,000,000 terms to sum) In general, usng the Kernel trck provdes huge computatonal savngs over explct mappng! Another commonly used Kernel s the Gaussan (maps to a dmensonal space wth number of dmensons equal to the number of tranng cases): K( x z) p = exp( x z 2 / 2σ ) 232

The Mercer Condton Is there a mappng Φ(x) for any symmetrc functon K(x,z)? No The SVM dual formulaton requres calculaton K(x, x j ) for each par of tranng nstances. The array G j = K(x, x j ) s called the Gram matrx There s a feature space Φ(x) when the Kernel s such that G s always sem-postve defnte (Mercer condton) 233

Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 234

Other Types of Kernel Methods SVMs that perform regresson SVMs that perform clusterng ν-support Vector Machnes: maxmze margn whle boundng the number of margn errors Leave One Out Machnes: mnmze the bound of the leave-one-out error SVM formulatons that take nto consderaton dfference n cost of msclassfcaton for the dfferent classes Kernels sutable for sequences of strngs, or other specalzed kernels 235

Varable Selecton wth SVMs Recursve Feature Elmnaton Tran a lnear SVM Remove the varables wth the lowest weghts (those varables affect classfcaton the least), e.g., remove the lowest 50% of varables Retran the SVM wth remanng varables and repeat untl classfcaton s reduced Very successful Other formulatons exst where mnmzng the number of varables s folded nto the optmzaton problem Smlar algorthm exst for non-lnear SVMs Some of the best and most effcent varable selecton methods 236

Comparson wth Neural Networks Neural Networks Hdden Layers map to lower dmensonal spaces Search space has multple local mnma Tranng s expensve Classfcaton extremely effcent Requres number of hdden unts and layers Very good accuracy n typcal domans SVMs Kernel maps to a very-hgh dmensonal space Search space has a unque mnmum Tranng s extremely effcent Classfcaton extremely effcent Kernel and cost the two parameters to select Very good accuracy n typcal domans Extremely robust 237

Why do SVMs Generalze? Even though they map to a very hghdmensonal space They have a very strong bas n that space The soluton has to be a lnear combnaton of the tranng nstances Large theory on Structural Rsk Mnmzaton provdng bounds on the error of an SVM Typcally the error bounds too loose to be of practcal use 238

MultClass SVMs One-versus-all Tran n bnary classfers, one for each class aganst all other classes. Predcted class s the class of the most confdent classfer One-versus-one Tran n(n-1)/2 classfers, each dscrmnatng between a par of classes Several strateges for selectng the fnal classfcaton based on the output of the bnary SVMs Truly MultClass SVMs Generalze the SVM formulaton to multple categores More on that n the nomnated for the student paper award: Methods for Mult-Category Cancer Dagnoss from Gene Expresson Data: A Comprehensve Evaluaton to Inform Decson Support System Development, Alexander Statnkov, Constantn F. Alfers, Ioanns Tsamardnos 239

Conclusons SVMs express learnng as a mathematcal program takng advantage of the rch theory n optmzaton SVM uses the kernel trck to map ndrectly to extremely hgh dmensonal spaces SVMs extremely successful, robust, effcent, and versatle whle there are good theoretcal ndcatons as to why they generalze well 240

Suggested Further Readng http://www.kernel-machnes.org/tutoral.html C. J. C. Burges. A Tutoral on Support Vector Machnes for Pattern Recognton. Knowledge Dscovery and Data Mnng, 2(2), 1998. P.H. Chen, C.-J. Ln, and B. Schölkopf. A tutoral on nu -support vector machnes. 2003. N. Crstann. ICML'01 tutoral, 2001. K.-R. Müller, S. Mka, G. Rätsch, K. Tsuda, and B. Schölkopf. An ntroducton to kernel-based learnng algorthms. IEEE Neural Networks, 12(2):181-201, May 2001. (PDF) B. Schölkopf. SVM and kernel methods, 2001. Tutoral gven at the NIPS Conference. Haste, Tbshran, Fredman, The Elements of Statstcal Learnng, Sprngel 2001 241