9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Size: px

Start display at page:

Download "9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives"

Nathaniel Hunt
6 years ago
Views:

1 Foundations of Machine Learning École Centrale Paris Fall Support Vector Machines Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech Learning objectives chloe paristech.fr Define a large-margin classifier in the separable case. Write the corresponding primal and dual optimization problems. Re-write the optimization problem in the case of non-separable data. Use the kernel trick to apply soft-margin SVMs to non-linear cases. Define kernels for real-valued data, strings, and graphs. Learning objectives Define a large-margin classifier in the separable case. Write the corresponding primal and dual optimization problems. Re-write the optimization problem in the case of non-separable data. Use the kernel trick to apply soft-margin SVMs to non-linear cases. Define kernels for real-valued data, strings, and graphs. The linearly separable case: hard-margin SVMs Assume data is linearly separable: there exists a line that separates + from - The linearly separable case: hard-margin SVMs Assume data is linearly separable: there exists a line that separates + from -

2 Margin of a linear classifier Largest margin classifier: Support vector machines Largest margin classifier: Support vector machines Formalization Training set Assume the data to be linearly separable Formalization Training set Assume the data to be linearly separable Goal: Find (w*, b*) that define the hyperplane with largest margin. Goal: Find (w*, b*) that define the hyperplane with largest margin.

3 What is the size of the margin γ? Largest margin hyperplane Optimization problem What is the size of the margin γ? Margin maximization: minimize Correct classification of the training points: For negative examples: For positive examples: Summarized as: This is a classic quadratic optimization problem. Optimization problem Margin maximization: minimize Correct classification of the training points: For negative examples: For positive examples: Karush-Kuhn-Tucker conditions minimize f(w) under the constraint g(w) Summarized as: Case : the unconstraind minimum lies in the feasible region. This is a classic quadratic optimization problem. Case 2: it does not. Karush-Kuhn-Tucker conditions minimize f(w) under the constraint g(w) Case : the unconstraind minimum lies in the feasible region. Case 2: it does not. Summarized as: Summarized as:

4 Karush-Kuhn-Tucker conditions minimize f(w) under the constraint g(w) Karush-Kuhn-Tucker conditions Lagrangian: α is called the Lagrange multiplier. minimize f(w) under the constraints gi(w) Use n Lagrange multiplers Lagrangian: Karush-Kuhn-Tucker conditions minimize f(w) under the constraints gi(w) Duality Use n Lagrange multiplers Lagrangian: Lagrangian Lagrange dual function Duality Lagrangian Lagrange dual function q is concave in α (even if L is not convex) The dual function yields lower bounds on the optimal value of the primal problem when q is concave in α (even if L is not convex) The dual function yields lower bounds on the optimal value of the primal problem when

5 Duality Primal problem: minimize f s.t. g(x). Lagrange dual problem: maximize q. Weak duality: If f* optimizes the primal and d* optimizes the dual, then d* f*. Always hold. Strong duality: f* = d* Back to hard-margin SVMs Holds under specific conditions (constraint qualification), e.g. Slater's: f convex and h affine. Minimize under the n constraints We introduce one dual variable αi for each constraint (i.e. each training point) Lagrangian: Back to hard-margin SVMs Minimize under the n constraints We introduce one dual variable αi for each constraint (i.e. each training point) Lagrangian: Lagrangian of the SVM Lagrangian of the SVM L(w, b, α) is convex quadratic in w and minimized for: L(w, b, α) is affine in b. It minimum is - except if: L(w, b, α) is convex quadratic in w and minimized for: L(w, b, α) is affine in b. It minimum is - except if:

6 constraints can be solved efficiently using dedicated software. SVM dual problem Lagrange dual function: Dual problem: Optimal hyperplane maximize q(α) subject to α. Maximizing a quadratic function under box constraints can be solved efficiently using dedicated software. Once the optimal α* is found, we recover (w*, b*) The decision function is hence: KKT conditions: Either αi = or gi= Optimal hyperplane Once the optimal α* is found, we recover (w*, b*) The decision function is hence: KKT conditions: Support vectors Either αi = or gi= α= α> Support vectors α= α>

7 The non-linearly separable case: soft-margin SVMs. Soft-margin SVMs Find a trade-off between large margin and few errors. Error: The soft-margin SVM solves: Soft-margin SVMs Find a trade-off between large margin and few errors. Error: The soft-margin SVM solves: The C parameter Large C makes few errors Small C ensures a large margin Intermediate C finds a tradeoff The C parameter Large C makes few errors Small C ensures a large margin Intermediate C finds a tradeoff

8 C Prediction error It is important to control C Hinge loss On new data On training data λ = /C Hinge loss function: C lhinge(f(x), y) y f(x) Hinge loss λ = /C Hinge loss function: Slack variables lhinge(f(x), y) is equivalent to: y f(x) slack variable Slack variables is equivalent to: slack variable

9 hard somewhat hard Dual formulation of the soft-margin SVM Maximize under the constraints KKT conditions: Support vectors of the soft-margin SVM α=c easy hard somewhat hard α= < α < C Support vectors of the soft-margin SVM α=c Primal vs. dual α= Primal: (w, b) has dimension (p+). Favored if the data is low-dimensional. < α < C Dual: α has dimension n. Favored is there is litle data available. Primal vs. dual Primal: (w, b) has dimension (p+). Favored if the data is low-dimensional. Dual: α has dimension n. Favored is there is litle data available.

10 The non-linear case: kernel SVMs. Non-linear mapping to a feature space R2 R Non-linear mapping to a feature space R2 R Kernels For a given mapping from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product of their images in the feature spaces. Kernels allow us to formalize the notion of similarity. Kernels For a given mapping from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product of their images in the feature spaces. Kernels allow us to formalize the notion of similarity.

11 Kernel tricks Many linear algorithms (in particular, linear SVMs) can be performed in the feature space H without explicitly computing the images φ(x), but instead by computing kernels K(x, x') It is sometimes easy to compute kernels which correspond to large-dimensional feature spaces: K(x, x') is often much simpler to compute than φ(x). SVM in the feature space Train: under the constraints Predict with the decision function SVM in the feature space Train: under the constraints SVM with a kernel Predict with the decision function Train: SVM with a kernel Train: under the constraints Predict with the decision function under the constraints Predict with the decision function

12 is an inner product in a feature space of all monomials of degree up to d. Polynomial kernels More generally, for Toy example is an inner product in a feature space of all monomials of degree up to d. Toy example Toy example: linear vs polynomial SVM Toy example: linear vs polynomial SVM

13 Theorem [Aronszajn, 95]: K is a kernel iff it is positive definite. Which functions are kernels? A function K(x, x') defined on a set X is a kernel iff it exists a Hilbert space H and a mapping φ: X H such that, for any x, x' in X: A function K(x, x') defined on a set X is positive definite iff it is symmetric and satisfies: Gaussian kernel Theorem [Aronszajn, 95]: K is a kernel iff it is positive definite. Gaussian kernel Kernels for strings Protein sequence classification Goal: predict which proteins are secreted or not, based on their sequence. Kernels for strings Protein sequence classification Goal: predict which proteins are secreted or not, based on their sequence.

14 Substring-based representations Represent strings based on the presence/absence of substrings of fixed length. Number of occurences of u in x: spectrum kernel [Leslie et al., 22]. Number of occurences of u in x, up to m mismatches: mismatch kernel [Leslie et al., 24]. Number of occcurences of u in x, allowing gaps, with a weight decaying exponentially with the number of gaps: substring kernel [Lohdi et al., 22]. Spectrum kernel Implementation: Formally, a sum over A k terms At most x - k + non-zero terms in Φ(x) Hence: Computation in O( x + x' ) Fast prediction for a new sequence x: Spectrum kernel Implementation: Formally, a sum over A k terms At most x - k + non-zero terms in Φ(x) Hence: Computation in O( x + x' ) The choice of kernel maters Fast prediction for a new sequence x: Performance of several kernels on the SCOP superfamily recognition kernel [Saigo et al., 24] The choice of kernel maters Performance of several kernels on the SCOP superfamily recognition kernel [Saigo et al., 24]

15 [Harchaoui & Bach, 27] Kernels for graphs Molecules Images Subgraph-based representations no occurrence of the st feature + occurrences of the th feature [Harchaoui & Bach, 27] Subgraph-based representations no occurrence of the st feature + occurrences of the th feature Tanimoto & MinMax Tanimoto & MinMax The Tanimoto and MinMax similarities are kernels The Tanimoto and MinMax similarities are kernels

16 Which subgraphs to use? Indexing by all subgraphs... Computing all subgraph occurences is NP-hard. Actually, finding whether a given subgraph occurs in a graph is NP-hard in general. Which subgraphs to use? Specific subgraphs that lead to computationally efficient indexing: Subgraphs selected based on domain knowledge E.g. chemical fingerprints All frequent subgraphs [Helma et al., 24] All paths up to length k [Nicholls 25] All walks up to length k [Mahé et al., 25] All trees up to depth k [Rogers, 24] All shortest paths [Borgwardt & Kriegel, 25] All subgraphs up to k vertices (graphlets) [Shervashidze et al., 29] Which subgraphs to use? Specific subgraphs that lead to computationally efficient indexing: Subgraphs selected based on domain knowledge E.g. chemical fingerprints All frequent subgraphs [Helma et al., 24] All paths up to length k [Nicholls 25] All walks up to length k [Mahé et al., 25] All trees up to depth k [Rogers, 24] All shortest paths [Borgwardt & Kriegel, 25] All subgraphs up to k vertices (graphlets) [Shervashidze et al., 29] Which subgraphs to use? Path of length 5 Which subgraphs to use? Path of length 5 Walk of length 5 Tree of depth 2 Walk of length 5 Tree of depth 2

17 Predicting inhibitors for 6 cancer cell lines [Mahé & Vert, 29] The choice of kernel maters The choice of kernel maters COREL4: 4 natural images, 4 classes Kernels: histogram (H), walk kernel (W), subtree kernel (TW), weighted subtree kernel (wtw), combination (M). Predicting inhibitors for 6 cancer cell lines [Mahé & Vert, 29] [Harchaoui & Bach, 27] The choice of kernel maters COREL4: 4 natural images, 4 classes Kernels: histogram (H), walk kernel (W), subtree kernel (TW), weighted subtree kernel (wtw), combination (M). Summary Linearly separable case: hard-margin SVM Non-separable, but still linear: soft-margin SVM Non-linear: kernel SVM Kernels for [Harchaoui & Bach, 27] Summary Linearly separable case: hard-margin SVM Non-separable, but still linear: soft-margin SVM Non-linear: kernel SVM Kernels for real-valued data strings graphs. real-valued data strings graphs.

10. Support Vector Machines

Foundations of Machine Learning CentraleSupélec Fall 2017 10. Support Vector Machines Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning