Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Similar documents
Introduction to Machine Learning

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Constrained optimization

Lecture 5: Linear Classification

Support vector machines

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Lecture 7: Support Vector Machine

COMS 4771 Support Vector Machines. Nakul Verma

All lecture slides will be available at CSC2515_Winter15.html

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Machine Learning: Think Big and Parallel

Generative and discriminative classification techniques

Combine the PA Algorithm with a Proximal Classifier

Linear methods for supervised learning

1 Case study of SVM (Rob)

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Support Vector Machines

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Support Vector Machines.

Lecture 9. Support Vector Machines

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Machine Learning Lecture 9

SUPPORT VECTOR MACHINES

A Short SVM (Support Vector Machine) Tutorial

Development in Object Detection. Junyuan Lin May 4th

Kernel Methods & Support Vector Machines

Instance-based Learning

ADVANCED CLASSIFICATION TECHNIQUES

DM6 Support Vector Machines

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所

SUPPORT VECTOR MACHINES

Mathematical Programming and Research Methods (Part II)

Linear models. Subhransu Maji. CMPSCI 689: Machine Learning. 24 February February 2015

Lecture 9: Support Vector Machines

6 Model selection and kernels

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Convex Optimization MLSS 2015

Syllabus. 1. Visual classification Intro 2. SVM 3. Datasets and evaluation 4. Shallow / Deep architectures

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Machine Learning Lecture 9

Section 2.3: Simple Linear Regression: Predictions and Inference

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Perceptron Learning Algorithm

CS 229 Midterm Review

Instance-based Learning

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Classification: Linear Discriminant Functions

Perceptron Learning Algorithm (PLA)

Support Vector Machines

Support Vector Machines

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Part 5: Structured Support Vector Machines

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

Classification: Feature Vectors

3 Perceptron Learning; Maximum Margin Classifiers

LARGE MARGIN CLASSIFIERS

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Machine Learning for NLP

Lecture 5: Duality Theory

CSE 573: Artificial Intelligence Autumn 2010

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Optimization Methods for Machine Learning (OMML)

Support Vector Machines

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Support Vector Machines.

Introduction to Support Vector Machines

In what follows, we will focus on Voronoi diagrams in Euclidean space. Later, we will generalize to other distance spaces.

Natural Language Processing

Support Vector Machines

Chakra Chennubhotla and David Koes

Topics in Machine Learning

Part 5: Structured Support Vector Machines

CS522: Advanced Algorithms

Introduction to Machine Learning Spring 2018 Note Sparsity and LASSO. 1.1 Sparsity for SVMs

Lecture Linear Support Vector Machines

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Conflict Graphs for Parallel Stochastic Gradient Descent

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

5 Machine Learning Abstractions and Numerical Optimization

Linear Regression and K-Nearest Neighbors 3/28/18

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

12 Classification using Support Vector Machines

Support Vector. Machines. Algorithms, and Extensions. Optimization Based Theory, Naiyang Deng YingjieTian. Chunhua Zhang.

Alternating Direction Method of Multipliers

Visual Recognition: Examples of Graphical Models

Linear Regression & Gradient Descent

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

SUPPORT VECTOR MACHINE ACTIVE LEARNING

SELF-ADAPTIVE SUPPORT VECTOR MACHINES

Natural Language Processing

732A54/TDDE31 Big Data Analytics

No more questions will be added

Support Vector Machines and their Applications

Stanford University. A Distributed Solver for Kernalized SVM

Transcription:

Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1

Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training data: S =((x 1, y 1 ),...,(x m, y m )) Hypothesis set H =halfspaces Assumption: dataislinearlyseparable) there exist a halfspace that perfectly classify the training set In general: multipleseparatinghyperplanes:) which one is the best choice? 2

Classification and Margin The last one seems the best choice, since it can tolerate more noise. Informally, for a given separating halfspace we define its margin as its minimum distance to an example in the training set S. Intuition: best separating hyperplane is the one with largest margin. How do we find it? 3

Linearly Separable Training Set Training set S =((x 1, y 1 ),...,(x m, y m )) is linearly separable if there exists a halfspace (w, b) such that y i = sign(hw, x i i + b) for all i =1,...,m. Equivalent to: 8i =1,...,m : y i (hw, x i i + b) > 0 Informally: margin of a separating hyperplane is its minimum distance to an example in the training set S 4

Separating Hyperplane and Margin Given hyperplane defined by L = {v : hw, vi + b =0}, and given x, the distance of x to L is d(x, L) =min{ x v : v 2 L} Claim: if w = 1 then d(x, L) = hw, xi + b (Proof: Claim 15.1 [UML]) 5

Margin and Support Vectors The margin of a separating hyperplane is the distance of the closest example in training set to it. If w =1the margin is: min hw, x ii + b i2{1,...,m} The closest examples are called support vectors 6

Support Vector Machine (SVM) Hard-SVM: seek for the separating hyperplane with largest margin (only for linearly separable data) Computational problem: arg max (w,b): w =1 subject to 8i : y i (hw, x i i + b) > 0 min hw, x ii + b i2{1,...,m} Equivalent formulation (due to separability assumption): arg max (w,b): w =1 min y i(hw, x i i + b) i2{1,...,m} 7

Hard-SVM: Quadratic Programming Formulation input: (x 1, y 1 ),...,(x m, y m ) solve: (w 0, b 0 )=argmin (w,b) w 2 subject to 8i : y i (hw, x i i + b) 1 output: ŵ = w 0 w 0, ˆb = b 0 w 0 How do we get a solution? Quadratic optimization problem: objective is convex quadratic function, constraints are linear inequalities ) Quadratic Programming solvers! Proposition The output of algorithm above is a solution to the Equivalent Formulation in the previous slide. Proof: on the board and in the book (Lemma 15.2 [UML]) 8

Equivalent Formulation and Support Vectors Equivalent formulation (homogeneous halfspaces): assume first component of x 2 X is 1, then w 0 =min w w 2 subject to 8i : y i hw, x i i 1 Support Vectors = vectors at minimum distance from w 0 The support vectors are the only ones that matter for defining w 0! Proposition Let w 0 be as above. Let I = {i : hw 0, x i i =1}. Thenthereexist coe cients 1,..., m such that w 0 = X i2i i x i Support vectors = {x i : i 2 I } Note: Solving Hard-SVM is equivalent to find i for i =1,...,m, and i 6=0only for support vectors 9

Soft-SVM 10 Hard-SVM works if data is linearly separable. What if data is not linearly separable? ) soft-svm Idea: modify constraints of Hard-SVM to allow for some violation, but take into account violations into objective function

Hard-SVM constraints: Soft-SVM Constraints y i (hw, x i i + b) 1 Soft-SVM constraints: slack variables: 1,..., m 0 ) vector for each i =1,...,m: y i (hw, x i i + b) 1 i i : how much constraint y i (hw, x i i + b) 1 is violated Soft-SVM minimizes combinations of norm of w average of i Tradeo among two terms is controlled by a parameter 2 R, > 0 11

Soft-SVM: Optimization Problem 12 input: (x 1, y 1 ),...,(x m, y m ),parameter > 0 solve:! min w 2 + 1 mx i w,b, m i=1 subject to 8i : y i (hw, x i i + b) 1 i and i 0 output: w, b Equivalent formulation: consider the hinge loss `hinge ((w, b), (x, y)) = max{0, 1 y(hw, xi + b)} Given (w, b) and a training S, theempiricalriskl hinge S ((w, b)) is L hinge S ((w, b)) = 1 mx `hinge ((w, b), (x i, y i )) m i=1

Soft-SVM: solve Soft-SVM as RLM min w,b, w 2 + 1 m! mx i i=1 subject to 8i : y i (hw, x i i + b) 1 i and i 0 Equivalent formulation with hinge loss: that is min w,b min w,b w 2 + 1 m w 2 + L hinge S (w, b)! mx `hinge ((w, b), (x i, y i )) i=1 Note: w 2 : `2 regularization L hinge S (w, b): empirical risk for hinge loss 13

Soft-SVM: Solution 14 We need to solve: How? min w,b w 2 + 1 m! mx `hinge ((w, b), (x i, y i )) i=1 standard solvers for optimization problems Stochastic Gradient Descent

Exercise 1 Your friend has developed a new machine learning algorithm for binary classification (i.e., y 2 { 1, 1}) with 0-1 loss and tells you that it achieves a generalization error of only 0.05. However,when you look at the learning problem he is working on, you find out that Pr D [y = 1] = 0.95... Assume that Pr D [y = `] =p`. Derive the generalization error of the (dumb) hypothesis/model that always predicts `. Use the result above to decide if your friend s algorithm has learned something or not. 1

Exercise 2 Consider a linear regression problem, where X = R d and Y = R, with mean squared loss. The hypothesis set is the set of constant functions, that is H = {h a : a 2 R}, whereh a (x) =a. Let S =((x 1, y 1 ),...,(x m, y m )) denote the training set. Derive the hypothesis h 2 H that minimizes the training error. Use the result above to explain why, for a given hypothesis ĥ from the set of all linear models, the coe cient of determination R 2 =1 P m i=1 (ĥ(x i) y i ) 2 P m i=1 (y i ȳ) 2 where ȳ is the average of the y i,i =1,...,m is a measure of how well ĥ performs (on the training set). 2

Exercise 3 Consider the classification problem with X = R 2, Y = {0, 1}. Consider the hypothesis class H = {h (c,a), c 2 R 2, a 2 R} with h (c,a) (x) = Find the VC-dimension of H. ( 1 if x c apple a 0 otherwise 3

Exercise 4 Assuming we have the following dataset (x i 2 R 2 ) and by solving the SVM for classification we get the corresponding optimal dual variables: i xi T y i i 1 [ 0.2-1.4] -1 0 2 [-2.1 1.7] 1 0 3 [0.9 1] 1 0.5 4 [-1-3.1] -1 0 5 [-0.2-1] -1 0.25 6 [-0.2 1.3] 1 0 7 [ 2.0-1] -1 0.25 8 [ 0.5 2.1] 1 0 Answer to the following: (A) Which are the support vectors? (B) Draw a schematic picture reporting the data points (approximately) and the optimal separating hyperplane, and mark the support vectors. Would it be possible, by moving only two data points, to obtain the SAME separating hyperplane with only 2 support vectors? If so, draw the modified configuration (approximately). 4

Exercise 5 Consider the ridge regression problem arg min w w 2 + P m i=1 (hw, x ii y i ) 2.Let:h S be the hypothesis obtained by ridge regressionon with training set S; h be the hypothesis of minimum generalization error among all linear models. (A) Draw, in the plot below, a typical behaviour of (i) the training error and (ii) the test/generalization error of h S as a function of. (B) Draw, in the plot below, a typical behaviour of (i) L D (h S ) L D (h ) and (ii) L D (h S ) L S (h S ) as a function of. (A)' (B)' training' error' test' error' 5