Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Similar documents
Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Support vector machines

CS 229 Midterm Review

Lecture 9: Support Vector Machines

DM6 Support Vector Machines

SUPPORT VECTOR MACHINES

Support Vector Machines

All lecture slides will be available at CSC2515_Winter15.html

Machine Learning for NLP

Linear methods for supervised learning

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

COMS 4771 Support Vector Machines. Nakul Verma

Data mining with Support Vector Machine

Support vector machines

Lecture 3: Linear Classification

Kernel Methods & Support Vector Machines

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

A Short SVM (Support Vector Machine) Tutorial

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Machine Learning Lecture 9

SUPPORT VECTOR MACHINES

12 Classification using Support Vector Machines

Support Vector Machines.

Introduction to Machine Learning

Machine Learning Lecture 9

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

ECG782: Multidimensional Digital Signal Processing

CS570: Introduction to Data Mining

CLASSIFICATION OF CUSTOMER PURCHASE BEHAVIOR IN THE AIRLINE INDUSTRY USING SUPPORT VECTOR MACHINES

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Lecture 7: Support Vector Machine

Information Management course

Support Vector Machines

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

3 Perceptron Learning; Maximum Margin Classifiers

Introduction to Support Vector Machines

More on Classification: Support Vector Machine

Support Vector Machines

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Content-based image and video analysis. Machine learning

5 Machine Learning Abstractions and Numerical Optimization

Support Vector Machines

The Curse of Dimensionality

Support Vector Machines.

Constrained optimization

Support Vector Machines + Classification for IR

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Support Vector Machines

Lecture 2. Topology of Sets in R n. August 27, 2008

PV211: Introduction to Information Retrieval

Neural Networks and Deep Learning

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Topic 4: Support Vector Machines

Support vector machines. Dominik Wisniewski Wojciech Wawrzyniak

Support Vector Machines

Support Vector Machines and their Applications

6.034 Quiz 2, Spring 2005

Naïve Bayes for text classification

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Instructor: Jessica Wu Harvey Mudd College

1 Case study of SVM (Rob)

Machine Learning - Lecture 15 Support Vector Machines

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Perceptron Learning Algorithm (PLA)

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

CSIS Introduction to Pattern Classification. CSIS Introduction to Pattern Classification 1

Combine the PA Algorithm with a Proximal Classifier

Support Vector Machines for Face Recognition

CS261: A Second Course in Algorithms Lecture #7: Linear Programming: Introduction and Applications

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

ADVANCED CLASSIFICATION TECHNIQUES

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Classification: Linear Discriminant Functions

Generative and discriminative classification techniques

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

Leave-One-Out Support Vector Machines

Contents. Preface to the Second Edition

Support Vector Machines: Brief Overview" November 2011 CPSC 352

Support Vector Machines

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Application of Support Vector Machine Algorithm in Spam Filtering

Expectation Maximization (EM) and Gaussian Mixture Models

Data-driven Kernels for Support Vector Machines

Classification: Feature Vectors

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods

Lecture 3: Convex sets

Kernels and Clustering

Notes on Support Vector Machines

Lecture 2 September 3

Transcription:

Overview T7 - SVM and s Christian Vögeli cvoegeli@inf.ethz.ch Supervised/ s Support Vector Machines Kernels Based on slides by P. Orbanz & J. Keuchel Task: Apply some machine learning method to data from a given source The source has a characteristic distribution, but we don t know what it is. Def.: Training data In a classification/regression problem, a sample from the data source with the correct classification/ regression results already assigned is called training data. Example: Training data for a classification problem is a data sample plus correct class labels. Training data is the knowledge about the data source which we use to construct the classifier/regressor. Supervised Learning: we can test our Use training data to infer model model apply model to test data e.g. Maximum likelihood,, SVM : No training data Model inference and application both rely on test data exclusively e.g. k-means We can not test our our model Fish-example from lecture Supervised Learning: One bag with salmons One bag with sea bass => build model One mixed bag => please classify : One mixed bag => build model Another mixed bag => please classify 1

Supervised Example: classification 1. Choose a classifier model (e. g. family of functions) 2. Use training data (pre-classified reference data) to choose a specific classifier (e. g. compute parameter) 3. Using the classifier on new data (test data): Apply classifier function to data value -> Result: Class label Idea: Separate data points from two classes C 1, C 2 by a linear function (hyperplane) Identify hyperplane by normal vector a (in generalized coordinates): CG: homogeneous coordinates Solution is not unique: possible normal vectors define a conic region (= intersection of n half spaces): For training: normalization of data points by mirroring points from C 2 at the origin: Linear separation remains! New requirement: in halfspace angle between a and y i < 90 0 Algorithm Sequentially add normalized misclassified data points y i to a Movesa in the direction of the wider opening of the solution cone Terminates if a lies in the solution cone to SVM Introduce a margin a T y i >= b to reduce the cone of possible solutions (and move the separating hyperplane more to the middle): Aufgabe 3 2

SVM By definition, SVM is the maximum margin classifier defined in terms of the support vector approach. Real-world SVM implementations usually combine three techniques: 1. Maximum margin classifier (this is where convex optimization comes in). 2. Soft margin technique (slack variables). 3. Kernel trick. Maximum Margin Classifiers Background on maximum margin classifiers: Generalization error. V. Vapnik: Theory of Structural risk minimization Approach: Classification error consists of two parts. classification error = training error + generalization error Flexible classifiers: small training error, but possibly large generalization error (overfitting). badly trained does not generalize well Maximum Margin Classifiers To keep the generalization error low (prevent from overfitting): 1. Use linear classifier. More complex/flexible classifiers (curved surfaces etc) are more likely to overfit. 2. When placing the hyperplane, choose the position which maximizes the margin. Note: The only conceptual difference between the basic SVM and the perceptron is the rule for positioning the hyperplane. Support Vectors Support vector idea: The classifier hyperplane is determined only by the points which are closest to it (the suppport vectors). Background: Region around class boundary is critical for classification. What happens in the back of the classes does not matter (classification density estimation). In theory, you may have millions of data points, but only three support vectors. Aufgabe 2 Aufgabe 2 maximum margin classifier Soft Margin personal opinion: very nice task good for low level understanding a): simply draw a picture b) d): this is not SVM, but LSQ used to illustrate the difference to SVM e) + f): SVM First problem of the SVM classifier: It requires data to be linearly separable. Solution: Soft margin approach. We allow training points on the wrong side of the classifier, but such points produce extra costs. The margin maximization is rewritten as a minimization problem. The two problems are combined. 3

Soft Margin Optimization Theory slack vars: the more the original constraint is violated, the larger they get There are many ways to combine two cost terms into a cost function: Every expression that becomes larger as either of the two terms increases is valid. Note: Slack variables apply only to training data. Classification of test points depends only on which side of thehyperplanethey areon. The theory about convex optimization (KKT conditions etc) only comes in because the maximum margin problem (with and without slack variables) can be cast in terms of a quadratic optimization problem. In machine learning, the convex optimization part is usually used as a black box. To program SVMs, we use libraries or matlab functions. Optimization Theory Why formulation as a convex optimization problem is so appealing: Convex problems have the best possible optimization properties (clearly defined optimum, no catches to confuse computer algorithms). Convex structure allows to tie in constraints in a very neat fashion. The meaning of can be formulated as convex optimization problem is roughly equivalent to the optimization involved will not give us any trouble. Kernel trick: Motivation Problem with linear classifiers: Not very flexible. Two types of errors: 1. Classes overlap. 2. Boundary between classes is obviously non-linear. The use of slack variables should be reserved for case (1). Consequence: We would like to be able to use curved classifiers. Curved classifiers: Nonlinear objects! much more difficult mathematics. classifier still linear, but space transformed Observation: Imagine some high-dimensional vector space Projection of a linear object (say, a hyperplane) onto a linear subspace is again linear. Projection of the same object onto a non-linear subset (e. g. a curved surface) is non-linear. If we regard our data space as a non-linear subset of some high-dimensional space, we can simulate nonlinear operations in data space by doing linear operations in the highdimensional space. High-dimensional computations the hard way: Define transformation into high-dim. space R H. Parameterize data space R L (e. g. as parametric differentiable manifold). Perform linear classifier training in high-dim space R H. Project the classifier back onto R L. R L T R H C R H T -1 (C) R L Nobody would think about this, if it was the only way and would not lead to... 4

The key operation of our linear methods is the scalar product: Computes projections (needed for classification). Computes orientations (determines orthogonality/direction of hyperplanes). Check the slides on optimization for SVMs: The only vector operation that appears is the scalar product. So: If we can represent the scalar product in R H as a function in R L, we can simulate linear classification in R H without actually leaving R L. Recall linear algebra: A matrixa defines a scalar product if it is symmetric positive definite: brackets: scalar prod. has all the properties of a scalar product. For a function k(x, y), what do we have to require such that there is some space R H in which k is a scalar product? Meaning: When do we have => we compute the scalar product of R L vectors in R H Answer: Mercer s theorem. If K : G R L R L R continuous and symmetric for which then K represents a scalar product in some highdimensional space. The Mercer condition is the functional analogy of positive definiteness for matrices (prev. slide) We can use all these functions instead of the scalar product Using the kernel trick: Start with some function K. Check whether it fulfills Mercer s condition. Replace all scalar products in algorithm by K. The important part: See whether it works. Most of the time (i. e. for most kernels), it will not Why do SVM work so well? First of all: It s an unsolved research problem. Some claims: Because of... 1. the maximum margin classifier. 2. the compression property (solution represented by small subset of points, the support vectors). 3. certain (still largely unproved) convergence properties of the kernel operator eigenvalues. Why do SVM work so well? Three very simple reason why SVMs are so popular: 1. Of proven merit. 2. Lots of experience, literature etc exists. 3. Several easy-to-use, freely accessible, well-tested implementations are available (libsvm, svmlite etc.) 5