Lecture 9: Support Vector Machines

Similar documents
SUPPORT VECTOR MACHINES

ADVANCED CLASSIFICATION TECHNIQUES

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Information Management course

Machine Learning for NLP

SUPPORT VECTOR MACHINES

CSE 573: Artificial Intelligence Autumn 2010

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Support Vector Machines

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

All lecture slides will be available at CSC2515_Winter15.html

Support Vector Machines + Classification for IR

CS 229 Midterm Review

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Machine Learning: Think Big and Parallel

Application of Support Vector Machine Algorithm in Spam Filtering

Naïve Bayes for text classification

Support Vector Machines

Support vector machines

Introduction to Machine Learning

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

ECG782: Multidimensional Digital Signal Processing

Linear methods for supervised learning

Introduction to Support Vector Machines

12 Classification using Support Vector Machines

Classification: Feature Vectors

Support Vector Machines

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

CS5670: Computer Vision

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Contents. Preface to the Second Edition

Support Vector Machines and their Applications

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Support Vector Machines

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Support Vector Machines.

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Data Mining in Bioinformatics Day 1: Classification

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Support Vector Machines

Lecture 10: Support Vector Machines and their Applications

Chakra Chennubhotla and David Koes

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Lecture 7: Support Vector Machine

Kernel Methods & Support Vector Machines

PV211: Introduction to Information Retrieval

Network Traffic Measurements and Analysis

Data mining with Support Vector Machine

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Machine Learning Models for Pattern Classification. Comp 473/6731

Content-based image and video analysis. Machine learning

Instance-based Learning

CSE 158. Web Mining and Recommender Systems. Midterm recap

Classification by Support Vector Machines

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

CS570: Introduction to Data Mining

Performance Evaluation of Various Classification Algorithms

Applying Supervised Learning

Large synthetic data sets to compare different data mining methods

DM6 Support Vector Machines

CSEP 573: Artificial Intelligence

Search Engines. Information Retrieval in Practice

VECTOR SPACE CLASSIFICATION

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

More on Classification: Support Vector Machine

Lecture 9. Support Vector Machines

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Classification. 1 o Semestre 2007/2008

Classification by Support Vector Machines

6.034 Quiz 2, Spring 2005

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Classification: Linear Discriminant Functions

The Effects of Outliers on Support Vector Machines

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Introduction to Automated Text Analysis. bit.ly/poir599

Support vector machines

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

A Dendrogram. Bioinformatics (Lec 17)

Support Vector Machines

Support Vector Machines: Brief Overview" November 2011 CPSC 352

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Support Vector Regression for Software Reliability Growth Modeling and Prediction

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Support Vector Machines for Face Recognition

Support Vector Machines 290N, 2015

10601 Machine Learning. Model and feature selection

Support Vector Machines

Support Vector Machines.

Transductive Learning: Motivation, Model, Algorithms

CAP5415-Computer Vision Lecture 13-Support Vector Machines for Computer Vision Applica=ons

Generative and discriminative classification techniques

Multivariate Data Analysis and Machine Learning in High Energy Physics (V)

Transcription:

Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8

What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and effective classifier theory of maximum-margin hyperplane transforming data into higher dimensional space soft-margin for classifier errors practicalities of use with text classification

Support Vector Machines (SVM) Basic concepts of (binary) SVMs: Project training data into feature space Find the maximum-margin hyperplane (MMH) between classes Hyperplane is generalization of line to > 3 dimensions MMH completely separates training data into positive and negative classes... and maximizes distance of nearest examples from hyperplane These nearest examples are called the support vectors

Support vectors and separating hyperplane Figure : Maximum margin hyperplane and support vectors (Wikipedia)

Calculating MMH: math Labelled training examples (y 1, x 1 ),..., (y l, x l ), y i { 1, 1} (1) separable if exists vector w and scalar b such that: y i (w x i + b) 1, i = 1,..., l (2) ( is dot product). w describes angle of hyperplane, being vector perpendicular to it; b ( bias ) locates it from origin, relative to w. Optimal hyperplane: w o x + b 0 = 0 (3) separates with maximal margin. Maths 1 shows this is one that minimizes w under constraint (2). 1 Cortes and Vapnik (1995)

Calculating MMH: implementation min( w ), s.t. y i (w x i + b) 1, i = 1,..., l (4) A quadratic programming problem Requires O(n 2 ) space in standard QP implementations But all that matters are (candidate) support vectors This allows efficient decomposition methods, giving linear space and time E.g. note that SV of full set must be SV in any subset it occurs in So calculate SV for subsets, merge 2 2 Cortes and Vapnik (1995). See also Joachims (1998).

Classifying new examples Model is (w 0, b 0 ) More maths shows that w 0 expressible as linear combination of support vectors Z: w 0 = z i Z α i z i, α i > 0 (5) For unlabelled example x, calculate: ŷ = w 0 x + b 0 = z i Z α i z i z + b 0 (6) Predict class of x from sign of y y gives strength of prediction

Linear separability Most problems not linearly separable Two (not mutually exclusive) solutions: Project data into higher-dimensional space (more chance of being separable) Allow some training points to fall on wrong side of hyperplane (with penalty)

Mapping to higher dimensional space Data points mapped to higher dimensional (feature) space E.g. for polynomial space, extend / replace raw features: x 1,..., x n (n dimensions) (7) with: x 2 1,..., x 2 n (n dimensions) (8) plus: x 1 x 2, x 1 x 3,..., x n x n 1 ( n(n 1) 2 ) dimensions (9) Calculate separating hyperplane in higher-dimensional space

Higher-dimensional space: why? Mapping to higher-dimensional space Makes linearly non-separable problem separable (perhaps) Finds important relationships between features Remove monotonic assumptions from features In linear space, features must be monotonically related to class (e.g., the greater the score of feature x, the greater evidence for class y) Some features not like this (e.g., weight as predictor of health) (Some) mappings to higher-dimensional space allow for discovery of more complex relations But how is this even possible? Don t we get an explosion of features?

The kernel trick Calculation of SVMs uses dot-products throughout For certain classes of projections ϕ(x), there exist a kernel function K(x, y) such that: For example the kernel function: K(x, y) = ϕ(x) ϕ(y) (10) K(x, y) = (x y + 1) d (11) is equivalent to a mapping into degree d polynomial space. Simply replace dot product with K() throughout in computation of SVM Then linear hyperplane effectively (and cheaply) calculated on higher-d space

Soft-margin classifiers Figure : Soft-margin SVM (from StackOverflow) 2nd solution to linear non-separability: allow errors i.e. training examples within margin, or on wrong side of hyperplane Penalize errors by how wrong they are Solve minimization problem with error penalty added

Soft-margin: maths Change hard-margin: y i (w x i + b) 1, i = 1,..., l (12) to soft-margin: y i (w x i + b) 1 ξ i, ξ i 0 (13) where ξ i is slack variable for x i. Then solve: { } 1 n argmin w,ξ,b 2 w 2 + C ξ i i=1 (14) (where C is our constant slack parameter, related to the number of training examples) subject to the constraint in (13)

SVM: practical considerations Choice of kernel function is trial-and-error but some insight into data can help (e.g. are features monotonic?) Soft-margin classifiers generally used now SVM reputedly robust : Doesn t get confused by correlated features Doesn t overfit Few or no parameters to tune So we can throw features at it (at least as first pass)

SVM for text classification Linear SVM (i.e. no kernel transformations), with soft margins, typically most effective Large feature space Feature monotonicity SVM consistently best or near-best text classification effectiveness (over Rocchio, knn, linear-least square fit, Naive Bayes, MaxEnt, decision trees, etc.) Maxent / logistic regression (see 2nd half of course) and knn come closest Drawback: model is difficult to interpret Based on Support Vectors (marginal documents) Hard to say what features (terms) are strong evidence Non-interpretability common with geometric methods

Looking back and forward Back SVMs (like knn and Rocchio) based upon geometric model (but partitioning, not similarity) Finds maximally separating hyperplane between training data classes; represented by marginal support vectors (training examples) Can transform space into higher dimensions and efficiently calculate using kernel trick Soft-margin version allows classifier errors, with penalty SVM a robust classifier, performs well for text classification

Looking back and forward Forward Later in course, will look at probabilistic classifiers, and further topics in classification Next lecture: start on probabilistic models of document similarity

Further reading Cortes and Vapnik, Support-Vector Network, Machine Learning, 1995 (Vapnik is the inventor of SVMs; this paper gives a readable introduction to the theory, and then describe soft-margin hyperplane). Joachims, Making Large-Scale SVM Learning Practical, 1998 (describes implementation of decomposition method to optimize calculation of SVMs). Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, 1998 (compares SVM with Naive Bayes, Rocchio, C4.5, and knn) Lewis, Yang, Rose, and Li, RCV1: A New Benchmark Collection for Text Categorization Research, 2004 (compares SVM with knn and Rocchio)