Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Similar documents
Information Management course

DM6 Support Vector Machines

Introduction to Support Vector Machines

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

CS570: Introduction to Data Mining

Support Vector Machines

All lecture slides will be available at CSC2515_Winter15.html

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Support Vector Machines

Introduction to Machine Learning

SUPPORT VECTOR MACHINES

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.7

Support vector machines

Lecture 7: Support Vector Machine

Kernel Methods & Support Vector Machines

Support Vector Machines

Lecture 9: Support Vector Machines

SUPPORT VECTOR MACHINES

Linear methods for supervised learning

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Support Vector Machines. Nakul Verma

Support Vector Machines.

Support vector machines. Dominik Wisniewski Wojciech Wawrzyniak

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Support Vector Machines

Perceptron Learning Algorithm

SVM Classification in Multiclass Letter Recognition System

A Short SVM (Support Vector Machine) Tutorial

Classification by Support Vector Machines

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Perceptron Learning Algorithm (PLA)

Constrained optimization

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Classification by Support Vector Machines

CLASSIFICATION OF CUSTOMER PURCHASE BEHAVIOR IN THE AIRLINE INDUSTRY USING SUPPORT VECTOR MACHINES

Naïve Bayes for text classification

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Lecture Linear Support Vector Machines

SUPPORT VECTOR MACHINE ACTIVE LEARNING

Support Vector Machines.

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

10. Support Vector Machines

Support Vector Machines

Support Vector Machines for Face Recognition

Support Vector Machines

Support Vector Machines

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

Data mining with Support Vector Machine

Machine Learning for NLP

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

Lab 2: Support vector machines

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Rule extraction from support vector machines

Classification by Support Vector Machines

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

ADVANCED CLASSIFICATION TECHNIQUES

Generative and discriminative classification techniques

Chakra Chennubhotla and David Koes

Support Vector Machines + Classification for IR

Machine Learning Lecture 9

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Parallel & Scalable Machine Learning Introduction to Machine Learning Algorithms

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Machine Learning Lecture 9

Machine Learning: Think Big and Parallel

Content-based image and video analysis. Machine learning

Topics in Machine Learning

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

Support Vector Machines

Large synthetic data sets to compare different data mining methods

CS 229 Midterm Review

Convex Optimization and Machine Learning

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Support Vector Machines: Brief Overview" November 2011 CPSC 352

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Support Vector Machines and their Applications

CSE 573: Artificial Intelligence Autumn 2010

Neural Networks and Deep Learning

Supervised vs unsupervised clustering

Instance-based Learning

Discriminative classifiers for image recognition

ECG782: Multidimensional Digital Signal Processing

5 Learning hypothesis classes (16 points)

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

12 Classification using Support Vector Machines

Support Vector Machines (a brief introduction) Adrian Bevan.

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Applying Supervised Learning

Transcription:

Data Mining: Concepts and Techniques Chapter 9 Classification: Support Vector Machines 1 Support Vector Machines (SVMs) SVMs are a set of related supervised learning methods used for classification Based on the training datasets, it solves an optimization problem to find maximum-margin hyperplane to classify new data instances Applications Patter recognition Classification Learning Decision making for game Types of SVMs Linear vs. nonlinear Binary vs. multi-class Internal vs. external 1

SVM Support Vector Machines A relatively new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training data into a higher dimension With the new dimension, it searches for the linear optimal separating hyperplane (i.e., decision boundary ) With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane SVM finds this hyperplane using support vectors ( essential training tuples) and margins (defined by the support vectors) 3 SVM History and Applications Vapnik and colleagues (1992) groundwork from Vapnik & Chervonenkis statistical learning theory in 1960s Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization) Used for: classification and numeric prediction Applications: handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests 2

Decision Boundary Consider a two-class, linearly separable classification problem Many decision boundaries! Are all decision boundaries equally good? Class 2 Class 1 Basic Idea x 2 Vector: x, w Scalar: x, y, w Input: {(x 1, y 1 ), } Output: classification function f(x) f(x i ) > 0 for y i = +1 f(x i ) < 0 for y i = -1 f(x) => wx + b = 0 or w 1 x 1 +w 2 x 2 +b = 0 x 1 update W 6 3

SVM General Philosophy Small Margin Support Vectors Large Margin SVM When Data Is Linearly Separable m Let data D be (X 1, y 1 ),, (X D, y D ), where X i is the set of training tuples associated with the class labels y i There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH) 4

Large-margin Decision Boundary The decision boundary should be as far away from the data of both classes as possible We should maximize the margin, m Distance between the origin and the line w t x=k is k/ w Class 2 Class 1 m 9 SVM Linearly Separable A separating hyperplane can be written as W X + b = 0 where W={w 1, w 2,, w n } is a weight vector and b a scalar (bias) For 2-D it can be written as w 0 + w 1 x 1 + w 2 x 2 = 0 The hyperplane defining the sides of the margin: H 1 : w 0 + w 1 x 1 + w 2 x 2 1 for y i = +1, and H 2 : w 0 + w 1 x 1 + w 2 x 2 1 for y i = 1 Any training tuples that fall on hyperplanes H 1 or H 2 (i.e., the sides defining the margin) are support vectors This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints Quadratic Programming (QP) Lagrangian multipliers 5

Finding the Decision Boundary Let {x 1,..., x n } be our data set and let y i {1,-1} be the class label of x i The decision boundary should classify all points correctly The decision boundary can be found by solving the following constrained optimization problem This is a constrained optimization problem. Determine w and b Define hyperplane x T w+b=0 Obtain the decision function to classify new data points 11 Constrained Optimization Suppose we want to: minimize f(x) subject to g(x) = 0 A necessary condition for x 0 to be a solution: a: the Lagrange multiplier For multiple constraints g i (x) = 0, i=1,, m, we need a Lagrange multiplier a i for each of the constraints 6

Constrained Optimization The case for inequality constraint g i (x) 0 is similar, except that the Lagrange multiplier a i should be positive If x 0 is a solution to the constrained optimization problem There must exist a i 0 for i=1,, m such that x 0 satisfy The function is also known as the Lagrangrian; we want to set its gradient to 0 Back to the Original Problem The Lagrangian is Note that w 2 = w T w Setting the gradient of w.r.t. w and b to zero, we have 7

The Dual Problem If we substitute to, we have Note that This is a function of a i only The Dual Problem The new objective function is in terms of a i only It is known as the dual problem: if we know w, we know all a i ; if we know all a i, we know w The original problem is known as the primal problem The objective function of the dual problem needs to be maximized! The dual problem is therefore: 8

The Dual Problem This is a quadratic programming (QP) problem A global maximum of a i can always be found w can be recovered by Characteristics of the Solution Many of the a i are zero w is a linear combination of a small number of data points x i with non-zero a i are called support vectors (SV) The decision boundary is determined only by the SV Let t j (j=1,..., s) be the indices of the s support vectors. We can write For testing with a new data z Compute and classify z as class 1 if the sum is positive, and class 2 otherwise 18 9

A Geometrical Interpretation Class 2 a 8 =0.6 a 10 =0 a 5 =0 a 7 =0 a 2 =0 a 4 =0 a 9 =0 Class 1 a 3 =0 a 6 =1.4 a 1 =0.8 Why Is SVM Effective on High Dimensional Data? The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data The support vectors are the essential or critical training examples they lie closest to the decision boundary (MMH) If all other training examples are removed and the training is repeated, the same separating hyperplane would be found The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high 10

Extension to Non-linear Decision Boundary So far, we have only considered large-margin classifier with a linear decision boundary How to generalize it to become nonlinear? Key idea: transform x i to a higher dimensional space to make life easier Input space: the space the point x i are located Feature space: the space of f(x i ) after transformation Why transform? Linear operation in the feature space is equivalent to non-linear operation in input space Classification can become easier with a proper transformation. Transforming the Data Input space f(.) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Feature space Note: feature space is of higher dimension than the input space in practice Computation in the feature space can be costly because it is high dimensional The feature space is typically infinite-dimensional! The kernel trick comes to rescue 22 11

Kernel functions for Nonlinear Classification Instead of computing the dot product on the transformed data, it is math. equivalent to applying a kernel function K(X i, X j ) to the original data, i.e., K(X i, X j ) = Φ(X i ) Φ(X j ) Typical Kernel Functions SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional parameters) Modification Due to Kernel Function Change all inner products to kernel functions For training, Original With kernel function 24 12

Modification Due to Kernel Function For testing, the new data z is classified as class 1 if f 0, and as class 2 if f <0 Original With kernel function 25 Example Value of discriminant function class 1 class 2 class 1 1 2 4 5 6 13

Strengths and Weaknesses of SVM Strengths Training is relatively easy No local optimal, unlike in neural networks It scales relatively well to high dimensional data Tradeoff between classifier complexity and error can be controlled explicitly Weaknesses Need to choose a good kernel function. 27 SVM vs. Neural Network SVM Deterministic algorithm Nice generalization properties Hard to learn learned in batch mode using quadratic programming techniques Using kernels can learn very complex functions Neural Network Nondeterministic algorithm Generalizes well but doesn t have strong mathematical foundation Can easily be learned in incremental fashion To learn complex functions use multilayer perceptron (nontrivial) 14

SVM Related Links SVM Website: http://www.kernel-machines.org/ Representative implementations LIBSVM: an efficient implementation of SVM, multiclass classifications, nu-svm, one-class SVM, including also various interfaces with java, python, etc. SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only in C SVM-torch: another recent implementation also written in C 15