Linear methods for supervised learning

Similar documents
Lecture 7: Support Vector Machine

Constrained optimization

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Kernel Methods & Support Vector Machines

All lecture slides will be available at CSC2515_Winter15.html

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Support Vector Machines

COMS 4771 Support Vector Machines. Nakul Verma

Support Vector Machines

Support Vector Machines.

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

DM6 Support Vector Machines

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

A Short SVM (Support Vector Machine) Tutorial

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

SUPPORT VECTOR MACHINES

Introduction to Machine Learning

Support vector machines

Introduction to Support Vector Machines

10. Support Vector Machines

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Introduction to Constrained Optimization

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Support vector machines

SUPPORT VECTOR MACHINES

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Chakra Chennubhotla and David Koes

Local and Global Minimum

Convex Optimization. Lijun Zhang Modification of

Linear programming and duality theory

Nonlinear Programming

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Programming, numerics and optimization

Optimization III: Constrained Optimization

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Lecture Linear Support Vector Machines

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

Applied Lagrange Duality for Constrained Optimization

Mathematical Programming and Research Methods (Part II)

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

CS 229 Midterm Review

SUPPORT VECTOR MACHINE ACTIVE LEARNING

Lecture 9: Support Vector Machines

Data-driven Kernels for Support Vector Machines

Convex Optimization and Machine Learning

Support Vector Machines

Kernels and Constrained Optimization

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Machine Learning Lecture 9

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Perceptron Learning Algorithm

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Support Vector Machines

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Natural Language Processing

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

4 Integer Linear Programming (ILP)

FERDINAND KAISER Robust Support Vector Machines For Implicit Outlier Removal. Master of Science Thesis

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Optimization Methods. Final Examination. 1. There are 5 problems each w i t h 20 p o i n ts for a maximum of 100 points.

Support Vector Machines

Generative and discriminative classification techniques

SELF-ADAPTIVE SUPPORT VECTOR MACHINES

Homework 3: Solutions

Classification by Support Vector Machines

Perceptron Learning Algorithm (PLA)

Support Vector Machines.

Lec13p1, ORF363/COS323

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Machine Learning Lecture 9

CLASSIFICATION OF CUSTOMER PURCHASE BEHAVIOR IN THE AIRLINE INDUSTRY USING SUPPORT VECTOR MACHINES

Support Vector Machines

Mathematical Themes in Economics, Machine Learning, and Bioinformatics

Classification: Feature Vectors

5 Machine Learning Abstractions and Numerical Optimization

Support Vector Machines + Classification for IR

Application of Support Vector Machine Algorithm in Spam Filtering

Machine Learning for NLP

Classification by Support Vector Machines

12 Classification using Support Vector Machines

Support Vector Machines (a brief introduction) Adrian Bevan.

CSE 573: Artificial Intelligence Autumn 2010

Rita McCue University of California, Santa Cruz 12/7/09

A NEW DISTRIBUTED FRAMEWORK FOR CYBER ATTACK DETECTION AND CLASSIFICATION SANDEEP GUTTA

Learning via Optimization

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

Lecture 4 Duality and Decomposition Techniques

Section Notes 5. Review of Linear Programming. Applied Math / Engineering Sciences 121. Week of October 15, 2017

ADVANCED CLASSIFICATION TECHNIQUES

More on Classification: Support Vector Machine

INTRODUCTION TO LINEAR AND NONLINEAR PROGRAMMING

Data Mining in Bioinformatics Day 1: Classification

Introduction to Optimization

Linear Programming. Larry Blume. Cornell University & The Santa Fe Institute & IHS

Lab 2: Support vector machines

Transcription:

Linear methods for supervised learning LDA Logistic regression Naïve Bayes PLA Maximum margin hyperplanes Soft-margin hyperplanes Least squares resgression Ridge regression

Nonlinear feature maps Sometimes linear methods (in both regression and classification) just don t work One way to create nonlinear estimators or classifiers is to first transform the data via a nonlinear feature map After applying, we can then try applying a linear method to the transformed data

Classification This data set is not linearly separable Consider the mapping The dataset is linearly separable after applying this feature map:

Issues with nonlinear feature maps Suppose we transform our data via where If, then this can lead to problems with overfitting you can always find a separating hyperplane In addition, however, when is very large, there can be an increased computational burden

The kernel trick Fortunately, there is a clever way to get around this computational challenge by exploiting two facts: Many machine learning algorithms only involve the data through inner products For many interesting feature maps, the function has a simple, closed form expression that can be evaluated without explicitly calculating and standard dot product

Kernel-based classifiers Algorithms we ve seen that only need to compute inner products: nearest-neighbor classifiers

Kernel-based classifiers Algorithms we ve seen that only need to compute inner products: maximum margin hyperplanes it is a fact that the optimal can be expressed as for some choice of

Quadratic kernel

Quadratic kernel What is and what is the dimension of the corresponding feature space?

Nonhomogeneous quadratic kernel In this case, the corresponding is similar to the homogenous quadratic kernel, but it also replicates

Polynomial kernels In general Multinomial theorem

Inner product kernel Definition. An inner product kernel is a mapping for which there exists an inner product space mapping such that and a for all Given a function, how can we tell when it is an inner product kernel? Mercer s theorem Positive semidefinite property

Positive semidefinite kernels We say that is symmetric is a positive semidefinite kernel if for all and all, the Gram matrix defined by is positive semidefinite, i.e., for all Theorem is an inner product kernel if and only if semidefinite kernel is a positive Proof: (Future homework)

Examples: Polynomial kernels Homogeneous polynomial kernel Inhomogenous polynomial kernel maps to the set of all monomials of degree

Examples: Gaussian/RBF kernels Gaussian / Radial basis function (RBF) kernel: One can show that what is? is a positive semidefinite kernel, but is infinite dimensional!

Kernels in action: SVMs In order to understand both how to kernelize the maximum margin optimization problem how to actually solve this optimization problem with a practical algorithm we will need to spend a bit of time learning about constrained optimization This stuff is super useful, even outside of this context

Constrained optimization A general constrained optimization problem has the form where We call the objective function If satisfies all the constraints, we say that is feasible We will assume that is defined for all feasible

The Lagrangian It is possible to convert a constrained optimization problem into an unconstrained one via the Lagrangian function In this case, the Lagrangian is given by

Lagrangian duality The Lagrangian provides a way to bound or solve the original optimization problem via a different optimization problem If we have the Lagrangian and Lagrange multipliers or dual variables are called The (Lagrange) dual function is

The dual optimization problem We can then define the dual optimization problem Why do we constrain? Interesting Fact: No matter the choice of or, is always concave

The primal optimization problem The primal function is and the primal optimization problem is Contrast this with the dual optimization problem

What is so primal about the primal? Suppose is feasible Suppose is not feasible The primal encompasses the original problem (i.e., they have the same solution), yet it is unconstrained

Weak duality Theorem Proof Let be feasible. Then for any with, Hence

Weak duality Theorem Proof Thus, for any with, we have Since this holds for any, we can take the max to obtain

Duality gap We have just shown that for any optimization problem, we can solve the dual (which is always a concave maximization problem), and obtain a lower bound on (the optimal value of the objective function for the original problem) The difference is called the duality gap In general, all we can say about the duality gap is that, but sometimes we are lucky If, we say that strong duality holds

The KKT conditions Assume that are all differentiable The Karush-Kuhn-Tucker (KKT) conditions are a set of properties that must hold under strong duality The KKT conditions will allow us to translate between solutions of the primal and dual optimization problems

The KKT conditions 1. 2. 3. 4. 5. (complementary slackness)

KKT conditions and convex problems Theorem If the original problem is convex, meaning that and the are convex and the are affine, and satisfy the KKT conditions, then is primal optimal, is dual optimal, and the duality gap is zero ( ). Moreover, if, is primal optimal, and is dual optimal, then the KKT conditions hold. Proof: See notes.

KKT conditions: Necessity Theorem If, is primal optimal, and is dual optimal, then the KKT conditions hold. Proof 2 and 3 hold because must be feasible 4 holds by the definition of the dual problem To prove 5, note that from strong duality, we have (by 2,3, and 4)

KKT conditions: Necessity This shows that the two inequalities are actually equalities, and hence the equality of the last two lines implies which implies that, which is 5 Equality of the 2 nd and 3 rd lines implies that is a minimizer of with respect to, and hence which establishes 1

KKT conditions: Sufficiency Theorem If the original problem is convex, meaning that and the are convex and the are affine, and satisfy the KKT conditions, then is primal optimal, is dual optimal, and the duality gap is zero. Proof 2 and 3 imply that 4 implies that that is a minimizer of Thus is feasible is convex in, and so 1 means (by feasibility and 5)

KKT conditions: Sufficiency If then But earlier we proved that, and hence it must actually be that, and thus we have zero duality gap is primal optimal is dual optimal

KKT conditions: The bottom line If a constrained optimization problem is differentiable convex then the KKT conditions are necessary and sufficient for primal/dual optimality (with zero duality gap) In this case, we can use the KKT conditions to find a solution to our optimization problem i.e., if we find satisfying the conditions, we have found solutions to both the primal and dual problems