CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Similar documents
CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. Non-Linear Regression Fall 2016

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Linear methods for supervised learning

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2016

Support vector machines

Lecture 7: Support Vector Machine

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

1 Training/Validation/Testing

CPSC 340: Machine Learning and Data Mining

All lecture slides will be available at CSC2515_Winter15.html

CS 229 Midterm Review

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

Kernel Methods & Support Vector Machines

Lecture 3: Linear Classification

CPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2016

Machine Learning: Think Big and Parallel

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

Generative and discriminative classification techniques

Content-based image and video analysis. Machine learning

Supervised vs unsupervised clustering

Support Vector Machines

CPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2017

Machine Learning / Jan 27, 2010

Missing Data. Where did it go?

Lecture 9: Support Vector Machines

Data Mining in Bioinformatics Day 1: Classification

Introduction to Computer Vision

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Support Vector Machines

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Neural Networks and Deep Learning

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Support Vector Machines.

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Support Vector Machines

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Network Traffic Measurements and Analysis

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Divide and Conquer Kernel Ridge Regression

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Support Vector Machines + Classification for IR

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CSE446: Linear Regression. Spring 2017

EE 511 Linear Regression

Problem 1: Complexity of Update Rules for Logistic Regression

Chakra Chennubhotla and David Koes

Deep Learning for Computer Vision

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

3 Nonlinear Regression

Support vector machines

Basis Functions. Volker Tresp Summer 2017

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data

1 Case study of SVM (Rob)

CPSC 340: Machine Learning and Data Mining. Ranking Fall 2016

A Brief Look at Optimization

Machine Learning Techniques for Data Mining

Machine Learning for NLP

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2018

CS6220: DATA MINING TECHNIQUES

Support Vector Machines

Distribution-free Predictive Approaches

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

6 Model selection and kernels

Note Set 4: Finite Mixture Models and the EM Algorithm

CSE 446 Bias-Variance & Naïve Bayes

No more questions will be added

Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

3 Nonlinear Regression

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

10. Support Vector Machines

Topic 4: Support Vector Machines

Slides for Data Mining by I. H. Witten and E. Frank

CSE 158. Web Mining and Recommender Systems. Midterm recap

Transcription:

CPSC 340: Machine Learning and Data Mining Kernel Trick Fall 2017

Admin Assignment 3: Due Friday. Midterm: Can view your exam during instructor office hours or after class this week.

Digression: the other Normal Equations Recall the L2-regularized least squares objective: We showed that the minimum is given by (in practice you still solve the linear system, since inverse can be numerically unstable see CPSC 302) With some work (bonus), this can equivalently be written as: This is faster if n << d: Cost is O(n 2 d + n 3 ) instead of O(nd 2 + d 3 ).

Gram Matrix The matrix XX T is called the Gram matrix K. K contains the inner products between all training examples. Similar to Z in RBFs, but using dot product as similarity instead of distance.

Support Vector Machines for Non-Separable What about data that is not even close to separable? http://math.stackexchange.com/questions/353607/how-do-inner-product-space-determine-half-planes

Support Vector Machines for Non-Separable What about data that is not even close to separable? It may be separable under change of basis (or closer to separable). http://math.stackexchange.com/questions/353607/how-do-inner-product-space-determine-half-planes

Support Vector Machines for Non-Separable What about data that is not even close to separable? It may be separable under change of basis (or closer to separable). http://math.stackexchange.com/questions/353607/how-do-inner-product-space-determine-half-planes

Multi-Dimensional Polynomial Basis Recall fitting polynomials when we only have 1 feature: We can fit these models using a change of basis: How can we do this when we have a lot of features?

Multi-Dimensional Polynomial Basis Polynomial basis for d=2 and p=2: With d=4 and p=3, the polynomial basis would include: Bias variable and the x ij : 1, x i1, x i2, x i3, x i4. The x ij squared and cubed: (x i1 ) 2, (x i2 ) 2, (x i3 ) 2, (x i4 ) 2, (x i1 ) 3, (x i2 ) 3, (x i3 ) 3, (x i4 ) 3. Two-term interactions: x i1 x i2, x i1 x i3, x i1 x i4, x i2 x i3, x i2 x i4, x i3 x i4. Cubic interactions: x i1 x i2 x i3, x i2 x i3 x i4, x i1 x i3,x i4, x i12 x i2, x i12 x i3, x i12 x i4, x i1 x i22, x i22 x i3, x i22 x i4, x i1 x i32, x i2 x i32,x i32 x i4, x i1 x i42, x i2 x i42, x i3 x i42.

Kernel Trick If we go to degree p=5, we ll have O(d 5 ) quintic terms: For large d and p, we can t even store Z or w. But, even though dimension of the basis k is exponential in d and p, for medium n we can use this basis efficiently with the kernel trick. Basic idea: We can sometimes efficiently compute dot product z it z j directly from x i and x j. Use this to make the Gram matrix ZZ T and make predictions.

Kernel Trick Given test data X, predict y by forming and Zusing: Key observation behind kernel trick: Predictions y only depend on features through K and K. If we have a function that computes K and K, we don t need the z i values.

Kernel Trick

Kernel Trick

Kernel Trick K contains the inner products between all training examples. Intuition: inner product can be viewed as a measure of similarity, so this matrix gives a similarity between each pair of examples. K contains the inner products between training and test examples. Kernel trick: I want to use a basis z i that is too huge to store (very large k ). But I only need z i to compute Gram matrix K = ZZ T and K = ZZ T. The sizes of these matrices are independent of k. Everything we need to know about z i is summarized by the n 2 values of z it z j. I can use this basis if I have a kernel function that computes k(x i,x j ) = z it z j. I don t need to compute the k-dimensional basis z i explicitly.

Example: Degree-2 Kernel Consider two examples x i and x j for a 2-dimensional dataset: And consider a particular degree-2 basis: We can compute inner product z it z j without forming z i and z j :

Polynomial Kernel with Higher Degrees Let s add a bias and linear terms to our degree-2 basis: I can compute inner products using:

Polynomial Kernel with Higher Degrees To get all degree-4 monomials I can use: To also get lower-order terms use z it z j = (1 + x it x j ) 4 The general degree-p polynomial kernel function: Works for any number of features d. But cost of computing one z it z j is O(d) instead of O(d p ). Take-home message: I can compute dot-products without the features.

Kernel Trick with Polynomials Using polynomial basis of degree p with the kernel trick: Compute K and K using: Make predictions using: Training cost is only O(n 2 d + n 3 ), despite using k=o(d p ) features. We can form K in O(n 2 d), and we need to invert an n x n matrix. Testing cost is only O(ndt), cost to formd K.

Linear Regression vs. Kernel Regression

Motivation: Finding Gold Kernel methods first came from mining engineering ( Kriging ): Mining company wants to find gold. Drill holes, measure gold content. Build a kernel regression model (typically use RBF kernels). http://www.bisolutions.us/a-brief-introduction-to-spatial-interpolation.php

Gaussian-RBF Kernel Most common kernel is the Gaussian RBF kernel: Same formula and behaviour as RBF basis, but not equivalent: Before we used RBFs as a basis, now we re using them as inner-product. Basis z i giving Gaussian RBF kernel is infinite-dimensional: If d=1 and σ=1, it corresponds to using this basis (bonus slide):

Kernel Trick for Non-Vector Data Consider data that doesn t look like this: But instead looks like this: Kernel trick lets us fit regression models without explicit features: We can interpret k(x i,x j ) as a similarity between objects x i and x j. We don t need features if we can compute similarity between objects. There are string kernels, image kernels, graph kernels, and so on.

Valid Kernels What kernel functions k(x i,x j ) can we use? Kernel k must be an inner product in some space: There must exist a mapping from x i to some z i such that k(x i,x j ) = z it z j. It can be hard to show that a function satisfies this. Infinite-dimensional eigenvalue equation. But like convex functions, there are some simple rules for constructing valid kernels from other valid kernels (bonus slide).

Kernel Trick for Other Methods Besides L2-regularized least squares, when can we use kernels? We can compute Euclidean distance with kernels: All of our distance-based methods have kernel versions: Kernel k-nearest neighbours. Kernel clustering k-means (allows non-convex clusters) Kernel density-based clustering. Kernel hierarchical clustering. Kernel distance-based outlier detection. Kernel Amazon Product Recommendation.

Kernel Trick for Other Methods Besides L2-regularized least squares, when can we use kernels? Representer theorems (bonus slide) have shown that any L2-regularized linear model can be kernelized: L2-regularized robust regression. L2-regularized brittle regression. L2-regularized logistic regression. L2-regularized hinge loss (SVMs).

Logistic Regression with Kernels

Summary High-dimensional bases allows us to separate non-separable data. Kernel trick allows us to use high-dimensional bases efficiently. Write model to only depend on inner products between features vectors. Kernels let us use similarity between objects, rather than features. Allows some exponential- or infinite-sized feature sets. Applies to L2-regularized linear models and distance-based models. Next time: how do we train on all of Gmail?

Why is inner product a similarity? It seems weird to think of the inner-product as a similarity. But consider this decomposition of squared Euclidean distance: If all training examples have the same norm, then minimizing Euclidean distance is equivalent to maximizing inner product. So high similarity according to inner product is like small Euclidean distance. The only difference is that the inner product is biased by the norms of the training examples. Some people explicitly normalize the x i by setting x i = (1/ x i )x i, so that inner products act like the negation of Euclidean distances.