CS 229 Midterm Review

Similar documents
Clustering Lecture 5: Mixture Model

Unsupervised Learning: Clustering

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Module 4. Non-linear machine learning econometrics: Support Vector Machine

All lecture slides will be available at CSC2515_Winter15.html

CS229 Lecture notes. Raphael John Lamarre Townshend

STA 4273H: Statistical Machine Learning

Random Forest A. Fornaser

CSE 573: Artificial Intelligence Autumn 2010

Lecture 9: Support Vector Machines

Generative and discriminative classification techniques

Support Vector Machines

Expectation Maximization (EM) and Gaussian Mixture Models

Introduction to Machine Learning CMU-10701

Lecture 7: Support Vector Machine

Lecture 7: Decision Trees

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

The Curse of Dimensionality

Supervised vs unsupervised clustering

Classification: Feature Vectors

Support Vector Machines

Network Traffic Measurements and Analysis

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Linear methods for supervised learning

Applied Statistics for Neuroscientists Part IIa: Machine Learning

Kernel Methods & Support Vector Machines

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Random projection for non-gaussian mixture models

Topics in Machine Learning

Kernels and Clustering

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

More on Classification: Support Vector Machine

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Support Vector Machines

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Introduction to Automated Text Analysis. bit.ly/poir599

Contents. Preface to the Second Edition

Support Vector Machines.

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Classification/Regression Trees and Random Forests

CS 343: Artificial Intelligence

Unsupervised Learning

Algorithms: Decision Trees

Introduction to Support Vector Machines

COMS 4771 Support Vector Machines. Nakul Verma

Slides for Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

The exam is closed book, closed notes except your one-page cheat sheet.

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

1 Case study of SVM (Rob)

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Introduction to Machine Learning

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Business Club. Decision Trees

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

An introduction to random forests

Supervised Learning for Image Segmentation

Support Vector Machines.

Probabilistic Graphical Models

Applying Supervised Learning

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Fall 09, Homework 5

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Bagging for One-Class Learning

CS570: Introduction to Data Mining

Midterm Examination CS540-2: Introduction to Artificial Intelligence

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Machine Learning. Chao Lan

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Machine Learning Techniques for Data Mining

Machine Learning. Decision Trees. Manfred Huber

Performance Evaluation of Various Classification Algorithms

Machine Learning. Unsupervised Learning. Manfred Huber

An Introduction to Machine Learning

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CS Machine Learning

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

6.034 Quiz 2, Spring 2005

Classification. Instructor: Wei Ding

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Classification: Linear Discriminant Functions

K-Means Clustering. Sargur Srihari

Multivariate Data Analysis and Machine Learning in High Energy Physics (V)

Classifiers and Detection. D.A. Forsyth

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

9.1. K-means Clustering

Transcription:

CS 229 Midterm Review Course Staff Fall 2018 11/2/2018

Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask questions! ]

SVMs

Optimal margin classifier Two classes separable by linear decision boundary. But first what is a hyperplane? In d-dimensional space, a (d 1)-dimensional affine subspace Examples: line in 2D, plane in 3D Hyperplane in d-dimensional space: [Separates space into two half-spaces.]

Hyperplanes

Optimal margin classifier Idea: Use a separating hyperplane for binary classification. Key assumption: Classes can be separated by a linear decision boundary.

Optimal margin classifier To classify new data points: Assign class by location of new data point with respect to hyperplane:

Optimal margin classifier Problem Many possible separating hyperplanes!

Optimal margin classifier Which linear decision boundary? Separating hyperplane farthest from the training data. Optimal margin

Optimal margin classifier Which linear decision boundary? Separating hyperplane farthest from the training data Margin: smallest distance between any training observation and the hyperplane Support vectors: the training observations equidistant from the hyperplane

Regularization and the non-separable case Disadvantage Can be sensitive to individual observations. May overfit training data.

Regularization and the non-separable case So far we ve assumed that classes can be separated by a linear decision boundary. What if there s no separating hyperplane?

Regularization and the non-separable case What if there s no separating hyperplane? Support Vector Classifier: Allows training samples on the wrong side of the margin or hyperplane.

Regularization and the non-separable case Penalty parameter C Budget for violations, allows at most C misclassifications on training set. Support vectors Observations on margin or violating margin.

Quizz

Non-linear decision boundary Disadvantage What if we have a non-linear decision boundary?

Expanding feature space Some data sets are not linearly separable... But they become linearly separable when transformed into a higher dimensional space

Expanding feature space Variables: X1, X2 Variables: X1, X2, X1X2

Non-linear decision boundary Suppose our original data has d features: Expand feature space to include quadratic terms: Decision boundary will be non-linear in original feature space (ellipse), but linear in the expanded feature space.

Non-linear decision boundary

Non-linear decision boundary

Non-linear decision boundary Large number of features becomes computationally challenging. We need an efficient way to work with large number of features.

Kernels

Kernels Kernel: Generalization of inner product. Kernels (implicitly) map data into higher-dimensional space Why use kernels instead of explicitly constructing larger feature space? Computational advantage when n << d [see the following slide]

Kernels Consider two equivalent ways to represent a linear classifier. Left: Right: If,, much more space efficient to use kernelized representation.

Tree Ensembles

Tree Ensembles Decision Tree: recursively partition space to make predictions. Prediction: simply predict the average of labels in leaf.

Tree Ensembles Recursive partitioning: split on thresholds of features. How to choose? Take average loss in children produced by split. Classification: cross-entropy loss Regression: mean squared-error loss

Tree Ensembles Decision tree tuning: 1. 2. 3. 4. 5. Minimum leaf size Maximum depth Maximum number of nodes Minimum decrease in loss Pruning with validation set Advantage: easy to interpret! Disadvantage: easy to overfit.

Tree Ensembles Random forests: Take the average prediction of many decision trees, 1. Each constructed on a bagged dataset of the original. 2. Only use a subset (typically p) features per split. Each decision tree has high bias, but averaging them together yields low variance. Bagging: resample of the same size as the original dataset, with replacement.

EM Algorithm / Mixtures

Mixture Models Gaussian Mixture Model We have data points, which we suppose come from Here, We hypothesize Gaussians.

Mixture Models GMMs can be extremely effective at modeling distributions! [Richardson and Weiss 2018: On GANs and GMMs]

EM Algorithm Problem: how do we estimate parameters when there are latent variables? Idea: maximize marginal likelihood. But this is difficult to compute! Example: Mixture of Gaussians.

EM Algorithm Algorithm 1. Begin with an initial guess for 2. Alternate between: a. [E-step] Hallucinate missing values by computing, for all possible values, b. [M-step] Use hallucinated dataset to maximize lower bound on log-likelihood

EM Algorithm [E-step] Hallucinate missing values by computing, for all possible values, In the GMM example: for this data point, compute: Now repeat for all data points. Creates augmented dataset, weighted by probabilities.

EM Algorithm [M-step] Use hallucinated dataset to maximize lower bound on log-likelihood We simply now fit using the augmented dataset with weights.

EM Algorithm Theory It turns out that we re maximizing a lower bound on the true log-likelihood. Here we use Jensen s inequality.

EM Algorithm Intuition

EM Algorithm Sanity Check EM is guaranteed to converge to a local optimum. Why? Runtime per iteration?

EM Algorithm Sanity Check EM is guaranteed to converge to a local optimum. Why? Our estimated only ever increases. Runtime per iteration? In general, need to hallucinate one new data point per possible value of. For GMMS, need to hallucinate data points thanks to independence.

K Means and More GMMs

K Means - Motivation

K Means - Algorithm(*)

K Means

K Means

K Means

Real-World Example: Mixture of Gaussians Note: data is unlabeled

How do we fit a GMM - the EM

Observations Only the distribution of X matters. You can change things like ordering of the components without affecting the distribution and hence not affecting the algorithm. Mixing two distributions from a parametric family might give us a third distribution from the same family. A mixture of 2 Bernoullis is another Bernoulli. Probabilistic clustering - Putting similar data points together into clusters, where clusters are represented by the component distributions.

Sample Question How do constraints on the covariance matrix change the gaussian that is being fit?

Tied

Diag

Spherical (K-Means!)