Topics in Machine Learning

Similar documents
Module 4. Non-linear machine learning econometrics: Support Vector Machine

Supervised vs unsupervised clustering

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS 229 Midterm Review

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

The Curse of Dimensionality

Machine Learning: Think Big and Parallel

Support vector machines

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Naïve Bayes for text classification

Using Machine Learning to Optimize Storage Systems

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Network Traffic Measurements and Analysis

Slides for Data Mining by I. H. Witten and E. Frank

Machine Learning / Jan 27, 2010

More on Classification: Support Vector Machine

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Support Vector Machines

Random Forest A. Fornaser

Business Club. Decision Trees

Support Vector Machines

MTTTS17 Dimensionality Reduction and Visualization. Spring 2018 Jaakko Peltonen. Lecture 11: Neighbor Embedding Methods continued

Nonparametric Regression

Classification and Regression Trees

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Classification/Regression Trees and Random Forests

Linear Regression and K-Nearest Neighbors 3/28/18

Lecture 25: Review I

Applying Supervised Learning

What is machine learning?

Performance Evaluation of Various Classification Algorithms

Lecture 7: Support Vector Machine

Support Vector Machines

6 Model selection and kernels

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

CS570: Introduction to Data Mining

Machine Learning Techniques for Data Mining

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Support Vector Machines + Classification for IR

5 Learning hypothesis classes (16 points)

All lecture slides will be available at CSC2515_Winter15.html

Supervised Learning Classification Algorithms Comparison

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Linear methods for supervised learning

Support Vector Machines

Support Vector Machines

Kernels and Clustering

A Short SVM (Support Vector Machine) Tutorial

Nonparametric Classification Methods

Support Vector Machines

Contents. Preface to the Second Edition

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Classifiers and Detection. D.A. Forsyth

Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics Spring 2018, Week 11

Support Vector Machines

The Basics of Decision Trees

Semi-supervised Learning

Support Vector Machines.

Linear Methods for Regression and Shrinkage Methods

Machine Learning Models for Pattern Classification. Comp 473/6731

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

CSE 573: Artificial Intelligence Autumn 2010

Support Vector Machines: Brief Overview" November 2011 CPSC 352

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

CSC411/2515 Tutorial: K-NN and Decision Tree

Semi-supervised learning and active learning

Generative and discriminative classification techniques

Lecture 3. Oct

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CS 343: Artificial Intelligence

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Classification: Feature Vectors

Lecture 9: Support Vector Machines

Constrained optimization

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Lecture 7: Decision Trees

Artificial Intelligence. Programming Styles

Lecture Linear Support Vector Machines

Pattern Recognition for Neuroimaging Data

CS229 Lecture notes. Raphael John Lamarre Townshend

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

CSE 158. Web Mining and Recommender Systems. Midterm recap

Cross-validation. Cross-validation is a resampling method.

Classification and Regression Trees

Support Vector Machines.

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Lecture 7: Linear Regression (continued)

Topic 4: Support Vector Machines

LARGE MARGIN CLASSIFIERS

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

An Introduction to Machine Learning

Transcription:

Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng

Machine Learning - Motivation Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed

Machine Learning - Motivation Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed In between, computer science, statistics, optimization, Three categories (soft dichotomy) Supervised learning Unsupervised learning Reinforcement learning

Difficulties Understanding the methods (requires knowledge of various areas) Understanding data and application areas Sometimes hard to establish mathematical guarantees Sometimes hard to code and test Fast developing area of research

Simplification To avoid such difficulties, but obtain a fine level of knowledge in 2 days, we ll follow: Book is available online Plan: last 3 chapters (8-10) and a bit more.

Review Supervised learning (training and test sets) vs. unsupervised learning Examples of supervised learning: regression, classification Examples of unsupervised learning: density/function estimation, clustering, dimension reduction Recall: regression, bias-variance tradeoff, resampling (e.g., cross validation), linear and non-linear models

Quick Review of Regression and Nearest Neighbors Regression predicts a response variable Y (quantitative variable) in terms of input variables (predictors) X 1,,X p given n samples in p ; denote X=(X 1,,X p ) The regression function f(x)=e(y X=x) is the minimizer of the mean square prediction error We cannot precisely compute f, since we have few if any values of given x

Estimating f by NN

Remarks on NN and Classification Need p 4 and sufficiently large n Nearest neighbors tend to be far away in high dimensions Can use kernel or spline smoothing Other common methods: parametric and structure models

Neighborhoods in Increasing Dimensions

More on Regression Assessing model accuracy:

More on Regression Dashed line explained later (irreducible error) Flexibility = degrees of freedom (each square represents method with same color),

More on Regression

More on Regression

More on Regression

On Regression Error For an estimator f learned on training set the mean squared error is E(Y f X X = x) 2 Assume Y = f X + ε, wherε is independent noise with mean zero, then E(Y f X X = x) 2 = E(f X + ε f X X = x) 2 = E(f X f X X = x) 2 + Var(ε) Var(ε) is the irreducible error E(f X f X X = x) 2 is the reducible error ( f X depends on random training sample)

Regression Error: Bias and Variance E(f X f X X = x) 2 = E( f X E( f X ) X = x) 2 + (E( f X X = x) f x ) 2 = Var( f X X = x)+bias 2 (( f X X = x) E(Y f X X = x) 2 = Var( f X X = x)+bias 2 (( f X X = x)+var(ε)

Bias-Variance Tradeoff Two other tradeoffs:

Bias-Variance Tradeoff

Quick Review of Classification and Nearest Neighbors Classification:

Quick Review of Classification and Nearest Neighbors Example:

Quick Review of Classification and Nearest Neighbors

Quick Review of Classification and Nearest Neighbors

Quick Review of Classification and Nearest Neighbors

Quick Review of Classification and Nearest Neighbors

Quick Review of Classification and Nearest Neighbors

Chapter 9: SVM

Separation of 2 Classes by a hyperplane Training set: n points (x i,1,, x i,p ), 1 i n, with n labels y i 1,1, 1 i n Separating hyperplane (if exists) satisfies:

Separation of 2 Classes by a Example: hyperplane

Separation of 2 Classes by a hyperplane If a separating hyperplane exists, then for a test observation x*, a classifier is obtained by the sign of (negative (positive) sign -1/1) The magnitude of f x * provides confidence on class assignment p p * * 2 i i 1 i 1 d( x,hyp.) β0 βix / β

Maximal Margin Classifier

Maximal Margin Classifier MMC is the solution of No explanation in book, but immediate for a math student Actual algorithm is not discussed

Numerical Solution (following A. Ng s Cs229 notes) Change of notation: y (i) =y i, x (i) =(x i,1,, x i,p ) Recall Distance of (x (i),y (i) ) to a hyperplane w T X +b=0 is w T x (i) +b / w

Numerical Solution (following A. Ng s Cs229 notes) Original Problem (non-convex): Equivalent non-convex problem via

Numerical Solution (following A. Ng s Cs229 notes) Scale w and b by the same constant so that (no effect on problem) and change to the convex problem (quadratic program)

Equivalent Formulation (following A. Ng s Cs229 notes) Lagrangian: Dual: Solution: Hence: (used later)

A Non-separable Example

Non-robustness of the Maximal Margin Classifier

The Support Vector Classifier If ε i =0 correct side of boundary If ε i >0 wrong side of margin If ε i >1 wrong side of hyperplane Solution is effected only by support vectors, i.e., observations on wrong side of margins or boundary.

Concept Demonstration

More on the Optimization Problem C controls # observations on wrong side of margin C controls the bias-variance trade-off Optimizer is effected only by support vectors Increasing C in clock-wise order:

Equivalent Formulation (following A. Ng s Cs229 notes) Dual: Similarly as before w T x is a linear combination of <x,x (i) >

Support Vector Machine (SVM) From linear to nonlinear boundaries by embedding to a higher-dimensional space The algorithm can be written in terms of a dot product Instead of embed to a very high-dimen. space, replace dot products with kernels

Clarification

Clarification

More (following book) By solution of SVC (recall earlier comment) Can use only support vectors for SVC For SVM replace dot products with kernels

Demonstration

SVM for K>2 Classes OVO (One vs. One): For training data, K construct 1/-1 classifiers (2 classes 2 out of K classes). For test point, use voting (class with most pairwise assignments) OVA (One vs. All): For training, construct K classifiers (class with 1 vs. rest of classes with -1). For test x*, classify according to largest estimated f(x*) OVO is better for K not too large

Chapter 8: Tree-based Methods (or CART) Decision Trees for Regression Demonstration of predicting log(salary/1000) as a func. of # of years in major leagues and hits in previous year Terminology: leaf/terminal node, internal node, branch

Chapter 8: Tree-based Methods (or CART)

Building a Decision Tree We wish to minimize the RSS (residual sum of squares): Computationally infeasible. Use instead recursive binary splitting (top-down greedy procedure)

Recursive Binary Splitting At each node (top to bottom) determine predictor X j and cutoff s minimizing 2 y i yi 2 i: xi R1( j, s ) 2 i: xi R2( j, s ) yi yi i: x 1(, ) 1(, ) : 2(, ) 2(, ) i R j s R j s i xi R j s R j s 2

Recursive Binary Splitting For j = 1,, p, determine s that maximize 2 y i yi i: xi R1 ( j, s ) i: x i R2( j, s ) R1( j, s) R2( j, s) Can be done by sorting the j-values and checking all n-1 pairs (x i,x i+1 ) (O(1) operations for each) and reporting average of x i and x i+1, for max. i. Total cost is O(pn). We assumed continuous random variables (can modify for discrete ones) 2

More on Recursive Binary Splitting The previous process is repeated until a stopping criteria is met Predict response by mean of training observations in region the test sample belong to

Tree Pruning Continue page 17 of books slides trees.pdf