Machine Learning / Jan 27, 2010

Similar documents
Notes and Announcements

Logis&c Regression. Aar$ Singh & Barnabas Poczos. Machine Learning / Jan 28, 2014

Nonparametric Methods Recap

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Network Traffic Measurements and Analysis

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Machine Learning. Chao Lan

What is machine learning?

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning

Machine Learning. Supervised Learning. Manfred Huber

CSE 446 Bias-Variance & Naïve Bayes

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

EE 511 Linear Regression

Generative and discriminative classification techniques

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Simple Model Selection Cross Validation Regularization Neural Networks

Classification: Linear Discriminant Functions

Linear Methods for Regression and Shrinkage Methods

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Topics in Machine Learning

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Bayes Net Learning. EECS 474 Fall 2016

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet.

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Supervised vs unsupervised clustering

CSE446: Linear Regression. Spring 2017

Moving Beyond Linearity

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Support Vector Machines.

Lecture 13: Model selection and regularization

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Economics Nonparametric Econometrics

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Machine Learning: Think Big and Parallel

Nonparametric and Semiparametric Econometrics Lecture Notes for Econ 221. Yixiao Sun Department of Economics, University of California, San Diego

Machine Learning Classifiers and Boosting

Regularization and model selection

Problem 1: Complexity of Update Rules for Logistic Regression

Unsupervised Learning: Clustering

The Curse of Dimensionality

Chapter 7: Numerical Prediction

Gradient LASSO algoithm

Structured Learning. Jun Zhu

CS 229 Midterm Review

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Using Machine Learning to Optimize Storage Systems

Lecture 16: High-dimensional regression, non-linear regression

Unsupervised Learning

Applying Supervised Learning

STA 4273H: Sta-s-cal Machine Learning

Last time... Bias-Variance decomposition. This week

Boosting Simple Model Selection Cross Validation Regularization

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Machine Learning Lecture 3

k-nearest Neighbors + Model Selection

Machine Learning. Classification

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

10-701/15-781, Fall 2006, Final

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Semi-supervised Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Lecture 7: Decision Trees

Please write your initials at the top right of each page (e.g., write JS if you are Jonathan Shewchuk). Finish this by the end of your 3 hours.

PATTERN CLASSIFICATION AND SCENE ANALYSIS

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

Machine Learning Techniques for Data Mining

Slides for Data Mining by I. H. Witten and E. Frank

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Machine Learning Lecture 3

Machine Learning Lecture 3

Collective classification in network data

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

A Taxonomy of Semi-Supervised Learning Algorithms

Introduction to Automated Text Analysis. bit.ly/poir599

Note Set 4: Finite Mixture Models and the EM Algorithm

Transcription:

Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010

Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y, or P(Y X) Generative classifiers (e.g. Naïve Bayes) Assume some functional form for P(X,Y) (or P(X Y) and P(Y)) Estimate parameters of P(X Y), P(Y) directly from training data Use Bayesrule to calculate P(Y X) Discriminative classifiers (e.g. Logistic Regression) Assume some functional form for P(Y X) Estimate parameters of P(Y X) directly from training data 2

Logistic Regression Assumes the following functional form for P(Y X): Alternatively, (Linear Decision Boundary) DOES NOT require any conditional independence assumptions 3

Connection to Gaussian Naïve Bayes There are several distributions that can lead to a linear decision boundary. As another example, consider a generative model: Exponential family Observe that Gaussian is a special case 4

Connection to Gaussian Naïve Bayes Constant term First-order term Special case: P(X Y=y) ~ Gaussian( µy,σy) where Σ0 = Σ1 (cij,0 = cij,1) Conditionally independent cij,y = 0, i j (Gaussian Naïve Bayes) 5

Generative vsdiscriminative Given infinite data (asymptotically), If conditional independence assumption holds, Discriminative and generative perform similar. If conditional independence assumption does NOT holds, Discriminative outperforms generative. 6

Generative vsdiscriminative Given finite data (n data points, p features), Ng-Jordan paper Naïve Bayes(generative) requires n = O(log p) to converge to its asymptotic error, whereas Logistic regression (discriminative) requires n = O(p). Why? Independent class conditional densities * smaller classes are easier to learn * parameter estimates not coupled each parameter is learnt independently, not jointly, from training data. 7

Naïve BayesvsLogistic Regression Verdict Both learn a linear decision boundary. Naïve Bayesmakes more restrictive assumptions and has higher asymptotic error, BUT converges faster to its less accurate asymptotic error. 8

Experimental Comparison (Ng-Jordan 01) UCI Machine Learning Repository 15 datasets, 8 continuous features, 7 discrete features More in Paper Naïve Bayes Logistic Regression 9

Classification so far (Recap) 10

Classification Tasks Diagnosing sickle cell anemia Features, X Labels, Y Anemic cell Healthy cell Tax Fraud Detection Web Classification Predict squirrel hill resident Drive to CMU, Rachel s fan, Shop at SH Giant Eagle Sports Science News Resident Not resident 11

Classification Goal: Sports Science News Features, X Labels, Y Probability of Error 12

Classification Optimal predictor: (Bayes classifier) Depends on unknown distribution 13

Classification algorithms However, we can learn a good prediction rule from training data Independent and identically distributed Learning algorithm So far Decision Trees K-Nearest Neighbor Naïve Bayes Logistic Regression 14

Linear Regression Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010

Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X = Cell Image Y = Diagnosis Stock Market Prediction Y =? X = Feb01 16

Regression Tasks Weather Prediction Y = Temp X = 7 pm Estimating Contamination X = new location Y = sensor reading 17

Supervised Learning Goal: Sports Science News Y =? X = Feb01 Classification: Regression: Probability of Error Mean Squared Error 18

Regression Optimal predictor: (Conditional Mean) Dropping subscripts for notational convenience 19

Regression Optimal predictor: (Conditional Mean) Intuition: Signal plus (zero-mean) Noise model Depends on unknown distribution 20

Regression algorithms Learning algorithm Linear Regression Lasso, Ridge regression (Regularized Linear Regression) Nonlinear Regression Kernel Regression Regression Trees, Splines, Wavelet estimators, Empirical Risk Minimizer: Empirical mean 21

Linear Regression Least Squares Estimator - Class of Linear functions Uni-variate case: β1- intercept β2= slope Multi-variate case: 1 where, 22

Least Squares Estimator 23

Least Squares Estimator 24

Normal Equations p xp p x1 p x1 If is invertible, When is invertible? (Homework 2) Recall: Full rank matrices are invertible. What is rank of? What if is not invertible? (Homework 2) Regularization (later) 25

Geometric Interpretation Difference in prediction on training set: 0 is the orthogonal projection of onto the linear subspace spanned by the columns of 26

Revisiting Gradient Descent Even when is invertible, might be computationally expensive if A is huge. Gradient Descent Initialize: Update: 0 if = Stop: when some criterion met e.g. fixed # iterations, or < ε. 27

Effect of step-size α Large α => Fast convergence but larger residual error Also possible oscillations Small α => Slow convergence but small residual error 28

When does Gradient Descent succeed? View of the algorithm is myopic. http://www.ce.berkeley.edu/~bayen/ Guaranteed to converge to local minima if http://demonstrations.wolfram.com Converges as in jthdirection Convergence depends on eigenvaluespread 29

Least Squares and MLE Intuition: Signal plus (zero-mean) Noise model log likelihood Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model! 30

Regularized Least Squares and MAP What if is not invertible? log likelihood log prior I) Gaussian Prior 0 Ridge Regression Closed form: HW Prior belief that β is Gaussian with zero-mean biases solution to small β 31

Regularized Least Squares and MAP What if is not invertible? log likelihood log prior II) Laplace Prior Lasso Closed form: HW Prior belief that β is Laplace with zero-mean biases solution to small β 32

Ridge Regression vslasso Ridge Regression: Lasso: HOT! Ideally l0 penalty, but optimization becomes non-convex βs with constant J(β) (level sets of J(β)) βswith constant l2 norm β2 βswith constant l1 norm βswith constant l0 norm β1 Lasso (l1 penalty) results in sparse solutions vector with more zero coordinates Good for high-dimensional problems don t have to store all coordinates! 33

Beyond Linear Regression Polynomial regression Regression with nonlinear features/basis functions Kernel regression - Local/Weighted regression h Regression trees Spatially adaptive regression 34

Polynomial Regression Univariate case: where, Weight of each feature Nonlinear features 35

Nonlinear Regression Basis coefficients Nonlinear features/basis functions Fourier Basis Wavelet Basis Good representation for oscillatory functions Good representation for functions localized at multiple scales 36

Local Regression Basis coefficients Nonlinear features/basis functions Globally supported basis functions (polynomial, fourier) will not yield a good representation 37

Local Regression Basis coefficients Nonlinear features/basis functions Globally supported basis functions (polynomial, fourier) will not yield a good representation 38

Kernel Regression (Local) Weighted Least Squares Weigh each training point based on distance to test point K Kernel h Bandwidth of kernel 39

Nadaraya-Watson Kernel Regression constant 40

Nadaraya-Watson Kernel Regression constant h with box-car kernel #pts in h ball around X Sum of Ys in h ball around X Recall: NN classifier Average <-> majority 41vote

Choice of Bandwidth h Should depend on n, # training data (determines variance) Should depend on smoothness of function (determines bias) Large Bandwidth average more data points, reduce noise (Lower variance) Small Bandwidth less smoothing, more accurate fit (Lower bias) Bias Variance tradeoff : More to come in later lectures 42

Spatially adaptive regression h h If function smoothness varies spatially, we want to allow bandwidth h to depend on X Local polynomials, splines, wavelets, regression trees 43

Regression trees Binary Decision Tree Num Children? 2 < 2 Average (fit a constant ) on the leaves 44

Regression trees Quad Decision Tree h h -Polynomial fit on each leaf If Else stop, then split Compare residual error with and without split 45

Summary Discriminative vs Generative Classifiers - Naïve Bayes vs Logistic Regression Regression - Linear Regression Least Squares Estimator Normal Equations Gradient Descent Geometric Interpretation Probabilistic Interpretation (connection to MLE) - Regularized Linear Regression (connection to MAP) Ridge Regression, Lasso - Polynomial Regression, Basis (Fourier, Wavelet) Estimators - Kernel Regression (Localized) - Regression Trees 46