Topics in Machine Learning-EE 5359 Model Assessment and Selection

Similar documents
Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Nonparametric Methods Recap

What is machine learning?

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Lecture 13: Model selection and regularization

Cross-validation and the Bootstrap

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Machine Learning. Topic 4: Linear Regression Models

Lecture 25: Review I

Machine Learning / Jan 27, 2010

Cross-validation and the Bootstrap

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

CS249: ADVANCED DATA MINING

Network Traffic Measurements and Analysis

10601 Machine Learning. Model and feature selection

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Moving Beyond Linearity

CSE446: Linear Regression. Spring 2017

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Cross-validation. Cross-validation is a resampling method.

STA121: Applied Regression Analysis

EE 511 Linear Regression

Resampling methods (Ch. 5 Intro)

Using Machine Learning to Optimize Storage Systems

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

I How does the formulation (5) serve the purpose of the composite parameterization

Machine Learning (BSMC-GA 4439) Wenke Liu

Multicollinearity and Validation CIVL 7012/8012

RESAMPLING METHODS. Chapter 05

Bayesian model selection and diagnostics

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Lecture 26: Missing data

Topics in Machine Learning

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Boosting Simple Model Selection Cross Validation Regularization

Distribution-free Predictive Approaches

Bias-Variance Analysis of Ensemble Learning

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Lecture 16: High-dimensional regression, non-linear regression

Model selection and validation 1: Cross-validation

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Nonparametric Classification Methods

Sistemática Teórica. Hernán Dopazo. Biomedical Genomics and Evolution Lab. Lesson 03 Statistical Model Selection

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

How do we obtain reliable estimates of performance measures?

Lecture 19: Decision trees

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Model selection. Peter Hoff STAT 423. Applied Regression and Analysis of Variance. University of Washington /53

Simple Model Selection Cross Validation Regularization Neural Networks

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

MATH 829: Introduction to Data Mining and Analysis Model selection

Model Complexity and Generalization

The Curse of Dimensionality

Last time... Bias-Variance decomposition. This week

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Linear Methods for Regression and Shrinkage Methods

Random Forest A. Fornaser

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Stat 342 Exam 3 Fall 2014

Machine Learning (BSMC-GA 4439) Wenke Liu

Nonparametric Approaches to Regression

Machine Learning. Supervised Learning. Manfred Huber

Missing Data Analysis for the Employee Dataset

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

Neural Network Weight Selection Using Genetic Algorithms

Supervised vs unsupervised clustering

Automatic basis selection for RBF networks using Stein s unbiased risk estimator

CSE 446 Bias-Variance & Naïve Bayes

Estimating Map Accuracy without a Spatially Representative Training Sample

Hyperparameters and Validation Sets. Sargur N. Srihari

Package EBglmnet. January 30, 2016

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Large Scale Data Analysis Using Deep Learning

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

CS 229 Midterm Review

Instance-based Learning

Adaptive Metric Nearest Neighbor Classification

IOM 530: Intro. to Statistical Learning 1 RESAMPLING METHODS. Chapter 05

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Divide and Conquer Kernel Ridge Regression

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Nonparametric Regression and Cross-Validation Yen-Chi Chen 5/27/2017

Discussion Notes 3 Stepwise Regression and Model Selection

Lecture 20: Bagging, Random Forests, Boosting

CSE446: Linear Regression. Spring 2017

Nonparametric Importance Sampling for Big Data

Splines and penalized regression

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Introduction to Data Science Lecture 8 Unsupervised Learning. CS 194 Fall 2015 John Canny

Classification and Regression Trees

10-701/15-781, Fall 2006, Final

STA 4273H: Sta-s-cal Machine Learning

Transcription:

Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1

Training and Generalization Training stage: Utilizing training data to learn a model for regression or classification Generalization stage: Given a new input x estimated mapping to find find output y (prediction or classification) Generalization performance (RSS or misclassification error) needs to be assessed since it can guide selection of model or learning method Setting: Let Y be the target variable, a vector of inputs X, a prediction model estimated using a training set T Typical choices of the loss function 2

Training Error vs. Testing Error Test (generalization) error For a fixed training set T Expected test error Training error is average loss over training sample - training error (100 training sets T, N=50) - test error Need to estimate Err, training error not good performance indicator (underestimates true performance) 3

Quantities in Classification When having a categorical response G, with K possible labels We model p k (X)=Pr(G=k X) and decide Typical choices of the loss functions Test error Training error 4

Generic Steps Models have tuning constants, say α, that adjust complexity and affect predictor Model selection: Estimate the performance of different models in order to choose best one Model assessment: After choosing model, estimate the test error on new data 50% 25% 25% fit the models estimate prediction error for model selection assess generalization error of selected model 5

Bias-Variance Decomposition Assume that Y=f(x)+ε, where E[ε]=0 and var(ε)=σ ε 2 Then expected prediction error at input point X=x 0 First term is the noise variance Typically a complex model : Lower bias but higher variance 6

k-nn Fit Example Using k-nearest neighbor fit the prediction error takes the form Assuming training data x i are fixed, and randomness is in y i k is inversely proportional to the model complexity For small k can potentially better adapt to the underlying f(x) As k goes up the bias will go up but variance will decrease due to averaging 7

Linear Fit Example In the linear fit case we have Then the error is Here that produce the fit Average of the error Model complexity proportional to p 8

Ridge Regression Bias Breakdown Parameters of best fitting linear approximation to f Same error as before except Best-fitting linear parameters The average squared bias can be written as Model bias due to linear fitting; Estimation bias due to the utilization of regularization introducing extra bias for smaller variance 9

The Behavior of Bias and Variance Trading-off bias for variance 10

Optimism of the Training Error Rate Given a training set Generalization error is This is for fixed training set and averaging over testing point (X 0,Y 0 ) Averaging over training sets gives Training error Since fitting method adapts to training data training error is an optimistic (lower) estimate of Err T 11

A Quantification of the Optimism Consider the following in-sample error Averaged over the responses Y i0 (due to noise) for a given training input set of points x i, i=1,,n Optimism is defined as Op is typically positive since ērr is biased downward as an estimate of Err in Average optimism over training sets Easier to estimate ω 12

More Details For squared-error and 0-1 loss one can show that optimism The harder we fit the data the higher cov will be and thus larger optimism We have the important relation If ŷ i is obtained by linear fit with d parameters When Y=f(X)+ε we have 13

Ways to Estimate the Testing Error Estimate optimism and add it to training error ērr Using Cp, AIC, BIC Utilize cross-validation or bootstrap techniques This is to form direct estimates of average testing error Err 14

In-Sample Error Estimation The in-sample error estimate (averaged over responses) is given as Get the C p statistic (error estimate) Noise variance estimate obtained from the mean-square error of a low-bias model Similarity to Akaike s information criterion (AIC), true as N Family of densities Including true density Maximized wrt θ log likelihood Given training data 15

Akaike s Information Criterion (AIC) Model selection: Select model that results smallest AIC Given set of models f a (x) indexed by a, consider training error ērr(a) and number of parameters d(a). Then, AIC gives an estimate of testing error Select parameter â that minimizes the AIC Second term is accurate for linear models with additive errors and squared error loss; Holds approximately for linear models and log-likelihoods Formula does not hold in general for 0-1 loss but it is still used to determine optimal parameters 16

An Example AIC used to select order model in phoneme recognition using logistic regression with model 17

Effective Number of Parameters Consider outcomes y 1,y 2,,y N in vector y in R N and similarly for the prediction ŷ Consider linear fitting ŷ=sy, where S is an NxN matrix depending on x i but not on y i [Linear regression and quadratic smoothing] Effective number of parameters If S was projection matrix onto a space spanned by M basis vectors then trace(s)=m If y arises from additive-error model Y=f(X)+ε with variance σ ε 2 then or 18

Bayesian Information Criterion (BIC) BIC like AIC works with settings where fitting is formulated as a max likelihood problem Under Gaussian model with variance σ ε2 the first term is equal to which gives BIC proportional to AIC where factor 2 replaced by logn Although they look similar they are motivated in completely different ways; BIC is derived from a Bayesian approach in model selection Consider a set of candidate models M m, m=1, M and parameters θ m with prior distribution Pr(θ m M m ) Need to find the best model 19

Posterior Model Probability Z corresponds to the training data {x i,y i } i=1 N To compare models M m and M l we form ratio Bayesian factor If ratio greater than one we choose m otherwise we choose l Typicall assume uniform model prior, thus Pr(M m ) constant 20

Approximating the Likelihood Laplace integral approximation [Ripley 96] gives Maximum likelihood estimate and number of free parameters d m in model M m Note that the BIC criterion can be obtained as Minimizing BIC is equivalent to selecting model with largest posterior probability Given BIC m for model M m the posterior can be estimated as 21

BIC or AIC No clear choice between BIC and AIC BIC is asymptotically consistent; Given a family of models that includes the true probability, then the probability that BIC will select the right model goes to 1 as the number of training data N AIC is not asymptotically consistent; Tends to choose complex models as N For finite samples AIC may be a better option than BIC since BIC tends to choose models that are too simple due to the heavy penalty on complexity 22

Minimum Description Length (MDL) Similar to BIC but motivated from coding theory for data compression Think of datum z as a message that want to encode and transmit Our selected model is a way of encoding the datum and will choose the most parsimonious model (shortest code) Consider a model M with parameters θ, and data Z=(X,y) consisting of inputs and outputs; Pdf Pr(y θ,m,x) Message length required to transmit the output is Transmit discrepancy between model and actual target values Transmit model parameters 23

MDL Example MDL principle says we should choose the model that minimizes the length Consider single target y~n(θ,σ 2 ) and model parameter θ~n(0,1) The message length is given as The smaller σ the shorter on average is the message length Similarity to BIC principle 24

Cross-Validation (CV) The simplest and most widely used method for estimating EPE K-Fold CV: Due to scarcity of data split dataset in K equallysized parts, part used to train part used to test. E.g., K=5 For the kth part (k=3) we fit model using the rest K-1 parts and calculate the prediction error of the fitted model when predicting the kth part; This is done repeatedly for k=1,2,,k 25

CV Details Consider the partition mapping: To which partition is observation i allocated (randomization) Fitted function computed with kth data part removed CV estimate of EPE is found as Typical choices for K are 5 or 10 K=N: Leave-one-out validation k(i)=i; Fit obtained using all data but ith 26

Selecting a Model via CV Consider set of models f(x,α) indexed by tuning parameter α The αth model fit with the kth part of data removed Find tuning parameter α that minimizes test error curve If K=N, then CV close to unbiased but high-variance since all training sets look very similar; High computational burden How to pick K? Depends on the slope of learning curve (not known) 27

Selecting K and N If N=200 and K=5 then there is not much difference between CV and actual EPE If N=50 and K=5 then the CV will give significantly higher estimate for EPE (check high slope at 50) CV estimates always overestimate (biased upwards) actual EPE 28

Generalized CV Approximation to leave-one out CV for linear fitting under squared-error loss In the linear fitting prediction takes the form For many linear fitting methods GCV approximation 29

Bootstrap Methods General tool for assessing statistical accuracy and used to estimate EPE Training data z i =(x i,y i ) S(Z): Any quantity computed from Z E.g. the variance Monte Carol estimate using empirical distribution of data 30

Bootstrap for EPE Estimation For each observation keep track of predictions from boot-strap samples not containing that observation (Leave-one out) C -i is the set of indices of the bootstrap samples b that do not contain observation i B should be large enough to avoid zero C -i Suffering from Bias; Better performance using the 0.632 estimator Improved estimator 0.632+ : 31