Lecture 13: Model selection and regularization

Similar documents
Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization

Machine Learning. Topic 4: Linear Regression Models

STA121: Applied Regression Analysis

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Chapter 6: Linear Model Selection and Regularization

LECTURE 11: LINEAR MODEL SELECTION PT. 2. October 18, 2017 SDS 293: Machine Learning

Lecture 16: High-dimensional regression, non-linear regression

Linear Methods for Regression and Shrinkage Methods

22s:152 Applied Linear Regression

LECTURE 12: LINEAR MODEL SELECTION PT. 3. October 23, 2017 SDS 293: Machine Learning

22s:152 Applied Linear Regression

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Lecture 7: Linear Regression (continued)

Lasso Regression: Regularization for feature selection

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Nonparametric Methods Recap

Model selection and validation 1: Cross-validation

Lasso. November 14, 2017

Gradient LASSO algoithm

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Lecture 26: Missing data

What is machine learning?

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Chapter 10: Variable Selection. November 12, 2018

Multiple Linear Regression

Variable selection is intended to select the best subset of predictors. But why bother?

Algorithms for LTS regression

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

MATH 829: Introduction to Data Mining and Analysis Model selection

Lecture 17: Smoothing splines, Local Regression, and GAMs

Model selection. Peter Hoff STAT 423. Applied Regression and Analysis of Variance. University of Washington /53

CPSC 340: Machine Learning and Data Mining

Lecture 25: Review I

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Chapter 7: Numerical Prediction

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

Regularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

6 Model selection and kernels

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Machine Learning / Jan 27, 2010

Lecture on Modeling Tools for Clustering & Regression

CS249: ADVANCED DATA MINING

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Bayes Estimators & Ridge Regression

7. Collinearity and Model Selection

Feature Selec+on. Machine Learning Fall 2018 Kasthuri Kannan

1 StatLearn Practical exercise 5

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Random Forests and Boosting

RESAMPLING METHODS. Chapter 05

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

MODEL DEVELOPMENT: VARIABLE SELECTION

Hyperparameters and Validation Sets. Sargur N. Srihari

14. League: A factor with levels A and N indicating player s league at the end of 1986

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Splines and penalized regression

Feature Subset Selection for Logistic Regression via Mixed Integer Optimization

Nonparametric Classification Methods

The exam is closed book, closed notes except your one-page cheat sheet.

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Multicollinearity and Validation CIVL 7012/8012

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016

General Factorial Models

model order p weights The solution to this optimization problem is obtained by solving the linear system

Information Criteria Methods in SAS for Multiple Linear Regression Models

The Basics of Decision Trees

Regularization and model selection

Discussion Notes 3 Stepwise Regression and Model Selection

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Cross-validation for detecting and preventing overfitting

Cross-validation and the Bootstrap

Lecture 19: Decision trees

Predicting Web Service Levels During VM Live Migrations

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

10601 Machine Learning. Model and feature selection

Machine Learning. Chao Lan

I How does the formulation (5) serve the purpose of the composite parameterization

Week 4: Simple Linear Regression II

CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18

Network Traffic Measurements and Analysis

Regression Analysis and Linear Regression Models

Resampling methods (Ch. 5 Intro)

CSE446: Linear Regression. Spring 2017

Cross-validation. Cross-validation is a resampling method.

Linear Penalized Spline Model Estimation Using Ranked Set Sampling Technique

Lecture 20: Bagging, Random Forests, Boosting

A Course in Machine Learning

Package EBglmnet. January 30, 2016

General Factorial Models

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Week 4: Simple Linear Regression III

Complexity Challenges to the Discovery of Relationships in Eddy Current Non-destructive Test Data

Using Machine Learning to Optimize Storage Systems

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Transcription:

Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17

What do we know so far In linear regression, adding predictors always decreases the training error or RSS. 2 / 17

What do we know so far In linear regression, adding predictors always decreases the training error or RSS. However, adding predictors does not necessarily improve the test error. 2 / 17

What do we know so far In linear regression, adding predictors always decreases the training error or RSS. However, adding predictors does not necessarily improve the test error. When n < p, there is no least squares solution: ˆβ = (X T X) 1 X T y. }{{} So, we must find a way to select fewer predictors if we want to use least squares linear regression. 2 / 17

What do we know so far In linear regression, adding predictors always decreases the training error or RSS. However, adding predictors does not necessarily improve the test error. When n < p, there is no least squares solution: ˆβ = (X T X) 1 X T y. }{{} Singular So, we must find a way to select fewer predictors if we want to use least squares linear regression. 2 / 17

Best subset selection Simple idea: let s compare all models with k predictors. 3 / 17

Best subset selection Simple idea: let s compare all models with k predictors. There are ( p k) = p!/ [k!(p k)!] possible models. 3 / 17

Best subset selection Simple idea: let s compare all models with k predictors. There are ( p k) = p!/ [k!(p k)!] possible models. Choose the model with the smallest RSS. Do this for every possible k. 3 / 17

Best subset selection Simple idea: let s compare all models with k predictors. There are ( p k) = p!/ [k!(p k)!] possible models. Choose the model with the smallest RSS. Do this for every possible k. Residual Sum of Squares 2e+07 4e+07 6e+07 8e+07 R 2 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors 3 / 17

Best subset selection Simple idea: let s compare all models with k predictors. There are ( p k) = p!/ [k!(p k)!] possible models. Choose the model with the smallest RSS. Do this for every possible k. Residual Sum of Squares 2e+07 4e+07 6e+07 8e+07 R 2 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors Naturally, the RSS and R 2 improve as we increase k. 3 / 17

Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 4 / 17

Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC): 1 nˆσ 2 (RSS + 2kˆσ2 ) where ˆσ 2 is an estimate of the irreducible error, typically ˆσ 2 = 1 n p 1 RSS of full model with all p predictors 4 / 17

Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 1 n (RSS + 2kˆσ2 ) where ˆσ 2 is an estimate of the irreducible error, typically ˆσ 2 = 1 n p 1 RSS of full model with all p predictors 4 / 17

Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 2. Bayesian Information Criterion (BIC): 1 n (RSS + log(n)kˆσ2 ) 4 / 17

Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 2. Bayesian Information Criterion (BIC): 3. Adjusted R 2 : R 2 = 1 RSS TSS where TSS= i (y i ȳ) 2 4 / 17

Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 2. Bayesian Information Criterion (BIC): 3. Adjusted R 2 : R 2 adj = 1 RSS/(n k 1) TSS/(n 1) 4 / 17

Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 2. Bayesian Information Criterion (BIC): 3. Adjusted R 2 : How do they compare to cross validation: They are much less expensive to compute. They are motivated by asymptotic arguments and rely on model assumptions (eg. normality of the errors). Equivalent concepts for other models (e.g. logistic regression). 4 / 17

Example Best subset selection for the Credit dataset. C p 10000 15000 20000 25000 30000 BIC 10000 15000 20000 25000 30000 Adjusted R 2 0.86 0.88 0.90 0.92 0.94 0.96 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Number of Predictors Number of Predictors Number of Predictors 5 / 17

Example Cross-validation vs. the BIC. Square Root of BIC 100 120 140 160 180 200 220 Validation Set Error 100 120 140 160 180 200 220 Cross Validation Error 100 120 140 160 180 200 220 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors 6 / 17

Example Cross-validation vs. the BIC. Square Root of BIC 100 120 140 160 180 200 220 Validation Set Error 100 120 140 160 180 200 220 Cross Validation Error 100 120 140 160 180 200 220 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors Recall: In k-fold cross validation, we can estimate a standard error or accuracy for our test error estimate and apply the one standard-error rule. 6 / 17

Stepwise selection methods Best subset selection has 2 problems: 7 / 17

Stepwise selection methods Best subset selection has 2 problems: 1. It is often very expensive computationally. We have to fit 2 p models! 7 / 17

Stepwise selection methods Best subset selection has 2 problems: 1. It is often very expensive computationally. We have to fit 2 p models! 2. If for a fixed k, there are too many possibilities, we increase our chances of overfitting (just like in multiple testing!). 7 / 17

Stepwise selection methods Best subset selection has 2 problems: 1. It is often very expensive computationally. We have to fit 2 p models! 2. If for a fixed k, there are too many possibilities, we increase our chances of overfitting (just like in multiple testing!). The model selected has high variance. 7 / 17

Stepwise selection methods Best subset selection has 2 problems: 1. It is often very expensive computationally. We have to fit 2 p models! 2. If for a fixed k, there are too many possibilities, we increase our chances of overfitting (just like in multiple testing!). The model selected has high variance. In order to mitigate these problems, we can restrict our search space for the best model. This reduces the variance of the selected model at the expense of an increase in bias. 7 / 17

Forward selection 8 / 17

Forward selection vs. best subset 9 / 17

Backward selection 10 / 17

Forward vs. backward selection You cannot apply backward selection when p > n. 11 / 17

Forward vs. backward selection You cannot apply backward selection when p > n. Although it seems like they should, they need not produce the same sequence of models. 11 / 17

Forward vs. backward selection You cannot apply backward selection when p > n. Although it seems like they should, they need not produce the same sequence of models. Example. X 1, X 2 N (0, σ) independent. Regress Y onto X 1, X 2, X 3. X 3 = X 1 + 3X 2 Y = X 1 + 2X 2 + ɛ 11 / 17

Forward vs. backward selection You cannot apply backward selection when p > n. Although it seems like they should, they need not produce the same sequence of models. Example. X 1, X 2 N (0, σ) independent. Regress Y onto X 1, X 2, X 3. X 3 = X 1 + 3X 2 Y = X 1 + 2X 2 + ɛ Forward: {X3 } {X 3, X 2 } {X 3, X 2, X 1 } 11 / 17

Forward vs. backward selection You cannot apply backward selection when p > n. Although it seems like they should, they need not produce the same sequence of models. Example. X 1, X 2 N (0, σ) independent. Regress Y onto X 1, X 2, X 3. X 3 = X 1 + 3X 2 Y = X 1 + 2X 2 + ɛ Forward: {X3 } {X 3, X 2 } {X 3, X 2, X 1 } Backward: {X1, X 2, X 3 } {X 1, X 2 } {X 2 } 11 / 17

Other stepwise selection strategies Mixed stepwise selection: Do forward selection, but at every step, remove any variables that are no longer necessary. 12 / 17

Other stepwise selection strategies Mixed stepwise selection: Do forward selection, but at every step, remove any variables that are no longer necessary. Forward stagewise selection: Like forward selection, but when introducing a new variable, do not adjust the coefficients of the previously added variables. 12 / 17

Other stepwise selection strategies Mixed stepwise selection: Do forward selection, but at every step, remove any variables that are no longer necessary. Forward stagewise selection: Like forward selection, but when introducing a new variable, do not adjust the coefficients of the previously added variables.... 12 / 17

Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. 13 / 17

Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? 13 / 17

Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? This introduces bias, but may significantly decrease the variance of the estimates. If the latter effect is larger, this would decrease the test error. 13 / 17

Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? This introduces bias, but may significantly decrease the variance of the estimates. If the latter effect is larger, this would decrease the test error. There are Bayesian motivations to do this: the prior tends to shrink the parameters. 13 / 17

Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 p β j x i,j j=1 In blue, we have the RSS of the model. 2 + λ In red, we have the squared l 2 norm of β, or β 2 2. p j=1 β 2 j 14 / 17

Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 p β j x i,j j=1 In blue, we have the RSS of the model. 2 + λ In red, we have the squared l 2 norm of β, or β 2 2. p j=1 The parameter λ is a tuning parameter. It modulates the importance of fit vs. shrinkage. β 2 j 14 / 17

Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 p β j x i,j j=1 In blue, we have the RSS of the model. 2 + λ In red, we have the squared l 2 norm of β, or β 2 2. p j=1 The parameter λ is a tuning parameter. It modulates the importance of fit vs. shrinkage. β 2 j We find an estimate ˆβ λ R cross-validation. for many values of λ and then choose it by 14 / 17

Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 p β j x i,j j=1 In blue, we have the RSS of the model. 2 + λ In red, we have the squared l 2 norm of β, or β 2 2. p j=1 The parameter λ is a tuning parameter. It modulates the importance of fit vs. shrinkage. We find an estimate ˆβ λ R for many values of λ and then choose it by cross-validation. Fortunately, this is no more expensive than running a least-squares regression. β 2 j 14 / 17

Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = X 0 + β 1 X 1 + β 2 X 2 + + β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. 15 / 17

Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = X 0 + β 1 X 1 + β 2 X 2 + + β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. In ridge regression, this is not true. 15 / 17

Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = X 0 + β 1 X 1 + β 2 X 2 + + β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. In ridge regression, this is not true. In practice, what do we do? Scale each variable such that it has sample variance 1 before running the regression. This prevents penalizing some coefficients more than others. 15 / 17

Example. Ridge regression Ridge regression of default in the Credit dataset. Standardized Coefficients 300 100 0 100 200 300 400 1e 02 1e+00 1e+02 1e+04 Income Limit Rating Student Standardized Coefficients 300 100 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 1.0 λ ˆβ R λ 2/ ˆβ 2 16 / 17

Bias-variance tradeoff In a simulation study, we compute bias, variance, and test error as a function of λ. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 1e 01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0 λ ˆβ R λ 2/ ˆβ 2 Cross validation would yield an estimate of the test error. 17 / 17