Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Similar documents
Lecture 13: Model selection and regularization

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Linear Methods for Regression and Shrinkage Methods

Chapter 6: Linear Model Selection and Regularization

What is machine learning?

Moving Beyond Linearity

LECTURE 11: LINEAR MODEL SELECTION PT. 2. October 18, 2017 SDS 293: Machine Learning

22s:152 Applied Linear Regression

Multivariate Analysis Multivariate Calibration part 2

STA121: Applied Regression Analysis

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Discussion Notes 3 Stepwise Regression and Model Selection

Resampling methods (Ch. 5 Intro)

22s:152 Applied Linear Regression

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Lasso. November 14, 2017

[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization

Nonparametric Methods Recap

Feature Selec+on. Machine Learning Fall 2018 Kasthuri Kannan

Variable selection is intended to select the best subset of predictors. But why bother?

Machine Learning. Topic 4: Linear Regression Models

Bayes Estimators & Ridge Regression

Package EBglmnet. January 30, 2016

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

CSE446: Linear Regression. Spring 2017

LECTURE 12: LINEAR MODEL SELECTION PT. 3. October 23, 2017 SDS 293: Machine Learning

Lecture 25: Review I

Information Criteria Methods in SAS for Multiple Linear Regression Models

Lasso Regression: Regularization for feature selection

Stat 342 Exam 3 Fall 2014

10601 Machine Learning. Model and feature selection

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

EE 511 Linear Regression

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Last time... Bias-Variance decomposition. This week

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

CPSC 340: Machine Learning and Data Mining

Lecture on Modeling Tools for Clustering & Regression

Chapter 10: Variable Selection. November 12, 2018

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Model selection and validation 1: Cross-validation

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

Machine Learning: An Applied Econometric Approach Online Appendix

14. League: A factor with levels A and N indicating player s league at the end of 1986

MATH 829: Introduction to Data Mining and Analysis Model selection

Bayesian model selection and diagnostics

Gradient LASSO algoithm

7. Collinearity and Model Selection

Basics of Multivariate Modelling and Data Analysis

Machine Learning / Jan 27, 2010

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

The exam is closed book, closed notes except your one-page cheat sheet.

STA 4273H: Sta-s-cal Machine Learning

Dimensionality Reduction, including by Feature Selection.

Non-linear models. Basis expansion. Overfitting. Regularization.

Lecture 16: High-dimensional regression, non-linear regression

GLMSELECT for Model Selection

The ALAMO approach to machine learning: Best subset selection, adaptive sampling, and constrained regression

Regularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Package msgps. February 20, 2015

Cross-validation and the Bootstrap

Introduction to Classification & Regression Trees

Regularization and model selection

FEATURE or input variable selection plays a very important

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER

Feature Subset Selection for Logistic Regression via Mixed Integer Optimization

Probabilistic Approaches

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

RESAMPLING METHODS. Chapter 05

Introduction to Automated Text Analysis. bit.ly/poir599

Using Machine Learning to Optimize Storage Systems

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Nonparametric Approaches to Regression

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

Machine Learning. Chao Lan

Cross-validation and the Bootstrap

Chapter 9 Building the Regression Model I: Model Selection and Validation

SAS Workshop. Introduction to SAS Programming. Iowa State University DAY 3 SESSION I

1 StatLearn Practical exercise 5

Overview. Background. Locating quantitative trait loci (QTL)

Stat 4510/7510 Homework 6

Model selection. Peter Hoff STAT 423. Applied Regression and Analysis of Variance. University of Washington /53

Generalized Additive Models

CSE 446 Bias-Variance & Naïve Bayes

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

CSE446: Linear Regression. Spring 2017

Support Vector Machines

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

I How does the formulation (5) serve the purpose of the composite parameterization

SYS 6021 Linear Statistical Models

MODEL DEVELOPMENT: VARIABLE SELECTION

Network Traffic Measurements and Analysis

Evaluation of empirical models for calibration and classification

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

Package SSLASSO. August 28, 2018

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Transcription:

Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1

Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records we can (often) learn model with 0 training error even for independent features! it is overfitted model. Less features in the model may lead to smaller test error. We add constrains or a penalty on coefficients. Model with fewer features is more interpretable. 2

Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV) Shrinkage (reguralization): a penalty on coefficients size shrunks them towards zero Dimension Reduction: from p dimension select M-dimensional subspace, M<p. fit a linear model in this M-dim. subspace. 3

Best Subset Selection Null model predicts. for( k in 1:p) fit models with exactly k predictors select the one with smallest RSS, or equiv. largest R 2 denote it Select a single best model from among using crossvalidation, AIC, BIC or adjusted R 2. 4

Best Subset Selection tractable up to p=30,40. Simillarly, for logistic regression with deviance as error measure instead of RSS, again, CV for model 'size' selection. 5

Forward Stepwise Selection Null model predicts. for( k in 0:(p-1)) consider (p-k) adding one predictor to select the one with smallest RSS, or equiv. largest R2 denote it Select a single best model from among using crossvalidation, AIC, BIC or adjusted R2. 6

Backward Stepwise Selection Full model with p predictors (standard LR). for( k in (p-1):0) consider (k+1) models removing one predictor from select the one with smallest RSS, or equiv. largest R2 denote it Select a single best model from among using crossvalidation, AIC, BIC or adjusted R2. 7

Hybrid Approaches go Forward, any time try to eliminate useless predictor. Each algorithm may provide different subset for a given size k (except 0 and p ;-) None of these has to be optimal with respect to mean test error. 8

Choosing the Optimal Model Two main approaches: Analytical criteria, adjustment to the training error to reduce overfitting ('penalty') should not be used for p>>n! Direct estimate of test error, either validation set or cross-validation approach. 9

Analytical Criteria Mallow 'in sample error estimate' Akaike: (more general, proportional to C p here) Bayesian Information Criterion: Adjusted R 2 : equiv. minimize 10

Example 11

Validation and Cross-Validation Validation: at the beginning, exclude 1/4 of data samples from training use them for error estimation for model selection. Cross-Validation: at the beginning, split data records into k=10 folds, for k in 1:10 hide k-th fold for training use it for error estimation for model selection. Note: different runs may provide different subsets of size 3. 12

Example 13

One Standard Error Rule take the model size with the minimal CV error calculate 1 std. err. interval arround this error, select the smallest model with error inside this interval. 14

Shrinkage Methods Penalty for non-zero model parameters, no penalty for intercept. Ridge: Lasso: 15

Ridge Parameter lambda penalizes the sum of. intentionally excluded from the penalty. we can center features and fix: β 2 For centered featues: for orthonormal features: Dependent on scale: standardization usefull. 16

Ridge coef. - Cancer example 17

Lasso regression the penalty is it forces some coefficients to be zero an equvivalent specification: 18

Ridge x Lasso 19

Corellated X, Parameter Shrinkage 20

Best subset, Ridge, Lasso Coefficient change for orthonormal features: 21

Example p=45, n=50, 2 predictors relate to output. 22

Bias-variance decomposition Bias 'systematic error', usually caused by restricted model subspace Var variance of the estimate we wish both to be zero.

Example of Bias

Example: Lasso, Ridge Regression red: MSE green: variance black: squared Bias penalty positive

MSE: 100 observations, p differs

Penalty ~ prior model probability Ridge we assume prior probability of parameters independent, ~ then Ridge is most likely estimate (posterior mode). Bayes formula P (β / X )= P( X / β ) P (β ) P ( X ) P(X) constant, P (β ) prior probability, P ( X / β ) likelihood, P (β / X ) posterior probability. 27

Prior Probability Ridge, Laso Ridge: Normal distribution Lasso: Laplace distribution 28

Linear Models for Regression Ridge, Lasso penalization PCR, PLS coordinate system change + dimension selection 29

Principal Component Analysis PCA vlastní čísla, vlastní vektory 30

PCR, PLS PCR Principal component regression select direction corresponding to largest eigenvalues for these directions, regression coeff. are fitted. For size=p equivalent with linear regression. Partial least squares considers Y for selection calculates regression coefficients weight features and calculate eigenvalues select the first direction of PLS, other direction simillar, orthogonal to the first. 31