Regularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Similar documents
14. League: A factor with levels A and N indicating player s league at the end of 1986

Lab 10 - Ridge Regression and the Lasso in Python

Lasso. November 14, 2017

LECTURE 12: LINEAR MODEL SELECTION PT. 3. October 23, 2017 SDS 293: Machine Learning

Lecture 13: Model selection and regularization

MATH 829: Introduction to Data Mining and Analysis Model selection

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

The Basics of Decision Trees

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Penalized regression Statistical Learning, 2011

scikit-learn (Machine Learning in Python)

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Linear Methods for Regression and Shrinkage Methods

Lecture on Modeling Tools for Clustering & Regression

Stat 4510/7510 Homework 6

Yelp Recommendation System

The Data. Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO

Introduction to Automated Text Analysis. bit.ly/poir599

Package glmnetutils. August 1, 2017

CSE446: Linear Regression. Spring 2017

Lasso Regression: Regularization for feature selection

1 StatLearn Practical exercise 5

Chapter 6: Linear Model Selection and Regularization

6 Model selection and kernels

Lab 9 - Linear Model Selection in Python

Comparison of Optimization Methods for L1-regularized Logistic Regression

Random Forests and Boosting

Multiresponse Sparse Regression with Application to Multidimensional Scaling

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

This is a simple example of how the lasso regression model works.

Machine Learning: An Applied Econometric Approach Online Appendix

EE 511 Linear Regression

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 19: Decision trees

Final Exam. Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm

Package TANDEM. R topics documented: June 15, Type Package

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

Contents Cont Hypothesis testing

Variable Selection 6.783, Biomedical Decision Support

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Package bayescl. April 14, 2017

Comparison of Linear Regression with K-Nearest Neighbors

Multicollinearity and Validation CIVL 7012/8012

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance

STA 4273H: Sta-s-cal Machine Learning

Gradient LASSO algoithm

Machine Learning. Topic 4: Linear Regression Models

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

MACHINE LEARNING TOOLBOX. Random forests and wine

Glmnet Vignette. Introduction. Trevor Hastie and Junyang Qian

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Machine Learning / Jan 27, 2010

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

PS 6: Regularization. PART A: (Source: HTF page 95) The Ridge regression problem is:

model order p weights The solution to this optimization problem is obtained by solving the linear system

LECTURE 11: LINEAR MODEL SELECTION PT. 2. October 18, 2017 SDS 293: Machine Learning

Support Vector Machines

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Lecture 26: Missing data

Nonparametric Methods Recap

Model selection and validation 1: Cross-validation

CSE446: Linear Regression. Spring 2017

Package msgps. February 20, 2015

Using Machine Learning to Optimize Storage Systems

Fast or furious? - User analysis of SF Express Inc

Lasso.jl Documentation

Robust Linear Regression (Passing- Bablok Median-Slope)

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Moving Beyond Linearity

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Lecture 16: High-dimensional regression, non-linear regression

Among those 14 potential explanatory variables,non-dummy variables are:

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

Chapter 7: Linear regression

Lecture 25: Review I

CSE 158 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

Topics in Machine Learning

Business Club. Decision Trees

Logistic Regression and Gradient Ascent

Multiple Regression White paper

I How does the formulation (5) serve the purpose of the composite parameterization

Package ADMM. May 29, 2018

Overview and Practical Application of Machine Learning in Pricing

7. Collinearity and Model Selection

10. Network dimensioning

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

1 Training/Validation/Testing

Bayes Estimators & Ridge Regression

Applied Regression Modeling: A Business Approach

Transcription:

Regularization Methods Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Today s Lecture Objectives 1 Avoiding overfitting and improving model interpretability with the help of regularization methods 2 Understanding both ridge regression and the LASSO 3 Applying these methods for variable selection Regularization 2

Outline 1 Motivation for Regularization 2 Ridge Regression 3 LASSO 4 Comparison 5 Wrap-Up Regularization 3

Outline 1 Motivation for Regularization 2 Ridge Regression 3 LASSO 4 Comparison 5 Wrap-Up Regularization: Motivation for Regularization 4

Motivation for Regularization Linear models are frequently favorable due to their interpretability and often good predictive performance Yet, Ordinary Least Squares (OLS) estimation faces challenges Challenges 1 Interpretability 2 Overfitting OLS cannot distinguish variables with little or no influence These variables distract from the relevant regressors OLS works well when number of observation n is bigger than the number of predictors p, i. e. n p If n p, overfitting results into low accuracy on unseen observations If n < p, variance of estimates is infinite and OLS fails As a remedy, one can identify only relevant variables by feature selection Regularization: Motivation for Regularization 5

Motivation for Regularization Fitting techniques as alternatives to OLS Subset selection Pick only a subset of all p variables which is assumed to be relevant Estimate model with least squares using these reduced set of variables Dimension reduction Project p predictors into a d-dimensional subspace with d < p These d features are used to fit a linear model by least squares Shrinkage methods, also named regularization Fit model with all p variables However, some coefficients are shrunken towards zero Has the effect of reducing variance Regularization: Motivation for Regularization 6

Regularization Methods Fit linear models with least squares but impose constraints on the coefficients Instead, alternative formulations add a penalty in the OLS formula Best known are ridge regression and LASSO (least absolute shrinkage operator) Ridge regression can shrink parameters close to zero LASSO models can shrink some parameters exactly to zero Performs implicit variable selection Regularization: Motivation for Regularization 7

Outline 1 Motivation for Regularization 2 Ridge Regression 3 LASSO 4 Comparison 5 Wrap-Up Regularization: Ridge Regression 8

Ridge Regression OLS estimation Recall the OLS technique to estimate β = [β 0,β 1,...,β p ] T Minimizes the residual sum of squares (RSS) β OLS = min β RSS = min β n i=1 ( y i β 0 p j=1 β j x ij ) 2 Ridge regression Imposes a penalty on the size of the coefficients to reduce the variance of the estimates ( β ridge = min β n i=1 y i β 0 p j=1 ) 2 β j x ij + λ } {{ } RSS p j=1 β 2 j }{{} shrinkage penalty Regularization: Ridge Regression 9

Tuning Parameter Tuning parameter λ > 0 controls the relative impact of the penalty Penalty λ p βj 2 has the effect of shrinking β j towards zero j=1 If λ 0, penalty term has no effect (similar to OLS) Choice of λ is critical determined separately via cross validation Regularization: Ridge Regression 10

Ridge Regression in R Predicting salaries of U. S. baseball players based on game statistics Loading data Hitters library(islr) # Hitters is located inside ISLR data(hitters) Hitters <- na.omit(hitters) # salary can be missing Loading package glmnet which implements ridge regression library(glmnet) Main function glmnet(x, y, alpha=0) requires dependent variable y and regressors x Function only processes numerical input, whereas categorical variables needs to be transformed via model.matrix(...) Regularization: Ridge Regression 11

Ridge Regression in R Prepare variables set.seed(0) # drop 1st column with intercept (glmnet has already one) x <- model.matrix(salary ~., Hitters)[, -1] y <- Hitters$Salary train_idx <- sample(nrow(x), size=0.9*nrow(x)) x.train <- x[train_idx, ] x.test <- x[-train_idx, ] y.train <- y[train_idx] y.test <- y[-train_idx] Call ridge regression and automatically test a sequence of λ lm.ridge <- glmnet(x.train, y.train, alpha=0) Regularization: Ridge Regression 12

Ridge Regression in R coef(...) retrieves coefficients belonging to each λ dim(coef(lm.ridge)) ## [1] 20 100 here: 100 models with different λ and each with 20 coefficients For example, the 50th model is as follows lm.ridge$lambda[50] ## [1] 2581.857 # tested lambda value head(coef(lm.ridge)[,50]) # estimated coefficients ## (Intercept) AtBat Hits HmRun Runs ## 211.76123020 0.08903326 0.37913073 1.21041548 0.64115228 ## RBI ## 0.59834311 Regularization: Ridge Regression 13

Ridge Regression in R plot(model, xvar="lambda") investigates the influence of λ on the estimated coefficients for all variables plot(lm.ridge, xvar="lambda") 19 19 19 19 19 Coefficients 150 100 50 0 50 4 6 8 10 12 Log Lambda Bottom axis gives lnλ, top the number of non-zero coefficients Regularization: Ridge Regression 14

Plot: Lambda vs. Coefficients Coefficients 0 10 20 Division League New League Years Walks Manual effort necessary for pretty format As λ increases, coefficients shrink towards zero All other variables are shown in gray 10000 20000 Lambda Regularization: Ridge Regression 15

Parameter Tuning Optimal λ is determined via cross validation by minimizing the mean squared error from a prediction Usage is cv.glmnet(x, y, alpha=0) cv.ridge <- cv.glmnet(x.train, y.train, alpha=0) Optimal λ and corresponding coefficients cv.ridge$lambda.min ## [1] 29.68508 head(coef(cv.ridge, s="lambda.min")) ## 6 x 1 sparse Matrix of class "dgcmatrix" ## 1 ## (Intercept) 109.4192279 ## AtBat -0.6764771 ## Hits 2.5974777 ## HmRun -0.7058689 ## Runs 1.8565943 ## RBI 0.3434801 Regularization: Ridge Regression 16

Parameter Tuning plot(cv.model) compares the means squared error across λ plot(cv.ridge) 19 19 19 19 19 19 19 19 19 Mean Squared Error 100000 160000 220000 4 6 8 10 12 log(lambda) Mean squared error first remains fairly constant and then rises sharply Regularization: Ridge Regression 17

Ridge Regression in R predict(model, newx=x, s=lambda) makes predictions for new data x and a specific λ pred.ridge <- predict(cv.ridge, newx=x.test, s="lambda.min") head(cbind(pred.ridge, y.test)) ## 1 y.test ## -Alan Ashby 390.1766 475.000 ## -Andre Dawson 1094.5741 500.000 ## -Andre Thornton 798.5886 1100.000 ## -Alan Trammell 893.8298 517.143 ## -Barry Bonds 518.9105 100.000 ## -Bob Dernier 353.4100 708.333 Mean absolute percentage error (MAPE) mean(abs((y.test - pred.ridge)/y.test)) ## [1] 0.6811053 Regularization: Ridge Regression 18

Scaling of Estimates OLS estimation Recall: least square estimates are scale equivalent Multiplying x j by c scaling of β j by a factor 1/c Ridge regression In contrast, coefficients in ridge regression can change substantially when scaling variable x j due to penalty term Best is to use the following approach 1 Scale variables via x ij = n 1 n i=1 x ij (x ij x j ) 2 which divides by the standard deviation of x j 2 Estimate the coefficients of ridge regression glmnet scales accordingly, but returns coefficients on original scale Regularization: Ridge Regression 19

0 Mean Squared Error 10 20 30 40 50 60 Bias-Variance Trade-Off Ridge regressions benefits from bias-variance trade-off As λ increases, the flexibility of ridge regression coefficients decreases This Decreases variance but increases bias 10 1 10 1 10 3 λ Squared bias (in black), variance (blue), and error on test set (red) Dashed line is minimum possible mean squared error Regularization: Ridge Regression 20

Pros and Cons Advantages Ridge regression can reduce the variance (with an increasing bias) works best in situations where the OLS estimates have high variance Can improve predictive performance Works in situations where p < n Mathematically simple computations Disadvantages Ridge regression is not able to shrink coefficients to exactly zero As a result, it cannot perform variable selection Alternative: Least Absolute Shrinkage and Selection Operator (LASSO) Regularization: Ridge Regression 21

Outline 1 Motivation for Regularization 2 Ridge Regression 3 LASSO 4 Comparison 5 Wrap-Up Regularization: LASSO 22

LASSO Least Absolute Shrinkage and Selection Operator (LASSO) Ridge regression always includes p variables, but LASSO performs variables selection LASSO only changes the shrinkage penalty β LASSO = min β n i=1 ( y i β 0 p j=1 ) 2 β j x ij + λ } {{ } RSS Here, the LASSO uses the L 1 -norm β 1 = β j j p j=1 β j }{{} shrinkage penalty This penalty allows coefficients to shrink towards exactly zero LASSO usually results into sparse models, that are easier to interpret Regularization: LASSO 23

LASSO in R Implemented in glmnet(x, y, alpha=1) as part of the glmnet package lm.lasso <- glmnet(x.train, y.train, alpha=1) Note: different value for alpha plot(...) shows how λ changes the estimated coefficients plot(lm.lasso, xvar="lambda") 17 17 12 6 Coefficients 100 50 0 50 2 0 2 4 Regularization: LASSO Log Lambda 24

Parameter Tuning cv.glmnet(x, y, alpha=1) determines optimal λ via cross validation by minimizing the mean squared error from a prediction set.seed(0) cv.lasso <- cv.glmnet(x.train, y.train, alpha=1) Optimal λ and corresponding coefficients ("." are removed variables) cv.lasso$lambda.min ## [1] 2.143503 head(coef(cv.lasso, s="lambda.min")) ## 6 x 1 sparse Matrix of class "dgcmatrix" ## 1 ## (Intercept) 189.7212235 ## AtBat -1.9921887 ## Hits 6.6124279 ## HmRun 0.6674432 ## Runs. ## RBI. Regularization: LASSO 25

Parameter Tuning Total variables nrow(coef(cv.lasso)) ## [1] 20 Omitted variables dimnames(coef(cv.lasso, s="lambda.min"))[[1]][which( coef(cv.lasso, s="lambda.min") == 0)] ## [1] "Runs" "RBI" "CAtBat" "CHits" Included variables dimnames(coef(cv.lasso, s="lambda.min"))[[1]][which( coef(cv.lasso, s="lambda.min")!= 0)] ## [1] "(Intercept)" "AtBat" "Hits" "HmRun" ## [6] "Years" "CHmRun" "CRuns" "CRBI" ## [11] "LeagueN" "DivisionW" "PutOuts" "Assists" ## [16] "NewLeagueN" Regularization: LASSO 26

Parameter Tuning plot(cv.model) compares the means squared error across λ plot(cv.lasso) 18 18 16 14 12 10 8 6 5 4 Mean Squared Error 100000 140000 180000 220000 1 0 1 2 3 4 5 log(lambda) Mean squared error first remains fairly constant and then rises sharply Top axis denotes the number of included model variables Regularization: LASSO 27

LASSO in R predict(model, newx=x, s=lambda) makes predictions for new data x and a specific λ pred.lasso <- predict(cv.lasso, newx=x.test, s="lambda.min") Mean absolute percentage error (MAPE) of LASSO mean(abs((y.test - pred.lasso)/y.test)) ## [1] 0.6328225 For comparison, error of ridge regression ## [1] 0.6811053 Regularization: LASSO 28

Outline 1 Motivation for Regularization 2 Ridge Regression 3 LASSO 4 Comparison 5 Wrap-Up Regularization: Comparison 29

Problem Formulation Both ridge regression and LASSO can be rewritten as β ridge = min β β LASSO = min β n i=1 ( y i β 0 p j=1 β j x ij ) 2 }{{} n i=1 ( RSS y i β 0 p j=1 β j x ij ) 2 } {{ } RSS s. t. s. t. p β 2 j j=1 p j=1 θ β j θ Outlook: both ridge regression and LASSO have Bayesian formulations Regularization: Comparison 30

Variable Selection with LASSO Comparison of previous constraints Ridge regression β 2 1 + β 2 2 θ LASSO β 1 + β 2 θ β2 βols β 2 β OLS βridge β LASSO β1 β 1 Objective function RSS as contours in red Constraints (blue) in 2 dimensions Intersection occurs at β 1 = 0 for LASSO Regularization: Comparison 31

Case Study Example Comparison to OLS estimator lm.ols <- lm(y.train ~ x.train) # Workaround as predict.lm only accepts a data.frame pred.ols <- predict(lm.ols, data.frame(x.train=i(x.test))) mean(abs((y.test - pred.ols)/y.test)) ## [1] 0.6352089 Comparison OLS Ridge regression LASSO 0.64 0.68 0.63 Here: LASSO can outperform OLS with fewer predictors Regularization: Comparison 32

Elastic Net Elastic net generalizes the ideas of both LASSO and ridge regression Combination of both penalties β ElasticNet = minrss + λ (1 α) β 2 2 /2 + α β β }{{} 1 }{{} L 2 -penalty L 1 -penalty L 1 -penalty helps generating a sparse model L 2 -part overcomes a strict selection Parameter α controls numerical stability α = 0.5 tends to handle correlated variables as groups Regularization: Comparison 33

Elastic Net in R Example Test the elastic net with a sequence of values for α Report in-sample mean squared error set.seed(0) alpha <- seq(from=0, to=1, by=0.1) en <- lapply(alpha, function(a) cv.glmnet(x.train, y.train, alpha=a)) en.mse <- unlist(lapply(en, function(i) i$cvm[which(i$lambda==i$lambda.min)])) plot(alpha, en.mse, ylab="mean Squared Error", pch=16) Mean Squared Error 120000 130000 0.0 0.2 0.4 0.6 0.8 1.0 alpha Regularization: Comparison 34

Elastic Net in R Example (continued) Report out-of-sample mean absolute prediction error en.mape <- unlist(lapply(en, function(i) { pred <- predict(i, newx=x.test, s="lambda.min") mean(abs((y.test - pred)/y.test)) })) plot(alpha, en.mape, ylab="mape", pch=16) MAPE 0.64 0.65 0.66 0.67 0.68 0.0 0.2 0.4 0.6 0.8 1.0 Regularization: Comparison alpha 35

Outline 1 Motivation for Regularization 2 Ridge Regression 3 LASSO 4 Comparison 5 Wrap-Up Regularization: Wrap-Up 36

Summary Regularization methods Regularization methods bring advantages beyond OLS Cross validation chooses tuning parameter λ LASSO performs variable selection Neither ridge regression nor LASSO dominates one another Cross validation finds the best approach for a given dataset Outlook In practice, λ is scaled by rule of thumb to get better results Research has lately developed several variants and improvements Spike-and-slab regression can be a viable alternative for inferences Regularization: Wrap-Up 37

Further Readings Package glmnet glmnet tutorial: http://web.stanford.edu/~hastie/ glmnet/glmnet_alpha.html glmnet webinar: http://web.stanford.edu/~hastie/ TALKS/glmnet_webinar.pdf see Hastie s website for data and scripts Background on methods Talk on elastic net: http: //web.stanford.edu/~hastie/talks/enet_talk.pdf Section 6.2 in the book An Introduction to Statistical Learning Applications Especially healthcare analytics, but also sports e. g. Groll, Schauberger & Tutz (2015): Prediction of major international soccer tournaments based on team-specific regularized Poisson regression: An application to the FIFA World Cup 2014. In: Journal of Quantitiative Analysis in Sports, 10:2, Regularization: pp. 97 115. Wrap-Up 38