Lasso. November 14, 2017

Similar documents
Glmnet Vignette. Introduction. Trevor Hastie and Junyang Qian

Linear Methods for Regression and Shrinkage Methods

Lecture 13: Model selection and regularization

Regularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

14. League: A factor with levels A and N indicating player s league at the end of 1986

Model selection and validation 1: Cross-validation

Lecture on Modeling Tools for Clustering & Regression

Contents Cont Hypothesis testing

1 StatLearn Practical exercise 5

Variable Selection 6.783, Biomedical Decision Support

Lecture 16: High-dimensional regression, non-linear regression

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Stat 4510/7510 Homework 6

Lecture 19: November 5

Package TANDEM. R topics documented: June 15, Type Package

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Bayes Estimators & Ridge Regression

Chapter 6: Linear Model Selection and Regularization

Penalized regression Statistical Learning, 2011

CS294-1 Assignment 2 Report

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an

Package glmnetutils. August 1, 2017

Final Exam. Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

Yelp Recommendation System

MATH 829: Introduction to Data Mining and Analysis Model selection

Package msgps. February 20, 2015

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

The Data. Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO

Lecture 25: Review I

Comparison of Optimization Methods for L1-regularized Logistic Regression

I How does the formulation (5) serve the purpose of the composite parameterization

Lecture 17 Sparse Convex Optimization

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python

Lasso Regression: Regularization for feature selection

model order p weights The solution to this optimization problem is obtained by solving the linear system

Lecture 22 The Generalized Lasso

6.867 Machine Learning

The picasso Package for High Dimensional Regularized Sparse Learning in R

PS 6: Regularization. PART A: (Source: HTF page 95) The Ridge regression problem is:

Machine Learning: An Applied Econometric Approach Online Appendix

Sparsity Based Regularization

Nearest Neighbor Predictors

ELEG Compressive Sensing and Sparse Signal Representations

Chapter 7: Numerical Prediction

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Package svmpath. R topics documented: August 30, Title The SVM Path Algorithm Date Version Author Trevor Hastie

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

6 Model selection and kernels

Nonparametric Regression

Lecture 26: Missing data

Package hiernet. March 18, 2018

Gradient LASSO algoithm

3 Nonlinear Regression

LECTURE 12: LINEAR MODEL SELECTION PT. 3. October 23, 2017 SDS 293: Machine Learning

Gelman-Hill Chapter 3

Lab 10 - Ridge Regression and the Lasso in Python

Package flam. April 6, 2018

The grplasso Package

Data mining techniques for actuaries: an overview

Cross-validation and the Bootstrap

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Text Modeling with the Trace Norm

STA 4273H: Sta-s-cal Machine Learning

Machine Learning. Chao Lan

Simulation studies. Patrick Breheny. September 8. Monte Carlo simulation Example: Ridge vs. Lasso vs. Subset

Package EBglmnet. January 30, 2016

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Multicollinearity and Validation CIVL 7012/8012

Lecture 9: Support Vector Machines

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER

Conquering Massive Clinical Models with GPU. GPU Parallelized Logistic Regression

Dimension Reduction Methods for Multivariate Time Series

Kernel Methods & Support Vector Machines

Package msda. February 20, 2015

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

Lasso.jl Documentation

MS&E 226: Small Data

Introduction to Machine Learning Spring 2018 Note Sparsity and LASSO. 1.1 Sparsity for SVMs

1 Training/Validation/Testing

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

biglasso: extending lasso model to Big Data in R

Package scoop. September 16, 2011

COURSE WEBPAGE. Peter Orbanz Applied Data Mining

Package polywog. April 20, 2018

Package TVsMiss. April 5, 2018

Cross-validation and the Bootstrap

Applied Statistics and Econometrics Lecture 6

Package SSLASSO. August 28, 2018

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Transcription:

Lasso November 14, 2017 Contents 1 Case Study: Least Absolute Shrinkage and Selection Operator (LASSO) 1 1.1 The Lasso Estimator.................................... 1 1.2 Computation of the Lasso Solution............................ 11 1.2.1 Single Predictor: Soft Thresholding........................ 12 1.3 l q Penalties......................................... 12 1.4 Advantages of l 1 -penalty.................................. 13 1 Case Study: Least Absolute Shrinkage and Selection Operator (LASSO) There are two reasons why we might consider an alternative to the least-squares estimate. Prediction accuracy: The least-squares estimate often has low bias but large variance, and prediction accuracy can sometimes be improved by shrinking the values of the regression coefficients. By doing so, we introduce some bias but reduce the variance of the predicted values, and hence may improve the overall prediction accuracy. Purposes of interpretation: With a large number of predictors, we often would like to identify a smaller subset of these predictors that exhibit the strongest effects. In this section, we discuss the various penalty functions p λ ( ) used in the penalized problem arg min{l(β) + p λ (β)} β for some loss function L(β). We mainly use the least squares loss function throughout our discussion. 1.1 The Lasso Estimator Definition 1 (The lasso estimator). The lasso estimator, denoted by ˆβ lasso, is defined as ˆβ lasso 1 n p = arg min (y β i β 0 x i β) 2 + λ β j, (λ > 0) 2n i=1 j=1 1

or equivalently, ˆβ lasso 1 n = arg min (y i β 0 x i β) 2 β 2n i=1 p subject to β j t (t > 0) j=1 or equivalently, { } ˆβ lasso 1 = arg min β 2n y β 01 Xβ 2 2 + λ β 1, (λ > 0) where y = (y 1,..., y n ) denote the n-vector of responses, X be an n p matrix with x i R p in its ith row, 1 is the vector of n ones, and 1 is the l 1 -norm and 2 is the usual Euclidean norm. Why do we use the l 1 norm? Why not use the l 2 norm or any l q norm? The lasso yields sparse solution vectors. The value q = 1 is the smallest value that yields a convex problem. Theoretical guarantee. Note: Typically, we first standardize the predictors X so that each column is centered 1 n n i=1 x ij = 0 and has unit variance 1 n n i=1 x2 ij = 1. Without standardization, the lasso solutions would depend on the units. For convenience, we also assume that the outcome values y i have been centered, meaning that 1 n n i=1 y i = 0. These centering conditions are convenient, since they mean that we can omit the intercept term β 0 in the lasso optimization. Given an optimal lasso solution ˆβ on the centered data, we can recover the optimal solutions for the uncentered data: ˆβ is the same, and the intercept ˆβ 0 is given by ˆβ 0 = ȳ p x j ˆβj where ȳ and { x j } p 1 are the original means. (This is typically only true for linear regression with squared-error loss; it s not true, for example, for lasso logistic regression). For this reason, we omit the intercept β 0 from the lasso for the remainder of this chapter. j=1 2

Figure 1: The l 1 ball. Table 2 shows the results of applying three fitting procedures to the crime data. bound t was chosen by cross-validation. The lasso The left panel corresponds to the full least-squares fit. The middle panel shows the lasso fit. On the right, we have applied least-squares estimation to the subset of three predictors with nonzero coefficients in the lasso. (Relaxed Lasso) The standard errors for the least-squares estimates come from the usual formulas. No such simple formula exists for the lasso, so we have used the bootstrap to obtain the estimate of standard errors in the middle panel. Overall it appears that funding has a large effect, probably indicating that police resources have been focused on higher crime areas. The other predictors have small to moderate effects. Note 3

that the lasso sets two of the five coefficients to zero, and tends to shrink the coefficients of the others toward zero relative to the full least-squares estimate. In turn, the least-squares fit on the subset of the three predictors tends to expand the lasso estimates away from zero. The nonzero estimates from the lasso tend to be biased toward zero, so the debiasing in the right panel can often improve the prediction error of the model. This two-stage process is also known as the relaxed lasso (Meinshausen 2007). Figure 2: Results from analysis of the crime data. Left panel shows the least-squares estimates, standard errors, and their ratio (Z-score). Middle and right panels show the corresponding results for the lasso, and the least-squares estimates applied to the subset of predictors chosen by the lasso. # to obtain glmnet and install it directly from CRAN. # install.packages("glmnet", repos = "http://cran.us.r-project.org") # load the glmnet package: library(glmnet) # The default model used in the package is the Guassian linear # model or "least squares", which we will demonstrate in this # section. We load a set of data created beforehand for # illustration. Users can either load their own data or use # those saved in the workspace. getwd() ## [1] "/Users/yiyang/Dropbox/Teaching/MATH680/Topic4/note" load("bardet.rda") # The command loads an input matrix x and a response # vector y from this saved R data archive. # # We fit the model using the most basic call to glmnet. fit = glmnet(x, y) 4

# "fit" is an object of class glmnet that contains all the # relevant information of the fitted model for further use. # We do not encourage users to extract the components directly. # Instead, various methods are provided for the object such # as plot, print, coef and predict that enable us to execute # those tasks more elegantly. # We can visualize the coefficients by executing the plot function: plot(fit) 0 17 29 42 54 61 66 73 Coefficients 0.10 0.00 0.05 0.10 0.15 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 L1 Norm # Each curve corresponds to a variable. It shows the path of # its coefficient against the l1-norm of the whole # coefficient vector at as lambda varies. The axis above 5

# indicates the number of nonzero coefficients at the # current lambda, which is the effective degrees of freedom # (df) for the lasso. Users may also wish to annotate # the curves; this can be done by setting label = TRUE # in the plot command. # A summary of the glmnet path at each step is displayed # if we just enter the object name or use # the print function: print(fit) ## ## Call: glmnet(x = x, y = y) ## ## Df %Dev Lambda ## [1,] 0 0.00000 0.109400 ## [2,] 1 0.05131 0.104500 ## [3,] 1 0.09807 0.099720 ## [4,] 1 0.14070 0.095190 ## [5,] 1 0.17950 0.090860 ## [6,] 4 0.22260 0.086730 ## [7,] 4 0.26500 0.082790 ## [8,] 4 0.30360 0.079030 ## [9,] 4 0.33880 0.075430 ## [10,] 8 0.37320 0.072010 ## [11,] 8 0.40520 0.068730 ## [12,] 9 0.43450 0.065610 ## [13,] 9 0.46160 0.062630 ## [14,] 9 0.48620 0.059780 ## [15,] 10 0.50880 0.057060 ## [16,] 10 0.52990 0.054470 ## [17,] 10 0.54910 0.051990 ## [18,] 11 0.56670 0.049630 ## [19,] 12 0.58270 0.047380 ## [20,] 13 0.59770 0.045220 ## [21,] 13 0.61270 0.043170 ## [22,] 13 0.62700 0.041200 ## [23,] 15 0.64060 0.039330 ## [24,] 17 0.65340 0.037540 ## [25,] 17 0.66540 0.035840 ## [26,] 18 0.67640 0.034210 ## [27,] 17 0.68640 0.032650 ## [28,] 17 0.69520 0.031170 ## [29,] 19 0.70340 0.029750 ## [30,] 19 0.71070 0.028400 6

## [31,] 20 0.71750 0.027110 ## [32,] 21 0.72380 0.025880 ## [33,] 20 0.72960 0.024700 ## [34,] 19 0.73470 0.023580 ## [35,] 18 0.73950 0.022510 ## [36,] 18 0.74370 0.021480 ## [37,] 18 0.74760 0.020510 ## [38,] 18 0.75120 0.019580 ## [39,] 19 0.75450 0.018690 ## [40,] 18 0.75750 0.017840 ## [41,] 18 0.76020 0.017030 ## [42,] 18 0.76280 0.016250 ## [43,] 18 0.76500 0.015510 ## [44,] 19 0.76710 0.014810 ## [45,] 19 0.76910 0.014140 ## [46,] 19 0.77090 0.013490 ## [47,] 19 0.77250 0.012880 ## [48,] 19 0.77400 0.012290 ## [49,] 19 0.77530 0.011740 ## [50,] 19 0.77650 0.011200 ## [51,] 19 0.77770 0.010690 ## [52,] 19 0.77870 0.010210 ## [53,] 19 0.77960 0.009743 ## [54,] 19 0.78040 0.009300 ## [55,] 19 0.78120 0.008877 ## [56,] 20 0.78190 0.008474 ## [57,] 20 0.78260 0.008089 ## [58,] 20 0.78310 0.007721 ## [59,] 20 0.78370 0.007370 ## [60,] 21 0.78470 0.007035 ## [61,] 21 0.78900 0.006715 ## [62,] 23 0.79380 0.006410 ## [63,] 24 0.79900 0.006119 ## [64,] 25 0.80390 0.005841 ## [65,] 24 0.80850 0.005575 ## [66,] 25 0.81250 0.005322 ## [67,] 25 0.81620 0.005080 ## [68,] 27 0.81960 0.004849 ## [69,] 29 0.82370 0.004629 ## [70,] 30 0.82830 0.004418 ## [71,] 31 0.83330 0.004217 ## [72,] 31 0.83780 0.004026 ## [73,] 32 0.84200 0.003843 ## [74,] 34 0.84620 0.003668 ## [75,] 38 0.85130 0.003501 7

## [76,] 40 0.85700 0.003342 ## [77,] 40 0.86280 0.003190 ## [78,] 41 0.86820 0.003045 ## [79,] 42 0.87310 0.002907 ## [80,] 46 0.87790 0.002775 ## [81,] 50 0.88260 0.002649 ## [82,] 52 0.88800 0.002528 ## [83,] 55 0.89340 0.002413 ## [84,] 55 0.89920 0.002304 ## [85,] 54 0.90410 0.002199 ## [86,] 54 0.90870 0.002099 ## [87,] 55 0.91290 0.002004 ## [88,] 57 0.91690 0.001913 ## [89,] 59 0.92100 0.001826 ## [90,] 61 0.92540 0.001743 ## [91,] 62 0.92930 0.001663 ## [92,] 62 0.93290 0.001588 ## [93,] 62 0.93620 0.001516 ## [94,] 64 0.93920 0.001447 ## [95,] 64 0.94210 0.001381 ## [96,] 66 0.94480 0.001318 ## [97,] 69 0.94720 0.001258 ## [98,] 71 0.94990 0.001201 ## [99,] 73 0.95280 0.001147 ## [100,] 73 0.95530 0.001094 # It shows from left to right the number of nonzero # coefficients (Df), the values of -log(likelihood) # (%dev) and the value of lambda (Lambda). # Although by default glmnet calls for 100 values of # lambda the program stops early if %dev% does not # change sufficently from one lambda to the next # (typically near the end of the path.) # We can obtain the actual coefficients at one or more lambda's # within the range of the sequence: coef0 = coef(fit,s=0.1) # The function glmnet returns a sequence of models # for the users to choose from. In many cases, users # may prefer the software to select one of them. # Cross-validation is perhaps the simplest and most # widely used method for that task. # 8

# cv.glmnet is the main function to do cross-validation # here, along with various supporting methods such as # plotting and prediction. We still act on the sample # data loaded before. cvfit = cv.glmnet(x, y) # cv.glmnet returns a cv.glmnet object, which is "cvfit" # here, a list with all the ingredients of the # cross-validation fit. As for glmnet, we do not # encourage users to extract the components directly # except for viewing the selected values of lambda. # The package provides well-designed functions # for potential tasks. # We can plot the object. plot(cvfit) 9

71 57 41 27 20 19 18 17 11 8 1 Mean Squared Error 0.010 0.015 0.020 0.025 0.030 6 5 4 3 log(lambda) # It includes the cross-validation curve (red dotted line), # and upper and lower standard deviation curves along the # lambda sequence (error bars). Two selected lambda's are # indicated by the vertical dotted lines (see below). # We can view the selected lambda's and the corresponding # coefficients. For example, cvfit$lambda.min ## [1] 0.001663435 # lambda.min is the value of lambda that gives minimum # mean cross-validated error. The other lambda saved is # lambda.1se, which gives the most regularized model 10

# such that error is within one standard error of # the minimum. To use that, we only need to replace # lambda.min with lambda.1se above. coef1 = coef(cvfit, s = "lambda.min") # Note that the coefficients are represented in the # sparse matrix format. The reason is that the # solutions along the regularization path are # often sparse, and hence it is more efficient # in time and space to use a sparse format. # If you prefer non-sparse format, # pipe the output through as.matrix(). # Predictions can be made based on the fitted # cv.glmnet object. Let's see a toy example. predict(cvfit, newx = x[1:5,], s = "lambda.min") ## 1 ## V2 8.370210 ## V3 8.332486 ## V4 8.404824 ## V5 8.294271 ## V6 8.322188 # newx is for the new input matrix and s, # as before, is the value(s) of lambda at which # predictions are made. 1.2 Computation of the Lasso Solution Lasso prefers sparse solution. To see this, notice that, with ridge regression, the prior cost of a sparse solution, such as β = (1, 0), is the same as the cost of a dense solution, such as β = (1/ 2, 1/ 2), as long as they have the same l 2 norm: (1, 0) 2 = (1/ 2, 1/ 2) 2 = 1. However, for lasso, setting β = (1, 0) is cheaper than setting β = (1/ 2, 1/ 2), since (1, 0) 1 = 1 < (1/ 2, 1/ 2) 1 = 2. The most rigorous way to see that l 1 regularization results in sparse solutions is to examine the conditions that hold at the optimum. 11

1.2.1 Single Predictor: Soft Thresholding In this section, z i has been centered. Consider a single predictor setting, based on samples {(y i, z i )} n i=1 (for convenience we have renamed z i to be x ij ). The problem then is to solve arg min β { 1 2n } n (y i z i β) 2 + λ β i=1 We cannot get the optimality condition directly, since β does not have a derivative at β = 0. By direct inspection of the function (1), we find that 1 n z, y λ if 1 n z, y > λ ˆβ = 0 if 1 n z, y λ, 1 n z, y + λ if 1 n z, y < λ (1) which can be written as ˆβ = S λ ( 1 z, y ), n where the soft-thresholding operator S λ (x) = sign(x)( x λ) +. when data is standardized 1 n i z2 i = 1, it translates the usual least-squares estimate ˆβ OLS = z, y / z, z = 1 n z, y toward zero by the amount λ. This is demonstrated in Figure 3. Figure 3: Soft thresholding function S λ (x) = sign(x)( x λ) + is shown in blue (broken lines), along with the 45 line in black. 1.3 l q Penalties For a fixed real number q 0, consider the criterion 12

1 min β 2n n (y i x i β) 2 + λ i=1 p β j q. (2) This is the lasso for q = 1 and ridge regression for q = 2. For q = 0, the term p j=1 β j 0 counts the number of nonzero elements in β, and thus amounts to best-subset selection. Figure 4 displays the constraint regions corresponding to these penalties for the case of two predictors (p = 2). j=1 Figure 4: Constraint regions p j=1 β j q 1 for different values of q. For q < 1, the constraint region is nonconvex. In the special case of an orthonormal model matrix X, all three procedures have explicit solutions. Each method applies a simple coordinate-wise transformation to the least-squares estimate β as detailed in Table 1. The lasso is special in that the choice q = 1 is the smallest value of q (closest to best-subset) that leads to a convex constraint region and hence a convex optimization problem. In this sense, it is the closest convex relaxation of the best-subset selection problem. Table 1: Estimators of β j from (2) in the case of an orthonormal model matrix X. 1.4 Advantages of l 1 -penalty Interpretation of the final model: the l 1 -penalty provides a natural way to encourage or enforce sparsity and simplicity in the solution. Statistical efficiency: bet-on-sparsity principle assume that the underlying true signal is sparse and we use an l 1 -penalty to try to recover it. If our assumption is correct, we can do a good job in recovering the true signal. But if we are wrong the underlying truth is not sparse in the chosen bases then the l 1 -penalty will not work well. However in that instance, 13

no method can do well, relative to the Bayes error. There is now a large body of theoretical support for these loose statements. We can think of this in terms of the amount of information n/p per parameter. If p n and the true model is not sparse, i.e. k n, then the number of samples n is too small to allow for accurate estimation of the parameters. But if the true model is sparse, so that only k < n parameters are actually nonzero in the true underlying model, then it turns out that we can estimate the parameters effectively, using the lasso. This may come as somewhat of a surprise, because we are able to do this even though we are not told which k of the p parameters are actually nonzero. Of course we cannot do as well as we could if we had that information, but it turns out that we can still do reasonably well. Computational efficiency: l 1 -based penalties are convex and this fact and the assumed sparsity can lead to significant computational advantages. 14