Lasso. November 14, PDF Free Download

Lasso November 14, 2017 Contents 1 Case Study: Least Absolute Shrinkage and Selection Operator (LASSO) 1 1.1 The Lasso Estimator.................................... 1 1.2 Computation of the Lasso Solution............................ 11 1.2.1 Single Predictor: Soft Thresholding........................ 12 1.3 l q Penalties......................................... 12 1.4 Advantages of l 1 -penalty.................................. 13 1 Case Study: Least Absolute Shrinkage and Selection Operator (LASSO) There are two reasons why we might consider an alternative to the least-squares estimate. Prediction accuracy: The least-squares estimate often has low bias but large variance, and prediction accuracy can sometimes be improved by shrinking the values of the regression coefficients. By doing so, we introduce some bias but reduce the variance of the predicted values, and hence may improve the overall prediction accuracy. Purposes of interpretation: With a large number of predictors, we often would like to identify a smaller subset of these predictors that exhibit the strongest effects. In this section, we discuss the various penalty functions p λ ( ) used in the penalized problem arg min{l(β) + p λ (β)} β for some loss function L(β). We mainly use the least squares loss function throughout our discussion. 1.1 The Lasso Estimator Definition 1 (The lasso estimator). The lasso estimator, denoted by ˆβ lasso, is defined as ˆβ lasso 1 n p = arg min (y β i β 0 x i β) 2 + λ β j, (λ > 0) 2n i=1 j=1 1

or equivalently, ˆβ lasso 1 n = arg min (y i β 0 x i β) 2 β 2n i=1 p subject to β j t (t > 0) j=1 or equivalently, { } ˆβ lasso 1 = arg min β 2n y β 01 Xβ 2 2 + λ β 1, (λ > 0) where y = (y 1,..., y n ) denote the n-vector of responses, X be an n p matrix with x i R p in its ith row, 1 is the vector of n ones, and 1 is the l 1 -norm and 2 is the usual Euclidean norm. Why do we use the l 1 norm? Why not use the l 2 norm or any l q norm? The lasso yields sparse solution vectors. The value q = 1 is the smallest value that yields a convex problem. Theoretical guarantee. Note: Typically, we first standardize the predictors X so that each column is centered 1 n n i=1 x ij = 0 and has unit variance 1 n n i=1 x2 ij = 1. Without standardization, the lasso solutions would depend on the units. For convenience, we also assume that the outcome values y i have been centered, meaning that 1 n n i=1 y i = 0. These centering conditions are convenient, since they mean that we can omit the intercept term β 0 in the lasso optimization. Given an optimal lasso solution ˆβ on the centered data, we can recover the optimal solutions for the uncentered data: ˆβ is the same, and the intercept ˆβ 0 is given by ˆβ 0 = ȳ p x j ˆβj where ȳ and { x j } p 1 are the original means. (This is typically only true for linear regression with squared-error loss; it s not true, for example, for lasso logistic regression). For this reason, we omit the intercept β 0 from the lasso for the remainder of this chapter. j=1 2

Figure 1: The l 1 ball. Table 2 shows the results of applying three fitting procedures to the crime data. bound t was chosen by cross-validation. The lasso The left panel corresponds to the full least-squares fit. The middle panel shows the lasso fit. On the right, we have applied least-squares estimation to the subset of three predictors with nonzero coefficients in the lasso. (Relaxed Lasso) The standard errors for the least-squares estimates come from the usual formulas. No such simple formula exists for the lasso, so we have used the bootstrap to obtain the estimate of standard errors in the middle panel. Overall it appears that funding has a large effect, probably indicating that police resources have been focused on higher crime areas. The other predictors have small to moderate effects. Note 3

that the lasso sets two of the five coefficients to zero, and tends to shrink the coefficients of the others toward zero relative to the full least-squares estimate. In turn, the least-squares fit on the subset of the three predictors tends to expand the lasso estimates away from zero. The nonzero estimates from the lasso tend to be biased toward zero, so the debiasing in the right panel can often improve the prediction error of the model. This two-stage process is also known as the relaxed lasso (Meinshausen 2007). Figure 2: Results from analysis of the crime data. Left panel shows the least-squares estimates, standard errors, and their ratio (Z-score). Middle and right panels show the corresponding results for the lasso, and the least-squares estimates applied to the subset of predictors chosen by the lasso. # to obtain glmnet and install it directly from CRAN. # install.packages("glmnet", repos = "http://cran.us.r-project.org") # load the glmnet package: library(glmnet) # The default model used in the package is the Guassian linear # model or "least squares", which we will demonstrate in this # section. We load a set of data created beforehand for # illustration. Users can either load their own data or use # those saved in the workspace. getwd() ## [1] "/Users/yiyang/Dropbox/Teaching/MATH680/Topic4/note" load("bardet.rda") # The command loads an input matrix x and a response # vector y from this saved R data archive. # # We fit the model using the most basic call to glmnet. fit = glmnet(x, y) 4

# "fit" is an object of class glmnet that contains all the # relevant information of the fitted model for further use. # We do not encourage users to extract the components directly. # Instead, various methods are provided for the object such # as plot, print, coef and predict that enable us to execute # those tasks more elegantly. # We can visualize the coefficients by executing the plot function: plot(fit) 0 17 29 42 54 61 66 73 Coefficients 0.10 0.00 0.05 0.10 0.15 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 L1 Norm # Each curve corresponds to a variable. It shows the path of # its coefficient against the l1-norm of the whole # coefficient vector at as lambda varies. The axis above 5

# indicates the number of nonzero coefficients at the # current lambda, which is the effective degrees of freedom # (df) for the lasso. Users may also wish to annotate # the curves; this can be done by setting label = TRUE # in the plot command. # A summary of the glmnet path at each step is displayed # if we just enter the object name or use # the print function: print(fit) ## ## Call: glmnet(x = x, y = y) ## ## Df %Dev Lambda ## [1,] 0 0.00000 0.109400 ## [2,] 1 0.05131 0.104500 ## [3,] 1 0.09807 0.099720 ## [4,] 1 0.14070 0.095190 ## [5,] 1 0.17950 0.090860 ## [6,] 4 0.22260 0.086730 ## [7,] 4 0.26500 0.082790 ## [8,] 4 0.30360 0.079030 ## [9,] 4 0.33880 0.075430 ## [10,] 8 0.37320 0.072010 ## [11,] 8 0.40520 0.068730 ## [12,] 9 0.43450 0.065610 ## [13,] 9 0.46160 0.062630 ## [14,] 9 0.48620 0.059780 ## [15,] 10 0.50880 0.057060 ## [16,] 10 0.52990 0.054470 ## [17,] 10 0.54910 0.051990 ## [18,] 11 0.56670 0.049630 ## [19,] 12 0.58270 0.047380 ## [20,] 13 0.59770 0.045220 ## [21,] 13 0.61270 0.043170 ## [22,] 13 0.62700 0.041200 ## [23,] 15 0.64060 0.039330 ## [24,] 17 0.65340 0.037540 ## [25,] 17 0.66540 0.035840 ## [26,] 18 0.67640 0.034210 ## [27,] 17 0.68640 0.032650 ## [28,] 17 0.69520 0.031170 ## [29,] 19 0.70340 0.029750 ## [30,] 19 0.71070 0.028400 6

## [31,] 20 0.71750 0.027110 ## [32,] 21 0.72380 0.025880 ## [33,] 20 0.72960 0.024700 ## [34,] 19 0.73470 0.023580 ## [35,] 18 0.73950 0.022510 ## [36,] 18 0.74370 0.021480 ## [37,] 18 0.74760 0.020510 ## [38,] 18 0.75120 0.019580 ## [39,] 19 0.75450 0.018690 ## [40,] 18 0.75750 0.017840 ## [41,] 18 0.76020 0.017030 ## [42,] 18 0.76280 0.016250 ## [43,] 18 0.76500 0.015510 ## [44,] 19 0.76710 0.014810 ## [45,] 19 0.76910 0.014140 ## [46,] 19 0.77090 0.013490 ## [47,] 19 0.77250 0.012880 ## [48,] 19 0.77400 0.012290 ## [49,] 19 0.77530 0.011740 ## [50,] 19 0.77650 0.011200 ## [51,] 19 0.77770 0.010690 ## [52,] 19 0.77870 0.010210 ## [53,] 19 0.77960 0.009743 ## [54,] 19 0.78040 0.009300 ## [55,] 19 0.78120 0.008877 ## [56,] 20 0.78190 0.008474 ## [57,] 20 0.78260 0.008089 ## [58,] 20 0.78310 0.007721 ## [59,] 20 0.78370 0.007370 ## [60,] 21 0.78470 0.007035 ## [61,] 21 0.78900 0.006715 ## [62,] 23 0.79380 0.006410 ## [63,] 24 0.79900 0.006119 ## [64,] 25 0.80390 0.005841 ## [65,] 24 0.80850 0.005575 ## [66,] 25 0.81250 0.005322 ## [67,] 25 0.81620 0.005080 ## [68,] 27 0.81960 0.004849 ## [69,] 29 0.82370 0.004629 ## [70,] 30 0.82830 0.004418 ## [71,] 31 0.83330 0.004217 ## [72,] 31 0.83780 0.004026 ## [73,] 32 0.84200 0.003843 ## [74,] 34 0.84620 0.003668 ## [75,] 38 0.85130 0.003501 7

## [76,] 40 0.85700 0.003342 ## [77,] 40 0.86280 0.003190 ## [78,] 41 0.86820 0.003045 ## [79,] 42 0.87310 0.002907 ## [80,] 46 0.87790 0.002775 ## [81,] 50 0.88260 0.002649 ## [82,] 52 0.88800 0.002528 ## [83,] 55 0.89340 0.002413 ## [84,] 55 0.89920 0.002304 ## [85,] 54 0.90410 0.002199 ## [86,] 54 0.90870 0.002099 ## [87,] 55 0.91290 0.002004 ## [88,] 57 0.91690 0.001913 ## [89,] 59 0.92100 0.001826 ## [90,] 61 0.92540 0.001743 ## [91,] 62 0.92930 0.001663 ## [92,] 62 0.93290 0.001588 ## [93,] 62 0.93620 0.001516 ## [94,] 64 0.93920 0.001447 ## [95,] 64 0.94210 0.001381 ## [96,] 66 0.94480 0.001318 ## [97,] 69 0.94720 0.001258 ## [98,] 71 0.94990 0.001201 ## [99,] 73 0.95280 0.001147 ## [100,] 73 0.95530 0.001094 # It shows from left to right the number of nonzero # coefficients (Df), the values of -log(likelihood) # (%dev) and the value of lambda (Lambda). # Although by default glmnet calls for 100 values of # lambda the program stops early if %dev% does not # change sufficently from one lambda to the next # (typically near the end of the path.) # We can obtain the actual coefficients at one or more lambda's # within the range of the sequence: coef0 = coef(fit,s=0.1) # The function glmnet returns a sequence of models # for the users to choose from. In many cases, users # may prefer the software to select one of them. # Cross-validation is perhaps the simplest and most # widely used method for that task. # 8

# cv.glmnet is the main function to do cross-validation # here, along with various supporting methods such as # plotting and prediction. We still act on the sample # data loaded before. cvfit = cv.glmnet(x, y) # cv.glmnet returns a cv.glmnet object, which is "cvfit" # here, a list with all the ingredients of the # cross-validation fit. As for glmnet, we do not # encourage users to extract the components directly # except for viewing the selected values of lambda. # The package provides well-designed functions # for potential tasks. # We can plot the object. plot(cvfit) 9

71 57 41 27 20 19 18 17 11 8 1 Mean Squared Error 0.010 0.015 0.020 0.025 0.030 6 5 4 3 log(lambda) # It includes the cross-validation curve (red dotted line), # and upper and lower standard deviation curves along the # lambda sequence (error bars). Two selected lambda's are # indicated by the vertical dotted lines (see below). # We can view the selected lambda's and the corresponding # coefficients. For example, cvfit$lambda.min ## [1] 0.001663435 # lambda.min is the value of lambda that gives minimum # mean cross-validated error. The other lambda saved is # lambda.1se, which gives the most regularized model 10

# such that error is within one standard error of # the minimum. To use that, we only need to replace # lambda.min with lambda.1se above. coef1 = coef(cvfit, s = "lambda.min") # Note that the coefficients are represented in the # sparse matrix format. The reason is that the # solutions along the regularization path are # often sparse, and hence it is more efficient # in time and space to use a sparse format. # If you prefer non-sparse format, # pipe the output through as.matrix(). # Predictions can be made based on the fitted # cv.glmnet object. Let's see a toy example. predict(cvfit, newx = x[1:5,], s = "lambda.min") ## 1 ## V2 8.370210 ## V3 8.332486 ## V4 8.404824 ## V5 8.294271 ## V6 8.322188 # newx is for the new input matrix and s, # as before, is the value(s) of lambda at which # predictions are made. 1.2 Computation of the Lasso Solution Lasso prefers sparse solution. To see this, notice that, with ridge regression, the prior cost of a sparse solution, such as β = (1, 0), is the same as the cost of a dense solution, such as β = (1/ 2, 1/ 2), as long as they have the same l 2 norm: (1, 0) 2 = (1/ 2, 1/ 2) 2 = 1. However, for lasso, setting β = (1, 0) is cheaper than setting β = (1/ 2, 1/ 2), since (1, 0) 1 = 1 < (1/ 2, 1/ 2) 1 = 2. The most rigorous way to see that l 1 regularization results in sparse solutions is to examine the conditions that hold at the optimum. 11

1.2.1 Single Predictor: Soft Thresholding In this section, z i has been centered. Consider a single predictor setting, based on samples {(y i, z i )} n i=1 (for convenience we have renamed z i to be x ij ). The problem then is to solve arg min β { 1 2n } n (y i z i β) 2 + λ β i=1 We cannot get the optimality condition directly, since β does not have a derivative at β = 0. By direct inspection of the function (1), we find that 1 n z, y λ if 1 n z, y > λ ˆβ = 0 if 1 n z, y λ, 1 n z, y + λ if 1 n z, y < λ (1) which can be written as ˆβ = S λ ( 1 z, y ), n where the soft-thresholding operator S λ (x) = sign(x)( x λ) +. when data is standardized 1 n i z2 i = 1, it translates the usual least-squares estimate ˆβ OLS = z, y / z, z = 1 n z, y toward zero by the amount λ. This is demonstrated in Figure 3. Figure 3: Soft thresholding function S λ (x) = sign(x)( x λ) + is shown in blue (broken lines), along with the 45 line in black. 1.3 l q Penalties For a fixed real number q 0, consider the criterion 12

1 min β 2n n (y i x i β) 2 + λ i=1 p β j q. (2) This is the lasso for q = 1 and ridge regression for q = 2. For q = 0, the term p j=1 β j 0 counts the number of nonzero elements in β, and thus amounts to best-subset selection. Figure 4 displays the constraint regions corresponding to these penalties for the case of two predictors (p = 2). j=1 Figure 4: Constraint regions p j=1 β j q 1 for different values of q. For q < 1, the constraint region is nonconvex. In the special case of an orthonormal model matrix X, all three procedures have explicit solutions. Each method applies a simple coordinate-wise transformation to the least-squares estimate β as detailed in Table 1. The lasso is special in that the choice q = 1 is the smallest value of q (closest to best-subset) that leads to a convex constraint region and hence a convex optimization problem. In this sense, it is the closest convex relaxation of the best-subset selection problem. Table 1: Estimators of β j from (2) in the case of an orthonormal model matrix X. 1.4 Advantages of l 1 -penalty Interpretation of the final model: the l 1 -penalty provides a natural way to encourage or enforce sparsity and simplicity in the solution. Statistical efficiency: bet-on-sparsity principle assume that the underlying true signal is sparse and we use an l 1 -penalty to try to recover it. If our assumption is correct, we can do a good job in recovering the true signal. But if we are wrong the underlying truth is not sparse in the chosen bases then the l 1 -penalty will not work well. However in that instance, 13

no method can do well, relative to the Bayes error. There is now a large body of theoretical support for these loose statements. We can think of this in terms of the amount of information n/p per parameter. If p n and the true model is not sparse, i.e. k n, then the number of samples n is too small to allow for accurate estimation of the parameters. But if the true model is sparse, so that only k < n parameters are actually nonzero in the true underlying model, then it turns out that we can estimate the parameters effectively, using the lasso. This may come as somewhat of a surprise, because we are able to do this even though we are not told which k of the p parameters are actually nonzero. Of course we cannot do as well as we could if we had that information, but it turns out that we can still do reasonably well. Computational efficiency: l 1 -based penalties are convex and this fact and the assumed sparsity can lead to significant computational advantages. 14

Lasso. November 14, 2017