Chapter 6: Linear Model Selection and Regularization

Size: px

Start display at page:

Download "Chapter 6: Linear Model Selection and Regularization"

Augustus Copeland
5 years ago
Views:

1 Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the estimates gets large and in the case of p>n no solution is possible. Reducing the number of predictors would then both improve the statistical properties of the regression estimates but would also simplify the model making its interpretation easier.

2 Main Topics Subset Selection: there are several approaches to reducing the number of predictor variables and then doing normal linear regression. Shrinkage: If we use all p predictors then some methods will shrink (also called regularization) the magnitude of the predictor. This may entail a small increase in bias with a large reduction in variance. Dimension reduction: We may create linear combinations of the p predictors or project them onto a subspace of smaller dimensionality. Both techniques will reduce the number of predictors prior to normal linear regression.

3 Subset Selection The best subset selection looks at all 2 p models using the following algorithm. (1) Let M be the null model with no parameters. (2) for k=2,..,p fit all p! (= k ) models. Pick the best (M!! ) based on the smallest RSS or largest R 2. (3) Select the best among M,,M using cross-validation (MSE), C p, AIC, BIC, or adjusted R 2. Using R 2 is OK at step (2) since all models have the same number of parameters.

4 Subset Selection For logistic-regression we can use the deviance in place of RSS. The deviance is -2 times the log-likelihood of the model. The smaller the better. The main drawback is the number of models that must be examined. For p=20 it is over one million. For least-squares models there are some shortcuts to fitting all possible models but it still become difficult for large p. Stepwise selection is computationally more efficient.

5 Stepwise Selection: forward selection Forward stepwise selection: this method starts with no predictors and add them one at a time. (1) Let M be the null model with no predictors (2) for k= 0,,p-1, consider all p-k models by adding one parameter to M. Choose the best model (M ) based on the smallest RSS or largest R 2. (3) Select the single best model among M,,M using crossvalidation (MSE), C p, AIC, BIC, or adjusted R 2. As before all the models compared at step (2) have the same number of parameters so using RSS or R 2 is OK.

6 Stepwise Selection: forward selection The total number of models fitted is now only 1+p(p+1)/2. So when p=20 we fit 211 not one million! We are not guaranteed to get the best model. If p=3, the best single variable model might be X 1, but the best model using 2 variables is X 2 plus X 3 which will be missed by forward selection. Although we can start the forward selection algorithm even if p>n we can only go up to M.

7 Stepwise Selection: backward selection Algorithm (1) Let M be the full model with all p predictors. (2) For k=p, p-1,,1: fit all k models with one less predictor than used in M. Choose the best model (M ) based on the smallest RSS or largest R 2. (3) Select the single best model among M,,M using crossvalidation (MSE), C p, AIC, BIC, or adjusted R 2. The same number of models are fit as with forward selection. However, we must have p<n.

8 Choosing the Optimal Model We know the training MSE is an underestimate of the test MSE. Two different approaches, (1) Make adjustments to the training error to correct for the bias. (2) Directly estimate the test error with a validation set or crossvalidation.

9 C p, AIC, BIC, and Adjusted R 2 Mallow s C p = RSS 2dσ, where d is the number of predictors AIC= RSS 2dσ for least squares AIC and Cp are proportional to each other. BIC = RSS log ndσ, n>7 log(n)>2 so BIC will be greater that 2 and thus more conservation than C p and AIC. Adjusted R 2 = 1 /, the adjusted / R2 will no longer always increase with d. Except for the Adjusted R 2 the other measures have a strong theoretical basis.

10 C p, AIC, BIC, and Adjusted R 2 The best model is at the minimum of C p and BIC (AIC) and the maximum of the adjusted R 2. For the credit data BIC indicates an optimum with fewer predictors than C p.

11 Validation Set and Cross-Validation The same credit data which in this case gives the same optimum for the validation set and cross-validation. James et al. propose the 1 standard deviation rule. Calculate the standard deviation of the test MSE. After identifying the minimum see if plus 1 standard deviation includes the test MSE for fewer predictors

12 Shrinkage Methods: Ridge Regression Minimize RSS λ β is called the tuning parameter. λ β is called the shrinkage penalty. When =0, then the ridge estimators is just the normal least squares estimates. As λ the penalty grows and the ridge estimates approach 0. For each λ there is a different set of regression parameters, β. The penalty function does not include the intercept, 0. James et al. don t talk about this directly but when p>n then there may be no unique solution to the ridge minimization formula.

13 Shrinkage Methods: Ridge Regression 2 is called the l2 norm and equals β. So the x-axis can be thought of as a measure of the relative amount of shrinkage, which decreases to the right until equal to 1 which is no shrinkage.

14 Why does ridge regression work? Using simulated data with n=50 and p=45, the MSE (top purple? line) for the ridge estimator, the squared bias (black) and the variance (green) are shown. The LSE show a very large variance which is decreased substantially by the ridge estimator. Ridge regression does not eliminate predictors, at best they get assigned very small coefficients.

15 The Lasso The lasso can set some predictor coefficients to 0 and thus effectively aid with variable selection. The penalty function uses an l 1 norm instead of an l 2 norm. The β lasso coefficients satisfy, RSS λ β As with the ridge estimates as gets larger the coefficients shrink towards 0 but now some may equal be 0. Thus, we say the lasso yields sparse models. By convex duality you can shown when p>n there can be at most n non-zero lasso coefficients! (see Rosset & Zhu, Piecewise linear regularization paths. Ann. Stat. 35: )

16 The Lasso Credit data. The number of predictors in the final model is a function of. In the right figure as you move to the right Rating is the first variable to come into the model followed by Student and Limit.

17 The Lasso An alternative way to write solutions for the ridge and lasso estimates are, minimize RSS subject to β β s minimize β RSS subject to β s For every value of there is a corresponding value of s.

18 The Lasso The regions demarcated by s for the lasso (left) and ridge estimators (right) are where the solutions must reside. β is the least squares estimate. The ellipses are regions of constant RSS and get larger as you move away from β. The solutions for the lasso will often hit a vertex of the region which results in one or more parameters being set to 0.

19 The Lasso This simulation has p=45, n=50, but now only 2 of the predictors are related to the response. On the right are the lasso (solid) and ridge (dashed) estimator properties.

20 Lasso and Ridge Soft thresholding Consider a simple model, no intercept, n=p, X a diagonal matrix =I. Then the y if y λ/2 ridge solution is β y /1 λ and the lasso is β y if y λ/2 0 if y λ/2

21 Choosing Using the leave-one-out cross validation ridge regression was applied to the credit data. The optimal is small and results in a modest reduction in the MSE and magnitude of the coefficients. Perhaps the original least square estimates are not that bad.

22 Choosing Lasso applied to the simulated data with p=45 but only two that affect the outcome. Now the optimal results in two non-zero coefficients which were the two that affect the outcome.

23 Principal Components Suppose we have a data matrix, X, of n independent samples and p features. The sample covariance matrix of X is S. Principal components will find p linear combinations of the features which are orthogonal (independent) of each other and are ranked by variance, so the first principal component will have the largest variance. If the features different dramatically in scale then it would be best to center and scale the raw data, e.g. z. The principal components from the centered and scaled data will be different then the unscaled data. There is no simple transformation. Thus, the first principal component is Y a X a X a x The Var Y a Sa

24 Principal Components So we want to find a 1 such that Var(Y 1 ) it has the largest variance of all normalized linear compounds that satisfies a a 1. Due to the constraint finding the maximum is a little more difficult but can be done with Lagrange multipliers. Skipping the details the Lagrange multipliers results in p simultaneous equations, Sl Ia 0, where l 1 is the Lagrange multiplier. The only way for this to not have a trivial solution is if, dets l I 0 This means that l 1 is the characteristic root (eigenvalue) of S and a 1 is its associated characteristic vector (eigenvector).

25 Principal Components If we pre-multiply Sl Ia 0by a we get l a Sa VarY Since the first principal component should have the largest variance then l 1 should be the largest eigenvalue out of p possible eigenvalues. The second principal component satisfies, a a 1, and a a 0 The second principal component is the second largest eigenvalue of S and so on. It is also the case that l l trs So if we divide the variance of each principal component by the total variance it will equal the proportion of the total variance.

26 Supervised Principal Components To get details see chapter 18 of The elements of statistical learning or Blair et at JASA 101:119. This technique is designed for the p>n case. We don t want to use all the features only those that are correlated with the outcomes, hence the supervision. The technique was originally designed for survival data. However, it can be used with normal regression problems. The program can be run with the superpc package written by Blair and Tibshirani. For a reasonably good tutorial go to Note that the superpc.listfeatures command is incorrect or outdated -> see help menu.

27 Supervised Principal Components Algorithm 1. Compute the standardized univariate regression coefficients (, where v j is jth diagonal of (X T X) -1 ) for the outcome as a function of each feature. 2. For each value of the threshold from the list 0 θ.. θ : (a) Form a reduced data matrix consisting of only those features whose univariate coefficient exceeds in absolute value, and compute the first m principal components of this matrix. (b) Use these principal components in a regression model to predict the outcome. 3. Pick (and m) by cross-validation.

28 Example: Supervised Principal Components We use a simulated pooled genomic allele frequency database. In this database loci 1-30 have some effect on a phenotype, loci show the same level of allele frequency differentiation as loci 1-30 but have NO effect on the phenotype and loci show random variation between populations and also do not affect the phenotype. There are a total of 40 populations (so n=40 and p=2000). Before doing this analysis we remove loci by testing for allele frequency differences and using a false discovery rate of 5%. With this database the pre-filtering reduced p to 43.

29 Example: Supervised Principal Components In this simulated database the allele frequency variation among populations is shown in (a). These are the mean allele frequencies. This database shows binomial sampling variation about these means. Populations

30 Example: Supervised Principal Components Data was randomly divided into a training set of 32 populations and a test set of 8 populations. After training the data we choose the threshold value 6.49 from the graph below. Only the first principal component is shown.

31 Example: Supervised Principal Components We can test the significance of the first three principal components. > superpc.fit.to.outcome(sim401.train, data.test, sim401.fit$v.pred) Call: lm(formula = data.test$y ~., data = temp.list) Residuals: e e e e e e e e-03 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-09 *** score e-07 *** score * score Signif. codes: 0 *** ** 0.01 * Residual standard error: on 4 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 1849 on 3 and 4 DF, p-value: 9.736e-07

32 Example: Supervised Principal Components Results Importance-score Raw-score Name feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature feature13 The input list included all these plus # s 39, 453, and 1560 Thus, supervised principal components was only able to eliminate three loci and it included almost all of the noncausative loci (in bold). These results were based on only the first principal component.

33 FLAME Apply FLAME to the same artificial database yielding the following sparse list. FLAM 50% criteria feature1 73 feature2 100 feature4 98 feature6 90 feature7 89 feature8 85 feature11 95 feature12 71 feature14 98 feature16 80 feature17 70 feature21 90 feature22 79 feature26 94 feature27 72 feature28 92 feature30 97 feature32 76 Frequency/100

34 Partial Least Squares This technique can be used for dimension reduction like principal component regression. Up to p new directions are created which are linear functions of the original features. Unlike principal components the new directions are based on X and y not just X like principal components. In principal components we chose each one to maximize the variance of the first, then second and so on. Partial least squares chooses that have a high variance and high correlation with the outcomes (y). Partial least squares software can be found in the pls R-package. The detailed algorithm is on page 81 of the Elements book.

35 FLAM Fussed Lasso Additive Model Assume n-observations. At one feature assume the ordered values are x j1,, x jn. The regression model is E y x θ The D matrix inside of the l 1 norm encourages adjacent parameters to be zero, e.g. θ θ 0. Knots (jumps) will only appear if the reduce the sum of squares x 1 x 2 x 3

36 FLAM The full minimization problem has a group lasso although it uses an l 2 norm that encourages discarding whole loci. P j is a permutation matrix which orders the x j from least to greatest. The tuning parameters and have to be chosen from a grid. Luckily there is a finite value for above which the results are completely sparse. ranges from 0 to 1. If the outcomes, y, vary widely in magnitude consider a transformation like log(y). Since the test MSE determines the model parameters, very small y may have a minor effect on the final model.

37 FLAM: Example One simulated database with 40 populations. The patterns are shown in panel (c) below. FLAM is in the flam package. After initial filtering, FLAM was run 100 times on permuted genetic databases. From the sparse list using the 50% rule, FLAM was run on just those features after the best was determined. The final results was, best.flam<flamcv(sparse.gen,pheno.data,alpha=0.4,n.fold=5,seed=1,met hod="bcd")

38 FLAM: Example > summary(best.flam) Call: flamcv(x = sparse.gen, y = pheno.data, alpha = 0.4, method = "BCD", n.fold = 5, seed = 1) plot(best.flam) FLAM was fit using the tuning parameters: lambda: alpha: 0.4 Cross-validation with K=5 folds was used to choose lambda. Lambda was chosen to be the largest value with CV error within one standard error of the minimum CV error. The chosen lambda was This corresponds to 7 predictors having non-sparse fits. The predictors with non-sparse fits:

FLAM: Example plot(best.flam$flam.out,best.flam$index.cv) 5= gene 7 best.predict<- cbind(pheno.data,best.flam$flam.out$y.hat.mat[best.flam$index.cv,]) 6= gene 8 best.predict<- as.data.frame(best.

39 FLAM: Example plot(best.flam$flam.out,best.flam$index.cv) 5= gene 7 best.predict<- cbind(pheno.data,best.flam$flam.out$y.hat.mat[best.flam$index.cv,]) 6= gene 8 best.predict<- as.data.frame(best.predict) 7= gene 10 colnames(best.predict)<- c("observed","predicted") 8= gene 13 library(ggplot2) 14= gene 27 ggplot(best.predict,aes(observed,predicted))+geom_point() +ylab("predicted Phenotype")+xlab("Observed Phenotype")+geom_abline()

40 Feature Assessment and Multiple Testing Problem: determine if there are significant differences in the mean feature value between two groups. If p>>n then this involves multiple hypothesis tests. If the type-i error (the chance of rejecting the null hypothesis when true) is 5% then we expect to have many type errors when p is very large. A family wise error rate (FWER) controls the type-i error on a collection of hypothesis tests. If we do a total of M hypothesis tests with a type-i error rate of, then the chance that any of the M tests results in a type-i error is, (1-(1-) M )=FWER. If there is positive dependence between the tests then FWER will be smaller.

41 Feature Assessment and Multiple Testing To test each feature for significant differences first calculate a t-statistic, t, where in general, x x /N, where C l are the indices of group l with sample size N l. The standard error is calculated as, 1 se σ 1 N N σ 1 N N 2 x x x x We can approximate the distribution using a t-distribution or make a permutation distribution.

42 Permutation Distribution Here we permute the labels of the features many times and for each permutation compute the t-statistics. In theory we could look at all possible permutations. So the number of different ways to sample labels for group 1 are K N N. Thus, for permutation k, the t-statistic is t N, then the p- value for feature-j is p I t t. If the features are very similar then the calculation for p can be summed over all features to get a better average. The Bonferroni method can give a FWER of αby simply dividing the individual error rate,, by the number of tests. However, this can be overly conservation for large numbers of tests.

43 False Discovery Rate A second approach is to control the fraction of false significance calls. The FDR is E(V/R)

44 Benjamini and Hockberg Method See, 1995, J. Royal Stat. Soc. Series B 85: Algorithm: 1. Fix the false discovery rate at and let p p p denote the ordered p-values. 2. Define Lmax j: p α 3. Reject all hypotheses for which p p, the BH rejection threshold.

45 Benjamini and Hockberg Method The BH threshold is The Bonferroni with =0.15 is , an order of magnitude smaller.

46 Plug-in estimate of false discovery rate Algorithm: 1. Create K permutations of the data, producing t-statistics t for features j=1,,m and permutations k=1,, K. 2. For a range of values of the cut-point C, let R I t C,EV I t C 3. Estimate the FDR by FDR EV /R For the microarray data the BH threshold was for t= If we use as the cut-point, R obs =11, and EV 1.518, thus FDR 0.14 The plug in method rejects a greater number of hypotheses while controlling the same error rate, which leads to greater power (Storey, 2002, J. Roy Soc. Stat. B 64: 479.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records