Stat 4510/7510 Homework 6

Size: px

Start display at page:

Download "Stat 4510/7510 Homework 6"

Garry Flowers
5 years ago
Views:

1 Stat 4510/7510 1/11. Stat 4510/7510 Homework 6 Instructions: Please list your name and student number clearly. In order to receive credit for a problem, your solution must show sufficient details so that the grader can determine how you obtained your answer. 1. The file CommunityCrime.csv is a dataset containing 319 observations on 123 variables. The observations are communities within the United States. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI Uniform Crime Reporting program. A detailed description of all variables is available at We seek to predict the variable ViolentCrimesPerPop, the total number of violent crimes per 100,000 people. Note: when asked to perform cross validation to select a tuning parameter, be sure to conduct this cross validation on the training data only, then see how well your cross validated tuning parameter does on the test data. (a). Set a seed of 1 and split the data into a 90% training set, and a 10% test set. crime=read.csv("communitycrime.csv") set.seed(1) train.obs=sample(1:nrow(crime),.9*nrow(crime),replace=false) crime.train = crime[train.obs,] crime.test = crime[-train.obs,] VC.test = crime.test$violentcrimesperpop (b). Fit a linear model using least squares on the training set. Report the test error obtained. crime.lm=lm(violentcrimesperpop~., data=crime.train) lm.pred=predict.lm(crime.lm,crime.test) mean((vc.test-lm.pred)^2) ## [1] The MSPE for the linear model fitted with least squares is (c). Fit a ridge regression model on the training set with λ chosen by cross-validation. Report the test error obtained. 1

2 Stat 4510/7510 2/11 library(glmnet) ## Loading required package: Matrix ## Loading required package: foreach ## Loaded glmnet X <- model.matrix(violentcrimesperpop~., data=crime.train) cv.out=cv.glmnet(x,y=crime.train$violentcrimesperpop,alpha=0) plot(cv.out) Mean Squared Error log(lambda) 2

3 Stat 4510/7510 3/11 cv.out$lambda.min ## [1] ridge.mod=glmnet(x = X,y=crime.train$ViolentCrimesPerPop,alpha=0, lambda = cv.out$lambda.min) X.new <- model.matrix(violentcrimesperpop~., data=crime.test) ridge.pred=predict(ridge.mod,newx=x.new,s=cv.out$lambda.min) mean((vc.test-ridge.pred)^2) ## [1] The MSPE for the linear model fitted with ridge regression is The optimal λ value was 0.31 (d). Fit a lasso model on the training set with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates. X <- model.matrix(violentcrimesperpop~., data=crime.train) cv.out=cv.glmnet(x,y=crime.train$violentcrimesperpop,alpha=1) plot(cv.out) 3

4 Stat 4510/7510 4/ Mean Squared Error log(lambda) cv.out$lambda.min ## [1] lasso.mod=glmnet(x = X,y=crime.train$ViolentCrimesPerPop,alpha=1, lambda = cv.out$lambda.min) X.new <- model.matrix(violentcrimesperpop~., data=crime.test) lasso.pred=predict(lasso.mod,newx=x.new,s=cv.out$lambda.min) mean((vc.test-lasso.pred)^2) ## [1]

5 Stat 4510/7510 5/11 The MSPE for the linear model fitted with lasso is There are 16 non-zeo coefficient estimates. The optimal λ value is (e). Fit a PCR model on the training set with M chosen by crossvalidation. Report the test error obtained along with the value of M selected by cross-validation. library(pls) ## ## Attaching package: pls ## The following object is masked from package:stats : ## ## loadings pcr.mod=pcr(violentcrimesperpop~., data=crime.train,scale=false, validation="cv") validationplot(pcr.mod,val.type="msep",xlim=c(0,40),ylim=c(.025,.03)) 5

6 Stat 4510/7510 6/11 ViolentCrimesPerPop MSEP number of components pcr.pred=predict(pcr.mod,newdata=x.new[,-1],ncomp=16) mean((vc.test-pcr.pred)^2) ## [1] The value of M that minimizes cross-validation error is 16. The MSPE for the linear model fitted with PCR is (f). Fit a PLS model on the training set with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation. 6

7 Stat 4510/7510 7/11 library(pls) pls.mod=plsr(violentcrimesperpop~., data=crime.train, scale=true, validation="cv") validationplot(pls.mod,val.type="msep",ylim=c(0.02,.04),xlim=c(0,10)) ViolentCrimesPerPop MSEP number of components pls.pred=predict(pls.mod,x.new[,-1],ncomp=3) mean((vc.test-pls.pred)^2) ## [1] The lowest cross-validation error occurs when only M = 3 partial least squares directions are used. The MSPE is again

8 Stat 4510/7510 8/11 (g). Comment on the above parts and how well you believe we can predict violent crime rate using these methods. 2. Each of the four approaches above outperforms the original least squares fitted linear model in terms of out-of-sample prediction. This is because the least squares method suffers from over-fitting due to the large number of possible predictor variables. Overall, we can predict violent crime rate marginally well as the predictive standard error is 0.17 (sqrt(0.032)) which is still somewhat large for data that is restrained to be between 0 and 1. Run the following R code: set.seed(1) n<-80 p<-50 X<-matrix(0,nrow = n,ncol=p) for(j in 1:p) { X[,j]<-runif(n = n,min = 0,max = 1) } beta0<-2 betas<-rep(0,p) betas[1:3]<-c(1,2,3) betas<-matrix(betas,ncol=1) Y<-beta0 + X %*% betas + rnorm(n,0,1) (a). Briefly describe what this code is doing. What are n and p? Which X j are related to Y? What are the true values of the β j? The code is generating a matrix of predicor variables, X, that is of dimension n p. Then, setting the intercept term to 2, the first three coefficients to 1, 2, and 3, and the remaining p 3 coefficients to 0, it generates the response variable from a normal distribution with mean β 0 + Xβ and variance 1. Here, n = 80, the number of observations, p = 50, the number of possible covariates. (b). Run a linear model with the lm() function and examine the coefficients. Do the estimates match your description in (a)? sim.lm=lm(y~x) No, the intercept is estimated at 1.19, and the first three coefficient estimtes are 2.98, 2.37, and They all are significantly different than zero but not very close to the true values. Also, five other covariates are found significant at the α = 0.10 level. 8

9 Stat 4510/7510 9/11 (c). Conduct lasso with a grid of λ ranging from 10 2 to 1 and construct a traceplot of coefficient value vs λ for all 50 variables. Color the first 3 X variables (which are related to Y ) red and all others black. Describe your findings. lasso.mod<-glmnet(x,y,alpha=1,lambda = 10^seq(from=-2,to=.1,length=100)) plot(y=coef(lasso.mod)[2,],x=lasso.mod$lambda,type="l",ylim=c(-1,3),col=2) lines(y=coef(lasso.mod)[3,],x=lasso.mod$lambda,type="l",col=2) lines(y=coef(lasso.mod)[4,],x=lasso.mod$lambda,type="l",col=2) for(j in 5:(p+1)) {lines(y=coef(lasso.mod)[j,],x=lasso.mod$lambda,type="l",col=1)} coef(lasso.mod)[2, ] lasso.mod$lambda The lasso model has correctly given much large coefficient estimates to the three signal predictors than the other noise predictors in the model. For values of λ > 0.3, the coefficients of the noise predictors all are shrunk to 0. 9

10 Stat 4510/ /11 (d). Repeat (c) with ridge regression and report the differences. You may need to change the range for λ to get an informative plot. ridge.mod<-glmnet(x,y,alpha=0,lambda = 10^seq(from=-2,to=.1,length=100)) plot(y=coef(ridge.mod)[2,],x=ridge.mod$lambda,type="l",ylim=c(-1,3),col=2) lines(y=coef(ridge.mod)[3,],x=ridge.mod$lambda,type="l",col=2) lines(y=coef(ridge.mod)[4,],x=ridge.mod$lambda,type="l",col=2) for(j in 5:(p+1)) {lines(y=coef(ridge.mod)[j,],x=ridge.mod$lambda,type="l",col=1)} coef(ridge.mod)[2, ] ridge.mod$lambda Similarly to the lasso, the ridge regression approach also finds the first three coefficients the most significant. The remaining noise covariates have much smaller coefficient estiamtes. (e). Conduct a lasso again but use cross validation to select the best λ. Report this value as well 10

11 Stat 4510/ /11 as the results of the model selection it performs. Did it find all 3 of our truly significant variables significant? Any false positives? cv.out=cv.glmnet(x,y,alpha=1) cv.out$lambda.min ## [1] lasso.mod=glmnet(x,y,alpha=1,lambda = cv.out$lambda.min) lasso.coef=predict(lasso.mod,type="coefficients",s=cv.out$lambda.min) The chosen value of λ is The intercept term and first three coefficients were found significant. There were 14 false positive covariates. 11

14. League: A factor with levels A and N indicating player s league at the end of 1986

14. League: A factor with levels A and N indicating player s league at the end of 1986 PENALIZED REGRESSION Ridge and The LASSO Note: The example contained herein was copied from the lab exercise in Chapter 6 of Introduction to Statistical Learning by. For this exercise, we ll use some baseball