HW 10 STAT 672, Summer 2018

Size: px

Start display at page:

Download "HW 10 STAT 672, Summer 2018"

Joy Fields
5 years ago
Views:

1 HW 10 STAT 672, Summer ) (0 points) Do parts (a), (b), (c), and (e) of Exercise 2 on p. 298 of ISL. 2) (0 points) Do Exercise 3 on p. 298 of ISL. 3) For this problem, try to use the 64 bit version of R if possible. Otherwise, state clearly on your paper that you used the 32 bit version. You can merely submit the things that I specifically request in the various parts, or you can submit some of your work in addition to the answers, but if you do that be sure to highlight the specific things I request in yellow. (Note: Just submitting the bare minimum won t allow you to earn much partial credit for incorrect answers. If you re unsure about something, it may be better to provide some of your R code. (Or better yet, ask me about any troublesome parts of the assignment.)) Attach the Auto data set from the ISLR library. With this exercise, we re going to first use some of the one predictor regression methods from Sections 7.1 through 7.6 in an attempt to explain miles per gallon using horsepower. But an examination of the scatter plot created using plot(mpg~horsepower) shows that we ll have appreciable heteroscedasticity if we use mpg as the response variable. (The variation in mpg generally increases as horsepower decreases.) So instead we ll use the inverse of the square root of mpg as the response variable, as we use horsepower as the sole predictor variable for the first portion of this assignment. (Note: I decided on this transformation of mpg by just trying a few things. By making a plot, you can see that not only does it make a constant error term variance assumption much more plausible, but it also creates a closer-to-linear relationship between the response and the predictor.) Because later in the assignment we ll also incorporate some additional predictor variables (displacement and weight), and so it ll be best to go ahead and include them as the training and test data are created. Furthermore, let s also load the kknn, glmnet, boot, splines, gam, and tree libraries, since eventually they ll be needed. (Note: You might have to first install the gam and tree libraries if you ve never used them, but you don t have to install splines before using it since it s part of the base installation.) Now let s create training and tests sets of our response and predictors as follows (being sure to set the seed of R s random number generator to 123 right before you create the train vector): library(kknn) library(glmnet) library(boot) library(splines) library(gam) library(tree) y=1/sqrt(mpg) train = sample(392,292,replace=false) train.dat=data.frame(cbind(y[train],displacement[train],horsepower[train], weight[train])) test.dat=data.frame(cbind(y[-train],displacement[-train],horsepower[-train], weight[-train])) names(train.dat)=c("y","disp","hp","wt") names(test.dat)=c("y","disp","hp","wt") Note that I ve made the variable names y, disp, hp, and wt in order to make it easier to type in the various models we want to consider. In order to check things enter dim(train.dat) head(train.dat) dim(test.dat) head(test.dat) You should see that the dimension of train.dat is 292 by 4, and that the first 3 values of hp are 107, 60, and 105. You should also see that the dimension of test.dat is 100 by 4, and that the first 3 values of hp are 165, 150, and 140.

2 Now use the training data to fit a fourth-degree polynomial models having y as the reponse and hp as the predictor. Although there are a variety of ways that this can be done, please use poly4=lm(y~poly(hp,4,raw=t), data=train.dat) summary(poly4) (since using the above with raw=f gives us a version that s not explained well in the text (nor the videos), and it s an unnecessary complication that we don t need to bother with). (a) (1 point) What p-value results from the t test associated with the 4th-degree term in the model? (Round to the nearest thousandth (which may be indicating a bit too much accuracy, but the 3 digits will help me make sure that you ve done everything correctly up to this point). You should get a large p-value indicating that a 4th-order polynomial fit may not be necessary.) Now fit a third-order polynomial model using: poly3=lm(y~poly(hp,3,raw=t), data=train.dat) summary(poly3) (Note that R 2 did not decrease. You should see that the 3rd-order term has a small p-value, indicating that simplifying to a 2nd-order fit may not be good.) Now let s check to see if our test set MSPE estimates indicate that the the 3rd-order model is really superior to the 4th-order model. We can compute the estimated test MSPE for the 4th-order model as follows: pred.test=predict(poly4, newdata=test.dat) You should get a value of about (b) (1 point) Now give the estimated MSPE (based on the test data) for the 3rd-order model? (Report the value by rounding to 5 significant digits (so through the 8th digit after the decimal) so that I can confirm that you ve done things correctly. You should see that while it s only a very tiny bit smaller than the estimate obtained from the 4th-order model, the simpler polynomial model did predict better.) Now let s make a plot showing the 3rd-order polynomial fit, along with some standard error bands. This can be done by doing something similar to what is shown on the middle portion of p. 289 of the text, but I ll make a plot that is a bit less fancy as follows: hp.lims=range(train.dat$hp) hp.grid=seq(from=hp.lims[1], to=hp.lims[2]) preds=predict(poly3,newdata=list(hp=hp.grid),se=true) se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit) lines(hp.grid,preds$fit,lwd=2,col="blue") matlines(hp.grid,se.bands,lwd=1,col="blue",lty=3) (c) (1 point) Use R to produce such a plot and submit a hard copy of it. (You don t have to use any color if you don t have a good way to print color plots, but don t submit a hand-drawn plot! (These guidelines apply to all of the other plots requested in this assignment.)) Now let s fit some spline models. Since our polynomial fits indicate that the cubic polynomial is better than the 4th-degree polynomial, it may be that we don t need a lot of knots to get a good fit, and so let s just use one knot located at 175. (Most of the curvature occurs upwards of 150, so one might be tempted to move the knot even higher. But since there s not a lot of data with values of hp greater than 175, it may be better not to move it any higher.) We can produce such a cubic spline fit, and plot it, as follows: cubspl=lm(y~bs(hp,knots=c(175)),data=train.dat) cs.pred=predict(cubspl,newdata=list(hp=hp.grid)) lines(hp.grid,cs.pred,lwd=2,col="blue") The fitted curve doesn t look much different from the one for part (c), except that it turns down more sharply at the extreme right. (Note: Based on what s in the top shaded box on p. 293 of ISL, one might think that I should use cs.pred$fit instead of just cs.pred in the last line above, but since I didn t include se=t (like the text did) when I used predict(), the way I did it is appropriate.) (d) (1 point) Now give the estimated MSPE, based on the test data, for the cubic spline model? (Report the value by rounding to 5 significant digits (so the 8th place after the decimal).)

3 For a natural spline, we can use a total of 5 knots and have the same number of parameters. But lets try using just 4 knots; two close to the ends of the range of hp values, at 70 and 210, and two more in the region where the curvature starts to become more pronounced, at 170 and 190. We can fit the spline and view the fit as follows: nspl=lm(y~ns(hp,knots=c(70,170,190,210)),data=train.dat) ns.pred=predict(nspl,newdata=list(hp=hp.grid)) lines(hp.grid,ns.pred,lwd=2,col="blue") (e) (1 point) Now give the estimated MSPE, based on the test data, for the natural spline model? (Again, round to 5 significant digits. (It can be noted that even though the estimated function perhaps seems to change slope a bit too much, this fit gives us the smallest estimated test MSPE so far.)) So now let s go to a smoothing spline fit, where we don t have to specify knot locations. First we ll let crossvalidation select a value for the smoothing parameter and determine the corresponding effective degrees of freedom, and then we ll plot the fit as follows: sspl=smooth.spline(train.dat$hp,train.dat$y,cv=true) sspl$df lines(sspl,lwd=2,col="blue") (f) (1 point) Use R to produce such a plot and submit a hard copy of it. This curve looks very different from the one we got using the natural spline with knots at 70, 170, 190, and 210. Unfortunately, we cannot estimate the MSPE in the usual way, since the predict() function seems to work different on the sspl object that was produced. Unlike the cases when we applied the predict() function to the polynomial, cubic spline, and natural spline objects, both pred.test=predict(sspl, newdata=test.dat)) and pred.test=predict(sspl, newdata=list(hp=test.dat$hp)) only produces 83 different values, instead of the 100 values we need to compare to the 100 y values in the test set. (83 is the number of unique values in train.dat$hp.) If one looks at the output of table(test.dat$hp) it can be seen that there are only 50 different values of hp in the test set. However, to estimate the MSPE using the test sample, we can do the following: pred.test=predict(sspl, x=test.dat$hp) mean((pred.test$y-test.dat$y)^2) (Note: I don t know why the syntax is different for the smoothing splines than it is for the other methods.) You should get a value of about (which is the worst performance we ve gotten so far... maybe because the smoothing spline fit doesn t curve down as much on the extreme right)). Now let s use the local regression function, loess() to estimate the mean response, and make predictions: locreg=loess(y~hp,span=.5,data=train.dat,degree=1) lo.pred=predict(locreg,data.frame(hp=hp.grid)) lines(hp.grid,lo.pred,lwd=2,col="blue") pred.test=predict(locreg,data.frame(hp=test.dat$hp)) (Note: I first tried using span=.2, but it produced a very wiggly fit!) Use the R code above to do the two parts below. (g) (1 point) Use R to produce a plot of the loess fit and submit a hard copy of it. (h) (1 point) Now give the estimated MSPE, based on the test data, that comes from using the loess fit (resulting from degree=1 and span=.5) to make predictions? (As before, round to 5 significant digits. (Oddly, the estimated MSPE that came from the wiggley fit based on span=.2 (that I tried, but didn t ask you to try) led to a smaller estimated MSPE.))

4 Just for fun, let s try the same thing, except that this time we ll use a 2nd-order fit for the local regressions. locreg2=loess(y~hp,span=.5,data=train.dat,degree=2) lo2.pred=predict(locreg2,data.frame(hp=hp.grid)) lines(hp.grid,lo2.pred,lwd=2,col="red") pred.test=predict(locreg2,data.frame(hp=test.dat$hp)) You should get a value of about , which is the smallest estimated MSPE we ve obtained so far. (Note: Using degree=2 to produce local 2nd-order fits is the default for loess().) Now use the training data to fit a basic multiple regression model with y as the response, and using disp, hp, and wt as predictors. This can be done using fit1=lm(y~., data=train.dat) summary(fit1) mean((predict(fit1,test.dat)-test.dat$y)^2) (Note: We get a higher value for R 2 from this multiple regression fit than we did from the polynomial fits just using hp, but it can also be noted that the test MSPE is larger here than what we have for the 2nd-order loess fit based on just the single predictor hp.) An examination of a residual plot suggests a pretty good fit, however if you look at the scatter plot produced by plot(train.dat$disp,fit1$res) you can see that perhaps we need more than just a linear term for disp. As a first attempt at improvement, let s simply add a quadratic terms for disp. If you do this, creating the object fit2, and look at summary(fit2) you can see that disp went from being marginally significant in our initial model, to now being highly significant, along with its associated quadratic term. (i) (1 point) What is the test sample estimate of the test MSPE for the model containing 1st-order terms for hp and wt, and 1st-order and 2nd-order terms for disp? Please round to 5 significant digits. (It can be noted that the value is the smallest of all such MSPE values obtained so far with this data.) Now let s fit a GAM. As a first attempt, let s use smoothing spline representations for all three predictors. (This way we don t have to make decisions about knot placement.) If we use the rule of thumb that suggests that you can have 1 df for every 15 observations, we get that we can afford to use 19 df in all. Taking out 1 for the intercept, that leaves 18, So, to be a bit conservative, we ll use 5 for hp, 5 for wt, and 6 for disp (since it appears to be the predictor needing the largest adjustment for nonlinearity). Then we ll look at the plots we can produce and make a new assessment of the situation. So, enter the following: fit3=gam(y~s(disp,6)+s(hp,5)+s(wt,5), data=train.dat) par(mfrow=c(1,3)) plot(fit3, se=true, col="blue") One can see that the hp and wt contributions are at most just a little nonlinear, but that the disp contribution is very nonlinear. So let s cut down on the flexibility allowed for hp and wt, by changing the df for each one, and keep disp as is. fit4=gam(y~s(disp,6)+s(hp,4)+s(wt,3), data=train.dat) Now let s use the test sample to estimate the MSPE for our last GAM model to see if our guesses have been good ones. mean((predict(fit4,test.dat)-test.dat$y)^2) (j) (1 point) What is the test sample estimate of the test MSPE for the last GAM model (the one having the lower df for hp and wt)? (Round to 5 significant digits (and note that this is the smallest MSPE value so far).) We could try several more GAM models, possibly including some interaction terms, but let s move on. (Note: I tried a full 3rd-order linear model, having 19 df, fit with OLS, and got an estimated MSPE of So clearly our GAM did a better job than a more traditional approach. Somewhat oddly, the 3rd-order model made worse predictions than the 1st-order linear model, even though F tests indicated that both 2nd-order and 3rd-oder terms were needed.)

5 Now let s try nearest neighbors regression, first using just hp as a predictor, modifying what s on p of the class notes as follows: train.out=train.kknn(y~hp,data=train.dat,kmax=40,distance=2,kernel="epanechnikov") par(mfrow=c(1,1)) plot(1:40, train.out$mean.squ) We can obtain the best value to use for K, and get the estimated MSPE (based on cross-validation) corresponding to this value of K as follows: which.min(train.out$mean.squ) train.out$mean.squ[which.min(train.out$mean.squ)] Then we can modify what s on p of the class notes in order to make predictions on the test set data, and see what the test set estimate of the MSPE is: kknn.out=kknn(y~hp,train=train.dat,test=test.dat,k=11,distance=2, kernel="epanechnikov") mean((test.dat$y - kknn.out$fitted.values)^2) Although we can see that the estimated MSPE from cross-validation is very close to the test set estimate of the MSPE, it should be noted that all of the other methods considered previously predicted better. (k) (3 points) Now do a multiple regression version: using hp, disp, and wt as predictors, but keeping everything else as above, use cross-validation to determine the best value for K, and then use this value of K, along with the training data, to make predictions for the test set. Then use these predicted values to estimate the MSPE for predicting future values of the response variable, and give this estimated MSPE, along with the value of K that was determined. Be sure to use before using train.kknn for this part! Now let s grow and examine a regression tree using the tree() function. fit5=tree(y~., data=train.dat) summary(fit5) plot(fit5) text(fit5,pretty=0) If you enlarge the plot to be full screen, you can see that it s somewhat interesting: first splitting on disp, then splitting both branches formed on hp, and then splitting 3 of the 4 branches formed on wt, and then there are no further splits. (With so much symmetry in the tree, it doesn t suggest the presence of strong interactions.) Let s compute an estimate of the MSPE. mean((predict(fit5,test.dat)-test.dat$y)^2) Not horrible, considering that regression trees are generally not so competitive, and this one wasn t fine tuned. (Note that our tree can only produce 7 different predicted values (one for each terminal node of the tree).) So now let s see if using cross-validation to select a right-sized tree will lead to an improvement. fit6=cv.tree(fit5) plot(fit6$size,fit6$dev,type="b") The plot indicates that the 6 node tree is slightly better than the 7 node tree. (Enlarge the plot to get a better look. You can also examine the contents of fit6$dev.) So now let s use the 6 node tree to get predictions, and estimate the test MSPE. pruned.tree=prune.tree(fit5, best=6) mean((predict(pruned.tree,test.dat)-test.dat$y)^2) (It can be noted that the test sample estimate of the test MSPE was slightly lower for the original 7 node tree (and so maybe cross-validation misled us).) (l) (2 points) What is the test sample estimate of the test MSPE for the pruned tree selected by crossvalidation? (Round to 5 significant digits.)

HW 10 STAT 472, Spring 2018

HW 10 STAT 472, Spring 2018 1) (0 points) Do parts (a), (b), (c), and (e) of Exercise 2 on p. 298 of ISL. 2) (0 points) Do Exercise 3 on p. 298 of ISL. 3) For this problem, you can merely submit the things