HW 10 STAT 472, Spring 2018

Similar documents
HW 10 STAT 672, Summer 2018

Applied Statistics : Practical 9

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Moving Beyond Linearity

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

How to Do Everything We Need to Do on a TI Calculator in Algebra 2 for Now (Unless Davies Forgot Something)

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below.

STAT 705 Introduction to generalized additive models

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Economics Nonparametric Econometrics

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Generalized Additive Model

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016

Section 18-1: Graphical Representation of Linear Equations and Functions

Exemplar for Internal Achievement Standard. Mathematics and Statistics Level 1

Table of Laplace Transforms

Splines and penalized regression

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Lecture 25: Review I

Math Lab 6: Powerful Fun with Power Series Representations of Functions Due noon Thu. Jan. 11 in class *note new due time, location for winter quarter

Stat 4510/7510 Homework 6

Principles of Algorithm Design

Graphs of Exponential

Lecture 17: Smoothing splines, Local Regression, and GAMs

RESAMPLING METHODS. Chapter 05

Ingredients of Change: Nonlinear Models

Chapter01.fm Page 1 Monday, August 23, :52 PM. Part I of Change. The Mechanics. of Change

BASIC LOESS, PBSPLINE & SPLINE

An introduction to plotting data

Fractional. Design of Experiments. Overview. Scenario

Depth First Search A B C D E F G A B C 5 D E F 3 2 G 2 3

Divisibility Rules and Their Explanations

If Statements, For Loops, Functions

9 R1 Get another piece of paper. We re going to have fun keeping track of (inaudible). Um How much time do you have? Are you getting tired?

Variables and Data Representation

Tree-based methods for classification and regression

Exercise 1: Introduction to Stata

How to import text files to Microsoft Excel 2016:

2014 Stat-Ease, Inc. All Rights Reserved.

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

Understanding Recursion

Math Fundamentals for Statistics (Math 52) Unit 3: Addition and Subtraction. Scott Fallstrom and Brent Pickett The How and Whys Guys.

1 Standard Errors on Different Models

1

One way ANOVA when the data are not normally distributed (The Kruskal-Wallis test).

Model selection and validation 1: Cross-validation

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

(Refer Slide Time: 1:27)

Moving Beyond Linearity

Using Microsoft Excel

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

4.7 Approximate Integration

How to use FSBforecast Excel add in for regression analysis

Lesson 16: More on Modeling Relationships with a Line

Intro to Algorithms. Professor Kevin Gold

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Section 2.3: Simple Linear Regression: Predictions and Inference

Cross-Validation Alan Arnholt 3/22/2016

Note that ALL of these points are Intercepts(along an axis), something you should see often in later work.

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

The theory of the linear model 41. Theorem 2.5. Under the strong assumptions A3 and A5 and the hypothesis that

February 2017 (1/20) 2 Piecewise Polynomial Interpolation 2.2 (Natural) Cubic Splines. MA378/531 Numerical Analysis II ( NA2 )

CMPSCI 187: Programming With Data Structures. Lecture 5: Analysis of Algorithms Overview 16 September 2011

Tutorial Four: Linear Regression

Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

How Your First Program Works

Spatial Interpolation & Geostatistics

Mar. 20 Math 2335 sec 001 Spring 2014

CS 4349 Lecture October 18th, 2017

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

MITOCW watch?v=w_-sx4vr53m

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

Multiple Regression White paper

Ingredients of Change: Nonlinear Models & 2.1 Exponential Functions and Models

Graphing with a Graphing Calculator

Why Use Graphs? Test Grade. Time Sleeping (Hrs) Time Sleeping (Hrs) Test Grade

Algorithm Analysis. College of Computing & Information Technology King Abdulaziz University. CPCS-204 Data Structures I

Lecture 9: July 14, How to Think About Debugging

Lecture 16: High-dimensional regression, non-linear regression

This is a simple example of how the lasso regression model works.

Heuristic Evaluation of Covalence

Spatial Interpolation - Geostatistics 4/3/2018

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data

A Multiple-Line Fitting Algorithm Without Initialization Yan Guo

Notes based on: Data Mining for Business Intelligence

Last time... Bias-Variance decomposition. This week

Chapter 7. The Data Frame

Knowledge Discovery and Data Mining

Chapter 6: DESCRIPTIVE STATISTICS

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

Nonparametric Approaches to Regression

Investigative Skills Toolkit (Numeric) Student Task Sheet TI-Nspire Numeric Version

Mr G s Java Jive. #11: Formatting Numbers

ME 261: Numerical Analysis Lecture-12: Numerical Interpolation

Solution to Series 7

Teacher notes for Hi-Tech Brains. Activity to teach linear equations by Slope-Intercept

Planting the Seeds Exploring Cubic Functions

Transcription:

HW 10 STAT 472, Spring 2018 1) (0 points) Do parts (a), (b), (c), and (e) of Exercise 2 on p. 298 of ISL. 2) (0 points) Do Exercise 3 on p. 298 of ISL. 3) For this problem, you can merely submit the things that I specifically request in the various parts, or you can submit some of your work in addition to the answers, but if you do that be sure to highlight the specific things I request in yellow. (Note: Just submitting the bare minimum won t allow you to earn much partial credit for incorrect answers. If you re unsure about something, it may be better to provide some of your R code. (Or better yet, ask me about any troublesome parts of the assignment.)) Attach the Auto data set from the ISLR library. With this exercise, we re going to first use some of the one predictor regression methods from Sections 7.1 through 7.6 in an attempt to explain miles per gallon using horsepower. But an examination of the scatter plot created using plot(mpg~horsepower) shows that we ll have appreciable heteroscedasticity if we use mpg as the response variable. (The variation in mpg generally increases as horsepower decreases.) So instead we ll use the inverse of the square root of mpg as the response variable, as we use horsepower as the sole predictor variable for the first portion of this assignment. (Note: I decided on this transformation of mpg by just trying a few things. By making a plot, you can see that not only does it make a constant error term variance assumption much more plausible, but it also creates a closer-to-linear relationship between the response and the predictor.) However, later in the assignment we ll also incorporate some additional predictor variables (displacement and weight), and so it ll be best to go ahead and include them as the training and test data are created. Furthermore, let s also load the glmnet, boot, splines, gam, and tree libraries, since eventually they ll be needed. (Note: You might have to first install the gam and tree libraries if you ve never used them, but you don t have to install splines before using it since it s part of the base installation.) Now let s create training and tests sets of our response and predictors as follows (being sure to set the seed of R s random number generator to 123 right before you create the train vector): library(glmnet) library(boot) library(splines) library(gam) library(tree) y=1/sqrt(mpg) set.seed(123) train = sample(392,292,replace=false) train.dat=data.frame(cbind(y[train],displacement[train],horsepower[train], +weight[train])) test.dat=data.frame(cbind(y[-train],displacement[-train],horsepower[-train], +weight[-train])) names(train.dat)=c("y","disp","hp","wt") names(test.dat)=c("y","disp","hp","wt") Note that I ve made the variable names y, disp, hp, and wt in order to make it easier to type in the various models we want to consider. In order to check things enter dim(train.dat) head(train.dat) dim(test.dat) head(test.dat) You should see that the dimension of train.dat is 292 by 4, and that the first 3 values of hp are 107, 60, and 105. You should also see that the dimension of test.dat is 100 by 4, and that the first 3 values of hp are 165, 150, and 140.

Now use the training data to fit a fourth-degree polynomial models having y as the reponse and hp as the predictor. Although there are a variety of ways that this can be done, please use poly4=lm(y~poly(hp,4,raw=t), data=train.dat) summary(poly4) (since using the above with raw=f gives us a version that s not explained well in the text (nor the videos), and it s an unnecessary complication that we don t need to bother with). (a) (1 point) What p-value results from the t test associated with the 4th-degree term in the model? (Round to the nearest thousandth (which may be indicating a bit too much accuracy, but the 3 digits will help me make sure that you ve done everything correctly up to this point). You should get a large p-value indicating that a 4th-order polynomial fit may not be necessary.) Now fit a third-order polynomial model using: poly3=lm(y~poly(hp,3,raw=t), data=train.dat) summary(poly3) (Note that R 2 did not decrease. You should see that the 3rd-order term has a small p-value, indicating that simplifying to a 2nd-order fit may not be good.) Now let s check to see if our test set MSPE estimates indicate that the the 3rd-order model is really superior to the 4th-order model. We can compute the estimated test MSPE for the 4th-order model as follows: pred.test=predict(poly4, newdata=test.dat) You should get a value of about 0.00025412. (b) (1 point) Now give the estimated MSPE (based on the test data) for the 3rd-order model? (Report the value by rounding to 5 significant digits (so through the 8th digit after the decimal) so that I can confirm that you ve done things correctly. You should see that while it s only a very tiny bit smaller than the estimate obtained from the 4th-order model, the simpler polynomial model did predict better.) Now let s make a plot showing the 3rd-order polynomial fit, along with some standard error bands. This can be done by doing something similar to what is shown on the middle portion of p. 289 of the text, but I ll make a plot that is a bit less fancy as follows: hp.lims=range(train.dat$hp) hp.grid=seq(from=hp.lims[1], to=hp.lims[2]) preds=predict(poly3,newdata=list(hp=hp.grid),se=true) se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit) lines(hp.grid,preds$fit,lwd=2,col="blue") matlines(hp.grid,se.bands,lwd=1,col="blue",lty=3) (c) (1 point) Use R to produce such a plot and submit a hard copy of it. (You don t have to use any color if you don t have a good way to print color plots, but don t submit a hand-drawn plot! (These guidelines apply to all of the other plots requested in this assignment.)) Now let s fit some spline models. Since our polynomial fits indicate that the cubic polynomial is better than the 4th-degree polynomial, it may be that we don t need a lot of knots to get a good fit, and so let s just use one knot located at 175. (Most of the curvature occurs upwards of 150, so one might be tempted to move the knot even higher. But since there s not a lot of data with values of hp greater than 175, it may be better not to move it any higher.) We can produce such a cubic spline fit, and plot it, as follows: cubspl=lm(y~bs(hp,knots=c(175)),data=train.dat) cs.pred=predict(cubspl,newdata=list(hp=hp.grid)) lines(hp.grid,cs.pred,lwd=2,col="blue") The fitted curve doesn t look much different from the one for part (c), except that it turns down more sharply at the extreme right. (Note: Based on what s in the top shaded box on p. 293 of ISL, one might think that I should use cs.pred$fit instead of just cs.pred in the last line above, but since I didn t include se=t (like the text did) when I used predict(), the way I did it is appropriate.) (d) (1 point) Now give the estimated MSPE, based on the test data, for the cubic spline model? (Report the value by rounding to 5 significant digits (so the 8th place after the decimal).)

For a natural spline, we can use a total of 5 knots and have the same number of parameters. But lets try using just 4 knots; two close to the ends of the range of hp values, at 70 and 210, and two more in the region where the curvature starts to become more pronounced, at 170 and 190. We can fit the spline and view the fit as follows: nspl=lm(y~ns(hp,knots=c(70,170,190,210)),data=train.dat) ns.pred=predict(nspl,newdata=list(hp=hp.grid)) lines(hp.grid,ns.pred,lwd=2,col="blue") (e) (1 point) Now give the estimated MSPE, based on the test data, for the natural spline model? (Again, round to 5 significant digits.) So now let s go to a smoothing spline fit, where we don t have to specify knot locations. First we ll let crossvalidation select a value for the smoothing parameter and determine the corresponding effective degrees of freedom, and then we ll plot the fit as follows: sspl=smooth.spline(train.dat$hp,train.dat$y,cv=true) sspl$df lines(sspl,lwd=2,col="blue") (f) (1 point) Use R to produce such a plot and submit a hard copy of it. This curve looks very different from the one we got using the natural spline with knots at 70, 170, 190, and 210. Unfortunately, we cannot estimate the MSPE in the usual way, since the predict() function seems to work different on the sspl object that was produced. Unlike the cases when we applied the predict() function to the polynomial, cubic spline, and natural spline objects, both pred.test=predict(sspl, newdata=test.dat)) and pred.test=predict(sspl, newdata=list(hp=test.dat$hp)) only produces 83 different values, instead of the 100 values we need to compare to the 100 y values in the test set. (83 is the number of unique values in train.dat$hp.) If one looks at the output of table(test.dat$hp) it can be seen that there are only 50 different values of hp in the test set. However, to estimate the MSPE using the test sample, we can do the following: pred.test=predict(sspl, x=test.dat$hp) mean((pred.test$y-test.dat$y)^2) (Note: I don t know why the syntax is different for the smoothing splines than it is for the other methods.) You should get a value of about 0.00025600 (which is the worst performance we ve gotten so far... maybe because the smoothing spline fit doesn t curve down as much on the extreme right)). Now let s use the local regression function, loess() to estimate the mean response, and make predictions: locreg=loess(y~hp,span=.5,data=train.dat,degree=1) lo.pred=predict(locreg,data.frame(hp=hp.grid)) lines(hp.grid,lo.pred,lwd=2,col="blue") pred.test=predict(locreg,data.frame(hp=test.dat$hp)) (Note: I first tried using span=.2, but it produced a very wiggly fit!) Use the R code above to do the two parts below. (g) (1 point) Use R to produce a plot of the loess fit and submit a hard copy of it. (h) (1 point) Now give the estimated MSPE, based on the test data, that comes from using the loess fit to make predictions? (As before, round to 5 significant digits. It can be noted that the value is the smallest of all such MSPE values obtained so far with this data.) Just for fun, let s try the same thing, except that this time we ll use a 2nd-order fit for the local regressions. locreg2=loess(y~hp,span=.5,data=train.dat,degree=2) lo2.pred=predict(locreg2,data.frame(hp=hp.grid)) lines(hp.grid,lo2.pred,lwd=2,col="red")

pred.test=predict(locreg2,data.frame(hp=test.dat$hp)) You should get a value of about 0.00024641, which is the smallest estimated MSPE we ve obtained so far. (Note: Using degree=2 to produce local 2nd-order fits is the default for loess().) Now use the training data to fit a basic multiple regression model with y as the response, and using disp, hp, and wt as predictors. This can be done using fit1=lm(y~., data=train.dat) summary(fit1) mean((predict(fit1,test.dat)-test.dat$y)^2) (Note: We get a higher value for R 2 from this multiple regression fit than we did from the polynomial fits just using hp, but it can also be noted that the test MSPE is larger here than what we have for the 2nd-order loess fit based on just the single predictor hp.) An examination of a residual plot suggests a pretty good fit, however if you look at the scatter plot produced by plot(train.dat$disp,fit1$res) you can see that perhaps we need more than just a linear term for disp. As a first attempt at improvement, let s simply add a quadratic terms for disp. If you do this, creating the object fit2, and look at summary(fit2) you can see that disp went from being marginally significant in our initial model, to now being highly significant, along with its associated quadratic term. (i) (1 point) What is the test sample estimate of the test MSPE for the model containing 1st-order terms for hp and wt, and 1st-order and 2nd-order terms for disp? Please round to 5 significant digits. Now let s fit a GAM. As a first attempt, let s use smoothing spline representations for all three predictors. (This way we don t have to make decisions about knot placement.) If we use the rule of thumb that suggests that you can have 1 df for every 15 observations, we get that we can afford to use 19 df in all. Taking out 1 for the intercept, that leaves 18, So, to be a bit conservative, we ll use 5 for hp, 5 for wt, and 6 for disp (since it appears to be the predictor needing the largest adjustment for nonlinearity). Then we ll look at the plots we can produce and make a new assessment of the situation. So, enter the following: fit3=gam(y~s(disp,6)+s(hp,5)+s(wt,5), data=train.dat) par(mfrow=c(1,3)) plot(fit3, se=true, col="blue") One can see that the hp and wt contributions are at most just a little nonlinear, but that the disp contribution is very nonlinear. So let s cut down on the flexibility allowed for hp and wt, by changing the df for each one, and keep disp as is. fit4=gam(y~s(disp,6)+s(hp,4)+s(wt,3), data=train.dat) Now let s use the test sample to estimate the MSPE for our last GAM model to see if our guesses have been good ones. mean((predict(fit4,test.dat)-test.dat$y)^2) (j) (1 point) What is the test sample estimate of the test MSPE for the last GAM model (the one having the lower df for hp and wt)? (Round to 5 significant digits (and note that this is the smallest MSPE value so far).) We could try several more GAM models, possibly including some interaction terms, but let s move on. (Note: I tried a full 3rd-order linear model, having 19 df, fit with OLS, and got an estimated MSPE of 0.00026288. So clearly our GAM did a better job than a more traditional approach. Somewhat oddly, the 3rd-order model made worse predictions than the 1st-order linear model, even though an F test indicated that 2nd-order and 3rd-oder terms were needed.) Now let s grow and examine a regression tree using the tree() function. fit5=tree(y~., data=train.dat) summary(fit5) par(mfrow=c(1,1)) plot(fit5) text(fit5,pretty=0)

If you enlarge the plot to be full screen, you can see that it s somewhat interesting: first splitting on disp, then splitting both branches formed on hp, and then splitting 3 of the 4 branches formed on wt, and then there are no further splits. (With so much symmetry in the tree, it doesn t suggest the presence of strong interactions.) Let s compute an estimate of the MSPE. mean((predict(fit5,test.dat)-test.dat$y)^2) Not horrible, considering that regression trees are generally not so competitive, and this one wasn t fine tuned. So now let s see if using cross-validation to select a right-sized tree will lead to an improvement. fit6=cv.tree(fit5) plot(fit6$size,fit6$dev,type="b") The plot indicates that the 7 node tree is best, and so I guess we re done! (Enlarge the plot to get a better look. You can also examine the contents of fit6$dev.) (k) (1 point) What is the test sample estimate of the test MSPE for the tree model of fit5? (Round to 5 significant digits.)