Exercise 2.23 Villanova MAT 8406 September 7, 2015 Step 1: Understand the Question Consider the simple linear regression model y = 50 + 10x + ε where ε is NID(0, 16). Suppose that n = 20 pairs of observations are used to fit this model. Generate 500 samples of 20 observations, drawing one observation for each level of x = 1, 1.5, 2...., 10 for each sample. R makes this easy because its normal random number generator, rnorm, does not require fixed values of the parameters (the mean and standard deviation): you may vary them! Therefore you can generate one dataset according to the preceding instructions by means of remarkably terse, efficient commands: sigma.2 <- 16 beta <- c(50, 10) x <- seq(1, 10, by=1/2) y <- rnorm(length(x), beta[1] + beta[2]*x, sigma.2) Before proceeding, let s check that this is correct and matches what is intended in the problem. Always draw a picture: plot(x, y, main="first Try at Sampling") First Try at Sampling y 40 60 80 120 160 2 4 6 8 10 x Does it look correct? Is this a plot of 20 points that could be described by the model y NID(50 + 10x, 16)? A quick check is afforded by fitting the OLS line and reading the summary output: 1
fit <- lm(y ~ x) summary(fit) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -14.587-6.760-1.073 10.555 22.435 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 46.4377 5.6664 8.195 2.62e-07 *** x 11.9093 0.9222 12.913 3.25e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 11.01 on 17 degrees of freedom Multiple R-squared: 0.9075, Adjusted R-squared: 0.902 F-statistic: 166.8 on 1 and 17 DF, p-value: 3.25e-10 Scan it carefully, looking for evidence of every quantitative value that was used: the dataset size of 20, the model y = 50 + 10x, and the variance of 16 in the errors. There are two salient problems that need to be addressed. (It s good we did this check before proceeding with extensive simulation!) 1. The value of 17 for DF (degrees of freedom) is one less than we would expect. Indeed, x has only 19 elements! (length(x)) [1] 19 Let s just assume statisticians can t count :-) and presume the question really is calling for generating samples of size 19. (A quick scan through the rest of the question suggests none of it relies fundamentally on the sample size being 20.) 2. The residual standard error of 11 suggests the error variance (its square) is around 121, which is far larger than the intended value of 16. This kind of mistake is common but insidious: the textbook uses a different parameterization of Normal distributions than the software does. R uses the mean and standard deviation while the text uses the mean and variance. (Still other sources might use the precision, which is the reciprocal of the variance, or even the logarithm of the variance for the second parameter.) This problem is particularly acute with other distributions, like the Gamma distributions, for which there is no clear convention for the parameters. It is crucial to understand what the parameters mean so that you can perform calculations correctly! There may be additional problems: the intercept of 46.4 and the slope of 11.91 differ somewhat from the intended intercept of 50 and slope of 10. However, they re of the right order of magnitude, so let s hope the discrepancies are due to randomness but we ll keep an eye on this issue and perform a fuller check later. Fixing these problems is easy: (1) needs no change, while (2) requires us to convert the variance of 16 into its square root: 2
y <- rnorm(length(x), beta[1] + beta[2]*x, sqrt(sigma.2)) plot(x, y, main="fixed-up Sample") # Always check! Fixed up Sample y 60 80 100 120 140 2 4 6 8 10 x (You should re-run the lm and summary code to verify that you re getting what you expected.) Step 2: Do the Calculations We are asked to generate 500 samples according to this model. Now that we have written and tested the commands to generate one sample, there are many (easy) ways to generate 500 samples. Because 500 is a relatively small number and each sample is small and requires relatively little calculation, we can afford to be inefficient. Rather than extracting all the information requested in parts (a) - (d) of the question, let s just save all the samples and all the fits. We can then post-process them at our leisure. Here s the command: sim <- replicate(3, { y <- rnorm(length(x), beta[1] + beta[2]*x, sqrt(sigma.2)) lm(y ~ x) To get started, the intended count of 500 has been replaced by 3. That s enough to practice with yet small enough to avoid being overwhelmed by managing 500 different (complex) fits. One step at a time! The result is an array of three (or later, 500) objects: each of them is the output of lm in the last line. It is an R idiosyncrasy that each object will be considered to be indexed by a second coordinate. For instance, the result of applying lm to the first sample is contained in sim[, 1], not sim[1, ]. You can confirm this by inspecting sim (either in the Global Environment pane in RStudio or by computing dim(sim)). 3
Question (a) a. For each sample compute the least-squares estimates of the slope and intercept. Construct histograms of the sample values of ˆβ 0 and ˆβ 1. Discuss the shape of these histograms. To apply some procedure such as extracting the least-squares estimates of the coefficients to an array like sim, you will usually use one of the *apply functions in R: often apply, lapply, or sapply, with the first being appropriate for looping over rows or columns of arrays. In this case we wish to treat sim as an array of columns by looping over its second index (number 2). The coefficients of the fit in each column are extracted using the coef function: beta.hat <- apply(sim, 2, coef) The output will have one column for iteration in the loop. Because coef returns first the intercept and then the slope, the intercepts will be found in the first row of beta.hat and the slopes in its second row. Let s look: print(beta.hat) [,1] [,2] [,3] (Intercept) 51.445077 53.584327 49.01810 x 9.956447 9.213877 10.06204 That s looking good! The first row is actually named (Intercept) and the second row, x (because x was the name of the regressor in the call to lm). We may refer to the rows by name. This is usually a good idea because it avoids mistakes made when we miscount the number of a row in which we are interested. Thus, for instance, the histograms can be obtained with two calls to hist, one for each row. Since a histogram of just three values won t reveal much, first we go back and re-do the simulation with the full 500 values. sim <- replicate(500, { y <- rnorm(length(x), beta[1] + beta[2]*x, sqrt(sigma.2)) lm(y ~ x) beta.hat <- apply(sim, 2, coef) par(mfrow=c(1,2)) # Draws side-by-side histograms hist(beta.hat["(intercept)", ], freq=false, main="", xlab=expression(hat(beta)[0])) hist(beta.hat["x", ], freq=false, main="", xlab=expression(hat(beta)[1])) 4
Density 0.00 0.10 Density 0.0 0.4 0.8 44 48 52 56 β^0 9.0 9.5 10.0 10.5 11.0 β^1 Discuss the shape of these histograms should include quantitative evaluation of their centers and spreads, along with either quantitative or qualitative assessment of other aspects of a distribution, such as its skewness, heaviness of tails, presence of outliers, peakedness, numbers of modes, etc. If you have reason to suppose the data shown by these histograms would look approximately like some well-known distributional shape (such as Normal, Student t, etc) then compare them to that shape as a reference. Question (b) For each sample, compute an estimate of E(y x = 5). Construct a histogram of the estimates you obtained. Discuss the shape of the histogram. The preferred way in R to estimate this expectation is with the predict function. It works in a strangely restricted way: you must supply it a data frame of the values of x in which you are interested. To test, note that you still have an object fit lying around from your initial testing. Let s try out predict on it: predict(object=fit, newdata=data.frame(x=5)) 1 105.9842 fit is the name of the object containing the lm output (we chose it) and x is the name of the regressor variable used by lm. The output value of 106 is reasonably close to the model value 50 + 10 5 = 100. Having successfully done the calculation with one fit, we are ready to apply it to the entire simulation. As before, all 500 values will be stored in a variable which is then fed to hist for visualization as a histogram. y.hat.0 <- apply(sim, 2, function(f) { class(f) <- "lm" predict(f, newdata=data.frame(x=5)) 5
As you can see, this is fussy: we are obliged to define a function on the fly that (re-)informs R that each column of sim really is the output of lm just so we can apply predict. (R tends to be inconsistent: even core procedures like lm, coef, and predict do not work together in a consistent manner. A simpler approach is to use your knowledge of least squares. The predicted value at x = 5 is given by the estimated coefficients, which we already have computed (and stored as rows in beta.hat): y.hat <- beta.hat["(intercept)", ] + beta.hat["x", ] * 5 par(mfrow=c(1,2)) hist(y.hat.0, freq=false, main="output of `predict`", cex.main=0.95, xlab=expression(hat(y)[0])) hist(y.hat, freq=false, main="manually computed predictions", cex.main=0.95, xlab=expression(hat(y))) Output of `predict` Manually computed predictions Density 0.0 0.2 0.4 Density 0.0 0.2 0.4 97 99 101 y^0 97 99 101 y^ The results are the same, of course. Question (c) c. For each sample, compute a 95% CI on the slope. How many of these intervals contain the true value β 1 = 10? Is this what you would expect? It s a good exercise to compute this CI using formulas from the book. In practice, though, you would look for a built-in R function. It is confint: confint(fit, "x", level=95/100) 2.5 % 97.5 % x 9.963523 13.85507 The art of statistical computing lies in continually checking that your understanding of the software is correct. How do we know that this output really is providing a symmetric, two-sided, 95% 6
confidence interval for β 1? One way is to compute the same interval in an alternative way. For instance, we could inspect the summary table. For fit it included an estimate of ˆβ 1 = 11.909 and a standard error of 0.9222. Using 19 2 = 17 degrees of freedom (also shown in the summary output) we may compute the corresponding multiplier from the Student t distribution as κ = t 1 df (1 α/2). Here are the commands to perform these calculations and display κ: confidence <- 95/100 alpha <- (1 - confidence)/2 df <- fit$df.residual (multiplier <- qt(1 - alpha, df)) [1] 2.109816 The confidence interval is ˆβ 1 ± κse( ˆβ 1 ) = 11.909 ± 2.11 0.9222. It agrees with the output of confint. Now we can feel comfortable using confint in our work. Let s apply this to the simulation: CI.beta.1 <- apply(sim, 2, function(f) { class(f) <- "lm" confint(f, "x", level=95/100) To count the number of intervals containing the true value, compare them with the true value: covers <- CI.beta.1[1, ] <= beta[2] & beta[2] <= CI.beta.1[2, ] print(paste0(sum(covers), " (", mean(covers)*100, "%) of the intervals cover the true value.")) [1] "475 (95%) of the intervals cover the true value." Question (d) d. For each estimate of E(y x = 5) in part b, compute the 95% CI, etc. The R solution once again is predict. This function is overloaded: it does lots of different things, depending on what you ask of it. As before, we should not rely on it until we have tested it/ predict(fit, newdata=data.frame(x=5), interval="confidence", level=95/100) fit lwr upr 1 105.9842 100.5674 111.401 Evidently it produces a vector of three values: the fit ŷ and the lower and upper (symmetric, two-sided) confidence interval. We can deal with these exactly as we did with ˆβ: the result of apply will be three rows of output which can be referenced by their names fit, lwr, and upr. y.hat.0 <- apply(sim, 2, function(f) { class(f) <- "lm" predict(f, newdata=data.frame(x=5)) From this point on, emulate the calculations and the answer to part (c). 7