Exercise 2.23 Villanova MAT 8406 September 7, 2015

Similar documents
Section 2.3: Simple Linear Regression: Predictions and Inference

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

Multiple Linear Regression

36-402/608 HW #1 Solutions 1/21/2010

Estimating R 0 : Solutions

Applied Statistics and Econometrics Lecture 6

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Regression Analysis and Linear Regression Models

Model Selection and Inference

Chapter 6: DESCRIPTIVE STATISTICS

Week 4: Simple Linear Regression II

Statistical foundations of Machine Learning INFO-F-422 TP: Linear Regression

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

Regression on the trees data with R

Two-Stage Least Squares

Quantitative - One Population

Analysis of variance - ANOVA

Understanding and Comparing Distributions. Chapter 4

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Unit 5: Estimating with Confidence

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

1 Lab 1. Graphics and Checking Residuals

Week 4: Simple Linear Regression III

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Chapter 2 Modeling Distributions of Data

Applied Regression Modeling: A Business Approach

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

STA 570 Spring Lecture 5 Tuesday, Feb 1

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

NEURAL NETWORKS. Cement. Blast Furnace Slag. Fly Ash. Water. Superplasticizer. Coarse Aggregate. Fine Aggregate. Age

IQR = number. summary: largest. = 2. Upper half: Q3 =

CHAPTER 2 Modeling Distributions of Data

Solution to Bonus Questions

Model selection. Peter Hoff. 560 Hierarchical modeling. Statistics, University of Washington 1/41

Robust Linear Regression (Passing- Bablok Median-Slope)

Chapter 2: The Normal Distribution

Gelman-Hill Chapter 3

Chapter 5. Understanding and Comparing Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Chapter 5. Understanding and Comparing Distributions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Bivariate (Simple) Regression Analysis

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Workshop 8: Model selection

Section 2.1: Intro to Simple Linear Regression & Least Squares

Table of Contents (As covered from textbook)

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Subset Selection in Multiple Regression

Descriptive Statistics, Standard Deviation and Standard Error

MS&E 226: Small Data

Section 2.2: Covariance, Correlation, and Least Squares

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Fathom Dynamic Data TM Version 2 Specifications

Poisson Regression and Model Checking

Basic Statistical Terms and Definitions

Introduction to hypothesis testing

predict and Friends: Common Methods for Predictive Models in R , Spring 2015 Handout No. 1, 25 January 2015

Applied Regression Modeling: A Business Approach

Variable selection is intended to select the best subset of predictors. But why bother?

Algebra 1, 4th 4.5 weeks

Chapters 5-6: Statistical Inference Methods

Statistics I 2011/2012 Notes about the third Computer Class: Simulation of samples and goodness of fit; Central Limit Theorem; Confidence intervals.

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Bland-Altman Plot and Analysis

Dr. Barbara Morgan Quantitative Methods

Bayes Estimators & Ridge Regression

Standard Errors in OLS Luke Sonnet

Chapter 1. Looking at Data-Distribution

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

References R's single biggest strenght is it online community. There are tons of free tutorials on R.

9.8 Rockin the Residuals

Measures of Central Tendency

For our example, we will look at the following factors and factor levels.

1. Determine the population mean of x denoted m x. Ans. 10 from bottom bell curve.

A Knitr Demo. Charles J. Geyer. February 8, 2017

Page 1. Graphical and Numerical Statistics

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

Introductory Applied Statistics: A Variable Approach TI Manual

plots Chris Parrish August 20, 2015

UNIT 1: NUMBER LINES, INTERVALS, AND SETS

The Statistical Sleuth in R: Chapter 10

Problem Set #8. Econ 103

Section 2.1: Intro to Simple Linear Regression & Least Squares

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Regression III: Lab 4

Linear Modeling with Bayesian Statistics

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

One Factor Experiments

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

A (very) brief introduction to R

Monte Carlo Analysis

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business

Transcription:

Exercise 2.23 Villanova MAT 8406 September 7, 2015 Step 1: Understand the Question Consider the simple linear regression model y = 50 + 10x + ε where ε is NID(0, 16). Suppose that n = 20 pairs of observations are used to fit this model. Generate 500 samples of 20 observations, drawing one observation for each level of x = 1, 1.5, 2...., 10 for each sample. R makes this easy because its normal random number generator, rnorm, does not require fixed values of the parameters (the mean and standard deviation): you may vary them! Therefore you can generate one dataset according to the preceding instructions by means of remarkably terse, efficient commands: sigma.2 <- 16 beta <- c(50, 10) x <- seq(1, 10, by=1/2) y <- rnorm(length(x), beta[1] + beta[2]*x, sigma.2) Before proceeding, let s check that this is correct and matches what is intended in the problem. Always draw a picture: plot(x, y, main="first Try at Sampling") First Try at Sampling y 40 60 80 120 160 2 4 6 8 10 x Does it look correct? Is this a plot of 20 points that could be described by the model y NID(50 + 10x, 16)? A quick check is afforded by fitting the OLS line and reading the summary output: 1

fit <- lm(y ~ x) summary(fit) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -14.587-6.760-1.073 10.555 22.435 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 46.4377 5.6664 8.195 2.62e-07 *** x 11.9093 0.9222 12.913 3.25e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 11.01 on 17 degrees of freedom Multiple R-squared: 0.9075, Adjusted R-squared: 0.902 F-statistic: 166.8 on 1 and 17 DF, p-value: 3.25e-10 Scan it carefully, looking for evidence of every quantitative value that was used: the dataset size of 20, the model y = 50 + 10x, and the variance of 16 in the errors. There are two salient problems that need to be addressed. (It s good we did this check before proceeding with extensive simulation!) 1. The value of 17 for DF (degrees of freedom) is one less than we would expect. Indeed, x has only 19 elements! (length(x)) [1] 19 Let s just assume statisticians can t count :-) and presume the question really is calling for generating samples of size 19. (A quick scan through the rest of the question suggests none of it relies fundamentally on the sample size being 20.) 2. The residual standard error of 11 suggests the error variance (its square) is around 121, which is far larger than the intended value of 16. This kind of mistake is common but insidious: the textbook uses a different parameterization of Normal distributions than the software does. R uses the mean and standard deviation while the text uses the mean and variance. (Still other sources might use the precision, which is the reciprocal of the variance, or even the logarithm of the variance for the second parameter.) This problem is particularly acute with other distributions, like the Gamma distributions, for which there is no clear convention for the parameters. It is crucial to understand what the parameters mean so that you can perform calculations correctly! There may be additional problems: the intercept of 46.4 and the slope of 11.91 differ somewhat from the intended intercept of 50 and slope of 10. However, they re of the right order of magnitude, so let s hope the discrepancies are due to randomness but we ll keep an eye on this issue and perform a fuller check later. Fixing these problems is easy: (1) needs no change, while (2) requires us to convert the variance of 16 into its square root: 2

y <- rnorm(length(x), beta[1] + beta[2]*x, sqrt(sigma.2)) plot(x, y, main="fixed-up Sample") # Always check! Fixed up Sample y 60 80 100 120 140 2 4 6 8 10 x (You should re-run the lm and summary code to verify that you re getting what you expected.) Step 2: Do the Calculations We are asked to generate 500 samples according to this model. Now that we have written and tested the commands to generate one sample, there are many (easy) ways to generate 500 samples. Because 500 is a relatively small number and each sample is small and requires relatively little calculation, we can afford to be inefficient. Rather than extracting all the information requested in parts (a) - (d) of the question, let s just save all the samples and all the fits. We can then post-process them at our leisure. Here s the command: sim <- replicate(3, { y <- rnorm(length(x), beta[1] + beta[2]*x, sqrt(sigma.2)) lm(y ~ x) To get started, the intended count of 500 has been replaced by 3. That s enough to practice with yet small enough to avoid being overwhelmed by managing 500 different (complex) fits. One step at a time! The result is an array of three (or later, 500) objects: each of them is the output of lm in the last line. It is an R idiosyncrasy that each object will be considered to be indexed by a second coordinate. For instance, the result of applying lm to the first sample is contained in sim[, 1], not sim[1, ]. You can confirm this by inspecting sim (either in the Global Environment pane in RStudio or by computing dim(sim)). 3

Question (a) a. For each sample compute the least-squares estimates of the slope and intercept. Construct histograms of the sample values of ˆβ 0 and ˆβ 1. Discuss the shape of these histograms. To apply some procedure such as extracting the least-squares estimates of the coefficients to an array like sim, you will usually use one of the *apply functions in R: often apply, lapply, or sapply, with the first being appropriate for looping over rows or columns of arrays. In this case we wish to treat sim as an array of columns by looping over its second index (number 2). The coefficients of the fit in each column are extracted using the coef function: beta.hat <- apply(sim, 2, coef) The output will have one column for iteration in the loop. Because coef returns first the intercept and then the slope, the intercepts will be found in the first row of beta.hat and the slopes in its second row. Let s look: print(beta.hat) [,1] [,2] [,3] (Intercept) 51.445077 53.584327 49.01810 x 9.956447 9.213877 10.06204 That s looking good! The first row is actually named (Intercept) and the second row, x (because x was the name of the regressor in the call to lm). We may refer to the rows by name. This is usually a good idea because it avoids mistakes made when we miscount the number of a row in which we are interested. Thus, for instance, the histograms can be obtained with two calls to hist, one for each row. Since a histogram of just three values won t reveal much, first we go back and re-do the simulation with the full 500 values. sim <- replicate(500, { y <- rnorm(length(x), beta[1] + beta[2]*x, sqrt(sigma.2)) lm(y ~ x) beta.hat <- apply(sim, 2, coef) par(mfrow=c(1,2)) # Draws side-by-side histograms hist(beta.hat["(intercept)", ], freq=false, main="", xlab=expression(hat(beta)[0])) hist(beta.hat["x", ], freq=false, main="", xlab=expression(hat(beta)[1])) 4

Density 0.00 0.10 Density 0.0 0.4 0.8 44 48 52 56 β^0 9.0 9.5 10.0 10.5 11.0 β^1 Discuss the shape of these histograms should include quantitative evaluation of their centers and spreads, along with either quantitative or qualitative assessment of other aspects of a distribution, such as its skewness, heaviness of tails, presence of outliers, peakedness, numbers of modes, etc. If you have reason to suppose the data shown by these histograms would look approximately like some well-known distributional shape (such as Normal, Student t, etc) then compare them to that shape as a reference. Question (b) For each sample, compute an estimate of E(y x = 5). Construct a histogram of the estimates you obtained. Discuss the shape of the histogram. The preferred way in R to estimate this expectation is with the predict function. It works in a strangely restricted way: you must supply it a data frame of the values of x in which you are interested. To test, note that you still have an object fit lying around from your initial testing. Let s try out predict on it: predict(object=fit, newdata=data.frame(x=5)) 1 105.9842 fit is the name of the object containing the lm output (we chose it) and x is the name of the regressor variable used by lm. The output value of 106 is reasonably close to the model value 50 + 10 5 = 100. Having successfully done the calculation with one fit, we are ready to apply it to the entire simulation. As before, all 500 values will be stored in a variable which is then fed to hist for visualization as a histogram. y.hat.0 <- apply(sim, 2, function(f) { class(f) <- "lm" predict(f, newdata=data.frame(x=5)) 5

As you can see, this is fussy: we are obliged to define a function on the fly that (re-)informs R that each column of sim really is the output of lm just so we can apply predict. (R tends to be inconsistent: even core procedures like lm, coef, and predict do not work together in a consistent manner. A simpler approach is to use your knowledge of least squares. The predicted value at x = 5 is given by the estimated coefficients, which we already have computed (and stored as rows in beta.hat): y.hat <- beta.hat["(intercept)", ] + beta.hat["x", ] * 5 par(mfrow=c(1,2)) hist(y.hat.0, freq=false, main="output of `predict`", cex.main=0.95, xlab=expression(hat(y)[0])) hist(y.hat, freq=false, main="manually computed predictions", cex.main=0.95, xlab=expression(hat(y))) Output of `predict` Manually computed predictions Density 0.0 0.2 0.4 Density 0.0 0.2 0.4 97 99 101 y^0 97 99 101 y^ The results are the same, of course. Question (c) c. For each sample, compute a 95% CI on the slope. How many of these intervals contain the true value β 1 = 10? Is this what you would expect? It s a good exercise to compute this CI using formulas from the book. In practice, though, you would look for a built-in R function. It is confint: confint(fit, "x", level=95/100) 2.5 % 97.5 % x 9.963523 13.85507 The art of statistical computing lies in continually checking that your understanding of the software is correct. How do we know that this output really is providing a symmetric, two-sided, 95% 6

confidence interval for β 1? One way is to compute the same interval in an alternative way. For instance, we could inspect the summary table. For fit it included an estimate of ˆβ 1 = 11.909 and a standard error of 0.9222. Using 19 2 = 17 degrees of freedom (also shown in the summary output) we may compute the corresponding multiplier from the Student t distribution as κ = t 1 df (1 α/2). Here are the commands to perform these calculations and display κ: confidence <- 95/100 alpha <- (1 - confidence)/2 df <- fit$df.residual (multiplier <- qt(1 - alpha, df)) [1] 2.109816 The confidence interval is ˆβ 1 ± κse( ˆβ 1 ) = 11.909 ± 2.11 0.9222. It agrees with the output of confint. Now we can feel comfortable using confint in our work. Let s apply this to the simulation: CI.beta.1 <- apply(sim, 2, function(f) { class(f) <- "lm" confint(f, "x", level=95/100) To count the number of intervals containing the true value, compare them with the true value: covers <- CI.beta.1[1, ] <= beta[2] & beta[2] <= CI.beta.1[2, ] print(paste0(sum(covers), " (", mean(covers)*100, "%) of the intervals cover the true value.")) [1] "475 (95%) of the intervals cover the true value." Question (d) d. For each estimate of E(y x = 5) in part b, compute the 95% CI, etc. The R solution once again is predict. This function is overloaded: it does lots of different things, depending on what you ask of it. As before, we should not rely on it until we have tested it/ predict(fit, newdata=data.frame(x=5), interval="confidence", level=95/100) fit lwr upr 1 105.9842 100.5674 111.401 Evidently it produces a vector of three values: the fit ŷ and the lower and upper (symmetric, two-sided) confidence interval. We can deal with these exactly as we did with ˆβ: the result of apply will be three rows of output which can be referenced by their names fit, lwr, and upr. y.hat.0 <- apply(sim, 2, function(f) { class(f) <- "lm" predict(f, newdata=data.frame(x=5)) From this point on, emulate the calculations and the answer to part (c). 7