INDEX. # 95 % CI t.star <- qt(0.975, df = 29) t.star ## [1] ME <- t.star * 1.9/sqrt(30) ME ## [1]

Similar documents
10. INDEX 241. P(Y 4 and G 6 (Y 4 and G 6 ) or (Y 6 and G 4 )) = P(Y 4 and G 6 ) + P(Y 6 and G 4 )

Section 2.3: Simple Linear Regression: Predictions and Inference

Regression Analysis and Linear Regression Models

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Confidence Intervals: Estimators

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

IQR = number. summary: largest. = 2. Upper half: Q3 =

Regression on the trees data with R

Chapters 5-6: Statistical Inference Methods

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Statistics Lab #7 ANOVA Part 2 & ANCOVA

STAT 113: Lab 9. Colin Reimer Dawson. Last revised November 10, 2015

Fathom Dynamic Data TM Version 2 Specifications

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Model Selection and Inference

Building Better Parametric Cost Models

Measures of Dispersion

The theory of the linear model 41. Theorem 2.5. Under the strong assumptions A3 and A5 and the hypothesis that

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Applied Statistics and Econometrics Lecture 6

STATS PAD USER MANUAL

Linear Regression. Problem: There are many observations with the same x-value but different y-values... Can t predict one y-value from x. d j.

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Stat 5303 (Oehlert): Unbalanced Factorial Examples 1

Gelman-Hill Chapter 3

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

We have seen that as n increases, the length of our confidence interval decreases, the confidence interval will be more narrow.

Descriptive Statistics, Standard Deviation and Standard Error

Exercise 2.23 Villanova MAT 8406 September 7, 2015

36-402/608 HW #1 Solutions 1/21/2010

Analysis of variance - ANOVA

A Quick Introduction to R

Chapter 3 Analyzing Normal Quantitative Data

Measures of Central Tendency

Chapter 6: DESCRIPTIVE STATISTICS

An introduction to SPSS

appstats6.notebook September 27, 2016

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

VCEasy VISUAL FURTHER MATHS. Overview

Table of Contents (As covered from textbook)

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Introduction to hypothesis testing

Averages and Variation

Lab 6 More Linear Regression

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Homework set 4 - Solutions

Chapter 8. Interval Estimation

Week 4: Simple Linear Regression III

1 Simple Linear Regression

The Statistical Sleuth in R: Chapter 10

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Excel Assignment 4: Correlation and Linear Regression (Office 2016 Version)

Missing Data Analysis for the Employee Dataset

Section 2.1: Intro to Simple Linear Regression & Least Squares

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Multiple Linear Regression

Chapter 5: The normal model

Grade 8. Strand: Number Specific Learning Outcomes It is expected that students will:

MATH 115: Review for Chapter 1

MATH& 146 Lesson 10. Section 1.6 Graphing Numerical Data

Lecture 20: Outliers and Influential Points

Curve Fitting with Linear Models

Glossary Common Core Curriculum Maps Math/Grade 6 Grade 8

WHOLE NUMBER AND DECIMAL OPERATIONS

Stat 5303 (Oehlert): Unreplicated 2-Series Factorials 1

Statistical Tests for Variable Discrimination

Problem Set #8. Econ 103

Contemporary Calculus Dale Hoffman (2012)

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences

predict and Friends: Common Methods for Predictive Models in R , Spring 2015 Handout No. 1, 25 January 2015

Chapter 2: Linear Equations and Functions

Applied Regression Modeling: A Business Approach

Introduction to R. Introduction to Econometrics W

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

SLStats.notebook. January 12, Statistics:

An introduction to plotting data

range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90%

Regression III: Advanced Methods

SYS 6021 Linear Statistical Models

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Lecture 1: Exploratory data analysis

Math 443/543 Graph Theory Notes 10: Small world phenomenon and decentralized search

Unit 5: Estimating with Confidence

Enter your UID and password. Make sure you have popups allowed for this site.

Chapter 6. THE NORMAL DISTRIBUTION

Solution to Bonus Questions

Alignment to the Texas Essential Knowledge and Skills Standards

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

Table Of Contents. Table Of Contents

Install RStudio from - use the standard installation.

UNIT I READING: GRAPHICAL METHODS

S CHAPTER return.data S CHAPTER.Data S CHAPTER

Robust Linear Regression (Passing- Bablok Median-Slope)

PR3 & PR4 CBR Activities Using EasyData for CBL/CBR Apps

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Transcription:

266 10. INDEX 6.1 # 95 % CI t.star <- qt(0.975, df = 29) t.star ## [1] 2.045 ME <- t.star * 1.9/sqrt(30) ME ## [1] 0.7095 1.2 + c(-1, 1) * ME ## [1] 0.4905 1.9095 # 99 % CI t.star <- qt(0.995, df = 29) t.star ## [1] 2.756 ME <- t.star * 1.9/sqrt(30) ME ## [1] 0.9562 10.2 + c(-1, 1) * ME ## [1] 9.244 11.156 # standard uncertainty = SE = s / sqrt(n) 1.9/sqrt(30) ## [1] 0.3469 6.2

10. INDEX 267 # rectangualr: sd = width / sqrt(12) = width / (2 * sqrt(3)) 0.4 * 2/sqrt(12) ## [1] 0.2309 0.4/sqrt(3) ## [1] 0.2309 # triangular: sd = width / (2 * sqrt(6)) 0.4 * 2/(2 * sqrt(6)) ## [1] 0.1633 0.4/sqrt(6) ## [1] 0.1633 The uniform distribution is a more conservative assumption. The trianlge distribution should only be used in situations where errors are more likely to be smaller than larger (with the the specified bounds). 6.3 # rectangular 1e-04/sqrt(3) ## [1] 5.774e-05 # trianglular 1e-04/sqrt(6) ## [1] 4.082e-05 6.4 Let f(r) = R 2.Then @f @R =2 R, so the uncertainty in the area estimation is p (2 R)2 (0.3) 2 =2 R 0.3 = 23.562m 2. We can report the area as 490 ± 20. 6.5 Math 241 : Spring 2017 : Pruim

268 10. INDEX estimate uncertainty 5.43 0.02 1536 13 576 3 0.00149 0.00003 6.6 In a product the relative uncertainties add, so L <- 2.65; W <- 3.10; H <- 4.61 ul <- 0.02; uw <- 0.02; uh <- 0.05 V <- L * W * H; V ## [1] 37.87 uv <- sqrt( (ul/l)^2 + (uw/w)^2 + (uh/h)^2) * V; uv ## [1] 0.5569 So we report 37.9 ± 0.6 6.7 The total fuel consumed is given by F = kpd E where k is the cars per person, P is total population, d is average distance driven per car, and E is the average e ciency. @F @k = Pd E @F @P = kd E @F @d = kp E @F @E = kpd E 2 So the estimated amount of fuel consumed (in millions of gallons) is kpd E and the uncertainty can be computed as follows: = 0.8 311.6 12000 23.7 =1.262 10 5

10. INDEX 269 k <- 0.8; u_k <- 0.12 P <- 311.6; u_p <- 0.2 d <- 12000; u_d <- 2000 E <- 23.7; u_e <- 1.7 variances <- c( (P*d/E)^2 * u_k^2, (k*d/e)^2 * u_p^2, (k*p/e)^2 * u_d^2, (k * P * d/e^2)^2 * u_e^2 ) variances ## [1] 358445548 6563 442525367 81967525 u_f <- sqrt( (P*d/E)^2 * u_k^2 + (k*d/e)^2 * u_p^2 + (k*p/e)^2 * u_d^2 + (k * P * d/e^2)^2 * u_e^2 ) u_f ## [1] 29714 sqrt(sum(variances)) ## [1] 29714 So we might report fuel consumption as 1.3 10 5 ± 3 10 4 millions of gallons or 130 ± 30 billions of gallons. Note: This problem can also be done in stages. First we estimate the number of (millions of) cars using C = kp. C <- k * P C ## [1] 249.3 u_c <- sqrt(p^2 * u_k^2 + k^2 * u_p^2) u_c ## [1] 37.39 The number of (millions of) miles driven is estimated by M = Cd: Math 241 : Spring 2017 : Pruim

270 10. INDEX M <- C * d M ## [1] 2991360 u_m <- sqrt(d^2 * u_c^2 + C^2 * u_d^2) u_m ## [1] 670747 Finally, the fuel used is F = M/E. F = M/E u_f = sqrt((1/e)^2 * u_m^2 + (M/E^2)^2 * u_e^2) u_f ## [1] 29714 This gives the same result (again in millions of gallons of gasoline). It would also be possible to relative uncertainty to do this problem, since we have nice formulas for the relative uncertainty in products and quotients. 6.8 See the next problem for the answer. 6.9 Let Q = XY 1.So @Q @X = Y 1 and @Q @Y = XY 2. From this we get u Q Q = sy 2 u 2 X + X2 Y 4 u 2 Y Q 2 = r Y 2 u 2 X + X2 Y 4 u 2 Y X 2 Y 2 r u 2 = X X 2 + u2 Y Y r 2 ux 2 uy = + X Y which gives a Pythagorean identity for the relative uncertainties just as it did for a product. V <- 1.637 / 0.43; V 2 ## [1] 3.807 uv <- sqrt( (0.02/.43)^2 + (0.006/1.637)^2 ) * V; uv ## [1] 0.1776 So we report 3.81 ± 0.18.

10. INDEX 271 7.1 a) The residual degrees of freedom is 13, so there were 13 + 2 = 15 observations. b) runoff = 1+ 0.83 rainfall c) 0.83 ± 0.04 d) confint(rain.model, "rainfall") ## 2.5 % 97.5 % ## rainfall 0.7481 0.9059 We can compute this from the information displayed: t.star <- qt(.975, df=13 ) # 13 df listed for residual standard error t.star ## [1] 2.16 SE <- 0.0365 ME <- t.star * SE; SE ## [1] 0.0365 0.8270 + c(-1,1) * ME # CI as an interval ## [1] 0.7481 0.9059 We should round this using our rounding rules (treating the margin of error like an uncertainty). e) The slope tells us how much additional run-o there is per additional amount of rain that falls. Since both are in the same units (m 3 ) and since the intercept is essentially 0, we can interpret this slope as a proportion. Roughly 83% of the rain water is being measured as runo. f) 5.24 7.2 foot.model <- lm(width ~ length, data = KidsFeet) plot(foot.model, w = 1) plot(foot.model, w = 2) Math 241 : Spring 2017 : Pruim

272 10. INDEX Residuals vs Fitted Normal Q Q Residuals 1.0 0.0 0.5 3 25 16 Standardized residuals 2 1 0 1 2 16 25 3 8.5 9.0 9.5 2 1 0 1 2 Fitted values lm(width ~ length) Theoretical Quantiles lm(width ~ length) Our diagnostics look pretty good. The residuals look randomly distributed with similar amounts of variability throughout the plot. The normal-quantile plot is nearly linear. f <- makefun(foot.model) f(24, interval = "prediction") ## fit lwr upr ## 1 8.813 7.997 9.629 f(24, interval = "confidence") ## fit lwr upr ## 1 8.813 8.666 8.96 We can t estimate Billy s foot width very accurately (between 8.0 and 9.6 cm), but we can estimate the average foot width for all kids with a foot length of 24 cm more accurately (between 8.67 and 8.96 cm). 7.3

10. INDEX 273 bike.model <- lm(distance ~ space, data = ex12.21) coef(summary(bike.model)) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) -2.1825 1.05669-2.065 7.275e-02 ## space 0.6603 0.06748 9.786 9.975e-06 f <- makefun(bike.model) f(15, interval = "prediction") ## fit lwr upr ## 1 7.723 6.313 9.132 We would use a confidence interval to estimate the average separation distance for all streets with 15 feet of available space. f(15, interval = "confidence") ## fit lwr upr ## 1 7.723 7.293 8.152 7.4 a) model <- lm(weight ~ height, data = Men) coef(summary(model)) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) -69.6989 12.14450-5.739 1.551e-08 ## height 0.8707 0.06994 12.449 1.277e-31 b) # we can ask for just the parameter we want, if we like confint(model, parm = "height") ## 2.5 % 97.5 % ## height 0.7333 1.008 The slope tells us how much the average weight (in kg) increases per cm of height. Math 241 : Spring 2017 : Pruim

274 10. INDEX c) f <- makefun(model) # in kg f(6 * 12 * 2.54, interval = "confidence") ## fit lwr upr ## 1 89.53 87.98 91.09 # in pounds f(6 * 12 * 2.54, interval = "confidence") * 2.2 ## fit lwr upr ## 1 197 193.5 200.4 d) xyplot(resid(model) ~ fitted(model)) qqmath(~resid(model)) resid(model) 60 40 20 0 20 resid(model) 60 40 20 0 20 65 70 75 80 85 90 95 fitted(model) 2 0 2 qnorm We could also have used plot(model, which = 1) plot(model, which = 2)

10. INDEX 275 Residuals vs Fitted Normal Q Q Residuals 20 20 60 1170 201 425 Standardized residuals 2 0 2 4 6 1170 425 201 65 75 85 95 3 1 1 2 3 Fitted values lm(weight ~ height) Theoretical Quantiles lm(weight ~ height) The residual plot looks fine. There is bit of a bend to the normal-quantile plot, indicating that the distribution of residuals is a bit skewed (to the right the heaviest men are farther above the mean weight for their height than the lightest men are below). In this particular case, a log transformation of the weights improves the residual distribution. There is still one man whose weight is quite high for his height, but otherwise things look quite good. model2 <- lm(log(weight) ~ height, data = Men) coef(summary(model2)) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 2.51348 0.1448876 17.35 2.146e-54 ## height 0.01081 0.0008344 12.95 8.626e-34 xyplot(resid(model2) ~ fitted(model2)) qqmath(~resid(model2)) resid(model2) 0.6 0.4 0.2 0.0 0.2 resid(model2) 0.6 0.4 0.2 0.0 0.2 4.2 4.3 4.4 4.5 fitted(model2) 2 0 2 qnorm This model says that So log(weight) = 2.51 + 0.011 height weight = 12.3 (1.011) height Math 241 : Spring 2017 : Pruim

276 10. INDEX 7.5 Anscombe s data show that it is not su cient to look only at the numerical summaries produced by regression software. His four data sets produce nearly identical output of lm() and anova() yet show very di erent fits. An inspection of the residuals (or even simple scatterplots) quickly reveals the various di culties. 7.7 data(xmp12.06, package = "Devore7") sewage.model <- lm(moistcon ~ filtrate, data = xmp12.06) # a) coef(summary(sewage.model)) # this includes standard errors; could also use summary() ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 72.95855 0.697528 104.596 1.616e-26 ## filtrate 0.04103 0.004837 8.484 1.052e-07 # b) resid(sewage.model)[1] ## 1 ## -0.2001 # c) sigma(sewage.model) # can also be read off of the summary() output ## [1] 0.6653 # d) confint(sewage.model)[2,, drop = FALSE] ## 2.5 % 97.5 % ## filtrate 0.03087 0.0512 # e) estimated.moisture <- makefun(sewage.model) estimated.moisture(170, interval = "confidence") ## fit lwr upr ## 1 79.93 79.5 80.36 8.3 a) log(y) = log(a)+xlog(b), so g(y) = log(y), f(x) =x, 0 = log(a), and 1 = log(b). b) log(y) = log(a)+blog(x), so g(y) = log(y), f(x) = log(x), 0 = log(a), and 1 = b. c) 1 y = a + bx, sog(y) = 1 y, f(x) =x, 0 = a, and 1 = b. d) 1 y = a x + b, sog(y) = 1 y, f(x) = 1 x, 0 = b, and 1 = a.

10. INDEX 277 e) not possible f) 1 y =1+ea+bx, so log( 1 y 1) = log( 1 y 1 y y )=a + bx, sog(y) = log( y ), f(x) =x, 0 = a, and 1 = b. g) 100 y =1+e a+bx, so log( 100 y 1) = log( 100 y y )=a + bx, sog(y) = log( 100 y y ), f(x) =x, 0 = a, and 1 = b. 8.5 Moving up the ladder will spread the larger values more than the smaller values, resulting in a distribution that is right skewed. 8.6 At first glance, the two models might appear equally good. In each case the power is a bit below 2 and the fits look good on top of the raw data. model <- lm(log(period) ~ log(length), data = Pendulum) model2 <- nls(period ~ A * length^power, data = Pendulum, start = list(a = 1, power = 2)) f <- makefun(model) g <- makefun(model2) xyplot(period ~ length, data = Pendulum) plotfun(f(x) ~ x, col = "gray50", add = TRUE) xyplot(period ~ length, data = Pendulum) plotfun(g(x) ~ x, col = "red", lty = 2, add = TRUE) coef(summary(model)) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 0.7207 0.006361 113.3 2.027e-35 ## log(length) 0.4785 0.003938 121.5 3.536e-36 coef(summary(model2)) ## Estimate Std. Error t value Pr(> t ) ## A 2.0470 0.022027 92.93 2.844e-33 ## power 0.4827 0.004829 99.96 4.611e-34 8 8 period 6 4 period 6 4 2 2 0 5 10 15 length 0 5 10 15 length Math 241 : Spring 2017 : Pruim

278 10. INDEX 8 8 period 6 4 period 6 4 2 2 0 5 10 15 length 0 5 10 15 length But if we look at the residuals, we see that the linear model is clearly better in this case. The non-linear model su ers from heteroskedasticity. xyplot(resid(model) ~ fitted(model), type = c("p", "smooth")) xyplot(resid(model2) ~ fitted(model2), type = c("p", "smooth")) resid(model) 0.04 0.02 0.00 0.02 0.04 resid(model2) 0.2 0.1 0.0 0.1 0.06 0.0 0.5 1.0 1.5 2.0 fitted(model) 2 4 6 8 fitted(model2) Both residual distributions are reasonably close to normal, but not perfect. In the ordinary least squares model, the largets few residuals are not as large as we would expect. qqmath(~resid(model)) qqmath(~resid(model2)) resid(model) 0.04 0.02 0.00 0.02 0.04 resid(model2) 0.2 0.1 0.0 0.1 0.06 2 1 0 1 2 qnorm 2 1 0 1 2 qnorm The residuals in the non-linear model show a clear change in variance as the fitted value increases. This is counteracted by the logarithmic transformation of the explanatory variable. (In other cases, the non-linaer model might have the preferred residual distribution.)

10. INDEX 279 The estimated power based on the linear model is 0.478 ± 0.004. 8.7 Using Tukey s buldge rules it is pretty easy to land at something like one of the following xyplot(log(pressure) ~ temperature, data = Pressure) ## Error in eval(substitute(groups), data, environment(x)): object Pressure not found xyplot(log(pressure) ~ log(temperature), data = Pressure) ## Error in eval(substitute(groups), data, environment(x)): object Pressure not found Neither of these is pefect (although both are much more linear than the original untransformed relationship). But if you think in degrees Kelvin instead, you might find a much better transformation. xyplot(log(pressure) ~ (273.15 + temperature), data = Pressure) ## Error in eval(substitute(groups), data, environment(x)): object Pressure not found xyplot(log(pressure) ~ log(273.15 + temperature), data = Pressure) ## Error in eval(substitute(groups), data, environment(x)): object Pressure not found xyplot(log(pressure) ~ 1/(273.15 + temperature), data = Pressure) ## Error in eval(substitute(groups), data, environment(x)): object Pressure not found Ah, that last one looks quite good. Let s try that one. model <- lm(log(pressure) ~ I(1/(273.15 + temperature)), data = Pressure) ## Error in is.data.frame(data): object Pressure not found plot(model, 1:2) Residuals 0.06 Residuals vs Fitted 8 5 14 0.0 1.0 2.0 Fitted values lm(log(period) ~ log(length)) Standardized residuals 1 5 14 Normal Q Q 2 1 0 1 2 Theoretical Quantiles lm(log(period) ~ log(length)) 1 Math 241 : Spring 2017 : Pruim

280 10. INDEX Things still aren t perfect. There s a bit of a bow in the residual plot, but the size of the residuals is quite small relative to the scale of the log-of-pressure values: range(~resid(model)) ## [1] -0.05675 0.04833 range(~log(pressure), data = Pressure) ## Error in range(~log(pressure), data = Pressure): object Pressure not found Also the residual for the fourth observation is quite a bit larger than the rest. 8.9 a) We can build a confidence interval for the mean by fitting a model with only an intercept term. grades <- ACTgpa confint(lm(act ~ 1, data = grades)) ## 2.5 % 97.5 % ## (Intercept) 24.25 27.9 But this isn t the only way to do it. Here are some other ways.

10. INDEX 281 # here's another way to do it; but you don't need to know about it t.test(grades$act) ## ## One Sample t-test ## ## data: grades$act ## t = 29, df = 25, p-value <2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 24.25 27.90 ## sample estimates: ## mean of x ## 26.08 # or you can do it by hand x.bar <- mean(~act, data = grades) x.bar ## [1] 26.08 n <- nrow(grades) n ## [1] 26 t.star <- qt(0.975, df = n - 1) t.star ## [1] 2.06 SE <- sd(~act, data = grades)/sqrt(n) SE ## [1] 0.8874 ME <- t.star * SE ME ## [1] 1.828 So the CI is 26.077 ± 1.828. Of course, that is too many digits, we should do some rounding to 26.1 ± 1.8. b) confint(lm(gpa ~ 1, data = grades)) ## 2.5 % 97.5 % ## (Intercept) 3.185 3.579 # this could also be done the other ways shown above. Math 241 : Spring 2017 : Pruim

282 10. INDEX c) grades.model <- lm(gpa ~ ACT, data = grades) f <- makefun(grades.model) f(act = 25, interval = "confidence") ## fit lwr upr ## 1 3.288 3.167 3.41 d) f(act = 30, interval = "prediction") ## fit lwr upr ## 1 3.723 3.101 4.346 e) xyplot(gpa ~ ACT, data = grades, panel = panel.lmbands) xyplot(resid(grades.model) ~ fitted(grades.model), type = c("p", "smooth")) GPA 4.0 3.5 3.0 2.5 2.0 resid(grades.model) 0.5 0.0 0.5 20 25 30 ACT 2.8 3.0 3.2 3.4 3.6 3.8 fitted(grades.model) There are no major concerns with the regression model. The residuals look pretty good. (There is perhaps a bit more variability in GPA for the lower ACT scores and if you said you were worried about that, I would not argue.) The prediction intervals are very wide and hardly useful, however. It s pretty hard to give a precise estimate for an individual person there s just too much variability from preson to person, even among people with the same ACT score. 8.10 The best of these four models is a model that says drag is proportional to the square of velocity. Given the design of the experiment, it makes the most sense to fit velocity as a function of drag force. Here are several ways we could do the fit: model1 <- lm(velocity^2 ~ force.drag, data = Drag) model2 <- lm(velocity ~ sqrt(force.drag), data = Drag) model3 <- lm(log(velocity) ~ log(force.drag), data = Drag) coef(summary(model1)) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) -0.06227 0.221931-0.2806 7.805e-01 ## force.drag 0.08400 0.002321 36.1966 3.449e-32

10. INDEX 283 coef(summary(model2)) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) -0.03586 0.054832-0.6539 5.169e-01 ## sqrt(force.drag) 0.29098 0.006807 42.7478 5.239e-35 coef(summary(model3)) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) -1.1623 0.04540-25.60 2.021e-26 ## log(force.drag) 0.4745 0.01241 38.23 4.104e-33 Note that model1, model2, and model3 are not equivalent, but they all tell roughly the same story. xyplot(velocity ~ force.drag, data = Drag) f1 <- makefun(model1) f2 <- makefun(model2) f3 <- makefun(model3) plotfun(sqrt(f1(x)) ~ x, add = TRUE, alpha = 0.4) plotfun(f2(x) ~ x, add = TRUE, alpha = 0.4, col = "red") plotfun(exp(f3(x)) ~ x, add = TRUE, alpha = 0.4, col = "brown") 4 4 velocity 3 2 velocity 3 2 1 1 0 50 100 150 200 force.drag 0 50 100 150 200 force.drag 4 4 velocity 3 2 velocity 3 2 1 1 0 50 100 150 200 force.drag 0 50 100 150 200 force.drag The fit for these models reveals some potential errors in the design of this experiment. Separating out the data by the height used to determine velocity suggests that perhaps some of the velocity measurements are not yet at terminal velocity. In both groups, the velocities for the greatest drag forces are not as fast as the pattern of the remaining data would lead us to expect. Math 241 : Spring 2017 : Pruim

284 10. INDEX xyplot(velocity^2 ~ force.drag, data = Drag, groups = height) plot(model1, w = 1) xyplot(log(velocity) ~ log(force.drag), data = Drag, groups = height) xyplot(velocity ~ force.drag, data = Drag, groups = height, scales = list(log = TRUE)) plot(model3, w = 1) velocity^2 15 10 5 0 Residuals 4 Residuals vs Fitted 35 36 40 0 5 10 15 20 0 50 100 150 200 force.drag Fitted values lm(velocity^2 ~ force.drag) log(velocity) 1.5 1.0 0.5 0.0 0.5 velocity 10^0.6 10^0.4 10^0.2 10^0.0 10^ 0.2 1 2 3 4 5 log(force.drag) 10^0.5 10^1.0 10^1.5 10^2.0 force.drag Residuals 0.3 Residuals vs Fitted 19 2118 0.5 0.5 1.5 Fitted values lm(log(velocity) ~ log(force.drag)) 8.11 See previous problem. 9.1 a) The normal-quantile plot looks pretty good for a sample of this size. The residual plot is perhaps not quite as noisy as we would like (the first few residuals cluster above zero, then next few below), but it is not terrible either. Ideally we would like to have a larger data set to see whether this pattern persists or whether things look more noisy as we fill-in with more data.

10. INDEX 285 b) A is a measure of background radiation levels, it is the amount of clicks we would get if our test substance were at infinity, i.e., so far away (or beyond some shielding) that it does not a ect the Geiger counter. c) 114 ± 11 d) 5.5 10 6. This gives strong evidence that there is some background radiation being measured by the Geiger counter. e) k measures the rate at which our test substance is emitting radioactive particles. If k is 0, then our substance is not radioactive (or at least not being detected by the Geiger counter). f) 31.5 ± 1.0 g) 1.1 10 9. We have strong evidence that our substance is decaying and contributing to the Geiger counter clicks. h) H 0 : k = 29.812; H a : k 6= 29.812 t <- (31.5-29.812)/1 t # test statistic ## [1] 1.688 2 * (1 - pt(t, df = 8)) # p -value ## [1] 0.1299 Conclusion. With a p-value this large, we cannot reject the null hypothesis. Our data are consistent with the hypothesis that our test substance is the same as the standard substance. 9.2 We can use the product of conditional probabilities: frac14 1 3 1 2 1 1 = 1 24 Since the probability of guessing the first glass correctly is 1 4, the probability of guess the second correctly assuming the first was correctly guessed is 1 3 ; the probability of guessing the third correctly assuming the first two were correctly guessed is 1 2 ; and if the first three are guessed correctly, the last is guaranteed to be correct. Altnernative solution: The wines can be presented in 4 3 2 1 = 24 di erent orders, so the probability of guessing correctly is 1/24. 9.3 Let s remove the fastest values at each height setting. Although they are the fastest, it appears that terminal velocity has not yet been reached. At least, these points would fit the overall pattern better if the velocity were larger. xyplot(force.drag ~ velocity, data = drag, groups = (velocity < 3.9) &!(velocity > 1 & velocity < 1.5)) drag2 <- subset(drag, (velocity < 3.9) &!(velocity > 1 & velocity < 1.5)) Math 241 : Spring 2017 : Pruim

286 10. INDEX force.drag 200 150 100 50 0 1 2 3 4 velocity drag2.model <- lm(log(force.drag) ~ log(velocity), data = drag2) summary(drag2.model) ## ## Call: ## lm(formula = log(force.drag) ~ log(velocity), data = drag2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.3158-0.1050 0.0179 0.0801 0.2542 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 2.3659 0.0272 86.9 <2e-16 ## log(velocity) 2.1142 0.0368 57.5 <2e-16 ## ## Residual standard error: 0.137 on 28 degrees of freedom ## Multiple R-squared: 0.992,Adjusted R-squared: 0.991 ## F-statistic: 3.3e+03 on 1 and 28 DF, p-value: <2e-16 plot(drag2.model, w = 1:2)

10. INDEX 287 Residuals 0.3 0.0 0.2 Residuals vs Fitted 22 36 35 Standardized residuals 2 0 1 2 36 35 Normal Q Q 22 1 2 3 4 5 2 1 0 1 2 Fitted values lm(log(force.drag) ~ log(velocity)) Theoretical Quantiles lm(log(force.drag) ~ log(velocity)) The model is still not as good as we might like, and it seams like the fit is di erent for the heavier objects than for the lighter ones. This could be due to some flaw in the design of the experiment or because drag force actually behaves di erently at low speeds vs. higher speeds. Notice the data suggest an exponent on velocity that is just a tiny bit larger than 2: confint(drag2.model) ## 2.5 % 97.5 % ## (Intercept) 2.310 2.422 ## log(velocity) 2.039 2.190 So our data (for whichever reason, potentially still due to design issues) is not compatible with the hypothesis that this exponent should be 2. 9.4 SE <- 1.2 / sqrt(16); SE ## [1] 0.3 t <- (94.32-95) / SE; t ## [1] -2.267 p_val <- 2 * pt(t, df = 15); p_val # p-value ## [1] 0.03863 95 will be outside the confidence interval because the p-value is less than 0.05. As a double check, here is the 95% confidence interval. Math 241 : Spring 2017 : Pruim

288 10. INDEX t_star <- qt(0.975, df = 11) t_star ## [1] 2.201 94.32 + c(-1, 1) * t_star * SE ## [1] 93.66 94.98 9.5 a) In this situation it makes sense to do a one-sided test since we will reject the alloy only if the amount of expansion is to high, not if it is too low. SE <- 5.9/sqrt(42); SE ## [1] 0.9104 t <- (73.1-75) / SE; t ## [1] -2.087 pt(t, df=41) ## [1] 0.02157 b) Corresponding to a 1-sided test with a significance level of =0.01, we could make a 98% confidence interval (1% in each tail). The upper end of this would be the same as for a 1-sided 99% confidence interval t_star <- qt(0.99, df = 41) 73.1 + c(-1, 1) * t_star * SE ## [1] 70.9 75.3 c) If all we are interested in is a decision, the two methods are equivalent. I confidence interval might be easier to explain to someone less familiar with statistics.