Contents Cont Hypothesis testing

Lecture 5 STATS/CME 195

Contents Hypothesis testing

Hypothesis testing

Exploratory vs. confirmatory data analysis Two approaches of statistics to analyze data sets: Exploratory: use plotting, transformations and summaries to explore the data and formulated hypotheses. Confirmatory: represent the data as random variables, formulate hypotheses and test whether they are consistent with the model assumptions. Traditionally, statistics focused more on hypothesis testing. John Tukey wrote the book Exploratory Data Analysis in 1977. Tukey s championing of exploratory analysis encouraged the development of statistical computing packages, such as S: the precursor of R, from Bell Labs.

Examples of hypotheses Is the measured quantity equal to/higher/lower than a given threshold? e.g. is the number of faulty items in an order statistically higher than the one guaranteed by a manufacturer? Is there a difference between two groups or observations? e.g. Do treated patient have a higher survival rate than the untreated ones? Is the level of one quantity related to the value of the other quantity? e.g. Is hyperactivity related to eating sugar? Is lung cancer related to smoking?

How to perform a hypothesis test 0. There is an initial research hypothesis of which the truth is unknown. 1. Formally define the null and alternative hypotheses. 2. Choose level of significance α. 3. Make statistical assumptions on the distributions of the observations. 4. Pick and compute test statistics. 5. Derive the distribution of the test statistics under the null hypothesis from the assumptions. 6. Compute the p-value. This is the probability, under the null hypothesis, of sampling a test statistic at least as extreme as that which was observed. 7. Check whether to reject the null hypothesis by comparing p-value to α. 8. Draw conclusion from the test.

Null and alternative hypotheses Null hypothesis ( H 0 ): A statement assumed to be true unless it can be shown to be incorrect beyond a reasonable doubt ( α). This is something one usually attempts to disprove or discredit. Alternative hypothesis ( H 1 ): A claim that is contradictory to H 0 and what we conclude when we reject H 0. This setting is asymmetric. You do not accept the null hypothesis, but you may fail to reject it.

Student s t-test

In general, used when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. Student s t-test, One test with many applications William Gosset (1908), a chemist at the Guiness brewery Published in Biometrika under a pseudonym Student Used to select best yielding varieties of barley Now one of the standard/traditional methods for hypothesis testing Among the typical applications: Comparing population mean to a constant value Comparing the means of two populations Comparing the slope of a regression line to a constant

Distribution of the statistic If are independent and X i σ 2 then 1 X n X i S 2 1 n i=1 X μ T = t ν=n 1 X i N(μ, ) =, = ( X n 1 i X ) 2 i=1 S/ n n

P-values p-value is the probability of an observed (or more extreme) result assuming that the null hypothesis is true. A small p-value indicates strong evidence against the null hypothesis, so you reject the null hypothesis.

Two-sided test of the mean Is the mean flight arrival delay statistically equal to 0? Test the null hypothesis where μ is the average arrival delay. H 0 : μ = μ 0 = 0 H 1 : μ μ 0 = 0

Testing mean flight delay library(tidyverse) library(nycflights13) mean(flights$arr_delay, na.rm = T) ## [1] 6.895377 Is this statistically significant? ( tt = t.test(x=flights$arr_delay, mu=0, alternative="two.sided" ) ) ## ## One Sample t-test ## ## data: flights$arr_delay ## t = 88.39, df = 327340, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 6.742478 7.048276 ## sample estimates: ## mean of x ## 6.895377

Looking inside a t.test object The function t.test returns an object containing the following components: names(tt) ## [1] "statistic" "parameter" "p.value" "conf.int" "estimate" ## [6] "null.value" "alternative" "method" "data.name" tt$p.value # The p-value ## [1] 0 tt$conf.int # The 95% confidence interval for the mean ## [1] 6.742478 7.048276 ## attr(,"conf.level") ## [1] 0.95

One-sided test of the mean Test the null hypothesis H 0 : μ = μ 0 = 0 H 1 : μ < μ 0 = 0 t.test(x, mu=0, alternative="less") One-sided can be more powerful, but the intepretation is more difficult.

Testing mean flight delay (II) Is the average delay 0 or is it lower? ( tt = t.test(x=flights$arr_delay, mu=0, alternative="less" ) ) ## ## One Sample t-test ## ## data: flights$arr_delay ## t = 88.39, df = 327340, p-value = 1 ## alternative hypothesis: true mean is less than 0 ## 95 percent confidence interval: ## -Inf 7.023694 ## sample estimates: ## mean of x ## 6.895377 Failure to reject is not acceptance of the null hypothesis.

Testing difference between groups Are average arrival delays the same in winter and summer? Test the null hypothesis H 0 : μ A = μ B H 1 : μ A μ B where μ A is the mean of group A and μ B is the mean of group B. The t.test function can also perform this test. t.test(x, y)

Seasonal differences in flight delay (I) flights %>% mutate(season = cut(month, breaks = c(0,3,6,9,12))) %>% ggplot(aes(x = season, y = arr_delay)) + geom_boxplot (alpha=0.1) + xlab("season" ) + ylab("arrival delay" )

Seasonal differences in flight delay (II) flights %>% filter(arr_delay < 120) %>% mutate(season = cut(month, breaks = c(0,3,6,9,12))) %>% ggplot(aes(x = season, y = arr_delay)) + geom_boxplot (alpha=0.01) + xlab("season" ) + ylab("arrival delay" )

Testing seasonal differences in flight delay flights.winter = filter(flights, month %in% c(1,2,3)) flights.summer = filter(flights, month %in% c(7,8,9)) t.test(x=flights.winter$arr_delay, y=flights.summer$arr_delay) ## ## Welch Two Sample t-test ## ## data: flights.winter$arr_delay and flights.summer$arr_delay ## t = -2.4383, df = 161250, p-value = 0.01476 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.9780344-0.1063691 ## sample estimates: ## mean of x mean of y ## 5.857851 6.400052 Are the assumptions of the test valid?

Linear Regression

Linear regression Regression is a supervised learning method, whose goal is inferring the relationship between input data, X, and a continuous response variable, y. Linear regression is a type of regression where y is modeled as a linear function of X. Simple linear regression predicts the output y from a single predictor x: y = β 0 + β 1 x + ϵ Multiple linear regression assumes y relies on many covariates: y = β 0 + β 1 x 1 + + β p x p + ϵ - here ϵ denotes a random noise term with zero mean and independent components.

Objective function Linear regression seeks a solution y^ = X that minimizes the difference between the true outcome y and the prediction y^, in terms of the residual sum of squares (RSS). β^ b^ = arg min ( β n i=1 y (i) β T x (i) ) 2

Example of simple linear regression Predict the logarithm of diamond price using the logarithm of its weight. mod <- lm(log(price) ~ log(carat), data = diamonds) library(modelr) diamonds.mod <- diamonds %>% add_predictions(mod) %>% mutate(pred = exp(pred), resid=price-pred) ggplot(diamonds.mod) + geom_point(aes(x=carat, y=price), alpha=0.1) + geom_line(color='red', aes(x = carat, y = pred))

Simple linear regression mod <- lm(log(price) ~ log(carat), data = diamonds) summary(mod) ## ## Call: ## lm(formula = log(price) ~ log(carat), data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.50833-0.16951-0.00591 0.16637 1.33793 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 8.448661 0.001365 6190.9 <2e-16 *** ## log(carat) 1.675817 0.001934 866.6 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2627 on 53938 degrees of freedom ## Multiple R-squared: 0.933, Adjusted R-squared: 0.933 ## F-statistic: 7.51e+05 on 1 and 53938 DF, p-value: < 2.2e-16

The coefficients ( Looking inside a lm object β^ ) of the fitted model: ( beta.hat <- coef(summary(mod)) ) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 8.448661 0.001364691 6190.8959 0 ## log(carat) 1.675817 0.001933806 866.5901 0 Predicted values ( y^ ) for the existing observations: head(predict(mod)) ## 1 2 3 4 5 6 ## 5.985753 5.833301 5.985753 6.374210 6.485973 6.057075

Making predictions Alternatively, add predictions with the modelr package library(modelr) diamonds %>% add_predictions(mod) %>% mutate(pred = exp(pred)) Predictions for new observations: new.diamonds = data.frame(carat = c(0.2,0.5,1,2,5)) predict(mod, new.diamonds) ## 1 2 3 4 5 ## 5.751538 7.287073 8.448661 9.610248 11.145784

Regression t-tests Statistical significance of coefficients with t-tests: summary(mod)$coefficients ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 8.448661 0.001364691 6190.8959 0 ## log(carat) 1.675817 0.001933806 866.5901 0 Plot lm and uncertainty with geom_smooth(method = "lm"):

Multiple linear regression We might like to predict log-price using log-carat and cut. mod.2 <- lm(log(price) ~ log(carat) + cut, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.52247-0.16484-0.00587 0.16087 1.38115 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 8.392010 0.001735 4835.551 < 2e-16 *** ## log(carat) 1.695771 0.001910 887.679 < 2e-16 *** ## cut.l 0.224330 0.004401 50.970 < 2e-16 *** ## cut.q -0.066427 0.003895-17.054 < 2e-16 *** ## cut.c 0.052895 0.003402 15.550 < 2e-16 *** ## cut^4 0.018632 0.002734 6.814 9.6e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction terms An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent. variable. For example, one variable, x 1 might have a different effect on y within different categories or groups, given by variable x 2. With lm, an asterisk in the formula generates the interaction terms.

Linear regression with interaction terms mod.2 <- lm(log(price) ~ log(carat) + cut*clarity, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut * clarity, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0355-0.1171 0.0113 0.1226 2.0503 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 8.478446 0.002530 3350.900 < 2e-16 *** ## log(carat) 1.812383 0.001500 1208.244 < 2e-16 *** ## cut.l 0.110090 0.006962 15.813 < 2e-16 *** ## cut.q -0.020511 0.006129-3.346 0.000819 *** ## cut.c 0.013001 0.004729 2.750 0.005970 ** ## cut^4 0.009691 0.003638 2.664 0.007720 ** ## clarity.l 0.894861 0.009346 95.749 < 2e-16 *** ## clarity.q -0.211756 0.008635-24.522 < 2e-16 ***

Linear regression with interaction terms (II) You can also specify explicitly which terms you want: mod.2 <- lm(log(price) ~ log(carat) + cut:clarity, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut:clarity, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0355-0.1171 0.0113 0.1226 2.0503 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 8.878400 0.005487 1618.114 < 2e-16 *** ## log(carat) 1.812383 0.001500 1208.244 < 2e-16 *** ## cutfair:clarityi1-1.295617 0.013938-92.957 < 2e-16 *** ## cutgood:clarityi1-1.093947 0.019693-55.549 < 2e-16 *** ## cutvery Good:clarityI1-1.052551 0.020956-50.226 < 2e-16 *** ## cutpremium:clarityi1-1.099240 0.014074-78.105 < 2e-16 *** ## cutideal:clarityi1-0.907525 0.016296-55.691 < 2e-16 *** ## cutfair:claritysi2-0.761500 0.010207-74.609 < 2e-16 ***

Sparse Regression

Linear regression with many covariates Many modern datasets have many more covariates than observations: p n Example: genowide-association studies often have p 10 6 and n 10 4. When p > n, the linear regression estimate is not welldefined inference is not easy The assumption of sparsity: The number of available covariates is extremely large, but only a handful of them are relevant for the prediction of the outcome.

Sparse linear regression Lasso regression is simply regression with L 1 penalty. β^ = arg min { ( + λ β } β 1 n n i=1 y (i) β T x (i) ) 2 1 The L 1 norm β = p 1 j=1 β j promotes sparsity. The solution β^ usually has only a small number of non-zero coefficients. The number of non-zero coefficients depends on the choice of the tuning parameter, λ.

The glmnet package # install.packages("glmnet") library(glmnet) Lasso regression is implemented in an R package glmnet. An introductory tutorial to the package can be found here: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html Function glmnet provided by package glmnet: compute the lasso regression for a sequence of different λ.

Fitting the lasso glmnet does not work with data frames. It requires numeric input. glmnet(x,y) diamonds.log <- diamonds %>% mutate(price = log10(price), carat = log10(carat)) X = diamonds.log[,!(names(diamonds.log) %in% c("price"))] y = diamonds.log[, (names(diamonds.log) %in% c("price"))] y = data.matrix(y) head(data.matrix(x)) ## carat cut color clarity depth table x y z ## [1,] -0.6382722 5 2 2 61.5 55 3.95 3.98 2.43 ## [2,] -0.6777807 4 2 3 59.8 61 3.89 3.84 2.31 ## [3,] -0.6382722 2 2 5 56.9 65 4.05 4.07 2.31 ## [4,] -0.5376020 4 6 4 62.4 58 4.20 4.23 2.63 ## [5,] -0.5086383 2 7 2 63.3 58 4.34 4.35 2.75 ## [6,] -0.6197888 3 7 6 62.8 57 3.94 3.96 2.48

Dummy variables for categorical predictors Create dummy variables for all categorical predictors X <- model.matrix(as.formula( "log(price) ~ log(carat) + cut + clarity + color"), diamonds) colnames(x) ## [1] "(Intercept)" "log(carat)" "cut.l" "cut.q" "cut.c" ## [6] "cut^4" "clarity.l" "clarity.q" "clarity.c" "clarity^ ## [11] "clarity^5" "clarity^6" "clarity^7" "color.l" "color.q" ## [16] "color.c" "color^4" "color^5" "color^6" Now we can call glmnet fit = glmnet(x,y)

Plotting the lasso path plot(fit, label = T) the y-axis corresponds the value of the coefficients the x-axis is denoted L 1 norm and is (inversely) related to λ

Lasso coefficient estimates The computed Lasso coefficient for a particular choice of λ coef(fit, s = 0.02) # Lambda = 0.02 ## 20 x 1 sparse Matrix of class "dgcmatrix" ## 1 ## (Intercept) 3.68239956 ## (Intercept). ## log(carat) 0.73994621 ## cut.l. ## cut.q. ## cut.c. ## cut^4. ## clarity.l 0.20919153 ## clarity.q. ## clarity.c. ## clarity^4. ## clarity^5. ## clarity^6. ## clarity^7. ## color.l -0.07951473 ## color.q.

Predictions from lasso estimates Like for lm(), we can use a function predict() to predict the log-price for the training or the test data. # Predict for the train set head( predict(fit, newx = X, s = c(0.02, 0.1)) ) ## 1 2 ## 1 2.544275 2.783346 ## 2 2.509239 2.732693 ## 3 2.641112 2.783346 ## 4 2.720246 2.912415 ## 5 2.690009 2.949549 ## 6 2.629748 2.807044 Each of the columns corresponds to a choice of λ.

Choosing λ To choose λ you can use cross-validation. Use cv.glmnet() function to perform a k-fold cross validation. In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model. The remaining k 1 subsamples are used as training data.

The two chosen λ values are the one with MSE min and MSE min + sd. Cross-validation with glmnet set.seed(1) # Seed for the random number generator cvfit <- cv.glmnet(x, y, nfolds = 5) plot(cvfit) The red dots are the average MSE over the k-folds.

min Cross-validation with glmnet (II) Value of λ with minimum MSE: cvfit$lambda.min ## [1] 0.000478123 Largest MSE: λ such that the MSE is within one standard error of the minimum cvfit$lambda.1se ## [1] 0.000761307

Summary

Learning more Some resources to learn more about hypothesis testing and regression: An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, http://www-bcf.usc.edu/~gareth/isl/ Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, http://statweb.stanford.edu/~tibs/elemstatlearn/ Introductory Statistics with R by Peter Dalgart, http://www.springer.com/us/book/9780387790534

Next time