Contents Cont Hypothesis testing

Size: px

Start display at page:

Download "Contents Cont Hypothesis testing"

Shawn Robert Dean
5 years ago
Views:

1 Lecture 5 STATS/CME 195

2 Contents Hypothesis testing

3 Hypothesis testing

4 Exploratory vs. confirmatory data analysis Two approaches of statistics to analyze data sets: Exploratory: use plotting, transformations and summaries to explore the data and formulated hypotheses. Confirmatory: represent the data as random variables, formulate hypotheses and test whether they are consistent with the model assumptions. Traditionally, statistics focused more on hypothesis testing. John Tukey wrote the book Exploratory Data Analysis in Tukey s championing of exploratory analysis encouraged the development of statistical computing packages, such as S: the precursor of R, from Bell Labs.

5 Examples of hypotheses Is the measured quantity equal to/higher/lower than a given threshold? e.g. is the number of faulty items in an order statistically higher than the one guaranteed by a manufacturer? Is there a difference between two groups or observations? e.g. Do treated patient have a higher survival rate than the untreated ones? Is the level of one quantity related to the value of the other quantity? e.g. Is hyperactivity related to eating sugar? Is lung cancer related to smoking?

6 How to perform a hypothesis test 0. There is an initial research hypothesis of which the truth is unknown. 1. Formally define the null and alternative hypotheses. 2. Choose level of significance α. 3. Make statistical assumptions on the distributions of the observations. 4. Pick and compute test statistics. 5. Derive the distribution of the test statistics under the null hypothesis from the assumptions. 6. Compute the p-value. This is the probability, under the null hypothesis, of sampling a test statistic at least as extreme as that which was observed. 7. Check whether to reject the null hypothesis by comparing p-value to α. 8. Draw conclusion from the test.

7 Null and alternative hypotheses Null hypothesis ( H 0 ): A statement assumed to be true unless it can be shown to be incorrect beyond a reasonable doubt ( α). This is something one usually attempts to disprove or discredit. Alternative hypothesis ( H 1 ): A claim that is contradictory to H 0 and what we conclude when we reject H 0. This setting is asymmetric. You do not accept the null hypothesis, but you may fail to reject it.

8 Student s t-test

9 In general, used when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. Student s t-test, One test with many applications William Gosset (1908), a chemist at the Guiness brewery Published in Biometrika under a pseudonym Student Used to select best yielding varieties of barley Now one of the standard/traditional methods for hypothesis testing Among the typical applications: Comparing population mean to a constant value Comparing the means of two populations Comparing the slope of a regression line to a constant

10 Distribution of the statistic If are independent and X i σ 2 then 1 X n X i S 2 1 n i=1 X μ T = t ν=n 1 X i N(μ, ) =, = ( X n 1 i X ) 2 i=1 S/ n n

11 P-values p-value is the probability of an observed (or more extreme) result assuming that the null hypothesis is true. A small p-value indicates strong evidence against the null hypothesis, so you reject the null hypothesis.

12 Two-sided test of the mean Is the mean flight arrival delay statistically equal to 0? Test the null hypothesis where μ is the average arrival delay. H 0 : μ = μ 0 = 0 H 1 : μ μ 0 = 0

14 Testing mean flight delay library(tidyverse) library(nycflights13) mean(flights$arr_delay, na.rm = T) ## [1] Is this statistically significant? ( tt = t.test(x=flights$arr_delay, mu=0, alternative="two.sided" ) ) ## ## One Sample t-test ## ## data: flights$arr_delay ## t = 88.39, df = , p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## ## sample estimates: ## mean of x ##

15 Looking inside a t.test object The function t.test returns an object containing the following components: names(tt) ## [1] "statistic" "parameter" "p.value" "conf.int" "estimate" ## [6] "null.value" "alternative" "method" "data.name" tt$p.value # The p-value ## [1] 0 tt$conf.int # The 95% confidence interval for the mean ## [1] ## attr(,"conf.level") ## [1] 0.95

16 One-sided test of the mean Test the null hypothesis H 0 : μ = μ 0 = 0 H 1 : μ < μ 0 = 0 t.test(x, mu=0, alternative="less") One-sided can be more powerful, but the intepretation is more difficult.

17 Testing mean flight delay (II) Is the average delay 0 or is it lower? ( tt = t.test(x=flights$arr_delay, mu=0, alternative="less" ) ) ## ## One Sample t-test ## ## data: flights$arr_delay ## t = 88.39, df = , p-value = 1 ## alternative hypothesis: true mean is less than 0 ## 95 percent confidence interval: ## -Inf ## sample estimates: ## mean of x ## Failure to reject is not acceptance of the null hypothesis.

18 Testing difference between groups Are average arrival delays the same in winter and summer? Test the null hypothesis H 0 : μ A = μ B H 1 : μ A μ B where μ A is the mean of group A and μ B is the mean of group B. The t.test function can also perform this test. t.test(x, y)

19 Seasonal differences in flight delay (I) flights %>% mutate(season = cut(month, breaks = c(0,3,6,9,12))) %>% ggplot(aes(x = season, y = arr_delay)) + geom_boxplot (alpha=0.1) + xlab("season" ) + ylab("arrival delay" )

20 Seasonal differences in flight delay (II) flights %>% filter(arr_delay < 120) %>% mutate(season = cut(month, breaks = c(0,3,6,9,12))) %>% ggplot(aes(x = season, y = arr_delay)) + geom_boxplot (alpha=0.01) + xlab("season" ) + ylab("arrival delay" )

21 Testing seasonal differences in flight delay flights.winter = filter(flights, month %in% c(1,2,3)) flights.summer = filter(flights, month %in% c(7,8,9)) t.test(x=flights.winter$arr_delay, y=flights.summer$arr_delay) ## ## Welch Two Sample t-test ## ## data: flights.winter$arr_delay and flights.summer$arr_delay ## t = , df = , p-value = ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## ## sample estimates: ## mean of x mean of y ## Are the assumptions of the test valid?

22 Linear Regression

23 Linear regression Regression is a supervised learning method, whose goal is inferring the relationship between input data, X, and a continuous response variable, y. Linear regression is a type of regression where y is modeled as a linear function of X. Simple linear regression predicts the output y from a single predictor x: y = β 0 + β 1 x + ϵ Multiple linear regression assumes y relies on many covariates: y = β 0 + β 1 x β p x p + ϵ - here ϵ denotes a random noise term with zero mean and independent components.

24 Objective function Linear regression seeks a solution y^ = X that minimizes the difference between the true outcome y and the prediction y^, in terms of the residual sum of squares (RSS). β^ b^ = arg min ( β n i=1 y (i) β T x (i) ) 2

Example of simple linear regression Predict the logarithm of diamond price using the logarithm of its weight. mod <- lm(log(price) ~ log(carat), data = diamonds) library(modelr) diamonds.

25 Example of simple linear regression Predict the logarithm of diamond price using the logarithm of its weight. mod <- lm(log(price) ~ log(carat), data = diamonds) library(modelr) diamonds.mod <- diamonds %>% add_predictions(mod) %>% mutate(pred = exp(pred), resid=price-pred) ggplot(diamonds.mod) + geom_point(aes(x=carat, y=price), alpha=0.1) + geom_line(color='red', aes(x = carat, y = pred))

26 Simple linear regression mod <- lm(log(price) ~ log(carat), data = diamonds) summary(mod) ## ## Call: ## lm(formula = log(price) ~ log(carat), data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) <2e-16 *** ## log(carat) <2e-16 *** ## --- ## Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: on degrees of freedom ## Multiple R-squared: 0.933, Adjusted R-squared: ## F-statistic: 7.51e+05 on 1 and DF, p-value: < 2.2e-16

27 The coefficients ( Looking inside a lm object β^ ) of the fitted model: ( beta.hat <- coef(summary(mod)) ) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) ## log(carat) Predicted values ( y^ ) for the existing observations: head(predict(mod)) ## ##

28 Making predictions Alternatively, add predictions with the modelr package library(modelr) diamonds %>% add_predictions(mod) %>% mutate(pred = exp(pred)) Predictions for new observations: new.diamonds = data.frame(carat = c(0.2,0.5,1,2,5)) predict(mod, new.diamonds) ## ##

29 Regression t-tests Statistical significance of coefficients with t-tests: summary(mod)$coefficients ## Estimate Std. Error t value Pr(> t ) ## (Intercept) ## log(carat) Plot lm and uncertainty with geom_smooth(method = "lm"):

30 Multiple linear regression We might like to predict log-price using log-carat and cut. mod.2 <- lm(log(price) ~ log(carat) + cut, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) < 2e-16 *** ## log(carat) < 2e-16 *** ## cut.l < 2e-16 *** ## cut.q < 2e-16 *** ## cut.c < 2e-16 *** ## cut^ e-12 *** ## --- ## Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

31 Interaction terms An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent. variable. For example, one variable, x 1 might have a different effect on y within different categories or groups, given by variable x 2. With lm, an asterisk in the formula generates the interaction terms.

32 Linear regression with interaction terms mod.2 <- lm(log(price) ~ log(carat) + cut*clarity, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut * clarity, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) < 2e-16 *** ## log(carat) < 2e-16 *** ## cut.l < 2e-16 *** ## cut.q *** ## cut.c ** ## cut^ ** ## clarity.l < 2e-16 *** ## clarity.q < 2e-16 ***

33 Linear regression with interaction terms (II) You can also specify explicitly which terms you want: mod.2 <- lm(log(price) ~ log(carat) + cut:clarity, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut:clarity, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) < 2e-16 *** ## log(carat) < 2e-16 *** ## cutfair:clarityi < 2e-16 *** ## cutgood:clarityi < 2e-16 *** ## cutvery Good:clarityI < 2e-16 *** ## cutpremium:clarityi < 2e-16 *** ## cutideal:clarityi < 2e-16 *** ## cutfair:claritysi < 2e-16 ***

34 Sparse Regression

35 Linear regression with many covariates Many modern datasets have many more covariates than observations: p n Example: genowide-association studies often have p 10 6 and n When p > n, the linear regression estimate is not welldefined inference is not easy The assumption of sparsity: The number of available covariates is extremely large, but only a handful of them are relevant for the prediction of the outcome.

36 Sparse linear regression Lasso regression is simply regression with L 1 penalty. β^ = arg min { ( + λ β } β 1 n n i=1 y (i) β T x (i) ) 2 1 The L 1 norm β = p 1 j=1 β j promotes sparsity. The solution β^ usually has only a small number of non-zero coefficients. The number of non-zero coefficients depends on the choice of the tuning parameter, λ.

37 The glmnet package # install.packages("glmnet") library(glmnet) Lasso regression is implemented in an R package glmnet. An introductory tutorial to the package can be found here: Function glmnet provided by package glmnet: compute the lasso regression for a sequence of different λ.

38 Fitting the lasso glmnet does not work with data frames. It requires numeric input. glmnet(x,y) diamonds.log <- diamonds %>% mutate(price = log10(price), carat = log10(carat)) X = diamonds.log[,!(names(diamonds.log) %in% c("price"))] y = diamonds.log[, (names(diamonds.log) %in% c("price"))] y = data.matrix(y) head(data.matrix(x)) ## carat cut color clarity depth table x y z ## [1,] ## [2,] ## [3,] ## [4,] ## [5,] ## [6,]

39 Dummy variables for categorical predictors Create dummy variables for all categorical predictors X <- model.matrix(as.formula( "log(price) ~ log(carat) + cut + clarity + color"), diamonds) colnames(x) ## [1] "(Intercept)" "log(carat)" "cut.l" "cut.q" "cut.c" ## [6] "cut^4" "clarity.l" "clarity.q" "clarity.c" "clarity^ ## [11] "clarity^5" "clarity^6" "clarity^7" "color.l" "color.q" ## [16] "color.c" "color^4" "color^5" "color^6" Now we can call glmnet fit = glmnet(x,y)

40 Plotting the lasso path plot(fit, label = T) the y-axis corresponds the value of the coefficients the x-axis is denoted L 1 norm and is (inversely) related to λ

41 Lasso coefficient estimates The computed Lasso coefficient for a particular choice of λ coef(fit, s = 0.02) # Lambda = 0.02 ## 20 x 1 sparse Matrix of class "dgcmatrix" ## 1 ## (Intercept) ## (Intercept). ## log(carat) ## cut.l. ## cut.q. ## cut.c. ## cut^4. ## clarity.l ## clarity.q. ## clarity.c. ## clarity^4. ## clarity^5. ## clarity^6. ## clarity^7. ## color.l ## color.q.

42 Predictions from lasso estimates Like for lm(), we can use a function predict() to predict the log-price for the training or the test data. # Predict for the train set head( predict(fit, newx = X, s = c(0.02, 0.1)) ) ## 1 2 ## ## ## ## ## ## Each of the columns corresponds to a choice of λ.

43 Choosing λ To choose λ you can use cross-validation. Use cv.glmnet() function to perform a k-fold cross validation. In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model. The remaining k 1 subsamples are used as training data.

44 The two chosen λ values are the one with MSE min and MSE min + sd. Cross-validation with glmnet set.seed(1) # Seed for the random number generator cvfit <- cv.glmnet(x, y, nfolds = 5) plot(cvfit) The red dots are the average MSE over the k-folds.

45 min Cross-validation with glmnet (II) Value of λ with minimum MSE: cvfit$lambda.min ## [1] Largest MSE: λ such that the MSE is within one standard error of the minimum cvfit$lambda.1se ## [1]

46 Summary

47 Learning more Some resources to learn more about hypothesis testing and regression: An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, Introductory Statistics with R by Peter Dalgart,

48 Next time

Lasso. November 14, 2017

Lasso. November 14, 2017 Lasso November 14, 2017 Contents 1 Case Study: Least Absolute Shrinkage and Selection Operator (LASSO) 1 1.1 The Lasso Estimator.................................... 1 1.2 Computation of the Lasso Solution............................