Contents Cont Hypothesis testing

Similar documents
Lasso. November 14, 2017

Regression Analysis and Linear Regression Models

Chapter 6: Linear Model Selection and Regularization

Section 2.3: Simple Linear Regression: Predictions and Inference

Introduction to hypothesis testing

The Data. Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

Machine Learning: An Applied Econometric Approach Online Appendix

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Linear Methods for Regression and Shrinkage Methods

Simulating power in practice

Exploratory data analysis

Analysis of variance - ANOVA

Package TANDEM. R topics documented: June 15, Type Package

Regularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Regression on the trees data with R

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Cross-Validation Alan Arnholt 3/22/2016

Glmnet Vignette. Introduction. Trevor Hastie and Junyang Qian

IQR = number. summary: largest. = 2. Upper half: Q3 =

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Multiple Linear Regression

Among those 14 potential explanatory variables,non-dummy variables are:

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Statistical Tests for Variable Discrimination

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Lab 10 - Ridge Regression and the Lasso in Python

Stat 4510/7510 Homework 6

Lecture 26: Missing data

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Model Selection and Inference

Descriptive Statistics, Standard Deviation and Standard Error

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Gelman-Hill Chapter 3

36-402/608 HW #1 Solutions 1/21/2010

Using Machine Learning to Optimize Storage Systems

Package flam. April 6, 2018

Multivariate Analysis Multivariate Calibration part 2

14. League: A factor with levels A and N indicating player s league at the end of 1986

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

scikit-learn (Machine Learning in Python)

Applied Regression Modeling: A Business Approach

STA 570 Spring Lecture 5 Tuesday, Feb 1

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Lecture on Modeling Tools for Clustering & Regression

Lecture 13: Model selection and regularization

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Lecture 16: High-dimensional regression, non-linear regression

Multiple Linear Regression: Global tests and Multiple Testing

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Simulation and resampling analysis in R

Study Guide. Module 1. Key Terms

Section 2.1: Intro to Simple Linear Regression & Least Squares

Comparison of Optimization Methods for L1-regularized Logistic Regression

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Nina Zumel and John Mount Win-Vector LLC

Applied Statistics and Econometrics Lecture 6

Poisson Regression and Model Checking

Lecture 25: Review I

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

Estimating R 0 : Solutions

Penalized regression Statistical Learning, 2011

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Introduction to Data Science

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

Week 4: Simple Linear Regression III

Week 4: Simple Linear Regression II

Fathom Dynamic Data TM Version 2 Specifications

Network Traffic Measurements and Analysis

Set up of the data is similar to the Randomized Block Design situation. A. Chang 1. 1) Setting up the data sheet

STAT 113: Lab 9. Colin Reimer Dawson. Last revised November 10, 2015

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Introduction to R, Github and Gitlab

Robust Linear Regression (Passing- Bablok Median-Slope)

biglasso: extending lasso model to Big Data in R

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

Package glmnetutils. August 1, 2017

Introductory Guide to SAS:

Lab 9 - Linear Model Selection in Python

Solution to Bonus Questions

NCSS Statistical Software. Robust Regression

STATS PAD USER MANUAL

Lecture 22 The Generalized Lasso

Nonparametric and Simulation-Based Tests. Stat OSU, Autumn 2018 Dalpiaz

Linear Regression. A few Problems. Thursday, March 7, 13

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

Chapters 5-6: Statistical Inference Methods

Salary 9 mo : 9 month salary for faculty member for 2004

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

Transcription:

Lecture 5 STATS/CME 195

Contents Hypothesis testing

Hypothesis testing

Exploratory vs. confirmatory data analysis Two approaches of statistics to analyze data sets: Exploratory: use plotting, transformations and summaries to explore the data and formulated hypotheses. Confirmatory: represent the data as random variables, formulate hypotheses and test whether they are consistent with the model assumptions. Traditionally, statistics focused more on hypothesis testing. John Tukey wrote the book Exploratory Data Analysis in 1977. Tukey s championing of exploratory analysis encouraged the development of statistical computing packages, such as S: the precursor of R, from Bell Labs.

Examples of hypotheses Is the measured quantity equal to/higher/lower than a given threshold? e.g. is the number of faulty items in an order statistically higher than the one guaranteed by a manufacturer? Is there a difference between two groups or observations? e.g. Do treated patient have a higher survival rate than the untreated ones? Is the level of one quantity related to the value of the other quantity? e.g. Is hyperactivity related to eating sugar? Is lung cancer related to smoking?

How to perform a hypothesis test 0. There is an initial research hypothesis of which the truth is unknown. 1. Formally define the null and alternative hypotheses. 2. Choose level of significance α. 3. Make statistical assumptions on the distributions of the observations. 4. Pick and compute test statistics. 5. Derive the distribution of the test statistics under the null hypothesis from the assumptions. 6. Compute the p-value. This is the probability, under the null hypothesis, of sampling a test statistic at least as extreme as that which was observed. 7. Check whether to reject the null hypothesis by comparing p-value to α. 8. Draw conclusion from the test.

Null and alternative hypotheses Null hypothesis ( H 0 ): A statement assumed to be true unless it can be shown to be incorrect beyond a reasonable doubt ( α). This is something one usually attempts to disprove or discredit. Alternative hypothesis ( H 1 ): A claim that is contradictory to H 0 and what we conclude when we reject H 0. This setting is asymmetric. You do not accept the null hypothesis, but you may fail to reject it.

Student s t-test

In general, used when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. Student s t-test, One test with many applications William Gosset (1908), a chemist at the Guiness brewery Published in Biometrika under a pseudonym Student Used to select best yielding varieties of barley Now one of the standard/traditional methods for hypothesis testing Among the typical applications: Comparing population mean to a constant value Comparing the means of two populations Comparing the slope of a regression line to a constant

Distribution of the statistic If are independent and X i σ 2 then 1 X n X i S 2 1 n i=1 X μ T = t ν=n 1 X i N(μ, ) =, = ( X n 1 i X ) 2 i=1 S/ n n

P-values p-value is the probability of an observed (or more extreme) result assuming that the null hypothesis is true. A small p-value indicates strong evidence against the null hypothesis, so you reject the null hypothesis.

Two-sided test of the mean Is the mean flight arrival delay statistically equal to 0? Test the null hypothesis where μ is the average arrival delay. H 0 : μ = μ 0 = 0 H 1 : μ μ 0 = 0

Testing mean flight delay library(tidyverse) library(nycflights13) mean(flights$arr_delay, na.rm = T) ## [1] 6.895377 Is this statistically significant? ( tt = t.test(x=flights$arr_delay, mu=0, alternative="two.sided" ) ) ## ## One Sample t-test ## ## data: flights$arr_delay ## t = 88.39, df = 327340, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 6.742478 7.048276 ## sample estimates: ## mean of x ## 6.895377

Looking inside a t.test object The function t.test returns an object containing the following components: names(tt) ## [1] "statistic" "parameter" "p.value" "conf.int" "estimate" ## [6] "null.value" "alternative" "method" "data.name" tt$p.value # The p-value ## [1] 0 tt$conf.int # The 95% confidence interval for the mean ## [1] 6.742478 7.048276 ## attr(,"conf.level") ## [1] 0.95

One-sided test of the mean Test the null hypothesis H 0 : μ = μ 0 = 0 H 1 : μ < μ 0 = 0 t.test(x, mu=0, alternative="less") One-sided can be more powerful, but the intepretation is more difficult.

Testing mean flight delay (II) Is the average delay 0 or is it lower? ( tt = t.test(x=flights$arr_delay, mu=0, alternative="less" ) ) ## ## One Sample t-test ## ## data: flights$arr_delay ## t = 88.39, df = 327340, p-value = 1 ## alternative hypothesis: true mean is less than 0 ## 95 percent confidence interval: ## -Inf 7.023694 ## sample estimates: ## mean of x ## 6.895377 Failure to reject is not acceptance of the null hypothesis.

Testing difference between groups Are average arrival delays the same in winter and summer? Test the null hypothesis H 0 : μ A = μ B H 1 : μ A μ B where μ A is the mean of group A and μ B is the mean of group B. The t.test function can also perform this test. t.test(x, y)

Seasonal differences in flight delay (I) flights %>% mutate(season = cut(month, breaks = c(0,3,6,9,12))) %>% ggplot(aes(x = season, y = arr_delay)) + geom_boxplot (alpha=0.1) + xlab("season" ) + ylab("arrival delay" )

Seasonal differences in flight delay (II) flights %>% filter(arr_delay < 120) %>% mutate(season = cut(month, breaks = c(0,3,6,9,12))) %>% ggplot(aes(x = season, y = arr_delay)) + geom_boxplot (alpha=0.01) + xlab("season" ) + ylab("arrival delay" )

Testing seasonal differences in flight delay flights.winter = filter(flights, month %in% c(1,2,3)) flights.summer = filter(flights, month %in% c(7,8,9)) t.test(x=flights.winter$arr_delay, y=flights.summer$arr_delay) ## ## Welch Two Sample t-test ## ## data: flights.winter$arr_delay and flights.summer$arr_delay ## t = -2.4383, df = 161250, p-value = 0.01476 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.9780344-0.1063691 ## sample estimates: ## mean of x mean of y ## 5.857851 6.400052 Are the assumptions of the test valid?

Linear Regression

Linear regression Regression is a supervised learning method, whose goal is inferring the relationship between input data, X, and a continuous response variable, y. Linear regression is a type of regression where y is modeled as a linear function of X. Simple linear regression predicts the output y from a single predictor x: y = β 0 + β 1 x + ϵ Multiple linear regression assumes y relies on many covariates: y = β 0 + β 1 x 1 + + β p x p + ϵ - here ϵ denotes a random noise term with zero mean and independent components.

Objective function Linear regression seeks a solution y^ = X that minimizes the difference between the true outcome y and the prediction y^, in terms of the residual sum of squares (RSS). β^ b^ = arg min ( β n i=1 y (i) β T x (i) ) 2

Example of simple linear regression Predict the logarithm of diamond price using the logarithm of its weight. mod <- lm(log(price) ~ log(carat), data = diamonds) library(modelr) diamonds.mod <- diamonds %>% add_predictions(mod) %>% mutate(pred = exp(pred), resid=price-pred) ggplot(diamonds.mod) + geom_point(aes(x=carat, y=price), alpha=0.1) + geom_line(color='red', aes(x = carat, y = pred))

Simple linear regression mod <- lm(log(price) ~ log(carat), data = diamonds) summary(mod) ## ## Call: ## lm(formula = log(price) ~ log(carat), data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.50833-0.16951-0.00591 0.16637 1.33793 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 8.448661 0.001365 6190.9 <2e-16 *** ## log(carat) 1.675817 0.001934 866.6 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2627 on 53938 degrees of freedom ## Multiple R-squared: 0.933, Adjusted R-squared: 0.933 ## F-statistic: 7.51e+05 on 1 and 53938 DF, p-value: < 2.2e-16

The coefficients ( Looking inside a lm object β^ ) of the fitted model: ( beta.hat <- coef(summary(mod)) ) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 8.448661 0.001364691 6190.8959 0 ## log(carat) 1.675817 0.001933806 866.5901 0 Predicted values ( y^ ) for the existing observations: head(predict(mod)) ## 1 2 3 4 5 6 ## 5.985753 5.833301 5.985753 6.374210 6.485973 6.057075

Making predictions Alternatively, add predictions with the modelr package library(modelr) diamonds %>% add_predictions(mod) %>% mutate(pred = exp(pred)) Predictions for new observations: new.diamonds = data.frame(carat = c(0.2,0.5,1,2,5)) predict(mod, new.diamonds) ## 1 2 3 4 5 ## 5.751538 7.287073 8.448661 9.610248 11.145784

Regression t-tests Statistical significance of coefficients with t-tests: summary(mod)$coefficients ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 8.448661 0.001364691 6190.8959 0 ## log(carat) 1.675817 0.001933806 866.5901 0 Plot lm and uncertainty with geom_smooth(method = "lm"):

Multiple linear regression We might like to predict log-price using log-carat and cut. mod.2 <- lm(log(price) ~ log(carat) + cut, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.52247-0.16484-0.00587 0.16087 1.38115 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 8.392010 0.001735 4835.551 < 2e-16 *** ## log(carat) 1.695771 0.001910 887.679 < 2e-16 *** ## cut.l 0.224330 0.004401 50.970 < 2e-16 *** ## cut.q -0.066427 0.003895-17.054 < 2e-16 *** ## cut.c 0.052895 0.003402 15.550 < 2e-16 *** ## cut^4 0.018632 0.002734 6.814 9.6e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction terms An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent. variable. For example, one variable, x 1 might have a different effect on y within different categories or groups, given by variable x 2. With lm, an asterisk in the formula generates the interaction terms.

Linear regression with interaction terms mod.2 <- lm(log(price) ~ log(carat) + cut*clarity, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut * clarity, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0355-0.1171 0.0113 0.1226 2.0503 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 8.478446 0.002530 3350.900 < 2e-16 *** ## log(carat) 1.812383 0.001500 1208.244 < 2e-16 *** ## cut.l 0.110090 0.006962 15.813 < 2e-16 *** ## cut.q -0.020511 0.006129-3.346 0.000819 *** ## cut.c 0.013001 0.004729 2.750 0.005970 ** ## cut^4 0.009691 0.003638 2.664 0.007720 ** ## clarity.l 0.894861 0.009346 95.749 < 2e-16 *** ## clarity.q -0.211756 0.008635-24.522 < 2e-16 ***

Linear regression with interaction terms (II) You can also specify explicitly which terms you want: mod.2 <- lm(log(price) ~ log(carat) + cut:clarity, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut:clarity, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0355-0.1171 0.0113 0.1226 2.0503 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 8.878400 0.005487 1618.114 < 2e-16 *** ## log(carat) 1.812383 0.001500 1208.244 < 2e-16 *** ## cutfair:clarityi1-1.295617 0.013938-92.957 < 2e-16 *** ## cutgood:clarityi1-1.093947 0.019693-55.549 < 2e-16 *** ## cutvery Good:clarityI1-1.052551 0.020956-50.226 < 2e-16 *** ## cutpremium:clarityi1-1.099240 0.014074-78.105 < 2e-16 *** ## cutideal:clarityi1-0.907525 0.016296-55.691 < 2e-16 *** ## cutfair:claritysi2-0.761500 0.010207-74.609 < 2e-16 ***

Sparse Regression

Linear regression with many covariates Many modern datasets have many more covariates than observations: p n Example: genowide-association studies often have p 10 6 and n 10 4. When p > n, the linear regression estimate is not welldefined inference is not easy The assumption of sparsity: The number of available covariates is extremely large, but only a handful of them are relevant for the prediction of the outcome.

Sparse linear regression Lasso regression is simply regression with L 1 penalty. β^ = arg min { ( + λ β } β 1 n n i=1 y (i) β T x (i) ) 2 1 The L 1 norm β = p 1 j=1 β j promotes sparsity. The solution β^ usually has only a small number of non-zero coefficients. The number of non-zero coefficients depends on the choice of the tuning parameter, λ.

The glmnet package # install.packages("glmnet") library(glmnet) Lasso regression is implemented in an R package glmnet. An introductory tutorial to the package can be found here: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html Function glmnet provided by package glmnet: compute the lasso regression for a sequence of different λ.

Fitting the lasso glmnet does not work with data frames. It requires numeric input. glmnet(x,y) diamonds.log <- diamonds %>% mutate(price = log10(price), carat = log10(carat)) X = diamonds.log[,!(names(diamonds.log) %in% c("price"))] y = diamonds.log[, (names(diamonds.log) %in% c("price"))] y = data.matrix(y) head(data.matrix(x)) ## carat cut color clarity depth table x y z ## [1,] -0.6382722 5 2 2 61.5 55 3.95 3.98 2.43 ## [2,] -0.6777807 4 2 3 59.8 61 3.89 3.84 2.31 ## [3,] -0.6382722 2 2 5 56.9 65 4.05 4.07 2.31 ## [4,] -0.5376020 4 6 4 62.4 58 4.20 4.23 2.63 ## [5,] -0.5086383 2 7 2 63.3 58 4.34 4.35 2.75 ## [6,] -0.6197888 3 7 6 62.8 57 3.94 3.96 2.48

Dummy variables for categorical predictors Create dummy variables for all categorical predictors X <- model.matrix(as.formula( "log(price) ~ log(carat) + cut + clarity + color"), diamonds) colnames(x) ## [1] "(Intercept)" "log(carat)" "cut.l" "cut.q" "cut.c" ## [6] "cut^4" "clarity.l" "clarity.q" "clarity.c" "clarity^ ## [11] "clarity^5" "clarity^6" "clarity^7" "color.l" "color.q" ## [16] "color.c" "color^4" "color^5" "color^6" Now we can call glmnet fit = glmnet(x,y)

Plotting the lasso path plot(fit, label = T) the y-axis corresponds the value of the coefficients the x-axis is denoted L 1 norm and is (inversely) related to λ

Lasso coefficient estimates The computed Lasso coefficient for a particular choice of λ coef(fit, s = 0.02) # Lambda = 0.02 ## 20 x 1 sparse Matrix of class "dgcmatrix" ## 1 ## (Intercept) 3.68239956 ## (Intercept). ## log(carat) 0.73994621 ## cut.l. ## cut.q. ## cut.c. ## cut^4. ## clarity.l 0.20919153 ## clarity.q. ## clarity.c. ## clarity^4. ## clarity^5. ## clarity^6. ## clarity^7. ## color.l -0.07951473 ## color.q.

Predictions from lasso estimates Like for lm(), we can use a function predict() to predict the log-price for the training or the test data. # Predict for the train set head( predict(fit, newx = X, s = c(0.02, 0.1)) ) ## 1 2 ## 1 2.544275 2.783346 ## 2 2.509239 2.732693 ## 3 2.641112 2.783346 ## 4 2.720246 2.912415 ## 5 2.690009 2.949549 ## 6 2.629748 2.807044 Each of the columns corresponds to a choice of λ.

Choosing λ To choose λ you can use cross-validation. Use cv.glmnet() function to perform a k-fold cross validation. In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model. The remaining k 1 subsamples are used as training data.

The two chosen λ values are the one with MSE min and MSE min + sd. Cross-validation with glmnet set.seed(1) # Seed for the random number generator cvfit <- cv.glmnet(x, y, nfolds = 5) plot(cvfit) The red dots are the average MSE over the k-folds.

min Cross-validation with glmnet (II) Value of λ with minimum MSE: cvfit$lambda.min ## [1] 0.000478123 Largest MSE: λ such that the MSE is within one standard error of the minimum cvfit$lambda.1se ## [1] 0.000761307

Summary

Learning more Some resources to learn more about hypothesis testing and regression: An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, http://www-bcf.usc.edu/~gareth/isl/ Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, http://statweb.stanford.edu/~tibs/elemstatlearn/ Introductory Statistics with R by Peter Dalgart, http://www.springer.com/us/book/9780387790534

Next time