Contents Cont Hypothesis testing

Size: px
Start display at page:

Download "Contents Cont Hypothesis testing"

Transcription

1 Lecture 5 STATS/CME 195

2 Contents Hypothesis testing

3 Hypothesis testing

4 Exploratory vs. confirmatory data analysis Two approaches of statistics to analyze data sets: Exploratory: use plotting, transformations and summaries to explore the data and formulated hypotheses. Confirmatory: represent the data as random variables, formulate hypotheses and test whether they are consistent with the model assumptions. Traditionally, statistics focused more on hypothesis testing. John Tukey wrote the book Exploratory Data Analysis in Tukey s championing of exploratory analysis encouraged the development of statistical computing packages, such as S: the precursor of R, from Bell Labs.

5 Examples of hypotheses Is the measured quantity equal to/higher/lower than a given threshold? e.g. is the number of faulty items in an order statistically higher than the one guaranteed by a manufacturer? Is there a difference between two groups or observations? e.g. Do treated patient have a higher survival rate than the untreated ones? Is the level of one quantity related to the value of the other quantity? e.g. Is hyperactivity related to eating sugar? Is lung cancer related to smoking?

6 How to perform a hypothesis test 0. There is an initial research hypothesis of which the truth is unknown. 1. Formally define the null and alternative hypotheses. 2. Choose level of significance α. 3. Make statistical assumptions on the distributions of the observations. 4. Pick and compute test statistics. 5. Derive the distribution of the test statistics under the null hypothesis from the assumptions. 6. Compute the p-value. This is the probability, under the null hypothesis, of sampling a test statistic at least as extreme as that which was observed. 7. Check whether to reject the null hypothesis by comparing p-value to α. 8. Draw conclusion from the test.

7 Null and alternative hypotheses Null hypothesis ( H 0 ): A statement assumed to be true unless it can be shown to be incorrect beyond a reasonable doubt ( α). This is something one usually attempts to disprove or discredit. Alternative hypothesis ( H 1 ): A claim that is contradictory to H 0 and what we conclude when we reject H 0. This setting is asymmetric. You do not accept the null hypothesis, but you may fail to reject it.

8 Student s t-test

9 In general, used when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. Student s t-test, One test with many applications William Gosset (1908), a chemist at the Guiness brewery Published in Biometrika under a pseudonym Student Used to select best yielding varieties of barley Now one of the standard/traditional methods for hypothesis testing Among the typical applications: Comparing population mean to a constant value Comparing the means of two populations Comparing the slope of a regression line to a constant

10 Distribution of the statistic If are independent and X i σ 2 then 1 X n X i S 2 1 n i=1 X μ T = t ν=n 1 X i N(μ, ) =, = ( X n 1 i X ) 2 i=1 S/ n n

11 P-values p-value is the probability of an observed (or more extreme) result assuming that the null hypothesis is true. A small p-value indicates strong evidence against the null hypothesis, so you reject the null hypothesis.

12 Two-sided test of the mean Is the mean flight arrival delay statistically equal to 0? Test the null hypothesis where μ is the average arrival delay. H 0 : μ = μ 0 = 0 H 1 : μ μ 0 = 0

13

14 Testing mean flight delay library(tidyverse) library(nycflights13) mean(flights$arr_delay, na.rm = T) ## [1] Is this statistically significant? ( tt = t.test(x=flights$arr_delay, mu=0, alternative="two.sided" ) ) ## ## One Sample t-test ## ## data: flights$arr_delay ## t = 88.39, df = , p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## ## sample estimates: ## mean of x ##

15 Looking inside a t.test object The function t.test returns an object containing the following components: names(tt) ## [1] "statistic" "parameter" "p.value" "conf.int" "estimate" ## [6] "null.value" "alternative" "method" "data.name" tt$p.value # The p-value ## [1] 0 tt$conf.int # The 95% confidence interval for the mean ## [1] ## attr(,"conf.level") ## [1] 0.95

16 One-sided test of the mean Test the null hypothesis H 0 : μ = μ 0 = 0 H 1 : μ < μ 0 = 0 t.test(x, mu=0, alternative="less") One-sided can be more powerful, but the intepretation is more difficult.

17 Testing mean flight delay (II) Is the average delay 0 or is it lower? ( tt = t.test(x=flights$arr_delay, mu=0, alternative="less" ) ) ## ## One Sample t-test ## ## data: flights$arr_delay ## t = 88.39, df = , p-value = 1 ## alternative hypothesis: true mean is less than 0 ## 95 percent confidence interval: ## -Inf ## sample estimates: ## mean of x ## Failure to reject is not acceptance of the null hypothesis.

18 Testing difference between groups Are average arrival delays the same in winter and summer? Test the null hypothesis H 0 : μ A = μ B H 1 : μ A μ B where μ A is the mean of group A and μ B is the mean of group B. The t.test function can also perform this test. t.test(x, y)

19 Seasonal differences in flight delay (I) flights %>% mutate(season = cut(month, breaks = c(0,3,6,9,12))) %>% ggplot(aes(x = season, y = arr_delay)) + geom_boxplot (alpha=0.1) + xlab("season" ) + ylab("arrival delay" )

20 Seasonal differences in flight delay (II) flights %>% filter(arr_delay < 120) %>% mutate(season = cut(month, breaks = c(0,3,6,9,12))) %>% ggplot(aes(x = season, y = arr_delay)) + geom_boxplot (alpha=0.01) + xlab("season" ) + ylab("arrival delay" )

21 Testing seasonal differences in flight delay flights.winter = filter(flights, month %in% c(1,2,3)) flights.summer = filter(flights, month %in% c(7,8,9)) t.test(x=flights.winter$arr_delay, y=flights.summer$arr_delay) ## ## Welch Two Sample t-test ## ## data: flights.winter$arr_delay and flights.summer$arr_delay ## t = , df = , p-value = ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## ## sample estimates: ## mean of x mean of y ## Are the assumptions of the test valid?

22 Linear Regression

23 Linear regression Regression is a supervised learning method, whose goal is inferring the relationship between input data, X, and a continuous response variable, y. Linear regression is a type of regression where y is modeled as a linear function of X. Simple linear regression predicts the output y from a single predictor x: y = β 0 + β 1 x + ϵ Multiple linear regression assumes y relies on many covariates: y = β 0 + β 1 x β p x p + ϵ - here ϵ denotes a random noise term with zero mean and independent components.

24 Objective function Linear regression seeks a solution y^ = X that minimizes the difference between the true outcome y and the prediction y^, in terms of the residual sum of squares (RSS). β^ b^ = arg min ( β n i=1 y (i) β T x (i) ) 2

25 Example of simple linear regression Predict the logarithm of diamond price using the logarithm of its weight. mod <- lm(log(price) ~ log(carat), data = diamonds) library(modelr) diamonds.mod <- diamonds %>% add_predictions(mod) %>% mutate(pred = exp(pred), resid=price-pred) ggplot(diamonds.mod) + geom_point(aes(x=carat, y=price), alpha=0.1) + geom_line(color='red', aes(x = carat, y = pred))

26 Simple linear regression mod <- lm(log(price) ~ log(carat), data = diamonds) summary(mod) ## ## Call: ## lm(formula = log(price) ~ log(carat), data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) <2e-16 *** ## log(carat) <2e-16 *** ## --- ## Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: on degrees of freedom ## Multiple R-squared: 0.933, Adjusted R-squared: ## F-statistic: 7.51e+05 on 1 and DF, p-value: < 2.2e-16

27 The coefficients ( Looking inside a lm object β^ ) of the fitted model: ( beta.hat <- coef(summary(mod)) ) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) ## log(carat) Predicted values ( y^ ) for the existing observations: head(predict(mod)) ## ##

28 Making predictions Alternatively, add predictions with the modelr package library(modelr) diamonds %>% add_predictions(mod) %>% mutate(pred = exp(pred)) Predictions for new observations: new.diamonds = data.frame(carat = c(0.2,0.5,1,2,5)) predict(mod, new.diamonds) ## ##

29 Regression t-tests Statistical significance of coefficients with t-tests: summary(mod)$coefficients ## Estimate Std. Error t value Pr(> t ) ## (Intercept) ## log(carat) Plot lm and uncertainty with geom_smooth(method = "lm"):

30 Multiple linear regression We might like to predict log-price using log-carat and cut. mod.2 <- lm(log(price) ~ log(carat) + cut, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) < 2e-16 *** ## log(carat) < 2e-16 *** ## cut.l < 2e-16 *** ## cut.q < 2e-16 *** ## cut.c < 2e-16 *** ## cut^ e-12 *** ## --- ## Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

31 Interaction terms An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent. variable. For example, one variable, x 1 might have a different effect on y within different categories or groups, given by variable x 2. With lm, an asterisk in the formula generates the interaction terms.

32 Linear regression with interaction terms mod.2 <- lm(log(price) ~ log(carat) + cut*clarity, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut * clarity, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) < 2e-16 *** ## log(carat) < 2e-16 *** ## cut.l < 2e-16 *** ## cut.q *** ## cut.c ** ## cut^ ** ## clarity.l < 2e-16 *** ## clarity.q < 2e-16 ***

33 Linear regression with interaction terms (II) You can also specify explicitly which terms you want: mod.2 <- lm(log(price) ~ log(carat) + cut:clarity, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut:clarity, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) < 2e-16 *** ## log(carat) < 2e-16 *** ## cutfair:clarityi < 2e-16 *** ## cutgood:clarityi < 2e-16 *** ## cutvery Good:clarityI < 2e-16 *** ## cutpremium:clarityi < 2e-16 *** ## cutideal:clarityi < 2e-16 *** ## cutfair:claritysi < 2e-16 ***

34 Sparse Regression

35 Linear regression with many covariates Many modern datasets have many more covariates than observations: p n Example: genowide-association studies often have p 10 6 and n When p > n, the linear regression estimate is not welldefined inference is not easy The assumption of sparsity: The number of available covariates is extremely large, but only a handful of them are relevant for the prediction of the outcome.

36 Sparse linear regression Lasso regression is simply regression with L 1 penalty. β^ = arg min { ( + λ β } β 1 n n i=1 y (i) β T x (i) ) 2 1 The L 1 norm β = p 1 j=1 β j promotes sparsity. The solution β^ usually has only a small number of non-zero coefficients. The number of non-zero coefficients depends on the choice of the tuning parameter, λ.

37 The glmnet package # install.packages("glmnet") library(glmnet) Lasso regression is implemented in an R package glmnet. An introductory tutorial to the package can be found here: Function glmnet provided by package glmnet: compute the lasso regression for a sequence of different λ.

38 Fitting the lasso glmnet does not work with data frames. It requires numeric input. glmnet(x,y) diamonds.log <- diamonds %>% mutate(price = log10(price), carat = log10(carat)) X = diamonds.log[,!(names(diamonds.log) %in% c("price"))] y = diamonds.log[, (names(diamonds.log) %in% c("price"))] y = data.matrix(y) head(data.matrix(x)) ## carat cut color clarity depth table x y z ## [1,] ## [2,] ## [3,] ## [4,] ## [5,] ## [6,]

39 Dummy variables for categorical predictors Create dummy variables for all categorical predictors X <- model.matrix(as.formula( "log(price) ~ log(carat) + cut + clarity + color"), diamonds) colnames(x) ## [1] "(Intercept)" "log(carat)" "cut.l" "cut.q" "cut.c" ## [6] "cut^4" "clarity.l" "clarity.q" "clarity.c" "clarity^ ## [11] "clarity^5" "clarity^6" "clarity^7" "color.l" "color.q" ## [16] "color.c" "color^4" "color^5" "color^6" Now we can call glmnet fit = glmnet(x,y)

40 Plotting the lasso path plot(fit, label = T) the y-axis corresponds the value of the coefficients the x-axis is denoted L 1 norm and is (inversely) related to λ

41 Lasso coefficient estimates The computed Lasso coefficient for a particular choice of λ coef(fit, s = 0.02) # Lambda = 0.02 ## 20 x 1 sparse Matrix of class "dgcmatrix" ## 1 ## (Intercept) ## (Intercept). ## log(carat) ## cut.l. ## cut.q. ## cut.c. ## cut^4. ## clarity.l ## clarity.q. ## clarity.c. ## clarity^4. ## clarity^5. ## clarity^6. ## clarity^7. ## color.l ## color.q.

42 Predictions from lasso estimates Like for lm(), we can use a function predict() to predict the log-price for the training or the test data. # Predict for the train set head( predict(fit, newx = X, s = c(0.02, 0.1)) ) ## 1 2 ## ## ## ## ## ## Each of the columns corresponds to a choice of λ.

43 Choosing λ To choose λ you can use cross-validation. Use cv.glmnet() function to perform a k-fold cross validation. In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model. The remaining k 1 subsamples are used as training data.

44 The two chosen λ values are the one with MSE min and MSE min + sd. Cross-validation with glmnet set.seed(1) # Seed for the random number generator cvfit <- cv.glmnet(x, y, nfolds = 5) plot(cvfit) The red dots are the average MSE over the k-folds.

45 min Cross-validation with glmnet (II) Value of λ with minimum MSE: cvfit$lambda.min ## [1] Largest MSE: λ such that the MSE is within one standard error of the minimum cvfit$lambda.1se ## [1]

46 Summary

47 Learning more Some resources to learn more about hypothesis testing and regression: An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, Introductory Statistics with R by Peter Dalgart,

48 Next time

Lasso. November 14, 2017

Lasso. November 14, 2017 Lasso November 14, 2017 Contents 1 Case Study: Least Absolute Shrinkage and Selection Operator (LASSO) 1 1.1 The Lasso Estimator.................................... 1 1.2 Computation of the Lasso Solution............................

More information

Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical

More information

Chapter 6: Linear Model Selection and Regularization

Chapter 6: Linear Model Selection and Regularization Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the

More information

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.3: Simple Linear Regression: Predictions and Inference Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple

More information

Introduction to hypothesis testing

Introduction to hypothesis testing Introduction to hypothesis testing Mark Johnson Macquarie University Sydney, Australia February 27, 2017 1 / 38 Outline Introduction Hypothesis tests and confidence intervals Classical hypothesis tests

More information

The Data. Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO

The Data. Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO The Data The following dataset is from Hastie, Tibshirani and Friedman (2009), from a studyby Stamey et al. (1989) of prostate

More information

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods. STAT 48 - STAT 48 - December 5, 27 STAT 48 - STAT 48 - Here are a few questions to consider: What does statistical learning mean to you? Is statistical learning different from statistics as a whole? What

More information

Machine Learning: An Applied Econometric Approach Online Appendix

Machine Learning: An Applied Econometric Approach Online Appendix Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail

More information

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Statistics Lab #7 ANOVA Part 2 & ANCOVA Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Simulating power in practice

Simulating power in practice Simulating power in practice Author: Nicholas G Reich This material is part of the statsteachr project Made available under the Creative Commons Attribution-ShareAlike 3.0 Unported License: http://creativecommons.org/licenses/by-sa/3.0/deed.en

More information

Exploratory data analysis

Exploratory data analysis Lecture 4 STATS/CME 195 Matteo Sesia Stanford University Spring 2018 Contents Exploratory data analysis Exploratory data analysis What is exploratory data analysis (EDA) In this lecture we discuss how

More information

Analysis of variance - ANOVA

Analysis of variance - ANOVA Analysis of variance - ANOVA Based on a book by Julian J. Faraway University of Iceland (UI) Estimation 1 / 50 Anova In ANOVAs all predictors are categorical/qualitative. The original thinking was to try

More information

Package TANDEM. R topics documented: June 15, Type Package

Package TANDEM. R topics documented: June 15, Type Package Type Package Package TANDEM June 15, 2017 Title A Two-Stage Approach to Maximize Interpretability of Drug Response Models Based on Multiple Molecular Data Types Version 1.0.2 Date 2017-04-07 Author Nanne

More information

Regularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Regularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel Regularization Methods Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel Today s Lecture Objectives 1 Avoiding overfitting and improving model interpretability with the help of regularization

More information

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem. STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,

More information

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010 THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE

More information

Regression on the trees data with R

Regression on the trees data with R > trees Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7 7 11.0 66 15.6 8 11.0 75 18.2 9 11.1 80 22.6 10 11.2 75 19.9 11 11.3 79 24.2 12 11.4 76

More information

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression OBJECTIVES 1. Prepare a scatter plot of the dependent variable on the independent variable 2. Do a simple linear regression

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Cross-Validation Alan Arnholt 3/22/2016

Cross-Validation Alan Arnholt 3/22/2016 Cross-Validation Alan Arnholt 3/22/2016 Note: Working definitions and graphs are taken from Ugarte, Militino, and Arnholt (2016) The Validation Set Approach The basic idea behind the validation set approach

More information

Glmnet Vignette. Introduction. Trevor Hastie and Junyang Qian

Glmnet Vignette. Introduction. Trevor Hastie and Junyang Qian Glmnet Vignette Trevor Hastie and Junyang Qian Stanford September 13, 2016 Introduction Installation Quick Start Linear Regression Logistic Regression Poisson Models Cox Models Sparse Matrices Appendix

More information

IQR = number. summary: largest. = 2. Upper half: Q3 =

IQR = number. summary: largest. = 2. Upper half: Q3 = Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number

More information

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. ST512 Fall Quarter, 2005 Exam 1 Name: Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. 1. (42 points) A random sample of n = 30 NBA basketball

More information

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018 Lab #13 - Resampling Methods Econ 224 October 23rd, 2018 Introduction In this lab you will work through Section 5.3 of ISL and record your code and results in an RMarkdown document. I have added section

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

Among those 14 potential explanatory variables,non-dummy variables are:

Among those 14 potential explanatory variables,non-dummy variables are: Among those 14 potential explanatory variables,non-dummy variables are: Size: 2nd column in the dataset Land: 14th column in the dataset Bed.Rooms: 5th column in the dataset Fireplace: 7th column in the

More information

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016 Resampling Methods Levi Waldron, CUNY School of Public Health July 13, 2016 Outline and introduction Objectives: prediction or inference? Cross-validation Bootstrap Permutation Test Monte Carlo Simulation

More information

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Exercise 2.23 Villanova MAT 8406 September 7, 2015 Exercise 2.23 Villanova MAT 8406 September 7, 2015 Step 1: Understand the Question Consider the simple linear regression model y = 50 + 10x + ε where ε is NID(0, 16). Suppose that n = 20 pairs of observations

More information

Statistical Tests for Variable Discrimination

Statistical Tests for Variable Discrimination Statistical Tests for Variable Discrimination University of Trento - FBK 26 February, 2015 (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, 2015 1 / 31 General statistics Descriptional:

More information

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown Z-TEST / Z-STATISTIC: used to test hypotheses about µ when the population standard deviation is known and population distribution is normal or sample size is large T-TEST / T-STATISTIC: used to test hypotheses

More information

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables: Regression Lab The data set cholesterol.txt available on your thumb drive contains the following variables: Field Descriptions ID: Subject ID sex: Sex: 0 = male, = female age: Age in years chol: Serum

More information

Lab 10 - Ridge Regression and the Lasso in Python

Lab 10 - Ridge Regression and the Lasso in Python Lab 10 - Ridge Regression and the Lasso in Python March 9, 2016 This lab on Ridge Regression and the Lasso is a Python adaptation of p. 251-255 of Introduction to Statistical Learning with Applications

More information

Stat 4510/7510 Homework 6

Stat 4510/7510 Homework 6 Stat 4510/7510 1/11. Stat 4510/7510 Homework 6 Instructions: Please list your name and student number clearly. In order to receive credit for a problem, your solution must show sufficient details so that

More information

Lecture 26: Missing data

Lecture 26: Missing data Lecture 26: Missing data Reading: ESL 9.6 STATS 202: Data mining and analysis December 1, 2017 1 / 10 Missing data is everywhere Survey data: nonresponse. 2 / 10 Missing data is everywhere Survey data:

More information

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business Section 3.4: Diagnostics and Transformations Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Regression Model Assumptions Y i = β 0 + β 1 X i + ɛ Recall the key assumptions

More information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University September 20 2018 Review Solution for multiple linear regression can be computed in closed form

More information

Model Selection and Inference

Model Selection and Inference Model Selection and Inference Merlise Clyde January 29, 2017 Last Class Model for brain weight as a function of body weight In the model with both response and predictor log transformed, are dinosaurs

More information

Descriptive Statistics, Standard Deviation and Standard Error

Descriptive Statistics, Standard Deviation and Standard Error AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Gelman-Hill Chapter 3

Gelman-Hill Chapter 3 Gelman-Hill Chapter 3 Linear Regression Basics In linear regression with a single independent variable, as we have seen, the fundamental equation is where ŷ bx 1 b0 b b b y 1 yx, 0 y 1 x x Bivariate Normal

More information

36-402/608 HW #1 Solutions 1/21/2010

36-402/608 HW #1 Solutions 1/21/2010 36-402/608 HW #1 Solutions 1/21/2010 1. t-test (20 points) Use fullbumpus.r to set up the data from fullbumpus.txt (both at Blackboard/Assignments). For this problem, analyze the full dataset together

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Package flam. April 6, 2018

Package flam. April 6, 2018 Type Package Package flam April 6, 2018 Title Fits Piecewise Constant Models with Data-Adaptive Knots Version 3.2 Date 2018-04-05 Author Ashley Petersen Maintainer Ashley Petersen

More information

Multivariate Analysis Multivariate Calibration part 2

Multivariate Analysis Multivariate Calibration part 2 Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data

More information

14. League: A factor with levels A and N indicating player s league at the end of 1986

14. League: A factor with levels A and N indicating player s league at the end of 1986 PENALIZED REGRESSION Ridge and The LASSO Note: The example contained herein was copied from the lab exercise in Chapter 6 of Introduction to Statistical Learning by. For this exercise, we ll use some baseball

More information

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10 St@tmaster 02429/MIXED LINEAR MODELS PREPARED BY THE STATISTICS GROUPS AT IMM, DTU AND KU-LIFE Module 9: R 9.1 Random coefficients models...................... 1 9.1.1 Constructed data........................

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

scikit-learn (Machine Learning in Python)

scikit-learn (Machine Learning in Python) scikit-learn (Machine Learning in Python) (PB13007115) 2016-07-12 (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 1 / 29 Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value. Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression Machine learning, pattern recognition and statistical data modelling Lecture 3. Linear Methods (part 1) Coryn Bailer-Jones Last time... curse of dimensionality local methods quickly become nonlocal as

More information

Lecture 16: High-dimensional regression, non-linear regression

Lecture 16: High-dimensional regression, non-linear regression Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, 2017 1 / 17 High-dimensional regression Most of the methods we

More information

Multiple Linear Regression: Global tests and Multiple Testing

Multiple Linear Regression: Global tests and Multiple Testing Multiple Linear Regression: Global tests and Multiple Testing Author: Nicholas G Reich, Jeff Goldsmith This material is part of the statsteachr project Made available under the Creative Commons Attribution-ShareAlike

More information

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business Section 4.1: Time Series I Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Time Series Data and Dependence Time-series data are simply a collection of observations gathered

More information

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017 Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 4, 217 PDF file location: http://www.murraylax.org/rtutorials/regression_intro.pdf HTML file location:

More information

Simulation and resampling analysis in R

Simulation and resampling analysis in R Simulation and resampling analysis in R Author: Nicholas G Reich, Jeff Goldsmith, Andrea S Foulkes, Gregory Matthews This material is part of the statsteachr project Made available under the Creative Commons

More information

Study Guide. Module 1. Key Terms

Study Guide. Module 1. Key Terms Study Guide Module 1 Key Terms general linear model dummy variable multiple regression model ANOVA model ANCOVA model confounding variable squared multiple correlation adjusted squared multiple correlation

More information

Section 2.1: Intro to Simple Linear Regression & Least Squares

Section 2.1: Intro to Simple Linear Regression & Least Squares Section 2.1: Intro to Simple Linear Regression & Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1 Regression:

More information

Comparison of Optimization Methods for L1-regularized Logistic Regression

Comparison of Optimization Methods for L1-regularized Logistic Regression Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com

More information

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes. Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department

More information

Nina Zumel and John Mount Win-Vector LLC

Nina Zumel and John Mount Win-Vector LLC SUPERVISED LEARNING IN R: REGRESSION Evaluating a model graphically Nina Zumel and John Mount Win-Vector LLC "line of perfect prediction" Systematic errors DataCamp Plotting Ground Truth vs. Predictions

More information

Applied Statistics and Econometrics Lecture 6

Applied Statistics and Econometrics Lecture 6 Applied Statistics and Econometrics Lecture 6 Giuseppe Ragusa Luiss University gragusa@luiss.it http://gragusa.org/ March 6, 2017 Luiss University Empirical application. Data Italian Labour Force Survey,

More information

Poisson Regression and Model Checking

Poisson Regression and Model Checking Poisson Regression and Model Checking Readings GH Chapter 6-8 September 27, 2017 HIV & Risk Behaviour Study The variables couples and women_alone code the intervention: control - no counselling (both 0)

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions) THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533 MIDTERM EXAMINATION: October 14, 2005 Instructor: Val LeMay Time: 50 minutes 40 Marks FRST 430 50 Marks FRST 533 (extra questions) This examination

More information

Estimating R 0 : Solutions

Estimating R 0 : Solutions Estimating R 0 : Solutions John M. Drake and Pejman Rohani Exercise 1. Show how this result could have been obtained graphically without the rearranged equation. Here we use the influenza data discussed

More information

Penalized regression Statistical Learning, 2011

Penalized regression Statistical Learning, 2011 Penalized regression Statistical Learning, 2011 Niels Richard Hansen September 19, 2011 Penalized regression is implemented in several different R packages. Ridge regression can, in principle, be carried

More information

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation CSSS 510: Lab 2 Introduction to Maximum Likelihood Estimation 2018-10-12 0. Agenda 1. Housekeeping: simcf, tile 2. Questions about Homework 1 or lecture 3. Simulating heteroskedastic normal data 4. Fitting

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Introduction to Data Science

Introduction to Data Science Introduction to Data Science CS 491, DES 430, IE 444, ME 444, MKTG 477 UIC Innovation Center Fall 2017 and Spring 2018 Instructors: Charles Frisbie, Marco Susani, Michael Scott and Ugo Buy Author: Ugo

More information

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation Xingguo Li Tuo Zhao Xiaoming Yuan Han Liu Abstract This paper describes an R package named flare, which implements

More information

Week 4: Simple Linear Regression III

Week 4: Simple Linear Regression III Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Goodness of

More information

Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression II Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties

More information

Fathom Dynamic Data TM Version 2 Specifications

Fathom Dynamic Data TM Version 2 Specifications Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Set up of the data is similar to the Randomized Block Design situation. A. Chang 1. 1) Setting up the data sheet

Set up of the data is similar to the Randomized Block Design situation. A. Chang 1. 1) Setting up the data sheet Repeated Measure Analysis (Univariate Mixed Effect Model Approach) (Treatment as the Fixed Effect and the Subject as the Random Effect) (This univariate approach can be used for randomized block design

More information

STAT 113: Lab 9. Colin Reimer Dawson. Last revised November 10, 2015

STAT 113: Lab 9. Colin Reimer Dawson. Last revised November 10, 2015 STAT 113: Lab 9 Colin Reimer Dawson Last revised November 10, 2015 We will do some of the following together. The exercises with a (*) should be done and turned in as part of HW9. Before we start, let

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization

More information

Introduction to R, Github and Gitlab

Introduction to R, Github and Gitlab Introduction to R, Github and Gitlab 27/11/2018 Pierpaolo Maisano Delser mail: maisanop@tcd.ie ; pm604@cam.ac.uk Outline: Why R? What can R do? Basic commands and operations Data analysis in R Github and

More information

Robust Linear Regression (Passing- Bablok Median-Slope)

Robust Linear Regression (Passing- Bablok Median-Slope) Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their

More information

biglasso: extending lasso model to Big Data in R

biglasso: extending lasso model to Big Data in R biglasso: extending lasso model to Big Data in R Yaohui Zeng, Patrick Breheny Package Version: 1.2-3 December 1, 2016 1 User guide 1.1 Small data When the data size is small, the usage of biglasso package

More information

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Examples: Mixture Modeling With Cross-Sectional Data CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Mixture modeling refers to modeling with categorical latent variables that represent

More information

Package glmnetutils. August 1, 2017

Package glmnetutils. August 1, 2017 Type Package Version 1.1 Title Utilities for 'Glmnet' Package glmnetutils August 1, 2017 Description Provides a formula interface for the 'glmnet' package for elasticnet regression, a method for cross-validating

More information

Introductory Guide to SAS:

Introductory Guide to SAS: Introductory Guide to SAS: For UVM Statistics Students By Richard Single Contents 1 Introduction and Preliminaries 2 2 Reading in Data: The DATA Step 2 2.1 The DATA Statement............................................

More information

Lab 9 - Linear Model Selection in Python

Lab 9 - Linear Model Selection in Python Lab 9 - Linear Model Selection in Python March 7, 2016 This lab on Model Validation using Validation and Cross-Validation is a Python adaptation of p. 248-251 of Introduction to Statistical Learning with

More information

Solution to Bonus Questions

Solution to Bonus Questions Solution to Bonus Questions Q2: (a) The histogram of 1000 sample means and sample variances are plotted below. Both histogram are symmetrically centered around the true lambda value 20. But the sample

More information

NCSS Statistical Software. Robust Regression

NCSS Statistical Software. Robust Regression Chapter 308 Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that chapter for in depth coverage of multiple

More information

STATS PAD USER MANUAL

STATS PAD USER MANUAL STATS PAD USER MANUAL For Version 2.0 Manual Version 2.0 1 Table of Contents Basic Navigation! 3 Settings! 7 Entering Data! 7 Sharing Data! 8 Managing Files! 10 Running Tests! 11 Interpreting Output! 11

More information

Lecture 22 The Generalized Lasso

Lecture 22 The Generalized Lasso Lecture 22 The Generalized Lasso 07 December 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Class Notes Midterm II - Due today Problem Set 7 - Available now, please hand in by the 16th Motivation Today

More information

Nonparametric and Simulation-Based Tests. Stat OSU, Autumn 2018 Dalpiaz

Nonparametric and Simulation-Based Tests. Stat OSU, Autumn 2018 Dalpiaz Nonparametric and Simulation-Based Tests Stat 3202 @ OSU, Autumn 2018 Dalpiaz 1 What is Parametric Testing? 2 Warmup #1, Two Sample Test for p 1 p 2 Ohio Issue 1, the Drug and Criminal Justice Policies

More information

Linear Regression. A few Problems. Thursday, March 7, 13

Linear Regression. A few Problems. Thursday, March 7, 13 Linear Regression A few Problems Plotting two datasets Often, one wishes to overlay two line plots over each other The range of the x and y variables might not be the same > x1 = seq(-3,2,.1) > x2 = seq(-1,3,.1)

More information

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute GENREG WHAT IS IT? The Generalized Regression platform was introduced in JMP Pro 11 and got much better in version

More information

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation Xingguo Li Tuo Zhao Xiaoming Yuan Han Liu Abstract This paper describes an R package named flare, which implements

More information

Chapters 5-6: Statistical Inference Methods

Chapters 5-6: Statistical Inference Methods Chapters 5-6: Statistical Inference Methods Chapter 5: Estimation (of population parameters) Ex. Based on GSS data, we re 95% confident that the population mean of the variable LONELY (no. of days in past

More information

Salary 9 mo : 9 month salary for faculty member for 2004

Salary 9 mo : 9 month salary for faculty member for 2004 22s:52 Applied Linear Regression DeCook Fall 2008 Lab 3 Friday October 3. The data Set In 2004, a study was done to examine if gender, after controlling for other variables, was a significant predictor

More information

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017 Predictive Checking Readings GH Chapter 6-8 February 8, 2017 Model Choice and Model Checking 2 Questions: 1. Is my Model good enough? (no alternative models in mind) 2. Which Model is best? (comparison

More information