Contents Cont Hypothesis testing
|
|
- Shawn Robert Dean
- 5 years ago
- Views:
Transcription
1 Lecture 5 STATS/CME 195
2 Contents Hypothesis testing
3 Hypothesis testing
4 Exploratory vs. confirmatory data analysis Two approaches of statistics to analyze data sets: Exploratory: use plotting, transformations and summaries to explore the data and formulated hypotheses. Confirmatory: represent the data as random variables, formulate hypotheses and test whether they are consistent with the model assumptions. Traditionally, statistics focused more on hypothesis testing. John Tukey wrote the book Exploratory Data Analysis in Tukey s championing of exploratory analysis encouraged the development of statistical computing packages, such as S: the precursor of R, from Bell Labs.
5 Examples of hypotheses Is the measured quantity equal to/higher/lower than a given threshold? e.g. is the number of faulty items in an order statistically higher than the one guaranteed by a manufacturer? Is there a difference between two groups or observations? e.g. Do treated patient have a higher survival rate than the untreated ones? Is the level of one quantity related to the value of the other quantity? e.g. Is hyperactivity related to eating sugar? Is lung cancer related to smoking?
6 How to perform a hypothesis test 0. There is an initial research hypothesis of which the truth is unknown. 1. Formally define the null and alternative hypotheses. 2. Choose level of significance α. 3. Make statistical assumptions on the distributions of the observations. 4. Pick and compute test statistics. 5. Derive the distribution of the test statistics under the null hypothesis from the assumptions. 6. Compute the p-value. This is the probability, under the null hypothesis, of sampling a test statistic at least as extreme as that which was observed. 7. Check whether to reject the null hypothesis by comparing p-value to α. 8. Draw conclusion from the test.
7 Null and alternative hypotheses Null hypothesis ( H 0 ): A statement assumed to be true unless it can be shown to be incorrect beyond a reasonable doubt ( α). This is something one usually attempts to disprove or discredit. Alternative hypothesis ( H 1 ): A claim that is contradictory to H 0 and what we conclude when we reject H 0. This setting is asymmetric. You do not accept the null hypothesis, but you may fail to reject it.
8 Student s t-test
9 In general, used when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. Student s t-test, One test with many applications William Gosset (1908), a chemist at the Guiness brewery Published in Biometrika under a pseudonym Student Used to select best yielding varieties of barley Now one of the standard/traditional methods for hypothesis testing Among the typical applications: Comparing population mean to a constant value Comparing the means of two populations Comparing the slope of a regression line to a constant
10 Distribution of the statistic If are independent and X i σ 2 then 1 X n X i S 2 1 n i=1 X μ T = t ν=n 1 X i N(μ, ) =, = ( X n 1 i X ) 2 i=1 S/ n n
11 P-values p-value is the probability of an observed (or more extreme) result assuming that the null hypothesis is true. A small p-value indicates strong evidence against the null hypothesis, so you reject the null hypothesis.
12 Two-sided test of the mean Is the mean flight arrival delay statistically equal to 0? Test the null hypothesis where μ is the average arrival delay. H 0 : μ = μ 0 = 0 H 1 : μ μ 0 = 0
13
14 Testing mean flight delay library(tidyverse) library(nycflights13) mean(flights$arr_delay, na.rm = T) ## [1] Is this statistically significant? ( tt = t.test(x=flights$arr_delay, mu=0, alternative="two.sided" ) ) ## ## One Sample t-test ## ## data: flights$arr_delay ## t = 88.39, df = , p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## ## sample estimates: ## mean of x ##
15 Looking inside a t.test object The function t.test returns an object containing the following components: names(tt) ## [1] "statistic" "parameter" "p.value" "conf.int" "estimate" ## [6] "null.value" "alternative" "method" "data.name" tt$p.value # The p-value ## [1] 0 tt$conf.int # The 95% confidence interval for the mean ## [1] ## attr(,"conf.level") ## [1] 0.95
16 One-sided test of the mean Test the null hypothesis H 0 : μ = μ 0 = 0 H 1 : μ < μ 0 = 0 t.test(x, mu=0, alternative="less") One-sided can be more powerful, but the intepretation is more difficult.
17 Testing mean flight delay (II) Is the average delay 0 or is it lower? ( tt = t.test(x=flights$arr_delay, mu=0, alternative="less" ) ) ## ## One Sample t-test ## ## data: flights$arr_delay ## t = 88.39, df = , p-value = 1 ## alternative hypothesis: true mean is less than 0 ## 95 percent confidence interval: ## -Inf ## sample estimates: ## mean of x ## Failure to reject is not acceptance of the null hypothesis.
18 Testing difference between groups Are average arrival delays the same in winter and summer? Test the null hypothesis H 0 : μ A = μ B H 1 : μ A μ B where μ A is the mean of group A and μ B is the mean of group B. The t.test function can also perform this test. t.test(x, y)
19 Seasonal differences in flight delay (I) flights %>% mutate(season = cut(month, breaks = c(0,3,6,9,12))) %>% ggplot(aes(x = season, y = arr_delay)) + geom_boxplot (alpha=0.1) + xlab("season" ) + ylab("arrival delay" )
20 Seasonal differences in flight delay (II) flights %>% filter(arr_delay < 120) %>% mutate(season = cut(month, breaks = c(0,3,6,9,12))) %>% ggplot(aes(x = season, y = arr_delay)) + geom_boxplot (alpha=0.01) + xlab("season" ) + ylab("arrival delay" )
21 Testing seasonal differences in flight delay flights.winter = filter(flights, month %in% c(1,2,3)) flights.summer = filter(flights, month %in% c(7,8,9)) t.test(x=flights.winter$arr_delay, y=flights.summer$arr_delay) ## ## Welch Two Sample t-test ## ## data: flights.winter$arr_delay and flights.summer$arr_delay ## t = , df = , p-value = ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## ## sample estimates: ## mean of x mean of y ## Are the assumptions of the test valid?
22 Linear Regression
23 Linear regression Regression is a supervised learning method, whose goal is inferring the relationship between input data, X, and a continuous response variable, y. Linear regression is a type of regression where y is modeled as a linear function of X. Simple linear regression predicts the output y from a single predictor x: y = β 0 + β 1 x + ϵ Multiple linear regression assumes y relies on many covariates: y = β 0 + β 1 x β p x p + ϵ - here ϵ denotes a random noise term with zero mean and independent components.
24 Objective function Linear regression seeks a solution y^ = X that minimizes the difference between the true outcome y and the prediction y^, in terms of the residual sum of squares (RSS). β^ b^ = arg min ( β n i=1 y (i) β T x (i) ) 2
25 Example of simple linear regression Predict the logarithm of diamond price using the logarithm of its weight. mod <- lm(log(price) ~ log(carat), data = diamonds) library(modelr) diamonds.mod <- diamonds %>% add_predictions(mod) %>% mutate(pred = exp(pred), resid=price-pred) ggplot(diamonds.mod) + geom_point(aes(x=carat, y=price), alpha=0.1) + geom_line(color='red', aes(x = carat, y = pred))
26 Simple linear regression mod <- lm(log(price) ~ log(carat), data = diamonds) summary(mod) ## ## Call: ## lm(formula = log(price) ~ log(carat), data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) <2e-16 *** ## log(carat) <2e-16 *** ## --- ## Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: on degrees of freedom ## Multiple R-squared: 0.933, Adjusted R-squared: ## F-statistic: 7.51e+05 on 1 and DF, p-value: < 2.2e-16
27 The coefficients ( Looking inside a lm object β^ ) of the fitted model: ( beta.hat <- coef(summary(mod)) ) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) ## log(carat) Predicted values ( y^ ) for the existing observations: head(predict(mod)) ## ##
28 Making predictions Alternatively, add predictions with the modelr package library(modelr) diamonds %>% add_predictions(mod) %>% mutate(pred = exp(pred)) Predictions for new observations: new.diamonds = data.frame(carat = c(0.2,0.5,1,2,5)) predict(mod, new.diamonds) ## ##
29 Regression t-tests Statistical significance of coefficients with t-tests: summary(mod)$coefficients ## Estimate Std. Error t value Pr(> t ) ## (Intercept) ## log(carat) Plot lm and uncertainty with geom_smooth(method = "lm"):
30 Multiple linear regression We might like to predict log-price using log-carat and cut. mod.2 <- lm(log(price) ~ log(carat) + cut, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) < 2e-16 *** ## log(carat) < 2e-16 *** ## cut.l < 2e-16 *** ## cut.q < 2e-16 *** ## cut.c < 2e-16 *** ## cut^ e-12 *** ## --- ## Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
31 Interaction terms An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent. variable. For example, one variable, x 1 might have a different effect on y within different categories or groups, given by variable x 2. With lm, an asterisk in the formula generates the interaction terms.
32 Linear regression with interaction terms mod.2 <- lm(log(price) ~ log(carat) + cut*clarity, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut * clarity, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) < 2e-16 *** ## log(carat) < 2e-16 *** ## cut.l < 2e-16 *** ## cut.q *** ## cut.c ** ## cut^ ** ## clarity.l < 2e-16 *** ## clarity.q < 2e-16 ***
33 Linear regression with interaction terms (II) You can also specify explicitly which terms you want: mod.2 <- lm(log(price) ~ log(carat) + cut:clarity, data = diamonds) summary(mod.2) ## ## Call: ## lm(formula = log(price) ~ log(carat) + cut:clarity, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) < 2e-16 *** ## log(carat) < 2e-16 *** ## cutfair:clarityi < 2e-16 *** ## cutgood:clarityi < 2e-16 *** ## cutvery Good:clarityI < 2e-16 *** ## cutpremium:clarityi < 2e-16 *** ## cutideal:clarityi < 2e-16 *** ## cutfair:claritysi < 2e-16 ***
34 Sparse Regression
35 Linear regression with many covariates Many modern datasets have many more covariates than observations: p n Example: genowide-association studies often have p 10 6 and n When p > n, the linear regression estimate is not welldefined inference is not easy The assumption of sparsity: The number of available covariates is extremely large, but only a handful of them are relevant for the prediction of the outcome.
36 Sparse linear regression Lasso regression is simply regression with L 1 penalty. β^ = arg min { ( + λ β } β 1 n n i=1 y (i) β T x (i) ) 2 1 The L 1 norm β = p 1 j=1 β j promotes sparsity. The solution β^ usually has only a small number of non-zero coefficients. The number of non-zero coefficients depends on the choice of the tuning parameter, λ.
37 The glmnet package # install.packages("glmnet") library(glmnet) Lasso regression is implemented in an R package glmnet. An introductory tutorial to the package can be found here: Function glmnet provided by package glmnet: compute the lasso regression for a sequence of different λ.
38 Fitting the lasso glmnet does not work with data frames. It requires numeric input. glmnet(x,y) diamonds.log <- diamonds %>% mutate(price = log10(price), carat = log10(carat)) X = diamonds.log[,!(names(diamonds.log) %in% c("price"))] y = diamonds.log[, (names(diamonds.log) %in% c("price"))] y = data.matrix(y) head(data.matrix(x)) ## carat cut color clarity depth table x y z ## [1,] ## [2,] ## [3,] ## [4,] ## [5,] ## [6,]
39 Dummy variables for categorical predictors Create dummy variables for all categorical predictors X <- model.matrix(as.formula( "log(price) ~ log(carat) + cut + clarity + color"), diamonds) colnames(x) ## [1] "(Intercept)" "log(carat)" "cut.l" "cut.q" "cut.c" ## [6] "cut^4" "clarity.l" "clarity.q" "clarity.c" "clarity^ ## [11] "clarity^5" "clarity^6" "clarity^7" "color.l" "color.q" ## [16] "color.c" "color^4" "color^5" "color^6" Now we can call glmnet fit = glmnet(x,y)
40 Plotting the lasso path plot(fit, label = T) the y-axis corresponds the value of the coefficients the x-axis is denoted L 1 norm and is (inversely) related to λ
41 Lasso coefficient estimates The computed Lasso coefficient for a particular choice of λ coef(fit, s = 0.02) # Lambda = 0.02 ## 20 x 1 sparse Matrix of class "dgcmatrix" ## 1 ## (Intercept) ## (Intercept). ## log(carat) ## cut.l. ## cut.q. ## cut.c. ## cut^4. ## clarity.l ## clarity.q. ## clarity.c. ## clarity^4. ## clarity^5. ## clarity^6. ## clarity^7. ## color.l ## color.q.
42 Predictions from lasso estimates Like for lm(), we can use a function predict() to predict the log-price for the training or the test data. # Predict for the train set head( predict(fit, newx = X, s = c(0.02, 0.1)) ) ## 1 2 ## ## ## ## ## ## Each of the columns corresponds to a choice of λ.
43 Choosing λ To choose λ you can use cross-validation. Use cv.glmnet() function to perform a k-fold cross validation. In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model. The remaining k 1 subsamples are used as training data.
44 The two chosen λ values are the one with MSE min and MSE min + sd. Cross-validation with glmnet set.seed(1) # Seed for the random number generator cvfit <- cv.glmnet(x, y, nfolds = 5) plot(cvfit) The red dots are the average MSE over the k-folds.
45 min Cross-validation with glmnet (II) Value of λ with minimum MSE: cvfit$lambda.min ## [1] Largest MSE: λ such that the MSE is within one standard error of the minimum cvfit$lambda.1se ## [1]
46 Summary
47 Learning more Some resources to learn more about hypothesis testing and regression: An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, Introductory Statistics with R by Peter Dalgart,
48 Next time
Lasso. November 14, 2017
Lasso November 14, 2017 Contents 1 Case Study: Least Absolute Shrinkage and Selection Operator (LASSO) 1 1.1 The Lasso Estimator.................................... 1 1.2 Computation of the Lasso Solution............................
More informationRegression Analysis and Linear Regression Models
Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical
More informationChapter 6: Linear Model Selection and Regularization
Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the
More informationSection 2.3: Simple Linear Regression: Predictions and Inference
Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple
More informationIntroduction to hypothesis testing
Introduction to hypothesis testing Mark Johnson Macquarie University Sydney, Australia February 27, 2017 1 / 38 Outline Introduction Hypothesis tests and confidence intervals Classical hypothesis tests
More informationThe Data. Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO
Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO The Data The following dataset is from Hastie, Tibshirani and Friedman (2009), from a studyby Stamey et al. (1989) of prostate
More informationSTAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.
STAT 48 - STAT 48 - December 5, 27 STAT 48 - STAT 48 - Here are a few questions to consider: What does statistical learning mean to you? Is statistical learning different from statistics as a whole? What
More informationMachine Learning: An Applied Econometric Approach Online Appendix
Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail
More informationStatistics Lab #7 ANOVA Part 2 & ANCOVA
Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")
More informationLinear Methods for Regression and Shrinkage Methods
Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors
More informationSimulating power in practice
Simulating power in practice Author: Nicholas G Reich This material is part of the statsteachr project Made available under the Creative Commons Attribution-ShareAlike 3.0 Unported License: http://creativecommons.org/licenses/by-sa/3.0/deed.en
More informationExploratory data analysis
Lecture 4 STATS/CME 195 Matteo Sesia Stanford University Spring 2018 Contents Exploratory data analysis Exploratory data analysis What is exploratory data analysis (EDA) In this lecture we discuss how
More informationAnalysis of variance - ANOVA
Analysis of variance - ANOVA Based on a book by Julian J. Faraway University of Iceland (UI) Estimation 1 / 50 Anova In ANOVAs all predictors are categorical/qualitative. The original thinking was to try
More informationPackage TANDEM. R topics documented: June 15, Type Package
Type Package Package TANDEM June 15, 2017 Title A Two-Stage Approach to Maximize Interpretability of Drug Response Models Based on Multiple Molecular Data Types Version 1.0.2 Date 2017-04-07 Author Nanne
More informationRegularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel
Regularization Methods Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel Today s Lecture Objectives 1 Avoiding overfitting and improving model interpretability with the help of regularization
More informationSTAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.
STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,
More informationTHIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010
THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE
More informationRegression on the trees data with R
> trees Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7 7 11.0 66 15.6 8 11.0 75 18.2 9 11.1 80 22.6 10 11.2 75 19.9 11 11.3 79 24.2 12 11.4 76
More informationEXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression
EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression OBJECTIVES 1. Prepare a scatter plot of the dependent variable on the independent variable 2. Do a simple linear regression
More informationFMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu
FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)
More informationCross-Validation Alan Arnholt 3/22/2016
Cross-Validation Alan Arnholt 3/22/2016 Note: Working definitions and graphs are taken from Ugarte, Militino, and Arnholt (2016) The Validation Set Approach The basic idea behind the validation set approach
More informationGlmnet Vignette. Introduction. Trevor Hastie and Junyang Qian
Glmnet Vignette Trevor Hastie and Junyang Qian Stanford September 13, 2016 Introduction Installation Quick Start Linear Regression Logistic Regression Poisson Models Cox Models Sparse Matrices Appendix
More informationIQR = number. summary: largest. = 2. Upper half: Q3 =
Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number
More informationST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.
ST512 Fall Quarter, 2005 Exam 1 Name: Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. 1. (42 points) A random sample of n = 30 NBA basketball
More informationLab #13 - Resampling Methods Econ 224 October 23rd, 2018
Lab #13 - Resampling Methods Econ 224 October 23rd, 2018 Introduction In this lab you will work through Section 5.3 of ISL and record your code and results in an RMarkdown document. I have added section
More informationMultiple Linear Regression
Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors
More informationAmong those 14 potential explanatory variables,non-dummy variables are:
Among those 14 potential explanatory variables,non-dummy variables are: Size: 2nd column in the dataset Land: 14th column in the dataset Bed.Rooms: 5th column in the dataset Fireplace: 7th column in the
More informationResampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016
Resampling Methods Levi Waldron, CUNY School of Public Health July 13, 2016 Outline and introduction Objectives: prediction or inference? Cross-validation Bootstrap Permutation Test Monte Carlo Simulation
More informationExercise 2.23 Villanova MAT 8406 September 7, 2015
Exercise 2.23 Villanova MAT 8406 September 7, 2015 Step 1: Understand the Question Consider the simple linear regression model y = 50 + 10x + ε where ε is NID(0, 16). Suppose that n = 20 pairs of observations
More informationStatistical Tests for Variable Discrimination
Statistical Tests for Variable Discrimination University of Trento - FBK 26 February, 2015 (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, 2015 1 / 31 General statistics Descriptional:
More informationZ-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown
Z-TEST / Z-STATISTIC: used to test hypotheses about µ when the population standard deviation is known and population distribution is normal or sample size is large T-TEST / T-STATISTIC: used to test hypotheses
More informationRegression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:
Regression Lab The data set cholesterol.txt available on your thumb drive contains the following variables: Field Descriptions ID: Subject ID sex: Sex: 0 = male, = female age: Age in years chol: Serum
More informationLab 10 - Ridge Regression and the Lasso in Python
Lab 10 - Ridge Regression and the Lasso in Python March 9, 2016 This lab on Ridge Regression and the Lasso is a Python adaptation of p. 251-255 of Introduction to Statistical Learning with Applications
More informationStat 4510/7510 Homework 6
Stat 4510/7510 1/11. Stat 4510/7510 Homework 6 Instructions: Please list your name and student number clearly. In order to receive credit for a problem, your solution must show sufficient details so that
More informationLecture 26: Missing data
Lecture 26: Missing data Reading: ESL 9.6 STATS 202: Data mining and analysis December 1, 2017 1 / 10 Missing data is everywhere Survey data: nonresponse. 2 / 10 Missing data is everywhere Survey data:
More informationSection 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business
Section 3.4: Diagnostics and Transformations Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Regression Model Assumptions Y i = β 0 + β 1 X i + ɛ Recall the key assumptions
More informationDS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University
DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University September 20 2018 Review Solution for multiple linear regression can be computed in closed form
More informationModel Selection and Inference
Model Selection and Inference Merlise Clyde January 29, 2017 Last Class Model for brain weight as a function of body weight In the model with both response and predictor log transformed, are dinosaurs
More informationDescriptive Statistics, Standard Deviation and Standard Error
AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.
More informationBig Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1
Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that
More informationGelman-Hill Chapter 3
Gelman-Hill Chapter 3 Linear Regression Basics In linear regression with a single independent variable, as we have seen, the fundamental equation is where ŷ bx 1 b0 b b b y 1 yx, 0 y 1 x x Bivariate Normal
More information36-402/608 HW #1 Solutions 1/21/2010
36-402/608 HW #1 Solutions 1/21/2010 1. t-test (20 points) Use fullbumpus.r to set up the data from fullbumpus.txt (both at Blackboard/Assignments). For this problem, analyze the full dataset together
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationPackage flam. April 6, 2018
Type Package Package flam April 6, 2018 Title Fits Piecewise Constant Models with Data-Adaptive Knots Version 3.2 Date 2018-04-05 Author Ashley Petersen Maintainer Ashley Petersen
More informationMultivariate Analysis Multivariate Calibration part 2
Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data
More information14. League: A factor with levels A and N indicating player s league at the end of 1986
PENALIZED REGRESSION Ridge and The LASSO Note: The example contained herein was copied from the lab exercise in Chapter 6 of Introduction to Statistical Learning by. For this exercise, we ll use some baseball
More information9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10
St@tmaster 02429/MIXED LINEAR MODELS PREPARED BY THE STATISTICS GROUPS AT IMM, DTU AND KU-LIFE Module 9: R 9.1 Random coefficients models...................... 1 9.1.1 Constructed data........................
More informationEvaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München
Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics
More informationscikit-learn (Machine Learning in Python)
scikit-learn (Machine Learning in Python) (PB13007115) 2016-07-12 (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 1 / 29 Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming
More informationSTA 570 Spring Lecture 5 Tuesday, Feb 1
STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row
More information( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.
Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationLecture 13: Model selection and regularization
Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always
More informationLast time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression
Machine learning, pattern recognition and statistical data modelling Lecture 3. Linear Methods (part 1) Coryn Bailer-Jones Last time... curse of dimensionality local methods quickly become nonlocal as
More informationLecture 16: High-dimensional regression, non-linear regression
Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, 2017 1 / 17 High-dimensional regression Most of the methods we
More informationMultiple Linear Regression: Global tests and Multiple Testing
Multiple Linear Regression: Global tests and Multiple Testing Author: Nicholas G Reich, Jeff Goldsmith This material is part of the statsteachr project Made available under the Creative Commons Attribution-ShareAlike
More informationSection 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business
Section 4.1: Time Series I Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Time Series Data and Dependence Time-series data are simply a collection of observations gathered
More informationBivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017
Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 4, 217 PDF file location: http://www.murraylax.org/rtutorials/regression_intro.pdf HTML file location:
More informationSimulation and resampling analysis in R
Simulation and resampling analysis in R Author: Nicholas G Reich, Jeff Goldsmith, Andrea S Foulkes, Gregory Matthews This material is part of the statsteachr project Made available under the Creative Commons
More informationStudy Guide. Module 1. Key Terms
Study Guide Module 1 Key Terms general linear model dummy variable multiple regression model ANOVA model ANCOVA model confounding variable squared multiple correlation adjusted squared multiple correlation
More informationSection 2.1: Intro to Simple Linear Regression & Least Squares
Section 2.1: Intro to Simple Linear Regression & Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1 Regression:
More informationComparison of Optimization Methods for L1-regularized Logistic Regression
Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com
More informationResources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.
Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department
More informationNina Zumel and John Mount Win-Vector LLC
SUPERVISED LEARNING IN R: REGRESSION Evaluating a model graphically Nina Zumel and John Mount Win-Vector LLC "line of perfect prediction" Systematic errors DataCamp Plotting Ground Truth vs. Predictions
More informationApplied Statistics and Econometrics Lecture 6
Applied Statistics and Econometrics Lecture 6 Giuseppe Ragusa Luiss University gragusa@luiss.it http://gragusa.org/ March 6, 2017 Luiss University Empirical application. Data Italian Labour Force Survey,
More informationPoisson Regression and Model Checking
Poisson Regression and Model Checking Readings GH Chapter 6-8 September 27, 2017 HIV & Risk Behaviour Study The variables couples and women_alone code the intervention: control - no counselling (both 0)
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationTHE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)
THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533 MIDTERM EXAMINATION: October 14, 2005 Instructor: Val LeMay Time: 50 minutes 40 Marks FRST 430 50 Marks FRST 533 (extra questions) This examination
More informationEstimating R 0 : Solutions
Estimating R 0 : Solutions John M. Drake and Pejman Rohani Exercise 1. Show how this result could have been obtained graphically without the rearranged equation. Here we use the influenza data discussed
More informationPenalized regression Statistical Learning, 2011
Penalized regression Statistical Learning, 2011 Niels Richard Hansen September 19, 2011 Penalized regression is implemented in several different R packages. Ridge regression can, in principle, be carried
More informationCSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation
CSSS 510: Lab 2 Introduction to Maximum Likelihood Estimation 2018-10-12 0. Agenda 1. Housekeeping: simcf, tile 2. Questions about Homework 1 or lecture 3. Simulating heteroskedastic normal data 4. Fitting
More informationPerformance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018
Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:
More informationIntroduction to Data Science
Introduction to Data Science CS 491, DES 430, IE 444, ME 444, MKTG 477 UIC Innovation Center Fall 2017 and Spring 2018 Instructors: Charles Frisbie, Marco Susani, Michael Scott and Ugo Buy Author: Ugo
More informationAn R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation
An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation Xingguo Li Tuo Zhao Xiaoming Yuan Han Liu Abstract This paper describes an R package named flare, which implements
More informationWeek 4: Simple Linear Regression III
Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Goodness of
More informationWeek 4: Simple Linear Regression II
Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties
More informationFathom Dynamic Data TM Version 2 Specifications
Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationSet up of the data is similar to the Randomized Block Design situation. A. Chang 1. 1) Setting up the data sheet
Repeated Measure Analysis (Univariate Mixed Effect Model Approach) (Treatment as the Fixed Effect and the Subject as the Random Effect) (This univariate approach can be used for randomized block design
More informationSTAT 113: Lab 9. Colin Reimer Dawson. Last revised November 10, 2015
STAT 113: Lab 9 Colin Reimer Dawson Last revised November 10, 2015 We will do some of the following together. The exercises with a (*) should be done and turned in as part of HW9. Before we start, let
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization
More informationIntroduction to R, Github and Gitlab
Introduction to R, Github and Gitlab 27/11/2018 Pierpaolo Maisano Delser mail: maisanop@tcd.ie ; pm604@cam.ac.uk Outline: Why R? What can R do? Basic commands and operations Data analysis in R Github and
More informationRobust Linear Regression (Passing- Bablok Median-Slope)
Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their
More informationbiglasso: extending lasso model to Big Data in R
biglasso: extending lasso model to Big Data in R Yaohui Zeng, Patrick Breheny Package Version: 1.2-3 December 1, 2016 1 User guide 1.1 Small data When the data size is small, the usage of biglasso package
More informationCHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA
Examples: Mixture Modeling With Cross-Sectional Data CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Mixture modeling refers to modeling with categorical latent variables that represent
More informationPackage glmnetutils. August 1, 2017
Type Package Version 1.1 Title Utilities for 'Glmnet' Package glmnetutils August 1, 2017 Description Provides a formula interface for the 'glmnet' package for elasticnet regression, a method for cross-validating
More informationIntroductory Guide to SAS:
Introductory Guide to SAS: For UVM Statistics Students By Richard Single Contents 1 Introduction and Preliminaries 2 2 Reading in Data: The DATA Step 2 2.1 The DATA Statement............................................
More informationLab 9 - Linear Model Selection in Python
Lab 9 - Linear Model Selection in Python March 7, 2016 This lab on Model Validation using Validation and Cross-Validation is a Python adaptation of p. 248-251 of Introduction to Statistical Learning with
More informationSolution to Bonus Questions
Solution to Bonus Questions Q2: (a) The histogram of 1000 sample means and sample variances are plotted below. Both histogram are symmetrically centered around the true lambda value 20. But the sample
More informationNCSS Statistical Software. Robust Regression
Chapter 308 Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that chapter for in depth coverage of multiple
More informationSTATS PAD USER MANUAL
STATS PAD USER MANUAL For Version 2.0 Manual Version 2.0 1 Table of Contents Basic Navigation! 3 Settings! 7 Entering Data! 7 Sharing Data! 8 Managing Files! 10 Running Tests! 11 Interpreting Output! 11
More informationLecture 22 The Generalized Lasso
Lecture 22 The Generalized Lasso 07 December 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Class Notes Midterm II - Due today Problem Set 7 - Available now, please hand in by the 16th Motivation Today
More informationNonparametric and Simulation-Based Tests. Stat OSU, Autumn 2018 Dalpiaz
Nonparametric and Simulation-Based Tests Stat 3202 @ OSU, Autumn 2018 Dalpiaz 1 What is Parametric Testing? 2 Warmup #1, Two Sample Test for p 1 p 2 Ohio Issue 1, the Drug and Criminal Justice Policies
More informationLinear Regression. A few Problems. Thursday, March 7, 13
Linear Regression A few Problems Plotting two datasets Often, one wishes to overlay two line plots over each other The range of the x and y variables might not be the same > x1 = seq(-3,2,.1) > x2 = seq(-1,3,.1)
More informationGENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute
GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute GENREG WHAT IS IT? The Generalized Regression platform was introduced in JMP Pro 11 and got much better in version
More informationAn R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation
An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation Xingguo Li Tuo Zhao Xiaoming Yuan Han Liu Abstract This paper describes an R package named flare, which implements
More informationChapters 5-6: Statistical Inference Methods
Chapters 5-6: Statistical Inference Methods Chapter 5: Estimation (of population parameters) Ex. Based on GSS data, we re 95% confident that the population mean of the variable LONELY (no. of days in past
More informationSalary 9 mo : 9 month salary for faculty member for 2004
22s:52 Applied Linear Regression DeCook Fall 2008 Lab 3 Friday October 3. The data Set In 2004, a study was done to examine if gender, after controlling for other variables, was a significant predictor
More informationPredictive Checking. Readings GH Chapter 6-8. February 8, 2017
Predictive Checking Readings GH Chapter 6-8 February 8, 2017 Model Choice and Model Checking 2 Questions: 1. Is my Model good enough? (no alternative models in mind) 2. Which Model is best? (comparison
More information